Replace a faulty drive in a SolidFire eSDS cluster

If your SolidFire eSDS cluster has a faulty drive, the Element UI displays an alert. Before you remove the drive from the cluster, verify the reason for failure by looking at the information in the IPMI interface for your node/server. These steps are applicable if you are replacing a block drive or a metadata drive.

Before you begin

Procedure

  1. Perform the following steps to verify the drive failure and view the events logged that are associated with the drive failure:
    1. Log in to the IPMI interface of the node (iLO in this case).
    2. Click Information > Integrated Management Log. The reason for the drive failure (for example, SSDWearOut) and the location will be listed here. You will also see an event stating that the status of the drive is degraded.
    3. Click System Information from the left-hand navigation and click Storage.
    4. Verify the information available about the failed drive. The status of the failed drive will say Degraded.
  2. Remove the failed drive from the cluster as follows using the Element UI:
    1. Click Cluster > Drives > Failed.
    2. Click the icon under Actions and select Remove.
  3. Remove the drive physically as follows:
    1. Identify the drive bay.
      The following image shows the front of the server with the drive bay numbering shown on the left side of the image:

    2. Press the power button on the drive that you want to replace. The LED blinks for 5-10 seconds and stops.
    3. After the LED stops blinking and the drive is powered off, remove it from the server by pressing the red button and pulling the latch.
      Note: Ensure that you handle drives very carefully.
  4. Insert the replacement drive by carefully pushing the drive into the bay using the latch and closing the latch. The drive powers on when inserted correctly.
  5. Verify the new drive details in iLO:
    1. Click Information > Integrated Management Log. You will see an event logged for the drive that you added.
    2. Refresh the page to see the events logged for the new drive that you added.
  6. Verify the health of your storage system in iLO:
    1. Click System Information from the left-hand navigation and click Storage.
    2. Scroll till you find information about the bay in which you installed the new drive.
    3. Make a note of the serial number.
  7. Add the new drive information in the sf_sds_config.yaml file for the node in which you replaced the drive.
    The sf_sds_config.yaml file is stored in /opt/sf/. This file includes all the information about the drives in the node. You must enter the details of the new drive in this file every time you add a new drive. For more information about this file, see Contents of the sf_sds_config.yaml file.
    1. Establish an SSH connection to the node by using PuTTY.
    2. In the PuTTY configuration window, enter node MIP in the Host Name (or IP address) field.
    3. Click Open.
    4. In the terminal window that opens, log in with your username and password.
    5. Run the # cat /opt/sf/sf_sds_config.yaml command to list the contents of the file.
    6. Replace the entries in the dataDevices or cacheDevices lists for the drive you replaced with the new drive information.
    7. Run # nvme list to see the list of drives in the cluster. You will see the serial number of the new drive (that you noted in the previous step) in this list.
    8. Run # systemctl start solidfire-update-drives.
      You will see the bash prompt after this command runs. You must go to the Element UI after this to add the drive to the cluster. The Element UI will show an alert for a new drive that is available.
  8. Click Cluster > Drives > Available.
    You will see the serial number of the new drive that you installed.
  9. Click the icon under Actions and select Add.
  10. Refresh the Element UI after the block sync job completes. You will see that the alert about the drive available has cleared if you access the Running Tasks page from the Reporting tab of the Element UI.