Replace drives for HPE DL360

Contributors netapp-amitha Download PDF of this page

Choose from the procedures listed here to replace a drive proactively, replace a drive after it has failed, and replace a cache drive. replace a metadata drive or a block drive in your SolidFire eSDS cluster. The Element UI Cluster > Drives page shows the drive wear information.

Replace a drive proactively

Perform this procedure if you want to proactively replace a metadata drive or a block drive in your SolidFire eSDS cluster. The Element UI Cluster > Drives page shows the drive wear information.

What you’ll need
  • From the NetApp Element software UI, ensure that your cluster is in good health and there are no warnings or cluster faults. You can access the Element UI by using the management virtual IP (MVIP) address of the primary cluster node.

  • Ensure that there are no active jobs running on the cluster.

  • Ensure that you have familiarized yourself with all the steps.

  • Ensure that you take necessary precautions to prevent electrostatic discharge (ESD) while handling drives.

Steps
  1. Perform the following steps in the Element UI:

    1. In the Element UI, select Cluster > Drives > Active.

    2. Select the drive that you want to replace.

    3. Make a note of the serial number of the drive. This will help you locate the corresponding BayID in the IPMI interface of the node (HPE Integrated Lights-Out or iLO, in this case).

    4. Select Bulk Actions > Remove. After you remove the drive, the drive goes into the Removing state. It stays in the Removing state for a while, waiting for the data on the drive to be synced or redistributed to the remaining drives in the cluster. After the remove is complete, the drive moves to the Available state.

  2. Perform the following steps to locate the drive slot of the drive that you are replacing:

    1. Log in to the IPMI interface of the node (iLO in this case).

    2. Select System Information from the left-hand navigation, and then select Storage.

    3. Match the serial number you made a note of in the previous step with what you see on the screen.

    4. Look for the slot number listed against the serial number. This is the physical slot from which you must remove the drive.

  3. Now that you have identified the drive, physically remove it as follows:

    1. Identify the drive bay.

      The following image shows the front of the server with the drive bay numbering shown on the left side of the image:

      Shows the drive bay numbering on the DL360 node.
    2. Press the power button on the drive that you want to replace. The LED blinks for 5-10 seconds and stops.

    3. After the LED stops blinking and the drive is powered off, remove it from the server by pressing the red button and pulling the latch.

      Note Ensure that you handle drives very carefully.

      After you physically remove the drive, the drive state changes to Failed in the Element UI.

  4. In the Element UI, select Cluster > Drives > Failed.

  5. Select the icon under Actions, and then select Remove.

    Now you can go ahead and install the new drive in the node.

  6. Make a note of the serial number of the new drive.

  7. Insert the replacement drive by carefully pushing the drive into the bay using the latch and closing the latch. The drive powers on when inserted correctly.

  8. Perform the following steps to verify the new drive details in iLO:

    1. Log in to iLO.

    2. Select Information > Integrated Management Log. You will see an event logged for the drive that you added.

    3. Select System Information from the left-hand navigation, and then select Storage.

    4. Scroll till you find information about the bay that you replaced the drive in.

    5. Verify that the serial number on your screen matches the serial number of the new drive that you replaced.

  9. Add the new drive information in the sf_sds_config.yaml file for the node in which you replaced the drive.

    The sf_sds_config.yaml file is stored in /opt/sf/. This file includes all the information about the drives in the node. Every time you replace a drive, you must enter the replacement drive information in this file. For more information about this file, see Contents of the sf_sds_config.yaml file.

    1. Establish an SSH connection to the node by using PuTTY.

    2. In the PuTTY configuration window, enter node MIP in the Host Name (or IP address) field.

    3. Select Open.

    4. In the terminal window that opens, log in with your username and password.

    5. Run the # cat /opt/sf/sf_sds_config.yaml command to list the contents of the file.

    6. Replace the entries in the dataDevices or cacheDevices lists for the drive you replaced with the new drive information.

    7. Run # systemctl start solidfire-update-drives.

      You see the Bash prompt after this command runs. You should go to the Element UI after this to add the drive to the cluster. The Element UI shows an alert for a new drive that is available.

  10. Select Cluster > Drives > Available.

    You see the serial number of the new drive that you installed.

  11. Select the icon under Actions, and then select Add.

  12. Refresh the Element UI after the block sync job completes. You see that the alert about the drive available has cleared if you access the Running Tasks page from the Reporting tab of the Element UI.

Replace a faulty drive

If your SolidFire eSDS cluster has a faulty drive, the Element UI displays an alert. Before you remove the drive from the cluster, verify the reason for failure by looking at the information in the IPMI interface for your node/server. These steps are applicable if you are replacing a block drive or a metadata drive.

What you’ll need
  • From the NetApp Element software UI, verify that the drive has failed. Element displays an alert when a drive fails. You can access the Element UI by using the management virtual IP (MVIP) address of the primary cluster node.

  • Ensure that you have familiarized yourself with all the steps.

  • Ensure that you take necessary precautions to prevent electrostatic discharge (ESD) while handling drives.

Steps
  1. Remove the failed drive from the cluster as follows using the Element UI:

    1. Select Cluster > Drives > Failed.

    2. Note the node name and serial number associated with the failed drive.

    3. Select the icon under Actions, and then select Remove. If you see warnings of the service associated with the drive, wait until bin sync completes, and then remove the drive.

  2. Perform the following steps to verify the drive failure and view the events logged that are associated with the drive failure:

    1. Log in to the IPMI interface of the node (iLO in this case).

    2. Select Information > Integrated Management Log. The reason for the drive failure (for example, SSDWearOut) and the location is listed here. You can also see an event stating that the status of the drive is degraded.

    3. Select System Information from the left-hand navigation, and then select Storage.

    4. Verify the information available about the failed drive. The status of the failed drive will say Degraded.

  3. Remove the drive physically as follows:

    1. Identify the drive slot number in the chassis.

      The following image shows the front of the server with the drive bay numbering shown on the left side of the image:

      Shows the drive bay numbering on the DL360 node.
    2. Press the power button on the drive that you want to replace. The LED blinks for 5-10 seconds and stops.

    3. After the LED stops blinking and the drive is powered off, remove it from the server by pressing the red button and pulling the latch.

      Note Ensure that you handle drives very carefully.
  4. Insert the replacement drive by carefully pushing the drive into the bay using the latch and closing the latch. The drive powers on when inserted correctly.

  5. Verify the new drive details in iLO:

    1. Select Information > Integrated Management Log. You see an event logged for the drive that you added.

    2. Refresh the page to see the events logged for the new drive that you added.

  6. Verify the health of your storage system in iLO:

    1. Select System Information from the left-hand navigation, and then select Storage.

    2. Scroll till you find information about the bay in which you installed the new drive.

    3. Make a note of the serial number.

  7. Add the new drive information in the sf_sds_config.yaml file for the node in which you replaced the drive.

    The sf_sds_config.yaml file is stored in /opt/sf/. This file includes all the information about the drives in the node. Every time you replace a drive, you must enter the replacement drive information in this file. For more information about this file, see Contents of the sf_sds_config.yaml file.

    1. Establish an SSH connection to the node by using PuTTY.

    2. In the PuTTY configuration window, enter node MIP in the Host Name (or IP address) field.

    3. Select Open.

    4. In the terminal window that opens, log in with your username and password.

    5. Run the # cat /opt/sf/sf_sds_config.yaml command to list the contents of the file.

    6. Replace the entries in the dataDevices or cacheDevices lists for the drive you replaced with the new drive information.

    7. Run # systemctl start solidfire-update-drives.

      You see the Bash prompt after this command runs. You should go to the Element UI after this to add the drive to the cluster. The Element UI shows an alert for a new drive that is available.

  8. Select Cluster > Drives > Available.

    You see the serial number of the new drive that you installed.

  9. Select the icon under Actions, and then select Add.

  10. Refresh the Element UI after the block sync job completes. You see that the alert about the drive available has cleared if you access the Running Tasks page from the Reporting tab of the Element UI.

Replace a cache drive

Perform this procedure if you want to replace the cache drive in your SolidFire eSDS cluster. The cache drive is associated with metadata services. The Element UI Cluster > Drives page shows the drive wear information.

What you’ll need
  • From the NetApp Element software UI, ensure that your cluster is in good health and there are no warnings or cluster faults. You can access the Element UI by using the management virtual IP (MVIP) address of the primary cluster node.

  • Ensure that there are no active jobs running on the cluster.

  • Ensure that you have familiarized yourself with all the steps.

  • Ensure that you remove metadata services from the Element UI.

  • Ensure that you take necessary precautions to prevent electrostatic discharge (ESD) while handling drives.

Steps
  1. Perform the following steps in the Element UI:

    1. In the Element UI, select Cluster > Nodes > Active.

    2. Make a note of the node ID and management IP address of the node in which you are replacing the cache drive.

    3. If the cache drive is healthy and you are proactively replacing it, select Active Drives, locate the metadata drive, and remove it from the UI.

      After you remove it, the metadata drive goes to Removing state first, and then to Available.

    4. If you are performing replacement after the cache drive failed, the metadata drive will be in Available state, and listed under Cluster > Drives > Available.

    5. In the Element UI, select Cluster > Drives > Active.

    6. Select the metadata drive associated with the NodeName, where you want to do the cache drive replacement.

    7. Select Bulk Actions > Remove. After you remove the drive, the drive goes into the Removing state. It stays in the Removing state for a while, waiting for the data on the drive to be synced or redistributed to the remaining drives in the cluster. After the remove is complete, the drive moves to the Available state.

  2. Perform the following steps to locate the drive slot of the cache drive that you are replacing:

    1. Log in to the IPMI interface of the node (iLO in this case).

    2. Select System Information from the left-hand navigation, and then select Storage.

    3. Locate the cache drive.

      Note Cache drives are of lesser capacity than storage drives.
    4. Look for the slot number listed for cache drive. This is the physical slot from which you must remove the drive.

  3. Now that you have identified the drive, physically remove it as follows:

    1. Identify the drive bay.

      The following image shows the front of the server with the drive bay numbering shown on the left side of the image:

      Shows the drive bay numbering on the DL360 node.
    2. Press the power button on the drive that you want to replace. The LED blinks for 5-10 seconds and stops.

    3. After the LED stops blinking and the drive is powered off, remove it from the server by pressing the red button and pulling the latch.

      Note Ensure that you handle drives very carefully.

      After you physically remove the drive, the drive state changes to Failed in the Element UI.

  4. Make a note of the HPE model number and the ISN (serial number) of the new cache drive.

  5. Insert the replacement drive by carefully pushing the drive into the bay using the latch and closing the latch. The drive powers on when inserted correctly.

  6. Perform the following steps to verify the new drive details in iLO:

    1. Log in to iLO.

    2. Select Information > Integrated Management Log. You see an event logged for the drive that you added.

    3. Select System Information from the left-hand navigation, and then select Storage.

    4. Scroll till you find information about the bay that you replaced the drive in.

    5. Verify that the serial number on your screen matches the serial number of the new drive that you installed.

  7. Add the new cache drive information in the sf_sds_config.yaml file for the node in which you replaced the drive.

    The sf_sds_config.yaml file is stored in /opt/sf/. This file includes all the information about the drives in the node. Every time you replace a drive, you should enter the replacement drive information in this file. For more information about this file, see Contents of the sf_sds_config.yaml file.

    1. Establish an SSH connection to the node by using PuTTY.

    2. In the PuTTY configuration window, enter node MIP address (that you made a note of from the Element UI earlier) in the Host Name (or IP address) field.

    3. Select Open.

    4. In the terminal window that opens, log in with your username and password.

    5. Run the nvme list command to list the NMVe devices.

      You can see the model number and serial number of the new cache drive. See the following sample output:

      Shows the model number and serial number of the new cache drive.
    6. Add the new cache drive information in /opt/sf/sf_sds_config.yaml.

      You should replace the existing cache drive model number and serial number with the corresponding information for the new cache drive. See the following example:

      Shows the model number and serial number.
    7. Save the /opt/sf/sf_sds_config.yaml file.

  8. Perform the steps for the scenario that is applicable to you:

    Scenario Steps

    The new inserted cache drive shows up after you run the nvme list command

    1. Run # systemctl restart solidfire. This takes around three minutes.

    2. Check the solidfire status by running system status solidfire.

    3. Go to step 9.

    The new inserted cache drive does not show up after you run the nvme list command

    1. Reboot the node.

    2. After the node reboots, verify that the solidfire services are running by logging in to the node (using PuTTY), and running the system status solidfire command.

    3. Go to step 9.

    Note Restarting solidfire or rebooting the node causes some cluster faults, which eventually clear in about five minutes.
  9. In the Element UI, add back the metadata drive that you removed:

    1. Select Cluster > Drives > Available.

    2. Select the icon under Actions, and select Add.

  10. Refresh your Element UI after the block sync job completes.

    You can see that the alert about the drive available has cleared along with other cluster faults.