English

Replace DIMMs in compute nodes

Contributors netapp-amitha Download PDF of this page

You can replace a faulty dual inline memory module (DIMM) in NetApp HCI compute nodes instead of replacing the entire node.

What you’ll need
  • Before starting this procedure, you should have contacted NetApp Support and received a replacement part. Support will be involved during the installation of the replacement. If you have not done so already, contact Support.

  • You have planned for system downtime, because you need to power down or cycle the node and boot the node to NetApp Safe Mode to access the terminal user interface (TUI).

About this task

This procedure applies to the following compute node models:

  • H410C nodes. An H410C node is inserted into a 2U NetApp HCI chassis.

  • H610C node. An H610C node is built into the chassis.

  • H615C node. An H615C node is built into the chassis.

    H410C and H615C nodes include DIMMs from different vendors. Ensure that you do not mix DIMMs from different vendors in one chassis.
    The terms "chassis" and "node" are used interchangeably in the case of H610C and H615C, because the node and chassis are not separate components.

Here are the steps involved in replacing DIMMs in compute nodes:

Prepare to replace the DIMM

When issues with the DIMM occur, VMware ESXi displays alerts, such as Memory Configuration Error, Memory Uncorrectable ECC, Memory Transition to Critical, and Memory Critical Overtemperature. Even if the alerts disappear after a while, the hardware problem might persist. You should diagnose and address the faulty DIMM. You can get information about the faulty DIMM from vCenter Server. If you need more information than what is available from vCenter Server, you must run the hardware check in the TUI.

Steps
  1. Access the node by logging in to vCenter Server.

  2. Right-click the node that is reporting the error, and select the option to place the node in maintenance mode.

  3. Migrate the virtual machines (VMs) to another available host.

    See the VMware documentation for the migration steps.
  4. Power down the compute node.

    If you have the information about which DIMM needs to be replaced and do not need to access the TUI, you can skip the following steps in this section.
  5. Plug in a keyboard, video, and mouse (KVM) to the back of the node that reported the error.

  6. Press the power button at the front of the node.
    It takes approximately six minutes for the node to boot. The screen displays a boot menu.

  7. Use the keyboard to select NetApp Safe Mode.

    You should do this in three seconds. If you miss the window, you should go through the boot process again.
  8. In the TUI window that opens, navigate to Maintenance Tasks > Check Hardware, and select OK.
    A window opens with the results of the hardware check. If the check detects a DIMM issue, the results include a timestamp and a slot identifier. Here is a sample output from the hardware check:

    Shows a sample output from the hardware check.
  9. Record the CPU number and the DIMM slot number/ID.
    This will help you identify the failed DIMM in the chassis.

  10. For H410C and H615C nodes, perform the steps to identify the DIMM manufacturer.

    H410C and H615C nodes include DIMMs from different manufacturers. You should not mix different DIMM types in the same chassis. You should identify the manufacturer of the faulty DIMM and order a replacement of the same type.
    1. Log in to the BMC to launch the console on the node.

    2. Press F2 on the keyboard to get to the Customize System/View Logs menu.

    3. Enter the password when prompted.

      The password should match what you configured in the NetApp Deployment Engine when you set up NetApp HCI.
      Shows the window to enter the password to log in to the console on the node.
    4. From the System Customization menu, press the down arrow to navigate to Troubleshooting Options, and press Enter.

      Shows the System Customization menu.
    5. From the Troubleshooting Mode Options menu, use the up or down arrow to enable ESXi shell and SSH, which are disabled by default.

    6. Press the <Esc> key twice to exit Troubleshooting Options.

    7. Run the smbiosDump command using one of the following options:

      Option Steps

      Option A

      1. Connect to the ESXi host (compute node) using the IP address of the host and the root credentials that you defined.

      2. Run the smbiosDump command.
        See the following sample output:

      `Memory Device:#30
      Location: "P1-DIMMA1"
      Bank: "P0_Node0_Channel0_Dimm0"
      Manufacturer:"Samsung"
      Serial: "38EB8380"
      Asset Tag: "P1-DIMMA1_AssetTag (date:18/15)"
      Part Number: "M393A4K40CB2-CTD"
      Memory Array: #29
      Form Factor: 0x09(DIMM)
      Type: 0x1a (DDR4)
      Type Detail: 0x0080 (Synchronous)
      Data Width: 64 bits (+8 ECC bits)
      Size: 32 GB`

      Option B

      1. Press Alt + F1 to enter shell, and log in to the node to run the command.

Replace the DIMM from the chassis

Before you physically remove and replace the faulty DIMM in the chassis, ensure that you have performed all the preparatory steps.

Steps
  1. Power down the chassis or node.

    For a H610C or H615C chassis, power down the chassis. For H410C nodes in a 2U, four-node chassis, power down only the node with the faulty DIMM.
  2. Remove the power cables and network cables, carefully slide the node or chassis out of the rack, and place it on a flat, antistatic surface.

    Consider using twist ties for cables.
  3. Put on antistatic protection before you open the chassis cover to replace the DIMM.

  4. Perform the steps relevant to your node model:

    Node model Steps

    H410C

    1. Find the failed DIMM by matching the slot number/ID you noted earlier with the numbering on the motherboard. Here are sample images showing the DIMM slot numbers on the motherboard:

      Shows the DIMM slot numbers on the motherboard of the H410C node.
      Shows a close-up view of the DIMM slot numbers on the H410C node motherboard.
    2. Press the two retaining clips outward, and carefully pull the DIMM up. Here is a sample image showing the retaining clips:

      Shows the retaining clips for the DIMMs in the H410C node.
    3. Install the replacement DIMM correctly. When you insert the DIMM into the slot correctly, the two clips lock in place.

      Ensure that you touch only the rear ends of the DIMM. If you press on other parts of the DIMM, it might result in damage to the hardware.
    4. Install the node in the NetApp HCI chassis, ensuring that the node clicks when you slide it into place.

    H610C

    1. Lift the cover as shown in the following image:

      Shows the cover lifted on the H610C node.
    2. Loosen the four blue lock screws at the back of the node. Here is a sample image showing the location of two lock screws; you will find the other two on the other side of the node:

      Shows the lock screws at the back of the H610C node.
    3. Remove both PCI card blanks.

    4. Remove the GPU and the airflow cover.

    5. Find the failed DIMM by matching the slot number/ID you noted earlier with the numbering on the motherboard. Here is a sample image showing the location of the DIMM slot numbers on the motherboard:

      Shows the DIMM slot numbers on the H610C motherboard.
    6. Press the two retaining clips outward, and carefully pull the DIMM up.

    7. Install the replacement DIMM correctly. When you insert the DIMM into the slot correctly, the two clips lock in place.

      Ensure that you touch only the rear ends of the DIMM. If you press on other parts of the DIMM, it might result in damage to the hardware.
    8. Replace all the components that you removed: GPU, airflow cover, and PCI blanks.

    9. Tighten the lock screws.

    10. Put the cover back on the node.

    11. Install the H610C chassis in the rack, ensuring that the chassis clicks when you slide it into place.

    H615C

    1. Lift the cover as shown in the following image:

      Shows the cover lifted on the H615C node.
    2. Remove the GPU (if your H615C node has GPU installed) and the airflow cover.

      Shows the airflow cover removed on the H615C node.
    3. Find the failed DIMM by matching the slot number/ID you noted earlier with the numbering on the motherboard. Here is a sample image showing the location of the DIMM slot numbers on the motherboard:

      Shows the DIMM slot numbers on the H615C motherboard.
    4. Press the two retaining clips outward, and carefully pull the DIMM up.

    5. Install the replacement DIMM correctly. When you insert the DIMM into the slot correctly, the two clips lock in place.

      Ensure that you touch only the rear ends of the DIMM. If you press on other parts of the DIMM, it might result in damage to the hardware.
    6. Replace the airflow cover.

    7. Put the cover back on the node.

    8. Install the H610C chassis in the rack, ensuring that the chassis clicks when you slide it into place.

  5. Insert the power cables and network cables.
    Ensure that all the port lights turn on.

  6. Press the power button at the front of the node if it does not power on automatically when you install it.

  7. After the node is displayed in vSphere, right-click the name and take the node out of maintenance mode.

  8. Verify the hardware information as follows:

    1. Log in to the baseboard management controller (BMC) UI.

    2. Select System > Hardware Information, and check the DIMMs listed.

What’s next

After the node returns to normal operation, in vCenter, check the Summary tab to ensure that the memory capacity is as expected.

If the DIMM is not installed correctly, the node will operate normally but with lower than expected memory capacity.
After the DIMM replacement procedure, you can clear the warnings and errors on the Hardware Status tab in vCenter. You can do this if you want to erase the history of errors related to the hardware that you replaced. Learn more.