Skip to main content
BeeGFS on NetApp with E-Series Storage

Update file node adapter firmware

Contributors mcwhiteside

Follow these steps to update the file node's ConnectX-7 adapters to the latest firmware.

Overview

Updating the ConnectX-7 adapter firmware may be required to support a new MLNX_OFED driver, enable new features, or fix bugs. This guide will use NVIDIA's mlxfwmanager utility for adapter updates due to its ease of use and efficiency.

Upgrade considerations

This guide covers two approaches to updating ConnectX-7 adapter firmware: a rolling update and a two-node cluster update. Choose the appropriate update approach according to your cluster's size. Before performing firmware updates, verify that:

  • A supported MLNX_OFED driver is installed, refer to the technology requirements.

  • Valid backups exist for your BeeGFS filesystem and Pacemaker cluster configuration.

  • The cluster is in a healthy state.

Firmware update preparation

It is recommended to use NVIDIA's mlxfwmanager utility to update a node's adapter firmware, which is bundled with NVIDIA's MLNX_OFED driver. Prior to starting the updates, download the adapter's firmware image from NVIDIA's support site and store it on each file node.

Note For Lenovo ConnectX-7 adapters, use the mlxfwmanager_LES tool, which is available on NVIDIA’s OEM firmware page.

Rolling update approach

This approach is recommended for any HA cluster with more than two nodes. This approach involves updating adapter firmware on one file node at a time, allowing the HA cluster to keep servicing requests, though it is recommended to avoid servicing I/O during this time.

  1. Confirm that the cluster is in an optimal state, with each BeeGFS service running on its preferred node. Refer to Examine the state of the cluster for details.

  2. Choose a file node to update and place it into standby mode, which drains (or moves) all BeeGFS services from that node:

    pcs node standby <HOSTNAME>
    Console
  3. Verify the node's services have drained by running:

    pcs status
    Console

    Verify no services are reporting as Started on the node in standby.

    Note Depending on cluster size, it may take seconds or minutes for BeeGFS services to move to the sister node. If a BeeGFS service fails to start on the sister node, refer to the Troubleshooting Guides.
  4. Update the adapter firmware using mlxfwmanager.

     mlxfwmanager -i <path/to/firmware.bin> -u
    Console

    Note the PCI Device Name for each adapter receiving firmware updates.

  5. Reset each adapter using the mlxfwreset utility to apply the new firmware.

    Note Some firmware updates may require a reboot to apply the update. Refer to NVIDIA's mlxfwreset limitations for guidance. If a reboot is required, perform a reboot instead of resetting the adapters.
    1. Stop the opensm service:

      systemctl stop opensm
      Console
    2. Execute the following command for each PCI Device Name previously noted.

      mlxfwreset -d <pci_device_name> reset -y
      Console
    3. Start the opensm service:

      systemctl start opensm
      Console
  6. Run ibstat and verify all adapters are running at the desired firmware version:

    ibstat
    Console
  7. Start Pacemaker cluster services on the node:

    pcs cluster start <HOSTNAME>
    Console
  8. Bring the node out of standby:

    pcs node unstandby <HOSTNAME>
    Console
  9. Relocate all BeeGFS services back to their preferred node:

    pcs resource relocate run
    Console

Repeat these steps for each file node in the cluster until all adapters have been updated.

Two node cluster update approach

This approach is recommended for HA clusters with only two nodes. This approach is similar to a rolling update but includes additional steps to prevent service downtime when one node's cluster services are stopped.

  1. Confirm that the cluster is in an optimal state, with each BeeGFS service running on its preferred node. Refer to Examine the state of the cluster for details.

  2. Choose a file node to update and place the node in standby mode, which drains (or moves) all BeeGFS services from that node:

    pcs node standby <HOSTNAME>
    Console
  3. Verify the node's resources have drained by running:

    pcs status
    Console

    Verify no services are reporting as Started on the node in standby.

    Tip Depending on cluster size, it may take seconds or minutes for BeeGFS services to report as Started on the sister node. If a BeeGFS service fails to start, refer to the Troubleshooting Guides.
  4. Place the cluster into maintenance mode.

    pcs property set maintenance-mode=true
    Console
  5. Update the adapter firmware using mlxfwmanager.

     mlxfwmanager -i <path/to/firmware.bin> -u
    Console

    Note the PCI Device Name for each adapter receiving firmware updates.

  6. Reset each adapter using the mlxfwreset utility to apply the new firmware.

    Note Some firmware updates may require a reboot to apply the update. Refer to NVIDIA's mlxfwreset limitations for guidance. If a reboot is required, perform a reboot instead of resetting the adapters.
    1. Stop the opensm service:

      systemctl stop opensm
      Console
    2. Execute the following command for each PCI Device Name previously noted.

      mlxfwreset -d <pci_device_name> reset -y
      Console
    3. Start the opensm service:

      systemctl start opensm
      Console
  7. Run ibstat and verify all adapters are running at the desired firmware version:

    ibstat
    Console
  8. Start Pacemaker cluster services on the node:

    pcs cluster start <HOSTNAME>
    Console
  9. Bring the node out of standby:

    pcs node unstandby <HOSTNAME>
    Console
  10. Take the cluster out of maintenance mode.

    pcs property set maintenance-mode=false
    Console
  11. Relocate all BeeGFS services back to their preferred node:

    pcs resource relocate run
    Console

Repeat these steps for each file node in the cluster until all adapters have been updated.