Update file node adapter firmware
Follow these steps to update the file node's ConnectX-7 adapters to the latest firmware.
Overview
Updating the ConnectX-7 adapter firmware may be required to support a new MLNX_OFED driver, enable new features, or fix bugs. This guide will use NVIDIA's mlxfwmanager
utility for adapter updates due to its ease of use and efficiency.
Upgrade considerations
This guide covers two approaches to updating ConnectX-7 adapter firmware: a rolling update and a two-node cluster update. Choose the appropriate update approach according to your cluster's size. Before performing firmware updates, verify that:
-
A supported MLNX_OFED driver is installed, refer to the technology requirements.
-
Valid backups exist for your BeeGFS filesystem and Pacemaker cluster configuration.
-
The cluster is in a healthy state.
Firmware update preparation
It is recommended to use NVIDIA's mlxfwmanager
utility to update a node's adapter firmware, which is bundled with NVIDIA's MLNX_OFED driver. Prior to starting the updates, download the adapter's firmware image from NVIDIA's support site and store it on each file node.
|
For Lenovo ConnectX-7 adapters, use the mlxfwmanager_LES tool, which is available on NVIDIA’s OEM firmware page.
|
Rolling update approach
This approach is recommended for any HA cluster with more than two nodes. This approach involves updating adapter firmware on one file node at a time, allowing the HA cluster to keep servicing requests, though it is recommended to avoid servicing I/O during this time.
-
Confirm that the cluster is in an optimal state, with each BeeGFS service running on its preferred node. Refer to Examine the state of the cluster for details.
-
Choose a file node to update and place it into standby mode, which drains (or moves) all BeeGFS services from that node:
-
Verify the node's services have drained by running:
Verify no services are reporting as
Started
on the node in standby.Depending on cluster size, it may take seconds or minutes for BeeGFS services to move to the sister node. If a BeeGFS service fails to start on the sister node, refer to the Troubleshooting Guides. -
Update the adapter firmware using
mlxfwmanager
.Note the
PCI Device Name
for each adapter receiving firmware updates. -
Reset each adapter using the
mlxfwreset
utility to apply the new firmware.Some firmware updates may require a reboot to apply the update. Refer to NVIDIA's mlxfwreset limitations for guidance. If a reboot is required, perform a reboot instead of resetting the adapters. -
Stop the opensm service:
-
Execute the following command for each
PCI Device Name
previously noted. -
Start the opensm service:
-
-
Run
ibstat
and verify all adapters are running at the desired firmware version: -
Start Pacemaker cluster services on the node:
-
Bring the node out of standby:
-
Relocate all BeeGFS services back to their preferred node:
Repeat these steps for each file node in the cluster until all adapters have been updated.
Two node cluster update approach
This approach is recommended for HA clusters with only two nodes. This approach is similar to a rolling update but includes additional steps to prevent service downtime when one node's cluster services are stopped.
-
Confirm that the cluster is in an optimal state, with each BeeGFS service running on its preferred node. Refer to Examine the state of the cluster for details.
-
Choose a file node to update and place the node in standby mode, which drains (or moves) all BeeGFS services from that node:
-
Verify the node's resources have drained by running:
Verify no services are reporting as
Started
on the node in standby.Depending on cluster size, it may take seconds or minutes for BeeGFS services to report as Started
on the sister node. If a BeeGFS service fails to start, refer to the Troubleshooting Guides. -
Place the cluster into maintenance mode.
-
Update the adapter firmware using
mlxfwmanager
.Note the
PCI Device Name
for each adapter receiving firmware updates. -
Reset each adapter using the
mlxfwreset
utility to apply the new firmware.Some firmware updates may require a reboot to apply the update. Refer to NVIDIA's mlxfwreset limitations for guidance. If a reboot is required, perform a reboot instead of resetting the adapters. -
Stop the opensm service:
-
Execute the following command for each
PCI Device Name
previously noted. -
Start the opensm service:
-
-
Run
ibstat
and verify all adapters are running at the desired firmware version: -
Start Pacemaker cluster services on the node:
-
Bring the node out of standby:
-
Take the cluster out of maintenance mode.
-
Relocate all BeeGFS services back to their preferred node:
Repeat these steps for each file node in the cluster until all adapters have been updated.