Update file node adapter firmware

03/12/2025 Contributors

Follow these steps to update the file node's ConnectX-7 adapters to the latest firmware.

Overview

Updating the ConnectX-7 adapter firmware may be required to support a new MLNX_OFED driver, enable new features, or fix bugs. This guide will use NVIDIA's mlxfwmanager utility for adapter updates due to its ease of use and efficiency.

Upgrade considerations

This guide covers two approaches to updating ConnectX-7 adapter firmware: a rolling update and a two-node cluster update. Choose the appropriate update approach according to your cluster's size. Before performing firmware updates, verify that:

A supported MLNX_OFED driver is installed, refer to the technology requirements.
Valid backups exist for your BeeGFS filesystem and Pacemaker cluster configuration.
The cluster is in a healthy state.

Firmware update preparation

It is recommended to use NVIDIA's mlxfwmanager utility to update a node's adapter firmware, which is bundled with NVIDIA's MLNX_OFED driver. Prior to starting the updates, download the adapter's firmware image from NVIDIA's support site and store it on each file node.

For Lenovo ConnectX-7 adapters, use the mlxfwmanager_LES tool, which is available on NVIDIA’s OEM firmware page.

Rolling update approach

This approach is recommended for any HA cluster with more than two nodes. This approach involves updating adapter firmware on one file node at a time, allowing the HA cluster to keep servicing requests, though it is recommended to avoid servicing I/O during this time.

Confirm that the cluster is in an optimal state, with each BeeGFS service running on its preferred node. Refer to Examine the state of the cluster for details.
Choose a file node to update and place it into standby mode, which drains (or moves) all BeeGFS services from that node:
```
pcs node standby <HOSTNAME>
```
Console
Copy

Verify the node's services have drained by running:

pcs status

Verify no services are reporting as Started on the node in standby.

Depending on cluster size, it may take seconds or minutes for BeeGFS services to move to the sister node. If a BeeGFS service fails to start on the sister node, refer to the Troubleshooting Guides.

Update the adapter firmware using mlxfwmanager.
```
 mlxfwmanager -i <path/to/firmware.bin> -u
```
Console
Copy
Note the PCI Device Name for each adapter receiving firmware updates.

Reset each adapter using the mlxfwreset utility to apply the new firmware.

Some firmware updates may require a reboot to apply the update. Refer to NVIDIA's mlxfwreset limitations for guidance. If a reboot is required, perform a reboot instead of resetting the adapters.

Stop the opensm service:
```
systemctl stop opensm
```
Console
Copy
Execute the following command for each PCI Device Name previously noted.
```
mlxfwreset -d <pci_device_name> reset -y
```
Console
Copy
Start the opensm service:
```
systemctl start opensm
```
Console
Copy

Run ibstat and verify all adapters are running at the desired firmware version:
```
ibstat
```
Console
Copy
Start Pacemaker cluster services on the node:
```
pcs cluster start <HOSTNAME>
```
Console
Copy
Bring the node out of standby:
```
pcs node unstandby <HOSTNAME>
```
Console
Copy
Relocate all BeeGFS services back to their preferred node:
```
pcs resource relocate run
```
Console
Copy

Repeat these steps for each file node in the cluster until all adapters have been updated.

Two node cluster update approach

This approach is recommended for HA clusters with only two nodes. This approach is similar to a rolling update but includes additional steps to prevent service downtime when one node's cluster services are stopped.

Confirm that the cluster is in an optimal state, with each BeeGFS service running on its preferred node. Refer to Examine the state of the cluster for details.
Choose a file node to update and place the node in standby mode, which drains (or moves) all BeeGFS services from that node:
```
pcs node standby <HOSTNAME>
```
Console
Copy
Verify the node's resources have drained by running:
```
pcs status
```
Console
Copy
Verify no services are reporting as Started on the node in standby.

Depending on cluster size, it may take seconds or minutes for BeeGFS services to report as Started on the sister node. If a BeeGFS service fails to start, refer to the Troubleshooting Guides.
Place the cluster into maintenance mode.
```
pcs property set maintenance-mode=true
```
Console
Copy
Update the adapter firmware using mlxfwmanager.
```
 mlxfwmanager -i <path/to/firmware.bin> -u
```
Console
Copy
Note the PCI Device Name for each adapter receiving firmware updates.