Hot-swap the I/O module used for cluster and HA traffic - AFF C30 and AFF C60
The cluster and HA I/O module supports interconnects for clustering and high-availability. You can hot-swap the module in your AFF C30 or AFF C60 storage system when the module fails and if your storage system meets specific requirements.
To hot-swap a module, you ensure your storage system meets the procedure requirements, prepare the storage system and I/O module in slot 4, hot-swap the failed module for an equivalent one, bring the replacement module online, restore the storage system to normal operation, and return the failed module to NetApp.
-
Hot-swapping the cluster and HA I/O module means that you do not have to perform a manual takeover; the impaired controller (the controller with the failed cluster and HA I/O module) has automatically taken over the healthy controller.
When the impaired controller has taken over the healthy controller, the only way to recover without an outage is to hot-swap the module.
-
It is critical to apply the commands to the correct controller when you are hot-swapping the cluster and HA I/O module:
-
The impaired controller is the controller on which you are hot-swapping the cluster and HA I/O module and it is the controller that has taken over the healthy controller.
-
The healthy controller is the HA partner of the impaired controller and it is the controller that was taken over by the impaired controller.
-
-
If needed, you can turn on the storage system location (blue) LEDs to aid in physically locating the affected storage system. Log into the BMC using SSH and enter the
system location-led on
command.A storage system has three location LEDs: one on the operator display panel and one on each controller. Location LEDs remain illuminated for 30 minutes.
You can turn them off by entering the
system location-led off
command. If you are unsure if the LEDs are on or off, you can check their state by entering thesystem location-led show
command.
Step 1: Ensure the storage system meets the procedure requirements
To use this procedure, make sure your storage system meets all requirements.
|
If your storage system does not meet all requirements, you must use the replace an I/O module procedure. |
-
Your storage system must be running ONTAP 9.17.1 or later.
-
The I/O module that failed must be a cluster and HA I/O module in slot 4 and you must be replacing it with an equivalent cluster and HA I/O module. You cannot change the I/O module type.
-
Your storage system configuration must have only one cluster and HA I/O module located in slot 4, not two cluster and HA I/O modules.
-
Your storage system must be a two-node (switchless or switched) cluster configuration.
-
The controller with the failed cluster and HA I/O module (the impaired controller) must have already taken over the healthy partner controller. The takeover should have occurred automatically if the I/O module is failed.
For two-node clusters, the storage system cannot discern which controller has the failed I/O module, so either controller might initiate the takeover. The cluster and HA I/O module hot-swap procedure is only supported when the controller with the failed I/O module (the impaired controller) has taken over the healthy controller.
You can verify that the impaired controller successfully took over the healthy controller by entering the
storage failover show
command.If you are not sure which controller has the failed I/O module, contact NetApp Support.
-
All other components in the storage system must be functioning properly; if not, contact NetApp Support before continuing with this procedure.
Step 2: Prepare the storage system and I/O module slot 4
Prepare the storage system and I/O module slot 4 so that it is safe to remove the failed cluster and HA I/O module:
-
Properly ground yourself.
-
Unplug cabling from the failed cluster and HA I/O module.
Make sure to label the cables so that later in this procedure you can reconnect them to the same ports.
-
If AutoSupport is enabled, suppress automatic case creation by invoking an AutoSupport message:
system node autosupport invoke -node * -type all -message MAINT=<number of hours down>h
For example, the following AutoSupport message suppresses automatic case creation for two hours:
node2::> system node autosupport invoke -node * -type all -message MAINT=2h
-
Disable automatic giveback:
-
Enter the following command from the console of the impaired controller:
storage failover modify -node local -auto-giveback false
-
Enter
y
when you see the prompt Do you want to disable auto-giveback?
-
-
Prepare the failed cluster and HA module in slot 4 for removal by removing it from service and powering it off:
-
Enter the following command:
system controller slot module remove -node impaired_node_name -slot slot_number
-
Enter
y
when you see the prompt Do you want to continue?For example, the following command prepares the module in slot 4 on node 2 (the impaired controller) for removal, and displays a message that it is safe to remove:
node2::> system controller slot module remove -node node2 -slot 4 Warning: IO_2X_100GBE_NVDA_NIC module in slot 4 of node node2 will be powered off for removal. Do you want to continue? {y|n}: y The module has been successfully removed from service and powered off. It can now be safely removed.
-
-
Verify the failed cluster and HA module in slot 4 is powered off:
system controller slot module show
The output should show
powered-off
in the status column for the failed module in slot 4.
Step 3: Replace the failed cluster and HA I/O module
Replace the failed cluster and HA I/O module in slot 4 with an equivalent I/O module:
-
If you are not already grounded, properly ground yourself.
-
Remove the failed cluster and HA I/O module from the impaired controller:
Turn the I/O module thumbscrew counterclockwise to loosen.
Pull the I/O module out of the controller using the port label tab on the left and the thumbscrew on the right.
-
Install the replacement cluster and HA I/O module into slot 4:
-
Align the I/O module with the edges of the slot.
-
Gently push the I/O module all the way into the slot, making sure to properly seat the I/O module into the connector.
You can use the tab on the left and the thumbscrew on the right to push in the I/O module.
-
Turn the thumbscrew clockwise to tighten.
-
-
Cable the cluster and HA I/O module.
Step 4: Bring the replacement cluster and HA I/O module online
Bring the replacement cluster and HA I/O module in slot 4 online, verify the module ports initialized successfully, verify slot 4 is powered on, and then verify the module is online and recognized.
-
Bring the replacement cluster and HA I/O module online:
-
Enter the following command:
system controller slot module insert -node impaired_node_name -slot slot_name
-
Enter
y
when you see the prompt, Do you want to continue?The output should confirm the cluster and HA I/O module was successfully brought online (powered on, initialized, and placed into service).
For example, the following command brings slot 4 on node 2 (the impaired controller) online, and displays a message that the process was successful:
node2::> system controller slot module insert -node node2 -slot 4 Warning: IO_2X_100GBE_NVDA_NIC module in slot 4 of node node2 will be powered on and initialized. Do you want to continue? {y|n}: `y` The module has been successfully powered on, initialized and placed into service.
-
-
Verify that each port on the cluster and HA I/O module successfully initialized:
event log show -event *hotplug.init*
It might take several minutes to allow for any required firmware updates and port initialization. The output should show a hotplug.init.success EMS event logged for each port on the cluster and HA I/O module with
hotplug.init.success:
in theEvent
column.For example, the following output shows initialization succeeded for cluster and HA I/O module ports e4b and e4a:
node2::> event log show -event *hotplug.init* Time Node Severity Event ------------------- ---------------- ------------- --------------------------- 7/11/2025 16:04:06 node2 NOTICE hotplug.init.success: Initialization of ports "e4b" in slot 4 succeeded 7/11/2025 16:04:06 node2 NOTICE hotplug.init.success: Initialization of ports "e4a" in slot 4 succeeded 2 entries were displayed.
-
Verify I/O module slot 4 is powered on and ready for operation:
system controller slot module show
The output should show slot 4 status as
powered-on
and therefore ready for operation of the replacement cluster and HA I/O module. -
Verify that the replacement cluster and HA I/O module is online and recognized.
Enter the command from the console of the impaired controller:
system controller config show -node local -slot4
If the replacement cluster and HA I/O module was successfully brought online and is recognize, the output shows I/O module information, including port information, for slot 4.
For example, you should see output similar to the following:
node2::> system controller config show -node local -slot 4 Node: node2 Sub- Device/ Slot slot Information ---- ---- ----------------------------- 4 - Dual 40G/100G Ethernet Controller CX6-DX e4a MAC Address: d0:39:ea:59:69:74 (auto-100g_cr4-fd-up) QSFP Vendor: CISCO-BIZLINK QSFP Part Number: L45593-D218-D10 QSFP Serial Number: LCC2807GJFM-B e4b MAC Address: d0:39:ea:59:69:75 (auto-100g_cr4-fd-up) QSFP Vendor: CISCO-BIZLINK QSFP Part Number: L45593-D218-D10 QSFP Serial Number: LCC2809G26F-A Device Type: CX6-DX PSID(NAP0000000027) Firmware Version: 22.44.1700 Part Number: 111-05341 Hardware Revision: 20 Serial Number: 032403001370
Step 5: Restore the storage system to normal operation
Restore your storage system to normal operation by giving back storage to the healthy controller, restoring automatic giveback, and reenabling AutoSupport automatic case creation.
-
Return the healthy controller (the controller that was taken over) to normal operation by giving back its storage:
storage failover giveback -ofnode healthy_node_name
-
Restore automatic giveback from the console of the impaired controller (the controller that took over the healthy controller):
storage failover modify -node local -auto-giveback true
-
If AutoSupport is enabled, restore automatic case creation:
system node autosupport invoke -node * -type all -message MAINT=end
Step 6: Return the failed part to NetApp
Return the failed part to NetApp, as described in the RMA instructions shipped with the kit. See the Part Return and Replacements page for further information.