Replace a NVIDIA SN2100 cluster switch
Follow this procedure to replace a defective NVIDIA SN2100 switch in a cluster network. This is a nondisruptive procedure (NDU).
Review requirements
Ensure that:
-
The existing cluster are verified as completely functional, with at least one fully connected cluster switch.
-
All cluster ports are up.
-
All cluster logical interfaces (LIFs) are up and on their home ports.
-
The ONTAP
cluster ping-cluster -node node1
command indicates that basic connectivity and larger than PMTU communication are successful on all paths.
Ensure that:
-
Management network connectivity on the replacement switch are functional.
-
Console access to the replacement switch are in place.
-
The node connections are ports swp1 through swp14.
-
All Inter-Switch Link (ISL) ports are disabled on ports swp15 and swp16.
-
The desired reference configuration file (RCF) and Cumulus operating system image switch are loaded onto the switch.
-
Initial customization of the switch is complete.
Also make sure that any previous site customizations, such as STP, SNMP, and SSH, are copied to the new switch.
You must execute the command for migrating a cluster LIF from the node where the cluster LIF is hosted. |
Enable console logging
NetApp strongly recommends that you enable console logging on the devices that you are using and take the following actions when replacing your switch:
-
Leave AutoSupport enabled during maintenance.
-
Trigger a maintenance AutoSupport before and after maintenance to disable case creation for the duration of the maintenance. See this Knowledge Base article SU92: How to suppress automatic case creation during scheduled maintenance windows for further details.
-
Enable session logging for any CLI sessions. For instructions on how to enable session logging, review the "Logging Session Output" section in this Knowledge Base article How to configure PuTTY for optimal connectivity to ONTAP systems.
Replace the switch
The examples in this procedure use the following switch and node nomenclature:
-
The names of the existing NVIDIA SN2100 switches are sw1 and sw2.
-
The name of the new NVIDIA SN2100 switch is nsw2.
-
The node names are node1 and node2.
-
The cluster ports on each node are named e3a and e3b.
-
The cluster LIF names are node1_clus1 and node1_clus2 for node1, and node2_clus1 and node2_clus2 for node2.
-
The prompt for changes to all cluster nodes is
cluster1::*>
-
Breakout ports take the format: swp[port]s[breakout port 0-3]. For example, four breakout ports on swp1 are swp1s0, swp1s1, swp1s2, and swp1s3.
This procedure is based on the following cluster network topology:
Show example topology
cluster1::*> network port show -ipspace Cluster Node: node1 Ignore Speed(Mbps) Health Health Port IPspace Broadcast Domain Link MTU Admin/Oper Status Status --------- ------------ ---------------- ---- ---- ------------ -------- ------ e3a Cluster Cluster up 9000 auto/100000 healthy false e3b Cluster Cluster up 9000 auto/100000 healthy false Node: node2 Ignore Speed(Mbps) Health Health Port IPspace Broadcast Domain Link MTU Admin/Oper Status Status --------- ------------ ---------------- ---- ---- ------------ -------- ------ e3a Cluster Cluster up 9000 auto/100000 healthy false e3b Cluster Cluster up 9000 auto/100000 healthy false cluster1::*> network interface show -vserver Cluster Logical Status Network Current Current Is Vserver Interface Admin/Oper Address/Mask Node Port Home ----------- ---------- ---------- ------------------ ------------- ------- ---- Cluster node1_clus1 up/up 169.254.209.69/16 node1 e3a true node1_clus2 up/up 169.254.49.125/16 node1 e3b true node2_clus1 up/up 169.254.47.194/16 node2 e3a true node2_clus2 up/up 169.254.19.183/16 node2 e3b true cluster1::*> network device-discovery show -protocol lldp Node/ Local Discovered Protocol Port Device (LLDP: ChassisID) Interface Platform ----------- ------ ------------------------- ------------ ---------------- node1 /lldp e3a sw1 (b8:ce:f6:19:1a:7e) swp3 - e3b sw2 (b8:ce:f6:19:1b:96) swp3 - node2 /lldp e3a sw1 (b8:ce:f6:19:1a:7e) swp4 - e3b sw2 (b8:ce:f6:19:1b:96) swp4 -
+
cumulus@sw1:~$ net show lldp LocalPort Speed Mode RemoteHost RemotePort --------- ----- ---------- ----------------- ----------- swp3 100G Trunk/L2 sw2 e3a swp4 100G Trunk/L2 sw2 e3a swp15 100G BondMember sw2 swp15 swp16 100G BondMember sw2 swp16 cumulus@sw2:~$ net show lldp LocalPort Speed Mode RemoteHost RemotePort --------- ----- ---------- ----------------- ----------- swp3 100G Trunk/L2 sw1 e3b swp4 100G Trunk/L2 sw1 e3b swp15 100G BondMember sw1 swp15 swp16 100G BondMember sw1 swp16
Step 1: Prepare for replacement
-
If AutoSupport is enabled on this cluster, suppress automatic case creation by invoking an AutoSupport message:
system node autosupport invoke -node * -type all -message MAINT=xh
where x is the duration of the maintenance window in hours.
-
Change the privilege level to advanced, entering y when prompted to continue:
set -privilege advanced
The advanced prompt (*>) appears.
-
Install the appropriate RCF and image on the switch, nsw2, and make any necessary site preparations.
If necessary, verify, download, and install the appropriate versions of the RCF and Cumulus software for the new switch.
-
You can download the applicable Cumulus software for your cluster switches from the NVIDIA Support site. Follow the steps on the Download page to download the Cumulus Linux for the version of ONTAP software you are installing.
-
The appropriate RCF is available from the NVIDIA Cluster and Storage Switches page. Follow the steps on the Download page to download the correct RCF for the version of ONTAP software you are installing.
-
Step 2: Configure ports and cabling
-
On the new switch nsw2, log in as admin and shut down all of the ports that will be connected to the node cluster interfaces (ports swp1 to swp14).
The LIFs on the cluster nodes should have already failed over to the other cluster port for each node.
Show example
cumulus@nsw2:~$ net add interface swp1s0-3, swp2s0-3, swp3-14 link down cumulus@nsw2:~$ net pending cumulus@nsw2:~$ net commit
-
Disable auto-revert on the cluster LIFs:
network interface modify -vserver Cluster -lif * -auto-revert false
Show example
cluster1::*> network interface modify -vserver Cluster -lif * -auto-revert false Warning: Disabling the auto-revert feature of the cluster logical interface may effect the availability of your cluster network. Are you sure you want to continue? {y|n}: y
-
Verify that all cluster LIFs have auto-revert enabled:
net interface show -vserver Cluster -fields auto-revert
-
Shut down the ISL ports swp15 and swp16 on the SN2100 switch sw1.
Show example
cumulus@sw1:~$ net add interface swp15-16 link down cumulus@sw1:~$ net pending cumulus@sw1:~$ net commit
-
Remove all the cables from the SN2100 sw1 switch, and then connect them to the same ports on the SN2100 nsw2 switch.
-
Bring up the ISL ports swp15 and swp16 between the sw1 and nsw2 switches.
Show example
The following commands enable ISL ports swp15 and swp16 on switch sw1:
cumulus@sw1:~$ net del interface swp15-16 link down cumulus@sw1:~$ net pending cumulus@sw1:~$ net commit
The following example shows that the ISL ports are up on switch sw1:
cumulus@sw1:~$ net show interface State Name Spd MTU Mode LLDP Summary ----- ----------- ---- ----- ---------- -------------- ---------------------- ... ... UP swp15 100G 9216 BondMember nsw2 (swp15) Master: cluster_isl(UP) UP swp16 100G 9216 BondMember nsw2 (swp16) Master: cluster_isl(UP)
+ The following example shows that the ISL ports are up on switch nsw2:
+
cumulus@nsw2:~$ net show interface State Name Spd MTU Mode LLDP Summary ----- ----------- ---- ----- ---------- ------------- ----------------------- ... ... UP swp15 100G 9216 BondMember sw1 (swp15) Master: cluster_isl(UP) UP swp16 100G 9216 BondMember sw1 (swp16) Master: cluster_isl(UP)
-
Verify that port
e3b
is up on all nodes:network port show -ipspace Cluster
Show example
The output should be similar to the following:
cluster1::*> network port show -ipspace Cluster Node: node1 Ignore Speed(Mbps) Health Health Port IPspace Broadcast Domain Link MTU Admin/Oper Status Status --------- ------------ ---------------- ---- ----- ------------ -------- ------- e3a Cluster Cluster up 9000 auto/100000 healthy false e3b Cluster Cluster up 9000 auto/100000 healthy false Node: node2 Ignore Speed(Mbps) Health Health Port IPspace Broadcast Domain Link MTU Admin/Oper Status Status --------- ------------ ---------------- ---- ----- ----------- --------- ------- e3a Cluster Cluster up 9000 auto/100000 healthy false e3b Cluster Cluster up 9000 auto/100000 healthy false
-
The cluster ports on each node are now connected to cluster switches in the following way, from the nodes' perspective:
Show example
cluster1::*> network device-discovery show -protocol lldp Node/ Local Discovered Protocol Port Device (LLDP: ChassisID) Interface Platform ----------- ------ ------------------------- ------------ ---------------- node1 /lldp e3a sw1 (b8:ce:f6:19:1a:7e) swp3 - e3b nsw2 (b8:ce:f6:19:1b:b6) swp3 - node2 /lldp e3a sw1 (b8:ce:f6:19:1a:7e) swp4 - e3b nsw2 (b8:ce:f6:19:1b:b6) swp4 -
-
Verify that all node cluster ports are up:
net show interface
Show example
cumulus@nsw2:~$ net show interface State Name Spd MTU Mode LLDP Summary ----- ----------- ---- ----- ---------- ----------------- ---------------------- ... ... UP swp3 100G 9216 Trunk/L2 Master: bridge(UP) UP swp4 100G 9216 Trunk/L2 Master: bridge(UP) UP swp15 100G 9216 BondMember sw1 (swp15) Master: cluster_isl(UP) UP swp16 100G 9216 BondMember sw1 (swp16) Master: cluster_isl(UP)
-
Verify that both nodes each have one connection to each switch:
net show lldp
Show example
The following example shows the appropriate results for both switches:
cumulus@sw1:~$ net show lldp LocalPort Speed Mode RemoteHost RemotePort --------- ----- ---------- ----------------- ----------- swp3 100G Trunk/L2 node1 e3a swp4 100G Trunk/L2 node2 e3a swp15 100G BondMember nsw2 swp15 swp16 100G BondMember nsw2 swp16 cumulus@nsw2:~$ net show lldp LocalPort Speed Mode RemoteHost RemotePort --------- ----- ---------- ----------------- ----------- swp3 100G Trunk/L2 node1 e3b swp4 100G Trunk/L2 node2 e3b swp15 100G BondMember sw1 swp15 swp16 100G BondMember sw1 swp16
-
Enable auto-revert on the cluster LIFs:
cluster1::*> network interface modify -vserver Cluster -lif * -auto-revert true
-
On switch nsw2, bring up the ports connected to the network ports of the nodes.
Show example
cumulus@nsw2:~$ net del interface swp1-14 link down cumulus@nsw2:~$ net pending cumulus@nsw2:~$ net commit
-
Display information about the nodes in a cluster:
cluster show
Show example
This example shows that the node health for node1 and node2 in this cluster is true:
cluster1::*> cluster show Node Health Eligibility ------------- ------- ------------ node1 true true node2 true true
-
Verify that all physical cluster ports are up:
network port show ipspace Cluster
Show example
cluster1::*> network port show -ipspace Cluster Node node1 Ignore Speed(Mbps) Health Health Port IPspace Broadcast Domain Link MTU Admin/Oper Status Status --------- ----------- ----------------- ----- ----- ----------- -------- ------ e3a Cluster Cluster up 9000 auto/10000 healthy false e3b Cluster Cluster up 9000 auto/10000 healthy false Node: node2 Ignore Speed(Mbps) Health Health Port IPspace Broadcast Domain Link MTU Admin/Oper Status Status --------- ------------ ---------------- ----- ----- ----------- -------- ------ e3a Cluster Cluster up 9000 auto/10000 healthy false e3b Cluster Cluster up 9000 auto/10000 healthy false
Step 3: Verify the configuration
-
Verify that the cluster network is healthy.
Show example
cumulus@sw1:~$ net show lldp LocalPort Speed Mode RemoteHost RemotePort --------- ----- ---------- -------------- ----------- swp3 100G Trunk/L2 node1 e3a swp4 100G Trunk/L2 node2 e3a swp15 100G BondMember nsw2 swp15 swp16 100G BondMember nsw2 swp16
-
Change the privilege level back to admin:
set -privilege admin
-
If you suppressed automatic case creation, re-enable it by invoking an AutoSupport message:
system node autosupport invoke -node * -type all -message MAINT=END