Replace a NVIDIA SN2100 cluster switch

Contributors netapp-yvonneo

Replacing a defective NVIDIA SN2100 switch in a cluster network is a nondisruptive procedure (NDU).

Before you begin

The following conditions must exist before performing the switch replacement in the current environment and on the replacement switch.

  • Existing cluster and network infrastructure:

    • The existing cluster must be verified as completely functional, with at least one fully connected cluster switch.

    • All cluster ports must be up.

    • All cluster logical interfaces (LIFs) must be up and on their home ports.

    • The ONTAP cluster ping-cluster -node node1 command must indicate that basic connectivity and larger than PMTU communication are successful on all paths.

  • NVIDIA SN2100 replacement switch:

    • Management network connectivity on the replacement switch must be functional.

    • Console access to the replacement switch must be in place.

    • The node connections are ports swp1 through swp14.

    • All Inter-Switch Link (ISL) ports must be disabled on ports swp15 and swp16.

    • The desired reference configuration file (RCF) and Cumulus operating system image switch must be loaded onto the switch.

    • Initial customization of the switch must be complete, as detailed in:

      Any previous site customizations, such as STP, SNMP, and SSH, should be copied to the new switch.

You must execute the command for migrating a cluster LIF from the node where the cluster LIF is hosted.

About this task

The examples in this procedure use the following switch and node nomenclature:

  • The names of the existing NVIDIA SN2100 switches are sw1 and sw2.

  • The name of the new NVIDIA SN2100 switch is nsw2.

  • The node names are node1 and node2.

  • The cluster ports on each node are named e3a and e3b.

  • The cluster LIF names are node1_clus1 and node1_clus2 for node1, and node2_clus1 and node2_clus2 for node2.

  • The prompt for changes to all cluster nodes is cluster1::*>

  • Breakout ports take the format: swp[port]s[breakout port 0-3]. For example, four breakout ports on swp1 are swp1s0, swp1s1, swp1s2, and swp1s3.

    Note The following procedure is based on the following cluster network topology:
    cluster1::*> network port show -ipspace Cluster
    
    Node: node1
                                                                            Ignore
                                                      Speed(Mbps)  Health   Health
    Port      IPspace      Broadcast Domain Link MTU  Admin/Oper   Status   Status
    --------- ------------ ---------------- ---- ---- ------------ -------- ------
    e3a       Cluster      Cluster          up   9000  auto/100000 healthy  false
    e3b       Cluster      Cluster          up   9000  auto/100000 healthy  false
    
    Node: node2
                                                                            Ignore
                                                      Speed(Mbps)  Health   Health
    Port      IPspace      Broadcast Domain Link MTU  Admin/Oper   Status   Status
    --------- ------------ ---------------- ---- ---- ------------ -------- ------
    e3a       Cluster      Cluster          up   9000  auto/100000 healthy  false
    e3b       Cluster      Cluster          up   9000  auto/100000 healthy  false
    
    
    cluster1::*> network interface show -vserver Cluster
    
                Logical    Status     Network            Current       Current Is
    Vserver     Interface  Admin/Oper Address/Mask       Node          Port    Home
    ----------- ---------- ---------- ------------------ ------------- ------- ----
    Cluster
                node1_clus1  up/up    169.254.209.69/16  node1         e3a     true
                node1_clus2  up/up    169.254.49.125/16  node1         e3b     true
                node2_clus1  up/up    169.254.47.194/16  node2         e3a     true
                node2_clus2  up/up    169.254.19.183/16  node2         e3b     true
    
    
    cluster1::*> network device-discovery show -protocol lldp
    Node/       Local  Discovered
    Protocol    Port   Device (LLDP: ChassisID)  Interface     Platform
    ----------- ------ ------------------------- ------------  ----------------
    node1      /lldp
                e3a    sw1 (b8:ce:f6:19:1a:7e)   swp3          -
                e3b    sw2 (b8:ce:f6:19:1b:96)   swp3          -
    node2      /lldp
                e3a    sw1 (b8:ce:f6:19:1a:7e)   swp4          -
                e3b    sw2 (b8:ce:f6:19:1b:96)   swp4          -
    cumulus@sw1:~$ net show lldp
    
    LocalPort  Speed  Mode        RemoteHost         RemotePort
    ---------  -----  ----------  -----------------  -----------
    swp3       100G   Trunk/L2    sw2                e3a
    swp4       100G   Trunk/L2    sw2                e3a
    swp15      100G   BondMember  sw2                swp15
    swp16      100G   BondMember  sw2                swp16
    
    
    cumulus@sw2:~$ net show lldp
    
    LocalPort  Speed  Mode        RemoteHost         RemotePort
    ---------  -----  ----------  -----------------  -----------
    swp3       100G   Trunk/L2    sw1                e3b
    swp4       100G   Trunk/L2    sw1                e3b
    swp15      100G   BondMember  sw1                swp15
    swp16      100G   BondMember  sw1                swp16
Steps
  1. If AutoSupport is enabled on this cluster, suppress automatic case creation by invoking an AutoSupport message: system node autosupport invoke -node * -type all -message MAINT=xh

    where x is the duration of the maintenance window in hours.

  2. Change the privilege level to advanced, entering y when prompted to continue: set -privilege advanced

    The advanced prompt (*>) appears.

  3. Install the appropriate RCF and image on the switch, nsw2, and make any necessary site preparations.

    If necessary, verify, download, and install the appropriate versions of the RCF and Cumulus software for the new switch. If you have verified that the new switch is correctly set up and does not need updates to the RCF and Cumulus software, continue to step 4. See Setup and configure the NVIDIA SN2100 switches for further details.

    1. You can download the applicable Cumulus software for your cluster switches from the NVIDIA Support site. Follow the steps on the Download page to download the Cumulus Linux for the version of ONTAP software you are installing.

    2. The appropriate RCF is available from the NVIDIA Cluster and Storage Switches page. Follow the steps on the Download page to download the correct RCF for the version of ONTAP software you are installing.

  4. On the new switch nsw2, log in as admin and shut down all of the ports that will be connected to the node cluster interfaces (ports swp1 to swp14).

    If the switch that you are replacing is not functional and is powered down, go to Step 5. The LIFs on the cluster nodes should have already failed over to the other cluster port for each node.

    cumulus@nsw2:~$ net add interface swp1s0-3, swp2s0-3, swp3-14 link down
    cumulus@nsw2:~$ net pending
    cumulus@nsw2:~$ net commit
  5. Disable auto-revert on the cluster LIFs: network interface modify -vserver Cluster -lif * -auto-revert false

    cluster1::*> network interface modify -vserver Cluster -lif * -auto-revert false
    
    Warning: Disabling the auto-revert feature of the cluster logical interface may effect the availability of your cluster network. Are you sure you want to continue? {y|n}: y
  6. Shut down the ISL ports swp15 and swp16 on the SN2100 switch sw1:

    cumulus@sw1:~$ net add interface swp15-16 link down
    cumulus@sw1:~$ net pending
    cumulus@sw1:~$ net commit
  7. Remove all the cables from the SN2100 sw1 switch, and then connect them to the same ports on the SN2100 nsw2 switch.

  8. Bring up the ISL ports swp15 and swp16 between the sw1 and nsw2 switches.

    The following commands enable ISL ports swp15 and swp16 on switch sw1:

    cumulus@sw1:~$ net del interface swp15-16 link down
    cumulus@sw1:~$ net pending
    cumulus@sw1:~$ net commit

    The following example shows that the ISL ports are up on switch sw1:

    cumulus@sw1:~$ net show interface
    
    State  Name         Spd   MTU    Mode        LLDP           Summary
    -----  -----------  ----  -----  ----------  -------------- ----------------------
    ...
    ...
    UP     swp15        100G  9216   BondMember  nsw2 (swp15)   Master: cluster_isl(UP)
    UP     swp16        100G  9216   BondMember  nsw2 (swp16)   Master: cluster_isl(UP)

    The following example shows that the ISL ports are up on switch nsw2:

    cumulus@nsw2:~$ net show interface
    
    State  Name         Spd   MTU    Mode        LLDP           Summary
    -----  -----------  ----  -----  ----------  -------------  -----------------------
    ...
    ...
    UP     swp15        100G  9216   BondMember  sw1 (swp15)    Master: cluster_isl(UP)
    UP     swp16        100G  9216   BondMember  sw1 (swp16)    Master: cluster_isl(UP)
  9. Verify that port e3b is up on all nodes: network port show -ipspace Cluster

    The output should be similar to the following:

    cluster1::*> network port show -ipspace Cluster
    
    Node: node1
                                                                             Ignore
                                                       Speed(Mbps)  Health   Health
    Port      IPspace      Broadcast Domain Link MTU   Admin/Oper   Status   Status
    --------- ------------ ---------------- ---- ----- ------------ -------- -------
    e3a       Cluster      Cluster          up   9000  auto/100000  healthy  false
    e3b       Cluster      Cluster          up   9000  auto/100000  healthy  false
    
    
    Node: node2
                                                                             Ignore
                                                       Speed(Mbps) Health    Health
    Port      IPspace      Broadcast Domain Link MTU   Admin/Oper  Status    Status
    --------- ------------ ---------------- ---- ----- ----------- --------- -------
    e3a       Cluster      Cluster          up   9000  auto/100000  healthy  false
    e3b       Cluster      Cluster          up   9000  auto/100000  healthy  false
  10. The cluster ports on each node are now connected to cluster switches in the following way, from the nodes' perspective:

    cluster1::*> network device-discovery show -protocol lldp
    Node/       Local  Discovered
    Protocol    Port   Device (LLDP: ChassisID)  Interface     Platform
    ----------- ------ ------------------------- ------------  ----------------
    node1      /lldp
                e3a    sw1  (b8:ce:f6:19:1a:7e)   swp3          -
                e3b    nsw2 (b8:ce:f6:19:1b:b6)   swp3          -
    node2      /lldp
                e3a    sw1  (b8:ce:f6:19:1a:7e)   swp4          -
                e3b    nsw2 (b8:ce:f6:19:1b:b6)   swp4          -
  11. Verify that all node cluster ports are up: net show interface

    cumulus@nsw2:~$ net show interface
    
    State  Name         Spd   MTU    Mode        LLDP              Summary
    -----  -----------  ----  -----  ----------  ----------------- ----------------------
    ...
    ...
    UP     swp3         100G  9216   Trunk/L2                      Master: bridge(UP)
    UP     swp4         100G  9216   Trunk/L2                      Master: bridge(UP)
    UP     swp15        100G  9216   BondMember  sw1 (swp15)       Master: cluster_isl(UP)
    UP     swp16        100G  9216   BondMember  sw1 (swp16)       Master: cluster_isl(UP)
  12. Verify that both nodes each have one connection to each switch: net show lldp

    The following example shows the appropriate results for both switches:

    cumulus@sw1:~$ net show lldp
    
    LocalPort  Speed  Mode        RemoteHost         RemotePort
    ---------  -----  ----------  -----------------  -----------
    swp3       100G   Trunk/L2    node1              e3a
    swp4       100G   Trunk/L2    node2              e3a
    swp15      100G   BondMember  nsw2               swp15
    swp16      100G   BondMember  nsw2               swp16
    
    
    cumulus@nsw2:~$ net show lldp
    
    LocalPort  Speed  Mode        RemoteHost         RemotePort
    ---------  -----  ----------  -----------------  -----------
    swp3       100G   Trunk/L2    node1                e3b
    swp4       100G   Trunk/L2    node2                e3b
    swp15      100G   BondMember  sw1                swp15
    swp16      100G   BondMember  sw1                swp16
  13. Enable auto-revert on the cluster LIFs: cluster1::*> network interface modify -vserver Cluster -lif * -auto-revert true

  14. On switch nsw2, bring up the ports connected to the network ports of the nodes.

    cumulus@nsw2:~$ net del interface swp1-14 link down
    cumulus@nsw2:~$ net pending
    cumulus@nsw2:~$ net commit
  15. Display information about the nodes in a cluster: cluster show

    This example shows that the node health for node1 and node2 in this cluster is true:

    cluster1::*> cluster show
    
    Node          Health  Eligibility
    ------------- ------- ------------
    node1         true    true
    node2         true    true
  16. Verify that all physical cluster ports are up: network port show ipspace Cluster

    cluster1::*> network port show -ipspace Cluster
    
    Node node1                                                               Ignore
                                                        Speed(Mbps) Health   Health
    Port      IPspace     Broadcast Domain  Link  MTU   Admin/Oper  Status   Status
    --------- ----------- ----------------- ----- ----- ----------- -------- ------
    e3a       Cluster     Cluster           up    9000  auto/10000  healthy  false
    e3b       Cluster     Cluster           up    9000  auto/10000  healthy  false
    
    Node: node2
                                                                             Ignore
                                                        Speed(Mbps) Health   Health
    Port      IPspace      Broadcast Domain Link  MTU   Admin/Oper  Status   Status
    --------- ------------ ---------------- ----- ----- ----------- -------- ------
    e3a       Cluster      Cluster          up    9000  auto/10000  healthy  false
    e3b       Cluster      Cluster          up    9000  auto/10000  healthy  false
  17. Verify that the cluster network is healthy:

    cumulus@sw1:~$ net show lldp
    
    LocalPort  Speed  Mode        RemoteHost      RemotePort
    ---------  -----  ----------  --------------  -----------
    swp3       100G   Trunk/L2    node1           e3a
    swp4       100G   Trunk/L2    node2           e3a
    swp15      100G   BondMember  nsw2            swp15
    swp16      100G   BondMember  nsw2            swp16
  18. Enable the Ethernet switch health monitor log collection feature for collecting switch-related log files, using the commands: system switch ethernet log setup-password and system switch ethernet log enable-collection

    Enter: system switch ethernet log setup-password

    cluster1::*> system switch ethernet log setup-password
    Enter the switch name: <return>
    The switch name entered is not recognized.
    Choose from the following list:
    sw1
    nsw2
    
    cluster1::*> system switch ethernet log setup-password
    
    Enter the switch name: sw1
    RSA key fingerprint is e5:8b:c6:dc:e2:18:18:09:36:63:d9:63:dd:03:d9:cc
    Do you want to continue? {y|n}::[n] y
    
    Enter the password: <enter switch password>
    Enter the password again: <enter switch password>
    
    cluster1::*> system switch ethernet log setup-password
    
    Enter the switch name: nsw2
    RSA key fingerprint is 57:49:86:a1:b9:80:6a:61:9a:86:8e:3c:e3:b7:1f:b1
    Do you want to continue? {y|n}:: [n] y
    
    Enter the password: <enter switch password>
    Enter the password again: <enter switch password>

    Followed by: system switch ethernet log enable-collection

    cluster1::*> system switch ethernet log enable-collection
    
    Do you want to enable cluster log collection for all nodes in the cluster?
    {y|n}: [n] y
    
    Enabling cluster switch log collection.
    
    cluster1::*>
    Note If any of these commands return an error, contact NetApp support.
  19. Initiate the switch log collection feature: system switch ethernet log collect -device *

    Wait for 10 minutes and then check that the log collection was successful using the command: system switch ethernet log show

    cluster1::*> system switch ethernet log show
    Log Collection Enabled: true
    
    Index  Switch                       Log Timestamp        Status
    ------ ---------------------------- -------------------  ---------    
    1      sw1 (b8:ce:f6:19:1b:42)      4/29/2022 03:05:25   complete   
    2      nsw2 (b8:ce:f6:19:1b:96)     4/29/2022 03:07:42   complete
  20. Change the privilege level back to admin: set -privilege admin

  21. If you suppressed automatic case creation, re-enable it by invoking an AutoSupport message: system node autosupport invoke -node * -type all -message MAINT=END