Upgrading controllers in a four-node MetroCluster FC configuration using switchover and switchback with "system controller replace" commands (ONTAP 9.10.1 and later)
You can use this guided automated MetroCluster switchover operation to perform a non-disruptive controller upgrade on a four-node MetroCluster FC configuration. Other components (such as storage shelves or switches) cannot be upgraded as part of this procedure.
Supported platform combinations
-
For information on what platform upgrade combinations are supported, review the MetroCluster FC upgrade table in Choose a controller upgrade procedure.
Refer to Choosing an upgrade or refresh method for additional procedures.
About this task
-
You can use this procedure only for controller upgrade.
Other components in the configuration, such as storage shelves or switches, cannot be upgraded at the same time.
-
This procedure applies to controller modules in a four-node MetroCluster FC configuration.
-
The platforms must be running ONTAP 9.10.1 or later.
-
You can use this procedure to upgrade controllers in a four-node MetroCluster FC configuration using NSO based automated switchover and switchback. If you want to perform a controller upgrade using aggregate relocation (ARL), refer to Use "system controller replace" commands to upgrade controller hardware running ONTAP 9.8 or later. It is recommended to use the NSO based automated procedure.
-
If your MetroCluster sites are physically at two different locations, you should use the automated NSO controller upgrade procedure to upgrade the controllers at both sites in sequence.
-
This automated NSO based controller upgrade procedure gives you the capability to initiate controller replacement to a MetroCluster disaster recovery (DR) site. You can only initiate a controller replacement at one site at a time.
-
To initiate a controller replacement at site A, you need to run the controller replacement start command from site B. The operation guides you to replace controllers of both the nodes at site A only. To replace the controllers at site B, you need to run the controller replacement start command from site A. A message displays identifying the site at which the controllers are being replaced.
The following example names are used in this procedure:
-
site_A
-
Before upgrade:
-
node_A_1-old
-
node_A_2-old
-
-
After upgrade:
-
node_A_1-new
-
node_A_2-new
-
-
-
site_B
-
Before upgrade:
-
node_B_1-old
-
node_B_2-old
-
-
After upgrade:
-
node_B_1-new
-
node_B_2-new
-
-
Preparing for the upgrade
To prepare for the controller upgrade, you need to perform system prechecks and collect the configuration information.
At any stage during the upgrade, you can run the system controller replace show
or system controller replace show-details
command from site A to check the status. If the commands return a blank output, wait for a few minutes and rerun the command.
-
Start the automated controller replacement procedure from site A to replace the controllers at site B:
system controller replace start
The automated operation executes the prechecks. If no issues are found, the operation pauses so you can manually collect the configuration related information.
The current source system and all compatible target systems are displayed. If you have replaced the source controller with a controller that has a different ONTAP version or a non-compatible platform, the automation operation halts and reports an error after the new nodes are booted up. To bring the cluster back to a healthy state, you need to follow the manual recovery procedure. The
system controller replace start
command might report the following precheck error:Cluster-A::*>system controller replace show Node Status Error-Action ----------- -------------- ------------------------------------ Node-A-1 Failed MetroCluster check failed. Reason : MCC check showed errors in component aggregates
Check if this error occurred because you have unmirrored aggregates or due to another aggregate issue. Verify that all mirrored aggregates are healthy and not degraded or mirror-degraded. If this error is due to unmirrored aggregates only, you can override this error by selecting the
-skip-metrocluster-check true
option on thesystem controller replace start
command. If remote storage is accessible, the unmirrored aggregates come online after switchover. If the remote storage link fails, the unmirrored aggregates fail to come online. -
Manually collect the configuration information by logging in at site B and following the commands listed in the console message under the
system controller replace show
orsystem controller replace show-details
command.
Gathering information before the upgrade
Before upgrading, if the root volume is encrypted, you must gather the backup key and other information to boot the new controllers with the old encrypted root volumes.
This task is performed on the existing MetroCluster FC configuration.
-
Label the cables for the existing controllers, so you can easily identify the cables when setting up the new controllers.
-
Display the commands to capture the backup key and other information:
system controller replace show
Run the commands listed under the
show
command from the partner cluster. -
Gather the system IDs of the nodes in the MetroCluster configuration:
metrocluster node show -fields node-systemid,dr-partner-systemid
During the upgrade procedure, you will replace these old system IDs with the system IDs of the new controller modules.
In this example for a four-node MetroCluster FC configuration, the following old system IDs are retrieved:
-
node_A_1-old: 4068741258
-
node_A_2-old: 4068741260
-
node_B_1-old: 4068741254
-
node_B_2-old: 4068741256
metrocluster-siteA::> metrocluster node show -fields node-systemid,ha-partner-systemid,dr-partner-systemid,dr-auxiliary-systemid dr-group-id cluster node node-systemid ha-partner-systemid dr-partner-systemid dr-auxiliary-systemid ----------- --------------- ---------- ------------- ------------------- ------------------- --------------------- 1 Cluster_A Node_A_1-old 4068741258 4068741260 4068741256 4068741256 1 Cluster_A Node_A_2-old 4068741260 4068741258 4068741254 4068741254 1 Cluster_B Node_B_1-old 4068741254 4068741256 4068741258 4068741260 1 Cluster_B Node_B_2-old 4068741256 4068741254 4068741260 4068741258 4 entries were displayed.
In this example for a two-node MetroCluster FC configuration, the following old system IDs are retrieved:
-
node_A_1: 4068741258
-
node_B_1: 4068741254
metrocluster node show -fields node-systemid,dr-partner-systemid dr-group-id cluster node node-systemid dr-partner-systemid ----------- ---------- -------- ------------- ------------ 1 Cluster_A Node_A_1-old 4068741258 4068741254 1 Cluster_B node_B_1-old - - 2 entries were displayed.
-
-
Gather port and LIF information for each old node.
You should gather the output of the following commands for each node:
-
network interface show -role cluster,node-mgmt
-
network port show -node node-name -type physical
-
network port vlan show -node node-name
-
network port ifgrp show -node node_name -instance
-
network port broadcast-domain show
-
network port reachability show -detail
-
network ipspace show
-
volume show
-
storage aggregate show
-
system node run -node node-name sysconfig -a
-
-
If the MetroCluster nodes are in a SAN configuration, collect the relevant information.
You should gather the output of the following commands:
-
fcp adapter show -instance
-
fcp interface show -instance
-
iscsi interface show
-
ucadmin show
-
-
If the root volume is encrypted, collect and save the passphrase used for key-manager:
security key-manager backup show
-
If the MetroCluster nodes are using encryption for volumes or aggregates, copy information about the keys and passphrases.
For additional information, see Backing up onboard key management information manually.
-
If Onboard Key Manager is configured:
security key-manager onboard show-backup
You will need the passphrase later in the upgrade procedure.
-
If enterprise key management (KMIP) is configured, issue the following commands:
security key-manager external show -instance
security key-manager key query
-
-
After you finish collecting the configuration information, resume the operation:
system controller replace resume
Removing the existing configuration from the Tiebreaker or other monitoring software
If the existing configuration is monitored with the MetroCluster Tiebreaker configuration or other third-party applications (for example, ClusterLion) that can initiate a switchover, you must remove the MetroCluster configuration from the Tiebreaker or other software prior to replacing the old controller.
-
Remove the existing MetroCluster configuration from the Tiebreaker software.
-
Remove the existing MetroCluster configuration from any third-party application that can initiate switchover.
Refer to the documentation for the application.
Replacing the old controllers and booting up the new controllers
After you gather information and resume the operation, the automation proceeds with the switchover operation.
The automation operation initiates the switchover, heal-aggregates
, and heal root-aggregates
operations. After these operations complete, the operation pauses at paused for user intervention so you can rack and install the controllers, boot up the partner controllers, and reassign the root aggregate disks to the new controller module from flash backup using the sysids
gathered earlier.
Before initiating switchover, the automation operation pauses so you can manually verify that all LIFs are “up” at site B. If necessary, bring any LIFs that are “down” to “up” and resume the automation operation by using the system controller replace resume
command.
Preparing the network configuration of the old controllers
To ensure that the networking resumes cleanly on the new controllers, you must move LIFs to a common port and then remove the networking configuration of the old controllers.
-
This task must be performed on each of the old nodes.
-
You will use the information gathered in Preparing for the upgrade.
-
Boot the old nodes and then log in to the nodes:
boot_ontap
-
Assign the home port of all data LIFs on the old controller to a common port that is the same on both the old and new controller modules.
-
Display the LIFs:
network interface show
All data LIFS including SAN and NAS will be admin “up” and operationally “down” since those are up at switchover site (cluster_A).
-
Review the output to find a common physical network port that is the same on both the old and new controllers that is not used as a cluster port.
For example, “e0d” is a physical port on old controllers and is also present on new controllers. “e0d” is not used as a cluster port or otherwise on the new controllers.
For port usage for platform models, see the NetApp Hardware Universe
-
Modify all data LIFS to use the common port as the home port:
network interface modify -vserver svm-name -lif data-lif -home-port port-id
In the following example, this is “e0d”.
For example:
network interface modify -vserver vs0 -lif datalif1 -home-port e0d
-
-
Modify broadcast domains to remove VLAN and physical ports that need to be deleted:
broadcast-domain remove-ports -broadcast-domain broadcast-domain-name -ports node-name:port-id
Repeat this step for all VLAN and physical ports.
-
Remove any VLAN ports using cluster ports as member ports and interface groups using cluster ports as member ports.
-
Delete VLAN ports:
network port vlan delete -node node-name -vlan-name portid-vlandid
For example:
network port vlan delete -node node1 -vlan-name e1c-80
-
Remove physical ports from the interface groups:
network port ifgrp remove-port -node node-name -ifgrp interface-group-name -port portid
For example:
network port ifgrp remove-port -node node1 -ifgrp a1a -port e0d
-
Remove VLAN and interface group ports from broadcast domain:
network port broadcast-domain remove-ports -ipspace ipspace -broadcast-domain broadcast-domain-name -ports nodename:portname,nodename:portname,..
-
Modify interface group ports to use other physical ports as member as needed.:
ifgrp add-port -node node-name -ifgrp interface-group-name -port port-id
-
-
Halt the nodes:
halt -inhibit-takeover true -node node-name
This step must be performed on both nodes.
Setting up the new controllers
You must rack and cable the new controllers.
-
Plan out the positioning of the new controller modules and storage shelves as needed.
The rack space depends on the platform model of the controller modules, the switch types, and the number of storage shelves in your configuration.
-
Properly ground yourself.
-
Install the controller modules in the rack or cabinet.
-
If the new controller modules did not come with FC-VI cards of their own and if FC-VI cards from old controllers are compatible on new controllers, swap FC-VI cards and install those in correct slots.
See the NetApp Hardware Universe for slot info for FC-VI cards.
-
Cable the controllers' power, serial console and management connections as described in the MetroCluster Installation and Configuration Guides.
Do not connect any other cables that were disconnected from old controllers at this time.
-
Power up the new nodes and press Ctrl-C when prompted to display the LOADER prompt.
Netbooting the new controllers
After you install the new nodes, you need to netboot to ensure the new nodes are running the same version of ONTAP as the original nodes. The term netboot means you are booting from an ONTAP image stored on a remote server. When preparing for netboot, you must put a copy of the ONTAP 9 boot image onto a web server that the system can access.
This task is performed on each of the new controller modules.
-
Access the NetApp Support Site to download the files used for performing the netboot of the system.
-
Download the appropriate ONTAP software from the software download section of the NetApp Support Site and store the ontap-version_image.tgz file on a web-accessible directory.
-
Go to the web-accessible directory and verify that the files you need are available.
If the platform model is…
Then…
FAS/AFF8000 series systems
Extract the contents of the ontap-version_image.tgzfile to the target directory: tar -zxvf ontap-version_image.tgz
NOTE: If you are extracting the contents on Windows, use 7-Zip or WinRAR to extract the netboot image.
Your directory listing should contain a netboot folder with a kernel file:netboot/kernel
All other systems
Your directory listing should contain a netboot folder with a kernel file: ontap-version_image.tgz
You do not need to extract the ontap-version_image.tgz file.
-
At the LOADER prompt, configure the netboot connection for a management LIF:
-
If IP addressing is DHCP, configure the automatic connection:
ifconfig e0M -auto
-
If IP addressing is static, configure the manual connection:
ifconfig e0M -addr=ip_addr -mask=netmask
-gw=gateway
-
-
Perform the netboot.
-
If the platform is an 80xx series system, use this command:
netboot http://web_server_ip/path_to_web-accessible_directory/netboot/kernel
-
If the platform is any other system, use the following command:
netboot http://web_server_ip/path_to_web-accessible_directory/ontap-version_image.tgz
-
-
From the boot menu, select option (7) Install new software first to download and install the new software image to the boot device.
Disregard the following message: "This procedure is not supported for Non-Disruptive Upgrade on an HA pair". It applies to nondisruptive upgrades of software, not to upgrades of controllers.
-
If you are prompted to continue the procedure, enter
y
, and when prompted for the package, enter the URL of the image file:http://web_server_ip/path_to_web-accessible_directory/ontap-version_image.tgz
Enter username/password if applicable, or press Enter to continue.
-
Be sure to enter
n
to skip the backup recovery when you see a prompt similar to the following:Do you want to restore the backup configuration now? {y|n}
-
Reboot by entering
y
when you see a prompt similar to the following:The node must be rebooted to start using the newly installed software. Do you want to reboot now? {y|n}
Clearing the configuration on a controller module
Before using a new controller module in the MetroCluster configuration, you must clear the existing configuration.
-
If necessary, halt the node to display the LOADER prompt:
halt
-
At the LOADER prompt, set the environmental variables to default values:
set-defaults
-
Save the environment:
saveenv
-
At the LOADER prompt, launch the boot menu:
boot_ontap menu
-
At the boot menu prompt, clear the configuration:
wipeconfig
Respond
yes
to the confirmation prompt.The node reboots and the boot menu is displayed again.
-
At the boot menu, select option 5 to boot the system into Maintenance mode.
Respond
yes
to the confirmation prompt.
Restoring the HBA configuration
Depending on the presence and configuration of HBA cards in the controller module, you need to configure them correctly for your site's usage.
-
In Maintenance mode configure the settings for any HBAs in the system:
-
Check the current settings of the ports:
ucadmin show
-
Update the port settings as needed.
If you have this type of HBA and desired mode…
Use this command…
CNA FC
ucadmin modify -m fc -t initiator adapter-name
CNA Ethernet
ucadmin modify -mode cna adapter-name
FC target
fcadmin config -t target adapter-name
FC initiator
fcadmin config -t initiator adapter-name
-
-
Exit Maintenance mode:
halt
After you run the command, wait until the node stops at the LOADER prompt.
-
Boot the node back into Maintenance mode to enable the configuration changes to take effect:
boot_ontap maint
-
Verify the changes you made:
If you have this type of HBA…
Use this command…
CNA
ucadmin show
FC
fcadmin show
Reassigning root aggregate disks
Reassign the root aggregate disks to the new controller module, using the sysids
gathered earlier
This task is performed in Maintenance mode.
The old system IDs were identified in Gathering information before the upgrade.
The examples in this procedure use controllers with the following system IDs:
Node |
Old system ID |
New system ID |
---|---|---|
node_B_1 |
4068741254 |
1574774970 |
-
Cable all other connections to the new controller modules (FC-VI, storage, cluster interconnect, etc.).
-
Halt the system and boot to Maintenance mode from the LOADER prompt:
boot_ontap maint
-
Display the disks owned by node_B_1-old:
disk show -a
The command output shows the system ID of the new controller module (1574774970). However, the root aggregate disks are still owned by the old system ID (4068741254). This example does not show drives owned by other nodes in the MetroCluster configuration.
*> disk show -a Local System ID: 1574774970 DISK OWNER POOL SERIAL NUMBER HOME DR HOME ------------ ------------- ----- ------------- ------------- ------------- ... rr18:9.126L44 node_B_1-old(4068741254) Pool1 PZHYN0MD node_B_1-old(4068741254) node_B_1-old(4068741254) rr18:9.126L49 node_B_1-old(4068741254) Pool1 PPG3J5HA node_B_1-old(4068741254) node_B_1-old(4068741254) rr18:8.126L21 node_B_1-old(4068741254) Pool1 PZHTDSZD node_B_1-old(4068741254) node_B_1-old(4068741254) rr18:8.126L2 node_B_1-old(4068741254) Pool0 S0M1J2CF node_B_1-old(4068741254) node_B_1-old(4068741254) rr18:8.126L3 node_B_1-old(4068741254) Pool0 S0M0CQM5 node_B_1-old(4068741254) node_B_1-old(4068741254) rr18:9.126L27 node_B_1-old(4068741254) Pool0 S0M1PSDW node_B_1-old(4068741254) node_B_1-old(4068741254) ...
-
Reassign the root aggregate disks on the drive shelves to the new controller:
disk reassign -s old-sysid -d new-sysid
The following example shows reassignment of drives:
*> disk reassign -s 4068741254 -d 1574774970 Partner node must not be in Takeover mode during disk reassignment from maintenance mode. Serious problems could result!! Do not proceed with reassignment if the partner is in takeover mode. Abort reassignment (y/n)? n After the node becomes operational, you must perform a takeover and giveback of the HA partner node to ensure disk reassignment is successful. Do you want to continue (y/n)? Jul 14 19:23:49 [localhost:config.bridge.extra.port:error]: Both FC ports of FC-to-SAS bridge rtp-fc02-41-rr18:9.126L0 S/N [FB7500N107692] are attached to this controller. y Disk ownership will be updated on all disks previously belonging to Filer with sysid 4068741254. Do you want to continue (y/n)? y
-
Check that all disks are reassigned as expected:
disk show
*> disk show Local System ID: 1574774970 DISK OWNER POOL SERIAL NUMBER HOME DR HOME ------------ ------------- ----- ------------- ------------- ------------- rr18:8.126L18 node_B_1-new(1574774970) Pool1 PZHYN0MD node_B_1-new(1574774970) node_B_1-new(1574774970) rr18:9.126L49 node_B_1-new(1574774970) Pool1 PPG3J5HA node_B_1-new(1574774970) node_B_1-new(1574774970) rr18:8.126L21 node_B_1-new(1574774970) Pool1 PZHTDSZD node_B_1-new(1574774970) node_B_1-new(1574774970) rr18:8.126L2 node_B_1-new(1574774970) Pool0 S0M1J2CF node_B_1-new(1574774970) node_B_1-new(1574774970) rr18:9.126L29 node_B_1-new(1574774970) Pool0 S0M0CQM5 node_B_1-new(1574774970) node_B_1-new(1574774970) rr18:8.126L1 node_B_1-new(1574774970) Pool0 S0M1PSDW node_B_1-new(1574774970) node_B_1-new(1574774970) *>
-
Display the aggregate status:
aggr status
*> aggr status Aggr State Status Options aggr0_node_b_1-root online raid_dp, aggr root, nosnap=on, mirrored mirror_resync_priority=high(fixed) fast zeroed 64-bit
-
Repeat the above steps on the partner node (node_B_2-new).
Booting up the new controllers
You must reboot the controllers from the boot menu to update the controller flash image. Additional steps are required if encryption is configured.
You can reconfigure VLANs and interface groups. If required, manually modify the ports for the cluster LIFs and broadcast domain details before resuming the operation by using the system controller replace resume
command.
This task must be performed on all the new controllers.
-
Halt the node:
halt
-
If external key manager is configured, set the related bootargs:
setenv bootarg.kmip.init.ipaddr ip-address
setenv bootarg.kmip.init.netmask netmask
setenv bootarg.kmip.init.gateway gateway-address
setenv bootarg.kmip.init.interface interface-id
-
Display the boot menu:
boot_ontap menu
-
If root encryption is used, select the boot menu option for your key management configuration.
If you are using…
Select this boot menu option…
Onboard key management
Option “10”
Follow the prompts to provide the required inputs to recover and restore the key-manager configuration.
External key management
Option “11”
Follow the prompts to provide the required inputs to recover and restore the key-manager configuration.
-
If autoboot is enabled, interrupt autoboot by pressing Ctrl-C.
-
From the boot menu, run option “6”.
Option “6” will reboot the node twice before completing. Respond “y” to the system id change prompts. Wait for the second reboot messages:
Successfully restored env file from boot media... Rebooting to load the restored env file...
-
Double-check that the partner-sysid is correct:
printenv partner-sysid
If the partner-sysid is not correct, set it:
setenv partner-sysid partner-sysID
-
If root encryption is used, select the boot menu option again for your key management configuration.
If you are using…
Select this boot menu option…
Onboard key management
Option “10”
Follow the prompts to provide the required inputs to recover and restore the key-manager configuration.
External key management
Option “11”
Follow the prompts to provide the required inputs to recover and restore the key-manager configuration.
Depending on the key manager setting, perform the recovery procedure by selecting option “10” or option “11”, followed by option “6” at the first boot menu prompt. To boot the nodes completely, you might need to repeat the recovery procedure continued by option “1” (normal boot).
-
Boot the nodes:
boot_ontap
-
Wait for the replaced nodes to boot up.
If either node is in takeover mode, perform a giveback using the
storage failover giveback
command. -
Verify that all ports are in a broadcast domain:
-
View the broadcast domains:
network port broadcast-domain show
-
Add any ports to a broadcast domain as needed.
-
Add the physical port that will host the intercluster LIFs to the corresponding broadcast domain.
-
Modify intercluster LIFs to use the new physical port as home port.
-
After the intercluster LIFs are up, check the cluster peer status and re-establish cluster peering as needed.
You may need to reconfigure cluster peering.
-
Recreate VLANs and interface groups as needed.
VLAN and interface group membership might be different than that of the old node.
-
Verify that the partner cluster is reachable and that the configuration is successfully resynchronized on the partner cluster:
metrocluster switchback -simulate true
-
-
If encryption is used, restore the keys using the correct command for your key management configuration.
If you are using…
Use this command…
Onboard key management
security key-manager onboard sync
For more information, see Restoring onboard key management encryption keys.
External key management
security key-manager external restore -vserver SVM -node node -key-server host_name|IP_address:port -key-id key_id -key-tag key_tag node-name
For more information, see Restoring external key management encryption keys.
-
Before you resume the operation, verify that the MetroCluster is configured correctly. Check the node status:
metrocluster node show
Verify that the new nodes (site_B) are in Waiting for switchback state from site_A.
-
Resume the operation:
system controller replace resume
Completing the upgrade
The automation operation runs verification system checks and then pauses so you can verify the network reachability. After verification, the resource regain phase is initiated and the automation operation executes switchback at site A and pauses at the post upgrade checks. After you resume the automation operation, it performs the post upgrade checks and if no errors are detected, marks the upgrade as complete.
-
Verify the network reachability by following the console message.
-
After you complete the verification, resume the operation:
system controller replace resume
-
The automation operation performs switchback at site A and the post upgrade checks. When the operation pauses, manually check the SAN LIF status and verify the network configuration by following the console message.
-
After you complete the verification, resume the operation:
system controller replace resume
-
Check the post upgrade checks status:
system controller replace show
If the post upgrade checks did not report any errors, the upgrade is complete.
-
After you complete the controller upgrade, log in at site B and verify that the replaced controllers are configured correctly.
Restoring Tiebreaker monitoring
If the MetroCluster configuration was previously configured for monitoring by the Tiebreaker software, you can restore the Tiebreaker connection.
-
Use the steps in Adding MetroCluster configurations.