Completing recovery
Perform the required tasks to complete the recovery from a multi-controller or storage failure.
Reestablishing object stores for FabricPool configurations
If one of the object stores in a FabricPool mirror was co-located with the MetroCluster disaster site and was destroyed, you must reestablish the object store and the FabricPool mirror.
-
If the object-stores are remote and a MetroCluster site is destroyed, you do not need to rebuild the object store, and the original object store configurations as well as cold data contents are retained.
-
For more information about FabricPool configurations, see the Disk and aggregates management.
-
Follow the procedure "Replacing a FabricPool mirror on a MetroCluster configuration" in the Disk and aggregates management.
Verifying licenses on the replaced nodes
You must install new licenses for the replacement nodes if the impaired nodes were using ONTAP features that require a standard (node-locked) license. For features with standard licenses, each node in the cluster should have its own key for the feature.
Until you install license keys, features requiring standard licenses continue to be available to the replacement node. However, if the impaired node was the only node in the cluster with a license for the feature, no configuration changes to the feature are allowed. Also, using unlicensed features on the node might put you out of compliance with your license agreement, so you should install the replacement license key or keys on the replacement node as soon as possible.
The licenses keys must be in the 28-character format.
You have a 90-day grace period in which to install the license keys. After the grace period, all old licenses are invalidated. After a valid license key is installed, you have 24 hours to install all of the keys before the grace period ends.
If all nodes at a site have been replaced (a single node in the case of a two-node MetroCluster configuration), license keys must be installed on the replacement node or nodes prior to switchback. |
-
Identify the licenses on the node:
license show
The following example displays the information about licenses in the system:
cluster_B::> license show (system license show) Serial Number: 1-80-00050 Owner: site1-01 Package Type Description Expiration ------- ------- ------------- ----------- Base license Cluster Base License - NFS site NFS License - CIFS site CIFS License - iSCSI site iSCSI License - FCP site FCP License - FlexClone site FlexClone License - 6 entries were displayed.
-
Verify that the licenses are good for the node after switchback:
metrocluster check license show
The following example displays the licenses that are good for the node:
cluster_B::> metrocluster check license show Cluster Check Result ------- ------- ------------- Cluster_B negotiated-switchover-ready not-applicable NFS switchback-ready not-applicable CIFS job-schedules ok iSCSI licenses ok FCP periodic-check-enabled ok
-
If you need new license keys, obtain replacement license keys on the NetApp Support Site in the My Support section under Software licenses.
The new license keys that you require are automatically generated and sent to the email address on file. If you fail to receive the email with the license keys within 30 days, refer to the "Who to contact if I have issues with my Licenses?" section in the Knowledge Base article Post Motherboard Replacement Process to update Licensing on a AFF/FAS system. -
Install each license key:
system license add -license-code license-key, license-key…+
-
Remove the old licenses, if desired:
-
Check for unused licenses:
license clean-up -unused -simulate
-
If the list looks correct, remove the unused licenses:
license clean-up -unused
-
Restoring key management
If data volumes are encrypted, you must restore key management. If the root volume is encrypted, you must recover key management.
-
If data volumes are encrypted, restore the keys using the correct command for your key management configuration.
If you are using…
Use this command…
Onboard key management
security key-manager onboard sync
For more information, see Restoring onboard key management encryption keys.
External key management
security key-manager key query -node node-name
For more information, see Restoring external key management encryption keys.
-
If the root volume is encrypted, use the procedure in Recovering key management if the root volume is encrypted.
Performing a switchback
After you heal the MetroCluster configuration, you can perform the MetroCluster switchback operation. The MetroCluster switchback operation returns the configuration to its normal operating state, with the sync-source storage virtual machines (SVMs) on the disaster site active and serving data from the local disk pools.
-
The disaster cluster must have successfully switched over to the surviving cluster.
-
Healing must have been performed on the data and root aggregates.
-
The surviving cluster nodes must not be in the HA failover state (all nodes must be up and running for each HA pair).
-
The disaster site controller modules must be completely booted and not in the HA takeover mode.
-
The root aggregate must be mirrored.
-
The Inter-Switch Links (ISLs) must be online.
-
Any required licenses must be installed on the system.
-
Confirm that all nodes are in the enabled state:
metrocluster node show
The following example displays the nodes that are in the enabled state:
cluster_B::> metrocluster node show DR Configuration DR Group Cluster Node State Mirroring Mode ----- ------- ----------- -------------- --------- -------------------- 1 cluster_A node_A_1 configured enabled heal roots completed node_A_2 configured enabled heal roots completed cluster_B node_B_1 configured enabled waiting for switchback recovery node_B_2 configured enabled waiting for switchback recovery 4 entries were displayed.
-
Confirm that resynchronization is complete on all SVMs:
metrocluster vserver show
-
Verify that any automatic LIF migrations being performed by the healing operations have been successfully completed:
metrocluster check lif show
-
Perform the switchback by running the
metrocluster switchback
command from any node in the surviving cluster. -
Check the progress of the switchback operation:
metrocluster show
The switchback operation is still in progress when the output displays "waiting-for-switchback":
cluster_B::> metrocluster show Cluster Entry Name State ------------------------- ------------------- ----------- Local: cluster_B Configuration state configured Mode switchover AUSO Failure Domain - Remote: cluster_A Configuration state configured Mode waiting-for-switchback AUSO Failure Domain -
The switchback operation is complete when the output displays "normal":
cluster_B::> metrocluster show Cluster Entry Name State ------------------------- ------------------- ----------- Local: cluster_B Configuration state configured Mode normal AUSO Failure Domain - Remote: cluster_A Configuration state configured Mode normal AUSO Failure Domain -
If a switchback takes a long time to finish, you can check on the status of in-progress baselines by using the the following command at the advanced privilege level:
metrocluster config-replication resync-status show
-
Reestablish any SnapMirror or SnapVault configurations.
In ONTAP 8.3, you need to manually reestablish a lost SnapMirror configuration after a MetroCluster switchback operation. In ONTAP 9.0 and later, the relationship is reestablished automatically.
Verifying a successful switchback
After performing the switchback, you want to confirm that all aggregates and storage virtual machines (SVMs) are switched back and online.
-
Verify that the switched-over data aggregates are switched back:
storage aggregate show
In the following example, aggr_b2 on node B2 has switched back:
node_B_1::> storage aggregate show Aggregate Size Available Used% State #Vols Nodes RAID Status --------- -------- --------- ----- ------- ------ ---------------- ------------ ... aggr_b2 227.1GB 227.1GB 0% online 0 node_B_2 raid_dp, mirrored, normal node_A_1::> aggr show Aggregate Size Available Used% State #Vols Nodes RAID Status --------- -------- --------- ----- ------- ------ ---------------- ------------ ... aggr_b2 - - - unknown - node_A_1
If the disaster site included unmirrored aggregates and the unmirrored aggregates are no longer present, the aggregate might show up with a state of “unknown” in the output of the storage aggregate show command. Contact technical support to remove the out-of-date entries for the unmirrored aggregates, reference the Knowledge Base article How to remove stale unmirrored aggregate entries in a MetroCluster following disaster where storage was lost.
-
Verify that all sync-destination SVMs on the surviving cluster are dormant (showing an Admin State of “stopped”) and the sync-source SVMs on the disaster cluster are up and running:
vserver show -subtype sync-source
node_B_1::> vserver show -subtype sync-source Admin Root Name Name Vserver Type Subtype State Volume Aggregate Service Mapping ----------- ------- ---------- ---------- ---------- ---------- ------- ------- ... vs1a data sync-source running vs1a_vol node_B_2 file file aggr_b2 node_A_1::> vserver show -subtype sync-destination Admin Root Name Name Vserver Type Subtype State Volume Aggregate Service Mapping ----------- ------- ---------- ---------- ---------- ---------- ------- ------- ... cluster_A-vs1a-mc data sync-destination stopped vs1a_vol sosb_ file file aggr_b2
Sync-destination aggregates in the MetroCluster configuration have the suffix "-mc" automatically appended to their name to help identify them.
-
Confirm that the switchback operations succeeded by using the
metrocluster operation show
command.If the command output shows…
Then…
That the switchback operation state is successful.
The switchback process is complete and you can proceed with operation of the system.
That the switchback operation or switchback-continuation-agent operation is partially successful.
Perform the suggested fix provided in the output of the metrocluster operation show command.
You must repeat the previous sections to perform the switchback in the opposite direction. If site_A did a switchover of site_B, have site_B do a switchover of site_A.
Mirroring the root aggregates of the replacement nodes
If disks were replaced, you must mirror the root aggregates of the new nodes on the disaster site.
-
On the disaster site, identify the aggregates which are not mirrored:
storage aggregate show
cluster_A::> storage aggregate show Aggregate Size Available Used% State #Vols Nodes RAID Status --------- -------- --------- ----- ------- ------ ---------------- ------------ node_A_1_aggr0 1.49TB 74.12GB 95% online 1 node_A_1 raid4, normal node_A_2_aggr0 1.49TB 74.12GB 95% online 1 node_A_2 raid4, normal node_A_1_aggr1 1.49TB 74.12GB 95% online 1 node_A_1 raid 4, normal mirrored node_A_2_aggr1 1.49TB 74.12GB 95% online 1 node_A_2 raid 4, normal mirrored 4 entries were displayed. cluster_A::>
-
Mirror one of the root aggregates:
storage aggregate mirror -aggregate root-aggregate
The following example shows how the command selects disks and prompts for confirmation when mirroring the aggregate.
cluster_A::> storage aggregate mirror -aggregate node_A_2_aggr0 Info: Disks would be added to aggregate "node_A_2_aggr0" on node "node_A_2" in the following manner: Second Plex RAID Group rg0, 3 disks (block checksum, raid4) Position Disk Type Size ---------- ------------------------- ---------- --------------- parity 2.10.0 SSD - data 1.11.19 SSD 894.0GB data 2.10.2 SSD 894.0GB Aggregate capacity available for volume use would be 1.49TB. Do you want to continue? {y|n}: y cluster_A::>
-
Verify that mirroring of the root aggregate is complete:
storage aggregate show
The following example shows that the root aggregates are mirrored.
cluster_A::> storage aggregate show Aggregate Size Available Used% State #Vols Nodes RAID Status --------- -------- --------- ----- ------- ------ ----------- ------------ node_A_1_aggr0 1.49TB 74.12GB 95% online 1 node_A_1 raid4, mirrored, normal node_A_2_aggr0 2.24TB 838.5GB 63% online 1 node_A_2 raid4, mirrored, normal node_A_1_aggr1 1.49TB 74.12GB 95% online 1 node_A_1 raid4, mirrored, normal node_A_2_aggr1 1.49TB 74.12GB 95% online 1 node_A_2 raid4 mirrored, normal 4 entries were displayed. cluster_A::>
-
Repeat these steps for the other root aggregates.
Any root aggregate that does not have a status of mirrored must be mirrored.
Reconfiguring the ONTAP Mediator service (MetroCluster IP configurations)
If you have a MetroCluster IP configuration that was configured with the ONTAP Mediator service, you must remove and reconfigure the association with the mediator.
-
You must have the IP address and username and password for the ONTAP Mediator service.
-
The ONTAP Mediator service must be configured and operating on the Linux host.
-
Remove the existing ONTAP Mediator configuration:
metrocluster configuration-settings mediator remove
-
Reconfigure the ONTAP Mediator configuration:
metrocluster configuration-settings mediator add -mediator-address mediator-IP-address
Verifying the health of the MetroCluster configuration
You should check the health of the MetroCluster configuration to verify proper operation.
-
Check that the MetroCluster is configured and in normal mode on each cluster:
metrocluster show
cluster_A::> metrocluster show Cluster Entry Name State ------------------------- ------------------- ----------- Local: cluster_A Configuration state configured Mode normal AUSO Failure Domain auso-on-cluster-disaster Remote: cluster_B Configuration state configured Mode normal AUSO Failure Domain auso-on-cluster-disaster
-
Check that mirroring is enabled on each node:
metrocluster node show
cluster_A::> metrocluster node show DR Configuration DR Group Cluster Node State Mirroring Mode ----- ------- -------------- -------------- --------- -------------------- 1 cluster_A node_A_1 configured enabled normal cluster_B node_B_1 configured enabled normal 2 entries were displayed.
-
Check that the MetroCluster components are healthy:
metrocluster check run
cluster_A::> metrocluster check run Last Checked On: 10/1/2014 16:03:37 Component Result ------------------- --------- nodes ok lifs ok config-replication ok aggregates ok 4 entries were displayed. Command completed. Use the `metrocluster check show -instance` command or sub-commands in `metrocluster check` directory for detailed results. To check if the nodes are ready to do a switchover or switchback operation, run `metrocluster switchover -simulate` or `metrocluster switchback -simulate`, respectively.
-
Check that there are no health alerts:
system health alert show
-
Simulate a switchover operation:
-
From any node's prompt, change to the advanced privilege level:
set -privilege advanced
You need to respond with
y
when prompted to continue into advanced mode and see the advanced mode prompt (*>). -
Perform the switchover operation with the
-simulate
parameter:metrocluster switchover -simulate
-
Return to the admin privilege level:
set -privilege admin
-
-
For MetroCluster IP configurations using the ONTAP Mediator service, confirm that the Mediator service is up and operating.
-
Check that the Mediator disks are visible to the system:
storage failover mailbox-disk show
The following example shows that the mailbox disks have been recognized.
node_A_1::*> storage failover mailbox-disk show Mailbox Node Owner Disk Name Disk UUID ------------- ------ ----- ----- ---------------- sti113-vsim-ucs626g . . local 0m.i2.3L26 7BBA77C9:AD702D14:831B3E7E:0B0730EE:00000000:00000000:00000000:00000000:00000000:00000000 local 0m.i2.3L27 928F79AE:631EA9F9:4DCB5DE6:3402AC48:00000000:00000000:00000000:00000000:00000000:00000000 local 0m.i1.0L60 B7BCDB3C:297A4459:318C2748:181565A3:00000000:00000000:00000000:00000000:00000000:00000000 . . . partner 0m.i1.0L14 EA71F260:D4DD5F22:E3422387:61D475B2:00000000:00000000:00000000:00000000:00000000:00000000 partner 0m.i2.3L64 4460F436:AAE5AB9E:D1ED414E:ABF811F7:00000000:00000000:00000000:00000000:00000000:00000000 28 entries were displayed.
-
Change to the advanced privilege level:
set -privilege advanced
-
Check that the mailbox LUNs are visible to the system:
storage iscsi-initiator show
The output will show the presence of the mailbox LUNs:
Node Type Label Target Portal Target Name Admin/Op ---- ---- -------- --------- --------- -------------------------------- -------- . . . .node_A_1 mailbox mediator 172.16.254.1 iqn.2012-05.local:mailbox.target.db5f02d6-e3d3 up/up . . . 17 entries were displayed.
-
Return to the administrative privilege level:
set -privilege admin
-