Completing recovery

05/21/2025 Contributors

PDFs

Perform the required tasks to complete the recovery from a multi-controller or storage failure.

Reestablishing object stores for FabricPool configurations

If one of the object stores in a FabricPool mirror was co-located with the MetroCluster disaster site and was destroyed, you must reestablish the object store and the FabricPool mirror.

About this task

If the object-stores are remote and a MetroCluster site is destroyed, you do not need to rebuild the object store, and the original object store configurations as well as cold data contents are retained.
For more information about FabricPool configurations, see the Disk and aggregates management.

Step

Follow the procedure "Replacing a FabricPool mirror on a MetroCluster configuration" in the Disk and aggregates management.

Verifying licenses on the replaced nodes

You must install new licenses for the replacement nodes if the impaired nodes were using ONTAP features that require a standard (node-locked) license. For features with standard licenses, each node in the cluster should have its own key for the feature.

About this task

Until you install license keys, features requiring standard licenses continue to be available to the replacement node. However, if the impaired node was the only node in the cluster with a license for the feature, no configuration changes to the feature are allowed. Also, using unlicensed features on the node might put you out of compliance with your license agreement, so you should install the replacement license key or keys on the replacement node as soon as possible.

The licenses keys must be in the 28-character format.

You have a 90-day grace period in which to install the license keys. After the grace period, all old licenses are invalidated. After a valid license key is installed, you have 24 hours to install all of the keys before the grace period ends.

If all nodes at a site have been replaced (a single node in the case of a two-node MetroCluster configuration), license keys must be installed on the replacement node or nodes prior to switchback.

Steps

Identify the licenses on the node:

license show

The following example displays the information about licenses in the system:

cluster_B::>  license show
         (system license show)

Serial Number: 1-80-00050
Owner: site1-01
Package           Type       Description             Expiration
-------          -------     -------------           -----------
Base             license     Cluster Base License        -
NFS              site        NFS License                 -
CIFS             site        CIFS License                -
iSCSI            site        iSCSI License               -
FCP              site        FCP License                 -
FlexClone        site        FlexClone License           -

6 entries were displayed.

Verify that the licenses are good for the node after switchback:

metrocluster check license show

The following example displays the licenses that are good for the node:

cluster_B::> metrocluster check license show

Cluster           Check                             Result
-------           -------                           -------------
Cluster_B         negotiated-switchover-ready       not-applicable
NFS               switchback-ready                  not-applicable
CIFS              job-schedules                     ok
iSCSI             licenses                          ok
FCP               periodic-check-enabled            ok

If you need new license keys, obtain replacement license keys on the NetApp Support Site in the My Support section under Software licenses.

The new license keys that you require are automatically generated and sent to the email address on file. If you fail to receive the email with the license keys within 30 days, refer to the "Who to contact if I have issues with my Licenses?" section in the Knowledge Base article Post Motherboard Replacement Process to update Licensing on a AFF/FAS system.

Install each license key:

system license add -license-code license-key, license-key…+
Remove the old licenses, if desired:
1. Check for unused licenses:
  
  license clean-up -unused -simulate
2. If the list looks correct, remove the unused licenses:
  
  license clean-up -unused

Restoring key management

If data volumes are encrypted, you must restore key management. If the root volume is encrypted, you must recover key management.

Steps

If data volumes are encrypted, restore the keys using the correct command for your key management configuration.

If you are using…	Use this command…
Onboard key management	`security key-manager onboard sync` For more information, see Restoring onboard key management encryption keys.
External key management	`security key-manager key query -node node-name` For more information, see Restoring external key management encryption keys.

If you are using…

Use this command…

Onboard key management

security key-manager onboard sync

For more information, see Restoring onboard key management encryption keys.

External key management

security key-manager key query -node node-name

For more information, see Restoring external key management encryption keys.

If the root volume is encrypted, use the procedure in Recovering key management if the root volume is encrypted.

Performing a switchback

After you heal the MetroCluster configuration, you can perform the MetroCluster switchback operation. The MetroCluster switchback operation returns the configuration to its normal operating state, with the sync-source storage virtual machines (SVMs) on the disaster site active and serving data from the local disk pools.

Before you begin

The disaster cluster must have successfully switched over to the surviving cluster.
Healing must have been performed on the data and root aggregates.
The surviving cluster nodes must not be in the HA failover state (all nodes must be up and running for each HA pair).
The disaster site controller modules must be completely booted and not in the HA takeover mode.
The root aggregate must be mirrored.
The Inter-Switch Links (ISLs) must be online.
Any required licenses must be installed on the system.

Steps

Confirm that all nodes are in the enabled state:

metrocluster node show

The following example displays the nodes that are in the enabled state:

cluster_B::>  metrocluster node show

DR                        Configuration  DR
Group Cluster Node        State          Mirroring Mode
----- ------- ----------- -------------- --------- --------------------
1     cluster_A
              node_A_1    configured     enabled   heal roots completed
              node_A_2    configured     enabled   heal roots completed
      cluster_B
              node_B_1    configured     enabled   waiting for switchback recovery
              node_B_2    configured     enabled   waiting for switchback recovery
4 entries were displayed.

Confirm that resynchronization is complete on all SVMs:

metrocluster vserver show
Verify that any automatic LIF migrations being performed by the healing operations have been successfully completed:

metrocluster check lif show
Perform the switchback by running the metrocluster switchback command from any node in the surviving cluster.

Check the progress of the switchback operation:

metrocluster show

The switchback operation is still in progress when the output displays "waiting-for-switchback":

cluster_B::> metrocluster show
Cluster                   Entry Name          State
------------------------- ------------------- -----------
 Local: cluster_B         Configuration state configured
                          Mode                switchover
                          AUSO Failure Domain -
Remote: cluster_A         Configuration state configured
                          Mode                waiting-for-switchback
                          AUSO Failure Domain -

The switchback operation is complete when the output displays "normal":

cluster_B::> metrocluster show
Cluster                   Entry Name          State
------------------------- ------------------- -----------
 Local: cluster_B         Configuration state configured
                          Mode                normal
                          AUSO Failure Domain -
Remote: cluster_A         Configuration state configured
                          Mode                normal
                          AUSO Failure Domain -

If a switchback takes a long time to finish, you can check on the status of in-progress baselines by using the the following command at the advanced privilege level:

metrocluster config-replication resync-status show

Reestablish any SnapMirror or SnapVault configurations.

In ONTAP 8.3, you need to manually reestablish a lost SnapMirror configuration after a MetroCluster switchback operation. In ONTAP 9.0 and later, the relationship is reestablished automatically.

Verifying a successful switchback

After performing the switchback, you want to confirm that all aggregates and storage virtual machines (SVMs) are switched back and online.

Steps

Verify that the switched-over data aggregates are switched back:

storage aggregate show

In the following example, aggr_b2 on node B2 has switched back:

node_B_1::> storage aggregate show
Aggregate     Size Available Used% State   #Vols  Nodes            RAID Status
--------- -------- --------- ----- ------- ------ ---------------- ------------
...
aggr_b2    227.1GB   227.1GB    0% online       0 node_B_2   raid_dp,
                                                                   mirrored,
                                                                   normal

node_A_1::> aggr show
Aggregate     Size Available Used% State   #Vols  Nodes            RAID Status
--------- -------- --------- ----- ------- ------ ---------------- ------------
...
aggr_b2          -         -     - unknown      - node_A_1

If the disaster site included unmirrored aggregates and the unmirrored aggregates are no longer present, the aggregate might show up with a state of “unknown” in the output of the storage aggregate show command. Contact technical support to remove the out-of-date entries for the unmirrored aggregates, reference the Knowledge Base article How to remove stale unmirrored aggregate entries in a MetroCluster following disaster where storage was lost.

Verify that all sync-destination SVMs on the surviving cluster are dormant (showing an operational state of “stopped”):

vserver show -subtype sync-destination

node_B_1::> vserver show -subtype sync-destination
                                 Admin    Operational  Root
Vserver       Type    Subtype    State    State        Volume    Aggregate
-----------   ------- ---------- -------- ----------   --------  ----------
...
cluster_A-vs1a-mc data sync-destination
                               running    stopped    vs1a_vol   aggr_b2

Sync-destination aggregates in the MetroCluster configuration have the suffix “-mc” automatically appended to their name to help identify them.

Verify the sync-source SVMs on the disaster cluster are up and running:

vserver show -subtype sync-source

node_A_1::> vserver show -subtype sync-source
                                  Admin    Operational  Root
Vserver        Type    Subtype    State    State        Volume     Aggregate
-----------    ------- ---------- -------- ----------   --------   ----------
...
vs1a           data    sync-source
                                  running  running    vs1a_vol  aggr_b2

Confirm that the switchback operations succeeded by using the metrocluster operation show command.

If the command output shows…	Then…
That the switchback operation state is successful.	The switchback process is complete and you can proceed with operation of the system.
That the switchback operation or switchback-continuation-agent operation is partially successful.	Perform the suggested fix provided in the output of the metrocluster operation show command.

If the command output shows…

Then…

That the switchback operation state is successful.

The switchback process is complete and you can proceed with operation of the system.

That the switchback operation or switchback-continuation-agent operation is partially successful.

Perform the suggested fix provided in the output of the metrocluster operation show command.

After you finish

You must repeat the previous sections to perform the switchback in the opposite direction. If site_A did a switchover of site_B, have site_B do a switchover of site_A.

Mirroring the root aggregates of the replacement nodes

If disks were replaced, you must mirror the root aggregates of the new nodes on the disaster site.

Steps

On the disaster site, identify the aggregates which are not mirrored:

storage aggregate show

cluster_A::> storage aggregate show

Aggregate     Size Available Used% State   #Vols  Nodes            RAID Status
--------- -------- --------- ----- ------- ------ ---------------- ------------
node_A_1_aggr0
            1.49TB   74.12GB   95% online       1 node_A_1         raid4,
                                                                   normal
node_A_2_aggr0
            1.49TB   74.12GB   95% online       1 node_A_2         raid4,
                                                                   normal
node_A_1_aggr1
            1.49TB   74.12GB   95% online       1 node_A_1         raid 4, normal
                                                                   mirrored
node_A_2_aggr1
            1.49TB   74.12GB   95% online       1 node_A_2         raid 4, normal
                                                                   mirrored
4 entries were displayed.

cluster_A::>

Mirror one of the root aggregates:

storage aggregate mirror -aggregate root-aggregate

The following example shows how the command selects disks and prompts for confirmation when mirroring the aggregate.

cluster_A::> storage aggregate mirror -aggregate node_A_2_aggr0

Info: Disks would be added to aggregate "node_A_2_aggr0" on node "node_A_2" in
      the following manner:

      Second Plex

        RAID Group rg0, 3 disks (block checksum, raid4)
          Position   Disk                      Type                  Size
          ---------- ------------------------- ---------- ---------------
          parity     2.10.0                    SSD                      -
          data       1.11.19                   SSD                894.0GB
          data       2.10.2                    SSD                894.0GB

      Aggregate capacity available for volume use would be 1.49TB.

Do you want to continue? {y|n}: y

cluster_A::>

Verify that mirroring of the root aggregate is complete:

storage aggregate show

The following example shows that the root aggregates are mirrored.

cluster_A::> storage aggregate show

Aggregate     Size Available Used% State   #Vols  Nodes       RAID Status
--------- -------- --------- ----- ------- ------ ----------- ------------
node_A_1_aggr0
            1.49TB   74.12GB   95% online       1 node_A_1    raid4,
                                                              mirrored,
                                                              normal
node_A_2_aggr0
            2.24TB   838.5GB   63% online       1 node_A_2    raid4,
                                                              mirrored,
                                                              normal
node_A_1_aggr1
            1.49TB   74.12GB   95% online       1 node_A_1    raid4,
                                                              mirrored,
                                                              normal
node_A_2_aggr1
            1.49TB   74.12GB   95% online       1 node_A_2    raid4
                                                              mirrored,
                                                              normal
4 entries were displayed.

cluster_A::>

Repeat these steps for the other root aggregates.

Any root aggregate that does not have a status of mirrored must be mirrored.

Reconfiguring ONTAP Mediator (MetroCluster IP configurations)

If you have a MetroCluster IP configuration that was configured with ONTAP Mediator, you must remove and reconfigure the association with ONTAP Mediator.

Before you begin

You must have the IP address and username and password for ONTAP Mediator.
ONTAP Mediator must be configured and operating on the Linux host.

Steps

Remove the existing ONTAP Mediator configuration:

metrocluster configuration-settings mediator remove
Reconfigure the ONTAP Mediator configuration:

metrocluster configuration-settings mediator add -mediator-address mediator-IP-address

Verifying the health of the MetroCluster configuration

You should check the health of the MetroCluster configuration to verify proper operation.

Steps

Check that the MetroCluster is configured and in normal mode on each cluster:

metrocluster show

cluster_A::> metrocluster show
Cluster                   Entry Name          State
------------------------- ------------------- -----------
 Local: cluster_A         Configuration state configured
                          Mode                normal
                          AUSO Failure Domain auso-on-cluster-disaster
Remote: cluster_B         Configuration state configured
                          Mode                normal
                          AUSO Failure Domain auso-on-cluster-disaster

Check that mirroring is enabled on each node:

metrocluster node show

cluster_A::> metrocluster node show
DR                           Configuration  DR
Group Cluster Node           State          Mirroring Mode
----- ------- -------------- -------------- --------- --------------------
1     cluster_A
              node_A_1       configured     enabled   normal
      cluster_B
              node_B_1       configured     enabled   normal
2 entries were displayed.

Check that the MetroCluster components are healthy:

metrocluster check run

cluster_A::> metrocluster check run

Last Checked On: 10/1/2014 16:03:37

Component           Result
------------------- ---------
nodes               ok
lifs                ok
config-replication  ok
aggregates          ok
4 entries were displayed.

Command completed. Use the `metrocluster check show -instance` command or sub-commands in `metrocluster check` directory for detailed results.
To check if the nodes are ready to do a switchover or switchback operation, run `metrocluster switchover -simulate` or `metrocluster switchback -simulate`, respectively.

Check that there are no health alerts:

system health alert show
Simulate a switchover operation:
1. From any node's prompt, change to the advanced privilege level:
  
  set -privilege advanced
  
  You need to respond with y when prompted to continue into advanced mode and see the advanced mode prompt (*>).
2. Perform the switchover operation with the -simulate parameter:
  
  metrocluster switchover -simulate
3. Return to the admin privilege level:
  
  set -privilege admin

For MetroCluster IP configurations using ONTAP Mediator, confirm that ONTAP Mediator is up and operating.

Check that the Mediator disks are visible to the system:

storage failover mailbox-disk show

The following example shows that the mailbox disks have been recognized.

node_A_1::*> storage failover mailbox-disk show
                 Mailbox
Node             Owner     Disk    Name        Disk UUID
-------------     ------   -----   -----        ----------------
sti113-vsim-ucs626g
.
.
     local     0m.i2.3L26      7BBA77C9:AD702D14:831B3E7E:0B0730EE:00000000:00000000:00000000:00000000:00000000:00000000
     local     0m.i2.3L27      928F79AE:631EA9F9:4DCB5DE6:3402AC48:00000000:00000000:00000000:00000000:00000000:00000000
     local     0m.i1.0L60      B7BCDB3C:297A4459:318C2748:181565A3:00000000:00000000:00000000:00000000:00000000:00000000
.
.
.
     partner   0m.i1.0L14      EA71F260:D4DD5F22:E3422387:61D475B2:00000000:00000000:00000000:00000000:00000000:00000000
     partner   0m.i2.3L64      4460F436:AAE5AB9E:D1ED414E:ABF811F7:00000000:00000000:00000000:00000000:00000000:00000000
28 entries were displayed.

Change to the advanced privilege level:

set -privilege advanced

Check that the mailbox LUNs are visible to the system:

storage iscsi-initiator show

The output will show the presence of the mailbox LUNs:

Node    Type       Label      Target Portal     Target Name                                 Admin/Op
----    ----       --------   ---------    --------- --------------------------------       --------
.
.
.
.node_A_1
               mailbox
                     mediator 172.16.254.1    iqn.2012-05.local:mailbox.target.db5f02d6-e3d3    up/up
.
.
.
17 entries were displayed.

Return to the administrative privilege level:

set -privilege admin

Completing recovery

Creating your file...

Reestablishing object stores for FabricPool configurations

Verifying licenses on the replaced nodes

Restoring key management

Performing a switchback

Verifying a successful switchback

Mirroring the root aggregates of the replacement nodes

Reconfiguring ONTAP Mediator (MetroCluster IP configurations)

Verifying the health of the MetroCluster configuration