Skip to main content
ONTAP MetroCluster

Completing recovery

Contributors netapp-aoife netapp-thomi NetAppZacharyWambold netapp-pcarriga netapp-martyh netapp-aherbin

Perform the required tasks to complete the recovery from a multi-controller or storage failure.

Reestablishing object stores for FabricPool configurations

If one of the object stores in a FabricPool mirror was co-located with the MetroCluster disaster site and was destroyed, you must reestablish the object store and the FabricPool mirror.

About this task
  • If the object-stores are remote and a MetroCluster site is destroyed, you do not need to rebuild the object store, and the original object store configurations as well as cold data contents are retained.

  • For more information about FabricPool configurations, see the Disk and aggregates management.

Step
  1. Follow the procedure "Replacing a FabricPool mirror on a MetroCluster configuration" in the Disk and aggregates management.

Verifying licenses on the replaced nodes

You must install new licenses for the replacement nodes if the impaired nodes were using ONTAP features that require a standard (node-locked) license. For features with standard licenses, each node in the cluster should have its own key for the feature.

About this task

Until you install license keys, features requiring standard licenses continue to be available to the replacement node. However, if the impaired node was the only node in the cluster with a license for the feature, no configuration changes to the feature are allowed. Also, using unlicensed features on the node might put you out of compliance with your license agreement, so you should install the replacement license key or keys on the replacement node as soon as possible.

The licenses keys must be in the 28-character format.

You have a 90-day grace period in which to install the license keys. After the grace period, all old licenses are invalidated. After a valid license key is installed, you have 24 hours to install all of the keys before the grace period ends.

Note If all nodes at a site have been replaced (a single node in the case of a two-node MetroCluster configuration), license keys must be installed on the replacement node or nodes prior to switchback.
Steps
  1. Identify the licenses on the node:

    license show

    The following example displays the information about licenses in the system:

    cluster_B::>  license show
             (system license show)
    
    Serial Number: 1-80-00050
    Owner: site1-01
    Package           Type       Description             Expiration
    -------          -------     -------------           -----------
    Base             license     Cluster Base License        -
    NFS              site        NFS License                 -
    CIFS             site        CIFS License                -
    iSCSI            site        iSCSI License               -
    FCP              site        FCP License                 -
    FlexClone        site        FlexClone License           -
    
    6 entries were displayed.
  2. Verify that the licenses are good for the node after switchback:

    metrocluster check license show

    The following example displays the licenses that are good for the node:

    cluster_B::> metrocluster check license show
    
    Cluster           Check                             Result
    -------           -------                           -------------
    Cluster_B         negotiated-switchover-ready       not-applicable
    NFS               switchback-ready                  not-applicable
    CIFS              job-schedules                     ok
    iSCSI             licenses                          ok
    FCP               periodic-check-enabled            ok
  3. If you need new license keys, obtain replacement license keys on the NetApp Support Site in the My Support section under Software licenses.

    Note The new license keys that you require are automatically generated and sent to the email address on file. If you fail to receive the email with the license keys within 30 days, refer to the "Who to contact if I have issues with my Licenses?" section in the Knowledge Base article Post Motherboard Replacement Process to update Licensing on a AFF/FAS system.
  4. Install each license key:

    system license add -license-code license-key, license-key…​+

  5. Remove the old licenses, if desired:

    1. Check for unused licenses:

      license clean-up -unused -simulate

    2. If the list looks correct, remove the unused licenses:

      license clean-up -unused

Restoring key management

If data volumes are encrypted, you must restore key management. If the root volume is encrypted, you must recover key management.

Steps
  1. If data volumes are encrypted, restore the keys using the correct command for your key management configuration.

    If you are using…​

    Use this command…​

    Onboard key management

    security key-manager onboard sync

    External key management

    security key-manager key query -node node-name

  2. If the root volume is encrypted, use the procedure in Recovering key management if the root volume is encrypted.

Performing a switchback

After you heal the MetroCluster configuration, you can perform the MetroCluster switchback operation. The MetroCluster switchback operation returns the configuration to its normal operating state, with the sync-source storage virtual machines (SVMs) on the disaster site active and serving data from the local disk pools.

Before you begin
  • The disaster cluster must have successfully switched over to the surviving cluster.

  • Healing must have been performed on the data and root aggregates.

  • The surviving cluster nodes must not be in the HA failover state (all nodes must be up and running for each HA pair).

  • The disaster site controller modules must be completely booted and not in the HA takeover mode.

  • The root aggregate must be mirrored.

  • The Inter-Switch Links (ISLs) must be online.

  • Any required licenses must be installed on the system.

Steps
  1. Confirm that all nodes are in the enabled state:

    metrocluster node show

    The following example displays the nodes that are in the enabled state:

    cluster_B::>  metrocluster node show
    
    DR                        Configuration  DR
    Group Cluster Node        State          Mirroring Mode
    ----- ------- ----------- -------------- --------- --------------------
    1     cluster_A
                  node_A_1    configured     enabled   heal roots completed
                  node_A_2    configured     enabled   heal roots completed
          cluster_B
                  node_B_1    configured     enabled   waiting for switchback recovery
                  node_B_2    configured     enabled   waiting for switchback recovery
    4 entries were displayed.
  2. Confirm that resynchronization is complete on all SVMs:

    metrocluster vserver show

  3. Verify that any automatic LIF migrations being performed by the healing operations have been successfully completed:

    metrocluster check lif show

  4. Perform the switchback by running the metrocluster switchback command from any node in the surviving cluster.

  5. Check the progress of the switchback operation:

    metrocluster show

    The switchback operation is still in progress when the output displays "waiting-for-switchback":

    cluster_B::> metrocluster show
    Cluster                   Entry Name          State
    ------------------------- ------------------- -----------
     Local: cluster_B         Configuration state configured
                              Mode                switchover
                              AUSO Failure Domain -
    Remote: cluster_A         Configuration state configured
                              Mode                waiting-for-switchback
                              AUSO Failure Domain -

    The switchback operation is complete when the output displays "normal":

    cluster_B::> metrocluster show
    Cluster                   Entry Name          State
    ------------------------- ------------------- -----------
     Local: cluster_B         Configuration state configured
                              Mode                normal
                              AUSO Failure Domain -
    Remote: cluster_A         Configuration state configured
                              Mode                normal
                              AUSO Failure Domain -

    If a switchback takes a long time to finish, you can check on the status of in-progress baselines by using the the following command at the advanced privilege level:

    metrocluster config-replication resync-status show

  6. Reestablish any SnapMirror or SnapVault configurations.

    In ONTAP 8.3, you need to manually reestablish a lost SnapMirror configuration after a MetroCluster switchback operation. In ONTAP 9.0 and later, the relationship is reestablished automatically.

Verifying a successful switchback

After performing the switchback, you want to confirm that all aggregates and storage virtual machines (SVMs) are switched back and online.

Steps
  1. Verify that the switched-over data aggregates are switched back:

    storage aggregate show

    In the following example, aggr_b2 on node B2 has switched back:

    node_B_1::> storage aggregate show
    Aggregate     Size Available Used% State   #Vols  Nodes            RAID Status
    --------- -------- --------- ----- ------- ------ ---------------- ------------
    ...
    aggr_b2    227.1GB   227.1GB    0% online       0 node_B_2   raid_dp,
                                                                       mirrored,
                                                                       normal
    
    node_A_1::> aggr show
    Aggregate     Size Available Used% State   #Vols  Nodes            RAID Status
    --------- -------- --------- ----- ------- ------ ---------------- ------------
    ...
    aggr_b2          -         -     - unknown      - node_A_1

    If the disaster site included unmirrored aggregates and the unmirrored aggregates are no longer present, the aggregate might show up with a state of “unknown” in the output of the storage aggregate show command. Contact technical support to remove the out-of-date entries for the unmirrored aggregates, reference the Knowledge Base article How to remove stale unmirrored aggregate entries in a MetroCluster following disaster where storage was lost.

  2. Verify that all sync-destination SVMs on the surviving cluster are dormant (showing an Admin State of “stopped”) and the sync-source SVMs on the disaster cluster are up and running:

    vserver show -subtype sync-source

    node_B_1::> vserver show -subtype sync-source
                                   Admin      Root                       Name    Name
    Vserver     Type    Subtype    State      Volume     Aggregate       Service Mapping
    ----------- ------- ---------- ---------- ---------- ----------      ------- -------
    ...
    vs1a        data    sync-source
                                   running    vs1a_vol   node_B_2        file    file
                                                                         aggr_b2
    
    node_A_1::> vserver show -subtype sync-destination
                                   Admin      Root                         Name    Name
    Vserver            Type    Subtype    State      Volume     Aggregate  Service Mapping
    -----------        ------- ---------- ---------- ---------- ---------- ------- -------
    ...
    cluster_A-vs1a-mc  data    sync-destination
                                          stopped    vs1a_vol   sosb_      file    file
                                                                           aggr_b2

    Sync-destination aggregates in the MetroCluster configuration have the suffix "-mc" automatically appended to their name to help identify them.

  3. Confirm that the switchback operations succeeded by using the metrocluster operation show command.

    If the command output shows…​

    Then…​

    That the switchback operation state is successful.

    The switchback process is complete and you can proceed with operation of the system.

    That the switchback operation or switchback-continuation-agent operation is partially successful.

    Perform the suggested fix provided in the output of the metrocluster operation show command.

After you finish

You must repeat the previous sections to perform the switchback in the opposite direction. If site_A did a switchover of site_B, have site_B do a switchover of site_A.

Mirroring the root aggregates of the replacement nodes

If disks were replaced, you must mirror the root aggregates of the new nodes on the disaster site.

Steps
  1. On the disaster site, identify the aggregates which are not mirrored:

    storage aggregate show

    cluster_A::> storage aggregate show
    
    Aggregate     Size Available Used% State   #Vols  Nodes            RAID Status
    --------- -------- --------- ----- ------- ------ ---------------- ------------
    node_A_1_aggr0
                1.49TB   74.12GB   95% online       1 node_A_1         raid4,
                                                                       normal
    node_A_2_aggr0
                1.49TB   74.12GB   95% online       1 node_A_2         raid4,
                                                                       normal
    node_A_1_aggr1
                1.49TB   74.12GB   95% online       1 node_A_1         raid 4, normal
                                                                       mirrored
    node_A_2_aggr1
                1.49TB   74.12GB   95% online       1 node_A_2         raid 4, normal
                                                                       mirrored
    4 entries were displayed.
    
    cluster_A::>
  2. Mirror one of the root aggregates:

    storage aggregate mirror -aggregate root-aggregate

    The following example shows how the command selects disks and prompts for confirmation when mirroring the aggregate.

    cluster_A::> storage aggregate mirror -aggregate node_A_2_aggr0
    
    Info: Disks would be added to aggregate "node_A_2_aggr0" on node "node_A_2" in
          the following manner:
    
          Second Plex
    
            RAID Group rg0, 3 disks (block checksum, raid4)
              Position   Disk                      Type                  Size
              ---------- ------------------------- ---------- ---------------
              parity     2.10.0                    SSD                      -
              data       1.11.19                   SSD                894.0GB
              data       2.10.2                    SSD                894.0GB
    
          Aggregate capacity available for volume use would be 1.49TB.
    
    Do you want to continue? {y|n}: y
    
    cluster_A::>
  3. Verify that mirroring of the root aggregate is complete:

    storage aggregate show

    The following example shows that the root aggregates are mirrored.

    cluster_A::> storage aggregate show
    
    Aggregate     Size Available Used% State   #Vols  Nodes       RAID Status
    --------- -------- --------- ----- ------- ------ ----------- ------------
    node_A_1_aggr0
                1.49TB   74.12GB   95% online       1 node_A_1    raid4,
                                                                  mirrored,
                                                                  normal
    node_A_2_aggr0
                2.24TB   838.5GB   63% online       1 node_A_2    raid4,
                                                                  mirrored,
                                                                  normal
    node_A_1_aggr1
                1.49TB   74.12GB   95% online       1 node_A_1    raid4,
                                                                  mirrored,
                                                                  normal
    node_A_2_aggr1
                1.49TB   74.12GB   95% online       1 node_A_2    raid4
                                                                  mirrored,
                                                                  normal
    4 entries were displayed.
    
    cluster_A::>
  4. Repeat these steps for the other root aggregates.

    Any root aggregate that does not have a status of mirrored must be mirrored.

Reconfiguring the ONTAP Mediator service (MetroCluster IP configurations)

If you have a MetroCluster IP configuration that was configured with the ONTAP Mediator service, you must remove and reconfigure the association with the mediator.

Before you begin
  • You must have the IP address and username and password for the ONTAP Mediator service.

  • The ONTAP Mediator service must be configured and operating on the Linux host.

Steps
  1. Remove the existing ONTAP Mediator configuration:

    metrocluster configuration-settings mediator remove

  2. Reconfigure the ONTAP Mediator configuration:

    metrocluster configuration-settings mediator add -mediator-address mediator-IP-address

Verifying the health of the MetroCluster configuration

You should check the health of the MetroCluster configuration to verify proper operation.

Steps
  1. Check that the MetroCluster is configured and in normal mode on each cluster:

    metrocluster show

    cluster_A::> metrocluster show
    Cluster                   Entry Name          State
    ------------------------- ------------------- -----------
     Local: cluster_A         Configuration state configured
                              Mode                normal
                              AUSO Failure Domain auso-on-cluster-disaster
    Remote: cluster_B         Configuration state configured
                              Mode                normal
                              AUSO Failure Domain auso-on-cluster-disaster
  2. Check that mirroring is enabled on each node:

    metrocluster node show

    cluster_A::> metrocluster node show
    DR                           Configuration  DR
    Group Cluster Node           State          Mirroring Mode
    ----- ------- -------------- -------------- --------- --------------------
    1     cluster_A
                  node_A_1       configured     enabled   normal
          cluster_B
                  node_B_1       configured     enabled   normal
    2 entries were displayed.
  3. Check that the MetroCluster components are healthy:

    metrocluster check run

    cluster_A::> metrocluster check run
    
    Last Checked On: 10/1/2014 16:03:37
    
    Component           Result
    ------------------- ---------
    nodes               ok
    lifs                ok
    config-replication  ok
    aggregates          ok
    4 entries were displayed.
    
    Command completed. Use the `metrocluster check show -instance` command or sub-commands in `metrocluster check` directory for detailed results.
    To check if the nodes are ready to do a switchover or switchback operation, run `metrocluster switchover -simulate` or `metrocluster switchback -simulate`, respectively.
  4. Check that there are no health alerts:

    system health alert show

  5. Simulate a switchover operation:

    1. From any node's prompt, change to the advanced privilege level:

      set -privilege advanced

      You need to respond with y when prompted to continue into advanced mode and see the advanced mode prompt (*>).

    2. Perform the switchover operation with the -simulate parameter:

      metrocluster switchover -simulate

    3. Return to the admin privilege level:

      set -privilege admin

  6. For MetroCluster IP configurations using the ONTAP Mediator service, confirm that the Mediator service is up and operating.

    1. Check that the Mediator disks are visible to the system:

      storage failover mailbox-disk show

      The following example shows that the mailbox disks have been recognized.

      node_A_1::*> storage failover mailbox-disk show
                       Mailbox
      Node             Owner     Disk    Name        Disk UUID
      -------------     ------   -----   -----        ----------------
      sti113-vsim-ucs626g
      .
      .
           local     0m.i2.3L26      7BBA77C9:AD702D14:831B3E7E:0B0730EE:00000000:00000000:00000000:00000000:00000000:00000000
           local     0m.i2.3L27      928F79AE:631EA9F9:4DCB5DE6:3402AC48:00000000:00000000:00000000:00000000:00000000:00000000
           local     0m.i1.0L60      B7BCDB3C:297A4459:318C2748:181565A3:00000000:00000000:00000000:00000000:00000000:00000000
      .
      .
      .
           partner   0m.i1.0L14      EA71F260:D4DD5F22:E3422387:61D475B2:00000000:00000000:00000000:00000000:00000000:00000000
           partner   0m.i2.3L64      4460F436:AAE5AB9E:D1ED414E:ABF811F7:00000000:00000000:00000000:00000000:00000000:00000000
      28 entries were displayed.
    2. Change to the advanced privilege level:

      set -privilege advanced

    3. Check that the mailbox LUNs are visible to the system:

      storage iscsi-initiator show

      The output will show the presence of the mailbox LUNs:

      Node    Type       Label      Target Portal     Target Name                                 Admin/Op
      ----    ----       --------   ---------    --------- --------------------------------       --------
      .
      .
      .
      .node_A_1
                     mailbox
                           mediator 172.16.254.1    iqn.2012-05.local:mailbox.target.db5f02d6-e3d3    up/up
      .
      .
      .
      17 entries were displayed.
    4. Return to the administrative privilege level:

      set -privilege admin