Skip to main content

HA pair management overview

Contributors netapp-jsnyder netapp-aherbin netapp-dbagwell netapp-lisa

Cluster nodes are configured in high-availability (HA) pairs for fault tolerance and nondisruptive operations. If a node fails or if you need to bring a node down for routine maintenance, its partner can take over its storage and continue to serve data from it. The partner gives back storage when the node is brought back on line.

The HA pair controller configuration consists of a pair of matching FAS/AFF storage controllers (local node and partner node). Each of these nodes is connected to the other’s disk shelves. When one node in an HA pair encounters an error and stops processing data, its partner detects the failed status of the partner and takes over all data processing from that controller.

Takeover is the process in which a node assumes control of its partner's storage.

Giveback is the process in which the storage is returned to the partner.

By default, takeovers occur automatically in any of the following situations:

  • A software or system failure occurs on a node that leads to a panic. The HA pair controllers automatically fail over to their partner node. After the partner has recovered from the panic and booted up, the node automatically performs a giveback, returning the partner to normal operation.

  • A system failure occurs on a node, and the node cannot reboot. For example, when a node fails because of a power loss, HA pair controllers automatically fail over to their partner node and serve data from the surviving storage controller.

Note If the storage for a node also loses power at the same time, a standard takeover is not possible.
  • Heartbeat messages are not received from the node's partner. This could happen if the partner experienced a hardware or software failure (for example, an interconnect failure) that did not result in a panic but still prevented it from functioning correctly.

  • You halt one of the nodes without using the -f or -inhibit-takeover true parameter.

Note In a two-node cluster with cluster HA enabled, halting or rebooting a node using the ‑inhibit‑takeover true parameter causes both nodes to stop serving data unless you first disable cluster HA and then assign epsilon to the node that you want to remain online.
  • You reboot one of the nodes without using the ‑inhibit‑takeover true parameter. (The ‑onboot parameter of the storage failover command is enabled by default.)

  • The remote management device (Service Processor) detects failure of the partner node. This is not applicable if you disable hardware-assisted takeover.

You can also manually initiate takeovers with the storage failover takeover command.

Cluster resiliency and diagnostic improvements

Beginning in ONTAP 9.9.1, the following resiliency and diagnostic additions improve cluster operation:

  • Port monitoring and avoidance: In two-node switchless cluster configurations, the system avoids ports that experience total packet loss (connectivity loss). In ONTAP 9.8.1 and earlier, this functionality was only available in switched configurations.

  • Automatic node failover: If a node cannot serve data across its cluster network, that node should not own any disks. Instead its HA partner should take over, if the partner is healthy.

  • Commands to analyze connectivity issues: Use the following command to display which cluster paths are experiencing packet loss: network interface check cluster-connectivity show