Skip to main content

Automating the failover of stateful applications with Trident

Contributors netapp-aruldeepa

Trident's force-detach feature allows you to automatically detach volumes from unhealthy nodes in a Kubernetes cluster, preventing data corruption and ensuring application availability. This feature is particularly useful in scenarios where nodes become unresponsive or are taken offline for maintenance.

Details about force detach

Force detach is available for ontap-san, ontap-san-economy, ontap-nas, and ontap-nas-economy only. Before enabling force detach, non-graceful node shutdown (NGNS) must be enabled on the Kubernetes cluster. NGNS is enabled by default for Kubernetes 1.28 and above. For more information, refer to Kubernetes: Non Graceful node shutdown.

Note When using the ontap-nas or ontap-nas-economy driver, you need to set the autoExportPolicy parameter in the backend configuration to true so that Trident can restrict access from the Kubernetes node with the taint applied using managed export policies.
Warning Because Trident relies on Kubernetes NGNS, do not remove out-of-service taints from an unhealthy node until all non-tolerable workloads are rescheduled. Recklessly applying or removing the taint can jeopardize backend data protection.

When the Kubernetes cluster administrator has applied the node.kubernetes.io/out-of-service=nodeshutdown:NoExecute taint to the node and enableForceDetach is set to true, Trident will determine the node status and:

  1. Stop backend I/O access for volumes mounted to that node.

  2. Mark the Trident node object as dirty (not safe for new publications).

    Note The Trident controller will reject new publish volume requests until the node is re-qualified (after having been marked as dirty) by the Trident node pod. Any workloads scheduled with a mounted PVC (even after the cluster node is healthy and ready) will be not be accepted until Trident can verify the node clean (safe for new publications).

When node health is restored and the taint is removed, Trident will:

  1. Identify and clean stale published paths on the node.

  2. If the node is in a cleanable state (the out-of-service taint has been removed and the node is in Ready state) and all stale, published paths are clean, Trident will readmit the node as clean and allow new published volumes to the node.

Details about automated failover

You can automate the force-detach process through integration with node health check (NHC) operator. When a node failure occurs, NHC triggers Trident node remediation (TNR) and force-detach automatically by creating a TridentNodeRemediation CR in Trident's namespace defining the failed node. TNR is created only upon node failure, and removed by NHC once the node comes back online or the node is deleted.

Failed node pod removal process

Automated-failover selects the workloads to remove from the failed node. When a TNR is created, the TNR controller marks the node as dirty, preventing any new volume publications and begins removing force-detach supported pods and their volume attachments.

All volumes/PVCs supported by force-detach are supported by automated-failover:

  • NAS, and NAS-economy volumes using auto-export policies (SMB is not yet supported).

  • SAN, and SAN-economy volumes.

Default behavior:

  • Pods using volumes supported by force-detach are removed from the failed node. Kubernetes will reschedule these onto a healthy node.

  • Pods using a volume not supported by force-detach, including non-Trident volumes, are not removed from the failed node.

  • Stateless pods (not PVCs) are not removed from the failed node, unless the pod annotation trident.netapp.io/podRemediationPolicy: delete is set.

Overriding the pod removal behavior:

Pod removal behavior can be customized using a pod annotation: trident.netapp.io/podRemediationPolicy[retain, delete]. These annotations are examined and used when a failover occurs.
Apply annotations to the Kubernetes deployment/replicaset pod spec to prevent the annotation from disappearing after a failover:

  • retain - Pod WILL NOT be removed from the failed node during an automated-failover.

  • delete - Pod WILL be removed from the failed node during an automated-failover.

These annotations can be applied to any pod.

Warning
  • I/O operations will be blocked only on failed nodes for volumes that support force-detach.

  • For volumes that do not support force-detach, there is a risk of data corruption and multi-attach issues.

TridentNodeRemediation CR

The TridentNodeRemediation (TNR) CR defines a failed node. The name of the TNR is the name of the failed node.

Example TNR:

apiVersion: trident.netapp.io/v1
kind: TridentNodeRemediation
metadata:
  name: <K8s-node-name>
spec: {}

TNR states:
Use the following commands to view the status of TNRs:
kubectl get tnr <name> -n <trident-namespace>

TNRs can be in one of the following states:

  • Remediating:

    • Cease backend I/O access for volumes supported by force-detach mounted to that node.

    • The Trident node object is marked dirty (not safe for new publications).

    • Remove pods and volume attachments from the node

  • NodeRecoveryPending:

    • The controller is waiting for the node to come back online.

    • Once the node is online, publish-enforcement will ensure the node is clean and ready for new volume publications.

  • If the node is deleted from K8s, the TNR controller will remove the TNR and cease reconciliation.

  • Succeeded:

    • All remediation and node recovery steps completed successfully. The node is clean and ready for new volume publications.

  • Failed:

    • Unrecoverable error. Error reasons are set in the status.message field of the CR.

Enabling automated-failover

Prerequisites:

Note You can also use alternative ways to detect node failure as specified in the [Integrating Custom Node Health Check Solutions] section below.

See Node Health Check Operator for more information.

Steps
  1. Create a NodeHealthCheck (NHC) CR in the Trident namespace to monitor the worker nodes in the cluster. Example:

    apiVersion: remediation.medik8s.io/v1alpha1
    kind: NodeHealthCheck
    metadata:
      name: <CR name>
    spec:
      selector:
        matchExpressions:
          - key: node-role.kubernetes.io/control-plane
            operator: DoesNotExist
          - key: node-role.kubernetes.io/master
            operator: DoesNotExist
      remediationTemplate:
        apiVersion: trident.netapp.io/v1
        kind: TridentNodeRemediationTemplate
        namespace: <Trident installation namespace>
        name: trident-node-remediation-template
      minHealthy: 0 # Trigger force-detach upon one or more node failures
      unhealthyConditions:
        - type: Ready
          status: "False"
          duration: 0s
        - type: Ready
          status: Unknown
          duration: 0s
  2. Apply the node health check CR in the trident namespace.

    kubectl apply -f <nhc-cr-file>.yaml -n <trident-namespace>

The above CR is configured to watch K8s worker nodes for node conditions Ready: false and Unknown. Automated-Failover will be triggered upon a node going into Ready: false, or Ready: Unknown state.

The unhealthyConditions in the CR uses a 0 second grace period. This causes automated-failover to trigger immediately upon K8s setting node condition Ready: false, which is set after K8s loses the heartbeat from a node. K8s has a default of 40sec wait after the last heartbeat before setting Ready: false. This grace-period can be customized in K8s deployment options.

For additional configuration options, refer to Node-Healthcheck-Operator documentation.

Additional setup information

When Trident is installed with force-detach enabled, two additional resources are automatically created in the Trident namespace to facilitate integration with NHC: TridentNodeRemediationTemplate (TNRT) and ClusterRole.

TridentNodeRemediationTemplate (TNRT):

The TNRT serves as a template for the NHC controller, which uses TNRT to generate TNR resources as needed.

apiVersion: trident.netapp.io/v1
kind: TridentNodeRemediationTemplate
metadata:
  name: trident-node-remediation-template
  namespace: trident
spec:
  template:
    spec: {}

ClusterRole:

A cluster role is also added during the installation when force-detach is enabled. This gives NHC permissions to TNRs in the Trident namespace.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    rbac.ext-remediation/aggregate-to-ext-remediation: "true"
  name: tridentnoderemediation-access
rules:
- apiGroups:
  - trident.netapp.io
  resources:
  - tridentnoderemediationtemplates
  - tridentnoderemediations
  verbs:
  - get
  - list
  - watch
  - create
  - update
  - patch
  - delete

K8s cluster upgrades and maintenance

To prevent any failovers, pause automated-failover during K8s maintenance or upgrades, where the nodes are expected to go down or reboot. You can pause the NHC CR (described above) by patching its CR:

kubectl patch NodeHealthCheck <cr-name> --patch '{"spec":{"pauseRequests":["<description-for-reason-of-pause>"]}}' --type=merge

This pauses the automated-failover. To re-enable automated-failover, remove the pauseRequests from the spec after the maintenance is complete.

Limitations

  • I/O operations are prevented only on the failed nodes for volumes supported by force-detach. Only pods using volumes/PVCs supported by force-detach are automatically removed.

  • Automatic-failover and force-detach run inside the trident-controller pod. If the node hosting trident-controller fails, automated-failover will be delayed until K8s moves the pod to a healthy node.

Integrating custom node health check solutions

You can replace Node Healthcheck Operator with alternative node failure detection tools to trigger automatic-failover.
To ensure compatibility with the automated failover mechanism, your custom solution should:

  • Create a TNR when a node failure is detected, using the failed node’s name as the TNR CR name.

  • Delete the TNR when the node has recovered and the TNR is in the Succeeded state.