Data protection

09/20/2021

Learn about data protection and recoverability options that NetApp’s storage platforms provide. Astra Trident can provision volumes that can take advantage of some of these features. You should have a data protection and recovery strategy for each application with a persistence requirement.

Back up the `etcd` cluster data

Astra Trident stores its metadata in the Kubernetes cluster's etcd database. Periodically backing up the etcd cluster data is important to recover Kubernetes clusters under disaster scenarios.

Steps

The etcdctl snapshot save command enables you to take a point-in-time snapshot of the etcd cluster:

sudo docker run --rm -v /backup:/backup \
  --network host \
  -v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd \
  --env ETCDCTL_API=3 \
  k8s.gcr.io/etcd-amd64:3.2.18 \
  etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  snapshot save /backup/etcd-snapshot.db

This command creates an etcd snapshot by spinning up an etcd container and saves it in the /backup directory.

In the event of a disaster, you can spin up a Kubernetes cluster by using the etcd snapshots.
Use the etcdctl snapshot restore command to restore a specific snapshot taken to the /var/lib/etcd folder. After restoring, confirm if the /var/lib/etcd folder has been populated with the member folder. The following is an example of etcdctl snapshot restore command:
```
# etcdctl snapshot restore '/backup/etcd-snapshot-latest.db' ; mv /default.etcd/member/ /var/lib/etcd/
```
Before you initialize the Kubernetes cluster, copy all the necessary certificates.
Create the cluster with the --ignore-preflight-errors=DirAvailable—var-lib-etcd flag.
After the cluster comes up ensure that the kube-system pods have started.
Use the kubectl get crd command to verify if the custom resources created by Trident are present and retrieve Trident objects to make sure that all the data is available.

Recover date by using ONTAP snapshots

Snapshots play an important role by providing point-in-time recovery options for application data. However, snapshots are not backups by themselves, they do not protect against storage system failure or other catastrophes. But, they are a convenient, quick, and easy way to recover data in most scenarios. Learn about how you can use ONTAP snapshot technology to take backups of the volume and how to restore them.

If the snapshot policy has not been defined in the backend, it defaults to using the none policy. This results in ONTAP taking no automatic snapshots. However, the storage administrator can take manual snapshots or change the snapshot policy via the ONTAP management interface. This does not affect Trident operation.
The snapshot directory is hidden by default. This helps facilitate maximum compatibility of volumes provisioned using the ontap-nas and ontap-nas-economy drivers. Enable the .snapshot directory when using the ontap-nas and ontap-nas-economy drivers to allow applications to recover data from snapshots directly.
Restore a volume to a state recorded in a prior snapshot by using the volume snapshot restore ONTAP CLI command. When you restore a snapshot copy, the restore operation overwrites the existing volume configuration. Any changes made to the data in the volume after the Snapshot copy was created are lost.

cluster1::*> volume snapshot restore -vserver vs0 -volume vol3 -snapshot vol3_snap_archive

Replicate data by using ONTAP

Replicating data can play an important role in protecting against data loss due to storage array failure.

To learn more about ONTAP replication technologies, see the ONTAP documentation.

SnapMirror Storage Virtual Machines (SVM) replication

You can use SnapMirror to replicate a complete SVM, which includes its configuration settings and its volumes. In the event of a disaster, you can activate the SnapMirror destination SVM to start serving data. You can switch back to the primary when the systems are restored.

Astra Trident cannot configure replication relationships itself, so the storage administrator can use ONTAP’s SnapMirror SVM Replication feature to automatically replicate volumes to a Disaster Recovery (DR) destination.

Consider the following if you are planning to use the SnapMirror SVM Replication feature or are currently using the feature:

You should create a distinct backend for each SVM, which has SVM-DR enabled.
You should configure the storage classes so as to not select the replicated backends except when desired. This is important to avoid having volumes which do not need the protection of a replication relationship to be provisioned onto the backend(s) that support SVM-DR.
Application administrators should understand the additional cost and complexity associated with replicating the data and a plan for recovery should be determined before they leverage data replication.
Before activating the SnapMirror destination SVM, stop all the scheduled SnapMirror transfers, abort all ongoing SnapMirror transfers, break the replication relationship, stop the source SVM, and then start the SnapMirror destination SVM.
Astra Trident does not automatically detect SVM failures. Therefore, upon a failure, the administrator should run the tridentctl backend update command to trigger Trident’s failover to the new backend.

Here is an overview of the SVM setup steps:

Set up peering between the source and destination cluster and SVM.
Create the destination SVM by using the -subtype dp-destination option.
Create a replication job schedule to ensure that replication happens at the required intervals.
Create a SnapMirror replication from the destination SVM to the source SVM by using the -identity-preserve true option to ensure that the source SVM configurations and source SVM interfaces are copied to the destination. From the destination SVM, initialize the SnapMirror SVM replication relationship.

Shows the steps involved in setting up SVM.

Disaster recovery workflow for Trident

Astra Trident 19.07 and later use Kubernetes CRDs to store and manage its own state. It uses the Kubernetes cluster's etcd to store its metadata. Here we assume that the Kubernetes etcd data files and the certificates are stored on NetApp FlexVolume. This FlexVolume resides in a SVM, which has a SnapMirror SVM-DR relationship with a destination SVM at the secondary site.

The following steps describe how to recover a single master Kubernetes cluster with Astra Trident in the event of a disaster:

If the source SVM fails, activate the SnapMirror destination SVM. To do this, you should stop scheduled SnapMirror transfers, abort ongoing SnapMirror transfers, break the replication relationship, stop the source SVM, and start the destination SVM.
From the destination SVM, mount the volume which contains the Kubernetes etcd data files and certificates on to the host which will be setup as a master node.
Copy all the required certificates pertaining to the Kubernetes cluster under /etc/kubernetes/pki and the etcd member files under /var/lib/etcd.
Create a Kubernetes cluster by using the kubeadm init command with the --ignore-preflight-errors=DirAvailable—var-lib-etcd flag. The hostnames used for the Kubernetes nodes should be the same as the source Kubernetes cluster.
Run the kubectl get crd command to verify if all the Trident custom resources have come up and retrieve the Trident objects to verify that all the data is available.
Update all the required backends to reflect the new destination SVM name by running the ./tridentctl update backend <backend-name> -f <backend-json-file> -n <namespace> command.

For application persistent volumes, when the destination SVM is activated, all the volumes provisioned by Trident start serving data. After the Kubernetes cluster is set up on the destination side by using the steps outlined above, all the deployments and pods are started and the containerized applications should run without any issues.

SnapMirror volume replication

ONTAP SnapMirror volume replication is a disaster recovery feature, which enables failover to destination storage from primary storage on a volume level. SnapMirror creates a volume replica or mirror of the primary storage on the secondary storage by syncing snapshots.

Here is an overview of the ONTAP SnapMirror volume replication setup steps:

Set up peering between the clusters in which the volumes reside and the SVMs that serve data from the volumes.
Create a SnapMirror policy, which controls the behavior of the relationship and specifies the configuration attributes for that relationship.
Create a SnapMirror relationship between the destination volume and the source volume by using the snapmirror create command and assign the appropriate SnapMirror policy.
After the SnapMirror relationship is created, initialize the relationship so that a baseline transfer from the source volume to the destination volume is completed.

Shows the SnapMirror volume replication setup.

SnapMirror volume disaster recovery workflow for Trident

The following steps describe how to recover a single master Kubernetes cluster with Astra Trident.

In the event of a disaster, stop all the scheduled SnapMirror transfers and abort all ongoing SnapMirror transfers. Break the replication relationship between the destination and source volumes so that the destination volume becomes read/write.
From the destination SVM, mount the volume that contains the Kubernetes etcd data files and certificates on to the host, which will be set up as a master node.
Copy all the required certificates pertaining to the Kubernetes cluster under /etc/kubernetes/pki and the etcd member files under /var/lib/etcd.
Create a Kubernetes cluster by running the kubeadm init command with the --ignore-preflight-errors=DirAvailable—var-lib-etcd flag. The hostnames should be the same as the source Kubernetes cluster.
Run the kubectl get crd command to verify if all the Trident custom resources have come up and retrieve Trident objects to make sure that all the data is available.
Clean up the previous backends and create new backends on Trident. Specify the new management and data LIF, new SVM name, and password of the destination SVM.

Disaster recovery workflow for application persistent volumes

The following steps describe how SnapMirror destination volumes can be made available for containerized workloads in the event of a disaster:

Stop all the scheduled SnapMirror transfers and abort all ongoing SnapMirror transfers. Break the replication relationship between the destination and source volume so that the destination volume becomes read/write. Clean up the deployments which were consuming PVC bound to volumes on the source SVM.
After the Kubernetes cluster is set up on the destination side by using the steps outlined above, clean up the deployments, PVCs and PV, from the Kubernetes cluster.
Create new backends on Trident by specifying the new management and data LIF, new SVM name and password of the destination SVM.
Import the required volumes as a PV bound to a new PVC by using the Trident import feature.
Redeploy the application deployments with the newly created PVCs.

Recover data by using Element snapshots

Back up data on an Element volume by setting a snapshot schedule for the volume and ensuring that the snapshots are taken at the required intervals. You should set the snapshot schedule by using the Element UI or APIs. Currently, it is not possible to set a snapshot schedule to a volume through the solidfire-san driver.

In the event of data corruption, you can choose a particular snapshot and roll back the volume to the snapshot manually by using the Element UI or APIs. This reverts any changes made to the volume since the snapshot was created.