Recovering from Storage Node failures

11/22/2022 Contributors

The procedure for recovering a failed Storage Node depends on the type of failure and the type of Storage Node that has failed.

Use this table to select the recovery procedure for a failed Storage Node.

Issue	Action	Notes
More than one Storage Node has failed. A second Storage Node has failed less than 15 days after a Storage Node failure or recovery. This includes the case where a Storage Node fails while recovery of another Storage Node is still in progress.	You must contact technical support.	If all failed Storage Nodes are at the same site, it might be necessary to perform a site recovery procedure. Technical support will assess your situation and develop a recovery plan. How site recovery is performed by technical support Recovering more than one Storage Node (or more than one Storage Node within 15 days) might affect the integrity of the Cassandra database, which can cause data loss. Technical support can determine when it is safe to begin recovery of a second Storage Node. Note: If more than one Storage Node that contains the ADC service fails at a site, you lose any pending platform service requests for that site.
A Storage Node has been offline for more than 15 days.	Recovering a Storage Node that has been down more than 15 days	This procedure is required to ensure Cassandra database integrity.
An appliance Storage Node has failed.	Recovering a StorageGRID appliance Storage Node	The recovery procedure for appliance Storage Nodes is the same for all failures.
One or more storage volumes have failed, but the system drive is intact	Recovering from storage volume failure where the system drive is intact	This procedure is used for software-based Storage Nodes.
The system drive has failed.	Recovering from system drive failure	The node replacement procedure depends on the deployment platform and on whether any storage volumes have also failed.

Issue

Action

Notes

More than one Storage Node has failed.
A second Storage Node has failed less than 15 days after a Storage Node failure or recovery.

This includes the case where a Storage Node fails while recovery of another Storage Node is still in progress.

You must contact technical support.

If all failed Storage Nodes are at the same site, it might be necessary to perform a site recovery procedure.

Technical support will assess your situation and develop a recovery plan.

How site recovery is performed by technical support

Recovering more than one Storage Node (or more than one Storage Node within 15 days) might affect the integrity of the Cassandra database, which can cause data loss.

Technical support can determine when it is safe to begin recovery of a second Storage Node.

Note: If more than one Storage Node that contains the ADC service fails at a site, you lose any pending platform service requests for that site.

A Storage Node has been offline for more than 15 days.

Recovering a Storage Node that has been down more than 15 days

This procedure is required to ensure Cassandra database integrity.

An appliance Storage Node has failed.

Recovering a StorageGRID appliance Storage Node

The recovery procedure for appliance Storage Nodes is the same for all failures.

One or more storage volumes have failed, but the system drive is intact

Recovering from storage volume failure where the system drive is intact

This procedure is used for software-based Storage Nodes.

The system drive has failed.

Recovering from system drive failure

The node replacement procedure depends on the deployment platform and on whether any storage volumes have also failed.

Some StorageGRID recovery procedures use Reaper to handle Cassandra repairs. Repairs occur automatically as soon as the related or required services have started. You might notice script output that mentions “reaper” or “Cassandra repair.” If you see an error message indicating the repair has failed, run the command indicated in the error message.

Recovering from Storage Node failures

Creating your file...