Recover from Storage Node failures: Overview

03/20/2023 Contributors

The procedure for recovering a failed Storage Node depends on the type of failure and the type of Storage Node that has failed.

Use this table to select the recovery procedure for a failed Storage Node.

Issue	Action	Notes
More than one Storage Node has failed. A second Storage Node has failed less than 15 days after a Storage Node failure or recovery. This includes the case where a Storage Node fails while recovery of another Storage Node is still in progress.	Contact technical support.	Recovering more than one Storage Node (or more than one Storage Node within 15 days) might affect the integrity of the Cassandra database, which can cause data loss. Technical support can determine when it is safe to begin recovery of a second Storage Node. Note: If more than one Storage Node that contains the ADC service fails at a site, you lose any pending platform service requests for that site.
More than one Storage Node at a site has failed or an entire site has failed.	Contact technical support. It might be necessary to perform a site recovery procedure.	Technical support will assess your situation and develop a recovery plan. See How site recovery is performed by technical support.
A Storage Node has been offline for more than 15 days.	Recover Storage Node down more than 15 days	This procedure is required to ensure Cassandra database integrity.
An appliance Storage Node has failed.	Recover appliance Storage Node	The recovery procedure for appliance Storage Nodes is the same for all failures.
One or more storage volumes have failed, but the system drive is intact	Recover from storage volume failure where system drive is intact	This procedure is used for software-based Storage Nodes.
The system drive has failed.	Recover from system drive failure	The node replacement procedure depends on the deployment platform and on whether any storage volumes have also failed.

Issue

Action

Notes

More than one Storage Node has failed.
A second Storage Node has failed less than 15 days after a Storage Node failure or recovery.

This includes the case where a Storage Node fails while recovery of another Storage Node is still in progress.

Contact technical support.

Recovering more than one Storage Node (or more than one Storage Node within 15 days) might affect the integrity of the Cassandra database, which can cause data loss.

Technical support can determine when it is safe to begin recovery of a second Storage Node.

Note: If more than one Storage Node that contains the ADC service fails at a site, you lose any pending platform service requests for that site.

More than one Storage Node at a site has failed or an entire site has failed.

Contact technical support. It might be necessary to perform a site recovery procedure.

Technical support will assess your situation and develop a recovery plan. See How site recovery is performed by technical support.

A Storage Node has been offline for more than 15 days.

Recover Storage Node down more than 15 days

This procedure is required to ensure Cassandra database integrity.

An appliance Storage Node has failed.

Recover appliance Storage Node

The recovery procedure for appliance Storage Nodes is the same for all failures.

One or more storage volumes have failed, but the system drive is intact

Recover from storage volume failure where system drive is intact

This procedure is used for software-based Storage Nodes.

The system drive has failed.

Recover from system drive failure

The node replacement procedure depends on the deployment platform and on whether any storage volumes have also failed.

Some StorageGRID recovery procedures use Reaper to handle Cassandra repairs. Repairs occur automatically as soon as the related or required services have started. You might notice script output that mentions “reaper” or “Cassandra repair.” If you see an error message indicating the repair has failed, run the command indicated in the error message.

Recover from Storage Node failures: Overview

Creating your file...