Skip to main content

Recovering from Storage Node failures

Contributors netapp-lhalbert

The procedure for recovering a failed Storage Node depends on the type of failure and the type of Storage Node that has failed.

Use this table to select the recovery procedure for a failed Storage Node.

Issue Action Notes
  • More than one Storage Node has failed.

  • A second Storage Node has failed less than 15 days after a Storage Node failure or recovery.

    This includes the case where a Storage Node fails while recovery of another Storage Node is still in progress.

You must contact technical support.

If all failed Storage Nodes are at the same site, it might be necessary to perform a site recovery procedure.

Technical support will assess your situation and develop a recovery plan.

Recovering more than one Storage Node (or more than one Storage Node within 15 days) might affect the integrity of the Cassandra database, which can cause data loss.

Technical support can determine when it is safe to begin recovery of a second Storage Node.

Note: If more than one Storage Node that contains the ADC service fails at a site, you lose any pending platform service requests for that site.

A Storage Node has been offline for more than 15 days.

This procedure is required to ensure Cassandra database integrity.

An appliance Storage Node has failed.

The recovery procedure for appliance Storage Nodes is the same for all failures.

One or more storage volumes have failed, but the system drive is intact

This procedure is used for software-based Storage Nodes.

The system drive has failed.

The node replacement procedure depends on the deployment platform and on whether any storage volumes have also failed.

Note Some StorageGRID recovery procedures use Reaper to handle Cassandra repairs. Repairs occur automatically as soon as the related or required services have started. You might notice script output that mentions “reaper” or “Cassandra repair.” If you see an error message indicating the repair has failed, run the command indicated in the error message.