Recover Storage Node down more than 15 days

10/29/2021 Contributors

If a single Storage Node has been offline and not connected to other Storage Nodes for more than 15 days, you must rebuild Cassandra on the node.

What you'll need

You have checked that a Storage Node decommissioning is not in progress, or you have paused the node decommission procedure. (In the Grid Manager, select MAINTENANCE > Tasks > Decommission.)
You have checked that an expansion is not in progress. (In the Grid Manager, select MAINTENANCE > Tasks > Expansion.)

About this task

Storage Nodes have a Cassandra database that includes object metadata. If a Storage Node has not been able to communicate with other Storage Nodes for more than 15 days, StorageGRID assumes that node's Cassandra database is stale. The Storage Node cannot rejoin the grid until Cassandra has been rebuilt using information from other Storage Nodes.

Use this procedure to rebuild Cassandra only if a single Storage Node is down. Contact technical support if additional Storage Nodes are offline or if Cassandra has been rebuilt on another Storage Node within the last 15 days; for example, Cassandra might have been rebuilt as part of the procedures to recover failed storage volumes or to recover a failed Storage Node.

If more than one Storage Node has failed (or is offline), contact technical support. Do not perform the following recovery procedure. Data loss could occur.

If this is the second Storage Node failure in less than 15 days after a Storage Node failure or recovery, contact technical support. Do not perform the following recovery procedure. Data loss could occur.

If more than one Storage Node at a site has failed, a site recovery procedure might be required. Contact technical support.

How site recovery is performed by technical support

Steps

If necessary, power on the Storage Node that needs to be recovered.
Log in to the grid node:
1. Enter the following command: ssh admin@grid_node_IP
2. Enter the password listed in the Passwords.txt file.
3. Enter the following command to switch to root: su -
4. Enter the password listed in the Passwords.txt file.
When you are logged in as root, the prompt changes from $ to #.+

If you are unable to log in to the grid node, the system disk might not be intact. Go to the procedure for recovering from system drive failure.

Perform the following checks on the Storage Node:
1. Issue this command: nodetool status
  
  The output should be Connection refused
2. In the Grid Manager, select SUPPORT > Tools > Grid topology.
3. Select Site > Storage Node > SSM > Services. Verify that the Cassandra service displays Not Running.
4. Select Storage Node > SSM > Resources. Verify that there is no error status in the Volumes section.
5. Issue this command: grep -i Cassandra /var/local/log/servermanager.log
  
  You should see the following message in the output:
  Cassandra not started because it has been offline for more than 15 day grace period - rebuild Cassandra

Issue this command, and monitor the script output: check-cassandra-rebuild

If storage services are running, you will be prompted to stop them. Enter: y

Review the warnings in the script. If none of them apply, confirm that you want to rebuild Cassandra. Enter: y

Some StorageGRID recovery procedures use Reaper to handle Cassandra repairs. Repairs occur automatically as soon as the related or required services have started. You might notice script output that mentions “reaper” or “Cassandra repair.” If you see an error message indicating the repair has failed, run the command indicated in the error message.

After the rebuild completes, perform the following checks:
1. In the Grid Manager, select SUPPORT > Tools > Grid topology.
2. Select Site > recovered Storage Node > SSM > Services.
3. Confirm that all services are running.
4. Select DDS > Data Store.
5. Confirm that the Data Store Status is “Up” and the Data Store State is “Normal.”

Recover Storage Node down more than 15 days

Creating your file...