Skip to main content

Recover Storage Node down more than 15 days

Contributors netapp-perveilerk netapp-madkat ssantho3 netapp-lhalbert

If a single Storage Node has been offline and not connected to other Storage Nodes for more than 15 days, you must rebuild Cassandra on the node.

Before you begin
  • You have checked that a Storage Node decommissioning is not in progress, or you have paused the node decommission procedure. (In the Grid Manager, select MAINTENANCE > Tasks > Decommission.)

  • You have checked that an expansion is not in progress. (In the Grid Manager, select MAINTENANCE > Tasks > Expansion.)

About this task

Storage Nodes have a Cassandra database that includes object metadata. If a Storage Node has not been able to communicate with other Storage Nodes for more than 15 days, StorageGRID assumes that node's Cassandra database is stale. The Storage Node can't rejoin the grid until Cassandra has been rebuilt using information from other Storage Nodes.

Use this procedure to rebuild Cassandra only if a single Storage Node is down. Contact technical support if additional Storage Nodes are offline or if Cassandra has been rebuilt on another Storage Node within the last 15 days; for example, Cassandra might have been rebuilt as part of the procedures to recover failed storage volumes or to recover a failed Storage Node.

Caution If more than one Storage Node has failed (or is offline), contact technical support. Don't perform the following recovery procedure. Data loss could occur.
Caution If this is the second Storage Node failure in less than 15 days after a Storage Node failure or recovery, contact technical support. Don't perform the following recovery procedure. Data loss could occur.
Note If more than one Storage Node at a site has failed, a site recovery procedure might be required. See How site recovery is performed by technical support.
Steps
  1. If necessary, power on the Storage Node that needs to be recovered.

  2. Log in to the grid node:

    1. Enter the following command: ssh admin@grid_node_IP

    2. Enter the password listed in the Passwords.txt file.

    3. Enter the following command to switch to root: su -

    4. Enter the password listed in the Passwords.txt file.

      When you are logged in as root, the prompt changes from $ to #.+

      Note If you are unable to log in to the grid node, the system disk might not be intact. Go to the procedure for recovering from system drive failure.
  3. Perform the following checks on the Storage Node:

    1. Issue this command: nodetool status

      The output should be Connection refused

    2. In the Grid Manager, select SUPPORT > Tools > Grid topology.

    3. Select Site > Storage Node > SSM > Services. Verify that the Cassandra service displays Not Running.

    4. Select Storage Node > SSM > Resources. Verify that there is no error status in the Volumes section.

    5. Issue this command: grep -i Cassandra /var/local/log/servermanager.log

      You should see the following message in the output:

      Cassandra not started because it has been offline for more than 15 day grace period - rebuild Cassandra
  4. Issue this command, and monitor the script output: check-cassandra-rebuild

    • If the Cassandra service depending on volume 0 is running, you will be prompted to stop it. Enter: y

      Note If the Cassandra service is already stopped, you aren't prompted. The Cassandra service is stopped only for volume 0.
    • Review the warnings in the script. If none of them apply, confirm that you want to rebuild Cassandra. Enter: y

      Note Some StorageGRID recovery procedures use Reaper to handle Cassandra repairs. Repairs occur automatically as soon as the related or required services have started. You might notice script output that mentions “reaper” or “Cassandra repair.” If you see an error message indicating the repair has failed, run the command indicated in the error message.
  5. After the rebuild completes, perform the following checks:

    1. In the Grid Manager, select SUPPORT > Tools > Grid topology.

    2. Select Site > recovered Storage Node > SSM > Services.

    3. Confirm that all services are running.

    4. Select DDS > Data Store.

    5. Confirm that the Data Store Status is “Up” and the Data Store State is “Normal.”