Recovering a Storage Node that has been down more than 15 days

If a single Storage Node has been offline and not connected to other Storage Nodes for more than 15 days, you must rebuild Cassandra on the node.

Before you begin

About this task

Storage Nodes have a Cassandra database that includes object metadata. If a Storage Node has not been able to communicate with other Storage Nodes for more than 15 days, StorageGRID assumes that node's Cassandra database is stale. The Storage Node cannot rejoin the grid until Cassandra has been rebuilt using information from other Storage Nodes.

Use this procedure to rebuild Cassandra only if a single Storage Node is down. Contact technical support if additional Storage Nodes are offline or if Cassandra has been rebuilt on another Storage Node within the last 15 days; for example, Cassandra might have been rebuilt as part of the procedures to recover failed storage volumes or to recover a failed Storage Node.

CAUTION:
If more than one Storage Node has failed (or is offline), contact technical support. Do not perform the following recovery procedure. Data loss could occur.
CAUTION:
If this is the second Storage Node failure within 15 days, contact technical support. Do not perform the following recovery procedure. Data loss could occur.
Note: If more than one Storage Node at a site has failed, a site recovery procedure might be required. Contact technical support.

How site recovery is performed by technical support

Procedure

  1. If necessary, power on the Storage Node that needs to be recovered.
  2. From the service laptop, log in to the grid node:
    1. Enter the following command: ssh admin@grid_node_IP
    2. Enter the password listed in the Passwords.txt file.
    3. Enter the following command to switch to root: su -
    4. Enter the password listed in the Passwords.txt file.
    When you are logged in as root, the prompt changes from $ to #.
    Note: If you are unable to log in to the grid node, the system disk might not be intact. Go to the procedure for recovering from system drive failure.

    Recovering from system drive failure

  3. Perform the following checks on the Storage Node:
    1. Issue this command: nodetool status
      The output should be Connection refused
    2. In the Grid Manager, select Support. Then, in the Tools section of the menu, select Grid Topology.
    3. Select site > Storage Node > SSM > Services. Verify that the Cassandra service displays Not Running.
    4. Select Storage Node > SSM > Resources. Verify that there is no error status in the Volumes section.
    5. Issue this command: grep -i Cassandra /var/local/log/servermanager.log
      You should see the following message in the output:
      Cassandra not started because it has been offline for more than 15 day grace period - rebuild Cassandra
  4. Issue this command, and monitor the script output: check-cassandra-rebuild
    • If storage services are running, you will be prompted to stop them. Enter: y
    • Review the warnings in the script. If none of them apply, confirm that you want to rebuild Cassandra. Enter: y
  5. After the rebuild completes, perform the following checks:
    1. In the Grid Manager, select Support. Then, in the Tools section of the menu, select Grid Topology.
    2. Select site > recovered Storage Node > SSM > Services.
    3. Confirm that all services are running.
    4. Select DDS > Data Store.
    5. Confirm that the Data Store Status is Up and the Data Store State is Normal.