Recovering a Storage Node that has been down more than 15 days

If a single Storage Node has been offline and not connected to other Storage Nodes for more than 15 days, you must rebuild Cassandra on the node.

Before you begin

About this task

Storage Nodes have a Cassandra database that includes object metadata. If a Storage Node has not been able to communicate with other Storage Nodes for more than 15 days, StorageGRID Webscale assumes that node's Cassandra database is stale. The Storage Node cannot rejoin the grid until Cassandra has been rebuilt using information from other Storage Nodes.

Use this procedure to rebuild Cassandra only if a single Storage Node is down. Contact technical support if additional Storage Nodes are offline or if Cassandra has been rebuilt on another Storage Node within the last 15 days; for example, Cassandra might have been rebuilt as part of the procedures to recover failed storage volumes or to recover a failed Storage Node.

CAUTION:
Do not perform this procedure if more than one Storage Node is offline, or if more than one Storage Node has a failure of any kind. Doing so might result in data loss. Contact technical support.
CAUTION:
Do not rebuild Cassandra on more than one Storage Node within a 15-day period. Rebuilding Cassandra on two or more Storage Nodes within 15 days of each other might result in data loss. Contact technical support.

Steps

  1. If necessary, power on the Storage Node that needs to be recovered.
  2. From the service laptop, log in to the grid node:
    1. Enter the following command: ssh admin@grid_node_IP
    2. Enter the password listed in the Passwords.txt file.
    3. Enter the following command to switch to root: su -
    4. Enter the password listed in the Passwords.txt file.
    When you are logged in as root, the prompt changes from $ to #.
  3. Perform the following checks on the Storage Node:
    1. Issue this command: nodetool status
      The output should be Connection refused
    2. In the Grid Manager, select Support > Grid Topology. Then, select site > Storage Node > SSM > Services. Verify that the Cassandra service displays Not Running.
    3. Select Storage Node > SSM > Resources. Verify that there is no error status in the Volumes section.
    4. Issue this command: grep -i Cassandra /var/local/log/servermanager.log
      You should see the following message in the output: Cassandra not started because it has been offline for more than 15 day grace period - rebuild Cassandra
  4. Issue this command: check-cassandra-rebuild
    • If storage services are running, you will be prompted to stop them. Enter: y
    • Review the warnings in the script. If none of them apply, confirm that you want to rebuild Cassandra. Enter: y

    In the following example script output, services were not running.

    Cassandra has been down for more than 15 days.
    Cassandra needs rebuilding.
    Rebuild the Cassandra database for this Storage Node.
    
    ATTENTION: Do not execute this script when two or more Storage Nodes have failed
    or been offline at the same time. Doing so may result in data loss. Contact technical support.
    
    ATTENTION: Do not rebuild more than a single node within a 15 day period.
    Rebuilding 2 or more nodes within 15 days of each other may result in data loss.
    
    Enter 'y' to rebuild the Cassandra database for this Storage Node. [y/N]? y
    Cassandra is down.
    
    Rebuilding may take 12-24 hours. Do not stop or pause the rebuild.
    If the rebuild was stopped or paused, re-run this command.
    
    Cassandra node needs to be bootstrapped.
    Cleaning Cassandra directories for node.
    Adding replace_address_first_boot flag.
    Starting ntp service.
    Starting nginx service.
    Starting dynip service.
    Starting cassandra service.
    Cassandra mode is NORMAL. No bootstrap resume required.
    Rebuild was successful.
    Starting services.
  5. After the rebuild completes, perform the following checks:
    1. In the Grid Manager, select Support > Grid Topology.
    2. Select site > recovered Storage Node > SSM > Services.
    3. Confirm that all services are running.
    4. Select DDS > Data Store.
    5. Confirm that the Data Store Status is "Up" and the Data Store State is "Normal."