Recovering from a single Storage Node down more than 15 days

If a single Storage Node has been offline and not connected to other Storage Nodes for more than 15 days, you must rebuild Cassandra on the node.

Before you begin

About this task

Storage Nodes have a Cassandra database that includes object metadata. If a Storage Node has not been able to communicate with other Storage Nodes for more than 15 days, the Storage Node assumes that its Cassandra database is stale. The Storage Node cannot rejoin the grid until Cassandra has been rebuilt using information from other Storage Nodes.

This procedure can only be used to rebuild Cassandra on a single Storage Node that has been down for more than 15 days. If more than one Storage Node is offline, or if Cassandra has been rebuilt on another Storage Node within the last 15 days, contact technical support.
Note: Cassandra may also have been rebuilt on a Storage Node as part of the procedure to recover failed storage volumes, or as part of the procedure to recover a failed Storage Node.
CAUTION:
Do perform this procedure if more than one Storage Node is offline, or if more than one Storage Node has a failure of any kind. Doing so might result in data loss. Contact technical support.
CAUTION:
Do not rebuild Cassandra on more than one Storage Node within a 15 day period. Rebuilding Cassandra on two or more Storage Nodes within 15 days of each other might result in data loss. Contact technical support.

Steps

  1. If necessary, power on the Storage Node that needs to be recovered.
  2. From the service laptop, log in to the grid node:
    1. Enter the following command: ssh admin@grid_node_IP
    2. Enter the password listed in the Passwords.txt file.
    3. Enter the following command to switch to root: su -
    4. Enter the password listed in the Passwords.txt file.
    When you are logged in as root, the prompt changes from $ to #.
  3. Perform the following checks on the Storage Node:
    1. Issue this command: nodetool status
      The output should be Connection refused
    2. In the Grid Management Interface, verify that the Cassandra service under SSM > Services displays the following: Not Running
    3. In the Grid Management Interface, verify that the selection SSM > Resources > Volumes has no error status.
    4. Issue this command: grep -i Cassandra /var/local/log/servermanager.log
      You should see the following message in the output: Cassandra not started because it has been offline for more than 15 day grace period - rebuild Cassandra
  4. Issue this command: check-cassandra-rebuild
    • If storage services are running you will be prompted to stop them. Enter: y
    • Review the warnings in the script. If none of them apply, confirm that you want to rebuild Cassandra. Enter: y

    In the following example script output, services were not running.

    Cassandra has been down for more than 15 days.
    Cassandra needs rebuilding.
    Rebuild the Cassandra database for this Storage Node.
     
    ATTENTION: Do not execute this script when two or more Storage Nodes have failed or been offline at the same time. Doing so may result in data loss. Contact technical support.
     
    ATTENTION: Do not rebuild more than a single node within a 15 day period. Rebuilding 2 or more nodes within 15 days of each other may result in data loss.
     
    Enter 'y' to rebuild the Cassandra database for this Storage Node. [y/N]? y
    Cassandra is down.
     
    Rebuild may take 12-24 hours. Do not stop or pause the rebuild.
    If the rebuild was stopped or paused, re-run this command.
     
    Removing Cassandra commit logs
    Removing Cassandra SSTables
    Updating timestamps of the Cassandra data directories.
    Starting ntp service.
    Starting cassandra service.
    Running nodetool rebuild.
    Done. Cassandra database successfully rebuilt.
    Rebuild was successful.
    Starting services.
    
    
  5. After the rebuild completes, perform the following checks:
    1. Check that all services on the recovered Storage Node show as "online" in the Grid Management Interface.
    2. For the Storage Node that was rebuilt, check that DDS > Data Store > Data Store Status shows a status of "Up" and DDS > Data Store > Data Store State shows a status of "Normal."