Remounting and reformatting appliance storage volumes (Manual Steps)

You must run scripts to remount preserved storage volumes and reformat failed storage volumes. The first script remounts volumes that are properly formatted as StorageGRID Webscale storage volumes. The second script reformats unmounted volumes, and rebuilds the Cassandra database on the Storage Node if the system determines that it is necessary.

Before you begin

Attention: This procedure may rebuild the Cassandra database if the system determines that it is necessary. Contact technical support if more than one Storage Node is offline or if a Storage Node in this grid has been rebuilt in the last 15 days. Do not run sn-recovery-postinstall.sh.

About this task

You must manually run scripts to check the attached storage, look for storage volumes attached to the server and attempt to mount them, and then verify the volumes are structured like StorageGRID Webscale object stores. Any storage volume that cannot be mounted or does not pass this check is assumed to be failed and is reformatted. All data on these storage volumes is lost.

The following two recovery scripts are used to identify and recover failed storage volumes:

After the sn-recovery-postinstall.sh script completes, StorageGRID Webscale software installation completes on the appliance, and the appliance automatically reboots. You must wait for the reboot to complete before you can go on to restore object data to the recovered storage volumes.

Steps

  1. From the service laptop, log in to the recovered Storage Node:
    1. Enter the following command: ssh admin@grid_node_IP
    2. Enter the password listed in the Passwords.txt file.
    3. Enter the following command to switch to root: su -
    4. Enter the password listed in the Passwords.txt file.
    When you are logged in as root, the prompt changes from $ to #.
  2. If you think that some or all of the storage volumes may contain good data:
    1. Run the script to check and remount storage volumes:sn-remount-volumes
      The script checks for preserved storage volumes and remounts them. In this example, /dev/sdd was found to be bad, and was not remounted.
      root@SG:~ # sn-remount-volumes
      The configured LDR noid is 12632740
      Attempting to remount /dev/sdg
      Found LDR node id 12632740, volume number 4 in the volID file
      Device /dev/sdg remounted successfully
      Attempting to remount /dev/sdf
      Found LDR node id 12632740, volume number 3 in the volID file
      Device /dev/sdf remounted successfully
      Attempting to remount /dev/sde
      Found LDR node id 12632740, volume number 2 in the volID file
      Device /dev/sde remounted successfully
      Failed to mount device /dev/sdd
      Attempting to remount /dev/sdc
      Found LDR node id 12632740, volume number 0 in the volID file
      Device /dev/sdc remounted successfully
      
      Data on devices that are found and mounted by the script are preserved; all other data on this node is lost and must be recovered from other nodes.
    2. Review the list of devices that the script mounted or failed to mount:
      • If you believe a volume is valid, but it was not mounted by the script, you need to quit the script and investigate. After you correct all issues preventing the script from mounting the devices, you need to run the script again.

      • You must repair or replace any storage devices that could not be mounted.
      Note: You can monitor the contents of the script's log file (/var/local/log/sn-remount-volumes.log) using the tail -f command. The log file contains more detailed information than the command line output.
    3. Record the device name of each failed storage volume, and identify their volume ID (var/local/rangedb/number).
      You will need the volume IDs of failed volumes in a later procedure, Restoring object data to a storage volume, as input to the script used to recover data.
      In the following example, /dev/sdd was not mounted successfully by the script.
      Attempting to remount /dev/sde
      Found LDR node id 12632740, volume number 2 in the volID file
      Device /dev/sde remounted successfully
      Failed to mount device /dev/sdd
      Attempting to remount /dev/sdc
      Found LDR node id 12632740, volume number 0 in the volID file
      Device /dev/sdc remounted successfully
      
      In this case, /dev/sdd corresponds to volume number 1.
  3. Run sn-recovery-postinstall.sh to reformat all unmounted (failed) storage volumes, rebuild Cassandra if needed, and start all StorageGRID Webscale services.
    The script outputs the post-install status.
    root@SG:~ # sn-recovery-postinstall.sh
    Starting Storage Node recovery post installation.
    Reformatting all unmounted disks as rangedbs
    Formatting devices that are not in use...
    Skipping in use device /dev/sdc
    Successfully formatted /dev/sdd with UUID d6533e5f-7dfe-4a45-af7c-08ae6864960a
    Skipping in use device /dev/sde
    Skipping in use device /dev/sdf
    Skipping in use device /dev/sdg
    All devices processed
    Creating Object Stores for LDR
    Generating Grid Interface Configuration file
    LDR initialization complete
    Setting up Cassandra directory structure
    Cassandra needs rebuilding.
    Rebuild the Cassandra database for this Storage Node.
    
    ATTENTION: Do not execute this script when two or more Storage Nodes have failed or been offline at the same time. Doing so may result in data loss. Contact technical support.
    
    ATTENTION: Do not rebuild more than a single node within a 15 day period. Rebuilding 2 or more nodes within 15 days of each other may result in data loss.
    
    Cassandra is down.
    
    Rebuild may take 12-24 hours. Do not stop or pause the rebuild.
    If the rebuild was stopped or paused, re-run this command.
    
    Removing Cassandra commit logs
    Removing Cassandra SSTables
    Updating timestamps of the Cassandra data directories.
    Starting ntp service.
    Starting cassandra service.
    Running nodetool rebuild.
    Done. Cassandra database successfully rebuilt.
    Rebuild was successful.
    Not starting services due to --do-not-start-services argument.
    Updating progress on primary Admin Node
    Starting services
    
    #######################################
            STARTING SERVICES
    #######################################
    
    Starting Syslog daemon
    Stopping system logging: syslog-ng.
    Starting system logging: syslog-ng.
    Starting SSH
    Starting OpenBSD Secure Shell server: sshd.
    No hotfix to install
    starting persistence ... done
    remove all error states
    starting all services
    services still stopped: acct adc ade-exporter cms crr dds idnt kstn ldr net-monitor nginx node-exporter ssm
    starting ade-exporter
    Starting service ade-exporter in background
    starting cms
    Starting service cms in background
    starting crr
    Starting service crr in background
    starting net-monitor
    Starting service net-monitor in background
    starting nginx
    Starting service nginx in background
    starting node-exporter
    Starting service node-exporter in background
    starting ssm
    Starting service ssm in background
    services still stopped: acct adc dds idnt kstn ldr
    starting adc
    Starting service adc in background
    starting dds
    Starting service dds in background
    starting ldr
    Starting service ldr in background
    services still stopped: acct idnt kstn
    starting acct
    Starting service acct in background
    starting idnt
    Starting service idnt in background
    starting kstn
    Starting service kstn in background
    all services started
    Starting service servermanager in background
    Restarting SNMP services::  snmpd
    
    #######################################
               SERVICES STARTED
    #######################################
    
    Loaded node_id from server-config
    node_id=5611d3c9-45e9-47e4-99ac-cd16ef8a20b9
    Storage Node recovery post installation complete.
    Object data must be restored to the storage volumes.
    Triggering bare-metal reboot of SGA to complete installation.
    SGA install phase 2: awaiting chainboot.
    
    While the sn-recovery-postinstall.sh script is running, the progress stage updates on the Grid Management Interface.

    screenshot showing recovery progress in Grid Management Interface

    second screenshot showing recovery progress stage in Grid management interface.
  4. Return to the Monitor Install page of the StorageGRID Webscale Appliance Installer by entering http://Controller_IP:8080, using the IP address of the E5600SG controller or the E5700SG controller.
    The Monitor Install page shows the installation progress while the sn-recovery-postinstall.sh script is running.
  5. Wait until the appliance automatically reboots.
  6. After the node reboots, check the software version on the recovered node.
    1. Select Grid.
    2. Select site > grid node > SSM > Services.
    3. Note the installed version in the Packages field.
    4. Confirm that the version matches the version on the primary Admin Node, including any previously applied hotfixes.
    5. If the software versions do not match, you must manually reapply the hotfix to the recovered node.
  7. Restore object data to any storage volumes that were rebuilt. See "Restoring object data" for instructions.