Remounting and reformatting storage volumes (Manual Steps)

You must run scripts to remount preserved storage volumes and reformat failed storage volumes. The first script remounts volumes that are properly formatted as StorageGRID Webscale storage volumes. The second script reformats unmounted volumes, and rebuilds the Cassandra database on the Storage Node if the system determines that it is necessary.

Before you begin

Attention: This procedure may rebuild the Cassandra database if the system determines that it is necessary. You will not be prompted to confirm. Contact technical support if more than one Storage Node is offline or if a Storage Node in this grid has been rebuilt in the last 15 days. Do not run sn-recovery-postinstall.sh.

About this task

You must manually run scripts to check the attached storage, look for storage volumes attached to the server and attempt to mount them, and then verify the volumes are structured like StorageGRID Webscale object stores. Any storage volume that cannot be mounted or does not pass this check is assumed to be failed. The following two recovery scripts are used to identify and recover failed storage volumes:

Steps

  1. From the service laptop, log in to the recovered Storage Node:
    1. Enter the following command: ssh admin@grid_node_IP
    2. Enter the password listed in the Passwords.txt file.
    3. Enter the following command to switch to root: su -
    4. Enter the password listed in the Passwords.txt file.
    When you are logged in as root, the prompt changes from $ to #.
  2. If some or all of the storage volumes may contain good data:
    Note: If you believe that all storage volumes have bad and need to be reformatted, or if all storage volumes failed before the system drive failed, you can skip this step and go on to Step 3
    1. Run the script to check and remount storage volumes:sn-remount-volumes
      The script checks for preserved storage volumes and remounts them. In this example, /dev/sdd was found to be bad, and was not remounted.
      root@SG:~ # sn-remount-volumes
      The configured LDR noid is 12632740
      Attempting to remount /dev/sdg
      Found LDR node id 12632740, volume number 4 in the volID file
      Device /dev/sdg remounted successfully
      Attempting to remount /dev/sdf
      Found LDR node id 12632740, volume number 3 in the volID file
      Device /dev/sdf remounted successfully
      Attempting to remount /dev/sde
      Found LDR node id 12632740, volume number 2 in the volID file
      Device /dev/sde remounted successfully
      Failed to mount device /dev/sdd
      Attempting to remount /dev/sdc
      Found LDR node id 12632740, volume number 0 in the volID file
      Device /dev/sdc remounted successfully
      
      Data on devices that are found and mounted by the script are preserved; all other data on this node is lost and must be recovered from other nodes.
      Note: You can monitor the contents of the script's log file (/var/local/log/sn-remount-volumes.log) using the tail -f command. The log file contains more detailed information than the command line output.
    2. Review the list of devices that the script mounted or failed to mount:
      • If you believe a volume is valid, but it was not mounted by the script, you need to quit the script and investigate. After you correct all issues preventing the script from mounting the devices, you need to run the sn-remount-volumes script again.

      • If the script finds and mounts devices that you believe to be bad, finish the recovery procedure for a Storage Node with a failed system drive. Then perform the procedure "Recovering from storage volume failure where the system drive is intact" to repair the bad storage volume.
      • You must repair or replace any storage devices that could not be mounted.
    3. Record the device name of each failed storage volume, and identify their volume ID (var/local/rangedb/number).
      You will need the volume IDs of failed volumes in a later procedure, to use as input to the script used to recover data.
      In the following example, /dev/sdd was not mounted successfully by the script.
      Attempting to remount /dev/sde
      Found LDR node id 12632740, volume number 2 in the volID file
      Device /dev/sde remounted successfully
      Failed to mount device /dev/sdd
      Attempting to remount /dev/sdc
      Found LDR node id 12632740, volume number 0 in the volID file
      Device /dev/sdc remounted successfully
      
      In this case, /dev/sdd corresponds to volume number 1.
    4. Confirm that each mounted device should be associated with this Storage Node:
      1. Review the messages for mounted devices in the script output. The message "Found node id node_id, volume number volume ID in volID file" is displayed for each mounted device.
      2. Find the node ID of the LDR service either through the StorageGRID Webscale system on the Grid > Topology > Site > Node > LDR > Overview page, or the index.html file found in the /Doc directory of the Recovery Package.
      3. Check that the node_id for each mounted device is the same as the node ID of the LDR service that you are restoring.
      4. If the node ID for any of the volumes is different than the node ID of the LDR service, contact technical support.
  3. Run the script that reformats unmounted storage volumes, rebuilds Cassandra if required, and restarts services on the Storage Node. Enter: sn-recovery-postinstall.sh
    If the script fails, contact technical support.
    root@SG:~ # sn-recovery-postinstall.sh
    Starting Storage Node recovery post installation.
    Reformatting all unmounted disks as rangedbs
    Formatting devices that are not in use...
    Skipping in use device /dev/sdc
    Successfully formatted /dev/sdd with UUID d6533e5f-7dfe-4a45-af7c-08ae6864960a
    Skipping in use device /dev/sde
    Skipping in use device /dev/sdf
    Skipping in use device /dev/sdg
    All devices processed
    Creating Object Stores for LDR
    Generating Grid Interface Configuration file
    LDR initialization complete
    Cassandra does not need rebuilding.
    Not starting services due to --do-not-start-services argument.
    Updating progress on primary Admin Node
    Starting services
    
    #######################################
            STARTING SERVICES
    #######################################
    
    Starting Syslog daemon
    Stopping system logging: syslog-ng.
    Starting system logging: syslog-ng.
    Starting SSH
    Starting OpenBSD Secure Shell server: sshd.
    No hotfix to install
    starting persistence ... done
    remove all error states
    starting all services
    services still stopped: acct adc ade-exporter cms crr dds idnt kstn ldr net-monitor nginx
    node-exporter ssm
    starting ade-exporter
    Starting service ade-exporter in background
    starting cms
    Starting service cms in background
    starting crr
    Starting service crr in background
    starting net-monitor
    Starting service net-monitor in background
    starting nginx
    Starting service nginx in background
    starting node-exporter
    Starting service node-exporter in background
    starting ssm
    Starting service ssm in background
    services still stopped: acct adc dds idnt kstn ldr
    starting adc
    Starting service adc in background
    starting dds
    Starting service dds in background
    starting ldr
    Starting service ldr in background
    services still stopped: acct idnt kstn
    starting acct
    Starting service acct in background
    starting idnt
    Starting service idnt in background
    starting kstn
    Starting service kstn in background
    all services started
    Starting service servermanager in background
    Restarting SNMP services:: snmpd
    
    #######################################
               SERVICES STARTED
    #######################################
    
    Loaded node_id from server-config
    node_id=5611d3c9-45e9-47e4-99ac-cd16ef8a20b9
    Storage Node recovery post installation complete.
    Object data must be restored to the storage volumes.
    

After you finish

After the node is rebuilt, you must restore object data to any storage volumes that were formatted by sn-recovery-postinstall.sh, as described in the next procedure.