Remounting and reformatting storage volumes ("Manual Steps")

You must manually run two scripts to remount preserved storage volumes and reformat failed storage volumes. The first script remounts volumes that are properly formatted as StorageGRID Webscale storage volumes. The second script reformats any unmounted volumes, and rebuilds the Cassandra database on the Storage Node as required.

Before you begin

About this task

To complete this procedure, you perform these high-level tasks:

Steps

  1. From the service laptop, log in to the recovered Storage Node:
    1. Enter the following command: ssh admin@grid_node_IP
    2. Enter the password listed in the Passwords.txt file.
    3. Enter the following command to switch to root: su -
    4. Enter the password listed in the Passwords.txt file.
    When you are logged in as root, the prompt changes from $ to #.
  2. Run the first script to remount any properly formatted storage volumes.
    Note: If all storage volumes are new and need to be formatted, or if all storage volumes have failed, you can skip this step and run the second script to reformat all unmounted storage volumes.
    1. Run the script: sn-remount-volumes
    2. As the script runs, review the output and answer any prompts.
      Note: As required, you can use the tail -f command to monitor the contents of the script's log file (/var/local/log/sn-remount-volumes.log) . The log file contains more detailed information than the command line output.
      Example
      root@SG:~ # sn-remount-volumes
      The configured LDR noid is 12632740
      
      ====== Device /dev/sdb ======
      Mount and unmount device /dev/sdb and checking file system consistency:
      The device is consistent.
      Check rangedb structure on device /dev/sdb:
      Mount device /dev/sdb to /tmp/sdb-654321 with rangedb mount options
      This device has all rangedb directories.
      Found LDR node id 12632740, volume number 0 in the volID file
      Attempting to remount /dev/sdb
      Device /dev/sdb remounted successfully
      
      ====== Device /dev/sdc ======
      Mount and unmount device /dev/sdc and checking file system consistency:
      Error: File system consistency check retry failed on device /dev/sdc. 
      You can see the diagnosis information in the /var/local/log/sn-remount-volumes.log.
      
      This volume could be new or damaged. If you run sn-recovery-postinstall.sh,
      this volume and any data on this volume will be deleted. If you only had two
      copies of object data, you will temporarily have only a single copy.
      StorageGRID Webscale will attempt to restore data redundancy by making
      additional replicated copies or EC fragments, according to the rules in
      the active ILM policy.
      
      Do not continue to the next step if you believe that the data remaining on
      this volume cannot be rebuilt from elsewhere in the grid (for example, if
      your ILM policy uses a rule that makes only one copy or if volumes have
      failed on multiple nodes). Instead, contact support to determine how to
      recover your data.
      
      ====== Device /dev/sdd ======
      Mount and unmount device /dev/sdd and checking file system consistency:
      Failed to mount device /dev/sdd
      This device could be an uninitialized disk or has corrupted superblock.
      File system check might take a long time. Do you want to continue? (y or n) [y/N]? y
      
      Error: File system consistency check retry failed on device /dev/sdd. 
      You can see the diagnosis information in the /var/local/log/sn-remount-volumes.log.
      
      This volume could be new or damaged. If you run sn-recovery-postinstall.sh,
      this volume and any data on this volume will be deleted. If you only had two
      copies of object data, you will temporarily have only a single copy.
      StorageGRID Webscale will attempt to restore data redundancy by making
      additional replicated copies or EC fragments, according to the rules in
      the active ILM policy.
      
      Do not continue to the next step if you believe that the data remaining on
      this volume cannot be rebuilt from elsewhere in the grid (for example, if
      your ILM policy uses a rule that makes only one copy or if volumes have
      failed on multiple nodes). Instead, contact support to determine how to
      recover your data.
      
      ====== Device /dev/sde ======
      Mount and unmount device /dev/sde and checking file system consistency:
      The device is consistent.
      Check rangedb structure on device /dev/sde:
      Mount device /dev/sde to /tmp/sde-654321 with rangedb mount options
      This device has all rangedb directories.
      Found LDR node id 12000078, volume number 9 in the volID file
      Error: This volume does not belong to this node. Fix the attached volume and re-run this script.

      In the example output, one storage volume was remounted successfully and three storage volumes had errors.

      • /dev/sdb passed the XFS file system consistency check and had a valid volume structure, so it was remounted successfully. Data on devices that are remounted by the script is preserved.
      • /dev/sdc failed the XFS file system consistency check because the storage volume was new or corrupt.
      • /dev/sdd could not be mounted because the disk was uninitialized or the disk's superblock was corrupted. When the script cannot mount a storage volume, it asks if you want to run the file system consistency check.
        • If the storage volume is attached to a new disk, answer N to the prompt. You do not need check the file system on a new disk.
        • If the storage volume is attached to an existing disk, answer Y to the prompt. You can use the results of the file system check to determine the source of the corruption. The results are saved in the /var/local/log/sn-remount-volumes.log log file.
      • /dev/sde passed the XFS file system consistency check and had a valid volume structure; however, the LDR node ID in the volID file did not match the ID for this Storage Node (the "configured LDR noid" displayed at the top). This message indicates that this volume belongs to another Storage Node.
  3. Review the script output and resolve any issues.
    Attention: If a storage volume failed the XFS file system consistency check or could not be mounted, carefully review the error messages in the output. You must understand the implications of running the sn-recovery-postinstall script on these volumes.
    1. Check to make sure that the results include an entry for all of the volumes you expected. If any volumes are not listed, rerun the script.
    2. Review the messages for all mounted devices. Make sure there are no errors indicating that a storage volume does not belong to this Storage Node.
      Example
      The output for /dev/sde includes the following error message:
      Error: This volume does not belong to this node. Fix the attached volume and re-run this script.
      If a storage volume is reported as belonging to another Storage Node, replace or remove the disk and run the sn-remount-volumes script again to ensure the issue is resolved.

      As needed, you can find the node ID for the Storage Node you are recovering at the top of the script (the "configured LDR noid"). You can look up node IDs for other Storage Nodes in the Grid Manager. Select Support > Grid Topology > Site > Storage Node > LDR > Overview.

      CAUTION:
      If you are unable to resolve the issue, contact technical support. If you run the sn-recovery-postinstall.sh script, the storage volume will be reformatted, which might cause data loss.
    3. Review the messages for devices that could not be mounted, and make a note of the device name for each failed storage volume.
      Note: You must repair or replace any storage devices that could not be mounted.
      You will use the device name to look up the volume ID, which is required input when you run the repair-data script in the next procedure (restoring object data).
    4. Run the sn-remount-volumes script again to ensure that all valid storage volumes have been remounted.
    Attention: If a storage volume could not be mounted or is improperly formatted, and you continue to the next step, the volume and any data on the volume will be deleted. If you had two copies of object data, you will have only a single copy until you complete the next procedure (restoring object data).
    CAUTION:
    Do not run the sn-recovery-postinstall script if you believe that the data remaining on this volume cannot be rebuilt from elsewhere in the grid (for example, if your ILM policy uses a rule that makes only one copy or if volumes have failed on multiple nodes). Instead, contact technical support to determine how to recover your data.
  4. Run the second script to reformat all unmounted (failed) storage volumes, rebuild Cassandra if required, and start services on the Storage Node: sn-recovery-postinstall.sh
  5. As the script runs, monitor the Recovery page in the Grid Manager, and review the command line output.
    The Progress bar and the Stage column on the Recovery page provide a high-level status of the sn-recovery-postinstall.sh script.
    screenshot showing recovery progress in Grid Management Interface
    The output for the script provides more detailed status information.
    Example
    root@SG:~ # sn-recovery-postinstall.sh
    Starting Storage Node recovery post installation.
    Reformatting all unmounted disks as rangedbs
    Formatting devices that are not in use...
    Skipping in use device /dev/sdb
    Successfully formatted /dev/sdc with UUID d6533e5f-7dfe-4a45-af7c-08ae6864960a
    Successfully formatted /dev/sdd with UUID a2534c4b-6bcc-3a12-fa4e-88ee8621452c
    Skipping in use device /dev/sde
    All devices processed
    Creating Object Stores for LDR
    Generating Grid Interface Configuration file
    LDR initialization complete
    Cassandra does not need rebuilding.
    Not starting services due to --do-not-start-services argument.
    Updating progress on primary Admin Node
    Starting services
    
    #######################################
            STARTING SERVICES
    #######################################
    
    Starting Syslog daemon
    Stopping system logging: syslog-ng.
    Starting system logging: syslog-ng.
    Starting SSH
    Starting OpenBSD Secure Shell server: sshd.
    No hotfix to install
    starting persistence ... done
    remove all error states
    starting all services
    services still stopped: acct adc ade-exporter cms crr dds idnt kstn ldr net-monitor nginx
    node-exporter ssm
    starting ade-exporter
    Starting service ade-exporter in background
    starting cms
    Starting service cms in background
    starting crr
    Starting service crr in background
    starting net-monitor
    Starting service net-monitor in background
    starting nginx
    Starting service nginx in background
    starting node-exporter
    Starting service node-exporter in background
    starting ssm
    Starting service ssm in background
    services still stopped: acct adc dds idnt kstn ldr
    starting adc
    Starting service adc in background
    starting dds
    Starting service dds in background
    starting ldr
    Starting service ldr in background
    services still stopped: acct idnt kstn
    starting acct
    Starting service acct in background
    starting idnt
    Starting service idnt in background
    starting kstn
    Starting service kstn in background
    all services started
    Starting service servermanager in background
    Restarting SNMP services:: snmpd
    
    #######################################
               SERVICES STARTED
    #######################################
    
    Loaded node_id from server-config
    node_id=5611d3c9-45e9-47e4-99ac-cd16ef8a20b9
    Storage Node recovery post installation complete.
    Object data must be restored to the storage volumes.
    

After you finish

Restore object data to any storage volumes that were formatted by sn-recovery-postinstall.sh, as described in the next procedure.