Identifying and unmounting failed storage volumes

When recovering a Storage Node with failed storage volumes, you must identify and unmount the failed volumes. You must verify that only the failed storage volumes are reformatted as part of the recovery procedure.

Before you begin

You must be signed in to the Grid Manager using a supported browser.

About this task

You should recover failed storage volumes as soon as possible.

The first step of the recovery process is to detect volumes that have become detached, need to be unmounted, or have I/O errors. If failed volumes are still attached but have a randomly corrupted file system, the system might not detect any corruption in unused or unallocated parts of the disk. While you should run file system checks for consistency on a normal basis, only perform this procedure for detecting failed volumes on a large file system when necessary, such as in cases of power loss.

Note: You must finish this procedure before performing manual steps to recover the volumes, such as adding or re-attaching the disks, stopping the node, starting the node, or rebooting. Otherwise, when you run the reformat_storage_block_devices.rb script, you might encounter a file system error that causes the script to hang or fail.
Note: Repair the hardware and properly attach the disks before running the reboot command.
CAUTION:
Identify failed storage volumes carefully. You will use this information to verify which volumes must be reformatted. Once a volume has been reformatted, data on the volume cannot be recovered.

To correctly recover failed storage volumes, you need to know both the device names of the failed storage volumes and their volume IDs.

At installation, each storage device is assigned a file system universal unique identifier (UUID) and is mounted to a rangedb directory on the Storage Node using that assigned file system UUID. The file system UUID and the rangedb directory are listed in the /etc/fstab file. The device name, rangedb directory, and the size of the mounted volume are displayed in the Grid Manager.

To see these values, select Support > Grid Topology. Then, select site > failed Storage Node > SSM > Resources > Overview > Main.

In the following example, device /dev/sdc has a volume size of 4 TB, is mounted to /var/local/rangedb/0, using the device name /dev/disk/by-uuid/822b0547-3b2b-472e-ad5e-e1cf1809faba in the /etc/fstab file:
Volume size sample

Steps

  1. Complete the following steps to record the failed storage volumes and their device names:
    1. Select Support > Grid Topology.
    2. Select site > failed Storage Node > LDR > Storage > Overview > Main, and look for object stores with alarms.

      Object stores section
    3. Select site > failed Storage Node > SSM > Resources > Overview > Main. Determine the mount point and volume size of each failed storage volume identified in the previous step.

      Object stores are numbered in hex notation, from 0000 to 000F. In the example, the object store with an ID of 0000 corresponds to /var/local/rangedb/0 with device name sdc and a size of 107 GB.


      example showing object stores and mount points

      If you cannot determine the volume number and device name of failed storage volumes, log in to an equivalent Storage Node and determine the mapping of volumes to device names on that server.

      Storage Nodes are usually added in pairs, with identical hardware and storage configurations. Examine the /etc/fstab file on the equivalent Storage Node to identify the device names that correspond to each storage volume. Identify and record the device name for each failed storage volume.

  2. From the service laptop, log in to the failed Storage Node:
    1. Enter the following command: ssh admin@grid_node_IP
    2. Enter the password listed in the Passwords.txt file.
    3. Enter the following command to switch to root: su -
    4. Enter the password listed in the Passwords.txt file.
    When you are logged in as root, the prompt changes from $ to #.
  3. Stop the LDR service and unmount the failed storage volumes:
    1. Stop LDR services: service ldr stop
    2. If rangedb/0 needs recovery, stop Cassandra before unmounting rangedb/0: service cassandra stop
    3. Unmount the failed storage volume: umount /var/local/rangedb/object_store_ID

      The object_store_ID is the ID of the failed storage volume. For example, specify 0 in the command for an object store with ID 0000.