Restoring object data to a storage volume for an appliance

After recovering storage volumes for the appliance Storage Node, you restore the object data that was lost when the Storage Node failed. Object data is restored from other Storage Nodes and Archive Nodes, assuming that the grid's ILM rules were configured such that object copies are available.

Before you begin

About this task

If you have the volume ID of each storage volume that failed and was restored, you can use those volume IDs to restore object data to those storage volumes. If you know the device names of these storage volumes, you can use the device names to find the volume IDs, which correspond to the volume's /var/local/rangedb number.

At installation, each storage device is assigned a file system universal unique identifier (UUID) and is mounted to a rangedb directory on the Storage Node using that assigned file system UUID. The file system UUID and the rangedb directory are listed in the /etc/fstab file. The device name, rangedb directory, and the size of the mounted volume are displayed in the Grid Manager.

To see these values, select Support > Grid Topology. Then, select site > failed Storage Node > SSM > Resources > Overview > Main.

In the following example, device /dev/sdc has a volume size of 4 TB, is mounted to /var/local/rangedb/0, using the device name /dev/disk/by-uuid/822b0547-3b2b-472e-ad5e-e1cf1809faba in the /etc/fstab file:
Volume size sample

Using the repair-data script

To restore object data, you run the repair-data script. This script begins the process of restoring object data and works with ILM scanning to ensure that ILM rules are met. You use different options with the repair-data script to restore object data that is protected using object replication and data that is protected using erasure coding.
Note: You can enter repair-data --help from the primary Admin Node command line for more information on using the repair-data script.

Replicated data

Two commands are available for restoring replicated data, based on whether you need to repair the entire node or only certain volumes on the node.

repair-data start-replicated-node-repair
repair-data start-replicated-volume-repair
Although Cassandra inconsistencies might be present and failed repairs are not tracked, you can monitor replicated repair status to a degree. Select Support > Grid Topology. Then, select deployment > Overview > ILM Activity. You can use a combination of the following attributes to monitor repairs and to determine as well as possible if replicated repairs are complete:
  • Repairs Attempted (XRPA): Use this attribute to track the progress of replicated repairs. This attribute increases each time an LDR tries to repair a high-risk object. When this attribute does not increase for a period longer than the current scan period (provided by the Scan Period – Estimated attribute), it means that ILM scanning found no high-risk objects that need to be repaired on any nodes.
    Note: High-risk objects are objects that are at risk of being completely lost. This does not include objects that do not satisfy their ILM configuration.
  • Scan Period – Estimated (XSCM): Use this attribute to estimate when a policy change will be applied to previously ingested objects. If the Repairs Attempted attribute does not increase for a period longer than the current scan period, it is probable that replicated repairs are done. Note that the scan period can change. The Scan Period – Estimated (XSCM) attribute is at the deployment level and is the maximum of all node scan periods. You can query the Scan Period – Estimated attribute history at the deployment level to determine an appropriate timeframe for your grid.

Erasure coded (EC) data

Two commands are available for restoring erasure coded data, based on whether you need to repair the entire node or only certain volumes on the node.

repair-data start-ec-node-repair
repair-data start-ec-volume-repair
Repairs of erasure coded data can begin while some Storage Nodes are offline. Repair will complete after all nodes are available. You can track repairs of erasure coded data with this command:
repair-data show-ec-repair-status

Notes on data recovery

If the only remaining copy of object data is located on an Archive Node, object data is retrieved from the Archive Node. Due to the latency associated with retrievals from external archival storage systems, restoring object data to a Storage Node from an Archive Node takes longer than restoring copies from other Storage Nodes.

Attention: If ILM rules are configured to store only one replicated copy and the copy exists on a Storage Node that has failed, you will not be able to recover the object. However, you must still perform the procedure to restore object data to a storage volume to purge lost object information from the database.

Steps

  1. From the service laptop, log in to the primary Admin Node:
    1. Enter the following command: ssh admin@primary_Admin_Node_IP
    2. Enter the password listed in the Passwords.txt file.
    3. Enter the following command to switch to root: su -
    4. Enter the password listed in the Passwords.txt file.
      When you are logged in as root, the prompt changes from $ to #.
  2. Use the /etc/hosts file to find the host name of the Storage Node for the restored storage volumes. To see a list of all nodes in the grid, enter the following: cat /etc/hosts
  3. If all storage volumes have failed, you must repair the entire node. If only some volumes have failed, go to the next step.
    Run both of the following commands if your grid has both replicated and erasure coded data.
    Attention: You cannot run repair-data operations for more than one node at the same time. To recover multiple nodes, contact technical support.
    1. Replicated data: Use the repair-data start-replicated-node-repair command with the --nodes option to repair the entire Storage Node.

      For example, this command repairs a Storage Node named SG-DC-SN3:

      repair-data start-replicated-node-repair --nodes SG-DC-SN3
    2. Erasure coded data: Use the repair-data start-ec-node-repair command with the --nodes option to repair the entire Storage Node.

      For example, this command repairs a Storage Node named SG-DC-SN3:

      repair-data start-ec-node-repair --nodes SG-DC-SN3

      The operation returns a unique repair ID that identifies this repair_data operation. Use this repair ID to track the progress and result of the repair_data operation. No other feedback is returned as the recovery process completes.

      Note: Repairs of erasure coded data can begin while some Storage Nodes are offline. Repair will complete after all nodes are available.

    As object data is restored, the LOST (Lost Objects) alarm is triggered if the StorageGRID Webscale system cannot locate replicated object data. Alarms might be triggered on Storage Nodes throughout the system. You should determine the cause of the loss and if recovery is possible. See the instructions for troubleshooting StorageGRID Webscale.

  4. If only some of the volumes have failed, you can repair one volume or a range of volumes.
    Run both of the following commands if your grid has both replicated and erasure coded data. Enter the volume IDs in hexadecimal, where 0000 is the first volume and 000F is the sixteenth volume. You can specify one volume or a range of volumes.
    1. Replicated data: Use the start-replicated-volume-repair command with the --nodes and --volume-range options.

      For example, the following command restores object data to all volumes in the range 0003 to 000B on a Storage Node named SG-DC-SN3:

      repair-data start-replicated-volume-repair --nodes SG-DC-SN3 --volume-range 0003,000B

      For replicated data, you can run more than one repair-data operation at the same time for the same node. You might want to do this if you need to restore two volumes that are not in a range, such as 0000 and 000A.

    2. Erasure coded data: Use the start-ec-volume-repair command with the --nodes and --volume-range options.

      For example, the following command restores data to a single volume 000A on a Storage Node named SG-DC-SN3:

      repair-data start-ec-volume-repair --nodes SG-DC-SN3 --volume-range 000A

      For erasure coded data, you must wait until one repair-data start ec-volume-repair operation completes before starting a second repair-data operation for the same node.

      The repair-data operation returns a unique repair ID that identifies this repair_data operation. Use this repair ID to track the progress and result of the repair_data operation. No other feedback is returned as the recovery process completes.
      Note: Repairs of erasure coded data can begin while some Storage Nodes are offline. Repair will complete after all nodes are available.

    As object data is restored, the LOST (Lost Objects) alarm is triggered if the StorageGRID Webscale system cannot locate replicated object data. Alarms might be triggered on Storage Nodes throughout the system. You should determine the cause of the loss and if recovery is possible. See the instructions for troubleshooting StorageGRID Webscale.

  5. Track the status of the repair of erasure coded data and make sure it completes successfully. Choose one of the following:
    • Use this command to see the status of a specific repair-data operation:
      repair-data show-ec-repair-status --repair-id repair ID
    • Use this command to list all repairs:
      repair-data show-ec-repair-status
      The output lists information, including repair ID, for all previously and currently running repairs.
      root@DC1-ADM1:~ # repair-data show-ec-repair-status                      
      
       Repair ID   Scope                   Start Time  End Time  State  Est Bytes Affected Bytes Repaired  Retry Repair
      ==========================================================================================================
       949283   DC1-S-99-10(Volumes: 1,2) 2016-11-30T15:27:06.9  Success   17359            17359           No
       949292   DC1-S-99-10(Volumes: 1,2) 2016-11-30T15:37:06.9  Failure   17359            0               Yes
       949294   DC1-S-99-10(Volumes: 1,2) 2016-11-30T15:47:06.9  Failure   17359            0               Yes
       949299   DC1-S-99-10(Volumes: 1,2) 2016-11-30T15:57:06.9  Failure   17359            0               Yes
      
      
  6. If the output shows that the repair operation failed, use the --repair-id option to retry the repair.
    The following command retries a failed node repair, using the repair ID 83930030303133434:
    repair-data start-ec-node-repair --repair-id 83930030303133434
    The following command retries a failed volume repair, using the repair ID 83930030303133434:
    repair-data start-ec-volume-repair --repair-id 83930030303133434