Restore object data to storage volume (system drive failure)

04/12/2023 Contributors

After recovering storage volumes for a non-appliance Storage Node, you can restore the replicated or erasure-coded object data that was lost when the Storage Node failed.

Which procedure should I use?

Whenever possible, restore object data using the Volume restoration page in the Grid Manager.

If the volumes are listed at MAINTENANCE > Volume restoration > Nodes to restore, restore object data using the Volume restoration page in the Grid Manager.
If the volumes aren't listed at MAINTENANCE > Volume restoration > Nodes to restore, follow the steps below for using the repair-data script to restore object data.

If the recovered Storage Node contains fewer volumes than the node it is replacing, you must use the repair-data script.

Use the `repair-data` script to restore object data

Before you begin

You must have confirmed that the recovered Storage Node has a Connection State of Connected on the NODES > Overview tab in the Grid Manager.

About this task

Object data can be restored from other Storage Nodes, an Archive Node, or a Cloud Storage Pool, assuming that the grid's ILM rules were configured such that object copies are available.

Note the following:

If an ILM rule was configured to store only one replicated copy and that copy existed on a storage volume that failed, you will not be able to recover the object.
If the only remaining copy of an object is in a Cloud Storage Pool, StorageGRID must issue multiple requests to the Cloud Storage Pool endpoint to restore object data. Before performing this procedure, contact technical support for help in estimating the recovery time frame and the associated costs.
If the only remaining copy of an object is on an Archive Node, object data is retrieved from the Archive Node. Restoring object data to a Storage Node from an Archive Node takes longer than restoring copies from other Storage Nodes because of the latency associated with retrievals from external archival storage systems.

About the `repair-data` script

To restore object data, you run the repair-data script. This script begins the process of restoring object data and works with ILM scanning to ensure that ILM rules are met.

Select Replicated data or Erasure-coded (EC) data below to learn the different options for the repair-data script, based on whether you are restoring replicated data or erasure-coded data. If you need to restore both types of data, you must run both sets of commands.

For more information about the repair-data script, enter repair-data --help from the command line of the primary Admin Node.

Replicated data

Two commands are available for restoring replicated data, based on whether you need to repair the entire node or only certain volumes on the node:

repair-data start-replicated-node-repair

repair-data start-replicated-volume-repair

You can track repairs of replicated data with this command:

repair-data show-replicated-repair-status

Erasure coded (EC) data

Two commands are available for restoring erasure-coded data, based on whether you need to repair the entire node or only certain volumes on the node:

repair-data start-ec-node-repair

repair-data start-ec-volume-repair

You can track repairs of erasure-coded data with this command:

repair-data show-ec-repair-status

Repairs of erasure-coded data can begin while some Storage Nodes are offline. However, if all erasure-coded data can't be accounted for, the repair can't be completed. Repair will complete after all nodes are available.

The EC repair job temporarily reserves a large amount of storage. Storage alerts might be triggered, but will resolve when the repair is complete. If there is not enough storage for the reservation, the EC repair job will fail. Storage reservations are released when the EC repair job completes, whether the job failed or succeeded.

Find hostname for Storage Node

Log in to the primary Admin Node:
1. Enter the following command: ssh admin@primary_Admin_Node_IP
2. Enter the password listed in the Passwords.txt file.
3. Enter the following command to switch to root: su -
4. Enter the password listed in the Passwords.txt file.
  
  When you are logged in as root, the prompt changes from $ to #.
Use the /etc/hosts file to find the hostname of the Storage Node for the restored storage volumes. To see a list of all nodes in the grid, enter the following: cat /etc/hosts.

Repair data if all volumes have failed

If all storage volumes have failed, repair the entire node. Follow the instructions for replicated data, erasure-coded (EC) data, or both, based on whether you use replicated data, erasure-coded (EC) data, or both.

If only some volumes have failed, go to Repair data if only some volumes have failed.

You can't run repair-data operations for more than one node at the same time. To recover multiple nodes, contact technical support.

Replicated data

If your grid includes replicated data, use the repair-data start-replicated-node-repair command with the --nodes option, where --nodes is the hostname (system name), to repair the entire Storage Node.

This command repairs the replicated data on a Storage Node named SG-DC-SN3:

repair-data start-replicated-node-repair --nodes SG-DC-SN3

As object data is restored, the Objects Lost alert is triggered if the StorageGRID system can't locate replicated object data. Alerts might be triggered on Storage Nodes throughout the system. You should determine the cause of the loss and if recovery is possible. See Investigate lost objects.

Erasure coded (EC) data

If your grid contains erasure-coded data, use the repair-data start-ec-node-repair command with the --nodes option, where --nodes is the hostname (system name), to repair the entire Storage Node.

This command repairs the erasure-coded data on a Storage Node named SG-DC-SN3:

repair-data start-ec-node-repair --nodes SG-DC-SN3

The operation returns a unique repair ID that identifies this repair_data operation. Use this repair ID to track the progress and result of the repair_data operation. No other feedback is returned as the recovery process completes.

Repairs of erasure-coded data can begin while some Storage Nodes are offline. Repair will complete after all nodes are available.

Repair data if only some volumes have failed

If only some of the volumes have failed, repair the affected volumes. Follow the instructions for replicated data, erasure-coded (EC) data, or both, based on whether you use replicated data, erasure-coded (EC) data, or both.

If all volumes have failed, go to Repair data if all volumes have failed.

Enter the volume IDs in hexadecimal. For example, 0000 is the first volume and 000F is the sixteenth volume. You can specify one volume, a range of volumes, or multiple volumes that aren't in a sequence.

All the volumes must be on the same Storage Node. If you need to restore volumes for more than one Storage Node, contact technical support.

Replicated data

If your grid contains replicated data, use the start-replicated-volume-repair command with the --nodes option to identify the node (where --nodes is the hostname of the node). Then add either the --volumes or --volume-range option, as shown in the following examples.

Single volume: This command restores replicated data to volume 0002 on a Storage Node named SG-DC-SN3:

repair-data start-replicated-volume-repair --nodes SG-DC-SN3 --volumes 0002

Range of volumes: This command restores replicated data to all volumes in the range 0003 to 0009 on a Storage Node named SG-DC-SN3:

repair-data start-replicated-volume-repair --nodes SG-DC-SN3 --volume-range 0003,0009

Multiple volumes not in a sequence: This command restores replicated data to volumes 0001, 0005, and 0008 on a Storage Node named SG-DC-SN3:

repair-data start-replicated-volume-repair --nodes SG-DC-SN3 --volumes 0001,0005,0008

As object data is restored, the Objects Lost alert is triggered if the StorageGRID system can't locate replicated object data. Alerts might be triggered on Storage Nodes throughout the system. Note the alert description and recommended actions to determine the cause of the loss and if recovery is possible.

Erasure coded (EC) data

If your grid contains erasure-coded data, use the start-ec-volume-repair command with the --nodes option to identify the node (where --nodes is the hostname of the node). Then add either the --volumes or --volume-range option, as shown in the following examples.

Single volume: This command restores erasure-coded data to volume 0007 on a Storage Node named SG-DC-SN3:

repair-data start-ec-volume-repair --nodes SG-DC-SN3 --volumes 0007

Range of volumes: This command restores erasure-coded data to all volumes in the range 0004 to 0006 on a Storage Node named SG-DC-SN3:

repair-data start-ec-volume-repair --nodes SG-DC-SN3 --volume-range 0004,0006

Multiple volumes not in a sequence: This command restores erasure-coded data to volumes 000A, 000C, and 000E on a Storage Node named SG-DC-SN3:

repair-data start-ec-volume-repair --nodes SG-DC-SN3 --volumes 000A,000C,000E

The repair-data operation returns a unique repair ID that identifies this repair_data operation. Use this repair ID to track the progress and result of the repair_data operation. No other feedback is returned as the recovery process completes.

Repairs of erasure-coded data can begin while some Storage Nodes are offline. Repair will complete after all nodes are available.

Monitor repairs

Monitor the status of the repair jobs, based on whether you use replicated data, erasure-coded (EC) data, or both.

You can also monitor the status of volume restoration jobs in process and view a history of restoration jobs completed in Grid Manager.

Replicated data

To get an estimated percent completion for the replicated repair, add the show-replicated-repair-status option to the repair-data command.

repair-data show-replicated-repair-status
To determine if repairs are complete:
1. Select NODES > Storage Node being repaired > ILM.
2. Review the attributes in the Evaluation section. When repairs are complete, the Awaiting - All attribute indicates 0 objects.
To monitor the repair in more detail:
1. Select SUPPORT > Tools > Grid topology.
2. Select grid > Storage Node being repaired > LDR > Data Store.
3. Use a combination of the following attributes to determine, as well as possible, if replicated repairs are complete.
  
  Cassandra inconsistencies might be present, and failed repairs aren't tracked.
  - Repairs Attempted (XRPA): Use this attribute to track the progress of replicated repairs. This attribute increases each time a Storage Node tries to repair a high-risk object. When this attribute does not increase for a period longer than the current scan period (provided by the Scan Period — Estimated attribute), it means that ILM scanning found no high-risk objects that need to be repaired on any nodes.
    
    High-risk objects are objects that are at risk of being completely lost. This does not include objects that don't satisfy their ILM configuration.
  - Scan Period — Estimated (XSCM): Use this attribute to estimate when a policy change will be applied to previously ingested objects. If the Repairs Attempted attribute does not increase for a period longer than the current scan period, it is probable that replicated repairs are done. Note that the scan period can change. The Scan Period — Estimated (XSCM) attribute applies to the entire grid and is the maximum of all node scan periods. You can query the Scan Period — Estimated attribute history for the grid to determine an appropriate time frame.

Erasure coded (EC) data

To monitor the repair of erasure-coded data and retry any requests that might have failed:

Determine the status of erasure-coded data repairs:
- Select SUPPORT > Tools > Metrics to view the estimated time to completion and the completion percentage for the current job. Then, select EC Overview in the Grafana section. Look at the Grid EC Job Estimated Time to Completion and Grid EC Job Percentage Completed dashboards.
- Use this command to see the status of a specific repair-data operation:
  
  repair-data show-ec-repair-status --repair-id repair ID
- Use this command to list all repairs:
  
  repair-data show-ec-repair-status
  
  The output lists information, including repair ID, for all previously and currently running repairs.
If the output shows that the repair operation failed, use the --repair-id option to retry the repair.

This command retries a failed node repair, using the repair ID 6949309319275667690:

repair-data start-ec-node-repair --repair-id 6949309319275667690

This command retries a failed volume repair, using the repair ID 6949309319275667690:

repair-data start-ec-volume-repair --repair-id 6949309319275667690

Restore object data to storage volume (system drive failure)

Creating your file...

Which procedure should I use?

Use the repair-data script to restore object data

About the repair-data script

Find hostname for Storage Node

Repair data if all volumes have failed

Repair data if only some volumes have failed

Monitor repairs

Use the `repair-data` script to restore object data

About the `repair-data` script