Monitor repair-data jobs

06/21/2023 Contributors

You can monitor the status of repair jobs by using the repair-data script from the command line.

These include jobs that you initiated manually, or jobs that StorageGRID initiated automatically as part of a decommission procedure.

If you are running volume restoration jobs, monitor the progress and view a history of those jobs in the Grid Manager instead.

Monitor the status of repair-data jobs based on whether you use replicated data, erasure-coded (EC) data, or both.

To get an estimated percent completion for the replicated repair, add the show-replicated-repair-status option to the repair-data command.

repair-data show-replicated-repair-status
To determine if repairs are complete:
1. Select NODES > Storage Node being repaired > ILM.
2. Review the attributes in the Evaluation section. When repairs are complete, the Awaiting - All attribute indicates 0 objects.
To monitor the repair in more detail:
1. Select SUPPORT > Tools > Grid topology.
2. Select grid > Storage Node being repaired > LDR > Data Store.
3. Use a combination of the following attributes to determine, as well as possible, if replicated repairs are complete.
  
  Cassandra inconsistencies might be present, and failed repairs aren't tracked.
  - Repairs Attempted (XRPA): Use this attribute to track the progress of replicated repairs. This attribute increases each time a Storage Node tries to repair a high-risk object. When this attribute does not increase for a period longer than the current scan period (provided by the Scan Period — Estimated attribute), it means that ILM scanning found no high-risk objects that need to be repaired on any nodes.
    
    High-risk objects are objects that are at risk of being completely lost. This does not include objects that don't satisfy their ILM configuration.
  - Scan Period — Estimated (XSCM): Use this attribute to estimate when a policy change will be applied to previously ingested objects. If the Repairs Attempted attribute does not increase for a period longer than the current scan period, it is probable that replicated repairs are done. Note that the scan period can change. The Scan Period — Estimated (XSCM) attribute applies to the entire grid and is the maximum of all node scan periods. You can query the Scan Period — Estimated attribute history for the grid to determine an appropriate time frame.

Monitor repair-data jobs

Creating your file...