Monitor repair-data jobs
You can monitor the status of repair jobs by using the repair-data
script from the command line.
These include jobs that you initiated manually, or jobs that StorageGRID initiated automatically as part of a decommission procedure.
|
If you are running volume restoration jobs, monitor the progress and view a history of those jobs in the Grid Manager instead. |
Monitor the status of repair-data
jobs based on whether you use replicated data, erasure-coded (EC) data, or both.
-
To get an estimated percent completion for the replicated repair, add the
show-replicated-repair-status
option to the repair-data command.repair-data show-replicated-repair-status
-
To determine if repairs are complete:
-
Select Nodes > Storage Node being repaired > ILM.
-
Review the attributes in the Evaluation section. When repairs are complete, the Awaiting - All attribute indicates 0 objects.
-
-
To monitor the repair in more detail:
-
Select Nodes.
-
Select grid name > ILM.
-
Position your cursor over the ILM queue graph to see the value of the Scan rate (objects/sec) attribute, which is the rate at which objects in the grid are scanned and queued for ILM.
-
In the ILM Queue section, look at the following attributes:
-
Scan period - estimated: The estimated time to complete a full ILM scan of all objects.
A full scan doesn't guarantee that ILM has been applied to all objects.
-
Repairs attempted: The total number of attempted object repair operations for replicated data that are considered high risk. High-risk objects are any objects with one copy remaining, whether specified by the ILM policy or as a result of lost copies. This count increments each time a Storage Node tries to repair a high-risk object. High-risk ILM repairs are prioritized if the grid becomes busy.
The same object repair might increment again if replication failed after the repair.
These attributes can be useful when you are monitoring the progress of Storage Node volume recovery. If the number of repairs attempted has stopped increasing and a full scan has been completed, the repair has probably completed.
-
-
Alternatively, submit a Prometheus query for
storagegrid_ilm_scan_period_estimated_minutes
andstoragegrid_ilm_repairs_attempted
.
-
To monitor the repair of erasure-coded data and retry any requests that might have failed:
-
Determine the status of erasure-coded data repairs:
-
Select Support > Tools > Metrics to view the estimated time to completion and the completion percentage for the current job. Then, select EC Overview in the Grafana section. Look at the Grid EC Job Estimated Time to Completion and Grid EC Job Percentage Completed dashboards.
-
Use this command to see the status of a specific
repair-data
operation:repair-data show-ec-repair-status --repair-id repair ID
-
Use this command to list all repairs:
repair-data show-ec-repair-status
The output lists information, including
repair ID
, for all previously and currently running repairs.
-
-
If the output shows that the repair operation failed, use the
--repair-id
option to retry the repair.This command retries a failed node repair, using the repair ID 6949309319275667690:
repair-data start-ec-node-repair --repair-id 6949309319275667690
This command retries a failed volume repair, using the repair ID 6949309319275667690:
repair-data start-ec-volume-repair --repair-id 6949309319275667690