Skip to main content

Checking data repair jobs

Contributors

Before decommissioning a grid node, you must confirm that no data repair jobs are active. If any repairs have failed, you must restart them and allow them to complete before performing the decommission procedure.

If you need to decommission a disconnected Storage Node, you will also complete these steps after the decommission procedure completes in order to ensure the data repair job has completed successfully. You must ensure that any erasure-coded fragments that were on the removed node have been restored successfully.

These steps only apply to systems that have erasure-coded objects.

  1. Log in to the primary Admin Node:

    1. Enter the following command: ssh admin@grid_node_IP

      When you are logged in as root, the prompt changes from $ to #.

    2. Enter the password listed in the Passwords.txt file.

    3. Enter the following command to switch to root: su -

    4. Enter the password listed in the Passwords.txt file.

  2. Check for running repairs: repair-data show-ec-repair-status

    • If you have never run a data repair job, the output is No job found. You do not need to restart any repair jobs.

    • If the data repair job was run previously or is running currently, the output lists information for the repair. Each repair has a unique repair ID. Go to the next step.

    root@DC1-ADM1:~ # repair-data show-ec-repair-status
    
    Repair ID Scope Start Time End Time State Est/Affected Bytes Repaired Retry Repair
    ===================================================================================
    949283 DC1-S-99-10(Volumes: 1,2) 2016-11-30T15:27:06.9 Success 17359 17359 No
    949292 DC1-S-99-10(Volumes: 1,2) 2016-11-30T15:37:06.9 Failure 17359 0     Yes
    949294 DC1-S-99-10(Volumes: 1,2) 2016-11-30T15:47:06.9 Failure 17359 0     Yes
    949299 DC1-S-99-10(Volumes: 1,2) 2016-11-30T15:57:06.9 Failure 17359 0     Yes
  3. If the State for all repairs is Success, you do not need to restart any repair jobs.

  4. If the State for any repair is Failure, you must restart that repair.

    1. Obtain the repair ID for the failed repair from the output.

    2. Run the repair-data start-ec-node-repair command.

      Use the --repair-id option to specify the Repair ID. For example, if you want to retry a repair with repair ID 949292, run this command: repair-data start-ec-node-repair --repair-id 949292

    3. Continue to track the status of EC data repairs until the State for all repairs is Success.