Rebalancing erasure-coded data after adding Storage Nodes

In some cases, you might need to rebalance erasure-coded data after you add new Storage Nodes.

Before you begin

About this task

When the EC rebalance procedure is running, the performance of ILM operations and S3 and Swift client operations are likely to be impacted. For this reason, you should only perform this procedure in limited cases.

Note: The EC rebalance procedure temporarily reserves a large amount of storage. Storage alerts might be triggered, but will resolve when the rebalance is complete. If there is not enough storage for the reservation, the EC rebalance procedure will fail. Storage reservations are released when the EC rebalance procedure completes, whether the procedure failed or succeeded.
Note: S3 and Swift API operations to upload objects (or object parts) might fail during the EC rebalancing procedure if they require more than 24 hours to complete. Long-duration PUT operations will fail if the applicable ILM rule uses Strict or Balanced placement on ingest. The following error will be reported:
500 Internal Server Error

Procedure

  1. Review the current object storage details for the site you plan to rebalance.
    1. Select Nodes.
    2. Select the first Storage Node at the site.
    3. Select the Storage tab.
    4. Hover your cursor over the Storage Used - Object Data chart to see the current amount of replicated data and erasure-coded data on the Storage Node.
    5. Repeat these steps to view the other Storage Nodes at the site.
  2. Log in to the primary Admin Node:
    1. Enter the following command: ssh admin@primary_Admin_Node_IP
    2. Enter the password listed in the Passwords.txt file.
    3. Enter the following command to switch to root: su -
    4. Enter the password listed in the Passwords.txt file.
      When you are logged in as root, the prompt changes from $ to #.
  3. Enter the following command: rebalance-data start --site "site-name"
    For "site-name", specify the first site where you added new Storage Node or nodes. Enclose site-name in quotes.
    The EC rebalance procedure starts, and a job ID is returned.
  4. Copy the job ID.
  5. Monitor the status of the EC rebalance procedure.
    • To view the status of a single EC rebalance procedure:
      rebalance-data status --job-id job-id

      For job-id, specify the ID that was returned when you started the procedure.

    • To view the status of the current EC rebalance procedure and any previously completed procedures:
      rebalance-data status
    Note: To get help on the rebalance-data command:
    rebalance-data --help
  6. Perform additional steps, based on the status returned:
    • If the status is In progress, the EC rebalance operation is still running. You should periodically monitor the procedure until it completes.
    • If the status is Failure, go to step 9.
    • If the status is Success, go to step 10.
  7. If the EC rebalance procedure is generating too much load (for example, ingest operations are affected), pause the procedure.
    rebalance-data pause --job-id job-id
  8. If you need to terminate the EC rebalance procedure (for example, so you can perform a StorageGRID software upgrade), enter the following:
    rebalance-data abort --job-id job-id
    Note: When you terminate an EC rebalance procedure, any data fragments that have already been moved remain in the new location. Data is not moved back to the original location.
  9. If the status of the EC rebalance procedure is Failure, follow these steps:
    1. Confirm that all Storage Nodes at the site are connected to the grid.
    2. Check for and resolve any alerts that might be affecting these Storage Nodes.
      For information about specific alerts, see the monitoring and troubleshooting instructions.
    3. Restart the EC rebalance procedure: rebalance-data start –-job-id job-id
    4. If the status of the EC rebalance procedure is still Failure, contact technical support.
  10. If the status of the EC rebalance procedure is Success, optionally repeat step 1 to review the updated object storage details for the site.
    Erasure-coded data should now be more balanced among the Storage Nodes at the site.
    Note: Replicated object data is not moved by the EC rebalance procedure.
  11. If you are using erasure coding at more than one site, run this procedure for all other affected sites.