Rebalance erasure-coded data after adding Storage Nodes

04/13/2023 Contributors

After you add Storage Nodes, you can use the EC rebalance procedure to redistribute erasure-coded fragments among the existing and new Storage Nodes.

Before you begin

You have completed the expansion steps to add the new Storage Nodes.
You have reviewed the considerations for rebalancing erasure-coded data.
You understand that replicated object data will not be moved by this procedure and that the EC rebalance procedure does not consider the replicated data usage on each Storage Node when determining where to move erasure-coded data.
You have the Passwords.txt file.

What happens when this procedure runs

Before starting the procedure, note the following:

The EC rebalance procedure will not start if one or more volumes are offline (unmounted) or if they are online (mounted) but in an error state.
The EC rebalance procedure temporarily reserves a large amount of storage. Storage alerts might be triggered, but will resolve when the rebalance is complete. If there is not enough storage for the reservation, the EC rebalance procedure will fail. Storage reservations are released when the EC rebalance procedure completes, whether the procedure failed or succeeded.
If a volume goes offline or experiences an error while an EC rebalance is in process, the rebalance process terminates partially completed with no loss of data. The EC rebalance procedure can be resumed at the point where it terminated when all volumes are online without error.

When the EC rebalance procedure is running, the performance of ILM operations and S3 and Swift client operations might be impacted.

S3 and Swift API operations to upload objects (or object parts) might fail during the EC rebalancing procedure if they require more than 24 hours to complete. Long-duration PUT operations will fail if the applicable ILM rule uses Balanced or Strict placement on ingest. The following error will be reported: 500 Internal Server Error.

Steps

Review the current object storage details for the site you plan to rebalance.
1. Select NODES.
2. Select the first Storage Node at the site.
3. Select the Storage tab.
4. Position your cursor over the Storage Used - Object Data chart to see the current amount of replicated data and erasure-coded data on the Storage Node.
5. Repeat these steps to view the other Storage Nodes at the site.
Log in to the primary Admin Node:
1. Enter the following command: ssh admin@primary_Admin_Node_IP
2. Enter the password listed in the Passwords.txt file.
3. Enter the following command to switch to root: su -
4. Enter the password listed in the Passwords.txt file.
  
  When you are logged in as root, the prompt changes from $ to #.
Start the procedure:

rebalance-data start --site "site-name"

For "site-name", specify the first site where you added new Storage Node or nodes. Enclose site-name in quotes.

The EC rebalance procedure starts, and a job ID is returned.
Copy the job ID.
Monitor the status of the EC rebalance procedure.
- To view the status of a single EC rebalance procedure:
  
  rebalance-data status --job-id job-id
  
  For job-id, specify the ID that was returned when you started the procedure.
- To view the status of the current EC rebalance procedure and any previously completed procedures:
  
  rebalance-data status
  
  To get help on the rebalance-data command:
  
  rebalance-data --help

Perform additional steps, based on the status returned:

If the status is In progress, the EC rebalance operation is still running. You should periodically monitor the procedure until it completes.

To view the estimated time to completion and the completion percentage for the current job:
1. Select SUPPORT > Tools > Metrics.
2. Select EC Overview in the Grafana section.
3. Look at the Grid EC Job Estimated Time to Completion and Grid EC Job Percentage Completed dashboards.

If the status is Success, optionally review object storage to see the updated details for the site.

Erasure-coded data should now be more balanced among the Storage Nodes at the site.

If the following message appears, run the EC rebalance procedure again until all erasure-coded data has been rebalanced:

The moves in this rebalance job have been limited. To rebalance additional data, start EC rebalance again for the same site.

If the status is Failure:
1. Confirm that all Storage Nodes at the site are connected to the grid.
2. Check for and resolve any alerts that might be affecting these Storage Nodes.
3. Restart the EC rebalance procedure:
  
  rebalance-data start –-job-id job-id
4. If the status of the EC rebalance procedure is still Failure, contact technical support.

If the EC rebalance procedure is generating too much load (for example, ingest operations are affected), pause the procedure.

rebalance-data pause --job-id job-id
If you need to terminate the EC rebalance procedure (for example, so you can perform a StorageGRID software upgrade), enter the following:

rebalance-data terminate --job-id job-id

When you terminate an EC rebalance procedure, any data fragments that have already been moved remain in the new location. Data is not moved back to the original location.
If you are using erasure coding at more than one site, run this procedure for all other affected sites.

Rebalance erasure-coded data after adding Storage Nodes

Creating your file...