Rebalance erasure-coded data after adding Storage Nodes

01/04/2022 Contributors

PDFs

In some cases, you might need to rebalance erasure-coded data after you add new Storage Nodes.

What you'll need

You have completed the expansion steps to add the new Storage Nodes.
You have reviewed the considerations for rebalancing erasure-coded data.

Only perform this procedure if the Low Object Storage alert has been triggered for one or more Storage Nodes at a site and you were unable to add the recommended number of new Storage Nodes.
You understand that replicated object data will not be moved by this procedure and that the EC rebalance procedure does not consider the replicated data usage on each Storage Node when determining where to move erasure-coded data.
You have the Passwords.txt file.

About this task

When the EC rebalance procedure is running, the performance of ILM operations and S3 and Swift client operations are likely to be impacted. For this reason, you should only perform this procedure in limited cases.

The EC rebalance procedure temporarily reserves a large amount of storage. Storage alerts might be triggered, but will resolve when the rebalance is complete. If there is not enough storage for the reservation, the EC rebalance procedure will fail. Storage reservations are released when the EC rebalance procedure completes, whether the procedure failed or succeeded.

S3 and Swift API operations to upload objects (or object parts) might fail during the EC rebalancing procedure if they require more than 24 hours to complete. Long-duration PUT operations will fail if the applicable ILM rule uses Strict or Balanced placement on ingest. The following error will be reported:

500 Internal Server Error

Steps

Review the current object storage details for the site you plan to rebalance.
1. Select NODES.
2. Select the first Storage Node at the site.
3. Select the Storage tab.
4. Hover your cursor over the Storage Used - Object Data chart to see the current amount of replicated data and erasure-coded data on the Storage Node.
5. Repeat these steps to view the other Storage Nodes at the site.
Log in to the primary Admin Node:
1. Enter the following command: ssh admin@primary_Admin_Node_IP
2. Enter the password listed in the Passwords.txt file.
3. Enter the following command to switch to root: su -
4. Enter the password listed in the Passwords.txt file.
  
  When you are logged in as root, the prompt changes from $ to #.
Enter the following command:

rebalance-data start --site "site-name"

For "site-name", specify the first site where you added new Storage Node or nodes. Enclose site-name in quotes.

The EC rebalance procedure starts, and a job ID is returned.
Copy the job ID.
Monitor the status of the EC rebalance procedure.
- To view the status of a single EC rebalance procedure:
  
  rebalance-data status --job-id job-id
  
  For job-id, specify the ID that was returned when you started the procedure.
- To view the status of the current EC rebalance procedure and any previously completed procedures:
  
  rebalance-data status
  
  To get help on the rebalance-data command:
  
  rebalance-data --help
- To view the estimated time to completion and the completion percentage for the current job, select SUPPORT > Tools > Metrics. Then, select EC Overview in the Grafana section. Look at the Grid EC Job Estimated Time to Completion and Grid EC Job Percentage Completed dashboards.
Perform additional steps, based on the status returned:
- If the status is In progress, the EC rebalance operation is still running. You should periodically monitor the procedure until it completes.
- If the status is Failure, perform the failure steps.
- If the status is Success, perform the success step.
If the EC rebalance procedure is generating too much load (for example, ingest operations are affected), pause the procedure.

rebalance-data pause --job-id job-id
If you need to terminate the EC rebalance procedure (for example, so you can perform a StorageGRID software upgrade), enter the following:

rebalance-data terminate --job-id job-id

When you terminate an EC rebalance procedure, any data fragments that have already been moved remain in the new location. Data is not moved back to the original location.
If the status of the EC rebalance procedure is Failure, follow these steps:
1. Confirm that all Storage Nodes at the site are connected to the grid.
2. Check for and resolve any alerts that might be affecting these Storage Nodes.
  
  For information about specific alerts, see the monitoring and troubleshooting instructions.
3. Restart the EC rebalance procedure:
  rebalance-data start –-job-id job-id
4. If the status of the EC rebalance procedure is still Failure, contact technical support.
If the status of the EC rebalance procedure is Success, optionally review object storage to see the updated details for the site.

Erasure-coded data should now be more balanced among the Storage Nodes at the site.
If you are using erasure coding at more than one site, run this procedure for all other affected sites.

Rebalance erasure-coded data after adding Storage Nodes

Creating your file...