Skip to main content

Rebalance erasure-coded data after adding Storage Nodes

Contributors netapp-maireadn netapp-madkat netapp-lhalbert

After you add Storage Nodes, you can use the EC rebalance procedure to redistribute erasure-coded fragments among the existing and new Storage Nodes.

Before you begin
  • You have completed the expansion steps to add the new Storage Nodes.

  • You have reviewed the considerations for rebalancing erasure-coded data.

  • You understand that replicated object data will not be moved by this procedure and that the EC rebalance procedure does not consider the replicated data usage on each Storage Node when determining where to move erasure-coded data.

  • You have the Passwords.txt file.

What happens when this procedure runs

Before starting the procedure, note the following:

  • The EC rebalance procedure will not start if one or more volumes are offline (unmounted) or if they are online (mounted) but in an error state.

  • The EC rebalance procedure temporarily reserves a large amount of storage. Storage alerts might be triggered, but will resolve when the rebalance is complete. If there is not enough storage for the reservation, the EC rebalance procedure will fail. Storage reservations are released when the EC rebalance procedure completes, whether the procedure failed or succeeded.

  • If a volume goes offline while the EC rebalance procedure is in process, the rebalance procedure will terminate. Any data fragments that were already moved will remain in their new locations, and no data will be lost.

    You can rerun the procedure after all volumes are back online.

  • When the EC rebalance procedure is running, the performance of ILM operations and S3 and Swift client operations might be impacted.

    Note S3 and Swift API operations to upload objects (or object parts) might fail during the EC rebalance procedure if they require more than 24 hours to complete. Long-duration PUT operations will fail if the applicable ILM rule uses Balanced or Strict placement on ingest. The following error will be reported: 500 Internal Server Error.
  • During this procedure all nodes have a storage capacity limit of 80%. Nodes that exceed this limit, but still store below the target data partition, are excluded from:

    • The site imbalance value

    • Any job completion conditions

      Note The target data partition is calculated by dividing the total data for a site by the number of nodes.
  • Job completion conditions. The EC rebalance procedure is considered complete when any of the following is true:

    • It can't move any more erasure-coded data.

    • The data in all nodes is within a 5% deviation of the target data partition.

    • The procedure has been running for 30 days.

Steps
  1. Review the current object storage details for the site you plan to rebalance.

    1. Select NODES.

    2. Select the first Storage Node at the site.

    3. Select the Storage tab.

    4. Position your cursor over the Storage Used - Object Data chart to see the current amount of replicated data and erasure-coded data on the Storage Node.

    5. Repeat these steps to view the other Storage Nodes at the site.

  2. Log in to the primary Admin Node:

    1. Enter the following command: ssh admin@primary_Admin_Node_IP

    2. Enter the password listed in the Passwords.txt file.

    3. Enter the following command to switch to root: su -

    4. Enter the password listed in the Passwords.txt file.

      When you are logged in as root, the prompt changes from $ to #.

  3. Start the procedure:

    `rebalance-data start --site "site-name"

    For "site-name", specify the first site where you added new Storage Node or nodes. Enclose site-name in quotes.

    The EC rebalance procedure starts, and a job ID is returned.

  4. Copy the job ID.

  5. Monitor the status of the EC rebalance procedure.

    • To view the status of a single EC rebalance procedure:

      rebalance-data status --job-id job-id

      For job-id, specify the ID that was returned when you started the procedure.

    • To view the status of the current EC rebalance procedure and any previously completed procedures:

      rebalance-data status

      Note

      To get help on the rebalance-data command:

      rebalance-data --help

  6. Perform additional steps, based on the status returned:

    • If State is In progress, the EC rebalance operation is still running. You should periodically monitor the procedure until it completes.

      Use the Site Imbalance value to assess how unbalanced erasure-code data usage is across the Storage Nodes at the site. This value can range from 1.0 to 0, with 0 indicating that erasure-coding data usage is completely balanced across all Storage Nodes at the site.

      The EC rebalance job is considered complete and will stop when the data in all nodes is within a 5% deviation of the target data partition.

    • If State is Success, optionally review object storage to see the updated details for the site.

      Erasure-coded data should now be more balanced among the Storage Nodes at the site.

    • If State is Failure:

      1. Confirm that all Storage Nodes at the site are connected to the grid.

      2. Check for and resolve any alerts that might be affecting these Storage Nodes.

      3. Restart the EC rebalance procedure:

        rebalance-data start –-job-id job-id

      4. View the status of the new procedure. If State is still Failure, contact technical support.

  7. If the EC rebalance procedure is generating too much load (for example, ingest operations are affected), pause the procedure.

    rebalance-data pause --job-id job-id

  8. If you need to terminate the EC rebalance procedure (for example, so you can perform a StorageGRID software upgrade), enter the following:

    rebalance-data terminate --job-id job-id

    Note When you terminate an EC rebalance procedure, any data fragments that have already been moved remain in their new locations. Data is not moved back to the original location.
  9. If you are using erasure coding at more than one site, run this procedure for all other affected sites.