Decommissioning disconnected grid nodes

You might need to decommission a node that is not currently connected to the grid (one whose Health is Unknown or Administratively Down).

Before you begin

About this task

You can identify disconnected nodes by looking for Unknown (blue) or Administratively Down (gray) icons in the Health column. In the example, the Storage Node named DC1-S4 is disconnected; all of the other nodes are connected.


Decommission Nodes page with one node disconnected
Before decommissioning any disconnected node, note the following:
  • This procedure is primarily intended for removing a single disconnected node. If your grid contains multiple disconnected nodes, the software requires you to decommission them at all the same time, which increases the potential for unexpected results.
    Attention: Be very careful when decommissioning more than one disconnected grid node at a time, especially if you are selecting multiple disconnected Storage Nodes.
  • If a disconnected node cannot be removed (for example, a Storage Node that is required for the ADC quorum), no other disconnected node can be removed.
Before decommissioning a disconnected Storage Node, note the following
  • You should never decommission a disconnected Storage Node unless you are sure it cannot be brought online or recovered.
    Attention: If you believe that object data can still be recovered from the node, do not perform this procedure. Instead, contact technical support to determine if node recovery is possible.
  • If you decommission more than one disconnected Storage Node, data loss might occur. The system might not be able to reconstruct data if not enough object copies, erasure-coded fragments, or object metadata remain available.
    Attention: If you have more than one disconnected Storage Node that you cannot recover, contact technical support to determine the best course of action.
  • When you decommission a disconnected Storage Node, StorageGRID starts data repair jobs at the end of the decommissioning process. These jobs attempt to reconstruct the object data and metadata that was stored on the disconnected node.
  • When you decommission a disconnected Storage Node, the decommission procedure completes relatively quickly. However, the data repair jobs can take days or weeks to run and are not monitored by the decommission procedure. You must manually monitor these jobs and restart them as needed.
  • If you decommission a disconnected Storage Node that contains the only copy of an object, the object will be lost. The data repair jobs can only reconstruct and recover objects if at least one replicated copy or enough erasure-coded fragments exist on Storage Nodes that are currently connected.
Before decommissioning a disconnected Admin Node or Gateway Node, note the following:
Attention: Do not remove a grid node's virtual machine or other resources until instructed to do so in this procedure.

Procedure

  1. Attempt to bring any disconnected grid nodes back online or to recover them.
    See the recovery procedures for instructions.
  2. If you are unable to recover a disconnected grid node and you want to decommission it while it is disconnected, select the check box for that node.
    Note: If your grid contains multiple disconnected nodes, the software requires you to decommission them at all the same time, which increases the potential for unexpected results.
    Attention: Be very careful when selecting to decommission more than one disconnected grid node at a time, especially if you are selecting multiple disconnected Storage Nodes. If you have more than one disconnected Storage Node that you cannot recover, contact technical support to determine the best course of action.
  3. Enter the provisioning passphrase.
    The Start Decommission button is enabled.
  4. Click Start Decommission.
    A warning appears, indicating that you have selected a disconnected node and that object data will be lost if the node has the only copy of an object.

    screenshot of decommission warning message
  5. Review the list of nodes, and click OK.
    The decommission procedure starts, and the progress is displayed for each node. During the procedure, a new Recovery Package is generated to show the grid configuration change.

    screenshot of node decomissioning in progress
  6. As soon as the new Recovery Package is available, click the link or select Maintenance > Recovery Package to access the Recovery Package page. Then, download the .zip file.
    See the instructions for downloading the Recovery Package.
    Note: Download the Recovery Package as soon as possible to ensure you can recover your grid if something goes wrong during the decommission procedure.
  7. Periodically monitor the Decommission page to ensure that all selected nodes are decommissioned successfully.
    Storage Nodes can take days or weeks to decommission. When all tasks are complete, the node selection list is redisplayed with a success message. If you decommissioned a disconnected Storage Node, an information message indicates that the repair jobs have been started.

    screenshot showing that repair jobs have started
  8. Remove any remaining virtual machines or other resources that are associated with the decommissioned node.
  9. If you are decommissioning a Storage Node, monitor the status of the data repair jobs that are automatically started during the decommissioning process.
    1. Select Support. Then, in the Tools section of the menu, select Grid Topology.
    2. Select StorageGRID deployment at the top of the Grid Topology tree.
    3. On the Overview tab, locate the ILM Activity section.
    4. Use a combination of the following attributes to determine, as well as possible, if replicated repairs are complete.
      Note: Cassandra inconsistencies might be present, and failed repairs are not tracked.
      • Repairs Attempted (XRPA): Use this attribute to track the progress of replicated repairs. This attribute increases each time a Storage Node tries to repair a high-risk object. When this attribute does not increase for a period longer than the current scan period (provided by the Scan Period – Estimated attribute), it means that ILM scanning found no high-risk objects that need to be repaired on any nodes.
        Note: High-risk objects are objects that are at risk of being completely lost. This does not include objects that do not satisfy their ILM configuration.
      • Scan Period – Estimated (XSCM): Use this attribute to estimate when a policy change will be applied to previously ingested objects. If the Repairs Attempted attribute does not increase for a period longer than the current scan period, it is probable that replicated repairs are done. Note that the scan period can change. The Scan Period – Estimated (XSCM) attribute applies to the entire grid and is the maximum of all node scan periods. You can query the Scan Period – Estimated attribute history for the grid to determine an appropriate time frame.
    5. Use the following commands to track or restart repairs:
      • Use the repair-data show-ec-repair-status command to track repairs of erasure coded data.
      • Use the repair-data start-ec-node-repair command with the --repair-id option to restart a failed repair.
      See the instructions for checking data repair jobs.
  10. Continue to track the status of EC data repairs until all repair jobs have completed successfully.
    As soon as the disconnected nodes have been decommissioned and all data repair jobs have been completed, you can decommission any connected grid nodes as required.

After you finish

Complete these steps after you complete the decommission procedure: