Decommissioning disconnected grid nodes

You might need to decommission a grid node that is not currently connected to the grid (one whose Health is Unknown or Administratively Down).

Before you begin

Before decommissioning a disconnected node, confirm the following:

About this task

You can identify disconnected nodes by looking for Unknown (blue) or Administratively Down (gray) icons in the Health column. No health icons are shown for nodes that are connected. In addition, the Decommission Possible shows No, at least one grid node is disconnected for the connected nodes. You cannot decommission a connected node if any nodes are disconnected.


screenshot of Decommisson page with one node disconnected
Before decommissioning a disconnected node, note the following:
  • You should never decommission a disconnected node unless you are sure it cannot be brought online or recovered.
  • You can safely decommission an API Gateway Node while it is disconnected.
  • If you decommission an Admin Node that is disconnected, you will lose the audit logs from that node; however, these logs should also exist on the primary Admin Node.
  • When you decommission a Storage Node that is disconnected, StorageGRID Webscale starts data repair jobs at the end of the decommissioning process. These jobs attempt to reconstruct the object data and metadata that was stored on the disconnected node.
  • When you decommission a disconnected Storage Node, the decommission procedure completes relatively quickly. However, the data repair jobs can take days or weeks to run and are not monitored by the decommission procedure. You must manually monitor these jobs and restart them as needed.
  • If you decommission a Storage Node that is disconnected and that node contains the only copy of an object, the object will be lost. The data repair jobs can only reconstruct and recover objects if at least one replicated copy or enough erasure-coded fragments exist on Storage Nodes that are currently connected.
  • If you attempt to decommission more than one disconnected Storage Node at a time, you increase the risk of unexpected results and data loss. The system might not be able to reconstruct data if too few copies of object data, metadata, or EC fragments remain available.
Attention: Do not remove a grid node's virtual machine or other resources until instructed to do so in this procedure.

Steps

  1. Attempt to bring any disconnected grid nodes back online or to recover them.
    See "Recovery procedures" for instructions.
  2. If you are unable to recover a disconnected grid node and you want to decommission it while it is disconnected, select the check box for that node.
    Attention: Be very careful when selecting to decommission more than one disconnected grid node at a time, especially if you are selecting multiple disconnected Storage Nodes. If you have more than one disconnected Storage Node that you cannot recover, contact technical support to determine the best course of action.
  3. Enter the provisioning passphrase.
    The Start Decommission button is enabled.
  4. Click Start Decommission.
    A warning appears, indicating that you have selected a disconnected node and that object data will be lost if the node has the only copy of an object.
    screenshot of decommission warning message
  5. Review the list of nodes, and click OK.
    The decommission procedure starts, and the progress is displayed for each node. During the procedure, a new Recovery Package is generated to show the grid configuration change.
    screenshot of node decomissioning in progress

  6. As soon as the new Recovery Package is available, click the link or select Maintenance > Recovery Package to access the Recovery Package page. Then, download the .zip file.
    See "Downloading the Recovery Package" for instructions.
    Note: Download the Recovery Package as soon as possible to ensure you can recover your grid if something goes wrong during the decommission procedure.
  7. Periodically monitor the Decommission page to ensure that all selected nodes are decommissioned successfully.
    Storage Nodes can take days or weeks to decommission. When all tasks are complete, the node selection list is redisplayed with a success message. If you decommissioned a disconnected Storage Node, an information message indicates that the repair jobs have been started.
    screenshot showing that repair jobs have started
  8. Remove any remaining virtual machines or other resources that are associated with the decommissioned node.
  9. If you are decommissioning a Storage Node, monitor the status of the data repair jobs that are automatically started during the decommissioning process.
    1. Select Support > Grid Topology.
    2. Select StorageGRID Webscale deployment at the top of the Grid Topology tree.
    3. On the Overview tab, locate the ILM Activity section.
    4. Use a combination of the following attributes to monitor repairs and to determine as well as possible if replicated repairs are complete:
      • Use the Repairs Attempted (XRPA) attribute to track the progress of replicated repairs. This attribute increases each time the LDR service tries to repair a high-risk object. When this attribute does not increase for a period longer than the current scan period (provided by the Scan Period – Estimated attribute), it means that ILM scanning found no high-risk objects that need to be repaired on any nodes.
        Note: High-risk objects are objects that are at risk of being completely lost. This does not include objects that do not satisfy their ILM configuration.
      • Use the Scan Period - Estimated (XSCM) attribute to estimate when a policy change will be applied to previously ingested objects. If the Repairs Attempted attribute does not increase for a period longer than the current scan period, it is probable that replicated repairs are done. Note that the scan period can change. The Scan Period - Estimated (XSCM) attribute is at the Summary level and is the maximum of all node scan periods. You can query the Scan Period - Estimated attribute history at the Summary level to determine an appropriate timeframe for your grid.
    5. Use the repair-data show-ec-repair-status command to track repairs of erasure coded data. Use the repair-data start-ec-node-repair command with the --repair-id option to restart a failed repair.
      See "Checking data repair jobs" for instructions.
  10. Continue to track the status of EC data repairs until all repair jobs have completed successfully.
    As soon as the disconnected nodes have been decommissioned and all data repair jobs have been completed, you can decommission any connected grid nodes as required.

After you finish

After you complete the decommission procedure, ensure that the drives of the decommissioned grid node are wiped clean. Use a commercially available data wiping tool or service to permanently and securely remove data from the drives.