Decommission disconnected grid nodes
You might need to decommission a node that is not currently connected to the grid (one whose Health is Unknown or Administratively Down).
-
You understand the considerations for decommissioning Admin and Gateway Nodes and the considerations for decommissioning Storage Nodes.
-
You have obtained all prerequisite items.
-
You have ensured that no data repair jobs are active. See Check data repair jobs.
-
You have confirmed that Storage Node recovery is not in progress anywhere in the grid. If it is, you must wait until any Cassandra rebuild performed as part of the recovery is complete. You can then proceed with decommissioning.
-
You have ensured that other maintenance procedures will not be run while the node decommission procedure is running, unless the node decommission procedure is paused.
-
The Decommission Possible column for the disconnected node or nodes you want to decommission includes a green check mark.
-
You have the provisioning passphrase.
You can identify disconnected nodes by looking for the blue Unknown icon or the gray Administratively down icon
in the Health column.
Before decommissioning any disconnected node, note the following:
-
This procedure is primarily intended for removing a single disconnected node. If your grid contains multiple disconnected nodes, the software requires you to decommission them all at the same time, which increases the potential for unexpected results.
Data loss might occur if you decommission more than one disconnected Storage Node at a time. See Considerations for disconnected Storage Nodes. Use caution when you decommission Storage Nodes in a grid containing software-based metadata-only nodes. If you decommission all nodes configured to store both objects and metadata, the ability to store objects is removed from the grid. See Types of Storage Nodes for more information about metadata-only Storage Nodes. -
If a disconnected node can't be removed (for example, a Storage Node that is required for the ADC quorum), no other disconnected node can be removed.
-
Unless you are decommissioning an Archive Node (which must be disconnected), attempt to bring any disconnected grid nodes back online or recover them.
See Grid node recovery procedures for instructions.
-
If you are unable to recover a disconnected grid node and you want to decommission it while it is disconnected, select the checkbox for that node.
If your grid contains multiple disconnected nodes, the software requires you to decommission them all at the same time, which increases the potential for unexpected results. Be careful when choosing to decommission more than one disconnected grid node at a time, especially if you are selecting multiple disconnected Storage Nodes. If you have more than one disconnected Storage Node that you can't recover, contact technical support to determine the best course of action. -
Enter the provisioning passphrase.
The Start Decommission button is enabled.
-
Click Start Decommission.
A warning appears, indicating that you have selected a disconnected node and that object data will be lost if the node has the only copy of an object.
-
Review the list of nodes, and click OK.
The decommission procedure starts, and the progress is displayed for each node. During the procedure, a new recovery package is generated containing the grid configuration change.
-
As soon as the new recovery package is available, click the link or select Maintenance > System > Recovery package to access the recovery package page. Then, download the
.zip
file.See the instructions for downloading the recovery package.
Download the recovery package as soon as possible to ensure you can recover your grid if something goes wrong during the decommission procedure. The recovery package file must be secured because it contains encryption keys and passwords that can be used to obtain data from the StorageGRID system. -
Periodically monitor the Decommission page to ensure that all selected nodes are decommissioned successfully.
Storage Nodes can take days or weeks to decommission. When all tasks are complete, the node selection list is redisplayed with a success message. If you decommissioned a disconnected Storage Node, an information message indicates that the repair jobs have been started.
-
After the nodes have shut down automatically as part of the decommission procedure, remove any remaining virtual machines or other resources that are associated with the decommissioned node.
Don't perform this step until the nodes have shut down automatically. -
If you are decommissioning a Storage Node, monitor the status of the replicated data and erasure-coded (EC) data repair jobs that are automatically started during the decommissioning process.
-
To get an estimated percent completion for the replicated repair, add the
show-replicated-repair-status
option to the repair-data command.repair-data show-replicated-repair-status
-
To determine if repairs are complete:
-
Select Nodes > Storage Node being repaired > ILM.
-
Review the attributes in the Evaluation section. When repairs are complete, the Awaiting - All attribute indicates 0 objects.
-
-
To monitor the repair in more detail:
-
Select Nodes.
-
Select grid name > ILM.
-
Position your cursor over the ILM queue graph to see the value of the Scan rate (objects/sec) attribute, which is the rate at which objects in the grid are scanned and queued for ILM.
-
In the ILM Queue section, look at the following attributes:
-
Scan period - estimated: The estimated time to complete a full ILM scan of all objects.
A full scan doesn't guarantee that ILM has been applied to all objects.
-
Repairs attempted: The total number of attempted object repair operations for replicated data that are considered high risk. High-risk objects are any objects with one copy remaining, whether specified by the ILM policy or as a result of lost copies. This count increments each time a Storage Node tries to repair a high-risk object. High-risk ILM repairs are prioritized if the grid becomes busy.
The same object repair might increment again if replication failed after the repair.
These attributes can be useful when you are monitoring the progress of Storage Node volume recovery. If the number of repairs attempted has stopped increasing and a full scan has been completed, the repair has probably completed.
-
-
Alternatively, submit a Prometheus query for
storagegrid_ilm_scan_period_estimated_minutes
andstoragegrid_ilm_repairs_attempted
.
-
To monitor the repair of erasure-coded data and retry any requests that might have failed:
-
Determine the status of erasure-coded data repairs:
-
Select Support > Tools > Metrics to view the estimated time to completion and the completion percentage for the current job. Then, select EC Overview in the Grafana section. Look at the Grid EC Job Estimated Time to Completion and Grid EC Job Percentage Completed dashboards.
-
Use this command to see the status of a specific
repair-data
operation:repair-data show-ec-repair-status --repair-id repair ID
-
Use this command to list all repairs:
repair-data show-ec-repair-status
The output lists information, including
repair ID
, for all previously and currently running repairs.
-
-
If the output shows that the repair operation failed, use the
--repair-id
option to retry the repair.This command retries a failed node repair, using the repair ID 6949309319275667690:
repair-data start-ec-node-repair --repair-id 6949309319275667690
This command retries a failed volume repair, using the repair ID 6949309319275667690:
repair-data start-ec-volume-repair --repair-id 6949309319275667690
As soon as the disconnected nodes have been decommissioned and all data repair jobs have been completed, you can decommission any connected grid nodes as required.
Then, complete these steps after you complete the decommission procedure:
-
Ensure that the drives of the decommissioned grid node are wiped clean. Use a commercially available data wiping tool or service to permanently and securely remove data from the drives.
-
If you decommissioned an appliance node and the data on the appliance was protected using node encryption, use the StorageGRID Appliance Installer to clear the key management server configuration (Clear KMS). You must clear the KMS configuration if you want to add the appliance to another grid. For instructions, see Monitor node encryption in maintenance mode.