Restore Prometheus metrics when recovering non-primary Admin Node

09/28/2021 Contributors

Optionally, you can retain the historical metrics maintained by Prometheus on a non-primary Admin Node that has failed.

The recovered Admin Node must be installed and running.
The StorageGRID system must include at least two Admin Nodes.
You must have the Passwords.txt file.
You must have the provisioning passphrase.

If an Admin Node fails, the metrics maintained in the Prometheus database on the Admin Node are lost. When you recover the Admin Node, the software installation process creates a new Prometheus database. After the recovered Admin Node is started, it records metrics as if you had performed a new installation of the StorageGRID system.

If you restored a non-primary Admin Node, you can restore the historical metrics by copying the Prometheus database from the primary Admin Node (the source Admin Node) to the recovered Admin Node.

Copying the Prometheus database might take an hour or more. Some Grid Manager features will be unavailable while services are stopped on the source Admin Node.

Log in to the source Admin Node:
1. Enter the following command: ssh admin@grid_node_IP
2. Enter the password listed in the Passwords.txt file.
3. Enter the following command to switch to root: su -
4. Enter the password listed in the Passwords.txt file.
From the source Admin Node, stop the Prometheus service: service prometheus stop
Complete the following steps on the recovered Admin Node:
1. Log in to the recovered Admin Node:
  1. Enter the following command: ssh admin@grid_node_IP
  2. Enter the password listed in the Passwords.txt file.
  3. Enter the following command to switch to root: su -
  4. Enter the password listed in the Passwords.txt file.
2. Stop the Prometheus service: service prometheus stop
3. Add the SSH private key to the SSH agent. Enter:ssh-add
4. Enter the SSH Access Password listed in the Passwords.txt file.
5. Copy the Prometheus database from the source Admin Node to the recovered Admin Node: /usr/local/prometheus/bin/prometheus-clone-db.sh Source_Admin_Node_IP
6. When prompted, press Enter to confirm that you want to destroy the new Prometheus database on the recovered Admin Node.
  
  The original Prometheus database and its historical data are copied to the recovered Admin Node. When the copy operation is done, the script starts the recovered Admin Node. The following status appears:
  
  Database cloned, starting services
7. When you no longer require passwordless access to other servers, remove the private key from the SSH agent. Enter:ssh-add -D
Restart the Prometheus service on the source Admin Node.service prometheus start

Restore Prometheus metrics when recovering non-primary Admin Node

Creating your file...