Run Element storage health checks prior to upgrading storage

11/10/2023 Contributors

You must run health checks prior to upgrading Element storage to ensure all storage nodes in your cluster are ready for the next Element storage upgrade.

What you'll need

Management services: You have updated to the latest management services bundle (2.10.27 or later).

You must upgrade to the latest management services bundle before upgrading your Element software.
Management node:You are running management node 11.3 or later.
Element software: Your cluster version is running NetApp Element software 11.3 or later.
End User License Agreement (EULA): Beginning with management services 2.20.69, you must accept and save the EULA before using the NetApp Hybrid Cloud Control UI or API to run Element storage health checks:
1. Open the IP address of the management node in a web browser:
  https://<ManagementNodeIP>
2. Log in to NetApp Hybrid Cloud Control by providing the storage cluster administrator credentials.
3. Select Upgrade near the top right of the interface.
4. The EULA pops up. Scroll down, select I accept for current and all future updates, and select Save.

Health check options

You can run health checks using the NetApp Hybrid Cloud Control UI or the NetApp Hybrid Cloud Control API:

Use NetApp Hybrid Cloud Control to run Element storage health checks prior to upgrading storage (Preferred method)

You can also find out more about storage health checks that are run by the service:

Storage health checks made by the service

Use NetApp Hybrid Cloud Control to run Element storage health checks prior to upgrading storage

Using NetApp Hybrid Cloud Control, you can verify that a storage cluster is ready to be upgraded.

Steps

Open the IP address of the management node in a web browser:
```
https://<ManagementNodeIP>
```
Log in to NetApp Hybrid Cloud Control by providing the storage cluster administrator credentials.
Select Upgrade near the top right of the interface.
On the Upgrades page, select the Storage tab.
Select the health check for the cluster you want to check for upgrade readiness.
On the Storage Health Check page, select Run Health Check.
If there are issues, do the following:
1. Go to the specific KB article listed for each issue or perform the specified remedy.
2. If a KB is specified, complete the process described in the relevant KB article.
3. After you have resolved cluster issues, select Re-Run Health Check.

After the health check completes without errors, the storage cluster is ready to upgrade. See storage node upgrade instructions to proceed.

Use API to run Element storage health checks prior to upgrading storage

You can use REST API to verify that a storage cluster is ready to be upgraded. The health check verifies that there are no obstacles to upgrading, such as pending nodes, disk space issues, and cluster faults.

Steps

Locate the storage cluster ID:
1. Open the management node REST API UI on the management node:
  https://<ManagementNodeIP>/mnode
2. Select Authorize and complete the following:
  1. Enter the cluster user name and password.
  2. Enter the client ID as mnode-client if the value is not already populated.
  3. Select Authorize to begin a session.
  4. Close the authorization window.
3. From the REST API UI, select GET /assets.
4. Select Try it out.
5. Select Execute.
6. From the response, copy the "id" from the "storage" section of the cluster you intend to check for upgrade readiness.
  
  Do not use the "parent" value in this section because this is the management node’s ID, not the storage cluster’s ID.
  "config": {}, "credentialid": "12bbb2b2-f1be-123b-1234-12c3d4bc123e", "host_name": "SF_DEMO", "id": "12cc3a45-e6e7-8d91-a2bb-0bdb3456b789", "ip": "10.123.12.12", "parent": "d123ec42-456e-8912-ad3e-4bd56f4a789a", "sshcredentialid": null, "ssl_certificate": null

Run health checks on the storage cluster:

Open the storage REST API UI on the management node:
```
https://<ManagementNodeIP>/storage/1/
```
Select Authorize and complete the following:
1. Enter the cluster user name and password.
2. Enter the client ID as mnode-client if the value is not already populated.
3. Select Authorize to begin a session.
4. Close the authorization window.
Select POST /health-checks.
Select Try it out.

In the parameter field, enter the storage cluster ID obtained in Step 1.

{
  "config": {},
  "storageId": "123a45b6-1a2b-12a3-1234-1a2b34c567d8"
}

Select Execute to run a health check on the specified storage cluster.

The response should indicate state as initializing:

{
  "_links": {
    "collection": "https://10.117.149.231/storage/1/health-checks",
    "log": "https://10.117.149.231/storage/1/health-checks/358f073f-896e-4751-ab7b-ccbb5f61f9fc/log",
    "self": "https://10.117.149.231/storage/1/health-checks/358f073f-896e-4751-ab7b-ccbb5f61f9fc"
  },
  "config": {},
  "dateCompleted": null,
  "dateCreated": "2020-02-21T22:11:15.476937+00:00",
  "healthCheckId": "358f073f-896e-4751-ab7b-ccbb5f61f9fc",
  "state": "initializing",
  "status": null,
  "storageId": "c6d124b2-396a-4417-8a47-df10d647f4ab",
  "taskId": "73f4df64-bda5-42c1-9074-b4e7843dbb77"
}

Copy the healthCheckID that is part of response.

Verify the results of the health checks:
1. Select GET /health-checks/{healthCheckId}.
2. Select Try it out.
3. Enter the health check ID in the parameter field.
4. Select Execute.
5. Scroll to the bottom of the response body.
  
  If all health checks are successful, the return is similar to the following example:
  "message": "All checks completed successfully.", "percent": 100, "timestamp": "2020-03-06T00:03:16.321621Z"
If the message return indicates that there were problems regarding cluster health, do the following:
1. Select GET /health-checks/{healthCheckId}/log
2. Select Try it out.
3. Enter the health check ID in the parameter field.
4. Select Execute.
5. Review any specific errors and obtain their associated KB article links.
6. Go to the specific KB article listed for each issue or perform the specified remedy.
7. If a KB is specified, complete the process described in the relevant KB article.
8. After you have resolved cluster issues, run GET /health-checks/{healthCheckId}/log again.

Storage health checks made by the service

Storage health checks make the following checks per cluster.

Check Name Node/Cluster Description

Check Name	Node/Cluster	Description
check_async_results	Cluster	Verifies that the number of asynchronous results in the database is below a threshold number.
check_cluster_faults	Cluster	Verifies that there are no upgrade blocking cluster faults (as defined in Element source).
check_upload_speed	Node	Measures the upload speed between the storage node and the management node.
connection_speed_check	Node	Verifies that nodes have connectivity to the management node serving upgrade packages and estimates connection speed.
check_cores	Node	Checks for kernel crash dump and core files on the node. The check fails for any crashes in a recent time period (threshold 7 days).
check_root_disk_space	Node	Verifies the root file system has sufficient free space to perform an upgrade.
check_var_log_disk_space	Node	Verifies that `/var/log` free space meets some percentage free threshold. If it does not, the check will rotate and purge older logs in order to fall under threshold. The check fails if it is unsuccessful at creating sufficient free space.
check_pending_nodes	Cluster	Verifies that there are no pending nodes on the cluster.

check_async_results

Cluster

Verifies that the number of asynchronous results in the database is below a threshold number.

check_cluster_faults

Cluster

Verifies that there are no upgrade blocking cluster faults (as defined in Element source).

check_upload_speed

Node

Measures the upload speed between the storage node and the management node.

connection_speed_check

Node

Verifies that nodes have connectivity to the management node serving upgrade packages and estimates connection speed.

check_cores

Node

Checks for kernel crash dump and core files on the node. The check fails for any crashes in a recent time period (threshold 7 days).

check_root_disk_space

Node

Verifies the root file system has sufficient free space to perform an upgrade.

check_var_log_disk_space

Node

Verifies that /var/log free space meets some percentage free threshold. If it does not, the check will rotate and purge older logs in order to fall under threshold. The check fails if it is unsuccessful at creating sufficient free space.

check_pending_nodes

Cluster

Verifies that there are no pending nodes on the cluster.

Run Element storage health checks prior to upgrading storage

Creating your file...

Use NetApp Hybrid Cloud Control to run Element storage health checks prior to upgrading storage

Use API to run Element storage health checks prior to upgrading storage

Storage health checks made by the service

Find more information