Alerts reference

The following table lists all default StorageGRID alerts. As required, you can create custom alert rules to fit your system management approach.

For information about the Prometheus metrics used in some of these alerts, see "Commonly used Prometheus metrics."
Alert name Related alarm Description and recommended actions
Cloud Storage Pool connectivity error none
The health check for Cloud Storage Pools detected one or more new errors.
  1. Go to the Cloud Storage Pools section of the Storage Pools page.
  2. Look at the Last Error column to determine which Cloud Storage Pool has an error.
  3. See the instructions for administering StorageGRID.

Administering StorageGRID

Expiration of server certificate for Management Interface MCEP The server certificate used for the management interface is about to expire.
  1. Go to Configuration > Server Certificates.
  2. In the Management Interface Server Certificate section, upload a new certificate.

Administering StorageGRID

Expiration of server certificate for Storage API Endpoints SCEP The server certificate used for accessing storage API endpoints is about to expire.
  1. Go to Configuration > Server Certificates.
  2. In the Object Storage API Service Endpoints Server Certificate section, upload a new certificate.

Administering StorageGRID

Large audit queue AMQS

The disk queue for audit messages is full.

  1. Check the load on the system—if there have been a significant number of transactions, the alert should resolve itself over time, and you can ignore the alert.
  2. If the alert persists and increases in severity, view a chart of the queue size. If the number is steadily increasing over hours or days, the audit load has likely exceeded the audit capacity of the system.
  3. Reduce the client operation rate or decrease the number of audit messages logged by changing the audit level for Client Writes and Client Reads to Error or Off (Configuration > Audit).

Understanding audit messages

Low audit log disk capacity VMFR

The space available for audit logs is low.

  1. Monitor this alert to see if the issue resolves on its own and the disk space becomes available again.
  2. Contact technical support if the available space continues to decrease.
Low available node memory TMEM

The amount of RAM available on a node is low.

Low available RAM could indicate a change in the workload or a memory leak with one or more nodes.
  1. Monitor this alert to see if the issue resolves on its own.
  2. If the available memory falls below the major alert threshold, contact technical support.
Low installed node memory UMEM

The amount of installed memory on a node is low.

Increase the amount of RAM available to the virtual machine or Linux host. Check the threshold value for the major alert to determine the default minimum requirement for a StorageGRID node. See the installation instructions for your platform:
Low metadata query performance CQST

The average time for Cassandra metadata queries is too long.

An increase in query latency can be caused by a hardware change, such as replacing a disk, or a workload change, such as a sudden increase in ingests.
  1. Determine if there were any hardware or workload changes around the time the query latency increased.
  2. If you are unable to resolve the problem, contact technical support.
Low metadata storage CDLP

The space available for storing object metadata is low.

Critical alert
  1. Stop ingesting objects.
  2. Immediately add Storage Nodes in an expansion procedure.

Major alert

Immediately add Storage Nodes in an expansion procedure.

Minor alert
  1. Monitor the rate at which object metadata space is being used. Select Nodes > Storage Nodes > Storage, and view the Storage Used - Object Metadata graph.
  2. Add Storage Nodes in an expansion procedure as soon as possible.

Once new Storage Nodes are added, the system automatically rebalances object metadata across all Storage Nodes, and the alarm clears.

Monitoring object metadata capacity for each Storage Node

Expanding a StorageGRID system

Low metrics disk capacity VMFR

The space available for the metrics database is low.

  1. Monitor this alert to see if the issue resolves on its own and the disk space becomes available again.
  2. Contact technical support if the available space continues to decrease.
Low object data storage SSTS

The space available for storing object data is low.

Perform an expansion procedure. You can add storage volumes (LUNs) to existing Storage Nodes, or you can add new Storage Nodes.

Troubleshooting Low object data storage alerts

Expanding a StorageGRID system

Low root disk capacity VMFR

The space available for the root disk is low.

  1. Monitor this alert to see if the issue resolves on its own and the disk space becomes available again.
  2. Contact technical support if the available space continues to decrease.
Low volume disk capacity VMFR

The space available for the /var/local mount point is low.

  1. Monitor this alert to see if the issue resolves on its own and the disk space becomes available again.
  2. Contact technical support if the available space continues to decrease.
Node network connectivity error

NRER

NTER

Errors have occurred while transferring data between nodes.

Network connectivity errors might clear without manual intervention. Contact technical support if the errors do not clear.

Node not in sync with time source NTSO

The node's time is not in sync with the NTP time source.

Monitor the alert for 10 minutes to see if the issue resolves on its own. If the alert persists:
  1. Verify that you have specified at least four external NTP sources, each providing a Stratum 3 or better reference.
  2. Check that all NTP sources are operating normally.
  3. Verify the connection to the NTP sources. Make sure they are not blocked by a firewall.
Objects lost LOST

One or more objects have been lost from the grid.

This alert might indicate that data has been permanently lost and is not retrievable.
  1. Investigate this alert immediately. You might need to take action to prevent further data loss. You also might be able to restore a lost object if you take prompt action.

    Lost and missing object data

  2. When the underlying problem is resolved, reset the counter:
    1. Select Support > Grid Topology.
    2. For the Storage Node that raised the alert, select site > grid node > LDR > Data Store > Configuration > Main.
    3. Select Reset Lost Objects Count and click Apply Changes.
Platform services unavailable none

Too few Storage Nodes with the RSM service are running or available at a site.

Make sure that the majority of the Storage Nodes that have the RSM service at the affected site are running and in a non-error state.

See "Troubleshooting platform services" in the instructions for administering StorageGRID.

Administering StorageGRID

Unable to communicate with node none

One or more services are unresponsive or cannot be reached by the metrics collection job.

This alert indicates a problem connecting to the node or a service on the node. For example, the node might be powered down, there might be a network connectivity issue, or a service on the node might be stopped.

Monitor this alert to see if the issue resolves on its own. If the issue persists:
  1. Determine if there is another alert affecting this node. This alert might be resolved when you resolve the other alert.
  2. Determine if there a network connectivity issue between this node and the Admin Node.
  3. Contact technical support.