Skip to main content

Alerts reference

Contributors netapp-lhalbert netapp-madkat netapp-perveilerk ssantho3

This reference lists the default alerts that appear in the Grid Manager. Recommended actions are in the alert message you receive.

As required, you can create custom alert rules to fit your system management approach.

Some of the default alerts use Prometheus metrics.

Appliance alerts

Alert name Description

Appliance battery expired

The battery in the appliance's storage controller has expired.

Appliance battery failed

The battery in the appliance's storage controller has failed.

Appliance battery has insufficient learned capacity

The battery in the appliance's storage controller has insufficient learned capacity.

Appliance battery near expiration

The battery in the appliance's storage controller is nearing expiration.

Appliance battery removed

The battery in the appliance's storage controller is missing.

Appliance battery too hot

The battery in the appliance's storage controller is overheated.

Appliance BMC communication error

Communication with the baseboard management controller (BMC) has been lost.

Appliance cache backup device failed

A persistent cache backup device has failed.

Appliance cache backup device insufficient capacity

There is insufficient cache backup device capacity.

Appliance cache backup device write-protected

A cache backup device is write-protected.

Appliance cache memory size mismatch

The two controllers in the appliance have different cache sizes.

Appliance compute controller chassis temperature too high

The temperature of the compute controller in a StorageGRID appliance has exceeded a nominal threshold.

Appliance compute controller CPU temperature too high

The temperature of the CPU in the compute controller in a StorageGRID appliance has exceeded a nominal threshold.

Appliance compute controller needs attention

A hardware fault has been detected in the compute controller of a StorageGRID appliance.

Appliance compute controller power supply A has a problem

Power supply A in the compute controller has a problem.

Appliance compute controller power supply B has a problem

Power supply B in the compute controller has a problem.

Appliance compute hardware monitor service stalled

The service that monitors storage hardware status has stalled.

Appliance DAS drive fault detected

A problem was detected with a direct-attached storage (DAS) drive in the appliance.

Appliance DAS drive rebuilding

A direct-attached storage (DAS) drive is rebuilding. This is expected if it was recently replaced or removed/reinserted.

Appliance fan fault detected

A problem with a fan unit in the appliance was detected.

Appliance Fibre Channel fault detected

A Fibre Channel link problem has been detected between the appliance storage controller and compute controller

Appliance Fibre Channel HBA port failure

A Fibre Channel HBA port is failing or has failed.

Appliance flash cache drives non-optimal

The drives used for the SSD cache are non-optimal.

Appliance interconnect/battery canister removed

The interconnect/battery canister is missing.

Appliance LACP port missing

A port on a StorageGRID appliance is not participating in the LACP bond.

Appliance NIC fault detected

A problem with a network interface card (NIC) in the appliance was detected.

Appliance overall power supply degraded

The power of a StorageGRID appliance has deviated from the recommended operating voltage.

Appliance SSD critical warning

An appliance SSD is reporting a critical warning.

Appliance storage controller A failure

Storage controller A in a StorageGRID appliance has failed.

Appliance storage controller B failure

Storage controller B in a StorageGRID appliance has failed.

Appliance storage controller drive failure

One or more drives in a StorageGRID appliance has failed or is not optimal.

Appliance storage controller hardware issue

SANtricity software is reporting "Needs attention" for a component in a StorageGRID appliance.

Appliance storage controller power supply A failure

Power supply A in a StorageGRID appliance has deviated from the recommended operating voltage.

Appliance storage controller power supply B failure

Power supply B in a StorageGRID appliance has deviated from the recommended operating voltage.

Appliance storage hardware monitor service stalled

The service that monitors storage hardware status has stalled.

Appliance storage shelves degraded

The status of one of the components in the storage shelf for a storage appliance is degraded.

Appliance temperature exceeded

The nominal or maximum temperature for the appliance's storage controller has been exceeded.

Appliance temperature sensor removed

A temperature sensor has been removed.

Disk I/O is very slow

Very slow disk I/O may be impacting grid performance.

Storage appliance fan fault detected

A problem with a fan unit in the storage controller for an appliance was detected.

Storage appliance storage connectivity degraded

There is a problem with one or more connections between the compute controller and storage controller.

Storage device inaccessible

A storage device can't be accessed.

Audit and syslog alerts

Alert name Description

Audit logs are being added to the in-memory queue

Node can't send logs to the local syslog server and the in-memory queue is filling up.

External syslog server forwarding error

Node can't forward logs to the external syslog server.

Large audit queue

The disk queue for audit messages is full. If this condition is not addressed, S3 or Swift operations might fail.

Logs are being added to the on-disk queue

Node can't forward logs to the external syslog server and the on-disk queue is filling up.

Bucket alerts

Alert name Description

FabricPool bucket has unsupported bucket consistency setting

A FabricPool bucket uses the Available consistency level, which is not supported.

Cassandra alerts

Alert name Description

Cassandra auto-compactor error

The Cassandra auto-compactor has experienced an error.

Cassandra auto-compactor metrics out of date

The metrics that describe the Cassandra auto-compactor are out of date.

Cassandra communication error

The nodes that run the Cassandra service are having trouble communicating with each other.

Cassandra compactions overloaded

The Cassandra compaction process is overloaded.

Cassandra oversize write error

An internal StorageGRID process sent a write request to Cassandra that was too large.

Cassandra repair metrics out of date

The metrics that describe Cassandra repair jobs are out of date.

Cassandra repair progress slow

The progress of Cassandra database repairs is slow.

Cassandra repair service not available

The Cassandra repair service is not available.

Cassandra table corruption

Cassandra has detected table corruption. Cassandra automatically restarts if it detects table corruption.

Improved read availability disabled

When improved read availability is disabled, GET and HEAD requests might fail when Storage Nodes are unavailable.

Cloud Storage Pool alerts

Alert name Description

Cloud Storage Pool connectivity error

The health check for Cloud Storage Pools detected one or more new errors.

Cross-grid replication alerts

Alert name Description

Cross-grid replication permanent failure

A cross-grid replication error occurred that requires user intervention to resolve.

Cross-grid replication resources unavailable

Cross-grid replication requests are pending because a resource is unavailable.

DHCP alerts

Alert name Description

DHCP lease expired

The DHCP lease on a network interface has expired.

DHCP lease expiring soon

The DHCP lease on a network interface is expiring soon.

DHCP server unavailable

The DHCP server is unavailable.

Debug and trace alerts

Alert name Description

Debug performance impact

When debug mode is enabled, system performance might be negatively impacted.

Trace configuration enabled

When trace configuration is enabled, system performance might be negatively impacted.

Email and AutoSupport alerts

Alert name Description

AutoSupport message failed to send

The most recent AutoSupport message failed to send.

Email notification failure

The email notification for an alert could not be sent.

Erasure coding (EC) alerts

Alert name Description

EC rebalance failure

The EC rebalance procedure has failed or has been stopped.

EC repair failure

A repair job for EC data has failed or has been stopped.

EC repair stalled

A repair job for EC data has stalled.

Expiration of certificates alerts

Alert name Description

Expiration of client certificate

One or more client certificates are about to expire.

Expiration of global server certificate for S3 and Swift

The global server certificate for S3 and Swift is about to expire.

Expiration of load balancer endpoint certificate

One or more load balancer endpoint certificates are about to expire.

Expiration of server certificate for Management interface

The server certificate used for the management interface is about to expire.

External syslog CA certificate expiration

The certificate authority (CA) certificate used to sign the external syslog server certificate is about to expire.

External syslog client certificate expiration

The client certificate for an external syslog server is about to expire.

External syslog server certificate expiration

The server certificate presented by the external syslog server is about to expire.

Grid Network alerts

Alert name Description

Grid Network MTU mismatch

The MTU setting for the Grid Network interface (eth0) differs significantly across nodes in the grid.

Grid federation alerts

Alert name Description

Expiration of grid federation certificate

One or more grid federation certificates are about to expire.

Grid federation connection failure

The grid federation connection between the local and remote grid is not working.

High usage or high latency alerts

Alert name Description

High Java heap use

A high percentage of Java heap space is being used.

High latency for metadata queries

The average time for Cassandra metadata queries is too long.

Identity federation alerts

Alert name Description

Identity federation synchronization failure

Unable to synchronize federated groups and users from the identity source.

Identity federation synchronization failure for a tenant

Unable to synchronize federated groups and users from the identity source configured by a tenant.

Information lifecycle management (ILM) alerts

Alert name Description

ILM placement unachievable

A placement instruction in an ILM rule can't be achieved for certain objects.

ILM scan period too long

The time required to scan, evaluate, and apply ILM to objects is too long.

ILM scan rate low

The ILM scan rate is set to less than 100 objects/second.

Key management server (KMS) alerts

Alert name Description

KMS CA certificate expiration

The certificate authority (CA) certificate used to sign the key management server (KMS) certificate is about to expire.

KMS client certificate expiration

The client certificate for a key management server is about to expire

KMS configuration failed to load

The configuration for the key management server exists but failed to load.

KMS connectivity error

An appliance node could not connect to the key management server for its site.

KMS encryption key name not found

The configured key management server does not have an encryption key that matches the name provided.

KMS encryption key rotation failed

All appliance volumes were successfully decrypted, but one or more volumes could not rotate to the latest key.

KMS is not configured

No key management server exists for this site.

KMS key failed to decrypt an appliance volume

One or more volumes on an appliance with node encryption enabled could not be decrypted with the current KMS key.

KMS server certificate expiration

The server certificate used by the key management server (KMS) is about to expire.

Local clock offset alerts

Alert name Description

Local clock large time offset

The offset between local clock and Network Time Protocol (NTP) time is too large.

Low memory or low space alerts

Alert name Description

Low audit log disk capacity

The space available for audit logs is low. If this condition is not addressed, S3 or Swift operations might fail.

Low available node memory

The amount of RAM available on a node is low.

Low free space for storage pool

The space available for storing object data in the Storage Node is low.

Low installed node memory

The amount of installed memory on a node is low.

Low metadata storage

The space available for storing object metadata is low.

Low metrics disk capacity

The space available for the metrics database is low.

Low object data storage

The space available for storing object data is low.

Low read-only watermark override

The Storage Volume Soft Read-Only Watermark Override is less than the minimum optimized watermark for a Storage Node.

Low root disk capacity

The space available on the root disk is low.

Low system data capacity

The space available for StorageGRID system data on the /var/local mount point is low.

Low tmp directory free space

The space available in the /tmp directory is low.

Node or node network alerts

Alert name Description

Firewall configuration failure

Failed to apply firewall configuration.

Node network connectivity error

Errors have occurred while transferring data between nodes.

Node network reception frame error

A high percentage of the network frames received by a node had errors.

Node not in sync with NTP server

The node is not in sync with the network time protocol (NTP) server.

Node not locked with NTP server

The node is not locked to a network time protocol (NTP) server.

Non-appliance node network down

One or more network devices are down or disconnected.

Services appliance link down on Admin Network

The appliance interface to the Admin Network (eth1) is down or disconnected.

Services appliance link down on Admin Network port 1

The Admin Network port 1 on the appliance is down or disconnected.

Services appliance link down on Client Network

The appliance interface to the Client Network (eth2) is down or disconnected.

Services appliance link down on network port 1

Network port 1 on the appliance is down or disconnected.

Services appliance link down on network port 2

Network port 2 on the appliance is down or disconnected.

Services appliance link down on network port 3

Network port 3 on the appliance is down or disconnected.

Services appliance link down on network port 4

Network port 4 on the appliance is down or disconnected.

Storage appliance link down on Admin Network

The appliance interface to the Admin Network (eth1) is down or disconnected.

Storage appliance link down on Admin Network port 1

The Admin Network port 1 on the appliance is down or disconnected.

Storage appliance link down on Client Network

The appliance interface to the Client Network (eth2) is down or disconnected.

Storage appliance link down on network port 1

Network port 1 on the appliance is down or disconnected.

Storage appliance link down on network port 2

Network port 2 on the appliance is down or disconnected.

Storage appliance link down on network port 3

Network port 3 on the appliance is down or disconnected.

Storage appliance link down on network port 4

Network port 4 on the appliance is down or disconnected.

Storage Node not in desired storage state

The LDR service on a Storage Node can't transition to the desired state because of an internal error or volume related issue

Unable to communicate with node

One or more services are unresponsive, or the node can't be reached.

Unexpected node reboot

A node rebooted unexpectedly within the last 24 hours.

Object alerts

Alert name Description

Object existence check failed

The object existence check job has failed.

Object existence check stalled

The object existence check job has stalled.

Objects lost

One or more objects have been lost from the grid.

S3 PUT object size too large

A client is attempting a PUT Object operation that exceeds S3 size limits.

Unidentified corrupt object detected

A file was found in replicated object storage that could not be identified as a replicated object.

Platform services alerts

Alert name Description

Platform services unavailable

Too few Storage Nodes with the RSM service are running or available at a site.

Storage volume alerts

Alert name Description

Storage volume needs attention

A storage volume is offline and needs attention.

Storage volume needs to be restored

A storage volume has been recovered and needs to be restored.

Storage volume offline

A storage volume has been offline for more than 5 minutes, possibly because the node rebooted during the volume formatting step.

Volume Restoration failed to start replicated data repair

Replicated data repair for a repaired volume couldn't be started automatically.

StorageGRID services alerts

Alert name Description

nginx service using backup configuration

The configuration of the nginx service is invalid. The previous configuration is now being used.

nginx-gw service using backup configuration

The configuration of the nginx-gw service is invalid. The previous configuration is now being used.

SSH service using backup configuration

The configuration of the SSH service is invalid. The previous configuration is now being used.

Tenant alerts

Alert name Description

Tenant quota usage high

A high percentage of quota space is being used. This rule is disabled by default because it might cause too many notifications.