Alerts reference
This reference lists the default alerts that appear in the Grid Manager. Recommended actions are in the alert message you receive.
As required, you can create custom alert rules to fit your system management approach.
Some of the default alerts use Prometheus metrics.
Appliance alerts
Alert name | Description |
---|---|
Appliance battery expired |
The battery in the appliance's storage controller has expired. |
Appliance battery failed |
The battery in the appliance's storage controller has failed. |
Appliance battery has insufficient learned capacity |
The battery in the appliance's storage controller has insufficient learned capacity. |
Appliance battery near expiration |
The battery in the appliance's storage controller is nearing expiration. |
Appliance battery removed |
The battery in the appliance's storage controller is missing. |
Appliance battery too hot |
The battery in the appliance's storage controller is overheated. |
Appliance BMC communication error |
Communication with the baseboard management controller (BMC) has been lost. |
Appliance boot device fault detected |
A problem was detected with the boot device in the appliance. |
Appliance cache backup device failed |
A persistent cache backup device has failed. |
Appliance cache backup device insufficient capacity |
There is insufficient cache backup device capacity. |
Appliance cache backup device write-protected |
A cache backup device is write-protected. |
Appliance cache memory size mismatch |
The two controllers in the appliance have different cache sizes. |
Appliance CMOS battery fault |
A problem was detected with the CMOS battery in the appliance. |
Appliance compute controller chassis temperature too high |
The temperature of the compute controller in a StorageGRID appliance has exceeded a nominal threshold. |
Appliance compute controller CPU temperature too high |
The temperature of the CPU in the compute controller in a StorageGRID appliance has exceeded a nominal threshold. |
Appliance compute controller needs attention |
A hardware fault has been detected in the compute controller of a StorageGRID appliance. |
Appliance compute controller power supply A has a problem |
Power supply A in the compute controller has a problem. |
Appliance compute controller power supply B has a problem |
Power supply B in the compute controller has a problem. |
Appliance compute hardware monitor service stalled |
The service that monitors storage hardware status has stalled. |
Appliance DAS drive exceeding limit for data written per day |
An excessive amount of data is being written to a drive each day, which might void its warranty. |
Appliance DAS drive fault detected |
A problem was detected with a direct-attached storage (DAS) drive in the appliance. |
Appliance DAS drive locator light on |
The drive locator light for one or more direct-attached storage (DAS) drives in an appliance Storage Node is on. |
Appliance DAS drive rebuilding |
A direct-attached storage (DAS) drive is rebuilding. This is expected if it was recently replaced or removed/reinserted. |
Appliance fan fault detected |
A problem with a fan unit in the appliance was detected. |
Appliance Fibre Channel fault detected |
A Fibre Channel link problem has been detected between the appliance storage controller and compute controller |
Appliance Fibre Channel HBA port failure |
A Fibre Channel HBA port is failing or has failed. |
Appliance flash cache drives non-optimal |
The drives used for the SSD cache are non-optimal. |
Appliance interconnect/battery canister removed |
The interconnect/battery canister is missing. |
Appliance LACP port missing |
A port on a StorageGRID appliance is not participating in the LACP bond. |
Appliance NIC fault detected |
A problem with a network interface card (NIC) in the appliance was detected. |
Appliance overall power supply degraded |
The power of a StorageGRID appliance has deviated from the recommended operating voltage. |
Appliance SSD critical warning |
An appliance SSD is reporting a critical warning. |
Appliance storage controller A failure |
Storage controller A in a StorageGRID appliance has failed. |
Appliance storage controller B failure |
Storage controller B in a StorageGRID appliance has failed. |
Appliance storage controller drive failure |
One or more drives in a StorageGRID appliance has failed or is not optimal. |
Appliance storage controller hardware issue |
SANtricity software is reporting "Needs attention" for a component in a StorageGRID appliance. |
Appliance storage controller power supply A failure |
Power supply A in a StorageGRID appliance has deviated from the recommended operating voltage. |
Appliance storage controller power supply B failure |
Power supply B in a StorageGRID appliance has deviated from the recommended operating voltage. |
Appliance storage hardware monitor service stalled |
The service that monitors storage hardware status has stalled. |
Appliance storage shelves degraded |
The status of one of the components in the storage shelf for a storage appliance is degraded. |
Appliance temperature exceeded |
The nominal or maximum temperature for the appliance's storage controller has been exceeded. |
Appliance temperature sensor removed |
A temperature sensor has been removed. |
Appliance UEFI secure boot error |
An appliance has not been booted securely. |
Disk I/O is very slow |
Very slow disk I/O might be impacting grid performance. |
Storage appliance fan fault detected |
A problem with a fan unit in the storage controller for an appliance was detected. |
Storage appliance storage connectivity degraded |
There is a problem with one or more connections between the compute controller and storage controller. |
Storage device inaccessible |
A storage device cannot be accessed. |
Audit and syslog alerts
Alert name | Description |
---|---|
Audit logs are being added to the in-memory queue |
Node cannot send logs to the local syslog server and the in-memory queue is filling up. |
External syslog server forwarding error |
Node cannot forward logs to the external syslog server. |
Large audit queue |
The disk queue for audit messages is full. If this condition is not addressed, S3 or Swift operations might fail. |
Logs are being added to the on-disk queue |
Node cannot forward logs to the external syslog server and the on-disk queue is filling up. |
Bucket alerts
Alert name | Description |
---|---|
FabricPool bucket has unsupported bucket consistency setting |
A FabricPool bucket uses the Available or Strong-site consistency level, which is not supported. |
FabricPool bucket has unsupported versioning setting |
A FabricPool bucket has versioning or S3 Object Lock enabled, which are not supported. |
Cassandra alerts
Alert name | Description |
---|---|
Cassandra auto-compactor error |
The Cassandra auto-compactor has experienced an error. |
Cassandra auto-compactor metrics out of date |
The metrics that describe the Cassandra auto-compactor are out of date. |
Cassandra communication error |
The nodes that run the Cassandra service are having trouble communicating with each other. |
Cassandra compactions overloaded |
The Cassandra compaction process is overloaded. |
Cassandra oversize write error |
An internal StorageGRID process sent a write request to Cassandra that was too large. |
Cassandra repair metrics out of date |
The metrics that describe Cassandra repair jobs are out of date. |
Cassandra repair progress slow |
The progress of Cassandra database repairs is slow. |
Cassandra repair service not available |
The Cassandra repair service is not available. |
Cassandra table corruption |
Cassandra has detected table corruption. Cassandra automatically restarts if it detects table corruption. |
Cloud Storage Pool alerts
Alert name | Description |
---|---|
Cloud Storage Pool connectivity error |
The health check for Cloud Storage Pools detected one or more new errors. |
IAM Roles Anywhere end-entity certification expiration |
IAM Roles Anywhere end-entity certificate is about to expire. |
Cross-grid replication alerts
Alert name | Description |
---|---|
Cross-grid replication permanent failure |
A cross-grid replication error occurred that requires user intervention to resolve. |
Cross-grid replication resources unavailable |
Cross-grid replication requests are pending because a resource is unavailable. |
DHCP alerts
Alert name | Description |
---|---|
DHCP lease expired |
The DHCP lease on a network interface has expired. |
DHCP lease expiring soon |
The DHCP lease on a network interface is expiring soon. |
DHCP server unavailable |
The DHCP server is unavailable. |
Debug and trace alerts
Alert name | Description |
---|---|
Debug performance impact |
When debug mode is enabled, system performance might be negatively impacted. |
Trace configuration enabled |
When trace configuration is enabled, system performance might be negatively impacted. |
Email and AutoSupport alerts
Alert name | Description |
---|---|
AutoSupport message failed to send |
The most recent AutoSupport message failed to send. |
Domain name resolution failure |
The StorageGRID node has been unable to resolve domain names. |
Email notification failure |
The email notification for an alert could not be sent. |
SNMP inform errors |
Errors sending SNMP inform notifications to a trap destination. |
SSH or console login detected |
In the past 24 hours, a user has logged in with Web Console or SSH. |
Erasure coding (EC) alerts
Alert name | Description |
---|---|
EC rebalance failure |
The EC rebalance procedure has failed or has been stopped. |
EC repair failure |
A repair job for EC data has failed or has been stopped. |
EC repair stalled |
A repair job for EC data has stalled. |
Erasure-coded fragment verification error |
Erasure-coded fragments can no longer be verified. Corrupt fragments might not be repaired. |
Expiration of certificates alerts
Alert name | Description |
---|---|
Admin Proxy CA certificate expiration |
One or more certificates in the admin proxy server CA bundle is about to expire. |
Expiration of client certificate |
One or more client certificates are about to expire. |
Expiration of global server certificate for S3 and Swift |
The global server certificate for S3 and Swift is about to expire. |
Expiration of load balancer endpoint certificate |
One or more load balancer endpoint certificates are about to expire. |
Expiration of server certificate for Management interface |
The server certificate used for the management interface is about to expire. |
External syslog CA certificate expiration |
The certificate authority (CA) certificate used to sign the external syslog server certificate is about to expire. |
External syslog client certificate expiration |
The client certificate for an external syslog server is about to expire. |
External syslog server certificate expiration |
The server certificate presented by the external syslog server is about to expire. |
Grid Network alerts
Alert name | Description |
---|---|
Grid Network MTU mismatch |
The MTU setting for the Grid Network interface (eth0) differs significantly across nodes in the grid. |
Grid federation alerts
Alert name | Description |
---|---|
Expiration of grid federation certificate |
One or more grid federation certificates are about to expire. |
Grid federation connection failure |
The grid federation connection between the local and remote grid is not working. |
High usage or high latency alerts
Alert name | Description |
---|---|
High Java heap use |
A high percentage of Java heap space is being used. |
High latency for metadata queries |
The average time for Cassandra metadata queries is too long. |
Identity federation alerts
Alert name | Description |
---|---|
Identity federation synchronization failure |
Unable to synchronize federated groups and users from the identity source. |
Identity federation synchronization failure for a tenant |
Unable to synchronize federated groups and users from the identity source configured by a tenant. |
Information lifecycle management (ILM) alerts
Alert name | Description |
---|---|
ILM placement unachievable |
A placement instruction in an ILM rule cannot be achieved for certain objects. |
ILM scan rate low |
The ILM scan rate is set to less than 100 objects/second. |
Key management server (KMS) alerts
Alert name | Description |
---|---|
KMS CA certificate expiration |
The certificate authority (CA) certificate used to sign the key management server (KMS) certificate is about to expire. |
KMS client certificate expiration |
The client certificate for a key management server is about to expire |
KMS configuration failed to load |
The configuration for the key management server exists but failed to load. |
KMS connectivity error |
An appliance node could not connect to the key management server for its site. |
KMS encryption key name not found |
The configured key management server does not have an encryption key that matches the name provided. |
KMS encryption key rotation failed |
All appliance volumes were successfully decrypted, but one or more volumes could not rotate to the latest key. |
KMS is not configured |
No key management server exists for this site. |
KMS key failed to decrypt an appliance volume |
One or more volumes on an appliance with node encryption enabled could not be decrypted with the current KMS key. |
KMS server certificate expiration |
The server certificate used by the key management server (KMS) is about to expire. |
KMS server connectivity failure |
An appliance node could not connect to one or more servers in the key management server cluster for its site. |
Load balancer alerts
Alert name | Description |
---|---|
Elevated zero-request load balancer connections |
An elevated percentage of connections to load balancer endpoints disconnected without performing requests. |
Local clock offset alerts
Alert name | Description |
---|---|
Local clock large time offset |
The offset between local clock and Network Time Protocol (NTP) time is too large. |
Low memory or low space alerts
Alert name | Description |
---|---|
Low audit log disk capacity |
The space available for audit logs is low. If this condition is not addressed, S3 or Swift operations might fail. |
Low available node memory |
The amount of RAM available on a node is low. |
Low free space for storage pool |
The space available for storing object data in the Storage Node is low. |
Low installed node memory |
The amount of installed memory on a node is low. |
Low metadata storage |
The space available for storing object metadata is low. |
Low metrics disk capacity |
The space available for the metrics database is low. |
Low object data storage |
The space available for storing object data is low. |
Low read-only watermark override |
The storage volume soft read-only watermark override is less than the minimum optimized watermark for a Storage Node. |
Low root disk capacity |
The space available on the root disk is low. |
Low system data capacity |
The space available for /var/local is low. If this condition is not addressed, S3 or Swift operations might fail. |
Low tmp directory free space |
The space available in the /tmp directory is low. |
Node or node network alerts
Alert name | Description |
---|---|
Admin Network receive usage |
The receive usage on the Admin Network is high. |
Admin Network transmit usage |
The transmit usage on the Admin Network is high. |
Firewall configuration failure |
Failed to apply firewall configuration. |
Management interface endpoints in fallback mode |
All management interface endpoints have been falling back to the default ports for too long. |
Node network connectivity error |
Errors have occurred while transferring data between nodes. |
Node network reception frame error |
A high percentage of the network frames received by a node had errors. |
Node not in sync with NTP server |
The node is not in sync with the network time protocol (NTP) server. |
Node not locked with NTP server |
The node is not locked to a network time protocol (NTP) server. |
Non-appliance node network down |
One or more network devices are down or disconnected. |
Services appliance link down on Admin Network |
The appliance interface to the Admin Network (eth1) is down or disconnected. |
Services appliance link down on Admin Network port 1 |
The Admin Network port 1 on the appliance is down or disconnected. |
Services appliance link down on Client Network |
The appliance interface to the Client Network (eth2) is down or disconnected. |
Services appliance link down on network port 1 |
Network port 1 on the appliance is down or disconnected. |
Services appliance link down on network port 2 |
Network port 2 on the appliance is down or disconnected. |
Services appliance link down on network port 3 |
Network port 3 on the appliance is down or disconnected. |
Services appliance link down on network port 4 |
Network port 4 on the appliance is down or disconnected. |
Storage appliance link down on Admin Network |
The appliance interface to the Admin Network (eth1) is down or disconnected. |
Storage appliance link down on Admin Network port 1 |
The Admin Network port 1 on the appliance is down or disconnected. |
Storage appliance link down on Client Network |
The appliance interface to the Client Network (eth2) is down or disconnected. |
Storage appliance link down on network port 1 |
Network port 1 on the appliance is down or disconnected. |
Storage appliance link down on network port 2 |
Network port 2 on the appliance is down or disconnected. |
Storage appliance link down on network port 3 |
Network port 3 on the appliance is down or disconnected. |
Storage appliance link down on network port 4 |
Network port 4 on the appliance is down or disconnected. |
Storage Node not in desired storage state |
The LDR service on a Storage Node cannot transition to the desired state because of an internal error or volume related issue |
TCP connection usage |
The number of TCP connections on this node is approaching the maximum number that can be tracked. |
Unable to communicate with node |
One or more services are unresponsive, or the node cannot be reached. |
Unexpected node reboot |
A node rebooted unexpectedly within the last 24 hours. |
Object alerts
Alert name | Description |
---|---|
Object existence check failed |
The object existence check job has failed. |
Object existence check stalled |
The object existence check job has stalled. |
Objects lost |
One or more objects have been lost from the grid. |
S3 PUT object size too large |
A client is attempting a PUT Object operation that exceeds S3 size limits. |
Unidentified corrupt object detected |
A file was found in replicated object storage that could not be identified as a replicated object. |
Platform services alerts
Alert name | Description |
---|---|
Platform Services pending request capacity low |
The number of Platform Services pending requests is approaching capacity. |
Platform services unavailable |
Too few Storage Nodes with the RSM service are running or available at a site. |
Storage volume alerts
Alert name | Description |
---|---|
Storage volume needs attention |
A storage volume is offline and needs attention. |
Storage volume needs to be restored |
A storage volume has been recovered and needs to be restored. |
Storage volume offline |
A storage volume has been offline for more than 5 minutes. |
Storage volume remount attempted |
A storage volume was offline and triggered an automatic remount. This could indicate a drive issue or filesystem errors. |
Volume Restoration failed to start replicated data repair |
Replicated data repair for a repaired volume couldn't be started automatically. |
StorageGRID services alerts
Alert name | Description |
---|---|
nginx service using backup configuration |
The configuration of the nginx service is invalid. The previous configuration is now being used. |
nginx-gw service using backup configuration |
The configuration of the nginx-gw service is invalid. The previous configuration is now being used. |
Reboot required to disable FIPS |
The security policy does not require FIPS mode, but the NetApp Cryptographic Security Module is enabled. |
Reboot required to enable FIPS |
The security policy requires FIPS mode, but the NetApp Cryptographic Security Module is disabled. |
SSH service using backup configuration |
The configuration of the SSH service is invalid. The previous configuration is now being used. |
Tenant alerts
Alert name | Description |
---|---|
Tenant quota usage high |
A high percentage of quota space is being used. This rule is disabled by default because it might cause too many notifications. |