Alerts reference

Contributors netapp-madkat netapp-perveilerk netapp-lhalbert

The following table lists all default StorageGRID alerts. As required, you can create custom alert rules to fit your system management approach.

See the information about commonly used Prometheus metrics to learn about the metrics used in some of these alerts.

Alert name Description and recommended actions

Appliance battery expired

The battery in the appliance’s storage controller has expired.

  1. Replace the battery. The steps to remove and replace a battery are included in the procedure for replacing a storage controller. See the instructions for your storage appliance:

  2. If this alert persists, contact technical support.

Appliance battery failed

The battery in the appliance’s storage controller has failed.

  1. Replace the battery. The steps to remove and replace a battery are included in the procedure for replacing a storage controller. See the instructions for your storage appliance:

  2. If this alert persists, contact technical support.

Appliance battery has insufficient learned capacity

The battery in the appliance’s storage controller has insufficient learned capacity.

  1. Replace the battery. The steps to remove and replace a battery are included in the procedure for replacing a storage controller. See the instructions for your storage appliance:

  2. If this alert persists, contact technical support.

Appliance battery near expiration

The battery in the appliance’s storage controller is nearing expiration.

  1. Replace the battery soon. The steps to remove and replace a battery are included in the procedure for replacing a storage controller. See the instructions for your storage appliance:

  2. If this alert persists, contact technical support.

Appliance battery removed

The battery in the appliance’s storage controller is missing.

  1. Install a battery. The steps to remove and replace a battery are included in the procedure for replacing a storage controller. See the instructions for your storage appliance:

  2. If this alert persists, contact technical support.

Appliance battery too hot

The battery in the appliance’s storage controller is overheated.

  1. Determine if there is another alert affecting this node. This alert might be resolved when you resolve the other alert.

  2. Investigate possible reasons for the temperature increase, such as a fan or HVAC failure.

  3. If this alert persists, contact technical support.

Appliance BMC communication error

Communication with the baseboard management controller (BMC) has been lost.

  1. Confirm that the BMC is operating normally. Select NODES, and then select the Hardware tab for the appliance node. Locate the Compute Controller BMC IP field, and browse to that IP.

  2. Attempt to restore BMC communications by placing the node into maintenance mode and then powering the appliance off and back on. See the instructions for your appliance:

  3. If this alert persists, contact technical support.

Appliance cache backup device failed

A persistent cache backup device has failed.

  1. Determine if there is another alert affecting this node. This alert might be resolved when you resolve the other alert.

  2. Contact technical support.

Appliance cache backup device insufficient capacity

There is insufficient cache backup device capacity.

Contact technical support.

Appliance cache backup device write-protected

A cache backup device is write-protected.

Contact technical support.

Appliance cache memory size mismatch

The two controllers in the appliance have different cache sizes.

Contact technical support.

Appliance compute controller chassis temperature too high

The temperature of the compute controller in a StorageGRID appliance has exceeded a nominal threshold.

  1. Check the hardware components for overheating conditions, and follow the recommended actions:

    • If you have an SG100, SG1000, or SG6000, use the BMC.

    • If you have an SG5600 or SG5700, use SANtricity System Manager.

  2. If necessary, replace the component. See the instructions for your appliance:

Appliance compute controller CPU temperature too high

The temperature of the CPU in the compute controller in a StorageGRID appliance has exceeded a nominal threshold.

  1. Check the hardware components for overheating conditions, and follow the recommended actions:

    • If you have an SG100, SG1000, or SG6000, use the BMC.

    • If you have an SG5600 or SG5700, use SANtricity System Manager.

  2. If necessary, replace the component. See the instructions for your appliance:

Appliance compute controller needs attention

A hardware fault has been detected in the compute controller of a StorageGRID appliance.

  1. Check the hardware components for errors, and follow the recommended actions:

    • If you have an SG100, SG1000, or SG6000, use the BMC.

    • If you have an SG5600 or SG5700, use SANtricity System Manager.

  2. If necessary, replace the component. See the instructions for your appliance:

Appliance compute controller power supply A has a problem

Power supply A in the compute controller has a problem.This alert might indicate that the power supply has failed or that it has a problem providing power.

  1. Check the hardware components for errors, and follow the recommended actions:

    • If you have an SG100, SG1000, or SG6000, use the BMC.

    • If you have an SG5600 or SG5700, use SANtricity System Manager.

  2. If necessary, replace the component. See the instructions for your appliance:

Appliance compute controller power supply B has a problem

Power supply B in the compute controller has a problem.

This alert might indicate that the power supply has failed or that it has a problem providing power.

  1. Check the hardware components for errors, and follow the recommended actions:

    • If you have an SG100, SG1000, or SG6000, use the BMC.

    • If you have an SG5600 or SG5700, use SANtricity System Manager.

  2. If necessary, replace the component. See the instructions for your appliance:

Appliance compute hardware monitor service stalled

The service that monitors storage hardware status has stopped reporting data.

  1. Check the status of the eos-system-status service in the base-os.

  2. If the service is in a stopped or error state, restart the service.

  3. If this alert persists, contact technical support.

Appliance Fibre Channel fault detected

A Fibre Channel link problem has been detected between the appliance storage controller and compute controller.

This alert might indicate that there is a problem with the Fibre Channel connection between the storage and compute controllers in the appliance.

  1. Check the hardware components for errors (NODES > appliance node > Hardware). If the status of any of the components is not “Nominal,” take these actions:

    1. Verify that the Fibre Channel cables between controllers are completely connected.

    2. Ensure that the Fibre Channel cables are free of excessive bends.

    3. Confirm that the SFP+ modules are properly seated.

      Note: If this problem persists, the StorageGRID system might take the problematic connection offline automatically.

  2. If necessary, replace components. See the instructions for your appliance:

Appliance Fibre Channel HBA port failure

A Fibre Channel HBA port is failing or has failed.

Contact technical support.

Appliance flash cache drives non-optimal

The drives used for the SSD cache are non-optimal.

  1. Replace the SSD cache drives. See the instructions for your appliance:

  2. If this alert persists, contact technical support.

Appliance interconnect/battery canister removed

The interconnect/battery canister is missing.

  1. Replace the battery. The steps to remove and replace a battery are included in the procedure for replacing a storage controller. See the instructions for your storage appliance.

  2. If this alert persists, contact technical support.

Appliance LACP port missing

A port on a StorageGRID appliance is not participating in the LACP bond.

  1. Check the configuration for the switch. Ensure the interface is configured in the correct link aggregation group.

  2. If this alert persists, contact technical support.

Appliance overall power supply degraded

The power of a StorageGRID appliance has deviated from the recommended operating voltage.

  1. Check the status of power supply A and B to determine which power supply is operating abnormally, and follow the recommended actions:

    • If you have an SG100, SG1000, or SG6000, use the BMC.

    • If you have an SG5600 or SG5700, use SANtricity System Manager.

  2. If necessary, replace the component. See the instructions for your appliance:

Appliance storage controller A failure

Storage controller A in a StorageGRID appliance has failed.

  1. Use SANtricity System Manager to check hardware components, and follow the recommended actions.

  2. If necessary, replace the component. See the instructions for your appliance:

Appliance storage controller B failure

Storage controller B in a StorageGRID appliance has failed.

  1. Use SANtricity System Manager to check hardware components, and follow the recommended actions.

  2. If necessary, replace the component. See the instructions for your appliance:

Appliance storage controller drive failure

One or more drives in a StorageGRID appliance has failed or is not optimal.

  1. Use SANtricity System Manager to check hardware components, and follow the recommended actions.

  2. If necessary, replace the component. See the instructions for your appliance:

Appliance storage controller hardware issue

SANtricity software is reporting "Needs attention" for a component in a StorageGRID appliance.

  1. Use SANtricity System Manager to check hardware components, and follow the recommended actions.

  2. If necessary, replace the component. See the instructions for your appliance:

Appliance storage controller power supply A failure

Power supply A in a StorageGRID appliance has deviated from the recommended operating voltage.

  1. Use SANtricity System Manager to check hardware components, and follow the recommended actions.

  2. If necessary, replace the component. See the instructions for your appliance:

Appliance storage controller power supply B failure

Power supply B in a StorageGRID appliance has deviated from the recommended operating voltage.

  1. Use SANtricity System Manager to check hardware components, and follow the recommended actions.

  2. If necessary, replace the component. See the instructions for your appliance:

Appliance storage hardware monitor service stalled

The service that monitors storage hardware status has stopped reporting data.

  1. Check the status of the eos-system-status service in the base-os.

  2. If the service is in a stopped or error state, restart the service.

  3. If this alert persists, contact technical support.

Appliance storage shelves degraded

The status of one of the components in the storage shelf for a storage appliance is degraded.

  1. Use SANtricity System Manager to check hardware components, and follow the recommended actions.

  2. If necessary, replace the component. See the instructions for your appliance:

Appliance temperature exceeded

The nominal or maximum temperature for the appliance’s storage controller has been exceeded.

  1. Determine if there is another alert affecting this node. This alert might be resolved when you resolve the other alert.

  2. Investigate possible reasons for the temperature increase, such as a fan or HVAC failure.

  3. If this alert persists, contact technical support.

Appliance temperature sensor removed

A temperature sensor has been removed. Contact technical support.

Cassandra auto-compactor error

The Cassandra auto-compactor has experienced an error.

The Cassandra auto-compactor exists on all Storage Nodes and manages the size of the Cassandra database for overwrite and delete heavy workloads. While this condition persists, certain workloads will experience unexpectedly high metadata consumption.

  1. Determine if there is another alert affecting this node. This alert might be resolved when you resolve the other alert.

  2. Contact technical support.

Audit logs are being added to the in-memory queue

Node cannot send logs to the local syslog server and the in-memory queue is filling up.

  1. Ensure that the rsyslog service is running on the node.

  2. If necessary, restart the rsyslog service on the node using the command service rsyslog restart.

  3. If the rsyslog service cannot be restarted and you do not save audit messages on Admin Nodes, contact technical support. Audit logs will be lost if this condition is not corrected.

Cassandra auto-compactor metrics out of date

The metrics that describe the Cassandra auto-compactor are out of date.

The Cassandra auto-compactor exists on all Storage Nodes and manages the size of the Cassandra database for overwrite and delete heavy workloads. While this alert persists, certain workloads will experience unexpectedly high metadata consumption.

  1. Determine if there is another alert affecting this node. This alert might be resolved when you resolve the other alert.

  2. Contact technical support.

Cassandra communication error

The nodes that run the Cassandra service are having trouble communicating with each other.

This alert indicates that something is interfering with node-to-node communications. There might be a network issue or the Cassandra service might be down on one or more Storage Nodes.

  1. Determine if there is another alert affecting one or more Storage Nodes. This alert might be resolved when you resolve the other alert.

  2. Check for a network issue that might be affecting one or more Storage Nodes.

  3. Select SUPPORT > Tools > Grid topology.

  4. For each Storage Node in your system, select SSM > Services. Ensure that the status of the Cassandra service is "Running."

  5. If Cassandra is not running, follow the steps for starting or restarting a service.

  6. If all instances of the Cassandra service are now running and the alert is not resolved, contact technical support.

Cassandra compactions overloaded

The Cassandra compaction process is overloaded.

If the compaction process is overloaded, read performance might be degraded and RAM might be used up. The Cassandra service might also become unresponsive or crash.

  1. Restart the Cassandra service by following the steps for restarting a service.

  2. If this alert persists, contact technical support.

Cassandra repair metrics out of date

The metrics that describe Cassandra repair jobs are out of date. If this condition persists for more than 48 hours, client queries, such as bucket listings, might show deleted data.

  1. Reboot the node. From the Grid Manager, go to NODES, select the node, and select the Tasks tab.

  2. If this alert persists, contact technical support.

Cassandra repair progress slow

The progress of Cassandra database repairs is slow.

When database repairs are slow, Cassandra data consistency operations are impeded. If this condition persists for more than 48 hours, client queries, such as bucket listings, might show deleted data.

  1. Confirm that all Storage Nodes are online and there are no networking-related alerts.

  2. Monitor this alert for up to 2 days to see if the issue resolves on its own.

  3. If database repairs continue to proceed slowly, contact technical support.

Cassandra repair service not available

The Cassandra repair service is not available.

The Cassandra repair service exists on all Storage Nodes and provides critical repair functions for the Cassandra database. If this condition persists for more than 48 hours, client queries, such as bucket listings, might show deleted data.

  1. Select SUPPORT > Tools > Grid topology.

  2. For each Storage Node in your system, select SSM > Services. Ensure that the status of the Cassandra Reaper service is "Running."

  3. If Cassandra Reaper is not running, follow the steps for follow the steps for starting or restarting a service.

  4. If all instances of the Cassandra Reaper service are now running and the alert is not resolved, contact technical support.

Cassandra table corruption

Cassandra has detected table corruption.

Cassandra automatically restarts if it detects table corruption.

Contact technical support.

Cloud Storage Pool connectivity error

The health check for Cloud Storage Pools detected one or more new errors.

  1. Go to the Cloud Storage Pools section of the Storage Pools page.

  2. Look at the Last Error column to determine which Cloud Storage Pool has an error.

  3. See the instructions for managing objects with information lifecycle management.

DHCP lease expired

The DHCP lease on a network interface has expired. If the DHCP lease has expired, follow the recommended actions:

  1. Ensure there is connectivity between this node and the DHCP server on the affected interface.

  2. Ensure there are IP addresses available to assign in the affected subnet on the DHCP server.

  3. Ensure there is a permanent reservation for the IP address configured in the DHCP server. Or, use the StorageGRID Change IP tool to assign a static IP address outside of the DHCP address pool. See the recovery and maintenance instructions.

DHCP lease expiring soon

The DHCP lease on a network interface is expiring soon.

To prevent the DHCP lease from expiring, follow the recommended actions:

  1. Ensure there is connectivity between this node and the DHCP server on the affected interface.

  2. Ensure there are IP addresses available to assign in the affected subnet on the DHCP server.

  3. Ensure there is a permanent reservation for the IP address configured in the DHCP server. Or, use the StorageGRID Change IP tool to assign a static IP address outside of the DHCP address pool. See the recovery and maintenance instructions.

DHCP server unavailable

The DHCP server is unavailable.

The StorageGRID node is unable to contact your DHCP server. The DHCP lease for the node’s IP address cannot be validated.

  1. Ensure there is connectivity between this node and the DHCP server on the affected interface.

  2. Ensure there are IP addresses available to assign in the affected subnet on the DHCP server.

  3. Ensure there is a permanent reservation for the IP address configured in the DHCP server. Or, use the StorageGRID Change IP tool to assign a static IP address outside of the DHCP address pool. See the recovery and maintenance instructions.

Disk I/O is very slow

Very slow disk I/O might be impacting StorageGRID performance.

  1. If the issue is related to a storage appliance node, use SANtricity System Manager to check for faulty drives, drives with predicted faults, or in-progress drive repairs. Also check the status of the Fibre Channel or SAS links between the appliance compute and storage controllers to see if any links are down or showing excessive error rates.

  2. Examine the storage system that hosts this node’s volumes to determine, and correct, the root cause of the slow I/O.

  3. If this alert persists, contact technical support.

Note: Affected nodes might disable services and reboot themselves to avoid impacting overall grid performance. When the underlying condition is cleared and these nodes detect normal I/O performance, they will return to full service automatically.

EC rebalance failure

The job to rebalance erasure-coded data among Storage Nodes has failed or has been paused by the user.

  1. Ensure that all Storage Nodes at the site being rebalanced are online and available.

  2. Ensure that there are no volume failures at the site being rebalanced. If there are, terminate the EC rebalance job so that you can run a repair job.

    'rebalance-data terminate --job-id <ID>'

  3. Ensure that there are no service failures on the site being rebalanced. If a service is not running, follow the steps for starting or restarting a service in the recovery and maintenance instructions.

  4. After resolving any issues, restart the job by running the following command on the primary Admin Node:

    'rebalance-data start --job-id <ID>'

  5. If you are unable to resolve the problem, contact technical support.

EC repair failure

A repair job for erasure-coded data has failed or has been stopped.

  1. Ensure that there are sufficient available Storage Nodes or volumes to take the place of the failed Storage Node or volume.

  2. Ensure that there are sufficient available Storage Nodes to satisfy the active ILM policy.

  3. Ensure there are no network connectivity issues.

  4. After resolving any issues, restart the job by running the following command on the primary Admin Node:

    'repair-data start-ec-node-repair --repair-id <ID>'

  5. If you are unable to resolve the problem, contact technical support.

EC repair stalled

A repair job for erasure-coded data has stalled.

  1. Ensure that there are sufficient available Storage Nodes or volumes to take the place of the failed Storage Node or volume.

  2. Ensure there are no network connectivity issues.

  3. After resolving any issues, check if the alert is resolved. To see a more detailed report on the repair progress, run the following command on the primary Admin Node:

    'repair-data show-ec-repair-status --repair-id <ID>'

  4. If you are unable to resolve the problem, contact technical support.

Email notification failure

The email notification for an alert could not be sent.

This alert is triggered when an alert email notification fails or a test email (sent from the ALERTS > Email setup page) cannot be delivered.

  1. Sign in to Grid Manager from the Admin Node listed in the Site/Node column of the alert.

  2. Go to the ALERTS > Email setup page, check the settings, and change them if required.

  3. Click Send Test Email, and check the inbox of a test recipient for the email. A new instance of this alert might be triggered if the test email cannot be sent.

  4. If the test email could not be sent, confirm your email server is online.

  5. If the server is working, select SUPPORT > Tools > Logs, and collect the log for the Admin Node. Specify a time period that is 15 minutes before and after the time of the alert.

  6. Extract the downloaded archive, and review the contents of prometheus.log (_/GID<gid><time_stamp>/<site_node>/<time_stamp>/metrics/prometheus.log).

  7. If you are unable to resolve the problem, contact technical support.

Expiration of client certificates configured on the Certificates page

One or more client certificates configured on the Certificates page are about to expire.

  1. In the Grid Manager, select CONFIGURATION > Security > Certificates and then select the Client tab.

  2. Select a certificate that will expire soon.

  3. Select Attach new certificate to upload or generate a new certificate.

  4. Repeat these steps for each certificate that will expire soon.

Expiration of load balancer endpoint certificate

One or more load balancer endpoint certificates are about to expire.

  1. Select CONFIGURATION > Network > Load balancer endpoints.

  2. Select an endpoint that has a certificate that will expire soon.

  3. Select Edit endpoint to upload or generate a new certificate.

  4. Repeat these steps for each endpoint that has an expired certificate or one that will expire soon.

For more information about managing load balancer endpoints, see the instructions for administering StorageGRID.

Expiration of server certificate for management interface

The server certificate used for the management interface is about to expire.

  1. Select CONFIGURATION > Security > Certificates.

  2. On the Global tab, select Management interface certificate.

  3. Upload a new management interface certificate.

Expiration of global server certificate for S3 and Swift API

The server certificate used for accessing storage API endpoints is about to expire.

  1. Select CONFIGURATION > Security > Certificates.

  2. On the Global tab, select S3 and Swift API certificate.

  3. Upload a new S3 and Swift API certificate.

External syslog CA certificate expiration

The certificate authority (CA) certificate used to sign the external syslog server certificate is about to expire.

  1. Update the CA certificate on the external syslog server.

  2. Obtain a copy of the updated CA certificate.

  3. From the Grid Manager, go to CONFIGURATION > Monitoring > Audit and syslog server.

  4. Select Edit external syslog server.

  5. Select Browse to upload the new certificate.

  6. Complete the Configuration wizard to save the new certificate and key.

External syslog client certificate expiration

The client certificate for an external syslog server is about to expire.

  1. From the Grid Manager, go to CONFIGURATION > Monitoring > Audit and syslog server.

  2. Select Edit external syslog server.

  3. Select Browse to upload the new certificate.

  4. Select Browse to upload the new private key.

  5. Complete the Configuration wizard to save the new certificate and key.

External syslog server certificate expiration

The server certificate presented by the external syslog server is about to expire.

  1. Update the server certificate on the external syslog server.

  2. If you previously used the Grid Manager API to provide a server certificate for certificate validation, upload the updated server certificate using the API.

External syslog server forwarding error

Node cannot forward logs to the external syslog server.

  1. From the Grid Manager, go to CONFIGURATION > Monitoring > Audit and syslog server.

  2. Select Edit external syslog server.

  3. Advance through the Configuration wizard until you are able to select Send test messages.

  4. Select Send test messages to determine why logs cannot be forwarded to the external syslog server.

  5. Resolve any reported issues.

Grid Network MTU mismatch

The maximum transmission unit (MTU) setting for the Grid Network interface (eth0) differs significantly across nodes in the grid.

The differences in MTU settings could indicate that some, but not all, eth0 networks are configured for jumbo frames. An MTU size mismatch of greater than 1000 might cause network performance problems.

See the instructions for the Grid Network MTU mismatch alert in Troubleshoot network, hardware, and platform issues.

High Java heap use

A high percentage of Java heap space is being used.

If the Java heap becomes full, metadata services can become unavailable and client requests can fail.

  1. Review the ILM activity on the Dashboard. This alert might resolve on its own when the ILM workload decreases.

  2. Determine if there is another alert affecting this node. This alert might be resolved when you resolve the other alert.

  3. If this alert persists, contact technical support.

High latency for metadata queries

The average time for Cassandra metadata queries is too long.

An increase in query latency can be caused by a hardware change, such as replacing a disk; a workload change, such as a sudden increase in ingests; or a network change, such as a communication problem between nodes and sites.

  1. Determine if there were any hardware, workload, or network changes around the time the query latency increased.

  2. If you are unable to resolve the problem, contact technical support.

Identity federation synchronization failure

Unable to synchronize federated groups and users from the identity source.

  1. Confirm that the configured LDAP server is online and available.

  2. Review the settings on the Identity Federation page. Confirm that all values are current. See Use identity federation in the instructions for administering StorageGRID.

  3. Click Test Connection to validate the settings for the LDAP server.

  4. If you cannot resolve the issue, contact technical support.

Identity federation synchronization failure for a tenant

Unable to synchronize federated groups and users from the identity source configured by a tenant.

  1. Sign in to the Tenant Manager.

  2. Confirm that the LDAP server configured by the tenant is online and available.

  3. Review the settings on the Identity Federation page. Confirm that all values are current. See Use identity federation in the instructions for using a tenant account.

  4. Click Test Connection to validate the settings for the LDAP server.

  5. If you cannot resolve the issue, contact technical support.

ILM placement unachievable

A placement instruction in an ILM rule cannot be achieved for certain objects.

This alert indicates that a node required by a placement instruction is unavailable or that an ILM rule is misconfigured. For example, a rule might specify more replicated copies than there are Storage Nodes.

  1. Ensure that all nodes are online.

  2. If all nodes are online, review the placement instructions in all ILM rules that are used the active ILM policy. Confirm that there are valid instructions for all objects. See the instructions for managing objects with information lifecycle management.

  3. As required, update rule settings and activate a new policy.

    Note: It might take up to 1 day for the alert to clear.

  4. If the problem persists, contact technical support.

Note: This alert might appear during an upgrade and could persist for 1 day after the upgrade is completed successfully. When this alert is triggered by an upgrade, it will clear on its own.

ILM scan period too long

The time required to scan, evaluate objects, and apply ILM is too long.

If the estimated time to complete a full ILM scan of all objects is too long (see Scan Period - Estimated on the Dashboard), the active ILM policy might not be applied to newly ingested objects. Changes to the ILM policy might not be applied to existing objects.

  1. Determine if there is another alert affecting this node. This alert might be resolved when you resolve the other alert.

  2. Confirm that all Storage Nodes are online.

  3. Temporarily reduce the amount of client traffic. For example, from the Grid Manager, select CONFIGURATION > Network > Traffic classification, and create a policy that limits bandwidth or the number of requests.

  4. If disk I/O or CPU are overloaded, try to reduce the load or increase the resource.

  5. If necessary, update ILM rules to use synchronous placement (default for rules created after StorageGRID 11.3).

  6. If this alert persists, contact technical support.

ILM scan rate low

The ILM scan rate is set to less than 100 objects/second.

This alert indicates that someone has changed the ILM scan rate for your system to less than 100 objects/second (default: 400 objects/second). The active ILM policy might not be applied to newly ingested objects. Subsequent changes to the ILM policy will not be applied to existing objects.

  1. Determine if a temporary change was made to the ILM scan rate as part of an ongoing support investigation.

  2. Contact technical support.

Important Never change the ILM scan rate without contacting technical support.

KMS CA certificate expiration

The certificate authority (CA) certificate used to sign the key management server (KMS) certificate is about to expire.

  1. Using the KMS software, update the CA certificate for the key management server.

  2. From the Grid Manager, select CONFIGURATION > Security > Key management server.

  3. Select the KMS that has a certificate status warning.

  4. Select Edit.

  5. Select Next to go to Step 2 (Upload Server Certificate).

  6. Select Browse to upload the new certificate.

  7. Select Save.

KMS client certificate expiration

The client certificate for a key management server is about to expire.

  1. From the Grid Manager, select CONFIGURATION > Security > Key management server.

  2. Select the KMS that has a certificate status warning.

  3. Select Edit.

  4. Select Next to go to Step 3 (Upload Client Certificates).

  5. Select Browse to upload the new certificate.

  6. Select Browse to upload the new private key.

  7. Select Save.

KMS configuration failed to load

The configuration for the key management server exists but failed to load.

  1. Determine if there is another alert affecting this node. This alert might be resolved when you resolve the other alert.

  2. If this alert persists, contact technical support.

KMS connectivity error

An appliance node could not connect to the key management server for its site.

  1. From the Grid Manager, select CONFIGURATION > Security > Key management server.

  2. Confirm that the port and hostname entries are correct.

  3. Confirm that the server certificate, client certificate, and the client certificate private key are correct and not expired.

  4. Ensure that firewall settings allow the appliance node to communicate with the specified KMS.

  5. Correct any networking or DNS issues.

  6. If you need assistance or this alert persists, contact technical support.

KMS encryption key name not found

The configured key management server does not have an encryption key that matches the name provided.

  1. Confirm that the KMS assigned to the site is using the correct name for the encryption key and any prior versions.

  2. If you need assistance or this alert persists, contact technical support.

KMS encryption key rotation failed

All appliance volumes were decrypted, but one or more volumes could not rotate to the latest key.Contact technical support.

KMS is not configured

No key management server exists for this site.

  1. From the Grid Manager, select CONFIGURATION > Security > Key management server.

  2. Add a KMS for this site or add a default KMS.

KMS key failed to decrypt an appliance volume

One or more volumes on an appliance with node encryption enabled could not be decrypted with the current KMS key.

  1. Determine if there is another alert affecting this node. This alert might be resolved when you resolve the other alert.

  2. Ensure that the key management server (KMS) has the configured encryption key and any previous key versions.

  3. If you need assistance or this alert persists, contact technical support.

KMS server certificate expiration

The server certificate used by the key management server (KMS) is about to expire.

  1. Using the KMS software, update the server certificate for the key management server.

  2. If you need assistance or this alert persists, contact technical support.

Large audit queue

The disk queue for audit messages is full.

  1. Check the load on the system—​if there have been a significant number of transactions, the alert should resolve itself over time, and you can ignore the alert.

  2. If the alert persists and increases in severity, view a chart of the queue size. If the number is steadily increasing over hours or days, the audit load has likely exceeded the audit capacity of the system.

  3. Reduce the client operation rate or decrease the number of audit messages logged by changing the audit level for Client Writes and Client Reads to Error or Off (CONFIGURATION > Monitoring > Audit and syslog server).

Legacy CLB load balancer activity detected

Some clients might be connecting to the deprecated CLB load balancer service using the default S3 and Swift API certificate.

  1. To simplify future upgrades, install a custom S3 and Swift API certificate on the Global tab of the Certificates page. Then, ensure that all S3 or Swift clients who connect to the legacy CLB have the new certificate.

  2. Create one or more load balancer endpoints. Then, direct all existing S3 and Swift clients to these endpoints. Contact technical support if you need to remap the client port.

Other activity might trigger this alert, including port scans. To determine if the deprecated CLB service is currently in use, view the storagegrid_private_clb_http_connection_established_successful Prometheus metric.

As required, silence or disable this alert rule if the CLB service is no longer in use.

Logs are being added to the on-disk queue

Node cannot forward logs to the external syslog server and the on-disk queue is filling up.

  1. From the Grid Manager, go to CONFIGURATION > Monitoring > Audit and syslog server.

  2. Select Edit external syslog server.

  3. Advance through the Configuration wizard until you are able to select Send test messages.

  4. Select Send test messages to determine why logs cannot be forwarded to the external syslog server.

  5. Resolve any reported issues.

Low audit log disk capacity

The space available for audit logs is low.

  1. Monitor this alert to see if the issue resolves on its own and the disk space becomes available again.

  2. Contact technical support if the available space continues to decrease.

Low available node memory

The amount of RAM available on a node is low.

Low available RAM could indicate a change in the workload or a memory leak with one or more nodes.

  1. Monitor this alert to see if the issue resolves on its own.

  2. If the available memory falls below the major alert threshold, contact technical support.

Low free space for storage pool

The amount of space available to store object data in a storage pool is low.

  1. Select ILM > Storage pools.

  2. Select the storage pool listed in the alert, and select View details.

  3. Determine where additional storage capacity is required. You can either add Storage Nodes to each site in the storage pool or add storage volumes (LUNs) to one or more existing Storage Nodes.

  4. Perform an expansion procedure to increase storage capacity.

Low installed node memory

The amount of installed memory on a node is low.

Increase the amount of RAM available to the virtual machine or Linux host. Check the threshold value for the major alert to determine the default minimum requirement for a StorageGRID node. See the installation instructions for your platform:

Low metadata storage

The space available for storing object metadata is low.

Critical alert

  1. Stop ingesting objects.

  2. Immediately add Storage Nodes in an expansion procedure.

Major alert

Immediately add Storage Nodes in an expansion procedure.

Minor alert

  1. Monitor the rate at which object metadata space is being used. Select NODES > Storage Node > Storage, and view the Storage Used - Object Metadata graph.

  2. Add Storage Nodes in an expansion procedure as soon as possible.

Once new Storage Nodes are added, the system automatically rebalances object metadata across all Storage Nodes, and the alarm clears.

See the instructions for the Low metadata storage alert in Troubleshoot metadata issues.

Low metrics disk capacity

The space available for the metrics database is low.

  1. Monitor this alert to see if the issue resolves on its own and the disk space becomes available again.

  2. Contact technical support if the available space continues to decrease.

Low object data storage

The space available for storing object data is low.

Perform an expansion procedure. You can add storage volumes (LUNs) to existing Storage Nodes, or you can add new Storage Nodes.

Low read-only watermark override

The Storage Volume Soft Read-Only Watermark Override is less than the minimum optimized watermark for a Storage Node.

To learn how to resolve this alert, go to Troubleshoot Low read-only watermark override alerts.

Low root disk capacity

The space available for the root disk is low.

  1. Monitor this alert to see if the issue resolves on its own and the disk space becomes available again.

  2. Contact technical support if the available space continues to decrease.

Low system data capacity

The space available for StorageGRID system data on the /var/local file system is low.

  1. Monitor this alert to see if the issue resolves on its own and the disk space becomes available again.

  2. Contact technical support if the available space continues to decrease.

Low tmp directory free space

The space available in the /tmp directory is low.

  1. Monitor this alert to see if the issue resolves on its own and the disk space becomes available again.

  2. Contact technical support if the available space continues to decrease.

Node network connectivity error

Errors have occurred while transferring data between nodes.

Network connectivity errors might clear without manual intervention. Contact technical support if the errors do not clear.

See the instructions for the Network Receive Error (NRER) alarm in Troubleshoot network, hardware, and platform issues.

Node network reception frame error

A high percentage of the network frames received by a node had errors.

This alert might indicate a hardware issue, such as a bad cable or a failed transceiver on either end of the Ethernet connection.

  1. If you are using an appliance, try replacing each SFP+ or SFP28 transceiver and cable, one at a time, to see if the alert clears.

  2. If this alert persists, contact technical support.

Node not in sync with NTP server

The node’s time is not in sync with the network time protocol (NTP) server.

  1. Verify that you have specified at least four external NTP servers, each providing a Stratum 3 or better reference.

  2. Check that all NTP servers are operating normally.

  3. Verify the connections to the NTP servers. Make sure they are not blocked by a firewall.

Node not locked with NTP server

The node is not locked to a network time protocol (NTP) server.

  1. Verify that you have specified at least four external NTP servers, each providing a Stratum 3 or better reference.

  2. Check that all NTP servers are operating normally.

  3. Verify the connections to the NTP servers. Make sure they are not blocked by a firewall.

Non appliance node network down

One or more network devices are down or disconnected. This alert indicates that a network interface (eth) for a node installed on a virtual machine or Linux host is not accessible.

Contact technical support.

Object existence check failed

The object existence check job has failed.

  1. Select MAINTENANCE > Object existence check.

  2. Note the error message. Perform the appropriate corrective actions:

    Failed to start, Lost connection, Unknown error

    1. Ensure the Storage Nodes and volumes included in the job are online and available.

    2. Ensure there are no service or volume failures on the Storage Nodes. If a service is not running, start or restart the service. See the recovery and maintenance instructions.

    3. Ensure the selected consistency control can be satisfied.

    4. After resolving any issues, select Retry. The job will resume from the last valid state.

    Critical storage error in volume

    1. Recover the failed volume. See the recovery and maintenance instructions.

    2. Select Retry.

    3. After the job completes, create another job for the remaining volumes on the node to check for additional errors.

  3. If you are unable to resolve the issues, contact technical support.

Object existence check stalled

The object existence check job has stalled.

The object existence check job cannot continue. Either one or more Storage Nodes or volumes included in the job are offline or unresponsive, or the selected consistency control can no longer be satisfied because too many nodes are down or unavailable.

  1. Ensure that all Storage Nodes and volumes being checked are online and available (select NODES).

  2. Ensure that sufficient Storage Nodes are online and available to allow the current coordinator node to read object metadata using the selected consistency control. If necessary, start or restart a service. See the recovery and maintenance instructions.

    When you resolve steps 1 and 2, the job will automatically start where it left off.

  3. If the selected consistency control cannot be satisfied, cancel the job and start another job using a lower consistency control.

  4. If you are unable to resolve the issues, contact technical support.

Objects lost

One or more objects have been lost from the grid.

This alert might indicate that data has been permanently lost and is not retrievable.

  1. Investigate this alert immediately. You might need to take action to prevent further data loss. You also might be able to restore a lost object if you take prompt action.

  2. When the underlying problem is resolved, reset the counter:

    1. Select SUPPORT > Tools > Grid topology.

    2. For the Storage Node that raised the alert, select site > grid node > LDR > Data Store > Configuration > Main.

    3. Select Reset Lost Objects Count and click Apply Changes.

Platform services unavailable

Too few Storage Nodes with the RSM service are running or available at a site.

Make sure that the majority of the Storage Nodes that have the RSM service at the affected site are running and in a non-error state.

See “Troubleshooting platform services” in the instructions for administering StorageGRID.

S3 PUT Object size too large

An S3 client is attempting to perform a PUT Object operation that exceeds the S3 size limits.

  1. Use the tenant ID shown in the alert details to identify the tenant account.

  2. Go to Support > Tools > Logs, and collect the Application Logs for the Storage Node shown in the alert details. Specify a time period that is 15 minutes before and after the time of the alert.

  3. Extract the downloaded archive, and navigate to the location of bycast.log (/GID<grid_id>_<time_stamp>/<site_node>/<time_stamp>/grid/bycast.log).

  4. Search the contents of bycast.log for "method=PUT" and identify the IP address of the S3 client by looking at the clientIP field.

  5. Inform all client users that the maximum PUT Object size is 5 GiB.

  6. Use multipart uploads for objects larger than 5 GiB.

Services appliance link down on Admin Network port 1

The Admin Network port 1 on the appliance is down or disconnected.

  1. Check the cable and physical connection to Admin Network port 1.

  2. Address any connection issues. See the installation and maintenance instructions for your appliance hardware.

  3. If this port is disconnected on purpose, disable this rule. From the Grid Manager, select ALERTS > Rules, select the rule, and click Edit rule. Then, uncheck the Enabled check box.

Services appliance link down on Admin Network (or Client Network)

The appliance interface to the Admin Network (eth1) or the Client Network (eth2) is down or disconnected.

  1. Check the cables, SFPs, and physical connections to the StorageGRID network.

  2. Address any connection issues. See the installation and maintenance instructions for your appliance hardware.

  3. If this port is disconnected on purpose, disable this rule. From the Grid Manager, select ALERTS > Rules, select the rule, and click Edit rule. Then, uncheck the Enabled check box.

Services appliance link down on network port 1, 2, 3, or 4

Network port 1, 2, 3, or 4 on the appliance is down or disconnected.

  1. Check the cables, SFPs, and physical connections to the StorageGRID network.

  2. Address any connection issues. See the installation and maintenance instructions for your appliance hardware.

  3. If this port is disconnected on purpose, disable this rule. From the Grid Manager, select ALERTS > Rules, select the rule, and click Edit rule. Then, uncheck the Enabled check box.

Services appliance storage connectivity degraded

One of the two SSDs in a services appliance has failed or is out of synchronization with the other.

Appliance functionality is not impacted, but you should address the issue immediately. If both drives fail, the appliance will no longer function.

  1. From the Grid Manager, select NODES > services appliance, and then select the Hardware tab.

  2. Review the message in the Storage RAID Mode field.

  3. If the message shows the progress of a resynchronization operation, wait for the operation to complete and then confirm that the alert is resolved. A resynchronization message means that SSD was replaced recently or that it is being resynchronized for another reason.

  4. If the message indicates that one of the SSDs has failed, replace the failed drive as soon as possible.

    For instructions on how to replace a drive in a services appliance, see the SG100 and SG1000 appliances installation and maintenance guide.

Storage appliance link down on Admin Network port 1

The Admin Network port 1 on the appliance is down or disconnected.

  1. Check the cable and physical connection to Admin Network port 1.

  2. Address any connection issues. See the installation and maintenance instructions for your appliance hardware.

  3. If this port is disconnected on purpose, disable this rule. From the Grid Manager, select ALERTS > Rules, select the rule, and click Edit rule. Then, uncheck the Enabled check box.

Storage appliance link down on Admin Network (or Client Network)

The appliance interface to the Admin Network (eth1) or the Client Network (eth2) is down or disconnected.

  1. Check the cables, SFPs, and physical connections to the StorageGRID network.

  2. Address any connection issues. See the installation and maintenance instructions for your appliance hardware.

  3. If this port is disconnected on purpose, disable this rule. From the Grid Manager, select ALERTS > Rules, select the rule, and click Edit rule. Then, uncheck the Enabled check box.

Storage appliance link down on network port 1, 2, 3, or 4

Network port 1, 2, 3, or 4 on the appliance is down or disconnected.

  1. Check the cables, SFPs, and physical connections to the StorageGRID network.

  2. Address any connection issues. See the installation and maintenance instructions for your appliance hardware.

  3. If this port is disconnected on purpose, disable this rule. From the Grid Manager, select ALERTS > Rules, select the rule, and click Edit rule. Then, uncheck the Enabled check box.

Storage appliance storage connectivity degraded

There is a problem with one or more connections between the compute controller and storage controller.

  1. Go to the appliance to check the port indicator lights.

  2. If a port’s lights are off, confirm the cable is properly connected. As needed, replace the cable.

  3. Wait up to five minutes.

    Note: If a second cable needs to be replaced, do not unplug it for at least 5 minutes. Otherwise, the root volume might become read-only, which requires a hardware restart.

  4. From the Grid Manager, select NODES. Then, select the Hardware tab of the node that had the problem. Verify that the alert condition has resolved.

Storage device inaccessible

A storage device cannot be accessed.

This alert indicates that a volume cannot be mounted or accessed because of a problem with an underlying storage device.

  1. Check the status of all storage devices used for the node:

    • If the node is installed on a virtual machine or Linux host, follow the instructions for your operating system to run hardware diagnostics or perform a filesystem check.

    • If the node is installed on an SG100, SG1000 or SG6000 appliance, use the BMC.

    • If the node is installed on a SG5600 or SG5700 appliance, use SANtricity System Manager.

  2. If necessary, replace the component. See the instructions for your appliance:

Tenant quota usage high

A high percentage of tenant quota space is being used. If a tenant exceeds its quota, new ingests are rejected.

Note: This alert rule is disabled by default because it might generate a lot of notifications.

  1. From the Grid Manager, select TENANTS.

  2. Sort the table by Quota Utilization.

  3. Select a tenant whose quota utilization is close to 100%.

  4. Do either or both of the following:

    • Select Edit to increase the storage quota for the tenant.

    • Notify the tenant that their quota utilization is high.

Unable to communicate with node

One or more services are unresponsive, or the node cannot be reached.

This alert indicates that a node is disconnected for an unknown reason. For example, a service on the node might be stopped, or the node might have lost its network connection because of a power failure or unexpected outage.

Monitor this alert to see if the issue resolves on its own. If the issue persists:

  1. Determine if there is another alert affecting this node. This alert might be resolved when you resolve the other alert.

  2. Confirm that all of the services on this node are running. If a service is stopped, try starting it. See the recovery and maintenance instructions.

  3. Ensure that the host for the node is powered on. If it is not, start the host.

    Note: If more than one host is powered off, see the recovery and maintenance instructions.

  4. Determine if there is a network connectivity issue between this node and the Admin Node.

  5. If you cannot resolve the alert, contact technical support.

Unexpected node reboot

A node rebooted unexpectedly within the last 24 hours.

  1. Monitor this alert. The alert will be cleared after 24 hours. However, if the node reboots unexpectedly again, this alert will be triggered again.

  2. If you cannot resolve the alert, there might be a hardware failure. Contact technical support.

Unidentified corrupt object detected

A file was found in replicated object storage that could not be identified as a replicated object.

  1. Determine if there are any issues with the underlying storage on a Storage Node. For example, run hardware diagnostics or perform a filesystem check.

  2. After resolving any storage issues, run object existence check to determine if any replicated copies, as defined by your ILM policy, are missing.

  3. Monitor this alert. The alert will clear after 24 hours, but will be triggered again if the issue has not been fixed.

  4. If you cannot resolve the alert, contact technical support.