Alarms reference (legacy system)

The following table lists all of the legacy Default alarms. If an alarm is triggered, you can look up the alarm code in this table to find the recommended actions.

Note: While the legacy alarm system continues to be supported, the alert system offers significant benefits and is easier to use.
Code Name Service Recommended action
ABRL Available Attribute Relays BADC, BAMS, BARC, BCLB, BCMN, BLDR, BNMS, BSSM, BDDS

Restore connectivity to a service (an ADC service) running an Attribute Relay Service as soon as possible. If there are no connected attribute relays, the grid node cannot report attribute values to the NMS service. Thus, the NMS service can no longer monitor the status of the service, or update attributes for the service.

If the problem persists, contact technical support.

ACMS Available Metadata Services BARC, BLDR, BCMN

An alarm is triggered when an LDR or ARC service loses connection to a DDS service. If this occurs, ingest or retrieve transactions cannot be processed. If the unavailability of DDS services is only a brief transient issue, transactions can be delayed.

Check and restore connections to a DDS service to clear this alarm and return the service to full functionality.

ACTS Cloud Tiering Service Status ARC

Only available for Archive Nodes with a Target Type of Cloud Tiering - Simple Storage Service (S3).

If the ACTS attribute for the Archive Node is set to Read-Only Enabled or Read-Write Disabled, you must set the attribute to Read-Write Enabled.

If a major alarm is triggered due to an authentication failure, verify the credentials associated with destination bucket and update values, if necessary.

If a major alarm is triggered due to any other reason, contact technical support.

ADCA ADC Status ADC

If an alarm is triggered, select Support > Tools > Grid Topology. Then select site > grid node > ADC > Overview > Main and ADC > Alarms > Main to determine the cause of the alarm.

If the problem persists, contact technical support.

ADCE ADC State ADC

If the value of ADC State is Standby, continue monitoring the service and if the problem persists, contact technical support.

If the value of ADC State is Offline, restart the service. If the problem persists, contact technical support.

AITE Retrieve State BARC

Only available for Archive Node's with a Target Type of Tivoli Storage Manager (TSM).

If the value of Retrieve State is Waiting for Target, check the TSM middleware server and ensure that it is operating correctly. If the Archive Node has just been added to the StorageGRID system, ensure that the Archive Node's connection to the targeted external archival storage system is configured correctly.

If the value of Archive Retrieve State is Offline, attempt to update the state to Online. Select Support > Tools > Grid Topology. Then select site > grid node > ARC > Retrieve > Configuration > Main, select Archive Retrieve State > Online, and click Apply Changes.

If the problem persists, contact technical support.

AITU Retrieve Status BARC

If the value of Retrieve Status is Target Error, check the targeted external archival storage system for errors.

If the value of Archive Retrieve Status is Session Lost, check the targeted external archival storage system to ensure it is online and operating correctly. Check the network connection with the target.

If the value of Archive Retrieve Status is Unknown Error, contact technical support.

ALIS Inbound Attribute Sessions ADC

If the number of inbound attribute sessions on an attribute relay grows too large, it can be an indication that the StorageGRID system has become unbalanced. Under normal conditions, attribute sessions should be evenly distributed amongst ADC services. An imbalance can lead to performance issues.

If the problem persists, contact technical support.

ALOS Outbound Attribute Sessions ADC

The ADC service has a high number of attribute sessions, and is becoming overloaded. If this alarm is triggered, contact technical support.

ALUR Unreachable Attribute Repositories ADC

Check network connectivity with the NMS service to ensure that the service can contact the attribute repository.

If this alarm is triggered and network connectivity is good, contact technical support.

AMQS Audit Messages Queued BADC, BAMS, BARC, BCLB, BCMN, BLDR, BNMS, BDDS

If audit messages cannot be immediately forwarded to an audit relay or repository, the messages are stored in a disk queue. If the disk queue becomes full, outages can occur.

To allow you to respond in time to prevent an outage, AMQS alarms are triggered when the number of messages in the disk queue reaches the following thresholds:
  • Notice: More than 100,000 messages
  • Minor: At least 500,000 messages
  • Major: At least 2,000,000 messages
  • Critical: At least 5,000,000 messages

If an AMQS alarm is triggered, check the load on the system—if there have been a significant number of transactions, the alarm should resolve itself over time. In this case, you can ignore the alarm.

If the alarm persists and increases in severity, view a chart of the queue size. If the number is steadily increasing over hours or days, the audit load has likely exceeded the audit capacity of the system. Reduce the client operation rate or decrease the number of audit messages logged by changing the audit level to Error or Off. See Changing audit message levels in Understanding audit messages.

Understanding audit messages

AOTE Store State BARC

Only available for Archive Node's with a Target Type of Tivoli Storage Manager (TSM).

If the value of Store State is Waiting for Target, check the external archival storage system and ensure that it is operating correctly. If the Archive Node has just been added to the StorageGRID system, ensure that the Archive Node's connection to the targeted external archival storage system is configured correctly.

If the value of Store State is Offline, check the value of Store Status. Correct any problems before moving the Store State back to Online.

AOTU Store Status BARC

If the value of Store Status is Session Lost check that the external archival storage system is connected and online.

If the value of Target Error, check the external archival storage system for errors.

If the value of Store Status is Unknown Error, contact technical support.

APMS Storage Multipath Connectivity SSM If the multipath state alarm appears as Degraded (select Support > Tools > Grid Topology, then select site > grid node > SSM > Events), do the following:
  1. Plug in or replace the cable that does not display any indicator lights.
  2. Wait one to five minutes.

    Do not unplug the other cable until at least five minutes after you plug in the first one. Unplugging too early can cause the root volume to become read-only, which requires that the hardware be restarted.

  3. Return to the SSM > Resources page, and verify that the Degraded Multipath status has changed to Nominal in the Storage Hardware section.
ARCE ARC State ARC

The ARC service has a state of Standby until all ARC components (Replication, Store, Retrieve, Target) have started. It then transitions to Online.

If the value of ARC State does not transition from Standby to Online, check the status of the ARC components.

If the value of ARC State is Offline, restart the service. If the problem persists, contact technical support.

AROQ Objects Queued ARC

This alarm can be triggered if the removable storage device is running slowly due to problems with the targeted external archival storage system, or if it encounters multiple read errors. Check the external archival storage system for errors, and ensure that it is operating correctly.

In some cases, this error can occur as a result of a high rate of data requests. Monitor the number of objects queued as system activity declines.

ARRF Request Failures ARC

If a retrieval from the targeted external archival storage system fails, the Archive Node retries the retrieval as the failure can be due to a transient issue. However, if the object data is corrupt or has been marked as being permanently unavailable, the retrieval does not fail. Instead, the Archive Node continuously retries the retrieval and the value for Request Failures continues to increase.

This alarm can indicate that the storage media holding the requested data is corrupt. Check the external archival storage system to further diagnose the problem.

If you determine that the object data is no longer in the archive, the object will have to be removed from the StorageGRID system. For more information, contact technical support.

Once the problem that triggered this alarm is addressed, reset the failures count. Select Support > Tools > Grid Topology. Then select site > grid node > ARC > Retrieve > Configuration > Main, select Reset Request Failure Count and click Apply Changes.

ARRV Verification Failures ARC

To diagnose and correct this problem, contact technical support.

Once the problem that triggered this alarm is addressed, reset the failures count. Select Support > Tools > Grid Topology. Then select site > grid node > ARC > Retrieve > Configuration > Main, select Reset Verification Failure Count and click Apply Changes.

ARVF Store Failures ARC

This alarm can occur as a result of errors with the targeted external archival storage system. Check the external archival storage system for errors, and ensure that it is operating correctly.

Once the problem that triggered this alarm is addressed, reset the failures count. Select Support > Tools > Grid Topology. Then select site > grid node > ARC > Retrieve > Configuration > Main, select Reset Store Failure Count, and click Apply Changes.

ASXP Audit Shares AMS

An alarm is triggered if the value of Audit Shares is Unknown. This alarm can indicate a problem with the installation or configuration of the Admin Node.

If the problem persists, contact technical support.

AUMA AMS Status AMS

If the value of AMS Status is DB Connectivity Error, restart the grid node.

If the problem persists, contact technical support.

AUME AMS State AMS

If the value of AMS State is Standby, continue monitoring the StorageGRID system. If the problem persists, contact technical support.

If the value of AMS State is Offline, restart the service. If the problem persists, contact technical support.

AUXS Audit Export Status AMS

If an alarm is triggered, correct the underlying problem, and then restart the AMS service.

If the problem persists, contact technical support.

BADD Storage Controller Failed Drive Count SSM This alarm is triggered when one or more drives in a StorageGRID appliance has failed or is not optimal.

Replace the drives as required.

BASF Available Object Identifiers CMN

When a StorageGRID system is provisioned, the CMN service is allocated a fixed number of object identifiers. This alarm is triggered when the StorageGRID system begins to exhaust its supply of object identifiers.

To allocate more identifiers, contact technical support.

BASS Identifier Block Allocation Status CMN

By default, an alarm is triggered when object identifiers cannot be allocated because ADC quorum cannot be reached.

Identifier block allocation on the CMN service requires a quorum (50% + 1) of the ADC services to be online and connected. If quorum is unavailable, the CMN service is unable to allocate new identifier blocks until ADC quorum is re-established. If ADC quorum is lost, there is generally no immediate impact on the StorageGRID system (clients can still ingest and retrieve content), as approximately one month's supply of identifiers are cached elsewhere in the grid; however, if the condition continues, the StorageGRID system will lose the ability to ingest new content.

If an alarm is triggered, investigate the reason for the loss of ADC quorum (for example, it can be a network or Storage Node failure) and take corrective action.

If the problem persists, contact technical support.

BRDT Compute Controller Chassis Temperature SSM

An alarm is triggered if the temperature of the compute controller in a StorageGRID appliance exceeds a nominal threshold.

Check hardware components and environmental issues for overheated condition. If necessary, replace the component.

BTOF Offset BADC, BLDR, BNMS, BAMS, BCLB, BCMN, BARC

An alarm is triggered if the service time (seconds) differs significantly from the operating system time. Under normal conditions, the service should resynchronize itself. If the service time drifts too far from the operating system time, system operations can be affected. Confirm that the StorageGRID system’s time source is correct.

If the problem persists, contact technical support.

BTSE Clock State BADC, BLDR, BNMS, BAMS, BCLB, BCMN, BARC

An alarm is triggered if the service’s time is not synchronized with the time tracked by the operating system. Under normal conditions, the service should resynchronize itself. If the time drifts too far from operating system time, system operations can be affected. Confirm that the StorageGRID system’s time source is correct.

If the problem persists, contact technical support.

CAHP Java Heap Usage Percent DDS

An alarm is triggered if Java is unable to perform garbage collection at a rate that allows enough heap space for the system to properly function. An alarm might indicate a user workload that exceeds the resources available across the system for the DDS metadata store. Check the ILM Activity in the Dashboard, or select Support > Tools > Grid Topology, then select site > grid node > DDS > Resources > Overview > Main.

If the problem persists, contact technical support.

CAIH Number Available Ingest Destinations CLB

This alarm is deprecated.

CAQH Number Available Destinations CLB

This alarm clears when underlying issues of available LDR services are corrected. Ensure that the HTTP component of LDR services are online and running normally.

If the problem persists, contact technical support.

CASA Data Store Status DDS

An alarm is raised if the Cassandra metadata store becomes unavailable.

Check the status of Cassandra:
  1. At the Storage Node, log in as admin and su to root using the password listed in the Passwords.txt file.
  2. Enter: service cassandra status
  3. If Cassandra is not running, restart it: service cassandra restart

This alarm might also indicate that the metadata store (Cassandra database) for a Storage Node requires rebuilding.

Troubleshooting the Services: Status - Cassandra (SVST) alarm

If the problem persists, contact technical support.

CASE Data Store State DDS This alarm is triggered during installation or expansion to indicate a new data store is joining the grid.
CCES Incoming Sessions - Established CLB This alarm is triggered if there are 20,000 or more HTTP sessions currently active (open) on the Gateway Node. If a client has too many connections, you might see connection failures. You should reduce the workload.
CCNA Compute Hardware SSM This alarm is triggered if the status of the compute controller hardware in a StorageGRID appliance is Needs Attention.
CDLP Metadata Used Space (Percent) DDS

This alarm is triggered when the Metadata Effective Space (CEMS) reaches 70% full (minor alarm), 90% full (major alarm), and 100% full (critical alarm).

If this alarm reaches the 90% threshold, a warning appears on the Dashboard in the Grid Manager. You must perform an expansion procedure to add new Storage Nodes as soon as possible. See the instructions for expanding a StorageGRID grid.

If this alarm reaches the 100% threshold, you must stop ingesting objects and add Storage Nodes immediately. Cassandra requires a certain amount of space to perform essential operations such as compaction and repair. These operations will be impacted if object metadata uses more than 100% of the allowed space. Undesirable results can occur.

Note: Contact technical support if you are unable to add Storage Nodes.

Once new Storage Nodes are added, the system automatically rebalances object metadata across all Storage Nodes, and the alarm clears.

Troubleshooting the Low metadata storage alert

Expanding a StorageGRID system

CLBA CLB Status CLB

If an alarm is triggered, select Support > Tools > Grid Topology, then select site > grid node > CLB > Overview > Main and CLB > Alarms > Main to determine the cause of the alarm and to troubleshoot the problem.

If the problem persists, contact technical support.

CLBE CLB State CLB

If the value of CLB State is Standby, continue monitoring the situation and if the problem persists, contact technical support.

If the state is Offline and there are no known server hardware issues (for example, the server is unplugged) or scheduled downtime, restart the service. If the problem persists, contact technical support.

CMNA CMN Status CMN

If the value of CMN Status is Error, select Support > Tools > Grid Topology, then select site > grid node > CMN > Overview > Main and CMN > Alarms > Main to determine the cause of the error and to troubleshoot the problem.

An alarm is triggered and the value of CMN Status is No Online CMN during a hardware refresh of the primary Admin Node when the CMNs are switched (the value of the old CMN State is Standby and the new is Online).

If the problem persists, contact technical support.

CPRC Remaining Capacity NMS

An alarm is triggered if the remaining capacity (number of available connections that can be opened to the NMS database) falls below the configured alarm severity.

If an alarm is triggered, contact technical support.

CPSA Compute Controller Power Supply A SSM

An alarm is triggered if there is an issue with power supply A in the compute controller for a StorageGRID appliance.

If necessary, replace the component.

CPSB Compute Controller Power Supply B SSM

An alarm is triggered if there is an issue with power supply B in the compute controller for a StorageGRID appliance.

If necessary, replace the component.

CPUT Compute Controller CPU Temperature SSM

An alarm is triggered if the temperature of the CPU in the compute controller in a StorageGRID appliance exceeds a nominal threshold.

If the Storage Node is a StorageGRID appliance, the StorageGRID system indicates that the controller needs attention.

Check hardware components and environment issues for overheated condition. If necessary, replace the component.

DNST DNS Status SSM

After installation completes, a DNST alarm is triggered in the SSM service. After the DNS is configured and the new server information reaches all grid nodes, the alarm is canceled.

ECCD Corrupt Fragments Detected LDR An alarm is triggered when the background verification process detects a corrupt erasure coded fragment. If a corrupt fragment is detected, an attempt is made to rebuild the fragment.

Reset the Corrupt Fragments Detected and Copies Lost attributes to zero and monitor them to see if counts go up again. If counts do go up, there may be a problem with the Storage Node's underlying storage. A copy of erasure coded object data is not considered missing until such time that the number of lost or corrupt fragments breaches the erasure code's fault tolerance; therefore, it is possible to have corrupt fragment and to still be able to retrieve the object.

If the problem persists, contact technical support.

ECST Verification Status LDR

This alarm indicates the current status of the background verification process for erasure coded object data on this Storage Node.

A major alarm is triggered if there is an error in the background verification process.

FOPN Open File Descriptors BADC, BAMS, BARC, BCLB, BCMN, BLDR, BNMS, BSSM, BDDS

FOPN can become large during peak activity. If it does not diminish during periods of slow activity, contact technical support.

HSTE HTTP State BLDR

HSTE and HSTU are related to the HTTP protocol for all LDR traffic, including S3, Swift, and other internal StorageGRID traffic. An alarm indicates that one of the following situations has occurred:

  • The HTTP protocol has been taken offline manually.
  • The Auto-Start HTTP attribute has been disabled.
  • The LDR service is shutting down.

The Auto-Start HTTP attribute is enabled by default. If this setting is changed, HTTP could remain offline after a restart.

If necessary, wait for the LDR service to restart.

Select Support > Tools > Grid Topology. Then select Storage Node > LDR > Configuration. If the HTTP protocol is offline, place it online. Verify that the Auto-Start HTTP attribute is enabled.

If the HTTP protocol remains offline, contact technical support.

HSTU HTTP Status BLDR
HTAS Auto-Start HTTP LDR

Specifies whether to start HTTP services automatically on start-up. This is a user-specified configuration option.

IRSU Inbound Replication Status BLDR, BARC

An alarm indicates that inbound replication has been disabled. Confirm configuration settings: Select Support > Tools > Grid Topology. Then select site > grid node > LDR > Replication > Configuration > Main.

LATA Average Latency NMS

Check for connectivity issues.

Check system activity to confirm that there is an increase in system activity. An increase in system activity will result in an increase to attribute data activity. This increased activity will result in a delay to the processing of attribute data. This can be normal system activity and will subside.

Check for multiple alarms. An increase in average latency times can be indicated by an excessive number of triggered alarms.

If the problem persists, contact technical support.

LDRE LDR State LDR

If the value of LDR State is Standby, continue monitoring the situation and if the problem persists, contact technical support.

If the value of LDR State is Offline, restart the service. If the problem persists, contact technical support.

LOST Lost Objects DDS, LDR

Triggered when the StorageGRID system fails to retrieve a copy of the requested object from anywhere in the system. Before a LOST (Lost Objects) alarm is triggered, the system attempts to retrieve and replace a missing object from elsewhere in the system.

Lost objects represent a loss of data. The Lost Objects attribute is incremented whenever the number of locations for an object drops to zero without the DDS service purposely purging the content to satisfy the ILM policy.

Investigate LOST (LOST Object) alarms immediately. If the problem persists, contact technical support.

Troubleshooting lost and missing object data

MCEP Management Interface Certificate Expiry CMN Triggered when the certificate used for accessing the management interface is about to expire.
  1. Go to Configuration > Server Certificates.
  2. In the Management Interface Server Certificate section, upload a new certificate.

Administering StorageGRID

MINQ E-mail Notifications Queued NMS

Check the network connections of the servers hosting the NMS service and the external mail server. Also confirm that the email server configuration is correct.

Configuring email server settings for alarms (legacy system)

MINS E-mail Notifications Status BNMS

A minor alarm is triggered if the NMS service is unable to connect to the mail server. Check the network connections of the servers hosting the NMS service and the external mail server. Also confirm that the email server configuration is correct.

Configuring email server settings for alarms (legacy system)

MISS NMS Interface Engine Status BNMS

An alarm is triggered if the NMS interface engine on the Admin Node that gathers and generates interface content is disconnected from the system. Check Server Manager to determine if the server individual application is down.

NANG Network Auto Negotiate Setting SSM

Check the network adapter configuration. The setting must match preferences of your network routers and switches.

An incorrect setting can have a severe impact on system performance.

NDUP Network Duplex Setting SSM

Check the network adapter configuration. The setting must match preferences of your network routers and switches.

An incorrect setting can have a severe impact on system performance.

NLNK Network Link Detect SSM

Check the network cable connections on the port and at the switch.

Check the network router, switch, and adapter configurations.

Restart the server.

If the problem persists, contact technical support.

NRER Receive Errors SSM
The following can be causes of NRER alarms:
  • Forward error correction (FEC) mismatch
  • Switch port and NIC MTU mismatch
  • High link error rates
  • NIC ring buffer overrun

Troubleshooting the Network Receive Error (NRER) alarm

NRLY Available Audit Relays BADC, BARC, BCLB, BCMN, BLDR, BNMS, BDDS

If audit relays are not connected to ADC services, audit events cannot be reported. They are queued and unavailable to users until the connection is restored.

Restore connectivity to an ADC service as soon as possible.

If the problem persists, contact technical support.

NSCA NMS Status NMS

If the value of NMS Status is DB Connectivity Error, restart the service. If the problem persists, contact technical support.

NSCE NMS State NMS

If the value of NMS State is Standby, continue monitoring and if the problem persists, contact technical support.

If the value of NMS State is Offline, restart the service. If the problem persists, contact technical support.

NSPD Speed SSM

This can be caused by network connectivity or driver compatibility issues. If the problem persists, contact technical support.

NTBR Free Tablespace NMS

If an alarm is triggered, check how fast database usage has been changing. A sudden drop (as opposed to a gradual change over time) indicates an error condition. If the problem persists, contact technical support.

Adjusting the alarm threshold allows you to proactively manage when additional storage needs to be allocated.

If the available space reaches a low threshold (see alarm threshold), contact technical support to change the database allocation.

NTER Transmit Errors SSM

These errors can clear without being manually reset. If they do not clear, check network hardware. Check that the adapter hardware and driver are correctly installed and configured to work with your network routers and switches.

When the underlying problem is resolved, reset the counter. Select Support > Tools > Grid Topology. Then select site > grid node > SSM > Resources > Configuration > Main, select Reset Transmit Error Count, and click Apply Changes.

NTFQ NTP Frequency Offset SSM

If the frequency offset exceeds the configured threshold, there is likely a hardware problem with the local clock. If the problem persists, contact technical support to arrange a replacement.

NTLK NTP Lock SSM

If the NTP daemon is not locked to an external time source, check network connectivity to the designated external time sources, their availability, and their stability.

NTOF NTP Time Offset SSM

If the time offset exceeds the configured threshold, there is likely a hardware problem with the oscillator of the local clock. If the problem persists, contact technical support to arrange a replacement.

NTSJ Chosen Time Source Jitter SSM

This value indicates the reliability and stability of the time source that NTP on the local server is using as its reference.

If an alarm is triggered, it can be an indication that the time source’s oscillator is defective, or that there is a problem with the WAN link to the time source.

NTSU NTP Status SSM

If the value of NTP Status is Not Running, contact technical support.

OPST Overall Power Status SSM

An alarm is triggered if the power of a StorageGRID appliance deviates from the recommended operating voltage.

Check the status of Power Supply A or B to determine which power supply is operating abnormally.

If necessary, replace the power supply.

OQRT Objects Quarantined LDR

After the objects are automatically restored by the StorageGRID system, the quarantined objects can be removed from the quarantine directory.

  1. Select Support > Tools > Grid Topology.
  2. Select site > Storage Node > LDR > Verification > Configuration > Main.
  3. Select Delete Quarantined Objects.
  4. Click Apply Changes.

The quarantined objects are removed, and the count is reset to zero.

ORSU Outbound Replication Status BLDR, BARC

An alarm indicates that outbound replication is not possible: storage is in a state where objects cannot be retrieved. An alarm is triggered if outbound replication is disabled manually. Select Support > Tools > Grid Topology. Then select site > grid node > LDR > Replication > Configuration.

An alarm is triggered if the LDR service is unavailable for replication. Select Support > Tools > Grid Topology. Then select site > grid node > LDR > Storage.

OSLF Shelf Status SSM An alarm is triggered if the status of one of the components in the storage shelf for a storage appliance is degraded. Storage shelf components include the IOMs, fans, power supplies, and drive drawers.

If this alarm is triggered, see the maintenance instructions for your appliance.

PMEM Service Memory Usage (Percent) BADC, BAMS, BARC, BCLB, BCMN, BLDR, BNMS, BSSM, BDDS

Can have a value of Over Y% RAM, where Y represents the percentage of memory being used by the server.

Figures under 80% are normal. Over 90% is considered a problem.

If memory usage is high for a single service, monitor the situation and investigate.

If the problem persists, contact technical support.

PSAS Power Supply A Status SSM

An alarm is triggered if power supply A in a StorageGRID appliance deviates from the recommended operating voltage.

If necessary, replace power supply A.

PSBS Power Supply B Status SSM

An alarm is triggered if power supply B in a StorageGRID appliance deviates from the recommended operating voltage.

If necessary, replace the power supply B.

RDTE Tivoli Storage Manager State BARC

Only available for Archive Nodes with a Target Type of Tivoli Storage Manager (TSM).

If the value of Tivoli Storage Manager State is Offline, check Tivoli Storage Manager Status and resolve any problems.

Bring the component back online. Select Support > Tools > Grid Topology. Then select site > grid node > ARC > Target > Configuration > Main, select Tivoli Storage Manager State > Online, and click Apply Changes.

RDTU Tivoli Storage Manager Status BARC

Only available for Archive Nodes with a Target Type of Tivoli Storage Manager (TSM).

If the value of Tivoli Storage Manager Status is Configuration Error and the Archive Node has just been added to the StorageGRID system, ensure that the TSM middleware server is correctly configured.

If the value of Tivoli Storage Manager Status is Connection Failure, or Connection Failure, Retrying, check the network configuration on the TSM middleware server, and the network connection between the TSM middleware server and the StorageGRID system.

If the value of Tivoli Storage Manager Status is Authentication Failure, or Authentication Failure, Reconnecting, the StorageGRID system can connect to the TSM middleware server, but cannot authenticate the connection. Check that the TSM middleware server is configured with the correct user, password, and permissions, and restart the service.

If the value of Tivoli Storage Manager Status is Session Failure, an established session has been lost unexpectedly. Check the network connection between the TSM middleware server and the StorageGRID system. Check the middleware server for errors.

If the value of Tivoli Storage Manager Status is Unknown Error, contact technical support.

RIRF Inbound Replications – Failed BLDR, BARC

An Inbound Replications – Failed alarm can occur during periods of high load or temporary network disruptions. After system activity reduces, this alarm should clear. If the count of failed replications continues to increase, look for network problems and verify that the source and destination LDR and ARC services are online and available.

To reset the count, select Support > Tools > Grid Topology, then select site > grid node > LDR > Replication > Configuration > Main. Select Reset Inbound Replication Failure Count, and click Apply Changes.

RIRQ Inbound Replications – Queued BLDR, BARC

Alarms can occur during periods of high load or temporary network disruption. After system activity reduces, this alarm should clear. If the count for queued replications continues to increase, look for network problems and verify that the source and destination LDR and ARC services are online and available.

RORQ Outbound Replications – Queued BLDR, BARC

The outbound replication queue contains object data being copied to satisfy ILM rules and objects requested by clients.

An alarm can occur as a result of a system overload. Wait to see if the alarm clears when system activity declines. If the alarm recurs, add capacity by adding Storage Nodes.

SAVP Total Usable Space (Percent) LDR

If usable space reaches a low threshold, options include expanding the StorageGRID system or move object data to archive through an Archive Node.

SCAS Status CMN

If the value of Status for the active grid task is Error, look up the grid task message. Select Support > Tools > Grid Topology. Then select site > grid node > CMN > Grid Tasks > Overview > Main. The grid task message displays information about the error (for example, “check failed on node 12130011”).

After you have investigated and corrected the problem, restart the grid task. Select Support > Tools > Grid Topology. Then select site > grid node > CMN > Grid Tasks > Configuration > Main, and select Actions > Run.

If the value of Status for a grid task being aborted is Error, retry aborting the grid task.

If the problem persists, contact technical support.

SCEP Storage API Service Endpoints Certificate Expiry CMN Triggered when the certificate used for accessing storage API endpoints is about to expire.
  1. Go to Configuration > Server Certificates.
  2. In the Object Storage API Service Endpoints Server Certificate section, upload a new certificate.

Administering StorageGRID

SCHR Status CMN

If the value of Status for the historical grid task is Aborted, investigate the reason and run the task again if required.

If the problem persists, contact technical support.

SCSA Storage Controller A SSM

An alarm is triggered if there is an issue with storage controller A in a StorageGRID appliance.

If necessary, replace the component.

SCSB Storage Controller B SSM

An alarm is triggered if there is an issue with storage controller B in a StorageGRID appliance.

If necessary, replace the component.

Some appliance models do not have a storage controller B.

SHLH Health LDR

If the value of Health for an object store is Error, check and correct:

  • problems with the volume being mounted
  • file system errors
SLSA CPU Load Average SSM

The higher the value the busier the system.

If the CPU Load Average persists at a high value, the number of transactions in the system should be investigated to determine whether this is due to heavy load at the time. View a chart of the CPU load average: Select Support > Tools > Grid Topology. Then select site > grid node > SSM > Resources > Reports > Charts.

If the load on the system is not heavy and the problem persists, contact technical support.

SMST Log Monitor State SSM

If the value of Log Monitor State is not Connected for a persistent period of time, contact technical support.

SMTT Total Events SSM

If the value of Total Events is greater than zero, check if there are known events (such as network failures) that can be the cause. Unless these errors have been cleared (that is, the count has been reset to 0), Total Events alarms can be triggered.

When an issue is resolved, reset the counter to clear the alarm. Select Nodes > site > grid node > Events > Reset event counts.
Note: To reset event counts, you must have the Grid Topology Page Configuration permission.

If the value of Total Events is zero, or the number increases and the problem persists, contact technical support.

SNST Status CMN

An alarm indicates that there is a problem storing the grid task bundles. If the value of Status is Checkpoint Error or Quorum Not Reached, confirm that a majority of ADC services are connected to the StorageGRID system (50 percent plus one) and then wait a few minutes.

If the problem persists, contact technical support.

SOSS Storage Operating System Status SSM

An alarm is triggered if SANtricity software indicates that there is a Needs attention issue with a component in a StorageGRID appliance.

Select Nodes. Then select appliance Storage Node > Hardware. Scroll down to view the status of each component. In SANtricity software, check other appliance components to isolate the issue.

SSMA SSM Status SSM

If the value of SSM Status is Error, select Support > Tools > Grid Topology, then select site > grid node > SSM > Overview > Main and SSM > Overview > Alarms to determine the cause of the alarm.

If the problem persists, contact technical support.

SSME SSM State SSM

If the value of SSM State is Standby, continue monitoring, and if the problem persists, contact technical support.

If the value of SSM State is Offline, restart the service. If the problem persists, contact technical support.

SSTS Storage Status BLDR

If the value of Storage Status is Insufficient Usable Space, there is no more available storage on the Storage Node and data ingests are redirected to other available Storage Node. Retrieval requests can continue to be delivered from this grid node.

Additional storage should be added. It is not impacting end user functionality, but the alarm persists until additional storage is added.

If the value of Storage Status is Volume(s) Unavailable, a part of the storage is unavailable. Storage and retrieval from these volumes is not possible. Check the volume’s Health for more information: Select Support > Tools > Grid Topology. Then select site > grid node > LDR > Storage > Overview > Main. The volume's Health is listed under Object Stores.

If the value of Storage Status is Error, contact technical support.

Troubleshooting the Storage Status (SSTS) alarm

SVST Status SSM

This alarm clears when other alarms related to a non-running service are resolved. Track the source service alarms to restore operation.

Select Support > Tools > Grid Topology. Then select site > grid node > SSM > Services > Overview > Main. When the status of a service is shown as Not Running, its state is Administratively Down. The service’s status can be listed as Not Running for the following reasons:
  • The service has been manually stopped (/etc/init.d/<service> stop).
  • There is an issue with the MySQL database and Server Manager shuts down the MI service.
  • A grid node has been added, but not started.
  • During installation, a grid node has not yet connected to the Admin Node.

If a service is listed as Not Running, restart the service (/etc/init.d/<service> restart).

This alarm might also indicate that the metadata store (Cassandra database) for a Storage Node requires rebuilding.

If the problem persists, contact technical support.

Troubleshooting the Services: Status - Cassandra (SVST) alarm

TMEM Installed Memory SSM

Nodes running with less than 24 GiB of installed memory can lead to performance problems and system instability. The amount of memory installed on the system should be increased to at least 24 GiB.

TPOP Pending Operations ADC

A queue of messages can indicate that the ADC service is overloaded. Too few ADC services can be connected to the StorageGRID system. In a large deployment, the ADC service can require adding computational resources, or the system can require additional ADC services.

UMEM Available Memory SSM

If the available RAM gets low, determine whether this is a hardware or software issue. If it is not a hardware issue, or if available memory falls below 50 MB (the default alarm threshold), contact technical support.

VMFI Entries Available SSM

This is an indication that additional storage is required. Contact technical support.

VMFR Space Available SSM

If the value of Space Available gets too low (see alarm thresholds), it needs to be investigated as to whether there are log files growing out of proportion, or objects taking up too much disk space (see alarm thresholds) that need to be reduced or deleted.

If the problem persists, contact technical support.

VMST Status SSM

An alarm is triggered if the value of Status for the mounted volume is Unknown. A value of Unknown or Offline can indicate that the volume cannot be mounted or accessed due to a problem with the underlying storage device.

VPRI Verification Priority BLDR, BARC

By default, the value of Verification Priority is Adaptive. If Verification Priority is set to High, an alarm is triggered because storage verification can slow normal operations of the service.

VSTU Object Verification Status BLDR

Select Support > Tools > Grid Topology. Then select site > grid node > LDR > Storage > Overview > Main.

Check the operating system for any signs of block-device or file system errors.

If the value of Object Verification Status is Unknown Error, it usually indicates a low-level file system or hardware problem (I/O error) that prevents the Storage Verification task from accessing stored content. Contact technical support.

XAMS Unreachable Audit Repositories BADC, BARC, BCLB, BCMN, BLDR, BNMS

Check network connectivity to the server hosting the Admin Node.

If the problem persists, contact technical support.