Alarms reference

The following tables lists all pre-configured StorageGRID Webscale system alarms. Responses are assigned according to the severity of the alarm. This can vary if you customize the alarm settings to fit your system management approach.

Code Name Service Recommended action
ABRL Available Attribute Relays BADC, BAMS, BARC, BCLB, BCMN, BCMS, BLDR, BNMS, BSSM, BDDS

Restore connectivity to a service (an ADC service) running an Attribute Relay Service as soon as possible. If there are no connected attribute relays, the grid node cannot report attribute values to the NMS service. Thus, the NMS service can no longer monitor the status of the service, or update attributes for the service.

If the problem persists, contact technical support.

ACMS Available Metadata Services BARC, BLDR, BCMN

An alarm is triggered when an LDR or ARC service loses connection to a DDS service. If this occurs, ingest or retrieve transactions cannot be processed. If the unavailability of DDS services is only a brief transient issue, transactions can be delayed.

Check and restore connections to a DDS service to clear this alarm and return the service to full functionality.

ACTS Cloud Tiering Service Status ARC

Only available for Archive Node's with a Target Type of Cloud Tiering - Simple Storage Service (S3).

If the ACTS attribute for the Archive Node is set to Read-Only Enabled or Read-Write Disabled, you must set the attribute to Read-Write Enabled.

If a major alarm is triggered due to an authentication failure, verify the credentials associated with destination bucket and update values, if necessary.

If a major alarm is triggered due to any other reason, contact technical support.

ADCA ADC Status ADC

If an alarm is triggered, select Support > Grid Topology. Then select site > grid node > ADC > Overview > Main and ADC > Alarms > Main to determine the cause of the alarm.

If the problem persists, contact technical support.

ADCE ADC State ADC

If the value of ADC State is Standby, continue monitoring the service and if the problem persists, contact technical support.

If the value of ADC State is Offline, restart the service. If the problem persists, contact technical support.

AITE Retrieve State BARC, BARC

Only available for Archive Node's with a Target Type of Tivoli Storage Manager (TSM).

If the value of Retrieve State is Waiting for Target, check the TSM middleware server and ensure that it is operating correctly. If the Archive Node has just been added to the StorageGRID Webscale system, ensure that the Archive Node's connection to the targeted external archival storage system is configured correctly.

If the value of Archive Retrieve State is Offline, attempt to update the state to Online. Select Support > Grid Topology. Then select site > grid node > ARC > Retrieve > Configuration > Main, select Archive Retrieve State > Online, and click Apply Changes.

If the problem persists, contact technical support.

AITU Retrieve Status BARC, BARC

If the value of Retrieve Status is Target Error, check the targeted external archival storage system for errors.

If the value of Archive Retrieve Status is Session Lost, check the targeted external archival storage system to ensure it is online and operating correctly. Check the network connection with the target.

If the value of Archive Retrieve Status is Unknown Error, contact technical support.

ALIS Inbound Attribute Sessions ADC

If the number of inbound attribute sessions on an attribute relay grows too large, it can be an indication that the StorageGRID Webscale system has become unbalanced. Under normal conditions, attribute sessions should be evenly distributed amongst ADC services. An imbalance can lead to performance issues.

If the problem persists, contact technical support.

ALOS Outbound Attribute Sessions ADC

The ADC service has a high number of attribute sessions, and is becoming overloaded. If this alarm is triggered, contact technical support.

ALUR Unreachable Attribute Repositories ADC

Check network connectivity with the NMS service to ensure that the service can contact the attribute repository.

If this alarm is triggered and network connectivity is good, contact technical support.

AMQS Audit Messages Queued BADC, BAMS, BARC, BCLB, BCMN, BCMS, BLDR, BNMS, BDDS

If audit messages cannot be immediately forwarded to an audit relay or repository, the messages are stored in a disk queue. During heavy loads this queue can exceed 100,000 messages. If this occurs, monitor the queue to determine if messages are being forwarded.

If the alarm is triggered, check the load on the system—if there have been a significant number of transactions this can be normal and will resolve itself over time. In this case, the alarm can be ignored and will clear itself.

If the alarm persists, view a chart of the queue size. If the number continues increasing without occasional decreases, contact technical support.

In rare instances, the disk queue can be large enough to cause a thread deadlock when the AMS service starts. If a thread deadlock occurs, contact technical support.

AOTE Store State BARC, BARC

Only available for Archive Node's with a Target Type of Tivoli Storage Manager (TSM).

If the value of Store State is Waiting for Target, check the external archival storage system and ensure that it is operating correctly. If the Archive Node has just been added to the StorageGRID Webscale system, ensure that the Archive Node's connection to the targeted external archival storage system is configured correctly.

If the value of Store State is Offline, check the value of Store Status. Correct any problems before moving the Store State back to Online.

AOTU Store Status BARC, BARC

If the value of Store Status is Session Lost check that the external archival storage system is connected and online.

If the value of Target Error, check the external archival storage system for errors.

If the value of Store Status is Unknown Error, contact technical support.

APMS Multipath State SSM If the multipath state alarm appears as "Simplex" (select Support > Grid Topology, then select site > grid node > SSM > Events), do the following:
  1. Plug in or replace the cable that does not display any indicator lights.
  2. Wait one to five minutes.

    Do not unplug the other cable until at least five minutes after you plug in the first one. Unplugging too early can cause the root volume to become read-only which requires that the hardware be restarted.

  3. Return to the SSM > Resources page, and verify that the "Simplex" Multipath status changed to "Nominal" in the Storage Hardware section.
ARCE ARC State ARC

The ARC service has a state of Standby until all ARC components (Replication, Store, Retrieve, Target) have started. It then transitions to Online.

If the value of ARC State does not transition from Standby to Online, check the status of the ARC components.

If the value of ARC State is Offline, restart the service. If the problem persists, contact technical support.

AROQ Objects Queued ARC

This alarm can be triggered if the removable storage device is running slowly due to problems with the targeted external archival storage system, or if it encounters multiple read errors. Check the external archival storage system for errors, and ensure that it is operating correctly.

In some cases, this error can occur as a result of a high rate of data requests. Monitor the number of objects queued as system activity declines.

ARRF Request Failures ARC

If a retrieval from the targeted external archival storage system fails, the Archive Node retries the retrieval as the failure can be due to a transient issue. However, if the object data is corrupt or has been marked as being permanently unavailable, the retrieval does not fail. Instead, the Archive Node continuously retries the retrieval and the value for Request Failures continues to increase.

This alarm can indicate that the storage media holding the requested data is corrupt. Check the external archival storage system to further diagnose the problem.

If you determine that the object data is no longer in the archive, the object will have to be removed from the StorageGRID Webscale system. For more information, contact technical support.

Once the problem that triggered this alarm is addressed, reset the failures count. Select Support > Grid Topology. Then select site > grid node > ARC > Retrieve > Configuration > Main, select Reset Request Failure Count and click Apply Changes.

ARRS Repository Status NMS

The NMS service is unexpectedly not gathering attribute information from the StorageGRID Webscale system.

If the problem persists, contact technical support.

ARRV Verification Failures ARC

To diagnose and correct this problem, contact technical support.

Once the problem that triggered this alarm is addressed, reset the failures count. Select Support > Grid Topology. Then select site > grid node > ARC > Retrieve > Configuration > Main, select Reset Verification Failure Count and click Apply Changes.

ARVF Store Failures ARC

This alarm can occur as a result of errors with the targeted external archival storage system. Check the external archival storage system for errors, and ensure that it is operating correctly.

Once the problem that triggered this alarm is addressed, reset the failures count. Select Support > Grid Topology. Then select site > grid node > ARC > Retrieve > Configuration > Main, select Reset Store Failure Count, and click Apply Changes.

ASXP Audit Shares AMS

An alarm is triggered if the value of Audit Shares is Unknown. This alarm can indicate a problem with the installation or configuration of the Admin Node.

If the problem persists, contact technical support.

AUMA AMS Status AMS

If the value of AMS Status is DB Connectivity Error, restart the grid node.

If the problem persists, contact technical support.

AUME AMS State AMS

If the value of AMS State is Standby, continue monitoring the StorageGRID Webscale system. If the problem persists, contact technical support.

If the value of AMS State is Offline, restart the service. If the problem persists, contact technical support.

AUXS Audit Export Status AMS

If an alarm is triggered, correct the underlying problem, and then restart the AMS service.

If the problem persists, contact technical support.

BASF Available Object Identifiers CMN

When a StorageGRID Webscale system is provisioned, the CMN service is allocated a fixed number of object identifiers. This alarm is triggered when the StorageGRID Webscale system begins to exhaust its supply of object identifiers.

To allocate more identifiers, contact technical support.

BASS Identifier Block Allocation Status CMN

By default, an alarm is triggered when object identifiers cannot be allocated because ADC quorum cannot be reached.

Identifier block allocation on the CMN service requires a quorum (50% + 1) of the ADC services to be online and connected. If quorum is unavailable, the CMN service is unable to allocate new identifier blocks until ADC quorum is re-established. If ADC quorum is lost, there is generally no immediate impact on the StorageGRID Webscale system (clients can still ingest and retrieve content), as approximately one month's supply of identifiers are cached elsewhere in the grid; however, if the condition continues, the StorageGRID Webscale system will lose the ability to ingest new content.

If an alarm is triggered, investigate the reason for the loss of ADC quorum (for example, it can be a network or Storage Node failure) and take corrective action.

If the problem persists, contact technical support.

BRDT Module temperature SSM

An alarm is triggered if the temperature of a StorageGRID Webscale appliance E5600SG controller exceeds a nominal threshold.

If the Storage Node is a StorageGRID Webscale appliance, StorageGRID Webscale indicates that the storage controller needs attention.

Check hardware components and environmental issues for overheated condition. If necessary, replace the component.

BTOF Offset BADC, BLDR, BNMS, BAMS, BCLB, BCMN, BARC, BCMS

An alarm is triggered if the service time (seconds) differs significantly from the operating system time. Under normal conditions, the service should resynchronize itself. If the service time drifts too far from the operating system time, system operations can be affected. Confirm that the StorageGRID Webscale system’s time source is correct.

If the problem persists, contact technical support.

BTSE Clock State BADC, BLDR, BNMS, BAMS, BCLB, BCMN, BARC, BCMS

An alarm is triggered if the service’s time is not synchronized with the time tracked by the operating system. Under normal conditions, the service should resynchronize itself. If the time drifts too far from operating system time, system operations can be affected. Confirm that the StorageGRID Webscale system’s time source is correct.

If the problem persists, contact technical support.

CAHP Java Heap Usage Percent DDS

An alarm is triggered if Java is unable to perform garbage collection at a rate that allows enough heap space for the system to properly function. An alarm might indicate a user workload that exceeds the resources available across the system for the DDS metadata store. Check the ILM Activity in the Dashboard, or select Support > Grid Topology, then select site > grid node > DDS > Resources > Overview > Main.

If the problem persists, contact technical support.

CAIH Number Available Ingest Destinations CLB

This alarm clears when underlying issues of available LDR services are corrected. Ensure that the HTTP component of LDR services are online and running normally.

If the problem persists, contact technical support.

CAQH Number Available Q/R Destinations CLB

This alarm clears when underlying issues of available LDR services are corrected. Ensure that the HTTP component of LDR services are online and running normally.

If the problem persists, contact technical support.

CASA Data Store Status DDS

An alarm is raised if the Cassandra metadata store becomes unavailable.

Check the status of Cassandra:
  1. At the Storage Node, log in as admin and su to root using the password listed in the Passwords.txt file.
  2. Enter: service cassandra status
  3. If Cassandra is not running, restart it: service cassandra restart

This alarm might also indicate that the metadata store (Cassandra database) for a Storage Node requires rebuilding.

Troubleshooting SVST (Services: Status - Cassandra) alarm

If the problem persists, contact technical support.

CDLP Metadata Used Space (Percent) DDS

This alarm is triggered when the Metadata Effective Space (CEMS) reaches 70% full (minor alarm), 90% full (major alarm), and 100% full (critical alarm).

If this alarm reaches the 90% threshold, a warning appears on the Dashboard in the Grid Manager. You must perform an expansion procedure to add new Storage Nodes as soon as possible. See the instructions for expanding a StorageGRID Webscale grid.

If this alarm reaches the 100% threshold, you must stop ingesting objects and add Storage Nodes immediately. Cassandra requires a certain amount of space to perform essential operations such as compaction and repair. These operations will be impacted if object metadata uses more than 100% of the allowed space. Undesirable results can occur.

Note: Contact technical support if you are unable to add Storage Nodes.

Once new Storage Nodes are added, the system automatically rebalances object metadata across all Storage Nodes, and the alarm clears.

CLBA CLB Status CLB

If an alarm is triggered, select Support > Grid Topology, then select site > grid node > CLB > Overview > Main and CLB > Alarms > Main to determine the cause of the alarm and to troubleshoot the problem.

If the problem persists, contact technical support.

CLBE CLB State CLB

If the value of CLB State is Standby, continue monitoring the situation and if the problem persists, contact technical support.

If the state is Offline and there are no known server hardware issues (for example, the server is unplugged) or scheduled downtime, restart the service. If the problem persists, contact technical support.

CMNA CMN Status CMN

If the value of CMN Status is Error, select Support > Grid Topology, then select site > grid node > CMN > Overview > Main and CMN > Alarms > Main to determine the cause of the error and to troubleshoot the problem.

An alarm is triggered and the value of CMN Status is No Online CMN during a hardware refresh of the primary Admin Node when the CMNs are switched (the value of the old CMN State is Standby and the new is Online).

If the problem persists, contact technical support.

CMSS CMS State  

If an alarm is triggered, contact technical support.

CMST CMS Status CMS

If an alarm is triggered, contact technical support.

CPRC Remaining Capacity NMS

An alarm is triggered if the remaining capacity (number of available connections that can be opened to the NMS database) falls below the configured alarm severity.

If an alarm is triggered, contact technical support.

CPUT CPU Temperature SSM

An alarm is triggered if the temperature of a StorageGRID Webscale appliance E5600SG controller CPU exceeds a nominal threshold.

If the Storage Node is a StorageGRID Webscale appliance, the StorageGRID Webscale system indicates that the storage controller needs attention.

Check hardware components and environment issues for overheated condition. If necessary, replace the component.

CQST Average Query Latency LDR, DDS

This alarm is triggered when the average time required to run a query against the metadata store through the service exceeds the value set in the Grid Manager.

To resolve this alarm, check for hardware and workload changes around the time the query latency increased. For example, hardware issues such as multiple failed disks and workload changes such as a sudden increase in ingests, can lead to an increase in query latency.

DNST DNS Status SSM

After installation completes, a DNST alarm is triggered in the SSM service. After the DNS is configured and the new server information reaches all grid nodes, the alarm is canceled.

ECCD Corrupt Fragments Detected LDR An alarm is triggered when the background verification process detects a corrupt erasure coded fragment. If a corrupt fragment is detected, an attempt is made to rebuild the fragment.

Reset the Corrupt Fragments Detected and Copies Lost attributes to zero and monitor them to see if counts go up again. If counts do go up, there may be a problem with the Storage Node's underlying storage. A copy of erasure coded object data is not considered missing until such time that the number of lost or corrupt fragments breaches the erasure code's fault tolerance; therefore, it is possible to have corrupt fragment and to still be able to retrieve the object.

If the problem persists, contact technical support.

ECST Verification Status LDR

This alarm indicates the current status of the background verification process for erasure coded object data on this Storage Node.

A major alarm is triggered if there is an error in the background verification process.

FOPN Open File Descriptors BADC, BAMS, BARC, BCLB, BCMN, BLDR, BNMS, BSSM, BDDS

FOPN can become large during peak activity. If it does not diminish during periods of slow activity, contact technical support.

HSTE HTTP State BLDR, BLDR

It is critical that the HTTP protocol be online and running without errors.

Check the state of the LDR service and the related Storage component. Ensure all are online.

Check that the HTTP component is configured to autostart when the service is restarted.

HSTU HTTP Status
HTAS Auto-Start HTTP LDR

Specifies whether to start HTTP services automatically on start-up. This is a user-specified configuration option.

IQSZ Number of Objects  

Either objects are arriving for ingest faster than the ILM policy can evaluate them, or a large number of objects that require an ILM re-evaluation are being processed.

Plot the value of IQSZ over the course of a day or week, and check that at times of low system activity the number of objects drops, and tends towards zero.

If the problem persists, contact technical support.

IRSU Inbound Replication Status BLDR, BARC

An alarm indicates that inbound replication has been disabled. Confirm configuration settings: Select Support > Grid Topology. Then select site > grid node > LDR > Replication > Configuration > Main.

LATA Average Latency NMS

Check for connectivity issues.

Check system activity to confirm that there is an increase in system activity. An increase in system activity will result in an increase to attribute data activity. This increased activity will result in a delay to the processing of attribute data. This can be normal system activity and will subside.

Check for multiple alarms. An increase in average latency times can be indicated by an excessive number of triggered alarms.

If the problem persists, contact technical support.

LATW Worst-Case Latency NMS

Check for connectivity issues.

Check system activity to confirm that there is an increase in activity. An increase in system activity will result in an increase to attribute data activity. This increased activity will result in a delay to the processing of attribute data. This can be normal system activity and will subside.

Check for multiple alarms. An increase in average latency times can be indicated by an excessive number of triggered alarms.

If the problem persists, contact technical support.

LDRE LDR State LDR

If the value of LDR State is Standby, continue monitoring the situation and if the problem persists, contact technical support.

If the value of LDR State is Offline, restart the service. If the problem persists, contact technical support.

LOST Lost Objects DDS, LDR

Triggered when the StorageGRID Webscale system fails to retrieve a copy of the requested object from anywhere in the system. Before a LOST (Lost Objects) alarm is triggered, the system attempts to retrieve and replace a missing object from elsewhere in the system.

Lost objects represent a loss of data. The Lost Objects attribute is incremented whenever the number of locations for an object drops to zero without the DDS service purposely purging the content to satisfy the ILM policy.

Investigate LOST (LOST Object) alarms immediately. If the problem persists, contact technical support.

MINQ E-mail Notifications Queued NMS

Check the network connections of the servers hosting the NMS service and the external mail server. Also confirm that the NMS e-mail server configuration is correct.

MINS E-mail Notifications Status BNMS, BNMS

A minor alarm is triggered if the NMS service is unable to connect to the mail server. Check the network connections of the servers hosting the NMS service and the external mail server. Also confirm that the NMS e-mail server configuration is correct.

MISS NMS Interface Engine Status BNMS, BNMS

An alarm is triggered if the NMS interface engine on the Admin Node that gathers and generates interface content is disconnected from the system. Check Server Manager to determine if the server individual application is down.

MMQS Peak Message Queue Size BADC, BAMS, BARC, BCLB, BCMN, BLDR, BNMS, BSSM, BDDS

An alarm indicates that the grid node is overloaded, and cannot be able to process operations at a high enough rate to support normal system operation. Client requests can timeout when nodes are in this condition.

If the problem persists, contact technical support.

NANG Network Auto Negotiate Setting SSM

Check the network adapter configuration. The setting must match preferences of your network routers and switches.

An incorrect setting can have a severe impact on system performance.

NDUP Network Duplex Setting SSM

Check the network adapter configuration. The setting must match preferences of your network routers and switches.

An incorrect setting can have a severe impact on system performance.

NLNK Network Link Detect SSM

Check the network cable connections on the port and at the switch.

Check the network router, switch, and adapter configurations.

Restart the server.

If the problem persists, contact technical support.

NRER Receive Errors SSM

These errors can clear without being manually reset. If errors do not clear, check the network hardware.

Check that the adapter hardware and driver are correctly installed and configured to work with your network routers and switches.

When the underlying problem is resolved, reset the counter: Select Support > Grid Topology. Then select site > grid node > SSM > Resources > Configuration > Main. Select Reset Receive Error Count and click Apply Changes.

NRLY Available Audit Relays BADC, BARC, BCLB, BCMN, BCMS, BLDR, BNMS, BDDS

If audit relays are not connected to ADC services, audit events cannot be reported. They are queued and unavailable to users until the connection is restored.

Restore connectivity to an ADC service as soon as possible.

If the problem persists, contact technical support.

NSCA NMS Status NMS

If the value of NMS Status is DB Connectivity Error, restart the service. If the problem persists, contact technical support.

NSCE NMS State NMS

If the value of NMS State is Standby, continue monitoring and if the problem persists, contact technical support.

If the value of NMS State is Offline, restart the service. If the problem persists, contact technical support.

NSPD Speed SSM

This can be caused by network connectivity or driver compatibility issues. If the problem persists, contact technical support.

NTBR Free Tablespace NMS

If an alarm is triggered, check how fast database usage has been changing. A sudden drop (as opposed to a gradual change over time) indicates an error condition. If the problem persists, contact technical support.

Adjusting the alarm threshold allows you to proactively manage when additional storage needs to be allocated.

If the available space reaches a low threshold (see alarm threshold), contact technical support to change the database allocation.

NTER Transmit Errors SSM

These errors can clear without being manually reset. If they do not clear, check network hardware. Check that the adapter hardware and driver are correctly installed and configured to work with your network routers and switches.

When the underlying problem is resolved, reset the counter. Select Support > Grid Topology. Then select site > grid node > SSM > Resources > Configuration > Main, select Reset Transmit Error Count, and click Apply Changes.

NTFQ NTP Frequency Offset SSM

If the frequency offset exceeds the configured threshold, there is likely a hardware problem with the local clock. If the problem persists, contact technical support to arrange a replacement.

NTLK NTP Lock SSM

If the NTP daemon is not locked to an external time source, check network connectivity to the designated external time sources, their availability, and their stability.

NTLR Repair Completion Status DDS If a nodetoool repair task for Cassandra stalls, the normal background process of checking for and repairing potential database inconsistencies cannot complete and is retried every hour.

Check the Cassandra log at /var/local/log/cassandra/system.log for errors, and correct any issues that you discover. For example, the Storage Node could be isolated due to network issues.

Contact technical support if you cannot identify or resolve the issue that prevents nodetool repair from completing.

NTOF NTP Time Offset SSM

If the time offset exceeds the configured threshold, there is likely a hardware problem with the oscillator of the local clock. If the problem persists, contact technical support to arrange a replacement.

NTSA NTP Sources Available SSM

If this server is configured to act as a primary NTP server for the StorageGRID Webscale system, this attribute tracks the number of external NTP time sources available. It is normal for this number to fluctuate if there are a large number of external time sources available.

If the server is configured to act as a secondary NTP time server or an NTP client, the server uses other servers as its NTP time sources. For more information about the StorageGRID Webscale system’s NTP configuration, see the Solution Design document for your deployment.

If the number of NTP time sources available falls below the configured minimum, the accuracy and consistency of local time on the server can suffer. If the number of NTP time sources falls to zero, local server time will drift out of synchronization with the time recorded by other services. In extreme cases, this can disrupt system operations. Correct the issue as quickly as possible.

NTSD Chosen Time Source Delay SSM

These values give an indication of the reliability and stability of the time source that NTP on the local server is using as its reference.

If an alarm is triggered, it can be an indication that the time source’s oscillator is defective, or that there is a problem with the WAN link to the time source.

NTSJ Chosen Time Source Jitter
NTSO Chosen Time Source Offset
NTSU NTP Status SSM

If the value of NTP Status is Not Running, contact technical support.

OCOR Corrupt Objects Detected LDR

The total number of corrupt replicated objects that the most recently run background verification process has detected on the Storage Node. Any corrupt object should be investigated. More than 10 indicates a major problem.

Note that this value is persistent: it is not updated once the corrupt objects have been restored.

If corrupt objects are detected, change the Verification Priority to High. This speeds up verification and determining the magnitude of the problem. Select Support > Grid Topology. Then select site > grid node > LDR > Verification > Configuration > Main, select Verification Priority > High, and click Apply Changes.

After the underlying problem is resolved, reset the counter to clear the alarm. Select Support > Grid Topology. Then select site > grid node > LDR > Verification > Configuration > Main, select Reset Corrupt Objects Count, and click Apply Changes.

OPST Overall Power Status SSM

An alarm is triggered if the power of a StorageGRID Webscale appliance enclosure deviates from the recommended operating voltage.

Check Power Supply A or B status to determine which power supply is operating abnormally.

If necessary, replace the power supply.

OQRT Objects Quarantined LDR

After the objects are automatically restored by the StorageGRID Webscale system, the quarantined objects must be manually removed from the quarantine directory. Contact technical support.

After the quarantined objects are removed, the value of OQRT is updated and the alarm clears.

ORSU Outbound Replication Status BLDR, BARC

An alarm indicates that outbound replication is not possible: storage is in a state where objects cannot be retrieved. An alarm is triggered if outbound replication is disabled manually. Select Support > Grid Topology. Then select site > grid node > LDR > Replication > Configuration.

An alarm is triggered if the LDR service is unavailable for replication. Select Support > Grid Topology. Then select site > grid node > LDR > Storage.

PMEM Service Memory Usage (Percent) BADC, BAMS, BARC, BCLB, BCMN, BCMS, BLDR, BNMS, BSSM, BDDS

Can have a value of Over Y% RAM where Y represents the percentage of memory being used by the server.

Figures under 80% are normal. Over 90% is considered a problem.

If memory usage is high for a single service, monitor the situation and investigate.

If the problem persists, contact technical support.

PSAS Power Supply A Status SSM

An alarm is triggered if the power supply A of a StorageGRID Webscale appliance deviates from the recommended operating voltage.

If necessary, replace the power supply A.

PSBS Power Supply B Status SSM

An alarm is triggered if the power supply B of a StorageGRID Webscale appliance deviates from the recommended operating voltage.

If necessary, replace the power supply B.

RDTE Tivoli Storage Manager State BARC, BARC

Only available for Archive Nodes with a Target Type of Tivoli Storage Manager (TSM).

If the value of Tivoli Storage Manager State is Offline, check Tivoli Storage Manager Status and resolve any problems.

Bring the component back online. Select Support > Grid Topology. Then select site > grid node > ARC > Target > Configuration > Main, select Tivoli Storage Manager State > Online, and click Apply Changes.

RDTU Tivoli Storage Manager Status BARC, BARC

Only available for Archive Nodes with a Target Type of Tivoli Storage Manager (TSM).

If the value of Tivoli Storage Manager Status is Configuration Error and the Archive Node has just been added to the StorageGRID Webscale system, ensure that the TSM middleware server is correctly configured.

If the value of Tivoli Storage Manager Status is Connection Failure, or Connection Failure, Retrying, check the network configuration on the TSM middleware server, and the network connection between the TSM middleware server and the StorageGRID Webscale system.

If the value of Tivoli Storage Manager Status is Authentication Failure, or Authentication Failure, Reconnecting, the StorageGRID Webscale system can connect to the TSM middleware server, but cannot authenticate the connection. Check that the TSM middleware server is configured with the correct user, password, and permissions, and restart the service.

If the value of Tivoli Storage Manager Status is Session Failure, an established session has been lost unexpectedly. Check the network connection between the TSM middleware server and the StorageGRID Webscale system. Check the middleware server for errors.

If the value of Tivoli Storage Manager Status is Unknown Error, contact technical support.

RIRF Inbound Replications – Failed BLDR, BARC

Replication alarms (Inbound Replications – Failed RIRF and Outbound Replications – Failed RORF) can occur during periods of high load or temporary network disruptions. After system activity reduces, these alarms should clear. If the count of failed replications continues to increase, look for network problems and verify that the source and destination LDR and ARC services are online and available.

To reset the count, go to ARC or select Support > Grid Topology, then select site > grid node > LDR > Replication > Configuration > Main. Then select Reset Inbound Replication Failure Count, and click Apply Changes.

RIRQ Inbound Replications – Queued BLDR, BARC

Alarms can occur during periods of high load or temporary network disruption. After system activity reduces, this alarm should clear. If the count for queued replications continues to increase, look for network problems and verify that the source and destination LDR and ARC services are online and available.

RORF Outbound Replications – Failed BLDR, BARC

The threshold for a notice alarm is 10 objects, while greater than 50 objects triggers a minor alarm.

Replication alarms (Inbound Replications – Failed (RIRF) and Outbound Replications – Failed (RORF)) can occur during periods of high load or due to temporary network disruptions. After system activity reduces, these alarms should clear. If the count of failed replications continues to increase, look for network problems and verify that the source and destination LDR and the ARC services are online and available.

To reset the count, go to ARC or select Support > Grid Topology, then select site > grid node > LDR > Replication > Configuration > Main. Then select Reset Outbound Replication Failure Count, and click Apply Changes.

RORQ Outbound Replications – Queued BLDR, BARC

The outbound replication queue contains object data being copied to satisfy ILM rules and objects requested by clients.

An alarm can occur as a result of a system overload. Wait to see if the alarm clears when system activity declines. If the alarm recurs, add capacity by adding Storage Nodes.

SAVP Total Usable Space (Percent) LDR

If usable space reaches a low threshold, options include expanding the StorageGRID Webscale system or move object data to archive through an Archive Node.

SCAS Status CMN

If the value of Status for the active grid task is Error, look up the grid task message. Select Support > Grid Topology. Then select site > grid node > CMN > Grid Tasks > Overview > Main. The grid task message displays information about the error (for example, “check failed on node 12130011”).

After you have investigated and corrected the problem, restart the grid task. Select Support > Grid Topology. Then select site > grid node > CMN > Grid Tasks > Configuration > Main, and select Actions > Run.

If the value of Status for a grid task being aborted is Error, retry aborting the grid task.

If the problem persists, contact technical support.

SCHR Status CMN

If the value of Status for the historical grid task is Aborted, investigate the reason and run the task again if required.

If the problem persists, contact technical support.

SHLH Health LDR

If the value of Health for an object store is Error, check and correct:

  • problems with the volume being mounted
  • file system errors
SLSA CPU Load Average SSM

The higher the value the busier the system.

If the CPU Load Average persists at a high value, the number of transactions in the system should be investigated to determine whether this is due to heavy load at the time. View a chart of the CPU load average: Select Support > Grid Topology. Then select site > grid node > SSM > Resources > Reports > Charts.

If the load on the system is not heavy and the problem persists, contact technical support.

Note: If you use Linux and run multiple containers on a single host, you might want to change the trigger values for the CPU Load Average alarm to better reflect the host utilization. See Changing trigger values for CPU Load Average.
SMST Log Monitor State SSM

If the value of Log Monitor State is not Connected for a persistent period of time, contact technical support.

SMTT Total Events SSM

If the value of Total Events is greater than zero, check if there are known events (such as network failures) that can be the cause. Unless these errors have been cleared (that is, the count has been reset to 0), Total Events alarms can be triggered.

When an issue is resolved, reset the counter to clear the alarm. Select Nodes > site > grid node > Events > Reset event counts.
Note: To reset event counts, you must be a user who belongs to a group that has the Grid Topology Page Configuration permission enabled.

If the value of Total Events is zero, or the number increases and the problem persists, contact technical support.

SNST Status CMN

An alarm indicates that there is a problem storing the grid task bundles. If the value of Status is Checkpoint Error or Quorum Not Reached, confirm that a majority of ADC services are connected to the StorageGRID Webscale system (50 percent plus one) and then wait a few minutes.

If the problem persists, contact technical support.

SOSS Storage Operating System Status SSM

An alarm is triggered if SANtricity software indicates that there is a "Needs attention" issue with an E2700 controller StorageGRID Webscale appliance component.

Select Support > Grid Topology. Then select site > grid node > SSM > Resources > Overview page and check the power supply statuses. In SANtricity software, check other appliance components to isolate the issue.

SSMA SSM Status SSM

If the value of SSM Status is Error, select Support > Grid Topology, then select site > grid node > SSM > Overview > Main and SSM > Overview > Alarms to determine the cause of the alarm.

If the problem persists, contact technical support.

SSME SSM State SSM

If the value of SSM State is Standby, continue monitoring, and if the problem persists, contact technical support.

If the value of SSM State is Offline, restart the service. If the problem persists, contact technical support.

SSTS Storage Status BLDR, BLDR

If the value of Storage Status is Insufficient Usable Space, there is no more available storage on the Storage Node and data ingests are redirected to other available Storage Node. Retrieval requests can continue to be delivered from this grid node.

Additional storage should be added. It is not impacting end user functionality, but the alarm persists until additional storage is added.

If the value of Storage Status is Volume(s) Unavailable, a part of the storage is unavailable. Storage and retrieval from these volumes is not possible. Check the volume’s Health for more information: Select Support > Grid Topology. Then select site > grid node > LDR > Storage > Overview > Main. The volume's Health is listed under Object Stores.

If the value of Storage Status is Error, contact technical support.

SVST Status SSM

This alarm clears when other alarms related to a non-running service are resolved. Track the source service alarms to restore operation.

Select Support > Grid Topology. Then select site > grid node > SSM > Services > Overview > Main. When the status of a service is shown as Not Running, its state is Administratively Down. The service’s status can be listed as Not Running for the following reasons:
  • The service has been manually stopped (/etc/init.d/<service> stop).
  • There is an issue with the MySQL database and Server Manager shuts down the MI service.
  • A grid node has been added, but not started.
  • During installation, a grid node has not yet connected to the Admin Node.

If a service is listed as Not Running, restart the service (/etc/init.d/<service> restart).

This alarm might also indicate that the metadata store (Cassandra database) for a Storage Node requires rebuilding. For more information, see Troubleshooting SVST (Services: Status - Cassandra) alarm.

If the problem persists, contact technical support.

TMEM Installed Memory SSM

Nodes running with less than 24 GiB of installed memory can lead to performance problems and system instability. The amount of memory installed on the system should be increased to at least 24 GiB.

TPOP Pending Operations ADC

A queue of messages can indicate that the ADC service is overloaded. Too few ADC services can be connected to the StorageGRID Webscale system. In a large deployment, the ADC service can require adding computational resources, or the system can require additional ADC services.

UMEM Available Memory SSM

If the available RAM gets low, determine whether this is a hardware or software issue. If it is not a hardware issue, or if available memory falls below 50 MB (the default alarm threshold), contact technical support.

VMFI Entries Available SSM

This is an indication that additional storage is required. Contact technical support.

VMFR Space Available SSM

If the value of Space Available gets too low (see alarm thresholds), it needs to be investigated as to whether there are log files growing out of proportion, or objects taking up too much disk space (see alarm thresholds) that need to be reduced or deleted.

If the problem persists, contact technical support.

VMST Status SSM

An alarm is triggered if the value of Status for the mounted volume is Unknown. A value of Unknown or Offline can indicate that the volume cannot be mounted or accessed due to a problem with the underlying storage device.

VPRI Verification Priority BLDR, BARC

By default, the value of Verification Priority is Adaptive. If Verification Priority is set to High, an alarm is triggered because storage verification can slow normal operations of the service.

VSTU Object Verification Status BLDR, BLDR, BARC, BARC

Look for other problems: Select Support > Grid Topology. Then select site > grid node > LDR > Storage > Overview > Main.

If the value of Object Verification Status is Verify Location Synchronize Failed, check that the LDR service is connected to at least one CMS service.

Also check the operating system for any signs of block-device or file system errors.

If the value of Object Verification Status is Maximum Number of Failures Reached, it usually indicates a low-level file system or hardware problem (I/O error) that prevents the Storage Verification task from accessing stored content. This alarm can also occur when there is a high number of content errors indicating that data was invalid.

If the value of Object Verification Status is Unknown Error, contact technical support.

XAMS Unreachable Audit Repositories BADC, BARC, BCLB, BCMN, BCMS, BLDR, BNMS

Check network connectivity to the server hosting the Admin Node.

If the problem persists, contact technical support.