Monitoring alerts (preview)

Available to preview in the StorageGRID 11.3 release, the alerts system provides an easy-to-use interface for detecting, evaluating, and resolving the issues that can occur during StorageGRID operation.

The alerts system offers significant benefits when compared to the alarms system:
  • The alerts system focuses on real problems in the system. Unlike some alarms in the legacy system, all of the new alerts are triggered for events that require your immediate attention, not for events that can safely be ignored.
  • Multiple alerts of the same type are grouped into one email to reduce the number of notifications. In addition, multiple alerts of the same type are shown as a group on the Alerts page. You can expand and collapse alert groups to show or hide the individual alerts. For example, if several nodes are reporting the Low installed node memory alert, only one email is sent and the alert is shown as a group on the Alerts page.
    Alerts Page
  • The Alerts page provides a more user friendly interface for viewing current problems. You can sort the listing by individual alerts and alert groups. For example, you might want to sort all alerts by node/site to see which alerts are affecting a specific node. Or, you might want to sort the alerts in a group by time triggered to find the most recent instance of a specific alert.
  • Alerts use intuitive names and descriptions to help you understand more quickly what the problem is. Alert notifications include details about the node and site affected, the alert severity, the time when the alert rule was triggered, and the current value of metrics related to the alert.
  • Both alert notifications and the alert listings on the Alerts page provide recommended actions for resolving an alert. These recommended actions often include direct links to the StorageGRID documentation center to make it easier to find and access more detailed troubleshooting procedures.
    Alerts Page Details Modal
  • If you need to temporarily suppress the notifications for an alert at one or more severity levels, you can easily silence a specific alert rule for a specified duration. You can silence an alert rule for the entire grid, a single site, or a single node. The new silences functionality is more powerful than the acknowledge functionality in the alarms system.
  • Creating custom alert rules is significantly easier and allows for greater functionality than creating custom alarms using the StorageGRID attributes system. You can create custom alert rules to target the specific conditions that are relevant to your situation and to provide your own recommended actions. To define the conditions for a custom alert, you create expressions using the Prometheus metrics available from the Metrics section of the Grid Management API.

    For example, this expression causes an alert to be triggered if the amount of installed RAM for a node is less than 24,000,000,000 bytes (24 GB).

    node_memory_MemTotal < 24000000000