Commonly used Prometheus metrics

The Prometheus service on Admin Nodes collects time series metrics from the services on all nodes. While Prometheus collects more than a thousand metrics, a relatively small number are required to monitor the most critical StorageGRID operations.

The following table lists the most commonly used Prometheus metrics and provides a mapping of each metric to the equivalent attribute (used in the alarm system).

You can refer to this list to better understand the conditions in the default alert rules or to construct the conditions for custom alert rules. For a complete list of metrics, select Help > API Documentation.

Note: Metrics that include _private_ in their names are intended for internal use only and are subject to change between StorageGRID releases without notice.
Note: Prometheus metrics are retained for 31 days.
Prometheus metric Attribute Description
alertmanager_notifications_failed_total   The total number of failed alert notifications.
node_filesystem_avail_bytes   The amount of filesystem space available to non-root users in bytes.
node_memory_MemAvailable_bytes   Memory information field MemAvailable_bytes.
node_network_carrier   Carrier value of /sys/class/net/<iface>.
node_network_receive_errs_total   Network device statistic receive_errs.
node_network_transmit_errs_total   Network device statistic transmit_errs.
storagegrid_administratively_down   The node is not connected to the grid for an expected reason. For example, the node, or services on the node, has been gracefully shut down, the node is rebooting, or the software is being upgraded.
storagegrid_appliance_compute_controller_hardware_status   The status of the compute controller hardware in an appliance.
storagegrid_appliance_failed_disks BADD For the storage controller in an appliance, the number of drives that are not optimal.
storagegrid_appliance_storage_controller_hardware_status   The overall status of the storage controller hardware in an appliance.
storagegrid_content_buckets_and_containers SBKC The total number of S3 buckets and Swift containers known by this Storage Node.
storagegrid_content_objects SDOC The total number of S3 and Swift data objects known by this Storage Node. Count is valid only for data objects created by client applications that interface with the system through S3 or Swift.
storagegrid_content_objects_lost LOST The total number of objects this service detects as missing from the StorageGRID system. Action should be taken to determine the cause of the loss and if recovery is possible.

Troubleshooting lost and missing object data

storagegrid_http_sessions_incoming_attempted HAIS The total number of HTTP sessions that have been attempted to a Storage Node.
storagegrid_http_sessions_incoming_currently_established HCCS The number of HTTP sessions that are currently active (open) on the Storage Node.
storagegrid_http_sessions_incoming_failed HEIS The total number of HTTP sessions that failed to complete successfully, either due to a malformed HTTP request or a failure while processing an operation.
storagegrid_http_sessions_incoming_successful HISC The total number of HTTP sessions that have completed successfully.
storagegrid_ilm_awaiting_background_objects BQUZ The total number of objects on this node awaiting ILM evaluation from the scan.
storagegrid_ilm_awaiting_client_evaluation_objects_per_second EVRT The current rate at which objects are evaluated against the ILM policy on this node.
storagegrid_ilm_awaiting_client_objects CQUZ The total number of objects on this node awaiting ILM evaluation from client operations (for example, ingest).
storagegrid_ilm_awaiting_total_objects QUSZ The total number of objects awaiting ILM evaluation.
storagegrid_ilm_scan_objects_per_second SCRT The rate at which objects owned by this node are scanned and queued for ILM.
storagegrid_ilm_scan_period_estimated_minutes SCTM The estimated time to complete a full ILM scan on this node.
Note: A full scan does not guarantee that ILM has been applied to all objects owned by this node.
storagegrid_load_balancer_endpoint_cert_expiry_time   The expiration time of the load balancer endpoint certificate in seconds since the epoch.
storagegrid_metadata_queries_average_latency_milliseconds CQST The average time required to run a query against the metadata store through this service.
storagegrid_network_received_bytes TRXB The total amount of data received since installation.
storagegrid_network_transmitted_bytes TTXB The total amount of data sent since installation.
storagegrid_ntp_chosen_time_source_offset_milliseconds NTSO Systematic offset of time provided by a chosen time source. Offset is introduced when the delay to reach a time source is not equal to the time required for the time source to reach the NTP client.
storagegrid_ntp_locked   The node is not locked to a network time protocol (NTP) server.
storagegrid_s3_data_transfers_bytes_ingested SRXB The total amount of data ingested from S3 clients to this Storage Node since the attribute was last reset.
storagegrid_s3_data_transfers_bytes_retrieved STXB The total amount of data retrieved by S3 clients from this Storage Node since the attribute was last reset.
storagegrid_s3_operations_failed SFAL The total number of failed S3 operations (HTTP status codes 4xx and 5xx), excluding those caused by S3 authorization failure.
storagegrid_s3_operations_successful SSUC The total number of successful S3 operations (HTTP status code 2xx).
storagegrid_s3_operations_unauthorized SUAU The total number of failed S3 operations that are the result of an authorization failure.
storagegrid_servercertificate_management_interface_cert_expiry_days   The number of days before the Management Interface certificate expires.
storagegrid_servercertificate_storage_api_endpoints_cert_expiry_days   The number of days before the Object Storage API certificate expires.
storagegrid_service_cpu_seconds SUTM The cumulative amount of time that the CPU has been used by this service since installation.
storagegrid_service_load SLOD The percentage of available CPU time currently being used by this service. Indicates how busy the service is. The amount of available CPU time depends on the number of CPUs for the server.
storagegrid_service_memory_usage_bytes SMEM The amount of memory (RAM) currently in use by this service. This value is identical to that displayed by the Linux top utility as RES.
storagegrid_service_network_received_bytes BREC The total amount of data received by this service since installation.
storagegrid_service_network_transmitted_bytes BTRA The total amount of data sent by this service.
storagegrid_service_restarts RSTS The total number of times the service has been restarted.
storagegrid_service_runtime_seconds SVRT The total amount of time that the service has been running since installation.
storagegrid_service_uptime_seconds SVUT The total amount of time the service has been running since it was last restarted.
storagegrid_storage_state_current SSCR The current state of the storage services. Attribute values are:
  • 10 = Offline
  • 15 = Maintenance
  • 20 = Read-only
  • 30 = Online
storagegrid_storage_status SSTS

The current status of the storage services. Attribute values are:

  • 0 = No Errors
  • 10 = In Transition
  • 20 = Insufficient Free Space
  • 30 = Volume(s) Unavailable
  • 40 = Error
storagegrid_storage_utilization_data_bytes SPSD An estimate of the total size of replicated and erasure coded object data on the Storage Node.
storagegrid_storage_utilization_metadata_allowed_bytes CEMS The total space available on storage volume 0 for object metadata. Metadata Allowed Space (CEMS) is always less than the Metadata Reserved Space (CAWM) because a portion of the reserved metadata space is required for essential database operations, such as compaction and repair.
storagegrid_storage_utilization_metadata_bytes CADL The amount of object metadata on storage volume 0, in bytes.
storagegrid_storage_utilization_total_space_bytes STTS The total amount of storage space allocated to all object stores.
storagegrid_storage_utilization_usable_space_bytes STAS The total amount of object storage space remaining. Calculated by adding together the amount of available space for all object stores on the Storage Node.
storagegrid_swift_data_transfers_bytes_ingested WRXB The total amount of data ingested from Swift clients to this Storage Node since the attribute was last reset.
storagegrid_swift_data_transfers_bytes_retrieved WTXB The total amount of data retrieved by Swift clients from this Storage Node since the attribute was last reset.
storagegrid_swift_operations_failed WFAL The total number of failed Swift operations (HTTP status codes 4xx and 5xx), excluding those caused by Swift authorization failure.
storagegrid_swift_operations_successful WSUC The total number of successful Swift operations (HTTP status code 2xx).
storagegrid_swift_operations_unauthorized WUAU The total number of failed Swift operations that are the result of an authorization failure (HTTP status codes 401, 403, 405).