Commonly used Prometheus metrics
Refer to this list of commonly used Prometheus metrics to better understand conditions in the default alert rules or to construct the conditions for custom alert rules.
You can also obtain a complete list of all metrics.
For details on the syntax of Prometheus queries, see Querying Prometheus.
What are Prometheus metrics?
Prometheus metrics are time series measurements. The Prometheus service on Admin Nodes collects these metrics from the services on all nodes. Metrics are stored on each Admin Node until the space reserved for Prometheus data is full. When the /var/local/mysql_ibdata/
volume reaches capacity, the oldest metrics are deleted first.
Where are Prometheus metrics used?
The metrics collected by Prometheus are used in several places in the Grid Manager:
-
Nodes page: The graphs and charts on the tabs available from the Nodes page use the Grafana visualization tool to display the time-series metrics collected by Prometheus. Grafana displays time-series data in graph and chart formats, while Prometheus serves as the backend data source.
-
Alerts: Alerts are triggered at specific severity levels when alert rule conditions that use Prometheus metrics evaluate as true.
-
Grid Management API: You can use Prometheus metrics in custom alert rules or with external automation tools to monitor your StorageGRID system. A complete list of Prometheus metrics is available from the Grid Management API. (From the top of the Grid Manager, select the help icon and select API documentation > metrics.) While more than a thousand metrics are available, only a relatively small number are required to monitor the most critical StorageGRID operations.
Metrics that include private in their names are intended for internal use only and are subject to change between StorageGRID releases without notice. -
The SUPPORT > Tools > Diagnostics page and the SUPPORT > Tools > Metrics page: These pages, which are primarily intended for use by technical support, provide several tools and charts that use the values of Prometheus metrics.
Some features and menu items within the Metrics page are intentionally non-functional and are subject to change.
List of most common metrics
The following list contains the most commonly used Prometheus metrics.
Metrics that include private in their names are for internal use only and are subject to change without notice between StorageGRID releases. |
- alertmanager_notifications_failed_total
-
The total number of failed alert notifications.
- node_filesystem_avail_bytes
-
The amount of file system space available to non-root users in bytes.
- node_memory_MemAvailable_bytes
-
Memory information field MemAvailable_bytes.
- node_network_carrier
-
Carrier value of
/sys/class/net/iface
. - node_network_receive_errs_total
-
Network device statistic
receive_errs
. - node_network_transmit_errs_total
-
Network device statistic
transmit_errs
. - storagegrid_administratively_down
-
The node is not connected to the grid for an expected reason. For example, the node, or services on the node, has been gracefully shut down, the node is rebooting, or the software is being upgraded.
- storagegrid_appliance_compute_controller_hardware_status
-
The status of the compute controller hardware in an appliance.
- storagegrid_appliance_failed_disks
-
For the storage controller in an appliance, the number of drives that aren't optimal.
- storagegrid_appliance_storage_controller_hardware_status
-
The overall status of the storage controller hardware in an appliance.
- storagegrid_content_buckets_and_containers
-
The total number of S3 buckets and Swift containers known by this Storage Node.
- storagegrid_content_objects
-
The total number of S3 and Swift data objects known by this Storage Node. Count is valid only for data objects created by client applications that interface with the system through S3 or Swift.
- storagegrid_content_objects_lost
-
The total number of objects this service detects as missing from the StorageGRID system. Action should be taken to determine the cause of the loss and if recovery is possible.
- storagegrid_http_sessions_incoming_attempted
-
The total number of HTTP sessions that have been attempted to a Storage Node.
- storagegrid_http_sessions_incoming_currently_established
-
The number of HTTP sessions that are currently active (open) on the Storage Node.
- storagegrid_http_sessions_incoming_failed
-
The total number of HTTP sessions that failed to complete successfully, either due to a malformed HTTP request or a failure while processing an operation.
- storagegrid_http_sessions_incoming_successful
-
The total number of HTTP sessions that have completed successfully.
- storagegrid_ilm_awaiting_background_objects
-
The total number of objects on this node awaiting ILM evaluation from the scan.
- storagegrid_ilm_awaiting_client_evaluation_objects_per_second
-
The current rate at which objects are evaluated against the ILM policy on this node.
- storagegrid_ilm_awaiting_client_objects
-
The total number of objects on this node awaiting ILM evaluation from client operations (for example, ingest).
- storagegrid_ilm_awaiting_total_objects
-
The total number of objects awaiting ILM evaluation.
- storagegrid_ilm_scan_objects_per_second
-
The rate at which objects owned by this node are scanned and queued for ILM.
- storagegrid_ilm_scan_period_estimated_minutes
-
The estimated time to complete a full ILM scan on this node.
Note: A full scan does not guarantee that ILM has been applied to all objects owned by this node.
- storagegrid_load_balancer_endpoint_cert_expiry_time
-
The expiration time of the load balancer endpoint certificate in seconds since the epoch.
- storagegrid_metadata_queries_average_latency_milliseconds
-
The average time required to run a query against the metadata store through this service.
- storagegrid_network_received_bytes
-
The total amount of data received since installation.
- storagegrid_network_transmitted_bytes
-
The total amount of data sent since installation.
- storagegrid_node_cpu_utilization_percentage
-
The percentage of available CPU time currently being used by this service. Indicates how busy the service is. The amount of available CPU time depends on the number of CPUs for the server.
- storagegrid_ntp_chosen_time_source_offset_milliseconds
-
Systematic offset of time provided by a chosen time source. Offset is introduced when the delay to reach a time source is not equal to the time required for the time source to reach the NTP client.
- storagegrid_ntp_locked
-
The node is not locked to a Network Time Protocol (NTP) server.
- storagegrid_s3_data_transfers_bytes_ingested
-
The total amount of data ingested from S3 clients to this Storage Node since the attribute was last reset.
- storagegrid_s3_data_transfers_bytes_retrieved
-
The total amount of data retrieved by S3 clients from this Storage Node since the attribute was last reset.
- storagegrid_s3_operations_failed
-
The total number of failed S3 operations (HTTP status codes 4xx and 5xx), excluding those caused by S3 authorization failure.
- storagegrid_s3_operations_successful
-
The total number of successful S3 operations (HTTP status code 2xx).
- storagegrid_s3_operations_unauthorized
-
The total number of failed S3 operations that are the result of an authorization failure.
- storagegrid_servercertificate_management_interface_cert_expiry_days
-
The number of days before the Management Interface certificate expires.
- storagegrid_servercertificate_storage_api_endpoints_cert_expiry_days
-
The number of days before the Object Storage API certificate expires.
- storagegrid_service_cpu_seconds
-
The cumulative amount of time that the CPU has been used by this service since installation.
- storagegrid_service_memory_usage_bytes
-
The amount of memory (RAM) currently in use by this service. This value is identical to that displayed by the Linux top utility as RES.
- storagegrid_service_network_received_bytes
-
The total amount of data received by this service since installation.
- storagegrid_service_network_transmitted_bytes
-
The total amount of data sent by this service.
- storagegrid_service_restarts
-
The total number of times the service has been restarted.
- storagegrid_service_runtime_seconds
-
The total amount of time that the service has been running since installation.
- storagegrid_service_uptime_seconds
-
The total amount of time the service has been running since it was last restarted.
- storagegrid_storage_state_current
-
The current state of the storage services. Attribute values are:
-
10 = Offline
-
15 = Maintenance
-
20 = Read-only
-
30 = Online
-
- storagegrid_storage_status
-
The current status of the storage services. Attribute values are:
-
0 = No Errors
-
10 = In Transition
-
20 = Insufficient Free Space
-
30 = Volume(s) Unavailable
-
40 = Error
-
- storagegrid_storage_utilization_data_bytes
-
An estimate of the total size of replicated and erasure-coded object data on the Storage Node.
- storagegrid_storage_utilization_metadata_allowed_bytes
-
The total space on volume 0 of each Storage Node that is allowed for object metadata. This value is always less than the actual space reserved for metadata on a node, because a portion of the reserved space is required for essential database operations (such as compaction and repair) and future hardware and software upgrades.The allowed space for object metadata controls overall object capacity.
- storagegrid_storage_utilization_metadata_bytes
-
The amount of object metadata on storage volume 0, in bytes.
- storagegrid_storage_utilization_total_space_bytes
-
The total amount of storage space allocated to all object stores.
- storagegrid_storage_utilization_usable_space_bytes
-
The total amount of object storage space remaining. Calculated by adding together the amount of available space for all object stores on the Storage Node.
- storagegrid_swift_data_transfers_bytes_ingested
-
The total amount of data ingested from Swift clients to this Storage Node since the attribute was last reset.
- storagegrid_swift_data_transfers_bytes_retrieved
-
The total amount of data retrieved by Swift clients from this Storage Node since the attribute was last reset.
- storagegrid_swift_operations_failed
-
The total number of failed Swift operations (HTTP status codes 4xx and 5xx), excluding those caused by Swift authorization failure.
- storagegrid_swift_operations_successful
-
The total number of successful Swift operations (HTTP status code 2xx).
- storagegrid_swift_operations_unauthorized
-
The total number of failed Swift operations that are the result of an authorization failure (HTTP status codes 401, 403, 405).
- storagegrid_tenant_usage_data_bytes
-
The logical size of all objects for the tenant.
- storagegrid_tenant_usage_object_count
-
The number of objects for the tenant.
- storagegrid_tenant_usage_quota_bytes
-
The maximum amount of logical space available for the tenant's objects. If a quota metric is not provided, an unlimited amount of space is available.