Monitoring the recovery point objective through ILM

You can track ILM evaluation attributes to determine the recovery point objective (RPO) of the StorageGRID system as defined by the ILM policy. The RPO defines the maximum tolerable period in which data might be lost because of a site failure, a Storage Node failure, or both.

Before you begin

You must be signed in to the Grid Manager using a supported browser.

About this task

The StorageGRID system manages objects by applying the active ILM policy. The ILM policy and associated ILM rules determine how many copies are made, how those copies are made, the appropriate placement, and the length of time each copy is retained.

Ingest or other activity can exceed the rate at which the system can process ILM. When this scenario occurs, the system might begin to queue objects whose ILM can no longer be fulfilled in near real time. The chart of Awaiting - Client can be useful in determining whether this situation has occurred. You can find the chart in the Grid Manager by going to Dashboard > Information Lifecycle Management (ILM) > Awaiting - Client and clicking the chart icon icon.

The example chart shows a situation where the number of objects awaiting ILM evaluation temporarily increased in an unsustainable manner, then eventually decreased. Such a trend indicates that ILM was temporarily not fulfilled in near real time.


Awaiting - Client vs. Time chart

You can further investigate ILM queues using the Nodes tab.

Steps

  1. Select Nodes.
  2. Select deployment > ILM.
  3. Hover your cursor over the ILM Queue graph to see the value of following attributes at a given point in time:
    • Objects queued (from client operations): The total number of objects awaiting ILM evaluation because of client operations (for example, ingest).
    • Objects queued (from all operations): The total number of objects awaiting ILM evaluation.
    • Scan rate (objects/sec): The rate at which objects in the grid are scanned and queued for ILM.
    • Evaluation rate (objects/sec): The current rate at which objects are being evaluated against the ILM policy in the grid.
  4. In the ILM Queue section, look at the following attributes:
    • Scan Period - Estimated: The estimated time to complete a full ILM scan of all objects.
      Note: A full scan does not guarantee that ILM has been applied to all objects.
    • Repairs Attempted: The total number of object repair operations for replicated data that have been attempted. This count increments each time a Storage Node tries to repair a high-risk object. High-risk ILM repairs are prioritized if the grid becomes busy.
      Note: The same object repair might increment again if replication failed after the repair.

    These attributes can be useful when you are monitoring the progress of Storage Node volume recovery. If the number of Repairs Attempted has stopped increasing and a full scan has been completed, the repair has probably completed.