Responding to a dynamic performance event caused by a disk failure

11/30/2022 Contributors

You can use Unified Manager to investigate a performance event caused by workloads overutilizing an aggregate. You can also use Unified Manager to check the health of the aggregate to see if recent health events detected on the aggregate contributed to the performance event.

Before you begin

You must have the Operator, OnCommand Administrator, or Storage Administrator role.
There must be new, acknowledged, or obsolete performance events.

Steps

Display the Event details page to view information about the event.
Read the Description, which describes the workloads involved in the event and the cluster component in contention.

There are multiple victim volumes whose latency was impacted by the cluster component in contention. The aggregate, which is in the middle of a RAID reconstruct to replace the failed disk with a spare disk, is the cluster component in contention. Under Component in Contention, the Aggregate icon is highlighted red and the name of the aggregate is displayed in parentheses.
In the Workload Utilization chart, select Bully Workloads.
Hover your cursor over the chart to view the top bully workloads that are affecting the component.

The top workloads with the highest peak utilization since the event was detected are displayed at the top of the chart. One of the top workloads is the system-defined workload Disk Health, which indicates a RAID reconstruct. A reconstruct is the internal process involved with rebuilding the aggregate with the spare disk. The Disk Health workload, along with other workloads on the aggregate, likely caused the contention on the aggregate and the associated event.
After confirming that the activity from the Disk Health workload caused the event, wait for approximately 30 minutes for the reconstruction to finish and for Unified Manager to analyze the event and detect whether the aggregate is still in contention.
In Unified Manager, search for the event ID you recorded in Step 2.

The event for the disk failure is displayed on the Event details page. After the RAID reconstruction is complete, check that the State is obsolete, indicating that the event is resolved.
In the Workload Utilization chart, select Bully Workloads to view the workloads on the aggregate by peak utilization.
Navigate to the Performance/Volume Details page for the top workload.
Click 1d to display the last 24 hours (1 day) of data for the selected volume.

In the Latency chart, a red dot () indicates when the disk failure event occurred.
Select Break down data by.
Under Components, select Disk Utilization.
Click Submit.

The Disk Utilization chart displays a graph of all read and write requests from the selected workload to the disks of the target aggregate.
Compare the data in the Disk Utilization chart to the data at the time of the event in the Latency chart.

At the time of the event, the Disk Utilization shows a high amount of read and write activity, caused by the RAID reconstruction processes, which increased the latency of the selected volume. A few hours after the event occurred, both the reads and writes and the latency have decreased, confirming that the aggregate is no longer in contention.

Responding to a dynamic performance event caused by a disk failure

Creating your file...

Before you begin

Steps