Responding to a performance event caused by a disk failure

You can use Unified Manager to investigate a performance event caused by workloads overutilizing an aggregate. You can also use Unified Manager to check the health of the aggregate to see if recent health events detected on the aggregate contributed to the performance event.

Before you begin

Steps

  1. Display the Dynamic Threshold Event Details page to view information about the event.
  2. Under Summary, read the Description, which describes the workloads involved in the event and the cluster component in contention.
    There are multiple victim volumes, whose latency was impacted by the cluster component in contention. The aggregate, which is in the middle of a RAID reconstruct to replace the failed disk with a spare disk, is the cluster component in contention. Under Component in Contention, the Aggregate icon is highlighted red and the name of the aggregate is displayed in parentheses.
  3. In the Workload Details table, click Bullies - Peak Deviation in Utilization to sort the workloads on the aggregate by peak utilization.
    The top workloads with the highest peak utilization since the event was detected are displayed at the top of the table. One of the top workloads in the table is the system-defined workload Disk Health, which indicates a RAID reconstruct. A reconstruct is the internal process involved with rebuilding the aggregate with the spare disk. The Disk Health workload, along with other workloads on the aggregate, likely caused the contention on the aggregate and the associated event.
  4. After confirming that the activity from the Disk Health workload caused the event, wait for approximately 30 minutes for the reconstruction to finish and for Unified Manager to analyze the event and detect whether the aggregate is still in contention.
  5. In Unified Manager, search for the event ID you recorded in Step 2.
    The event for the disk failure is displayed on the Dynamic Threshold Event Details page. After the RAID reconstruction is complete, under Summary, the Status is obsolete, indicating that the event is resolved.
  6. In the Workload Details table, click Bullies - Peak Deviation in Utilization to sort the workloads on the aggregate by peak utilization.
  7. Click the name of a top volume workload.
    Details for the selected volume are displayed on the Performance/Volume Details page.
  8. Click 1d to display the last 24 hours (1 day) of data for the selected volume.
    In the Latency chart, a red dot (Performance Manager incident icon) indicates when the disk failure event occurred.
  9. Select Break down data by.
  10. Under Components, select Disk Utilization.
  11. Click Submit.
    The Disk Utilization chart displays a graph of all read and write requests from the selected workload to the disks of the target aggregate.
  12. Compare the data in the Disk Utilization chart to the data at the time of the event in the Latency chart.
    At the time of the event, the Disk Utilization shows a high amount of read and write activity, caused by the RAID reconstruction processes, which increased the latency of the selected volume. A few hours after the event occurred, both the reads and writes and the latency have decreased, confirming that the aggregate is no longer in contention.