Responding to a dynamic performance event caused by QoS policy group throttling

You can use Unified Manager to investigate a performance event caused by a Quality of Service (QoS) policy group throttling workload throughput (MBps). The throttling increased the response times (latency) of volume workloads in the policy group. You can use the event information to determine whether new limits on the policy groups are needed to stop the throttling.

Before you begin

Steps

  1. Display the Dynamic Threshold Event Details page to view information about the event.
  2. Under Summary, read the Description, which displays the name of the workloads impacted by the throttling.
    Note: The description can display the same workload for the victim and bully, because the throttling makes the workload a victim of itself. For a QoS policy group, you can point the cursor to the name of the policy group to display its throughput limit and the last time it was modified. If the policy group was modified before the associated cluster was added to Unified Manager, the last modified time is the date and time when Unified Manager first discovered the cluster.
  3. Record the name of the volume, using an application such as a text editor.
    You can search on the volume name to locate it later.
  4. In the Workload Details table, click Bullies - Peak Deviation in Activity.
    The workloads in the policy group are sorted by highest deviation of actual activity from their expected activity. The workload at the top of the list has the highest deviation and caused the throttling to occur. The activity is the percentage of the policy group limit used by each workload.
  5. In the Workloads column, click the name of the top workload.
    The Performance/Volume Details page is displayed, with detailed performance data for the selected workload.
  6. Select Break down data by.
  7. Select the check box next to Latency to select all latency breakdown charts.
  8. Under IOPS, select Reads/writes/other.
  9. Click Submit.
    The breakdown charts are displayed under the Latency chart and the IOPS chart.
  10. Compare the Policy Group Impact chart to the Latency chart to see what percentage of throttling impacted the latency at the time of the event.
    The policy group has a maximum throughput of 1,000 operations per second (op/sec), which the workloads in it cannot collectively exceed. At the time of the event, the workloads in the policy group had a combined throughput of over 1,200 op/sec, which caused the policy group to throttle its activity back to 1,000 op/sec. The Policy Group Impact chart shows that the throttling caused 10% of the total latency, confirming that the throttling caused the event to occur.
  11. Review the Cluster Components chart, which shows the total latency by cluster component.
    The latency is highest at the policy group, further confirming that the throttling caused the event.
  12. Compare the Reads/writes latency chart to the Reads/writes/other chart.
    Both charts show a high number of read requests with high latency, but the number of requests and amount of latency for write requests is low. These values help you determine whether there is a high amount of throughput or number of operations that increased the latency. You can use these values when deciding to put a policy group limit on the throughput or operations.
  13. Use OnCommand System Manager to increase the current limit on the policy group to 1,300 op/sec.
  14. After a day, return to Unified Manager and search for the name of the workload that you recorded in Step 3.
    The Performance/Volume Details page is displayed.
  15. Select Break down data by > IOPS.
  16. Click Submit.
    The Reads/writes/other chart is displayed.
  17. At the bottom of the page, point your cursor to the change event icon (Change event icon) for the policy group limit change.
  18. Compare the Reads/writes/other chart to the Latency chart.
    The read and write requests are the same, but the throttling has stopped and the latency has decreased.