Responding to a dynamic performance event caused by QoS policy group throttling

11/30/2022 Contributors

You can use Unified Manager to investigate a performance event caused by a Quality of Service (QoS) policy group throttling workload throughput (MBps). The throttling increased the response times (latency) of volume workloads in the policy group. You can use the event information to determine whether new limits on the policy groups are needed to stop the throttling.

Before you begin

You must have the Operator, OnCommand Administrator, or Storage Administrator role.
There must be new, acknowledged, or obsolete performance events.

Steps

Display the Event details page to view information about the event.
Read the Description, which displays the name of the workloads impacted by the throttling.

The description can display the same workload for the victim and bully, because the throttling makes the workload a victim of itself.
Record the name of the volume, using an application such as a text editor.

You can search on the volume name to locate it later.
In the Workload Latency and Workload Activity charts, select Bully Workloads.
Hover your cursor over the charts to view the top user-defined workloads that are affecting the policy group.

The workload at the top of the list has the highest deviation and caused the throttling to occur. The activity is the percentage of the policy group limit used by each workload.
Navigate to the Performance/Volume Details page for the top workload.
Select Break down data by.
Select the check box next to Latency to select all latency breakdown charts.
Under IOPS, select Reads/writes/other.
Click Submit.

The breakdown charts are displayed under the Latency chart and the IOPS chart.
Compare the Policy Group Impact chart to the Latency chart to see what percentage of throttling impacted the latency at the time of the event.

The policy group has a maximum throughput of 1,000 operations per second (op/sec), which the workloads in it cannot collectively exceed. At the time of the event, the workloads in the policy group had a combined throughput of over 1,200 op/sec, which caused the policy group to throttle its activity back to 1,000 op/sec. The Policy Group Impact chart shows that the throttling caused 10% of the total latency, confirming that the throttling caused the event to occur.
Review the Cluster Components chart, which shows the total latency by cluster component.

The latency is highest at the policy group, further confirming that the throttling caused the event.
Compare the Reads/writes latency chart to the Reads/writes/other chart.

Both charts show a high number of read requests with high latency, but the number of requests and amount of latency for write requests is low. These values help you determine whether there is a high amount of throughput or number of operations that increased the latency. You can use these values when deciding to put a policy group limit on the throughput or operations.
Use OnCommand System Manager to increase the current limit on the policy group to 1,300 op/sec.
After a day, return to Unified Manager and search for the name of the workload that you recorded in Step 3.

The Performance/Volume Details page is displayed.
Select Break down data by > IOPS.
Click Submit.

The Reads/writes/other chart is displayed.
At the bottom of the page, point your cursor to the change event icon () for the policy group limit change.
Compare the Reads/writes/other chart to the Latency chart.

The read and write requests are the same, but the throttling has stopped and the latency has decreased.

Responding to a dynamic performance event caused by QoS policy group throttling

Creating your file...

Before you begin

Steps