Monitor volume latency in EDA workloads
As an IT administrator or DevOps engineer managing EDA workloads, you can use latency analysis to proactively monitor volume performance by tracking read and write latency metrics across your FSx for ONTAP file systems. Configure customizable thresholds for warning and critical events to identify potential performance bottlenecks before they impact simulation run-time and time-to-market. When latency events are detected, automated basic analysis helps identify the root cause.
Overview
High latency directly impacts simulation run-time and time-to-market for your EDA projects. Unhealthy volumes can cause significant performance degradation, leading to costly production delays. Latency analysis helps you proactively identify, troubleshoot, and remediate operational issues across your entire storage estate before they affect your workloads.
Latency analysis collects and monitors CloudWatch metrics for volume read and write operations. When both latency and IOPS thresholds are breached for all data points within a specified time range, the system generates alerts that appear in the latency events table.
When latency events are detected, the system automatically performs basic analysis using ONTAP QoS delay center metrics to identify the latency source.
This enables you to:
-
Identify volumes experiencing performance degradation.
-
Distinguish between warning-level and critical-level performance issues.
-
Automatically analyze the root cause of latency issues.
-
Track latency trends over time to optimize storage configurations.
-
Take proactive action before latency impacts workload performance.
Requirements
To use latency monitoring and analysis features, ensure you meet the following requirements:
- AWS credentials and permissions
-
You must add AWS credentials to Workload Factory with read/write permissions. The latency monitoring feature requires access to CloudWatch metrics for all FSx for ONTAP volumes associated with your AWS credentials.
Basic mode and read-only mode permissions are not supported for latency monitoring.
If you haven't configured AWS credentials, see Add AWS credentials.
- FSx for ONTAP file system
-
You need at least one FSx for ONTAP file system with volumes deployed in your AWS environment. The latency monitoring feature automatically collects metrics for all volumes associated with your configured AWS credentials.
- Link to FSx for ONTAP
-
To get insights from basic analysis, you must associate a link with your FSx for ONTAP file system. If no link is already associated, select Associate link in EDA, choose whether to create a new link or associate an existing link, and then select Continue to automatically go to the link creation page in Storage workloads.
For instructions on creating and associating links, see Create a link.
Understanding alerts
The latency analysis feature uses CloudWatch alarms to monitor volume performance. Understanding how alerts are triggered helps you configure appropriate thresholds and interpret the results.
Metrics collected
The system collects the following CloudWatch metrics for each volume:
-
Read latency threshold: Calculated as 1000 * m2/(m1+0.000001) where m1 = DataReadOperations and m2 = DataReadOperationTime
-
Write latency threshold: Calculated as 1000 * m2/(m1+0.000001) where m1 = DataWriteOperations and m2 = DataWriteOperationTime
Alert trigger conditions
An alert is triggered when all of the following conditions are met:
-
The latency threshold is exceeded for the operation type (read or write).
-
The IOPS threshold is exceeded for the operation type.
-
Both conditions persist for all data points within the configured time range.
For example, with default warning thresholds, a read alert triggers only if read latency exceeds 6 ms AND read IOPS exceeds 100 ops/sec for all data points within a 10-minute period.
Event severity
-
Warning events: Indicate elevated latency that might need attention.
-
Critical events: Indicate severe latency that requires immediate investigation.
Configure latency thresholds
Configuring appropriate latency thresholds enables you to receive timely notifications when volumes are experiencing performance issues. By setting both warning and critical thresholds, you can distinguish between issues that need attention and those requiring immediate action, allowing you to manage your storage estate more effectively and prevent performance problems from impacting production workloads.
You can configure thresholds for both warning and critical events. Each event type includes separate thresholds for read and write operations. The system evaluates these thresholds continuously and generates alerts when conditions are met.
|
|
You must set critical event thresholds higher than warning event thresholds to ensure proper alert escalation. If not, you cannot save your configuration. |
For an alert to trigger, both the latency threshold and the IOPS threshold must be breached for all data points within the specified time range. This dual-condition logic helps reduce false positives by ensuring that high latency is sustained under significant load.
-
Log in using one of the console experiences.
-
Select the menu
and then select EDA. -
From the EDA menu, select Latency.
-
In the EDA latency configuration page, configure the following thresholds:
-
Warning events
-
Read latency threshold: Enter the latency threshold in milliseconds. Default: 6 ms.
-
Read IOPS threshold: Enter the IOPS threshold in operations per second. Default: 100 ops/sec.
-
Read time range: Enter the time range in minutes (5-20). Default: 10 minutes.
-
Write latency threshold: Enter the latency threshold in milliseconds. Default: 8 ms.
-
Write IOPS threshold: Enter the IOPS threshold in operations per second. Default: 100 ops/sec.
-
Write time range: Enter the time range in minutes (5-20). Default: 10 minutes.
-
-
Critical events
-
Read latency threshold: Enter the latency threshold in milliseconds. Default: 12 ms.
-
Read IOPS threshold: Enter the IOPS threshold in operations per second. Default: 100 ops/sec.
-
Read time range: Enter the time range in minutes (5-20). Default: 10 minutes.
-
Write latency threshold: Enter the latency threshold in milliseconds. Default: 15 ms.
-
Write IOPS threshold: Enter the IOPS threshold in operations per second. Default: 100 ops/sec.
-
Write time range: Enter the time range in minutes (5-20). Default: 10 minutes.
-
-
-
Select Apply.
Workload Factory begins collecting latency metrics for all FSx for ONTAP volumes associated with your AWS credentials. Metrics are collected at least every 20 minutes. The latency events table displays any volumes that breach your configured thresholds.
View latency events
As an administrator managing multiple file systems and volumes, the latency events table provides a centralized view of all performance issues requiring your attention. The table displays all warning and critical events detected within the last 72 hours. Each event includes automated basic analysis results in the Details column, helping you quickly identify the root cause of latency issues and prioritize remediation efforts across your estate.
-
Only the latest breach for each volume appears in the table. If a volume experiences multiple breaches, only the most recent event is displayed.
-
Events are automatically removed after 72 hours.
-
The table displays a maximum of 200 events. Older events are removed as new events are added.
-
In the Latency tab, view the latency events table.
-
Review the information for each event including:
-
Severity: Indicates whether the event is Critical or Warning.
-
Volume name: The name of the affected volume.
-
Volume ID: The ID of the affected volume.
-
File system: The FSx for ONTAP file system containing the volume.
-
Time detected: When the breach was detected
-
Median latency: The median latency value during the breach period.
-
Details: Automated basic analysis results identifying the latency source and recommended actions.
-
-
To sort the table, select any column header. By default, critical events appear first sorted by time, followed by warning events sorted by time.
-
To dismiss one or more events, next to each event select Dismiss.
-
To add columns to the table, select the column icon, choose the columns, and select Apply.
Understanding basic analysis
Basic analysis helps you quickly identify the root cause of latency issues without manual investigation. When a latency event is detected, Workload Factory automatically performs basic analysis using ONTAP QoS delay center metrics. The analysis identifies which component is causing the latency and provides actionable guidance in the Details column of the latency events table, enabling you to understand root cause.
|
|
There might be slight discrepancies between latency values from ONTAP QoS analysis and CloudWatch data due to different collection methodologies. The basic analysis uses ONTAP data for root cause identification. |
Analysis scenarios
The basic analysis evaluates multiple latency components and provides specific guidance based on the results for each scenario:
-
Flexcache: Latency per I/O operation for FlexCache operations
-
Capacity pool: Latency per I/O operation for capacity pool operations
-
QoS min: Latency per I/O operation for QoS Policy Group Floor
-
QoS max: Latency per I/O operation for QoS Policy Group Ceiling
-
Disk: Latency per I/O operation in the Storage subsystem
-
Data: Latency per I/O operation in the WAFL subsystem file system, which includes tasks such as CPU processing, metadata updates, and cache management
-
Cluster: Latency per I/O operation across the internally connected nodes in a cluster
-
Other: Latency per I/O operation on FSx for ONTAP subsystems
Manage latency configuration
After the initial configuration, you can edit your thresholds.
-
In the Latency page, select Edit.
-
Modify any of the threshold values as needed.
Ensure that critical thresholds remain higher than warning thresholds. The system displays an error if you configure critical thresholds lower than warning thresholds. -
Select Apply to save your changes.
Best practices
Consider these recommendations when configuring and using latency analysis:
-
Set realistic thresholds: Configure thresholds based on your workload requirements. Default values provide a starting point but might need adjustment for your specific environment.
-
Start with warning thresholds: Use warning events to establish baseline performance expectations before fine-tuning critical thresholds.
-
Consider time ranges carefully: Shorter time ranges (5-10 minutes) detect issues faster but might generate more alerts. Longer time ranges (15-20 minutes) reduce false positives but might delay detection.
-
Monitor trends: Regularly review the latency events table to identify patterns or recurring issues that might indicate underlying configuration problems.
-
Coordinate IOPS and latency thresholds: The dual-condition logic means both must be exceeded. Setting very high IOPS thresholds might prevent alerts even when latency is problematic.
-
Review dismissed events: Periodically review why events were dismissed to identify opportunities for threshold adjustment or infrastructure improvements.