Skip to main content

Learn about latency monitoring in Workload Factory for EDA

Contributors netapp-sineadd

Latency monitoring in Workload Factory for EDA helps you proactively identify and resolve performance bottlenecks in your FSx for ONTAP volumes. The system monitors read and write latency using CloudWatch metrics and provides automated analysis to help you understand the root cause of performance issues.

How latency monitoring works

Latency analysis collects CloudWatch metrics for read and write operations on all FSx for ONTAP volumes associated with your AWS credentials. The system continuously evaluates these metrics against configurable thresholds to detect performance issues early.

When a latency event is detected, Workload Factory automatically performs basic analysis using ONTAP QoS delay center metrics to identify the primary latency contributor. For more complex scenarios involving data or cluster components, you can optionally run AI-agent analysis to get detailed root cause explanations, affected client lists, and specific remediation steps.

Alert generation

An alert is generated when both the latency threshold and the IOPS threshold are breached for all data points within the configured time range. This dual-condition approach reduces false positives by ensuring elevated latency is sustained under real load.

You can configure separate thresholds for:

  • Read operations

  • Write operations

  • Warning severity

  • Critical severity

All detected events appear in the latency events table, and if you've configured notifications, you receive email or Amazon SNS notifications with details about affected volumes.

Understanding alerts

Understanding how alerts are triggered helps you configure appropriate thresholds and interpret the results.

Metrics collected

The system collects the following CloudWatch metrics for each volume:

  • Read latency threshold: Calculated as 1000 * m2/(m1+0.000001) where m1 = DataReadOperations and m2 = DataReadOperationTime

  • Write latency threshold: Calculated as 1000 * m2/(m1+0.000001) where m1 = DataWriteOperations and m2 = DataWriteOperationTime

Alert trigger conditions

An alert is triggered when all of the following conditions are met:

  • The latency threshold is exceeded for the operation type (read or write).

  • The IOPS threshold is exceeded for the operation type.

  • Both conditions persist for all data points within the configured time range.

For example, with default warning thresholds, a read alert triggers only if read latency exceeds 6 ms AND read IOPS exceeds 100 ops/sec for all data points within a 10-minute period.

Event severity

  • Warning events: Indicate elevated latency that might need attention

  • Critical events: Indicate severe latency that requires immediate investigation

Latency analysis

Workload Factory provides two levels of analysis to help you troubleshoot latency issues.

Basic analysis

When a latency event is detected, Workload Factory automatically runs basic analysis using ONTAP QoS delay center metrics to identify which component is causing the latency (for example, FlexCache, capacity pool, QoS limits, disk, data, cluster, or other subsystems). This analysis provides a quick identification of the latency source without manual investigation.

Basic analysis is available for all latency events when you have associated a link with the FSx for ONTAP file system. Without a link, events can still be detected, but the analysis provides limited insights.

Note There might be slight discrepancies between latency values from ONTAP QoS analysis and CloudWatch data due to different collection methodologies. The basic analysis uses ONTAP data for root cause identification.

AI-agent analysis

While basic analysis identifies the latency source, complex scenarios involving data or cluster components often require deeper investigation. AI-agent analysis provides this deeper level of troubleshooting by identifying issues such as bully volumes, non-optimal configurations, or scale-out requirements that basic analysis cannot detect.

When you run AI-agent analysis, the system provides:

  • Potential root cause: Detailed explanation of what's causing the latency issue

  • Affected clients: List of EC2 instance names impacted by the latency

  • Potential remediation steps: Two or more specific actions to resolve the issue

AI-agent analysis requires an Amazon Bedrock model ARN configured in your Workload Factory settings. If Bedrock is not configured, you can still use latency monitoring and automated basic analysis.