Skip to main content

Monitor volume latency in EDA workloads

Contributors netapp-sineadd

As an IT administrator or DevOps engineer managing EDA workloads, you can use latency analysis to monitor FSx for ONTAP volume read and write latency. Configure warning and critical thresholds to detect performance issues early. When events occur, Workload Factory provides automated basic analysis, and you can optionally run AI-agent analysis for root cause details, impacted clients, and recommended remediation steps.

Overview

Latency analysis collects CloudWatch metrics for read and write operations on all FSx for ONTAP volumes associated with your AWS credentials. An alert is generated when both the latency threshold and the IOPS threshold are breached for all data points within the configured time range. This dual-condition logic reduces false positives by ensuring elevated latency is sustained under real load.

When an event is detected, Workload Factory runs basic analysis using ONTAP QoS delay center metrics to identify the primary latency contributor (for example, FlexCache, capacity pool, QoS limits, disk, data, cluster, or other subsystems).

For data and cluster scenarios, you can optionally invoke AI-agent analysis from the Latency analysis panel to get a detailed root cause explanation, a list of affected EC2 clients, and recommended remediation steps.

Requirements

To use latency monitoring and analysis features, ensure you meet the following requirements:

AWS credentials and permissions

You must add AWS credentials to Workload Factory with read/write permissions. The latency monitoring feature requires access to CloudWatch metrics for all FSx for ONTAP volumes associated with your AWS credentials.

Basic mode and Read-only mode permissions are not supported for latency monitoring.

If you haven't configured AWS credentials, see Add AWS credentials.

FSx for ONTAP file system

You need at least one FSx for ONTAP file system with volumes deployed in your AWS environment. The latency monitoring feature automatically collects metrics for all volumes associated with your configured AWS credentials.

Link to FSx for ONTAP

To view basic analysis insights in the latency events table and analysis panel, you must associate a link with the FSx for ONTAP file system. Without a link, events can still be detected, but the analysis provides limited insights. If no link is already associated, select Associate link in EDA, choose whether to create a new link or associate an existing link, and then select Continue to automatically go to the link creation page in Storage workloads.

For instructions on creating and associating links, see Create a link.

Amazon Bedrock model ARN (optional)

To use the optional AI-agent analysis feature, you must provide an Amazon Bedrock model ARN in your Workload Factory settings.

For more details, see Basic GenAI requirements.

If you don't configure a Bedrock model ARN, you can still use latency monitoring and automated basic analysis. AI-agent analysis will not be available.

Understanding alerts

The latency analysis feature uses CloudWatch alarms to monitor volume performance. Understanding how alerts are triggered helps you configure appropriate thresholds and interpret the results.

Metrics collected

The system collects the following CloudWatch metrics for each volume:

  • Read latency threshold: Calculated as 1000 * m2/(m1+0.000001) where m1 = DataReadOperations and m2 = DataReadOperationTime

  • Write latency threshold: Calculated as 1000 * m2/(m1+0.000001) where m1 = DataWriteOperations and m2 = DataWriteOperationTime

Alert trigger conditions

An alert is triggered when all of the following conditions are met:

  • The latency threshold is exceeded for the operation type (read or write).

  • The IOPS threshold is exceeded for the operation type.

  • Both conditions persist for all data points within the configured time range.

For example, with default warning thresholds, a read alert triggers only if read latency exceeds 6 ms AND read IOPS exceeds 100 ops/sec for all data points within a 10-minute period.

Event severity

  • Warning events: Indicate elevated latency that might need attention

  • Critical events: Indicate severe latency that requires immediate investigation

Configure latency thresholds

Configure warning and critical thresholds for read and write operations. The system evaluates thresholds continuously and generates alerts when conditions are met.

Note You must set critical event thresholds higher than warning event thresholds to ensure proper alert escalation. If not, you cannot save your configuration.
Steps
  1. Log in using one of the console experiences.

  2. Select the menu The hamburger menu icon and then select EDA.

  3. Select the Latency tab.

  4. In the EDA latency configuration page, configure the following thresholds:

    • Warning events

      • Read latency threshold: Enter the latency threshold in milliseconds. Default: 6 ms.

      • Read IOPS threshold: Enter the IOPS threshold in operations per second. Default: 100 ops/sec.

      • Read time range: Enter the time range in minutes (5-20). Default: 10 minutes.

      • Write latency threshold: Enter the latency threshold in milliseconds. Default: 8 ms.

      • Write IOPS threshold: Enter the IOPS threshold in operations per second. Default: 100 ops/sec.

      • Write time range: Enter the time range in minutes (5-20). Default: 10 minutes.

    • Critical events

      • Read latency threshold: Enter the latency threshold in milliseconds. Default: 12 ms.

      • Read IOPS threshold: Enter the IOPS threshold in operations per second. Default: 100 ops/sec.

      • Read time range: Enter the time range in minutes (5-20). Default: 10 minutes.

      • Write latency threshold: Enter the latency threshold in milliseconds. Default: 15 ms.

      • Write IOPS threshold: Enter the IOPS threshold in operations per second. Default: 100 ops/sec.

      • Write time range: Enter the time range in minutes (5-20). Default: 10 minutes.

  5. Select Apply.

Result

Workload Factory begins collecting latency metrics for all FSx for ONTAP volumes associated with your AWS credentials. Metrics are collected at least every 20 minutes. The latency events table displays any volumes that breach your configured thresholds.

View latency events

The latency events table provides a centralized view of all warning and critical events detected within the last 72 hours.

  • Only the latest breach for each volume appears in the table. If a volume experiences multiple breaches, only the most recent event is displayed.

  • Events are automatically removed after 72 hours.

  • The table displays a maximum of 200 events. Older events are removed as new events are added.

  • Events appear in the table even if no link is associated with the file system. A link is required to view basic analysis details and run AI-agent analysis.

Steps
  1. In the Latency tab, view the latency events table.

  2. Review the information for each event including:

    • Severity: Indicates whether the event is Critical or Warning

    • Volume name: The name of the affected volume

    • Volume ID: The ID of the affected volume

    • File system: The FSx for ONTAP file system containing the volume

    • Median latency (ms): The median latency value during the breach period

    • % above threshold: The percentage by which the latency exceeded the configured threshold

    • Time detected: When the breach was detected

  3. To view details for a latency event, select the event in the Severity column of the latency events table. This opens a latency analysis panel for that event.

  4. To sort the table, select any column header. By default, critical events appear first sorted by time, followed by warning events sorted by time.

  5. To dismiss one or more events, next to each event select The action menu icon Dismiss.

  6. To add columns to the table, select The column icon, choose the columns, and select Apply.

Understanding basic analysis

Basic analysis helps you quickly identify the root cause of latency issues without manual investigation. When a latency event is detected, Workload Factory automatically performs basic analysis using ONTAP QoS delay center metrics. The analysis identifies which component is causing the latency and provides a short description in the Latency analysis panel.

Note There might be slight discrepancies between latency values from ONTAP QoS analysis and CloudWatch data due to different collection methodologies. The basic analysis uses ONTAP data for root cause identification.

Latency analysis panel

Selecting a latency event in the Severity column of the latency events table opens a latency analysis panel for that event.

  • FlexCache: Latency from FlexCache operations

  • Capacity pool: Latency from capacity pool operations

  • QoS min: Latency from QoS policy group floor limits

  • QoS max: Latency from QoS policy group ceiling limits

  • Disk: Latency from the storage subsystem

  • Data: Latency from the WAFL subsystem, including CPU processing, metadata updates, and cache management

  • Cluster: Latency across internally connected nodes

  • Other: Latency from other subsystems such as NVRAM and network

If an Amazon Bedrock model ARN is configured, the panel also includes an option to run AI-agent analysis for data and cluster scenarios. If Bedrock is not configured, the panel displays a link to the Storage workloads configuration page for the specific file system where you can configure Bedrock access.

Run AI-agent analysis

While basic analysis identifies the latency source, complex scenarios involving data or cluster components often require deeper investigation to determine the specific root cause and potential remediation steps. AI-agent analysis provides this deeper level of troubleshooting by identifying issues such as bully volumes, non-optimal configurations, or scale-out requirements that basic analysis cannot detect.

Before you begin

You must have configured an Amazon Bedrock model ARN in Workload Factory settings.

About this task

When you run AI-agent analysis, the system automatically refreshes the basic analysis data and uses it as input for the AI-agent. The AI-agent evaluates the latency scenario and provides:

  • Potential root cause: Detailed explanation of what's causing the latency issue

  • Affected clients: List of EC2 instance names impacted by the latency

  • Potential remediation steps: Two or more specific actions to resolve the issue

The AI-agent follows the basic analysis guidelines to identify scenarios such as:

  • Bully volumes consuming excessive resources (for data delays)

  • Non-optimal mount point configurations (for cluster delays)

  • FlexGroup rebalancing needs (for cluster delays)

  • Scale-out requirements (for cluster delays)

Steps
  1. In the Latency tab, locate the event you want to analyze.

  2. In the Severity column of the latency events table, select a latency event to open an analysis panel for that event.

    If no link is associated with the file system, a prompt is displayed asking you to associate a link with the affected file system. Select the prompt to be redirected to the link setup page for that file system. A tooltip explains the redirect and notes that associating a link and configuring Bedrock access (recommended) enables full event analysis.

  3. In the analysis panel, review the basic analysis results to understand the latency source.

  4. If the latency source is identified as data or cluster, select Analyze.

  5. Review the AI-agent analysis results, which include:

    • Root cause explanation

    • List of affected EC2 clients

    • Potential remediation steps

  6. Implement the recommended remediation steps to resolve the latency issue.

  7. After remediation, monitor the latency events table to verify the issue is resolved.

Manage latency configuration

After the initial configuration, you can edit your thresholds.

Steps
  1. In the Latency page, select Edit.

  2. Modify any of the threshold values as needed.

    Note Ensure that critical thresholds remain higher than warning thresholds. The system displays an error if you configure critical thresholds lower than warning thresholds.
  3. Select Apply to save your changes.

Best practices

Consider these recommendations when configuring and using latency analysis:

  • Set realistic thresholds: Configure thresholds based on your workload requirements. Default values provide a starting point but might need adjustment for your specific environment.

  • Start with warning thresholds: Use warning events to establish baseline performance expectations before fine-tuning critical thresholds.

  • Consider time ranges carefully: Shorter time ranges (5-10 minutes) detect issues faster but might generate more alerts. Longer time ranges (15-20 minutes) reduce false positives but might delay detection.

  • Monitor trends: Regularly review the latency events table to identify patterns or recurring issues that might indicate underlying configuration problems.

  • Coordinate IOPS and latency thresholds: The dual-condition logic means both must be exceeded. Setting very high IOPS thresholds might prevent alerts even when latency is problematic.

  • Review dismissed events: Periodically review why events were dismissed to identify opportunities for threshold adjustment or infrastructure improvements.

  • Use AI-agent analysis strategically: Run AI-agent analysis for data and cluster scenarios where basic analysis recommends it. AI-agent analysis provides deeper insights for complex performance issues that require detailed troubleshooting.