Skip to main content
Cloud Insights

System Monitors

Contributors netapp-alavoie

Cloud Insights includes a number of system-defined monitors for both metrics and logs. The system monitors available are dependent on the data collectors present in your environment. Because of that, the monitors available in Cloud Insights may change as data collectors are added or their configurations changed.

Note Many System Monitors are in Paused state by default. You can enable a system monitor by selecting the Resume option for the monitor. Ensure that Advanced Counter Data Collection and Enable ONTAP EMS log collection are enabled in the Data Collector. These options can be found in the ONTAP Data Collector under Advanced Configuration:
Enabling Advanced Counter and EMS Log collection for ONTAP

Monitor Descriptions

System-defined monitors are comprised of pre-defined metrics and conditions, as well as default descriptions and corrective actions, which can not be modified. You can modify the notification recipient list for system-defined monitors. To view the metrics, conditions, description and corrective actions, or to modify the recipient list, open a system-defined monitor group and click the monitor name in the list.

System-defined monitor groups cannot be modified or removed.

The following system-defined monitors are available, in the noted groups.

  • ONTAP Infrastructure includes monitors for infrastructure-related issues in ONTAP clusters.

  • ONTAP Workload Examples includes monitors for workload-related issues.

  • Monitors in both group default to Paused state.

Below are the system monitors currently included with Cloud Insights:

Metric Monitors

Monitor Name

Severity

Monitor Description

Corrective Action

Fiber Channel Port Utilization High

CRITICAL

Fiber Channel Protocol ports are used to receive and transfer the SAN traffic between the customer host system and the ONTAP LUNs. If the port utilization is high, then it will become a bottleneck and it will ultimately affect the performance of sensitive of Fiber Channel Protocol workloads.…A warning alert indicates that planned action should be taken to balance network traffic.…A critical alert indicates that service disruption is imminent and emergency measures should be taken to balance network traffic to ensure service continuity.

If critical threshold is breached, consider immediate actions to minimize service disruption:
1. Move workloads to another lower utilized FCP port.
2. Limit the traffic of certain LUNs only to essential work, either via QoS policies in ONTAP or host-side configuration to lighten the utilization of the FCP ports.…
If warning threshold is breached, plan to take the following actions:
1. Configure more FCP ports to handle the data traffic so that the port utilization gets distributed among more ports.
2. Move workloads to another lower utilized FCP port.
3. Limit the traffic of certain LUNs only to essential work, either via QoS policies in ONTAP or host-side configuration to lighten the utilization of the FCP ports.

Lun Latency High

CRITICAL

LUNs are objects that serve the I/O traffic often driven by performance sensitive applications such as databases. High LUN latencies means that the applications themselves might suffer and be unable to accomplish their tasks.…A warning alert indicates that planned action should be taken to move the LUN to appropriate Node or Aggregate.…A critical alert indicates that service disruption is imminent and emergency measures should be taken to ensure service continuity. Following are expected latencies based on media type - SSD up to 1-2 milliseconds; SAS up to 8-10 milliseconds, and SATA HDD 17-20 milliseconds

If critical threshold is breached, consider following actions to minimize service disruption:
If the LUN or its volume has a QoS policy associated with it, then evaluate its threshold limits and validate if they are causing the LUN workload to get throttled.…
If warning threshold is breached, plan to take the following actions:
1. If aggregate is also experiencing high utilization, move the LUN to another aggregate.
2. If the node is also experiencing high utilization, move the volume to another node or reduce the total workload of the node.
3. If the LUN or its volume has a QoS policy associated with it, evaluate its threshold limits and validate if they are causing the LUN workload to get throttled.

Network Port Utilization High

CRITICAL

Network ports are used to receive and transfer the NFS, CIFS, and iSCSI protocol traffic between the customer host systems and the ONTAP volumes. If the port utilization is high, then it becomes a bottleneck and it will ultimately affect the performance of NFS, CIFS and iSCSI workloads.…A warning alert indicates that planned action should be taken to balance network traffic.…A critical alert indicates that service disruption is imminent and emergency measures should be taken to balance network traffic to ensure service continuity.

If critical threshold is breached, consider following immediate actions to minimize service disruption:
1. Limit the traffic of certain volumes only to essential work, either via QoS policies in ONTAP or host-side analysis to decrease the utilization of the network ports.
2. Configure one or more volumes to use another lower utilized network port.…
If warning threshold is breached, consider the following immediate actions:
1. Configure more network ports to handle the data traffic so that the port utilization gets distributed among more ports.
2. Configure one or more volumes to use another lower utilized network port.

NVMe Namespace Latency High

CRITICAL

NVMe Namespaces are objects that serve the I/O traffic that is driven by performance sensitive applications such as databases. High NVMe Namespaces latency means that the applications themselves may suffer and be unable to accomplish their tasks.…A warning alert indicates that planned action should be taken to move the LUN to appropriate Node or Aggregate.…A critical alert indicates that service disruption is imminent and emergency measures should be taken to ensure service continuity.

If critical threshold is breached, consider immediate actions to minimize service disruption:
If the NVMe namespace or its volume has a QoS policy assigned to them, then evaluate its limit thresholds in case they are causing the NVMe namespace workload to get throttled.…
If warning threshold is breached, consider to take the following actions:
1. If aggregate is also experiencing high utilization, move the LUN to another aggregate.
2. If the node is also experiencing high utilization, move the volume to another node or reduce the total workload of the node.
3. If the NVMe namespace or its volume has a QoS policy assigned to them, evaluate its limit thresholds in case they are causing the NVMe namespace workload to get throttled.

QTree Capacity Full

CRITICAL

A qtree is a logically defined file system that can exist as a special subdirectory of the root directory within a volume. Each qtree has a default space quota or a quota defined by a quota policy to limit amount of data stored in the tree within the volume capacity.…A warning alert indicates that planned action should be taken to increase the space.…A critical alert indicates that service disruption is imminent and emergency measures should be taken to free up space to ensure service continuity.

If critical threshold is breached, consider immediate actions to minimize service disruption:
1. Increase the space of the qtree in order to accommodate the growth.
2. Delete unwanted data to free up space.…
If warning threshold is breached, plan to take the following immediate actions:
1. Increase the space of the qtree in order to accommodate the growth.
2. Delete unwanted data to free up space.

QTree Capacity Hard Limit

CRITICAL

A qtree is a logically defined file system that can exist as a special subdirectory of the root directory within a volume. Each qtree has a space quota measured in KBytes that is used to store data in order to control the growth of user data in volume and not exceed its total capacity.…A qtree maintains a soft storage capacity quota that provides alert to the user proactively before reaching the total capacity quota limit in the qtree and being unable to store data anymore. Monitoring the amount of data stored within a qtree ensures that the user receives uninterrupted data service.

If critical threshold is breached, consider following immediate actions to minimize service disruption:
1. Increase the tree space quota in order to accommodate the growth
2. Instruct the user to delete unwanted data in the tree to free up space

QTree Capacity Soft Limit

WARNING

A qtree is a logically defined file system that can exist as a special subdirectory of the root directory within a volume. Each qtree has a space quota measured in KBytes that it can use to store data in order to control the growth of user data in volume and not exceed its total capacity.…A qtree maintains a soft storage capacity quota that provides alert to the user proactively before reaching the total capacity quota limit in the qtree and being unable to store data anymore. Monitoring the amount of data stored within a qtree ensures that the user receives uninterrupted data service.

If warning threshold is breached, consider the following immediate actions:
1. Increase the tree space quota to accommodate the growth.
2. Instruct the user to delete unwanted data in the tree to free up space.

QTree Files Hard Limit

CRITICAL

A qtree is a logically defined file system that can exist as a special subdirectory of the root directory within a volume. Each qtree has a quota of the number of files that it can contain to maintain a manageable file system size within the volume.…A qtree maintains a hard file number quota beyond which new files in the tree are denied. Monitoring the number of files within a qtree ensures that the user receives uninterrupted data service.

If critical threshold is breached, consider immediate actions to minimize service disruption:
1. Increase the file count quota for the qtree.
2. Delete unwanted files from the qtree file system.

QTree Files Soft Limit

WARNING

A qtree is a logically defined file system that can exist as a special subdirectory of the root directory within a volume. Each qtree has a quota of the number of files that it can contain in order to maintain a manageable file system size within the volume.…A qtree maintains a soft file number quota to provide alert to the user proactively before reaching the limit of files in the qtree and being unable to store any additional files. Monitoring the number of files within a qtree ensures that the user receives uninterrupted data service.

If warning threshold is breached, plan to take the following immediate actions:
1. Increase the file count quota for the qtree.
2. Delete unwanted files from the qtree file system.

Snapshot Reserve Space Full

CRITICAL

Storage capacity of a volume is necessary to store application and customer data. A portion of that space, called snapshot reserved space, is used to store snapshots which allow data to be protected locally. The more new and updated data stored in the ONTAP volume the more snapshot capacity is used and less snapshot storage capacity is available for future new or updated data. If the snapshot data capacity within a volume reaches the total snapshot reserve space, it might lead to the customer being unable to store new snapshot data and reduction in the level of protection for the data in the volume. Monitoring the volume used snapshot capacity ensures data services continuity.

If critical threshold is breached, consider immediate actions to minimize service disruption:
1. Configure snapshots to use data space in the volume when the snapshot reserve is full.
2. Delete some older unwanted snapshots to free up space.…
If warning threshold is breached, plan to take the following immediate actions:
1. Increase the snapshot reserve space within the volume to accommodate the growth.
2. Configure snapshots to use data space in the volume when the snapshot reserve is full.

Storage Capacity Limit

CRITICAL

When a storage pool (aggregate) is filling up, I/O operations slow down and finally stop resulting in storage outage incident. A warning alert indicates that planned action should be taken soon to restore minimum free space. A critical alert indicates that service disruption is imminent and emergency measures should be taken to free up space to ensure service continuity.

If critical threshold is breached, immediately consider the following actions to minimize service disruption:
1. Delete Snapshots on non-critical volumes.
2. Delete Volumes or LUNs that are non-essential workloads and that may be restored from off storage copies.……If warning threshold is breached, plan the following immediate actions:
1. Move one or more volumes to a different storage location.
2. Add more storage capacity.
3. Change storage efficiency settings or tier inactive data to cloud storage.

Storage Performance Limit

CRITICAL

When a storage system reaches its performance limit, operations slow down, latency goes up and workloads and applications may start failing. ONTAP evaluates the storage pool utilization for workloads and estimates what percent of performance has been consumed.…A warning alert indicates that planned action should be taken to reduce storage pool load to ensure that there will be enough storage pool performance left to service workload peaks.…A critical alert indicates that a performance brownout is imminent and emergency measures should be taken to reduce storage pool load to ensure service continuity.

If critical threshold is breached, consider following immediate actions to minimize service disruption:
1. Suspend scheduled tasks such as Snapshots or SnapMirror replication.
2. Idle non-essential workloads.…
If warning threshold is breached, take the following actions immediately:
1. Move one or more workloads to a different storage location.
2. Add more storage nodes (AFF) or disk shelves(FAS) and redistribute workloads
3. Change workload characteristics(block size, application caching).

User Quota Capacity Hard Limit

CRITICAL

ONTAP recognizes the users of Unix or Windows systems who have the rights to access volumes, files or directories within a volume. As a result, ONTAP allows the customers to configure storage capacity for their users or groups of users of their Linux or Windows systems. The user or group policy quota limits the amount of space the user can utilize for their own data.…A hard limit of this quota allows notification of the user when the amount of capacity used within the volume is right before reaching the total capacity quota. Monitoring the amount of data stored within a user or group quota ensures that the user receives uninterrupted data service.

If critical threshold is breached, consider following immediate actions to minimize service disruption:
1. Increase the space of the user or group quota in order to accommodate the growth.
2. Instruct the user or group to delete unwanted data to free up space.

User Quota Capacity Soft Limit

WARNING

ONTAP recognizes the users of Unix or Windows systems that have the rights to access volumes, files or directories within a volume. As a result, ONTAP allows the customers to configure storage capacity for their users or groups of users of their Linux or Windows systems. The user or group policy quota limits the amount of space the user can utilize for their own data.…A soft limit of this quota allows proactive notification to the user when the amount of capacity used within the volume is reaching the total capacity quota. Monitoring the amount of data stored within a user or group quota ensures that the user receives uninterrupted data service.

If warning threshold is breached, plan to take the following immediate actions:
1. Increase the space of the user or group quota in order to accommodate the growth.
2. Delete unwanted data to free up space.

Volume Capacity Full

CRITICAL

Storage capacity of a volume is necessary to store application and customer data. The more data stored in the ONTAP volume the less storage availability for future data. If the data storage capacity within a volume reaches the total storage capacity may lead to the customer being unable to store data due to lack of storage capacity. Monitoring the volume used storage capacity ensures data services continuity.

If critical threshold is breached, consider following immediate actions to minimize service disruption:
1. Increase the space of the volume to accommodate the growth.
2. Delete unwanted data to free up space.
3. If snapshot copies occupy more space than the snapshot reserve, delete old Snapshots or enable Volume Snapshot Autodelete.…If warning threshold is breached, plan to take the following immediate actions:
1. Increase the space of the volume in order to accommodate the growth
2. If snapshot copies occupy more space than the snapshot reserve, delete old Snapshots or enabling Volume Snapshot Autodelete.……

Volume Inodes Limit

CRITICAL

Volumes that store files use index nodes (inode) to store file metadata. When a volume exhausts its inode allocation, no more files can be added to it.…A warning alert indicates that planned action should be taken to increase the number of available inodes.…A critical alert indicates that file limit exhaustion is imminent and emergency measures should be taken to free up inodes to ensure service continuity.

If critical threshold is breached, consider following immediate actions to minimize service disruption:
1. Increase the inodes value for the volume. If the inodes value is already at the max value, then split the volume into two or more volumes because the file system has grown beyond the maximum size.
2. Use FlexGroup as it helps to accommodate large file systems.…
If warning threshold is breached, plan to take the following immediate actions:
1. Increase the inodes value for the volume. If the inodes value is already at the max, then split the volume into two or more volumes because the file system has grown beyond the maximum size.
2. Use FlexGroup as it helps to accommodate large file systems

Volume Latency High

CRITICAL

Volumes are objects that serve the I/O traffic often driven by performance sensitive applications including devOps applications, home directories, and databases. High volume latencies means that the applications themselves may suffer and be unable to accomplish their tasks. Monitoring volume latencies is critical to maintain application consistent performance. The following are expected latencies based on media type - SSD up to 1-2 milliseconds; SAS up to 8-10 milliseconds and SATA HDD 17-20 milliseconds.

If critical threshold is breached, consider following immediate actions to minimize service disruption:
If the volume has a QoS policy assigned to it, evaluate its limit thresholds in case they are causing the volume workload to get throttled.…
If warning threshold is breached, consider the following immediate actions:
1. If aggregate is also experiencing high utilization, move the volume to another aggregate.
2. If the volume has a QoS policy assigned to it, evaluate its limit thresholds in case they are causing the volume workload to get throttled.
3. If the node is also experiencing high utilization, move the volume to another node or reduce the total workload of the node.

Monitor Name

Severity

Monitor Description

Corrective Action

Node High Latency

WARNING / CRITICAL

Node latency has reached the levels where it might affect the performance of the applications on the node. Lower node latency ensures consistent performance of the applications. The expected latencies based on media type are: SSD up to 1-2 milliseconds; SAS up to 8-10 milliseconds and SATA HDD 17-20 milliseconds.

If critical threshold is breached, then immediate actions should be taken to minimize service disruption:
1. Suspend scheduled tasks, Snapshots or SnapMirror replication
2. Lower the demand of lower priority workloads via QoS limits
3. Inactivate non-essential workloads

Consider immediate actions when warning threshold is breached:
1. Move one or more workloads to a different storage location
2. Lower the demand of lower priority workloads via QoS limits
3. Add more storage nodes (AFF) or disk shelves (FAS) and redistribute workloads
4. Change workload characteristics (block size, application caching etc)

Node Performance Limit

WARNING / CRITICAL

Node performance utilization has reached the levels where it might affect the performance of the IOs and the applications supported by the node. Low node performance utilization ensures consistent performance of the applications.

Immediate actions should be taken to minimize service disruption if critical threshold is breached:
1. Suspend scheduled tasks, Snapshots or SnapMirror replication
2. Lower the demand of lower priority workloads via QoS limits
3. Inactivate non-essential workloads

Consider the following actions if warning threshold is breached:
1. Move one or more workloads to a different storage location
2. Lower the demand of lower priority workloads via QoS limits
3. Add more storage nodes (AFF) or disk shelves (FAS)and redistribute workloads
4. Change workload characteristics (block size, application caching etc)

Storage VM High Latency

WARNING / CRITICAL

Storage VM (SVM) latency has reached the levels where it might affect the performance of the applications on the storage VM. Lower storage VM latency ensures consistent performance of the applications. The expected latencies based on media type are: SSD up to 1-2 milliseconds; SAS up to 8-10 milliseconds and SATA HDD 17-20 milliseconds.

If critical threshold is breached, then immediately evaluate the threshold limits for volumes of the storage VM with a QoS policy assigned, to verify whether they are causing the volume workloads to get throttled

Consider following immediate actions when warning threshold is breached:
1. If aggregate is also experiencing high utilization, move some volumes of the storage VM to another aggregate.
2. For volumes of the storage VM with a QoS policy assigned, evaluate the threshold limits if they are causing the volume workloads to get throttled
3. If the node is experiencing high utilization, move some volumes of the storage VM to another node or reduce the total workload of the node

User Quota Files Hard Limit

CRITICAL

The number of files created within the volume has reached the critical limit and additional files cannot be created. Monitoring the number of files stored ensures that the user receives uninterrupted data service.

Immediate actions are required to minimize service disruption if critical threshold is breached.…Consider taking following actions:
1. Increase the file count quota for the specific user
2. Delete unwanted files to reduce the pressure on the files quota for the specific user

User Quota Files Soft Limit

WARNING

The number of files created within the volume has reached the threshold limit of the quota and is near to the critical limit. You cannot create additional files if quota reaches the critical limit. Monitoring the number of files stored by a user ensures that the user receives uninterrupted data service.

Consider immediate actions if warning threshold is breached:
1. Increase the file count quota for the specific user quota
2. Delete unwanted files to reduce the pressure on the files quota for the specific user

Volume Cache Miss Ratio

WARNING / CRITICAL

Volume Cache Miss Ratio is the percentage of read requests from the client applications that are returned from the disk instead of being returned from the cache. This means that the volume has reached the set threshold.

If critical threshold is breached, then immediate actions should be taken to minimize service disruption:
1. Move some workloads off of the node of the volume to reduce the IO load
2. If not already on the node of the volume, increase the WAFL cache by purchasing and adding a Flash Cache
3. Lower the demand of lower priority workloads on the same node via QoS limits

Consider immediate actions when warning threshold is breached:
1. Move some workloads off of the node of the volume to reduce the IO load
2. If not already on the node of the volume, increase the WAFL cache by purchasing and adding a Flash Cache
3. Lower the demand of lower priority workloads on the same node via QoS limits
4. Change workload characteristics (block size, application caching etc)

Volume Qtree Quota Overcommit

WARNING / CRITICAL

Volume Qtree Quota Overcommit specifies the percentage at which a volume is considered to be overcommitted by the qtree quotas. The set threshold for the qtree quota is reached for the volume. Monitoring the volume qtree quota overcommit ensures that the user receives uninterrupted data service.

If critical threshold is breached, then immediate actions should be taken to minimize service disruption:
1. Increase the space of the volume
2. Delete unwanted data

When warning threshold is breached, then consider increasing the space of the volume.

Log Monitors

Monitor Name

Severity

Description

Corrective Action

AWS Credentials Not Initialized

INFO

This event occurs when a module attempts to access Amazon Web Services (AWS) Identity and Access Management (IAM) role-based credentials from the cloud credentials thread before they are initialized.

Wait for the cloud credentials thread, as well as the system, to complete initialization.

Cloud Tier Unreachable

CRITICAL

A storage node cannot connect to Cloud Tier object store API. Some data will be inaccessible.

If you use on-premises products, perform the following corrective actions: …Verify that your intercluster LIF is online and functional by using the "network interface show" command.…Check the network connectivity to the object store server by using the "ping" command over the destination node intercluster LIF.…Ensure the following:…The configuration of your object store has not changed.…The login and connectivity information is still valid.…Contact NetApp technical support if the issue persists.

If you use Cloud Volumes ONTAP, perform the following corrective actions: …Ensure that the configuration of your object store has not changed.… Ensure that the login and connectivity information is still valid.…Contact NetApp technical support if the issue persists.

Disk Out of Service

INFO

This event occurs when a disk is removed from service because it has been marked failed, is being sanitized, or has entered the Maintenance Center.

None.

FlexGroup Constituent Full

CRITICAL

A constituent within a FlexGroup volume is full, which might cause a potential disruption of service. You can still create or expand files on the FlexGroup volume. However, none of the files that are stored on the constituent can be modified. As a result, you might see random out-of-space errors when you try to perform write operations on the FlexGroup volume.

It is recommended that you add capacity to the FlexGroup volume by using the "volume modify -files +X" command.…Alternatively, delete files from the FlexGroup volume. However, it is difficult to determine which files have landed on the constituent.

Flexgroup Constituent Nearly Full

WARNING

A constituent within a FlexGroup volume is nearly out of space, which might cause a potential disruption of service. Files can be created and expanded. However, if the constituent runs out of space, you might not be able to append to or modify the files on the constituent.

It is recommended that you add capacity to the FlexGroup volume by using the "volume modify -files +X" command.…Alternatively, delete files from the FlexGroup volume. However, it is difficult to determine which files have landed on the constituent.

FlexGroup Constituent Nearly Out of Inodes

WARNING

A constituent within a FlexGroup volume is almost out of inodes, which might cause a potential disruption of service. The constituent receives lesser create requests than average. This might impact the overall performance of the FlexGroup volume, because the requests are routed to constituents with more inodes.

It is recommended that you add capacity to the FlexGroup volume by using the "volume modify -files +X" command.…Alternatively, delete files from the FlexGroup volume. However, it is difficult to determine which files have landed on the constituent.

FlexGroup Constituent Out of Inodes

CRITICAL

A constituent of a FlexGroup volume has run out of inodes, which might cause a potential disruption of service. You cannot create new files on this constituent. This might lead to an overall imbalanced distribution of content across the FlexGroup volume.

It is recommended that you add capacity to the FlexGroup volume by using the "volume modify -files +X" command.…Alternatively, delete files from the FlexGroup volume. However, it is difficult to determine which files have landed on the constituent.

LUN Offline

INFO

This event occurs when a LUN is brought offline manually.

Bring the LUN back online.

Main Unit Fan Failed

WARNING

One or more main unit fans have failed. The system remains operational.…However, if the condition persists for too long, the overtemperature might trigger an automatic shutdown.

Reseat the failed fans. If the error persists, replace them.

Main Unit Fan in Warning State

INFO

This event occurs when one or more main unit fans are in a warning state.

Replace the indicated fans to avoid overheating.

NVRAM Battery Low

WARNING

The NVRAM battery capacity is critically low. There might be a potential data loss if the battery runs out of power.…Your system generates and transmits an AutoSupport or "call home" message to NetApp technical support and the configured destinations if it is configured to do so. The successful delivery of an AutoSupport message significantly improves problem determination and resolution.

Perform the following corrective actions:…View the battery's current status, capacity, and charging state by using the "system node environment sensors show" command.…If the battery was replaced recently or the system was non-operational for an extended period of time, monitor the battery to verify that it is charging properly.…Contact NetApp technical support if the battery runtime continues to decrease below critical levels, and the storage system shuts down automatically.

Service Processor Not Configured

WARNING

This event occurs on a weekly basis, to remind you to configure the Service Processor (SP). The SP is a physical device that is incorporated into your system to provide remote access and remote management capabilities. You should configure the SP to use its full functionality.

Perform the following corrective actions:…Configure the SP by using the "system service-processor network modify" command.…Optionally, obtain the MAC address of the SP by using the "system service-processor network show" command.…Verify the SP network configuration by using the "system service-processor network show" command.…Verify that the SP can send an AutoSupport email by using the "system service-processor autosupport invoke" command.
NOTE: AutoSupport email hosts and recipients should be configured in ONTAP before you issue this command.

Service Processor Offline

CRITICAL

ONTAP is no longer receiving heartbeats from the Service Processor (SP), even though all the SP recovery actions have been taken. ONTAP cannot monitor the health of the hardware without the SP.…The system will shut down to prevent hardware damage and data loss. Set up a panic alert to be notified immediately if the SP goes offline.

Power-cycle the system by performing the following actions:…Pull the controller out from the chassis.…Push the controller back in.…Turn the controller back on.…If the problem persists, replace the controller module.

Shelf Fans Failed

CRITICAL

The indicated cooling fan or fan module of the shelf has failed. The disks in the shelf might not receive enough cooling airflow, which might result in disk failure.

Perform the following corrective actions:…Verify that the fan module is fully seated and secured.
NOTE: The fan is integrated into the power supply module in some disk shelves.…If the issue persists, replace the fan module.…If the issue still persists, contact NetApp technical support for assistance.

System Cannot Operate Due to Main Unit Fan Failure

CRITICAL

One or more main unit fans have failed, disrupting system operation. This might lead to a potential data loss.

Replace the failed fans.

Unassigned Disks

INFO

System has unassigned disks - capacity is being wasted and your system may have some misconfiguration or partial configuration change applied.

Perform the following corrective actions:…Determine which disks are unassigned by using the "disk show -n" command.…Assign the disks to a system by using the "disk assign" command.

Antivirus Server Busy

WARNING

The antivirus server is too busy to accept any new scan requests.

If this message occurs frequently, ensure that there are enough antivirus servers to handle the virus scan load generated by the SVM.

AWS Credentials for IAM Role Expired

CRITICAL

Cloud Volume ONTAP has become inaccessible. The Identity and Access Management (IAM) role-based credentials have expired. The credentials are acquired from the Amazon Web Services (AWS) metadata server using the IAM role, and are used to sign API requests to Amazon Simple Storage Service (Amazon S3).

Perform the following:…Log in to the AWS EC2 Management Console.…Navigate to the Instances page.…Find the instance for the Cloud Volumes ONTAP deployment and check its health.…Verify that the AWS IAM role associated with the instance is valid and has been granted proper privileges to the instance.

AWS Credentials for IAM Role Not Found

CRITICAL

The cloud credentials thread cannot acquire the Amazon Web Services (AWS) Identity and Access Management (IAM) role-based credentials from the AWS metadata server. The credentials are used to sign API requests to Amazon Simple Storage Service (Amazon S3). Cloud Volume ONTAP has become inaccessible.…

Perform the following:…Log in to the AWS EC2 Management Console.…Navigate to the Instances page.…Find the instance for the Cloud Volumes ONTAP deployment and check its health.…Verify that the AWS IAM role associated with the instance is valid and has been granted proper privileges to the instance.

AWS Credentials for IAM Role Not Valid

CRITICAL

The Identity and Access Management (IAM) role-based credentials are not valid. The credentials are acquired from the Amazon Web Services (AWS) metadata server using the IAM role, and are used to sign API requests to Amazon Simple Storage Service (Amazon S3). Cloud Volume ONTAP has become inaccessible.

Perform the following:…Log in to the AWS EC2 Management Console.…Navigate to the Instances page.…Find the instance for the Cloud Volumes ONTAP deployment and check its health.…Verify that the AWS IAM role associated with the instance is valid and has been granted proper privileges to the instance.

AWS IAM Role Not Found

CRITICAL

The Identity and Access Management (IAM) roles thread cannot find an Amazon Web Services (AWS) IAM role on the AWS metadata server. The IAM role is required to acquire role-based credentials used to sign API requests to Amazon Simple Storage Service (Amazon S3). Cloud Volume ONTAP has become inaccessible.…

Perform the following:…Log in to the AWS EC2 Management Console.…Navigate to the Instances page.…Find the instance for the Cloud Volumes ONTAP deployment and check its health.…Verify that the AWS IAM role associated with the instance is valid.

AWS IAM Role Not Valid

CRITICAL

The Amazon Web Services (AWS) Identity and Access Management (IAM) role on the AWS metadata server is not valid. The Cloud Volume ONTAP has become inaccessible.…

Perform the following:…Log in to the AWS EC2 Management Console.…Navigate to the Instances page.…Find the instance for the Cloud Volumes ONTAP deployment and check its health.…Verify that the AWS IAM role associated with the instance is valid and has been granted proper privileges to the instance.

AWS Metadata Server Connection Fail

CRITICAL

The Identity and Access Management (IAM) roles thread cannot establish a communication link with the Amazon Web Services (AWS) metadata server. Communication should be established to acquire the necessary AWS IAM role-based credentials used to sign API requests to Amazon Simple Storage Service (Amazon S3). Cloud Volume ONTAP has become inaccessible.…

Perform the following:…Log in to the AWS EC2 Management Console.…Navigate to the Instances page.…Find the instance for the Cloud Volumes ONTAP deployment and check its health.…

FabricPool Space Usage Limit Nearly Reached

WARNING

The total cluster-wide FabricPool space usage of object stores from capacity-licensed providers has nearly reached the licensed limit.

Perform the following corrective actions:…Check the percentage of the licensed capacity used by each FabricPool storage tier by using the "storage aggregate object-store show-space" command.…Delete Snapshot copies from volumes with the tiering policy "snapshot" or "backup" by using the "volume snapshot delete" command to clear up space.…Install a new license on the cluster to increase the licensed capacity.

FabricPool Space Usage Limit Reached

CRITICAL

The total cluster-wide FabricPool space usage of object stores from capacity-licensed providers has reached the license limit.

Perform the following corrective actions:…Check the percentage of the licensed capacity used by each FabricPool storage tier by using the "storage aggregate object-store show-space" command.…Delete Snapshot copies from volumes with the tiering policy "snapshot" or "backup" by using the "volume snapshot delete" command to clear up space.…Install a new license on the cluster to increase the licensed capacity.

Giveback of Aggregate Failed

CRITICAL

This event occurs during the migration of an aggregate as part of a storage failover (SFO) giveback, when the destination node cannot reach the object stores.

Perform the following corrective actions:…Verify that your intercluster LIF is online and functional by using the "network interface show" command.…Check network connectivity to the object store server by using the"'ping" command over the destination node intercluster LIF. …Verify that the configuration of your object store has not changed and that login and connectivity information is still accurate by using the "aggregate object-store config show" command.…Alternatively, you can override the error by specifying false for the "require-partner-waiting" parameter of the giveback command.…Contact NetApp technical support for more information or assistance.

HA Interconnect Down

WARNING

The high-availability (HA) interconnect is down. Risk of service outage when failover is not available.

Corrective actions depend on the number and type of HA interconnect links supported by the platform, as well as the reason why the interconnect is down. …If the links are down:…Verify that both controllers in the HA pair are operational.…For externally connected links, make sure that the interconnect cables are connected properly and that the small form-factor pluggables (SFPs), if applicable, are seated properly on both controllers.…For internally connected links, disable and re-enable the links, one after the other, by using the "ic link off" and "ic link on" commands. …If links are disabled, enable the links by using the "ic link on" command. …If a peer is not connected, disable and re-enable the links, one after the other, by using the "ic link off" and "ic link on" commands.…Contact NetApp technical support if the issue persists.

Max Sessions Per User Exceeded

WARNING

You have exceeded the maximum number of sessions allowed per user over a TCP connection. Any request to establish a session will be denied until some sessions are released. …

Perform the following corrective actions: …Inspect all the applications that run on the client, and terminate any that are not operating properly.…Reboot the client.…Check if the issue is caused by a new or existing application:…If the application is new, set a higher threshold for the client by using the "cifs option modify -max-opens-same-file-per-tree" command.
In some cases, clients operate as expected, but require a higher threshold. You should have advanced privilege to set a higher threshold for the client. …If the issue is caused by an existing application, there might be an issue with the client. Contact NetApp technical support for more information or assistance.

Max Times Open Per File Exceeded

WARNING

You have exceeded the maximum number of times that you can open the file over a TCP connection. Any request to open this file will be denied until you close some open instances of the file. This typically indicates abnormal application behavior.…

Perform the following corrective actions:…Inspect the applications that run on the client using this TCP connection.
The client might be operating incorrectly because of the application running on it.…Reboot the client.…Check if the issue is caused by a new or existing application:…If the application is new, set a higher threshold for the client by using the "cifs option modify -max-opens-same-file-per-tree" command.
In some cases, clients operate as expected, but require a higher threshold. You should have advanced privilege to set a higher threshold for the client. …If the issue is caused by an existing application, there might be an issue with the client. Contact NetApp technical support for more information or assistance.

NetBIOS Name Conflict

CRITICAL

The NetBIOS Name Service has received a negative response to a name registration request, from a remote machine. This is typically caused by a conflict in the NetBIOS name or an alias. As a result, clients might not be able to access data or connect to the right data-serving node in the cluster.

Perform any one of the following corrective actions:…If there is a conflict in the NetBIOS name or an alias, perform one of the following:…Delete the duplicate NetBIOS alias by using the "vserver cifs delete -aliases alias -vserver vserver" command.…Rename a NetBIOS alias by deleting the duplicate name and adding an alias with a new name by using the "vserver cifs create -aliases alias -vserver vserver" command. …If there are no aliases configured and there is a conflict in the NetBIOS name, then rename the CIFS server by using the "vserver cifs delete -vserver vserver" and "vserver cifs create -cifs-server netbiosname" commands.
NOTE: Deleting a CIFS server can make data inaccessible. …Remove NetBIOS name or rename the NetBIOS on the remote machine.

NFSv4 Store Pool Exhausted

CRITICAL

A NFSv4 store pool has been exhausted.

If the NFS server is unresponsive for more than 10 minutes after this event, contact NetApp technical support.

No Registered Scan Engine

CRITICAL

The antivirus connector notified ONTAP that it does not have a registered scan engine. This might cause data unavailability if the "scan-mandatory" option is enabled.

Perform the following corrective actions:…Ensure that the scan engine software installed on the antivirus server is compatible with ONTAP.…Ensure that scan engine software is running and configured to connect to the antivirus connector over local loopback.

No Vscan Connection

CRITICAL

ONTAP has no Vscan connection to service virus scan requests. This might cause data unavailability if the "scan-mandatory" option is enabled.

Ensure that the scanner pool is properly configured and the antivirus servers are active and connected to ONTAP.

Node Root Volume Space Low

CRITICAL

The system has detected that the root volume is dangerously low on space. The node is not fully operational. Data LIFs might have failed over within the cluster, because of which NFS and CIFS access is limited on the node. Administrative capability is limited to local recovery procedures for the node to clear up space on the root volume.

Perform the following corrective actions:…Clear up space on the root volume by deleting old Snapshot copies, deleting files you no longer need from the /mroot directory, or expanding the root volume capacity.…Reboot the controller.…Contact NetApp technical support for more information or assistance.

Nonexistent Admin Share

CRITICAL

Vscan issue: a client has attempted to connect to a nonexistent ONTAP_ADMIN$ share.

Ensure that Vscan is enabled for the mentioned SVM ID. Enabling Vscan on a SVM causes the ONTAP_ADMIN$ share to be created for the SVM automatically.

NVMe Namespace Out of Space

CRITICAL

An NVMe namespace has been brought offline because of a write failure caused by lack of space.

Add space to the volume, and then bring the NVMe namespace online by using the "vserver nvme namespace modify" command.

NVMe-oF Grace Period Active

WARNING

This event occurs on a daily basis when the NVMe over Fabrics (NVMe-oF) protocol is in use and the grace period of the license is active. The NVMe-oF functionality requires a license after the license grace period expires. NVMe-oF functionality is disabled when the license grace period is over.

Contact your sales representative to obtain an NVMe-oF license, and add it to the cluster, or remove all instances of NVMe-oF configuration from the cluster.

NVMe-oF Grace Period Expired

WARNING

The NVMe over Fabrics (NVMe-oF) license grace period is over and the NVMe-oF functionality is disabled.

Contact your sales representative to obtain an NVMe-oF license, and add it to the cluster.

NVMe-oF Grace Period Start

WARNING

The NVMe over Fabrics (NVMe-oF) configuration was detected during the upgrade to ONTAP 9.5 software. NVMe-oF functionality requires a license after the license grace period expires.

Contact your sales representative to obtain an NVMe-oF license, and add it to the cluster.

Object Store Host Unresolvable

CRITICAL

The object store server host name cannot be resolved to an IP address. The object store client cannot communicate with the object-store server without resolving to an IP address. As a result, data might be inaccessible.

Check the DNS configuration to verify that the host name is configured correctly with an IP address.

Object Store Intercluster LIF Down

CRITICAL

The object-store client cannot find an operational LIF to communicate with the object store server. The node will not allow object store client traffic until the intercluster LIF is operational. As a result, data might be inaccessible.

Perform the following corrective actions:…Check the intercluster LIF status by using the "network interface show -role intercluster" command.…Verify that the intercluster LIF is configured correctly and operational.…If an intercluster LIF is not configured, add it by using the "network interface create -role intercluster" command.

Object Store Signature Mismatch

CRITICAL

The request signature sent to the object store server does not match the signature calculated by the client. As a result, data might be inaccessible.

Verify that the secret access key is configured correctly. If it is configured correctly, contact NetApp technical support for assistance.

READDIR Timeout

CRITICAL

A READDIR file operation has exceeded the timeout that it is allowed to run in WAFL. This can be because of very large or sparse directories. Corrective action is recommended.

Perform the following corrective actions:…Find information specific to recent directories that have had READDIR file operations expire by using the following 'diag' privilege nodeshell CLI command:
wafl readdir notice show.…Check if directories are indicated as sparse or not:…If a directory is indicated as sparse, it is recommended that you copy the contents of the directory to a new directory to remove the sparseness of the directory file. …If a directory is not indicated as sparse and the directory is large, it is recommended that you reduce the size of the directory file by reducing the number of file entries in the directory.

Relocation of Aggregate Failed

CRITICAL

This event occurs during the relocation of an aggregate, when the destination node cannot reach the object stores.

Perform the following corrective actions:…Verify that your intercluster LIF is online and functional by using the "network interface show" command.…Check network connectivity to the object store server by using the"'ping" command over the destination node intercluster LIF. …Verify that the configuration of your object store has not changed and that login and connectivity information is still accurate by using the "aggregate object-store config show" command.…Alternatively, you can override the error by using the "override-destination-checks" parameter of the relocation command.…Contact NetApp technical support for more information or assistance.

Shadow Copy Failed

CRITICAL

A Volume Shadow Copy Service (VSS), a Microsoft Server backup and restore service operation, has failed.

Check the following using the information provided in the event message:…Is shadow copy configuration enabled?…Are the appropriate licenses installed? …On which shares is the shadow copy operation performed?…Is the share name correct?…Does the share path exist?…What are the states of the shadow copy set and its shadow copies?

Storage Switch Power Supplies Failed

WARNING

There is a missing power supply in the cluster switch. Redundancy is reduced, risk of outage with any further power failures.

Perform the following corrective actions:…Ensure that the power supply mains, which supplies power to the cluster switch, is turned on.…Ensure that the power cord is connected to the power supply.…Contact NetApp technical support if the issue persists.

Too Many CIFS Authentication

WARNING

Many authentication negotiations have occurred simultaneously. There are 256 incomplete new session requests from this client.

Investigate why the client has created 256 or more new connection requests. You might have to contact the vendor of the client or of the application to determine why the error occurred.

Unauthorized User Access to Admin Share

WARNING

A client has attempted to connect to the privileged ONTAP_ADMIN$ share even though their logged-in user is not an allowed user.

Perform the following corrective actions:…Ensure that the mentioned username and IP address is configured in one of the active Vscan scanner pools.…Check the scanner pool configuration that is currently active by using the "vserver vscan scanner pool show-active" command.

Virus Detected

WARNING

A Vscan server has reported an error to the storage system. This typically indicates that a virus has been found. However, other errors on the Vscan server can cause this event.…Client access to the file is denied. The Vscan server might, depending on its settings and configuration, clean the file, quarantine it, or delete it.

Check the log of the Vscan server reported in the "syslog" event to see if it was able to successfully clean, quarantine, or delete the infected file. If it was not able to do so, a system administrator might have to manually delete the file.

Volume Offline

INFO

This message indicates that a volume is made offline.

Bring the volume back online.

Volume Restricted

INFO

This event indicates that a flexible volume is made restricted.

Bring the volume back online.

Storage VM Stop Succeeded

INFO

This message occurs when a 'vserver stop' operation succeeds.

Use 'vserver start' command to start the data access on a storage VM.

Node Panic

WARNING

This event is issued when a panic occurs

Contact NetApp customer support.

Anti-Ransomware Log Monitors

Monitor Name

Severity

Description

Corrective Action

Storage VM Anti-ransomware Monitoring Disabled

WARNING

The anti-ransomware monitoring for the storage VM is disabled. Enable anti-ransomware to protect the storage VM.

None

Storage VM Anti-ransomware Monitoring Enabled (Learning Mode)

INFO

The anti-ransomware monitoring for the storage VM is enabled in learning mode.

None

Volume Anti-ransomware Monitoring Enabled

INFO

The anti-ransomware monitoring for the volume is enabled.

None

Volume Anti-ransomware Monitoring Disabled

WARNING

The anti-ransomware monitoring for the volume is disabled. Enable anti-ransomware to protect the volume.

None

Volume Anti-ransomware Monitoring Enabled (Learning Mode)

INFO

The anti-ransomware monitoring for the volume is enabled in learning mode.

None

Volume Anti-ransomware Monitoring Paused (Learning Mode)

WARNING

The anti-ransomware monitoring for the volume is paused in learning mode.

None

Volume Anti-ransomware Monitoring Paused

WARNING

The anti-ransomware monitoring for the volume is paused.

None

Volume Anti-ransomware Monitoring Disabling

WARNING

The anti-ransomware monitoring for the volume is disabling.

None

Ransomware Activity Detected

CRITICAL

To protect the data from the detected ransomware, a Snapshot copy has been taken that can be used to restore original data.
Your system generates and transmits an AutoSupport or "call home" message to NetApp technical support and any configured destinations. AutoSupport message improves problem determination and resolution.

Refer to the "FINAL-DOCUMENT-NAME" to take remedial measures for ransomware activity.

FSx for NetApp ONTAP Monitors

Monitor Name

Thresholds

Monitor Description

Corrective Action

FSx Volume Capacity is Full

Warning @ > 85 %…Critical @ > 95 %

Storage capacity of a volume is necessary to store application and customer data. The more data stored in the ONTAP volume the less storage availability for future data. If the data storage capacity within a volume reaches the total storage capacity may lead to the customer being unable to store data due to lack of storage capacity. Monitoring the volume used storage capacity ensures data services continuity.

Immediate actions are required to minimize service disruption if critical threshold is breached:…1. Consider deleting data that is not needed anymore to free up space

FSx Volume High Latency

Warning @ > 1000 µs…Critical @ > 2000 µs

Volumes are objects that serve the IO traffic often driven by performance sensitive applications including devOps applications, home directories, and databases. High volume latencies means that the applications themselves may suffer and be unable to accomplish their tasks. Monitoring volume latencies is critical to maintain application consistent performance.

Immediate actions are required to minimize service disruption if critical threshold is breached:…1. If the volume has a QoS policy assigned to it, evaluate its limit thresholds in case they are causing the volume workload to get throttled……Plan to take the following actions soon if warning threshold is breached:…1. If the volume has a QoS policy assigned to it, evaluate its limit thresholds in case they are causing the volume workload to get throttled.…2. If the node is also experiencing high utilization, move the volume to another node or reduce the total workload of the node.

FSx Volume Inodes Limit

Warning @ > 85 %…Critical @ > 95 %

Volumes that store files use index nodes (inode) to store file metadata. When a volume exhausts its inode allocation no more files can be added to it. A warning alert indicates that planned action should be taken to increase the number of available inodes. A critical alert indicates that file limit exhaustion is imminent and emergency measures should be taken to free up inodes to ensure service continuity

Immediate actions are required to minimize service disruption if critical threshold is breached:…1. Consider increasing the inodes value for the volume. If the inodes value is already at the max, then consider splitting the volume into two or more volumes because the file system has grown beyond the maximum size……Plan to take the following actions soon if warning threshold is breached:…1. Consider increasing the inodes value for the volume. If the inodes value is already at the max, then consider splitting the volume into two or more volumes because the file system has grown beyond the maximum size

FSx Volume Qtree Quota Overcommit

Warning @ > 95 %…Critical @ > 100 %

Volume Qtree Quota Overcommit specifies the percentage at which a volume is considered to be overcommitted by the qtree quotas. The set threshold for the qtree quota is reached for the volume. Monitoring the volume qtree quota overcommit ensures that the user receives uninterrupted data service.

If critical threshold is breached, then immediate actions should be taken to minimize service disruption:
1. Delete unwanted data…When warning threshold is breached, then consider increasing the space of the volume.

FSx Snapshot Reserve Space is Full

Warning @ > 90 %…Critical @ > 95 %

Storage capacity of a volume is necessary to store application and customer data. A portion of that space, called snapshot reserved space, is used to store snapshots which allow data to be protected locally. The more new and updated data stored in the ONTAP volume the more snapshot capacity is used and less snapshot storage capacity will be available for future new or updated data. If the snapshot data capacity within a volume reaches the total snapshot reserve space it may lead to the customer being unable to store new snapshot data and reduction in the level of protection for the data in the volume. Monitoring the volume used snapshot capacity ensures data services continuity.

Immediate actions are required to minimize service disruption if critical threshold is breached:…1. Consider configuring snapshots to use data space in the volume when the snapshot reserve is full…2. Consider deleting some older snapshots that may not be needed anymore to free up space……Plan to take the following actions soon if warning threshold is breached:…1. Consider increasing the snapshot reserve space within the volume to accommodate the growth…2. Consider configuring snapshots to use data space in the volume when the snapshot reserve is full

FSx Volume Cache Miss Ratio

Warning @ > 95 %…Critical @ > 100 %

Volume Cache Miss Ratio is the percentage of read requests from the client applications that are returned from the disk instead of being returned from the cache. This means that the volume has reached the set threshold.

If critical threshold is breached, then immediate actions should be taken to minimize service disruption:
1. Move some workloads off of the node of the volume to reduce the IO load
2. Lower the demand of lower priority workloads on the same node via QoS limits…Consider immediate actions when warning threshold is breached:
1. Move some workloads off of the node of the volume to reduce the IO load
2. Lower the demand of lower priority workloads on the same node via QoS limits
3. Change workload characteristics (block size, application caching etc)

K8s Monitors

Monitor Name

Description

Corrective Actions

Severity/Threshold

Persistent Volume Latency High

High persistent volume latencies means that the applications themselves may suffer and be unable to accomplish their tasks. Monitoring persistent volume latencies is critical to maintain application consistent performance. The following are expected latencies based on media type - SSD up to 1-2 milliseconds; SAS up to 8-10 milliseconds and SATA HDD 17-20 milliseconds.

Immediate Actions
If critical threshold is breached, consider immediate actions to minimize service disruption:
If the volume has a QoS policy assigned to it, evaluate its limit thresholds in case they are causing the volume workload to get throttled.
Actions To Do Soon
If warning threshold is breached, plan the following immediate actions:
1. If storage pool is also experiencing high utilization, move the volume to another storage pool.
2. If the volume has a QoS policy assigned to it, evaluate its limit thresholds in case they are causing the volume workload to get throttled.
3. If the controller is also experiencing high utilization, move the volume to another controller or reduce the total workload of the controller.

Warning @ > 6,000 μs
Critical @ > 12,000 μs

Cluster Memory Saturation High

Cluster allocatable memory saturation is high.
Cluster CPU saturation is calculated as the sum of memory usage divided by the sum of allocatable memory across all K8s nodes.

Add nodes.
Fix any unscheduled nodes.
Right-size pods to free up memory on nodes.

Warning @ > 80 %
Critical @ > 90 %

POD Attach Failed

This alert occurs when a volume attachment with POD is failed.

Warning

High Retransmit Rate

High TCP Retransmit Rate

Check for Network congestion - Identify workloads that consume a lot of network bandwidth.
Check for high Pod CPU utilization.
Check hardware network performance.

Warning @ > 10 %
Critical @ > 25 %

Node File System Capacity High

Node File System Capacity High

- Increase the size of the node disks to ensure that there is sufficient room for the application files.
- Decrease application file usage.

Warning @ > 80 %
Critical @ > 90 %

Workload Network Jitter High

High TCP Jitter (high latency/response time variations)

Check for Network congestion. Identify workloads that consume a lot of network bandwidth.
Check for high Pod CPU utilization.
Check hardware network performance

Warning @ > 30 ms
Critical @ > 50 ms

Persistent Volume Throughput

MBPS thresholds on persistent volumes can be used to alert an administrator when persistent volumes exceed predefined performance expectations, potentially impacting other persistent volumes. Activating this monitor will generate alerts appropriate for the typical throughput profile of persistent volumes on SSDs. This monitor will cover all persistent volumes in your environment. The warning and critical threshold values can be adjusted based on your monitoring goals by duplicating this monitor and setting thresholds appropriate for your storage class. A duplicated monitor can be further targeted to a subset of the persistent volumes in your environment.

Immediate Actions
If critical threshold is breached, plan immediate actions to minimize service disruption:
1. Introduce QoS MBPS limits for the volume.
2. Review the application driving the workload on the volume for anomalies.
Actions To Do Soon
If warning threshold is breached, plan to take the following immediate actions:
1. Introduce QoS MBPS limits for the volume.
2. Review the application driving the workload on the volume for anomalies.

Warning @ > 10,000 MB/s
Critical @ > 15,000 MB/s

Container at Risk of Going OOM Killed

The container's memory limits are set too low. The container is at risk of eviction (Out of Memory Kill).

Increase container memory limits.

Warning @ > 95 %

Workload Down

Workload has no healthy pods.

Critical @ < 1

Persistent Volume Claim Failed Binding

This alert occurs when a binding is failed on a PVC.

Warning

ResourceQuota Mem Limits About to Exceed

Memory limits for Namespace are about to exceed ResourceQuota

Warning @ > 80 %
Critical @ > 90 %

ResourceQuota Mem Requests About to Exceed

Memory requests for Namespace are about to exceed ResourceQuota

Warning @ > 80 %
Critical @ > 90 %

Node Creation Failed

The node could not be scheduled because of a configuration error.

Check the Kubernetes event log for the cause of the configuration failure.

Critical

Persistent Volume Reclamation Failed

The volume has failed its automatic reclamation.

Warning @ > 0 B

Container CPU Throttling

The container's CPU Limits are set too low. Container processes are slowed.

Increase container CPU limits.

Warning @ > 95 %
Critical @ > 98 %

Service Load Balancer Failed to Delete

Warning

Persistent Volume IOPS

IOPS thresholds on persistent volumes can be used to alert an administrator when persistent volumes exceed predefined performance expectations. Activating this monitor will generate alerts appropriate for the typical IOPS profile of persistence volumes. This monitor will cover all persistent volumes in your environment. The warning and critical threshold values can be adjusted based on your monitoring goals by duplicating this monitor and setting thresholds appropriate for your workload.

Immediate Actions
If critical threshold is breached, plan Immediate actions to minimize service disruption :
1. Introduce QoS IOPS limits for the volume.
2. Review the application driving the workload on the volume for anomalies.
Actions To Do Soon
If warning threshold is breached, plan the following immediate actions:
1. Introduce QoS IOPS limits for the volume.
2. Review the application driving the workload on the volume for anomalies.

Warning @ > 20,000 IO/s
Critical @ > 25,000 IO/s

Service Load Balancer Failed to Update

Warning

POD Failed Mount

This alert occurs when a mount is failed on a POD.

Warning

Node PID Pressure

Available process identifiers on the (Linux) node has fallen below an eviction threshold.

Find and fix pods that generate many processes and starve the node of available process IDs.
Set up PodPidsLimit to protect your node against pods or containers that spawn too many processes.

Critical @ > 0

Pod Image Pull Failure

Kubernetes failed to pull the pod container image.

- Make sure the pod's image is spelled correctly in the pod configuration.
- Check image tag exists in your registry.
- Verify the credentials for the image registry.
- Check for registry connectivity issues.
- Verify you are not hitting the rate limits imposed by public registry providers.

Warning

Job Running Too Long

Job is running for too long

Warning @ > 1 hr
Critical @ > 5 hr

Node Memory High

Node memory usage is high

Add nodes.
Fix any unscheduled nodes.
Right-size pods to free up memory on nodes.

Warning @ > 85 %
Critical @ > 90 %

ResourceQuota CPU Limits About to Exceed

CPU limits for Namespace are about to exceed ResourceQuota

Warning @ > 80 %
Critical @ > 90 %

Pod Crash Loop Backoff

Pod has crashed and attempted to restart multiple times.

Critical @ > 3

Node CPU High

Node CPU usage is high.

Add nodes.
Fix any unscheduled nodes.
Right-size pods to free up CPU on nodes.

Warning @ > 80 %
Critical @ > 90 %

Workload Network Latency RTT High

High TCP RTT (Round Trip Time) latency

Check for Network congestion ▒ Identify workloads that consume a lot of network bandwidth.
Check for high Pod CPU utilization.
Check hardware network performance.

Warning @ > 150 ms
Critical @ > 300 ms

Job Failed

Job did not complete successfully due to a node crash or reboot, resource exhaustion, job timeout, or pod scheduling failure.

Check the Kubernetes event logs for failure causes.

Warning @ > 1

Persistent Volume Full in a Few Days

Persistent Volume will run out of space in a few days

-Increase the volume size to ensure that there is sufficient room for the application files.
-Reduce the amount of data stored in applications.

Warning @ < 8 day
Critical @ < 3 day

Node Memory Pressure

Node is running out of memory. Available memory has met eviction threshold.

Add nodes.
Fix any unscheduled nodes.
Right-size pods to free up memory on nodes.

Critical @ > 0

Node Unready

Node has been unready for 5 minutes

Verify the node have enough CPU, memory, and disk resources.
Check node network connectivity.
Check the Kubernetes event logs for failure causes.

Critical @ < 1

Persistent Volume Capacity High

Persistent Volume backend used capacity is high.

- Increase the volume size to ensure that there is sufficient room for the application files.
- Reduce the amount of data stored in applications.

Warning @ > 80 %
Critical @ > 90 %

Service Load Balancer Failed to Create

Service Load Balancer Create Failed

Critical

Workload Replica Mismatch

Some pods are currently not available for a Deployment or DaemonSet.

Warning @ > 1

ResourceQuota CPU Requests About to Exceed

CPU requests for Namespace are about to exceed ResourceQuota

Warning @ > 80 %
Critical @ > 90 %

High Retransmit Rate

High TCP Retransmit Rate

Check for Network congestion - Identify workloads that consume a lot of network bandwidth.
Check for high Pod CPU utilization.
Check hardware network performance.

Warning @ > 10 %
Critical @ > 25 %

Node Disk Pressure

Available disk space and inodes on either the node's root filesystem or image filesystem has satisfied an eviction threshold.

- Increase the size of the node disks to ensure that there is sufficient room for the application files.
- Decrease application file usage.

Critical @ > 0

Cluster CPU Saturation High

Cluster allocatable CPU saturation is high.
Cluster CPU saturation is calculated as the sum of CPU usage divided by the sum allocatable CPU across all K8s nodes.

Add nodes.
Fix any unscheduled nodes.
Right-size pods to free up CPU on nodes.

Warning @ > 80 %
Critical @ > 90 %

Change Log Monitors

Monitor Name

Severity

Monitor Description

Internal Volume Discovered

Informational

This message occurs when an Internal Volume is discovered.

Internal Volume Modified

Informational

This message occurs when an Internal Volume is modified.

Storage Node Discovered

Informational

This message occurs when an Storage Node is discovered.

Storage Node Removed

Informational

This message occurs when an Storage Node is removed.

Storage Pool Discovered

Informational

This message occurs when an Storage Pool is discovered.

Storage Virtual Machine Discovered

Informational

This message occurs when an Storage Virtual Machine is discovered.

Storage Virtual Machine Modified

Informational

This message occurs when an Storage Virtual Machine is modified.

Data Collection Monitors

Monitor Name

Description

Corrective Action

Acquisition Unit Shutdown

Cloud Insights Acquisition Units periodically restart as part of upgrades to introduce new features. This happens once a month or less in a typical environment. A Warning Alert that an Acquisition Unit has shutdown should be followed soon after by a Resolution noting that the newly-restarted Acquisition Unit has completed a registration with Cloud Insights. Typically this shutdown-to-registration cycle takes 5 to 15 minutes.

If the alert occurs frequently or lasts longer than 15 minutes, check on the operation of the system hosting the Acquisition Unit, the network, and any proxy connecting the AU to the Internet.

Collector Failed

The poll of a data collector encountered an unexpected failure situation.

Visit the data collector page in Cloud Insights to learn more about the situation.

Collector Warning

This Alert typically can arise because of an erroneous configuration of the data collector or of the target system. Revisit the configurations to prevent future Alerts. It can also be due to a retrieval of less-than-complete data where the data collector gathered all the data that it could. This can happen when situations change during data collection (e.g., a virtual machine present at the beginning of data collection is deleted during data collection and before its data is captured).

Check the configuration of the data collector or target system.

Note that the monitor for Collector Warning can send more alerts than other monitor types, so it is recommended to set no alert recipients unless you are troubleshooting.

Security Monitors

Monitor Name

Threshold

Monitor Description

Corrective Action

AutoSupport HTTPS transport disabled

Warning @ < 1

AutoSupport supports HTTPS, HTTP, and SMTP for transport protocols. Because of the sensitive nature of AutoSupport messages, NetApp strongly recommends using HTTPS as the default transport protocol for sending AutoSupport messages to NetApp support.

To set HTTPS as the transport protocol for AutoSupport messages, run the following ONTAP command:…system node autosupport modify -transport https

Cluster Insecure ciphers for SSH

Warning @ < 1

Indicates that SSH is using insecure ciphers, for example ciphers beginning with *cbc.

To remove the CBC ciphers, run the following ONTAP command:…security ssh remove -vserver <admin vserver> -ciphers aes256-cbc,aes192-cbc,aes128-cbc,3des-cbc

Cluster Login Banner Disabled

Warning @ < 1

Indicates that the Login banner is disabled for users accessing the ONTAP system. Displaying a login banner is helpful for establishing expectations for access and use of the system.

To configure the login banner for a cluster, run the following ONTAP command:…security login banner modify -vserver <admin svm> -message "Access restricted to authorized users"

Cluster Peer Communication Not Encrypted

Warning @ < 1

When replicating data for disaster recovery, caching, or backup, you must protect that data during transport over the wire from one ONTAP cluster to another. Encryption must be configured on both the source and destination clusters.

To enable encryption on cluster peer relationships that were created prior to ONTAP 9.6, the source and destination cluster must be upgraded to 9.6. Then use the "cluster peer modify" command to change both the source and destination cluster peers to use Cluster Peering Encryption.…See the NetApp Security Hardening Guide for ONTAP 9 for details.

Default Local Admin User Enabled

Warning @ > 0

NetApp recommends locking (disabling) any unneeded Default Admin User (built-in) accounts with the lock command. They are primarily default accounts for which passwords were never updated or changed.

To lock the built-in "admin" account, run the following ONTAP command:…security login lock -username admin

FIPS Mode Disabled

Warning @ < 1

When FIPS 140-2 compliance is enabled, TLSv1 and SSLv3 are disabled, and only TLSv1.1 and TLSv1.2 remain enabled. ONTAP prevents you from enabling TLSv1 and SSLv3 when FIPS 140-2 compliance is enabled.

To enable FIPS 140-2 compliance on a cluster, run the following ONTAP command in advanced privilege mode:…security config modify -interface SSL -is-fips-enabled true

Log Forwarding Not Encrypted

Warning @ < 1

Offloading of syslog information is necessary for limiting the scope or footprint of a breach to a single system or solution. Therefore, NetApp recommends securely offloading syslog information to a secure storage or retention location.

Once a log forwarding destination is created, its protocol cannot be changed. To change to an encrypted protocol, delete and recreate the log forwarding destination using the following ONTAP command:…cluster log-forwarding create -destination <destination ip> -protocol tcp-encrypted

MD5 Hashed password

Warning @ > 0

NetApp strongly recommends to use the more secure SHA-512 hash function for ONTAP user account passwords. Accounts using the less secure MD5 hash function should migrate to the SHA-512 hash function.

NetApp strongly recommends user accounts migrate to the more secure SHA-512 solution by having users change their passwords.…to lock accounts with passwords that use the MD5 hash function, run the following ONTAP command:…security login lock -vserver * -username * -hash-function md5

No NTP servers are configured

Warning @ < 1

Indicates that the cluster has no configured NTP servers. For redundancy and optimum service, NetApp recommends that you associate at least three NTP servers with the cluster.

To associate an NTP server with the cluster, run the following ONTAP command:

cluster time-service ntp server create -server <ntp server host name or ip address>

NTP server count is low

Warning @ < 3

Indicates that the cluster has less than 3 configured NTP servers. For redundancy and optimum service, NetApp recommends that you associate at least three NTP servers with the cluster.

To associate an NTP server with the cluster, run the following ONTAP command:…cluster time-service ntp server create -server <ntp server host name or ip address>

Remote Shell Enabled

Warning @ > 0

Remote Shell is not a secure method for establishing command-line access to the ONTAP solution. Remote Shell should be disabled for secure remote access.

NetApp recommends Secure Shell (SSH) for secure remote access.…To disable Remote shell on a cluster, run the following ONTAP command in advanced privilege mode:…security protocol modify -application rsh- enabled false

Storage VM Audit Log Disabled

Warning @ < 1

Indicates that Audit logging is disabled for SVM.

To configure the Audit log for a vserver, run the following ONTAP command:…vserver audit enable -vserver <svm>

Storage VM Insecure ciphers for SSH

Warning @ < 1

Indicates that SSH is using insecure ciphers, for example ciphers beginning with *cbc.

To remove the CBC ciphers, run the following ONTAP command:…security ssh remove -vserver <vserver> -ciphers aes256-cbc,aes192-cbc,aes128-cbc,3des-cbc

Storage VM Login banner disabled

Warning @ < 1

Indicates that the Login banner is disabled for users accessing SVMs on the system. Displaying a login banner is helpful for establishing expectations for access and use of the system.

To configure the login banner for a cluster, run the following ONTAP command:…security login banner modify -vserver <svm> -message "Access restricted to authorized users"

Telnet Protocol Enabled

Warning @ > 0

Telnet is not a secure method for establishing command-line access to the ONTAP solution. Telnet should be disabled for secure remote access.

NetApp recommends Secure Shell (SSH) for secure remote access.
To disable Telnet on a cluster, run the following ONTAP command in advanced privilege mode:…security protocol modify -application telnet -enabled false

Data Protection Monitors

Monitor Name

Thresholds

Monitor Description

Corrective Action

Insufficient Space for Lun Snapshot Copy

(Filter contains_luns = Yes) Warning @ > 95 %…Critical @ > 100 %

Storage capacity of a volume is necessary to store application and customer data. A portion of that space, called snapshot reserved space, is used to store snapshots which allow data to be protected locally. The more new and updated data stored in the ONTAP volume the more snapshot capacity is used and less snapshot storage capacity will be available for future new or updated data. If the snapshot data capacity within a volume reaches the total snapshot reserve space it may lead to the customer being unable to store new snapshot data and reduction in the level of protection for the data in the LUNs in the volume. Monitoring the volume used snapshot capacity ensures data services continuity.

Immediate Actions
If critical threshold is breached, consider immediate actions to minimize service disruption:

1. Configure snapshots to use data space in the volume when the snapshot reserve is full.
2. Delete some older unwanted snapshots to free up space.

Actions To Do Soon
If warning threshold is breached, plan to take the following immediate actions:

1. Increase the snapshot reserve space within the volume to accommodate the growth.
2. Configure snapshots to use data space in the volume when the snapshot reserve is full.

SnapMirror Relationship Lag

Warning @ > 150%…Critical @ > 300%

SnapMirror relationship lag is the difference between the snapshot timestamp and the time on the destination system. The lag_time_percent is the ratio of lag time to the SnapMirror Policy's schedule interval. If the lag time equals the schedule interval, the lag_time_percent will be 100%. If the SnapMirror policy does not have a schedule, lag_time_percent will not be calculated.

Monitor SnapMirror status using the "snapmirror show" command. Check the SnapMirror transfer history using the "snapmirror show-history" command

Cloud Volume (CVO) Monitors

Monitor Name

CI Severity

Monitor Description

Corrective Action

CVO Disk Out of Service

INFO

This event occurs when a disk is removed from service because it has been marked failed, is being sanitized, or has entered the Maintenance Center.

None

CVO Giveback of Storage Pool Failed

CRITICAL

This event occurs during the migration of an aggregate as part of a storage failover (SFO) giveback, when the destination node cannot reach the object stores.

Perform the following corrective actions:

Verify that your intercluster LIF is online and functional by using the "network interface show" command.

Check network connectivity to the object store server by using the"'ping" command over the destination node intercluster LIF.

Verify that the configuration of your object store has not changed and that login and connectivity information is still accurate by using the "aggregate object-store config show" command.

Alternatively, you can override the error by specifying false for the "require-partner-waiting" parameter of the giveback command.

Contact NetApp technical support for more information or assistance.

CVO HA Interconnect Down

WARNING

The high-availability (HA) interconnect is down. Risk of service outage when failover is not available.

Corrective actions depend on the number and type of HA interconnect links supported by the platform, as well as the reason why the interconnect is down.

If the links are down:

Verify that both controllers in the HA pair are operational.

For externally connected links, make sure that the interconnect cables are connected properly and that the small form-factor pluggables (SFPs), if applicable, are seated properly on both controllers.

For internally connected links, disable and re-enable the links, one after the other, by using the "ic link off" and "ic link on" commands.

If links are disabled, enable the links by using the "ic link on" command.

If a peer is not connected, disable and re-enable the links, one after the other, by using the "ic link off" and "ic link on" commands.

Contact NetApp technical support if the issue persists.

CVO Max Sessions Per User Exceeded

WARNING

You have exceeded the maximum number of sessions allowed per user over a TCP connection. Any request to establish a session will be denied until some sessions are released.

Perform the following corrective actions:

Inspect all the applications that run on the client, and terminate any that are not operating properly.

Reboot the client.

Check if the issue is caused by a new or existing application:

If the application is new, set a higher threshold for the client by using the "cifs option modify -max-opens-same-file-per-tree" command.
In some cases, clients operate as expected, but require a higher threshold. You should have advanced privilege to set a higher threshold for the client.

If the issue is caused by an existing application, there might be an issue with the client. Contact NetApp technical support for more information or assistance.

CVO NetBIOS Name Conflict

CRITICAL

The NetBIOS Name Service has received a negative response to a name registration request, from a remote machine. This is typically caused by a conflict in the NetBIOS name or an alias. As a result, clients might not be able to access data or connect to the right data-serving node in the cluster.

Perform any one of the following corrective actions:

If there is a conflict in the NetBIOS name or an alias, perform one of the following:

Delete the duplicate NetBIOS alias by using the "vserver cifs delete -aliases alias -vserver vserver" command.

Rename a NetBIOS alias by deleting the duplicate name and adding an alias with a new name by using the "vserver cifs create -aliases alias -vserver vserver" command.

If there are no aliases configured and there is a conflict in the NetBIOS name, then rename the CIFS server by using the "vserver cifs delete -vserver vserver" and "vserver cifs create -cifs-server netbiosname" commands.
NOTE: Deleting a CIFS server can make data inaccessible.

Remove NetBIOS name or rename the NetBIOS on the remote machine.

CVO NFSv4 Store Pool Exhausted

CRITICAL

A NFSv4 store pool has been exhausted.

If the NFS server is unresponsive for more than 10 minutes after this event, contact NetApp technical support.

CVO Node Panic

WARNING

This event is issued when a panic occurs

Contact NetApp customer support.

CVO Node Root Volume Space Low

CRITICAL

The system has detected that the root volume is dangerously low on space. The node is not fully operational. Data LIFs might have failed over within the cluster, because of which NFS and CIFS access is limited on the node. Administrative capability is limited to local recovery procedures for the node to clear up space on the root volume.

Perform the following corrective actions:

Clear up space on the root volume by deleting old Snapshot copies, deleting files you no longer need from the /mroot directory, or expanding the root volume capacity.

Reboot the controller.

Contact NetApp technical support for more information or assistance.

CVO Nonexistent Admin Share

CRITICAL

Vscan issue: a client has attempted to connect to a nonexistent ONTAP_ADMIN$ share.

Ensure that Vscan is enabled for the mentioned SVM ID. Enabling Vscan on a SVM causes the ONTAP_ADMIN$ share to be created for the SVM automatically.

CVO Object Store Host Unresolvable

CRITICAL

The object store server host name cannot be resolved to an IP address. The object store client cannot communicate with the object-store server without resolving to an IP address. As a result, data might be inaccessible.

Check the DNS configuration to verify that the host name is configured correctly with an IP address.

CVO Object Store Intercluster LIF Down

CRITICAL

The object-store client cannot find an operational LIF to communicate with the object store server. The node will not allow object store client traffic until the intercluster LIF is operational. As a result, data might be inaccessible.

Perform the following corrective actions:

Check the intercluster LIF status by using the "network interface show -role intercluster" command.

Verify that the intercluster LIF is configured correctly and operational.

If an intercluster LIF is not configured, add it by using the "network interface create -role intercluster" command.

CVO Object Store Signature Mismatch

CRITICAL

The request signature sent to the object store server does not match the signature calculated by the client. As a result, data might be inaccessible.

Verify that the secret access key is configured correctly. If it is configured correctly, contact NetApp technical support for assistance.

CVO QoS Monitor Memory Maxed Out

CRITICAL

The QoS subsystem's dynamic memory has reached its limit for the current platform hardware. Some QoS features might operate in a limited capacity.

Delete some active workloads or streams to free up memory. Use the “statistics show -object workload -counter ops” command to determine which workloads are active. Active workloads show non-zero ops. Then use the “workload delete <workload_name>” command multiple times to remove specific workloads. Alternatively, use the “stream delete -workload <workload name> *” command to delete the associated streams from the active workload.

CVO READDIR Timeout

CRITICAL

A READDIR file operation has exceeded the timeout that it is allowed to run in WAFL. This can be because of very large or sparse directories. Corrective action is recommended.

Perform the following corrective actions:

Find information specific to recent directories that have had READDIR file operations expire by using the following 'diag' privilege nodeshell CLI command:
wafl readdir notice show.

Check if directories are indicated as sparse or not:

If a directory is indicated as sparse, it is recommended that you copy the contents of the directory to a new directory to remove the sparseness of the directory file.

If a directory is not indicated as sparse and the directory is large, it is recommended that you reduce the size of the directory file by reducing the number of file entries in the directory.

CVO Relocation of Storage Pool Failed

CRITICAL

This event occurs during the relocation of an aggregate, when the destination node cannot reach the object stores.

Perform the following corrective actions:

Verify that your intercluster LIF is online and functional by using the "network interface show" command.

Check network connectivity to the object store server by using the"'ping" command over the destination node intercluster LIF.

Verify that the configuration of your object store has not changed and that login and connectivity information is still accurate by using the "aggregate object-store config show" command.

Alternatively, you can override the error by using the "override-destination-checks" parameter of the relocation command.

Contact NetApp technical support for more information or assistance.

CVO Shadow Copy Failed

CRITICAL

A Volume Shadow Copy Service (VSS), a Microsoft Server backup and restore service operation, has failed.

Check the following using the information provided in the event message:

Is shadow copy configuration enabled?

Are the appropriate licenses installed?

On which shares is the shadow copy operation performed?

Is the share name correct?

Does the share path exist?

What are the states of the shadow copy set and its shadow copies?

CVO Storage VM Stop Succeeded

INFO

This message occurs when a 'vserver stop' operation succeeds.

Use 'vserver start' command to start the data access on a storage VM.

CVO Too Many CIFS Authentication

WARNING

Many authentication negotiations have occurred simultaneously. There are 256 incomplete new session requests from this client.

Investigate why the client has created 256 or more new connection requests. You might have to contact the vendor of the client or of the application to determine why the error occurred.

CVO Unassigned Disks

INFO

System has unassigned disks - capacity is being wasted and your system may have some misconfiguration or partial configuration change applied.

Perform the following corrective actions:

Determine which disks are unassigned by using the "disk show -n" command.

Assign the disks to a system by using the "disk assign" command.

CVO Unauthorized User Access to Admin Share

WARNING

A client has attempted to connect to the privileged ONTAP_ADMIN$ share even though their logged-in user is not an allowed user.

Perform the following corrective actions:

Ensure that the mentioned username and IP address is configured in one of the active Vscan scanner pools.

Check the scanner pool configuration that is currently active by using the "vserver vscan scanner pool show-active" command.

CVO Virus Detected

WARNING

A Vscan server has reported an error to the storage system. This typically indicates that a virus has been found. However, other errors on the Vscan server can cause this event.

Client access to the file is denied. The Vscan server might, depending on its settings and configuration, clean the file, quarantine it, or delete it.

Check the log of the Vscan server reported in the "syslog" event to see if it was able to successfully clean, quarantine, or delete the infected file. If it was not able to do so, a system administrator might have to manually delete the file.

CVO Volume Offline

INFO

This message indicates that a volume is made offline.

Bring the volume back online.

CVO Volume Restricted

INFO

This event indicates that a flexible volume is made restricted.

Bring the volume back online.

SnapMirror for Business Continuity (SMBC) Mediator Log Monitors

Monitor Name

Severity

Monitor Description

Corrective Action

ONTAP Mediator Added

INFO

This message occurs when ONTAP Mediator is added successfully on a cluster.

None

ONTAP Mediator Not Accessible

CRITICAL

This message occurs when either the ONTAP Mediator is repurposed or the Mediator package is no longer installed on the Mediator server. As a result, SnapMirror failover is not possible.

Remove the configuration of the current ONTAP Mediator by using the "snapmirror mediator remove" command. Reconfigure access to the ONTAP Mediator by using the "snapmirror mediator add" command.

ONTAP Mediator Removed

INFO

This message occurs when ONTAP Mediator is removed successfully from a cluster.

None

ONTAP Mediator Unreachable

WARNING

This message occurs when the ONTAP Mediator is unreachable on a cluster. As a result, SnapMirror failover is not possible.

Check the network connectivity to the ONTAP Mediator by using the "network ping" and "network traceroute" commands. If the issue persists, remove the configuration of the current ONTAP Mediator by using the "snapmirror mediator remove" command. Reconfigure access to the ONTAP Mediator by using the "snapmirror mediator add" command.

SMBC CA Certificate Expired

CRITICAL

This message occurs when the ONTAP Mediator certificate authority (CA) certificate has expired. As a result, all further communication to the ONTAP Mediator will not be possible.

Remove the configuration of the current ONTAP Mediator by using the "snapmirror mediator remove" command. Update a new CA certificate on the ONTAP Mediator server. Reconfigure access to the ONTAP Mediator by using the "snapmirror mediator add" command.

SMBC CA Certificate Expiring

WARNING

This message occurs when the ONTAP Mediator certificate authority (CA) certificate is due to expire within the next 30 days.

Before this certificate expires, remove the configuration of the current ONTAP Mediator by using the "snapmirror mediator remove" command. Update a new CA certificate on the ONTAP Mediator server. Reconfigure access to the ONTAP Mediator by using the "snapmirror mediator add" command.

SMBC Client Certificate Expired

CRITICAL

This message occurs when the ONTAP Mediator client certificate has expired. As a result, all further communication to the ONTAP Mediator will not be possible.

Remove the configuration of the current ONTAP Mediator by using the "snapmirror mediator remove" command. Reconfigure access to the ONTAP Mediator by using the "snapmirror mediator add" command.

SMBC Client Certificate Expiring

WARNING

This message occurs when the ONTAP Mediator client certificate is due to expire within the next 30 days.

Before this certificate expires, remove the configuration of the current ONTAP Mediator by using the "snapmirror mediator remove" command. Reconfigure access to the ONTAP Mediator by using the "snapmirror mediator add" command.

SMBC Relationship Out of Sync
Note: UM doesn't have this one

CRITICAL

This message occurs when a SnapMirror for Business Continuity (SMBC) relationship changes status from "in-sync" to "out-of-sync". Due to this RPO=0 data protection will be disrupted.

Check the network connection between the source and destination volumes. Monitor the SMBC relationship status by using the "snapmirror show" command on the destination, and by using the "snapmirror list-destinations" command on the source. Auto-resync will attempt to bring the relationship back to "in-sync" status. If the resync fails, verify that all the nodes in the cluster are in quorum and are healthy.

SMBC Server Certificate Expired

CRITICAL

This message occurs when the ONTAP Mediator server certificate has expired. As a result, all further communication to the ONTAP Mediator will not be possible.

Remove the configuration of the current ONTAP Mediator by using the "snapmirror mediator remove" command. Update a new server certificate on the ONTAP Mediator server. Reconfigure access to the ONTAP Mediator by using the "snapmirror mediator add" command.

SMBC Server Certificate Expiring

WARNING

This message occurs when the ONTAP Mediator server certificate is due to expire within the next 30 days.

Before this certificate expires, remove the configuration of the current ONTAP Mediator by using the "snapmirror mediator remove" command. Update a new server certificate on the ONTAP Mediator server. Reconfigure access to the ONTAP Mediator by using the "snapmirror mediator add" command.

Additional Power, Heartbeat, and Miscellaneous System Monitors

Monitor Name Severity Monitor Description Corrective Action

Disk Shelf Power Supply Discovered

INFORMATIONAL

This message occurs when a power supply unit is added to the disk shelf.

NONE

Disk Shelves Power Supply Removed

INFORMATIONAL

This message occurs when a power supply unit is removed from the disk shelf.

NONE

MetroCluster Automatic Unplanned Switchover Disabled

CRITICAL

This message occurs when automatic unplanned switchover capability is disabled.

Run the "metrocluster modify -node-name <nodename> -automatic-switchover-onfailure true" command for each node in the cluster to enable automatic switchover.

MetroCluster Storage Bridge Unreachable

CRITICAL

The storage bridge is not reachable over the management network

1) If the bridge is monitored by SNMP, verify that the node management LIF is up by using the "network interface show" command. Verify that the bridge is alive by using the "network ping" command.
2) If the bridge is monitored in-band, check the fabric cabling to the bridge, and then verify that the bridge is powered up.

MetroCluster Bridge Temperature Abnormal - Below Critical

CRITICAL

The sensor on the Fibre Channel bridge is reporting a temperature that is below the critical threshold.

1) Check the operational status of the fans on the storage bridge.
2) Verify that the bridge is operating under recommended temperature conditions.

MetroCluster Bridge Temperature Abnormal - Above Critical

CRITICAL

The sensor on the Fibre Channel bridge is reporting a temperature that is above the critical threshold.

1) Check the operational status of the chassis temperature sensor on the storage bridge using the command "storage bridge show -cooling".
2) Verify that the storage bridge is operating under recommended temperature conditions.

MetroCluster Aggregate Left Behind

WARNING

The aggregate was left behind during switchback.

1) Check the aggregate state by using the command "aggr show".
2) If the aggregate is online, return it to its original owner by using the command "metrocluster switchback".

All Links Between Metrocluster Partners Down

CRITICAL

RDMA interconnect adapters and intercluster LIFs have broken connections to the peered cluster or the peered cluster is down.

1) Ensure that the intercluster LIFs are up and running. Repair the intercluster LIFs if they are down.
2) Verify that the peered cluster is up and running by using the "cluster peer ping" command. See the MetroCluster Disaster Recovery Guide if the peered cluster is down.
3) For fabric MetroCluster, verify that the back-end fabric ISLs are up and running. Repair the back-end fabric ISLs if they are down.
4) For non-fabric MetroCluster configurations, verify that the cabling is correct between the RDMA interconnect adapters. Reconfigure the cabling if the links are down.

MetroCluster Partners Not Reachable Over Peering Network

CRITICAL

The connectivity to the peer cluster is broken.

1) Ensure that the port is connected to the correct network/switch.
2) Ensure that the intercluster LIF is connected with the peered cluster.
3) Ensure that the peered cluster is up and running by using the command "cluster peer ping". Refer to the MetroCluster Disaster Recovery Guide if the peered cluster is down.

MetroCluster Inter Switch All Links Down

CRITICAL

All Inter-Switch Links (ISLs) on the storage switch are down.

1) Repair the back-end fabric ISLs on the storage switch.
2) Ensure that the partner switch is up and its ISLs are operational.
3) Ensure that intermediate equipment, such as xWDM devices, are operational.

MetroCluster Node To Storage Stack SAS Link Down

WARNING

The SAS adapter or its attached cable might be at fault.

1. Verify that the SAS adapter is online and running.
2. Verify that the physical cable connection is secure and operating, and replace the cable if necessary.
3. If the SAS adapter is connected to disk shelves, ensure IOMs and disks are properly seated.

MetroClusterFC Initiator Links Down

CRITICAL

The FC initiator adapter is at fault.

1. Ensure that the FC initiator link has not been tampered with.
2. Verify the operational status of the FC initiator adapter by using the command "system node run -node local -command storage show adapter".

FC-VI Interconnect Link Down

CRITICAL

The physical link on the FC-VI port is offline.

1. Ensure that the FC-VI link has not been tampered with.
2. Verify that the physical status of the FC-VI adapter is "Up" by using the command "metrocluster interconnect adapter show".
3. If the configuration includes fabric switches, ensure that they are properly cabled and configured.

MetroCluster Spare Disks Left Behind

WARNING

The spare disk was left behind during switchback.

If the disk is not failed, return it to its original owner by using the command "metrocluster switchback".

MetroCluster Storage Bridge Port Down

CRITICAL

The port on the storage bridge is offline.

1) Check the operational status of the ports on the storage bridge by using the command "storage bridge show -ports".
2) Verify logical and physical connectivity to the port.

MetroCluster Storage Switch Fans Failed

CRITICAL

The fan on the storage switch failed.

1) Ensure that the fans in the switch are operating correctly by using the command "storage switch show -cooling".
2) Ensure that the fan FRUs are properly inserted and operational.

MetroCluster Storage Switch Unreachable

CRITICAL

The storage switch is not reachable over the management network.

1) Ensure that the node management LIF is up by using the command "network interface show".
2) Ensure that the switch is alive by using the command "network ping".
3) Ensure that the switch is reachable over SNMP by checking its SNMP settings after logging into the switch.

MetroCluster Switch Power Supplies Failed

CRITICAL

A power supply unit on the storage switch is not operational.

1) Check the error details by using the command "storage switch show -error -switch-name <swtich name>".
2) Identify the faulty power supply unit by using the command "storage switch show -power -switch-name <switch name>".
3) Ensure that the power supply unitis properly inserted into the chassis of the storage switch and fully operational.

MetroCluster Switch Temperature Sensors Failed

CRITICAL

The sensor on the Fibre Channel switch failed.

1) Check the operational status of the temperature sensors on the storage switch by using the command "storage switch show -cooling".
2) Verify that the switch is operating under recommended temperature conditions.

MetroCluster Switch Temperature Abnormal

CRITICAL

The temperature sensor on the Fibre Channel switch reported abnormal temperature.

1) Check the operational status of the temperature sensors on the storage switch by using the command "storage switch show -cooling".
2) Verify that the switch is operating under recommended temperature conditions.

Service Processor Heartbeat Missed

INFORMATIONAL

This message occurs when ONTAP does not receive an expected "heartbeat" signal from the Service Processor (SP). Along with this message, log files from SP will be sent out for debugging. ONTAP will reset the SP to attempt to restore communication. The SP will be unavailable for up to two minutes while it reboots.

Contact NetApp technical support.

Service Processor Heartbeat Stopped

WARNING

This message occurs when ONTAP is no longer receiving heartbeats from the Service Processor (SP). Depending on the hardware design, the system may continue to serve data or may determine to shut down to prevent data loss or hardware damage. The system continues to serve data, but because the SP might not be working, the system cannot send notifications of down appliances, boot errors, or Open Firmware (OFW) Power-On Self-Test (POST) errors. If your system is configured to do so, it generates and transmits an AutoSupport (or 'call home') message to NetApp technical support and to the configured destinations. Successful delivery of an AutoSupport message significantly improves problem determination and resolution.

If the system has shut down, attempt a hard power cycle: Pull the controller out from the chassis, push it back in then power on the system. Contact NetApp technical support if the problem persists after the power cycle, or for any other condition that may warrant attention.