Cluster fault codes

If a storage cluster experiences an error or a state that might be of interest to an administrator, it generates a cluster fault. You can use the ListClusterFaults method to retrieve the current list of resolved and unresolved faults on a storage cluster.

The following list gives more information about and possible solutions for NetApp Element storage cluster faults:

authenticationServiceFault
The Authentication Service on one or more cluster nodes is not functioning as expected.
Contact NetApp Support for assistance.
availableVirtualNetworkIPAddressesLow
The number of virtual network addresses in the block of IP addresses is low.
To resolve this fault, add more IP addresses to the block of virtual network addresses.
blockClusterFull
There is not enough free block storage space to support a single node loss. See the GetClusterFullThreshold API method for details on cluster fullness levels. This cluster fault indicates one of the following conditions:
  • stage3Low (Warning): User-defined threshold was crossed. Adjust Cluster Full settings or add more nodes.
  • stage4Critical (Error): There is not enough space to recover from a 1-node failure. Creation of volumes, snapshots, and clones is not allowed.
  • stage5CompletelyConsumed (Critical)1; No writes or new iSCSI connections are allowed. Current iSCSI connections will be maintained. Writes will fail until more capacity is added to the cluster.
To resolve this fault, purge or delete volumes or add another storage node to the storage cluster.
blocksDegraded
Block data is no longer fully replicated due to a failure.
Severity Description
Warning Only two complete copies of the block data are accessible.
Error Only a single complete copy of the block data is accessible.
Critical No complete copies of the block data are accessible.
Note: The warning status can only occur on a Triple Helix system.
To resolve this fault, restore any offline nodes or block services, or contact NetApp Support for assistance.
blockServiceTooFull
A block service is using too much space.
To resolve this fault, purge or delete volumes or add another storage node to the storage cluster.
clockSkewExceedsFaultThreshold
The time skew between the cluster master and the node which is presenting a token exceeds the recommended threshold.
The storage cluster cannot correct the time skew between the nodes automatically. To resolve this fault, use NTP servers that are internal to your network, rather than the installation defaults. If you are already using internal NTP servers, contact NetApp Support for assistance.
clusterCannotSync
Cluster block data is in a degraded state and the auto-heal process to restore full block data redundancy cannot proceed; either too many nodes or block services are offline, or the cluster block services are too full.
To resolve this fault, add more block capacity or contact NetApp Support.
clusterFull
There is no more free storage space in the storage cluster.
To resolve this fault, add more storage.
clusterIOPSAreOverProvisioned
Storage cluster IOPS are over provisioned. The sum of all minimum QoS IOPS is greater than the expected IOPS of the cluster. The system cannot maintain minimum QoS for all volumes simultaneously.
To resolve this fault, lower the minimum QoS IOPS settings for volumes.
disableDriveSecurityFailed
A drive was unable to be secure disabled when the Encryption at Rest feature is turned off. The drive still has drive security enabled.
The reason that drive security could not be disabled is shown in the fault details; you might need to investigate the problem based on the reason. If you need to recover a disk that does not successfully disable security, use the following steps:
  1. Logically remove the drive by moving it to "available" status.
  2. Perform a secure erase on the drive.
  3. Move the drive to "active" status.
If these steps do not resolve the issue, replace the drive.
disconnectedClusterPair
A cluster pair is disconnected or configured incorrectly.
Check the network connectivity of the cluster.
disconnectedRemoteNode
A remote node is disconnected or configured incorrectly. Check network connectivity between the nodes.
disconnectedSnapMirrorEndpoint
A remote SnapMirror endpoint is disconnected or configured incorrectly. Check network connectivity between the cluster and the remote SnapMirrorEndpoint.
driveAvailable
One or more drives are available to be added in the storage cluster. In general, all storage clusters should have all drives added and none in the available state. If this fault appears unexpectedly, contact NetApp Support.
To resolve this fault, add any available drives to the storage cluster.
driveFailed
The cluster returns this fault when one or more drives have failed, indicating one of the following conditions:
  • The drive manager cannot access the drive.
  • The slice or block service has failed too many times, presumably because of drive read or write failures, and cannot restart.
  • The drive is missing.
  • The master service for the node is inaccessible (all drives in the node are considered missing/failed).
  • The drive is locked and the authentication key for the drive cannot be acquired.
  • The drive is locked and the unlock operation fails.
To resolve this issue:
  • Check network connectivity for the node.
  • Replace the drive.
  • Ensure that the authentication key is available.
driveHealthFault
A drive has failed the SMART health check and as a result, the drive’s functions are diminished. There is a Critical severity level for this fault:
  • Drive with serial: <serial number> in slot: <node slot><drive slot> has failed the SMART overall health check.
To resolve this fault, replace the drive.
driveWearFault
A drive's remaining life has dropped below thresholds, but it is still functioning.There are two possible severity levels for this fault: Critical and Warning:
  • Drive with serial: <serial number> in slot: <node slot><drive slot> has critical wear levels.
  • Drive with serial: <serial number> in slot: <node slot><drive slot> has low wear reserves.
To resolve this fault, replace the drive soon.
duplicateClusterMasterCandidates
There is more than one storage cluster master candidate.
Contact NetApp Support for assistance.
enableDriveSecurityFailed
A drive was unable to be secure enabled when the Encryption at Rest feature is turned on.
Ensure that the correct key is being used to enable security. If you need to recover a disk that does not successfully enable security, use the following steps:
  1. Logically remove the drive by moving it to "available" status.
  2. Perform a secure erase on the drive.
  3. Move the drive to "active" status.
If these steps do not resolve the issue, replace the drive.
ensembleDegraded
One of the ensemble nodes has lost network connectivity or power.
To resolve this fault, restore network connectivity or power to the affected node.
exception
An unusual fault has occurred. These faults are not automatically cleared from the fault queue.
Contact NetApp Support for assistance.
failedSpaceTooFull
A block service is not responding to data write requests. This causes the slice service to run out of space to store failed writes.
Contact NetApp Support for assistance.
fanSensor
A fan sensor has failed or is missing.
Replace any failed hardware in the node. If this does not resolve the issue, contact NetApp Support for assistance.
fibreChannelAccessDegraded
A Fibre Channel node is not responding to other nodes in the storage cluster via its storage IP address.
Check the network connectivity of the cluster.
fibreChannelAccessUnavailable
All Fibre Channel nodes are unresponsive. The node IDs are displayed.
Check the network connectivity of the cluster.
fibreChannelActiveIxL
The IxL Nexus count is approaching the supported limit of 8000 active sessions per Fibre Channel node.
  • Best practice limit is 5500.
  • Warning limit is 7500.
  • Maximum limit (not enforced) is 8192.
To resolve this fault, reduce the IxL Nexus count below the best practice limit of 5500.
fibreChannelConfig
This cluster fault indicates one of the following conditions:
  • There is an unexpected Fibre Channel port on a PCI slot.
  • There is an unexpected Fibre Channel HBA model.
  • There is a problem with the firmware of a Fibre Channel HBA.
  • A Fibre Channel port is not online.
  • There is a persistent issue configuring Fibre Channel passthrough.
Contact NetApp Support for assistance.
fibreChannelIOPS
The total IOPS count is approaching the IOPS limit for Fibre Channel nodes in the cluster. The limits are:
  • FC0025: 450K IOPS limit at 4K block size per Fibre Channel node.
  • FCN001: 625K OPS limit at 4K block size per Fibre Channel node.
To resolve this fault, balance the load across all available Fibre Channel nodes.
fibreChannelStaticIxL
The IxL Nexus count is approaching the supported limit of 16000 static sessions per Fibre Channel node.
  • Best practice limit is 11000.
  • Warning limit is 15000.
  • Maximum limit (enforced) is 16384.
To resolve this fault, reduce the IxL Nexus count below the best practice limit of 11000.
fileSystemCapacityLow
There is insufficient space on one of the filesystems.
To resolve this fault, add more capacity to the filesystem.
fipsDrivesMismatch
A non-FIPS drive has been inserted into a FIPS storage node or a FIPS drive has been inserted into a non-FIPS storage node.
Remove or replace the drive or drives in question.
fipsDrivesOutOfCompliance
The system has detected that Encryption at Rest is disabled, or non-FIPS hardware is present in the storage cluster.
Enable Encryption at Rest or remove the non-FIPS hardware from the storage cluster.
fipsSelfTestFailure
The system has detected a failure during the FIPS self test.
Contact NetApp Support for assistance.
hardwareConfigMismatch
This cluster fault indicates one of the following conditions:
  • The configuration does not match the node definition.
  • There is an incorrect drive size for this type of node.
  • A node is using unsupported drive.
  • There is a drive firmware mismatch.
  • A drive's encryption capable state does not match its parent node.
Contact NetApp Support for assistance.
idPCertificateExpiration
The cluster’s service provider SSL certificate for use with a third-party identity provider is nearing expiration or has already expired. This fault uses the following severities based on urgency:
Severity Description
Warning Certificate expires within 30 days.
Error Certificate expires within 7 days.
Critical Certificate expires within 3 days or has already expired.
To resolve this fault, update the SSL certificate before it expires. Use the UpdateIdpConfiguration method with refreshCertificateExpirationTime=true to provide the updated SSL certificate.
inconsistentBondModes
The bond modes on the VLAN device are missing. This fault will display the expected bond mode and the bond mode currently in use.
To resolve this fault, modify the bond modes in the per-node web UI.
inconsistentInterfaceConfiguration
The interface configuration is inconsistent.
To resolve this fault, ensure the node interfaces in the storage cluster are consistently configured.
inconsistentMtus
This cluster fault indicates one of the following conditions:
  • Bond1G mismatch: Inconsistent MTUs have been detected on Bond1G interfaces.
  • Bond10G mismatch: Inconsistent MTUs have been detected on Bond10G interfaces.
This fault displays the node or nodes in question along with the associated MTU value.
To resolve this fault, modify the MTU settings in the per-node web UI.
inconsistentRoutingRules
The routing rules for this interface are inconsistent.
inconsistentSubnetMasks
The network mask on the VLAN device does not match the internally recorded network mask for the VLAN. This fault displays the expected network mask and the network mask currently in use.
To resolve this fault, modify the subnet mask in the Element (storage cluster) web UI.
incorrectBondPortCount
The number of bond ports is incorrect.
invalidConfiguredFibreChannelNodeCount
One of the two expected Fibre Channel node connections is degraded. This fault appears when only one Fibre Channel node is connected.
To resolve this fault, check the cluster network connectivity and network cabling, and check for failed services. If there are no network or service problems, contact NetApp Support for a Fibre Channel node replacement.
irqBalanceFailed
An exception occurred while attempting to balance interrupts.
Contact NetApp Support for assistance.
kmipCertificateFault (Root Certification Authority (CA) certificate is nearing expiration)
The root Certification Authority (CA) certificate is nearing expiration. This fault uses the following severities based on urgency:
Severity Description
Warning Certificate expires within 30 days.
Error Certificate expires within 7 days.
Critical Certificate expires within 3 days.
To resolve this fault, update the certificate before it expires. Acquire a new certificate from the root CA with expiration date at least 30 days in the future. Use the ModifyKeyServerKmip API method to provide the updated root CA certificate.
kmipCertificateFault (Client certificate is nearing expiration)
The client certificate is nearing expiration. This fault uses the following severities based on urgency:
Severity Description
Warning Certificate expires within 30 days.
Error Certificate expires within 7 days.
Critical Certificate expires within 3 days.
To resolve this fault, create a new CSR with the GetClientCertificateSigningRequest method. Have the CSR signed with an expiration greater than 30 days and then use the ModifyKeyServerKmip API method to replace the expiring KMIP client certificate with the new certificate.
kmipCertificateFault (Root Certification Authority (CA) certificate expired)
The root CA certificate has expired.
Acquire a new certificate from the root CA with expiration date at least 30 days in the future. Use the ModifyKeyServerKmip API method to provide the updated root CA certificate.
kmipCertificateFault (Client certificate expired)
The client certificate has expired.
Create a new CSR using the GetClientCertificateSigningRequest API method and have it signed making sure new expiration date is at least 30 days in the future. Use the ModifyKeyServerKmip API method to replace the expired client certificate with the new certificate.
kmipCertificateFault (Invalid root certification authority (CA) certificate)
The root CA certificate is invalid.
Make sure that the correct certificate was provided. If needed, reacquire the certificate from the root CA. Use the ModifyKeyServerKmip API method to install the correct certificate.
kmipCertificateFault (Invalid client certificate)
The client certificate is invalid.
Make sure that the correct KMIP client certificate is installed. The root CA of the client certificate should be installed on the external key management server. If you need to update the client certificate, use the ModifyKeyServerKmip API method to do so.
kmipServerFault (Connection failure)
One or more of the nodes cannot access the external key management server.
The key server ID is provided in the fault details. Ensure that the server is functional and reachable via the management network. If only some nodes are unable to access the external key management server, the nodes that are unable to reach the key server are listed in the fault details. Perform troubleshooting at the network or specific node level to determine why only some of the nodes can access the external key management server.
kmipServerFault (Authentication failure)
One or more of the nodes cannot authenticate with the external key management server.
Ensure that the correct root CA and KMIP client certificates are in use. If you need to update any of the certificates, use the ModifyKeyServerKmip method to install the correct certificate.
kmipServerFault (Server error)
The external key management server has an error.
The error details are provided in the fault details. You might need to troubleshoot the external key management server based on the error.
memoryEccThreshold
A large number of correctable or uncorrectable ECC errors have been detected. When a severity of type Error is returned, this is likely due to a DIMM failure.
Contact NetApp Support for assistance.
memoryUsageThreshold
Memory usage is above normal. This fault uses the following severities based on urgency:
Note: See the Details heading for more detailed information on the fault.
Severity Description
Warning System memory is low.
Error System memory is very low.
Critical System memory is completely consumed.
To resolve this fault, contact NetApp Support for assistance.
metadataClusterFull
There is not enough free metadata storage space to support a single node loss. See the GetClusterFullThreshold API method for details on cluster fullness levels. This cluster fault indicates one of the following conditions:
  • stage3Low (Warning): User-defined threshold was crossed. Adjust Cluster Full settings or add more nodes.
  • stage4Critical (Error): There is not enough space to recover from a 1-node failure. Creation of volumes, snapshots, and clones is not allowed.
  • stage5CompletelyConsumed (Critical)1; No writes or new iSCSI connections are allowed. Current iSCSI connections will be maintained. Writes will fail until more capacity is added to the cluster. Purge or delete data or add more nodes.
See Understanding cluster fullness levels for more information.
To resolve this fault, purge or delete volumes or add another storage node to the storage cluster.
mtuCheckFailure
A network device is not configured for the proper MTU size.
To resolve this fault, ensure that all network interfaces and switch ports are configured for jumbo frames (MTUs up to 9000 bytes in size).
networkConfig
This cluster fault indicates one of the following conditions:
  • An expected network interface is not present.
  • A duplicate network interface is present.
  • A network interface is configured but down.
  • A network interface restart is needed.
Contact NetApp Support for assistance.
noAvailableVirtualNetworkIPAddresses
There are no available virtual network addresses in the block of IP addresses.
  • virtualNetworkID # TAG(###) has no available storage IP addresses. Additional nodes cannot be added to the cluster.
To resolve this fault, add more IP addresses to the block of virtual network addresses.
nodeHardwareFault (Network interface <name> is down or cable is unplugged)
A network interface is either down or the cable is unplugged.
To resolve this fault, check network connectivity for the node or nodes.
nodeHardwareFault (Drive encryption capable state mismatches node's encryption capable state for the drive in slot <node slot><drive slot>)
A drive does not match encryption capabilities with the storage node it is installed in.
nodeHardwareFault (Incorrect <drive type> drive size <actual size> for the drive in slot <node slot><drive slot> for this node type - expected <expected size>)
A storage node contains a drive that is the incorrect size for this node.
nodeHardwareFault (Unsupported drive detected in slot <node slot><drive slot>; drive statistics and health information will be unavailable)
A storage node contains a drive it does not support.
nodeHardwareFault (The drive in slot <node slot><drive slot> should be using firmware version <expected version>, but is using unsupported version <actual version>)
A storage node contains a drive running an unsupported firmware version.
nodeMaintenanceMode
A node has been placed in maintenance mode. This fault uses the following severities based on urgency:
Severity Description
Warning Indicates that the node is still in maintenance mode.
Error Indicates that maintenance mode has failed to disable, most likely due to failed or active standbys.
To resolve this fault, disable maintenance mode once maintenance completes. If the Error level fault persists, contact NetApp Support for assistance.
nodeOffline
Element software cannot communicate with the specified node.
To resolve this fault, check network connectivity and network cabling of the cluster. If there are no network problems, contact NetApp Support for a node replacement.
notUsingLACPBondMode
LACP bonding mode is not configured.
To resolve this fault, use LACP bonding when deploying storage nodes; clients might experience performance issues if LACP is not enabled and properly configured.
ntpServerUnreachable
The storage cluster cannot communicate with the specified NTP server or servers.
To resolve this fault, check the NTP server configuration, network, and firewall.
ntpTimeNotInSync
The difference between storage cluster time and the specified NTP server time is too large. The storage cluster cannot correct the difference automatically.
To resolve this fault, use NTP servers that are internal to your network, rather than the installation defaults. If you are using internal NTP servers and the issue persists, contact NetApp Support for assistance.
nvramDeviceStatus
An NVRAM device has an error, is failing, or has failed. This fault has the following severities:
Severity Description
Warning A warning has been detected by the hardware. This condition may be transitory, such as a temperature warning.
  • nvmLifetimeError
  • nvmLifetimeStatus
  • energySourceLifetimeStatus
  • energySourceTemperatureStatus
  • warningThresholdExceeded
Error An Error or Critical status has been detected by the hardware. The cluster master attempts to remove the slice drive from operation (this generates a drive removal event). If secondary slice services are not available the drive will not be removed. Errors returned in addition to the Warning level errors:
  • NVRAM device mount point doesn't exist.
  • NVRAM device partition doesn't exist.
  • NVRAM device partition exists, but not mounted.
Critical An Error or Critical status has been detected by the hardware. The cluster master attempts to remove the slice drive from operation (this generates a drive removal event). If secondary slice services are not available the drive will not be removed.
  • persistenceLost
  • armStatusSaveNArmed
  • csaveStatusError
Replace any failed hardware in the node. If this does not resolve the issue, contact NetApp Support for assistance.
powerSupplyError
This cluster fault indicates one of the following conditions:
  • A power supply is not present.
  • A power supply has failed.
  • A power supply has no input or the input is out of range.
To resolve this fault, verify that redundant power is supplied to all nodes. Contact NetApp Support if the issue persists.
provisionedSpaceTooFull
The overall provisioned capacity of the storage cluster is too full.
To resolve this fault, add more provisioned space, or delete and purge volumes or snapshots.
remoteRepAsyncDelayExceeded
The configured asynchronous delay for replication has been exceeded.
remoteRepClusterFull
The volumes have paused remote replication because the target storage cluster is too full.
To resolve this fault, free up some space on the target storage cluster.
remoteRepSnapshotClusterFull
The volumes have paused remote replication of snapshots because the target storage cluster is too full.
To resolve this fault, free up some space on the target storage cluster.
remoteRepSnapshotsExceededLimit
The volumes have paused remote replication of snapshots because the target storage cluster volume has exceeded its snapshot limit.
To resolve this fault, remove some snapshots on the remote cluster.
scheduleActionError
One or more of the scheduled activities ran, but failed.
The fault clears if the scheduled activity runs again and succeeds, if the scheduled activity is deleted, or if the activity is paused and resumed.
sensorReadingFailed
The Baseboard Management Controller (BMC) self-test failed or a sensor could not communicate with the BMC.
Contact NetApp Support for assistance.
serviceNotRunning
A required service is not running.
Contact NetApp Support for assistance.
sliceServiceTooFull
A slice service has too little provisioned capacity assigned to it.
To resolve this fault, add more storage nodes or contact NetApp Support.
sliceServiceUnhealthy
The system has detected that a slice service is unhealthy and is automatically decommissioning it.
  • Severity = Warning: No action is taken. This warning period will expire in 6 minutes.
  • Severity = Error: The system is automatically decommissioning data and re-replicating its data to other healthy drives.
Check for network connectivity issues and hardware errors. There will be other faults if specific hardware components have failed. The fault will clear when the slice service is accessible or when the service has been decommissioned.
sshEnabled
The SSH service is enabled on one or more nodes in the storage cluster.
To resolve this fault, disable the SSH service on the node or nodes.
sslCertificateExpiration
The SSL certificate associated with this node is nearing expiration or has expired. This fault uses the following severities based on urgency:
Severity Description
Warning Certificate expires within 30 days.
Error Certificate expires within 7 days.
Critical Certificate expires within 3 days or has already expired.
To resolve this fault, renew the SSL certificate. If needed, contact NetApp Support for assistance.
strandedCapacity
A single node accounts for more than half of the storage cluster capacity.
In order to maintain data redundancy, the system reduces the capacity of the largest node so that some of its block capacity is stranded (not used). To resolve this fault, add more drives to existing storage nodes or add storage nodes to the cluster.
tempSensor
A temperature sensor is reporting higher than normal temperatures. This fault can be triggered in conjunction with powerSupplyError or fanSensor faults.
To resolve this fault, check for airflow obstructions near the storage cluster. If needed, contact NetApp Support for assistance.
upgrade
An upgrade has been in progress for more than 24 hours.
To resolve this fault, resume the upgrade or contact NetApp Support for assistance.
unbalancedMixedNodes
A single node accounts for more than one-third of the storage cluster's capacity.
Contact NetApp Support for assistance.
unresponsiveService
A system service has become unresponsive.
Contact NetApp Support for assistance.
virtualNetworkConfig
This cluster fault indicates one of the following conditions:
  • An interface is not present.
  • There is an incorrect namespace on an interface.
  • There is an incorrect network mask.
  • There is an incorrect IP address.
  • An interface is not up and running.
  • There is a superfluous interface on a node.
Contact NetApp Support for assistance.
volumesDegraded
Secondary volumes have not yet completely replicated and synchronized.
This fault is cleared when the synchronisation is complete.
If the fault persists, check for network connectivity issues and hardware errors.
volumesOffline
One or more volumes in the storage cluster are offline.
Contact NetApp Support for assistance.