RAC timeouts

11/19/2024 Contributors

Oracle RAC is a clusterware product with several types of internal heartbeat processes that monitor the health of the cluster.

The information in the misscount section includes critical information for Oracle RAC environments using networked storage, and in many cases the default Oracle RAC settings will need to be changed to ensure the RAC cluster survives network path changes and storage failover/switchover operations.

disktimeout

The primary storage-related RAC parameter is disktimeout. This parameter controls the threshold within which voting file I/O must complete. If the disktimeout parameter is exceeded, then the RAC node is evicted from the cluster. The default for this parameter is 200. This value should be sufficient for standard storage takeover and giveback procedures.

NetApp strongly recommends testing RAC configurations thoroughly before placing them into production because many factors affect a takeover or giveback. In addition to the time required for storage failover to complete, additional time is also required for Link Aggregation Control Protocol (LACP) changes to propagate. Also, SAN multipathing software must detect an I/O timeout and retry on an alternate path. If a database is extremely active, a large amount of I/O must be queued and retried before voting disk I/O is processed.

If an actual storage takeover or giveback cannot be performed, the effect can be simulated with cable pull tests on the database server.

NetApp recommends the following:

Leaving the disktimeout parameter at the default value of 200.
Always test a RAC configuration thoroughly.

misscount

The misscount parameter normally affects only the network heartbeat between RAC nodes. The default is 30 seconds. If the grid binaries are on a storage array or the OS boot drive is not local, this parameter might become important. This includes hosts with boot drives located on an FC SAN, NFS-booted OSs, and boot drives located on virtualization datastores such as a VMDK file.

If access to a boot drive is interrupted by a storage takeover or giveback, it is possible that the grid binary location or the entire OS temporarily hangs. The time required for ONTAP to complete the storage operation and for the OS to change paths and resume I/O might exceed the misscount threshold. As a result, a node immediately evicts after connectivity to the boot LUN or grid binaries is restored. In most cases, the eviction and subsequent reboot occur with no logging messages to indicate the reason for the reboot. Not all configurations are affected, so test any SAN-booting, NFS-booting, or datastore-based host in a RAC environment so that RAC remains stable if communication to the boot drive is interrupted.

In the case of nonlocal boot drives or a nonlocal file system hosting grid binaries, the misscount will need to be changed to match disktimeout. If this parameter is changed, conduct further testing to also identify any effects on RAC behavior, such as node failover time.

NetApp recommends the following:

Leave the misscount parameter at the default value of 30 unless one of the following conditions applies:
- grid binaries are located on a network-attached drive, including NFS, iSCSI, FC, and datastore-based drives.
- The OS is SAN booted.
In such cases, evaluate the effect of network interruptions that affect access to OS or GRID_HOME file systems. In some cases, such interruptions cause the Oracle RAC daemons to stall, which can lead to a misscount-based timeout and eviction. The timeout defaults to 27 seconds, which is the value of misscount minus reboottime. In such cases, increase misscount to 200 to match disktimeout.

RAC timeouts

Creating your file...

disktimeout

misscount