RAC timeouts
Oracle RAC is a clusterware product with several types of internal heartbeat processes that monitor the health of the cluster.
The information in the misscount section includes critical information for Oracle RAC environments using networked storage, and in many cases the default Oracle RAC settings will need to be changed to ensure the RAC cluster survives network path changes and storage failover/switchover operations. |
disktimeout
The primary storage-related RAC parameter is disktimeout
. This parameter controls the threshold within which voting file I/O must complete. If the disktimeout
parameter is exceeded, then the RAC node is evicted from the cluster. The default for this parameter is 200. This value should be sufficient for standard storage takeover and giveback procedures.
NetApp strongly recommends testing RAC configurations thoroughly before placing them into production because many factors affect a takeover or giveback. In addition to the time required for storage failover to complete, additional time is also required for Link Aggregation Control Protocol (LACP) changes to propagate. Also, SAN multipathing software must detect an I/O timeout and retry on an alternate path. If a database is extremely active, a large amount of I/O must be queued and retried before voting disk I/O is processed.
If an actual storage takeover or giveback cannot be performed, the effect can be simulated with cable pull tests on the database server.
NetApp recommends the following:
|
misscount
The misscount
parameter normally affects only the network heartbeat between RAC nodes. The default is 30 seconds. If the grid binaries are on a storage array or the OS boot drive is not local, this parameter might become important. This includes hosts with boot drives located on an FC SAN, NFS-booted OSs, and boot drives located on virtualization datastores such as a VMDK file.
If access to a boot drive is interrupted by a storage takeover or giveback, it is possible that the grid binary location or the entire OS temporarily hangs. The time required for ONTAP to complete the storage operation and for the OS to change paths and resume I/O might exceed the misscount
threshold. As a result, a node immediately evicts after connectivity to the boot LUN or grid binaries is restored. In most cases, the eviction and subsequent reboot occur with no logging messages to indicate the reason for the reboot. Not all configurations are affected, so test any SAN-booting, NFS-booting, or datastore-based host in a RAC environment so that RAC remains stable if communication to the boot drive is interrupted.
In the case of nonlocal boot drives or a nonlocal file system hosting grid
binaries, the misscount
will need to be changed to match disktimeout
. If this parameter is changed, conduct further testing to also identify any effects on RAC behavior, such as node failover time.
NetApp recommends the following:
|