NODE EVICTION OVERVIEW and Troubleshooting steps
The Oracle Clusterware is designed to perform a node eviction by removing one or more nodes from the cluster if some critical problem is detected. A critical problem could be a node not responding via a network heartbeat, a node not responding via a disk heartbeat, a hung or severely degraded machine, or a hung ocssd.bin process. The purpose of this node eviction is to maintain the overall health of the cluster by removing bad members.
Starting in 11.2.0.2 RAC or above (or if you are on Exadata), a node eviction may not actually reboot the machine. This is called a rebootless restart. In this case we restart most of the clusterware stack to see if that fixes the unhealthy node.
Oracle Clusterware evicts the node when following condition occur:
- Node is not pinging via the network hearbeat
- Node is not pinging the Voting Disk
- Node is hung or busy and is unable to perform the above two tasks
What is the use of CSS Heartbeat Mechanism in Oracle RAC
The CSS of the Oracle Clusterware maintains two heartbeat mechanisms
1. The disk heartbeat to the voting device and
2. The network heartbeat across the interconnect (This establish and confirm valid node membership in the cluster).
Both of these heartbeat mechanisms have an associated timeout value. The disk heartbeat has an internal i/o timeout interval (DTO Disk TimeOut), in seconds, where an i/o to the voting disk must complete. The misscount parameter (MC), as stated above, is the maximum time, in seconds, that a network heartbeat can be missed. The disk heartbeat i/o timeout interval is directly related to the misscount parameter setting. The Disk TimeOut(DTO) = Miscount(MC) - 15 secconds (some versions are different).
1.0 - PROCESS ROLES FOR REBOOTS
OCSSD (aka CSS daemon) - This process is spawned by the cssdagent process. It runs in both vendor clusterware and non-vendor clusterware environments. OCSSD's primary job is internode health monitoring and RDBMS instance endpoint discovery. The health monitoring includes a network heartbeat and a disk heartbeat (to the voting files). OCSSD can also evict a node after escalation of a member kill from a client (such as a database LMON process). This is a multi-threaded process that runs at an elevated priority and runs as the Oracle user.
Startup sequence: INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdagent --> ocssd --> ocssd.bin
CSSDAGENT - This process is spawned by OHASD and is responsible for spawning the OCSSD process, monitoring for node hangs (via oprocd functionality), and monitoring to the OCSSD process for hangs (via oclsomon functionality), and monitoring vendor clusterware (via vmon functionality). This is a multi-threaded process that runs at an elevated priority and runs as the root user.
Startup sequence: INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdagent
CSSDMONITOR - This proccess also monitors for node hangs (via oprocd functionality), monitors the OCSSD process for hangs (via oclsomon functionality), and monitors vendor clusterware (via vmon functionality). This is a multi-threaded process that runs at an elevated priority and runs as the root user.
Startup sequence: INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdmonitor
2.0 - DETERMINING WHICH PROCESS IS RESPONSIBLE FOR A REBOOT
Important files to review:
Clusterware alert log in
The cssdagent log(s)
The cssdmonitor log(s)
The ocssd log(s)
The lastgasp log(s) in /etc/oracle/lastgasp or /var/opt/oracle/lastgasp
CHM or OS Watcher data
'opatch lsinventory -detail' output for the GRID home
*Messages files:
* Messages file locations:
Linux: /var/log/messages
Sun: /var/adm/messages
HP-UX: /var/adm/syslog/syslog.log
IBM: /bin/errpt -a > messages.out
Document 1513912.1 - TFA Collector - Tool for Enhanced Diagnostic Gathering
11.2 Clusterware evictions should, in most cases, have some kind of meaningful error in the clusterware alert log. This can be used to determine which process is responsible for the reboot. Example message from a clusterware alert log:
[ohasd(11243)]CRS-8011:reboot advisory message from host: sta00129, component: cssagent, with timestamp: L-2009-05-05-10:03:25.340
[ohasd(11243)]CRS-8013:reboot advisory message text: Rebooting after limit 28500 exceeded; disk timeout 27630, network timeout 28500, last heartbeat from CSSD at epoch seconds 1241543005.340, 4294967295 milliseconds ago based on invariant clock value of 93235653
This particular eviction happened when we had hit the network timeout. CSSD exited and the cssdagent took action to evict. The cssdagent knows the information in the error message from local heartbeats made from CSSD.
If no message is in the evicted node's clusterware alert log, check the lastgasp logs on the local node and/or the clusterware alert logs of other nodes.
3.0 - TROUBLESHOOTING OCSSD EVICTIONS
If you have encountered an OCSSD eviction review common causes in section 3.1 below.
3.1 - COMMON CAUSES OF OCSSD EVICTIONS
Network failure or latency between nodes. It would take 30 consecutive missed checkins (by default - determined by the CSS misscount) to cause a node eviction.
Problems writing to or reading from the CSS voting disk. If the node cannot perform a disk heartbeat to the majority of its voting files, then the node will be evicted.
A member kill escalation. For example, database LMON process may request CSS to remove an instance from the cluster via the instance eviction mechanism. If this times out it could escalate to a node kill.
An unexpected failure or hang of the OCSSD process, this can be caused by any of the above issues or something else.
An Oracle bug.
3.2 - FILES TO REVIEW AND GATHER FOR OCSSD EVICTIONS
All files from section 2.0 from all cluster nodes. More data may be required.
Example of an eviction due to loss of voting disk:
CSS log:
2012-03-27 22:05:48.693: [ CSSD][1100548416](:CSSNM00018:)clssnmvDiskCheck: Aborting, 0 of 3 configured voting disks available, need 2
2012-03-27 22:05:48.693: [ CSSD][1100548416]###################################
2012-03-27 22:05:48.693: [ CSSD][1100548416]clssscExit: CSSD aborting from thread clssnmvDiskPingMonitorThread
OS messages:
Mar 27 22:03:58 choldbr132p kernel: Error:Mpx:All paths to Symm 000190104720 vol 0c71 are dead.
Mar 27 22:03:58 choldbr132p kernel: Error:Mpx:Symm 000190104720 vol 0c71 is dead.
Mar 27 22:03:58 choldbr132p kernel: Buffer I/O error on device sdbig, logical block 0
...
4.0 - TROUBLESHOOTING CSSDAGENT OR CSSDMONITOR EVICTIONS
If you have encountered a CSSDAGENT or CSSDMONITOR eviction review common causes in section 4.1 below.
4.1 - COMMON CAUSES OF CSSDAGENT OR CSSDMONITOR EVICTIONS
An OS scheduler problem. For example, if the OS is getting locked up in a driver or hardware or there is excessive amounts of load on the machine (at or near 100% cpu utilization), thus preventing the scheduler from behaving reasonably.
A thread(s) within the CSS daemon hung.
An Oracle bug.
4.2 - FILES TO REVIEW AND GATHER FOR CSSDAGENT OR CSSDMONITOR EVICTIONS
All files from section 2.0 from all cluster nodes. More data may be required.
Cluster health check
CSS Miss-count
The Cluster Synchronization Service (CSS) Miscount is the maximum time, in seconds, that a network heartbeat can be missed before a cluster reconfiguration to evict the node
How to get the CSS Misscount value
$ crsctl get css misscount
CRS-4678: Successful get misscount 30 for Cluster Synchronization Services.
oracrs@node1~]$
[oracrs@node1~]$ crsctl get css misscount
CRS-4678: Successful get misscount 30 for Cluster Synchronization Services.
[oracrs@node1~]$
How to set the CSS Misscount value
Shut down CRS on all nodes but one as root run crsctl on that remaining node
# crsctl stop crs
#crsctl set css misscount 60
Configuration parameter misscount is now set to 60
#
2. CSS disktimeout (Default 200)
The maximum amount of time (in seconds)allowed for a voting file I/O to complete, if this time is exceeded the voting disk will be marked as offline.Note that this is also the amount of time that will be required for initial cluster formation, i.e. when no nodes have previously been up and in a cluster.
How to get the CSS disktimeout value
$ crsctl get css disktimeout
CRS-4678: Successful get disktimeout 200 for Cluster Synchronization Services.
[oracrs@node1l ~]$ crsctl get css disktimeout
CRS-4678: Successful get disktimeout 200 for Cluster Synchronization Services.
[oracrs@node1l ~]$
How to set the CSS disktimeout value
Shut down CRS on all nodes but one as root run crsctl on that remaining node
# crsctl stop crs
#crsctl set css disktimeout 300
Configuration parameter disktimeout is now set to300
#
3. CSS reboottime(Default 3 seconds)
The amount of time allowed for a node to complete a reboot after the CSS daemon has been evicted
How to get the CSS reboottime value
$crsctl get css reboottime
CRS-4678: Successful get reboottime 3 for Cluster Synchronization Services.
-4678: Successful get disktimeout 200 for Cluster Synchronization Services.
[oracrs@node1 ~]$ crsctl get css reboottime
CRS-4678: Successful get reboottime 3 for Cluster Synchronization Services.
[oracrs@node1 ~]$
How to set the CSS reboottime value
Shut down CRS on all nodes but one as root run crsctl on that remaining node
# crsctl stop crs
#crsctl set css reboottime 10
Configuration parameter reboottime is now set to 10
#
- Oracle Clusterware uses voting disk files to determine which nodes are members of a cluster.
- You can configure voting disks on Oracle ASM, or you can configure voting disks on shared storage.
- If you do not configure voting disks on Oracle ASM, then for high availability, Oracle recommends that you have a minimum of three voting disks on physically separate storage.This avoids having a single point of failure. If you configure a single voting disk, then you must use external mirroring to provide redundancy.
- No. of voting disks depend on the type of redundancy. From 11.2.0.x onwards OCR and voting files are placed in the ASM diskgroup.
External redundancy = 1 Voting disk
Normal redundancy = 3 Voting disks
High redundancy = 5 Voting disks
You can have up to 32 voting disks in your cluster
Oracle recommends that you configure multiple voting disks during Oracle Clusterware installation to improve availability. If you choose to put the voting disks into an Oracle ASM disk group, then Oracle ASM ensures the configuration of multiple voting disks if you use a normal or high redundancy disk group.
To identify the voting disk location :-
[oracle@rac1 ~]$ crsctl query css votedisk ## STATE File Universal Id File Name Disk group -- ----- ----------------- --------- --------- 1. ONLINE b4a7f383bb414f7ebf6aaae7c3873401 (/dev/oracleasm/disks/ASMDISK1) [DATA] Located 1 voting disk(s).
To backup the voting disk (Before 11gR2) :-
dd if=voting_disk_name of=backup_file_name
The following can be used to restore the voting disk from the backup file
created.
dd if=backup_file_name of=voting_disk_name
In previous versions of Oracle Clusterware you needed to backup the voting disks with the dd command. Starting with Oracle Clusterware 11g Release 2 you no longer need to backup the voting disks. The voting disks are automatically backed up as a part of the OCR. In fact, Oracle explicitly
indicates that you should not use a backup tool like dd to backup or restore voting disks. Doing so can lead to the loss of the voting disk.
What Information is stored in VOTING DISK/FILE?
It contains 2 types of data.
Static data: Information about the nodes in cluster
Dynamic data: Disk heartbeat logging
It contains the important details of the cluster nodes membership like
- Which node is part of the cluster?
- Which node is leaving the cluster?
- Which node is joining the cluster?
Although the Voting disk contents are not changed frequently, you will need to back up the Voting disk file every time
– you add or remove a node from the cluster or
– immediately after you configure or upgrade a cluster.
To move voting disk create another diskgroup with external redundancy named as ‘DATA1’
- From 11gR2,voting files are stored on ASM diskgroup.
- “add” or “delete” command is not available , only “replace” command is available when voting files are stored on ASM diskgroup.
- Note: You cannot create more than 1 voting disk in the same or on another/different Disk group disk when using External Redundancy in 11.2.
To identify the status and voting disk location :-
[oracle@rac1 ~]$ crsctl query css votedisk
## STATE File Universal Id File Name Disk group
-- ----- ----------------- --------- ---------
1. ONLINE b4a7f383bb414f7ebf6aaae7c3873401 (/dev/oracleasm/disks/ASMDISK1) [DATA]
Located 1 voting disk(s).
Replace a voting disk :-
[oracle@rac1 ~]$ crsctl replace votedisk +DATA1 Successful addition of voting disk 9789b4bf42214f8bbf14fda587ba331a. Successful deletion of voting disk b4a7f383bb414f7ebf6aaae7c3873401. Successfully replaced voting disk group with +DATA1. CRS-4266: Voting file(s) successfully replaced
Check the status and verify voting disk location :-
[oracle@rac1 ~]$ crsctl query css votedisk
## STATE File Universal Id File Name Disk group
-- ----- ----------------- --------- ---------
1. ONLINE 9789b4bf42214f8bbf14fda587ba331a (/dev/oracleasm/disks/ASMDISK2) [DATA1]
Located 1 voting disk(s).
Why should we have ODD number of voting disk?
A node must be able to access more than half of the voting disks at any time.
Scenario:
Let us consider 2 node clusters with even number of voting disks say 2.
- Let node 1 is able to access voting disk 1.
- Node 2 is able to access voting disk 2.
- From the above steps, we see that we don’t have any common file where clusterware can check the heartbeat of both the nodes.
- If we have 3 voting disks and both the nodes are able to access more than half ie., 2 voting disks, there will be atleast one disk which will be accessed by both the nodes. The clusterware can use this disk to check the heartbeat of the nodes.
- A node not able to do so will be evicted from the cluster by another node that has more than half the voting disks to maintain the integrity of the cluster.
Recover the corrupted voting disk :-
ASMCMD> lsdsk -G DATA1
Path
/dev/oracleasm/disks/ASMDISK2
As a root user,
#dd if=/dev/zero of=/dev/oracleasm/disks/ASMDISK2 bs=4096 count=1000000
The above session will get hang,
Check the clusterware status on another session,
************************************************************** rac1: CRS-4535: Cannot communicate with Cluster Ready Services CRS-4530: Communications failure contacting Cluster Synchronization Services daemon CRS-4534: Cannot communicate with Event Manager ************************************************************** rac2: CRS-4535: Cannot communicate with Cluster Ready Services CRS-4530: Communications failure contacting Cluster Synchronization Services daemon CRS-4534: Cannot communicate with Event Manager **************************************************************
After reboot both the nodes,check the clusterware status :-
[oracle@rac1 ~]$ crsctl check crs CRS-4638: Oracle High Availability Services is online CRS-4535: Cannot communicate with Cluster Ready Services CRS-4530: Communications failure contacting Cluster Synchronization Services daemon CRS-4534: Cannot communicate with Event Manager
Since voting disk can’t be restored back to DATA1 diskgroup as disk in DATA1 has been corrupted
Stop the CRS forcefully in both the nodes and check the clusterware status,
[root@rac1 bin]# ./crsctl stop crs -f CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'rac1' CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'rac1' CRS-2673: Attempting to stop 'ora.gpnpd' on 'rac1' CRS-2673: Attempting to stop 'ora.evmd' on 'rac1' CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'rac1' CRS-2673: Attempting to stop 'ora.mdnsd' on 'rac1' CRS-2673: Attempting to stop 'ora.gipcd' on 'rac1' CRS-2677: Stop of 'ora.cssdmonitor' on 'rac1' succeeded CRS-2677: Stop of 'ora.drivers.acfs' on 'rac1' succeeded CRS-2677: Stop of 'ora.gpnpd' on 'rac1' succeeded CRS-2677: Stop of 'ora.gipcd' on 'rac1' succeeded CRS-2677: Stop of 'ora.evmd' on 'rac1' succeeded CRS-2677: Stop of 'ora.mdnsd' on 'rac1' succeeded CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'rac1' has completed CRS-4133: Oracle High Availability Services has been stopped.
Start the CRS in exclusive mode in any nodes,
[root@rac1 bin]# ./crsctl start crs -excl CRS-4123: Oracle High Availability Services has been started. CRS-2672: Attempting to start 'ora.evmd' on 'rac1' CRS-2672: Attempting to start 'ora.mdnsd' on 'rac1' CRS-2676: Start of 'ora.mdnsd' on 'rac1' succeeded CRS-2676: Start of 'ora.evmd' on 'rac1' succeeded CRS-2672: Attempting to start 'ora.gpnpd' on 'rac1' CRS-2676: Start of 'ora.gpnpd' on 'rac1' succeeded CRS-2672: Attempting to start 'ora.cssdmonitor' on 'rac1' CRS-2672: Attempting to start 'ora.gipcd' on 'rac1' CRS-2676: Start of 'ora.cssdmonitor' on 'rac1' succeeded CRS-2676: Start of 'ora.gipcd' on 'rac1' succeeded CRS-2672: Attempting to start 'ora.cssd' on 'rac1' CRS-2672: Attempting to start 'ora.diskmon' on 'rac1' CRS-2676: Start of 'ora.diskmon' on 'rac1' succeeded CRS-2676: Start of 'ora.cssd' on 'rac1' succeeded CRS-2672: Attempting to start 'ora.crf' on 'rac1' CRS-2672: Attempting to start 'ora.ctssd' on 'rac1' CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'rac1' CRS-2676: Start of 'ora.crf' on 'rac1' succeeded CRS-2676: Start of 'ora.ctssd' on 'rac1' succeeded CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'rac1' succeeded CRS-2679: Attempting to clean 'ora.asm' on 'rac1' CRS-2681: Clean of 'ora.asm' on 'rac1' succeeded CRS-2672: Attempting to start 'ora.asm' on 'rac1' CRS-2676: Start of 'ora.asm' on 'rac1' succeeded CRS-2672: Attempting to start 'ora.storage' on 'rac1' CRS-2676: Start of 'ora.storage' on 'rac1' succeeded CRS-2672: Attempting to start 'ora.crsd' on 'rac1' CRS-2676: Start of 'ora.crsd' on 'rac1' succeeded
After CRS exclusive startup,check the clusterware status
[root@rac1 bin]# ./crsctl check crs CRS-4638: Oracle High Availability Services is online CRS-4692: Cluster Ready Services is online in exclusive mode CRS-4529: Cluster Synchronization Services is online CRS-4533: Event Manager is online
Recreate the ASM diskgroups using ASMCA where voting disk is placed before named as ‘DATA1’
ASMCMD> lsdg State Type Rebal Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Voting_files Name MOUNTED EXTERN N 512 4096 1048576 30718 20165 0 20165 0 N DATA/ MOUNTED EXTERN N 512 4096 1048576 10236 10183 0 10183 0 N DATA1/
Check the voting disk location :-
[oracle@rac1 ~]$ crsctl query css votedisk Located 0 voting disk(s).
Replace the voting disk :-
[oracle@rac1 ~]$ crsctl replace votedisk +DATA1
Successful addition of voting disk 5a1ef50fe3354f35bfa7f86a6ccb8990.
Successfully replaced voting disk group with +DATA1.
CRS-4266: Voting file(s) successfully replaced
[oracle@rac1 ~]$ crsctl query css votedisk
## STATE File Universal Id File Name Disk group
-- ----- ----------------- --------- ---------
1. ONLINE 5a1ef50fe3354f35bfa7f86a6ccb8990 (/dev/oracleasm/disks/ASMDISK2) [DATA1]
Located 1 voting disk(s).
Stop the CRS running in exclusive mode,
# crsctl stop crs
Start the CRS(clusterware) in all nodes,
# crsctl start crs
Check the clusterware status of both nodes,
[root@rac1 bin]# ./crsctl check cluster -all
**************************************************************
rac1:
CRS-4535: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************
rac2:
CRS-4535: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************