Thursday, September 23, 2010

Node evictions in RAC environment

Node eviction in RAC environment
Node Evictions in RAC environment

Node eviction is quite sometimes happening in RAC environment on any platform and troubleshooting and finding root cause for node eviction is very important for DBAs to avoid same in the future. There are two RAC processes which are basically deciding about node evictions and who will initiate node evictions in almost all platforms.
1. OCSSD : This process is primary responsible for inter node health monitoring and instance endpoint recovery. It runs as oracle user. It also provides basic cluster locking and group services. It can run with or without vendor clusterware. The abnormal termination or killing this process will reboot the node by init.cssd script. If this script is killed then ocssd process will survive and node will keep functioning. This script is called by /etc/inittab entry and when it tries to respawn it and will try to start its own ocssd process. Since one ocssd process is already running, this 2nd time script calling ocssd starting will fail and 2nd init.cssd script will reboot the node.

2. OPROCD : This process is known as checking hangcheck and drive freezes on machine. On Linux, it is not available on 10.2.0.3 platform as this same function is performed by linux hangcheck timer module. Starting from 10.2.0.4, it will be started as part of clusterware startup and it runs as root. Killing this process will reboot the node. If a machine is hang for long time, this process needs to kill itself to avoid IO happening to disk so that rest of the nodes can remaster the resources. This executable sets a signal handler and sets the interval time bases on milliseconds parameter. It takes two parameters.

a. Timeout value –t : This is the length of time between executions. By default it’s 1000.
b. Margin –m : This is the acceptable difference between dispatches. By default, it’s 500.

When we set diagwait to 13, the margin becomes 13 -3 (reboottime seconds)= 10 seconds so value of m will be 10000.

There are two kinds of heartbeat mechanisms which are responsible for node reboot and reconfiguration of remaining clusteware nodes.



a. Network heartbit : This indicates that node can participate in cluster activities like group membership changes. When it’s missing for too long, cluster membership will change as a result of reboot. This too long value is determined by css miscount parameter value which is 30 seconds on most of platforms but can be changed depending on network configuration of particular environment. If at all it needs to be changed, it’s advisable to contact oracle support and take their recommendations on this.
b. Disk heartbit : This disk heartbit means heartbits to voting disk file which has the latest information about node members. Connectivity to a majority of voting files must be maintained for a node to stay alive. Voting disk file uses kill blocks to notify nodes they have been evicted and then remaining nodes can go for reconfiguration and a node with least no will become master as per Oracle algorithm generally. By default this value is 200 seconds which is css disktimeout parameter. Again changing this parameter requires oracle support’s recommendation. When node can no longer communicate through private interconnect, other nodes can see its heartbits in voting file then it’s being evicted by using voting disk kill block functionality.

Network split resolution : When network fails and nodes are not able to communicate to each other then one node has to fail to maintain data integrity. The surviving nodes should be an optimal subcluster of original cluster. Each node writes its own vote to voting file and Reconfiguration manager component reads these votes to calculate an optimal sub cluster. Nodes that are not to survive are evicted via communication through network and disk.


Causes of reboot by clusterware processes
======================================
Now we will briefly discuss about causes of reboot by these processes and at last, which files to review and upload to oracle support for further diagnosis.
Reboot by OCSSD.
============================
1. Network failure : 30 consecutive missed checkins will reboot a node where heartbits are issues once per second. Some kind of messages in occsd.log like heartbit fatal, eviction in xx seconds… Here there are two things.
a. If node eviction time in messages log file is less than missed checkins then node eviction is likely not due to missed checkins.
b. If node eviction time in messages log file is greater than missed checkins then node eviction is likely due to missed checkins.
2. Problems writing to voting disk file : some kind of hang in accessing voting disk.
3. High CPU utilization : When CPU is highly utilized then css daemon doesn’t get CPU on time to ping to voting disk and as a result, it cannot write to voting disk file its own vote and node is going to be rebooted.
4. Disk subsystem is unresponsive due to storage issues.
5. Killing ocssd process.
6. An oracle bug.

Reboot by OPROCD.
============================
When a problem is detected by oprocd, it’ll reboot the node for following reasons.
1. OS scheduler algorithm problem.
2. High CPU utilization due to which oprocd is not getting cpu to check hang check issues at OS level.
3. An oracle bug.

Also just to share with you, at one of the client sites, lms processes were running on low priority scheduling and lms were not getting cpu on time when there’s cpu is high utilized so lms couldn’t communicate through clusterware processes and node eviction got delayed and it was observed that oprocd rebooted node which should not have happened as lms was responsible to run at lower priority scheduling.
Determining cause of reboot by which process
==============================================

1. If there are below kind of messages in logfiles then it will be likely reboot by ocssd process.
a. Reboot due to cluster integrity in syslog file or messages file.
b. Any error prior to reboot in ocssd.log file.
c. Missed checkins in syslog file and eviction time is prior to node reboot time.
2. If there are below kind of messages in logfiles then it will be likely reboot by oprocd process.
a. Resetting message in messages logfile on linux.
b. Any error in oprocd log matching with timestamp of reboot or prior to reboot at /etc/oracle/oprocd directory.
3. If there are other messages like Ethernet issues or some kind of errors in messages or syslog file then please check with sysadmins. On AIX, errpt –a output gives lot of information about cause of reboot.
Log files collection while reboot of node
==============================================
Whenever node reboot occurs in clusterware environment, please review below logfiles for getting reason of reboot and these files are necessary to upload to oracle support for node eviction diagnosis.
a. CRS log files (For 10.2.0 and above 10.2.0 release)
=============================================
1. $ORACLE_CRS_HOME/log//crsd/crsd.log
2. $ORACLE_CRS_HOME/log//cssd/ocssd.log
3. $ORACLE_CRS_HOME/log//evmd/evmd.log
4. $ORACLE_CRS_HOME/log//alert.log
5. $ORACLE_CRS_HOME/log//client/cls*.log (not all files but only latest files matching with timestamp of node reboot)
6. $ORACLE_CRS_HOME/log//racg/ (Please check files and directories matching with timestamp of reboot and if found then copy otherwise not required)
7. The latest .oprocd.log file from /etc/oracle or /var/opt/oracle/oprocd (Solaris)

Note: We can use $ORACLE_CRS_HOME/bin/diagcollection.pl to collect above files but it doesn’t collect OPROCD logfiles, OS log files and OS watcher logfiles and also it may take lot of time to run and consume resources so it’s better to copy manually.
b. OS log files (This will get overwritten so we need to copy soon)
====================================================
1. /var/log/syslog
2. /var/adm/messages
3. errpt –a >error_.log (AIX only)

c. OS Watcher log files (This will get overwritten so we need to copy soon)
=======================================================
Please check in crontab where OSwatcher is installed. Go to that directory and then archive folder and then collect files from all directory matching with timestamp of node reboot.
1. OS_WATCHER_HOME/archive/oswtop
2. OS_WATCHER_HOME/archive/oswvmstat
3. OS_WATCHER_HOME/archive/oswmpstat
4. OS_WATCHER_HOME/archive/oswnetstat
5. OS_WATCHER_HOME/archive/oswiostat
6. OS_WATCHER_HOME/archive/oswps
7. OS_WATCHER_HOME/archive/oswprvtnet