Thursday, September 23, 2010

Node evictions in RAC environment

Node eviction in RAC environment
Node Evictions in RAC environment

Node eviction is quite sometimes happening in RAC environment on any platform and troubleshooting and finding root cause for node eviction is very important for DBAs to avoid same in the future. There are two RAC processes which are basically deciding about node evictions and who will initiate node evictions in almost all platforms.
1. OCSSD : This process is primary responsible for inter node health monitoring and instance endpoint recovery. It runs as oracle user. It also provides basic cluster locking and group services. It can run with or without vendor clusterware. The abnormal termination or killing this process will reboot the node by init.cssd script. If this script is killed then ocssd process will survive and node will keep functioning. This script is called by /etc/inittab entry and when it tries to respawn it and will try to start its own ocssd process. Since one ocssd process is already running, this 2nd time script calling ocssd starting will fail and 2nd init.cssd script will reboot the node.

2. OPROCD : This process is known as checking hangcheck and drive freezes on machine. On Linux, it is not available on 10.2.0.3 platform as this same function is performed by linux hangcheck timer module. Starting from 10.2.0.4, it will be started as part of clusterware startup and it runs as root. Killing this process will reboot the node. If a machine is hang for long time, this process needs to kill itself to avoid IO happening to disk so that rest of the nodes can remaster the resources. This executable sets a signal handler and sets the interval time bases on milliseconds parameter. It takes two parameters.

a. Timeout value –t : This is the length of time between executions. By default it’s 1000.
b. Margin –m : This is the acceptable difference between dispatches. By default, it’s 500.

When we set diagwait to 13, the margin becomes 13 -3 (reboottime seconds)= 10 seconds so value of m will be 10000.

There are two kinds of heartbeat mechanisms which are responsible for node reboot and reconfiguration of remaining clusteware nodes.



a. Network heartbit : This indicates that node can participate in cluster activities like group membership changes. When it’s missing for too long, cluster membership will change as a result of reboot. This too long value is determined by css miscount parameter value which is 30 seconds on most of platforms but can be changed depending on network configuration of particular environment. If at all it needs to be changed, it’s advisable to contact oracle support and take their recommendations on this.
b. Disk heartbit : This disk heartbit means heartbits to voting disk file which has the latest information about node members. Connectivity to a majority of voting files must be maintained for a node to stay alive. Voting disk file uses kill blocks to notify nodes they have been evicted and then remaining nodes can go for reconfiguration and a node with least no will become master as per Oracle algorithm generally. By default this value is 200 seconds which is css disktimeout parameter. Again changing this parameter requires oracle support’s recommendation. When node can no longer communicate through private interconnect, other nodes can see its heartbits in voting file then it’s being evicted by using voting disk kill block functionality.

Network split resolution : When network fails and nodes are not able to communicate to each other then one node has to fail to maintain data integrity. The surviving nodes should be an optimal subcluster of original cluster. Each node writes its own vote to voting file and Reconfiguration manager component reads these votes to calculate an optimal sub cluster. Nodes that are not to survive are evicted via communication through network and disk.


Causes of reboot by clusterware processes
======================================
Now we will briefly discuss about causes of reboot by these processes and at last, which files to review and upload to oracle support for further diagnosis.
Reboot by OCSSD.
============================
1. Network failure : 30 consecutive missed checkins will reboot a node where heartbits are issues once per second. Some kind of messages in occsd.log like heartbit fatal, eviction in xx seconds… Here there are two things.
a. If node eviction time in messages log file is less than missed checkins then node eviction is likely not due to missed checkins.
b. If node eviction time in messages log file is greater than missed checkins then node eviction is likely due to missed checkins.
2. Problems writing to voting disk file : some kind of hang in accessing voting disk.
3. High CPU utilization : When CPU is highly utilized then css daemon doesn’t get CPU on time to ping to voting disk and as a result, it cannot write to voting disk file its own vote and node is going to be rebooted.
4. Disk subsystem is unresponsive due to storage issues.
5. Killing ocssd process.
6. An oracle bug.

Reboot by OPROCD.
============================
When a problem is detected by oprocd, it’ll reboot the node for following reasons.
1. OS scheduler algorithm problem.
2. High CPU utilization due to which oprocd is not getting cpu to check hang check issues at OS level.
3. An oracle bug.

Also just to share with you, at one of the client sites, lms processes were running on low priority scheduling and lms were not getting cpu on time when there’s cpu is high utilized so lms couldn’t communicate through clusterware processes and node eviction got delayed and it was observed that oprocd rebooted node which should not have happened as lms was responsible to run at lower priority scheduling.
Determining cause of reboot by which process
==============================================

1. If there are below kind of messages in logfiles then it will be likely reboot by ocssd process.
a. Reboot due to cluster integrity in syslog file or messages file.
b. Any error prior to reboot in ocssd.log file.
c. Missed checkins in syslog file and eviction time is prior to node reboot time.
2. If there are below kind of messages in logfiles then it will be likely reboot by oprocd process.
a. Resetting message in messages logfile on linux.
b. Any error in oprocd log matching with timestamp of reboot or prior to reboot at /etc/oracle/oprocd directory.
3. If there are other messages like Ethernet issues or some kind of errors in messages or syslog file then please check with sysadmins. On AIX, errpt –a output gives lot of information about cause of reboot.
Log files collection while reboot of node
==============================================
Whenever node reboot occurs in clusterware environment, please review below logfiles for getting reason of reboot and these files are necessary to upload to oracle support for node eviction diagnosis.
a. CRS log files (For 10.2.0 and above 10.2.0 release)
=============================================
1. $ORACLE_CRS_HOME/log//crsd/crsd.log
2. $ORACLE_CRS_HOME/log//cssd/ocssd.log
3. $ORACLE_CRS_HOME/log//evmd/evmd.log
4. $ORACLE_CRS_HOME/log//alert.log
5. $ORACLE_CRS_HOME/log//client/cls*.log (not all files but only latest files matching with timestamp of node reboot)
6. $ORACLE_CRS_HOME/log//racg/ (Please check files and directories matching with timestamp of reboot and if found then copy otherwise not required)
7. The latest .oprocd.log file from /etc/oracle or /var/opt/oracle/oprocd (Solaris)

Note: We can use $ORACLE_CRS_HOME/bin/diagcollection.pl to collect above files but it doesn’t collect OPROCD logfiles, OS log files and OS watcher logfiles and also it may take lot of time to run and consume resources so it’s better to copy manually.
b. OS log files (This will get overwritten so we need to copy soon)
====================================================
1. /var/log/syslog
2. /var/adm/messages
3. errpt –a >error_.log (AIX only)

c. OS Watcher log files (This will get overwritten so we need to copy soon)
=======================================================
Please check in crontab where OSwatcher is installed. Go to that directory and then archive folder and then collect files from all directory matching with timestamp of node reboot.
1. OS_WATCHER_HOME/archive/oswtop
2. OS_WATCHER_HOME/archive/oswvmstat
3. OS_WATCHER_HOME/archive/oswmpstat
4. OS_WATCHER_HOME/archive/oswnetstat
5. OS_WATCHER_HOME/archive/oswiostat
6. OS_WATCHER_HOME/archive/oswps
7. OS_WATCHER_HOME/archive/oswprvtnet

Monday, March 15, 2010

RAC Services, Load Balancing, TAF and FAN

RAC services , Load Balancing and TAF configuration

RAC Services
Oracle 10g RAC provides feature called Services which can be used to configure different services based on application, departments etc like finance, marketing, HR etc. We can create services and provide resources as per requirement for particular application or department.
For ex: If we create a service called HR then we can define the nodes on which it should run, any load balancing feature and if one node or all nodes are failed where service is running initially then which node service will be failed over.
You can check service status using crs_stat –t which will show you all application and database resources.
ora.myracdb.LBS1.myracdb1.srv ONLINE OFFLINE
ora.myracdb.LBS1.myracdb3.srv ONLINE ONLINE on myrac1
ora.myracdb.LBS1.cs ONLINE ONLINE on myrac1
ora.myracdb.taf_precon.myracdb1.srv ONLINE ONLINE on myrac1
ora.myracdb.taf_precon.cs ONLINE OFFLINE
ora.myracdb.myracdb1.inst ONLINE OFFLINE
ora.myracdb.myracdb2.inst OFFLINE OFFLINE
ora.myracdb.myracdb3.inst ONLINE ONLINE on myrac1
ora.myracdb.myractaf.myracdb3.srv ONLINE ONLINE on myrac1
ora.myracdb.myractaf.cs ONLINE ONLINE on myrac1
ora.myracdb.myractaf_BKP.myracdb3.srv ONLINE ONLINE on myrac1
ora.myracdb.myractaf_BKP.cs ONLINE ONLINE on myrac1
ora.myracdb.db ONLINE ONLINE on myrac1
ora.myracdb.hr.myracdb1.srv ONLINE ONLINE on myrac1
ora.myracdb.hr.cs ONLINE OFFLINE
ora.myracdb.hr_precon.myracdb1.srv ONLINE ONLINE on myrac1
ora.myracdb.hr_precon.cs ONLINE OFFLINE
ora.myracdb.hr_basic.myracdb3.srv ONLINE ONLINE on myrac1
ora.myracdb.hr_basic.cs ONLINE ONLINE on myrac1
ora.myracdb1.ASM3.asm ONLINE ONLINE on myrac1
ora.myracdb1.gsd ONLINE ONLINE on myrac1
ora.myracdb1.ons ONLINE ONLINE on myrac1
ora.myracdb1.vip ONLINE ONLINE on myrac1
ora.myracdb2.ASM1.asm ONLINE OFFLINE
ora.myracdb2.LISTENER_myracdb2.lsnr ONLINE OFFLINE
ora.myracdb2.gsd ONLINE OFFLINE
ora.myracdb2.ons ONLINE OFFLINE
ora.myracdb2.vip ONLINE OFFLINE
ora.myracdb3.ASM2.asm ONLINE OFFLINE
ora.myracdb3.LISTENER_myracdb3.lsnr OFFLINE OFFLINE
ora.myracdb3.gsd ONLINE OFFLINE
ora.myracdb3.ons ONLINE OFFLINE
ora.myracdb3.vip ONLINE ONLINE on myrac1

As we can see from above output that we have a database called myracdb and services are myracdb1, myracdb2, myracdb3 running on myrac1, myrac2 and myrac3 nodes. So whenever we create a database, all 4 default services will be created along with services for Listener, ONS, VIP and ASM (if we are using ASM)for all nodes.
Now if we want to create a non default service called finance then we can use below command
srvctl add service -d MYRACDB -s finance -r myracdb1
-r specifies preferred instances.
Now using below command, we can see the service details.
srvctl config service -d MYRACDB -s finance -S 9
#@=info: operation={config} config={full} ver={10.2.0.0.0}
finance PREF: MYRACDB1 AVAIL:
#@=service[0]: name={finance} enabled={true} pref={MYRACDB1} avail={} disabled_insts={} tafpolicy={NONE} type={user}
#@=endconfig:
To remove service, we can use below command
srvctl remove service –d MYRACDB –s finance

When we specify multiple preferred instances, Server side load balancing is enabled so when users are connecting to the database using this service, it’ll be load balanced by listeners.
srvctl add service -d MYRACDB -s finance -r myracdb1,myracdb2
srvctl config service -d MYRACDB -s finance -S 9
#@=info: operation={config} config={full} ver={10.2.0.0.0}
finance PREF: MYRACDB1 MYRACDB2 AVAIL:
#@=service[0]: name={finance} enabled={true} pref={MYRACDB1, MYRACDB2} avail={} disabled_insts={} tafpolicy={NONE} type={user}
#@=endconfig:
We can also specify available instances in case both preferred instances are down.
srvctl add service -d MYRACDB -s finance -r myracdb1,myracdb2 -a myracdb3
-a specifies available instances (We can specify multiple available instances)
srvctl config service -d MYRACDB -s finance -S 9
#@=info: operation={config} config={full} ver={10.2.0.0.0}
finance PREF: MYRACDB1 MYRACDB2 AVAIL: MYRACDB3
#@=service[0]: name={finance} enabled={true} pref={MYRACDB1, MYRACDB2} avail={MYRACDB3} disabled_insts={} tafpolicy={NONE} type={user}
#@=endconfig:
Please be careful while specifying same instance in preferred and available. See below example.
srvctl modify service -d MYRACDB -s finance -n -i "myracdb1,myracdb2" -a "myracdb1,myracdb3" -f
-f specifies to disconnect current connected sessions.
srvctl config service -d MYRACDB -s finance -S 9
#@=info: operation={config} config={full} ver={10.2.0.0.0}
finance PREF: MYRACDB2 AVAIL: MYRACDB1 MYRACDB3
#@=service[0]: name={finance} enabled={true} pref={MYRACDB2} avail={MYRACDB1, MYRACDB3} disabled_insts={} tafpolicy={NONE} type={user}
#@=endconfig:
So available instance parameter has overwritten preferred parameter value because one instance can be either in preferred or available but it cannot be in both preferred and available list.
Please see another example as below:
srvctl modify service -d MYRACDB -s finance -n -i myracdb1,myracdb2 –f
PRKP-1048 : Failed to change configuration for service finance.
CRS-0211: Resource 'ora.MYRACDB.finance.MYRACDB1.srv' has already been registered.
So did unregister first
crs_unregister ora.MYRACDB.finance.MYRACDB1.srv
srvctl modify service -d MYRACDB -s finance -n -i myracdb1,myracdb2 -f
srvctl config service -d MYRACDB -s finance -S 9
#@=info: operation={config} config={full} ver={10.2.0.0.0}
finance PREF: MYRACDB1 MYRACDB2 AVAIL:
#@=service[0]: name={finance} enabled={true} pref={MYRACDB1, MYRACDB2} avail={} disabled_insts={} tafpolicy={NONE} type={user}
#@=endconfig:
So as we can see that it has been removed from available and moved to preferred however no instance mentioned in available because we skipped –a option of modify service.
So bottom line is that we cannot specify same instance name both in preferred and available.
If you want any other help on modifying services, you can use below help.
srvctl modify service -h
Usage: srvctl modify service -d -s -i -t [-f]
-d Unique name for the database
-s Service name
-i Old instance name
-t New instance name
-f Disconnect all sessions during stop or relocate service operations
Usage: srvctl modify service -d -s -i -r [-f]
-d Unique name for the database
-s Service name
-i Instance name
-r Upgrade instance to preferred
-f Disconnect all sessions during stop or relocate service operations
Usage: srvctl modify service -d -s -n -i [-a ] [-f]
-d Unique name for the database
-s Service name
-n Modify service configuration
-r "" List of preferred instances
-a "" List of available instances
-f Disconnect all sessions during stop or relocate service operations
Usage: srvctl modify service -d -s -P [-f]
-d Unique name for the database
-s Service name
-P TAF policy (NONE, BASIC, or PRECONNECT)
-f Disconnect all sessions during stop or relocate service operations
-h Print usage

For starting and stopping service, we can use below command
srvctl start service –d MYRACDB –s finance
srvctl stop service –d MYRACDB –s finance
By default, it’ll be in offline state after creating so we need to start the service after creating it and it’ll start on preferred instances and all preferred instances are not available then it’ll start on available instances.
Assume that myracdb2 is down and we specified below command.
srvctl modify service -d MYRACDB -s finance -n -i myracdb2 -a myracdb1,myracdb3 –f
srvctl config service -d MYRACDB -s finance -S 9
#@=info: operation={config} config={full} ver={10.2.0.0.0}
finance PREF: MYRACDB2 AVAIL: MYRACDB1 MYRACDB3
#@=service[0]: name={finance} enabled={true} pref={MYRACDB2} avail={MYRACDB1, MYRACDB3} disabled_insts={} tafpolicy={NONE} type={user}
#@=endconfig:
Srvctl start service –d MYRACDB –s finance
ora.MYRACDB.finance.MYRACDB2.srv ONLINE ONLINE on myrac1
ora.MYRACDB.finance.cs ONLINE ONLINE on myrac1
As we can see that it has started on MYRACDB1 instance on MYRAC1 server.

Load Balancing

When we specify multiple preferred instances (server side load balancing) and enable client side load balancing then listener level load balancing is automatically happening. We’ve put local_listener parameter and remote_listener parameters in spfile as shown below.
Both parameters value point to TNSNAMES.ora entry
local_listener=LISTENER_MYRAC1
remote_listener = LISTENERS_MYRACDB
TNSNAMES.ORA entries: Below entries should be there on all nodes tnsnames.ora file.
LISTENERS_MYRACDB =
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = TCP)(HOST = myrac1-vip)(PORT = 2042))
(ADDRESS = (PROTOCOL = TCP)(HOST = myrac3-vip)(PORT = 2042))
(ADDRESS = (PROTOCOL = TCP)(HOST = myrac2-vip)(PORT = 2042))
)
LISTENER_MYRAC3 =
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = TCP)(HOST = myrac3-vip)(PORT = 2042))
)
LISTENER_MYRAC2 =
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = TCP)(HOST = myrac2-vip)(PORT = 2042))
)
LISTENER_MYRAC1 =
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = TCP)(HOST = myrac1-vip)(PORT = 2042))
)
Client side load balancing
finance =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = myrac2-vip)(PORT = 2042))
(ADDRESS = (PROTOCOL = TCP)(HOST = myrac1-vip)(PORT = 2042))
(ADDRESS = (PROTOCOL = TCP)(HOST = myrac3-vip)(PORT = 2042))
(LOAD_BALANCE = yes)
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = FINANCE)
(FAILOVER_MODE =
(TYPE = SELECT)
(METHOD = BASIC)
(RETRIES = 180)
(DELAY = 5)
)
)
)
Server Side load balancing
In a dedicated server configuration, a listener selects an instance in the following order:
1. Least loaded node
2. Least loaded instance

If a database service has multiple instances on multiple nodes, the listener chooses the least loaded instance on the least loaded node. If shared server is configured, then the least loaded dispatcher of the selected instance is chosen.
This is based on node load so distribution of sessions may not be equal in terms of no. To verify that we need to put event 10257 level 16 and restart the instance and check pmon trace file.
The pmon trace file output could be like below.
kmmgdnu: MYRACDB goodness 27
kmmgdnu: MYRACDB
goodness=27, delta=1,
flags=0x4:unblocked/not overloaded, update=0x6:G/D/-
kmmlrl: 65 processes
kmmlrl: node load 687
kmmlrl: instance load 35
kmmlrl: nsgr update returned 0
flags=0x4:unblocked/not overloaded, update=0x6:G/D/-
kmmgdnu: MYRACDBXDB
goodness=0, delta=1,
flags=0x5:unblocked/not overloaded, update=0x6:G/D/-
kmmgdnu: MYRACDB
goodness=27, delta=1,
flags=0x4:unblocked/not overloaded, update=0x6:G/D/-
kmmlrl: 66 processes
kmmlrl: node load 775
kmmlrl: nsgr update returned 0
kmmgdnu: MYRACDB
goodness=28, delta=1,
flags=0x4:unblocked/not overloaded, update=0x6:G/D/-
kmmlrl: 70 processes

kmmlrl: node load 847
kmmlrl: nsgr update returned 0
*** 2009-08-18 03:47:26.891
err=-319 tbtime=0 etime=319 srvs=1 nreqs=0 sreq
'Node load' gives an indication of how busy the machine is. By default, 'node load' is the primary factor for determining where new connections will go.
Load Balance Advisory
We can configure load balance advisory configure according to our requirement. There are two kinds of goal for a service that can be configured.

Service Time: Attempts to direct work requests to instances according to response time. Load balancing advisory data is based on elapsed time for work done in the service plus available bandwidth to the service.

Throughput: Attempts to direct work requests according to throughput. The load balancing advisory is based on the rate that work is completed in the service plus available bandwidth to the service.
In short, Load balance advisory does below things :
• Calculates goodness locally, forwards to master mmon
• Master mmon builds advisory for distribution of work
• Records advice to SYS$SERVICE_METRICS
• Posts load balancing advice via FAN event to AQ, PMON, ONS
Connection Load Balancing
We can also configure connection load balancing by specifying clb_goal for specific service.
Long: Use the LONG connection load balancing method for applications that have long-lived connections. This is typical for connection pools and SQL*Forms sessions. LONG is the default connection load balancing goal.

Short:Use the SHORT connection load balancing method for applications that have short-lived connections.

TAF configuration
TAF feature provides failover capability for oracle database instances in multi node RAC configuration. Depending on requirement, we need to configure TAF methods for a given service. There are two kinds of methods for server side TAF configuration.
TAF methods
1. Basic: It’s default method when we create a service. Establishes connections at failover time. This option requires almost no work on the backup database server until failover time
2. Pre-establishes connections. This provides faster failover but requires that the backup instance be able to support all connections from every supported instance. It creates the backup connection in alternate instance (available instance of a service).

Server side preconnect TAF is not supported using dbms_service.
SQL> exec dbms_service.modify_service(service_name=>'MYRACTAF_BKP',aq_ha_notifications => TRUE ,failover_method=>DBMS_SERVICE.FAILOVER_METHOD_PRECONNECT);
BEGIN dbms_service.modify_service(service_name=>'TSTRACTAF_BKP',aq_ha_notifications => TRUE ,failover_method=>DBMS_SERVICE.FAILOVER_METHOD_PRECONNECT); END;

*
ERROR at line 1:
ORA-06550: line 1, column 124:
PLS-00302: component 'FAILOVER_METHOD_PRECONNECT' must be declared
ORA-06550: line 1, column 7:
PL/SQL: Statement ignored

However it’ll allow to specify preconnect if we are creating a service through srvctl add

service -d MYRACDB -s finance -r MYRACDB3 -a MYRACDB1 -P preconnect
srvctl config service -d MYRACDB -s finance -S 9
#@=info: operation={config} config={full} ver={10.2.0.0.0}
finance PREF: MYRACDB3 AVAIL: MYRACDB1
#@=service[0]: name={finance} enabled={true} pref={MYRACDB3} avail={MYRACDB1} disabled_insts={} tafpolicy={preconnect} type={user}
#@=endconfig:

So this seems to be a bug in 10.2.0.4 and oracle metalink notes also say that Server side preconnect is not supported so if we want to take advantage of preconnect method then we have to specify available instance different from preferred instance and enable client side TAF with preconnect method as below.

TAF_Precon =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = myrac2-vip)(PORT = 2042))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = TAF_PRECON)
(LOAD_BALANCE = NO)
(FAILOVER = yes)
(FAILOVER_MODE =
(TYPE = SELECT)
(METHOD = PRECONNECT)
(BACKUP = myrac3) -> This is tnsnames entry
(RETRIES = 180)
(DELAY = 5)
)
)
)
myrac3 =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = myrac1-vip)(PORT = 2042))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = MYRACTAF_BKP)
(FAILOVER = yes)
(FAILOVER_MODE =
(TYPE = SELECT)
(RETRIES = 180)
(DELAY = 5)
)
)
)
srvctl config service -d MYRACDB -s TAF_PRECON -S 9
#@=info: operation={config} config={full} ver={10.2.0.0.0}
TAF_PRECON PREF: MYRACDB2 AVAIL: TSTRACDB1
#@=service[0]: name={TAF_PRECON} enabled={true} pref={MYRACDB2} avail={MYRACDB1} disabled_insts={} tafpolicy={basic} type={user}
#@=endconfig:

srvctl config service -d MYRACDB -s MYRACTAF_BKP -S 9
#@=info: operation={config} config={full} ver={10.2.0.0.0}
MYRACTAF_BKP PREF: MYRACDB1 AVAIL: MYRACDB2
#@=service[0]: name={MYRACTAF_BKP} enabled={true} pref={MYRACDB1} avail={MYRACDB2} disabled_insts={} tafpolicy={NONE} type={user}
#@=endconfig:

So after above configuration when we connect to database using TAF_Precon , it’ll create one connection on myrac2 and backup connection on myrac1.

sqlplus kmakwana@TAF_PRECON

SQL*Plus: Release 10.2.0.2.0 - Production on Mon Nov 23 22:51:40 2009

Copyright (c) 1982, 2005, Oracle. All Rights Reserved.

Enter password:

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options

SQL>
So we can connection both on MYRAC2 and MYRAC1 since we are using preconnect method.
oracle@myrac2:/home/oracle>
KMAKWANA oracle 497 7731 1532148 sqlplus@myapp1 (TNS myapp1 I TAF_PRECON 0:0:33

oracle@myrac1:/home/oracle>
KMAKWANA oracle 489 9376 913578 sqlplus@myapp1 (TNS myapp1 I MYRACTAF_BKP 0:0:25

TAF Types

TAF type can be select or session. It defines what should happen to session or query when failover happens.
a. SELECT failover. With SELECT failover, Oracle Net keeps track of all SELECT statements issued during the transaction, tracking how many rows have been fetched back to the client for each cursor associated with a SELECT statement. If the connection to the instance is lost, Oracle Net establishes a connection to another Oracle RAC node and re-executes the SELECT statements, repositioning the cursors so the client can continue fetching rows as if nothing has happened. The SELECT failover approach is best for data warehouse systems that perform complex and time-consuming transactions.
b. SESSION failover. When the connection to an instance is lost, SESSION failover results only in the establishment of a new connection to another Oracle RAC node; any work in progress is lost. SESSION failover is ideal for online transaction processing (OLTP) systems, where transactions are small.

NONE: No failover for this service.
SESSION: Only reconnect the session; do not resume select statements.
SELECT: Use small amount of additional client memory to store state information necessary to resume non-transactional select statements during a failover. It allows users with open cursors to continue fetching on them after failure. However, this mode involves overhead on the client side in normal select operations. DML Transactions will still break and require a rollback.
TAF can operate in one of two modes, Session Failover and Select Failover. Session Failover will recreate lost connections and sessions. Select Failover will replay queries that were in progress.
In OLTP systems, we can use session type with preconnect method so that backup connections are already created while starting primary connections so only clients which are running queries will get the error and client while are connecting immediately after failover, they can keep continuing data retrieval.
If we use select type with preconnect method (backup connections are already created using MYRACTAF_BKP service in our example) then actual failover takes 4-5 seconds and if we use select type with basic method then it takes around 12-13 seconds to resume query execution assuming service will failover to available instance. however DML is not supported till 10.2.0.4 by Oracle so any DML statements running at the time failover will be rolled back.
If we are using Select type with preconnect method , there’s some overhead involved at the client side mentioned in below table.

The below table discusses about TAF method and TAF type depending on environment (whether it’s OLTP or DWH).
Type of System TAF method TAF type Advantages Disadvantages
OLTP/ Data Ware housing Preconnect
or basic Session Since we’ve already created backup connections, clients can immediately connect after failover and keep selecting data. If not preconnect then clients will fail over to available instance of a service since no backup connections are created. It’s best for OLTP systems. Clients who are running the select queries will get the error. Any DML statements will be rolled back.
OLTP/ Data Ware housing Preconnect or basic Select Using already backup connections, client queries can resume in 4-5 seconds on the backup service connections. If not preconnect, client queries will fail over to available instance of a service and it may take some time as Oracle has to establish connections to available instance and then resume query processing. It took 40 seconds in our environment. It’s best for Data Warehousing environment. Whether or not to use extra client memory to track information including the sql plan hash, SCN and rows fetched for each active SELECT statement so that they can be resumed transparently during a failover. Client side memory requirement can be trivial if all clients are connecting from same machine or one client is connecting using multiple sessions.
(Note: Please remember that the timing specified for failover differs from environment to environment and mentioned timing in this document is tested in our environment not specified by Oracle so please test in your environment before going for production implementation)

FAN – Fast Application notification
This feature will allows application to publish events related to node, instance or any service up/down events and event are published using ONS (RAC component) and Advanced Queues. ONS will send notification to application and Load balance advisory framework to change goodness and delta value for particular instance. FAN cleans up any connections when the failure occurs. It keeps track of service and instance for each connection.
When down event is published by ONS in RAC environment :
a. Routes new requests to remaining instances
b. Throws exceptions if applications are in middle of transactions. As mentioned earlier, DML is not supported for failover so applications should have capability to deal with DML transaction and throw appropriate errors or exceptions to the users.
When up event is published by ONS in RAC environment :
a. Creates new connections to remaining instances
b. Distributes new requests evenly to all instances based on Node load and instance load
c. Updates Load Balance advisory
FAN configuration
a. Configure at application level
For ex: If application is using java then we can configure as below.
OracleDataSource ods = new OracleDataSource()
ods.setUser(username);
ods.setPassword(password);
ods.setConnectionCachingEnabled(True);
ods.setFastConnectionFailoverEnabled(True);
Ods.setONSConfiguration("nodes=myrac1:6300,host2:6300");
ods.setConnectionCacheName(MyCache);
ods.setConnectionCacheProperties(cp);
ods.setURL("jdbc:oracle:thin:@(DESCRIPTION=
(LOAD_BALANCE=on)
(ADDRESS=(PROTOCOL=TCP)(HOST=myracVIP1)(PORT=2042))
(ADDRESS=(PROTOCOL=TCP)(HOST=myracVIP2)(PORT=2042))
(CONNECT_DATA=(SERVICE_NAME=MYSERVICE)))");

or set system properties
-D oracle.jdbc.FastConnectionFailover=true

b. Change below file.
oracle@myrac1:/u01/app/oracle/product/10.2.0/db_1/opmn/conf>cat ons.config
localport=6150
remoteport=6300
loglevel=3
usesharedinstall=true

Please make sure that remoteport value is same on all RAC instances.

c. When starting the application, specify system property and make sure that ons.jar is in CLASSPATH.
-Doracle.ons.oraclehome=

FAN Server side call out
We can configure server side call out for FAN in RAC environment. Server side calls would database instance down/up event, ASM instance down/up event , Clusteware Fan events.
We can download sample code for Clusteware FAN server side callout from below link http://www.oracle.com/technology/sample_code/products/rac/index.html
We can modify the script as per requirement and store at below location only. In the script, we can configure alerts bases on events and send email alerts to DBA so that they can take appropriate actions.
oracle@myrac1:/$ORACLE_CRS_HOME/racg/usrco

CRS is not coming up on all nodes after adding voting disks on AIX

Adding voting disks to the cluster in AIX

Recently we added 2 more voting disks to our RAC cluster and after that CRS was not coming on both the nodes simultaneous. So what was happening is that wherever CRS is started first, it’ll keep running and then CRS was not getting started on another node. Also We were not able to shutdown CRS on the node where we started so we had to kill ocssd.bin or oprocd.bin to restart CRS on first node. So we thought that wherever CRS is starting first, it is locking voting disks and as a result, CRS is not coming up on another node though voting disks were added correctly as mentioned below
crsctl query css votedisk
0. 0 /dev/crs/voting_disk01
1. 0 /dev/crs/voting_disk02
2. 0 /dev/crs/voting_disk03
oracle@tst01:/home/oracle>ls -ltr /dev/crs/voting_disk01 /dev/crs/voting_disk02 /dev/crs/voting_disk03
lrwxrwxrwx 1 oracle dba 12 Nov 29 2007 /dev/crs/voting_disk01-> /dev/rhdisk100
lrwxrwxrwx 1 oracle dba 12 Jan 28 2010 /dev/crs/voting_disk02-> /dev/rhdisk101
lrwxrwxrwx 1 oracle dba 12 Jan 28 2010 /dev/crs/voting_disk03-> /dev/rhdisk102
oracle@tst01:/home/oracle>ls -ltr /dev/*hdisk100 /dev/*hdisk101 /dev/*hdisk102
brw------- 1 root system 38, 2 Nov 28 2007 /dev/hdisk100
crw-rw---- 1 oracle dba 38, 2 Nov 28 2007 /dev/rhdisk100
brw------- 1 root system 38, 2 Jan 18 2010 /dev/hdisk101
crw-rw---- 1 oracle dba 38, 2 Jan 18 2010 /dev/rhdisk101
brw------- 1 root system 38, 2 Jan 18 2010 /dev/hdisk102
crw-rw---- 1 oracle dba 38, 2 Jan 18 2010 /dev/rhdisk102

so we checked disk permissions and disk attributes and here’s the culprit highlighted in bold.

hdisk101
algorithm fail_over Algorithm True
hcheck_interval 10 Health Check Interval True
hcheck_mode nonactive Health Check Mode True
queue_depth 16 Queue DEPTH True
reserve_policy single_path Reserve Policy True
rw_timeout 40 READ/WRITE time out value True
DISK: hdisk102
algorithm fail_over Algorithm True
hcheck_interval 10 Health Check Interval True
hcheck_mode nonactive Health Check Mode True
queue_depth 16 Queue DEPTH True
reserve_policy single_path Reserve Policy True
rw_timeout 40 READ/WRITE time out value True

The disk resever_policy must be no_reserve and algorithm should be round robin as mentioned below. We can change some other required attributes as well for both disks hdisk101 and hdisk102.
# chdev -l hdisk101 -a reserve_policy=no_reserve (Mandatory)
# chdev -l hdisk101-a algorithm='round_robin' (Mandatory)
# chdev -l hdisk101-a hcheck_mode='enabled'
# chdev -l hdisk101-a hcheck_interval='600'
# chdev -l hdisk101-a queue_depth='2'
# chdev -l hdisk101-a rw_timeout='60'
Please see below output of changed attributes for both the disks.
# lsattr -El hdisk101
PCM PCM/friend/MSYMM_RAIDS Path Control Module True
PR_key_value none Persistant Reserve Key Value True
algorithm round_robin Algorithm True
clr_q yes Device CLEARS its Queue on error True
hcheck_interval 600 Health Check Interval True
hcheck_mode enabled Health Check Mode True
location Location Label True
lun_id 0x15d000000000000 Logical Unit Number ID False
lun_reset_spt yes FC Forced Open LUN True
max_transfer 0x40000 Maximum TRANSFER Size True
node_name 0x50060482d53061b6 FC Node Name False
pvid none Physical volume identifier False
q_err no Use QERR bit True
q_type simple Queue TYPE True
queue_depth 2 Queue DEPTH True
reserve_policy no_reserve Reserve Policy True
rw_timeout 60 READ/WRITE time out value True
scsi_id 0x18100 SCSI ID False
start_timeout 180 START UNIT time out value True
ww_name 0x50060482d53061b6 FC World Wide Name False
# lsattr -El hdisk102
PCM PCM/friend/MSYMM_RAIDS Path Control Module True
PR_key_value none Persistant Reserve Key Value True
algorithm round_robin Algorithm True
clr_q yes Device CLEARS its Queue on error True
hcheck_interval 600 Health Check Interval True
hcheck_mode enabled Health Check Mode True
location Location Label True
lun_id 0x15e000000000000 Logical Unit Number ID False
lun_reset_spt yes FC Forced Open LUN True
max_transfer 0x40000 Maximum TRANSFER Size True
node_name 0x50060482d53061b6 FC Node Name False
pvid none Physical volume identifier False
q_err no Use QERR bit True
q_type simple Queue TYPE True
queue_depth 2 Queue DEPTH True
reserve_policy no_reserve Reserve Policy True
rw_timeout 60 READ/WRITE time out value True
scsi_id 0x18100 SCSI ID False
start_timeout 180 START UNIT time out value True
ww_name 0x50060482d53061b6 FC World Wide Name False

After changing disk attributes, CRS came online on both the nodes.