ITPub博客

首页 > 数据库 > Oracle > Troubleshooting ORA-29740 in a RAC Environment

Troubleshooting ORA-29740 in a RAC Environment

原创 Oracle 作者:lifewise 时间:2007-10-16 10:50:36 0 删除 编辑
Subject: Troubleshooting ORA-29740 in a RAC Environment
Doc ID: Note:219361.1Type: TROUBLESHOOTING
Last Revision Date: 21-AUG-2007Status: PUBLISHED

PURPOSE
=======

This note was created to troubleshoot the ORA-29740 error in a Real Application 
Clusters environment.

[@more@]
Subject: Troubleshooting ORA-29740 in a RAC Environment
Doc ID: Note:219361.1Type: TROUBLESHOOTING
Last Revision Date: 21-AUG-2007Status: PUBLISHED

PURPOSE
=======

This note was created to troubleshoot the ORA-29740 error in a Real Application 
Clusters environment.

 
SCOPE & APPLICATION
====================

This note is for DBA's needing to resolve ORA-29740.


Troubleshooting ORA-29740 in a RAC Environment
==============================================

An ORA-29740 error occurs when a member was evicted from the group by another  
member of the cluster database for one of several reasons, which may include
a communications error in the cluster, failure to issue a heartbeat to the 
control file, and other reasons.  This mechanism is in place to prevent 
problems from occuring that would affect the entire database.  For example,
instead of allowing a cluster-wide hang to occur, Oracle will evict the 
problematic instance(s) from the cluster.  When an ORA-29740 error occurs, a 
surviving instance will remove the problem instance(s) from the cluster.  
When the problem is detected the instances 'race' to get a lock on the 
control file (Results Record lock) for updating.  The instance that obtains 
the lock tallies the votes of the instances to decide membership.  A member 
is evicted if:

	a) A communications link is down
	b) There is a split-brain (more than 1 subgroup) and the member is 
	   not in the largest subgroup
	c) The member is perceived to be inactive

Sample message in Alert log of the evicted instance:

	Fri Sep 28 17:11:51 2001
	Errors in file /oracle/export/TICK_BIG/lmon_26410_tick2.trc:
	ORA-29740: evicted by member %d, group incarnation %d
	Fri Sep 28 17:11:53 2001
	Trace dumping is performing id=[cdmp_20010928171153]
	Fri Sep 28 17:11:57 2001
	Instance terminated by LMON, pid = 26410

The key to resolving the ORA-29740 error is to review the LMON trace files 
from each of the instances.  On the evicted instance we will see something 
like:

	*** 2002-11-20 18:49:51.369
	kjxgrdtrt: Evicted by 0, seq (3, 2)
			      ^
			      |
This indicates which instance initiated the eviction.

On the evicting instance we will see something like:

	kjxgrrcfgchk: Initiating reconfig, reason 3
	*** 2002-11-20 18:49:29.559
	kjxgmrcfg: Reconfiguration started, reason 3

	...
	*** 2002-11-20 18:49:29.727
	Obtained RR update lock for sequence 2, RR seq 2
	*** 2002-11-20 18:49:31.284
	Voting results, upd 0, seq 3, bitmap: 0
	Evicting mem 1, stat 0x0047 err 0x0002

You can see above that the instance initiated a reconfiguration for reason 3
(see Note 139435.1 for more information on reconfigurations).  The 
reconfiguration is then started and this instance obtained the RR lock 
(Results Record lock) which means this instance will tally the votes of the 
instances to decide membership.  The last lines show the voting results then
this instance evicts instance 1.

For troubleshooting ORA-29740 errors, the 'reason' will be very important.  
In the above example, the first section indicates the reason for the 
initiated reconfiguration.  The reasons are as follows:

Reason 0 = No reconfiguration  
Reason 1 = The Node Monitor generated the reconfiguration.
Reason 2 = An instance death was detected.
Reason 3 = Communications Failure
Reason 4 = Reconfiguration after suspend

For ORA-29740 errors, you will most likely see reasons 1, 2, or 3.

-----------------------------------------------------------------------------

Reason 1: The Node Monitor generated the reconfiguration.  This can happen if:

	a) An instance joins the cluster
	b) An instance leaves the cluster
	c) A node is halted

It should be easy to determine the cause of the error by reviewing the alert
logs and LMON trace files from all instances.  If an instance joins or leaves
the cluster or a node is halted then the ORA-29740 error is not a problem.  

ORA-29740 evictions with reason 1 are usually expected when the cluster 
membership changes.  Very rarely are these types of evictions a real problem.

If you feel that this eviction was not correct, do a search in Metalink or 
the bug database for:

	ORA-29740 'reason 1'

Important files to review are:

	a) Each instance's alert log
	b) Each instance's LMON trace file
	c) Statspack reports from all nodes leading up to the eviction
	d) Each node's syslog or messages file
        e) iostat output before, after, and during evictions
        f) vmstat output before, after, and during evictions
        g) netstat output before, after, and during evictions

There is a tool called "OS Watcher" that is being developed that helps gather
this information.  For more information on "OS Watcher" see Note 301137.1 
"OS Watcher User Guide".

-----------------------------------------------------------------------------

Reason 2: An instance death was detected.  This can happen if:

	a) An instance fails to issue a heartbeat to the control file.

When the heartbeat is missing, LMON will issue a network ping to the instance 
not issuing the heartbeat.  As long as the instance responds to the ping, 
LMON will consider the instance alive.  If, however, the heartbeat is not 
issued for the length of time of the control file enqueue timeout, the 
instance is considered to be problematic and will be evicted.  

Common causes for an ORA-29740 eviction (Reason 2):

        a) NTP (Time changes on cluster) - usually on Linux, Tru64, or IBM AIX
        b) Network Problems (SAN).
        c) Resource Starvation (CPU, I/O, etc..)
        d) An Oracle bug.


Common bugs for reason 2 evictions:

Bug 2820871 - Abrupt time adjustments can crash instance with ORA-29740 
(Reason 2) (Linux Only)
Fixed-Releases: 9204+ A000

Bug 3917158 - ORA-29740 and a false instance eviction can occur (Reason 2) 
(IBM AIX Only)
Fixed Releases: 9206+

If you feel that this eviction was not correct, do a search in Metalink or the 
bug database for:

	ORA-29740 'reason 2'

Important files to review are:

	a) Each instance's alert log
	b) Each instance's LMON trace file
	c) Statspack reports from all nodes leading up to the eviction
	d) The CKPT process trace file of the evicted instance
	e) Other bdump or udump files...
	f) Each node's syslog or messages file 
        g) iostat output before, after, and during evictions
        h) vmstat output before, after, and during evictions
        i) netstat output before, after, and during evictions

There is a tool called "OS Watcher" that is being developed that helps gather
this information.  For more information on "OS Watcher" see Note 301137.1 
"OS Watcher User Guide".

-----------------------------------------------------------------------------

Reason 3: Communications Failure.  This can happen if:

	a) The LMON processes lose communication between one another.
	b) One instance loses communications with the LMS, LMD, process of 
           another instance.
        c) The LCK processes lose communication between one another.
	d) A process like LMON, LMD, LMS, or LCK is blocked, spinning, or stuck 
           and is not responding to remote requests.

In this case the ORA-29740 error is recorded when there are communication 
issues between the instances.  It is an indication that an instance has been 
evicted from the configuration as a result of IPC send timeout.  A 
communications failure between processes across instances will also generate a 
ORA-29740 with reason 3.  When this occurs, the trace file of the process 
experiencing the error will print a message:

  	Reporting Communication error with instance:

If communication is lost at the cluster layer (for example, network cables
are pulled), the cluster software may also perform node evictions in the 
event of a cluster split-brain.  Oracle will detect a possible split-brain 
and wait for cluster software to resolve the split-brain.  If cluster
software does not resolve the split-brain within a specified interval, 
Oracle proceeds with evictions.

Oracle Support has seen cases where resource starvation (CPU, I/O, etc...)  can 
cause an instance to be evicted with this reason code.  The LMON or LMD process 
could be blocked waiting for resources and not respond to polling by the remote 
instance(s).  This could cause that instance to be evicted.  If you have
a statspack report available from the time just prior to the eviction on the
evicted instance, check for poor I/O times and high CPU utilization.  Poor I/O 
times would be an average read time of > 20ms.

Common causes for an ORA-29740 eviction (Reason 3):

        a) Network Problems.
        b) Resource Starvation (CPU, I/O, etc..)
        c) Severe Contention in Database.
        d) An Oracle bug.


Common bugs for reason 3 evictions:

Bug 2276622 - ORA-29740 (Reason 3) possible in RAC under heavy load 
Fixed-Releases: 9014+ 9202+

Bug 2994260 - IPCSOCK_SEND FAILED WITH STATUS: 10054 (Windows only)
Fixed-Releases: 9203 with patch or 9204+

Bug 2210879 - ORACLE PROCESS CRASHES, WITH ASSERTION FAILURE IN LOWFAT 
SKGXP CODE (HP-UX only with clic interface)
Fixed-Releases: Fixed by HP in PHNE 26551 or above.

Bug 3007107 - FREQUENT ORA-29740 EVICTIONS OCCURING WITH MINIMAL ACTIVITY
(HP-UX only with clic interface)
Fixed-Releases: Fixed by HP in PHKL_28695 HP patch. 

Bug 3663773 - ORA-29740: EVICTED BY MEMBER 1   REASON 3
(HP Itanium Only)
Fixed Releases: Fixed via dbc_max_pct=8 parameter.

The issue described in Note 312935.1 (Sun Scrubber - Sun Solaris only)
Fixed-Releases: Contact Sun for patch number after reviewing the note.

Tips for tuning inter-instance performance can be found in the following note:

  Note 181489.1 
  Tuning Inter-Instance Performance in RAC and OPS 

If you feel that this eviction was not correct, do a search in Metalink or the 
bug database for:

	ORA-29740 'reason 3'

Important files to review are:

	a) Each instance's alert log
	b) Each instance's LMON trace file
	c) each instance's LMD and LMS trace files
	d) Statspack reports from all nodes leading up to the eviction
	e) Other bdump or udump files...
	f) Each node's syslog or messages file 
        g) iostat output before, after, and during evictions
        h) vmstat output before, after, and during evictions
        i) netstat output before, after, and during evictions

There is a tool called "OS Watcher" that is being developed that helps gather
this information.  For more information on "OS Watcher" see Note 301137.1 
"OS Watcher User Guide".

-----------------------------------------------------------------------------

References :

Note 139435.1 Fast Reconfiguration in 9i  Real Application Clusters
Bug 2276622 ORA-29740 UNDER HEAVY LOAD
Bug 1999778 RAC/OPS DATABASE CRASHES WITH ORA-29740 ON RESTART ON FAILED SYSTEM
Bug 2529223  INSTANCE EVICTED WITH ORA-29740
Note 175678.1 RAC Instances Crash with ORA-29740 or ORA-600 [ksxpwait5] on IBM AIX
Note 212381.1 RAC: Cluster Node evicted due to Change of System Time

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/620862/viewspace-976926/,如需转载,请注明出处,否则将追究法律责任。

上一篇: Oracle审计(转)
请登录后发表评论 登录
全部评论

注册时间:2008-01-07

  • 博文量
    52
  • 访问量
    489662