ITPub博客

首页 > Linux操作系统 > Linux操作系统 > 2006/05/17中午发生RAC上的问题

2006/05/17中午发生RAC上的问题

原创 Linux操作系统 作者:tolywang 时间:2006-05-17 00:00:00 0 删除 编辑

Linux AS2.1 + Oracle9.2.0.4 RAC .
OLTP .
Linux Kernel : 2.4.9-e.40smp #1 SMP


实例重新启动倒是恢复正常了 。可能原因见后面。


http://www.itpub.net/showthread.php...928&pagenumber=

讨论的一些方法: http://www.itpub.net/showthread.php?s=&threadid=550162




Node1 的信息:


Wed May 17 12:00:26 2006
ARC1: Evaluating archive log 2 thread 1 sequence 19622
Wed May 17 12:00:26 2006
Current log# 3 seq# 19623 mem# 0: /ocfs_ctrl_redo/orcl/redo03.log
Current log# 3 seq# 19623 mem# 1: /ocfs_data/orcl/redo03b.log
Wed May 17 12:00:26 2006
ARC1: Beginning to archive log 2 thread 1 sequence 19622
Creating archive destination LOG_ARCHIVE_DEST_1: '/ocfs_arch1/orcl/1_19622.dbf'
ARC1: Completed archiving log 2 thread 1 sequence 19622

Wed May 17 12:11:03 2006
Waiting for clusterware split-brain resolution
Evicting instance 2 from cluster
Wed May 17 12:21:12 2006
Reconfiguration started
List of nodes: 0,
Wed May 17 12:21:12 2006
Reconfiguration started
List of nodes: 0,


Wed May 17 12:41:46 2006
Starting ORACLE instance (normal)
Wed May 17 12:41:46 2006
Global Enqueue Service Resources = 20158, pool = 8
Wed May 17 12:41:46 2006
Global Enqueue Service Enqueues = 32606
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
GES IPC: Receivers 3 Senders 3
GES IPC: Buffers Receive 1000 Send 2260 Reserve 1000
GES IPC: Msg Size Regular 396 Batch 2048
SCN scheme 2
Using log_archive_dest parameter default value
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up ORACLE RDBMS Version: 9.2.0.4.0.
System parameters with non-default values:



-------------------------------------------------------------------------------------------------


Node2 上的log信息:


Wed May 17 12:09:17 2006
Communications reconfiguration: instance 0
Wed May 17 12:11:03 2006
Waiting for clusterware split-brain resolution
Wed May 17 12:21:04 2006
Errors in file /u01/product/admin/orcl/bdump/orcl2_lmon_2458.trc:
ORA-29740: evicted by member 1, group incarnation 3
LMON: terminating instance due to error 29740
Wed May 17 12:21:06 2006
Errors in file /u01/product/admin/orcl/bdump/orcl2_smon_2472.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-29740: evicted by member , group incarnation
Instance terminated by LMON, pid = 2458
Wed May 17 12:41:46 2006
Starting ORACLE instance (normal)
Wed May 17 12:41:46 2006
Global Enqueue Service Resources = 20158, pool = 4
Wed May 17 12:41:46 2006
Global Enqueue Service Enqueues = 32606
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
GES IPC: Receivers 3 Senders 3
GES IPC: Buffers Receive 1000 Send 2260 Reserve 1000
GES IPC: Msg Size Regular 396 Batch 2048
SCN scheme 2
Using log_archive_dest parameter default value
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up ORACLE RDBMS Version: 9.2.0.4.0.
System parameters with non-default values:



----------------------------------------------------------------------------------------------------




trace file 信息:



/u01/product/admin/orcl/bdump/orcl2_smon_2472.trc
Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production
With the Partitioning, Real Application Clusters, OLAP and Oracle Data Mining options
JServer Release 9.2.0.4.0 - Production
ORACLE_HOME = /u01/product/oracle
System name: Linux
Node name: dell-node2
Release: 2.4.9-e.40smp
Version: #1 SMP Thu Apr 8 16:53:29 EDT 2004
Machine: i686
Instance name: orcl2
Redo thread mounted by this instance: 2
Oracle process number: 11
Unix process pid: 2472, image: oracle@dell-node2 (SMON)

*** SESSION ID皱眉12.1) 2006-05-17 12:18:18.028
*** 2006-05-17 12:18:18.028
kjctipccb: send timed out for msg 0x0x97406ad0 to (0 2), inc 2 type 32 waited 307 sec
kjctipccb: stat 3 dest_inc 2 sys_inc 2
------ Dumping SKGXP context ------
SKGXPCTX: 0xad7f470 ctx
admono 0x68c70cb4 admport:
SSKGXPT 0xad7f558 flags info for network 0
socket no 8 IP 10.1.1.6 UDP 32831
sflags SSKGXPT_WRITESSKGXPT_UP
info for network 1
socket no 0 IP 0.0.0.0 UDP 0
sflags SSKGXPT_DOWN
active 0 actcnt 1
context timestamp 0x80ccca75
no ports
sconno accono ertt state seq# sent async sync rtrans acks
0x538ba1ef 0x02579b42 32 3 51736 84509 84509 0 0 84509
0x538ba1f0 0x226d924e 32 3 46788 14025 14025 0 0 11593
0x538ba1f1 0x1999f15f 32 3 47055 14292 14292 0 297 12062
ach accono sconno admno state seq# rcv rtrans acks
*** 2006-05-17 12:21:06.322
KCL: caught error 29740 during cr lock op
*** 2006-05-17 12:21:06.323
SMON: following errors trapped and ignored:
ORA-00604: error occurred at recursive SQL level 1
ORA-29740: evicted by member , group incarnation
~

---------------------------------------------

Reason 1: The Node Monitor generated the reconfiguration. This can happen if:

a) An instance joins the cluster
b) An instance leaves the cluster
c) A node is halted

It should be easy to determine the cause of the error by reviewing the alert
logs and LMON trace files from all instances. If an instance joins or leaves
the cluster or a node is halted then the ORA-29740 error is not a problem.

ORA-29740 evictions with reason 1 are usually expected when the cluster
membership changes. Very rarely are these types of evictions a real problem.

If you feel that this eviction was not correct, do a search in Metalink or
the bug database for:

ORA-29740 'reason 1'

Important files to review are:

a) Each instance's alert log
b) Each instance's LMON trace file
c) Statspack reports from all nodes leading up to the eviction
d) Each node's syslog or messages file

-----------------------------------------------------------------------------

Reason 2: An instance death was detected. This can happen if:

a) An instance fails to issue a heartbeat to the control file.

When the heartbeat is missing, LMON will issue a network ping to the instance
not issuing the heartbeat. As long as the instance responds to the ping,
LMON will consider the instance alive. If, however, the heartbeat is not
issued for the length of time of the control file enqueue timeout, the
instance is considered to be problematic and will be evicted.

Common causes for an ORA-29740 eviction (Reason 2):

a) NTP (Time changes on cluster) - usually on Linux, Tru64, or IBM AIX

b) Network Problems (SAN).
c) Resource Starvation (CPU, I/O, etc..)
d) An Oracle bug.



Common bugs for reason 2 evictions:



If you feel that this eviction was not correct, do a search in Metalink or the
bug database for:

ORA-29740 'reason 2'

Important files to review are:

a) Each instance's alert log
b) Each instance's LMON trace file
c) Statspack reports from all nodes leading up to the eviction
d) The CKPT process trace file of the evicted instance
e) Other bdump or udump files...
f) Each node's syslog or messages file
g) iostat output before, after, and during evictions
h) vmstat output before, after, and during evictions
i) netstat output before, after, and during evictions

-----------------------------------------------------------------------------

Reason 3: Communications Failure. This can happen if:

a) The LMON processes loose communication between one another.
b) One instance loses communications with the LMD process of another
instance.
c) An LMON process is blocked, spinning, or stuck and is not
responding to the other instance(s) LMON process.
d) An LMD process is blocked or spinning.

In this case the ORA-29740 error is recorded when there are communication
issues between the instances. It is an indication that an instance has been
evicted from the configuration as a result of IPC send timeout. A
communications failure between a foreground, or background other than LMON,
and a remote LMD will also generate a ORA-29740 with reason 3. When this
occurs, the trace file of the process experiencing the error will print a
message:

Reporting Communication error with instance:

If communication is lost at the cluster layer (for example, network cables
are pulled), the cluster software may also perform node evictions in the
event of a cluster split-brain. Oracle will detect a possible split-brain
and wait for cluster software to resolve the split-brain. If cluster
software does not resolve the split-brain within a specified interval,
Oracle proceeds with evictions.

Oracle Support has seen cases where resource starvation (CPU, I/O, etc...) can
cause an instance to be evicted with this reason code. The LMON or LMD process
could be blocked waiting for resources and not respond to polling by the remote
instance(s). This could cause that instance to be evicted. If you have
a statspack report available from the time just prior to the eviction on the
evicted instance, check for poor I/O times and high CPU utilization. Poor I/O
times would be an average read time of > 20ms.

Common causes for an ORA-29740 eviction (Reason 3):

a) Network Problems.
b) Resource Starvation (CPU, I/O, etc..)
c) Severe Contention in Database.
d) An Oracle bug.

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/35489/viewspace-84379/,如需转载,请注明出处,否则将追究法律责任。

请登录后发表评论 登录
全部评论
Oracle , MySQL, SAP IQ, SAP HANA, PostgreSQL, Tableau 技术讨论,希望在这里一起分享知识,讨论技术,畅谈人生 。

注册时间:2007-12-10

  • 博文量
    5595
  • 访问量
    13385697