ITPub博客

首页 > Linux操作系统 > Linux操作系统 > [总结]9i RAC LMON: terminating instance due to error 29702

[总结]9i RAC LMON: terminating instance due to error 29702

原创 Linux操作系统 作者:tolywang 时间:2009-02-14 10:24:09 0 删除 编辑

基本配置:

Linux AS3.0 内核版本: 2.4.21-37.ELsmp
Oracle 9.2.0.4
升级到Oracle9.2.0.7 , RAC , 两节点。 clusterware9204软件安装的 。

后来的查询发现Oracle 9.2.0.4 RAC 系统升级到Oracle9.2.0.7 , Oracle RDBMS Software 是可以升级到Oracle9.2.0.7,但是Oracle9.2.0.7 Patchset 确实没有ORACM Cluster管理软件的升级版,是Oracle的一个bug 9.2.0.7.0 Bug 4163445, 只有从Oracle9.2.0.6 Patchset上 升级Oracle9.2.0.4 Clusterware软件ORACM (丛Oracle CM Log 中可以看到Oracle9.2.0.4 版本下安装的ORACM版本为 oracm 9.2.0.2.0, 9.2.0.7 补丁没有升级ORACM版本,Oracle9.2.0.6 Patch升级的版本是 oracm 9.2.0.6.0.52 . 注意的一点是所有升级动作一定要严格按照Readme来操作,当然OracleReadme也不一定都考虑到了,这个问题就是一个例子。


http://www.itpub.net/viewthread.php?tid=922265&extra=&highlight=%2Btolywang&page=3 9.2.0.7.0 Bug 4163445

问题描述:

出现的问题描述如下 (节点1 以及节点 2 交替每隔58天左右实例crash一次)

alter_orcl1.log
-------------------------------------------------------------------------------------------------


Sat Jan 5 18:44:19 2008
ARC1: Evaluating archive log 1 thread 1 sequence 122
ARC1: Beginning to archive log 1 thread 1 sequence 122
Creating archive destination LOG_ARCHIVE_DEST_1: '/ocfs_arch1/orcl/1_122.dbf'
ARC1: Completed archiving log 1 thread 1 sequence 122
Sat Jan 5 19:36:06 2008
Thread 1 advanced to log sequence 124
Current log# 4 seq# 124 mem# 0: /ocfs_ctrl_redo/orcl/redo04.log
Current log# 4 seq# 124 mem# 1: /ocfs_data/orcl/redo04b.log
Sat Jan 5 19:36:06 2008
ARC1: Evaluating archive log 3 thread 1 sequence 123
ARC1: Beginning to archive log 3 thread 1 sequence 123
Creating archive destination LOG_ARCHIVE_DEST_1: '/ocfs_arch1/orcl/1_123.dbf'
ARC1: Completed archiving log 3 thread 1 sequence 123
Sat Jan 5 19:45:15 2008
Errors in file /u01/product/admin/orcl/bdump/orcl1_lmon_14214.trc:
ORA-29702: error occurred in Cluster Group Service operation
Sat Jan 5 19:45:15 2008
LMON: terminating instance due to error 29702
Sat Jan 5 19:45:16 2008
System state dump is made for local instance
Sat Jan 5 19:45:20 2008
Instance terminated by LMON, pid = 14214
Sat Jan 5 19:54:53 2008
Starting ORACLE instance (normal)
Sat Jan 5 19:54:53 2008
Global Enqueue Service Resources = 26694, pool = 4
Sat Jan 5 19:54:53 2008
Global Enqueue Service Enqueues = 39350
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
SCN scheme 2
Using log_archive_dest parameter default value
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up ORACLE RDBMS Version: 9.2.0.7.0.
System parameters with non-default values:
processes = 1000
timed_statistics = FALSE
resource_limit = TRUE
shared_pool_size = 419430400
large_pool_size = 33554432
java_pool_size = 33554432







$ vi /u01/product/admin/orcl/bdump/orcl1_lmon_14214.trc
=============



/u01/product/admin/orcl/bdump/orcl1_lmon_14214.trc
Oracle9i Enterprise Edition Release 9.2.0.7.0 - Production
With the Partitioning, Real Application Clusters, OLAP and Oracle Data Mining options
JServer Release 9.2.0.7.0 - Production
ORACLE_HOME = /u01/product/oracle
System name: Linux
Node name: DELL-RAC01
Release: 2.4.21-37.ELsmp
Version: #1 SMP Wed Sep 7 13:28:55 EDT 2005
Machine: i686
Instance name: orcl1
Redo thread mounted by this instance: 0
Oracle process number: 4
Unix process pid: 14214, image: oracle@DELL-RAC01 (LMON)

*** SESSION ID
3.1) 2007-12-31 12:07:45.591
GES IPC: Receivers 3 Senders 3
GES IPC: Buffers Receive 1000 Send (i:2230 b:2230) Reserve 1000
GES IPC: Msg Size Regular 396 Batch 2048
Batch msg size = 2048
Batching factor: enqueue replay 48, ack 53
Batching factor: cache replay 34 size per lock 56
kjxggin: receive buffer size = 32768
kjxgmin: SKGXN ver (2 1 Oracle 9i Reference CM)
CMCLI WARNING: CMInitContext: init ctx(0xb6d93f8)
*** 2007-12-31 12:07:49.396
kjxgmrcfg: Reconfiguration started, reason 1
kjxgmcs: Setting state to 0 0.
*** 2007-12-31 12:07:49.396
Name Service frozen
kjxgmcs: Setting state to 0 1.
kjfcpiora: publish my weight 122787
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 1 2.
Performed the unique instance identification check
kjxgmps: proposing substate 3
kjxgmcs: Setting state to 1 3.
Name Service recovery started
Deleted all dead-instance name entries
kjxgmps: proposing substate 4
kjxgmcs: Setting state to 1 4.
Multicasted all local name entries for publish
Replayed all pending requests
kjxgmps: proposing substate 5
kjxgmcs: Setting state to 1 5.
Name Service normal
Name Service recovery done
*** 2007-12-31 12:07:49.611
kjxgmps: proposing substate 6
kjxgmcs: Setting state to 1 6.
*** 2007-12-31 12:07:49.832
*** 2007-12-31 12:07:49.832
Reconfiguration started (old inc 0, new inc 1)
Synchronization timeout interval: 600 sec
List of nodes:
0
Global Resource Directory frozen
node 0
release 9 2 0 7
* kjshashcfg: I'm the only node in the cluster (node 0)
Active Sendback Threshold = 50 %
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Resources and enqueues cleaned out
Resources remastered 0
0 GCS shadows traversed, 0 cancelled, 0 closed
0 GCS resources traversed, 0 cancelled
set master node info
Submitted all remote-enqueue requests
Update rdomain variables
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
*** 2007-12-31 12:07:50.121
0 GCS shadows traversed, 0 replayed, 0 unopened
Submitted all GCS cache requests
0 write requests issued in 0 GCS resources
0 PIs marked suspect, 0 flush PI msgs

ORACM Log 当时的信息: ERROR: WriteEventPort: write failed with error 32

------------------------------------------------------------

Debug Hang :ClientProcListener (PID=14257) UnRegistered with watchdog daemon. {Sat Jan 5 19:45:16 2008 }^M
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = ClientProcListener:688145 file = unixinc.c, line = 767 {Sat Jan 5 19:45:16 2008 }^M
>ERROR: WriteEventPort: write failed with error 32., tid = ClientProcListener:688145 file = unixinc.c, line = 915 {Sat Jan 5 19:45:16 2008 }^M
Debug Hang :ClientProcListener (PID=14261) UnRegistered with watchdog daemon. {Sat Jan 5 19:45:16 2008 }^M
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = ClientProcListener:622615 file = unixinc.c, line = 767 {Sat Jan 5 19:45:16 2008 }^M
Debug Hang :ClientProcListener (PID=14255) UnRegistered with watchdog daemon. {Sat Jan 5 19:45:16 2008 }^M
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = ClientProcListener:557077 file = unixinc.c, line = 767 {Sat Jan 5 19:45:16 2008 }^M

Diag trace log :

/u01/product/admin/orcl/bdump/orcl2_diag_14211.trc

Oracle9i Enterprise Edition Release 9.2.0.7.0 - Production

With the Partitioning, Real Application Clusters, OLAP and Oracle Data Mining options

JServer Release 9.2.0.7.0 - Production

ORACLE_HOME = /u01/product/oracle

System name: Linux

Node name: DELL-RAC02

Release: 2.4.21-37.ELsmp

Version: #1 SMP Wed Sep 7 13:28:55 EDT 2005

Machine: i686

Instance name: orcl2

Redo thread mounted by this instance: 0

Oracle process number: 3

Unix process pid: 14211, image: oracle@DELL-RAC02 (DIAG)

*** SESSION ID:(2.1) 2008-01-16 12:16:14.524

CMCLI WARNING: CMInitContext: init ctx(0xb9115f4)

kjzcprt:rcv port created

Node id: 1

List of nodes: 0, 1,

*** 2008-01-16 12:16:14.526

Reconfiguration starts [incarn=0]

I'm the voting node

Send my bitmap to master 0

Rcfg confirmation is received from master 0

I agree with the rcfg confirmation

*** 2008-01-16 12:16:25.233

Reconfiguration completes [incarn=2]

*** 2008-01-19 04:50:21.933

Instance is terminating by process 14215 [ospid=oracle@DELL-RAC02 (LMON)]

Performing diagnostic data dump for this instance

CMCLI WARNING: CommonContextCleanup: closing comm port

DIAG detachs from CM

error 29723 detected in background process

OPIRIP: Uncaught error 447. Error stack:

ORA-00447: fatal error in background process

ORA-29723: Failed to attach to the global enqueue service (status=32)

metalink上面的错误描述上看,似乎是由于rac环境两个实例的libskgxn9.so不一致造成的。

处理方法:

1. 由于是Oracle9.2.0.4 升级到Oracle9.2.0.7 , 9207没有ORACM的升级版本软件,只有RDBMS的软件。 所以还必须通过9206patchset来升级oracm9.2.0.2oracm9.2.0.6.0.52版本。 注意了,一定要严格按照readme来操作。

2. 当然升级Oracle RDBMS , Oracm9.2.0.6之后还需要运行一些catproc.sql ……等脚本来更新数据字典,这些在readme上都有。

3. 有些bug是没有公布的,在google,baidu都不能找到,必须到metalink上才能看到。而且不能仅仅通过alert log ,还要结合trc log , diag log cm log 文件等。

Bug 4390716 Linux: "CMCLI WARNING" messages after applying 9.2.0.6 / 7

This note gives a brief overview of bug 4390716.

Affects:

Product (Component)

Oracle Server (Rdbms)

Range of versions believed to be affected

Versions >= 9.2.0.6

Versions confirmed as being affected

  • 9.2.0.6
  • 9.2.0.7

Platforms affected

  • Linux 32bit

It is believed to be a regression in default behaviour thus:
Regression introduced in 9.2.0.6

Fixed:

This issue is fixed in

  • (None Specified)

Symptoms:

Related To:

  • RAC (Real Application Clusters) / OPS

Description

After applying 9.2.0.6 or 9.2.0.7 Patch Set on Linux
platforms then RAC installations may start reporting
numerous errors to trace files of the form.:
  CMCLI WARNING: CMInitContext:  init ctx(0xae5c9a4)
  CMCLI WARNING: CommonContextCleanup:  closing comm port
This can lead to disk full and instance crash scenarios.
Workaround:
  After installation of the Patch Set ensure that the
  folowing steps are executed on ALL nodes of the RAC
  cluster:
   cd $ORACLE_HOME/rdbms/lib
   Shut down all the instances in the OH
   make -f ins_rdbms.mk rac_on ioracle

运行这些bug修复的命令:

After applying 9.2.0.6 or 9.2.0.7 Patch Set on Linux

platforms then RAC installations may start reporting

numerous errors to trace files of the form.:

CMCLI WARNING: CMInitContext: init ctx(0xae5c9a4)

CMCLI WARNING: CommonContextCleanup: closing comm port

This can lead to disk full and instance crash scenarios.

Workaround:

After installation of the Patch Set ensure that the

folowing steps are executed on ALL nodes of the RAC

cluster:

cd $ORACLE_HOME/rdbms/lib

Shut down all the instances in the OH

make -f ins_rdbms.mk rac_on ioracle

具体执行:

DELL-RAC01$

DELL-RAC01$make -f ins_rdbms.mk rac_on ioracle

rm -f /u01/product/oracle/lib/libskgxp9.so

cp /u01/product/oracle/lib//libskgxpu.so /u01/product/oracle/lib/libskgxp9.so

cp /u01/product/oracle/lib/libcmdll.so /u01/product/oracle/lib/libskgxn9.so

/usr/bin/ar cr /u01/product/oracle/rdbms/lib/libknlopt.a /u01/product/oracle/rdbms/lib/kcsm.o

- Linking Oracle

rm -f /u01/product/oracle/rdbms/lib/oracle

gcc -o /u01/product/oracle/rdbms/lib/oracle -L/u01/product/oracle/rdbms/lib/ -L/u01/product/oracle/lib/ -L/u01/product/oracle/lib/stubs/ -Wl,-E `test -f /u01/product/oracle/rdbms/lib/skgaioi.o && echo /u01/product/oracle/rdbms/lib/skgaioi.o` /u01/product/oracle/rdbms/lib/opimai.o /u01/product/oracle/rdbms/lib/ssoraed.o /u01/product/oracle/rdbms/lib/ttcsoi.o /u01/product/oracle/lib/nautab.o /u01/product/oracle/lib/naeet.o /u01/product/oracle/lib/naect.o /u01/product/oracle/lib/naedhs.o /u01/product/oracle/rdbms/lib/config.o -lserver9 -lodm9 -lskgxp9 -lskgxn9 -lclient9 -lvsn9 -lwtcserver9 -lcommon9 -lgeneric9 /u01/product/oracle/rdbms/lib/defopt.o -lknlopt `if /usr/bin/ar tv /u01/product/oracle/rdbms/lib/libknlopt.a | grep xsyeolap.o > /dev/null 2>&1 ; then echo "-loraolap9" ; fi` -lslax9 -lpls9 -lplp9 -lserver9 -lclient9 -lvsn9 -lwtcserver9 -lcommon9 -lgeneric9 -lknlopt -lslax9 -lpls9 -lplp9 -ljox9 -lserver9 -locijdbcst9 -lwwg9 `cat /u01/product/oracle/lib/ldflags` -lnsslb9 -lncrypt9 -lnsgr9 -lnzjs9 -ln9 -lnl9 -lnro9 `cat /u01/product/oracle/lib/ldflags` -lnsslb9 -lncrypt9 -lnsgr9 -lnzjs9 -ln9 -lnl9 -lmm -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9 -lunls9 -lnls9 `cat /u01/product/oracle/lib/ldflags` -lnsslb9 -lncrypt9 -lnsgr9 -lnzjs9 -ln9 -lnl9 -lnro9 `cat /u01/product/oracle/lib/ldflags` -lnsslb9 -lncrypt9 -lnsgr9 -lnzjs9 -ln9 -lnl9 -ltrace9 -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9 -lunls9 -lnls9 `if /usr/bin/ar tv /u01/product/oracle/rdbms/lib/libknlopt.a | grep "kxmnsd.o" > /dev/null 2>&1 ; then echo " " ; else echo "-lordsdo9"; fi` -lctxc9 -lctx9 -lzx9 -lgx9 -lctx9 -lzx9 -lgx9 -lordimt9 -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9 -lunls9 -lnls9 -lsnls9 -lunls9 -lxsd9 -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9 -lunls9 -lnls9 `cat /u01/product/oracle/lib/sysliblist` -Wl,-rpath,/u01/product/oracle/lib:/lib/i686:/lib:/usr/lib -lm `cat /u01/product/oracle/lib/sysliblist` -ldl -lm `test -f /u01/product/oracle/rdbms/lib/skgaioi.o && echo -laio`

mv -f /u01/product/oracle/bin/oracle /u01/product/oracle/bin/oracleO

mv /u01/product/oracle/rdbms/lib/oracle /u01/product/oracle/bin/oracle

chmod 6751 /u01/product/oracle/bin/oracle

DELL-RAC01$q

-bash: q: command not found

DELL-RAC01$

DELL-RAC02$

DELL-RAC02$cd $ORACLE_HOME/rdbms/lib

DELL-RAC02$make -f ins_rdbms.mk rac_on ioracle

rm -f /u01/product/oracle/lib/libskgxp9.so

cp /u01/product/oracle/lib//libskgxpu.so /u01/product/oracle/lib/libskgxp9.so

cp /u01/product/oracle/lib/libcmdll.so /u01/product/oracle/lib/libskgxn9.so

/usr/bin/ar cr /u01/product/oracle/rdbms/lib/libknlopt.a /u01/product/oracle/rdbms/lib/kcsm.o

- Linking Oracle

rm -f /u01/product/oracle/rdbms/lib/oracle

gcc -o /u01/product/oracle/rdbms/lib/oracle -L/u01/product/oracle/rdbms/lib/ -L/u01/product/oracle/lib/ -L/u01/product/oracle/lib/stubs/ -Wl,-E `test -f /u01/product/oracle/rdbms/lib/skgaioi.o && echo /u01/product/oracle/rdbms/lib/skgaioi.o` /u01/product/oracle/rdbms/lib/opimai.o /u01/product/oracle/rdbms/lib/ssoraed.o /u01/product/oracle/rdbms/lib/ttcsoi.o /u01/product/oracle/lib/nautab.o /u01/product/oracle/lib/naeet.o /u01/product/oracle/lib/naect.o /u01/product/oracle/lib/naedhs.o /u01/product/oracle/rdbms/lib/config.o -lserver9 -lodm9 -lskgxp9 -lskgxn9 -lclient9 -lvsn9 -lwtcserver9 -lcommon9 -lgeneric9 /u01/product/oracle/rdbms/lib/defopt.o -lknlopt `if /usr/bin/ar tv /u01/product/oracle/rdbms/lib/libknlopt.a | grep xsyeolap.o > /dev/null 2>&1 ; then echo "-loraolap9" ; fi` -lslax9 -lpls9 -lplp9 -lserver9 -lclient9 -lvsn9 -lwtcserver9 -lcommon9 -lgeneric9 -lknlopt -lslax9 -lpls9 -lplp9 -ljox9 -lserver9 -locijdbcst9 -lwwg9 `cat /u01/product/oracle/lib/ldflags` -lnsslb9 -lncrypt9 -lnsgr9 -lnzjs9 -ln9 -lnl9 -lnro9 `cat /u01/product/oracle/lib/ldflags` -lnsslb9 -lncrypt9 -lnsgr9 -lnzjs9 -ln9 -lnl9 -lmm -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9 -lunls9 -lnls9 `cat /u01/product/oracle/lib/ldflags` -lnsslb9 -lncrypt9 -lnsgr9 -lnzjs9 -ln9 -lnl9 -lnro9 `cat /u01/product/oracle/lib/ldflags` -lnsslb9 -lncrypt9 -lnsgr9 -lnzjs9 -ln9 -lnl9 -ltrace9 -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9 -lunls9 -lnls9 `if /usr/bin/ar tv /u01/product/oracle/rdbms/lib/libknlopt.a | grep "kxmnsd.o" > /dev/null 2>&1 ; then echo " " ; else echo "-lordsdo9"; fi` -lctxc9 -lctx9 -lzx9 -lgx9 -lctx9 -lzx9 -lgx9 -lordimt9 -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9 -lunls9 -lnls9 -lsnls9 -lunls9 -lxsd9 -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9 -lunls9 -lnls9 `cat /u01/product/oracle/lib/sysliblist` -Wl,-rpath,/u01/product/oracle/lib:/lib/i686:/lib:/usr/lib -lm `cat /u01/product/oracle/lib/sysliblist` -ldl -lm `test -f /u01/product/oracle/rdbms/lib/skgaioi.o && echo -laio`

mv -f /u01/product/oracle/bin/oracle /u01/product/oracle/bin/oracleO

mv /u01/product/oracle/rdbms/lib/oracle /u01/product/oracle/bin/oracle

chmod 6751 /u01/product/oracle/bin/oracle

DELL-RAC02$

DELL-RAC02$

DELL-RAC02$

DELL-RAC02$

DELL-RAC02$cd

DELL-RAC02$

观察2周后发现没有出现过类似问题。原来 67 天一次的实例crash现象消失,log也恢复正常,cm log没有类似error 的错误出现。 Bug 问题解决。

整個過程參考 : http://www.itpub.net/viewthread.php?tid=922265&highlight=%2Btolywang

 

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/35489/viewspace-558261/,如需转载,请注明出处,否则将追究法律责任。

下一篇: (记录) 脾胃虚弱
请登录后发表评论 登录
全部评论
Oracle , MySQL, SAP IQ, SAP HANA, PostgreSQL, Tableau 技术讨论,希望在这里一起分享知识,讨论技术,畅谈人生 。

注册时间:2007-12-10

  • 博文量
    5595
  • 访问量
    13268734