ITPub博客

首页 > Linux操作系统 > Linux操作系统 > 第一次RAC宕机记录(zt)

第一次RAC宕机记录(zt)

原创 Linux操作系统 作者:tolywang 时间:2007-12-13 00:00:00 0 删除 编辑
9月24日下午16点,收到报警说rac中的一台机器负载狂高,马上登录上去,发现load average已经快20了。登录另一个节点查看,却发现另一个节点怎么也连接不上去,初步断定第二个节点肯定操作系统出现了问题,正好有同事在机房,连接显示器和键盘到服务器上去,显示器没有任何显示,估计节点二死机了。找了一会没有发现任何线索,于是先把节点二重新启动,启动后一切恢复正常,直到今天一个月过去了,没有出现过异常。由于但是对rac知之甚少,无重下手,到今天也没整明白宕机的原因,关键是没有办法重现这个错误,只好先记录在此,以后待查。

主机二上的message日志,看不出什么异常,16点36分31秒是服务器手工启动的时间:
Sep 23 10:50:52 billdb2 sshd(pam_unix)[11928]: session opened for user oracle by (uid=0)
Sep 23 13:03:47 billdb2 sshd(pam_unix)[11928]: session closed for user oracle
Sep 24 09:05:19 billdb2 sshd(pam_unix)[10119]: session opened for user oracle by (uid=0)
Sep 24 16:36:31 billdb2 syslogd 1.4.1: restart.
Sep 24 16:36:31 billdb2 syslog: syslogd startup succeeded

主机二上的cssd日志,可以看到15点38分宕之前一切正常:
[ CSSD]2007-09-24 15:35:36.977 [1157658976] >TRACE: clssgmClientConnectMsg: Connect from con(0x7a3cb0) proc(0x7a4700) pid(14719
) proto(10:2:1:1)
[ CSSD]2007-09-24 15:36:05.091 [1157658976] >TRACE: clssgmClientConnectMsg: Connect from con(0x7f6330) proc(0x80d710) pid() pro
to(10:2:1:1)
[ CSSD]2007-09-24 15:37:05.395 [1157658976] >TRACE: clssgmClientConnectMsg: Connect from con(0x7ad220) proc(0x80d710) pid() pro
to(10:2:1:1)
[ CSSD]2007-09-24 15:38:05.692 [1157658976] >TRACE: clssgmClientConnectMsg: Connect from con(0x7a8e50) proc(0x80d710) pid() pro
to(10:2:1:1)
[ CSSD]2007-09-24 16:38:01.083 >USER: Oracle Database 10g CSS Release 10.2.0.2.0 Production Copyright 1996, 2004 Oracle. All
rights reserved.
[ clsdmt]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=billdb2DBG_CSSD))
[ CSSD]2007-09-24 16:38:01.083 >USER: CSS daemon log for node billdb2, number 2, in cluster crs
[ CSSD]2007-09-24 16:38:01.098 [2538094272] >TRACE: clssscmain: local-only set to false
[ CSSD]2007-09-24 16:38:01.105 [2538094272] >TRACE: clssnmReadNodeInfo: added node 1 (billdb1) to cluster
[ CSSD]2007-09-24 16:38:01.109 [2538094272] >TRACE: clssnmReadNodeInfo: added node 2 (billdb2) to cluster
[ CSSD]2007-09-24 16:38:01.113 [1115699552] >TRACE: clssnm_skgxnmon: skgxn init failed, rc 1
[ CSSD]2007-09-24 16:38:01.113 [2538094272] >TRACE: clssnm_skgxnonline: Using vacuous skgxn monitor
[ CSSD]2007-09-24 16:38:01.114 [2538094272] >TRACE: clssnmNMInitialize: misscount set to (120), impending reconfig threshold se
t to (116)

主机二上的ALERT日志:
Mon Sep 24 15:27:51 2007
Thread 2 advanced to log sequence 1466
Current log# 4 seq# 1466 mem# 0: +DATA/oradata/billdb/redo4_01.log
Current log# 4 seq# 1466 mem# 1: +DATA/oradata/billdb/redo4_02.log
Mon Sep 24 15:27:51 2007
ARC3: LGWR is actively archiving destination LOG_ARCHIVE_DEST_3
Mon Sep 24 15:29:00 2007
LOGMINER: End mining logfile: +DATA/oradata/billdb/redo3_01.log
Mon Sep 24 15:29:00 2007
LOGMINER: Begin mining logfile: +DATA/oradata/billdb/redo4_01.log
Mon Sep 24 16:44:13 2007
Starting ORACLE instance (normal)
Mon Sep 24 16:44:13 2007

主机一上的message日志,可以看到有OCFS的信息:
Sep 24 10:24:33 billdb1 sshd(pam_unix)[15577]: session closed for user oracle
Sep 24 15:38:55 billdb1 kernel: o2net: connection to node billdb2 (num 1) at 192.168.1.101:7777 has been idle for 10 seconds, shutti
ng it down.
Sep 24 15:38:55 billdb1 kernel: (0,5):o2net_idle_timer:1309 here are some times that might help debug the situation: (tmr 1190619525
.629518 now 1190619535.627382 dr 1190619525.629491 adv 1190619525.629523:1190619525.629525 func (63631f10:504) 1190618890.711057:119
0618890.711071)
Sep 24 15:38:55 billdb1 kernel: o2net: no longer connected to node billdb2 (num 1) at 192.168.1.101:7777
Sep 24 15:38:56 billdb1 kernel: tg3: eth2: Link is down.
Sep 24 15:38:56 billdb1 kernel: bonding: bond1: link status definitely down for interface eth2, disabling it
Sep 24 15:38:56 billdb1 kernel: bonding: bond1: making interface eth3 the new active one.
Sep 24 15:38:58 billdb1 kernel: tg3: eth2: Link is up at 100 Mbps, full duplex.
Sep 24 15:38:58 billdb1 kernel: tg3: eth2: Flow control is off for TX and off for RX.
Sep 24 15:38:58 billdb1 kernel: bonding: bond1: link status definitely up for interface eth2.
Sep 24 15:40:52 billdb1 kernel: (24105,7):ocfs2_replay_journal:1174 Recovering node 1 from slot 0 on device (120,33)
Sep 24 15:40:55 billdb1 kernel: kjournald starting. Commit interval 5 seconds
Sep 24 16:03:47 billdb1 sshd(pam_unix)[4642]: session opened for user oracle by (uid=0)
Sep 24 16:04:06 billdb1 sshd(pam_unix)[4642]: session closed for user oracle
Sep 24 16:04:34 billdb1 sshd(pam_unix)[19966]: session closed for user oracle
Sep 24 16:07:02 billdb1 sshd(pam_unix)[20362]: session closed for user oracle
Sep 24 16:07:03 billdb1 sshd(pam_unix)[14637]: session opened for user oracle by (uid=0)
Sep 24 16:08:47 billdb1 sshd(pam_unix)[21097]: session opened for user oracle by (uid=0)

主机一上的cssd日志:
[ CSSD]2007-09-24 15:37:56.550 [1157658976] >TRACE: clssgmClientConnectMsg: Connect from con(0x2a97776a60) proc(0x2a97779700) p
id() proto(10:2:1:1)
[ CSSD]2007-09-24 15:38:03.351 [1157658976] >TRACE: clssgmClientConnectMsg: Connect from con(0x2a9779ae50) proc(0x2a97752e00) p
id() proto(10:2:1:1)
[ CSSD]2007-09-24 15:38:51.107 [1199618400] >TRACE: clssnmPollingThread: node billdb2 (2) missed(2) checkin(s)
[ CSSD]2007-09-24 15:38:52.109 [1199618400] >TRACE: clssnmPollingThread: node billdb2 (2) missed(3) checkin(s)
[ CSSD]2007-09-24 15:38:53.110 [1199618400] >TRACE: clssnmPollingThread: node billdb2 (2) missed(4) checkin(s)
[ CSSD]2007-09-24 15:38:54.113 [1199618400] >TRACE: clssnmPollingThread: node billdb2 (2) missed(5) checkin(s)
。。。。。。。
[ CSSD]2007-09-24 15:40:43.331 [1199618400] >TRACE: clssnmPollingThread: node billdb2 (2) missed(114) checkin(s)
[ CSSD]2007-09-24 15:40:44.333 [1199618400] >TRACE: clssnmPollingThread: node billdb2 (2) missed(115) checkin(s)
[ CSSD]2007-09-24 15:40:45.335 [1199618400] >TRACE: clssnmPollingThread: node billdb2 (2) is impending reconfig
[ CSSD]2007-09-24 15:40:45.335 [1199618400] >TRACE: clssnmPollingThread: node billdb2 (2) missed(116) checkin(s)
[ CSSD]2007-09-24 15:40:45.335 [1199618400] >TRACE: clssnmPollingThread: diskTimeout set to (117000)ms impending reconfig statu
s(1)
[ CSSD]2007-09-24 15:40:46.336 [1199618400] >TRACE: clssnmPollingThread: node billdb2 (2) is impending reconfig
[ CSSD]2007-09-24 15:40:46.336 [1199618400] >TRACE: clssnmPollingThread: node billdb2 (2) missed(117) checkin(s)
[ CSSD]2007-09-24 15:40:47.338 [1199618400] >TRACE: clssnmPollingThread: node billdb2 (2) is impending reconfig
[ CSSD]2007-09-24 15:40:47.338 [1199618400] >TRACE: clssnmPollingThread: node billdb2 (2) missed(118) checkin(s)
[ CSSD]2007-09-24 15:40:48.340 [1199618400] >TRACE: clssnmPollingThread: node billdb2 (2) is impending reconfig
[ CSSD]2007-09-24 15:40:48.340 [1199618400] >TRACE: clssnmPollingThread: node billdb2 (2) missed(119) checkin(s)
[ CSSD]2007-09-24 15:40:49.342 [1199618400] >TRACE: clssnmPollingThread: node billdb2 (2) is impending reconfig
[ CSSD]2007-09-24 15:40:49.342 [1199618400] >TRACE: clssnmPollingThread: Eviction started for node billdb2 (2), flags 0x000d, s
tate 3, wt4c 0
[ CSSD]2007-09-24 15:40:49.343 [1220598112] >TRACE: clssnmDoSyncUpdate: Initiating sync 13
[ CSSD]2007-09-24 15:40:49.343 [1220598112] >TRACE: clssnmDoSyncUpdate: diskTimeout set to (117000)ms
[ CSSD]2007-09-24 15:40:49.343 [1220598112] >TRACE: clssnmSetupAckWait: Ack message type (11)
[ CSSD]2007-09-24 15:40:49.343 [1220598112] >TRACE: clssnmSetupAckWait: node(1) is ALIVE
[ CSSD]2007-09-24 15:40:49.343 [1220598112] >TRACE: clssnmSetupAckWait: node(2) is ALIVE
[ CSSD]2007-09-24 15:40:49.343 [1220598112] >TRACE: clssnmSendSync: syncSeqNo(13)
[ CSSD]2007-09-24 15:40:49.343 [1147169120] >TRACE: clssnmHandleSync: Acknowledging sync: src[1] srcName[billdb1] seq[9] sync[1
3]
[ CSSD]2007-09-24 15:40:49.343 [1147169120] >TRACE: clssnmHandleSync: diskTimeout set to (117000)ms
[ CSSD]2007-09-24 15:40:49.343 [1220598112] >TRACE: clssnmWaitForAcks: Ack message type(11), ackCount(2)
[ CSSD]2007-09-24 15:40:49.343 [1220598112] >TRACE: clssnmWaitForAcks: node(2) is expiring, msg type(11)
[ CSSD]2007-09-24 15:40:49.343 [1220598112] >TRACE: clssnmWaitForAcks: done, msg type(11)
[ CSSD]2007-09-24 15:40:49.343 [1220598112] >TRACE: clssnmDoSyncUpdate: node(0) missCount(2019416) state(0)
[ CSSD]2007-09-24 15:40:49.343 [1220598112] >TRACE: clssnmDoSyncUpdate: node(2) missCount(120) state(3)
[ CSSD]2007-09-24 15:40:49.343 [1220598112] >TRACE: clssnmSetupAckWait: Ack message type (13)
[ CSSD]2007-09-24 15:40:49.343 [1220598112] >TRACE: clssnmSetupAckWait: node(1) is ACTIVE
[ CSSD]2007-09-24 15:40:49.343 [1220598112] >TRACE: clssnmSendVote: syncSeqNo(13)
[ CSSD]2007-09-24 15:40:49.343 [1220598112] >TRACE: clssnmWaitForAcks: Ack message type(13), ackCount(1)
[ CSSD]2007-09-24 15:40:49.343 [1147169120] >TRACE: clssnmSendVoteInfo: node(1) syncSeqNo(13)
[ CSSD]2007-09-24 15:40:49.343 [1220598112] >TRACE: clssnmWaitForAcks: done, msg type(13)
[ CSSD]2007-09-24 15:40:49.343 [1220598112] >TRACE: clssnmCheckDskInfo: Checking disk info...
[ CSSD]2007-09-24 15:40:49.343 [1220598112] >TRACE: clssnmEvict: Start
[ CSSD]2007-09-24 15:40:49.343 [1220598112] >TRACE: clssnmEvict: Evicting node 2, birth 12, death 13, killme 1
[ CSSD]2007-09-24 15:40:49.360 [1220598112] >TRACE: clssnmEvict: Evicting Node(2), timeout(120140)
[ CSSD]2007-09-24 15:40:49.360 [1220598112] >TRACE: clssnmSendShutdown: req to node 2, kill time 2022549754
[ CSSD]2007-09-24 15:40:49.360 [1220598112] >TRACE: clssnmDiscHelper: node billdb2 (2) connection failed
[ CSSD]2007-09-24 15:40:49.360 [1220598112] >TRACE: clssnmWaitOnEvictions: Start
[ CSSD]2007-09-24 15:40:49.360 [1220598112] >TRACE: clssnmWaitOnEvictions: Node(2) down, LATS(2022429614),timeout(120160)
[ CSSD]2007-09-24 15:40:49.360 [1220598112] >TRACE: clssnmSetupAckWait: Ack message type (15)
[ CSSD]2007-09-24 15:40:49.360 [1220598112] >TRACE: clssnmSetupAckWait: node(1) is ACTIVE

节点一上的alert日志:
Mon Sep 24 15:40:53 2007
Reconfiguration started (old inc 44, new inc 46)
List of nodes:
0
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Mon Sep 24 15:40:53 2007
LMS 0: 37 GCS shadows cancelled, 2 closed
Mon Sep 24 15:40:53 2007
LMS 1: 28 GCS shadows cancelled, 2 closed
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Mon Sep 24 15:40:54 2007
LMS 1: 114136 GCS shadows traversed, 0 replayed
Mon Sep 24 15:40:54 2007
LMS 0: 114634 GCS shadows traversed, 0 replayed
Mon Sep 24 15:40:54 2007
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Mon Sep 24 15:40:54 2007
Instance recovery: looking for dead threads
Mon Sep 24 15:40:54 2007
Beginning instance recovery of 1 threads
Reconfiguration complete

里面有两个可疑的地方,一个是ocfs超时引起的第二节点被重启,一个是vote磁盘访问超时引起的。系统中的参数设置如下:
[oracle@db1 proc]$ crsctl get css misscount
120
这个120正好是节点一的cssd日志中记录的miss次数
#
[oracle@db1 proc]$ more /etc/sysconfig/o2cb
#
# This is a configuration file for automatic startup of the O2CB
# driver. It is generated by running /etc/init.d/o2cb configure.
# Please use that method to modify this file
#

# O2CB_ENABELED: 'true' means to load the driver on boot.
O2CB_ENABLED=true

# O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
O2CB_BOOTCLUSTER=ocfs2

# O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
O2CB_HEARTBEAT_THRESHOLD=61

OCFS心跳设置的阀值是61,超时应该是120秒


:
?

-->

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/35489/viewspace-85108/,如需转载,请注明出处,否则将追究法律责任。

请登录后发表评论 登录
全部评论
Oracle , MySQL, SAP IQ, SAP HANA, PostgreSQL, Tableau 技术讨论,希望在这里一起分享知识,讨论技术,畅谈人生 。

注册时间:2007-12-10

  • 博文量
    5595
  • 访问量
    13107715