ITPub博客

首页 > Linux操作系统 > Linux操作系统 > rac宕机分析

rac宕机分析

原创 Linux操作系统 作者:tonglei2000 时间:2012-02-02 15:07:22 0 删除 编辑
首先,在47分的时候实例4发起驱逐实例2的事件,实例2首先crash。
这是实例4的log
Wed Jan 11 09:47:07 GMT+08:00 2012Evicting instance 2 from cluster
Wed Jan 11 09:47:14 GMT+08:00 2012Reconfiguration started (old inc 12, new inc 16)
List of nodes:
 0 2 3
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE
 Communication channels reestablished
 * domain 0 valid = 0 according to instance 0
Wed Jan 11 09:47:14 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
这是实例3的log
Wed Jan 11 09:47:07 GMT+08:00 2012Evicting instance 2 from cluster
Wed Jan 11 09:47:14 GMT+08:00 2012Reconfiguration started (old inc 12, new inc 16)
List of nodes:
 0 2 3
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE
 Communication channels reestablished
 * domain 0 valid = 0 according to instance 0
Wed Jan 11 09:47:14 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
这是实例1的log
Wed Jan 11 09:47:07 GMT+08:00 2012Evicting instance 2 from cluster
Wed Jan 11 09:47:14 GMT+08:00 2012Reconfiguration started (old inc 12, new inc 16)
List of nodes:
 0 2 3
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE
 Communication channels reestablished
 * domain 0 not valid according to instance 2
 * domain 0 not valid according to instance 3
Wed Jan 11 09:47:14 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
这是实例2的log
Wed Jan 11 09:47:11 GMT+08:00 2012Received an instance abort message from instance 4 (reason 0x1)
Wed Jan 11 09:47:11 GMT+08:00 2012Received an instance abort message from instance 4 (reason 0x1)
Wed Jan 11 09:47:11 GMT+08:00 2012Please check instance 4 alert and LMON trace files for detail.
Wed Jan 11 09:47:11 GMT+08:00 2012Please check instance 4 alert and LMON trace files for detail.
Wed Jan 11 09:47:11 GMT+08:00 2012LMD0: terminating instance due to error 481
Instance terminated by LMD0, pid = 4325382
Wed Jan 11 09:51:33 GMT+08:00 2012Starting ORACLE instance (normal)

然后实例2在51分开始启动,52分完成。
Wed Jan 11 09:52:40 GMT+08:00 2012Reconfiguration started (old inc 20, new inc 22)
List of nodes:
 0 1 2 3
 Global Resource Directory frozen
 Communication channels reestablished
 * domain 0 valid = 1 according to instance 0
Wed Jan 11 09:52:40 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
 
实例1在51分的时候因为自己的asm异常而导致crash,具体log如下
Wed Jan 11 09:51:13 GMT+08:00 2012Errors in file /u01/app/oracle/admin/zhpt/bdump/zhpt1_asmb_6947490.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Wed Jan 11 09:51:21 GMT+08:00 2012ASMB: terminating instance due to error 15064
Instance terminated by ASMB, pid = 6947490
 
在实例4上可以看到当时2没有完全起来,而1已经crash的状态
Wed Jan 11 09:48:46 GMT+08:00 2012 Submitted all GCS remote-cache requests
 Fix write in gcs resources
Wed Jan 11 09:51:27 GMT+08:00 2012Reconfiguration started (old inc 16, new inc 18)
List of nodes:
 2 3
 Nested reconfiguration detected.
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE
 Communication channels reestablished
 * domain 0 valid = 0 according to instance 2
从上面还可以看出1的crash在crs中没有信息,说明它的crash和crs无关,而2的crash是crs发起的,日志是不一样的。

实例3的日志
Wed Jan 11 09:49:12 GMT+08:00 2012Completed redo scan
 196588 redo blocks read, 29797 data blocks need recovery
 Fix write in gcs resources
Wed Jan 11 09:51:27 GMT+08:00 2012Reconfiguration started (old inc 16, new inc 18)
List of nodes:
 2 3
 Nested reconfiguration detected.
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE
 Communication channels reestablished
 * domain 0 not valid according to instance 3
Wed Jan 11 09:51:27 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out

实例2的日志没有同样的信息是因为它当时正在open中。

下面是实例4在2和1先后好了以后记录的log
Wed Jan 11 09:51:29 GMT+08:00 2012 Submitted all GCS remote-cache requests
 Fix write in gcs resources
Reconfiguration complete
Wed Jan 11 09:51:43 GMT+08:00 2012Reconfiguration started (old inc 18, new inc 20)
List of nodes:
 1 2 3
 Global Resource Directory frozen
 Communication channels reestablished
 * domain 0 valid = 0 according to instance 2
Wed Jan 11 09:51:43 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
 

Wed Jan 11 09:51:45 GMT+08:00 2012 Submitted all GCS remote-cache requests
 Fix write in gcs resources
Reconfiguration complete
Wed Jan 11 09:52:40 GMT+08:00 2012Reconfiguration started (old inc 20, new inc 22)
List of nodes:
 0 1 2 3
 Global Resource Directory frozen
 Communication channels reestablished
 * domain 0 valid = 1 according to instance 1
Wed Jan 11 09:52:40 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out

下面是实例3在2和1先后好了以后记录的log

Wed Jan 11 09:51:32 GMT+08:00 2012Completed redo scan
 390680 redo blocks read, 58663 data blocks need recovery
Wed Jan 11 09:51:43 GMT+08:00 2012Reconfiguration started (old inc 18, new inc 20)
List of nodes:
 1 2 3
 Global Resource Directory frozen
 Communication channels reestablished
 * domain 0 valid = 0 according to instance 1
Wed Jan 11 09:51:43 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out

Wed Jan 11 09:52:02 GMT+08:00 2012Completed redo application
Wed Jan 11 09:52:11 GMT+08:00 2012Completed instance recovery at
 Thread 1: logseq 10747, block 1136082, scn 12622806928332
 Thread 2: logseq 8082, block 1925780, scn 12622806926550
 58633 data blocks read, 58725 data blocks written, 390680 redo blocks read
Wed Jan 11 09:52:40 GMT+08:00 2012Reconfiguration started (old inc 20, new inc 22)
List of nodes:
 0 1 2 3
 Global Resource Directory frozen
 Communication channels reestablished
 * domain 0 valid = 1 according to instance 1
Wed Jan 11 09:52:40 GMT+08:00 2012 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Wed Jan 11 09:52:40 GMT+08:00 2012 LMS 3: 0 GCS shadows cancelled, 0 closed
 

实例2crash的原因
在实例2上,有一条日志
*** 2012-01-11 09:45:21.918
kjxggpoll: change poll time to 50 ms
然后各个节点开始
 
 

实例1crash的原因
database上的alert里面提示看asm的log
Wed Jan 11 09:51:13 GMT+08:00 2012Errors in file /u01/app/oracle/admin/zhpt/bdump/zhpt1_asmb_6947490.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Wed Jan 11 09:51:21 GMT+08:00 2012ASMB: terminating instance due to error 15064
Instance terminated by ASMB, pid = 6947490
database的zhpt1_asmb_6947490.trc里面提示
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
ORACLE_HOME = /u01/app/oracle/product/10.2.0/db_1
System name:    AIX
Node name:      jtzhptdb1
Release:        1
Version:        6
Machine:        00F6EEBB4C00
Instance name: zhpt1
Redo thread mounted by this instance: 1
Oracle process number: 43
Unix process pid: 6947490, image: oracle@jtzhptdb1 (ASMB)
*** 2012-01-11 09:51:13.858
*** SERVICE NAME:(SYS$BACKGROUND) 2012-01-11 09:51:10.117
*** SESSION ID:(1598.1) 2012-01-11 09:51:10.117
error 15064 detected in background process
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
ksuitm: waiting up to [5] seconds before killing DIAG(17957082)
 
asm的log里面提示
Wed Jan 11 09:50:55 GMT+08:00 2012LMD0 (ospid: 8192988) is not heartbeating for 216 seconds.
LMON detects unhealthy receivers.
Please check LMON and DIAG trace files for detail.
Wed Jan 11 09:51:00 GMT+08:00 2012LMON (ospid: 8520498) is terminating the instance.
LMON: terminating instance due to error 481
Termination issued to instance processes. Waiting for the processes to exit
Wed Jan 11 09:51:14 GMT+08:00 2012Instance termination failed to kill one or more processes
Instance terminated by LMON, pid = 8520498
Wed Jan 11 09:52:10 GMT+08:00 2012Starting ORACLE instance (normal)
 
lmon的log
Repeat 3 times
*** 2012-01-11 09:50:55.139
kjfmrcvrchk: receiver LMD[0] has no heartbeat for 216 sec (1326246437.1326246653.0).
kjfmrcvrchk: Dumping callstack of lmd0
Submitting asynchronized dump request [20]
kjfmrcvrchk: receivers are not healthy. kill instance.
ksuitm: waiting up to [5] seconds before killing DIAG(8258490)
*** 2012-01-11 09:51:14.825
Instance termination failed to kill one or more processes
ksuitm_check: OS PID=10616892 is still alive
*** 2012-01-11 09:51:14.825
Process diagnostic dump for oracle@jtzhptdb1 (LCK0), OS id=10616892,
pid: 15, proc_ser: 1, sid: 432, sess_ser: 1
-------------------------------------------------------------------------------
loadavg : 2.12 2.51 3.42
swap info: free_mem = 6406.91M rsv = 192.00M
           alloc = 15371.82M avail = 49152.00M swap_free = 33780.18M
       F S      UID      PID     PPID   C PRI NI ADDR    SZ    WCHAN    STIME    TTY  TIME CMD
Short stack dump: ORA-00072: process "Unix process pid: 10616892, image: oracle@jtzhptdb1 (LCK0)" is not active

lmod的log
*** 2012-01-11 09:49:01.111
KJM_HISTORY: RCVR STALL OP(12) context 8 elapsed 99907602 us
KJM HIST LMD0:
  12:8:99907602 7:0 6:1 10:28 11:1 15:0 12:78144 7:0 6:1 10:31
  11:0 15:0 12:78149 7:1 6:1 10:22 2:33 1:0 11:1 15:0
  12:78147 7:1 6:0 10:37 11:1 15:1 12:78148 7:2 6:0 10:41
  11:1 15:1 12:78154 7:1 6:0 10:38 11:0 15:1 12:78148 7:1
  6:0 10:34 11:0 15:1 12:78150 7:2 6:0 10:62 11:2 15:1
  12:78161 7:0 6:0 10:30 11:0 15:0 12:78149 7:1 6:0 10:76
  11:1 15:2 12:78164 7:0
----------------------------------------
SO: 70000001709cab0, type: 4, owner: 70000001754ca80, flag: INIT/-/-/0x00
  (session) sid: 441 trans: 0, creator: 70000001754ca80, flag: (51) USR/- BSY/-/-/-/-/-
            DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
            txn branch: 0
            oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
  last wait for 'ges remote message' wait_time=1 min 17 sec, seconds since wait started=2616
          waittime=40, loop=0, p3=0
          blocking sess=0x0 seq=11868
  Dumping Session Wait History
   for 'ges remote message' count=1 wait_time=1 min 17 sec
          waittime=40, loop=0, p3=0
   for 'ges remote message' count=1 wait_time=0.078142 sec
          waittime=40, loop=0, p3=0
   for 'ges remote message' count=1 wait_time=0.078146 sec
          waittime=40, loop=0, p3=0
   for 'ges remote message' count=1 wait_time=0.078143 sec
          waittime=40, loop=0, p3=0
   for 'ges remote message' count=1 wait_time=0.078145 sec
          waittime=40, loop=0, p3=0
   for 'ges remote message' count=1 wait_time=0.078149 sec
          waittime=40, loop=0, p3=0
   for 'ges remote message' count=1 wait_time=0.078145 sec
          waittime=40, loop=0, p3=0
   for 'ges remote message' count=1 wait_time=0.078147 sec
          waittime=40, loop=0, p3=0
   for 'ges remote message' count=1 wait_time=0.078152 sec
          waittime=40, loop=0, p3=0
   for 'ges remote message' count=1 wait_time=0.078147 sec
          waittime=40, loop=0, p3=0
 
 
 
 
 
 
 
 
 
 
 

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/13337857/viewspace-715519/,如需转载,请注明出处,否则将追究法律责任。

上一篇: 10g rac宕机的猜想
请登录后发表评论 登录
全部评论

注册时间:2010-10-19

  • 博文量
    109
  • 访问量
    212508