ITPub博客

首页 > Linux操作系统 > Linux操作系统 > Oracle RAC(Cluster)的重构整理(3)

Oracle RAC(Cluster)的重构整理(3)

Linux操作系统 作者:东方友诚 时间:2016-04-01 15:33:28 0 删除 编辑

node2alert.log

Sat Jul 09 16:41:28 CST 2011

Reconfiguration started (old inc 2, new inc 4)

List of nodes:

 0 1

 Global Resource Directory frozen

 Communication channels reestablished

 Master broadcasted resource hash value bitmaps

 Non-local Process blocks cleaned out

Sat Jul 09 16:41:29 CST 2011

 LMS 0: 0 GCS shadows cancelled, 0 closed

 Set master node info

 Submitted all remote-enqueue requests

 Dwn-cvts replayed, VALBLKs dubious

 All grantable enqueues granted

Sat Jul 09 16:41:30 CST 2011

 LMS 0: 5074 GCS shadows traversed, 2242 replayed

Sat Jul 09 16:41:30 CST 2011

 Submitted all GCS remote-cache requests

 Post SMON to start 1st pass IR

 Fix write in gcs resources

Reconfiguration complete

 

 

node1alert.log(node2 shutdown abort):

Sat Jul 09 17:32:37 CST 2011

Reconfiguration started (old inc 4, new inc 6)

List of nodes:

 0

 Global Resource Directory frozen

 * dead instance detected - domain 0 invalid = TRUE

 Communication channels reestablished

 Master broadcasted resource hash value bitmaps

 Non-local Process blocks cleaned out

Sat Jul 09 17:32:38 CST 2011

 LMS 0: 0 GCS shadows cancelled, 0 closed

 Set master node info

 Submitted all remote-enqueue requests

 Dwn-cvts replayed, VALBLKs dubious

 All grantable enqueues granted

 Post SMON to start 1st pass IR

Sat Jul 09 17:32:39 CST 2011

 LMS 0: 5947 GCS shadows traversed, 0 replayed

Sat Jul 09 17:32:39 CST 2011

 Submitted all GCS remote-cache requests

 Fix write in gcs resources

Reconfiguration complete

Sat Jul 09 17:32:40 CST 2011

Instance recovery: looking for dead threads

Sat Jul 09 17:32:40 CST 2011

Beginning instance recovery of 1 threads

Sat Jul 09 17:32:42 CST 2011

Started redo scan

Sat Jul 09 17:32:46 CST 2011

Completed redo scan

 3 redo blocks read, 5 data blocks need recovery

Sat Jul 09 17:32:46 CST 2011

Started redo application at

 Thread 2: logseq 5, block 1884

Sat Jul 09 17:32:47 CST 2011

Recovery of Online Redo Log: Thread 2 Group 3 Seq 5 Reading mem 0

  Mem# 0: +RAC_DISK/racdb/onlinelog/group_3.258.751759681

Sat Jul 09 17:32:47 CST 2011

Completed redo application

Sat Jul 09 17:32:47 CST 2011

Completed instance recovery at

 Thread 2: logseq 5, block 1887, scn 532837

 3 data blocks read, 5 data blocks written, 3 redo blocks read

Sat Jul 09 17:32:48 CST 2011

Thread 2 advanced to log sequence 6 (thread recovery)

 

这里涉及到一个重要的服务Cluster Group ServiceCGS):

LMON:各个实例的LMON进程会定期通信,以检查集群中各节点的健康状态,当某个节点出现故障时, 负责集群 重构。它提供的服务叫Cluster Group ServiceCGS),ORACLE

Clusterware使用Process Monitor Daemon解决脑裂的方法,如果某节点上的实例异常挂起,如果单从NetworkOSClusterware几个层面 看,可能检测不到这种异常。因此数据

库必须有自我监控的机制。LMON进程提供了节点监控(Node Montor)功能。这个功能是用 来记录应用层各个节点的健康状态,节点的健康状态通过GRD中的一个位图bitmap记录,

每个节点一位,0代表关闭,1代表正常运行,各节点的LMON互相通信,确认这个位图的一致性。

    LMON可以和下层的Clusterware合作也可以 单独工作。当LMON检测到实例级别的脑裂时,期待借助于Clusterware解决脑裂,但RAC并不假设Clusterware 肯定能解决问题 ,因

LMON不会无尽等待Clusterware层的处理结果,当等待超时LMON进程会自动触发IMRInstance Membership RecoveryIMR可以看做是ORACLE在数据库层提供的脑裂、IO隔离机制

    LMON主要借助两种心跳来完成健康监测:

    1、节点间的心跳

    2、控制文件的磁盘心跳, 每个实例的CKPT进程 3秒更新一次控制文件的Checkpoint Progress Record数据块,控制文件是 共享的,因此实例可以互相检测对方是否及时更新以判断状态。

 

LMON 相应的日志:

*** 2011-07-09 16:41:25.412

kjxgmrcfg: Reconfiguration started, reason 1

kjxgmcs: Setting state to 2 0.

*** 2011-07-09 16:41:25.570

     Name Service frozen

kjxgmcs: Setting state to 2 1.

kjxgrssvote: reconfig bitmap chksum 0xccd0ae50 cnt 2 master 0 ret 0

kjxggpoll: change poll time to 50 ms

*** 2011-07-09 16:41:25.665

Obtained RR update lock for sequence 3, RR seq 2

*** 2011-07-09 16:41:25.752

Voting results, upd 0, seq 4, bitmap: 0 1

CGS/IMR TIMEOUTS:

  CSS recovery timeout = 71 sec

  IMR Reconfig timeout = 300 sec

  CGS rcfg timeout = 300 sec

kjxgmps: proposing substate 2

kjxgmcs: Setting state to 4 2.

 kjfmuin: bitmap 0 1

 kjfmmhi: received msg from 0 (inc 2)

 kjfmmhi: received msg from 1 (inc 4)

     Performed the unique instance identification check

kjxgmps: proposing substate 3

kjxgmcs: Setting state to 4 3.

     Name Service recovery started

     Deleted all dead-instance name entries

kjxgmps: proposing substate 4

kjxgmcs: Setting state to 4 4.

     Multicasted all local name entries for publish

     Replayed all pending requests

kjxgmps: proposing substate 5

kjxgmcs: Setting state to 4 5.

     Name Service normal

     Name Service recovery done

*** 2011-07-09 16:41:27.200

kjxgmps: proposing substate 6

kjxgmcs: Setting state to 4 6.

kjxggpoll: change poll time to 600 ms

*** 2011-07-09 16:41:28.279

kjfcrfg: DRM window size = 128->128 (min lognb = 10)

*** 2011-07-09 16:41:28.279

Reconfiguration started (old inc 2, new inc 4)

Synchronization timeout interval: 900 sec

List of nodes:

 0 1

Undo tsn affinity 1

*** 2011-07-09 16:41:28.311

*** 2011-07-09 16:41:28.311

kjfcrfg: query of NESTED_RECONFIGURATION for node 1 failed with 7

 Global Resource Directory frozen

node 0

node 1

release 10 2 0 5

 asby init, 0/0/x2

 asby returns, 0/0/x2/false

* Domain maps before reconfiguration:

*   DOMAIN 0 (valid 1): 0

* End of domain mappings

* Domain maps after recomputation:

*   DOMAIN 0 (valid 1): 0 1

* End of domain mappings

 Dead  inst

 Join  inst 1

 Exist inst 0

 Active Sendback Threshold = 50 %

 Communication channels reestablished

 sent syncr inc 4 lvl 1 to 0 (4,5/0/0)

 sent synca inc 4 lvl 1 (4,5/0/0)

 received all domreplay (4.6)

 sent master 0 (4.6)

*** 2011-07-09 16:41:29.535

KJBDOMHVMAP: BEGINS

*** 2011-07-09 16:41:29.560

KJBDOMHVMAP: ENDS

 sent dom info (4.6)

 sent hv info (4.6)

 sent syncr inc 4 lvl 2 to 0 (4,7/0/0)

 sent synca inc 4 lvl 2 (4,7/0/0)

 Master broadcasted resource hash value bitmaps

* kjfcrfg: domain 0 valid, valid_ver = 4

 Non-local Process blocks cleaned out

 Set master node info

 sent syncr inc 4 lvl 3 to 0 (4,13/0/0)

 sent synca inc 4 lvl 3 (4,13/0/0)

 Submitted all remote-enqueue requests

kjfcrfg: Number of mesgs sent to node 1 = 774

 sent syncr inc 4 lvl 4 to 0 (4,15/0/0)

 sent synca inc 4 lvl 4 (4,15/0/0)

 Dwn-cvts replayed, VALBLKs dubious

 sent syncr inc 4 lvl 5 to 0 (4,18/0/0)

 sent synca inc 4 lvl 5 (4,18/0/0)

 All grantable enqueues granted

 sent syncr inc 4 lvl 6 to 0 (4,20/0/0)

 sent synca inc 4 lvl 6 (4,20/0/0)

 Submitted all GCS cache requests

 sent syncr inc 4 lvl 7 to 0 (4,22/0/0)

 sent synca inc 4 lvl 7 (4,22/0/0)

 Post SMON to start 1st pass IR

 Fix write in gcs resources

 sent syncr inc 4 lvl 8 to 0 (4,24/0/0)

 sent synca inc 4 lvl 8 (4,24/0/0)

*** 2011-07-09 16:41:31.006

Reconfiguration complete

 

 

*** 2011-07-09 17:32:33.682

kjxgmpoll reconfig bitmap: 0

*** 2011-07-09 17:32:33.745

kjxgmrcfg: Reconfiguration started, reason 1

kjxgmcs: Setting state to 4 0.

*** 2011-07-09 17:32:34.157

     Name Service frozen

kjxgmcs: Setting state to 4 1.

kjxgrssvote: reconfig bitmap chksum 0x6668604e cnt 1 master 0 ret 0

kjxggpoll: change poll time to 50 ms

*** 2011-07-09 17:32:34.464

Obtained RR update lock for sequence 5, RR seq 4

*** 2011-07-09 17:32:37.539

Voting results, upd 0, seq 6, bitmap: 0

CGS/IMR TIMEOUTS:

  CSS recovery timeout = 71 sec

  IMR Reconfig timeout = 300 sec

  CGS rcfg timeout = 300 sec

kjxgmps: proposing substate 2

kjxgmcs: Setting state to 6 2.

kjfmSendAbortInstMsg: send an abort message to node 1

kjfmSendAbortInstMsg: unique id 0x0 reason 0x1

 kjfmuin: bitmap 0

 kjfmmhi: received msg from 0 (inc 2)

     Performed the unique instance identification check

kjxgmps: proposing substate 3

kjxgmcs: Setting state to 6 3.

     Name Service recovery started

     Deleted all dead-instance name entries

kjxgmps: proposing substate 4

kjxgmcs: Setting state to 6 4.

     Multicasted all local name entries for publish

     Replayed all pending requests

kjxgmps: proposing substate 5

kjxgmcs: Setting state to 6 5.

     Name Service normal

     Name Service recovery done

*** 2011-07-09 17:32:37.598

kjxgmps: proposing substate 6

kjxgmcs: Setting state to 6 6.

kjxggpoll: change poll time to 600 ms

kjfmact: call ksimdic on instance (1)

*** 2011-07-09 17:32:37.843

kjfcrfg: DRM window size = 128->128 (min lognb = 10)

*** 2011-07-09 17:32:37.845

Reconfiguration started (old inc 4, new inc 6)

Synchronization timeout interval: 900 sec

List of nodes:

 0

Undo tsn affinity 1

*** 2011-07-09 17:32:37.906

 Global Resource Directory frozen

node 0

 asby init, 0/0/x2

 asby returns, 0/0/x2/false

* Domain maps before reconfiguration:

*   DOMAIN 0 (valid 1): 0 1

* End of domain mappings

* kjbdomrcfg2: domain 0 invalid = TRUE

* Domain maps after recomputation:

*   DOMAIN 0 (valid 0): 0

* End of domain mappings

 Active Sendback Threshold = 50 %

 Communication channels reestablished

 sent syncr inc 6 lvl 1 to 0 (6,5/0/0)

 sent syncr inc 6 lvl 2 to 0 (6,7/0/0)

 Master broadcasted resource hash value bitmaps

 Non-local Process blocks cleaned out

 Set master node info

 sent syncr inc 6 lvl 3 to 0 (6,13/0/0)

 Submitted all remote-enqueue requests

 sent syncr inc 6 lvl 4 to 0 (6,15/0/0)

 Dwn-cvts replayed, VALBLKs dubious

 sent syncr inc 6 lvl 5 to 0 (6,18/0/0)

 All grantable enqueues granted

 sent syncr inc 6 lvl 6 to 0 (6,20/0/0)

*** 2011-07-09 17:32:39.351

 Post SMON to start 1st pass IR

 Submitted all GCS cache requests

 sent syncr inc 6 lvl 7 to 0 (6,22/0/0)

 Fix write in gcs resources

 sent syncr inc 6 lvl 8 to 0 (6,24/0/0)

*** 2011-07-09 17:32:39.673

Reconfiguration complete

*   domain 0 valid?: 0

kjxgfipccb: msg 0x0xb7db2a6c, mbo 0x0xb7db2a68, type 19, ack 0, ref 0, stat 34

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/30036720/viewspace-2073753/,如需转载,请注明出处,否则将追究法律责任。

请登录后发表评论 登录
全部评论

注册时间:2014-11-24

  • 博文量
    144
  • 访问量
    256594