ITPub博客

首页 > Linux操作系统 > Linux操作系统 > SERVICES的优先节点发生切换(二)

SERVICES的优先节点发生切换(二)

原创 Linux操作系统 作者:yangtingkun 时间:2011-07-28 23:46:25 0 删除 编辑

客户的测试环境,原本默认连接到节点1上,但是现在连接会跑到节点2上。

这一篇从日志分析错误的发生。

SERVICES的优先节点发生切换(一):http://yangtingkun.itpub.net/post/468/521452

 

 

虽然问题解决了,但是需要从日志信息中找到SERVICES失败的证据和原因,首先检查数据库告警日志。

一般来说SERVICES的失败,不会体现在数据库告警日志中,但是由于这个库每天进行expdp的导出操作,所以告警日志中会包含service_names参数的修改。根据上一篇文章可以得知,当实例1上的SERVICES死掉后,实例1上的service_names中的rac_xshdb配置也随之消失。

Tue May 10 15:04:36 2011
ALTER SYSTEM SET service_names='SYS$SYS.KUPC$S_1_20110510150402.XSHDB','rac_xshdb','raclb_xshdb' SCOPE=MEMORY SID='xshdb1';
ALTER SYSTEM SET service_names='rac_xshdb','raclb_xshdb' SCOPE=MEMORY SID='xshdb1';
.
.
.
Wed May 11 14:13:23 2011
ALTER SYSTEM SET service_names='raclb_xshdb','SYS$SYS.KUPC$C_1_20110511141323.XSHDB' SCOPE=MEMORY SID='xshdb1';
ALTER SYSTEM SET service_names='SYS$SYS.KUPC$C_1_20110511141323.XSHDB','raclb_xshdb','SYS$SYS.KUPC$S_1_20110511141323.XSHDB' SCOPE=MEMORY SID='xshdb1';

虽然告警日志中确实没有任何相关的错误信息,但是通过这种方法,成功的将错误发生时间定位到了51015045111413这一天的范围内,这使得随后检查cluster的相关日志时更有针对性。

检查cluster的告警汇总日志,果然发现了111253分时,出现了错误:

2011-05-11 12:53:21.710
[crsd(17664)]CRS-2765:Resource 'ora.xshdb.rac_xshdb.svc' has failed on server 'xsh-server1'.
2011-05-11 12:53:21.723
[crsd(17664)]CRS-2771:Maximum restart attempts reached for resource 'ora.xshdb.rac_xshdb.svc'; will not restart.

这里并没有记录错误原因,但是证明了错误的产生,而且确定了详细的错误发生时间,继续检查crs的日志:

2011-05-11 12:53:21.705: [ AGFW][1171855680] {0:6:7} Agfw Proxy Server received the message: RESOURCE_STATUS[Proxy] ID 20481:19458791
2011-05-11 12:53:21.706: [ AGFW][1171855680] {0:6:7} Received state change for ora.xshdb.rac_xshdb.svc 1 1 [old state = ONLINE, new state = OFFLINE]
2011-05-11 12:53:21.706: [ AGFW][1171855680] {0:6:7} Agfw Proxy Server sending message to PE, Contents = [MIDTo:2|OpID:3|FromA:{Invalid|Node:0|Process:0|Type:0}|ToA:{Invalid|Node:-1|Process:-1|Type:-1}|MIDFrom:0|Type:4|Pri2|Id:4715323:Ver:2]
2011-05-11 12:53:21.706: [ AGFW][1171855680] {0:6:7} Agfw Proxy Server replying to the message: RESOURCE_STATUS[Proxy] ID 20481:19458791
2011-05-11 12:53:21.706: [ CRSPE][1182361920] {0:6:7} State change received from xsh-server1 for ora.xshdb.rac_xshdb.svc 1 1
2011-05-11 12:53:21.706: [ CRSPE][1182361920] {0:6:7} Processing PE command id=478061. Description: [Resource State Change (ora.xshdb.rac_xshdb.svc 1 1) : 0x2aaab82c6800]
2011-05-11 12:53:21.706: [ CRSPE][1182361920] {0:6:7} RI [ora.xshdb.rac_xshdb.svc 1 1] new external state [OFFLINE] old value: [ONLINE] on xsh-server1 label = []
2011-05-11 12:53:21.710: [ CRSRPT][1184463168] {0:6:7} Published to EVM CRS_RESOURCE_STATE_CHANGE for ora.xshdb.rac_xshdb.svc
2011-05-11 12:53:21.722: [ CRSPE][1182361920] {0:6:7} Resource Resource Instance ID[ora.xshdb.rac_xshdb.svc 1 1]. Values:
STATE=OFFLINE
TARGET=ONLINE
LAST_SERVER=xsh-server1
CURRENT_RCOUNT=0
LAST_RESTART=1298450637
FAILURE_COUNT=1
FAILURE_HISTORY=1290078515:xsh-server1
STATE_DETAILS=
INCARNATION=12
STATE_CHANGE_VERS=21
LAST_FAULT=1290078515
LAST_STATE_CHANGE=1305089601
INTERNAL_STATE=0
DEGREE_ID=1
ID=ora.xshdb.rac_xshdb.svc 1 1
CARDINALITY_ID=1
Lock Info:
Write Locks:none
ReadLocks:|STATE INITED| has failed!
2011-05-11 12:53:21.723: [ CRSPE][1182361920] {0:6:7} Processing unplanned state change for [ora.xshdb.rac_xshdb.svc 1 1]
2011-05-11 12:53:21.723: [ CRSPE][1182361920] {0:6:7} ora.xshdb.rac_xshdb.svc 1 1: uptime exceeds uptime threshold , resetting restart count
2011-05-11 12:53:21.723: [ CRSPE][1182361920] {0:6:7} No Local Restart of [ora.xshdb.rac_xshdb.svc 1 1], restart attempt exhausted
2011-05-11 12:53:21.726: [ CRSPE][1182361920] {0:6:7} Will try to failover [ora.xshdb.rac_xshdb.svc 1 1]
2011-05-11 12:53:21.727: [ CRSRPT][1184463168] {0:6:7} Published to EVM CRS_NOT_RESTARTING for ora.xshdb.rac_xshdb.svc
2011-05-11 12:53:21.734: [ CRSPE][1182361920] {0:6:7} RI [ora.xshdb.rac_xshdb.svc 1 1] new internal state: [STARTING] old value: [STABLE]
2011-05-11 12:53:21.735: [ CRSPE][1182361920] {0:6:7} Sending message to agfw: id = 4715326
2011-05-11 12:53:21.735: [ CRSPE][1182361920] {0:6:7} CRS-2672: Attempting to start 'ora.xshdb.rac_xshdb.svc' on 'xsh-server2'

2011-05-11 12:53:21.853: [ CRSPE][1182361920] {0:6:7} Received reply to action [Start] message ID: 4715326
2011-05-11 12:53:21.856: [ CRSPE][1182361920] {0:6:7} Received reply to action [Start] message ID: 4715326
2011-05-11 12:53:21.856: [ CRSPE][1182361920] {0:6:7} RI [ora.xshdb.rac_xshdb.svc 1 1] new internal state: [STABLE] old value: [STARTING]
2011-05-11 12:53:21.856: [ CRSPE][1182361920] {0:6:7} RI [ora.xshdb.rac_xshdb.svc 1 1] new external state [ONLINE] old value: [OFFLINE] on xsh-server2 label = []
2011-05-11 12:53:21.859: [ CRSPE][1182361920] {0:6:7} CRS-2676: Start of 'ora.xshdb.rac_xshdb.svc' on 'xsh-server2' succeeded

2011-05-11 12:53:21.860: [ CRSRPT][1184463168] {0:6:7} Published to EVM CRS_RESOURCE_STATE_CHANGE for ora.xshdb.rac_xshdb.svc
2011-05-11 12:53:21.860: [ CRSOCR][1173956928] {0:6:7} Multi Write Batch processing...
2011-05-11 12:53:21.867: [ CRSOCR][1173956928] {0:6:7} Multi Write Batch done.
2011-05-11 12:53:21.868: [ CRSPE][1182361920] {0:6:7} PE Command [ Resource State Change (ora.xshdb.rac_xshdb.svc 1 1) : 0x2aaab82c6800 ] has completed

虽然crs日志中记录的信息更加详细,不过还是没有给出导致问题产生的原因,继续检查cssd的日志:

2011-05-11 12:53:21.842: [ CSSD][1110309184]clssgmHandleMasterMemberAdd: [s(2) d(1)]
2011-05-11 12:53:21.842: [ CSSD][1110309184]clssgmAddMember: new master 2 for group(IGXSHDBrac_xshdb)
2011-05-11 12:53:21.842: [ CSSD][1110309184]clssgmAddMember: member (2/0x2aaab800fea0) added. pbsz(0) prsz(0) flags 0x0 to grock (0x2aaab80590b0/IGXSHDBrac_xshdb)
2011-05-11 12:53:21.842: [ CSSD][1110309184]clssgmCommonAddMember: global group grock IGXSHDBrac_xshdb member(2/Remote) node(2) flags 0x0 0xb800fea0

cssd日志中,也没有给出导致SERVICES死掉的原因,由于部分日志已经被重用,已经找不到导致SERVICES死掉的原因了。即使SERVICES这么简单的结构,也会有意外死到的情况出现。

 

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/4227/viewspace-703419/,如需转载,请注明出处,否则将追究法律责任。

请登录后发表评论 登录
全部评论
暂无介绍

注册时间:2007-12-29

  • 博文量
    1954
  • 访问量
    10651080