ITPub博客

首页 > Linux操作系统 > Linux操作系统 > 9i RAC 导致脑裂的另一种情况

9i RAC 导致脑裂的另一种情况

原创 Linux操作系统 作者:Xuan_Baby 时间:2012-03-17 10:53:23 0 删除 编辑

一、故障现象

2012214115左右,接到客户电话报警。经检查系统各项环境,发现数据库节点1宕机,数据库不可用,同时大部分业务已经自动切换到节点2上,应该只影响到小部分业务。

二、处理过程

为尽快恢复业务,登陆系统后立即启动数据库,至此业务恢复,由于是晚上零晨的故障,工程师远程登陆和处理,整个处理过程历时有15分钟左右。在业务恢复之后,对故障原因进行了追查,分析如下:

根据数据库日志

Tue Feb 14 01:14:00 2012

ORA-29740: evicted by member 1, group incarnation 27

Tue Feb 14 01:14:00 2012

LMON: terminating instance due to error 29740

Instance terminated by LMON, pid = 237844

 

*** 2012-02-14 01:13:36.677

kjxgrgetresults: Detect reconfig from 1, seq 26, reason 3

kjxgrrcfgchk: Initiating reconfig, reason 3

1、 根据日志文件显示引起数据库宕机的直接原因为RAC发生脑裂(brain-split),实例被踢出集群;

2、 分析trace文件发现引起脑裂的原因代码为3,根据oracle官方文档,代码3可能的原因为:

Common causes for an ORA-29740 eviction (Reason 3):
        a) Network Problems.
        b) Resource Starvation (CPU, I/O, etc..)
        c) Severe Contention in Database.
        d) An Oracle bug.

通过itsm监控显示,cpu和内存使用正常,同时操作系统没有发现报错。分析私有网络,发现

JXDX_ODS_ORA01:/tmp#netstat -in

Name Mtu  Network   Address      Ipkts Ierrs  Opkts Oerrs Coll

en2 1500 link#2  0.14.5e.db.68.7b 1446979209  0 2539611154  11  0

en2 1500 172.31.13  172.31.13.40  1446979209  0 2539611154  11   0

en4 1500 link#3   0.11.25.bd.bd.d4 1277602257  0 363030239   2   0

en4 1500 134.224.60. 134.224.60.40 1277602257  0 363030239   2   0

网卡统计信息显示私有网卡上曾经出现过发包出错的,但由于该信息为统计信息无法确定就是故障时间点产生的,同时发现,私有心跳网卡(心跳为直连网线)没有做etherchannel,存在单点风险。

三、改进措施

1、继续观察数据库网络和IO负载情况。
  
2、增加1条私有心跳网线并做etherchannel,排除单点。

四、Private Net调整之后,一个星期之后还是发生节点遭驱逐的情况,说明问题还是没解决。进过仔细查看节点aler.log日志,发现归档有问题,请看如下日志,
Sat Feb 25 00:22:17 2012
ARC0: Evaluating archive   log 9 thread 1 sequence 76399
ARC0: Beginning to archive log 9 thread 1 sequence 76399
Creating archive destination LOG_ARCHIVE_DEST_1: '/archive/1_76399.dbf'
Sat Feb 25 00:22:22 2012
ARC1: Evaluating archive   log 9 thread 1 sequence 76399
ARC1: Unable to archive log 9 thread 1 sequence 76399
      Log actively being archived by another process
Sat Feb 25 00:23:22 2012
ARC1: Evaluating archive   log 9 thread 1 sequence 76399
ARC1: Unable to archive log 9 thread 1 sequence 76399
      Log actively being archived by another process
Sat Feb 25 00:24:22 2012
ARC1: Evaluating archive   log 9 thread 1 sequence 76399
ARC1: Unable to archive log 9 thread 1 sequence 76399
      Log actively being archived by another process
Sat Feb 25 00:25:36 2012
ARC1: Evaluating archive   log 9 thread 1 sequence 76399
ARC1: Unable to archive log 9 thread 1 sequence 76399
      Log actively being archived by another process
Sat Feb 25 00:26:36 2012
ARC1: Evaluating archive   log 9 thread 1 sequence 76399
ARC1: Unable to archive log 9 thread 1 sequence 76399
      Log actively being archived by another process
Sat Feb 25 00:32:34 2012
ARC1: Evaluating archive   log 9 thread 1 sequence 76399
ARC1: Unable to archive log 9 thread 1 sequence 76399
      Log actively being archived by another process
Sat Feb 25 00:33:34 2012
ARC1: Evaluating archive   log 9 thread 1 sequence 76399
ARC1: Unable to archive log 9 thread 1 sequence 76399
      Log actively being archived by another process
Sat Feb 25 00:33:57 2012
ARC0: Completed archiving  log 9 thread 1 sequence 76399
归档一个1G的文件的需要11至12min,平时大约20S左右,就能完成归档.说明磁盘写有问题,就赶紧叫磁阵工程师,去机房检查磁盘是否亮红的,进过查看果然有2块盘损坏,正好有块存归档的磁盘亮红的了,接下来没的说了,就换盘支持热插拔,换了之后,都目前为止,实例没出问题,哈哈,脑裂可能的原因不仅仅是主机资源和RAC和心跳,还有可能是磁阵的原因.

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/26634508/viewspace-718820/,如需转载,请注明出处,否则将追究法律责任。

请登录后发表评论 登录
全部评论

注册时间:2012-03-09

  • 博文量
    12
  • 访问量
    31712