ITPub博客

首页 > 数据库 > Oracle > asm实例自动dismount导致rac一个节点宕机

asm实例自动dismount导致rac一个节点宕机

原创 Oracle 作者:小馒头 时间:2015-07-30 17:16:36 0 删除 编辑
asm日志
/u01/app/grid/diag/asm/+asm/+ASM1/trace


Thu Jul 30 02:10:46 2015
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 1.
Thu Jul 30 02:10:47 2015
NOTE: process _b000_+asm1 (38695) initiating offline of disk 0.3915941304 (DATA2_0000) with mask 0x7e in group 1
NOTE: process _b000_+asm1 (38695) initiating offline of disk 1.3915941302 (DATA2_0001) with mask 0x7e in group 1
NOTE: process _b000_+asm1 (38695) initiating offline of disk 2.3915941303 (DATA2_0002) with mask 0x7e in group 1
NOTE: checking PST: grp = 1
GMON checking disk modes for group 1 at 12 for pid 28, osid 38695
ERROR: no read quorum in group: required 2, found 0 disks
.............
Dirty Detach Reconfiguration complete
Thu Jul 30 02:10:47 2015
WARNING: dirty detached from domain 1
NOTE: cache dismounted group 1/0xB368755B (DATA2)   <--自己dismounted了
SQL> alter diskgroup DATA2 dismount force /* ASM SERVER:3009967451 */ 
.............
Thu Jul 30 02:11:24 2015
NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 1
SUCCESS: diskgroup DATA2 was mounted    <---自己又mounted了
SUCCESS: ALTER DISKGROUP DATA2 MOUNT  /* asm agent *//* {0:31:15779} */     



参考文档
ASM diskgroup dismount with "Waited 15 secs for write IO to PST" (文档 ID 1581684.1)

alert可以看到ASM磁盘dismount,并且是错误“Waited 15 secs for write IO to PST”的问题,这是ASM特有的心跳超时检测,
ASM instance会定期检查每个asm disk是不是能正常反馈


Generally this kind messages comes in ASM alertlog file on below situations,
Delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup,
thus the ASM instance dismount the diskgroup.By default, it is 15 seconds.
By the way the heart beat delays are sort of ignored for external redundancy diskgroup.
ASM instance stop issuing more PST heart beat until it succeeds PST revalidation,
but the heart beat delays do not dismount external redundancy diskgroup directly.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
上面描述,可以理解为下面几点:
1. ASM实例会定期检查每一个磁盘组的磁盘状态,是否通信正常;
2. 这个检查,只是针对normal和high冗余模式,对于external冗余,不会遇到这个错误;
3. 默认情况是15s超时,也就是说15s磁盘组还是没有对ASM实例响应的话,就会dismount磁盘组。


在存储网络出现问题的情况下,会引发这个错误的出现。也就是说,在ASM定期发出检查信息的时候,如果磁盘没有在15s内反馈的话,就认为磁盘已经无法访问。


实际情况是上面的凌晨2:10时间点正好是做全库备份时间,估计大量的写入导致io响应慢

在11.2.0.3.0之后才有这个参数出现,也就是说ASM实例对磁盘超时的检测是在11.2.0.3之后才出现的


set pages 9999;

SELECT x.ksppinm NAME, y.ksppstvl VALUE, x.ksppdesc describ
FROM SYS.x$ksppi x, SYS.x$ksppcv y
WHERE x.inst_id = USERENV ('Instance')
AND y.inst_id = USERENV ('Instance')
AND x.indx = y.indx
AND upper(x.ksppinm) like '%ASM_H%';
显示如下:
_asm_hbeatiowait
15
number of secs to wait for PST Async Hbeat IO return
_asm_hbeatwaitquantum
2
quantum used to compute time-to-wait for a PST Hbeat check


在存储网络条件不是很好的情况下可以设置检查时间长点,其实在12.1.0.2默认就是120秒了

alter system set "_asm_hbeatiowait"=120 scope=spfile;

重启asm 继续观察

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/61604/viewspace-1756906/,如需转载,请注明出处,否则将追究法律责任。

请登录后发表评论 登录
全部评论

注册时间:2012-06-19

  • 博文量
    14
  • 访问量
    35702