ITPub博客

首页 > 数据库 > Oracle > ORA-00600 3020 Lost Write写丢失问题分析

ORA-00600 3020 Lost Write写丢失问题分析

原创 Oracle 作者:yaanzy 时间:2015-05-15 22:36:26 0 删除 编辑
最近半年在客户和测试环境前后发生了3次ORA-00600 [3020]错误, 都是在11g data guard环境下做完switchover切换
操作之后, Standby的MRP进程起不来了.

Enviroment:Oracle 11g data guard

Errors in file /d/db/oradata/diag/rdbms/oss/oss/trace/oss_pr01_11035.trc  (incident=324505):
ORA-00600: internal error code, arguments: [3020], [3], [3712], [3712], [], [], [], [], [], [], [], [] 
ORA-10567: Redo is inconsistent with data block (file# 3, block# 3712, file offset is 30408704 bytes) 
ORA-10564: tablespace UNDOTBS1 
ORA-01110: data file 3: '/d/db/oradata/oss/undotbs01.dbf' 
ORA-10560: block type 'KTU UNDO BLOCK' 
Incident details in: /d/db/oradata/diag/rdbms/oss/oss/incident/incdir_324505/oss_pr01_11035_i324505.trc 

下面是oss_pr01_11035.trc的内容(仅列出关键部分):

*** 2015-04-22 14:00:00.090
RECOVERY STUCK AT BLOCK 3712 OF FILE 3
Redo record scn: 0x0000.07e06bec
CHANGE #1 TYP:0 CLS:52 AFN:3 DBA:0x00000e80 OBJ:4294967295 SCN:0x0000.07d38c6e SEQ:254 OP:5.1 ENC:0 RBL:0
 
Buffer read during recovery:
DDE rules only execution for: ORA 1110
----- START Event Driven Actions Dump ----
---- END Event Driven Actions Dump ----
----- START DDE Actions Dump -----
Executing SYNC actions
----- START DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (Async) -----
Successfully dispatched
----- END DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (SUCCESS, 0 csec) -----
Executing ASYNC actions
----- END DDE Actions Dump (total 0 csec) -----
buffer tsn: 2 rdba: 0x00000e80 (1024/3712)
scn: 0x0000.078bd5ad seq: 0x01 flg: 0x04 tail: 0xd5ad0201
frmt: 0x02 chkval: 0xcf92 type: 0x02=KTU UNDO BLOCK
on-disk scn: 0x0.78bd5ad
 
Redo applied in this recovery session:
DUMP REDO
 Opcodes *.*
 DBAs (file#, block#):
 (3, 3712) .
 SCNs: scn: 0x0000.078bd5ad (126604717) thru scn: 0x0000.07e06bec (132148204)
 Times: creation thru eternity
Initial buffer sizes: read 1024K, overflow 832K, change 805K

------------------------------------------------------------------------------------------------------

通过分析上面日志,可以得出以下结果:
1、Recovery trying to recover the block to scn: 0x0000.07e06bec 
2、Expect SCN in that same block to be SCN: 0x0000.07d38c6e 
3、But found a wrong SCN in that block scn: 0x0000.078bd5ad

0x0000.078bd5ad < 0x0000.07d38c6e, 所以是在备用数据库上面发生了lost write写丢失事件.

这类问题的root cause是发生了Lost Write写丢失.通常在I/O子系统确认写块I/O完毕,在写操作未永远存储到磁盘时,
发生写丢失. 有多种原因会导致写丢失,最常见的是存在故障的主机总线适配器、固件错误或存在故障的存储硬件.写
丢失本质上是“无提示”的数据损坏,原因是在执行后续读操作时才能检测到受损块,可能历经数天、数周甚至数月.出于
这个原因,写丢失极难诊断.在后续块读取中,I/O系统返回一个块,该块实际上是数据块的过时版本.

如果Primay上的块SCN比Standby上的小,则检测出Primary上的一个写丢失.用来修复Primary的写丢失的推荐步骤是执行
针对物理备用的故障转移,然后重建主数据库.
如果Primay上的块SCN比Standby上的大,则检测出Standby上的一个写丢失.要修复备用数据库的写丢失,必须重建备用数
据库或受影响的数据文件.

Oracle support给出的解释描述:
Ora-00600 [3020] is known as stuck recovery and this can happen on standby due to lost writes.A data block 
lost write occurs when an I/O subsystem acknowledges the completion of the block write, while in fact the 
write did not occur in the persistent storage. Lost writes always happen due to storage/os/hardware issues. 

Example: 
------------- 
Block 1 had scn of 100 
Block 1 was updated and scn incremented to 110 in buffer cache. So the change vector in the redo recorded the previous SCN to be 100 and changed scn to 110. 
Block 1 was indicated to be flushed to disk but due an I/O issue the block was flushed but not written to disk. So the SCN for the block in disk remains 100. 
Again the same block gets updated and the scn gets incremented from 100 to 120. So the change vector in the redo recorded the previous SCN to be 100 and changed scn to 120. 
The redolog gets shipped to standby 
The recovery applies first redo change vector and changes the block scn from 100 to 110. 
The recovery tries to apply the second change vector. It finds the block scn to be 110 whereas it is expecting it to be 100 since for this change vector the previous scn recorded is 100. 
Recovery halts with ora-00600 [3020] because of the lost write which happened in step 3. 

解决方案I :直接基于主数据库重新重建备用数据库

解决方案II:仅仅只恢复文件undotbs01.dbf

1、关闭从库数据库实例
2、主库上让undo表空间进入热备份状态
    alter tablespace undotbs1 begin backup;
3、利用操作scp命令复制数据文件至从库指定位置,原有文件重命名下
4、启动从库数据库实例到read only状态,开启mrp进程,发现alert日志无报错信息
    startup mount
    ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT;
5、主库关闭undo表空间热备份状态
    alter tablespace undotbs1 end backup;
6、删除从库原有的undo表空间数据文件备份

方案II实施后,发现MRP进程启动成功, 但是没有应用归档日志,有gap产生

SQL> select process,status,sequence# from v$managed_standby;

PROCESS   STATUS        SEQUENCE#
--------- ------------ ----------
ARCH      CLOSING             803
ARCH      CONNECTED             0
ARCH      CLOSING             797
ARCH      CLOSING             798
ARCH      CLOSING             799
ARCH      CLOSING             800
ARCH      CLOSING             801
ARCH      CLOSING             802
MRP0      WAIT_FOR_GAP         93
RFS       IDLE                  0
RFS       IDLE                  0
RFS       IDLE                804
RFS       IDLE                  0

在备库查询:
SQL> select * from v$archive_gap;


   THREAD# LOW_SEQUENCE# HIGH_SEQUENCE#
---------- ------------- --------------
         1            92            323

把缺失的归档文件从主库拷贝到备库之后, 发现MRP进程仍然处于WAIT_FOR_GAP状态

查询v$archived_log,发现324之前的没有注册进来,估计是被rman的delete NOPROMPT archivelog all命令给清除了:

SQL> select name, sequence#,dest_id,creator,registrar,standby_dest,archived,applied from v$archived_log order by 2,3;

NAME                                  SEQUENCE#    DEST_ID CREATOR REGISTR STA ARC APPLIED
------------------------------------ ---------- ---------- ------- ------- --- --- ---------
/d/db-arc/oss/1_324_873459413.dbf           324          1 ARCH    RFS     NO  YES NO
/d/db-arc/oss/1_325_873459413.dbf           325          2 ARCH    RFS     NO  YES NO
/d/db-arc/oss/1_326_873459413.dbf           326          2 ARCH    RFS     NO  YES NO
/d/db-arc/oss/1_327_873459413.dbf           327          2 ARCH    RFS     NO  YES NO
/d/db-arc/oss/1_328_873459413.dbf           328          2 ARCH    RFS     NO  YES NO
.
.
.

重新注册这些归档日志:
alter database register logfile '/d/db-arc/oss/1_93_873459413.dbf';
alter database register logfile '/d/db-arc/oss/1_94_873459413.dbf';
alter database register logfile '/d/db-arc/oss/1_95_873459413.dbf';
alter database register logfile '/d/db-arc/oss/1_96_873459413.dbf';
.
.
alter database register logfile '/d/db-arc/oss/1_323_873459413.dbf';

然后再检查直到APPLYING_LOG的日志达到最新序号:
SQL> select process,status,sequence# from v$managed_standby;

PROCESS   STATUS        SEQUENCE#
--------- ------------ ----------
ARCH      CLOSING             803
ARCH      CONNECTED             0
ARCH      CLOSING             804
ARCH      CLOSING             798
ARCH      CLOSING             799
ARCH      CLOSING             800
ARCH      CLOSING             801
ARCH      CLOSING             802
MRP0      APPLYING_LOG        805
RFS       IDLE                  0
RFS       IDLE                  0
RFS       IDLE                805
RFS       IDLE                  0

到此备库恢复完毕.


参考资料:
【Oracle Data Guard 11g Handbook】第3章 3.3.2 小节
ORA600 [3020] "Stuck Recovery" (Doc ID 30866.1)
Resolving ORA-752 or ORA-600 [3020] During Standby Recovery (Doc ID 1265884.1)

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/3898/viewspace-1656809/,如需转载,请注明出处,否则将追究法律责任。

下一篇: 没有了~
请登录后发表评论 登录
全部评论
  • 博文量
    108
  • 访问量
    759976