ITPub博客

首页 > Linux操作系统 > Linux操作系统 > 转:一个RAC恢复案例并思路

转:一个RAC恢复案例并思路

原创 Linux操作系统 作者:yanyp 时间:2009-05-19 16:45:48 0 删除 编辑
一个RAC恢复案例并思路--1
2008-07-11 16:05

环境:
AIX 5.3
Oracle 10.2.0.2 RAC
2 instance

错误:看alert把
Thu Jul 10 15:37:47 2008
Errors in file /oracle/app/oracle/admin/qasmdm/bdump/qasmdm1_lgwr_446692.trc:
ORA-00340: IO error processing online log 2 of thread 1
ORA-00345: redo log write error block 781893 count 4
ORA-00312: online log 2 thread 1: '/dev/rlvredo12'
ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 5: I/O error
Additional information: -1

。。。。。。

Fri Jul 11 13:22:22 2008
Errors in file /oracle/app/oracle/admin/qasmdm/bdump/qasmdm1_j000_1073290.trc:
ORA-12012: error on auto execute of job 1
ORA-00376: file 3 cannot be read at this time
ORA-01110: data file 3: '/dev/rlvsysaux01'
ORA-06512: at "SYSMAN.EMD_MAINTENANCE", line 48
ORA-06512: at "SYSMAN.EMD_MAINTENANCE", line 936
ORA-06512: at line 1
Fri Jul 11 13:22:37 2008
Errors in file /oracle/app/oracle/admin/qasmdm/bdump/qasmdm1_j000_1073290.trc:
ORA-12012: error on auto execute of job 42579
ORA-00376: file ORA-00376: file 3 cannot be read at this time
ORA-01110: data file 3: '/dev/rlvsysaux01'
ORA-06512: at "EXFSYS.DBMS_RLMGR_DR", line 15
ORA-06512: at line 1
cannot be read at this time

肯定要先看lv的状态:
[qasmdm1:oracle] /oracle/app/oracle/admin/qasmdm/bdump> lsvg -l qasmdmvg
qasmdmvg:
LV NAME             TYPE       LPs     PPs     PVs LV STATE      MOUNT POINT
lv_ocr              jfs        2       2       1    open/syncd    N/A
lv_voting           jfs        2       2       1    open/syncd    N/A
lvuser03            jfs        128     128     1    open/syncd    N/A
lv_temp             jfs        160     160     1    open/syncd    N/A
lvcntrl1            jfs        1       1       1    open/syncd    N/A
lvcntrl2            jfs        1       1       1    open/syncd    N/A
lvcntrl3            jfs        1       1       1    open/syncd    N/A
lvsystem01          jfs        16      16      1    open/syncd    N/A
lvuser01            jfs        32      32      1    open/syncd    N/A
lvundotbs01         jfs        160     160     1    open/syncd    N/A
lvundotbs02         jfs        160     160     1    open/syncd    N/A
lvredo11            jfs        8       8       1    open/syncd    N/A
lvredo12            jfs        8       8       1    open/syncd    N/A
lvredo13            jfs        8       8       1    closed/syncd N/A
lvredo21            jfs        8       8       1    closed/syncd N/A
lvredo22            jfs        8       8       1    closed/syncd N/A
lvredo33            jfs        8       8       1    closed/syncd N/A
lvsysaux01          jfs        32      32      1    closed/syncd    N/A
lvuser02            jfs        256     256     2    open/syncd    N/A
lvuser04            jfs        128     128     2    open/syncd    N/A
lvsystem02          jfs        32      32      1    open/syncd    N/A


搞过aix的应该知道:closed并不代表逻辑卷有问题。并且看了一下最近的hacmp.out也没发现问题。

再看操作系统日志:

AFA89905   0711130408 I O grpsvcs        Group Services daemon started
97419D60   0711130408 I O topsvcs        Topology Services daemon started
AE3E3FAD   0711125608 I O SYSJ2          FSCK FOUND ERRORS
AE3E3FAD   0711125508 I O SYSJ2          FSCK FOUND ERRORS
AE3E3FAD   0711125408 I O SYSJ2          FSCK FOUND ERRORS
A6DF45AA   0711124508 I O RMCdaemon      The daemon is started.
7BFEEA1F   0711124308 T H fcs0           LINK ERROR
7BFEEA1F   0711124308 T H fcs0           LINK ERROR
7BFEEA1F   0711124308 T H fcs1           LINK ERROR
7BFEEA1F   0711124308 T H fcs1           LINK ERROR
2BFA76F6   0711124308 T S SYSPROC        SYSTEM SHUTDOWN BY USER
9DBCFDEE   0711124408 T O errdemon       ERROR LOGGING TURNED ON
192AC071   0710203708 T O errdemon       ERROR LOGGING TURNED OFF
AA8AB241   0710203708 T O OPERATOR       OPERATOR NOTIFICATION
A6DF45AA   0710193808 I O RMCdaemon      The daemon is started.
F45CFAFB   0710193808 P H hdisk29        OPEN FAILURE
F45CFAFB   0710193808 P H hdisk25        OPEN FAILURE
F45CFAFB   0710193808 P H hdisk15        OPEN FAILURE
F45CFAFB   0710193808 P H hdisk5         OPEN FAILURE
F45CFAFB   0710193808 P H hdisk28        OPEN FAILURE
F45CFAFB   0710193808 P H hdisk24        OPEN FAILURE
F45CFAFB   0710193808 P H hdisk14        OPEN FAILURE
F45CFAFB   0710193808 P H hdisk4         OPEN FAILURE
F45CFAFB   0710193808 P H hdisk39        OPEN FAILURE
F45CFAFB   0710193808 P H hdisk33        OPEN FAILURE
F45CFAFB   0710193808 P H hdisk19        OPEN FAILURE
F45CFAFB   0710193808 P H hdisk9         OPEN FAILURE

据说:昨天存储出过问题,但是今天中午已经修好了,并且其他表空间跟sysaux在同一个阵列上的,并且同一个vg.
理论上应该排除硬件问题了。

于是查一下数据文件的情况:
SQL> select file#,status from v$datafile;

     FILE# STATUS
---------- -------
         1 SYSTEM
         2 ONLINE
         3 RECOVER
         4 ONLINE
         5 ONLINE
         6 SYSTEM
         7 ONLINE
         8 ONLINE
         9 ONLINE

9 rows selected.

data file 3: '/dev/rlvsysaux01' 状态是recover

SQL> select file_id,bytes,status,online_status from dba_data_files;

   FILE_ID      BYTES STATUS    ONLINE_
---------- ---------- --------- -------
         4 8589148160 AVAILABLE ONLINE
         3            AVAILABLE RECOVER
         2 3.0774E+10 AVAILABLE ONLINE
         1 2144337920 AVAILABLE SYSTEM
         5 3.1029E+10 AVAILABLE ONLINE
         6 4293918720 AVAILABLE SYSTEM
         7 3.4357E+10 AVAILABLE ONLINE
         8 3.4357E+10 AVAILABLE ONLINE
         9 3.4357E+10 AVAILABLE ONLINE

9 rows selected.


datafile状态:available,bytes居然是null

看到recover自然要查v$recover_file,再往下查很明朗了:

SQL> select * from V$RECOVER_FILE;

     FILE# ONLINE ONLINE_ ERROR          CHANGE# TIME
---------- ------- ------- ----------- ---------- ---------
         3 OFFLINE OFFLINE             1.6800E+10 10-JUL-08

SQL> alter database recover datafile 3;
alter database recover datafile 3
*
ERROR at line 1:
ORA-00279: change 16799532025 generated at 07/10/2008 14:56:19 needed for thread 1
ORA-00289: suggestion : /qasmdmarch1/1_4049_653004728.dbf
ORA-00280: change 16799532025 for thread 1 is in sequence #4049

到这里,有个插曲:
我一看时间以为要麻烦了,因为是昨天的归档,可能已经被备份并删除了。准备要找tsm的team的时候想 :还是先看一眼吧:

[qasmdm1:oracle] /qasmdmarch1> ls
1_4042_653004728.dbf 1_4044_653004728.dbf 1_4046_653004728.dbf 1_4048_653004728.dbf 1_4050_653004728.dbf 
1_4043_653004728.dbf 1_4045_653004728.dbf 1_4047_653004728.dbf 1_4049_653004728.dbf 1_4051_653004728.dbf

归档文件居然还在,后来用recover datafile 3就不这样了
并且在这个时候做了一件事情,那就是把node 2的归档也ftp到了这个目录来。

接:一个RAC恢复案例并思路--1

SQL> alter database recover tablespace sysaux;
alter database recover tablespace sysaux
*
ERROR at line 1:
ORA-00275: media recovery has already been started


SQL> alter database recover cancel;

Database altered.

SQL> recover datafile 3;
ORA-00279: change 16799532025 generated at 07/10/2008 14:56:27 needed for thread 2
ORA-00289: suggestion : /qasmdmarch1/2_3886_653004728.dbf
ORA-00280: change 16799532025 for thread 2 is in sequence #3886


Specify log: {=suggested | filename | AUTO | CANCEL}

ORA-00279: change 16799532025 generated at 07/10/2008 14:56:19 needed for thread 1
ORA-00289: suggestion : /qasmdmarch1/1_4049_653004728.dbf
ORA-00280: change 16799532025 for thread 1 is in sequence #4049


Specify log: {=suggested | filename | AUTO | CANCEL}

ORA-00279: change 16799532174 generated at 07/10/2008 14:56:40 needed for thread 1
ORA-00289: suggestion : /qasmdmarch1/1_4050_653004728.dbf
ORA-00280: change 16799532174 for thread 1 is in sequence #4050
ORA-00278: log file '/qasmdmarch1/1_4049_653004728.dbf' no longer needed for this recovery

......
......省略了些

Specify log: {=suggested | filename | AUTO | CANCEL}

ORA-00279: change 16800179378 generated at 07/11/2008 13:06:58 needed for thread 2
ORA-00289: suggestion : /qasmdmarch1/2_3887_653004728.dbf
ORA-00280: change 16800179378 for thread 2 is in sequence #3887
ORA-00278: log file '/qasmdmarch1/2_3886_653004728.dbf' no longer needed for this recovery


Specify log: {=suggested | filename | AUTO | CANCEL}

ORA-00308: cannot open archived log '/qasmdmarch1/2_3887_653004728.dbf'
ORA-27037: unable to obtain file status
IBM AIX RISC System/6000 Error: 2: No such file or directory
Additional information: 3


SQL> recover datafile 3;
ORA-00279: change 16800179378 generated at 07/11/2008 13:06:58 needed for thread 2
ORA-00289: suggestion : /qasmdmarch1/2_3887_653004728.dbf
ORA-00280: change 16800179378 for thread 2 is in sequence #3887


Specify log: {=suggested | filename | AUTO | CANCEL}

ORA-00308: cannot open archived log '/qasmdmarch1/2_3887_653004728.dbf'
ORA-27037: unable to obtain file status
IBM AIX RISC System/6000 Error: 2: No such file or directory
Additional information: 3

到这里也应该能猜到,recover的过程中node2正好又生成了新的归档,第一次ftp没有复制过来。
把刚生成的归档再ftp过来

PEKAX131:/qasmdmarch1>ls
1_4042_653004728.dbf 1_4046_653004728.dbf 1_4050_653004728.dbf 2_3881_653004728.dbf 2_3885_653004728.dbf
1_4043_653004728.dbf 1_4047_653004728.dbf 1_4051_653004728.dbf 2_3882_653004728.dbf 2_3886_653004728.dbf
1_4044_653004728.dbf 1_4048_653004728.dbf 1_4052_653004728.dbf 2_3883_653004728.dbf 2_3887_653004728.dbf
1_4045_653004728.dbf 1_4049_653004728.dbf 1_4053_653004728.dbf 2_3884_653004728.dbf

然后重新开始recover:
SQL> recover datafile 3;
ORA-00279: change 16800179378 generated at 07/11/2008 13:06:58 needed for thread 2
ORA-00289: suggestion : /qasmdmarch1/2_3887_653004728.dbf
ORA-00280: change 16800179378 for thread 2 is in sequence #3887


Specify log: {=suggested | filename | AUTO | CANCEL}

ORA-00279: change 16800179378 generated at 07/10/2008 15:03:46 needed for thread 1
ORA-00289: suggestion : /qasmdmarch1/1_4051_653004728.dbf
ORA-00280: change 16800179378 for thread 1 is in sequence #4051


Specify log: {=suggested | filename | AUTO | CANCEL}

ORA-00279: change 16800179379 generated at 07/11/2008 13:06:59 needed for thread 1
ORA-00289: suggestion : /qasmdmarch1/1_4052_653004728.dbf
ORA-00280: change 16800179379 for thread 1 is in sequence #4052
ORA-00278: log file '/qasmdmarch1/1_4051_653004728.dbf' no longer needed for this recovery


Specify log: {=suggested | filename | AUTO | CANCEL}

Log applied.
Media recovery complete.

Perfect!
recover已经complete,然后该online 表空间了


SQL> alter tablespace sysaux online;

Tablespace altered.

SQL> select * from V$RECOVER_FILE;

no rows selected


SQL> select file_id,bytes,status,online_status from dba_data_files;

   FILE_ID      BYTES STATUS    ONLINE_
---------- ---------- --------- -------
         4 8589148160 AVAILABLE ONLINE
         3 4293918720 AVAILABLE ONLINE
         2 3.0774E+10 AVAILABLE ONLINE
         1 2144337920 AVAILABLE SYSTEM
         5 3.1029E+10 AVAILABLE ONLINE
         6 4293918720 AVAILABLE SYSTEM
         7 3.4357E+10 AVAILABLE ONLINE
         8 3.4357E+10 AVAILABLE ONLINE
         9 3.4357E+10 AVAILABLE ONLINE

已经没有问题了,file也已经online.
      
然后再第二个节点上:
SQL> alter tablespace sysaux online;

Tablespace altered.

原因分析和整个过程总结:

10日存储有问题,导致/dev/rlvsysaux01 offline.
因为这个表空间比较特殊,用户应用普通的查询和修改等不会被影响,所以问题没有被发现


当有job等需要碰到它操作的时候才会触发错误.
如ORA-12012: error on auto execute of job 1

结决过程思路比较简单:
因为硬件问题已经解决,所以恢复也比较简单就是recover datafile
由于rac的两个instance 各自归档在本地,所以在其中一个节点recover的时候要保证它能读到两个instance的归档,所以用了ftp把归档都放在了执行recover的节点

redo log 本来就是共享的,所以不用关心,自然能recover.

最后找到了错误的源头:
hu Jul 10 15:38:02 2008
Errors in file /oracle/app/oracle/admin/qasmdm/bdump/qasmdm2_lgwr_573572.trc:
ORA-00345: redo log write error block 586061 count 2
ORA-00312: online log 4 thread 2: '/dev/rlvredo22'
ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 46: Device not ready
Additional information: -1
Additional information: 1024

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/230160/viewspace-600471/,如需转载,请注明出处,否则将追究法律责任。

请登录后发表评论 登录
全部评论

注册时间:2009-04-12

  • 博文量
    139
  • 访问量
    172479