ITPub博客

首页 > Linux操作系统 > Linux操作系统 > MAA switchover失败经历分享

MAA switchover失败经历分享

原创 Linux操作系统 作者:Allen2312 时间:2009-01-04 23:51:35 0 删除 编辑
    一直想把前段时间做的MAA switchover失败的经历总结下,结果赶上一个项目实施,再加上元旦放假,所以都没来得及总结!今天总算腾出时间来了,必须总结出来!
    应客户要求要对已经上线的一套MAA系统做一次switchover测试,由于是上线系统,所以必须晚上+周末来做,可是苦了我了~~~
    依然按惯例介绍系统环境:
    4节点RAC+2节点Physical Standby,DB release: Oracle 10.2.0.3.0,采用real time apply方式同步physical standby database,没有安装Broker
    参考文档:metalink Note:751600.1(2008年12月4号刚刚更新的一篇文档,以下用文档代替)
    swtichover过程:
    1、预先检查
     根据文档建议,先要检查整个环境,这里就不多写了,省得大家看的累,我放到附件里了,感兴趣的下载看吧!
     我们的环境没有下载最新的patch bundle,我看了看就patch 6081547有些关系,其他应该没事,又加上上网不方便,所以抱着侥幸心理就没有去安装patch
    2、Pre-switchover检查
    这里分成两部分,一部分是使用broker管理的,一种是没有使用broker管理的
    我是没用过broker,就知道是Oracle用来图形化管理Data guard的工具,这种switchover我也更倾向于使用命令方式,这样能更好的定位错误,便于解决。
    a、Verify Managed Recovery is Running (non-broker) on the standby
    SQL> select process from v$managed_standby where process like 'MRP%';
    检查是否还有MRP在运行,我们的结果是0;
    b、Cancel apply delay for the target standby using SQL(如果有延时应用参数,取消之)
    On the Standby capture the current value
    select delay_mins from v$managed_standby where process = 'MRP0';
    On the standby turn off delay if  > 0
    SQL> RECOVER MANAGED STANDBY DATABASE CANCEL
    SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE NODELAY USING CURRENT LOGFILE DISCONNECT FROM SESSION;
    我们这里没有设置delay参数,跳过之
    c、Verify there are no large GAPS (检查是否存在gaps,我觉得这里有点问题)
    文档步骤如下:
1.      Identify the current sequence number for each thread

SQL> SELECT THREAD#, SEQUENCE# FROM V$THREAD;

2.      Verify  the standby has applied up to, but not including the logs from the primary query.  On the standby the following query should be no more than 1-2 less than the primary query result.

SQL> SELECT THREAD#, MAX(SEQUENCE#) FROM V$ARCHIVED_LOG val, V$DATABASE vdb
     WHERE APPLIED = 'YES' AND val.RESETLOGS_CHANGE#=vdb.RESETLOGS_CHANGE# GROUP BY THREAD#;

3.      If large gaps exist (more than 3 logs) then consult the “Oracle Data Guard Concepts and Administration, 10g Release 2 (10.2) guide”: Section 5.8 “Managing Archive Gaps” and Section 12.11 “Resolving Archive Gaps Manually”.  If the gap is not resolved by Data Guard automatically then consult, “Manually Determining and Resolving Archive Gaps”.
我们这里刚刚最大的是3个,我记得,所以我想可以忽略,但是下面一步就又提出新问题
4.      If a large redo apply lag (greater than 2 logs) persists then review the MAA best practice paper, “Data Guard Redo Apply & Media Recovery” and also consult the “Oracle Data Guard Concepts and Administration 11g Release 1 (11.1)  guide” to monitor in more detail, 9.5 Monitoring Primary, Physical Standby.
我当时把这里看成和第3步一样的意思了,实际上这里说的是redo apply lag,所以我决定去做recover,但是这里失败了,报错如下:
alter database recover automatic standby database
Sat Dec 13 21:57:51 2008
Media Recovery Start
Sat Dec 13 21:57:51 2008
Managed Standby Recovery not using Real Time Apply
Sat Dec 13 21:57:52 2008
 parallel recovery started with 7 processes
Sat Dec 13 21:57:52 2008
Media Recovery Log +DG1/standby/1_1163_664662264.dbf
Errors with log +DG1/standby/1_1163_664662264.dbf
ORA-279 signalled during: alter database recover automatic standby database...
这个1_1163_664662264.dbf还没有产生,应该是找不到的,估计是在主库的redo log日志组里面,当时没有考虑到这个问题,就又想让standby自动apply试试,就去执行:
alter database recover managed standby database using current logfile disconnect
最后是完成了,但是中间报错:
Managed Standby Recovery starting Real Time Apply
Sat Dec 13 22:02:10 2008
MRP0: Some datafile enqueues are still held! Retry recovery...
Sat Dec 13 22:02:10 2008
Errors in file /home/oracle/admin/standby/bdump/standby1_mrp0_30118.trc:
ORA-01124: cannot recover data file 1 - file is in use or recovery
ORA-01110: data file 1: '+DG1/standby/datafile/system.3984.665174909'
Sat Dec 13 22:02:10 2008
Managed Standby Recovery not using Real Time Apply
Sat Dec 13 22:02:10 2008
Errors in file /home/oracle/admin/standby/bdump/standby1_mrp0_30118.trc:
ORA-01124: cannot recover data file 1 - file is in use or recovery
ORA-01110: data file 1: '+DG1/standby/datafile/system.3984.665174909'
Sat Dec 13 22:02:30 2008
MRP0: Background Media Recovery terminated with error 1153
Sat Dec 13 22:02:30 2008
Errors in file /home/oracle/admin/standby/bdump/standby1_mrp0_30118.trc:
ORA-01153: an incompatible media recovery is active
Sat Dec 13 22:02:30 2008
Errors in file /home/oracle/admin/standby/bdump/standby1_mrp0_30118.trc:
ORA-01153: an incompatible media recovery is active
Sat Dec 13 22:02:30 2008
MRP0: Background Media Recovery process shutdown (standby1)
Sat Dec 13 22:02:30 2008
Completed: alter database recover managed standby database using current logfile disconnect
Sat Dec 13 22:09:21 2008
alter database recover managed standby database noparallel using current logfile disconnect
Sat Dec 13 22:09:21 2008
Attempt to start background Managed Standby Recovery process (standby1)
MRP0 started with pid=41, OS id=1271
Sat Dec 13 22:09:21 2008
MRP0: Background Managed Standby Recovery process started (standby1)
Sat Dec 13 22:09:26 2008
Managed Standby Recovery starting Real Time Apply
Sat Dec 13 22:09:26 2008
MRP0: Some datafile enqueues are still held! Retry recovery...
最后shutdown standby DB,然后重新尝试recover,依然和上面一样错误,最后只好使用real time apply方式自动去应用归档,然后不再报错,就没再做什么别的操作,最后失败我现在分析就跟这步有关!但是现在又说不太明白,哪位有相关经验,帮我指点下,谢谢啊!

    d、Use “THROUGH ALL SWITCHOVER” on Bystander Standbys(如果有多个standby database,这步需要关注,如果我没理解错的话这步也没什么问题)
    e、Verify Primary and Standby TEMP Files Match(检查主库和备库temp files是否匹配)
    这步在我们的环境是不匹配的,因为是rman复制方式创建的备库,所以备库没有temp files,文档给出的建议是:
    If the queries do not match then you can correct the mismatch now or immediately after the open of the new primary.

·        To correct now: add or delete a tempfile now requires managed recovery to be stopped and the standby to be open read only.  Opening the standby read-only will require a database close and open before becoming the new primary, see “Open the new primary database”.  

·        To correct post-primary-open: see “Correct any tempfile mismatch” step of Switchover
     我选择的是第二种方法: create temp tablespace immediately after the open of the new primary,这里应该也不是错误的关键!(个人感觉temp files没有那么大影响)
     f、Verify that there is no issue with V$LOG_HISTORY on the Standby(确认v$log_history中resetlogs的sequence#和v$archived_log中的sequence#的匹配程度,不能超过3),详细步骤说明:
     6010833 , 10.2.0.3 patch available on Linux 32-bit, this is included in the 6081547 patch bundle listed above under “Apply Latest Patch Bundle”)(这里还特别提到了6081547这个patch,metalink中该patch未公开,但是能找到6081547 Tracking Bug for Inventory of Data Guard Physical  Standby for 10.2.0.3 includes Bug 5399901 (archived logs can take a long time to search [control file]) > fixes Bug 5399901的说法,怀疑后面失败跟这个可能也有点关系


1.Get the Primary RESETLOGS_CHANGE#.  On the primary execute:
SQL> select RESETLOGS_CHANGE# from V$DATABASE;

2.On the standby get the maximum sequence numbers for each thread from V$LOG_HISTORY:
SQL> select thread#, max(sequence#) from V$LOG_HISTORY where resetlogs_change#=< resetlogs_change# from the primary V$DATABASE.RESETLOGS_CHANGE# >  group by thread#;

3.Get current sequence numbers on standby:
SQL> select thread#, max(sequence#) from v$archived_log alog,  v$database db where alog.resetlogs_change#=db.resetlogs_change# group by thread#;

4.The last sequence# for each thread# from V$LOG_HISTORY should be close (the difference in log sequences < 3) to the last sequence# for each thread# from V$ARCHIVED_LOG.

5.If there is an issue with V$LOG_HISTORY then recreate the standby controlfile.  See Note 459411.1.  If backups are being done on the standby without an RMAN Catalog then backup history will be lost.  It is highly recommended to use an RMAN Catalog for all backups.
这里我们检查结果都符合条件,就没有做第5步的操作,时间有点长,记不清了都有点!

     g、Verify no old partial Standby Redo Logs on the Standby
(bug 7159505, fixed in 10.2.0.5 and 11.1.0.7; 10.2.0.3 patch available on Solaris Sparc64 and can be requested for other platforms.  This patch conflicts with the 6081547 patch bundle and would require a patch merge request if you want to apply this on top of the 6081547 patch bundle.)

1.  Identify any active standby redo logs (SRL’s)
SQL> SELECT GROUP#, THREAD#, SEQUENCE# FROM V$STANDBY_LOG WHERE STATUS = 'ACTIVE' ORDER BY THREAD#,SEQUENCE#;

2.      Identify maximum applied sequence number(s).
SQL> select thread#, max(sequence#) from V$LOG_HISTORY where resetlogs_change#=< resetlogs_change# from the primary V$DATABASE.RESETLOGS_CHANGE# >  group by thread#;

3.      If there are any active SRL's that have a thread#/sequence# less than the thread#/sequence# returned from the V$LOG_HISTORY (meaning the recovery has progressed beyond the active SRL) query then clear them.

SQL> RECOVER MANAGED STANDBY DATABASE CANCEL

SQL> ALTER DATABASE CLEAR LOGFILE GROUP ;
这里检查的时候,我已经把其他节点都关闭了,所以是不是这里也出现问题,现在都没法检查了,不过我当时觉得是没关系的,现在又没环境了!
      3、Switchover(到了最关键的步骤了,其实前面准备过程最重要)
      a、关闭运行的jobs
      b、Shutdown all mid-tiers (optional,如果有的话)
      c、Turn on Data Guard tracing on primary and standby(打开DG trace)
      d、Tail Primary and Standby alert logs on all instances(输出logs)
      e、Create Guaranteed Restore Points (optional,如果有环境的话)
      f、Verify that the primary database can be switched to the standby role(检查主库是否具备switch条件)
     SQL> SELECT SWITCHOVER_STATUS FROM V$DATABASE;
SWITCHOVER_STATUS
-----------------
TO STANDBY
     
TO STANDBY or SESSIONS ACTIVE状态是可以switch的
       g、If RAC, then shutdown all secondary primary instance(关闭其他次主节点)
       h、Switchover the primary to a standby database
 SQL> ALTER DATABASE COMMIT TO SWITCHOVER TO STANDBY WITH SESSION SHUTDOWN;
1.      If an ORA-16139 is encountered, as long as V$DATABASE.DATABASE_ROLE=’PHYSICAL STANDBY’, then you can proceed.  A common case where this can occur is when there are a large number of data files, greater than 1,000, the apply of the EOR log  will timeout..  Once managed recovery is started on the new standby it will recover. 
2.      If the role was not changed then you need to cancel the switchover and review the alert logs and trace files further.
这一步操作成功了,日志太多就不贴出来了!
       i、 If the standby is a RAC configuration, then shutdown all secondary standby instances(关闭其他次备库)
       j、Verify that the standby database can be switched to the primary role(检查备库是否具备switch到主库的条件,passed)
SQL> SELECT SWITCHOVER_STATUS FROM V$DATABASE;
SWITCHOVER_STATUS
----------------
TO PRIMARY
TO PRIMARY or SESSIONS ACTIVE 模式是可以switch的
       k、Check if the standby has ever been open read-only (passed)
1.On the standby run this query:
SQL> SELECT VALUE FROM V$DATAGUARD_STATS WHERE NAME='standby has been open';
2.      If the standby was open read-only then restart the standby
SQL> SHUTDOWN IMMEDIATE
SQL> STARTUP MOUNT
       l、Switchover the standby database to a primary(失败了)
SQL> ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY;
看日志看的头晕了~~~~~还真没那么容易看来!

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/122290/viewspace-526608/,如需转载,请注明出处,否则将追究法律责任。

上一篇: 没有了~
下一篇: 呼和浩特印象
请登录后发表评论 登录
全部评论

注册时间:2008-11-03

  • 博文量
    12
  • 访问量
    70028