ITPub博客

首页 > 数据库 > Oracle > Data Guard ORA-16825、ORA-16661解决一例

Data Guard ORA-16825、ORA-16661解决一例

原创 Oracle 作者:liupzmin 时间:2016-04-05 16:02:08 0 删除 编辑
某天凌晨5点左右,被客户方运维人员电话吵醒,客户的交换机因故重启,导致做了RHCS的两台Linux服务器(非oracle服务器)因心跳问题也导致重启

第一时间登上去查看oracle运行情况,发现主库已是mount状态,备库已变为主库,可以肯定是因为心跳问题导致了自动切换(生产上的DG环境配置了Fast-Start Failover)

BROKER连上去查看状态

  1. DGMGRL> SHOW CONFIGURATION VERBOSE;
  2. Configuration - jsconf
  3. Protection Mode: MaxAvailability
  4. Databases:
  5. jsread - Primary database
  6. Error: ORA-16825: multiple errors or warnings, including fast-start failover-related errors or warnings, detected for the database
  
  jsrw   - (*) Physical standby database (disabled)
      ORA-16661: the standby database needs to be reinstated


  (*) Fast-Start Failover target


  Properties:
    FastStartFailoverThreshold      = '30'
    OperationTimeout                = '30'
    FastStartFailoverLagLimit       = '30'
    CommunicationTimeout            = '180'
    ObserverReconnect               = '0'
    FastStartFailoverAutoReinstate  = 'TRUE'
    FastStartFailoverPmyShutdown    = 'TRUE'
    BystandersFollowRoleChange      = 'ALL'
    ObserverOverride                = 'FALSE'
    ExternalDestination1            = ''
    ExternalDestination2            = ''
    PrimaryLostWriteAction          = 'CONTINUE'


Fast-Start Failover: ENABLED


  Threshold:          30 seconds
  Target:             jsrw
  Observer:           CAHW50
  Lag Limit:          30 seconds (not in use)
  Shutdown Primary:   TRUE
  Auto-reinstate:     TRUE
  Observer Reconnect: (none)
  Observer Override:  FALSE


Configuration Status:
ERROR


查看主备库的状态

  1. DGMGRL> show database jsrw


    Database - jsrw


      Role:            PHYSICAL STANDBY
      Intended State:  APPLY-ON
      Transport Lag:   (unknown)
      Apply Lag:       (unknown)
      Apply Rate:      (unknown)
      Real Time Query: OFF
      Instance(s):
        jsrw


    Database Status:
    ORA-16661: the standby database needs to be reinstated


    DGMGRL> show database jsread


    Database - jsread


      Role:            PRIMARY
      Intended State:  TRANSPORT-ON
      Instance(s):
        jsread


      Database Error(s):
        ORA-16820: fast-start failover observer is no longer observing this database


      Database Warning(s):
        ORA-16817: unsynchronized fast-start failover configuration


    Database Status:
    ERROR


看样子原主库已被隔离,查看alert日志

原主库日志:

  1. Tue Dec 15 04:14:45 2015
  2. ORA-16198: LGWR received timedout error from KSR
  3. LGWR: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (16198)
  4. LGWR: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned
  5. Error 16198 for archive log file 6 to 'jsread'
  6. ORA-16198: LGWR received timedout error from KSR
  7. LGWR: Error 16198 disconnecting from destination LOG_ARCHIVE_DEST_2 standby host 'jsread'
  8. Destination LOG_ARCHIVE_DEST_2 is UNSYNCHRONIZED
  9. Primary has heard from neither observer nor target standby within FastStartFailoverThreshold seconds.
  10. It is likely an automatic failover has already occurred. Primary is shutting down.
  11. Errors in file /orasystem_readwrite/oracle/oraWR/diag/rdbms/jsrw/jsrw/trace/jsrw_lgwr_73912.trc:
  12. ORA-16830: primary isolated from fast-start failover partners longer than FastStartFailoverThreshold seconds: shutting down
日志中可以看到,主库失去了备库和observer的链接,已超时30秒,因此主库猜测此时极有可能已经发生了自动故障转移,然后将自己关掉

原备库日志:

  1. Tue Dec 15 04:14:47 2015
  2. RFS[44]: Assigned to RFS process 41247
  3. RFS[44]: Possible network disconnect with primary database
  4. Tue Dec 15 04:14:47 2015
  5. RFS[40]: Possible network disconnect with primary database
  6. Tue Dec 15 04:14:47 2015
  7. RFS[42]: Possible network disconnect with primary database
  8. Tue Dec 15 04:14:50 2015
  9. Attempting Fast-Start Failover because the threshold of 30 seconds has elapsed.
  10. Tue Dec 15 04:14:50 2015
  11. Data Guard Broker: Beginning failover
  12. Tue Dec 15 04:14:50 2015
  13. ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL
  14. Tue Dec 15 04:14:50 2015
  15. MRP0: Background Media Recovery cancelled with status 16037
  16. Errors in file /orasystem_readonly/oracle/oraread/diag/rdbms/jsread/jsread/trace/jsread_pr00_59244.trc:
  17. ORA-16037: user requested cancel of managed recovery operation
  18. Managed Standby Recovery not using Real Time Apply
  19. Tue Dec 15 04:14:50 2015
  20. ALTER SYSTEM SET service_names='jsread' SCOPE=MEMORY SID='jsread';
  21. Recovery interrupted!
  22. Recovered data files to a consistent state at change 229593026
  23. Tue Dec 15 04:14:51 2015
  24. MRP0: Background Media Recovery process shutdown (jsread)
  25. Managed Standby Recovery Canceled (jsread)
  26. Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL
  27. ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH FORCE
  28. Attempt to do a Terminal Recovery (jsread)
  29. Media Recovery Start: Managed Standby Recovery (jsread)
  30. started logmerger process
  31. Tue Dec 15 04:14:51 2015
  32. Managed Standby Recovery not using Real Time Apply
  33. Parallel Media Recovery started with 64 slaves
  34. Media Recovery Waiting for thread 1 sequence 3176 (in transit)
  35. Killing 1 processes with pids 41252 (all RFS, wait for I/O) in order to disallow current and future RFS connections. Requested by OS process 130750
  36. Begin: Standby Redo Logfile archival
  37. End: Standby Redo Logfile archival
  38. Terminal Recovery timestamp is '12/15/2015 04:14:57'
  39. Terminal Recovery: applying standby redo logs.
  40. Terminal Recovery: thread 1 seq# 3176 redo required
  41. Terminal Recovery:
  42. Recovery of Online Redo Log: Thread 1 Group 9 Seq 3176 Reading mem 0
  43. Mem# 0: /oradata_readonly/jsread/stb_redo02.log
  44. Identified End-Of-Redo (failover) for thread 1 sequence 3176 at SCN 0xffff.ffffffff
  45. Incomplete Recovery applied until change 229593027 time 12/15/2015 04:14:14
  46. Media Recovery Complete (jsread)
  47. Terminal Recovery: successful completion
  48. Forcing ARSCN to IRSCN for TR 0:229593027
  49. Tue Dec 15 04:14:57 2015
  50. ARCH: Archival stopped, error occurred. Will continue retrying
  51. Attempt to set limbo arscn 0:229593027 irscn 0:229593027
  52. ORACLE Instance jsread - Archival ErrorResetting standby activation ID 486853183 (0x1d04ca3f)
  53. ORA-16014: log 9 sequence# 3176 not archived, no available destinations
  54. ORA-00312: online log 9 thread 1: '/oradata_readonly/jsread/stb_redo02.log'
  55. Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH FORCE
  56. ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WAIT WITH SESSION SHUTDOWN
  57. ALTER DATABASE SWITCHOVER TO PRIMARY (jsread)
  58. Maximum wait for role transition is 15 minutes.
  59. All dispatchers and shared servers shutdown
  60. CLOSE: killing server sessions.
  61. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  62. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  63. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  64. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  65. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  66. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  67. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  68. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  69. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  70. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  71. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  72. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  73. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  74. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  75. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  76. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  77. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  78. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  79. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  80. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  81. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  82. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  83. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  84. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  85. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  86. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  87. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  88. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  89. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  90. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  91. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  92. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  93. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  94. Tue Dec 15 04:15:00 2015
  95. Active process 117884 user 'grid' program 'oracle@jsc_db_r (TNS V1-V3)'
  96. CLOSE: all sessions shutdown successfully.
  97. Tue Dec 15 04:15:01 2015
  98. SMON: disabling cache recovery
  99. Backup controlfile written to trace file /orasystem_readonly/oracle/oraread/diag/rdbms/jsread/jsread/trace/jsread_rsm0_79936.trc
  100. Standby terminal recovery start SCN: 229593026
  101. RESETLOGS after incomplete recovery UNTIL CHANGE 229593027
  102. Online log /oradata_readonly/jsread/redo01.log: Thread 1 Group 1 was previously cleared
  103. Online log /oradata_readonly/jsread/redo02.log: Thread 1 Group 2 was previously cleared
  104. Online log /oradata_readonly/jsread/redo03.log: Thread 1 Group 3 was previously cleared
  105. Online log /oradata_readonly/jsread/redo04.log: Thread 1 Group 4 was previously cleared
  106. Online log /oradata_readonly/jsread/redo05.log: Thread 1 Group 5 was previously cleared
  107. Online log /oradata_readonly/jsread/redo06.log: Thread 1 Group 6 was previously cleared
  108. Online log /oradata_readonly/jsread/redo07.log: Thread 1 Group 7 was previously cleared
  109. Online log /oradata_readonly/jsread/redo08.log: Thread 1 Group 16 was previously cleared
  110. Online log /oradata_readonly/jsread/redo09.log: Thread 1 Group 17 was previously cleared
  111. Online log /oradata_readonly/jsread/redo10.log: Thread 1 Group 18 was previously cleared
  112. Standby became primary SCN: 229593025
  113. Tue Dec 15 04:15:09 2015
  114. Setting recovery target incarnation to 5
  115. AUDIT_TRAIL initialization parameter is changed back to its original value as specified in the parameter file.
  116. Switchover: Complete - Database mounted as primary
  117. Completed: ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WAIT WITH SESSION SHUTDOWN
  118. ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE AVAILABILITY
  119. Completed: ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE AVAILABILITY
  120. ALTER DATABASE OPEN
  121. Data Guard Broker initializing...
  122. Tue Dec 15 04:15:09 2015
  123. Assigning activation ID 488981393 (0x1d254391)
  124. LGWR: Primary database is in MAXIMUM AVAILABILITY mode
  125. Destination LOG_ARCHIVE_DEST_2 is UNSYNCHRONIZED
  126. LGWR: Destination LOG_ARCHIVE_DEST_1 is not serviced by LGWR
  127. Thread 1 advanced to log sequence 2 (thread open)
  128. Tue Dec 15 04:15:09 2015
  129. ARC2: Becoming the 'no SRL' ARCH
  130. Tue Dec 15 04:15:09 2015
  131. ARC3: Becoming the 'no SRL' ARCH
  132. Thread 1 opened at log sequence 2
  133. Current log# 2 seq# 2 mem# 0: /oradata_readonly/jsread/redo02.log
  134. Successful open of redo thread 1
  135. MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
  136. ARC2: Becoming the 'no SRL' ARCH
  137. SMON: enabling cache recovery
  138. Archived Log entry 3278 added for thread 1 sequence 1 ID 0x1d254391 dest 1:
  139. Archiver process freed from errors. No longer stopped
  140. Tue Dec 15 04:15:10 2015
  141. PING[ARC0]: Heartbeat failed to connect to standby 'jsrw'. Error is 16058.
  142. [79936] Successfully onlined Undo Tablespace 2.
  143. Undo initialization finished serial:0 start:1846831906 end:1846832126 diff:220 (2 seconds)
  144. Dictionary check beginning
  145. PING[ARC0]: Heartbeat failed to connect to standby 'jsrw'. Error is 16058.
  146. Dictionary check complete
  147. Verifying file header compatibility for 11g tablespace encryption..
  148. Verifying 11g file header compatibility for tablespace encryption completed
  149. SMON: enabling tx recovery
  150. Starting background process SMCO
  151. Database Characterset is ZHS16GBK
  152. Tue Dec 15 04:15:10 2015
  153. idle dispatcher 'D000' terminated, pid = (24, 1)
  154. Tue Dec 15 04:15:10 2015
  155. SMCO started with pid=24, OS id=131236
  156. No Resource Manager plan active
  157. Tue Dec 15 04:15:11 2015
  158. Starting background process QMNC
  159. Tue Dec 15 04:15:11 2015
  160. QMNC started with pid=35, OS id=131244
  161. LOGSTDBY: Validating controlfile with logical metadata
  162. LOGSTDBY: Validation complete
  163. Completed: ALTER DATABASE OPEN

备库确实在失去主库之后开始自动故障转移,并最终转移成功。

这里要说一下,按理交换机恢复之后DG检测到原主库会自动将其转换为备库,但这次显然没有,还记得有两个配置了RHCS的服务器因网络问题重启了,而客户的observer刚好部署在上面,登上去查看甚至没有找到observer的日志,为什么observer没有打日志出来的?

尝试开启observer,重启原主库,寄希望于observer自动将失败的主库转为备库,但是在start observer的时候就遇到了问题


  1. DGMGRL> start observer
  2. Error: ORA-16647: could not start more than one observer
  3. Failed.
  4. DGMGRL> stop observer
  5. Error:
  6. ORA-01034: ORACLE not available
  7. Process ID: 0
  8. Session ID: 0 Serial number: 0
  1. Observer started
  2. [W000 12/15 05:30:13.80] Observer started.
  3. Observer stopped
  4. Error:
  5. ORA-01034: ORACLE not available
  6. Process ID: 0
  7. Session ID: 0 Serial number: 0
  8. [W000 12/15 05:32:05.86] Failed to start the Observer.
  9. Error: ORA-16647: could not start more than one observer
  10. [W000 12/15 05:33:40.06] Failed to start the Observer.

即无法停止,也无法启动,猜测应该是服务器重启导致的状态错乱,此时情况变 的复杂,因交换机重启导致网络不通,导致快速故障转移,推测转移工作未完全结束,observer的服务器又被重启,使得DG处在一个错误的状态下,
现在已无法通过observer自动恢复失败的主库,只能通过手动解决,而日志里确认新主库已转移成功,那么恢复失败的主库即可,那么如何恢复呢?有很多选择重建?利用flashback databse?或者利用broker的reinstate?
其实reinstate也是使用的flashback,那么干脆使用简单的reinstate吧


  1. DGMGRL> reinstate database jsrw;
  2. Reinstating database "jsrw", please wait...
  3. Operation requires shutdown of instance "jsrw" on database "jsrw"
  4. Shutting down instance "jsrw"...
  5. ORA-01109: database not open
  6. Database dismounted.
  7. ORACLE instance shut down.
  8. Operation requires startup of instance "jsrw" on database "jsrw"
  9. Starting instance "jsrw"...
  10. ORACLE instance started.
  11. Database mounted.
  12. Continuing to reinstate database "jsrw" ...
  13. Reinstatement of database "jsrw" succeeded

查看状态

  1. DGMGRL> show configuration verbose;


    Configuration - jsconf


      Protection Mode: MaxAvailability
      Databases:
        jsread - Primary database
          Error: ORA-16820: fast-start failover observer is no longer observing this database


        jsrw   - (*) Physical standby database
          Error: ORA-16820: fast-start failover observer is no longer observing this database


      (*) Fast-Start Failover target


      Properties:
        FastStartFailoverThreshold      = '30'
        OperationTimeout                = '30'
        FastStartFailoverLagLimit       = '30'
        CommunicationTimeout            = '180'
        ObserverReconnect               = '0'
        FastStartFailoverAutoReinstate  = 'TRUE'
        FastStartFailoverPmyShutdown    = 'TRUE'
        BystandersFollowRoleChange      = 'ALL'
        ObserverOverride                = 'FALSE'
        ExternalDestination1            = ''
        ExternalDestination2            = ''
        PrimaryLostWriteAction          = 'CONTINUE'


    Fast-Start Failover: ENABLED


      Threshold:          30 seconds
      Target:             jsrw
      Observer:           CAHW50
      Lag Limit:          30 seconds (not in use)
      Shutdown Primary:   TRUE
      Auto-reinstate:     TRUE
      Observer Reconnect: (none)
      Observer Override:  FALSE


    Configuration Status:
    ERROR
可见已经恢复成功,此时stop observer,start observer就没有问题了,但为了保险起见,停止了observer,

  1. DGMGRL> stop observer;
  2. Done.
  3. DGMGRL> start observer;
  4. Observer started

这次显然是交换机导致的乌龙failover,根本不需要failover的,本来的高可用措施反而带来了些许麻烦,后来找机会又切换回原来的主库了,自动故障转移也一直没有再启动,届时手动faiover更为灵活,因为客户的应用并不够灵活,此次应用也全面瘫痪,因为重启的两台服务器是redis的消息中转服务器。


来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/26838672/viewspace-2075586/,如需转载,请注明出处,否则将追究法律责任。

请登录后发表评论 登录
全部评论

注册时间:2014-05-17

  • 博文量
    29
  • 访问量
    106602