ITPub博客

首页 > Linux操作系统 > Linux操作系统 > RAC环境下,2个Node的snapshot都有问题

RAC环境下,2个Node的snapshot都有问题

原创 Linux操作系统 作者:xz43 时间:2011-02-15 16:53:45 0 删除 编辑
正式系统的RAC下,2个Node都有不同的问题,记录如下.
 
节点1上,生成addm报告报错
SQL> @?/rdbms/admin/addmrpt
Current Instance
~~~~~~~~~~~~~~~~
   DB Id    DB Name      Inst Num Instance
----------- ------------ -------- ------------
 1963923747 ORCL                 1 orcl1

Instances in this Workload Repository schema
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   DB Id     Inst Num DB Name      Instance     Host
------------ -------- ------------ ------------ ------------
  1963923747        2 ORCL          orcl2         dbsvr2
* 1963923747        1 ORCL          orcl1         dbsvr1
Using 1963923747 for database Id
Using          1 for instance number

Specify the number of days of snapshots to choose from
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Entering the number of days (n) will result in the most recent
(n) days of snapshots being listed.  Pressing without
specifying a number lists all completed snapshots.
 
Listing the last 3 days of Completed Snapshots
                                                        Snap
Instance     DB Name        Snap Id    Snap Started    Level
------------ ------------ --------- ------------------ -----
orcl1         ORCL              19036 13 2月  2011 00:00     1
                              19037 13 2月  2011 01:00     1
                              19038 13 2月  2011 02:00     1
                               ...
                               ...
                              19093 15 2月  2011 09:00     1
                              19094 15 2月  2011 10:00     1
                              19095 15 2月  2011 11:00     1
                              19096 15 2月  2011 12:00     1
                              19097 15 2月  2011 13:00     1
 
Specify the Begin and End Snapshot Ids
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
输入 begin_snap 的值:  19093
Begin Snapshot Id specified: 19093
输入 end_snap 的值:  19095
End   Snapshot Id specified: 19095
 
Specify the Report Name
~~~~~~~~~~~~~~~~~~~~~~~
The default report file name is addmrpt_1_19093_19095.txt.  To use this name,
press to continue, otherwise enter an alternative.
输入 report_name 的值: 
Using the report name addmrpt_1_19093_19095.txt

Running the ADDM analysis on the specified pair of snapshots ...
begin
*
第 1 行出现错误:
ORA-00604: 递归 SQL 级别 1 出现错误
ORA-01426: 数字溢出
ORA-06512: 在 "SYS.PRVT_ADVISOR", line 1624
ORA-06512: 在 "SYS.DBMS_ADVISOR", line 186
ORA-06512: 在 line 27
 
Generating the ADDM report for this analysis ...

ERROR:
ORA-13608: 任务或对象名 NULL 无效。
ORA-06512: 在 "SYS.PRVT_ADVISOR", line 2043
ORA-06512: 在 "SYS.DBMS_ADVISOR", line 560
ORA-06512: 在 line 1
 
End of Report
Report written to addmrpt_1_19093_19095.txt
SQL>
 
节点2上,发现snapshot近几天都没采集了
$date
Tue Feb 15 11:39:35 CST 2011
首先,查看当前系统时间.
SQL> @?/rdbms/admin/addmrpt
Current Instance
~~~~~~~~~~~~~~~~
   DB Id    DB Name      Inst Num Instance
----------- ------------ -------- ------------
 1963923747 ORCL                 2 orcl2

Instances in this Workload Repository schema
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   DB Id     Inst Num DB Name      Instance     Host
------------ -------- ------------ ------------ ------------
  1963923747        1 ORCL          orcl1         dbsvr1
* 1963923747        2 ORCL          orcl2         dbsvr2
Using 1963923747 for database Id
Using          2 for instance number

Specify the number of days of snapshots to choose from
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Entering the number of days (n) will result in the most recent
(n) days of snapshots being listed.  Pressing without
specifying a number lists all completed snapshots.
 
Listing the last 3 days of Completed Snapshots
                                                        Snap
Instance     DB Name        Snap Id    Snap Started    Level
------------ ------------ --------- ------------------ -----
orcl2         ORCL              19023 12 2月  2011 11:00     1
                              19024 12 2月  2011 12:00     1
                              19025 12 2月  2011 13:00     1
 
Specify the Begin and End Snapshot Ids
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
输入 begin_snap 的值:
发现从2月12日的13点开始,后面的snapshot就没有采集过了.
 
首先处理节点2上的问题
SQL> show parameter timed_statistics
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
timed_statistics                     boolean     TRUE
SQL> show parameter statistics_level
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
statistics_level                     string      TYPICAL
SQL> select * from dba_hist_wr_control;
      DBID
----------
SNAP_INTERVAL
---------------------------------------------------------------------------
RETENTION
---------------------------------------------------------------------------
TOPNSQL
----------
1963923747
+00000 01:00:00.0
+00003 00:00:00.0
DEFAULT
没有发现问题,查看alert日志,也没找到任何相关的异常信息.
 
尝试手动执行快照采集
SQL> exec DBMS_WORKLOAD_REPOSITORY.create_snapshot();
PL/SQL 过程已成功完成.
 
过段时间,再次手动执行快照采集
SQL> exec DBMS_WORKLOAD_REPOSITORY.create_snapshot();
PL/SQL 过程已成功完成.
SQL> select * from (select * from DBA_HIST_SNAPSHOT s order by s.snap_id desc) where INSTANCE_NUMBER=2 and rownum<=2;
   SNAP_ID       DBID INSTANCE_NUMBER STARTUP_TIME                                                                BEGIN_INTERVAL_TIME                                                         END_INTERVAL_TIME                                                           FLUSH_ELAPSED                                                               SNAP_LEVEL ERROR_COUNT
---------- ---------- --------------- --------------------------------------------------------------------------- --------------------------------------------------------------------------- --------------------------------------------------------------------------- --------------------------------------------------------------------------- ---------- -----------
     19102 1963923747               2 25-12月-10 10.54.28.000 上午                                                15-2月 -11 03.48.01.665 下午                                                15-2月 -11 04.29.31.055 下午                                                +00000 00:00:03.9                                                                    1           0
     19101 1963923747               2 25-12月-10 10.54.28.000 上午                                                12-2月 -11 01.00.59.246 下午                                                15-2月 -11 03.48.01.665 下午                                                +00000 00:00:09.3                                                                    1           0

系统按手动执行结果,生成了2条快照记录.这里有点地方需要注意, 就是第一次手动采集,BEGIN_INTERVAL_TIME的时间为最后一次应该采集快照但没采集快照的时间.
 
根据以上现象,找到老白转贴的一个页面文件,内容转贴如下:

Applies to:

Oracle Server Enterprise Edition - Version: 10.1.0.4 to 10.2.0.4
This problem can occur on any platform.

Symptoms

In RAC or Single instance databases, the AWR snapshots are not collected automatically sometimes.

MMON process is not taking the AWR snapshots automatically.
By default MMON process should gather statistics every hour, but it is seen that there are no snapshot in DBA_HIST_SNAPSHOT.

There are some cases in RAC where for some instances the snapshots are generated automatically without any issues but for some instances it is not.

When tried generating manual snapshots, it takes long which usually used to finish up within a minute.

WORKAROUND
----------------------
Only recycling the box which is not feasible in prod.

Cause

Unpublished Bug 4699610 encountered for the problem which is fixed in 11.1

Another unpublished Bug 4226260 encountered for the problem which is fixed in 10.2

Solution

There could be many reason behind the issue.

Check whether the MMON trace ,i.e m00x has been generated in bdump.

If the trace is generated then check the contents.

Case 1:

========

If the MMON trace shows the following then the SYSAUX tablespace could be growing and that could stop the snapshot collection.

KEWRAFC: Flush slave failed, AWR Enqueue Timeout
The issue appears when deleting from WRI$_OPTSTAT_HISTHEAD_HISTORY.

This issue can effect other workload repository tables such as SYS.WRH$_SQL_PLAN.

Solution

=======

1. Take backup of table SYS.WRI$_OPTSTAT_HISTHEAD_HISTORY , Do CTAS of the table.
2. TRUNCATE TABLE SYS.WRI$_OPTSTAT_HISTHEAD_HISTORY
3. Monitor the trace files

Once the SYSAUX space is released MMON would resume collecting snapshots automatically.

Case 2

=======

If the MMON trace shows the following then it could be due to the Bug 4699610.

***KEWRAPC: Auto-Purge slave failed, return Code: 5
*** 2007-06-27 17:47:13.544
***KEWRAPC: Auto-Purge slave failed, return Code: 5
*** 2007-06-27 18:17:14.632
***KEWRAPC: Auto-Purge slave failed, return Code: 5

It is an unpublished Bug 4699610 which is fixed in 11g. The patch could be requested.

Case 3

=======

If the snapshots are not collected automatically and the MMON trace is not generated then try generating the snapshots manually.

If the manual generation of snapshots end up with the following error then do the following:-

ERROR at line 1:
ORA-13509: error encountered during updates to a SWRF table
ORA-02291: integrity constraint (ORA-02291: integrity constraint (SYS.WRM$_SNAPSHOT_FK) violated - parent key not found .) violated - parent key not found
ORA-06512: at "SYS.DBMS_WORKLOAD_REPOSITORY", line 8
ORA-06512: at "SYS.DBMS_WORKLOAD_REPOSITORY", line 31
ORA-06512: at line 1


To take the snapshots we need to populate WRM$_SNAPSHOT table successfully and for that instance metadata should be present in WRM$_DATABASE_INSTANCE.


If the workload related tables are not populated with the data then there could be some dependency issues which could stop the snapshot collection automatically.

Solution

========

To recreate the AWR objects by running catnoawr.sql and catawr.sql and then bounce the database. On restart of the database instance the WRM$_DATABASE_INSTANCE will be populated with the required data. Then MMON process would start collecting the snapshots.

Case 4

=======

Unpublished Bug 4226260.

In memory AWR hash information may not be cleared properly leading to incorrect MMON behaviour.
The most likely symptoms of this issue are that MMON may spin and not capture further AWR snapshots and/or ORA-600 [ktte_verify_tsth] may be reported.

The bug is fixed in 10.2 and 10.1.0.5 patchset also.

Case 5

=======

If the above cases are not relevant and there is no MMON trace generated then restart of box would help.

Otherwise you could recreate the AWR objects by running catnoawr.sql and catawr.sql and then bounce the database. This would anyways resolve the problem.

首先,在bdump目录下,没有看到MMON的日志文件.其次,这个库是正式环境,也不敢按上面所分析的去随意的重启数据库,重新创建AWR对象或者打补丁,唉.再想想办法了.
 
尝试修改属性
exec dbms_workload_repository.modify_snapshot_settings(retention => 3*24*60,interval => 15);
节点1可以正常产生,节点2还是不行,进一步证明这是Oracle的一个bug了,等升级打补丁了.
 
按老白的指点,做了一次hanganalyze,看是否是MMON进程hang住了。
SQL> oradebug setmypid;
已处理的语句
SQL> oradebug hanganalyze 3;
Hang Analysis in /oracle/admin/kms/udump/kms2_ora_6515.trc
查看生成的跟踪文件,没有看到锁信息,排除了MMON进程被hang住的可能性。
 
再DUMP一下MMON的PROCESSSTATE,看看该进程的状态信息:
 
oracle@zjgldb2$ps -ef|grep mmon
  oracle  8063  3292   0 15:03:43 pts/1       0:00 grep mmon
  oracle 26600     1   0   Feb 19 ?           0:26 ora_mmon_orcl2
oracle@zjgldb2$sqlplus / as sysdba
SQL*Plus: Release 10.2.0.1.0 - Production on 星期一 2月 21 15:04:05 2011
Copyright (c) 1982, 2005, Oracle.  All rights reserved.

连接到:
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
SQL> oradebug setospid 26600
Oracle pid: 21, Unix process pid: 26600, image: oracle@dbsvr2 (MMON)
SQL> oradebug dump processstate 10;
已处理的语句
然后到bdump目录下,找到刚生成的“实例名_mmon_系统进程ID.tra”格式的跟踪文件。

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/9399028/viewspace-687286/,如需转载,请注明出处,否则将追究法律责任。

上一篇: Oracle AWR速查
请登录后发表评论 登录
全部评论

注册时间:2010-11-16

  • 博文量
    420
  • 访问量
    1744216