ITPub博客

首页 > 数据库 > Oracle > 数据库故障分析与处理

数据库故障分析与处理

原创 Oracle 作者:gydba 时间:2015-09-29 09:30:29 0 删除 编辑

一个新生较忽略的错误,引以为戒!

问题症状:系统数据库(节点1主机发生重启,数据库实例暂时未启动),节点2操作系统负载不高,但是业务反应非常缓慢,观察数据库存在较多等待事件:

  COUNT(*) EVENT

---------- --------------------------------------------------

         1 Streams AQ: qmn coordinator idle wait

         1 Streams AQ: waiting for messages in the queue

         1 Streams AQ: qmn slave idle wait

         1 SQL*Net message to client

         1 DIAG idle wait

         1 Streams AQ: waiting for time management or cleanup

            tasks

         2 control file sequential read

         3 SQL*Net message from dblink

        10 cursor: pin S wait on X

        35 log file switch (archiving needed)

        46 buffer busy waits

       135 row cache lock

       376 enq: SQ - contention

 

13 rows selected.

 

 

问题处理过程:

观察以上查询到的等待事件进行,发生等待最多的enq: SQ - contentionrow cache lock看起来和序列有关,结合ASH视图查询,发生等待最多的SQL语句:

SQL> select sql_id,count(*)  from v$active_session_history where event  = 'enq: SQ - contention' and sample_time >= to_date('20150928 15:00:00', 'yyyymmdd hh24:mi:ss') group by sql_id;

 

SQL_ID                                    COUNT(*)

--------------------------------------- ----------

74whb22kwku28                               172933

fzc102gjarjdx                                40257

 

SQL> select sql_text from v$sqlarea where sql_id='74whb22kwku28';

 

SQL_TEXT

--------------------------------------------------------------------------------

select GETOPERID from dual  --->GETOPERID函数也是访问SEQ_MD_DEV_STORAGE.nextval

 

SQL> select sql_text from v$sqlarea where sql_id='fzc102gjarjdx';

 

SQL_TEXT

--------------------------------------------------------------------------------

select to_char(SEQ_MD_DEV_STORAGE.nextval) as ID from dual

同时查看了相关序列的信息:

SQL> SET LINE 200

SQL> col SEQUENCE_OWNER for a15

SQL> col SEQUENCE_NAME for a45

SQL> select * from dba_sequences a where sequence_name='SEQ_MD_DEV_STORAGE' and sequence_owner='SXYKCC';

 

SEQUENCE_OWNER  SEQUENCE_NAME                                  MIN_VALUE  MAX_VALUE INCREMENT_BY CYCL ORDE CACHE_SIZE LAST_NUMBER

--------------- --------------------------------------------- ---------- ---------- ------------ ---- ---- ---------- -----------

SXYKCC          SEQ_MD_DEV_STORAGE                                     1 1.0000E+16            1 N    N            20      781075

 

序列的Cache看起来比较小,尝试加Cache会被hung住,尝试kill部分会话,kill之后并没有新的会话产生。此时注意到,等待事件情况变化为:

  COUNT(*) EVENT

---------- ---------------------------------------------

        32 log file switch (archiving needed)

        19 buffer busy waits

        14 row cache lock

        12 cursor: pin S wait on X

         2 SQL*Net message from dblink

         2 control file sequential read

         1 SQL*Net message to client

 

7 rows selected.

 

等待事件log file switch (archiving needed)通常为日志空间满,alert告警日志信息,并无明显的不能归档信息:

on Sep 28 15:20:03 CST 2015

>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=169

System State dumped to trace file /oracle/admin/pmcpdp/bdump/pmcpdp2_j003_25428018.trc

Mon Sep 28 15:21:33 CST 2015

Error 3135 trapped in 2PC on transaction 18.3.2426591. Cleaning up.

Error stack returned to user:

ORA-03135: connection lost contact

Mon Sep 28 15:22:23 CST 2015

>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=233

Mon Sep 28 15:24:28 CST 2015

>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=288

System State dumped to trace file /oracle/admin/pmcpdp/udump/pmcpdp2_ora_19792100.trc

Mon Sep 28 15:27:01 CST 2015

>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=657

Mon Sep 28 15:29:28 CST 2015

>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=735

System State dumped to trace file /oracle/admin/pmcpdp/udump/pmcpdp2_ora_34930770.trc

Mon Sep 28 15:31:40 CST 2015

>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=742

 

 

进一步检查归档路径,空间使用情况:

SQL> archive log list

Database log mode              Archive Mode

Automatic archival             Enabled

Archive destination            /backup

Oldest online log sequence     2987

Next log sequence to archive   2987

Current log sequence           2994

 

$ df -k

Filesystem    1024-blocks      Free %Used    Iused %Iused Mounted on

/dev/hd4          5242880   2967360   44%    11564     2% /

/dev/hd2         10485760   6689804   37%    61859     4% /usr

/dev/hd9var       2097152   1238092   41%     9023     4% /var

/dev/hd3          5242880   4064048   23%     2957     1% /tmp

/dev/fwdump       1572864   1503456    5%       16     1% /var/adm/ras/platform

/dev/hd1        110100480 109831152    1%      385     1% /home

/dev/hd11admin      524288    523864    1%        5     1% /admin

/proc                   -         -    -         -     -  /proc

/dev/hd10opt      5242880   4872412    8%    10861     1% /opt

/dev/livedump      524288    513072    3%       49     1% /var/adm/ras/livedump

/dev/lv00          524288    507776    4%       18     1% /var/adm/csd

/dev/oralv       41943040  23617468   44%    85100     2% /oracle

/dev/fslv00     104857600 104584732    1%       11     1% /arch02

NFS server cmspdb1 not responding still trying

--此处hung

 

检查/backup挂载点相关信息:

/backup:

        dev             = /backup

        vfs             = nfs

        nodename        = cmspdb1

        mount           = true

        options         = cio,rw,bg,hard,nointr,rsize=32768,wsize=32768,proto=tcp,noac,vers=3,timeo=600

可以看到节点2的归档路径,实际是从节点1通过NFS挂载过来的归档路径

 

问题结论:

因为节点1上的NFS服务,为手动方式启动,重启主机以后,NFS服务未正常启动,所以节点2的归档路径变得不可用,导致归档无法正常进行,阻塞了大量会话。手动启动节点1上的NFS服务后,归档恢复正常,数据库等待事件消失。

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/30430420/viewspace-1811315/,如需转载,请注明出处,否则将追究法律责任。

请登录后发表评论 登录
全部评论

注册时间:2015-08-24

  • 博文量
    77
  • 访问量
    152014