ITPub博客

首页 > Linux操作系统 > Linux操作系统 > 关于systemstate dump

关于systemstate dump

原创 Linux操作系统 作者:qgrape 时间:2011-05-12 14:05:35 0 删除 编辑

关于systemstate dump

 

首先简单的介绍一下 event systemstate
很多人把 systemstate 事件理解为dump发生的那一刻的系统内所有进程的信息,这是个错误的概念,事实上,
转储 system state 产生的跟踪文件是从dump那一刻开始到dump任务完成之间一段事件内的系统内所有进程的信息。

dump systemstate产生的跟踪文件包含了系统中所有进程的进程状态等信息。每个进程对应跟踪文件中的一段内容,反映该进程的状态信息,包括进程信息,会话信息,enqueues信息(主要是lock的信息),缓冲区的信息和该进程在SGA区中持有的(held)对象的状态等信息。

那么通常在什么情况下使用systemstate比较合适呢?
 Oracle推荐的使用systemstate事件的几种情况是:

·         数据库 hang 住了

·          数据库很慢

·          进程正在hang

·          数据库出现某些错误

·          资源争用

dump systemstate的语法为:
    ALTER SESSION SET EVENTS 'immediate trace name systemstate level 10';

也可以使用ORADEBUG实现这个功能
    ORADEBUG DUMP SYSTEMSTATE level 10

如果希望在数据库发生某种错误时调用systemstate事件,可以在参数文件(spfile或者pfile)中设置event参数,
例如,当系统发生死锁(出现ORA-00060错误)时dump systemstate
    event = "60 trace name systemstate level 10"
 

言归正传,我们dump系统状态:
SQL> ALTER SESSION SET EVENTS 'IMMEDIATE TRACE NAME SYSTEMSTATE LEVEL 8';

Session altered.

library cache lock为例来解读dump文件

首先,通过在跟踪文件中查找字符串"waiting for 'library cache lock'",我们找到了被阻塞进程的信息:

PROCESS 28: ----------------被阻塞的Oracle进程,这里PROCESS 28对应了V$PROCESS中的PID的值,
    
也就是说我们可以根据这一信息在V$PROCESSV$SESSION找到被阻塞的会话的信息
  ----------------------------------------
  SO: c000000109c83bf0, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00
  (process) Oracle pid=28, calls cur/top: c00000010b277890/c00000010b277890, flag: (0) -
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 17 24 6
              last post received-location: ksusig
              last process to post me: c000000109c840f8 25 0
              last post sent: 0 0 15
              last post sent-location: ksasnd
              last process posted by me: c000000109c7ff90 1 6
    (latch info) wait_event=0 bits=0
    Process Group: DEFAULT, pseudo proc: c000000109eefda0
    O/S info: user: ora9i, term: pts/th, ospid: 22580  ----------------
该进程的操作系统进程号,对应于V$PROCESS中的SPID
    OSD pid info: Unix process pid: 22580, image:
 oracle@cs_dc02 (TNS V1-V3)
    ----------------------------------------
    SO: c000000109f02c68, type: 4, owner: c000000109c83bf0, flag: INIT/-/-/0x00
    (session) trans: 0000000000000000, creator: c000000109c83bf0, flag: (100041) USR/- BSY/-/-/-/-/-
              DID: 0002-001C-00000192, short-term DID: 0000-0000-00000000
              txn branch: 0000000000000000
              oct: 0, prv: 0, sql: c00000011f8ea068, psql: c00000011f8ea068, user: 50/PUBUSER
    O/S info: user: ora9i, term: , ospid: 22536, machine: cs_dc02
              program:
 sqlplus@cs_dc02 (TNS V1-V3)
    application name: SQL*Plus, hash value=3669949024
    waiting for 'library cache lock' blocking sess=0x0 seq=18589 wait_time=0
                handle address=c000000122e2a6d8, lock address=c00000011a449e20, 100*mode+namespace=515

。。。 。。。

    SO: c00000010b277890, type: 3, owner: c000000109c83bf0, flag: INIT/-/-/0x00
    (call) sess: cur c000000109f02c68, rec 0, usr c000000109f02c68; depth: 0
      ----------------------------------------
      SO: c00000011a449e20, type: 51, owner: c00000010b277890, flag: INIT/-/-/0x00
      LIBRARY OBJECT LOCK: lock=c00000011a449e20 handle=c000000122e2a6d8 request=S
      call pin=0000000000000000 session pin=0000000000000000
      htl=c00000011a449e90[c00000011a4bc350,c00000011a4bc350] htb=c00000011a4bc350
      user=c000000109f02c68 session=c000000109f02c68 count=0 flags=[00] savepoint=463
      the rest of the object was already dumped

。。。 。。。

请注意下面的信息:
    waiting for 'library cache lock' blocking sess=0x0 seq=18589 wait_time=0
                handle address=c000000122e2a6d8, lock address=c00000011a449e20, 100*mode+namespace=515

这段信息告诉我们ORACLE PID 28的进程(PROCESS 28),正在等待'library cache lock' ,通过‘handle address=c000000122e2a6d8’我们可以找到阻塞它的会话的ORACLE PID信息。

还要注意这段信息:
      LIBRARY OBJECT LOCK: lock=c00000011a449e20 handle=c000000122e2a6d8 request=S
      call pin=0000000000000000 session pin=0000000000000000
      htl=c00000011a449e90[c00000011a4bc350,c00000011a4bc350] htb=c00000011a4bc350
      user=c000000109f02c68 session=c000000109f02c68 count=0 flags=[00] savepoint=463

这里就是阻塞PROCESS 28进程的会话的信息。

简单的记住这个依据的要点是:

waiting session'handle address'的值对应于blocking session'handle'的值。


回过头来,看看这个值,它应于上面我们在V$SESSION_WAIT中看到的P1P2的值:
SQL> select to_number('C000000122E2A6D8','XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX') from dual;

TO_NUMBER('C000000122E2A6D8','XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
----------------------------------------------------------------
                                                      1.3835E+19

SQL>      

问题的成因已经基本上明确了,这里推荐两种解决问题的方法:
方法1,根据 c000000122e2a6d8 地址,我们可以得到当前在library cache中相应的锁信息:               
SQL> l
  1  select INST_ID,USER_NAME,KGLNAOBJ,KGLLKSNM,KGLLKUSE,KGLLKSES,KGLLKMOD,KGLLKREQ,KGLLKPNS,KGLLKHDL
  2* from X$KGLLK where KGLLKHDL = 'C000000122E2A6D8' order by KGLLKSNM,KGLNAOBJ
SQL> /

   INST_ID USER_NAME     KGLNAOBJ                 KGLLKSNM KGLLKUSE         KGLLKSES       KGLLKMOD   KGLLKREQ KGLLKPNS         KGLLKHDL
---------- ------------- ---------------------- ---------- ---------------- ---------------- ---------- ---------- ---------------- ----------------
         2 PUBUSER       CSNOZ629926699966              30 C000000109F02C68 C000000109F02C68      0          2 00               C000000122E2A6D8
         2 PUBUSER       CSNOZ629926699966              37 C000000108C99E28 C000000108C99E28      3          0 00               C000000122E2A6D8

SQL> 

按照Oracle推荐的做法,我们现在应该使用'alter system kill session'命令killSID 37
结果得到了ORA-00031错误:
SQL> alter system kill session '37,2707';

alter system kill session '37,2707'
*
ERROR at line 1:
ORA-00031: session marked for kill

SQL>

检查SID 37的状态:
SQL> set linesize 150
SQL> col program for a50
SQL> select sid,serial#,status,username,program from v$session where sid=37;

       SID    SERIAL# STATUS   USERNAME                       PROGRAM
---------- ---------- -------- ------------------------------ --------------------------------------------------
        37       2707 KILLED   PUBUSER                       
 sqlplus@cs_dc02 (TNS V1-V3)

SQL> 
再次证实了我们最初的想法—— 有人在执行了某个需要运行很久的DDL(多数是语句效率低,当然不排除遭遇bug的可能),
然后没等语句结束就异常退出了会话。

这个例子中我们在上面的跟踪文件已经找到了该会话对应的操作系统进程(SPID),如果在其他情况下,我们如何找到这种状态为'KILLED'
的操作系统进程号(SPID)呢?
下面给出了一个方法,可以借鉴:
SQL> l
  1  SELECT s.username,s.status,
  2  x.ADDR,x.KSLLAPSC,x.KSLLAPSN,x.KSLLASPO,x.KSLLID1R,x.KSLLRTYP,
  3  decode(bitand (x.ksuprflg,2),0,null,1)
  4  FROM x$ksupr x,v$session s
  5  WHERE s.paddr(+)=x.addr
  6  and bitand(ksspaflg,1)!=0
  7* and s.sid=37
SQL> /

USERNAME                       STATUS   ADDR               KSLLAPSC   KSLLAPSN KSLLASPO       KSLLID1R KS D
------------------------------ -------- ---------------- ---------- ---------- ------------ ---------- -- -
PUBUSER                        KILLED   C000000109C831E0         41         15 16243                17

SQL>


x$ksupr.ADDR
列的值对应了V$PROCESS 中的ADDR的值,知道了这个SPID的地址,找到这个操作系统进程(SPID)就简单了,例如:
SQL> select spid,pid from v$process where addr='C000000109C831E0';

SPID                PID
------------ ----------
20552                26

 

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/12330444/viewspace-695108/,如需转载,请注明出处,否则将追究法律责任。

请登录后发表评论 登录
全部评论

注册时间:2008-04-09

  • 博文量
    223
  • 访问量
    517996