ITPub博客

首页 > 数据库 > Oracle > 利用latch的hold来模拟database hang

利用latch的hold来模拟database hang

原创 Oracle 作者:dbs101 时间:2011-05-30 18:01:54 0 删除 编辑

问题描述:

最近碰见一个database hang的情况,但是重启之后就再也没有重现。也没有机会做system
state dump。为了重现database hang,查了查metalink,很多都是偶然发生的,有的是老
版本的数据库的bug。尝试了几种方法也不能使database hang,后来发现一个网址
http://dbsnake.com/cgi-bin/mt/mt-search.cgi?IncludeBlogs=1&tag=db%20hang%20simulation&limit=20
上面介绍了用hold latch的方法。这的确是个好方法。但是在做了system state dump后,
找不到blocker的session。这是因为是用了”oradebug poke 0x2F0BCB4C 4 1“强制标志latch。
其实有一个更好的方法"oradebug call kslgetl 0x20008070 1",这个
方法可以hold住一个latch,并且标明自己是这个latch的holder。

环境:

Linux Server 5, Oracle 11.2.0.2

模拟步骤:

下面是模拟的步骤:

1. 启动一个session,以sys用户登录, 找到一个latch的地址。这个latch "ksqeql_"是用来
保护KSE(enqueue)相关的操作的。


点击(此处)折叠或打开

  1. SQL> set linesize 120
  2. SQL> col ksmfsnam format a20
  3. SQL> col ksmfstyp format a20
  4. SQL> col kslldnam format a30
  5. SQL>
  6. SQL> select k.ksmfsadr, ksmfsnam, ksmfstyp, ksmfssiz, kslldnam, kslldlvl
  7.   2  from x$ksmfsv k, x$kslld a
  8.   3  where k.ksmfstyp like '%ksllt%' and k.ksmfsadr = a.kslldadr
  9.   4    and ksmfsnam = 'ksqeql_'
  10.   5  order by ksmfsnam;
  11. KSMFSADR KSMFSNAM             KSMFSTYP               KSMFSSIZ KSLLDNAM                         KSLLDLVL
  12. -------- -------------------- -------------------- ---------- ------------------------------ ----------
  13. 20008070 ksqeql_              ksllt                       100 enqueues                                5



2. hold住latch

点击(此处)折叠或打开

  1. oradebug setmypid
  2. oradebug call kslgetl 0x20008070 1


返回1的时候表示已经hold住了。


3. 新起一个session,并尝试登录数据库,发现sqlplus hold住了。

4. 开始做system state dump


点击(此处)折叠或打开

  1. [oracle@dbs101 trace]$ sqlplus -prelim '/ as sysdba'
  2. SQL*Plus: Release 11.2.0.2.0 Production on Mon May 30 22:34:46 2011
  3. Copyright (c) 1982, 2010, Oracle.  All rights reserved.
  4. SQL> oradebug setmypid
  5. Statement processed.
  6. SQL> oradebug unlimit;
  7. Statement processed.
  8. SQL> oradebug dump systemstate 10
  9. Statement processed.


5. 用ass109.awk来分析trace文件

点击(此处)折叠或打开

  1. awk -f ass109.awk dbs101_ora_18429.trc
  2. [oracle@dbs4 trace]$ awk -f ass109.awk dbs101_ora_18429.trc
  3. Starting Systemstate 1
  4. .................................................................
  5. Ass.Awk Version 1.0.9 - Processing dbs101_ora_18429.trc
  6. System State 1
  7. ~~~~~~~~~~~~~~~~
  8. 1:                                     
  9. 2:  0: waiting for 'pmon timer'        
  10. 3:  0: waiting for 'rdbms ipc message' 
  11. 4:  0: waiting for 'VKTM Logical Idle Wait'
  12. 5:  0: waiting for 'rdbms ipc message' 
  13. 6:  0: waiting for 'DIAG idle wait'    
  14. 7:  0: waiting for 'rdbms ipc message' 
  15. 8:  0: waiting for 'PING'              
  16. 9:  0: waiting for 'rdbms ipc message' 
  17. 10: 0: waiting for 'DIAG idle wait'    
  18. 11: 0: waiting for 'latch free'        [Latch 20008070]
  19. 12: 0: waiting for 'ges remote message'
  20. 13: 0: waiting for 'gcs remote message'
  21. 14: 0: waiting for 'rdbms ipc message' 
  22. 15: 0: waiting for 'KSV master wait'   [Latch 20008070]
  23. 16: 0: waiting for 'rdbms ipc message' 
  24. 17: 0: waiting for 'rdbms ipc message' 
  25. 18: 0: waiting for 'rdbms ipc message' 
  26. 19: 0: waiting for 'latch free'        [Latch 20008070]
  27. 20: 0: waiting for 'latch free'        [Latch 20008070]
  28. 21: 0: waiting for 'rdbms ipc message' 
  29. 22: 0: waiting for 'rdbms ipc message' 
  30. 23: 0: waiting for 'ASM background timer'
  31. 24: 0: waiting for 'latch free'        [Latch 20008070]
  32. 25: 0: waiting for 'latch free'        [Latch 20008070]
  33. 26:                                    
  34. 27:                                    
  35. 28: 0: waiting for 'wait for unread message on broadcast channel'
  36. 29: 0: waiting for 'Space Manager: slave idle wait'
  37. 30: 0: waiting for 'rdbms ipc message' 
  38. 31: 0: waiting for 'rdbms ipc message' 
  39. 32: 0: waiting for 'SQL*Net message from client'
  40.      Cmd: PL/SQL Execute
  41. 33: 0: waiting for 'class slave wait'  
  42. 34: 0: waiting for 'class slave wait'  
  43. 35: 0: waiting for 'class slave wait'  
  44. 36: 0: waiting for 'rdbms ipc message' 
  45. 37: 0: waiting for 'wait for unread message on broadcast channel'
  46. 38: 0: waiting for 'Streams AQ: qmn coordinator idle wait'
  47. 39: 0: waiting for 'rdbms ipc message' 
  48. 40: 0: waiting for 'SQL*Net message from client'
  49. 42: 0: waiting for 'Streams AQ: qmn slave idle wait'
  50. 43: 0: waited for 'Streams AQ: waiting for time management or cleanup tasks'
  51. 44: 0: waiting for 'latch free'        [Latch 20008070]
  52. 45: 0: waiting for 'SQL*Net message from client'
  53. 46: 0: waiting for 'SQL*Net message from client'
  54. 47: 0: waiting for 'latch free'        [Latch 20008070]
  55.      Cmd: Select
  56. 57: 0: waiting for 'class slave wait'  
  57.      Cmd: Select
  58. 60: 0: waiting for 'class slave wait'  
  59.      Cmd: Select
  60. 63: 0: waiting for 'class slave wait'  
  61.      Cmd: Select
  62. 68: 0: waiting for 'class slave wait'  
  63.      Cmd: Select
  64. 69: 0: waiting for 'class slave wait'  
  65.      Cmd: Select
  66. 71: 0: waiting for 'class slave wait'  
  67.      Cmd: Select
  68. 72: 0: waiting for 'class slave wait'  
  69.      Cmd: Select
  70. 73: 0: waiting for 'class slave wait'  
  71.      Cmd: Select
  72. 80: 0: waiting for 'class slave wait'  
  73.      Cmd: Select
  74. 82: 0: waiting for 'class slave wait'  
  75.      Cmd: Select
  76. 83: 0: waiting for 'class slave wait'  
  77.      Cmd: Select
  78. 84: 0: waiting for 'class slave wait'  
  79.      Cmd: Select
  80. 85: 0: waiting for 'class slave wait'  
  81.      Cmd: Select
  82. 91: 0: waiting for 'class slave wait'  
  83.      Cmd: Select
  84. 97: 0: waiting for 'class slave wait'  
  85.      Cmd: Select
  86. 100:0: waiting for 'class slave wait'  
  87.      Cmd: Select
  88. 102:0: waiting for 'class slave wait'  
  89.      Cmd: Select
  90. 104:0: waiting for 'class slave wait'  
  91.      Cmd: Select
  92. 149:0: waiting for 'class slave wait'  
  93.      Cmd: Select
  94. Blockers
  95. ~~~~~~~~
  96.         Above is a list of all the processes. If they are waiting for a resource
  97.         then it will be given in square brackets. Below is a summary of the
  98.         waited upon resources, together with the holder of that resource.
  99.         Notes:
  100.         ~~~~~
  101.          o A process id of '???' implies that the holder was not found in the
  102.            systemstate.
  103.                     Resource Holder State
  104.               Latch 20008070    ??? Blocker
  105. Object Names
  106. ~~~~~~~~~~~~
  107. Latch 20008070   enqueues                    

  108. 59194 Lines Processed.



可以看到进程11在等待latch 20008070。

点击(此处)折叠或打开

  1. 11: 0: waiting for 'latch free'        [Latch 20008070]
  2. PROCESS 11: LMON
  3.   ----------------------------------------
  4.   SO: 0x32fd3ac8, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
  5.    proc=0x32fd3ac8, name=process, file=ksu.h LINE:12451, pg=0
  6.   (process) Oracle pid:11, ser:1, calls cur/top: 0x32759778/0x32759778
  7.             flags : (0x6) SYSTEM
  8.             flags2: (0x100),  flags3: (0x0)
  9.             intr error: 0, call error: 0, sess error: 0, txn error 0
  10.             intr queue: empty
  11.     ksudlp FALSE at location: 0
  12.   (post info) last post received: 0 0 35
  13.               last post received-location: ksr2.h LINE:603 ID:ksrpublish
  14.               last process to post me: 32fe22c4 3 0
  15.               last post sent: 0 0 128
  16.               last post sent-location: kjxgn1.h LINE:171 ID:kjxgnaef
  17.               last process posted by me: 32fdd570 1 2
  18.     (latch info) wait_event=0 bits=0
  19.         Location from where call was made: ksq.h LINE:1807 ID:ksqgel: create enqueue: parent obj
  20.         Context saved from call: 846567288
  21.       waiting for 20008070  enqueues level=5
  22.         Location from where latch is held: kywm2.h LINE:185 ID:kywmcrpln: creating new WLM plan:
  23.         Context saved from call: 3215570592
  24.         state=busy [holder rapid=45] wlstate=free [value=0]
  25.           waiters [orapid (seconds since: put on list, posted, alive check)]:
  26.            11 (178, 1306757849, 4)
  27.            19 (177, 1306757849, 177)
  28.            44 (174, 1306757849, 174)
  29.            25 (162, 1306757849, 162)
  30.            47 (150, 1306757849, 150)
  31.            15 (141, 1306757849, 141)
  32.            24 (121, 1306757849, 121)
  33.            20 (117, 1306757849, 117)
  34.            waiter count=8
  35.           gotten 54415596 times wait, failed first 1012 sleeps 65
  36.           gotten 0 times nowait, failed: 0
  37.         possible holder pid = 45 spid=17550
  38.       on wait list for 20008070



可以看到进程45在hold住了latch 20008070。
        state=busy [holder rapid=45] wlstate=free [value=0]

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/25105315/viewspace-696705/,如需转载,请注明出处,否则将追究法律责任。

请登录后发表评论 登录
全部评论

注册时间:2010-12-18

  • 博文量
    92
  • 访问量
    438018