1.问题描述
客户说最近几天产生大量trace文件导致文件系统撑满
2.数据库版本
该客户数据库版本8.1.7.4 for aix 4.3
SQL> select * from v$version;
BANNER
----------------------------------------------------------------
Oracle8i Enterprise Edition Release 8.1.7.4.0 - 64bit Production
PL/SQL Release 8.1.7.4.0 - Production
CORE 8.1.7.0.0 Production
TNS for IBM/AIX RISC System/6000: Version 8.1.7.4.0 - Production
NLSRTL Version 3.4.1.0.0 - Production
3.初步诊断
登上去看到
SNP1 started with pid=207
Fri Jan 2 23:31:30 2009
Errors in file /oracle/app/oracle/admin/database_sid/bdump/snp1_91196_database_sid.trc:
ORA-07445: exception encountered: core dump [] [] [] [] [] []
Fri Jan 2 23:32:36 2009
Errors in file /oracle/app/oracle/admin/database_sid/bdump/snp2_251096_database_sid.trc:
ORA-07445: exception encountered: core dump [] [] [] [] [] []
Fri Jan 2 23:34:27 2009
Errors in file /oracle/app/oracle/admin/database_sid/bdump/snp4_148778_database_sid.trc:
ORA-07445: exception encountered: core dump [] [] [] [] [] []
Fri Jan 2 23:34:41 2009
Restarting dead background process SNP0
SNP0 started with pid=13
Fri Jan 2 23:34:41 2009
Restarting dead background process SNP3
SNP3 started with pid=16
Fri Jan 2 23:34:41 2009
Restarting dead background process SNP4
SNP4 started with pid=35
Fri Jan 2 23:34:45 2009
Errors in file /oracle/app/oracle/admin/database_sid/bdump/snp0_74068_database_sid.trc:
ORA-07445: exception encountered: core dump [] [] [] [] [] []
Fri Jan 2 23:35:47 2009
Restarting dead background process SNP2
SNP2 started with pid=113
Fri Jan 2 23:35:49 2009
Errors in file /oracle/app/oracle/admin/database_sid/bdump/snp3_48338_database_sid.trc:
ORA-07445: exception encountered: core dump [] [] [] [] [] []
Fri Jan 2 23:36:53 2009
Restarting dead background process SNP1
SNP1 started with pid=104
Fri Jan 2 23:36:56 2009
Errors in file /oracle/app/oracle/admin/database_sid/bdump/snp1_151260_database_sid.trc:
ORA-07445: exception encountered: core dump [] [] [] [] [] []
Fri Jan 2 23:37:59 2009
Errors in file /oracle/app/oracle/admin/database_sid/bdump/snpb_195544_database_sid.trc:
ORA-07445: exception encountered: core dump [] [] [] [] [] []
Fri Jan 2 23:39:04 2009
Restarting dead background process SNPB
SNPB started with pid=104
Fri Jan 2 23:39:08 2009
Errors in file /oracle/app/oracle/admin/database_sid/bdump/snpb_195570_database_sid.trc:
ORA-07445: exception encountered: core dump [] [] [] [] [] []
Fri Jan 2 23:39:37 2009
看来就是job出了问题导致不断的产生trace文件,才是文件系统不断暴涨。
再看,发现这个问题已经在5天之前就出现了。
4.新的疑问
按道理,如果job失败的话,失败16次就会被broken,而这里的job重启了这么多次还在不断的重启,不断地产生trace文件.
SQL> select count(FAILURES) from dba_jobs_running;
COUNT(FAILURES)
---------------
121
不断检查发现job重启并不增加job失败的次数.
5.进一步检查
打开trc文件发现都是如下内容:
ksedmp: internal or fatal error
ORA-07445: exception encountered: core dump [] [] [] [] [] []
Current SQL statement for this session:
select .... from a,b,c,d,e where ......
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedmp+00fc bl ksedst 100000001 ?
ssexhd+0284 bl ksedmp 300000067 ?
0000480C ? 00000000
ddfnet2Normal+022c bl ddfnetCFull 1101F4B58 ? 100000001 ?
1101FDB60 ? FFFFFFFFFFF74B0 ?
70000023D5B4500 ? 000000000 ?
000000001 ? 000000000 ?
kkmtab+0840 bl ddfnet2Normal 110012BE0 ? 000000000 ?
700000216298118 ?
kkmprfro+0078 bl kkmtab 100A39620 ? 000000001 ?
FFFFFFF00000005 ? 00000133A ?
FFFFFFFFFFF7710 ?
kkmdrv+03a4 bl kkmprfro FFFFFFFFFFF78C0 ? 101C29DA8 ?
FFFFFFFFFFF77E0 ? 110011718 ?
100A12E0C ? 100000001 ?
kkmevw+064c bl kkmdrv 000000010 ? 100000001 ?
10183CC6C ? 000000000 ?
kkmtab+0e34 bl kkmevw FFFFFFFFFFFA1A0 ?
700000224D0C8D8 ?
70000022265ADF0 ?
135000000000003 ? 000000000 ?
kkmprfro+0078 bl kkmtab 11000D0A0 ? 700000204A7BEF0 ?
70000020DE4F6A8 ?
7000001E8B884E0 ? 000000000 ?
kkmdrv+03a4 bl kkmprfro 000000000 ? 000000000 ?
000000000 ? 000000000 ?
000000000 ? FFFFFFFFFFFA1A0 ?
opiprs+0c64 bl kkmdrv 000000024 ? 100000001 ?
101C389D4 ?
44202482FFFF7F20 ?
kksald+03c8 bl opiprs FFFFFFFFFFFA1A0 ?
FFFFFFFFFFFD030 ? 000000347 ?
70000020A9C5718 ?
FFFFFFFFFFF8780 ?
2220242200000060 ?
101774AC4 ?
rpiswu2+02bc bl _ptrgl
kkslod+0fd8 bl rpiswu2 7000001E897B1B8 ? 000000000 ?
70000020A9C5D78 ? 200000002 ?
70000020A9C5E88 ? 000000000 ?
70000020A9C5DB4 ? 000000000 ?
kglobld+0318 bl _ptrgl
kglobpn+0458 bl kglobld 11000F378 ? 700000216298008 ?
FFFFFFFFFFFAFF8 ?
700000230E3CD60 ?
7000001F7EADF38 ?
kglpim+0184 bl kglobpn 000000000 ? 700000228C3CED8 ?
FFFFFFFFFFFA9A0 ?
70000023C8956E0 ?
kglpin+092c bl kglpim 11000F378 ? FFFFFFFFFFFAFF8 ?
700000230E3CD60 ?
kksfbc+1d18 bl kglpin 11000F378 ? FFFFFFFFFFFAFF8 ?
70000023C8956E0 ?
700000228C3CE78 ? 300000000 ?
1017903D8 ? 1101F4BF8 ?
kkspsc0+0728 bl kksfbc 1101F4B58 ? 300000003 ?
800000008 ? FFFFFFFFFFFD030 ?
000000347 ? 70000023AFBC3E8 ?
000000000 ? 000000000 ?
opiosq0+087c bl kkspsc0 1101F25A8 ? FFFFFFFFFFFD030 ?
000000347 ? 300000003 ?
600000006 ? 400000000000004 ?
000000000 ?
kpooprx+0130 bl opiosq0 3FFFFC5C0 ?
42202022101F4B58 ?
100121C34 ? 4FFFFFFFFFFD550 ?
kpoal8+02d0 bl kpooprx FFFFFFFFFFFD554 ?
FFFFFFFFFFFD030 ?
34600000346 ? 100000001 ?
000000000 ? 000007FFF ?
00000010E ? 000000006 ?
opiodr+06bc bl _ptrgl
ttcpip+0a74 bl _ptrgl
opitsk+06d8 bl ttcpip 57B0000057B ? 11010FA60 ?
FFFFFFFFFFFE248 ? 0000007FF ?
000000800 ? 000000000 ?
000000001 ? 000012D81 ?
opiino+0670 bl opitsk 110013C88 ?
opiodr+06bc bl _ptrgl
opidrv+056c bl opiodr 3C00000002 ? 4200FE590 ?
FFFFFFFFFFFFE00 ? 000000000 ?
sou2o+0028 bl opidrv 3C0000003C ? 400000004 ?
FFFFFFFFFFFFE00 ?
main+0128 bl sou2o 90000000022C5B0 ?
9FFFFFFF0009E00 ?
FFFFFFFFFFFFE10 ?
9001000A0214428 ?
__start+0090 bl main 000000000 ? 000000000 ?
----- Argument/Register Address Dump -----
6.问题定位
尝试单独执行trace中的sql语句报同样错误
前台
ORA-03113: 通信通道的文件结束
后台
ORA-07445: exception encountered: core dump [] [] [] [] [] []
由此定位问题由该语句引起,而和是否在job中没有关系
由于语句涉及多个表,所以不断对各个表进行各种操作将问题定位到表a
继续查看表a 发现它根本不是一个表 而是一个视图 类似如下
select ... from
schema f下的表g的select 权限授予了db link h所在的用户,其中远程数据库版本9.2.0.4
7.该死的bug
再次回到trace文件发现与一个bug基本相符
Bug 2610353 - Dump / memory corruption possible from remote select
Dump / memory corruption can occur when doing a select
of a remote table if the remote table contains "hidden"
columns (such as SYS_NC_OID$ , SYS_NC_ROWINFO$ etc).
During select of remote table the local session performs
a describe of the table then checks whether any indexes
reference any columns of that table.
Workaround:
Remove the index/s on the table (may not be feasible)
查看该补丁仅在solaris的9以上的版本才有对应的补丁
无奈只能建议客户尝试drop掉函数索引了,但即使drop掉函数索引也may not be feasible。
8附:
8.1.将一台版本为9.2.0.4的数据库做为本地库连接远程数据库没有该问题出现。
8.2.将8.1.7.4的库做为本地库,将另外一台9.2.0.4的数据库做为远程数据库,建立测试表和测试的函数索引没能重现该问题。
看来这个问题并不总是出现,而且oracle还告诉我们除了升级还并不总是能够解决,oh,yeah,神奇的oracle !!!
(需要引用, 请注明出处:痴情甲骨文http://space.itpub.net/14130873)
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/14130873/viewspace-526840/,如需转载,请注明出处,否则将追究法律责任。