ITPub博客

首页 > Linux操作系统 > Linux操作系统 > RACGIMON HAS FILE HANDLE LEAK ON HEALTHCHECK FILE

RACGIMON HAS FILE HANDLE LEAK ON HEALTHCHECK FILE

原创 Linux操作系统 作者:zhanglei_itput 时间:2009-06-23 08:36:17 0 删除 编辑

   早上发现有一台RAC数据库服务器异常,异常现象为:登陆慢,系统资源idle=0%,被大量的racgmain进程占用:
   临时处理办法:
  1. 查看os资源状况
      重起系统前,查看资源状况,发现有大量的racgmain进程,占用了资源。

  2. 查看database资源状况
[oracle@ra1 ~]$ sqlplus /nolog
SQL*Plus: Release 10.2.0.1.0 - Production on Mon Jun 15 08:39:04 2009
Copyright (c) 1982, 2005, Oracle.  All rights reserved.
SQL> conn / as sysdba
Connected.
SQL> select * from v$version;
BANNER
--------------------------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Prod
PL/SQL Release 10.2.0.1.0 - Production
CORE    10.2.0.1.0      Production
TNS for Linux: Version 10.2.0.1.0 - Production
NLSRTL Version 10.2.0.1.0 - Production

  3.查看CRS进程
[oracle@ra1 ~]$ ps -ef|grep crs
root      3241     1  0 08:35 ?        00:00:00 /bin/su -l oracle -c sh -c 'ulimit -c unlimited; cd /u01/app/oracle/product/crs/log/ra1/evmd; exec /u01/app/oracle/product/crs/bin/evmd '
oracle    4787  3241  0 08:36 ?        00:00:00 /u01/app/oracle/product/crs/bin/evmd.bin
root      4892  4774  0 08:36 ?        00:00:00 /bin/su -l oracle -c /bin/sh -c 'ulimit -c unlimited; cd /u01/app/oracle/product/crs/log/ra1/cssd;  /u01/app/oracle/product/crs/bin/ocssd  || exit $?'
oracle    4893  4892  0 08:36 ?        00:00:00 /bin/sh -c ulimit -c unlimited; cd /u01/app/oracle/product/crs/log/ra1/cssd;  /u01/app/oracle/product/crs/bin/ocssd  || exit $?
oracle    4918  4893  0 08:36 ?        00:00:01 /u01/app/oracle/product/crs/bin/ocssd.bin
oracle    5189  4787  0 08:36 ?        00:00:00 /u01/app/oracle/product/crs/bin/evmlogger.bin -o /u01/app/oracle/product/crs/evm/log/evmlogger.info -l /u01/app/oracle/product/crs/evm/log/evmlogger.log
oracle    6186     1  0 08:36 ?        00:00:00 /u01/app/oracle/product/crs/opmn/bin/ons -d
oracle    6187  6186  0 08:36 ?        00:00:00 /u01/app/oracle/product/crs/opmn/bin/ons -d
root     19744     1  0 08:48 ?        00:00:00 /u01/app/oracle/product/crs/bin/crsd.bin restart
oracle    8784  9729  0 09:01 pts/1    00:00:00 grep crs

初步判断由crs引起的系统资源异常

4.  停掉CRS资源
其中包括CSS进程、CRS进程(database, listener,node)、EVM进程等。
[root@ra1 ~]# cd /u01/app/oracle/product/crs/bin
[root@ra1 bin]# ./crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy


[root@ra1 bin]# ./crsctl stop crs
Stopping resources.
Successfully stopped CRS resources
Stopping CSSD.
Shutting down CSS daemon.
Shutdown request successfully issued.

[root@ra1 bin]# ./crsctl check crs
Failure 1 contacting CSS daemon
Cannot communicate with CRS
Cannot communicate with EVM


5. 查看数据库进程
[root@ra1 bin]# ps -ef|grep ora_
root     23490  9148  0 09:11 pts/1    00:00:00 grep ora_


6. 修改CRS进程为手动启动(根据实际情况可选操作)
    由于CRS服务是自动注册在主机重起的脚本里面的,所以需要手工修改此服务为手工启动,因为此时我们需要的是服务器中的应用,数据库不再需要,所以可以修改这个默认值,但是大部分的生产环境要根据实际情况来操作。
[root@ra1 ~]# cd /u01/app/oracle/product/crs/bin
[root@ra1 bin]# /u01/app/oracle/product/crs/bin/crsctl disable crs
[root@ra1 bin]# more /etc/oracle/scls_scr/ra1/root/crsstart
disable

此时系统资源恢复正常。进一步查找原因:
metalink information:  Bug No. 7235094

PROBLEM:
--------
racgimon has file handle leak on healthcheck file. . At the customer's site, ServiceGuard detected Split Brain then  a node was bounced. At that time, "ORA-27301: OS failure message: File table overflow" was recorded on alert.log. Also, "glance" showed that racgimon was opening more than 26,000 filehandles. The racgimon process was started around 20 days ago(14th Jun). Due to the handle leak by racgimon, the operating system was exhausting the kernel limit for maximum opened files ("nfile" on HP-UX).

DIAGNOSTIC ANALYSIS:
--------------------
"$ORACLE_HOME/log/< NodeName>/racg/imon_< InstanceName>.log"
During the handle leak, ragimon log recoded the following error at every 60 secondes(Health check interval). . 
- imon_r1024.log .  
2008-07-04 16:16:24.707: [RACG][20] [25433][20][ora.r1024.r10241.inst]:  
GIMH: GIM-00104: Health check failed to connect to instance.  
GIM-00090: OS-dependent operation:mmap failed with status: 12  
GIM-00091: OS failure message: Not enough space  
GIM-00092: OS failure occurred at: sskgmsmr_13

The error recorded on imon_r1024.log above seems same as Bug:6931689. On the other hand, Bug:6989661 explains an looping error in racgimon can result in opened files not closed. So I guess the racgimon was looping error due to Bug:6931689, then the loop error caused handle leak. At last, it exceeded "nfile" on HP-UX and ServiceGuard, Oracle, or any other applications could not run normally. .

WORKAROUND:
-----------
kill racgimon sometimes. .

RELATED BUGS:
-------------
Bug:6989661
Bug:6931689

 

参考文献:
metalink:
Bug No. 7235094
Filed 04-JUL-2008 Updated 08-JUL-2008
Product Oracle Server - Enterprise Edition Product Version  10.2.0.4
Platform. HP-UX Itanium Platform. Version No Data
Database Version 10.2.0.4 Affects Platforms  Port-Specific
Severity  Severe Loss of Service Status Duplicate Bug. To Filer
Base Bug 6931689 Fixed in Product Version No Data

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/9252210/viewspace-607212/,如需转载,请注明出处,否则将追究法律责任。

请登录后发表评论 登录
全部评论

注册时间:2009-02-10

  • 博文量
    400
  • 访问量
    1111487