ITPub博客

首页 > Linux操作系统 > Linux操作系统 > 连接ASM实例后出现ORA-1012错误(三)

连接ASM实例后出现ORA-1012错误(三)

原创 Linux操作系统 作者:yangtingkun 时间:2007-06-12 00:00:00 0 删除 编辑

继续讨论这个问题的产生原因。

连接ASM实例后出现ORA-1012错误:http://yangtingkun.itpub.net/post/468/280923

连接ASM实例后出现ORA-1012错误(二):http://yangtingkun.itpub.net/post/468/294068


上面一篇文章分析了导致ORA-1012错误的原因,是由于连接ASM实例的进程数超过了ASM初始化参数的设置,从而导致了问题的产生。

这篇文章进一步分析,是什么导致了ASM产生了大量的连接。

首先通过操作系统命令检查:

$ ps -ef|grep ASM|grep -v grep
oracle 17061 1 0 May 14 ? 0:00 asm_psp0_+ASM1
oracle 17059 1 0 May 14 ? 0:05 asm_diag_+ASM1
oracle 17108 1 0 May 14 ? 0:00 asm_smon_+ASM1
oracle 2556 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 17057 1 0 May 14 ? 0:03 asm_pmon_+ASM1
oracle 17063 1 0 May 14 ? 0:11 asm_lmon_+ASM1
oracle 17110 1 0 May 14 ? 0:00 asm_rbal_+ASM1
oracle 17067 1 0 May 14 ? 0:08 asm_lms0_+ASM1
oracle 17079 1 0 May 14 ? 0:00 asm_mman_+ASM1
oracle 17104 1 0 May 14 ? 0:01 asm_lgwr_+ASM1
oracle 17420 1 0 May 14 ? 0:12 /data/oracle/product/10.2/database/bin/racgimon daemon ora.racnode1.ASM1.asm
oracle 28629 1 0 Jun 08 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 19622 1 0 Jun 01 ? 0:05 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 17065 1 0 May 14 ? 0:19 asm_lmd0_+ASM1
oracle 17112 1 0 May 14 ? 2:10 asm_gmon_+ASM1
oracle 17143 1 0 May 14 ? 0:20 asm_lck0_+ASM1
oracle 1204 1 0 May 31 ? 0:00 asm_o001_+ASM1
oracle 8710 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 11907 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 29561 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 23342 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 17098 1 0 May 14 ? 0:01 asm_dbw0_+ASM1
oracle 23469 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 17106 1 0 May 14 ? 0:01 asm_ckpt_+ASM1
oracle 12134 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 24374 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 17 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 28631 1 0 Jun 08 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 22777 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 12888 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 15334 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 4977 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 26859 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 9633 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 28634 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 23136 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 736 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 6678 1 0 Jun 09 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 11080 11079 0 17:54:43 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))

$ ps -ef|grep ASM|grep -v grep|wc -l
39

仅仅是一个测试库,且只有我一个人在测试RMAN的备份,怎么会有这么多的会话呢,登陆ASM实例进行检查:

bash-2.03$ export ORACLE_SID=+ASM1
bash-2.03$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.3.0 - Production on 星期一 6 11 18:21:23 2007

Copyright (c) 1982, 2006, Oracle. All Rights Reserved.

连接到:
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options

SQL> select count(*) from v$session;

COUNT(*)
----------
38

SQL> select count(*) from v$process;

COUNT(*)
----------
39

SQL> select distinct process from v$session where username is not null;

PROCESS
------------------------
11234
19620
19969
19995
23709
23768

已选择6行。

SQL> select process, count(*) from v$session where username is not null group by process;

PROCESS COUNT(*)
------------------------ ----------
11234 1
19620 1
19969 11
19995 8
23709 1
23768 1

已选择6行。

根据ASM实例上V$SESSION上的PROCESS列,找到连接到ASM实例的操作系统进程。上面的查询列出了所有非ASM实例后台进程对应的操作系统进程。可以看到,绝大部分进程是由于1996919995两个进程造成的。

下面依次分析这些进程:

$ ps -ef|grep 11234|grep -v grep
oracle 11246 11234 0 18:21:23 ? 0:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 11234 8887 0 18:21:23 pts/1 0:00 sqlplus / as sysdba
$ ps -ef|grep 19620|grep -v grep
oracle 19620 1 0 Jun 01 ? 0:06 ora_asmb_testrac1

这两个进程一个是通过SQLPLUS连接到ASM实例的进程,也就是当前查询的后台进程,另一个是节点testrac1连接到ASM实例的进程。

$ ps -ef|grep 23709|grep -v grep
oracle 23709 1 0 Jun 07 ? 0:56 oracletestrac1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
$ ps -ef|grep 23768|grep -v grep
oracle 23768 1 0 Jun 07 ? 0:39 oracletestrac1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.3.0 - Production on 星期一 6 11 18:29:41 2007

Copyright (c) 1982, 2006, Oracle. All Rights Reserved.

连接到:
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options

SQL> set pages 100 lines 120
SQL> col program format a25
SQL> col module format a25
SQL> col event format a30
SQL> col client_info format a20
SQL> select s.program, module, client_info, event from v$session s, v$process p where paddr = p.addr and spid in (23709, 23768);

PROGRAM MODULE CLIENT_INFO EVENT
------------------------- --------------------- ----------------- ----------------------
rman@racnode1 (TNS V1-V3) backup full datafile rman channel=c1 control file parallel write
rman@racnode1 (TNS V1-V3) backup full datafile rman channel=c2 control file parallel write

这两个会话是当时备份出错后,很久无法结束,通过CTRL+强制结束的,导致会话异常结束,Oracle这里并没有将资源完全释放。

不过这两个会话应该不是问题的关键,进程1996919995两个进程在ASM实例上占用了18个会话,下面看看这两个实例在做什么:

$ ps -ef|grep 19969|grep -v grep
oracle 19969 1 0 Jun 01 ? 0:48 ora_arc0_testrac1
$ ps -ef|grep 19995|grep -v grep
oracle 19995 1 0 Jun 01 ? 0:18 ora_arc1_testrac1

居然是Oracle的归档进程,回想最开始错误产生的时候,就是归档进程出现的问题,那么基本上确定,问题是由于归档无法创建新的文件导致问题的产生。

检查实例1上的alert文件,发现问题产生的时候出现大量下面的错误:

ORA-19504: failed to create file "+DISK/testrac/1_307_618591128.dbf"
ORA-17502: ksfdcre:4 Failed to create file +DISK/testrac/1_307_618591128.dbf
ORA-15041: diskgroup space exhausted
Sat Jun 9 10:08:04 2007
ARC1: Error 19504 Creating archive log file to '+DISK/testrac/1_307_618591128.dbf'
ARCH: Archival stopped, error occurred. Will continue retrying
Sat Jun 9 10:08:04 2007
ORACLE Instance testrac1 - Archival Error
Sat Jun 9 10:08:04 2007
ORA-16038: log 1 sequence# 307 cannot be archived
ORA-19504: failed to create file ""
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.258.618591139'
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.259.618591145'
Sat Jun 9 10:08:04 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac1_arc1_19995.trc:
ORA-16038: log 1 sequence# 307 cannot be archived
ORA-19504: failed to create file ""
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.258.618591139'
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.259.618591145'
Sat Jun 9 10:08:28 2007
ARCH: Archival stopped, error occurred. Will continue retrying
Sat Jun 9 10:08:28 2007
ORACLE Instance testrac1 - Archival Error
Sat Jun 9 10:08:28 2007
ORA-16014: log 1 sequence# 307 not archived, no available destinations
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.258.618591139'
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.259.618591145'
Sat Jun 9 10:08:28 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac1_arc0_19969.trc:
ORA-16014: log 1 sequence# 307 not archived, no available destinations
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.258.618591139'
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.259.618591145'
Sat Jun 9 10:13:28 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac1_arc1_19995.trc:
ORA-19504: failed to create file "+DISK/testrac/1_307_618591128.dbf"
ORA-17502: ksfdcre:4 Failed to create file +DISK/testrac/1_307_618591128.dbf
ORA-15041: diskgroup space exhausted

显然错误是从ASM磁盘组没有空间,导致归档失败开始的。检查失败归档的进程号,就是上面出现问题的1996919995。显然,Oracle在不断重试归档的过程中,出现了一些错误,导致归档进程连接到ASM实例的会话没有释放,进而导致了下面的错误:

Errors in file /data/oracle/admin/testrac/bdump/testrac1_arc0_19969.trc:
ORA-00313: open failed for members of log group 2 of thread 1
ORA-00312: online log 2 thread 1: '+DISK/testrac/onlinelog/group_2.260.618591151'
ORA-17503: ksfdopn:2 Failed to open file +DISK/testrac/onlinelog/group_2.260.618591151
ORA-00020: maximum number of processes () exceeded

Tue Jun 12 09:29:04 2007
Unexpected communication failure with ASM instance:
ORA-00020: maximum number of processes () exceeded
Unexpected communication failure with ASM instance:
ORA-00020: maximum number of processes () exceeded
Tue Jun 12 09:29:05 2007
Errors in file /data/oracle/admin/testrac/bdump/testrac1_arc0_19969.trc:
ORA-00313: open failed for members of log group 1 of thread 1
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.259.618591145'
ORA-17503: ksfdopn:2 Failed to open file +DISK/testrac/onlinelog/group_1.259.618591145
ORA-00020: maximum number of processes () exceeded
ORA-00312: online log 1 thread 1: '+DISK/testrac/onlinelog/group_1.258.618591139'
ORA-17503: ksfdopn:2 Failed to open file +DISK/testrac/onlinelog/group_1.258.618591139
ORA-00020: maximum number of processes () exceeded

解决问题的最好方法是重启数据库实例和ASM,如果暂时无法重启,可以使用第一篇文章给出的方法,另外建立一个归档目的地。

或者采用下面的方法清楚异常的会话和进程:

SQL> select count(*) from v$session;

COUNT(*)
----------
38

SQL> select 'ALTER SYSTEM KILL SESSION ''' || SID || ', ' || SERIAL# || ''';'
2 from v$session where process in (19995, 19969);

'ALTERSYSTEMKILLSESSION'''||SID||','||SERIAL#||''';'
--------------------------------------------------------------------------------
ALTER SYSTEM KILL SESSION '12, 4';
ALTER SYSTEM KILL SESSION '34, 475';
ALTER SYSTEM KILL SESSION '14, 1';
ALTER SYSTEM KILL SESSION '15, 1';
ALTER SYSTEM KILL SESSION '16, 1';
ALTER SYSTEM KILL SESSION '17, 1';
ALTER SYSTEM KILL SESSION '18, 3';
ALTER SYSTEM KILL SESSION '19, 1';
ALTER SYSTEM KILL SESSION '20, 1';
ALTER SYSTEM KILL SESSION '21, 1';
ALTER SYSTEM KILL SESSION '22, 1';
ALTER SYSTEM KILL SESSION '23, 1';
ALTER SYSTEM KILL SESSION '24, 1';
ALTER SYSTEM KILL SESSION '25, 2';
ALTER SYSTEM KILL SESSION '26, 1';
ALTER SYSTEM KILL SESSION '27, 4';
ALTER SYSTEM KILL SESSION '28, 148';
ALTER SYSTEM KILL SESSION '33, 7';
ALTER SYSTEM KILL SESSION '13, 1';

已选择19行。

SQL> ALTER SYSTEM KILL SESSION '12, 4';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '34, 475';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '14, 1';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '15, 1';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '16, 1';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '17, 1';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '18, 3';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '19, 1';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '20, 1';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '21, 1';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '22, 1';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '23, 1';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '24, 1';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '25, 2';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '26, 1';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '27, 4';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '28, 148';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '33, 7';

系统已更改。

SQL> ALTER SYSTEM KILL SESSION '13, 1';

系统已更改。

并通过操作系统命令清楚刚才RMAN失败造成的两个后台进程:

bash-2.03$ kill -9 23709
bash-2.03$ kill -9 28631

无论是清除Oracle实例的后台进程23709,还是清除23768进程对应的ASM实例后台进程28631都可以达到相同的目的。

等待一段时间,Oracle自动清除所有的后台进程,数据库恢复正常。

bash-2.03$ ps -ef|grep ASM|grep -v grep|wc -l
18

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/4227/viewspace-69308/,如需转载,请注明出处,否则将追究法律责任。

上一篇: RMAN备份的并行
请登录后发表评论 登录
全部评论
暂无介绍

注册时间:2007-12-29

  • 博文量
    1955
  • 访问量
    10424844