ITPub博客

首页 > Linux操作系统 > Linux操作系统 > SUSPEND导致RAC环境的ORA-29702错误(一)

SUSPEND导致RAC环境的ORA-29702错误(一)

原创 Linux操作系统 作者:yangtingkun 时间:2009-04-25 21:19:32 0 删除 编辑

在测试RAC环境的SUSPEND命令时,出现了这个错误。

 

 

测试RAC环境下,SUSPEND对多个实例的影响时,在后台日志中出现了这个错误。

简单描述一下测试环境,3节点的Oracle Rac 10203 for Solaris sparc64,其中实例启动,一个实例关闭,将数据库置于SUSPEND状态,然后通过SRVCTL启动远端关闭的实例,这时SRVCTL命令会被阻塞,登陆远端站点发现数据库还没有被MOUNT,而实例1由于上述的操作被自动SHUTDOWN ABORT,并自动重起。在后台日志中可以发现ORA-29702错误。

下面模拟一下这个现象:

bash-2.03$ srvctl status db -d testrac
Instance testrac1 is running on node racnode1
Instance testrac2 is running on node racnode2
Instance testrac3 is running on node racnode3
bash-2.03$ srvctl stop inst -d testrac -i testrac3
bash-2.03$ srvctl status db -d testrac
Instance testrac1 is running on node racnode1
Instance testrac2 is running on node racnode2
Instance testrac3 is not running on node racnode3
bash-2.03$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.3.0 - Production on 星期六 2 21 01:01:17 2009

Copyright (c) 1982, 2006, Oracle.  All Rights Reserved.


连接到:
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options

SQL> alter session set nls_date_format = 'yyyy-mm-dd hh24:mi:ss';

会话已更改。

SQL> select instance_name, startup_time from v$instance;

INSTANCE_NAME    STARTUP_TIME
---------------- -------------------
testrac1         2009-02-21 00:54:20

SQL> select instance_name, status, database_status from gv$instance;

INSTANCE_NAME    STATUS       DATABASE_STATUS
---------------- ------------ -----------------
testrac1         OPEN         ACTIVE
testrac2         OPEN         ACTIVE

SQL> alter system suspend;

系统已更改。

SQL> select instance_name, status, database_status from gv$instance;

INSTANCE_NAME    STATUS       DATABASE_STATUS
---------------- ------------ -----------------
testrac1         OPEN         SUSPENDED
testrac2         OPEN         SUSPENDED

检查实例2的状态:

SQL> select instance_name from v$instance;

INSTANCE_NAME
----------------
testrac2

SQL> select instance_name, status, database_status from v$instance;

INSTANCE_NAME    STATUS       DATABASE_STATUS
---------------- ------------ -----------------
testrac2         OPEN         SUSPENDED

下面在另一个终端使用srvctl启动实例3

bash-2.03$ srvctl start inst -d testrac -i testrac3

这个操作处于被阻塞的状态。

现在通过sqlplus来连接实例3

$ sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.3.0 - Production on 星期三 4 22 16:32:08 2009

Copyright (c) 1982, 2006, Oracle.  All Rights Reserved.


连接到:
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options

SQL> select instance_name, status, database_status from v$instance;

INSTANCE_NAME    STATUS       DATABASE_STATUS
---------------- ------------ -----------------
testrac3         STARTED      ACTIVE

SQL> select instance_name, status, database_status from gv$instance;

可以看到数据库的状态并不是OPEN,而是STARTED,说明数据库还没有被MOUNT。而查询GV$INSTANCE信息的操作被阻塞。

又等待一段时间,发现SRVCTL命令结束,但是实例3还处于等待状态中。

在实例1上执行RESUME操作,并检查当前实例状态:

SQL> alter system resume;

系统已更改。

SQL> select instance_name, status, database_status from gv$instance;

INSTANCE_NAME    STATUS       DATABASE_STATUS
---------------- ------------ -----------------
testrac1         OPEN         ACTIVE
testrac3         OPEN         ACTIVE

实例3上被阻塞的查询得到结果,但是查询GV$INSTANCE却只有3的信息:


INSTANCE_NAME    STATUS       DATABASE_STATUS
---------------- ------------ -----------------
testrac3         OPEN         ACTIVE

SQL> select instance_name from v$instance;

INSTANCE_NAME
----------------
testrac3

SQL> select instance_name, status, database_status from gv$instance;

INSTANCE_NAME    STATUS       DATABASE_STATUS
---------------- ------------ -----------------
testrac3         OPEN         ACTIVE
testrac2         OPEN         ACTIVE
testrac1         OPEN         ACTIVE

随后3个节点都启动,且恢复正常的状态。

SQL> select instance_name, startup_time from v$instance;

INSTANCE_NAME    STARTUP_TIME
---------------- -------------------
testrac1         2009-02-21 00:54:20

跟踪这次测试的结果实例1并没有重启,而实例2则发生了重启,而第一次发现问题的时候是在实例2上执行alter system resume操作,看来没有执行这个操作的实例会发生重启。

下面是实例2上的操作:

SQL> select instance_name, status, database_status from v$instance;

INSTANCE_NAME    STATUS       DATABASE_STATUS
---------------- ------------ -----------------
testrac2         OPEN         ACTIVE

SQL> select instance_name, status, database_status from v$instance;
select instance_name, status, database_status from v$instance
*
1 行出现错误:
ORA-03135:
连接失去联系

可以看到实例2在刚执行完alter system resume后还可以执行查询,但是随后就断开连接。

检查实例2对应的alert文件:

Resuming database after alter system resume command
Sat Feb 21 01:17:44 2009
Error: KGXGN polling error (15)
Sat Feb 21 01:17:44 2009
Errors in file /data/oracle/admin/testrac/bdump/testrac2_lmon_16311.trc:
ORA-29702: error occurred in Cluster Group Service operation
LMON: terminating instance due to error 29702
Sat Feb 21 01:17:44 2009
System state dump is made for local instance
System State dumped to trace file /data/oracle/admin/testrac/bdump/testrac2_diag_16307.trc
Sat Feb 21 01:17:46 2009
Shutting down instance (abort)
License high water mark = 6
Sat Feb 21 01:17:46 2009
Trace dumping is performing id=[cdmp_20090221011744]
Sat Feb 21 01:17:48 2009
Instance terminated by LMON, pid = 16311
Sat Feb 21 01:17:51 2009
Instance terminated by USER, pid = 21837
Sat Feb 21 01:17:55 2009
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface type 1 ce1 10.0.0.0 configured from OCR for use as a cluster interconnect
Interface type 1 ce0 172.25.0.0 configured from OCR for use as  a public interface
Picked latch-free SCN scheme 3
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.3.0.
System parameters with non-default values:
  processes                = 300
  __shared_pool_size       = 352321536
  __large_pool_size        = 16777216
  __java_pool_size         = 16777216
  __streams_pool_size      = 50331648
  streams_pool_size        = 50331648
  spfile                   = +DISK/testrac/spfiletestrac.ora
  backup_tape_io_slaves    = TRUE
  sga_target               = 2147483648
  control_files            = +DISK/testrac/controlfile/current.256.618591133, +DISK/testrac/controlfile/current.257.618591137
  db_file_name_convert     = /data1/oradata/testrac/g, +DISK/testrac/datafile/g, /data/oradata/testrac/l, +DISK/testrac/datafile/l, /data1/oradata/testrac/n, +DISK/testrac/datafile/n, /data1/oradata/testrac/s, +DISK/testrac/datafile/s, /data1/oradata/testrac/u, +DISK/testrac/datafile/u, /export/home/oracle/oradata/testrac/t, +DISK/testrac/tempfile/t, /export/home/oracle/oradata/testrac/zhejiang01.dbf, +DISK/testrac/datafile/zhejiang01.dbf, /export/home/oracle/oradata/testrac/zhejiang02.dbf, +DISK/testrac/datafile/zhejiang02.dbf, /expor
  log_file_name_convert    = /export/home/oracle/oradata/testrac, +DISK/testrac/onlinelog
  db_block_size            = 8192
  __db_cache_size          = 1694498816
  compatible               = 10.2.0.1.0
  log_archive_config       =
  log_archive_dest_1       = LOCATION=/data1/archivelog VALID_FOR=(ALL_LOGFILES,ALL_ROLES) DB_UNIQUE_NAME=testrac
  log_archive_dest_2       =
  log_archive_dest_state_2 = ENABLE
  standby_archive_dest     = /data1/archivelog
  log_archive_local_first  = TRUE
  log_archive_format       = %t_%s_%r.dbf
  fal_client               = TESTRAC
  fal_server               = TESTRAC_STANDBY
  db_file_multiblock_read_count= 16
  cluster_database         = TRUE
  cluster_database_instances= 3
  db_create_file_dest      = +DISK
  db_recovery_file_dest    =
  db_recovery_file_dest_size= 5368709120
  standby_file_management  = AUTO
  thread                   = 2
  instance_number          = 2
  undo_management          = AUTO
  undo_tablespace          = UNDOTBS2
  remote_login_passwordfile= EXCLUSIVE
  db_domain                =
  service_names            = TESTRAC, testrac
  dispatchers              = (PROTOCOL=TCP) (SERVICE=testracXDB)
  local_listener           = (ADDRESS = (PROTOCOL = TCP)(HOST = 172.25.198.225)(PORT = 1521))
  remote_listener          = LISTENERS_TESTRAC
  job_queue_processes      = 10
  cursor_sharing           = EXACT
  background_dump_dest     = /data/oracle/admin/testrac/bdump
  user_dump_dest           = /data/oracle/admin/testrac/udump
  core_dump_dest           = /data/oracle/admin/testrac/cdump
  audit_file_dest          = /data/oracle/admin/testrac/adump
  session_max_open_files   = 20
  db_name                  = testrac
  open_cursors             = 5000
  pga_aggregate_target     = 536870912
Cluster communication is configured to use the following interface(s) for this instance
  10.0.0.2
Sat Feb 21 01:17:58 2009
cluster interconnect IPC version:Oracle UDP/IP (generic)
IPC Vendor 1 proto 2
PMON started with pid=2, OS id=22188
DIAG started with pid=3, OS id=22190
PSP0 started with pid=4, OS id=22192
LMON started with pid=5, OS id=22194
Sat Feb 21 01:17:59 2009
WARNING: Failed to set buffer limit on IPC interconnect socket
Oracle requires that the SocketReceive buffer size be tunable upto 1MB
Please make sure the kernel parameterwhich limits SO_RCVBUF value set by
applications is atleast 1MB
LMD0 started with pid=6, OS id=22206
LMS0 started with pid=7, OS id=22227
LMS1 started with pid=8, OS id=22231
MMAN started with pid=9, OS id=22235
DBW0 started with pid=10, OS id=22237
LGWR started with pid=11, OS id=22239
CKPT started with pid=12, OS id=22246
SMON started with pid=13, OS id=22251
RECO started with pid=14, OS id=22253
CJQ0 started with pid=15, OS id=22259
MMON started with pid=16, OS id=22273
Sat Feb 21 01:18:01 2009
starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...
MMNL started with pid=17, OS id=22280
Sat Feb 21 01:18:01 2009
starting up 1 shared server(s) ...
Sat Feb 21 01:18:01 2009
lmon registered with NM - instance id 2 (internal mem no 1)
Sat Feb 21 01:18:01 2009
Reconfiguration started (old inc 0, new inc 12)
List of nodes:
 0 1 2
 Global Resource Directory frozen
* allocate domain 0, invalid = TRUE
 Communication channels reestablished
 * domain 0 valid = 1 according to instance 0
Sat Feb 21 01:18:02 2009
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Sat Feb 21 01:18:02 2009
 LMS 0: 0 GCS shadows cancelled, 0 closed
Sat Feb 21 01:18:02 2009
 LMS 1: 0 GCS shadows cancelled, 0 closed
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
Sat Feb 21 01:18:02 2009
 LMS 0: 0 GCS shadows traversed, 0 replayed
Sat Feb 21 01:18:02 2009
 LMS 1: 0 GCS shadows traversed, 0 replayed
Sat Feb 21 01:18:02 2009
 Submitted all GCS remote-cache requests
 Fix write in gcs resources
Reconfiguration complete
LCK0 started with pid=20, OS id=22315
Sat Feb 21 01:18:03 2009
ALTER DATABASE   MOUNT
Sat Feb 21 01:18:03 2009
Starting background process ASMB
ASMB started with pid=22, OS id=22347
Starting background process RBAL
RBAL started with pid=23, OS id=22351
Sat Feb 21 01:18:07 2009
SUCCESS: diskgroup DISK was mounted
Sat Feb 21 01:18:11 2009
Setting recovery target incarnation to 2
Sat Feb 21 01:18:11 2009
Successful mount of redo thread 2, with mount id 57044521
Sat Feb 21 01:18:11 2009
Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE)
Completed: ALTER DATABASE   MOUNT
Sat Feb 21 01:18:13 2009
ALTER DATABASE OPEN
Picked broadcast on commit scheme to generate SCNs
Sat Feb 21 01:18:13 2009
LGWR: STARTING ARCH PROCESSES
ARC0 started with pid=27, OS id=22623
Sat Feb 21 01:18:14 2009
ARC0: Archival started
ARC1: Archival started
LGWR: STARTING ARCH PROCESSES COMPLETE
ARC1 started with pid=28, OS id=22625
Sat Feb 21 01:18:14 2009
Thread 2 opened at log sequence 842
  Current log# 4 seq# 842 mem# 0: +DISK/testrac/onlinelog/group_4.270.633485455
Successful open of redo thread 2
Sat Feb 21 01:18:14 2009
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Sat Feb 21 01:18:14 2009
SMON: enabling cache recovery
Sat Feb 21 01:18:14 2009
ARC0: Becoming the 'no FAL' ARCH
ARC0: Becoming the 'no SRL' ARCH
Sat Feb 21 01:18:14 2009
ARC1: Becoming the heartbeat ARCH
Sat Feb 21 01:18:15 2009
Successfully onlined Undo Tablespace 4.
Sat Feb 21 01:18:15 2009
SMON: enabling tx recovery
Sat Feb 21 01:18:15 2009
Database Characterset is ZHS16GBK
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
QMNC started with pid=31, OS id=22830
Sat Feb 21 01:18:25 2009
Completed: ALTER DATABASE OPEN

alert文件中可以清晰的看到,Oracle执行resume命令后由于出现的ORA-29702错误对当前实例执行了SHUTDOWN ABORT并重启的操作。根据metalink中的描述,ORA-29702只是一个超时的错误,怀疑是SRVCTL命令发出时间超时,导致CLUSTER无法判断数据库是否处于CLUSTER一致性的状态,因此在恢复操作后,重启了另外的实例。

 

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/4227/viewspace-591813/,如需转载,请注明出处,否则将追究法律责任。

请登录后发表评论 登录
全部评论
暂无介绍

注册时间:2007-12-29

  • 博文量
    1955
  • 访问量
    10488528