LIUBINGLIN

很多时候不在于你技术有多高,更重要的是你够不够细心、耐心和静心。

  • 博客访问: 5090826
  • 博文数量: 493
  • 用 户 组: 普通用户
  • 注册时间: 2010-01-05 23:06
  • 认证徽章:
个人简介

Oracle数据库管理员,Oracle数据库系统构架员;2012年7月出版《构建最高可用Oracle数据库系统:Oracle 11gR2 RAC管理、维护与性能优化》一书;Oracle 10g OCM。

文章分类

全部博文(493)

分类: Oracle

2015-06-26 15:51:49


    客户的一套Oracle Active DataGuard环境中
,主库在每天的最高峰的时间段内都会收到如下的报错:
Fri Apr 24 17:25:59 2015
ORA-16198: LGWR received timedout error from KSR
LGWR: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (16198)
LGWR: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned
Error 16198 for archive log file 1 to 'afabdg01'

参考如下的MOS文章:
Redo Transport Services fails with ORA-16198 when using SYNC (synchronous) mode (Doc ID 808469.1)

In this Document

Symptoms
Cause
Solution
References


Applies to:

Oracle Database - Enterprise Edition - Version 10.2.0.1 and later
Information in this document applies to any platform.
***Checked for relevance on 26-Feb-2014***

This will affect LGWR SYNC transport mode in 10.2.0.x databases and SYNC transport mode in 11.2.0.x databases


Symptoms

Redo Transport Services failed with ORA-16198 from primary database to either the physical standby database or logical standby database using LGWR SYNC mode.

The primary alert log file showed:

Fri Feb 6 21:22:26 2009
ORA-16198: LGWR received timedout error from KSR
LGWR: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (16198)
LGWR: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned
Fri Feb 6 21:22:26 2009
Errors in file /u01/app/oracle/admin/crthpd01/bdump/crthpd01_lgwr_2793488.trc:
ORA-16198: Timeout incurred on internal channel during remote archival
LGWR: Network asynch I/O wait error 16198 log 2 service '(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=tcp)(HOST=abc)(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=xyz_STANDBY_XPT.world)(INSTANCE_NAME=xyz)(SERVER=dedicated)))'
Fri Feb 6 21:22:26 2009
Destination LOG_ARCHIVE_DEST_2 is UNSYNCHRONIZED
LGWR: Failed to archive log 2 thread 1 sequence 628 (16198)
Fri Feb 6 21:22:27 2009



If you use Data Guard Broker, then the primary drc log showed:

DG 2009-04-12-12:11:08 0 2 678445059 Operation CTL_GET_STATUS cancelled during phase 2, error = ORA-16778
DG 2009-04-12-12:12:08 0 2 0 RSM detected log transport problem: log transport for database 'xyz_STANDBY' has the following error.
DG 2009-04-12-12:12:08 0 2 0 ORA-16198: Timeout incurred on internal channel during remote archival
DG 2009-04-12-12:12:08 0 2 0 RSM0: HEALTH CHECK ERROR: ORA-16737: the redo transport service for standby database "xyz_STANDBY" has an error
DG 2009-04-12-12:12:08 0 2 678445062 Operation CTL_GET_STATUS cancelled during phase 2, error = ORA-16778
DG 2009-04-12-12:12:08 0 2 678445062 Operation CTL_GET_STATUS cancelled during phase 2, error = ORA-16778

Cause

The NET_TIMEOUT attribute in the LOG_ARCHIVE_DEST_2 on the primary is set too low so that
LNS couldn't finish sending redo block in 10 seconds in this example.

log_archive_dest_2 service="(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PR
OTOCOL=tcp)(HOST=abc)(PORT=1521)))(CONNECT
_DATA=(SERVICE_NAME=xyz_STANDBY_XPT.world)(
INSTANCE_NAME=xyz)(SERVER=dedicated)))",
LGWR SYNC AFFIRM delay=0 OPTIONAL max_failure=0
max_connections=1 reopen=300 db_unique_name="
xyz_STANDBY" register net_timeout=10 valid
_for=(online_logfile,primary_role)

Noticed that you used LGWR SYNC log transport mode and NET_TIMEOUT was set to 10 .

Solution

You'll need to increase the NET_TIMEOUT value in the LOG_ARCHIVE_DEST_2 on the primary to at least 15 to 20 seconds depends on your network speed.

If you don't use Data Guard Broker, then you could change LOG_ARCHIVE_DEST_2 from SQL*Plus using ALTER SYSTEM command. For example,

SQL>ALTER SYSTEM SET LOG_ARCHIVE_DEST_2 SERVICE=xyz_STANDBY
LGWR SYNC DB_UNIQUE_NAME=xyz_STANDBY NET_TIMEOUT=30 VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE)

If you use Data Guard Broker, then you will need to modify NetTimeout property from DGMGRL or Grid Control.

For example, connect to the DGMGRL command-line interface from the primary machine,

DGMGRL> connect sys/

DGMGRL> EDIT DATABASE '' SET PROPERTY NetTimeout = 30;

=======================================================================

Note: If NET_TIMEOUT attribute has already been set to 30, and you still get ORA-16198, that means

LNS couldn't finish sending redo block in 30 seconds.

The slowness may caused by:

1. Operating System. Please keep track of OS usage (like iostat).

2. Network. Please keep track network flow (like tcpdump).

Note: Please don't use SYNC log transport mode across a wide area network (WAN) with latencies above 10ms.

 

 The purpose here is to figure out if the slowness is caused by temporary OS glitch or temporary network glitch. 


    出现这个报错是由于在默认的NET_TIMEOUT时间(10秒)内主库LGWR进程没有将数据完整的发送到备库,可以将NET_TIMEOUT设置为15或者30秒来增加LGWR发送数据到备库的时间,减少出现这个问题的几率。如果NET_TIMEOUT设置为30秒仍然存在此问题,那么就需要考虑是否是主库到备库的网络存在性能问题或存在一定的故障,对于WAN外网的Standby数据库最好不要使用LGWR SYNC进行实时同步,使用ARC NSYNC同步更合适。

--end--

阅读(7510) | 评论(0) | 转发(1) |
给主人留下些什么吧!~~
评论热议
请登录后评论。

登录 注册