ITPub博客

首页 > Linux操作系统 > Linux操作系统 > Oracle 9i,10g,11gR1基于Linux的RAC都需要Hangcheck-Timer模块

Oracle 9i,10g,11gR1基于Linux的RAC都需要Hangcheck-Timer模块

原创 Linux操作系统 作者:尛样儿 时间:2011-09-06 17:00:53 0 删除 编辑
Hangcheck-Timer Module Requirements for Oracle 9i, 10g, and 11g RAC on Linux [ID 726833.1]

  修改时间 29-JUL-2010     类型 REFERENCE     状态 PUBLISHED  

In this Document
  Purpose
  Scope
  Hangcheck-Timer Module Requirements for Oracle 9i, 10g, and 11g RAC on Linux
  References


Applies to:

Oracle Server - Enterprise Edition - Version: 10.1.0.2 to 11.1.0.7 - Release: 10.1 to 11.1
Oracle Server - Enterprise Edition - Version: 9.2.0.8 to 11.1.0.7   [Release: 9.2 to 11.1]
Linux x86
Linux x86-64

Purpose

Hangcheck_timer module is required to run a supported configuration in Oracle Real Application Clusters environments on Linux, with Oracle releases 9i, 10g, or 11g RAC.  This note identifies and outlines the requirements needed to configure hangcheck-timer in an Oracle Enterprise Linux, Red Hat Linux, or SUSE Linux environment.

Note : Hangheck timer is not required starting with Oracle Clusterware 11gR2

Scope

This article is provided for product management, system architects, and system administrators involved in deploying and configuring Oracle RAC 9i, 10g, or 11g in a Linux environment. This document will also be useful to field engineers and consulting organizations to facilitate installations and configuration requirements of Oracle in a Linux RAC environment.

Hangcheck-Timer Module Requirements for Oracle 9i, 10g, and 11g RAC on Linux

Starting in release 9.2.0.2 and later, Oracle RAC environments required using a new I/O fencing model, named the hangcheck-timer module. This module was implemented to replace the Watchdog module, which provided similar fencing functionality. Hangcheck-timer was subsequently delivered as part of the standard kernel distribution for Linux kernel releases 2.4 and above. 

Hangcheck-timer should be loaded at boot time, and monitors the Linux kernel for long operating system hangs that could affect the reliability of a RAC node.  It runs in kernel mode and uses the Time Stamp Counter (TSC) to catch scheduling delays or node hangs.  This is done by setting a timer, then checking when the timer fires as to whether it was delayed by more than the allowed margin of error.  If the duration exceeds the allowed time of (hangcheck_tick + hangcheck_margin seconds), the machine is restarted.  Hangcheck-timer will not cause reboots to occur due to CPU starvation.

 Hangcheck-timer requires three configuration parameters:

  • hangcheck_tick - defines how often, in seconds, the hangcheck-timer checks the node for hangs. The default value is 60 seconds.
  • hangcheck_margin - defines how much margin is allowed, in seconds, between expected scheduling and real scheduling time. The default value is 180 seconds.
  • hangcheck_reboot - determines if the hangcheck-timer restarts the node if the kernel fails to respond within the sum of the hangcheck_tick and hangcheck_margin parameter values. If the value of hangcheck_reboot is equal to or greater than 1, then the hangcheck-timer module restarts the system. If the hangcheck_reboot parameter is set to zero, then the hangcheck-timer module will not reboot the node, even if a hang is detected.   The default value varies by kernel version.  In the 2.4 kernel, the default is 1.  In 2.6 kernels, the default is 0.
All hangcheck-timer default values should be explicitly overridden when loading the kernel module, based on the Oracle release as follows: 
  • 9i: Assuming the default setting of "oracm misscount" is set to 220 seconds: 
    hangcheck_tick=30 hangcheck_margin=180 hangcheck_reboot=1
  • 10g/11g: Assuming the default setting of "CSS misscount" is set to either 30 or 60 seconds:
    hangcheck_tick=1 hangcheck_margin=10 hangcheck_reboot=1

You must always ensure that the Cluster misscount setting is greater than the sum of the setting for hangcheck_tick + hangcheck_margin. 

@  Unpublished information for Oracle Support Internal Use: 

When running Oracle Clusterware on Linux, hangcheck-timer should always be configured on each RAC cluster node, as the functionality of this module is required to provide I/O Fencing to ensure no stray writes will occur from an evicted node in a RAC cluster.  To verify if the hangcheck-timer module is running on a node execute as the root or oracle user:

# /sbin/lsmod | grep hangcheck

hangcheck-timer         2672   0

If the hangcheck-timer module is loaded (running) you will see output similar to above. When hangcheck-timer is not loaded no output is generated, and the command prompt is returned to the user.

In an Oracle Enterprise Linux, Red Hat 4/5, or SUSE 9/10 environment the hangcheck-timer module is loaded using the modprobe command:

# modprobe hangcheck-timer  hangcheck_tick=1 hangcheck_margin=10 hangcheck_reboot=1

In order to ensure the module is loaded at boot time, you should also place the same command in the appropriate local command execution directory (e.g. /etc/rc.d/rc.local, or /etc/init.d/boot.local).  In earlier releases, hangcheck-timer was loaded using insmod in place of modprobe. Consult your release specific documentation to determine which initialization method is required.

Hangcheck-timer will provide message logging to the system messages log when a failure is detected, and a node restart is initiated by the module:
  • When Hangcheck-timer reboots it may leave "Hangcheck: hangcheck is restarting the machine" message in /var/log/messages
  • If you see the following message in /var/log/messages:  "Hangcheck: hangcheck value past margin!" this means a reboot was required but was not performed, because hangcheck_reboot was not set to 1.  If this message is seen, you must reload the hangcheck module as described earlier in this note, with the hangcheck_reboot value set to 1.

Known Issues
  • Bug:6125546 which can prevent hangcheck-timer from rebooting in RHEL4 (fixed in 2.6.9.56 or RHEL4.6)
@ 6782377 INCOMPATIBILITY WITH HANGCHECK AND HPET CLOCK TIMER

References

NOTE:559365.1 - Using Diagwait as a diagnostic to get more information for diagnosing Oracle Clusterware Node evictions
NOTE:567730.1 - Changes in Oracle Clusterware on Linux with the 10.2.0.4 Patchset
http://dbdev.us.oracle.com/twiki/bin/view/Cluster/IOFencingHangcheckOprocd
http://rat.us.oracle.com/pls/htmldb/f?p=191:4:7293384680077836::NO:RP,4:P4_SUCCESS_FACTOR_ID:3C00CC5801C7AAE5E0401490CACF1BB5

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/23135684/viewspace-706767/,如需转载,请注明出处,否则将追究法律责任。

请登录后发表评论 登录
全部评论
Oracle数据库管理员,Oracle数据库系统构架员;2012年7月出版《构建最高可用Oracle数据库系统:Oracle 11gR2 RAC管理、维护与性能优化》一书;Oracle 10g OCM。

注册时间:2010-01-05

  • 博文量
    483
  • 访问量
    5427509