ITPub博客

首页 > Linux操作系统 > Linux操作系统 > Data Collecting for Troubleshooting (CRS or GI) And (RAC) Issues

Data Collecting for Troubleshooting (CRS or GI) And (RAC) Issues

原创 Linux操作系统 作者:renjixinchina 时间:2013-09-24 14:58:48 0 删除 编辑

In this Document (Doc ID 289690.1)

Purpose
 File Formats for Data Uploaded to Oracle Support
Troubleshooting Steps
 1. Data Gathering for All Oracle Clusterware Issues
 2. Data Gathering for Node Reboot/Eviction
 3. Data Gathering for All Real Application Cluster Issues
 4. Data Gathering for Real Application Cluster Performance/Hang Issues
 5. Data Gathering for Oracle Clusterware Installation Issues
 5.1. Failure before executing root script.:
 5.2. Failure while or after executing root script
 Appendix A. RDA
 Appendix B. OS logs
 Appendix C. systemstate and hanganalyze in RAC
References

APPLIES TO:

Oracle Database - Enterprise Edition - Version 10.1.0.2 to 11.2.0.3 [Release 10.1 to 11.2]
Information in this document applies to any platform.

PURPOSE

This note lists what to collect for different type of Oracle Clusterware and Real Application Cluster issues, it's not mandatory to upload all the files to open a SR, however, it will speed up the resolution if all relevant info are uploaded.

File Formats for Data Uploaded to Oracle Support


Oracle Support requests that you upload compressed files grouped together by node and labeled as such in a standard format, such as .tar, .gz, .Z or .zip.

Older runs of diagcollection or any other files  (i.e. if diagcollection was run a few days or weeks back) may not provide current log information which can delay the resolution.

TROUBLESHOOTING STEPS

 

1. Data Gathering for All Oracle Clusterware Issues

Provide current diagcollection output from all nodes in the cluster.

Note 330358.1 - CRS 10gR2/ 11gR1/ 11gR2 Diagnostic Collection Guide
Note 272332.1 - CRS 10gR1 Diagnostic Collection Guide


2. Data Gathering for Node Reboot/Eviction

Provide files in Section "Data Gathering for All Oracle Clusterware Issues" and the followings:

  • Approximate date and time of the reboot, and the hostname of the rebooted node
  • OSWatcher archives which cover the reboot time at an interval of 20 seconds with private network monitoring configured.   
Note 301137.1 - OS Watcher User Guide
Note.433472.1 - OS Watcher For Windows (OSWFW) User Guide
  • For pre-11.2, zip of /var/opt/oracle/oprocd/* or /etc/oracle/oprocd/*
  • For pre-11.2, OS logs - refer to Section Appendix B
  • For 11gR2+, zip of /etc/oracle/lastgasp/* or /var/opt/oracle/lastgasp/*
  • CHM/OS data that covers the reboot time for platforms where it is available, refer to Note 1328466.1 for section "How do I collect the Cluster Health Monitor data"
  • If vendor clusterware is being used, upload the vendor clusterware logs

 

3. Data Gathering for All Real Application Cluster Issues

From all nodes:

  • Provide instance alert_{$ORACLE_SID}.log, lmon, lmd*, lms*, ckpt, lgwr, lck*, dia*, lmhb(11g only), and all others traces that are modified around incident time. A quick way to identify all traces and tar them up is to use incident time with the following example:
$ grep "2010-09-02 03" *.trc | awk -F: '{print $1}' | sort -u |xargs tar cvf trace.`hostname`.`date +%Y%m%d%H%M%S`.tar

$ gzip trace*.tar

For pre-11g, execute the command in bdump and udump to identify the list of files.

For 11g+, execute the command in ${ORACLE_BASE}/diag/rdbms/$DBNAME/${ORACLE_SID}/trace to identify the list of files
  • Incident files/packages in alert.log at time of the incident
  • If ASM is involved, provide same set of files for ASM
  • OS logs - refer to Appendix B

 

4. Data Gathering for Real Application Cluster Performance/Hang Issues

Provide files in Section "Data Gathering for All Real Application Cluster Issues" and the following:

  • systemstate and hanganalyze - refer to Appendix C
  • awr, addm and ash report, each report covers a period no more than 60 minutes
  • OSWatcher archives which cover the hang time
Note 301137.1 - OS Watcher User Guide
Note.433472.1 - OS Watcher For Windows (OSWFW) User Guide
  • CHM/OS data what covers the hang time for platforms where it is available, refer to Note 1328466.1 for section "How do I collect the Cluster Health Monitor data"

 

5. Data Gathering for Oracle Clusterware Installation Issues

5.1. Failure before executing root script.:

For 11gR2: note 1056322.1 - Troubleshoot 11gR2 Grid Infrastructure/RAC Database runInstaller Issues

For pre-11.2: note 406231.1 - Diagnosing RAC/RDBMS Installation Problems

5.2. Failure while or after executing root script

Provide files in Section "Data Gathering for All Oracle Clusterware Issues" and the following:

  • root script. (root.sh or rootupgrade.sh) screen output
  • For 11gR2: provide zip of <$ORACLE_BASE>/cfgtoollogs and <$ORACLE_BASE>/diag for grid user.
  • For pre-11.2: Note 240001.1 - Troubleshooting 10g or 11.1 Oracle Clusterware Root.sh Problems



Appendix A. RDA

It's recommended to provide the latest RDA from for all issues from all nodes in the cluster

Note 314422.1 - Remote Diagnostics Agent (RDA)


Appendix B. OS logs

OS logs are in the following directory depending on platform.:

Linux: /var/log/messages

AIX: /bin/errpt -a (redirect this to a file called messages.out)

Solaris: /var/adm/messages

HP-UX: /var/adm/syslog/syslog.log

Tru64: /var/adm/messages

Windows: save Application Log and System Log as .TXT files using Event Viewer


Note: From 11gR2, OS logs are part of diagcollection on Linux, Solaris, HP-UX.


Appendix C. systemstate and hanganalyze in RAC


To collect hanganalyze and systemstate in RAC, execute the following on one instance to generate cluster wide dumps:

a - Connect to sqlplus as sysdba: "sqlplus / as sysdba"; 
if this does not work, use "sqlplus -prelim / as sysdba"

b - Execute the following commands:

  • For 11g+
SQL> oradebug setospid
SQL> oradebug unlimit
SQL> oradebug -g all hanganalyze 3
##..Wait about 2 minutes 
SQL> oradebug -g all hanganalyze 3
SQL> oradebug -g all dump systemstate 258


If possible, take another one at level 266 instead of 258


If SGA is large or fix for bug 11800959 (fixed in 11.2.0.2 DB PSU5, 11.2.0.3 and above) is not applied, level 266 could take very long time and generate a huge trace file and may not finish in hours.
  • For 10g
SQL> oradebug setospid
SQL> oradebug unlimit
SQL> oradebug -g all dump systemstate 266##..Wait about 2 minutes
SQL> oradebug -g all dump systemstate 266

Please upload *diag* trace from either bdump or trace directory.
  • If diag trace is huge or "oradebug -g all ..." command is hanging, please collect system state dump from each instance individually at similar time:
SQL> oradebug setmypid
SQL> oradebug unlimit
SQL> oradebug hanganalyze 3
##..Wait about 2 minutes 
SQL> oradebug hanganalyze 3
SQL> oradebug dump systemstate 258
SQL> oradebug tracefile_name

      Please upload the trace file listed above.


  • If "sqlplus -prelim / as sysdba" does not work, refer to note 359536.1 Step "1.)  Using OS debuggers like dbx or gdb" to take on all nodes.

 If ASM is involved, collect hanganalyze and systemstate from ASM with the instruction above.

Database - RAC/Scalability Community
To discuss this topic further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Database - RAC/Scalability Community

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/15747463/viewspace-773297/,如需转载,请注明出处,否则将追究法律责任。

请登录后发表评论 登录
全部评论

注册时间:2011-01-30

  • 博文量
    373
  • 访问量
    2055511