ITPub博客

首页 > 数据库 > Oracle > oracle两节点RAC,由于gipc导致某节点crs无法启动问题分析

oracle两节点RAC,由于gipc导致某节点crs无法启动问题分析

原创 Oracle 作者:巡完南山巡南山 时间:2019-01-14 18:39:05 0 删除 编辑

两节点RAC,其中1 节点集群CRS无法启动。经过分析原因为2节点gipcd进程异常,导致节点之间无法正常通信,重启2节点gipcd.bin后问题得以恢复。 从现象来看,是 ora.crsd ora.evmd 无法启动,其他组件正常。

1.     检查和分析

1.1.   节点 1 集群 alert 日志

节点1集群日志13:08分时手动重启内容如下, 关于olsnodes.log无法删除的信息本环境中一直存在,此处信息可忽略。

 

2018-11-26   13:08:29.521:

[client(892)]CRS-0009:log   file   "/home/u01/app/grid/11.2.0/product/log/sxmms1/client/olsnodes.log"   reopened

2018-11-26   13:08:29.521:

[client(892)]CRS-0019:file   rotation terminated. log file: "/home/u01/app/grid/11.2.0/product/log/sxmms1/client/olsnodes.log"

2018-11-26   13:08:42.421:

[ohasd(903)]CRS-2112:The   OLR service started on node sxmms1.

2018-11-26   13:08:42.433:

[ohasd(903)]CRS-1301:Oracle   High Availability Service started on node sxmms1.

2018-11-26   13:08:42.433:

[ohasd(903)]CRS-8017:location:   /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0   errors occurred

2018-11-26   13:08:45.864:

[/home/u01/app/grid/11.2.0/product/bin/orarootagent.bin(948)]CRS-2302:Cannot   get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running).

2018-11-26   13:08:51.238:

[gpnpd(1118)]CRS-2328:GPNPD   started on node sxmms1.

2018-11-26   13:08:53.710:

[cssd(1184)]CRS-1713:CSSD   daemon is started in clustered mode

2018-11-26   13:08:55.508:

[ohasd(903)]CRS-2767:Resource   state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE

2018-11-26   13:08:55.509:

[ohasd(903)]CRS-2769:Unable   to failover resource 'ora.diskmon'.

2018-11-26   13:09:03.406:

[cssd(1184)]CRS-1707:Lease   acquisition for node sxmms1 number 1 completed

2018-11-26   13:09:04.658:

[cssd(1184)]CRS-1605:CSSD   voting file is online: ORCL:OCR2; details in   /home/u01/app/grid/11.2.0/product/log/sxmms1/cssd/ocssd.log.

2018-11-26   13:09:07.670:

[cssd(1184)]CRS-1601:CSSD   Reconfiguration complete. Active nodes are sxmms1 sxmms2 .

2018-11-26   13:09:09.989:

[ctssd(1269)]CRS-2407:The   new Cluster Time Synchronization Service reference node is host sxmms2.

2018-11-26   13:09:09.990:

[ctssd(1269)]CRS-2401:The   Cluster Time Synchronization Service started on host sxmms1.

2018-11-26   13:09:11.701:

[ohasd(903)]CRS-2767:Resource   state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE

2018-11-26   13:09:11.701:

[ohasd(903)]CRS-2769:Unable   to failover resource 'ora.diskmon'.

2018-11-26   13:10:08.710:

[/home/u01/app/grid/11.2.0/product/bin/orarootagent.bin(1129)]CRS-5818:Aborted   command 'start' for resource 'ora.ctssd'. Details at (:CRSAGF00113:) {0:0:2}   in /home/u01/app/grid/11.2.0/product/log/sxmms1/agent/ohasd/orarootagent_root/orarootagent_root.log.

2018-11-26   13:10:12.714:

[ohasd(903)]CRS-2757:Command   'Start' timed out waiting for response from the resource 'ora.ctssd'. Details   at (:CRSPE00111:) {0:0:2} in /home/u01/app/grid/11.2.0/product/log/sxmms1/ohasd/ohasd.log.

[client(1584)]CRS-10001:26-Nov-18   13:10 ACFS-9391: Checking for existing ADVM/ACFS installation.

[client(1589)]CRS-10001:26-Nov-18   13:10 ACFS-9392: Validating ADVM/ACFS installation files for operating   system.

[client(1591)]CRS-10001:26-Nov-18   13:10 ACFS-9393: Verifying ASM Administrator setup.

[client(1594)]CRS-10001:26-Nov-18   13:10 ACFS-9308: Loading installed ADVM/ACFS drivers.

[client(1597)]CRS-10001:26-Nov-18   13:10 ACFS-9154: Loading 'oracleoks.ko' driver.

[client(1625)]CRS-10001:26-Nov-18   13:10 ACFS-9154: Loading 'oracleadvm.ko' driver.

[client(1653)]CRS-10001:26-Nov-18   13:10 ACFS-9154: Loading 'oracleacfs.ko' driver.

[client(1764)]CRS-10001:26-Nov-18   13:10 ACFS-9327: Verifying ADVM/ACFS devices.

[client(1773)]CRS-10001:26-Nov-18   13:10 ACFS-9156: Detecting control device '/dev/asm/.asm_ctl_spec'.

[client(1777)]CRS-10001:26-Nov-18   13:10 ACFS-9156: Detecting control device '/dev/ofsctl'.

[client(1782)]CRS-10001:26-Nov-18   13:10 ACFS-9322: completed

2018-11-26   13:10:14.067:

[ohasd(903)]CRS-2807:Resource   'ora.asm' failed to start automatically.

2018-11-26   13:10:14.067:

[ohasd(903)]CRS-2807:Resource   'ora.crsd' failed to start automatically.

2018-11-26   13:10:14.067:

[ohasd(903)]CRS-2807:Resource   'ora.evmd' failed to start automatically.

2018-11-26   13:11:42.738:

[ohasd(903)]CRS-2765:Resource   'ora.ctssd' has failed on server 'sxmms1'.

2018-11-26   13:11:45.381:

[ctssd(2151)]CRS-2407:The   new Cluster Time Synchronization Service reference node is host sxmms2.

2018-11-26   13:11:45.382:

[ctssd(2151)]CRS-2401:The   Cluster Time Synchronization Service started on host sxmms1.

 

1.2.   节点 1 AGENT 分析

日志只截取了部分内容,从日志来看,几乎很多组件在启动时都出现了超时

/home/u01/app/grid/11.2.0/product/log/sxmms1/agent/ohasd/orarootagent_root/orarootagent_root.log

2018-11-26 13:10:06.792:   [ora.ctssd][2525660928]{0:0:2} [start] clsdmc_respget return: status=0,   ecode=0, returnbuf=[0x7f51780ce0c0], buflen=8

2018-11-26 13:10:06.792:   [ora.ctssd][2525660928]{0:0:2} [start] Start: Extended check return buffer:   "? with length of 8

2018-11-26 13:10:06.792:   [ora.ctssd][2525660928]{0:0:2} [start] translateReturnCodes, return = 0,   state detail = Checkcb data [0x7f51780ce0c0]: mode[0xc0] offset[0 ms].

[    clsdmc][2525660928]CLSDMC.C returnbuflen=8, extraDataBuf=C0,   returnbuf=7805FCE0

2018-11-26 13:10:07.793:   [ora.ctssd][2525660928]{0:0:2} [start] clsdmc_respget return: status=0,   ecode=0, returnbuf=[0x7f517805fce0], buflen=8

2018-11-26 13:10:07.793:   [ora.ctssd][2525660928]{0:0:2} [start] Start: Extended check return buffer:   "? with length of 8

2018-11-26 13:10:07.793:   [ora.ctssd][2525660928]{0:0:2} [start] translateReturnCodes, return = 0,   state detail = Checkcb data [0x7f517805fce0]: mode[0xc0] offset[0 ms].

2018-11-26 13:10:08.710: [   AGENT][2527762176]{0:0:2} {0:0:2} Created   alert : (:CRSAGF00113:) :  Aborting the   command: start for resource: ora.ctssd 1 1

2018-11-26 13:10:08.710:   [ora.ctssd][2527762176]{0:0:2} [start] (:CLSN00110:) clsn_agent::abort {

2018-11-26 13:10:08.710:   [ora.ctssd][2527762176]{0:0:2} [start] abort {

2018-11-26 13:10:08.710:   [ora.ctssd][2527762176]{0:0:2} [start] Agent::abort last call info:  "Agent::Agent refreshAttr"

2018-11-26 13:10:08.710: [ora.ctssd][2527762176]{0:0:2}   [start] abort command: start

2018-11-26 13:10:08.710:   [ora.ctssd][2527762176]{0:0:2} [start] tryActionLock {

[    clsdmc][2525660928]CLSDMC.C returnbuflen=8, extraDataBuf=C0,   returnbuf=780CE0C0

2018-11-26 13:10:08.795:   [ora.ctssd][2525660928]{0:0:2} [start] clsdmc_respget return: status=0,   ecode=0, returnbuf=[0x7f51780ce0c0], buflen=8

2018-11-26 13:10:08.795:   [ora.ctssd][2525660928]{0:0:2} [start] Start: Extended check return buffer:   "? with length of 8

2018-11-26 13:10:08.795: [ora.ctssd][2525660928]{0:0:2}   [start] translateReturnCodes, return = 0, state detail = Checkcb data   [0x7f51780ce0c0]: mode[0xc0] offset[0 ms].

2018-11-26 13:10:08.795:   [ora.ctssd][2525660928]{0:0:2} [start] Start action aborted

2018-11-26 13:10:08.796: [ora.ctssd][2525660928]{0:0:2}   [start] clsnUtils::error Exception type=2 string=

CRS-5017: The resource action "ora.ctssd   start" encountered the following error:

Start action for octssd aborted. For details   refer to "(:CLSN00107:)" in   "/home/u01/app/grid/11.2.0/product/log/sxmms1/agent/ohasd/orarootagent_root/orarootagent_root.log".

 

2018-11-26 13:10:08.796: [    AGFW][2525660928]{0:0:2} sending status   msg [CRS-5017: The resource action "ora.ctssd start" encountered   the following error:

Start action for octssd aborted. For details   refer to "(:CLSN00107:)" in   "/home/u01/app/grid/11.2.0/product/log/sxmms1/agent/ohasd/orarootagent_root/orarootagent_root.log".

] for start for resource: ora.ctssd 1 1

2018-11-26 13:10:08.796:   [ora.ctssd][2525660928]{0:0:2} [start] (:CLSN00107:) clsn_agent::start }

2018-11-26 13:10:08.797: [    AGFW][2523559680]{0:0:2} Agent sending   reply for: RESOURCE_START[ora.ctssd 1 1] ID 4098:361

2018-11-26 13:10:12.711:   [ora.ctssd][2527762176]{0:0:2} [start] got lock

2018-11-26 13:10:12.711: [ora.ctssd][2527762176]{0:0:2}   [start] tryActionLock }

2018-11-26 13:10:12.711:   [ora.ctssd][2527762176]{0:0:2} [start] abort    }

2018-11-26 13:10:12.711:   [ora.ctssd][2527762176]{0:0:2} [start] (:CLSN00110:) clsn_agent::abort }

2018-11-26 13:10:12.711: [    AGFW][2527762176]{0:0:2} Command: start   for resource: ora.ctssd 1 1 completed with status: TIMEDOUT

2018-11-26 13:10:12.712: [    AGFW][2523559680]{0:0:2} Agent sending   reply for: RESOURCE_START[ora.ctssd 1 1] ID 4098:361

[    clsdmc][2527762176]CLSDMC.C returnbuflen=8, extraDataBuf=C0,   returnbuf=84006B40

 

 

1.3.   节点 1CRSD 日志分析

CRSD 日志看到异常信息,但并未有特别的指向具体是哪里出现问题导致无法启动。

2018-11-26 13:03:59.877: [ CRSMAIN][3661960960]   Policy Engine is not initialized yet!

2018-11-26 13:04:29.881: [ CRSMAIN][3661960960]   Policy Engine is not initialized yet!

2018-11-26 13:04:59.886: [ CRSMAIN][3661960960]   Policy Engine is not initialized yet!

2018-11-26 13:05:29.890: [ CRSMAIN][3661960960]   Policy Engine is not initialized yet!

2018-11-26 13:05:59.895: [ CRSMAIN][3661960960]   Policy Engine is not initialized yet!

2018-11-26 13:06:29.897: [ CRSMAIN][3661960960]   Policy Engine is not initialized yet!

2018-11-26 13:11:55.357: [ CRSMAIN][3963266848]   First attempt: init CSS context succeeded.

[    clsdmt][3956815616]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=sxmms1DBG_CRSD))

2018-11-26 13:11:55.361: [  clsdmt][3956815616]PID for the Process   [2324], connkey 1

2018-11-26 13:11:55.361: [  clsdmt][3956815616]Creating PID [2324] file   for home /home/u01/app/grid/11.2.0/product host sxmms1 bin crs to   /home/u01/app/grid/11.2.0/product/crs/init/

2018-11-26 13:11:55.362: [  clsdmt][3956815616]Writing PID [2324] to   the file [/home/u01/app/grid/11.2.0/product/crs/init/sxmms1.pid]

2018-11-26 13:11:56.304: [ CRSMAIN][3956815616]   Policy Engine is not initialized yet!

2018-11-26 13:11:56.304: [ CRSMAIN][3963266848]   CRS Daemon Starting

2018-11-26 13:11:56.305: [    CRSD][3963266848] Logging level for   Module: allcomp  0

2018-11-26 13:11:56.305: [    CRSD][3963266848] Logging level for   Module: default  0

2018-11-26 13:11:56.305: [    CRSD][3963266848] Logging level for   Module: COMMCRS  0

2018-11-26 13:11:56.305: [    CRSD][3963266848] Logging level for   Module: COMMNS  0

2018-11-26 13:11:56.305: [    CRSD][3963266848] Logging level for   Module: CSSCLNT  0

2018-11-26 13:11:56.305: [    CRSD][3963266848] Logging level for   Module: GIPCLIB  0

2018-11-26 13:11:56.305: [    CRSD][3963266848] Logging level for   Module: GIPCXBAD  0

2018-11-26 13:11:56.305: [    CRSD][3963266848] Logging level for   Module: GIPCLXPT  0

2018-11-26 13:11:56.305: [    CRSD][3963266848] Logging level for   Module: GIPCUNDE  0

2018-11-26 13:11:56.305: [    CRSD][3963266848] Logging level for   Module: GIPC  0

1.4.   节点 1EVMD 日志分析

evmd 日志为早上无法启动时的信息,但后续内容一致,所以时间09:51的错误信息也可用来分析;日志显示CRS无法启动,gipc连接超时,由于另外节点都正常,所以排除OCR出现问题。

2018-11-26 09:51:56.086: [  OCRMSG][2406381344]prom_connect: error   while waiting for connection complete [24]

2018-11-26 09:51:56.086: [  CRSOCR][2406381344] OCR context init   failure.  Error: PROC-32: Cluster Ready   Services on the local node is not running Messaging error   [gipcretConnectionRefused] [29]

2018-11-26 09:51:57.088: [  OCRMSG][2406381344]prom_waitconnect: CONN   NOT ESTABLISHED (0,29,1,2)

2018-11-26 09:51:57.088: [  OCRMSG][2406381344]GIPC error [29] msg   [gipcretConnectionRefused]

2018-11-26 09:51:57.088: [  OCRMSG][2406381344]prom_connect: error   while waiting for connection complete [24]

2018-11-26 09:51:57.088: [  CRSOCR][2406381344] OCR context init   failure.  Error: PROC-32: Cluster Ready   Services on the local node is not running Messaging error   [gipcretConnectionRefused] [29]

2018-11-26 09:51:58.089: [  OCRMSG][2406381344]prom_waitconnect: CONN   NOT ESTABLISHED (0,29,1,2)

2018-11-26 09:51:58.089: [  OCRMSG][2406381344]GIPC error [29] msg   [gipcretConnectionRefused]

2018-11-26 09:51:58.089: [  OCRMSG][2406381344]prom_connect: error   while waiting for connection complete [24]

2018-11-26 09:51:58.089: [  CRSOCR][2406381344] OCR context init   failure.  Error: PROC-32: Cluster Ready   Services on the local node is not running Messaging error   [gipcretConnectionRefused] [29]

 

1.5.   节点 1GIPC 日志分析

分析gipc日志发现,在检测与2节点之间连接时发生超时,最终导致连接无法建立,gipc是RAC集群中负责私网的连接建立,当建立连接出现问题时就需要进一步分析。

2018-11-26 13:05:27.553: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type gipcdmsgtypeInterfaceMetrics,   endp 0000000000001a67

2018-11-26 13:05:28.954: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000000348

2018-11-26 13:05:29.108: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000000723

2018-11-26 13:05:29.109: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: found node sxmms1

2018-11-26 13:05:29.110: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: found node sxmms2

2018-11-26 13:05:29.110: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: updating timeout node sxmms2

2018-11-26 13:05:29.110: [GIPCDMON][2818569984]   gipcdMonitorFailZombieNodes: skipping live node 'sxmms2', time 0 ms, endp   0000000000000000, 00000000000007ba

2018-11-26 13:05:30.624: [ CLSINET][2818569984]   Returning NETDATA: 1 interfaces

2018-11-26 13:05:30.624: [ CLSINET][2818569984]   # 0 Interface   'eth2',ip='192.168.0.1',mac='a0-36-9f-5d-5c-54',mask='255.255.255.0',net='192.168.0.0',use='cluster_interconnect'

2018-11-26 13:05:31.960: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 000000000000032d

2018-11-26 13:05:32.478: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000000121

2018-11-26 13:05:32.559: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000001a67

2018-11-26 13:05:33.550: [GIPCDMON][2818569984]   gipcdMonitorSaveInfMetrics: inf[ 0]    eth2                 - rank   99, avgms 1.111111 [ 151 / 117 / 117 ]

2018-11-26 13:05:33.964: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000000348

2018-11-26 13:05:34.114: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000000723

2018-11-26 13:05:34.115: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: found node sxmms1

2018-11-26 13:05:34.116: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: found node sxmms2

2018-11-26 13:05:34.116: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: updating timeout node sxmms2

2018-11-26 13:05:34.116: [GIPCDMON][2818569984]   gipcdMonitorFailZombieNodes: skipping live node 'sxmms2', time 0 ms, endp   0000000000000000, 00000000000007ba

2018-11-26 13:05:35.646: [ CLSINET][2818569984]   Returning NETDATA: 1 interfaces

2018-11-26 13:05:35.646: [ CLSINET][2818569984]   # 0 Interface 'eth2',ip='192.168.0.1',mac='a0-36-9f-5d-5c-54',mask='255.255.255.0',net='192.168.0.0',use='cluster_interconnect'

2018-11-26 13:05:36.971: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 000000000000032d

2018-11-26 13:05:37.480: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000000121

2018-11-26 13:05:37.565: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000001a67

2018-11-26 13:05:38.975: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000000348

2018-11-26 13:05:39.120: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type gipcdmsgtypeInterfaceMetrics,   endp 0000000000000723

2018-11-26 13:05:39.121: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: found node sxmms1

2018-11-26 13:05:39.122: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: found node sxmms2

2018-11-26 13:05:39.122: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: updating timeout node sxmms2

2018-11-26 13:05:39.122: [GIPCDMON][2818569984]   gipcdMonitorFailZombieNodes: skipping live node 'sxmms2', time 0 ms, endp   0000000000000000, 00000000000007ba

2018-11-26 13:05:40.626: [ CLSINET][2818569984]   Returning NETDATA: 1 interfaces

2018-11-26 13:05:40.626: [ CLSINET][2818569984]   # 0 Interface   'eth2',ip='192.168.0.1',mac='a0-36-9f-5d-5c-54',mask='255.255.255.0',net='192.168.0.0',use='cluster_interconnect'

2018-11-26 13:05:41.982: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 000000000000032d

2018-11-26 13:05:42.482: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000000121

2018-11-26 13:05:42.572: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000001a67

2018-11-26 13:05:43.986: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000000348

2018-11-26 13:05:44.127: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000000723

2018-11-26 13:05:44.128: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: found node sxmms1

2018-11-26 13:05:44.128: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: found node sxmms2

2018-11-26 13:05:44.128: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: updating timeout node sxmms2

2018-11-26 13:05:44.128: [GIPCDMON][2818569984]   gipcdMonitorFailZombieNodes: skipping live node 'sxmms2', time 0 ms, endp   0000000000000000, 00000000000007ba

2018-11-26 13:05:45.632: [ CLSINET][2818569984]   Returning NETDATA: 1 interfaces

2018-11-26 13:05:45.632: [ CLSINET][2818569984]   # 0 Interface   'eth2',ip='192.168.0.1',mac='a0-36-9f-5d-5c-54',mask='255.255.255.0',net='192.168.0.0',use='cluster_interconnect'

2018-11-26 13:05:46.992: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 000000000000032d

2018-11-26 13:05:47.485: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000000121

2018-11-26 13:05:47.578: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000001a67

2018-11-26 13:05:48.997: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000000348

2018-11-26 13:05:48.997: [GIPCDCLT][2946344704]   gipcdClientInterfaceMetrics: Received type(gipcdmsgtypeInterfaceMetrics),   endp(0000000000000348), len(1032), buf(0x7f7ca0254448), inf(ip:   192.168.0.1:14459, mask: 255.255.255.0, subnet: 192.168.0.0, mac: , ifname: )   time(0), retry(0), stamp(0), send(0), recv(0)

2018-11-26 13:05:48.997: [GIPCDCLT][2946344704]   gipcdClientInterfaceMetrics: enqueue local interface metrics (1) to worklist

2018-11-26 13:05:49.133: [GIPCDCLT][2946344704]   gipcdClientThread: req from local client of type gipcdmsgtypeInterfaceMetrics,   endp 0000000000000723

2018-11-26 13:05:49.133: [GIPCDCLT][2946344704]   gipcdClientInterfaceMetrics: Received type(gipcdmsgtypeInterfaceMetrics),   endp(0000000000000723), len(1032), buf(0x7f7ca0254448), inf(ip:   192.168.0.1:29489, mask: 255.255.255.0, subnet: 192.168.0.0, mac: , ifname: )   time(0), retry(0), stamp(0), send(0), recv(0)

2018-11-26 13:05:49.133: [GIPCDCLT][2946344704]   gipcdClientInterfaceMetrics: enqueue local interface metrics (1) to worklist

2018-11-26 13:05:49.134: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: found node sxmms1

2018-11-26 13:05:49.134: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: found node sxmms2

2018-11-26 13:05:49.135: [GIPCDMON][2818569984]   gipcdMonitorCssCheck: updating timeout node sxmms2

1.6.   连接建立超时分析

通过以上分析,一些资源无法启动皆是gipc建立建立超时导致,而私网出现问题可分为几种情况

1 、os层面网卡、网络配置、路由、网络交换机等出现问题;

2 、iptable、selinux等出现异常;

2 、两个节点gipc至少有一个进程存在异常;

通过1节点系统日志massage、netstat、iptables、selinux等方面的检查,确认系统层面网络无异常,私网也可以ping通,其他通过集群状态也可间接判断,因为1节点cssd服务正常,基本可以确认私网的连通性是正常的,而1节点的gipc日志中eht2网卡的rank值为99,也表名网卡状态正常,排除1节点的网卡和gipc后,需要查看2节点来进一步判断。

 

1.7.   节点 2 集群 ALERT 日志

节点2集群ALERT日志分析,13:08分节点1重启期间,节点2的gipc、evmd和ctss服务也出现了异常

2018-11-26   13:00:40.747:

[/home/u01/app/grid/11.2.0/product/bin/orarootagent.bin(6098)]CRS-5018:(:CLSN00037:)   Removed unused HAIP route:    169.254.95.0 / 255.255.255.0 / 0.0.0.0 / usb0

2018-11-26   13:06:57.926:

[cssd(15973)]CRS-1625:Node   sxmms1, number 1, was manually shut down

2018-11-26   13:06:57.944:

[cssd(15973)]CRS-1601:CSSD   Reconfiguration complete. Active nodes are sxmms2 .

2018-11-26   13:09:08.318:

[cssd(15973)]CRS-1601:CSSD   Reconfiguration complete. Active nodes are sxmms1 sxmms2 .

2018-11-26   13:10:46.812:

[/home/u01/app/grid/11.2.0/product/bin/orarootagent.bin(6098)]CRS-5018:(:CLSN00037:)   Removed unused HAIP route:    169.254.95.0 / 255.255.255.0 / 0.0.0.0 / usb0

2018-11-26   13:20:52.862:

[/home/u01/app/grid/11.2.0/product/bin/orarootagent.bin(6098)]CRS-5018:(:CLSN00037:)   Removed unused HAIP route:    169.254.95.0 / 255.255.255.0 / 0.0.0.0 / usb0

2018-11-26   13:31:02.879:

[/home/u01/app/grid/11.2.0/product/bin/orarootagent.bin(6098)]CRS-5018:(:CLSN00037:)   Removed unused HAIP route:    169.254.95.0 / 255.255.255.0 / 0.0.0.0 / usb0

2018-11-26   13:41:12.924:

[/home/u01/app/grid/11.2.0/product/bin/orarootagent.bin(6098)]CRS-5018:(:CLSN00037:)   Removed unused HAIP route:    169.254.95.0 / 255.255.255.0 / 0.0.0.0 / usb0

2018-11-26   13:51:20.976:

[/home/u01/app/grid/11.2.0/product/bin/orarootagent.bin(6098)]CRS-5018:(:CLSN00037:)   Removed unused HAIP route:    169.254.95.0 / 255.255.255.0 / 0.0.0.0 / usb0

2018-11-26   14:01:33.018:

[/home/u01/app/grid/11.2.0/product/bin/orarootagent.bin(6098)]CRS-5018:(:CLSN00037:)   Removed unused HAIP route:    169.254.95.0 / 255.255.255.0 / 0.0.0.0 / usb0

2018-11-26   14:11:45.027:

[/home/u01/app/grid/11.2.0/product/bin/orarootagent.bin(6098)]CRS-5018:(:CLSN00037:)   Removed unused HAIP route:    169.254.95.0 / 255.255.255.0 / 0.0.0.0 / usb0

2018-11-26   14:20:10.218:

[ohasd(5572)]CRS-2765:Resource   'ora.gipcd' has failed on server 'sxmms2'.

2018-11-26   14:20:10.287:

[ohasd(5572)]CRS-2765:Resource   'ora.ctssd' has failed on server 'sxmms2'.

2018-11-26   14:20:10.870:

[ohasd(5572)]CRS-2765:Resource   'ora.evmd' has failed on server 'sxmms2'.

2018-11-26   14:20:11.060:

[/home/u01/app/grid/11.2.0/product/bin/scriptagent.bin(52396)]CRS-5822:Agent   '/home/u01/app/grid/11.2.0/product/bin/scriptagent_grid' disconnected from   server. Details at (:CRSAGF00117:) {0:11:97} in   /home/u01/app/grid/11.2.0/product/log/sxmms2/agent/crsd/scriptagent_grid/scriptagent_grid.log.

2018-11-26   14:20:11.060:

1.8.   节点 2 GIPC 日志分析

分析日志发现,rank值出现过0的情况,并且drop信息出现超时,同时2节点的系统日志、网卡以及netstat等信息无异常,至此已基本可以确认2节点的gipc进程异常导致的两节点间无法建立连接。

2018-11-26   13:07:41.526: [GIPCDMON][4093638400] gipcdMonitorCssCheck: found node sxmms2

2018-11-26   13:07:41.526: [GIPCDMON][4093638400] gipcdMonitorFailZombieNodes: skipping   live node 'sxmms1', time 737498576 ms, endp 0000000000000000,   00000000470e5359

2018-11-26   13:07:41.537: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000355

2018-11-26   13:07:42.226: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000011befdfb

2018-11-26   13:07:42.384: [ CLSINET][4093638400] Returning NETDATA: 1 interfaces

2018-11-26   13:07:42.384: [ CLSINET][4093638400] # 0 Interface   'eth2',ip='192.168.0.2',mac='a0-36-9f-5d-66-ec',mask='255.255.255.0',net='192.168.0.0',use='cluster_interconnect'

2018-11-26   13:07:43.840: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000370

2018-11-26   13:07:44.077: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000011bf00f6

2018-11-26   13:07:44.888: [GIPCDMON][4093638400] gipcdMonitorSaveInfMetrics: inf[ 0]  eth2                 - rank    0, avgms 30000000000.000000 [ 16 / 0 / 0   ]

2018-11-26   13:07:44.888: [GIPCDMON][4093638400] gipcdMonitorSaveInfMetrics: saving:  eth2:0

2018-11-26   13:07:45.542: [GIPCHALO][4091537152] gipchaLowerProcessNode: no valid   interfaces found to node for 737502596 ms, node 0x7f00d022b420 { host   'sxmms1', haName 'gipcd_ha_name', srcLuid 3de74e37-09e82f35, dstLuid   50a77767-85458162 numInf 1, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq   [11888 : 11888], createTime 725590486, sentRegister 1, localMonitor 1, flags   0x4 }

2018-11-26   13:07:45.812: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000ca1

2018-11-26   13:07:46.529: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000011befc2f

2018-11-26   13:07:46.531: [GIPCDMON][4093638400] gipcdMonitorCssCheck: found node sxmms2

2018-11-26   13:07:46.531: [GIPCDMON][4093638400] gipcdMonitorFailZombieNodes: skipping   live node 'sxmms1', time 737503586 ms, endp 0000000000000000,   00000000470e5359

2018-11-26   13:07:46.544: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000355

2018-11-26   13:07:47.232: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000011befdfb

2018-11-26   13:07:47.232: [GIPCDCLT][4166940416] gipcdClientInterfaceMetrics: Received   type(gipcdmsgtypeInterfaceMetrics), endp(0000000011befdfb), len(1032),   buf(0x7f00e82872e8), inf(ip: 192.168.0.2:39280, mask: 255.255.255.0, subnet:   192.168.0.0, mac: , ifname: ) time(0), retry(0), stamp(0), send(0), recv(0)

2018-11-26   13:07:47.232: [GIPCDCLT][4166940416] gipcdClientInterfaceMetrics: enqueue   local interface metrics (1) to worklist

2018-11-26 13:07:47.380:   [ CLSINET][4093638400] Returning NETDATA: 1 interfaces

2018-11-26   13:07:47.380: [ CLSINET][4093638400] # 0 Interface   'eth2',ip='192.168.0.2',mac='a0-36-9f-5d-66-ec',mask='255.255.255.0',net='192.168.0.0',use='cluster_interconnect'

2018-11-26 13:07:48.846:   [GIPCDCLT][4166940416] gipcdClientThread: req from local client of type   gipcdmsgtypeInterfaceMetrics, endp 0000000000000370

2018-11-26   13:07:49.083: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000011bf00f6

2018-11-26   13:07:50.813: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000ca1

2018-11-26   13:07:51.534: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000011befc2f

2018-11-26   13:07:51.536: [GIPCDMON][4093638400] gipcdMonitorCssCheck: found node sxmms2

2018-11-26   13:07:51.536: [GIPCDMON][4093638400] gipcdMonitorFailZombieNodes: skipping   live node 'sxmms1', time 737508586 ms, endp 0000000000000000,   00000000470e5359

2018-11-26   13:07:51.550: [GIPCHALO][4091537152] gipchaLowerProcessNode: no valid   interfaces found to node for 737508606 ms, node 0x7f00d022b420 { host   'sxmms1', haName 'gipcd_ha_name', srcLuid 3de74e37-09e82f35, dstLuid   50a77767-85458162 numInf 1, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq   [11894 : 11894], createTime 725590486, sentRegister 1, localMonitor 1, flags   0x4 }

2018-11-26   13:07:51.550: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000355

2018-11-26   13:07:52.237: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000011befdfb

2018-11-26   13:07:52.386: [ CLSINET][4093638400] Returning NETDATA: 1 interfaces

2018-11-26   13:07:52.386: [ CLSINET][4093638400] # 0 Interface   'eth2',ip='192.168.0.2',mac='a0-36-9f-5d-66-ec',mask='255.255.255.0',net='192.168.0.0',use='cluster_interconnect'

2018-11-26   13:07:53.852: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000370

2018-11-26   13:07:54.089: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000011bf00f6

2018-11-26   13:07:55.815: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000ca1

2018-11-26   13:07:56.540: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000011befc2f

2018-11-26   13:07:56.541: [GIPCDMON][4093638400] gipcdMonitorCssCheck: found node sxmms2

2018-11-26   13:07:56.541: [GIPCDMON][4093638400] gipcdMonitorFailZombieNodes: skipping   live node 'sxmms1', time 737513596 ms, endp 0000000000000000,   00000000470e5359

2018-11-26   13:07:56.556: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000355

2018-11-26   13:07:57.243: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000011befdfb

2018-11-26   13:07:57.381: [ CLSINET][4093638400] Returning NETDATA: 1 interfaces

2018-11-26   13:07:57.381: [ CLSINET][4093638400] # 0 Interface 'eth2',ip='192.168.0.2',mac='a0-36-9f-5d-66-ec',mask='255.255.255.0',net='192.168.0.0',use='cluster_interconnect'

2018-11-26   13:07:57.556: [GIPCHALO][4091537152] gipchaLowerProcessNode: no valid   interfaces found to node for 737514606 ms, node 0x7f00d022b420 { host 'sxmms1',   haName 'gipcd_ha_name', srcLuid 3de74e37-09e82f35, dstLuid 50a77767-85458162   numInf 1, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [11900 : 11900],   createTime 725590486, sentRegister 1, localMonitor 1, flags 0x4 }

2018-11-26   13:07:58.856: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000370

2018-11-26   13:07:59.095: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000011bf00f6

2018-11-26   13:08:00.816: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000ca1

2018-11-26   13:08:01.544: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000011befc2f

2018-11-26   13:08:01.546: [GIPCDMON][4093638400] gipcdMonitorCssCheck: found node sxmms2

2018-11-26   13:08:01.546: [GIPCDMON][4093638400] gipcdMonitorFailZombieNodes: skipping   live node 'sxmms1', time 737518596 ms, endp 0000000000000000,   00000000470e5359

2018-11-26   13:08:01.561: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000355

2018-11-26   13:08:02.248: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000011befdfb

2018-11-26   13:08:02.401: [ CLSINET][4093638400] Returning NETDATA: 1 interfaces

2018-11-26   13:09:41.659: [GIPCDMON][4093638400] gipcdMonitorCssCheck: found node sxmms1

2018-11-26   13:09:41.659: [GIPCDMON][4093638400] gipcdMonitorCssCheck: updating timeout   node sxmms1

2018-11-26   13:09:41.659: [GIPCDMON][4093638400] gipcdMonitorCssCheck: found node sxmms2

2018-11-26   13:09:41.659: [GIPCDMON][4093638400] gipcdMonitorFailZombieNodes: skipping   live node 'sxmms1', time 0 ms, endp 0000000000000000, 00000000470e5359

2018-11-26   13:09:41.705: [GIPCHALO][4091537152] gipchaLowerDropMsg: dropping because of   sequence timeout, waited 30050, msg 0x7f00dc0a8478 { len 1160, seq 2, type   gipchaHdrTypeRecvEstablish (5), lastSeq 0, lastAck 0, minAck 1, flags 0x1,   srcLuid bcd67f68-d3658d20, dstLuid 00000000-00000000, msgId 1 }, node   0x7f00d02332c0 { host 'sxmms1', haName 'gipcd_ha_name', srcLuid   3de74e37-fe7418b0, dstLuid bcd67f68-d3658d20 numInf 1, contigSeq 0, lastAck   0, lastValidAck 0, sendSeq [30 : 30], createTime 737588686, sentRegister 1,   localMonitor 1, flags 0x0 }

2018-11-26   13:09:42.350: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000011befdfb

2018-11-26   13:09:42.427: [ CLSINET][4093638400] Returning NETDATA: 1 interfaces

2018-11-26   13:09:42.427: [ CLSINET][4093638400] # 0 Interface 'eth2',ip='192.168.0.2',mac='a0-36-9f-5d-66-ec',mask='255.255.255.0',net='192.168.0.0',use='cluster_interconnect'

2018-11-26   13:09:42.707: [GIPCHALO][4091537152] gipchaLowerDropMsg: dropping because of   sequence timeout, waited 30060, msg 0x7f00dc0c71b8 { len 1160, seq 3, type   gipchaHdrTypeRecvEstablish (5), lastSeq 0, lastAck 0, minAck 2, flags 0x1,   srcLuid bcd67f68-d3658d20, dstLuid 00000000-00000000, msgId 2 }, node   0x7f00d02332c0 { host 'sxmms1', haName 'gipcd_ha_name', srcLuid   3de74e37-fe7418b0, dstLuid bcd67f68-d3658d20 numInf 1, contigSeq 0, lastAck   0, lastValidAck 0, sendSeq [31 : 31], createTime 737588686, sentRegister 1,   localMonitor 1, flags 0x0 }

2018-11-26   13:09:42.708: [GIPCHALO][4091537152] gipchaLowerDropMsg: dropping because of   sequence timeout, waited 30060, msg 0x7f00dc038cc8 { len 1160, seq 3, type   gipchaHdrTypeRecvEstablish (5), lastSeq 0, lastAck 0, minAck 2, flags 0x1,   srcLuid bcd67f68-d3658d20, dstLuid 00000000-00000000, msgId 2 }, node   0x7f00d02332c0 { host 'sxmms1', haName 'gipcd_ha_name', srcLuid   3de74e37-fe7418b0, dstLuid bcd67f68-d3658d20 numInf 1, contigSeq 0, lastAck   0, lastValidAck 0, sendSeq [31 : 31], createTime 737588686, sentRegister 1,   localMonitor 1, flags 0x0 }

2018-11-26   13:09:42.708: [GIPCHALO][4091537152] gipchaLowerDropMsg: dropping because of   sequence timeout, waited 30060, msg 0x7f00dc04f288 { len 1160, seq 3, type   gipchaHdrTypeRecvEstablish (5), lastSeq 0, lastAck 0, minAck 2, flags 0x1,   srcLuid bcd67f68-d3658d20, dstLuid 00000000-00000000, msgId 2 }, node   0x7f00d02332c0 { host 'sxmms1', haName 'gipcd_ha_name', srcLuid   3de74e37-fe7418b0, dstLuid bcd67f68-d3658d20 numInf 1, contigSeq 0, lastAck   0, lastValidAck 0, sendSeq [31 : 31], createTime 737588686, sentRegister 1,   localMonitor 1, flags 0x0 }

2018-11-26   13:09:42.708: [GIPCHALO][4091537152] gipchaLowerProcessNode: no valid   interfaces found to node for 737619746 ms, node 0x7f00d02332c0 { host   'sxmms1', haName 'gipcd_ha_name', srcLuid 3de74e37-fe7418b0, dstLuid   bcd67f68-d3658d20 numInf 1, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq   [31 : 31], createTime 737588686, sentRegister 1, localMonitor 1, flags 0x4 }

2018-11-26   13:09:42.708: [GIPCDCLT][4166940416] gipcdClientThread: req from local client   of type gipcdmsgtypeInterfaceMetrics, endp 0000000000000355

2.     结论

   通过以上过程分析,1节点CRS无法启动的原因为2节点gipc进程异常,进而导致两个节点间无法正常建立连接,而gipc为RAC的重要组成部分,从11gR2(11.2.0.2)开始,oracle决定由集群自己来管理私网网卡,集群新特性gipc(Grid IPC)被介绍,这个新特性以守护进程gipcd.bin的形式存在于集群中,主要的功能如下:

1. 当集群启动时,发现集群的私网网卡,集群私网的信息是从gpnp profile中获得的。并对发现的私网接口进行检查;

2. 利用之前发现的私网网卡,发现集群中的其他节点,并和其他节点的私网网卡建立联系;

3. 如果集群配置了多块私网网卡,当某个节点的某一个/几个私网网卡出现问题时,离线有问题的私网,并通知其他节点;

确认gipcd.bin的作用后,其实1节点CRS无法启动的原因就已经找到,集群私网的连接是通过该进程实现,但2节点gipc进程处于异常状态,所以1节点经过多次重启始终无法加入集群。

3.     解决方案

从以上分析确定为2节点gipc进程异常导致的1节点CRS无法正常启动,虽然gipc是私网连接使用,但它本身的重启不会导致集群异常,所以通过手动kill -9 gipcd.bin进程,随后gipc进程会自动启动,而1节点也随之启动成功。


来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/26964624/viewspace-2557324/,如需转载,请注明出处,否则将追究法律责任。

请登录后发表评论 登录
全部评论

注册时间:2016-04-11

  • 博文量
    44
  • 访问量
    43277