ITPub博客

首页 > 数据库 > Oracle > OGG-01705 CHECKPOINT RBA IS LARGER THAN LOCAL TRAIL SIZE问题

OGG-01705 CHECKPOINT RBA IS LARGER THAN LOCAL TRAIL SIZE问题

原创 Oracle 作者:royevictory 时间:2016-11-16 16:36:16 0 删除 编辑
一.背景说明
2014/09/19,某客户工程师联系说dptent进程(该进程为entry schema成都到灾备中心的data pump进程)无法启动。报错信息如下:
2014-09-19 09:22:28  ERROR   OGG-01705  Input checkpoint position 68047491 for input trail file './dirdat/et001646' is greater than the size of the file (68047295).  Please consult Oracle Knowledge Management Doc ID 1138409.1. for instructions.


经确认,09/18日晚上该主机意外down机过,机器启动后因本错误而无法启动goldengate。而在goldengate中,datapump和extract在交换数据的时候data pump是从cache区域中去抓取数据传送到目标端(而不是等到真的写到磁盘后,可以提高性能)。当意外down机时系统来不及将cache中的内容写到磁盘,出现了datapump新建的检查点是基于cache中的信息更新的,而trail文件的大小实际上是要比检查点写的RBA小。当下次启动时,前一个进程进行恢复动作,并将比文件大的一部分内容写到了下一个trail文件中(extract启动的时候会etrollover)。所以,要么将进程的检查点跳到下一个trail的指定RBA,要么重新初始化。确认问题步骤:
1.确认ggerr.log日志文件报错信息是否包含:
ERROR OGG-01705 Input checkpoint position 160374765 for input trail file '/ogg-data/ogg/itmscmp/rmtdir/pe027505' is greater than the size of the file (160301092).

如果gg版本高于11.2.0.1还应该留意报错(在另外一个客户处遇到过此现象):
 ERROR OGG-01091 Unable to open file "/u01/app/goldengate/dirdat/rt000068" (error 2, No such file or directory).

2.确认进程当前检查点位置:
REPLICAT XXXX Last Started 2012-09-07 14:28 Status ABENDED
Checkpoint Lag 00:00:00 (updated 02:05:27 ago)
Log Read Checkpoint File /u01/app/goldengate/dirdat/rt000067
2012-09-06 22:32:04.123161 RBA 197650300

3.检查trail文件大小(通常RBA的位置已经超过了trail大小):
-rw-rw-rw- 1 oracle dba 197640272 Sep 6 22:43 /u01/app/goldengate/dirdat/rt000067
-rw-rw-rw- 1 oracle dba 83807478  Sep 7 12:46 /u01/app/goldengate/dirdat/rt000068


出现CHECKPOINT RBA IS LARGER THAN LOCAL TRAIL SIZE的问题可能出现的地方还有replicat进程,所以本文从rep和datapump两个方面进行说明.


二.Replicat进程检查点RBA比trail文件大:
如果在目标端(版本高于11.2.1.0.7,会有新的启动选项)的replicat进程出现了类似的情况,在使用了checkpoint table的情况下我们可以通过: FILTERDUPTRANSACTIONS 选项启动replicat。该选项的原理是:使用checkpoint table中的相关信息启动replicat并过滤掉重复的数据,不同情况处理方式如下:
1).OGG版本低于11.2.1.0.1:
无法使用此参数,因为低于此版本的checkpointtable中并不包含所需信息。处理方式需要参照后面的datapump进程的处理方式.

2).OGG版本介于11.2.1.0.1到11.2.1.0.7的:
升级到11.2.1.0.7以上版本并使用filterduptransactions参数启动Replicat进程:
start replicat <rep name> filterduptransactions

3).OGG版本高于11.2.1.0.7的:
直接使用该参数启动rep进程:
start replicat <rep name> filterduptransactions

使用该参数启动rep进程后,该进程报告文件会有类似如下信息:
2014-05-16 12:21:05 INFO OGG-05664 CSN-based filtering suppressed 534 duplicate transactions, from trail Seqno 30, RBA 1874 to Seqno 30, RBA 786118. The first transaction suppressed has CSN 473374133 and transaction ID 35.25.204083, and the last suppressed one has CSN 473375032 and transaction ID 39.22.203971.

三.datapump进程检查点比trail文件大:
基于概述中的原理说明,我们可以知道手工恢复的时候是需要找到seqno为X的这个trail比RBA小的这部分内容在X+1这个Trail中的具体位置。有三种方法可以得到这个结果:

3.1.方法一:
通过公式:进程新RBA=报错信息的过大RBA(变量A)+新trail中的第一条记录的RBA(变量B)-进程失败trail中包含下个trail第一条记录事务的起始RBA(变量C)
1.查看报错信息中(或者INFO查看进程当前RBA)为:变量A;
REPLICAT REPDB Last Started 2011-11-28 07:51 Status ABENDED
Checkpoint Lag 00:00:02 (updated 02:00:18 ago)
Log Read Checkpoint File ./dirdat/pe027505
2011-11-28 05:54:08.002168 RBA 160374765(变量A)

或者:
OGG-01705 Input checkpoint position 160374765 for input trail file '/ogg-data/ogg/itmscmp/rmtdir/pe027505' is greater than the size of the file (160301092)

2.使用Logdump工具找到下一个trail中的第一条记录:变量B;
Logdump 56 >open ./dirdat/pe027506
Current LogTrail is ./dirdat/pe027506
Logdump 57 >ghdr on
Logdump 59 >n
___________________________________________________________________
Hdr-Ind : E (x45) Partition : . (x00)
UndoFlag : . (x00) BeforeAfter: A (x41)
RecLength : 0 (x0000) IO Time : 2011/11/28 06:18:54.007.851
IOType : 150 (x96) OrigNode : 0 (x00)
TransInd : . (x03) FormatType : R (x52)
SyskeyLen : 0 (x00) Incomplete : . (x00)
AuditRBA : 0 AuditPos : 0
Continued : N (x00) RecCount : 0 (x00)

2011/11/28 06:18:54.007.851 RestartAbend Len 0 RBA 1477
Name:
After Image: Partition 0 G s

Logdump 60 >n
___________________________________________________________________
Hdr-Ind : E (x45) Partition : . (x04)
UndoFlag : . (x00) BeforeAfter: A (x41)
RecLength : 114 (x0072) IO Time : 2011/11/28 05:54:08.004.543
IOType : 15 (x0f) OrigNode : 255 (xff)
TransInd : . (x00) FormatType : R (x52)
SyskeyLen : 0 (x00) Incomplete : . (x00)
AuditRBA : 14735 AuditPos : 272667152
Continued : N (x00) RecCount : 1 (x01)

2011/11/28 05:54:08.004.543 FieldComp Len 114 RBA 1537(变量B)
Name: SCHEMA.XXXX
After Image: Partition 4 G b
0000 000c 0000 0008 3130 3532 3030 3839 0001 000c | ........10520089....
0000 0008 3130 3534 3639 3532 0002 0008 0000 0004 | ....10546952........
3633 3737 0003 0010 0000 000c 4461 7669 732c 204d | 6377........Jacob, M
696c 6573 0004 0015 0000 3230 3039 2d30 342d 3235 | iles......2009-05-25
3a30 323a 3338 3a35 3900 0500 1500 0032 3031 312d | :02:38:59......2011-
3131 2d32 383a 3035 3a35 343a 3037 | 11-28:05:54:07

:如果新trail没有第一条记录,则依次往后找,直到找到有记录的trail为止.

3.找到新trail事务在旧trail中的起始位置:变量C:
1).通过步骤2,我们找到了新trail的第一条记录RBA为1537.而这条记录对应的AuditRBA=14735, AuditPos=272667152 这表明此事务在Oracle数据库中的14735号redo的 RBA:272667152处。另外还需要记录此记录的表名。

2).在旧的trail中查找以上事务的起始点:
Logdump>open 进程ABEND的trail
Logdump>ghdr on
Logdump>filter include AuditRBA <number>
Logdump>filter include filename <table_name>
Logdump>filter match all
Logdump>n

找到记录:
Hdr-Ind : E (x45) Partition : . (x04)
UndoFlag : . (x00) BeforeAfter: A (x41)
RecLength : 114 (x0072) IO Time : 2011/11/28 05:54:08.000.919
IOType : 15 (x0f) OrigNode : 255 (xff)
TransInd : . (x00) FormatType : R (x52)
SyskeyLen : 0 (x00) Incomplete : . (x00)
AuditRBA : 14735 AuditPos : 272667152 ------------ same as the one in step 2
Continued : N (x00) RecCount : 1 (x01)
2011/11/28 05:54:08.000.919 FieldComp Len 114 RBA 160284255 -->我们要找的RBA
Name: SCHEMA.XXXX
After Image: Partition 4 G b
0000 000c 0000 0008 3130 3532 3030 3839 0001 000c | ........10520089....
0000 0008 3130 3534 3639 3532 0002 0008 0000 0004 | ....10546952........
3633 3737 0003 0010 0000 000c 4461 7669 732c 204d | 6377........Jacob, M
696c 6573 0004 0015 0000 3230 3039 2d30 342d 3235 | iles......2009-05-25
3a30 323a 3338 3a35 3900 0500 1500 0032 3031 312d | :02:38:59......2011-
3131 2d32 383a 3035 3a35 343a 3037 | 11-28:05:54:07

3).获得此条记录的RBA为(变量C):160284255

4.计算进程应该在新trial的哪个位置:
应该在新trail的RBA=160374765+1537-160284255=92047

5.基于新的RBA启动进程:
Ggsci> alter rep < rep name>, extseqno 27506, extrba 92047.


3.2.方法二:
基于公式:进程新RBA=报错信息的过大RBA(变量A)+新trail中的第一条记录的RBA(变量B)-旧Trail的文件大小(变量C)
1.获取变量A:同方法一中办法

2.获取变量B:同方法一中办法

3.获取变量C:文件系统查看:
OGG-01705 Input checkpoint position 160374765(变量A) for input trail file '/ogg-data/ogg/itmscmp/rmtdir/pe027505' is greater than the size of the file (160301092:变量C)

4.计算新进程RBA:
160374765+1537-160301092=75210

5.基于新的RBA启动:
Ggsci> alter rep < rep name>, extseqno 27506, extrba 75210


3.3.方法三:
通过以上两个方法可见,在相同的案例中不同的方法结果也不一样。也就是说不一定有效,如果以上两个方法都无效才能使用此方法(如果数据量不大最好重新初始化)。
此方法参照MOS

This option should only be used if option 1 and option 2 failed.
Please refer the bug- 13955891

A live example which explains this scenario:

1) GGSCI (OGGORACLE)1> info rep01prd, detail

REPLICAT   REP01PRD  Last Started 2012-04-11 12:05   Status ABENDED
Checkpoint Lag       00:00:00 (updated 08:24:18 ago)
Log Read Checkpoint  File /opt/GoldenGateV11/dirdat/rp002289
                     2012-04-11 04:38:15.998847  RBA 499003921

2. sh>ls -tlr dirdat

-rw-rw-rw- 1 oggadm dba 499999882 Apr 11 01:52 /opt/GoldenGateV11/dirdat/rp002288
-rw-rw-rw- 1 oggadm dba 498003983 Apr 11 06:15 /opt/GoldenGateV11/dirdat/rp002289 << where the checkpoint is pointing
-rw-rw-rw- 1 oggadm dba 499999920 Apr 11 06:16 /opt/GoldenGateV11/dirdat/rp002290 << the next available trail file

Here the actual trail file size, End of File (EOF), is 498003983 where the replicat read checkpoint rba is at 499003921

The formula to find the correct RBA is

New datapump / Replicat RBA = Reader's too-big checkpoint RBA(A) - Repaired record RBA in existing trail (B)+ First record RBA in the new trail file (after the restart abend)(c)

Let us call
Reader's too-big checkpoint RBA -- A
Repaired record RBA in existing trail -- B
First record RBA in the new trail file (after the restart abend) -- C

We know the Reader's too-big checkpoint RBA (A) =  499003921( from the info output) -- step 1
We could get the Repaired record RBA in existing trail-- B -- step 2
----------------------------------------------------------
Logdump 979 >open rp002289
Current LogTrail is C:\GG\V11111\rp002289
Logdump 980 >ghdr on
Logdump 981 >count
LogTrail C:\GG\V11111\rp002289 has 2352689 records
Total Data Bytes         272734228
  Avg Bytes/Record             115
Delete                       14237
Insert                     2192433
FieldComp                   146014
GGSPKUpdate                      1
LargeObject                      3
Repaired Records                 1
Others                           1
Before Images                14237
After Images               2338451

Average of 265671 Transactions
    Bytes/Trans .....       1451
    Records/Trans ...          8
    Files/Trans .....          0

Logdump 982 >skip 2352688
Logdump 983 >n
___________________________________________________________________
Hdr-Ind    :     E  (x45)     Partition  :     .  (x04)
UndoFlag   :     .  (x00)     BeforeAfter:     A  (x41)
RecLength  :   251  (x00fb)   IO Time    : 2012/04/11 05:38:08.999.658
IOType     :     5  (x05)     OrigNode   :   255  (xff)
TransInd   :     .  (x03)     FormatType :     R  (x52)
SyskeyLen  :     0  (x00)     Incomplete :     .  (x00)
AuditRBA   :     336307       AuditPos   : 89724944
Continued  :     N  (x00)     RecCount   :     1  (x01)

2012/04/11 05:38:08.999.658 Insert               Len   251 RBA 498003230
Name: PRODDTA.F0911T
After  Image:                                             Partition 4   G  s
0000 0004 0000 4841 0001 000a 0000 0006 3739 3436 | ......HA........7946
3536 0002 0007 0000 3030 3333 3000 0300 0a00 0000 | 56......00330.......
0000 0000 01b5 e600 0400 0400 0041 4100 0500 0500 | .............AA.....
0000 0138 0006 0004 0000 2020 0007 0003 0000 2000 | ...8......  ...... .
0800 0300 0020 0009 0003 0000 2000 0a00 0300 0020 | ..... ...... ......
000b 0003 0000 2000 0c00 0300 0020 000d 0003 0000 | ...... ...... ......
2000 0e00 0300 0020 000f 0003 0000 2000 1000 0300 |  ...... ...... .....

Logdump 984 >n

                            Repaired             Len   369 RBA 498003606
4800 0031 4504 0041 00fb 05ff 02f1 e397 7402 50ea | H..1E..A........t.P.
0000 0000 0559 2010 0005 21b3 0352 0000 0001 5052 | .....Y ...!..R....PR
4f44 4454 412e 4630 3931 3154 0044 0000 fb00 0000 | ODDTA.F0911T.D......
0400 0048 4100 0100 0a00 0000 0637 3934 3635 3600 | ...HA........794656.
0200 0700 0030 3033 3330 0003 000a 0000 0000 0000 | .....00330..........
0001 b5e6 0004 0004 0000 4141 0005 0005 0000 0001 | ..........AA........
3900 0600 0400 0020 2000 0700 0300 0020 0008 0003 | 9......  ...... ....
0000 2000 0900 0300 0020 000a 0003 0000 2000 0b00 | .. ...... ...... ...
0300 0020 000c 0003 0000 2000 0d00 0300 0020 000e | ... ...... ...... ..
0003 0000 2000 0f00 0300 0020 0010 0003 0000 2000 | .... ...... ...... .
1100 0300 0020 0012 000e 0000 2020 2020 2020 2020 | ..... ......
2020 2020 0013 0003 0000 2000 1400 0e00 0020 2020 |     ...... ......
2020 2020 2020 2020 2000 1500 0300 0020 0016 000e |          ...... ....
0000 2020 2020 2020 2020 2020 2020 0017 0003 0000 | ..            ......
2000 1800 0e00 0020 2020 2020 2020 2020 2020 2000 |  ......            .
1900 0500 0000 0130 5400 0039 5200 0014 4141 4142 | .......0T..9R...AAAB
3936 4143 5441 4147 3156 5541 417a 0001 4c00 000b | 96ACTAAG1VUAAz..L...
3731 3331 3831 3530 3739 3636 0000 0e31 312e 0000 | 713181507966...11...
0000 0000 0000 0000 00                            | .........

Logdump 985 >

-----------------------------------------------
Here we could get the RBA for (B) as 498003606

-----------------------------------------------------------------------------------------------------

First record RBA in the new trail file (after the restart abend)(c) --- Step 3

-----------------------------------------------------------------------------------------------------

Logdump 985 >open rp002289
LogTrail C:\GG\V11111\rp002289 closed
Current LogTrail is C:\GG\V11111\rp002290
Logdump 986 >n

2012/04/11 07:15:49.756.239 FileHeader           Len  1307 RBA 0
Name: *FileHeader*
3000 01f3 3000 0008 4747 0d0a 544c 0a0d 3100 0002 | 0...0...GG..TL..1...
0002 3200 0004 2000 0000 3300 0008 02f1 e398 d156 | ..2... ...3........V
5d4f 3400 001e 001c 7572 693a 7468 6562 653a 3a6f | ]O4.....uri:thebe::o
7074 3a47 6f6c 6465 6e47 6174 6556 3131 3500 0044 | pt:GoldenGateV115..D
3500 001e 001c 7572 693a 7468 6562 653a 3a6f 7074 | 5.....uri:thebe::opt
3a47 6f6c 6465 6e47 6174 6556 3131 3500 001e 001c | :GoldenGateV115.....
7572 693a 7468 6562 653a 3a6f 7074 3a47 6f6c 6465 | uri:thebe::opt:Golde

Logdump 987 >n
___________________________________________________________________
Hdr-Ind    :     E  (x45)     Partition  :     .  (x00)
UndoFlag   :     .  (x00)     BeforeAfter:     A  (x41)
RecLength  :     0  (x0000)   IO Time    : 2012/04/11 07:15:49.762.966
IOType     :   150  (x96)     OrigNode   :     0  (x00)
TransInd   :     .  (x03)     FormatType :     R  (x52)
SyskeyLen  :     0  (x00)     Incomplete :     .  (x00)
AuditRBA   :          0       AuditPos   : 0
Continued  :     N  (x00)     RecCount   :     0  (x00)

2012/04/11 07:15:49.762.966 RestartAbend         Len     0 RBA 1315
Name:
After  Image:                                             Partition 0   G  s

Logdump 988 >n
___________________________________________________________________
Hdr-Ind    :     E  (x45)     Partition  :     .  (x04)
UndoFlag   :     .  (x00)     BeforeAfter:     A  (x41)
RecLength  :   251  (x00fb)   IO Time    : 2012/04/11 05:38:09.014.430
IOType     :     5  (x05)     OrigNode   :   255  (xff)
TransInd   :     .  (x03)     FormatType :     R  (x52)
SyskeyLen  :     0  (x00)     Incomplete :     .  (x00)
AuditRBA   :     336307       AuditPos   : 89726992
Continued  :     N  (x00)     RecCount   :     1  (x01)

2012/04/11 05:38:09.014.430 Insert               Len   251 RBA 1378
Name: PRODDTA.F0911T
After  Image:                                             Partition 4   G  s
0000 0004 0000 4841 0001 000a 0000 0006 3739 3436 | ......HA........7946
3536 0002 0007 0000 3030 3333 3000 0300 0a00 0000 | 56......00330.......
0000 0000 01b5 e600 0400 0400 0041 4100 0500 0500 | .............AA.....
0000 0139 0006 0004 0000 2020 0007 0003 0000 2000 | ...9......  ...... .
0800 0300 0020 0009 0003 0000 2000 0a00 0300 0020 | ..... ...... ......
000b 0003 0000 2000 0c00 0300 0020 000d 0003 0000 | ...... ...... ......
2000 0e00 0300 0020 000f 0003 0000 2000 1000 0300 |  ...... ...... .....

Logdump 989 >

-------------------------------------------------------------------------------


Here we could get the RBA for (C) as 1378

Now we have all the three values needed for the calculation

New datapump / Replicat RBA = Reader's too-big checkpoint RBA(A) - Repaired record RBA in existing trail (B)+ First record RBA in the new trail file (after the restart abend)(c)
New datapump / Replicat RBA = 499003921  -    498003606  +    1378 =    1001693


Ensure that you have a good record at RBA 1001693 with TransInd x00 or x03.

This option should only be used if option 1 and option 2 does not work.

If it is a good record, then the datapump / replicat should be altered to trail file sequence number 2290 and rba 1001693 by using the following command

Ggsci> alter rep < rep name>, extseqno 2290, extrba 1001693

总结:

补个刀,Oracle自己在同一个问题上给你了三个方法。重点是三个方法得出的结果不一致,看来还是不要这么折腾了,生产数据搞坏了影响业务。还不如重新初始化啦》》》》》》

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/29302187/viewspace-2128573/,如需转载,请注明出处,否则将追究法律责任。

下一篇: 没有了~
请登录后发表评论 登录
全部评论

注册时间:2014-08-06

  • 博文量
    195
  • 访问量
    528003