ITPub博客

地税的TSM备份系统又出问题了@_@

原创 数据库开发技术 作者:busyfan 时间:2007-06-15 16:01:51 0 删除 编辑

中午, 我们商务告诉我说地税客户那里的备份系统又出问题了, 要我联系客户那边的管理员.

每次听到地税的备份系统出问题, 都直想晕倒...

下午去看看怎么回事 )_(

[@more@]

下午和客户方管理员接上头后, 被告知几个TSM client备份进程终止, 不继续备份, 日志的记录日期显示, 日志是三天前所记录的备份未完成的记录, 但日志里没有错误出现.

查看另外几个客户机节点, 也是这样的情况.

使用dsmadmc进入TSM控制台, q session查看连接, 竟然用了好几次空格键(每按一下空格键向下翻一页), 晕,看来又是哪个该死的客户端节点锁定驱动器后死掉了.

q drive f=d查看到底是哪个该死的客户端节点又锁了驱动器了, 查看信息如下:

Library Name: 3584LIB
Drive Name: DRIVE01
Device Type: LTO
On-Line: Yes
Read Formats: ULTRIUM3C,ULTRIUM3,ULTRIUM2C,ULTRIUM2,ULTRIUMC,ULTRIUM
Write Formats: ULTRIUM3C,ULTRIUM3,ULTRIUM2C,ULTRIUM2
Element: 258
Drive State: LOADED
Volume Name: A00054L3
Allocated to: AGENT_MHYWDB_A
......

共有4台驱动器, 都是被AGENT_MHYWDB_A锁定了, 看来是该节点的客户端程序又出问题了;

q actlog看一下, 果然, 一大堆的驱动器无法打开的错误:

06/13/2007 15:48:26 ANR8779E Unable to open drive /dev/rmt4, error number=16.
06/13/2007 15:48:26 ANR8779E Unable to open drive /dev/rmt3, error number=16.
06/13/2007 15:48:26 ANR8779E Unable to open drive /dev/rmt1, error number=16.
06/13/2007 15:48:36 ANR8779E Unable to open drive /dev/rmt2, error number=16.

同时有很多该客户端节点连接被重置的错误:

06/13/2007 15:48:26 ANR0454E Session rejected by server AGENT_MHYWDB_A, reason: Communication Failure.
06/13/2007 15:48:26 ANR0454E Session rejected by server AGENT_MHYWDB_A, reason: Communication Failure.
06/13/2007 15:48:26 ANR8390W Failure connecting to library client AGENT_MHYWDB_A to manage volume A00055L3.
06/13/2007 15:48:26 ANR8390W Failure connecting to library client AGENT_MHYWDB_A to manage volume A00054L3.

06/13/2007 15:48:26 ANR8214E Session open with 150.100.16.103 failed due to connection refusal.


一看到这结果, 基本肯定是AGENT_MHYWDB_A客户端节点程序出问题了, 因为以前已经有几次这样的现象了, 不多想了, 直接连接到150.100.16.103上去,然后:

ps -ef | grep dsm
kill -9 dsm_process_id

杀掉之后, 约等了二十秒左右, 听见磁带库发出换磁带的咔嚓声, 知道驱动器被解锁了, q mount看一下, 果然:

tsm:TSM>q mount
ANR8380I LTO volume A00055L3 is mounted R/W in drive DRIVE02 (/dev/rmt2), status: RETRY DISMOUNT FAILURE.
ANR8380I LTO volume A00054L3 is mounted R/W in drive DRIVE01 (/dev/rmt1), status: RETRY DISMOUNT FAILURE.
ANR8380I LTO volume A00052L3 is mounted R/W in drive DRIVE03 (/dev/rmt3), status: RETRY DISMOUNT FAILURE.
ANR8380I LTO volume A00053L3 is mounted R/W in drive DRIVE04 (/dev/rmt4), status: RETRY DISMOUNT FAILURE.
ANR8379I Mount point in device class 3584CLASS is waiting for the volume mount to complete, status: WAITING FOR VOLUME.
ANR8379I Mount point in device class 3584CLASS is waiting for the volume mount to complete, status: WAITING FOR VOLUME.
ANR8379I Mount point in device class 3584CLASS is waiting for the volume mount to complete, status: WAITING FOR VOLUME.
ANR8334I 7 matches found.

接下来, 就老一阵子的咔嚓声, 磁带被换来换去的, 再q session查看, session终于恢复正常了, 再查看各个客户端节点的lev0.log和lev1.log都已经被一个错误记录终止了,看来差不多是正常了:

released channel: t4
released channel: t1
released channel: t2
released channel: t3
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of backup command at 06/13/2007 15:02:49
ORA-19502: write error on file "0aijql6t_1_1", blockno 2049 (blocksize=512)
ORA-27030: skgfwrt: sbtwrite2 returned error
ORA-19511: Error received from media manager layer, error text:
ANS0278S (RC157) The transaction will be aborted.

明天再观察一下啦, 晕!

因为TSM Server的版本是5.3.3的, 而这个出错的客户端节点mhywdb_a的tsm版本是5.2.0的, 所以总感觉是版本不一至所导致的这样奇怪的问题.

下周买的TSM介质就到了, 到时候给这几个5.2.0的更新一下算了, 再晕一下, 结束

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/266238/viewspace-918992/,如需转载,请注明出处,否则将追究法律责任。

请登录后发表评论 登录
全部评论
  • 博文量
    29
  • 访问量
    641873