ITPub博客

首页 > 数据库 > Oracle > OCFS,OCFS2,ASM,RAW 讨论1(转帖)

OCFS,OCFS2,ASM,RAW 讨论1(转帖)

原创 Oracle 作者:m77m78 时间:2007-04-16 21:32:25 0 删除 编辑
200亿条数据,就是20T吧,用ASM好?还是用OCFS好?还是用RAW好?几个问题要先搞清楚:[@more@]200亿条数据,就是20T吧,用ASM好?还是用OCFS好?还是用RAW好?几个问题要先搞清楚:

1. 20T 是否是历史数据和归档数据?还是天天都要修改,更新和变化?
2. 20T数据中有多少比例的数据是需要经常修改/查询/更新的?
3. 20T数据是纯data数据还是有媒体数据?
4. 这套业务系统以后的数据增长是怎样的? 每天/每个月,每年新增长多少数据,增长速度如何?

这些问题都搞清楚了,你的数据库存储的规划就基本清晰了, 否则操作起来盲人摸象。


按照我的经验,恐怕20T不是全部都需要always online, 所以在数据库逻辑设计上需要把数据分层对待,即便是Oracle ,你要是只有一个层的20T数据,恐怕性能也会糟糕的一塌糊涂.

还有,ASM, OCFS, RAW并不是等价可比的。他们的特性和设计差别很大.

ASM的性能基本上和RAW差不多. 但是管理性上好很多很多。但是牺牲的代价就是引入了系统的复杂性,多了一层东西,问题出现的几率也大很多.

不过有一点我可以肯定就是如果你要放你这20T的数据,OCFS2不应该考虑, 别问我为什么,因为又要解释很多很多东西.


cwinxp 回复于:2006-03-09 09:36:55

谢谢
要同时检索10T到20T数据,怎么弄好?


nntp 回复于:2006-03-09 16:50:14

引用:原帖由 cwinxp 于 2006-3-9 09:36 发表
谢谢
要同时检索10T到20T数据,怎么弄好?




找专业的公司作咨询把,10T-20T的数据同时作检索,已经不能算是常规应用了。

正常情况下,他们会对这样的应用做这些工作:

1. 分析你的数据使用习性,调整数据库结构(包括为针对查询操作做的优化)
2. 建立一个常规的HA集群方案,同时会根据发向这套系统的查询请求的情况,加入负载均衡的考虑
3. 做一个小样测试之后,会根据性能测试的采样结果,调整你的OS和文件系统.(如果你们这里有对Linux比较熟悉的工作人员,这个工作也可以自己做,OTN上有很多性能调整的资料)
4. 还有可能在分析你的待查数据之后,把数据在物理上分开布局
5. 1-4的工作的前提是你的硬件投资是有限的不多的。做了1-4 的研究工作之后,还有一种糟糕的可能就是发现瓶颈还是在硬件上,你们对性能/可用性上的要求和真实你的物理基础能够提供的并不符合。所以需要对硬件作改造.

如果要做好10T-20T的应用,这些工作都比较复杂,需要接触实际的系统和深入的了解应用.

good luck


cwinxp 回复于:2006-03-10 10:19:48

thank you , 这么大的数据,硬件不是问题,搞5个CX700级联的,够用吧

我把5个CX700 按照您说的分出几个RAW等等,然后把10T甚至更大分成一个区给ASM吗?


nntp 回复于:2006-03-10 17:31:38

引用:原帖由 cwinxp 于 2006-3-10 10:19 发表
thank you , 这么大的数据,硬件不是问题,搞5个CX700级联的,够用吧

我把5个CX700 按照您说的分出几个RAW等等,然后把10T甚至更大分成一个区给ASM吗?




硬件好当然好,但是硬件好完全保证不了这套东西能够按照期望的性能和可用性工作。关于怎么样规划,我真地说不来什么,这些工作得认真分析你的应用之后才能得出结论,这么大的数据,搞错了就错了.


shimu 回复于:2006-03-10 23:49:21

我个人认为,这么大的数据来说安全最重要的,当然选择RAW,ocfs,asm相对是新东西,成熟性和稳定性不能比。。


cwinxp 回复于:2006-03-13 09:29:46

用RAW ? 谢谢


brave_script 回复于:2006-05-09 16:01:09

我现在做oracle的应用集群,在VG上建立的不同LV来实现ocfs2文件系统的的存储,现在希望能够在一个LV满时实现在线的扩充,在ext3的文件系统有相应的方法实现,不知道ocfs2文件系统如何实现
求教各位大虾


nntp 回复于:2006-05-09 18:12:59

生产系统么? 不要用ocfs2.

raw+ASM就可以了.

目前的RAC环境,看不出有任何理由在生产环境用ocfs2的必要.

RAC涉及到存储的就是2个个地方,一个是OCR和voting(以及他们的redundant config),另外一块就是Oralce Data和Flashback recovery area.

现在Oracle的RAC配置一般是两种 raw(ocr+voting)+ASM(data+flashback recovery area),另外一种是 ocfs2+ASM

OCR和 voting 占用的空间很小,根本没有必要在用了ocfs2的下面用一个OS的LVM来支持,就算你那样做了,也是错误的,因为目前OCR和voting 都需要存储是clusterware的,这也是用raw或ocfs2的原因,你用lvm+ocfs2的话,底下的OS LVM不是clusterware的,所以就会把你的数据破坏掉,这个话题是一个很老的话题了,你到oracle forum去搜,或者有metalink账号的话你看看就知道了,没有意义多讨论.

如果你用 OS LVM+ocfs2 用来放 Data+Flashback Recovery Area,我建议你还是不要这么干,不是说不可以,只不过ocfs2实在是很脆弱,你有订阅 ocfs2的maillist 么? 去看看吧.
Data+FRA用ASM 或RAW都很好,无论是性能上还是管理上,还是可靠性尚.

建议你仔细学习RAC安装的相关资料,把基础知识了解清楚。 


shahand 回复于:2006-05-09 19:15:30

nntp回答耐心,诲人不倦啊


brave_script 回复于:2006-05-09 23:21:55

谢谢版主,asm在oracle的官方网站一般都采用是oracle10g,由于特殊原因我们采用的是9204的oracle,如果采用raw那么分区是有限制的最多255个所以采用ocfs2文件系统,这也是oracle官方网站建议的。现在我已经做好了rac只是在扩充上有些不是很理想。ocr和voting是在单一的raw上的现在主要是data文件和flashback recovery area 文件的扩充问题如何解决,的确ocfs2文件系统有时不是很稳定但相对扩充要好多了


nntp 回复于:2006-05-10 04:00:48

oracle 没有说best practise 建议你用ocfs2, 实际上在社区没有一个oracle得人敢出来说ocfs2 你们放心用在生产环境把.

既然是RAC这个前提,我的建议就偏安全考虑.

既然解决的是Data部分的问题,而且又不用ASM,就没有选择了,只能用 LVM+OCFS了.

不过ocfs R1很麻烦的,不但和R2 一样不支持online resizing, 而且如果要resize ,操作起来需要一定的步骤的.

现在的麻烦就是 array 可以online resize, lun可以online hot add, pv可以online add, vg 可以online extend, lv 可以online extend,唯独你 resize ocfs on lv 的时候,不能online做. 必须要把ocfs 从所有node上卸下来.


nntp 回复于:2006-05-10 14:39:51

ocfs1不能直接升级到ocfs2, 如果以后要升级,需要做DB的导入导出操作.

昨天为了确认我给你的回复,顺便又搜了一下,ocfs1的 bug在网上比比皆是,触目惊心.
说白了,你们这样的架构的选择,最后就是给施工单位/人员和客户自找麻烦,痛苦的还在后面呢.

[ 本帖最后由 nntp 于 2006-5-10 14:40 编辑 ]


brave_script 回复于:2006-05-11 10:33:51

谢谢斑竹。其实我现在做的就是你所的方式,在所有节点将要扩充的ocfs盘umount之后在格式化,其实在oracle中也不需要这么做,毕竟oracle都是文件存放,只是想明白可以不可以动态扩展ocfs2文件系统在LVM上


brave_script 回复于:2006-05-11 10:35:05

顺便说一下我们使用的是ocfs2


nntp 回复于:2006-05-11 12:47:19

我昨天看到ocfs2的maillist有ocfs2 的 developer回答了类似问题:
他们的答复和我在二楼写的基本相同.

我在重复一下: ocfs2是一个clusteraware 的文件系统,在每个RAC node上都有instance运行,并通过网络通信+lock的机制,确保不同的node对同一个存储区域的读写是在控制下进行并且所有的node通过ocfs2 instance知道谁写了/谁读了. 所以ocfs2 filesystem的完整性是有保障底线的.

当你把ocfs2创建在LVM上的时候,LVM的 control在不同的node上是各管各的,由每个node的OS和LVM module自己来控制,node之间的LVM 并不通信,他们都是独立的,不排斥不加锁得去访问/操作共享存储上的区域,虽然你可以从每个node上用lvm工具scan到共享盘阵上的pv/vg/lv,但是一旦涉及到读写操作,所有的node便完全孤立来做了.所以LVM metadata 的读写就变成一个严重的问题.
所以 ocfs2+LVM 用在RAC的数据共享上是不可取的.

________________________________
maillist 的答复如下:

That's why ocfs2 is not certified with lvm2.

Going forward, we will be looking into this issue. But currently
there is no certified solution.

If you are running Oracle db and need volume mgmt, you should look into ASM.
-----------------------------------------------------------------------------------------------


joyhappy 回复于:2006-05-12 08:52:29

"所以 ocfs2+LVM 用在RAC的数据共享上是不可取的"

我同意这种说法。
但从原理上讲,如果确实需要用LVM, 可以用LVM2,也就是ocfs2 + CLVM,不过我没有试过;应该可以。


nntp 回复于:2006-05-12 15:17:17

引用:原帖由 joyhappy 于 2006-5-12 08:52 发表
"所以 ocfs2+LVM 用在RAC的数据共享上是不可取的"

我同意这种说法。
但从原理上讲,如果确实需要用LVM, 可以用LVM2,也就是ocfs2 + CLVM,不过我没有试过;应该可以。



HA里面用LVM 很常见,但是都是一头用一头锁的,RAC那种需要同时访问操作的,我恐怕就不是这样简单了.


blue_stone 回复于:2006-05-12 16:54:09

linux下的lvm不是clusterware aware的,所以不能够用在cluster环境下,cluster环境下应该使用clvm.

不明白为什么ocfs2不能使用在生产环境中,毕竟ocfs2已经整合到了linux kernel中。
还请nntp解释一下


nntp 回复于:2006-05-12 17:38:00

引用:原帖由 blue_stone 于 2006-5-12 16:54 发表
linux下的lvm不是clusterware aware的,所以不能够用在cluster环境下,cluster环境下应该使用clvm.

不明白为什么ocfs2不能使用在生产环境中,毕竟ocfs2已经整合到了linux kernel中。
还请nntp解释一下



我在谈生产环境的时候,说话的依据就不是什么整合不整合kernel了,而是很现实的稳定不稳定,如果我是负责一个企业系统架构的主管,我不会管那个东西吹得有多好,来头有多大,如果我看到好多好多人在汇报故障,并且故障源头在code level,大家在讨论的一些故障最后导致的问题不但会影响数据安全(可靠性),还影响到了服务连续性(可用性). 我就不会去考虑它. 即便是厂商可以提供技术支持和服务. 当然更加不要说没有任何支持服务的技术了.

因为有一件事情很清楚,用了新技术,为了还没有看到享受到的新特性和性能,我今天冒了这个风险,我必须要客观的评估风险,比如风险春在于什么层面?风险可能影响的范围有多大,一旦风险发生造成的损失具体有多少?为了这些风险我需要投入多少资源,资金和额外人工来做预防工作?风险发生后的的恢复工作复杂程度多少? 系统重新上线的时间间隔是多少? 停机对企业和个人的影响如何?风险发生后是否会影响下一期的企业IT建设计划和资金投入? 是否会影响到我作为IT架构的主管在决策层的信用和话语权?

谈到生产环境,我们考虑的前提就是一种最糟糕,最难堪,最受伤害的可能.所以新特性到底有多大的价值被采用,就是一个系统考虑的问题,而不是就事论事了.

这个话题有点岔开了,我猜这里大多数朋友都是engineer,所以很少涉及到项目管理,风险评估和控制方面的东西,不过项目组的每个人都有一些了解的话会对整个项目有莫大的好处.

所以回到话题上来,如果站在一个linux 粉丝的角度,看到ocfs2集成到linux kernel我觉得是一个超棒的事情,如果站在项目的高度来考虑,目前不建议.

建议订阅ocfs2的 mailing list . 可以获取第一手的信息.


blue_stone 回复于:2006-05-12 18:01:20

对nntp的话深表认同
感觉自己太浅薄了


rambus 回复于:2006-06-23 15:33:52

现在正在培训中,ORACLE方面把OCFS,OCFS2性能吹得天花乱坠,但是问了各个地方的同行,好像还没有那个生产环境是采用OCFS的。
不知道实际怎么样呢?


pawnjazz 回复于:2006-06-23 16:19:15

oracle 的強項在database , 做cluster 應該找OS平台廠商,就我知道的Redhat GFS 就是用在cluster


soway 回复于:2006-06-23 16:35:19

这个你让版主来回答

我记得他前面回答过,目前是坚决不要用.


我爱钓鱼 回复于:2006-06-23 17:35:28

ORACLE自己的资深工程师说:现在用的多的还是RAC,暂时不要用...


nntp 回复于:2006-06-23 17:37:51

ocfs 只能在RAC当中用
ocfs2的开发方向有了重大调整,目的是成为通用的 cluster filesystem.

我相信oracle和ocfs2 开发团队的实力和未来的发展,软件发展都有从幼稚到成熟,混乱到清晰,脆弱到稳定的过程,如果目前你有生产系统要考虑集群文件系统, ocfs2就不要考虑了.


fengwy 回复于:2006-06-26 10:48:13

引用:原帖由 我爱钓鱼 于 2006-6-23 17:35 发表
ORACLE自己的资深工程师说:现在用的多的还是RAC,暂时不要用...


ocfs不就是用在rac中吗


nntp 回复于:2006-06-26 11:03:16

引用:原帖由 fengwy 于 2006-6-26 10:48 发表

ocfs不就是用在rac中吗



yep, ocfs can only be used in RAC environment, ocfs2 is different, Oracle make it to be a general cluster file system for normal application.


cs119 回复于:2006-06-26 17:02:11

谁用谁知道呀!想要性能好最好用裸设。


rambus 回复于:2006-06-27 14:47:47

很奇怪的是OCFS为什么只能在AS3中使用,在AS4却不支持了呢?


nntp 回复于:2006-06-27 15:10:34

引用:原帖由 rambus 于 2006-6-27 14:47 发表
很奇怪的是OCFS为什么只能在AS3中使用,在AS4却不支持了呢?



奇怪你为什么这样想. ocfs2的站点看过么? 都写在那儿了. ocfs(ocfs1)现在被ocfs2 upgrade了.


archangle 回复于:2006-06-28 08:27:09

ocfs2去年测试过,性能很差,不知道现在怎么样了,但是感觉当这个产品成熟了之后会不不错的产品。


nntp 回复于:2006-06-28 12:25:51

ocfs2目前还是性能不好,再等等吧.


nimysun 回复于:2006-06-29 10:02:51

新的产品推出之后,怎么地也得等几年成熟之后才能考虑投入生产系统吧


fengwy 回复于:2006-06-29 10:31:22

引用:原帖由 nntp 于 2006-6-26 11:03 发表


yep, ocfs can only be used in RAC environment, ocfs2 is different, Oracle make it to be a general cluster file system for normal application.


没看到oracle关于有这方面的资料呀


nntp 回复于:2006-06-29 11:23:06

汗..... ocfs 项目的站点的第一页的第一行就写着呢:(http://oss.oracle.com/projects/ocfs2/)

摘录给你看看:

WHAT IS OCFS2?

OCFS2 is the next generation of the Oracle Cluster File System for Linux. It is an extent based, POSIX compliant file system. Unlike the previous release (OCFS), OCFS2 is a general-purpose file system that can be used for shared Oracle home installations making management of Oracle Real Application Cluster (RAC) installations even easier. Among the new features and benefits are:

* Node and architecture local files using Context Dependent Symbolic Links (CDSL)
* Network based pluggable DLM
* Improved journaling / node recovery using the Linux Kernel "JBD" subsystem
* Improved performance of meta-data operations (space allocation, locking, etc).
* Improved data caching / locking (for files such as oracle binaries, libraries, etc)


fengwy 回复于:2006-06-30 11:21:03

汗,没想到是在open source中的资料。


fengwy 回复于:2006-06-30 11:23:08

引用:原帖由 nntp 于 2006-6-28 12:25 发表
ocfs2目前还是性能不好,再等等吧.


这个性能不好是指在rac环境下对数据库的使用,还是在作为通用filesystem cluster中的使用呢


nntp 回复于:2006-06-30 13:45:31

引用:原帖由 fengwy 于 2006-6-30 11:23 发表

这个性能不好是指在rac环境下对数据库的使用,还是在作为通用filesystem cluster中的使用呢




both.


youngcow 回复于:2006-07-19 16:21:41

cluster 文件系统似乎性能都不怎么好,gfs也是这样


blue_stone 回复于:2006-07-19 22:03:51

gfs可否用在生产环境中呢?


nntp 回复于:2006-07-20 04:06:15

置顶贴


vecentli 回复于:2006-07-27 11:05:15

OCFS2, developed by Oracle Corporation, is a Cluster File System which allows all nodes in a cluster to concurrently access a device via the standard file system interface. This allows for easy management of applications that need to run across a cluster.

OCFS (Release 1) was released in December 2002 to enable Oracle Real Application Cluster (RAC) users to run the clustered database without having to deal with RAW devices. The file system was designed to store database related files, such as data files, control files, redo logs, archive logs, etc. OCFS2 is the next generation of the Oracle Cluster File System. It has been designed to be a general purpose cluster file system. With it, one can store not only database related files on a shared disk, but also store Oracle binaries and configuration files (shared Oracle Home) making management of RAC even easier.

In this article, I will be using OCFS2 to store the two files that are required to be shared by the Oracle Clusterware software. (Along with these two files, I will also be using this space to store the shared SPFILE for all Oracle RAC instances.)


秋风No.1 回复于:2006-08-04 16:49:16

引用:原帖由 nntp 于 2006-6-30 13:45 发表



both.


不太明白了

既然都不好,为何oracle RAC的安装还推荐使用ocfs2?


nntp 回复于:2006-08-04 21:02:59

引用:原帖由 秋风No.1 于 2006-8-4 16:49 发表

不太明白了

既然都不好,为何oracle RAC的安装还推荐使用ocfs2?



你有看到哪个软件从开发之初就是好的? 一个开发中的软件有各种现阶段的问题,难道就停止开发而放弃?

RAC安装从来没有推荐过用ocfs2, 你看Oracle RAC的产品经理在oracleworld上的发言了么?说得很清楚.

RAC系统离开clusterwide filesystem,节点failed之后,存储部分的切换延迟就会很大,这个道理和RHCS Vs RHCS+GFS一样的.

会安装RAC不难,难的是知道什么时候应该部署RAC,怎么部署,部署什么部分,那些现在可以放心用,那些不能,用了会有什么可能的风险,怎么防止和解决?


秋风No.1 回复于:2006-08-04 22:07:49

引用:原帖由 nntp 于 2006-8-4 21:02 发表


你有看到哪个软件从开发之初就是好的? 一个开发中的软件有各种现阶段的问题,难道就停止开发而放弃?

RAC安装从来没有推荐过用ocfs2, 你看Oracle RAC的产品经理在oracleworld上的发言了么?说得很清楚.

...


谢谢指教


nntp 回复于:2006-08-05 10:30:31

一点补充,RAC环境,如果不是用raw, 在生产环境还是应该选择ASM.


fengwy 回复于:2006-08-07 00:55:06

会安装RAC不难,难的是知道什么时候应该部署RAC,怎么部署,部署什么部分,那些现在可以放心用,那些不能,用了会有什么可能的风险,怎么防止和解决?
------------------------------------------------------------------------------


oncity 回复于:2006-08-18 07:44:40

安装 ocfs2 并不困难. (用最新的 suse server 10,什么都自带)

但使用起来,怪问题,特别多.

1) 死机,特别在复制大目录的时候.

2) 死机,意外关掉其中一个节点的时候 (拔网线).

3) 死机.......就是莫明其妙的.....

死机前后,没有任何 log 说明 !

看来我的 nfs 升级 iscsi + 群集系统 的计划都会是失败告终....:em10:


好好先生 回复于:2006-08-18 09:22:46

引用:原帖由 oncity 于 2006-8-18 07:44 发表
安装 ocfs2 并不困难. (用最新的 suse server 10,什么都自带)

但使用起来,怪问题,特别多.

1) 死机,特别在复制大目录的时候.

2) 死机,意外关掉其中一个节点的时候 (拔网线).

3) 死机.......就是莫明 ...



怎么死机?屏幕上有信息吗?是没有任何反应还是kernel崩溃?请把你的情况说清楚...谢谢!


oncity 回复于:2006-08-18 10:07:55

引用:原帖由 好好先生 于 2006-8-18 09:22 发表


怎么死机?屏幕上有信息吗?是没有任何反应还是kernel崩溃?请把你的情况说清楚...谢谢!



就是完全没有出错信息,包括屏显和 syslog

用 linux 那么久,还是第一次看到这种彻底瞬间崩溃的现象.

估计是内核的问题.


我爱钓鱼 回复于:2006-08-18 10:12:56

引用:原帖由 oncity 于 2006-8-18 10:07 发表


就是完全没有出错信息,包括屏显和 syslog

用 linux 那么久,还是第一次看到这种彻底瞬间崩溃的现象.

估计是内核的问题.



不可能吧....内核崩溃的话,会有日志的,默认是显示在控制台上..


nntp 回复于:2006-08-18 10:14:14

看过我之前对ocfs的评论么?

你因该首先排除掉环境问题和版本依赖性问题,因为ocfs2还是一个处于开发初始阶段的系统,虽然名字有一个2,但实际上是第一版支持general purpose的集群文件系统。ocfs2用来做生产系统是不明智的(见我的帖子)和不正确的。ocfs2现在用的话,你根本无法lock down一个stable set.

为什么不用GFS呢?


oncity 回复于:2006-08-18 10:22:34

引用:原帖由 nntp 于 2006-8-18 10:14 发表
看过我之前对ocfs的评论么?

你因该首先排除掉环境问题和版本依赖性问题,因为ocfs2还是一个处于开发初始阶段的系统,虽然名字有一个2,但实际上是第一版支持general purpose的集群文件系统。ocfs2用来做生产 ...



因为平台用了 SUSE Linux Enterprise Server 10 , 自带了 ocfs2 ,当然要先试试. :lol:

架起 ocfs2 很容易,简单测试也没有问题,但真正复制大量数据时就出问题.

如果用 gfs ,我想要换成 redhat 才行吧,最稳定是用那个版本来安装? as 4 u2 吗?


nntp 回复于:2006-08-18 12:45:11

越高越好.


nntp 回复于:2006-08-18 16:50:27

LZ建议你订阅 ocfs2的邮件列表, 动手之前看看别人吃的亏,然后好判断到底用不用.

SuSE SLES版本一般在第一个SP出来之前,还是不要上生产环境.


pxwyd 回复于:2006-08-29 19:05:40

我在RHEL4 update4上装的ocfs2;
node01 ,node02当把node02的网线或者node01的网线拔了之后,node02就会死机;而node1没有问题
/var/log/messages中有如下日志后死机
Aug 28 18:23:14 node02 kernel: o2net: connection to node node01 (num 0) at 192.168.210.201:7777 has been idle for 10 seconds, shutting it down.
Aug 28 18:23:14 node02 kernel: (0,0): o2net_idle_timer:1309 here are some times that might help debug the situation: (tmr 1156760584.614463 now 1156760594.612669 dr 1156760584.614448 adv 1156760584.614468:1156760584.614471 func (8911b11d:505) 1156760549.622451:1156760549.622455)
Aug 28 18:23:14 node02 kernel: o2net: no longer connected to node node01 (num 0) at 192.168.210.201:7777
Aug 28 18:25:01 node02 crond(pam_unix)[4833]: session opened for user root by (uid=0)
Aug 28 18:25:01 node02 crond(pam_unix)[4833]: session closed for user root
Aug 28 18:30:01 node02 crond(pam_unix)[6257]: session opened for user root by (uid=0)
Aug 28 18:30:01 node02 crond(pam_unix)[6259]: session opened for user root by (uid=0)
Aug 28 18:30:01 node02 crond(pam_unix)[6259]: session closed for user root
Aug 28 18:30:02 node02 crond(pam_unix)[6257]: session closed for user root


oncity 回复于:2006-08-29 21:25:48

ocfs2 的问题太复杂.

普通群集WEB网站 ,还是 nfs 适用.


nonameboy 回复于:2006-08-29 23:00:40

首先强烈推荐使用RAW设备.
拔掉网线死机的话,第二个节点正常是会重启,而不应该死机.
你试一下拔CRS 的几个服务停掉再拔网线看看.
估计这样就不会死机了.
为什么会死一个节点?
根据我的理解是:因为RAC是同时使用两个节点都使用,再用两个Virtual IP 设两个主机上面,
而ORACLE client同时连接到你的两个VIP.
正常情况下,一个节点出现问题的时候,会把他的VIP设到活的节点的机器上.这样才能保证客户端可以访问两个VIP.
而你他们两台主机之间的通讯靠的是Private的网卡,RAC靠两个网卡来共享内存池,同样他们的流量是相当大的.----这个概念跟我们以前在MSCS上做OFS是不一样的!!!!
而你拔掉网线的话,他们就没有办法做到共享内存池,而客户端如果还是同时使用两台主机的话,ORACLE就会出问题.所以,网线拔掉的话,必然要有一台要接管全部的VIP,而另一个一直处于无修止的重启,直到网线拔好.

而你的问题就是为什么不是重启而是死机.
估计你查一下,你的model里面的设置是不是按方档的来做,还有就是系统本身的设置. 
估计是CRS进程在重启机器的时候没有导致SYSTEM hung.


nonameboy 回复于:2006-08-29 23:03:34

另外,上生产系统一定要上RAW设备.OCFS太变态了,只能这么说,
如果你用OCFS的话,你以后升级KERNEL会有麻烦.


另外对于RAC上到SUSE的情况,我保留怀疑的态度.
因为我们公司几个非常非常资深的LINUX/ORACLE的工程师在做这个测试,
测了半年一直没有通过.

所以我们的生产系统的RAC一直上在RHEL 3.0 上.
要知道RAC不是说装完了就完事的.

[ 本帖最后由 nonameboy 于 2006-8-29 23:06 编辑 ]


nntp 回复于:2006-08-30 01:48:15

to 12楼,

你说RAC上到SuSE的问题(非ocfs/ocfs2)的观点,我完全不同意.
我想你们公司那些非常资深的linux/oracle工程师,他们一定知道Oracle Consulting部门负责IDC业务的团队推荐在关键业务系统的Oracle 是运行在SLES9 上面的.

如果有机会,倒是想要和贵公司的资深工程师切磋一下关于SuSE和 RAC的技术问题,不知道他们在RHEL+RAC的环境是怎么通过stability testing的.


pxwyd 回复于:2006-08-30 09:04:57

我是做oracle的rac,所以才用ocfs2


pxwyd 回复于:2006-08-30 09:13:50

会安装RAC不难,难的是知道什么时候应该部署RAC,怎么部署,部署什么部分,那些现在可以放心用,那些不能,用了会有什么可能的风险,怎么防止和解决?



有什么高招能解决这些问题吗?我装的ocfs2+rac,RHEL4 ,两个节点,当node1的网线断了之后,node2就会死机;其他方面的性能感觉还可以;听说ocfs2是2004年就发布了,我以为已经能商用了,看了大家的讨论才知道还没有正式用到生产系统呢。

加入我要是需要oracle的rac环境,请大家给点建议,用哪个文件系统较好呢?


ljhb 回复于:2006-08-30 10:14:03

ut在一些项目里用的就是gfs/hitache的存储,不知道效果怎么样


pxwyd 回复于:2006-08-30 12:04:02

谢谢指点:
我测试的环境是sun v65x两台;一个scsi磁盘阵列;os是RHEL4 update4;oracle10g2;ocfs2;
只运行ocfs2时拔网线出现的问题;只有一台服务器死机;另外一台正常;把ocfs2和o2cb停掉之后就没有问题;
我们已经放弃使用ocfs2了;准备用raw和asm;不知道你们的公司做oracle rac所用的文件系统,就是raw吗?

性能如何?


nonameboy 回复于:2006-08-30 16:52:17

引用:原帖由 pxwyd 于 2006-8-30 12:04 发表
谢谢指点:
我测试的环境是sun v65x两台;一个scsi磁盘阵列;os是RHEL4 update4;oracle10g2;ocfs2;
只运行ocfs2时拔网线出现的问题;只有一台服务器死机;另外一台正常;把ocfs2和o2cb停掉之后就没有问题;
...


我们以前有用过OCFS,但只是做测试,因为实验中有发现一些问题.
所以才转到RAW设备.
现在我们RAC的生产环境都是跑在RAW设备上面.


nonameboy 回复于:2006-08-30 16:53:53

引用:原帖由 nntp 于 2006-8-30 01:48 发表
to 12楼,

你说RAC上到SuSE的问题(非ocfs/ocfs2)的观点,我完全不同意.
我想你们公司那些非常资深的linux/oracle工程师,他们一定知道Oracle Consulting部门负责IDC业务的团队推荐在关键业务系统的Oracle 是运 ...


其中有一个家伙有论坛,你有空可以上去聊聊.
http://www.puschitz.com/


nntp 回复于:2006-08-30 17:36:20

引用:原帖由 pxwyd 于 2006-8-30 09:04 发表
我是做oracle的rac,所以才用ocfs2



请用ASM.

无论从性能,目前各自版本的成熟度,厂商研发投入和支持力度,最佳实践以及成功案例,现在这个阶段都不应该在生产系统中使用ocfs2.


nntp 回复于:2006-08-30 17:38:43

to pxwyd.

用ASM.


pxwyd 回复于:2006-08-31 08:22:50

谢谢;我现在就是在用asm+raw测试的;看看效果如何


nntp 回复于:2006-08-31 15:23:19

引用:原帖由 nonameboy 于 2006-8-30 16:53 发表

其中有一个家伙有论坛,你有空可以上去聊聊.
http://www.puschitz.com/




他就是在OTN上面登了几个文章么?

没有看到此君对SLES和RHEL在RAC环境有任何的企业级测试.


nntp 回复于:2006-08-31 15:24:33

引用:原帖由 pxwyd 于 2006-8-31 08:22 发表
谢谢;我现在就是在用asm+raw测试的;看看效果如何




ASM的性能和RAW差别不大,在有些指标的测试中,超过了RAW的性能.

要知道在linux环境中的raw,和商用unix的raw环境是不一样的.


nntp 回复于:2006-08-31 15:27:49

引用:原帖由 nonameboy 于 2006-8-30 16:53 发表

其中有一个家伙有论坛,你有空可以上去聊聊.
http://www.puschitz.com/




还有,你说"一个家伙" 是指你们公司的同事这个Werner Puschitz 么?我只知道他是一个独立顾问,自己给自己打工,什么时候成为你们公司的员工了?
我们去年有一个RAC项目的时候,曾经联系过他,问了他远程咨询服务的价格,结果价格谈不拢就黄掉了。没有想到加入你们公司了?

:em11:

[ 本帖最后由 nntp 于 2006-8-31 15:29 编辑 ]


vecentli 回复于:2006-08-31 15:53:28

对voting disk和OCR,我还是用ocfs2放的。

datafile之类的,lvm + raw也可以。


nntp 回复于:2006-08-31 16:07:04

引用:原帖由 vecentli 于 2006-8-31 15:53 发表
对voting disk和OCR,我还是用ocfs2放的。

datafile之类的,lvm + raw也可以。




我的voting /orc 用 raw , 10gr2有 redundant 的配置,所以raw比较方便,而且因为尺寸都很小,所以需要额外backup的时候也很方便.

voting/orc 用raw与否,对于运行性能没有太多影响,但是当集群因为不稳定的时候,系统开始做node membership的变动的时候,性能上还是有区别的.

datafile 你在RAC中用 lvm+raw? 还是 clvm+raw? lvm+raw怎么可能? lvm本身不是cluserware的,你的raw貌似创建在lvm上,但是node的lvm不能够把变动传送到其他node上的。datafile 在linux平台用raw存放显示不出性能优势,用ASM性能上面有保证.


vecentli 回复于:2006-08-31 16:21:57

ocr和voting disk用什么放无所谓,偶觉得个人习惯起决定性因素,用ocfs2更符合大部分人的使用习惯罢了。
至于lvm+raw,lvm只是用做datafile的管理方式罢了,存数据的是raw,当然,raw是建立在lv上的。

偶没有测试过,难道lvm不能管理rac下的raw?或者建立在lv上的raw,rac不能识别?

[ 本帖最后由 vecentli 于 2006-8-31 16:24 编辑 ]


vecentli 回复于:2006-08-31 16:29:41

木有ups,无故调电后asm cache的数据丢了,db说不定就起不来了。
木有足够的技术储备,无法用rman备份数据库。

所以,用什么,还要看自身条件的啦。。:)


vecentli 回复于:2006-08-31 16:33:01

引用:原帖由 nntp 于 2006-8-31 16:07 发表



我的voting /orc 用 raw , 10gr2有 redundant 的配置,所以raw比较方便,而且因为尺寸都很小,所以需要额外backup的时候也很方便.

voting/orc 用raw与否,对于运行性能没有太多影响,但是当集群因为不稳 ...



有道理。。

lvm的配置不能传到其他机器,在node1上用lv,在node2上的instance无法识别。
:mrgreen:


nntp 回复于:2006-08-31 18:01:16

引用:原帖由 vecentli 于 2006-8-31 16:29 发表
木有ups,无故调电后asm cache的数据丢了,db说不定就起不来了。
木有足够的技术储备,无法用rman备份数据库。

所以,用什么,还要看自身条件的啦。。:)




单机还是RAC? 如果是RAC的话, 就算掉电, asm 可以处理这种情况的,你订了oracle mag么?去年年底有一期介绍类似情况的.


vecentli 回复于:2006-08-31 21:59:58

OCFS2 - FREQUENTLY ASKED QUESTIONS

CONTENTS
* General
* Download and Install
* Configure
* O2CB Cluster Service
* Format
* Mount
* Oracle RAC
* Migrate Data from OCFS (Release 1) to OCFS2
* Coreutils
* Troubleshooting
* Limits
* System Files
* Heartbeat
* Quorum and Fencing
* Novell SLES9
* Release 1.2
* Upgrade to the Latest Release
* Processes

GENERAL
1. How do I get started?
* Download and install the module and tools rpms.
* Create cluster.conf and propagate to all nodes.
* Configure and start the O2CB cluster service.
* Format the volume.
* Mount the volume.
2. How do I know the version number running?

# cat /proc/fs/ocfs2/version
OCFS2 1.2.1 Fri Apr 21 13:51:24 PDT 2006 (build bd2f25ba0af9677db3572e3ccd92f739)

3. How do I configure my system to auto-reboot after a panic?
To auto-reboot system 60 secs after a panic, do:

# echo 60 > /proc/sys/kernel/panic

To enable the above on every reboot, add the following to /etc/sysctl.conf:

kernel.panic = 60

DOWNLOAD AND INSTALL
4. Where do I get the packages from?
For Novell's SLES9, upgrade to the latest SP3 kernel to get the required modules installed. Also, install ocfs2-tools and ocfs2console packages. For Red Hat's RHEL4, download and install the appropriate module package and the two tools packages, ocfs2-tools and ocfs2console. Appropriate module refers to one matching the kernel version, flavor and architecture. Flavor refers to smp, hugemem, etc.
5. What are the latest versions of the OCFS2 packages?
The latest module package version is 1.2.2. The latest tools/console packages versions are 1.2.1.
6. How do I interpret the package name ocfs2-2.6.9-22.0.1.ELsmp-1.2.1-1.i686.rpm?
The package name is comprised of multiple parts separated by '-'.
* ocfs2 - Package name
* 2.6.9-22.0.1.ELsmp - Kernel version and flavor
* 1.2.1 - Package version
* 1 - Package subversion
* i686 - Architecture
7. How do I know which package to install on my box?
After one identifies the package name and version to install, one still needs to determine the kernel version, flavor and architecture.
To know the kernel version and flavor, do:

# uname -r
2.6.9-22.0.1.ELsmp

To know the architecture, do:

# rpm -qf /boot/vmlinuz-`uname -r` --queryformat "%{ARCH}n"
i686

8. Why can't I use uname -p to determine the kernel architecture?
uname -p does not always provide the exact kernel architecture. Case in point the RHEL3 kernels on x86_64. Even though Red Hat has two different kernel architectures available for this port, ia32e and x86_64, uname -p identifies both as the generic x86_64.
9. How do I install the rpms?
First install the tools and console packages:

# rpm -Uvh ocfs2-tools-1.2.1-1.i386.rpm ocfs2console-1.2.1-1.i386.rpm

Then install the appropriate kernel module package:

# rpm -Uvh ocfs2-2.6.9-22.0.1.ELsmp-1.2.1-1.i686.rpm

10. Do I need to install the console?
No, the console is not required but recommended for ease-of-use.
11. What are the dependencies for installing ocfs2console?
ocfs2console requires e2fsprogs, glib2 2.2.3 or later, vte 0.11.10 or later, pygtk2 (EL4) or python-gtk (SLES9) 1.99.16 or later, python 2.3 or later and ocfs2-tools.
12. What modules are installed with the OCFS2 1.2 package?
* configfs.ko
* ocfs2.ko
* ocfs2_dlm.ko
* ocfs2_dlmfs.ko
* ocfs2_nodemanager.ko
* debugfs
13. What tools are installed with the ocfs2-tools 1.2 package?
* mkfs.ocfs2
* fsck.ocfs2
* tunefs.ocfs2
* debugfs.ocfs2
* mount.ocfs2
* mounted.ocfs2
* ocfs2cdsl
* ocfs2_hb_ctl
* o2cb_ctl
* o2cb - init service to start/stop the cluster
* ocfs2 - init service to mount/umount ocfs2 volumes
* ocfs2console - installed with the console package
14. What is debugfs and is it related to debugfs.ocfs2?
debugfs is an in-memory filesystem developed by Greg Kroah-Hartman. It is useful for debugging as it allows kernel space to easily export data to userspace. It is currently being used by OCFS2 to dump the list of filesystem locks and could be used for more in the future. It is bundled with OCFS2 as the various distributions are currently not bundling it. While debugfs and debugfs.ocfs2 are unrelated in general, the latter is used as the front-end for the debugging info provided by the former. For example, refer to the troubleshooting section.

CONFIGURE
15. How do I populate /etc/ocfs2/cluster.conf?
If you have installed the console, use it to create this configuration file. For details, refer to the user's guide. If you do not have the console installed, check the Appendix in the User's guide for a sample cluster.conf and the details of all the components. Do not forget to copy this file to all the nodes in the cluster. If you ever edit this file on any node, ensure the other nodes are updated as well.
16. Should the IP interconnect be public or private?
Using a private interconnect is recommended. While OCFS2 does not take much bandwidth, it does require the nodes to be alive on the network and sends regular keepalive packets to ensure that they are. To avoid a network delay being interpreted as a node disappearing on the net which could lead to a node-self-fencing, a private interconnect is recommended. One could use the same interconnect for Oracle RAC and OCFS2.
17. What should the node name be and should it be related to the IP address?
The node name needs to match the hostname. The IP address need not be the one associated with that hostname. As in, any valid IP address on that node can be used. OCFS2 will not attempt to match the node name (hostname) with the specified IP address.
18. How do I modify the IP address, port or any other information specified in cluster.conf?
While one can use ocfs2console to add nodes dynamically to a running cluster, any other modifications require the cluster to be offlined. Stop the cluster on all nodes, edit /etc/ocfs2/cluster.conf on one and copy to the rest, and restart the cluster on all nodes. Always ensure that cluster.conf is the same on all the nodes in the cluster.
19. How do I add a new node to an online cluster?
You can use the console to add a new node. However, you will need to explicitly add the new node on all the online nodes. That is, adding on one node and propagating to the other nodes is not sufficient. If the operation fails, it will most likely be due to bug#741. In that case, you can use the o2cb_ctl utility on all online nodes as follows:

# o2cb_ctl -C -i -n NODENAME -t node -a number=NODENUM -a ip_address=IPADDR -a ip_port=IPPORT -a cluster=CLUSTERNAME

20. Ensure the node is added both in /etc/ocfs2/cluster.conf and in /config/cluster/CLUSTERNAME/node on all online nodes. You can then simply copy the cluster.conf to the new (still offline) node as well as other offline nodes. At the end, ensure that cluster.conf is consistent on all the nodes. How do I add a new node to an offline cluster?
You can either use the console or use o2cb_ctl or simply hand edit cluster.conf. Then either use the console to propagate it to all nodes or hand copy using scp or any other tool. The o2cb_ctl command to do the same is:

# o2cb_ctl -C -n NODENAME -t node -a number=NODENUM -a ip_address=IPADDR -a ip_port=IPPORT -a cluster=CLUSTERNAME

Notice the "-i" argument is not required as the cluster is not online.

O2CB CLUSTER SERVICE
21. How do I configure the cluster service?

# /etc/init.d/o2cb configure

Enter 'y' if you want the service to load on boot and the name of the cluster (as listed in /etc/ocfs2/cluster.conf).
22. How do I start the cluster service?
* To load the modules, do:

# /etc/init.d/o2cb load

* To Online it, do:

# /etc/init.d/o2cb online [cluster_name]

If you have configured the cluster to load on boot, you could combine the two as follows:

# /etc/init.d/o2cb start [cluster_name]

The cluster name is not required if you have specified the name during configuration.
23. How do I stop the cluster service?
* To offline it, do:

# /etc/init.d/o2cb offline [cluster_name]

* To unload the modules, do:

# /etc/init.d/o2cb unload

If you have configured the cluster to load on boot, you could combine the two as follows:

# /etc/init.d/o2cb stop [cluster_name]

The cluster name is not required if you have specified the name during configuration.
24. How can I learn the status of the cluster?
To learn the status of the cluster, do:

# /etc/init.d/o2cb status

25. I am unable to get the cluster online. What could be wrong?
Check whether the node name in the cluster.conf exactly matches the hostname. One of the nodes in the cluster.conf need to be in the cluster for the cluster to be online.

FORMAT
26. How do I format a volume?
You could either use the console or use mkfs.ocfs2 directly to format the volume. For console, refer to the user's guide.

# mkfs.ocfs2 -L "oracle_home" /dev/sdX

The above formats the volume with default block and cluster sizes, which are computed based upon the size of the volume.

# mkfs.ocfs2 -b 4k -C 32K -L "oracle_home" -N 4 /dev/sdX

The above formats the volume for 4 nodes with a 4K block size and a 32K cluster size.
27. What does the number of node slots during format refer to?
The number of node slots specifies the number of nodes that can concurrently mount the volume. This number is specified during format and can be increased using tunefs.ocfs2. This number cannot be decreased.
28. What should I consider when determining the number of node slots?
OCFS2 allocates system files, like Journal, for each node slot. So as to not to waste space, one should specify a number within the ballpark of the actual number of nodes. Also, as this number can be increased, there is no need to specify a much larger number than one plans for mounting the volume.
29. Does the number of node slots have to be the same for all volumes?
No. This number can be different for each volume.
30. What block size should I use?
A block size is the smallest unit of space addressable by the file system. OCFS2 supports block sizes of 512 bytes, 1K, 2K and 4K. The block size cannot be changed after the format. For most volume sizes, a 4K size is recommended. On the other hand, the 512 bytes block is never recommended.
31. What cluster size should I use?
A cluster size is the smallest unit of space allocated to a file to hold the data. OCFS2 supports cluster sizes of 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K and 1M. For database volumes, a cluster size of 128K or larger is recommended. For Oracle home, 32K to 64K.
32. Any advantage of labelling the volumes?
As in a shared disk environment, the disk name (/dev/sdX) for a particular device be different on different nodes, labelling becomes a must for easy identification. You could also use labels to identify volumes during mount.

# mount -L "label" /dir

The volume label is changeable using the tunefs.ocfs2 utility.

MOUNT
33. How do I mount the volume?
You could either use the console or use mount directly. For console, refer to the user's guide.

# mount -t ocfs2 /dev/sdX /dir

The above command will mount device /dev/sdX on directory /dir.
34. How do I mount by label?
To mount by label do:

# mount -L "label" /dir

35. What entry to I add to /etc/fstab to mount an ocfs2 volume?
Add the following:

/dev/sdX /dir ocfs2 noauto,_netdev 0 0

The _netdev option indicates that the devices needs to be mounted after the network is up.
36. What do I need to do to mount OCFS2 volumes on boot?
* Enable o2cb service using:

# chkconfig --add o2cb

* Enable ocfs2 service using:

# chkconfig --add ocfs2

* Configure o2cb to load on boot using:

# /etc/init.d/o2cb configure

* Add entries into /etc/fstab as follows:

/dev/sdX /dir ocfs2 _netdev 0 0

37. How do I know my volume is mounted?
* Enter mount without arguments, or,

# mount

* List /etc/mtab, or,

# cat /etc/mtab

* List /proc/mounts, or,

# cat /proc/mounts

* Run ocfs2 service.

# /etc/init.d/ocfs2 status

mount command reads the /etc/mtab to show the information.
38. What are the /config and /dlm mountpoints for?
OCFS2 comes bundled with two in-memory filesystems configfs and ocfs2_dlmfs. configfs is used by the ocfs2 tools to communicate to the in-kernel node manager the list of nodes in the cluster and to the in-kernel heartbeat thread the resource to heartbeat on. ocfs2_dlmfs is used by ocfs2 tools to communicate with the in-kernel dlm to take and release clusterwide locks on resources.
39. Why does it take so much time to mount the volume?
It takes around 5 secs for a volume to mount. It does so so as to let the heartbeat thread stabilize. In a later release, we plan to add support for a global heartbeat, which will make most mounts instant.

ORACLE RAC
40. Any special flags to run Oracle RAC?
OCFS2 volumes containing the Voting diskfile (CRS), Cluster registry (OCR), Data files, Redo logs, Archive logs and Control files must be mounted with the datavolume and nointr mount options. The datavolume option ensures that the Oracle processes opens these files with the o_direct flag. The nointr option ensures that the ios are not interrupted by signals.

# mount -o datavolume,nointr -t ocfs2 /dev/sda1 /u01/db

41. What about the volume containing Oracle home?
Oracle home volume should be mounted normally, that is, without the datavolume and nointr mount options. These mount options are only relevant for Oracle files listed above.

# mount -t ocfs2 /dev/sdb1 /software/orahome

42. Also as OCFS2 does not currently support shared writeable mmap, the health check (GIMH) file $ORACLE_HOME/dbs/hc_ORACLESID.dat and the ASM file $ASM_HOME/dbs/ab_ORACLESID.dat should be symlinked to local filesystem. We expect to support shared writeable mmap in the RHEL5 timeframe. Does that mean I cannot have my data file and Oracle home on the same volume?
Yes. The volume containing the Oracle data files, redo-logs, etc. should never be on the same volume as the distribution (including the trace logs like, alert.log).
43. Any other information I should be aware off?
The 1.2.3 release of OCFS2 does not update the modification time on the inode across the cluster for non-extending writes. However, the time will be locally updated in the cached inodes. This leads to one observing different times (ls -l) for the same file on different nodes on the cluster.
While this does not affect most uses of the filesystem, as one variably changes the file size during write, the one usage where this is most commonly experienced is with Oracle datafiles and redologs. This is because Oracle rarely resizes these files and thus almost all writes are non-extending.
In the short term (1.2.x), we intend to provide a mount option (nocmtime) to allow users to explicitly ask the filesystem to not change the modification time during non-extending writes. While this is not the complete solution, this will ensure that the times are consistent across the cluster.
In the long term (1.4.x), we intend to fix this by updating modification times for all writes while providing an opt-out option (nocmtime) for users who would prefer to avoid the performance overhead associated with this feature.

MIGRATE DATA FROM OCFS (RELEASE 1) TO OCFS2
44. Can I mount OCFS volumes as OCFS2?
No. OCFS and OCFS2 are not on-disk compatible. We had to break the compatibility in order to add many of the new features. At the same time, we have added enough flexibility in the new disk layout so as to maintain backward compatibility in the future.
45. Can OCFS volumes and OCFS2 volumes be mounted on the same machine simultaneously?
No. OCFS only works on 2.4 linux kernels (Red Hat's AS2.1/EL3 and SuSE's SLES8). OCFS2, on the other hand, only works on the 2.6 kernels (Red Hat's EL4 and SuSE's SLES9).
46. Can I access my OCFS volume on 2.6 kernels (SLES9/RHEL4)?
Yes, you can access the OCFS volume on 2.6 kernels using FSCat tools, fsls and fscp. These tools can access the OCFS volumes at the device layer, to list and copy the files to another filesystem. FSCat tools are available on oss.oracle.com.
47. Can I in-place convert my OCFS volume to OCFS2?
No. The on-disk layout of OCFS and OCFS2 are sufficiently different that it would require a third disk (as a temporary buffer) inorder to in-place upgrade the volume. With that in mind, it was decided not to develop such a tool but instead provide tools to copy data from OCFS without one having to mount it.
48. What is the quickest way to move data from OCFS to OCFS2?
Quickest would mean having to perform the minimal number of copies. If you have the current backup on a non-OCFS volume accessible from the 2.6 kernel install, then all you would need to do is to retore the backup on the OCFS2 volume(s). If you do not have a backup but have a setup in which the system containing the OCFS2 volumes can access the disks containing the OCFS volume, you can use the FSCat tools to extract data from the OCFS volume and copy onto OCFS2.

COREUTILS
49. Like with OCFS (Release 1), do I need to use o_direct enabled tools to perform cp, mv, tar, etc.?
No. OCFS2 does not need the o_direct enabled tools. The file system allows processes to open files in both o_direct and bufferred mode concurrently.

TROUBLESHOOTING


vecentli 回复于:2006-08-31 22:01:12

# How do I enable and disable filesystem tracing?
To list all the debug bits along with their statuses, do:

# debugfs.ocfs2 -l

To enable tracing the bit SUPER, do:

# debugfs.ocfs2 -l SUPER allow

To disable tracing the bit SUPER, do:

# debugfs.ocfs2 -l SUPER off

To totally turn off tracing the SUPER bit, as in, turn off tracing even if some other bit is enabled for the same, do:

# debugfs.ocfs2 -l SUPER deny

To enable heartbeat tracing, do:

# debugfs.ocfs2 -l HEARTBEAT ENTRY EXIT allow

To disable heartbeat tracing, do:

# debugfs.ocfs2 -l HEARTBEAT off ENTRY EXIT deny

# How do I get a list of filesystem locks and their statuses?
OCFS2 1.0.9+ has this feature. To get this list, do:

* Mount debugfs is mounted at /debug.

# mount -t debugfs debugfs /debug

* Dump the locks.

# echo "fs_locks" | debugfs.ocfs2 /dev/sdX >/tmp/fslocks

# How do I read the fs_locks output?
Let's look at a sample output:

Lockres: M000000000000000006672078b84822 Mode: Protected Read
Flags: Initialized Attached
RO Holders: 0 EX Holders: 0
Pending Action: None Pending Unlock Action: None
Requested Mode: Protected Read Blocking Mode: Invalid

First thing to note is the Lockres, which is the lockname. The dlm identifies resources using locknames. A lockname is a combination of a lock type (S superblock, M metadata, D filedata, R rename, W readwrite), inode number and generation.
To get the inode number and generation from lockname, do:

#echo "stat " | debugfs.ocfs2 -n /dev/sdX
Inode: 419616 Mode: 0666 Generation: 2025343010 (0x78b84822)
....

To map the lockname to a directory entry, do:

# echo "locate " | debugfs.ocfs2 -n /dev/sdX
419616 /linux-2.6.15/arch/i386/kernel/semaphore.c

One could also provide the inode number instead of the lockname.

# echo "locate <419616>" | debugfs.ocfs2 -n /dev/sdX
419616 /linux-2.6.15/arch/i386/kernel/semaphore.c

To get a lockname from a directory entry, do:

# echo "encode /linux-2.6.15/arch/i386/kernel/semaphore.c" | debugfs.ocfs2 -n /dev/sdX
M000000000000000006672078b84822 D000000000000000006672078b84822 W000000000000000006672078b84822

The first is the Metadata lock, then Data lock and last ReadWrite lock for the same resource.

The DLM supports 3 lock modes: NL no lock, PR protected read and EX exclusive.

If you have a dlm hang, the resource to look for would be one with the "Busy" flag set.

The next step would be to query the dlm for the lock resource.

Note: The dlm debugging is still a work in progress.

To do dlm debugging, first one needs to know the dlm domain, which matches the volume UUID.

# echo "stats" | debugfs.ocfs2 -n /dev/sdX | grep UUID: | while read a b ; do echo $b ; done
82DA8137A49A47E4B187F74E09FBBB4B

Then do:

# echo R dlm_domain lockname > /proc/fs/ocfs2_dlm/debug

For example:

# echo R 82DA8137A49A47E4B187F74E09FBBB4B M000000000000000006672078b84822 > /proc/fs/ocfs2_dlm/debug
# dmesg | tail
struct dlm_ctxt: 82DA8137A49A47E4B187F74E09FBBB4B, node=79, key=965960985
lockres: M000000000000000006672078b84822, owner=75, state=0 last used: 0, on purge list: no
granted queue:
type=3, conv=-1, node=79, cookie=11673330234144325711, ast=(empty=y,pend=n), bast=(empty=y,pend=n)
converting queue:
blocked queue:

It shows that the lock is mastered by node 75 and that node 79 has been granted a PR lock on the resource.

This is just to give a flavor of dlm debugging.

LIMITS
# Is there a limit to the number of subdirectories in a directory?
Yes. OCFS2 currently allows up to 32000 subdirectories. While this limit could be increased, we will not be doing it till we implement some kind of efficient name lookup (htree, etc.).
# Is there a limit to the size of an ocfs2 file system?
Yes, current software addresses block numbers with 32 bits. So the file system device is limited to (2 ^ 32) * blocksize (see mkfs -b). With a 4KB block size this amounts to a 16TB file system. This block addressing limit will be relaxed in future software. At that point the limit becomes addressing clusters of 1MB each with 32 bits which leads to a 4PB file system.

SYSTEM FILES
# What are system files?
System files are used to store standard filesystem metadata like bitmaps, journals, etc. Storing this information in files in a directory allows OCFS2 to be extensible. These system files can be accessed using debugfs.ocfs2. To list the system files, do:

# echo "ls -l //" | debugfs.ocfs2 -n /dev/sdX
18 16 1 2 .
18 16 2 2 ..
19 24 10 1 bad_blocks
20 32 18 1 global_inode_alloc
21 20 8 1 slot_map
22 24 9 1 heartbeat
23 28 13 1 global_bitmap
24 28 15 2 orphan_dir:0000
25 32 17 1 extent_alloc:0000
26 28 16 1 inode_alloc:0000
27 24 12 1 journal:0000
28 28 16 1 local_alloc:0000
29 3796 17 1 truncate_log:0000

The first column lists the block number.
# Why do some files have numbers at the end?
There are two types of files, global and local. Global files are for all the nodes, while local, like journal:0000, are node specific. The set of local files used by a node is determined by the slot mapping of that node. The numbers at the end of the system file name is the slot#. To list the slot maps, do:

# echo "slotmap" | debugfs.ocfs2 -n /dev/sdX
Slot# Node#
0 39
1 40
2 41
3 42

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/34329/viewspace-911076/,如需转载,请注明出处,否则将追究法律责任。

请登录后发表评论 登录
全部评论

注册时间:2008-04-25

  • 博文量
    168
  • 访问量
    731152