首页 > 数据库 > Oracle > OCFS,OCFS2,ASM,RAW 讨论2(转帖)


原创 Oracle 作者:m77m78 时间:2007-04-16 21:41:23 0 删除 编辑
如果要部署RAC, 如果需要快速完工并且在这方面经验欠缺的话,Oracle 提供的 "Oracle Validated Configurations" 是一个最好的帮手。
Oracle刚开始推出 OVC的时候,我觉得特别特别好,即便是对于非常熟悉linux/oracle/RAC得人来说,也是一个大大减轻工作量的好工具.[@more@]
# How does the disk heartbeat work?
Every node writes every two secs to its block in the heartbeat system file. The block offset is equal to its global node number. So node 0 writes to the first block, node 1 to the second, etc. All the nodes also read the heartbeat sysfile every two secs. As long as the timestamp is changing, that node is deemed alive.
# When is a node deemed dead?
An active node is deemed dead if it does not update its timestamp for O2CB_HEARTBEAT_THRESHOLD (default=7) loops. Once a node is deemed dead, the surviving node which manages to cluster lock the dead node's journal, recovers it by replaying the journal.
# What about self fencing?
A node self-fences if it fails to update its timestamp for ((O2CB_HEARTBEAT_THRESHOLD - 1) * 2) secs. The [o2hb-xx] kernel thread, after every timestamp write, sets a timer to panic the system after that duration. If the next timestamp is written within that duration, as it should, it first cancels that timer before setting up a new one. This way it ensures the system will self fence if for some reason the [o2hb-x] kernel thread is unable to update the timestamp and thus be deemed dead by other nodes in the cluster.
# How can one change the parameter value of O2CB_HEARTBEAT_THRESHOLD?
This parameter value could be changed by adding it to /etc/sysconfig/o2cb and RESTARTING the O2CB cluster. This value should be the SAME on ALL the nodes in the cluster.
# What should one set O2CB_HEARTBEAT_THRESHOLD to?
It should be set to the timeout value of the io layer. Most multipath solutions have a timeout ranging from 60 secs to 120 secs. For 60 secs, set it to 31. For 120 secs, set it to 61.

O2CB_HEARTBEAT_THRESHOLD = (((timeout in secs) / 2) + 1)

# How does one check the current active O2CB_HEARTBEAT_THRESHOLD value?

# cat /proc/fs/ocfs2_nodemanager/hb_dead_threshold

# What if a node umounts a volume?
During umount, the node will broadcast to all the nodes that have mounted that volume to drop that node from its node maps. As the journal is shutdown before this broadcast, any node crash after this point is ignored as there is no need for recovery.
# I encounter "Kernel panic - not syncing: ocfs2 is very sorry to be fencing this system by panicing" whenever I run a heavy io load?
We have encountered a bug with the default CFQ io scheduler which causes a process doing heavy io to temporarily starve out other processes. While this is not fatal for most environments, it is for OCFS2 as we expect the hb thread to be r/w to the hb area atleast once every 12 secs (default). Bug with the fix has been filed with Red Hat. Red Hat is expected to have this fixed in RHEL4 U4 release. SLES9 SP3 2.5.6-7.257 includes this fix. For the latest, refer to the tracker bug filed on bugzilla. Till this issue is resolved, one is advised to use the DEADLINE io scheduler. To use it, add "elevator=deadline" to the kernel command line as follows:

* For SLES9, edit the command line in /boot/grub/menu.lst.

title Linux 2.6.5-7.244-bigsmp (with deadline)
kernel (hd0,4)/boot/vmlinuz-2.6.5-7.244-bigsmp root=/dev/sda5
vga=0x314 selinux=0 splash=silent resume=/dev/sda3 elevator=deadline showopts console=tty0 console=ttyS0,115200 noexec=off
initrd (hd0,4)/boot/initrd-2.6.5-7.244-bigsmp

* For RHEL4, edit the command line in /boot/grub/grub.conf:

title Red Hat Enterprise Linux AS (2.6.9-22.EL) (with deadline)
root (hd0,0)
kernel /vmlinuz-2.6.9-22.EL ro root=LABEL=/ console=ttyS0,115200 console=tty0 elevator=deadline noexec=off
initrd /initrd-2.6.9-22.EL.img

To see the current kernel command line, do:

# cat /proc/cmdline

# What is a quorum?
A quorum is a designation given to a group of nodes in a cluster which are still allowed to operate on shared storage. It comes up when there is a failure in the cluster which breaks the nodes up into groups which can communicate in their groups and with the shared storage but not between groups.
# How does OCFS2's cluster services define a quorum?
The quorum decision is made by a single node based on the number of other nodes that are considered alive by heartbeating and the number of other nodes that are reachable via the network.
A node has quorum when:

* it sees an odd number of heartbeating nodes and has network connectivity to more than half of them.
* it sees an even number of heartbeating nodes and has network connectivity to at least half of them *and* has connectivity to the heartbeating node with the lowest node number.

# What is fencing?
Fencing is the act of forecefully removing a node from a cluster. A node with OCFS2 mounted will fence itself when it realizes that it doesn't have quorum in a degraded cluster. It does this so that other nodes won't get stuck trying to access its resources. Currently OCFS2 will panic the machine when it realizes it has to fence itself off from the cluster. As described in Q02, it will do this when it sees more nodes heartbeating than it has connectivity to and fails the quorum test.
# How does a node decide that it has connectivity with another?
When a node sees another come to life via heartbeating it will try and establish a TCP connection to that newly live node. It considers that other node connected as long as the TCP connection persists and the connection is not idle for 10 seconds. Once that TCP connection is closed or idle it will not be reestablished until heartbeat thinks the other node has died and come back alive.
# How long does the quorum process take?
First a node will realize that it doesn't have connectivity with another node. This can happen immediately if the connection is closed but can take a maximum of 10 seconds of idle time. Then the node must wait long enough to give heartbeating a chance to declare the node dead. It does this by waiting two iterations longer than the number of iterations needed to consider a node dead (see the Heartbeat section of this FAQ). The current default of 7 iterations of 2 seconds results in waiting for 9 iterations or 18 seconds. By default, then, a maximum of 28 seconds can pass from the time a network fault occurs until a node fences itself.
# How can one avoid a node from panic-ing when one shutdowns the other node in a 2-node cluster?
This typically means that the network is shutting down before all the OCFS2 volumes are being umounted. Ensure the ocfs2 init script is enabled. This script ensures that the OCFS2 volumes are umounted before the network is shutdown. To check whether the service is enabled, do:

# chkconfig --list ocfs2
ocfs2 0:off 1:off 2:on 3:on 4:on 5:on 6:off

# How does one list out the startup and shutdown ordering of the OCFS2 related services?

* To list the startup order for runlevel 3 on RHEL4, do:

# cd /etc/rc3.d
# ls S*ocfs2* S*o2cb* S*network*
S10network S24o2cb S25ocfs2

* To list the shutdown order on RHEL4, do:

# cd /etc/rc6.d
# ls K*ocfs2* K*o2cb* K*network*
K19ocfs2 K20o2cb K90network

* To list the startup order for runlevel 3 on SLES9, do:

# cd /etc/init.d/rc3.d
# ls S*ocfs2* S*o2cb* S*network*
S05network S07o2cb S08ocfs2

* To list the shutdown order on SLES9, do:

# cd /etc/init.d/rc3.d
# ls K*ocfs2* K*o2cb* K*network*
K14ocfs2 K15o2cb K17network

Please note that the default ordering in the ocfs2 scripts only include the network service and not any shared-device specific service, like iscsi. If one is using iscsi or any shared device requiring a service to be started and shutdown, please ensure that that service runs before and shutsdown after the ocfs2 init service.

# Why are OCFS2 packages for SLES9 not made available on
OCFS2 packages for SLES9 are available directly from Novell as part of the kernel. Same is true for the various Asianux distributions and for ubuntu. As OCFS2 is now part of the mainline kernel, we expect more distributions to bundle the product with the kernel.
# What versions of OCFS2 are available with SLES9 and how do they match with the Red Hat versions available on
As both Novell and Oracle ship OCFS2 on different schedules, the package versions do not match. We expect to resolve itself over time as the number of patch fixes reduce. Novell is shipping two SLES9 releases, viz., SP2 and SP3.

* The latest kernel with the SP2 release is 2.6.5-7.202.7. It ships with OCFS2 1.0.8.
* The latest kernel with the SP3 release is 2.6.5-7.257. It ships with OCFS2 1.2.1.

# What is new in OCFS2 1.2?
OCFS2 1.2 has two new features:

* It is endian-safe. With this release, one can mount the same volume concurrently on x86, x86-64, ia64 and big endian architectures ppc64 and s390x.
* Supports readonly mounts. The fs uses this feature to auto remount ro when encountering on-disk corruptions (instead of panic-ing).

# Do I need to re-make the volume when upgrading?
No. OCFS2 1.2 is fully on-disk compatible with 1.0.
# Do I need to upgrade anything else?
Yes, the tools needs to be upgraded to ocfs2-tools 1.2. ocfs2-tools 1.0 will not work with OCFS2 1.2 nor will 1.2 tools work with 1.0 modules.

# How do I upgrade to the latest release?

* Download the latest ocfs2-tools and ocfs2console for the target platform and the appropriate ocfs2 module package for the kernel version, flavor and architecture. (For more, refer to the "Download and Install" section above.)

* Umount all OCFS2 volumes.

# umount -at ocfs2

* Shutdown the cluster and unload the modules.

# /etc/init.d/o2cb offline
# /etc/init.d/o2cb unload

* If required, upgrade the tools and console.

# rpm -Uvh ocfs2-tools-1.2.1-1.i386.rpm ocfs2console-1.2.1-1.i386.rpm

* Upgrade the module.

# rpm -Uvh ocfs2-2.6.9-22.0.1.ELsmp-1.2.2-1.i686.rpm

* Ensure init services ocfs2 and o2cb are enabled.

# chkconfig --add o2cb
# chkconfig --add ocfs2

* To check whether the services are enabled, do:

# chkconfig --list o2cb
o2cb 0:off 1:off 2:on 3:on 4:on 5:on 6:off
# chkconfig --list ocfs2
ocfs2 0:off 1:off 2:on 3:on 4:on 5:on 6:off

* At this stage one could either reboot the node or simply, restart the cluster and mount the volume.

# Can I do a rolling upgrade from 1.0.x/1.2.x to 1.2.2?
Rolling upgrade to 1.2.2 is not recommended. Shutdown the cluster on all nodes before upgrading the nodes.
# After upgrade I am getting the following error on mount "mount.ocfs2: Invalid argument while mounting /dev/sda6 on /ocfs".
Do "dmesg | tail". If you see the error:

ocfs2_parse_options:523 ERROR: Unrecognized mount option "heartbeat=local" or missing value

it means that you are trying to use the 1.2 tools and 1.0 modules. Ensure that you have unloaded the 1.0 modules and installed and loaded the 1.2 modules. Use modinfo to determine the version of the module installed and/or loaded.
# The cluster fails to load. What do I do?
Check "demsg | tail" for any relevant errors. One common error is as follows:

SELinux: initialized (dev configfs, type configfs), not configured for labeling audit(1139964740.184:2): avc: denied { mount } for ...

The above error indicates that you have SELinux activated. A bug in SELinux does not allow configfs to mount. Disable SELinux by setting "SELINUX=disabled" in /etc/selinux/config. Change is activated on reboot.

[ 本帖最后由 nntp 于 2006-9-1 00:00 编辑 ]

vecentli 回复于:2006-08-31 22:02:14

# List and describe all OCFS2 threads?

One per node. Is a workqueue thread started when the cluster is brought online and stopped when offline. It handles the network communication for all threads. It gets the list of active nodes from the o2hb thread and sets up tcp/ip communication channels with each active node. It sends regular keepalive packets to detect any interruption on the channels.
One per node. Is a workqueue thread started when dlmfs is loaded and stopped on unload. (dlmfs is an in-memory file system which allows user space processes to access the dlm in kernel to lock and unlock resources.) Handles lock downconverts when requested by other nodes.
One per node. Is a workqueue thread started when ocfs2 module is loaded and stopped on unload. Handles blockable file system tasks like truncate log flush, orphan dir recovery and local alloc recovery, which involve taking dlm locks. Various code paths queue tasks to this thread. For example, ocfs2rec queues orphan dir recovery so that while the task is kicked off as part of recovery, its completion does not affect the recovery time.
One per heartbeat device. Is a kernel thread started when the heartbeat region is populated in configfs and stopped when it is removed. It writes every 2 secs to its block in the heartbeat region to indicate to other nodes that that node is alive. It also reads the region to maintain a nodemap of live nodes. It notifies o2net and dlm any changes in the nodemap.
One per mount. Is a kernel thread started when a volume is mounted and stopped on umount. It downgrades locks when requested by other nodes in reponse to blocking ASTs (BASTs). It also fixes up the dentry cache in reponse to files unlinked or renamed on other nodes.
One per dlm domain. Is a kernel thread started when a dlm domain is created and stopped when destroyed. This is the core dlm which maintains the list of lock resources and handles the cluster locking infrastructure.
One per dlm domain. Is a kernel thread which handles dlm recovery whenever a node dies. If the node is the dlm recovery master, it remasters all the locks owned by the dead node.
One per dlm domain. Is a workqueue thread. o2net queues dlm tasks on this thread.
One per mount. Is used as OCFS2 uses JDB for journalling.
One per mount. Is a kernel thread started when a volume is mounted and stopped on umount. Works in conjunction with kjournald.
Is started whenever another node needs to be be recovered. This could be either on mount when it discovers a dirty journal or during operation when hb detects a dead node. ocfs2rec handles the file system recovery and it runs after the dlm has finished its recovery.

vecentli 回复于:2006-08-31 22:02:47


nntp 回复于:2006-09-01 00:44:25

各位,我把本版几个主要讨论ocfs,ocfs2,ASM,raw 的讨论主题合并在一起了,大家可以在这里继续讨论

nntp 回复于:2006-09-01 03:05:31

如果要部署RAC, 如果需要快速完工并且在这方面经验欠缺的话,Oracle 提供的 "Oracle Validated Configurations" 是一个最好的帮手。
Oracle刚开始推出 OVC的时候,我觉得特别特别好,即便是对于非常熟悉linux/oracle/RAC得人来说,也是一个大大减轻工作量的好工具.

搞不清楚状况,被工作任务紧逼的朋友,可以完全按照 OVC来完成任务,已经做好RAC并且碰到故障问题的时候,也可以按照 OVC来做排查参考.

Oracle Validated Configurations

nntp 回复于:2006-09-01 03:46:54�
Oracle Forum 一个非常有意义的问答讨论, 我的看法和他们后面几位基本一致. 特别是有位仁兄提到的ASM<->RAW之间的便捷转换.
还有关于之前我回答本线索某位朋友关于 voting 和OCR的位置问题,我当时没有说太多原因,在这个讨论中也由简单的提及.

vecentli 回复于:2006-09-01 10:07:53

引用:原帖由 nntp 于 2006-8-31 18:01 发表

单机还是RAC? 如果是RAC的话, 就算掉电, asm 可以处理这种情况的,你订了oracle mag么?去年年底有一期介绍类似情况的.



[ 本帖最后由 vecentli 于 2006-9-1 10:10 编辑 ]

blue_stone 回复于:2006-09-01 12:01:34

能不能把gfs, gpfs, ocfs, ocfs2比较一下?
用途, 可靠性, 可用性, 性能, 稳定性等

nntp 回复于:2006-09-01 16:13:41

gfs 和ocfs2是一种东西, 和ocfs, gpfs不是一种东西. ocfs 和当中的任何一种都不一样.

gfs/ocfs2 使得多个节点访问共享存储的同一个位置成为可能,他们通过普通网络建立不同节点上文件系统缓存的同步机制,通过集群锁,杜绝多个节点的不同应用操作同一个文件产生的竞争关系从而破坏文件的可能性,通过普通网络交换节点之间的心跳状态. 这是功能上的类似。从成熟度,性能来考虑,目前ocfs2还远不能和gfs相提并论, 能够用ocfs2的地方都可以用gfs来替代,但是反之就不行. gfs在 HA集群环境,担当了一个"廉价缩水版"的polyserv. 至少目前来看,我个人的观点是gfs在技术,成熟度,开发力量投入,性能上都要领先ocfs2 差不多3年左右的时间.而且这种差距可能进一步拉大.

ocfs是只能for oracle的,也是oracle把集群文件系统纳入发展视线的第一个版本,之前我也说过,这个版本当时并没有定位在通用集群文件系统上,无论是质量,性能,稳定性等等在oracle用户圈子,反面的意见占大多数.

即便是在今天ocfs2的阶段,oracle mailing list, forum上大量充斥对于ocfs2质量,性能和可靠性的投诉.

ASM 是Oracle 在 linux, HP-UX, Solaris 等多个商用高端Unix平台采用的新一代存储管理系统,在Oracle公司的产品地位,开发的投入,用户范围,适用的层次和领域都是ocfs2项目无法比的.
ASM在功能上,相当于 RAW+LVM. 在数据量和访问量的线性增长关系上,表现也很出色,在实际的真实测试环境中,ASM的性能基本接近RAW, 因为还有Volume 开销,所以性能上有一点点地开销,也是很容易理解的. CLVM+OCFS2的性能在线性增长的测试中,明显低于ASM和RAW. 前天我一个朋友给我发来了他在欧洲高能实验室一个年会上作的slide,他们实验室的IT部门统计了一下,整个实验室各种单数据库和集群加起来,现在有540多个TB的数据跑在ASM上面,经过重负荷的使用和测试,他们对于ASM是表现是相当满意的. 他们大部分的系统是IA64+linux和AMD Opteron+Linux. 我看有时间的话,会把他们的测试和结论贴一些上来.

[ 本帖最后由 nntp 于 2006-9-1 16:30 编辑 ]

myprotein 回复于:2006-09-15 09:14:06

小弟一事不明:lvm+ocfs2,您说lvm不是cluster aware的,但是以我的浅薄知识,好像aix中可以创建并发vg的吧?这个并发vg,是不是cluster aware的呢?

blue_stone 回复于:2006-09-15 10:18:33

引用:原帖由 myprotein 于 2006-9-15 09:14 发表
小弟一事不明:lvm+ocfs2,您说lvm不是cluster aware的,但是以我的浅薄知识,好像aix中可以创建并发vg的吧?这个并发vg,是不是cluster aware的呢?

lvm和lvm2都不时cluster aware的, linux下cluster aware的卷管理软件是clvm.
aix中的concurrent vg是cluster aware的

myprotein 回复于:2006-09-15 10:47:13


king3171 回复于:2006-09-19 17:14:25

引用:原帖由 nntp 于 2006-9-1 16:13 发表
gfs 和ocfs2是一种东西, 和ocfs, gpfs不是一种东西. ocfs 和当中的任何一种都不一样.

gfs/ocfs2 使得多个节点访问共享存储的同一个位置成为可能,他们通过普通网络建立不同节点上文件系统缓存的同步机制,通 ...

这个帖子的每一个回复我都看了,受益非浅,这几种文件系统的比较,我很感兴趣,但还是有疑惑,我对SUNSOLARIS的文件系统比较熟悉,其他的HPUX、AIX有一些了解,但对他们的文件系统不很清楚。SOLARIS中有一种文件系统叫Global File Systems,也被称为Cluster file system或Proxy file system,我想应该就是老兄所说的GFS,在SOLARIS中,这个Global File Systems可以被集群中的多个节点同时访问,但只有一个节点在实际控制操作读写,其他节点都是通过这个主控节点来操作,主控节点DOWN掉后,主控权会转移到其他节点。但SOLARIS的这个Global File Systems其实和普通的UFS文件系统是没有本质区别的,只是在MOUNT这个要作为Global File Systems的分区的的时候加了global这个选项而已。如下:
mount -o global,logging /dev/vx/dsk/nfs-dg/vol-01 /global/nfs
去年在做SUN的CLUSTER,跑IBM的DB2 用到这个Global File Systems时出现一些问题,后来厂家的工程师说不推荐用Global File Systems,说容易出现问题,后来把这个Global File Systems取消了,虽然后来证实出现问题并不是Global File Systems造成的。

[ 本帖最后由 king3171 于 2006-9-19 17:18 编辑 ]

nntp 回复于:2006-09-19 21:58:47

sorry, 恕我直言,你对Solaris 集群文件系统的了解是不正确的.

Solaris 上面可以跑一个独立的集群文件系统产品,叫做 SUN CFS - Cluster File System. 这个东西就是从Veritas CFS买过来 O*成自己的产品. 实际上HPUX上面也有CFS, 也是从Veritas CFS O*过来的. 这个CFS当时推出来的时候,实际上Sistina公司的GFS还处于初始萌芽状态,所以在行业内,Veritas就号称这个CFS可以实现 Global File Service.
这是你了解到的信息中不正确的地方之一. 所以 Sun/HP的CFS 号称实现Global File Service, 但是这个 GFS 可不是 Sistina 的"GFS"(Global File System). 也就是一字之差,说明了两者之间的相似和区别.

至于Sun的CFS到底是什么原理和内部细节,你可以从sun站点查一个白皮书,我记的名字就叫做 Sun Cluster Software Cluster File System xxxxx 的pdf文件, google一下,里面有详细的介绍. Sun CFS的组成部分,特点,原理和基本特性等. 算是写得相当清楚地.

本版置顶的帖子有关于RedHat 收购的Sistina 公司的GFS的详细联接和文档,因为你在帖子中表明需要搞清楚两者的区别,所以我也觉得如果三两句说不清楚,还是建议你将两者的白皮书和规格都详细阅读后,自然会有一个比较清楚的比较.

因为都是不同的产品,目的,设计特点,用途都不太一样,所以也不存在什么共同的功能上的标准. 底层编码设计上的标准肯定是有的,还是按照Unix世界通用的几大标准来设计的.

king3171 回复于:2006-09-20 13:33:58

谢谢,我再查一下吧,你说的Solaris 上面可以跑一个独立的集群文件系统产品,叫做 SUN CFS - Cluster File System,我不知道你说的产品是不是SUN的CLUSTER 3.1 产品,我想应该不是,因为 SUN CLUSTER 3.1 中并没有提到你说的那个东西,前面我说的那个GLOBAL file system 就是CLUSTER 3.1 产品中的概念。至于单独的集群文件系统产品我和SUN的工程师交流中没有听他们提过,我再查一下吧,有新的发现和心得再上来和您交流。

[ 本帖最后由 king3171 于 2006-9-20 13:46 编辑 ]

nntp 回复于:2006-09-21 18:08:54

sorry,你还是看看吧. 嘿嘿.

来自 “ ITPUB博客 ” ,链接:,如需转载,请注明出处,否则将追究法律责任。

请登录后发表评论 登录


  • 博文量
  • 访问量