引子：2010年国庆期间公司进行机房搬迁，同时还要对部分IBM小机升级、EMC存储扩容以及oracle全部升级至10G RAC，涉及IBM小机16台，2套EMC存储，8个oracle DB。动作相当大，且时间压得很紧，只有4天，因为工厂10.5就要开始运营，这是个几乎所有人都认为疯狂和不可思议的项目，但领导表现得很淡定。4天后的成果还算可以，虽然没有全部完成目标，但搬迁完成，IBM小机升级完成，EMC存储数据迁移、MIRROVIEW和networker实施完成，只有3个DB由于存储和HA的原因未按既定目标实现RAC，只成功升级10G外其他都达到既定目标，在我们看来基本完成目标但是领导对结果显然不太满意，但迫于时间关系只能作罢，而且同志们这几天都是没日没夜的基本没怎么合过眼。在这之后的运行中还算较稳定。但是由于RAC连接方式的改变给应用带来了不小的麻烦。
首先就是工厂现场所有的client都要修改TNS。虽然机器多但工作不难，所有人都能做。问题就是办公网络和工厂网络不在同一网段(后来得知两个网段交换机之间是物理隔离的)，导致办公网络的client或应用FAILOVER时时断时续，但指定instance_name连接时则非常稳定。当时领导非说我的oracle配置有问题，我肯定不承认，我的理由是工厂网络的client能FAILOVER且非常稳定，同时办公网络client有时也能连上，只要能连上就说明oracle配置没有问题，所以我说网络有问题。但领导认为我的话不能让人信服，并且让我指出网络有什么问题，把我气得一时语塞。被逼无奈，只好去抓网络包来分析。好在我以前用过Wireshark，虽然不是很精通。通过网络抓包发现，办公网络client在连接oracle服务器时，只要发生redirect到另外一个节点的vip时，就没有网络包了，这个时候client SQLPLUS返回ORA-12170: TNS:Connect timeout occurred错误。结合oracle自己提供的net trace服务产生的跟踪文件，初步怀疑oracle服务器基于LOADBALANCE考虑将办公网络client连接定向到另一个节点VIP时(VIP位于生产网络，与办公网络不互通)，此时client由于网络不通一直定位不到VIP，而服务器又告诉它路径确实没错，长时间无响应的情况下TNS只好返回连接超时的错误信息。怀疑是怀疑，不能完全确认，且不知道如何解决该问题。没办法，只好metalink。以下截取本人与metalink工程师交流的主要几段信息：
Customer Problem Description：
I have each node has three network cards on the cluster server.One that bound VIP is a public network card in the production network.The second card is a private network card.The last piece of card in the office network.Production networks and office networks both have oracle client,and since company policy is between the two networks physically isolated.So,the production network client that enables FAILOVER,but the office network client can sometimes connected, sometimes not connect mode in FAILOVER.When the office network client can not connect,the error is ORA-12170,but at the same time ,the office network client tnsping oracle network service name is normal.Will,How can we achieve office network clients can FAILOVER?
=== ODM Solution / Action Plan ===
Thanks for your patience. I have analyzed the client trace provided and from that I could see that the issue is happening when there is a direction request from the listeners running on any of the 2 nodes. As per the load balance feature, the listeners redirect conection requests based on the load on each instance. But in our scenario, when the listeners redirecting the connection requests, they are sending the Internal IP address of the other node(which the client unaware of) to the client rather than the external IP address. This is causing the connection to timeout as the client is in different domain of the RAC nodes and completely unaware of the nodes' external IP addresses.
I have looked for any relevant notes and found the Doc 453544.1 which states this may happen if the client/database server are behind a NAT firewall. Request you to follow the guidelines mentioned in this note and see if the issue comes up again (or) if you can provide me the output of "lsnrctl services" from the listeners on both the nodes along with the init
Thank you and have a nice day,
Oracle Support Services.
Update from Customer：
First,Thank you very much for your help!
According to the documentation you provide me to solve this problem, I have done a test on the test server,and the office network clients can connect RAC DB Server mode in FAILOVER since that the clients /etc/hosts file contains proper mapping.At the same time,the plant network clients /etc/hosts file also contains proper mapping,otherwise,the connection will fail.
I understand your theory about the failure occurred.But my new question is whether internal or external clients, the number is hundreds.If these clients have to modify /etc/hots file,that would be a very difficult task.So are there any other solution?
Thanks a lot again and have a nice day!
Thanks for uploading the files. After verifying, I could see that the listeners are running on the VIP addresses and the communication between oracle net is happening. The reason for your connections failing intermittently is because, whenever there is a redirection happening, the client is unable to resolve the VIP address as it is in a different network class than the client. In general, both Public IP and VIP of a node should be in the same network class and it seems VIP is configured under Private network. This is a RAC configuration issue as the client is unable to resolve the VIP address. You can use the below note to change the VIP of a node and please work with your network administrator for network related issues. Once VIP is configured properly, your intermittent connection issues should be resolved. I am passing this SR to RAC Specialization team who can help in case of issues with VIP. Please let me know if any concerns.
Thank you and have a nice day,
Oracle Support Services
在这之后 RAC Specialization team 有和我沟通过，但一直没有给出一个非常好的解决办法，说他们目前正在解决这个问题。所以我只能先按照Bharath Vykuntam提供的方法先做。但我始终觉得这样不能从根本上解决问题。慢慢等待吧！
来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/702413/viewspace-683611/，如需转载，请注明出处，否则将追究法律责任。