8/28/11

"root.sh" failed on second during the installation of 11.2.0.2 Grid InfraStructure

While installing 11.2.0.2 Grid Infrastructure root.sh failed on second node but successful on first node
on node2, root.sh failed with following errors:

CRS-4402: The CSS daemon was started in exclusive mode but found an active CSS daemon on node prittoprfdb1, number 1, and is terminating
An active cluster was found during exclusive startup, restarting to join the cluster
Failed to start Oracle Clusterware stack
Failed to start Cluster Synchorinisation Service in clustered mode at /u01/app/crs/11.2.0.2/crs/install/crsconfig_lib.pm line 1016.
/u01/app/crs/11.2.0.2/perl/bin/perl -I/u01/app/crs/11.2.0.2/perl/lib -I/u01/app/crs/11.2.0.2/crs/install /u01/app/crs/11.2.0.2/crs/install/rootcrs.pl execution failed



May be there are number of different reasons for failing of root.sh on second node, but in my case it failed due to a "MULTICASTING BUG"


What is Multicasting? and Why we need it?


Oracle introduces a new feature called "Redundant Interconnect Usage" which provides the redundancy for interconnect without using any external NIC bondings. Oracle internally provides the redundancy if you specify two private interfaces at the tie of installation of 11.2.0.2 Grid Infrastructure.


To use this new feature the Multicasting should be enabled for private inerfaces even switches that used.


multicast based communication on the private interconnect is utilized to establish communication with peers in the cluster on each startup of the stack on a node. Once the connection with the peers in the cluster has been established, the communication is switched back to unicast


In CSSD log you will find the following error:


2010-09-16 23:13:14.862: [GIPCHGEN][1107937600] gipchaNodeCreate: adding new node 0x2aaab408d4a0 { host 'node1', haName 'CSS_ttoprf10cluster', srcLuid 54d7bb0e-ef4a0c7e, dstLuid 00000000-00000000 numInf 0, contigSeq 0, lastAck 0, lastValidAck 0, sendSeq [0 : 0], createTime 9563084, flags 0x0 }


2010-09-16 23:13:15.839: [ CSSD][1087465792]clssnmvDHBValidateNCopy: node 1, node1, has a disk HB, but no network HB, DHB has rcfg 180134562, wrtcnt, 8627, LATS 9564064, lastSeqNo 8624, uniqueness 1284701023, timestamp 1284703995/10564774




Solution:


apply a Patch: 9974223 on both nodes and run the root.sh again on node2.













ORA-00210,ORA--00202,ORA-17503,ORA-5001,ORA-27140,ORA-27300-3 while Installing RAC database


After Successfully installed GI, I try to install RAC database but I got the following Error message:

ORA-00210: cannot open control file
ORA-00202: error in writing''+RECODG/utsdb/controlfile/current.256.732754521''
ORA-17503: ksfdopn: 2 Failed to open file +RECODG/utsdb/controlfile/current.256.732754521
ORA-15001: diskgroup "RECODG" does not exist or is not mounted
ORA-15055: unable to connect to ASM instance
ORA-27140: attach to post/wait facility failed
ORA-27300: OS system dependent operation:invalid_euid failed with status: 1
ORA-27301: OS failure message: Not owner
ORA-27302: failure occurred at: skgpwinit5
ORA-27303: additional information: startup euid = 100 (grid), current euid = 101 (oracle)


The eeror message is little bit confusing because it complaints about "ASM" but infact ASM is working fine, but the problem is permission of "oracle" executables.

Cause
The issue is caused by wrong permissions of the GI_HOME/bin/oracle executable.

In one case, customer changed the file permission of GI_HOME/bin/oracle from "-rwsr-s--x" to "-rwxrwxr-x".
The correct permission should be "-rwsr-s--x".
Solution
1. Change the file permission of GI_HOME/bin/oracle to "-rwsr-s--x":
$su - grid
$cd GI_HOME/bin
$chmod 6751 oracle
$ls -l oracle