Exadata: CSSD fails to start with CRS-1717 during Grid Infrastructure Upgrade.

Asset ID:	1-72-1485145.1
Update Date:	2012-08-30
Keywords:

Solution Type Problem Resolution Sure

Solution 1485145.1 : Exadata: CSSD fails to start with CRS-1717 during Grid Infrastructure Upgrade.

Applies to:

Exadata Database Machine V2 - Version Not Applicable and later
Oracle Server - Enterprise Edition - Version 11.2.0.1 to 11.2.0.2 [Release 11.2]
Oracle Exadata Hardware - Version 11.2.0.1 to 11.2.0.2 [Release 11.2]
Exadata Database Machine X2-2 Full Rack - Version Not Applicable and later
Exadata Database Machine X2-2 Half Rack - Version Not Applicable and later
Information in this document applies to any platform.

Symptoms

Clusterware upgrade fails during the rootupgrade.sh execution. Log files indicate that the ocssd.bin fails during the restart.

rootcrs_<hostname>.log:

2012-08-18 17:26:38: Executing cmd: /u01/app/ora/product/11.2.0.3/grid_1/bin/crsctl check crs

2012-08-18 17:26:38: Command output:
> CRS-4638: Oracle High Availability Services is online
> CRS-4535: Cannot communicate with Cluster Ready Services
> CRS-4529: Cluster Synchronization Services is online
> CRS-4534: Cannot communicate with Event Manager

...........
...........

2012-08-18 17:36:52: Checking the status of cluster

2012-08-18 17:36:57: ###### Begin DIE Stack Trace ######
2012-08-18 17:36:57: Package File Line Calling
2012-08-18 17:36:57: --------------- -------------------- ---- ----------
2012-08-18 17:36:57: 1: main rootcrs.pl 378 crsconfig_lib::dietrap
2012-08-18 17:36:57: 2: crsconfig_lib crsconfig_lib.pm 11524 main::__ANON__
2012-08-18 17:36:57: 3: crsconfig_lib crsconfig_lib.pm 1311 crsconfig_lib::wait_for_stack_start
2012-08-18 17:36:57: 4: crsconfig_lib crsconfig_lib.pm 1186 crsconfig_lib::start_cluster
2012-08-18 17:36:57: 5: main rootcrs.pl 824 crsconfig_lib::perform_start_cluster
2012-08-18 17:36:57: ####### End DIE Stack Trace #######

Cause

alert<hostname>.log: (Clusterware alertlog file):

2012-08-18 18:52:26.682[cssd(1616)]CRS-1717:The CSS daemon has detected a voting file add during startup and is waiting for the add to complete; Details at (:CSSNM00072:) in /u01/app/ora/product/11.2.0.3/grid_1/log/xldnc913001hpor/cssd/ocssd.log

ocssd.log:

clssnmvDiskVerify: Successful discovery for disk o/192.168.10.8/SYSTEMDG_CD_02_dldn913cel04, UID b1493e81-82034f27-bf30e20d-8765d98f, Pending CIN 0:1300559781:0, Committed CIN 0:1300559781:0

clssnmvDiskVerify: Successful discovery for disk o/192.168.10.7/SYSTEMDG_CD_02_dldn913cel03, UID 92e385f2-2c9d4fd0-bfc84d09-da10b879, Pending CIN 0:1300559781:0, Committed CIN 0:1300559781:0

clssnmvDiskVerify: Successful discovery for disk o/192.168.10.5/SYSTEMDG_CD_02_dldn913cel01, UID 5913d91c-0d4c4f81-bf4995c6-6f73ddf5, Pending CIN 0:1300559781:0, Committed CIN 0:1300559781:0

2012-08-18 18:35:11.324: [ CSSD][1092651328]clssnmCompleteInitVFDiscovery: Completing initial voting file discovery

2012-08-18 17:26:36.708: [ CSSD][1097820480]clssnmvDiskPing: Writing with status 0x3, timestamp 411681166/1345307196
2012-08-18 17:26:36.709: [ CSSD][1109272896]clssnmvDiskPing: Writing with status 0x3, timestamp 411681166/1345307196
2012-08-18 17:26:36.709: [ CSSD][1106118976]clssnmvDiskPing: Writing with status 0x3, timestamp 411681166/1345307196

2012-08-18 17:26:36.739: [ CSSD][1081133376]clssnmvDiskKillCheck: not evicted, file o/192.168.10.6/SYSTEMDG_CD_02_dldn913cel02 flags
0x00000000, kill block unique 0, my unique 1345307129

2012-08-18 17:26:36.769: [ CSSD][1123465536]clssnmvDiskPing: Writing with status 0x3, timestamp 411681226/1345307196

2012-08-18 17:26:37.068: [ CSSD][1076136256]clssnmPollingThread: signaling reconfig for config change<<

2012-08-18 18:35:11.324: [ CSSD][1092651328](:CSSNM00072:)clssnmCompleteInitVFDiscovery: Detected voting file add in progress for CIN 0:1345307185:0, waiting for configuration to complete 0:1300559781:0 <<<

The issue is not really related to the upgrade, but a CSSD startup issue due to a pending voting file configuration change.

The messages CSSD log file indicates that, the configuration changes are related to an voting file addition.

Possible reasons:

1. Failure during the ASM automatic voting file transfer. ASM will automatically relocate the voting files to another disk in the diskgroup, if the disk containing the voting file is unavailable due to some reason. So there is a higher chance of this happening in an Exadata environment, due to the large number of storage cells and disks in the OCR diskgroup.

2. DB node crash when doing an voting addition/deletion operation.

Refer <Bug 11816852>

Solution

The solution here is to start the clusterware in exclusive mode, one of the nodes and let the clusterware to rollback the pending configuration changes.

For Eg, for an 11.2.0.2 to 11.2.0.3 Clusterware upgrade.

Please follow the below steps: (If this issue occurs not during an upgrade, please skip steps 2 and 8)

OLD_GRID_HOME = 11.2.0.2 Grid Home
NEW_GRID_HOME = 11.2.0.3 Grid Home

1. Shutdown clusterware on all nodes

2. Run rootcrs.pl Downgrade to downgrade back to old version on node where rootupgrade failed typically it is the first node

$NEW_GRID_HOME/rootcrs.pl -downgrade -force -oldcrshome $OLD_GRID_HOME -version 11.2.0.2.0

Refer : <Doc ID 1485145.1>

3. Startup clusterware in exclusive mode on node 1(after all other nodes clusterware is down)

$OLD_GRID_HOME/bin/crsctl start crs -excl

4. Once the CRS comes up on node 1- make sure everything came up

$OLD_GRID_HOME/bin/crsctl stat res -t -init

5. Shutdown crs on node1

$OLD_GRID_HOME/bin/crsctl stop crs -f

6. Startup crs normally on node 1

$OLD_GRID_HOME/bin/crsctl start crs

7. Startup crs normally on all other nodes

$OLD_GRID_HOME/bin/crsctl start crs

8. Re-run the $NEW_GRID_HOME/rootupgrade.sh on Node 01. Once this is successful, please proceed with the remaining nodes to complete the upgrade.

References

<NOTE:1420265.1> - Upgrade Advisor: Database (DB) Exadata from 11.2.0.1/11.2.0.2 to 11.2.0.3
@ <BUG:11816852> - TB_X64:X:CSSD UNABLE TO STARTUP AFTER KILLING CELL NODES PERIODICALLY
<NOTE:1373255.1> - 11.2.0.1/11.2.0.2 to 11.2.0.3 Database Upgrade on Exadata Database Machine
<NOTE:1364946.1> - How to Downgrade 11.2.0.3 Grid Infrastructure Cluster to Lower 11.2 GI or Pre-11.2 CRS
<NOTE:888828.1> - Database Machine and Exadata Storage Server 11g Release 2 (11.2) Supported Versions

Attachments

This solution has no attachment