Sun Storage 7000 Unified Storage System: A cluster node fails to rejoin the cluster

Asset ID:	1-75-1403503.1
Update Date:	2012-02-21
Keywords:

Solution Type Troubleshooting Sure

Solution 1403503.1 : Sun Storage 7000 Unified Storage System: A cluster node fails to rejoin the cluster

Applies to:

Sun ZFS Storage 7420 - Version: Not Applicable and later   [Release: N/A and later ]
Sun Storage 7310 Unified Storage System - Version: Not Applicable and later    [Release: N/A and later]
Sun Storage 7410 Unified Storage System - Version: Not Applicable and later    [Release: N/A and later]
Sun ZFS Storage 7320 - Version: Not Applicable and later    [Release: N/A and later]
7000 Appliance OS (Fishworks)
NAS head revision : [not dependent]
BIOS revision : [not dependent]
ILOM revision : [not dependent]
JBODs Model : [not dependent]
CLUSTER related : [yes]

Purpose

This document is provided to assist in troubleshooting cluster join issues where one node of a cluster, following a reboot, fails to rejoin the cluster.

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - 7000 Series ZFS Appliances

Last Review Date

February 1, 2012

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

When one cluster node fails to join the cluster the problem is often due to loading on the working cluster node resulting in slow communications between the nodes or sometimes due to cluster-wide locking issues on the working node.
For the former issue simply leaving the system to attempt the rejoin operation may be sufficient and eventually the join operation may complete successfully.
In the latter case it is unlikely the second node will manage to rejoin the cluster and this document attempts to provide a workaround for this particular issue.

Note:
If you wish to know the cause for the node's failure to rejoin the cluster then please contact Oracle Support so they can collect additional diagnostic information in order to determine the underlying cause of the failure.

If you wish to try to resolve the issue yourself then please follow these steps.

Step 1. Power down the node that is failing to join the cluster.

The node must be powered off to ensure the cluster interconnect is offline. Simply shutting-down the node is not sufficient in this case.

From the console where the system is reporting the cluster join failure message press the following key sequence to Halt the system:

<ESC>-3 - Halt system

If possible, connect to the SP (Service Processor, sometimes called the ILOM) of the system and login as the root user. Check the system status by issuing the following command:

-> show /SYS

Towards the end of the output under the section entitled 'Properties' will be displayed the power state:

Properties: type = Host System chassis_name = SUN FIRE X4240 chassis_part_number = 540-7618-XX . . . product_manufacturer = SUN MICROSYSTEMS power_state = On

If the system is powered-on then issue the following SP command to power-off the system:

-> stop /SYS

At this point the system will be powered-off. You can check the status by reissuing the 'show /SYS' command as used earlier.

If you have access to the node itself then you can simply depress the Power button on the front panel.

Step 2. Restart the management service (called akd) on the working node

Connect to the working node and issue the following CLI command:

> maintenance system restart

This will not affect the data services but will restart the Admin interfaces and as a result you will be logged-out of the CLI session.

Step 3 Wait for the system to restart the management interfaces and resume normal operation

It may take several minutes for the management services to fully initialize. Once you have regained access to the Admin BUI or CLI check that the system is working correctly.
You may wish to wait one or two minutes more to ensure the system is fully recovered before proceeding.

Step 4 Power on the second cluster node

At this point the system should be working correctly with all resources available from the single working node. We can power on the remaining node and this time it should rejoin the cluster successfully.

If you have access to the node itself then you can simply depress the Power button on the front panel to power-on the node.

If you have access to the SP then issue the following SP command:

-> start /SYS

this will power-on the node. You can check the power state by issuing the 'show /SYS' command, as before.

Once the system has completed its power-on self tests it will load the operating system and appliance firmware and start operation.

Step 5 Check the system is working as a cluster

From the Admin BUI you can check the status from the Configuration -> Cluster page.

From the CLI you can issue the following command:

> configuration cluster show

The cluster will probably show one node as Owner and the other as Stripped indicating the cluster is operational and ready for the cluster fail-back operation.

Note:
If the cluster node still fails to join the cluster then further investigation will be required.

Please contact Oracle Support so they can collect additional diagnostic information in order to determine the underlying cause of the failure.

Attachments

This solution has no attachment