Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1403503.1
Update Date:2012-02-21
Keywords:

Solution Type  Troubleshooting Sure

Solution  1403503.1 :   Sun Storage 7000 Unified Storage System: A cluster node fails to rejoin the cluster  


Related Items
  • Sun Storage 7410 Unified Storage System
  •  
  • Sun ZFS Storage 7320
  •  
  • Sun Storage 7310 Unified Storage System
  •  
  • Sun ZFS Storage 7420
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>NAS>SN-DK: 7xxx NAS
  •  
  • .Old GCS Categories>Sun Microsystems>Storage - Disk>Unified Storage
  •  


This document is provided to assist in troubleshooting cluster join issues where one node of a cluster, following a reboot, fails to rejoin the cluster.

In this Document
  Purpose
  Last Review Date
  Instructions for the Reader
  Troubleshooting Details


Applies to:

Sun ZFS Storage 7420 - Version: Not Applicable and later   [Release: N/A and later ]
Sun Storage 7310 Unified Storage System - Version: Not Applicable and later    [Release: N/A and later]
Sun Storage 7410 Unified Storage System - Version: Not Applicable and later    [Release: N/A and later]
Sun ZFS Storage 7320 - Version: Not Applicable and later    [Release: N/A and later]
7000 Appliance OS (Fishworks)
NAS head revision : [not dependent]
BIOS revision : [not dependent]
ILOM revision : [not dependent]
JBODs Model : [not dependent]
CLUSTER related : [yes]

Purpose

This document is provided to assist in troubleshooting cluster join issues where one node of a cluster, following a reboot, fails to rejoin the cluster.

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - 7000 Series ZFS Appliances

Last Review Date

February 1, 2012

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

When one cluster node fails to join the cluster the problem is often due to loading on the working cluster node resulting in slow communications between the nodes or sometimes due to cluster-wide locking issues on the working node.
For the former issue simply leaving the system to attempt the rejoin operation may be sufficient and eventually the join operation may complete successfully.
In the latter case it is unlikely the second node will manage to rejoin the cluster and this document attempts to provide a workaround for this particular issue.

Note:
If you wish to know the cause for the node's failure to rejoin the cluster then please contact Oracle Support so they can collect additional diagnostic information in order to determine the underlying cause of the failure.


If you wish to try to resolve the issue yourself then please follow these steps.

Step 1Power down the node that is failing to join the cluster.

The node must be powered off to ensure the cluster interconnect is offline. Simply shutting-down the node is not sufficient in this case.

From the console where the system is reporting the cluster join failure message press the following key sequence to Halt the system:


 <ESC>-3 - Halt system


If possible, connect to the SP (Service Processor, sometimes called the ILOM) of the system and login as the root user. Check the system status by issuing the following command:

-> show /SYS

Towards the end of the output under the section entitled 'Properties' will be displayed the power state:

    Properties:
        type = Host System
        chassis_name = SUN FIRE X4240
        chassis_part_number = 540-7618-XX
        .   .   .
        product_manufacturer = SUN MICROSYSTEMS
        power_state = On

If the system is powered-on then issue the following SP command to power-off the system:

-> stop /SYS

At this point the system will be powered-off.  You can check the status by reissuing the 'show /SYS' command as used earlier.

If you have access to the node itself then you can simply depress the Power button on the front panel.

Step 2Restart the management service (called akd) on the working node
Connect to the working node and issue the following CLI command:

 >  maintenance system restart

This will not affect the data services but will restart the Admin interfaces and as a result you will be logged-out of the CLI session.

Step 3   Wait for the system to restart the management interfaces and resume normal operation
It may take several minutes for the management services to fully initialize.  Once you have regained access to the Admin BUI or CLI check that the system is working correctly. 
You may wish to wait one or two minutes more to ensure the system is fully recovered before proceeding.

Step 4   Power on the second cluster node
At this point the system should be working correctly with all resources available from the single working node.  We can power on the remaining node and this time it should rejoin the cluster successfully.

If you have access to the node itself then you can simply depress the Power button on the front panel to power-on the node.

If you have access to the SP then issue the following SP command:

-> start /SYS

this will power-on the node.  You can check the power state by issuing the 'show /SYS' command, as before.

Once the system has completed its power-on self tests it will load the operating system and appliance firmware and start operation.

Step 5  Check the system is working as a cluster
From the Admin BUI you can check the status from the Configuration -> Cluster page.

From the CLI you can issue the following command:

   > configuration cluster show


The cluster will probably show one node as Owner and the other as Stripped indicating the cluster is operational and ready for the cluster fail-back operation.


Note:
If the cluster node still fails to join the cluster then further investigation will be required.

Please contact Oracle Support so they can collect additional diagnostic information in order to determine the underlying cause of the failure.


Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback