Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1402545.1
Update Date:2012-02-21
Keywords:

Solution Type  Troubleshooting Sure

Solution  1402545.1 :   Sun Storage 7000 Unified Storage System: How to Troubleshoot Cluster Problems  


Related Items
  • Sun Storage 7410 Unified Storage System
  •  
  • Sun ZFS Storage 7320
  •  
  • Sun Storage 7310 Unified Storage System
  •  
  • Sun ZFS Storage 7420
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>NAS>SN-DK: 7xxx NAS
  •  
  • .Old GCS Categories>Sun Microsystems>Storage - Disk>Unified Storage
  •  


This document is provided to assist in troubleshooting cluster issues on the ZFS Storage Appliance

In this Document
  Purpose
  Last Review Date
  Instructions for the Reader
  Troubleshooting Details
     Identifying the problem
     Setting-up the Cluster


Applies to:

Sun ZFS Storage 7420 - Version: Not Applicable and later   [Release: N/A and later ]
Sun Storage 7310 Unified Storage System - Version: Not Applicable and later    [Release: N/A and later]
Sun Storage 7410 Unified Storage System - Version: Not Applicable and later    [Release: N/A and later]
Sun ZFS Storage 7320 - Version: Not Applicable and later    [Release: N/A and later]
7000 Appliance OS (Fishworks)
NAS head revision : [not dependent]
BIOS revision : [not dependent]
ILOM revision : [not dependent]
JBODs Model : [not dependent]
CLUSTER related : [yes]

Purpose

This document is provided to assist in troubleshooting cluster issues.
It will help to frame the problem, identifies some known issues and provides some guidelines to obtain a stable clustered system.
This document has been written as a resolution path, each step giving links to other specific documents.  

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - 7000 Series ZFS Appliances

Last Review Date

January 31, 2012

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details


Note: For any cluster issues that are not addressed by this document please contact Oracle Support for assistance in diagnosing the issue and be prepare that remote access maybe require.
Ref: Oracle Shared Shell Document 1194226.1

Identifying the problem

The following sections address cluster issues based on when the issues are observed during the clustering life-cycle (i.e. creating the initial cluster, operating a cluster, removing nodes from a cluster.)

If you are experiencing a cluster problem:
 - during the initial configuration of the cluster then see the section Setting-up the Cluster
 - during normal cluster operations then see Problems during normal cluster operation
 - while removing clustering then see Removing a node from a cluster

Setting-up the Cluster

For the initial configuration steps see:
     'Sun Storage 7000 Unified Storage System: How to set up NAS clustering' <Document 1329307.1>

INTERNAL: FOR TSC USE
For a system that is clustered but where the cluster is to be reset or rebuilt, see:
     'Sun Storage 7000 Unified Storage System: How to factory reset a cluster node without downtime' <Document 1174473.1>

Problems during normal cluster operation

This section describes some common cluster issues that may be observed during normal operations.

1.  A cluster node fails to join the cluster <Document 1403503.1>

2.  A node reboots following a take-over or fail-back operation

This is indicative of a resource issue that has been recognized by the cluster node that is attempting to acquire its resources from the main node. 
Examples are network interfaces that are not operational on the second node so the node would be unable to provide a data service following the cluster operation.  In this case the node will automatically reboot itself and thereby force the cluster resources to remain on the working node. Following an automatic reboot such as this, be sure to check network cables connecting the node to the network switches, and SAS cables connecting the node to shelves.

3.  The Admin BUI does not respond when the Configuration:Clustering page is selected. This can be caused by loading issues within the management service (the akd service).  If the system is busy performing a lengthy operation then it may not respond to some menu selections until the operation has completed. In case of some deletion operations, this may take several minutes. In case of large snapshot deletions, it may take even several hours. This is not necessarily a cluster issue but a management interface issue.


Note:
For any other cluster issues please contact Oracle Support who will work with you in resolving the issue.

Removing a node from a cluster

To remove a node from a cluster or to unconfigure clustering :
1.  Power off the node to be removed from the cluster

2.  From the remaining node, in the Admin BUI navigate to the Configuration -> Cluster page. Press the <Unconfig> button to remove the cluster configuration.

3.  Detach the cluster interconnect cables and detach the powered-off storage controller from the cluster's external storage enclosures (shelves).

At this point both of the ZFS SA nodes will operate independently.

INTERNAL: FOR TSC USE
If the Admin BUI is inoperative then it is possible to unconfigure clustering from the CLI using the raw command:
   > raw cluster.unconfigure();
see also:
     'Sun Storage 7000 Unified Storage System: How to factory reset a cluster node without downtime' <Document 1174473.1>

Configuration Guidelines

There are additional items to consider when configuring nodes to form a clustered system. For example, how to distribute the data pools and network interfaces between nodes to balance the loading on both nodes.

Oracle recommend that one network interface be dedicated on each node for use as a management interface. In this case the interface is marked as a private resource for the single node.

For more information on Clustering see the online Help pages available from the Admin BUI.  You can navigate to the Configuration-> Cluster page and then press the Help word located in the top right-hand corner of the page - this will open the help pages to the cluster context.
Alternatively, simply press the Help word located in the top right-hand corner of the pageto display the main help page and then navigate to Configuration and Cluster.

Other considerations

Some cluster-wide resources need special attention when transitioning from one node to the other.  For example, SCSI & FC LUN resources need support from the clients themselves: the clients will need to support ALUA for their FC LUNs.

Some client systems require additional configuration if they themselves are also members of a cluster.  For example, for some notes on configuring  Solaris Cluster see:
Sun Storage 7000 Unified Storage System: Configuring the ZFS Storage Appliance to work in Oracle Solaris Cluster' <Document 1380870.1>

Terms & Definitions 

Cluster : With the ZFS Storage Appliance the term cluster is used to denote a system comprising two identical ZFS SA nodes accessing shared storage and with access to a common network infrastructure.
In the event of a node failure the resources and services of the failed node will be taken by the remaining working node and the services will continue to be provided to clients and users by that node.

  • Cluster types
    • active-active  : a cluster in which the resources are shared between the two nodes and each provides services to clients.
    • active-passive : a cluster in which one node performs most of the work while the second node remains idle until there is a failure of the active node at which point the passive node resumes operation as the now active node.
  • Cluster States
    • AKCS_CLUSTERED  : Both nodes are running in normal condition sharing resources.
    • AKCS_OWNER      : One node in the cluster owns all of the shared cluster resources
    • AKCS_STRIPPED   : One node has joined the cluster but does not own any cluster resources (the node is waiting for the administrator to perform a fail-back operation)
  • Cluster operations
    • Take over      : following a node failure the remaining node takes over the resources from the failed node.
    • Fail back      : once a failed node has been repaired and joined the cluster the node waits for the Administrator to fail-back the node's resources from the main node (which owns all of the cluster resources).  On completion on the fail back operation both nodes will be operating in a fully clustered mode (active-active).
    • Shutdown       : see: 'Sun Storage 7000 Unified Storage System: How To Shutdown ZFSSA Cluster' <Document 1379117.1>

References

Collecting Diag data :
      'Sun Storage 7000 Unified Storage System: How to collect support bundle using the BUI or CLI' <Document 1019887.1>

Online Help is available in the Admin BUI under the section: Configuration:Cluster

Sun ZFS Storage 7000 System Administration Guide
       http://download.oracle.com/docs/cd/E22471_01/html/820-4167 - see the section on Clustering.

Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback