Sun Storage 7000 Unified Storage System: How to troubleshoot long cluster take-over and fail-back times

Asset ID:	1-75-1408475.1
Update Date:	2012-04-02
Keywords:

Solution Type Troubleshooting Sure

Solution 1408475.1 : Sun Storage 7000 Unified Storage System: How to troubleshoot long cluster take-over and fail-back times

Applies to:

Sun Storage 7410 Unified Storage System - Version: Not Applicable and later   [Release: N/A and later ]
Sun ZFS Storage 7320 - Version: Not Applicable and later    [Release: N/A and later]
Sun ZFS Storage 7420 - Version: Not Applicable and later    [Release: N/A and later]
Sun Storage 7310 Unified Storage System - Version: Not Applicable and later    [Release: N/A and later]
Sun Microsystems > Storage - Disk > Unified Storage
7000 Appliance OS (Fishworks)

Purpose

The appliance does fail-back and takeover and there is an important distinction between them since the former requires resources be given up before they move to the other head, whereas takeover just takes them.
In the case of a slow fail-back, its worth figuring out if the relinquishing head is slow to give up or whether the claiming head is slow to take the resources.

Last Review Date

February 15, 2012

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

The takeover and fail-back times depends on the number of objects that need to be iterated during the resource import phase. On the 7x20 and 7x10 series system those objects include: shares, LUNs, data-links, V-LANs, network interfaces, IPMP/LACP setup, iscsi/fc targets, initiators, and groups, etc. Simple configurations are faster than complex configurations.

Other considerations:

If there is IO to the pool at the time of takeover/failback... especially writes since that would yield more dirty data in the Logzillas to be "replayed" during takeover. Time should increase. Reads should not make a difference

When iSCSI/FC LUNs are in use and in the case of a takeover, the contents of the logzilla will need to be replayed before the zpool can be imported

If there are many CIFS clients authorized by an Active Directory Server, more time will be needed to perform re-authorization upon the peer cluster head after takeover or failback.

If a destroy is in progress this needs to be completed before the zpool can be imported. This has been seen with snapshots especially where the head being taken over was in the process of destroying a snapshot which the other head then had to complete before the pool could be imported. This situation is remedied in appliance software 2011.1.1.0.

Finally, if they are available the readzilla L2ARC caches will need warming up after their associated pool is imported. Note that this does not apply to the logzillas, because they are imported along with the rest of the pool whereas the readzillas are specific to each cluster head.

Identifying the problem
Determine
For reference, the expected takeover time is:
Time in seconds = (20 * D) + (.03 * S)
D is # of disksets (half JBODs)
S is # of shares (filesystems)

1402545.1 - Sun Storage 7000 Unified Storage System: How to Troubleshoot Cluster Problems

How to gather key data and information for Oracle Disk array products, to minimise problem diagnosis and resolution times (Doc ID 1346234.1)

Note: For any fail-over issues that are not addressed by this document please contact Oracle Support for assistance in diagnosing the issue and be prepare that remote access maybe require.
Ref: Oracle Shared Shell Document 1194226.1

Sun ZFS Storage Appliances Troubleshooting Resource Center (Doc ID 1416406.1)

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - 7000 Series ZFS Appliances

https://communities.oracle.com/portal/server.pt/community/7000_series_zfs_appliance/456

Customers are not permitted to run commands at the emergency shell.

Checking takeover/failback times from a support bundle:

Refer to the supportbundle log-file rm.ak
for example :
bash-3.2$ cd /cores/sr-id/supportbundle
bash-3.2$ find . -type f -exec grep "in 0." {} /dev/null \;

It will gives you an overview about how long the export and import of some items takes

example from <Bug 7144862>:

adc26stor08:configuration cluster> date
2012-2-11 08:59:46
adc26stor08:configuration cluster> failback
Continuing will immediately fail back the resources assigned to the cluster
peer. This may result in clients experiencing a slight delay in service.

Are you sure? (Y/N)
date
adc26stor08:configuration cluster> date
2012-2-11 09:06:24

on the exporting node (08) we see that the pools take the longest:
adc26stor08# aklog rm | grep -i export | grep "Sat Feb 11 09:0" | grep -v "in 0." | tail -20
Sat Feb 11 09:01:26 2012: export of ak:/nas/pool07a succeeded in 95.727s
Sat Feb 11 09:04:10 2012: export of ak:/zfs/pool07a succeeded in 164.224s
adc26stor08#

On the importing node (07), they are biggest hitters also:
adc26stor07# aklog rm | grep -i import | grep "Sat Feb 11 09:0" | grep -v "in 0." | tail -20
Sat Feb 11 09:05:14 2012: [zfs import] zpool_import_props() succeeded in 61.090s
Sat Feb 11 09:05:14 2012: import of ak:/zfs/pool07a succeeded in 61.129s
Sat Feb 11 09:06:03 2012: [nas import] discovery completed in 48.400s
Sat Feb 11 09:06:16 2012: [nas import] mounted 673 datasets in 6.989s
Sat Feb 11 09:06:17 2012: import of ak:/nas/pool07a succeeded in 62.531s
Sat Feb 11 09:06:20 2012: import of ak:/net/ixgbe93003 succeeded in 1.649s
adc26stor07#

Here is very useful dtrace script that allows checking which operation takes the most of time.
dtrace script import.d helps to troubleshoot long cluster takeover and fail-back times. The script measures the time to import each resource.

Output:
The first table is the aggregate time spent importing each resource, the second is the number of times it was imported. The special "resource" SAS LOCK is just the time taken to grab all the zone locks in the expanders. These two activities are basically all there is to takeover so they should capture everything that consumes time.

References

<NOTE:1402545.1> - Sun Storage 7000 Unified Storage System: How to Troubleshoot Cluster Problems
<BUG:7144862> - 6.5 MINUTE FAILBACK ON Q3.4.3 - NEED RCA
Dtrace Script - import.d: https://stbeehive.oracle.com/content/dav/st/AmberRoadSupport/Software/import.d
<NOTE:1194226.1> - Oracle Shared Shell

Attachments

This solution has no attachment