![]() | Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Troubleshooting Sure Solution 1408475.1 : Sun Storage 7000 Unified Storage System: How to troubleshoot long cluster take-over and fail-back times
Applies to:Sun Storage 7410 Unified Storage System - Version: Not ApplicableSun ZFS Storage 7320 - Version: Not Applicable and later [Release: N/A and later] Sun ZFS Storage 7420 - Version: Not Applicable and later [Release: N/A and later] Sun Storage 7310 Unified Storage System - Version: Not Applicable and later [Release: N/A and later] Sun Microsystems > Storage - Disk > Unified Storage 7000 Appliance OS (Fishworks) PurposeThe appliance does fail-back and takeover and there is an important distinction between them since the former requires resources be given up before they move to the other head, whereas takeover just takes them.In the case of a slow fail-back, its worth figuring out if the relinquishing head is slow to give up or whether the claiming head is slow to take the resources. Last Review DateFebruary 15, 2012Instructions for the ReaderA Troubleshooting Guide is provided to assist
in debugging a specific issue. When possible, diagnostic tools are included in the document
to assist in troubleshooting.
Troubleshooting DetailsThe takeover and fail-back times depends on the number of objects that need to be iterated during the resource import phase. On the 7x20 and 7x10 series system those objects include: shares, LUNs, data-links, V-LANs, network interfaces, IPMP/LACP setup, iscsi/fc targets, initiators, and groups, etc. Simple configurations are faster than complex configurations.Other considerations:
Identifying the problem Determine For reference, the expected takeover time is: Time in seconds = (20 * D) + (.03 * S) D is # of disksets (half JBODs) S is # of shares (filesystems) 1402545.1 - Sun Storage 7000 Unified Storage System: How to Troubleshoot Cluster Problems How to gather key data and information for Oracle Disk array products, to minimise problem diagnosis and resolution times (Doc ID 1346234.1) Note: For any fail-over issues that are not addressed by this document please contact Oracle Support for assistance in diagnosing the issue and be prepare that remote access maybe require. Ref: Oracle Shared Shell Document 1194226.1 Sun ZFS Storage Appliances Troubleshooting Resource Center (Doc ID 1416406.1) To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - 7000 Series ZFS Appliances https://communities.oracle.com/portal/server.pt/community/7000_series_zfs_appliance/456 Customers are not permitted to run commands at the emergency shell. Checking takeover/failback times from a support bundle: Refer to the supportbundle log-file rm.ak for example : bash-3.2$ cd /cores/sr-id/supportbundle bash-3.2$ find . -type f -exec grep "in 0." {} /dev/null \; It will gives you an overview about how long the export and import of some items takes example from <Bug 7144862>: adc26stor08:configuration cluster> date 2012-2-11 08:59:46 adc26stor08:configuration cluster> failback Continuing will immediately fail back the resources assigned to the cluster peer. This may result in clients experiencing a slight delay in service. Are you sure? (Y/N) date adc26stor08:configuration cluster> date 2012-2-11 09:06:24 on the exporting node (08) we see that the pools take the longest: adc26stor08# aklog rm | grep -i export | grep "Sat Feb 11 09:0" | grep -v "in 0." | tail -20 Sat Feb 11 09:01:26 2012: export of ak:/nas/pool07a succeeded in 95.727s Sat Feb 11 09:04:10 2012: export of ak:/zfs/pool07a succeeded in 164.224s adc26stor08# On the importing node (07), they are biggest hitters also: adc26stor07# aklog rm | grep -i import | grep "Sat Feb 11 09:0" | grep -v "in 0." | tail -20 Sat Feb 11 09:05:14 2012: [zfs import] zpool_import_props() succeeded in 61.090s Sat Feb 11 09:05:14 2012: import of ak:/zfs/pool07a succeeded in 61.129s Sat Feb 11 09:06:03 2012: [nas import] discovery completed in 48.400s Sat Feb 11 09:06:16 2012: [nas import] mounted 673 datasets in 6.989s Sat Feb 11 09:06:17 2012: import of ak:/nas/pool07a succeeded in 62.531s Sat Feb 11 09:06:20 2012: import of ak:/net/ixgbe93003 succeeded in 1.649s adc26stor07# Here is very useful dtrace script that allows checking which operation takes the most of time. dtrace script import.d helps to troubleshoot long cluster takeover and fail-back times. The script measures the time to import each resource. Output: The first table is the aggregate time spent importing each resource, the second is the number of times it was imported. The special "resource" SAS LOCK is just the time taken to grab all the zone locks in the expanders. These two activities are basically all there is to takeover so they should capture everything that consumes time. References<NOTE:1402545.1> - Sun Storage 7000 Unified Storage System: How to Troubleshoot Cluster Problems<BUG:7144862> - 6.5 MINUTE FAILBACK ON Q3.4.3 - NEED RCA Dtrace Script - import.d: https://stbeehive.oracle.com/content/dav/st/AmberRoadSupport/Software/import.d <NOTE:1194226.1> - Oracle Shared Shell Attachments This solution has no attachment |
||||||||||||
|