Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Problem Resolution Sure Solution 1008074.1 : Sun StorEdge[TM] A5x00 Array: IO become unresponsive or hang on disks
PreviouslyPublishedAs 211117 Symptoms IO to drive(s) in an Sun StorEdge[TM] A5x00 Array subsystem become unresponsive or incur retryable SCSI errored IO on a Solaris(tm) host. In most cases, this is a hardware fault. In general the design of FC-AL(Fibre Channel Arbitrated Loop), allows participants of the loop to corrupt, or fail to pass packets to the target of the IO. This is much like a TCP/IP token ring configuration, in that if one system fails, the rest of the ring cannot communicate. The trouble with isolating faults in this FC-AL architecture, is that any one participant can be failing in a marginal fashion as to still accept IO, but not cause an overt failure. This is typically seen as a host hang in most cases, but can also be seen as strange, partial, or corrupted output in the Solaris luxadm(1M) and format(1M) commands. Example symptoms:
luxadm probe shows individual drives instead of the SES devices Found Fibre Channel device(s): Node WWN:20000004cf6be6b7 Device Type:Disk device Logical Path:/dev/rdsk/c9t0d0s2 Node WWN:20000004cf7009a2 Device Type:Disk device Logical Path:/dev/rdsk/c9t1d0s2 Node WWN:20000004cf6bfd68 Device Type:Disk device Logical Path:/dev/rdsk/c9t4d0s2 NOTE: This is normal for some FC Multipacks and JBODS, but not for the A5x00. format shows drive type unknown: 12. c9t1d0 <drive type unknown> /sbus@6,0/SUNW,socal@1,0/sf@1,0/ssd@w21000004cf7009a2,0 13. c9t5d0 <drive type unknown> /sbus@6,0/SUNW,socal@1,0/sf@1,0/ssd@w21000004cf6bde10,0 14. c9t10d0 <drive type unknown> luxadm dump_map or luxadm display enclosure_name shows all zero's for the WWN and WWPN of a drive: luxadm dump_map: Pos AL_PA ID Hard_Addr Port WWN Node WWN Type 0 1 7d 1 2007020000122498 5020020000122498 0x3 (Processor device, Host Bus Adapter) 1 d2 d 0 0000000000000000 0000000000000000 0x1f (Unknown Type) 2 ef 0 41 0000000000000000 0000000000000000 0x1f (Unknown Type) 3 e8 1 0 0000000000000000 0000000000000000 0x1f (Unknown Type) 4 e1 4 0 0000000000000000 0000000000000000 0x1f (Unknown Type) Resolution An approach of methodical, fault isolation of component(s) offers the most comprehensive of resolutions to this issue. As such, it is recommended that Sun[TM] Support Services be contacted. Start this isolation by running: format(1M) luxadm probe luxadm display enclosure_name luxadm -e dump_map enclosure_name If any of the aforementioned symptoms are observed, or if one or more of these commands hang, contact Sun Support Services immediately. Additionally, it may be necessary to collect Read Link Status(RLS) data from the array on both HBA channels. RLS data is useful when viewed as a delta, or change, between two points in time. If you suspect a problem, RLS data can be collected by running: luxadm -e rdls enclosure_name Example: # luxadm -e rdls A Link Error Status information for loop:/devices/sbus@6,0/SUNW,socal@1,0:0 al_pa lnk fail sync loss signal loss sequence err invalid word CRC 5a 0 13 16 0 0 0 72 4 1186 0 0 0 0 71 0 16 0 0 0 0 6e 0 15 0 0 0 0 6d 0 28 0 0 0 0 6c 0 26 0 0 0 0 6b 0 26 0 0 0 0 6a 0 21 0 0 0 0 45 0 0 11 0 0 0 55 4 930 0 0 0 0 54 0 19 0 0 0 0 53 0 19 0 0 0 0 52 0 19 0 0 0 0 4e 0 19 0 0 0 0 4d 0 18 0 0 0 0 1 720896 0 0 0 0 0 NOTE: Remember, these outputs are cumulative since the last power cycle of the Additional Information The Sun StorEdge[TM] A5x00 Array has the following loop architecture: 1) each drive participates on 2 channels, A and B So that we understand the basic loop, here is an A or B drive channel in a full loop mode(front and rear backplanes are joined) HBA in front port->IB front port(SES chip)->Drives slots 0-NN in the front-> IB rear port(SES chip) -> Drives slots 0-NN in rear And what that means, is that any component that connects or passes information along between these devices can cause data transmission problems that present itself as a hang, IO timeout, and incorrect or incomplete outputs. Product Sun StorageTek A5000 Array Sun StorageTek A5100 Array Sun StorageTek A5200 Array Internal Comments For internal Sun use only. Fault Isolation: Services Engineers should look for:
1) hangs on one path but not another NOTE: RLS paths MUST be used in a delta fashion(collect at least two points of data).
4) After parts replacement, engineers should have the customer monitored on a photon, A5000, A5200, A5100, luxadm, rdls, timeout Previously Published As 87010 Change History Date: 2006-09-25 User Name: 97961 Action: Approved Comment: - Changed title to comply to the standard format - Tidied up formatting - Applied trademarking where it is missing Version: 3 Date: 2006-09-25 User Name: 97961 Action: Accept Comment: Version: 0 Date: 2006-09-25 User Name: 128938 Action: Approved Comment: I had done a technical review of the document and do not find any issue with it. I found the content technically accurate and correct. Hence, sending it for final review and publication. Version: 0 Attachments This solution has no attachment |
||||||||||||
|