![]() | Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Problem Resolution Sure Solution 1008074.1 : Sun StorEdge A5x00 Array: I/O becomes unresponsive or hang on disks
PreviouslyPublishedAs 211117 Symptoms I/O to drive(s) in an Sun StorEdge A5x00 Array subsystem becomes unresponsive or incurs retryable SCSI errors on a Solaris host. In most cases, this is a hardware fault. In general the design of FC-AL(Fibre Channel Arbitrated Loop) allows participants of the loop to corrupt, or fail to pass packets to the target of the I/O. This is much like a TCP/IP token ring configuration, in that if one system fails, the rest of the ring cannot communicate. The trouble with isolating faults in this FC-AL architecture, is that any one participant can be failing in a marginal fashion as to still accept I/O, but not cause an overt failure. This is typically seen as a host hang in most cases, but can also be seen as strange, partial, or corrupted output in the Solaris luxadm(1M) and format(1M) commands. Example symptoms:
luxadm probe shows individual drives instead of the SES devices Found Fibre Channel device(s): Node WWN:20000004cf6be6b7 Device Type:Disk device Logical Path:/dev/rdsk/c9t0d0s2 Node WWN:20000004cf7009a2 Device Type:Disk device Logical Path:/dev/rdsk/c9t1d0s2 Node WWN:20000004cf6bfd68 Device Type:Disk device Logical Path:/dev/rdsk/c9t4d0s2 NOTE: This is normal for some FC Multipacks and JBODS, but not for the A5x00. format shows drive type unknown: 12. c9t1d0 <drive type unknown> /sbus@6,0/SUNW,socal@1,0/sf@1,0/ssd@w21000004cf7009a2,0 13. c9t5d0 <drive type unknown> /sbus@6,0/SUNW,socal@1,0/sf@1,0/ssd@w21000004cf6bde10,0 14. c9t10d0 <drive type unknown> luxadm dump_map or luxadm display enclosure_name shows all zero's for the WWN and WWPN of a drive: luxadm dump_map: Pos AL_PA ID Hard_Addr Port WWN Node WWN Type 0 1 7d 1 2007020000122498 5020020000122498 0x3 (Processor device, Host Bus Adapter) 1 d2 d 0 0000000000000000 0000000000000000 0x1f (Unknown Type) 2 ef 0 41 0000000000000000 0000000000000000 0x1f (Unknown Type) 3 e8 1 0 0000000000000000 0000000000000000 0x1f (Unknown Type) 4 e1 4 0 0000000000000000 0000000000000000 0x1f (Unknown Type) Resolution An approach of methodical fault isolation of component(s) offers the most comprehensive way of resolution to this issue. Start this isolation by running: format(1M) luxadm probe luxadm display enclosure_name luxadm -e dump_map enclosure_name If any of the aforementioned symptoms are observed, or if one or more of these commands hang, contact Oracle Support immediately. Additionally, it may be necessary to collect Read Link Status(RLS) data from the array on both HBA channels. RLS data is useful when viewed as a delta, or change, between two points in time. If you suspect a problem, RLS data can be collected by running: luxadm -e rdls enclosure_name Example: # luxadm -e rdls A Link Error Status information for loop:/devices/sbus@6,0/SUNW,socal@1,0:0 al_pa lnk fail sync loss signal loss sequence err invalid word CRC 5a 0 13 16 0 0 0 72 4 1186 0 0 0 0 71 0 16 0 0 0 0 6e 0 15 0 0 0 0 6d 0 28 0 0 0 0 6c 0 26 0 0 0 0 6b 0 26 0 0 0 0 6a 0 21 0 0 0 0 45 0 0 11 0 0 0 55 4 930 0 0 0 0 54 0 19 0 0 0 0 53 0 19 0 0 0 0 52 0 19 0 0 0 0 4e 0 19 0 0 0 0 4d 0 18 0 0 0 0 1 720896 0 0 0 0 0 NOTE: Remember, these outputs are cumulative since the last power cycle of the Additional Information The Sun StorEdge A5x00 Array has the following loop architecture: 1) each drive participates on 2 channels, A and B So that we understand the basic loop, here is an A or B drive channel in a full loop mode(front and rear backplanes are joined) HBA in front port->IB front port(SES chip)->Drives slots 0-NN in the front-> IB rear port(SES chip) -> Drives slots 0-NN in rear Any component that connects, or passes information along, between these devices can cause data transmission problems that present as a hang, I/O timeout, and incorrect or incomplete outputs. Product Sun StorEdge A5000 Array Sun StorEdge A5100 Array Sun StorEdge A5200 Array Internal Comments For internal Sun use only. Fault Isolation: Service Engineers should look for:
1) hangs on one path but not another NOTE: RLS paths MUST be used in a delta fashion (collect at least two samples of data).
4) After parts replacement, engineers should have the customer monitor on a photon, A5000, A5200, A5100, luxadm, rdls, timeout Previously Published As 87010 Change History Date: 2006-09-25 User Name: 97961 Action: Approved Comment: - Changed title to comply to the standard format - Tidied up formatting - Applied trademarking where it is missing Version: 3 Date: 2006-09-25 User Name: 97961 Action: Accept Comment: Version: 0 Date: 2006-09-25 User Name: 128938 Action: Approved Comment: I had done a technical review of the document and do not find any issue with it. I found the content technically accurate and correct. Hence, sending it for final review and publication. Version: 0 Attachments This solution has no attachment |
||||||||||||
|