Sun StorEdge A5x00 Array: I/O becomes unresponsive or hang on disks

Asset ID:	1-72-1008074.1
Update Date:	2012-06-07
Keywords:

Solution Type Problem Resolution Sure

Solution 1008074.1 : Sun StorEdge A5x00 Array: I/O becomes unresponsive or hang on disks

Related Items


Sun Storage A5000 Array
 Sun Storage A5200 Array

Related Categories


PLA-Support>Sun Systems>DISK>Arrays>SN-DK: A5xxx
 .Old GCS Categories>Sun Microsystems>Storage - Disk>Modular Disk - Other

PreviouslyPublishedAs
211117

Symptoms
I/O to drive(s) in an Sun StorEdge A5x00 Array subsystem becomes unresponsive or incurs retryable SCSI errors on a Solaris host.
In most cases, this is a hardware fault. In general the design of FC-AL(Fibre Channel Arbitrated Loop) allows participants of the loop to corrupt, or fail to pass packets to the target of the I/O. This is much like a TCP/IP token ring configuration, in that if one system fails, the rest of the ring cannot communicate.

The trouble with isolating faults in this FC-AL architecture, is that any one participant can be failing in a marginal fashion as to still accept I/O, but not cause an overt failure. This is typically seen as a host hang in most cases, but can also be seen as strange, partial, or corrupted output in the Solaris luxadm(1M) and format(1M) commands.

Example symptoms:

luxadm probe shows individual drives instead of the SES devices
for the IB's

 Found Fibre Channel device(s):
Node WWN:20000004cf6be6b7  Device Type:Disk device
Logical Path:/dev/rdsk/c9t0d0s2
Node WWN:20000004cf7009a2  Device Type:Disk device
Logical Path:/dev/rdsk/c9t1d0s2
Node WWN:20000004cf6bfd68  Device Type:Disk device
Logical Path:/dev/rdsk/c9t4d0s2

NOTE: This is normal for some FC Multipacks and JBODS, but not for the A5x00.

format shows drive type unknown:

 12. c9t1d0 <drive type unknown>
/sbus@6,0/SUNW,socal@1,0/sf@1,0/ssd@w21000004cf7009a2,0
13. c9t5d0 <drive type unknown>
/sbus@6,0/SUNW,socal@1,0/sf@1,0/ssd@w21000004cf6bde10,0
14. c9t10d0 <drive type unknown>

luxadm dump_map or luxadm display enclosure_name shows all zero's for the WWN and WWPN of a drive:

luxadm dump_map:

 Pos AL_PA ID Hard_Addr Port WWN         Node WWN         Type
0     1   7d    1      2007020000122498 5020020000122498 0x3  (Processor device,
Host Bus Adapter)
1     d2  d     0      0000000000000000 0000000000000000 0x1f (Unknown Type)
2     ef  0     41     0000000000000000 0000000000000000 0x1f (Unknown Type)
3     e8  1     0      0000000000000000 0000000000000000 0x1f (Unknown Type)
4     e1  4     0      0000000000000000 0000000000000000 0x1f (Unknown Type)

Resolution
An approach of methodical fault isolation of component(s) offers the most comprehensive way of resolution to this issue.

Start this isolation by running:

format(1M)
luxadm probe
luxadm display enclosure_name
luxadm -e dump_map enclosure_name

If any of the aforementioned symptoms are observed, or if one or more of these commands hang, contact Oracle Support immediately.

Additionally, it may be necessary to collect Read Link Status(RLS) data from the array on both HBA channels. RLS data is useful when viewed as a delta, or change, between two points in time. If you suspect a problem, RLS data can be collected by running:

luxadm -e rdls enclosure_name

Example:

# luxadm -e rdls A

Link Error Status information for loop:/devices/sbus@6,0/SUNW,socal@1,0:0
al_pa   lnk fail    sync loss   signal loss   sequence err   invalid word   CRC
5a      0           13          16            0              0              0
72      4           1186        0             0              0              0
71      0           16          0             0              0              0
6e      0           15          0             0              0              0
6d      0           28          0             0              0              0
6c      0           26          0             0              0              0
6b      0           26          0             0              0              0
6a      0           21          0             0              0              0
45      0           0           11            0              0              0
55      4           930         0             0              0              0
54      0           19          0             0              0              0
53      0           19          0             0              0              0
52      0           19          0             0              0              0
4e      0           19          0             0              0              0
4d      0           18          0             0              0              0
1       720896      0           0             0              0              0

NOTE: Remember, these outputs are cumulative since the last power cycle of the
array, so it is worthwhile to collect two samples of data on each path to the
array. The Explorer also collects this information.

Additional Information
The Sun StorEdge A5x00 Array has the following loop architecture:

1) each drive participates on 2 channels, A and B
2) each Interface Board(IB) participates on either the A or B drive channel
3) each IB has an a participant FC port that accesses the front and rear drive backplanes.

So that we understand the basic loop, here is an A or B drive channel in a full loop mode(front and rear backplanes are joined)

HBA in front port->IB front port(SES chip)->Drives slots 0-NN in the front-> IB rear port(SES chip) -> Drives slots 0-NN in rear

Any component that connects, or passes information along, between these devices can cause data transmission problems that present as a hang, I/O timeout, and incorrect or incomplete outputs.

Product
Sun StorEdge A5000 Array
Sun StorEdge A5100 Array
Sun StorEdge A5200 Array

Internal Comments
For internal Sun use only.

Fault Isolation:

Service Engineers should look for:

1) hangs on one path but not another

2) increasing RLS counters on a particular path

3) increasing RLS counters on a particular component

-It is a drive/backplane/midplane issue if the increase is occurring on

both the A and B channels

-The first drive in the dump_map output, showing the RLS increase, should

be replaced, IF THERE ARE MULTIPLE INCREASES OVER A PERIOD OF TIME.

-Increases of less than 10 Invalid Word per hour are acceptable.

-Increases in CRC errors are never acceptable.

-Increases on all drives in a single channel indicate an IB fault

-We recommend replacing the HBA, cable, GBIC, and IB board in this case.

NOTE: RLS paths MUST be used in a delta fashion (collect at least two samples of data).

4) After parts replacement, engineers should have the customer monitor on a

daily basis, offering a new RLS and /var/adm/messages output, or Explorer, for

review of the RLS counters, to note any increases.

photon, A5000, A5200, A5100, luxadm, rdls, timeout
Previously Published As
87010

Change History
Date: 2006-09-25
User Name: 97961
Action: Approved
Comment: - Changed title to comply to the standard format
- Tidied up formatting
- Applied trademarking where it is missing
Version: 3
Date: 2006-09-25
User Name: 97961
Action: Accept
Comment:
Version: 0
Date: 2006-09-25
User Name: 128938
Action: Approved
Comment: I had done a technical review of the document and do not find any
issue with it. I found the content technically accurate and correct.
Hence, sending it for final review and publication.
Version: 0

Attachments

This solution has no attachment