Sun StorEdge[TM] 9900 Series: Solaris host report ASC: 0xc0 (<vendor unique code 0xc0>), ASCQ: 0x0

Asset ID:	1-71-1352894.1
Update Date:	2012-01-05
Keywords:

Solution Type Technical Instruction Sure

Solution 1352894.1 : Sun StorEdge[TM] 9900 Series: Solaris host report ASC: 0xc0 (), ASCQ: 0x0 - Autodump shows SSBs DE4E and D034

Applies to:

Sun Storage 9990 System - Version: Not Applicable and later [Release: N/A and later ]
Information in this document applies to any platform.

Goal

This document outlines considerations when we see "ASC: 0xc0" errors on solaris host that are accessing to an Hitachi/Sun StorEdge 9900 series storage array.

On the 6540 (connected as external storage to a SE9990) we have corrected the the parameters for the host mode type Windows Non-clustered (DMP support) overnight.

However we are still encountering a number of SCSI timeout messages reported from the solaris domains connected to the SE9990 storage, a lot of warnings like this, against different LUNs:

Aug 25 02:30:53 server1 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g60060e80045c2a0000005c2a0000384e (ssd303):

Aug 25 02:30:53 server1 Error for Command: write(10) Error Level: Retryable

Aug 25 02:30:53 server1 scsi: [ID 107833 kern.notice] Requested Block: 70280096 Error Block: 70280096

Aug 25 02:30:53 server1 scsi: [ID 107833 kern.notice] Vendor: HITACHI Serial Number: 50 05C2A384E

Aug 25 02:30:53 server1 scsi: [ID 107833 kern.notice] Sense Key: Aborted Command

Aug 25 02:30:53 server1 scsi: [ID 107833 kern.notice] ASC: 0xc0 (<vendor unique code 0xc0>), ASCQ: 0x0, FRU: 0x0

Aug 25 02:30:53 server1 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g60060e80045c2a0000005c2a0000384e (ssd303):

...up to this:

Aug 25 02:30:55 server1 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g60060e80045c2a0000005c2a0000380e (ssd307):

Aug 25 02:30:55 server1 Error for Command: write(10) Error Level: Retryable

Aug 25 02:30:55 server1 scsi: [ID 107833 kern.notice] Requested Block: 40308832 Error Block: 40308832

Aug 25 02:30:55 server1 scsi: [ID 107833 kern.notice] Vendor: HITACHI Serial Number: 50 05C2A380E

Aug 25 02:30:55 server1 scsi: [ID 107833 kern.notice] Sense Key: Aborted Command

Aug 25 02:30:55 server1 scsi: [ID 107833 kern.notice] ASC: 0xc0 (<vendor unique code 0xc0>), ASCQ: 0x0, FRU: 0x0

Always are of type "ASC: 0xc0 (<vendor unique code 0xc0>), ASCQ: 0x0, FRU: 0x0"

Customer investigations have found that the TOV is set to 15 seconds whereas their standards are 30 secs.

The analysis of the explorer host, shows it started getting these error messages from long time ago, since 9th Aug, all started at this point:

bash-3.00$ grep "ASC: 0xc0" messages.1|more

Aug 9 12:01:30 server1 scsi: [ID 107833 kern.notice] ASC: 0xc0 (<vendor unique code 0xc0>), ASCQ: 0x0, FRU: 0x0

Aug 9 12:01:30 server1 scsi: [ID 107833 kern.notice] ASC: 0xc0 (<vendor unique code 0xc0>), ASCQ: 0x0, FRU: 0x0

Aug 9 12:01:30 server1 scsi: [ID 107833 kern.notice] ASC: 0xc0 (<vendor unique code 0xc0>), ASCQ: 0x0, FRU: 0x0

On Qlogic web site there is a reference about the meaning of the sense key reported by the storage
"ASC: 0xc0 (<vendor unique code 0xc0>), ASCQ: 0x0, FRU: 0x0"
this also apply to our case, document:
QLogic Controller Attached to Hitachi Storage Returns 0C/00 ASC/ASCQ Check Condition

"The vendor unique ASC/ASCQ code 0xc0/0x00 within this Solaris host example is from the HDS storage array; it indicates that there is an internal timeout on the array because of too much load.
This usually means that the subsystem needs to add more cache, or possibly change the layout of the array groups so that the box can perform better.
This code is not an indication of trouble with a QLogic controller."

Based on that we checked the autodump from SE9990
and we see for the first time the 9th Aug at 12:05 (almost same time as on host) the following SSBs:

D034 58 04 11/08/09 12:05:35 38618 0 Synchronous command JOB_WAIT time exaggerated generating

D034 58 4C 11/08/09 12:05:36 38624 0 Synchronous command JOB_WAIT time exaggerated generating

D034 58 0C 11/08/09 12:05:36 38629 0 Synchronous command JOB_WAIT time exaggerated generating

D034 58 44 11/08/09 12:05:36 38661 0 Synchronous command JOB_WAIT time exaggerated generating

DE4E 58 4C 11/08/09 12:10:24 40119 0 Command-queuing TOV starting (60 or less seconds)

DE4E 58 0C 11/08/09 12:10:24 40127 0 Command-queuing TOV starting (60 or less seconds)

DE4E 58 04 11/08/09 12:10:24 40139 0 Command-queuing TOV starting (60 or less seconds)

DE4E 58 44 11/08/09 12:10:25 40152 0 Command-queuing TOV starting (60 or less seconds)

D034 58 04 11/08/09 12:21:29 41823 0 Synchronous command JOB_WAIT time exaggerated generating

D034 58 44 11/08/09 12:21:29 41832 0 Synchronous command JOB_WAIT time exaggerated generating

D034 58 0C 11/08/09 12:21:30 41864 0 Synchronous command JOB_WAIT time exaggerated generating

D034 58 4C 11/08/09 12:21:30 41894 0 Synchronous command JOB_WAIT time exaggerated generating

These are always reported by Microprocesors 04 , 0C, 44 and 4C :

DE4E 58 0C 11/08/12 16:11:02 57719 0 Command-queuing TOV starting (60 or less seconds)

DE4E 58 44 11/08/12 16:11:02 57726 0 Command-queuing TOV starting (60 or less seconds)

DE4E 58 4C 11/08/12 16:11:02 57735 0 Command-queuing TOV starting (60 or less seconds)

DE4E 58 04 11/08/12 16:11:03 57753 0 Command-queuing TOV starting (60 or less seconds)

From HDS, if associate it with front-end MPs (our case), this error code DE4E indicate queue overrun on the storage array ports.
Also the SSB D034 means most likely the port is dropping I/O due to overload condition, this makes sense with the other SSB DE4E

Solution

As explained by Qlogic, this usually means that the subsystem needs to add more cache, or possibly change the layout of the array groups so that the box can perform better.

The way to address this is by redistributing LUNs on other ports with less utilization.
MP 04 --> CHA-1E port 1C and 5C
MP 0C --> CHA-1F port 1G and 5G
MP 44 --> CHA-2Q port 2C and 6C
MP 04 --> CHA-2R port 2G and 6G

The hosts connected to these ports are:
1C, 1G, 2C, 2G--> server0, server1 and server2
5C, 6C --> server3, server4
5G, 6G --> nothing connected

...so it is most likely that this is caused by hosts server0, server1 and server2

You can also try to tune this on the host side, in order to do not overload these ports:

StorEdge[TM] 99X0: Heterogeneous Hosts, Queue Depth, and Other I/O Related Parameters (Doc ID 1018905.1)
Sun StorEdge[TM] 9900 Series: Setting Queue Depth (Doc ID 1003338.1)

By other hand, I don't see the relation with the I/O TOV.

About the I/O TOV, it is a value associated to the cross-subsystems paths to the external storage.
from "Hitachi Universal Storage Platform V/VM Hitachi Universal Volume Manager User's Guide" --> page 4-14:
I/O TOV : Value specified as the time over of the I/O to the external volume.

About how to change that I/O TOV value, this is described on page 5-42
"Changing the Port Setting of the External Storage System"

---------------------------------------
You can change the setting of the port of the external storage system in the
Path Operation window. For changing the setting of the port, use the Change
WWN Parameter dialog box.

To change the port setting of the external storage system:
1. Start Storage Navigator, and open the Path Operation window.
2. Make sure that Storage Navigator is in Modify mode.
3. Select Fibre - External Subsystem from the drop-down list above the
Path Operation tree.
4. Click the product name in the Path Operation tree.
Performing Universal Volume Manager Operations 5-43
Hitachi Universal Storage Platform V/VM Hitachi Universal Volume Manager User's Guide
5. Right-click the WWN that you want to change the setting in the Path
Operation list.
6. Click Change WWN Parameter in the pop-up menu.
The Change WWN Parameter dialog box opens.
7. Change the set parameter of the selected port on the Change WWN
Parameter dialog box.
8. Click OK to close the Change WWN Parameter dialog box and return to the
Path Operation window.
The selected items appear in blue italics.
9. Verify the settings in the Preview dialog box.
10.Click Apply in the Path Operation window.
The settings are applied to the local storage system and the Path Operation
window appears normally. When an error occurs, an error

The Change WWN Parameter dialog box consists of:

* QDepth (2-128): The number of Read/Write commands which can be
issued (queued) to the external volume at a time. The value that can be
set ranges from 2 to 128. The default value is 8.
* I/O TOV (5-240): Value specified as the time over of the I/O to the
external volume. The value that can be set ranges 5 to 240 (in second).
The default value is 15.
* Path Blockade Watch (5-180): The time from when the connection of all
the paths to the external volume have been down to when the external
volume is blocked. The commands from the host are accepted until the
time set for this parameter has passed. After the time set for this
parameter has passed, the path status becomes Blockade. The value that
can be set ranges from 5 to 180 (seconds). The default value is 10.
-------------------------------------

Conclusion from last analysis is that we don't see any fault, all is working as it is configured,
so we would recommend to contact with Oracle Advanced Services to address this performance/bottleneck problem.

References

<NOTE:1003338.1> - Sun StorEdge[TM] 9900 Series: Setting Queue Depth
<NOTE:1004918.1> - Sun StorEdge[TM] SAN Software 4.4: Logical block MPxIO load balancing method
<NOTE:1010664.1> - Starcat and Sun StorageTek[TM] 9900 (SE9900) Performance Considerations
<NOTE:1018905.1> - StorEdge[TM] 99X0: Heterogeneous Hosts, Queue Depth, and Other I/O Related Parameters
<NOTE:1019786.1> - Sun Storage 9990V System with external Storage
<NOTE:1352893.1> - Se9990 - Parity Groups Status As 'External Device Error' And 'Warning' - External Ldevs As Blockade

Attachments

This solution has no attachment