![]() | Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Technical Instruction Sure Solution 1352894.1 : Sun StorEdge[TM] 9900 Series: Solaris host report ASC: 0xc0 (
scsi errors on host messages like this: In this Document Created from <SR 3-4351866981>
Applies to:Sun Storage 9990 System - Version: Not ApplicableInformation in this document applies to any platform. GoalThis document outlines considerations when we see "ASC: 0xc0" errors on solaris host that are accessing to an Hitachi/Sun StorEdge 9900 series storage array.On the 6540 (connected as external storage to a SE9990) we have corrected the the parameters for the host mode type Windows Non-clustered (DMP support) overnight. However we are still encountering a number of SCSI timeout messages reported from the solaris domains connected to the SE9990 storage, a lot of warnings like this, against different LUNs: Aug 25 02:30:53 server1 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g60060e80045c2a0000005c2a0000384e (ssd303):
Aug 25 02:30:53 server1 Error for Command: write(10) Error Level: Retryable
Aug 25 02:30:53 server1 scsi: [ID 107833 kern.notice] Requested Block: 70280096 Error Block: 70280096
Aug 25 02:30:53 server1 scsi: [ID 107833 kern.notice] Vendor: HITACHI Serial Number: 50 05C2A384E
Aug 25 02:30:53 server1 scsi: [ID 107833 kern.notice] Sense Key: Aborted Command
Aug 25 02:30:53 server1 scsi: [ID 107833 kern.notice] ASC: 0xc0 (<vendor unique code 0xc0>), ASCQ: 0x0, FRU: 0x0
Aug 25 02:30:53 server1 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g60060e80045c2a0000005c2a0000384e (ssd303): ...up to this: Aug 25 02:30:55 server1 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g60060e80045c2a0000005c2a0000380e (ssd307):
Aug 25 02:30:55 server1 Error for Command: write(10) Error Level: Retryable
Aug 25 02:30:55 server1 scsi: [ID 107833 kern.notice] Requested Block: 40308832 Error Block: 40308832
Aug 25 02:30:55 server1 scsi: [ID 107833 kern.notice] Vendor: HITACHI Serial Number: 50 05C2A380E
Aug 25 02:30:55 server1 scsi: [ID 107833 kern.notice] Sense Key: Aborted Command
Aug 25 02:30:55 server1 scsi: [ID 107833 kern.notice] ASC: 0xc0 (<vendor unique code 0xc0>), ASCQ: 0x0, FRU: 0x0 Always are of type "ASC: 0xc0 (<vendor unique code 0xc0>), ASCQ: 0x0, FRU: 0x0" Customer investigations have found that the TOV is set to 15 seconds whereas their standards are 30 secs. The analysis of the explorer host, shows it started getting these error messages from long time ago, since 9th Aug, all started at this point: bash-3.00$ grep "ASC: 0xc0" messages.1|more Aug 9 12:01:30 server1 scsi: [ID 107833 kern.notice] ASC: 0xc0 (<vendor unique code 0xc0>), ASCQ: 0x0, FRU: 0x0 Aug 9 12:01:30 server1 scsi: [ID 107833 kern.notice] ASC: 0xc0 (<vendor unique code 0xc0>), ASCQ: 0x0, FRU: 0x0 Aug 9 12:01:30 server1 scsi: [ID 107833 kern.notice] ASC: 0xc0 (<vendor unique code 0xc0>), ASCQ: 0x0, FRU: 0x0 On Qlogic web site there is a reference about the meaning of the sense key reported by the storage "ASC: 0xc0 (<vendor unique code 0xc0>), ASCQ: 0x0, FRU: 0x0" this also apply to our case, document: QLogic Controller Attached to Hitachi Storage Returns 0C/00 ASC/ASCQ Check Condition "The vendor unique ASC/ASCQ code 0xc0/0x00 within this Solaris host example is from the HDS storage array; it indicates that there is an internal timeout on the array because of too much load. This usually means that the subsystem needs to add more cache, or possibly change the layout of the array groups so that the box can perform better. This code is not an indication of trouble with a QLogic controller." Based on that we checked the autodump from SE9990 and we see for the first time the 9th Aug at 12:05 (almost same time as on host) the following SSBs: D034 58 04 11/08/09 12:05:35 38618 0 Synchronous command JOB_WAIT time exaggerated generating D034 58 4C 11/08/09 12:05:36 38624 0 Synchronous command JOB_WAIT time exaggerated generating D034 58 0C 11/08/09 12:05:36 38629 0 Synchronous command JOB_WAIT time exaggerated generating D034 58 44 11/08/09 12:05:36 38661 0 Synchronous command JOB_WAIT time exaggerated generating DE4E 58 4C 11/08/09 12:10:24 40119 0 Command-queuing TOV starting (60 or less seconds) DE4E 58 0C 11/08/09 12:10:24 40127 0 Command-queuing TOV starting (60 or less seconds) DE4E 58 04 11/08/09 12:10:24 40139 0 Command-queuing TOV starting (60 or less seconds) DE4E 58 44 11/08/09 12:10:25 40152 0 Command-queuing TOV starting (60 or less seconds) D034 58 04 11/08/09 12:21:29 41823 0 Synchronous command JOB_WAIT time exaggerated generating D034 58 44 11/08/09 12:21:29 41832 0 Synchronous command JOB_WAIT time exaggerated generating D034 58 0C 11/08/09 12:21:30 41864 0 Synchronous command JOB_WAIT time exaggerated generating D034 58 4C 11/08/09 12:21:30 41894 0 Synchronous command JOB_WAIT time exaggerated generating These are always reported by Microprocesors 04 , 0C, 44 and 4C : DE4E 58 0C 11/08/12 16:11:02 57719 0 Command-queuing TOV starting (60 or less seconds) DE4E 58 44 11/08/12 16:11:02 57726 0 Command-queuing TOV starting (60 or less seconds) DE4E 58 4C 11/08/12 16:11:02 57735 0 Command-queuing TOV starting (60 or less seconds) DE4E 58 04 11/08/12 16:11:03 57753 0 Command-queuing TOV starting (60 or less seconds) From HDS, if associate it with front-end MPs (our case), this error code DE4E indicate queue overrun on the storage array ports. Also the SSB D034 means most likely the port is dropping I/O due to overload condition, this makes sense with the other SSB DE4E SolutionAs explained by Qlogic, this usually means that the subsystem needs to add more cache, or possibly change the layout of the array groups so that the box can perform better.The way to address this is by redistributing LUNs on other ports with less utilization. MP 04 --> CHA-1E port 1C and 5C MP 0C --> CHA-1F port 1G and 5G MP 44 --> CHA-2Q port 2C and 6C MP 04 --> CHA-2R port 2G and 6G The hosts connected to these ports are: 1C, 1G, 2C, 2G--> server0, server1 and server2 5C, 6C --> server3, server4 5G, 6G --> nothing connected ...so it is most likely that this is caused by hosts server0, server1 and server2 You can also try to tune this on the host side, in order to do not overload these ports: StorEdge[TM] 99X0: Heterogeneous Hosts, Queue Depth, and Other I/O Related Parameters (Doc ID 1018905.1) Sun StorEdge[TM] 9900 Series: Setting Queue Depth (Doc ID 1003338.1) By other hand, I don't see the relation with the I/O TOV. About the I/O TOV, it is a value associated to the cross-subsystems paths to the external storage. from "Hitachi Universal Storage Platform V/VM Hitachi Universal Volume Manager User's Guide" --> page 4-14: I/O TOV : Value specified as the time over of the I/O to the external volume. About how to change that I/O TOV value, this is described on page 5-42 "Changing the Port Setting of the External Storage System" --------------------------------------- You can change the setting of the port of the external storage system in the Path Operation window. For changing the setting of the port, use the Change WWN Parameter dialog box. To change the port setting of the external storage system: 1. Start Storage Navigator, and open the Path Operation window. 2. Make sure that Storage Navigator is in Modify mode. 3. Select Fibre - External Subsystem from the drop-down list above the Path Operation tree. 4. Click the product name in the Path Operation tree. Performing Universal Volume Manager Operations 5-43 Hitachi Universal Storage Platform V/VM Hitachi Universal Volume Manager User's Guide 5. Right-click the WWN that you want to change the setting in the Path Operation list. 6. Click Change WWN Parameter in the pop-up menu. The Change WWN Parameter dialog box opens. 7. Change the set parameter of the selected port on the Change WWN Parameter dialog box. 8. Click OK to close the Change WWN Parameter dialog box and return to the Path Operation window. The selected items appear in blue italics. 9. Verify the settings in the Preview dialog box. 10.Click Apply in the Path Operation window. The settings are applied to the local storage system and the Path Operation window appears normally. When an error occurs, an error The Change WWN Parameter dialog box consists of: * QDepth (2-128): The number of Read/Write commands which can be issued (queued) to the external volume at a time. The value that can be set ranges from 2 to 128. The default value is 8. * I/O TOV (5-240): Value specified as the time over of the I/O to the external volume. The value that can be set ranges 5 to 240 (in second). The default value is 15. * Path Blockade Watch (5-180): The time from when the connection of all the paths to the external volume have been down to when the external volume is blocked. The commands from the host are accepted until the time set for this parameter has passed. After the time set for this parameter has passed, the path status becomes Blockade. The value that can be set ranges from 5 to 180 (seconds). The default value is 10. ------------------------------------- Conclusion from last analysis is that we don't see any fault, all is working as it is configured, so we would recommend to contact with Oracle Advanced Services to address this performance/bottleneck problem. References<NOTE:1003338.1> - Sun StorEdge[TM] 9900 Series: Setting Queue Depth<NOTE:1004918.1> - Sun StorEdge[TM] SAN Software 4.4: Logical block MPxIO load balancing method <NOTE:1010664.1> - Starcat and Sun StorageTek[TM] 9900 (SE9900) Performance Considerations <NOTE:1018905.1> - StorEdge[TM] 99X0: Heterogeneous Hosts, Queue Depth, and Other I/O Related Parameters <NOTE:1019786.1> - Sun Storage 9990V System with external Storage <NOTE:1352893.1> - Se9990 - Parity Groups Status As 'External Device Error' And 'Warning' - External Ldevs As Blockade Attachments This solution has no attachment |
||||||||||||
|