Document Audience:	INTERNAL
Document ID:	I0845-1
Title:	RAID Manager 6 may hang for 3-8 minutes when an IBM drive is in the failed state in a StorEdge A1000/A3x00/A3500FC Array
Copyright Notice:	Copyright © 2005 Sun Microsystems, Inc. All Rights Reserved
Update Date:	2002-06-25

---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                            FIELD INFORMATION NOTICE
                  (For Authorized Distribution by SunService)

FIN #: I0845-1

Synopsis: RAID Manager 6 may hang for 3-8 minutes when an IBM drive is in the failed state in a StorEdge A1000/A3x00/A3500FC Array

Create Date: Jun/25/02

Keywords:

RAID Manager 6 may hang for 3-8 minutes when an IBM drive is in the failed state in a StorEdge A1000/A3x00/A3500FC Array

SunAlert: No

Top FIN/FCO Report: No

Products Reference: Raid Manager 6

Product Category: Storage / Service

Product Affected:

Systems Affected:
-----------------
Mkt_ID   Platform     Model   Description                  Serial Number
------   --------     -----   -----------                  -------------
  -       ANYSYS       -      System Platform Independent        -
  

X-Options Affected:
-------------------
Mkt_ID          Platform   Model   Description                Serial Number
------          --------   -----   -----------                -------------
  -              A1000       -     A1000 Storage Array              -
  -              A3000       -     A3000 Storage Array              -
  -              A3500       -     A3500 Storage Array              -
  -              A3500FC     -     A3500FC Storage Array            -
6530A              -         -     Sun RSM Array 63GB 15X4GB        -
6531A              -         -     Sun RSM Array 147GB 7X4GB        -
6532A              -         -     A3000 15*4.2GB/7200 FWSCSI       -
6533A              -         -     RSM2000 35*4.2GB/7200 FWSCSI     -
6534A              -         -     A3000 15*9.1GB/7200 FWSCSI       -
6535A              -         -     A3000 35*9.1GB/7200 FWSCSI       -
SG-XARY1*          -         -     STOREDGE A1000/RACK              -
SG-XARY3*          -         -     STOREDGE A3500/RACK              -
UG/CU-A3500FC*     -         -     ASSY,TOP OPT,1X5X9,MAX,9GB,10K   -
UG-A3K-A3500FC     -         -     ASSY,UPGRADE,A3500FC/TABASCO     -
UG-A3500-A3500FC   -         -     ASSY,UPGRADE,A3500FC/DILBERT     -
X6538A             -         -     X-OPT,A3500FC CONTROLLER         -
6538A              -         -     FCTY, CONTROLLER, A3500FC        -
X2611A             -         -     OPT INT I/O BD FOR EXX00         -
X2612A             -         -     OPT INT I/O BD EXX00 W/FC-AL     -
X2622A             -         -     OPT INT GRAPHICS I/O BD EXX00    -

Parts Affected:

Part Number            Description			     Model
-----------            -----------			     -----
798-0522-03 or lower   RAID MGR 6.1.1 & Update 1/2             -
704-6708-10            CD SUN STOREDGE RAID Mgr 6.22           -
704-7937-05            CD Sun StorEdge RAID Mgr 6.22.1         -

References:

BugId:  4656976 - manually failing IBM drive results in interrupt 
                  flood/failover: read capacity-NR. 

ESC:    536884 - Hotspare failover did not work. bug 4656976.
        535938 - Repeated controller failures.
        
FIN:    I0724-2

Issue Description:

Sun StorEdge A1000/A3x00/A3500FC Arrays with RAID Manager 6 (RM6) may 
hang for up to 8 minutes when they contain more than one failed IBM disk 
drive.  Array I/O may cease for a period of minutes and the array will 
appear unresponsive to the RM GUI.  There is no data loss, but customers
or field personnel may think the array has hung indefinitely.

In addition, delays of over 5 minutes may cause some upper level
software to generate errors for non-responsiveness.  These I/O timeouts
could cause database or volume management software to think the LUN or
array is dead.  RM6 commands may experience long delays due to this
problem.  Load on the system or array and/or LUN reconstructions
occuring on the array will further exacerbate these delays.
  
This issue applies to any system type and any A1000/A3x00/A3500FC array
using RM6.  The array will have one or more IBM drives, model
DDYST18350 or DDYST36950, or the older IBM DGHS-18Y or DDRS-39130,
which are in a failed state.  This failed state could have resulted
from the drive failing or from the user manually failing the drive with
'drivutil -f' or the RM6 GUI.

Although two of the drives, IBM Discovery models DDYST18350 and
DDYST36950, have been discontinued from use in the A1000/A3x00/A3500FC
products, they have a drive failure susceptability which heightens the
problem.  See FIN I0724-2.

To determine if these IBM drives are installed in an array:

   # drivutil -i    (shows the drive vendor, ie IBM)

This problem will be seen more frequently when issuing the CLI
healthck(1m) command or using the RM6 GUI health check.  These RM6
commands issue ReadCapacity using the SCSI pass-through facility of
sd/ssd which leads to the hang.

This issue occurs because the IBM drives do not respond the same way as
other A3x00/A1000 drives to ReadCapacity commands when the drive is not
spinning.  Other drive types cache the drive capacity figure and return
it even when spun down.  The IBM drives do not respond.

The A3x00/A1000 array firmware retries the ReadCapacity 3 times before
returning NOT_READY (ASC/ASCQ of 2402).  The sd/ssd target drivers
retry 24 times before returning NOT_READY to the application.  The
driver has a 5 second delay between retries, so 24 * 5 plus the
original command of 5 seconds means a delay of at least 125 seconds.
Investigation has shown actual delays of up to 190 seconds, or about 3
minutes.  If two drives are failed then the delay is double that, or up
to 380 seconds.

The IBM Discovery 1 drive has already been discontinued.  All X-option
and FRU inventories have been purged of these drives.  The only drives
which are still in the field would be those delivered before September
2001 that are in customer systems.  

A workaround has been provided to avoid this issue.  See the Corrective 
Action below.

Implementation:

---
        |   |   MANDATORY (Fully Proactive)
         ---    
         
  
         ---
        |   |   CONTROLLED PROACTIVE (per Sun Geo Plan) 
         --- 
         
                                
         ---
        | X |   REACTIVE (As Required)
         ---

Corrective Action:

The following recommendation is provided as a guideline for authorized
Enterprise Services Field Representatives who may encounter the above
mentioned problem.

Field personnel should manually fail IBM drives in these arrays only 
when there is no existing failed drive in the same array.  Correct the 
already failed drive conditions first before failing any other drives. 

To check for failed drives in an array, use one of the following:

   1) From the CLI, use commands 'drivutil -i ' 
      or 'healthck -a'.
      
   2) From the RM6 GUI, run Health Check in the Recovery Guru.
      
   3) Visually check the drive LED on the drive tray.  Each slot for
      a hard drive has a drive LED above the disk drive.  If there is 
      a failed drive, the LED will turn amber.

Comments:

None

============================================================================

Implementation Footnote:

i)   In case of MANDATORY FINs, Enterprise Services will attempt to    
     contact all affected customers to recommend implementation of 
     the FIN. 
   
ii)  For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/

* From there, select the appropriate link to browse the FIN or FCO index.

Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to [email protected]
--------------------------------------------------------------------------

Status

active