Document fins/I0634-1


FIN #: I0634-1

SYNOPSIS: StorEdge A3x00 Array controller failover

DATE: Nov/14/00

KEYWORDS: StorEdge A3x00 Array controller failover


---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                            FIELD INFORMATION NOTICE
                  (For Authorized Distribution by SunService)



SYNOPSIS: StorEdge A3x00 Array controller failover may cause temporary
          delay in processing of pending disk I/O.
              

TOP FIN/FCO REPORT: No 
 
PRODUCT_REFERENCE:  StorEdge A3x00 Array controller failover  
 
PRODUCT CATEGORY:   Storage / A3000 ; Storage / A3500 

PRODUCTS AFFECTED:  
  
Mkt_ID   Platform   Model   Description   Serial Number
------   --------   -----   -----------   -------------

Systems Affected
----------------

  -      ANYSYS       -        System Platform Independent    -

X-Options Affected
------------------

SG-XARY351A-180G     -   -   A3500 1 CONT MOD/5 TRAYS/18GB        -
SG-XARY353A-1008G    -   -   A3500 2 CONT/7 TRAYS/18GB            -
SG-XARY353A-360G     -   -   A3500 2 CONT/7 TRAYS/18GB            -
SG-XARY355A-2160G    -   -   A3500 3 CONT/15 TRAYS/18GB           -
SG-XARY360A-545G     -   -   545-GB A3500 (1X5X9-GB)              -
SG-XARY360A-90G      -   -   A3500 1 CONT/5 TRAYS/9GB(10K)        -
SG-XARY362A-180G     -   -   A3500 2 CONT/7 TRAYS/9GB(10K)        -
SG-XARY362A-763G     -   -   A3500 2 CONT/7 TRAYS/9GB(10K)        -
SG-XARY364A-1635G    -   -   A3500 3 CONT/15 TRAYS/9GB(10K)       -
SG-XARY366A-72G      -   -   A3500 1 CONT/2 TRAYS/9GB(10K)        -
SG-XARY380A-1092G    -   -   1092-GB A3500 (1x5x18-GB)            -
SG-XARY360B-90G      -   -   ASSY,TOP OPT,1X5X9,MIN,9GB,10K       -
SG-XARY360B-545G     -   -   ASSY,TOP OPT,1X5X9,MAX,9GB,10K       -
SG-XARY362B-180G     -   -   X-OPT,2X7X9,MIN,FCAL,9G10K           -
SG-XARY374B-273G     -   -   ASSY,TOP OPT,3X15X9,MIN,9GB,10K      -
SG-XARY380B-182G     -   -   X-OPT,FC-SN,1X5X18MIN,18GB10K        -
SG-XARY380B-1092G    -   -   ASSY,FC-SNL,1X5X18MAX,18G10K         -
SG-XARY382B-364G     -   -   ASSY,FC-SN,2X7X18,MIN,18GB,10K       -
SG-XARY384B-546G     -   -   ASSY,FC,3X15X18,MIN,18GB             -
SG-XARY381B-364G     -   -   ASSY,FC-SN,1X5X36MIN,36G10K          -
SG-XARY381B-1456G    -   -   ASSY,FC-SN,1X5X36MAX,36B10K          -
SG-XARY383B-728G     -   -   ASSY,FC-SN,2X7X36MIN,36B10K          -
SG-XARY385B-1092G    -   -   ASSY,FC-SN,3X15X36MIN,36B10K         -
UG-A3500-FC-545G     -   -   ASSY,TOP OPT,1X5X9,MAX,9GB,10K       -
CU-A3500-FC-545G     -   -   ASSY,TOP OPT,1X5X9,MAX,9GB,10K       -
UG-A3500FC-182-10K   -   -   FCTY,A3500FC/SCSI,1X5X18MIN,18/10K   -
CU-A3500FC-182-10K   -   -   FCTY,A3500FC/SCSI,1X5X18MIN,18/10K   -
UG-A3500FC-364-10K   -   -   FCTY,A3500FC/SCSI,2X7X18MIN,18/10K   -
CU-A3500FC-364-10K   -   -   FCTY,A3500FC/SCSI,2X7X18MIN,18/10K   -
UG-A3500FC-546-10K   -   -   FCTY,A3500FC/SCSI,3X15X18MIN 18G10K  -
CU-A3500FC-546-10K   -   -   FCTY,A3500FC/SCSI,3X15X18MIN 18G10K  -
UG-A3K-A3500FC       -   -   ASSY,UPGRADE,A3500FC/TABASCO         -
UG-A3500-A3500FC     -   -   ASSY,UPGRADE,A3500FC/DILBERT         -
X6538A               -   -   X-OPT,A3500FC CONTROLLER             -
6538A                -   -   FCTY, CONTROLLER, A3500FC            -


PART NUMBERS AFFECTED: 

Part Number   Description                             Model
-----------   -----------                             -----

825-3869-02   MNL Set, SUN RSM ARRAY 2000               -
798-0188-01   SS, CD ASSY, RAID Manager 6.1             -
798-0522-01   RAID Manager 6.1.1                        -
798-0522-02   RAID Manager6.1.1 Update 1                -
798-0522-03   RAID Manager6.1.1 Update 2                -
704-6708-10   CD, SUN STOREDGE RAID Manager6.22         -


REFERENCES:
 
ESC:  526134

      
PROBLEM DESCRIPTION: 

A3x00/A3500FC controllers failover whenever the driver or the RM 6 host
software detects a failure in one of the dual controllers. At this point
the redundant system has a single point of failure, ie the remaining
controller. Customers can be quite anxious about every message and delay at
this stage.  The purpose of this FIN is to help the field set the right 
expectations for the customer. 

A sample failover message stream from an A3500FC is:

Sep 13 13:35:31 autolycus unix: sf2:    Open failure to target 0x4 forcing LIP
Sep 13 13:35:31 autolycus unix: ID[SUNWssa.socal.link.5010] socal1: port 0: 
Fibre Channel is OFFLINE
Sep 13 13:35:31 autolycus unix: ID[SUNWssa.socal.link.6010] socal1: port 0: 
Fibre Channel Loop is ONLINE
Sep 13 13:35:31 autolycus unix: sf2:    target 0x4 al_pa 0xe1 offlined
Sep 13 13:35:31 autolycus last message repeated 10 times
Sep 13 13:35:31 autolycus unix: WARNING:
/sbus@2,0/SUNW,socal@1,0/sf@0,0/ssd@w20 
0100a0b8071c11,1 (ssd20):
Sep 13 13:35:31 autolycus unix:         ssdrestart transport failed (fffffffe)
Sep 13 13:35:31 autolycus unix: ID[RAIDarray.rdriver.3001] Errored I/O, with 
errno 5, returned to the Array driver on module_x, LUN 1
Sep 13 13:35:31 autolycus unix: ID[RAIDarray.rdriver.1002] The errored I/O is a

write at sector: 109897680
Sep 13 13:35:31 autolycus unix: ID[RAIDarray.rdriver.1003] The errored I/O is 
being routed to the Resolution daemon
Sep 13 13:35:31 autolycus unix: ID[RAIDarray.rdriver.3001] Errored I/O, with 
errn no 5, returned to the Array driver on module_x, LUN 1
Sep 13 13:35:31 autolycus unix: ID[RAIDarray.rdriver.1002] The errored I/O is a

write at sector: 146427792
Sep 13 13:35:31 autolycus unix: ID[RAIDarray.rdriver.1004] A resolution is 
already in progress for this device - the I/O will be queued for retry after
the 
reso
Sep 13 13:35:31 autolycus unix: lution completes.
Sep 13 13:35:31 autolycus unix: ID[RAIDarray.rdriver.3001] Errored I/O, with 
errno 5, returned to the Array driver on module_x, LUN 1
Sep 13 13:35:31 autolycus unix: ID[RAIDarray.rdriver.1002] The errored I/O is a

write at sector: 198713616
Sep 13 13:35:31 autolycus unix: ID[RAIDarray.rdriver.1004] A resolution is 
already in progress for this device - the I/O will be queued for retry after
the 
reso
Sep 13 13:35:31 autolycus unix: lution completes.
Sep 13 13:35:31 autolycus unix: ID[RAIDarray.rdriver.3001] Errored I/O, with 
errno 5, returned to the Array driver on module_x, LUN 1
Sep 13 13:35:31 autolycus unix: ID[RAIDarray.rdriver.1002] The errored I/O is a

write at sector: 126296480
Sep 13 13:35:31 autolycus unix: ID[RAIDarray.rdriver.1004] A resolution is 
already in progress for this device - the I/O will be queued for retry after
the 
reso
Sep 13 13:35:31 autolycus unix: lution completes.
Sep 13 13:35:31 autolycus unix: ID[RAIDarray.rdriver.3001] Errored I/O, with 
errno 5, returned to the Array driver on module_x, LUN 1
Sep 13 13:35:31 autolycus unix: ID[RAIDarray.rdriver.1002] The errored I/O is a

write at sector: 146425232
Sep 13 13:35:31 autolycus unix: ID[RAIDarray.rdriver.1004] A resolution is 
already in progress for this device - the I/O will be queued for retry after
the 
reso
Sep 13 13:35:31 autolycus unix: lution completes.
Sep 13 13:35:31 autolycus unix: ID[RAIDarray.rdriver.3001] Errored I/O, with 
errno 5, returned to the Array driver on module_x, LUN 1
Sep 13 13:35:31 autolycus unix: ID[RAIDarray.rdriver.1002] The errored I/O is a

write at sector: 158884432
Sep 13 13:35:31 autolycus unix: ID[RAIDarray.rdriver.1004] A resolution is 
already in progress for this device - the I/O will be queued for retry after
the 
reso
Sep 13 13:35:31 autolycus unix: lution completes.
Sep 13 13:35:31 autolycus unix: ID[RAIDarray.rdriver.3001] Errored I/O, with 
errno 5, returned to the Array driver on module_x, LUN 1
Sep 13 13:35:31 autolycus unix: ID[RAIDarray.rdriver.1002] The errored I/O is a

write at sector: 194880272
Sep 13 13:35:31 autolycus unix: ID[RAIDarray.rdriver.1004] A resolution is 
already in progress for this device - the I/O will be queued for retry after
the 
reso
Sep 13 13:35:31 autolycus unix: lution completes.
Sep 13 13:35:31 autolycus unix: ID[RAIDarray.rdriver.3001] Errored I/O, with 
errno 5, returned to the Array driver on module_x, LUN 1
Sep 13 13:35:31 autolycus unix: ID[RAIDarray.rdriver.1002] The errored I/O is a

write at sector: 194879712
Sep 13 13:35:31 autolycus unix: ID[RAIDarray.rdriver.1004] A resolution is 
already in progress for this device - the I/O will be queued for retry after
the 
reso
Sep 13 13:35:31 autolycus unix: lution completes.
Sep 13 13:35:31 autolycus unix: ID[RAIDarray.rdriver.3001] Errored I/O, with 
errno 5, returned to the Array driver on module_x, LUN 1
Sep 13 13:35:31 autolycus unix: ID[RAIDarray.rdriver.1002] The errored I/O is a

write at sector: 126296416
Sep 13 13:35:31 autolycus unix: ID[RAIDarray.rdriver.1004] A resolution is 
already in progress for this device - the I/O will be queued for retry after
the 
reso
Sep 13 13:35:31 autolycus unix: lution completes.
Sep 13 13:35:32 autolycus unix: ID[RAIDarray.rdriver.3001] Errored I/O, with 
errno 5, returned to the Array driver on module_x, LUN 1
Sep 13 13:35:32 autolycus unix: ID[RAIDarray.rdriver.1002] The errored I/O is a

write at sector: 183697552
Sep 13 13:35:32 autolycus unix: ID[RAIDarray.rdriver.1004] A resolution is 
already in progress for this device - the I/O will be queued for retry after
the 
reso
Sep 13 13:35:32 autolycus unix: lution completes.
Sep 13 13:35:32 autolycus unix: ID[RAIDarray.rdriver.3001] Errored I/O, with 
errno 5, returned to the Array driver on module_x, LUN 1
Sep 13 13:35:33 autolycus unix: ID[RAIDarray.rdriver.1002] The errored I/O is a

write at sector: 52070480
Sep 13 13:35:33 autolycus unix: ID[RAIDarray.rdriver.1004] A resolution is 
already in progress for this device - the I/O will be queued for retry after
the 
reso
Sep 13 13:35:33 autolycus unix: lution completes.
Sep 13 13:35:33 autolycus unix: ID[RAIDarray.rdriver.3001] Errored I/O, with 
errno 5, returned to the Array driver on module_x, LUN 1
Sep 13 13:35:33 autolycus unix: ID[RAIDarray.rdriver.1002] The errored I/O is a

write at sector: 109897952
Sep 13 13:35:33 autolycus unix: ID[RAIDarray.rdriver.1004] A resolution is 
already in progress for this device - the I/O will be queued for retry after
the 
reso
Sep 13 13:35:33 autolycus unix: lution completes.
Sep 13 13:35:33 autolycus unix: ID[RAIDarray.rdriver.3001] Errored I/O, with 
errno 5, returned to the Array driver on module_x, LUN 1
Sep 13 13:35:33 autolycus unix: ID[RAIDarray.rdriver.1002] The errored I/O is a

write at sector: 198715424
Sep 13 13:35:33 autolycus unix: ID[RAIDarray.rdriver.1004] A resolution is 
already in progress for this device - the I/O will be queued for retry after
the 
reso
Sep 13 13:35:33 autolycus unix: lution completes.
Sep 13 13:35:33 autolycus unix: ID[RAIDarray.rdriver.3001] Errored I/O, with 
errno 5, returned to the Array driver on module_x, LUN 1
Sep 13 13:35:33 autolycus unix: ID[RAIDarray.rdriver.1002] The errored I/O is a

write at sector: 126296224
Sep 13 13:35:33 autolycus unix: ID[RAIDarray.rdriver.1004] A resolution is 
already in progress for this device - the I/O will be queued for retry after
the 
reso
Sep 13 13:35:33 autolycus unix: lution completes.
Sep 13 13:35:48 autolycus unix: ID[RAIDarray.rdaemon.3003] The RDAC Resolution 
Daemon has failed a controller on module_x
Sep 13 13:35:48 autolycus last message repeated 1 time
Sep 13 13:35:48 autolycus unix: ID[RAIDarray.rdriver.1005] The Array Resolution

Daemon is resuming I/Os on module_x, LUN 1
Sep 13 13:35:48 autolycus unix: ID[RAIDarray.rdriver.6001] The Array 
driver/daemon has recovered an Errored I/O on module_x, Lun 1, sector 109897680
Sep 13 13:35:48 autolycus unix: ID[RAIDarray.rdriver.6001] The Array 
driver/daemon has recovered an Errored I/O on module_x, Lun 1, sector 198713616
Sep 13 13:35:48 autolycus unix: ID[RAIDarray.rdriver.6001] The Array 
driver/daemon has recovered an Errored I/O on module_x, Lun 1, sector 194880272
Sep 13 13:35:48 autolycus unix: ID[RAIDarray.rdriver.6001] The Array 
driver/daemon has recovered an Errored I/O on module_x, Lun 1, sector 146427792
Sep 13 13:35:48 autolycus unix: ID[RAIDarray.rdriver.6001] The Array 
driver/daemon has recovered an Errored I/O on module_x, Lun 1, sector 126296480
Sep 13 13:35:48 autolycus unix: ID[RAIDarray.rdriver.6001] The Array 
driver/daemon has recovered an Errored I/O on module_x, Lun 1, sector 52070480
Sep 13 13:35:48 autolycus unix: ID[RAIDarray.rdriver.6001] The Array 
driver/daemon has recovered an Errored I/O on module_x, Lun 1, sector 146425232
Sep 13 13:35:48 autolycus unix: ID[RAIDarray.rdriver.6001] The Array 
driver/daemon has recovered an Errored I/O on module_x, Lun 1, sector 183697552
Sep 13 13:35:48 autolycus unix: ID[RAIDarray.rdriver.6001] The Array 
driver/daemon has recovered an Errored I/O on module_x, Lun 1, sector 158884432
Sep 13 13:35:48 autolycus unix: ID[RAIDarray.rdriver.6001] The Array 
driver/daemon has recovered an Errored I/O on module_x, Lun 1, sector 194879712
Sep 13 13:35:48 autolycus unix: ID[RAIDarray.rdriver.6001] The Array 
driver/daemon has recovered an Errored I/O on module_x, Lun 1, sector 109897952
Sep 13 13:35:48 autolycus unix: ID[RAIDarray.rdriver.6001] The Array 
driver/daemon has recovered an Errored I/O on module_x, Lun 1, sector 126296416
Sep 13 13:35:48 autolycus unix: ID[RAIDarray.rdriver.6001] The Array 
driver/daemon has recovered an Errored I/O on module_x, Lun 1, sector 126296224
Sep 13 13:35:49 autolycus unix: ID[RAIDarray.rdriver.6001] The Array 
driver/daemon has recovered an Errored I/O on module_x, Lun 1, sector 198715424

The system log in /var/adm/messages shows i/o completing from 1 to 20 minutes
after the failure.  The customer will be concerned that failover is taking that
long when the A3x00 product spec calls for failover to occur within 120
seconds. This can happen in any A3x00/A3500FC configuration when an array
controller fails over.

Upper level applications like Oracle will not tolerate i/o taking longer than
28 minutes to complete in which case the application will give fatal errors.
The customer must understand that this won't happen because of failover.
 
The cause of the concern is the time between the message which says an i/o
has "failed and been routed to the resolution daemon" and the time that
the
particular i/o has "been recovered".  An individual i/o is identified
by the
module, LUN and sector address.  For example, in the above message file the
last
message shows an i/o to module_x, Lun 1, sector 198715424 which completes at
Sep 13 13:35:49. Earlier in the log we see that same i/o was attempted and
reported with soft i/o error code 5 which was retried:
  Sep 13 13:35:33 autolycus unix: ID[RAIDarray.rdriver.3001] Errored I/O, with 
errno 5, returned to the Array driver on module_x, LUN 1
  Sep 13 13:35:33 autolycus unix: ID[RAIDarray.rdriver.1002] The errored I/O is

a write at sector: 198715424
Note this is 16 seconds prior to its completion above.

The actual time for the controller failover is from the first failing i/o being
routed to the resolution daemon to the failover message
  Sep 13 13:35:31 autolycus unix: ID[RAIDarray.rdriver.3001] Errored I/O, with 
errno 5, returned to the Array driver on module_x, LUN 1

  Sep 13 13:35:48 autolycus unix: ID[RAIDarray.rdaemon.3003] The RDAC
Resolution 
Daemon has failed a controller on module_x which is 17 seconds in this example.


Now you can see the difference between failover time and length of time
i/o is unqueued from the sd or ssd drivers. This shows that the controller
failover is occurring within the prescribed window of 120
seconds, often in 20 seconds or so, but the unqueueing of i/o under heavy load
can take a long time which is normal during a failed-over scenario. This
happens as the sd or ssd drivers retry the queued i/o operations.

Remember that RDAC does not start failover investigation and activation until
an i/o error occurs.  Errors about fibre-channel (FC) offline and other
anomolies do not initiate failover processing by themselves.  This means a
controller cable could be pulled, but a failover might not happen until i/o is
started to a lun owned by that controller the next day.  Only errors which
seem to be the controller failing initiate failover processing. 

Errors reported by RDAC are retried by RDAC, unless Rdac_RetryCount in rmparams
is explicitly set to 1.  Other drivers in the stack will also retry errors,
especially the target drivers sd and ssd.  Also the other drivers in the
driver stack may generate and/or detect errors including target drivers: sd
and ssd, hba drivers like isp, glm and qlc, transport layer drivers like sf
and fcp, and layered drivers like VxVM and DMP.  Only errors which are
actually returned to the user application will affect its status. In the case
of an RDAC controller failover, the errno 5 in the example is not returned to
the application layer above RDAC.  Also remember that throughout this process,
ordering of i/o's is strictly maintained by queues in the driver stack.


IMPLEMENTATION:  
 
         ---
        |   |   MANDATORY (Fully Pro-Active)
         ---    
         
  
         ---
        |   |   CONTROLLED PRO-ACTIVE (per Sun Geo Plan) 
         --- 
         
                                
         ---
        | X |   REACTIVE (As Required)
         ---
         

CORRECTIVE ACTION: 

An Authorized Enterprise Field Service Representative may avoid the
above mentioned problems by following the recommendations as shown
below.

When supporting customers who have A3x00 StorEdge Arrays with dual
controllers, please advise the customers and set expectations using the
following guidelines:

1) In the event of read/write errors on one of the controllers,
   failover to the alternate controller will occur when the Raid Manager 
   RDAC driver detects the problem.  This failover will always occur in 
   less than 2 minutes.

2) After the failover, it may take queued I/O's up to 20 minutes to
   complete on an array with a high level of disk activity.  This is 
   normal and does not mean that there is any data loss.
       

COMMENTS:  

------------------------------------------------------------------------------ 


Implementation Footnote:

i)   In case of MANDATORY FINs, Enterprise Services will attempt to    
     contact all affected customers to recommend implementation of 
     the FIN. 
   
ii)  For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/

* From there, select the appropriate link to browse the FIN or FCO index.

Supporting Documents:
---------------------
* Supporting documents for FIN/FCOs can be found on Edist.  Edist can be 
  accessed internally at the following URL: http://edist.corp/.
  
* From there, follow the hyperlink path of "Enterprise Services Documenta- 
  tion" and click on "FIN & FCO attachments", then choose the
appropriate   
  folder, FIN or FCO.  This will display supporting directories/files for 
  FINs or FCOs.
   
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to [email protected]
---------------------------------------------------------------------------
                                                        



Copyright (c) 1997-2003 Sun Microsystems, Inc.