Document fins/I0551-1


FIN #: I0551-1

SYNOPSIS: Boot process and controller on-line process may take hours in systems
          with large StorEdge A3000, A3500 or A1000 configs

DATE: Oct/31/02

KEYWORDS: Boot process and controller on-line process may take hours in systems
          with large StorEdge A3000, A3500 or A1000 configs


---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                            FIELD INFORMATION NOTICE
                  (For Authorized Distribution by SunService)



SYNOPSIS: Boot process and controller on-line process may take hours  
          in systems with large StorEdge A3000, A3500 or A1000 configs.
                          
              
TOP FIN/FCO REPORT: No 
 
PRODUCT_REFERENCE:  Large StorEdge A1000, A3000 or A3500 Configurations 
 
PRODUCT CATEGORY:   Storage / Sw Admin; 

PRODUCTS AFFECTED:  

Mkt_ID   Platform   Model   Description         Serial Number
------   --------   -----   -----------         -------------
Systems Affected
----------------

  -      ANYSYS       -     System Platform Independent  -
   
X-Options Affected
------------------
  -       A1000      ALL    StorEdge A1000               - 
  -       A3000      ALL    StorEdge A3000               - 
  -       A3500      ALL    StorEdge A3500               -

PART NUMBERS AFFECTED: 

Part Number   Description                             Model
-----------   -----------                             -----
798-1036-01   CD Assy RAID MGR 6.1.1                    -
704-6708-10   CD SUN STOREDGE RAID MGR 6.22             -
704-7937-05   CD SUN STOREDGE RAID Manager6.22.1        -   


REFERENCES:

BugId: 4238051 - sd: (sparc) device probe extremely slow when multiple 
                 LUNs per target.
       4630273 - installing rm6 on Solaris 9 leads to a long pause at 
                 system boot.

ESC:   523848 - E10k/A3500 cluster - SCSI-maintenance - system 
                performance/hang afterwards.

       539493 - bug 4238051 /sd.conf entries cause excessive boot times 
                w/A3500 and EMC

      
PROBLEM DESCRIPTION:

When an A3x00/A3500FC is installed by default, it supports eight
different LUNs with same targets before starting next target.  The
default targets of the Raid Controllers are 4 and 5.

As part of the Raid Manager 6 installation, a modification has been made 
to the /kernel/drv/sd.conf file.  It will include a Raid Manager section 
for Targets 1-15 and LUNs 1-7 as shown below:

	# BEGIN RAID Manager additional LUN entries
	# DO NOT EDIT from BEGIN above to END below...
	name="sd" class="scsi"
	target=0 lun=1;

	name="sd" class="scsi"
	target=0 lun=2;

	name="sd" class="scsi"
	target=0 lun=3;

	name="sd" class="scsi"
	target=0 lun=4;

	name="sd" class="scsi"
	target=0 lun=5;

	name="sd" class="scsi"
	target=0 lun=6;

	name="sd" class="scsi"
	target=0 lun=7;
	.
	.
	.
	(Middle portion ommitted for reading)
	.
        .
        .
        name="sd" class="scsi"
	target=15 lun=1;
 
	name="sd" class="scsi"
	target=15 lun=2;
 
	name="sd" class="scsi"
	target=15 lun=3;
 
	name="sd" class="scsi"
	target=15 lun=4;
 
	name="sd" class="scsi"
	target=15 lun=5;
 
	name="sd" class="scsi"
	target=15 lun=6;
 
	name="sd" class="scsi"
	target=15 lun=7;
	# END RAID Manager additional lun entries

There are two problems with this:

1. During the booting process of the node, the sd driver will timeout  
   for every non-existent LUN.  If multiple A3x00's or A1000's are attached,
   it takes at least an hour to complete reboot cycle time for supporting
   the first 16 LUNs.  If 16-32 LUNs have to be supported on the A3x00 or 
   the A1000, the reboot cycle time takes even longer to complete.
  
2. During the process of bringing a StorEdge A3500 or A3000 controller
   back online, it will run a drvconfig.  If the sd.conf file contains 
   extra (unused) targets and LUNs, the process for rescanning the 
   device tree for extra (unused) targets and LUNs may even take longer 
   to complete a reboot on an Enterprise 10000 system.

3. On Solaris 9, probing all the device nodes put under rdnexus by RM6 is
   slowed by a kernel change, which is in power management code, as described
   in bug 4630273.  This new code spends about 1 second to scan every third
   rdriver instance in the Solaris device tree.  The duration of the slowdown
   can be understood in some simple math:

     X = total number of rdnexus entry in /kernel/drv/rdnexus.conf
         example of rdnexus entry:
         name="rdnexus" parent="pseudo" instance=0;
     V = total number of rdriver generic module entry in 
         /kernel/drv/rdriver.conf.
         example of rdriver generic module entry:
         name="rdriver" parent="rdnexus" target=4 lun=0;
     N = the total number of rdriver instances in the Solaris device tree
     T = the slowdown duration in seconds
     N = X*V
     T = N/3

   The V value can increase by running the add16lun.sh, add32lun.sh
   scripts, changing sd.conf, running genscsiconf(1m), changing the
   target id on the Rdac controller, and attaching other types of
   arrays.  The rdriver generic module entries are always in a symetric
   format, ie each target will have the same number of luns, so one can
   figure out the V value by multiply LUNs supported by number of
   devices listed in rdriver.conf.  In the case of 8 lun supported in
   A1000, X=64 and if default target id and rmparam are used, V=8*3 for
   target 0,4,5 listed in rdriver.conf. T=64*8*3/3=512 seconds.    
   
   
IMPLEMENTATION: 
 
         ---
        |   |   MANDATORY (Fully Pro-Active)
         ---    
         
  
         ---
        |   |   CONTROLLED PRO-ACTIVE (per Sun Geo Plan) 
         --- 
         
                                
         ---
        | X |   REACTIVE (As Required)
         ---


CORRECTIVE ACTION:

Enterprise Customers and Authorized Enterprise Field Service
Representatives may avoid the above mentioned problem with
A3X00 and A1000 boot delays by performing the recommendation  
shown below:

Below is a step-by-step procedure to speed up the reboot time, 
as well as bringing back the Raid Controller from an "offline" 
state to an "online" state.  The main purpose for this procedure 
is to make sure that the drvconfig doesn't have to negotiate 
three non-existent LUNs.

Note: A3x00 Raid Controllers are Targets 4 and 5 by default setting.

1. From the command prompt, type the following:
	   
   # /etc/raid/bin/lad

   c1t5d0s0 1T92401270 LUNS: 0 1 2 
   c2t5d0s0 1T92401348 LUNS: 0 1 2 
   c3t4d3s0 1T92600542 LUNS: 3 4 5 
   c4t5d0s0 1T92400129 LUNS: 0 1 2 
   c7t4d3s0 1T92401081 LUNS: 3 4 5 
   c8t4d3s0 1T92401082 LUNS: 3 4 5

   Every Raid Controller in these configurations are set to target t4 
   (eg c3[t4]d3s0) or target t5 (c1[t5]d0s0) from the /etc/raid/bin/lad
   output.  These are the target entries that need to be retained in 
   the /kernel/drv/sd.conf file.  These target numbers may vary between
   different configurations.

2. From the command editor, edit the /kernel/drv/sd.conf

   In the "RAID Manager" section, delete ALL targets
   except 4 and 5 which the A3x00's Raid Controllers must use.

3. The new sd.conf for the "RAID Manager" section should look
   like this after the modification:

        # BEGIN RAID Manager additional LUN entries
        # DO NOT EDIT from BEGIN above to END below...
        
        name="sd" class="scsi"
        target=4 lun=1;

        name="sd" class="scsi"
        target=4 lun=2;

        name="sd" class="scsi"
        target=4 lun=3;

        name="sd" class="scsi"
        target=4 lun=4;

        name="sd" class="scsi"
        target=4 lun=5;

        name="sd" class="scsi"
        target=4 lun=6;

        name="sd" class="scsi"
	target=4 lun=7;

        name="sd" class="scsi"
        target=5 lun=1;

        name="sd" class="scsi"
        target=5 lun=2;

        name="sd" class="scsi"
        target=5 lun=3;

        name="sd" class="scsi"
        target=5 lun=4;

        name="sd" class="scsi"
        target=5 lun=5;

        name="sd" class="scsi"
        target=5 lun=6;

        name="sd" class="scsi"
        target=5 lun=7;	
	# END RAID Manager additional LUN entries
	
Note: Targets 4 and 5 LUN 0 are not shown because they have   
      already been defined in the sd.conf near the top.

4. The booting cycle time should be reduced when performing the reboot 
   after saving the file.  A reconfiguration reboot is not necessary  
   on the A3x00(s) or A1000(s).

NOTE: The configuration is not set at targets 4 and 5 by default on 
      every A1000(s), it is necessary to make sure that these entries go 
      into the "RAID Manager" section of sd.conf.  Otherwise after 
      rebooting, the RM6 SW will not see your A1000 Raid Module.
      This procedure may affect third party multi-LUN devices. If the 
      systems have third party multi-LUN devices, verify their target ID 
      setting and make sure you do not disable them with this procedure.

One could reduce the slowdown in boot time, especially on Solaris 9
systems, by removing rdnexus entries in /kernel/drv/rdnexus.conf.  The
total number of rdnexus entries that can be removed depends on
individual systems but general guide lines are as following:

  . Use 'ls /devices/pseudo | grep rdnexus' to check the number of
    rdnexus nodes been used.

  . Leave enough entries for future expansion.  Each rdnexus node
    represents a HBA port connected to a Rdac controller.  In general,
    16 rdnexus entries in /kernel/drv/rdnexus.conf is sufficient for 
    systems with 4 arrays or less.

  . Remove the rdnexus entries by starting from highest instance number.

  . Some of the test results show good improvement in boot time.
     
     system configuration:

      A1000 with 32 lun support, V=96
      Solaris 8 with 64 rdnexus entry boot time =  3 minutes
      Solaris 9 with 64 rdnexus entry boot time = 39 minutes
      Solaris 9 with 16 rdnexus entry boot time = 13 minutes 


COMMENTS:

--------------------------------------------------------------------------
Implementation Footnote:

i)   In case of MANDATORY FINs, Enterprise Services will attempt to    
     contact all affected customers to recommend implementation of 
     the FIN. 
   
ii)  For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/

* From there, select the appropriate link to browse the FIN or FCO index.

Supporting Documents:
---------------------
* Supporting documents for FIN/FCOs can be found on Edist.  Edist can be 
  accessed internally at the following URL: http://edist.corp/.
  
* From there, follow the hyperlink path of "Enterprise Services Documenta- 
  tion" and click on "FIN & FCO attachments", then choose the
appropriate   
  folder, FIN or FCO.  This will display supporting directories/files for 
  FINs or FCOs.
   
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to [email protected]
-------------------------------------------------------------------------


Copyright (c) 1997-2003 Sun Microsystems, Inc.