Document fins/I0841-1


FIN #: I0841-1

SYNOPSIS: Sun Fire (3800/4800/4810/6800) Servers with very large storage
          configurations or large driver .conf files may encounter panics or
          hangs during bootup

DATE: Jun/20/02

KEYWORDS: Sun Fire (3800/4800/4810/6800) Servers with very large storage
          configurations or large driver .conf files may encounter panics or
          hangs during bootup


---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                            FIELD INFORMATION NOTICE
                  (For Authorized Distribution by SunService)

            

SYNOPSIS: Sun Fire (3800/4800/4810/6800) Servers with very large storage  
          configurations or large driver .conf files may encounter panics
          or hangs during bootup.
      

SunAlert:           Yes

TOP FIN/FCO REPORT: Yes 
 
PRODUCT_REFERENCE:  Sun Fire 3800/4800/4810/6800 
 
PRODUCT CATEGORY:   Server / Service


PRODUCTS AFFECTED:  

Systems Affected:
-----------------  
Mkt_ID   Platform   Model   Description          Serial Number
------   --------   -----   -----------          -------------
  -        S8         -     Sun Fire 3800              -
  -        S12        -     Sun Fire 4800              -
  -        S12i       -     Sun Fire 4810              -
  -        S24        -     Sun Fire 6800              -


X-Options Affected:
-------------------
Mkt_ID   Platform   Model   Description   Serial Number
------   --------   -----   -----------   -------------
  -         -         -          -              -


PART NUMBERS AFFECTED: 

Part Number   Description        Model
-----------   -----------        -----
     -             -               -


REFERENCES:

BugId:    4660795 - OBP virtual-memory translation buffer for Solaris
                    can truncate the list

ESC:      535993 - 6800 panic on boot with Hitachi 9960 disk.
          535960 - F/G+/ system panic due to SD.conf file.

Sun Alert: 44348
 
     
PROBLEM DESCRIPTION:

Sun Fire 3800/4800/4810/6800 servers may become unbootable due to the
inability to manage large numbers of virtual memory translations in the
OBP.  When this occurs, the system may hang or panic while booting,
making the system unusable.

This issue can occur with any Sun Fire system with firmware 5.12.6 or
lower and a system configuration which requires a large amount of
translation table entries (TTE) early in the boot process.  These will
generally be systems with large Storage Area Network (SAN) units with a
large number of LUNs in their storage configurations, systems with a
very large .conf (sd, ssd) file for the driver used at boot time, or in
some cases a moderately sized configuration with kernel memory auditing
enabled.

Typically, the bug will be encountered when the driver responsible for
controlling the boot device also controls a large number of other
devices in the storage configuration.  This issue has been seen on
systems with large Storage Area Network (SAN) units with a large number
of LUNs.

The most common symptom is a "BAD TRAP: type=31" panic message since
the underlying cause of the panic is the use of an untranslatable
address.  The rest of the panic message will vary depending on which
subsystem was executing when the bad pointer was referenced.   

To determine the system firmware version:

    From Solaris on one of the platform's domains:

         /usr/platform/sun4u/sbin/prtdiag -v | grep OBP

         Example output (showing a vulnerable system):

         OBP 5.12.5 09/26/01 15:46

    Or from the platform System Controller:

	 showboards -p proms

         Example output (showing a vulnerable system):

	 Component Device    Type  Version  Date       Time  
	 --------- ------    ----  -------  ----       ----  
	 SSC0                ScApp 5.12.5   09/26/2001 15:51 
	 SSC0                Info  5.12.5   09/26/2001 15:51 
	 /N0/IB6   SBBC 0    iPOST 5.12.5   09/26/2001 15:47 
	 /N0/IB6   SBBC 0    Info  5.12.5   09/26/2001 15:48 
	 /N0/SB0   SBBC 0    POST  5.12.5   09/26/2001 15:47 
	 /N0/SB0   SBBC 0    OBP   5.12.5   09/26/2001 15:47 
	 /N0/SB0   SBBC 0    Info  5.12.5   09/26/2001 15:47 
	 /N0/SB0   SBBC 1    POST  5.12.5   09/26/2001 15:47 
	 /N0/SB0   SBBC 1    OBP   5.12.5   09/26/2001 15:47 
	 /N0/SB0   SBBC 1    Info  5.12.5   09/26/2001 15:47 
	 /N0/IB8   SBBC 0    iPOST 5.12.5   09/26/2001 15:47 
	 /N0/IB8   SBBC 0    Info  5.12.5   09/26/2001 15:48 
	 /N0/SB2   SBBC 0    POST  5.12.5   09/26/2001 15:47 
	 /N0/SB2   SBBC 0    OBP   5.12.5   09/26/2001 15:47 
	 /N0/SB2   SBBC 0    Info  5.12.5   09/26/2001 15:47 
	 /N0/SB2   SBBC 1    POST  5.12.5   09/26/2001 15:47 
	 /N0/SB2   SBBC 1    OBP   5.12.5   09/26/2001 15:47 
	 /N0/SB2   SBBC 1    Info  5.12.5   09/26/2001 15:47 
	 /N0/SB4   SBBC 0    POST  5.12.5   09/26/2001 15:47 
	 /N0/SB4   SBBC 0    OBP   5.12.5   09/26/2001 15:47 
	 /N0/SB4   SBBC 0    Info  5.12.5   09/26/2001 15:47 
	 /N0/SB4   SBBC 1    POST  5.12.5   09/26/2001 15:47 
	 /N0/SB4   SBBC 1    OBP   5.12.5   09/26/2001 15:47 
	 /N0/SB4   SBBC 1    Info  5.12.5   09/26/2001 15:47 

An affected system may hang or panic at boot time. If the system panics, 
a typical stack trace will look like:

	 die(31,10407710,31002089000,0,3,c4488003) + 4 
         [savfp=0x10406c11,savpc=0x1002b584]
	 trap(31002088000,1,6,0,10407710,0) + 8dc
	 [savfp=0x10406d51,savpc=0x10019ee8]
	  + 640
 	 prom_rtt(48,1044da08,10,2000,2000,c8)
	 [savfp=0x10406fb1,savpc=0x1002f654]
	 page_freelist_coalesce(1044f224,10052ad0,0,1043c328,0,1043c328) + c
	 [savfp=0x10407071,savpc=0x100295d4]
	 startup_vm(0,10450de8,0,2000,2000,0) + 1cc
	 [savfp=0x10407171,savpc=0x1002820c]
	 startup(7d,edd00028,40,183e000,2000,ffffffffffffffff) + 2c
	 [savfp=0x10407221,savpc=0x100a8578]
	 main(1041d800,2000,10407ec0,10408030,fff2,10050df0) + 4
	 [savfp=0x104072f1,savpc=0x10006fa0]
	 _start(10006e38,1044ecd0,1044ecd0,1044ecd0,1049d8f8

Note that this stack trace is representative, and a specific failure
may not result in the exact same stack trace, depending on how and
where the dropped TTE is missed.

The OBP uses a statically sized memory buffer to pass a list of Memory
Management Unit (MMU) translation table entries to Solaris during the
boot process.  If a driver required during the boot process has a
sufficiently large .conf file, OBP may run out of space in the static
buffer, and will then silently drop any remaining entries in the list.
Solaris will then panic and/or hang during boot when it attempts to
reference a virtual address that, from Solaris' point of view, has no
TTE available.
  
The problem will be fixed in firmware release 5.13.0 and later.  Until
this patch is available, follow the workaround provided below.


IMPLEMENTATION: 

         ---
        |   |   MANDATORY (Fully Proactive)
         ---    
         
  
         ---
        | X |   CONTROLLED PROACTIVE (per Sun Geo Plan) 
         --- 
         
                                
         ---
        |   |   REACTIVE (As Required)
         ---


CORRECTIVE ACTION:

The following recommendation is provided as a guideline for authorized
Enterprise Services Field Representatives who may encounter the above
mentioned problem.

Use this workaround until firmware version 5.13.0 becomes available:

1. Examine the customer's driver .conf files and determine if they 
   can be trimmed, to reduce the memory requirements of the driver. 
   Typically, the sd or ssd drivers will be the most useful to examine.
   This may be accomplished by removing any unneeded entries from the
   problem file.

2. If it is not possible to reduce the size of the .conf file due to
   the customer's system configuration, reconfigure the system to boot
   from a different type of device (for instance, if the system is 
   booting from an ssd device, reconfigure the system to boot from an
   sd device).


COMMENTS:  

None

============================================================================

Implementation Footnote:

i)   In case of MANDATORY FINs, Enterprise Services will attempt to    
     contact all affected customers to recommend implementation of 
     the FIN. 
   
ii)  For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/

* From there, select the appropriate link to browse the FIN or FCO index.

Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to [email protected]
--------------------------------------------------------------------------


Copyright (c) 1997-2003 Sun Microsystems, Inc.