Document fins/I0733-1


FIN #: I0733-1

SYNOPSIS: Netra X1 and Netra T1 AC200/DC200 servers may experience a system
          hang or system crash.

DATE: 11/02/01

KEYWORDS: Netra X1 and Netra T1 AC200/DC200 servers may experience a system
          hang or system crash.


---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                            FIELD INFORMATION NOTICE
                  (For Authorized Distribution by SunService)

 
SYNOPSIS: Netra X1 and Netra T1 AC200/DC200 servers may experience
	  a system hang or system crash. 
 

Sun Alert:          No

TOP FIN/FCO REPORT: No
 
PRODUCT_REFERENCE:  Netra X1 & Netra T1 AC200/DC200 
 
PRODUCT CATEGORY:   Server / Service
 

PRODUCTS AFFECTED:

Systems Affected
---------------- 
Mkt_ID   Platform   Model   Description            Serial Number
------   --------   -----   -----------            ------------- 
  -        N19       ALL    Netra X1 Servers             -
  -        N21       ALL    Netra T1 Servers             -
  -        A36       ALL    Sun Blade 100                -


X-Options Affected
------------------
Mkt_ID   Platform   Model   Description   Serial Number
------   --------   -----   -----------   ------------- 
  -         -         -          -              -   


PART NUMBERS AFFECTED:
 
Part Number   Description                      Model
-----------   -----------                      -----
600-6900-04   CONF 500MHZ 1X256MB 1X18GB AC      -
600-6899-04   CONF 500MHZ 2X256MB 1X18GB AC      -
600-6898-04   CONF 500MHZ 2X512MB 2X18GB AC      -
600-6980-02   CONF 500MHZ 1X256MB 1X18GB DC      -
600-7402-01   CONF 500MHZ 4X512GB 2X36GB AC      -
600-7029-01   BASE CONF NETRA T1 AC200           -
600-7030-01   BASE CONF NETRA T1 DC200           -
600-7084-02   400Mz 1X128MB 1X20G                -
600-7085-02   400Mz 2x256MB 2X20G                -
600-7097-02   400Mz 4x256MB 2x20G                -
600-7395-01   400Mz 1X128MB 2X20GB AC            -
600-7295-01   500Mz 1x128MB 1x40G                -
600-7296-01   500Mz 2x256MB 1x40G                - 
600-7297-01   500Mz 4x256MB 2x40G                -
600-7298-01   500Mz 4x512MB 2x40G                -

 
REFERENCES:
 
BugId:   4467264 - GET_NATIVE_TIME for hummingbird/UIIe can return 
                   erroneous values 
         4487325 - Multiple Netra T1 200 systems hang.

PatchId: 108528 - SunOS 5.8: kernel update patch.

ESC:     530844 
         532371 
         532536  
         532969  
         532987 
         532172 
         532173  
         532360 
         532385


PROBLEM DESCRIPTION: 
  
Customers with Netra X1 or Netra T1 AC200/DC200 servers can experience
an unexpected hang for a period of 10-15 minutes, or retrieve an
erroneous time value from gettimeofday().  A logical error in the
UltraSPARC IIe processor support code can fail to handle the rollover
of the value of 'tick' which is stored across two 32-bit registers.
As a result of this problem, these systems may experience a panic or
unexpected system hang. 

The kernel function hres_tick() panics if it sees the time anomaly
caused by the failure described above; and fails to release a lock on
the high resolution time which panic() subsequently tries to acquire.
The result is a system hang.

Symptom 1: Other kernel hangs
---------- 
   A Secondary cause of hangs lies within kernel code's use of delays.  The
   high resolution time system is at the heart of delay or busy-wait and
   consequently the problem identified can cause kernel code to wait/delay
   for much longer than it actually wanted.  The kernel function
   drv_usecwait(), specified in the DDI, is an example of a busy-wait
   strategy which is affected by this problem.  If we call drv_usecwait()
   at the point of rollover we may busy-wait for a period of 
   approximately 780 seconds long, hanging the system from a user
   perspective.

Symptom 2: Unreliable result from high resolution time
----------
   If using the high resolution time services of Solaris, for example a
   programmer may use the syscall gettimeofday(), incorrect high values can be
   returned.  Subsequent calls will return the correct value.  This problem
   has been observed on systems running web services which are dependent on
   the facilities of high resolution time to generate time stamps.

Both of the above symptoms have been reported and observed.  There are
other theoretical side effects of this problem.  For example, in the
event that a change to the value within the 'tick' register was being
made, to see a jump in time forward.  However this type of problem
rarely occurs in the field.
 
Root Cause:
-----------
Solaris high resolution time is measured by counting clock ticks since 
start-up and using a reference time obtained from the time of day chip 
(Real time clock), again at start-up.  In order to make this work, we need 
to know the frequency at which the 'tick' value is incremented.
 
Power management features introduced in the UltraSPARC IIe
(Hummingbird) allow the frequency at which the cpu operates to vary.
This allows us to reduce the power used by the cpu by slowing it down.
In order to calculate time on Hummingbird based platforms we use a
reference clock from the PCI bus to increment the 'tick', as we can no
longer be sure of the value of the cpu frequency and hence the rate at
which 'tick' is incremented.
 
The tick is a 64 bit value stored in two 32-bit registers, and herein
lies the problem.  At the point at which rollover occurs between the two we
must be extremely careful.  To summarize the root of the problem: we
incorrectly handle the rollover, both when we read from the tick
register and when we write to it.  This incorrect handling can result in
an incorrect high value being returned or inserted.

These problems exacerbate a locking problem within a kernel function
hres_tick().  If called at the time of the incorrect handling of the
rollover described above, on its next invocation, this function will
detect that the current calculated value of hrtime_base became less
than the last value.  It will then call panic(); however it fails to
release hres_lock.  Panic() calls gethrtime() which in turns finds that
someone already has the lock on high resolution time and hangs as the
lock is never released. 

The fix for this problem has been integrated into Kernel Update Patch
108528 for Solaris 8.  The planned release date for Patch 108528 
is 09-Nov-2001.  Until that time, a binary fix is available for download 
from Sun CTE. 

This patch will be incorporated into the HDD pre-installed images from 
Solaris 8 Update 6 onwards.  Availability of the new images for the
affected platforms is expected as follows:

			Netra T1  Nov,Dec 2001
                        Netra X1  Jan 2002

 
IMPLEMENTATION: 
 
          ---
         |   |   MANDATORY (Fully Pro-Active)
          ---
 
 
          ---
         |   |   CONTROLLED PRO-ACTIVE (per Sun Geo Plan)
          ---
 
 
          ---
         | x |   REACTIVE (As Required)
          ---
 

CORRECTIVE ACTION:

An Authorized Enterprise Services Field Representative may avoid the
above mentioned problems by following the recommendations as shown below.


1) Install Kernel Update Patch 108528 for Solaris 8 on Netra X1 and
   Netra T1 AC200/DC200 platforms.

   OR
 
2) If this patch is not yet available, the following binary fix can be used:

   Download two files from anonymous ftp site cte-ftp.eng:/esc/531563/28-10

      unix (838184 bytes)
      SUNW,UltraSPARC-IIe (83192 bytes)

   After backing up the original files,

      # cp unix /platform/sun4u/kernel/sparcv9 
      # cp SUNW,UltraSPARC-IIe /platform/sun4u/kernel/cpu/sparcv9 
      # reboot
  
  All binaries have been verified and tested and proved successful. 
 
 
COMMENTS: 

None
 
============================================================================
 
Implementation Footnote:
 
i)   In case of MANDATORY FINs, Enterprise Services will attempt to
     contact all affected customers to recommend implementation of
     the FIN.
 
ii)  For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical
     support teams will recommend implementation of the FIN  (to their
     respective accounts), at the convenience of the customer.
 
iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network
browser as follows:
 
SunWeb Access:
--------------
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/
 
* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/

* From there, select the appropriate link to browse the FIN or FCO index.
 
Supporting Documents:
---------------------
* Supporting documents for FIN/FCOs can be found on Edist.  Edist can be
  accessed internally at the following URL: http://edist.corp/.
 
* From there, follow the hyperlink path of "Enterprise Services Documenta-
  tion" and click on "FIN & FCO attachments", then choose the
appropriate
  folder, FIN or FCO.  This will display supporting directories/files for
  FINs or FCOs.
 
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to [email protected]
---------------------------------------------------------------------------

 
 




Copyright (c) 1997-2003 Sun Microsystems, Inc.