Document fins/I0805-1


FIN #: I0805-1

SYNOPSIS: DIMMs are being unnecessarily replaced on Enterprise 10000 servers

DATE: Oct/30/02

KEYWORDS: DIMMs are being unnecessarily replaced on Enterprise 10000 servers


---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                            FIELD INFORMATION NOTICE
                  (For Authorized Distribution by SunService)



SYNOPSIS:  DIMMs are being unnecessarily replaced on Enterprise 10000 
           servers.

             
Sun Alert:          No             

TOP FIN/FCO REPORT: No 
 
PRODUCT_REFERENCE:  DIMMs on Enterprise 10000 servers
 
PRODUCT CATEGORY:   Server / Service 


PRODUCTS AFFECTED:  

Systems Affected
------- ---------
Mkt_ID      Platform   Model   Description                   Serial Number
------      --------   -----   -----------                   -------------
  -         E10000      ALL    Ultra Enterprise 10000 Server       - 
  -         E10000-HPC  ALL    Ultra Enterprise 10000 HPC          -


X-Options Affected
------------------
Mkt_ID     Platform   Model      Description                       Serial
Number
------     --------   -----      -----------                      
-------------
X7023A     E10000-X    ALL       OPT MEMORY 1GB (8 � 128MB DIMMS)        - 
X7022A     E10000-X    ALL       OPT MEMORY 256MB (8 � 32MB DIMMS)       -


PART NUMBERS AFFECTED:

Part Number             Description                       Model
-----------             -----------                       -----
501-2654-01             128 MB DIMM DRAM 16Mx72 60ns        -
501-2653-01             32 MB DIMM DRAM 4Mx72 60ns          -


REFERENCES:

URL: http://bestpractices.central/bestpractices_guide_memory_errors.pdf
URL: http://esp.west/starfire/post/redxintro.html


PROBLEM DESCRIPTION:  

Significant numbers of Dual In-Line Memory Modules (DIMMs) for Enterprise 10000
(E10000) servers are being returned from the field.  However, upon failure
analysis (FA), the most common DIMM diagnosis is No Trouble Found (NTF). In
calendar year 2001, over 70% of E10000 DIMM returns were diagnosed NTF. The
intent of this FIN is to provide Sun Service an overview of Error Correcting
Codes (ECC), to give criteria for replacing DIMMs, to reduce unnecessarily
replaced DIMMs, and increase system reliability by reducing service actions.
It is also intended to reduce the number of NTF parts by emphasizing to Sun
Service the necessity of returning verified failures with the actual error
messages encountered to assist in FA.  One of the causes of these unnecessary
returns is believed to be a lack of information provided to Sun Service on what
ECC is, what the definitions of different terms related to ECC are, and what
the criteria is to determine when ECC errors are considered excessive. The
following ECC overview should help in providing an understanding of this issue:

 --------------------     
| An Overview of ECC | 
 --------------------
                            
  Introduction
  ------------
    The scope of this discussion is limited to soft and hard errors that
    occur in memory and how they are reported by Solaris.  It does not
    account for errors that occur while data travels through the E10000
    interconnect, CPU Module, or I/O.  For this discussion, soft errors 
    are transient or temporary errors in memory that can be corrected by
    rewriting the affected memory cell.  Hard errors occur when a cell
    is permanently damaged and cannot hold the correct information. With
    a hard error, the cell can be permanently stuck-at "0", or
"1".


  ECC Concepts
  ------------
    Any volatile storage medium, whether it be the Dynamic Random Access
    Memory (DRAM) used on main memory DIMMs or Static Random Access Memory
    (SRAM) mainly used for caches, is subject to occasional natural
    incidences of data loss due to the impact of alpha particles or cosmic
    rays. This data loss manifests itself in the changing of the value
    stored in the memory cell affected by the collision.  Typically only a
    single bit is affected, but there is a small probability that multiple
    cells can be upset.

    When a bit flips due to this phenomenon, it is referred to as a soft
    error.  This is to distinguish it from a hard error resulting from a
    hardware failure.  These soft errors happen at a rate, called the soft
    error rate (SER), that can be predicted as a function of the memory
    density, the memory technology, and the altitude of the system in which
    the memory resides.

    ECC was invented to allow survival from these naturally occurring
    losses of data.  The ECC method used on the E10000 is called a Single
    Error Correcting, Double Error Detecting code (SEC-DED).  The concept is
    that every word of data is written to memory along with a number of
    extra check bits.  When the word is read back from memory, a fresh set
    of check bits are recomputed and compared with the check that was
    stored in memory.  The result of this comparison is called the syndrome.
    If the syndrome is zero, the comparison was identical, and thus the
    data is good.  A non-zero syndrome means the data is in error, and the
    syndrome is used to find a single bit in error and correct it.  A
    single bit error is called a Correctable Error (CE).  The syndrome can
    also detect if two bits are in error, but it does not have enough
    information to identify which two bits.  This type of error is called
    an Uncorrectable Error (UE).  UltraSPARC microprocessors use a SEC-DED
    variant called S4ED that also can detect, but not correct, three or
    four bit errors if they are clustered within a four bit nibble.
 
    Table 10-2 in the document specified by the URL below shows how the
    syndrome is used to identify the bit in error or determine if multiple
    bits are in error.  Solaris does this table look-up work for you so you
    don't have to, but the information in the table is interesting if you
    are curious about what type of memory error occurred in an E10000.
 
      http://sun-www.central.sun.com/microelectronics/manuals/805-0168.pdf

 --------------
| SSP Behavior |
 --------------
 
  All E10000 System Service Processor (SSP) patches are mandatory and
  can adversely affect memory error reporting if not installed. A list
  of patches for the version of the SSP software you are running is
  available at:
 
     http://cpre-amer.west/esg/hsg/starfire/patches.html

  The synopsis of the SSP patches normally just lists one of potentially
  several bugs fixed by the patch.  Do not ignore the patch just because
  your customer has not encountered the one bug listed in the synopsis.

  Correctable Errors
  ------------------
    The SSP generates a Recordstop file if the E10000 encounters a CE.
    The Recordstop cannot identify the exact DIMM where the CE occurred,
    but narrows it down to two possible DIMMs.  Solaris is responsible
    for identifying the exact DIMM experiencing the error.

  Uncorrectable Errors
  --------------------
    The SSP generates an Recordstop file if the E10000 encounters a UE.
    The Recordstop cannot identify the exact DIMM where the UE occurred,
    but narrows it down to two possible DIMMs.  Solaris is responsible
    for identifying the exact failing DIMM.

 ------------------
| Solaris Behavior |
 ------------------

  Ensure you have the updates to the version of Solaris you are running
  that include main memory scrubbing and improved error messaging.  See 
  FIN I0616-1 at:

  http://sunsolve.Central.Sun.COM/cgi/retrieve.pl?type=0&doc=fins/I0616-1

  for details. It is an E10000 requirement to use a version of Solaris 
  that runs the main memory scrubber.  

  Correctable Errors
  ------------------
    When a CE is detected, the device that reads the word and detected
    the error can correct the data read and continue on unimpeded.
    However, this does not address the fact that the referenced word
    could still be resident in memory uncorrected (i.e. a subsequent
    read of this word could result in another CE event).  If, over
    time, this word in memory is never corrected, the possibility
    starts to arise that another bit may flip in the same word.  This
    would lead to a UE event which will result in a loss of system
    service (See Uncorrectable Error discussion below).  To avoid this
    possibility, the detection of a CE causes a trap to Solaris.  The
    Solaris error handling code logs the error and scrubs the affected
    memory word by writing the corrected word back into memory.

  Uncorrectable Errors
  --------------------
    If a UE is detected, the device that read the word and detected the
    error cannot correct the data and continue on.  A UE will cause
    Solaris to panic if the UE was in kernel memory, or cause a kill of
    the particular user process that contained the memory in error and
    an then an orderly shutdown and reboot to protect the other processes
    in the domain.

  Memory Scrubber
  ---------------
    Solaris also runs a memory "scrubber" routine as part of its normal
    operation.  This scrubber doesn't do anything special besides
    ensure every memory location is accessed at least once every 12
    hours.  If the access finds a CE, then the normal trap to Solaris
    that occurs for any CE will scrub the affected memory word by
    writing the corrected word back into memory and log the event.  This
    ensures that multiple CEs do not have time to build up and form a
    UE at memory locations that are infrequently accessed. 

  Solaris Error Messages
  ----------------------

    As part of handling the error, Solaris will proceed to log a fair
    amount of diagnostic information. One such error message, taken from
    the /var/adm/message file of a E10000 domain running Solaris 8, looks
    like the following:

       Feb  4 18:21:50 cod-b0 SUNW,UltraSPARC: [ID 787962 kern.notice]
            [AFT0] Corrected Memory Error on CPU31, errID 0x0003a08d.15fec176
       Feb  4 18:21:50 cod-b0  AFSR 0x00000000.00100000<CE>
            AFAR 0x00000016.71bb89a8
       Feb  4 18:21:50 cod-b0  AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00
           Fault_PC 0x10024df4
       Feb  4 18:21:50 cod-b0     UDBL Syndrome 0x6e Memory Module Board# 4
            Bank# 2 P# P15 MM 2_3
       Feb  4 18:21:50 cod-b0 SUNW,UltraSPARC: [ID 218875 kern.notice]
            [AFT0] errID 0x0003a08d.15fec176
            Corrected Memory Error on Board# 4 Bank# 2 P# P15 MM 2_3 is
            Persistent
       Feb  4 18:21:50 cod-b0 SUNW,UltraSPARC: [ID 758418 kern.notice]
            [AFT0] errID 0x0003a08d.15fec176
            ECC Data Bit 56 was in error and corrected

  Points that need explanation are the following: 

  . Asynchronous Fault Trap 0 (AFT0) messages are for errors that are
    correctable or survivable errors such as CE memory errors.  AFT1
    messages are for errors that are uncorrectable or non-survivable
    errors (i.e. errors that usually cause Solaris to panic) such as UE
    memory errors.  AFT2 and AFT3 messages are for additional diagnostic
    information, such as cache line dumps.

  . The event was detected by CPU31.  All this means is that CPU31 is the
    processor that took the trap, thus invoking the Solaris error handling
    code.
 
  . Contents of the Asynchronous Fault Status Register (AFSR) and 
    Asynchronous Fault Address Register (AFAR) along with the E-cache tag
    parity syndrome (AFSR.ETS) and the data parity syndrome (AFSR.PSYND)
    are given.  These are CPU parity syndromes, not the SEC-DED syndrome
    used on the DIMMs.  They should be zero if the error was on the
    DIMM.  (The "Score" is used if multiple AFSR parity error messages
    are reported.  The highest score is the most likely originator of
    the parity error.  The "Score" is unrelated to errors involving
    DIMMs.)

  . The UltraSPARC Data Buffer Lower Error Register (UDBL) ECC syndrome is
    the syndrome used to detect and correct errors on DIMMs.  This syndrome
    is decoded from a table and the last line of the error message
    indicates that this was done by Solaris and bit 56 was found to be
    in error.
       
  . The DIMM containing the affected memory word is on: Board# 4 Bank# 2 
    MM 2_3.  This is not important information by itself, because we have
    not determined if the error is soft or hard, or if the DIMM is the
    cause of the condition or another component was the cause.  A "P"
    number is also given to identify the DIMM.  The DIMMs on a memory
    mezzanine are numbered from 1 to 32.  P15 is just another method of
    saying MM 2_3.  See FIN# I0396-1 for a P to MM table:
 
    http://sunsolve.Central.Sun.COM/cgi/retrieve.pl?type=0&doc=fins/I0396-1

  . Solaris describes this event as "Persistent" even though the next
error
    message clearly indicates the error has been corrected and does not
    persist.  The choice of the word persistent in this context causes
    confusion and can cause Sun Service to incorrectly remove a DIMM.
    The Solaris error handling code provides a disposition code as a
    result of the scrub operation.  This disposition is one of
"Intermittent",
   "Persistent", or "Sticky".  The definition of each of these
codes is:

    Intermittent - Means the error was not detected on a reread of
    ------------   the affected memory word.  "Intermittent" is also
not
                   the best choice of words because it implies that
                   this same error can be expected to manifest itself
                   at irregular intervals.  This CE is more commonly
                   known as a transient soft error.  No DIMM with this
                   sort of error can be considered for replacement
                   without first examining the soft error rate (SER) of
                   this DIMM and the System Service Processor (SSP)
                   Recordstop files to be certain that the memory
                   caused this error.  A step by step procedure to
                   accomplish this is given in this FIN's CORRECTIVE
                   ACTION heading.

    Persistent -   Means the error was detected again on a reread of
    ----------     the affected memory word but the scrub operation
                   corrected it.  This CE is more commonly known as a
                   temporary soft error.  No DIMM with this sort of error
                   can be considered for replacement without first
                   examining the SER of this DIMM and the SSP Recordstop
                   files to be certain that the memory caused this
                   error.  A step by step procedure to accomplish this is
                   given in this FIN's CORRECTIVE ACTION heading.

        Sticky -   Means that the error still exists in memory even after
        ------     the scrub operation.  These events should be immediately
                   investigated to determine if some hardware replacement is
                   necessary since this is indicative of a hard error.  This
                   CE is more commonly known as a stuck-at hard error.  A
                   DIMM with a "Sticky" CE should be considered for
                   replacement after first examining the SSP Recordstop files
                   to be certain that memory caused this error.  A step by
                   step procedure to accomplish this is given in this FIN's
                   CORRECTIVE ACTION heading.

  As discussed earlier, soft errors are naturally occurring events.  We
  have also indicated it is possible for the phenomena that causes single
  bit soft CEs to cause a multiple bit soft UEs.  Since the consequences
  of UEs are significant, and the occurrence of soft UEs induced by
  natural causes rare, it is best not to take chances and always to
  replace a DIMM that is responsible for a UE.  Conversely, a single
  report of a soft CE should not be the basis for replacing a memory
  device.  In fact, one should expect the number of soft CEs reported by a
  system to correlate with the SER that can be predicted by the amount of
  memory in the system and the altitude of the system.  Rather than going
  through system specific calculations to determine acceptable SER, the
  recommendation for the servicing of DIMMs in the presence of CEs along
  with UEs is outlined below.  

  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  ! o Remove a DIMM for soft CEs (Intermittent or Persistent) only if      !
  !   three or more soft CEs can be definitively attributed to the same    !
  !   DIMM within a 24 hour period.                                        !
  !                                                                        !
  ! o Remove a DIMM for a hard CE (Sticky) if just one hard CE can be      !
  !   definitively attributed to a DIMM.                                   !
  !                                                                        !
  ! o Remove a DIMM for a UE if just one UE can be definitively attributed !
  !   to a DIMM.                                                           !
  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

  See the "CORRECTIVE ACTION" heading of this FIN for a procedure on
how
  to definitively determine that the DIMM and not some other component is
  the source of the error.  Examples of determining if the DIMM is
  responsible are in this heading under the "Diagnosing Memory Errors"
  section below.

  Let's look at another error message in order to illustrate a point about
  the SER:

  Feb  5 08:54:42 cod-b0 SUNW,UltraSPARC: [ID 126141 kern.notice]
       [AFT0] Corrected Memory Error on CPU56, errID 0x0003d02e.cbca34fe
  Feb  5 08:54:42 cod-b0  AFSR 0x00000000.00100000<CE> AFAR
0x0000001e.3807aad8
  Feb  5 08:54:42 cod-b0  AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 
       Fault_PC 0x10095764
  Feb  5 08:54:42 cod-b0     UDBL Syndrome 0x91 Memory Module Board# 0
     Bank# 1 P# P17 MM 1_0
  Feb  5 08:54:42 cod-b0 unix: [ID 908439 kern.notice] [AFT0]
       Multiple Softerrors:
  Feb  5 08:54:42 cod-b0 unix: [ID 356634 kern.notice]
       0 Intermittent, 256 Persistent, and 0 Sticky Softerrors accumulated
  Feb  5 08:54:42 cod-b0 unix: [ID 340762 kern.notice] 
     from Memory Module Board# 0 Bank# 1 P# P17 MM 1_0
  Feb  5 08:54:42 cod-b0 SUNW,UltraSPARC: [ID 453948 kern.notice]
       [AFT0] errID 0x0003d02e.cbca34fe
       Corrected Memory Error on Board# 0 Bank# 1 P# P17 MM 1_0 is Persistent
  Feb  5 08:54:42 cod-b0 SUNW,UltraSPARC: [ID 104955 kern.notice]
       [AFT0] errID 0x0003d02e.cbca34fe
       ECC Data Bit  5 was in error and corrected

  An addition to this error message are lines that says: "Multiple
  Softerrors" have occurred and "0 Intermittent, 256 Persistent, and 0
  Sticky Softerrors accumulated from Memory Module Board# 0 Bank# 1 P# P17
  MM 1_0".  Solaris will issue a summary report like this when the
  number of correctable errors exceeds a threshold value, max_ce_err, on
  a particular DIMM.  This threshold value on the E10000 is set to 255.
  Does a DIMM with 256 errors always need to be replaced? Not necessarily!
  As it was stated, three or more CEs attributed to the same DIMM
  within a 24 hour period is not acceptable.  That means two errors
  per day is OK.  So if the uptime of the domain was greater than 128 days
  (256 errors / 2 errors per day = 128 days) it is conceivable that the
  SER never exceeded 2 errors per day, and the DIMM should not be
  replaced. 

  The point being emphasized here is always ensure that the SER is three or
  more "Intermittent" or "Persistent" CEs on the same DIMM
within a 24 hour
  period before even considering replacement. 

  Note that Solaris 9 KU2 (Patch 112233 or later) and Solaris 8 KU16 
  (Patch 108528 or later) replace the cumulative error count shown above
  with an error count that just spans a 24 hour window. These kernel updates
  also no longer send individual memory error messages to the console by
  default. If three errors occur in a 24 hour period the following message
  is printed on the console:

  Oct  3 22:46:31 thing2 unix: WARNING: [AFT0] 3 soft errors in less than
  24:00 (hh:mm) detected from Memory Module Board# 3 Bank# 3 P# P32 MM 3_7	


 --------------------------
| Diagnosing Memory Errors |
 --------------------------

  Identifying memory errors on the E10000 is best accomplished by first 
  looking at a set of Solaris error messages and trying to find a pattern.
  Below is a non-comprehensive list of possible patterns: 

  . If all the errors involve the same CPU Module, then suspect a problem
    with the CPU Module seating, the CPU Module itself, or the System
    Board it resides on.

  . If the errors all involve CPU Modules on the same System Board, then
    suspect a problem with the System Board seating or the System Board
    itself.

  . If the errors involve multiple CPUs on multiple System Boards, but 
    the same DIMM, suspect the DIMM.

  Once a pattern has been identified, a diagnosis can be made by confirming
  the pattern through looking at the Recordstops. Here are two complete
  examples of how to diagnose Solaris memory errors reported on a E10000.
  One is an actual DIMM problem, the other shows how important it is to use
  the Recordstop to verify if an error was really caused by a DIMM. 

  Example Diagnosis 1 : A True Soft Memory Error
  ---------------------------------------------- 

    This first example is from from an E10000 that has netcon logging
    enabled.  In this case the SSP $SSPLOGGER/<domain>/netcon file can
    be examined instead of having to log into the domain and examine
    /var/adm/message. (Note that Solaris 9 KU2 and Solaris 8 KU16 by
    default no longer report these messages to the console, and thus
    neither the netcon log. In that case, check the domain's 
    /var/adm/messages file instead.) 

    All E10000s should have netcon logging enabled. See FIN I0593-1 at:

    http://sunsolve.Central.Sun.COM/cgi/retrieve.pl?type=0&doc=fins/I0593-1

    Assume we have already established from the log that three CEs have
    occurred on the same DIMM within a 24 hour period, and now we are
    trying to determine if this particular set of CEs were caused by the
    DIMM.

      Dec 27 15:35:16 cod-ssp netcon_server: [ID 366040 local1.info] (cod-b0) :
      Dec 27 15:34:41 cod-b0 SUNW,UltraSPARC: [AFT0]
          Corrected Memory Error on CPU25, errID 0x00000189.0bb189df
      Dec 27 15:35:16 cod-ssp netcon_server: [ID 366040 local1.info] (cod-b0) :
      Dec 27 15:34:41 cod-b0 AFSR 0x00000000.00100000<CE> AFAR 
0x00000000.30fdbd70
      Dec 27 15:35:17 cod-ssp netcon_server: [ID 366040 local1.info] (cod-b0) :
      Dec 27 15:34:41 cod-b0     AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 
Fault_PC 0x10024df0
      Dec 27 15:35:17 cod-ssp netcon_server: [ID 366040 local1.info] (cod-b0) :
      Dec 27 15:34:41 cod-b0     UDBH Syndrome 0x8 Memory Module Board# 15
Bank# 
1 P# P30 MM 1_7
      Dec 27 15:35:17 cod-ssp netcon_server: [ID 366040 local1.info] (cod-b0) :
      Dec 27 15:34:41 cod-b0 SUNW,UltraSPARC: [AFT0] errID 0x00000189.0bb189df
          Corrected Memory Error on Board# 15 Bank# 1 P# P30 MM 1_7 is 
Persistent
      Dec 27 15:35:17 cod-ssp netcon_server: [ID 366040 local1.info] (cod-b0) :
      Dec 27 15:34:41 cod-b0 SUNW,UltraSPARC: [AFT0] errID 0x00000189.0bb189df
          ECC Check Bit  3 was in error and corrected

    In the same $SSPLOGGER/<domain> directory, look for a Recordstop
that 
    occurred around the same time:

    % ls -la Edd-Record-Stop-Dump-12.27*
      -rw-rw-rw-   1 ssp      staff      82680 Dec 27 15:43
      Edd-Record-Stop-Dump-12.27.15:35

    % redx -c -l 
      redxl> dumpf load Edd-Record-Stop-Dump-12.27.15:35
             Created Thu Dec 27 15:35:42 2001
             By hpost v. 3.4 Jun 20 2001 12:19:51  executing as pid=5840
             On ssp name =  cod-ssp.SD_Lab.West.Sun.COM
             HOSTNAME =  cod-b0
             platform_name =  cod
             Boardmask = 3FFFF    -D option
             Edd-Record-Stop-Dump
    There were 0 errors encountered while creating this dump.

      redxl> wfail

         LAARB 0     ErrorCSR1[65:0] = 0 00000000 3C000002
             ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
         LAARB 1     ErrorCSR1[65:0] = 0 00000000 3C000002
             ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
         LAARB 2     ErrorCSR1[65:0] = 0 00000000 3C000002
             ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
         LAARB 3     ErrorCSR1[65:0] = 0 00000000 3C000002
             ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
         LAARB 4     ErrorCSR1[65:0] = 0 00000000 3C000002
             ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
         LAARB 5     ErrorCSR1[65:0] = 0 00000000 3C000002
             ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
         LAARB 6     ErrorCSR1[65:0] = 0 00000000 3C000002
             ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
         LAARB 6     ErrorCSR3[63:0]: Hist: 0 N 0000    Flgs = 000 00100000
             ErrCSR3[20]: Recordstop Requested by XDB0 (LAARB)
         XDB   6.0   EccErrFlags[11:0] = 140
             EccFlg[6]: Correctable   error in ldat bus hi half, bits [143:72]
             EccFlg[11:8]: Error count = 1
         ldat[143:72]= 08 00000000 00000000 (xmux_par[5:0]= 1F) syn= 08: 
             bit 67 [3F]
             Ldat hi data recordstop requested by XDB 6.0.
         LAARB 7     ErrorCSR1[65:0] = 0 00000000 3C000002
             ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
         LAARB 8     ErrorCSR1[65:0] = 0 00000000 3C000002
             ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
         LAARB 9     ErrorCSR1[65:0] = 0 00000000 3C000002
             ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
         LAARB A     ErrorCSR1[65:0] = 0 00000000 3C000002
             ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
         LAARB B     ErrorCSR1[65:0] = 0 00000000 3C000002
             ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
         LAARB C     ErrorCSR1[65:0] = 0 00000000 3C000002
             ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
         LAARB D     ErrorCSR1[65:0] = 0 00000000 3C000002
             ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
         LAARB E     ErrorCSR1[65:0] = 0 00000000 3C000002
             ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
         LAARB F     ErrorCSR1[65:0] = 0 00000000 3C000002
             ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
         LAARB F     ErrorCSR3[63:0]: Hist: 0 N 0000    Flgs = 000 00800000
             ErrCSR3[23]: Recordstop Requested by XDB3 (LAARB)
         XDB   F.3   EccErrFlags[11:0] = 104
             EccFlg[2]: Correctable   error in  psi bus hi half, bits [143:72]
             EccFlg[11:8]: Error count = 1
         psi [143:72]= 08 00000000 00000000 (xmux_par[5:0]= 1F)  syn= 08: 
             bit 67 [3F]
         Memory ECC error detected by XDB F.3 will be analyzed later.
         GAARB 0     ErrorCSR1[65:0] = 0 00000000 00000002
             ErrCSR1[1]: Recordstop Detected
         GAARB 0     ArbStopLog[15:0] = 0000   RecordStopLog[15:0] = 8000
         GAARB 1     ErrorCSR1[65:0] = 0 00000000 00000002
             ErrCSR1[1]: Recordstop Detected
         GAARB 1     ArbStopLog[15:0] = 0000   RecordStopLog[15:0] = 8000
         GAARB 2     ErrorCSR1[65:0] = 0 00000000 00000002
             ErrCSR1[1]: Recordstop Detected
         GAARB 2     ArbStopLog[15:0] = 0000   RecordStopLog[15:0] = 8000
         GAARB 3     ErrorCSR1[65:0] = 0 00000000 00000002
             ErrCSR1[1]: Recordstop Detected
         GAARB 3     ArbStopLog[15:0] = 0000   RecordStopLog[15:0] = 8000
         Ldat-side data recordstops are assumed caused by psi-side errors.
           No further action is appropriate for them.
           Memory data ecc error detected by XDB F.3: PUP 1/3 output parity 
           history matches XDB in.  No action taken here.
           No components would be failed based on this state.

  . Solaris says the error was detected by CPU25. We know that CPU 25
    is on Board 6 (25 modulo 4 = 6).  The Recordstop shows that XDB 6.0
    requested the Recordstop, and we know XDB 6.0 interfaces with CPU
    6.1 confirming that something on Board 6 detected the error. 

  . Solaris says the error was detected by CPU25 but occurred on Board#15
    Bank# 1 MM 1_7 check bit 3.  The Recordstop confirms that check bit 3
    was the bit affected by the error.  We know this because are 72 bits
    in an E10000 memory word.  Bits 0-63 are the 64 data bits, and bits
    64-71 are ECC code bits.  The Recordstop indicates bit 67 was the bit
    in error, which happens to be check bit 3 (data bit 64=check bit 0,
    65=1, 66=2, 67=3, etc.).

  . The Recordstop says the error was a "Memory data ECC error detected by
    XDB F.3" confirming Solaris' claim that Board 15 had the memory error. 

  . The Recordstop continues saying "PUP 1/3 output parity history matches
    XDB in." which lets us know that the data sent out from the Pack/Unpack
    (PUP) ASICs matched the XDB (Xfire Data Buffer) input, therefore the
    error was not caused by the PUPs or the connection to the memory, but
    in the memory itself. 

  . Recordstops only have enough information to narrow down the error to
    two of the four possible memory banks, in this case banks 1 and 3 are
    identified.

  . The Recordstop identifies that the CE occurred "in psi bus hi half 
    ... bit 67".  Each DIMM provides the same 18 contiguous bits of data in
    a 144 bit transfer cycle, so this means according to the table below,
    that the affected DIMM is DIMM 7.

        ------------------------------
       | 144 bit transfer cycle table |
       |==============================|
       | DIMM 0: lo half bits [17: 0] |
       | DIMM 1: lo half bits [35:18] |
       | DIMM 2: lo half bits [53:36] |
       | DIMM 3: lo half bits [71:54] |
       | DIMM 4: hi half bits [17: 0] |
       | DIMM 5: hi half bits [35:18] |
       | DIMM 6: hi half bits [53:36] |
       | DIMM 7: hi half bits [71:54] | <--- "hi half ... bit 67"
        ------------------------------

    So it is now known from the Recordstop that the error occurred in DIMM
    F.1.7 or F.3.7 .  One of these matches Solaris indication of Board# 15
    MM 1_7 (DIMM F.1.7), therefore this Solaris error has been corroborated
    by the Recordstop.

  Example Diagnosis 2 : CPU Module Failure looks like a DIMM failure
  ------------------------------------------------------------------

    Once again we start by examining the /var/adm/message file.  Again
    assume we have already established from the log that three CEs have
    occurred on the same DIMM within a 24 hour period, and now we are
    trying to determine if this particular set of CEs were caused by
    the DIMM.

      Dec 12 15:58:00 xf2-b7 SUNW,UltraSPARC: [AFT0] 
          Corrected Memory Error on CPU31, errID 0x00000076.b4f9a4cc
      Dec 12 15:58:00 xf2-b7     AFSR 0x00000000.00100000<CE> AFAR 
          0x0000000e.ebe3c000
      Dec 12 15:58:00 xf2-b7     AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 
          Fault_PC 0x781cac7c
      Dec 12 15:58:00 xf2-b7     UDBH Syndrome 0x64 Memory Module Board# 7 
          Bank# 0 P# P2 MM 0_4
      Dec 12 15:58:00 xf2-b7 SUNW,UltraSPARC: [AFT0] errID 0x00000076.b4f9a4cc 
          Corrected Memory Error on Board# 7 Bank# 0 P# P2 MM 0_4 is Persistent
      Dec 12 15:58:00 xf2-b7 SUNW,UltraSPARC: [AFT0] errID 0x00000076.b4f9a4cc 
          ECC Data Bit  7 was in error and corrected

    % redx -c -l 
      redxl> dumpf load Edd-Record-Stop-Dump-12.12.15:58
         Created Wed Dec 12 15:58:25 2001
         By hpost v. 3.4 Aug 20 2000 19:14:56  executing as pid=28054
         On ssp name =  xf2-ssp2.SD_Lab.West.Sun.COM
         HOSTNAME =  xf2-b7
         platform_name =  allxf2
         Boardmask = 30088    -D option
         Edd-Record-Stop-Dump
         There were 0 errors encountered while creating this dump.
      redxl> wfail
         LAARB 3     ErrorCSR1[65:0] = 0 00000000 3C000002
             ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
         LAARB 7     ErrorCSR1[65:0] = 0 00000000 3C000002
             ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
         LAARB 7     ErrorCSR3[63:0]: Hist: 0 N 0000    Flgs = 000 00200000
             ErrCSR3[21]: Recordstop Requested by XDB1 (LAARB)
         XDB   7.1   EccErrFlags[11:0] = 204
             EccFlg[2]: Correctable   error in  psi bus hi half, bits [143:72]
             EccFlg[11:8]: Error count = 2
         psi [143:72]= 6C 0ECCF00D 7EA00080 (xmux_par[5:0]= 2A)  syn= 64: 
             bit 07 [2B]
         FAIL proc 7.2: Arbstop/Recordstop detected by xdb.
         FAIL proc 7.3: Arbstop/Recordstop detected by xdb.
         GAARB 0     ErrorCSR1[65:0] = 0 00000000 00000002
             ErrCSR1[1]: Recordstop Detected
         GAARB 0     ArbStopLog[15:0] = 0000   RecordStopLog[15:0] = 0080
         GAARB 1     ErrorCSR1[65:0] = 0 00000000 00000002
             ErrCSR1[1]: Recordstop Detected
         GAARB 1     ArbStopLog[15:0] = 0000   RecordStopLog[15:0] = 0080
         GAARB 2     ErrorCSR1[65:0] = 0 00000000 00000002
             ErrCSR1[1]: Recordstop Detected
         GAARB 2     ArbStopLog[15:0] = 0000   RecordStopLog[15:0] = 0080
         GAARB 3     ErrorCSR1[65:0] = 0 00000000 00000002
             ErrCSR1[1]: Recordstop Detected
         GAARB 3     ArbStopLog[15:0] = 0000   RecordStopLog[15:0] = 0080

  . Solaris says the error was detected by CPU31.  We know that CPU 31
    is on Board 7 (31 modulo 4 = 7).  The Recordstop shows that XDB 7.1
    requested the Recordstop, and we know XBD 7.1 interfaces with CPUs
    7.2 & 7.3 (CPU 30 & 31) confirming that something on Board 7
    detected the error.

  . Solaris says the error was detected by CPU31 and occurred on Board# 7 
    Bank# 0 MM 0_4 data bit 7.  The Recordstop confirms that data bit 7 was
    the bit affected by the error.
 

  . The Recordstop says:

       "FAIL proc 7.2: Arbstop/Recordstop detected by xdb.
        FAIL proc 7.3: Arbstop/Recordstop detected by xdb."

    It appears CPU 7.2 or 7.3 created an ECC error that was detected by
    XDB 7.1 that they share.  The offending CPU needs to be isolated
    from one of two possibilities.

  . The problem in this case was caused by a CPU writing bad data into a DIMM.
    No DIMM should be replaced based on an examination of the Recordstop.
    The bad CPU needs to be replaced.


IMPLEMENTATION: 
 
         ---
        |   |   MANDATORY (Fully Pro-Active)
         ---    
         
  
         ---
        |   |   CONTROLLED PRO-ACTIVE (per Sun Geo Plan) 
         --- 
         
                                
         ---
        | X |   REACTIVE (As Required)
         ---
         

CORRECTIVE ACTION:

The following recommendation is provided as a guideline for authorized
Enterprise Services Field Representatives who may encounter the above
mentioned condition.

Please perform the following as needed: 

  1. Examine the domain's Solaris message logs for memory errors.

     A. For CEs Solaris calls "Intermittent" or "Persistent":

         i) Determine that three or more "Intermittent" or
"Persistent" 
            errors have occurred within a 24 hour period.  If three or more
            errors have not happened, take no service action and do not
            proceed with the next steps.

        ii) Copy the three error messages verbatim for possible use during
            DIMM FA.

       iii) Note the time of the errors for comparison with Recordstops.

     B. For CEs Solaris calls "Sticky":

         i) Copy error message verbatim for possible use during DIMM FA.

        ii) Note the time of the error for comparison with Recordstops.

     C. For UEs:

         i) Copy error messages verbatim for possible use during DIMM FA.

        ii) Note the time of the error for comparison with Recordstop.

  2. Examine the domain's SSP Recordstops to corroborate the error and the
     DIMM Solaris reports is affected. For any Solaris report of a CE or UE,
     you must check if the error was caused by broken hardware writing data
     with errors into a memory location and not the memory itself.  

     A. For all CEs ("Intermittent", "Persistent", and
"Sticky"):

          i) Examine the output of the wfail redx command on all Recordstops
             that occurred around the same time as the errors indicated in the
             Solaris error messages and up to 12 hours earlier.  This is to
             rule out a memory error that was caused by broken hardware writing
             erroneous data out to memory and then detected by a later memory 
             read. 

       NOTE: The requirement to check back only 12 hours is due to the fact
             the Solaris memory scrubber accesses all DIMM locations every 12
             hours.  If this scenario was to occur, a Recordstop would have to
             occur within the 12 hours preceding the Solaris error message.  

         ii) For "Intermittent" and "Persistent" CEs, replace
the DIMM 
             indicated in the three Solaris error messages with a Field 
             Replaceable Unit (FRU) DIMM only if all three Recordstops 
             corroborate the Solaris messages' errors.

        iii) "Sticky" CEs may be considered for replacement after just
one 
             Solaris error message has been corroborated with a Recordstop.

         iv) Copy the Recordstops' wfail output verbatim for possible use 
             during DIMM FA.

          v) Bringup the domain with an minimum hpost level of 16 to test 
             memory ECC functionality.  If time permits, a level 24, 32, or 
             64 hpost will perform increasingly rigorous testing of memory.

     B. For all UEs:
 
          i) Examine the output of the wfail redx command on all Recordstops 
that
             occurred around the same time as the error indicated in the
Solaris 
             error message. 

       NOTE: UE error messages that indicate a "Syndrome 0x3" can be
related
             to a CPU Module E-cache parity or I/O parity error (SBus or
             PCIbus).  Investigate these sources of errors before replacing 
             any DIMMs.

         ii) Only replace the DIMM indicated in the Solaris error message with
             a FRU only if the Recordstop corroborates the Solaris failure 
message.

        iii) Copy the Recordstop's wfail output verbatim for possible use
during 
             DIMM FA.

         iv) Bringup the domain with an hpost level of 64 to fully test memory 
             and other hardware functionality. 

  3. Return DIMMs for FA along with complete Solaris error message and SSP 
     Recordstop wfail output.


COMMENTS:  

If you are not certain a particular DIMM is the cause of repeated correctable
memory errors that meets the SER replacement criteria of three CEs within 24
hours, do not replace it.  The experience gained in servicing E-cache parity
errors has made it clear that performing an unnecessary service action can do
more harm than good to a E10000 and increase future service calls.  Make your
customers aware that soft errors are natural and expected, and that
"Intermittent" and "Persistent" errors do not necessarily imply
intermittent or
persistent issues with their memory.

If you are certain a DIMM is the issue, share this information with the repair
depot by copying and providing the exact Solaris error messages and redx wfail
output you used to determine this along with the DIMM.  If this information
isn't provided, this DIMM may pass testing with a NTF diagnosis, and come back
to you as a FRU again.

============================================================================

Implementation Footnote:

  i) In case of MANDATORY FINs, Enterprise Services will attempt to    
     contact all affected customers to recommend implementation of 
     the FIN. 
   
 ii) For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/

* From there, select the appropriate link to browse the FIN or FCO index.

Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to [email protected]
--------------------------------------------------------------------------


Copyright (c) 1997-2003 Sun Microsystems, Inc.