Document fins/I0760-1
FIN #: I0760-1
SYNOPSIS: Too many Memory DIMMs are being unnecessarily replaced on UltraSPARC
III family of systems utilizing NG-DIMM memory
DATE: Aug/04/02
KEYWORDS: Too many Memory DIMMs are being unnecessarily replaced on UltraSPARC
III family of systems utilizing NG-DIMM memory
---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------
FIELD INFORMATION NOTICE
(For Authorized Distribution by SunService)
SYNOPSIS: Too many Memory DIMMs are being unnecessarily replaced on
UltraSPARC III family of systems utilizing NG-DIMM memory.
Sun Alert: No
TOP FIN/FCO REPORT: No
PRODUCT_REFERENCE: UltraSPARC III family of systems utilizing NG-DIMM memory
PRODUCT CATEGORY: Server / Service
PRODUCTS AFFECTED:
Systems Affected
----------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
- S8 ALL Sun Fire 3800 -
- S12 ALL Sun Fire 4800 -
- S12i ALL Sun Fire 4810 -
- S24 ALL Sun Fire 6800 -
- F15K ALL Sun Fire 15000 -
- A28 ALL Sun Blade 1000 -
- A35 ALL Sun Fire 280R -
- A30 ALL Sun Fire V880 -
- N28 ALL Netra 20 -
X-Options Affected
------------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
- - - - -
PART NUMBERS AFFECTED:
Part Number Description Model
----------- ----------- -----
- - -
REFERENCES:
N/A
PROBLEM DESCRIPTION:
Memory components (DIMMs) for UltraSPARC III family of systems
utilizing NG-DIMM memory are being returned from the field as failed
based on Correctable Error (CE) reports. However, upon failure
analysis, most of these memory parts show No Trouble Found (NTF). The
intent of this FIN is to provide Sun Service Representatives an
overview of ECC and to give criteria for replacing DIMMs. This FIN is
expected to reduce or eliminate unnecessarily replaced DIMMs.
For the UltraSPARC III family of systems utilizing NG-DIMM memory, it
has been reported that DIMMs are suspected to have failed for ECC
errors and are being replaced in the field unnecessarily. This may
partially be caused by a lack of understanding by Field Engineers of
what actually constitutes ECC, what are the definitions of different
terms related to ECC, and what is the criteria to determine when ECC
errors are to be considered excessive.
Failure analysis on suspected failed DIMMs, which are returned from the
field, has determined that nearly 100% turn out to be NTF. This is
causing a substantial impact on the valuable resources of Engineering,
Operations and Service. Further, the cost of procuring additional
DIMMs in order to maintain Target Stocking Levels (TSL) in the field
is very high.
The following ECC overview should help in providing an understanding
of this issue:
An Overview of ECC
Introduction:
=============
The recent launch of the UltraSPARC III family of systems utilizing
NG-DIMM memory, coupled with recent changes that have been
implemented and released in the Solaris 8 Operating Environment,
has led to some confusion about reported ECC errors and whether
these events are indications of hardware that needs replacement.
This document is intended to give a brief overview of ECC to
explain why these events occur and what action, if any, should be
taken when they do.
The scope of this discussion is limited to soft errors that occur
in memory and how they are reported by Solaris. It does not
account for hard errors or errors that occur while data travels
through the system interconnect. It also does not account for
information made available to the service processor. As such, the
concepts discussed here can be applied to any system, not just
UltraSPARC III family of systems utilizing NG-DIMM memory. For
this discussion, soft errors are transient errors in memory that
can be corrected by rewriting the affected memory cell. Hard
errors occur when a cell is permanently damaged and cannot hold the
correct information. Sometimes the cell can be stuck at "0",
sometimes it can be stuck at "1".
ECC Concepts:
=============
Any non-persistent storage device, whether it be Dynamic Random
Access Memory (DRAM) used for main memory or Static Random Access
Memory (SRAM) used for caches, is subject to occasional incidences
of data loss due to the impact of energetic alpha particles or cosmic
rays. This data loss manifests itself in the changing of the value
stored in the memory location affected by the collision. Typically
only a single bit is affected, but there is a small probability (<10%)
that multiple cells can be upset.
When a bit flips due to this phenomenon, it is referred to as a soft
error. This is to distinguish it from a hard error resulting from
faulty hardware. These soft errors happen at a rate, called the soft
error rate, that can be predicted as a function of the memory density,
the memory technology, and the geographic location of the memory system.
ECC was invented to facilitate the survival of these naturally
occurring losses of data. The concept is that every word of data stored
in memory also has check information stored along with it. This check
information serves two purposes. First, when a word of data is read out
of memory, the check information can be used to detect if any of the
bits of the word have changed, and whether just a single bit has
changed or more than one bit has changed. Second, in the event that a
single bit has changed, the check information can be used to determine
which bit in the word changed and therefore correct the word by
flipping this bit back to its complementary value.
When an ECC check mechanism has detected that one or more bits in a
word of data has changed, this is broadly categorized as an ECC error.
These errors can be further categorized as a function of the number of
bits in error. Because ECC can correct single bit flips, single bit
errors are referred to as Correctable Errors (CEs). Multi-bit errors
are referred to as Uncorrectable Errors (UEs).
Solaris Behavior:
=================
When a CE is detected, the device that read the word and detected the
event can certainly correct the data and continue on unimpeded. However,
this does not address the fact that the referenced word is still
resident in memory uncorrected, i.e., a subsequent read of this word
would result in another CE event. If, over time, this word in memory
is never corrected, the possibility starts to arise that another bit
may flip in the same word. This would lead to a UE event. To avoid
this possibility, the detection of a CE causes a trap to Solaris.
The resultant error handling code scrubs the affected memory word by
writing the corrected word back into memory.
As part of handling the error, Solaris will proceed to log a fair
amount of diagnostic information. One such event log, taken from a
Sun Fire 6800 running Solaris 8, looks like the following:
Oct 25 09:06:25 wpc26 SUNW,UltraSPARC-III: [ID 796192 kern.notice]
NOTICE:[AFT0]
Corrected system bus (CE) Event on CPU18 at TL=0, errID 0x0000c9b9.19d92690
Oct 25 09:06:25 wpc26 AFSR 0x00000002<CE>.00000097 AFAR
0x00000001.04bdf7d0
Oct 25 09:06:25 wpc26 Fault_PC 0x10024a74 Esynd 0x0097 /N0/SB5/P3/B0/D2 J16500
Oct 25 09:06:25 wpc26 SUNW,UltraSPARC-III: [ID 154767 kern.notice] [AFT0] errID
0x0000c9b9.19d92690 Corrected Memory Error on /N0/SB5/P3/B0/D2 J16500 is
Persistent
Oct 25 09:06:25 wpc26 SUNW,UltraSPARC-III: [ID 682217 kern.notice] [AFT0] errID
0x0000c9b9.19d92690 Data Bit 3 was in error and corrected
Oct 25 09:06:25 wpc26 SUNW,UltraSPARC-III: [ID 422650 kern.info] [AFT2] errID
0x0000c9b9.19d92690 E$tag PA=0x00000000.00bdf7c0 does not match
AFAR=0x00000001.04bdf7c0
Oct 25 09:06:25 wpc26 SUNW,UltraSPARC-III: [ID 904800 kern.info] [AFT2] errID
0x0000c9b9.19d92690 PA=0x00000000.00bdf7c0
Oct 25 09:06:25 wpc26 E$tag 0x00000000.01000001 E$state_7 Invalid
Oct 25 09:06:25 wpc26 SUNW,UltraSPARC-III: [ID 895151 kern.info] [AFT2] E$Data
(0x00)
0x5a8d0016.00000a20 0x20202020.37333231 ECC 0x128
Oct 25 09:06:25 wpc26 SUNW,UltraSPARC-III: [ID 895151 kern.info] [AFT2] E$Data
(0x10)
0x37333330.32062c00 0x5a8f000c.00000a20 ECC 0x1f6
Oct 25 09:06:25 wpc26 SUNW,UltraSPARC-III: [ID 895151 kern.info] [AFT2] E$Data
(0x30)
0x20202020.37333330 0x34062c00.5a8f000d ECC 0x1fc
Oct 25 09:06:25 wpc26 SUNW,UltraSPARC-III: [ID 929717 kern.info] [AFT2] D$ data
not available
Oct 25 09:06:25 wpc26 SUNW,UltraSPARC-III: [ID 335345 kern.info] [AFT2] I$ data
not available
For the case of a CE, the lines tagged with AFT0 are the most important
ones to examine. The lines tagged with AFT2 provide data for detailed
diagnostic evaluation, which is not expected to occur in the field.
Points that need explanation are the following:
1. The event was detected by CPU18 (line 2). All this really means is
that CPU18 is the processor that took the trap, thus invoking the
Solaris CE error handling code.
2. The DIMM containing the affected memory location is /N0/SB5/P3/B0/D2,
which has a reference designator on the Sun Fire system board of
J16500 (lines 4 and 6). This is the important information, not by
itself, but in conjunction with other events reported over time, as
will be described in the next section.
3. Solaris describes this event as Persistent (line 6). The Solaris
error handling code provides a disposition code as a result of the
scrub operation. This disposition is one of Intermittent,
Persistent, or Sticky. The definition of each of these codes is:
o Intermittent means the error was not detected on a reread of the
affected memory location.
o Persistent means the error was detected again on a reread of the
affected memory location but the scrub operation corrected it.
o Sticky means that the error still exists in memory even after the
scrub operation. These events should be investigated further to
determine if some hardware replacement is necessary since this is
indicative of a hard failure.
Servicing Memory Based on Soft Errors:
--------------------------------------
As discussed earlier, soft errors are naturally occurring events. As
such, a single report of a CE should not be the basis for servicing/
replacing a memory device. In fact, one should expect the number of CEs
reported by a system to correlate with the soft error rate that can be
predicted by the amount of memory in the system and the geographic
location of the system. Rather than going through system specific
calculations to determine acceptable soft error rates, the guideline
that is recommended for servicing of memory in the presence of soft
errors is the following:
NOTE: Three or more CE's attributed to the same memory module within
a 24 hour period is not acceptable.
Effective with Solaris 9 KU1 and Solaris 8 KU16, there is now
functionality implemented in Solaris that notifies the administrators of
excessive CE events.
In Solaris 9 KU1 and Solaris 8 KU16, two new tunables, ecc_softerr_limit
and ecc_softerr_interval, are introduced. A per-DIMM count is kept for
the number of intermittent CEs that occurred. This count is decremented
every (ecc_softerr_interval/ecc_softerr_limit) seconds. If the count ever
exceeds ecc_softerr_limit, the following message is printed out:
Mar 22 13:12:31 wpc26 unix: WARNING: [AFT0] 3 soft errors in less than
24:00 (hh:mm) detected from Memory Module U1004
The default values are 1440 seconds (24 hours) for ecc_softerr_interval
and 2 for ecc_softerr_limit. These values can be changed by adding
entries in /etc/system. For example:
set ecc_softerr_interval=2880
set ecc_softerr_limit=4
Note that ecc_softerr_interval is defined in seconds.
IMPLEMENTATION:
---
| | MANDATORY (Fully Pro-Active)
---
---
| | CONTROLLED PRO-ACTIVE (per Sun Geo Plan)
---
---
| X | REACTIVE (As Required)
---
CORRECTIVE ACTION:
The following recommendation is provided as a guideline for authorized
Enterprise Services Field Representatives who may encounter the above
mentioned situation.
For the three categories of ECC memory errors that Solaris reports
(Intermittent, Persistent, Sticky), the following guidelines should be
followed for replacement of DIMMs on Sun Fire Midframe and High-End
servers.
. Intermittent: Check for the reporting of parity errors, otherwise
ignore.
. Persistent: Replace DIMM if a message is output to the console
warning of excess errors.
. Sticky: Replace DIMM on first occurrence.
COMMENTS:
============================================================================
Implementation Footnote:
i) In case of MANDATORY FINs, Enterprise Services will attempt to
contact all affected customers to recommend implementation of
the FIN.
ii) For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical
support teams will recommend implementation of the FIN (to their
respective accounts), at the convenience of the customer.
iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the
need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network
browser as follows:
SunWeb Access:
--------------
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/
* From there, select the appropriate link to query or browse the FIN and
FCO Homepage collections.
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/
* From there, select the appropriate link to browse the FIN or FCO index.
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to [email protected]
--------------------------------------------------------------------------
Copyright (c) 1997-2003 Sun Microsystems, Inc.