Document fins/I0887-1
FIN #: I0887-1
SYNOPSIS: Guidelines for understanding and diagnosing UltraSPARC III Level 2
(L2) SRAM Cache Memory Errors
DATE: Oct/02/02
KEYWORDS: Guidelines for understanding and diagnosing UltraSPARC III Level 2
(L2) SRAM Cache Memory Errors
---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------
FIELD INFORMATION NOTICE
(For Authorized Distribution by SunService)
SYNOPSIS: Guidelines for understanding and diagnosing UltraSPARC III
Level 2 (L2) SRAM Cache Memory Errors.
SunAlert: No
TOP FIN/FCO REPORT: No
PRODUCT_REFERENCE: UltraSPARC III Level 2 SRAM
PRODUCT CATEGORY: Server / Service
PRODUCTS AFFECTED:
Systems Affected:
-----------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
- A28 ALL Sun Blade 1000 -
- A29 ALL Sun Blade 2000 -
- A35 ALL Sun Fire 280R -
- A37 ALL Sun Fire V480 -
- A30 ALL Sun Fire V880 -
- S8 ALL Sun Fire 3800 -
- S12 ALL Sun Fire 4800 -
- S12i ALL Sun Fire 4810 -
- S24 ALL Sun Fire 6800 -
- F12K ALL Sun Fire 12K -
- F15K ALL Sun Fire 15K -
- N28 ALL Netra 20 -
X-Options Affected:
-------------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
X4007A - - ASSY CPU-4PROC USIIIP 900MHz -
X4525A - - MAXCPU 900MHz CNFIG F15K -
X4004A - - CPU-2PROC USIII 750MHz -
X4005A - - CPU-4PROC USIII 900MHz -
X4006A - - CPU-2PROC USIIIP 900MHz -
X4046A - - CPU DUAL 750MHz AL A30 -
X4047A - - CPU DUAL 750MHz AL A30 -
XCPUBD-4049 - - CPU-4GB/4PROC USIII 900+M -
XCPUBD-F4089 - - CPU-8GB/4PROC USIII 900+M -
XCPUBD-F4169 - - CPU-16GB/4PROC USIII 900+M -
XCPUBD-F4329 - - CPU-32GB/4PROC USIII 900+M -
XCPUBD-2029 - - CPU-2GB/2PROC USIII 900+M -
XCPUBD-2049 - - CPU-4GB/2PROC USIII 900+M -
XCPUBD-2089 - - CPU-8GB/2PROC USIII 900+M -
SF-XCPUBD-227 - - CPU-2GB/2PROC USIII 750MHz -
SF-XCPUBD-447 - - CPU-4GB/4PROC USIII 750MHz -
SF-XCPUBD-487 - - CPU-8GB/4PROC 512MB USIII -
PART NUMBERS AFFECTED:
Part Number Description Model
----------- ----------- -----
540-5052-02 or below ASSY CPU-4PROC USIIIP 900+ MHz -
540-4729-04 or below ASSY CPU-2PROC USIII 750MHz -
540-4730-04 or below ASSY CPU-4PROC USIII 750MHz -
540-5051-02 or below ASSY CPU-2PROC USIIIP 900+ MHz -
501-5818-06 or below ASSY CPU DUAL 750MHz AL A30 -
540-4934-03 or below ASSY CPU-4GB/4PROC USIII 900+ MHz -
540-4992-02 or below ASSY CPU-8GB/4PROC USIII 900+ MHz -
540-4990-03 or below ASSY CPU-16GB/4PROC USIII 900+ MHz -
540-4993-02 or below ASSY CPU-32GB/4PROC USIII 900+ MHz -
540-4984-02 or below ASSY CPU-2GB/2PROC USIII 900+ MHz -
REFERENCES:
BugId: 4484338 - Need to improve handling of correctable errors.
4741848 - Invalid AFSR message is only HW diagnostic on VSP
US-III platforms.
PatchId: 108528: SunOS 5.8: kernel update patch.
112233: SunOS 5.9: Kernel Patch.
FIN: I0856-1: UltraSPARC III and III+ based platforms could be
susceptible to UCC errors that may cause system
panics.
SUN Alert: 45527
URL: http://sram.eng/MTG/Quality/24hr_estimate_3.pdf
http://onestop/ecache/index.shtml?menu
PROBLEM DESCRIPTION:
This FIN is to communicate and provide vital information to Sun
employees, especially from Sun Support Services, and Authorized Sun
Service Provider employees regarding two major points:
1. Using the correct terminology associated with UltraSPARC III
Level 2 (L2) SRAM memory errors.
2. The number of such correctable errors that can occur in either
Sun Fire 4MB or 8MB Level 2 (L2) cache modules before cause for
concern.
The underlying assumption is that such memory errors may have been
incorrectly identified as "E-cache" errors or upon detection, have
caused unnecessarily replacement of the entire main system board
module since the Level 2/SRAM cache module is not a field replaceable
unit. Therefore, it is likely that new problems will be created by
replacing a perfectly good main system board module - causing
additional downtime for the customer and extra hardware costs to Sun.
This issue can occur in the following releases:
UltraSPARC-III / UltraSPARC-III+ family:
Solaris 8
Solaris 9
Detection:
----------
Correctable Level 2/SRAM cache errors are indicated by the following
AFT0 error events: UCC Event, EDC Event, CPC Event, WDC Event. By
default, starting in S8 KU-16 and S9U1, these events are logged by all
platforms to /var/adm/messages.
Uncorrectable Level 2/SRAM cache errors are indicated by the following
AFT1 error events: UCU Event, EDU:BLD Event, EDU:ST Event, CPU Event,
WDU Event. By default these events are logged to the console and
/var/adm/messages on all platforms for all releases of S8 and S9.
For a description of these events, see below or Infodoc 43642, which
describes in detail all Level 1 cache (L1$), Level 2/SRAM cache (L2$) and
Memory errors of the UltraSPARC-III family processors.
For Sun Blade 1000, Sun Blade 2000, Sun Fire 280R, Sun Fire V480 and
Sun Fire V880 platforms prior to S8 KU-16 and S9U1, correctable Level
2/SRAM cache messages were not logged to either the console or
/var/adm/messages. The only symptoms of a failing Level 2/SRAM cache
module will typically be: "Invalid AFSR" messages, "UCC Event at
TL>0"
messages and/or "ptl1 panic" events. However, these messages may
also
indicate a failing memory DIMM, a failing CPU or a failing system
board. To diagnose errors on these platforms, it is recommended to
either upgrade to S8 KU-16 or S9U1 or to set the configuration
variable "ce_verbose" to 1 in /etc/system.
SRAM Level 2 Cache Memory Terminology Explained:
------------------------------------------------
. The UltraSPARC III (USIII) processor has cache memory on the chip;
64KB 4-way associative for data and 32KB 4-way associative for
instruction, along with 2KB write buffer and 2KB prefetch. This
memory is called Level 1 (L1) as it is the first level that the
processor uses.
. Level 2 (L2) cache on the USIII processor refers to off-chip cache,
therefore it is not on the processor.
. The USIII architecture has both Error Detection and Error Correction
Code (ECC).
. For an ECC event in USIII, it just says it is an ECC event, and a
certain bit is corrected. This can occur for either the cache or the
main DRAM memory. There are certain details in the error message
that allow for determining whether the event was Level 2/SRAM cache
or in main memory.
. There is no Level 3 cache for the USIII processor line. One can
think of the memory as hierachial, with the processor having
different levels that it accesses. The first level is closest to the
processor, and can be accessed the fastest. The second level is
farther away, and takes more time. In Sun's case, there is no 3rd
level, but the next step is accessing main memory, which takes many
cycles.
Types of L2/SRAM Cache Errors:
------------------------------
Level 2/SRAM cache errors are either single-bit, which are
correctable, or multi-bit, which are uncorrectable. Level 2/SRAM
cache errors are reported depending upon how the processor detects
the error. The USIII architure uses an extremely robust error
correcting code called SEC- DED (single error correct - double error
detect) that minimizes the impact of Level 2/SRAM cache and memory
errors. A single bit error in a 144 bit checkword is corrected and a
double bit error is detected and trapped in the processor. An entire
576 bit word consists of 4 checkwords, so up to 4 independent bit
errors are handled by the SED- DED code without any impact to data
integrity.
The errors reported in the Asynchronous Fault Status Registers
(AFSR) are:
1. UCC - SW correctable Level 2/SRAM cache ECC error for instruction
fetch or data access other than block load. Some paths
accessing Level 2/SRAM cache do not have sufficient time for
the ECC algorithm to present corrected data to the processor.
In these instances, software must intervene and flush Level
2/SRAM cache and perhaps D$ to ensure correct operation.
2. UCU - Uncorrectable Level 2/SRAM cache error for instruction fetch
or data access other than block load.
3. EDC - HW corrected Level 2/SRAM cache ECC error for store merge or
block load.
4. EDU - Uncorrectable Level 2/SRAM cache ECC error for store merge or
block load. In most cases, software can differentiate between
an error from a store merge, which it indicates with EDU:ST,
and an error from a block load, which it indicates with
EDU:BLD.
5. WDC - HW corrected Level 2/SRAM cache ECC for writeback
(victimization).
6. WDU - Uncorrectable Level 2/SRAM cache ECC error for writeback
(victimization).
7. CPC - HW corrected Level 2/SRAM cache ECC error for copyout
(snoop request).
8. CPU - Uncorrected Level 2/SRAM cache ECC error for copyout
(snoop request).
Causes of SRAM Errors
=====================
For an extensive discussion on the causes of Level 2/SRAM cache memory
errors, please reference the white paper: "Estimate of Threshold for
ECC in Serengeti L2 SRAM Modules", C. Slayman, 08/20/02
This document can be found at:
http://sram.eng/MTG/Quality/24hr_estimate_3.pdf
IMPLEMENTATION:
---
| | MANDATORY (Fully Proactive)
---
---
| | CONTROLLED PROACTIVE (per Sun Geo Plan)
---
---
| X | REACTIVE (As Required)
---
CORRECTIVE ACTION:
The following recommendation is provided as a guideline for authorized
Sun Services Field Representatives who may encounter the above
mentioned problem.
It is recommended that all UltraSPARC-III / UltraSPARC-III+ systems
be upgraded to these kernel patch levels:
Solaris 8 108528 or later
Solaris 9 112233 or later
The following information should be used when determining what action
to take when Level 2/SRAM Cache Memory Errors occur.
RECOMMENDATIONS:
================
1. Don't replace a module that shows only one or two XXC errors
in a 24hr period:
It is difficult to determine if two errors on the same Level 2/SRAM
cache module are due to a multi-cell cosmic ray event or the onset
of a different failure mechanism. Therefore, replacement of such a
module runs the risk of creating more downtime at the customer site
by swapping out a motherboard with perfectly good components.
Unless a diagnostic program can be run to determine that the two
errors ARE NOT from nearest neighbor cells, it is recommended that
the module not be replaced. (Conversely, if it can be verified that
the two errors are indeed independent or come from different SRAMs
on the same module, this cannot be explained by cosmic ray
phenomena and the reliability of the Level 2/SRAM cache module
should be viewed as suspect).
2. Take a closer look at three XXC errors in a 24hr period and
replace if necessary:
Three errors on the same Level 2/SRAM cache module within a 24hr
period is highly unlikely from a cosmic ray event: ~10 to 100 times
less frequent than a double-cell event and ~100 to 10,000 times
less frequent than a single-cell upset. Therefore, the risk of
replacing a perfectly good module using this criteria is much
lower. Please note that XXC errors that have unique error ids
should be considered seperate errors. For example: A UCC/WDC with
the same error id should be considered as one error and not two.
3. Take a very close look at all XXU errors:
Cosmic ray events can only corrupt single bits in each checkword.
This is handled by SEC/DED code and should not lead to any system
crash. Not every xxU will lead to a system crash, since the system
is able to recover from many of them. Nonetheless, every xxU is
serious and should be attended to promptly. Until we have software
that does this automatically, it would be advisable to scan
/var/adm/messages periodically (weekly, daily or even every few
hours depending on the severity of the problem) to see if any
recent xxU events (UCU, WDU, EDU or CPU) have occurred. One
possible command to do this would be:
egrep AFT /var/adm/messages /var/adm/messages.0 | egrep "U Event"
If any are found, then the full context of the message should be
examined (not just the lines printed from the command above, but
the complete set associated messages in the /var/admin/messages) to
see what board or module should be replaced.
4. Watch for patterns in the errors:
Cosmic ray SER is random in space and time, so all Level 2/SRAM
cache modules are likely to be hit. If there appears to be a
particular module, motherboard or bit that is showing a higher error
rate or the errors appear to be occurring at a particular part of
the day, then the events are not cosmic ray induce (unless the time
is correlated to scrubber operation which exercises 100% of
memory).
COMMENTS:
None
============================================================================
Implementation Footnote:
i) In case of MANDATORY FINs, Enterprise Services will attempt to
contact all affected customers to recommend implementation of
the FIN.
ii) For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical
support teams will recommend implementation of the FIN (to their
respective accounts), at the convenience of the customer.
iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the
need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network
browser as follows:
SunWeb Access:
--------------
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/
* From there, select the appropriate link to query or browse the FIN and
FCO Homepage collections.
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/
* From there, select the appropriate link to browse the FIN or FCO index.
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to [email protected]
--------------------------------------------------------------------------
Copyright (c) 1997-2003 Sun Microsystems, Inc.