Document Audience: | INTERNAL |
Document ID: | I0805-1 |
Title: | DIMMs are being unnecessarily replaced on Enterprise 10000 servers |
Copyright Notice: | Copyright © 2005 Sun Microsystems, Inc. All Rights Reserved |
Update Date: | 2004-01-07 |
---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------
FIELD INFORMATION NOTICE
(For Authorized Distribution by SunService)
FIN #: I0805-1
Synopsis: DIMMs are being unnecessarily replaced on Enterprise 10000 serversCreate Date: Oct/30/02
Keywords:
DIMMs are being unnecessarily replaced on Enterprise 10000 servers
SunAlert: No
Top FIN/FCO Report: No
Products Reference: DIMMs on Enterprise 10000 servers
Product Category: Server / Service
Product Affected:
Systems Affected
------- ---------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
- E10000 ALL Ultra Enterprise 10000 Server -
- E10000-HPC ALL Ultra Enterprise 10000 HPC -
X-Options Affected
------------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
X7023A E10000-X ALL OPT MEMORY 1GB (8 � 128MB DIMMS) -
X7022A E10000-X ALL OPT MEMORY 256MB (8 � 32MB DIMMS) -
Parts Affected:
Part Number Description Model
----------- ----------- -----
501-2654-01 128 MB DIMM DRAM 16Mx72 60ns -
501-2653-01 32 MB DIMM DRAM 4Mx72 60ns -
References:
URL: http://bestpractices.central/bestpractices_guide_memory_errors.pdf
URL: http://esp.west/starfire/post/redxintro.html
Issue Description:
Significant numbers of Dual In-Line Memory Modules (DIMMs) for Enterprise 10000
(E10000) servers are being returned from the field. However, upon failure
analysis (FA), the most common DIMM diagnosis is No Trouble Found (NTF). In
calendar year 2001, over 70% of E10000 DIMM returns were diagnosed NTF. The
intent of this FIN is to provide Sun Service an overview of Error Correcting
Codes (ECC), to give criteria for replacing DIMMs, to reduce unnecessarily
replaced DIMMs, and increase system reliability by reducing service actions.
It is also intended to reduce the number of NTF parts by emphasizing to Sun
Service the necessity of returning verified failures with the actual error
messages encountered to assist in FA. One of the causes of these unnecessary
returns is believed to be a lack of information provided to Sun Service on what
ECC is, what the definitions of different terms related to ECC are, and what
the criteria is to determine when ECC errors are considered excessive. The
following ECC overview should help in providing an understanding of this issue:
--------------------
| An Overview of ECC |
--------------------
Introduction
------------
The scope of this discussion is limited to soft and hard errors that
occur in memory and how they are reported by Solaris. It does not
account for errors that occur while data travels through the E10000
interconnect, CPU Module, or I/O. For this discussion, soft errors
are transient or temporary errors in memory that can be corrected by
rewriting the affected memory cell. Hard errors occur when a cell
is permanently damaged and cannot hold the correct information. With
a hard error, the cell can be permanently stuck-at "0", or "1".
ECC Concepts
------------
Any volatile storage medium, whether it be the Dynamic Random Access
Memory (DRAM) used on main memory DIMMs or Static Random Access Memory
(SRAM) mainly used for caches, is subject to occasional natural
incidences of data loss due to the impact of alpha particles or cosmic
rays. This data loss manifests itself in the changing of the value
stored in the memory cell affected by the collision. Typically only a
single bit is affected, but there is a small probability that multiple
cells can be upset.
When a bit flips due to this phenomenon, it is referred to as a soft
error. This is to distinguish it from a hard error resulting from a
hardware failure. These soft errors happen at a rate, called the soft
error rate (SER), that can be predicted as a function of the memory
density, the memory technology, and the altitude of the system in which
the memory resides.
ECC was invented to allow survival from these naturally occurring
losses of data. The ECC method used on the E10000 is called a Single
Error Correcting, Double Error Detecting code (SEC-DED). The concept is
that every word of data is written to memory along with a number of
extra check bits. When the word is read back from memory, a fresh set
of check bits are recomputed and compared with the check that was
stored in memory. The result of this comparison is called the syndrome.
If the syndrome is zero, the comparison was identical, and thus the
data is good. A non-zero syndrome means the data is in error, and the
syndrome is used to find a single bit in error and correct it. A
single bit error is called a Correctable Error (CE). The syndrome can
also detect if two bits are in error, but it does not have enough
information to identify which two bits. This type of error is called
an Uncorrectable Error (UE). UltraSPARC microprocessors use a SEC-DED
variant called S4ED that also can detect, but not correct, three or
four bit errors if they are clustered within a four bit nibble.
Table 10-2 in the document specified by the URL below shows how the
syndrome is used to identify the bit in error or determine if multiple
bits are in error. Solaris does this table look-up work for you so you
don't have to, but the information in the table is interesting if you
are curious about what type of memory error occurred in an E10000.
http://sun-www.central.sun.com/microelectronics/manuals/805-0168.pdf
--------------
| SSP Behavior |
--------------
All E10000 System Service Processor (SSP) patches are mandatory and
can adversely affect memory error reporting if not installed. A list
of patches for the version of the SSP software you are running is
available at:
http://cpre-amer.west/esg/hsg/starfire/patches.html
The synopsis of the SSP patches normally just lists one of potentially
several bugs fixed by the patch. Do not ignore the patch just because
your customer has not encountered the one bug listed in the synopsis.
Correctable Errors
------------------
The SSP generates a Recordstop file if the E10000 encounters a CE.
The Recordstop cannot identify the exact DIMM where the CE occurred,
but narrows it down to two possible DIMMs. Solaris is responsible
for identifying the exact DIMM experiencing the error.
Uncorrectable Errors
--------------------
The SSP generates an Recordstop file if the E10000 encounters a UE.
The Recordstop cannot identify the exact DIMM where the UE occurred,
but narrows it down to two possible DIMMs. Solaris is responsible
for identifying the exact failing DIMM.
------------------
| Solaris Behavior |
------------------
Ensure you have the updates to the version of Solaris you are running
that include main memory scrubbing and improved error messaging. See
FIN I0616-1 at:
I0616-1http://sunsolve.Central.Sun.COM/cgi/retrieve.pl?type=0&doc=fins/
for details. It is an E10000 requirement to use a version of Solaris
that runs the main memory scrubber.
Correctable Errors
------------------
When a CE is detected, the device that reads the word and detected
the error can correct the data read and continue on unimpeded.
However, this does not address the fact that the referenced word
could still be resident in memory uncorrected (i.e. a subsequent
read of this word could result in another CE event). If, over
time, this word in memory is never corrected, the possibility
starts to arise that another bit may flip in the same word. This
would lead to a UE event which will result in a loss of system
service (See Uncorrectable Error discussion below). To avoid this
possibility, the detection of a CE causes a trap to Solaris. The
Solaris error handling code logs the error and scrubs the affected
memory word by writing the corrected word back into memory.
Uncorrectable Errors
--------------------
If a UE is detected, the device that read the word and detected the
error cannot correct the data and continue on. A UE will cause
Solaris to panic if the UE was in kernel memory, or cause a kill of
the particular user process that contained the memory in error and
an then an orderly shutdown and reboot to protect the other processes
in the domain.
Memory Scrubber
---------------
Solaris also runs a memory "scrubber" routine as part of its normal
operation. This scrubber doesn't do anything special besides
ensure every memory location is accessed at least once every 12
hours. If the access finds a CE, then the normal trap to Solaris
that occurs for any CE will scrub the affected memory word by
writing the corrected word back into memory and log the event. This
ensures that multiple CEs do not have time to build up and form a
UE at memory locations that are infrequently accessed.
Solaris Error Messages
----------------------
As part of handling the error, Solaris will proceed to log a fair
amount of diagnostic information. One such error message, taken from
the /var/adm/message file of a E10000 domain running Solaris 8, looks
like the following:
Feb 4 18:21:50 cod-b0 SUNW,UltraSPARC: [ID 787962 kern.notice]
[AFT0] Corrected Memory Error on CPU31, errID 0x0003a08d.15fec176
Feb 4 18:21:50 cod-b0 AFSR 0x00000000.00100000
AFAR 0x00000016.71bb89a8
Feb 4 18:21:50 cod-b0 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00
Fault_PC 0x10024df4
Feb 4 18:21:50 cod-b0 UDBL Syndrome 0x6e Memory Module Board# 4
Bank# 2 P# P15 MM 2_3
Feb 4 18:21:50 cod-b0 SUNW,UltraSPARC: [ID 218875 kern.notice]
[AFT0] errID 0x0003a08d.15fec176
Corrected Memory Error on Board# 4 Bank# 2 P# P15 MM 2_3 is
Persistent
Feb 4 18:21:50 cod-b0 SUNW,UltraSPARC: [ID 758418 kern.notice]
[AFT0] errID 0x0003a08d.15fec176
ECC Data Bit 56 was in error and corrected
Points that need explanation are the following:
. Asynchronous Fault Trap 0 (AFT0) messages are for errors that are
correctable or survivable errors such as CE memory errors. AFT1
messages are for errors that are uncorrectable or non-survivable
errors (i.e. errors that usually cause Solaris to panic) such as UE
memory errors. AFT2 and AFT3 messages are for additional diagnostic
information, such as cache line dumps.
. The event was detected by CPU31. All this means is that CPU31 is the
processor that took the trap, thus invoking the Solaris error handling
code.
. Contents of the Asynchronous Fault Status Register (AFSR) and
Asynchronous Fault Address Register (AFAR) along with the E-cache tag
parity syndrome (AFSR.ETS) and the data parity syndrome (AFSR.PSYND)
are given. These are CPU parity syndromes, not the SEC-DED syndrome
used on the DIMMs. They should be zero if the error was on the
DIMM. (The "Score" is used if multiple AFSR parity error messages
are reported. The highest score is the most likely originator of
the parity error. The "Score" is unrelated to errors involving
DIMMs.)
. The UltraSPARC Data Buffer Lower Error Register (UDBL) ECC syndrome is
the syndrome used to detect and correct errors on DIMMs. This syndrome
is decoded from a table and the last line of the error message
indicates that this was done by Solaris and bit 56 was found to be
in error.
. The DIMM containing the affected memory word is on: Board# 4 Bank# 2
MM 2_3. This is not important information by itself, because we have
not determined if the error is soft or hard, or if the DIMM is the
cause of the condition or another component was the cause. A "P"
number is also given to identify the DIMM. The DIMMs on a memory
mezzanine are numbered from 1 to 32. P15 is just another method of
saying MM 2_3. See FIN# I0396-1 for a P to MM table:
I0396-1http://sunsolve.Central.Sun.COM/cgi/retrieve.pl?type=0&doc=fins/
. Solaris describes this event as "Persistent" even though the next error
message clearly indicates the error has been corrected and does not
persist. The choice of the word persistent in this context causes
confusion and can cause Sun Service to incorrectly remove a DIMM.
The Solaris error handling code provides a disposition code as a
result of the scrub operation. This disposition is one of "Intermittent",
"Persistent", or "Sticky". The definition of each of these codes is:
Intermittent - Means the error was not detected on a reread of
------------ the affected memory word. "Intermittent" is also not
the best choice of words because it implies that
this same error can be expected to manifest itself
at irregular intervals. This CE is more commonly
known as a transient soft error. No DIMM with this
sort of error can be considered for replacement
without first examining the soft error rate (SER) of
this DIMM and the System Service Processor (SSP)
Recordstop files to be certain that the memory
caused this error. A step by step procedure to
accomplish this is given in this FIN's CORRECTIVE
ACTION heading.
Persistent - Means the error was detected again on a reread of
---------- the affected memory word but the scrub operation
corrected it. This CE is more commonly known as a
temporary soft error. No DIMM with this sort of error
can be considered for replacement without first
examining the SER of this DIMM and the SSP Recordstop
files to be certain that the memory caused this
error. A step by step procedure to accomplish this is
given in this FIN's CORRECTIVE ACTION heading.
Sticky - Means that the error still exists in memory even after
------ the scrub operation. These events should be immediately
investigated to determine if some hardware replacement is
necessary since this is indicative of a hard error. This
CE is more commonly known as a stuck-at hard error. A
DIMM with a "Sticky" CE should be considered for
replacement after first examining the SSP Recordstop files
to be certain that memory caused this error. A step by
step procedure to accomplish this is given in this FIN's
CORRECTIVE ACTION heading.
As discussed earlier, soft errors are naturally occurring events. We
have also indicated it is possible for the phenomena that causes single
bit soft CEs to cause a multiple bit soft UEs. Since the consequences
of UEs are significant, and the occurrence of soft UEs induced by
natural causes rare, it is best not to take chances and always to
replace a DIMM that is responsible for a UE. Conversely, a single
report of a soft CE should not be the basis for replacing a memory
device. In fact, one should expect the number of soft CEs reported by a
system to correlate with the SER that can be predicted by the amount of
memory in the system and the altitude of the system. Rather than going
through system specific calculations to determine acceptable SER, the
recommendation for the servicing of DIMMs in the presence of CEs along
with UEs is outlined below.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! o Remove a DIMM for soft CEs (Intermittent or Persistent) only if !
! three or more soft CEs can be definitively attributed to the same !
! DIMM within a 24 hour period. !
! !
! o Remove a DIMM for a hard CE (Sticky) if just one hard CE can be !
! definitively attributed to a DIMM. !
! !
! o Remove a DIMM for a UE if just one UE can be definitively attributed !
! to a DIMM. !
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
See the "CORRECTIVE ACTION" heading of this FIN for a procedure on how
to definitively determine that the DIMM and not some other component is
the source of the error. Examples of determining if the DIMM is
responsible are in this heading under the "Diagnosing Memory Errors"
section below.
Let's look at another error message in order to illustrate a point about
the SER:
Feb 5 08:54:42 cod-b0 SUNW,UltraSPARC: [ID 126141 kern.notice]
[AFT0] Corrected Memory Error on CPU56, errID 0x0003d02e.cbca34fe
Feb 5 08:54:42 cod-b0 AFSR 0x00000000.00100000 AFAR 0x0000001e.3807aad8
Feb 5 08:54:42 cod-b0 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00
Fault_PC 0x10095764
Feb 5 08:54:42 cod-b0 UDBL Syndrome 0x91 Memory Module Board# 0
Bank# 1 P# P17 MM 1_0
Feb 5 08:54:42 cod-b0 unix: [ID 908439 kern.notice] [AFT0]
Multiple Softerrors:
Feb 5 08:54:42 cod-b0 unix: [ID 356634 kern.notice]
0 Intermittent, 256 Persistent, and 0 Sticky Softerrors accumulated
Feb 5 08:54:42 cod-b0 unix: [ID 340762 kern.notice]
from Memory Module Board# 0 Bank# 1 P# P17 MM 1_0
Feb 5 08:54:42 cod-b0 SUNW,UltraSPARC: [ID 453948 kern.notice]
[AFT0] errID 0x0003d02e.cbca34fe
Corrected Memory Error on Board# 0 Bank# 1 P# P17 MM 1_0 is Persistent
Feb 5 08:54:42 cod-b0 SUNW,UltraSPARC: [ID 104955 kern.notice]
[AFT0] errID 0x0003d02e.cbca34fe
ECC Data Bit 5 was in error and corrected
An addition to this error message are lines that says: "Multiple
Softerrors" have occurred and "0 Intermittent, 256 Persistent, and 0
Sticky Softerrors accumulated from Memory Module Board# 0 Bank# 1 P# P17
MM 1_0". Solaris will issue a summary report like this when the
number of correctable errors exceeds a threshold value, max_ce_err, on
a particular DIMM. This threshold value on the E10000 is set to 255.
Does a DIMM with 256 errors always need to be replaced? Not necessarily!
As it was stated, three or more CEs attributed to the same DIMM
within a 24 hour period is not acceptable. That means two errors
per day is OK. So if the uptime of the domain was greater than 128 days
(256 errors / 2 errors per day = 128 days) it is conceivable that the
SER never exceeded 2 errors per day, and the DIMM should not be
replaced.
The point being emphasized here is always ensure that the SER is three or
more "Intermittent" or "Persistent" CEs on the same DIMM within a 24 hour
period before even considering replacement.
Note that Solaris 9 KU2 (Patch 112233-02 or later) and Solaris 8 KU16
(Patch 108528-16 or later) replace the cumulative error count shown above
with an error count that just spans a 24 hour window. These kernel updates
also no longer send individual memory error messages to the console by
default. If three errors occur in a 24 hour period the following message
is printed on the console:
Oct 3 22:46:31 thing2 unix: WARNING: [AFT0] 3 soft errors in less than
24:00 (hh:mm) detected from Memory Module Board# 3 Bank# 3 P# P32 MM 3_7
--------------------------
| Diagnosing Memory Errors |
--------------------------
Identifying memory errors on the E10000 is best accomplished by first
looking at a set of Solaris error messages and trying to find a pattern.
Below is a non-comprehensive list of possible patterns:
. If all the errors involve the same CPU Module, then suspect a problem
with the CPU Module seating, the CPU Module itself, or the System
Board it resides on.
. If the errors all involve CPU Modules on the same System Board, then
suspect a problem with the System Board seating or the System Board
itself.
. If the errors involve multiple CPUs on multiple System Boards, but
the same DIMM, suspect the DIMM.
Once a pattern has been identified, a diagnosis can be made by confirming
the pattern through looking at the Recordstops. Here are two complete
examples of how to diagnose Solaris memory errors reported on a E10000.
One is an actual DIMM problem, the other shows how important it is to use
the Recordstop to verify if an error was really caused by a DIMM.
Example Diagnosis 1 : A True Soft Memory Error
----------------------------------------------
This first example is from from an E10000 that has netcon logging
enabled. In this case the SSP $SSPLOGGER//netcon file can
be examined instead of having to log into the domain and examine
/var/adm/message. (Note that Solaris 9 KU2 and Solaris 8 KU16 by
default no longer report these messages to the console, and thus
neither the netcon log. In that case, check the domain's
/var/adm/messages file instead.)
All E10000s should have netcon logging enabled. See FIN I0593-1 at:
I0593-1http://sunsolve.Central.Sun.COM/cgi/retrieve.pl?type=0&doc=fins/
Assume we have already established from the log that three CEs have
occurred on the same DIMM within a 24 hour period, and now we are
trying to determine if this particular set of CEs were caused by the
DIMM.
Dec 27 15:35:16 cod-ssp netcon_server: [ID 366040 local1.info] (cod-b0) :
Dec 27 15:34:41 cod-b0 SUNW,UltraSPARC: [AFT0]
Corrected Memory Error on CPU25, errID 0x00000189.0bb189df
Dec 27 15:35:16 cod-ssp netcon_server: [ID 366040 local1.info] (cod-b0) :
Dec 27 15:34:41 cod-b0 AFSR 0x00000000.00100000 AFAR
0x00000000.30fdbd70
Dec 27 15:35:17 cod-ssp netcon_server: [ID 366040 local1.info] (cod-b0) :
Dec 27 15:34:41 cod-b0 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00
Fault_PC 0x10024df0
Dec 27 15:35:17 cod-ssp netcon_server: [ID 366040 local1.info] (cod-b0) :
Dec 27 15:34:41 cod-b0 UDBH Syndrome 0x8 Memory Module Board# 15 Bank#
1 P# P30 MM 1_7
Dec 27 15:35:17 cod-ssp netcon_server: [ID 366040 local1.info] (cod-b0) :
Dec 27 15:34:41 cod-b0 SUNW,UltraSPARC: [AFT0] errID 0x00000189.0bb189df
Corrected Memory Error on Board# 15 Bank# 1 P# P30 MM 1_7 is
Persistent
Dec 27 15:35:17 cod-ssp netcon_server: [ID 366040 local1.info] (cod-b0) :
Dec 27 15:34:41 cod-b0 SUNW,UltraSPARC: [AFT0] errID 0x00000189.0bb189df
ECC Check Bit 3 was in error and corrected
In the same $SSPLOGGER/ directory, look for a Recordstop that
occurred around the same time:
% ls -la Edd-Record-Stop-Dump-12.27*
-rw-rw-rw- 1 ssp staff 82680 Dec 27 15:43
Edd-Record-Stop-Dump-12.27.15:35
% redx -c -l
redxl> dumpf load Edd-Record-Stop-Dump-12.27.15:35
Created Thu Dec 27 15:35:42 2001
By hpost v. 3.4 Jun 20 2001 12:19:51 executing as pid=5840
On ssp name = cod-ssp.SD_Lab.West.Sun.COM
HOSTNAME = cod-b0
platform_name = cod
Boardmask = 3FFFF -D option
Edd-Record-Stop-Dump
There were 0 errors encountered while creating this dump.
redxl> wfail
LAARB 0 ErrorCSR1[65:0] = 0 00000000 3C000002
ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
LAARB 1 ErrorCSR1[65:0] = 0 00000000 3C000002
ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
LAARB 2 ErrorCSR1[65:0] = 0 00000000 3C000002
ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
LAARB 3 ErrorCSR1[65:0] = 0 00000000 3C000002
ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
LAARB 4 ErrorCSR1[65:0] = 0 00000000 3C000002
ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
LAARB 5 ErrorCSR1[65:0] = 0 00000000 3C000002
ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
LAARB 6 ErrorCSR1[65:0] = 0 00000000 3C000002
ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
LAARB 6 ErrorCSR3[63:0]: Hist: 0 N 0000 Flgs = 000 00100000
ErrCSR3[20]: Recordstop Requested by XDB0 (LAARB)
XDB 6.0 EccErrFlags[11:0] = 140
EccFlg[6]: Correctable error in ldat bus hi half, bits [143:72]
EccFlg[11:8]: Error count = 1
ldat[143:72]= 08 00000000 00000000 (xmux_par[5:0]= 1F) syn= 08:
bit 67 [3F]
Ldat hi data recordstop requested by XDB 6.0.
LAARB 7 ErrorCSR1[65:0] = 0 00000000 3C000002
ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
LAARB 8 ErrorCSR1[65:0] = 0 00000000 3C000002
ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
LAARB 9 ErrorCSR1[65:0] = 0 00000000 3C000002
ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
LAARB A ErrorCSR1[65:0] = 0 00000000 3C000002
ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
LAARB B ErrorCSR1[65:0] = 0 00000000 3C000002
ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
LAARB C ErrorCSR1[65:0] = 0 00000000 3C000002
ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
LAARB D ErrorCSR1[65:0] = 0 00000000 3C000002
ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
LAARB E ErrorCSR1[65:0] = 0 00000000 3C000002
ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
LAARB F ErrorCSR1[65:0] = 0 00000000 3C000002
ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
LAARB F ErrorCSR3[63:0]: Hist: 0 N 0000 Flgs = 000 00800000
ErrCSR3[23]: Recordstop Requested by XDB3 (LAARB)
XDB F.3 EccErrFlags[11:0] = 104
EccFlg[2]: Correctable error in psi bus hi half, bits [143:72]
EccFlg[11:8]: Error count = 1
psi [143:72]= 08 00000000 00000000 (xmux_par[5:0]= 1F) syn= 08:
bit 67 [3F]
Memory ECC error detected by XDB F.3 will be analyzed later.
GAARB 0 ErrorCSR1[65:0] = 0 00000000 00000002
ErrCSR1[1]: Recordstop Detected
GAARB 0 ArbStopLog[15:0] = 0000 RecordStopLog[15:0] = 8000
GAARB 1 ErrorCSR1[65:0] = 0 00000000 00000002
ErrCSR1[1]: Recordstop Detected
GAARB 1 ArbStopLog[15:0] = 0000 RecordStopLog[15:0] = 8000
GAARB 2 ErrorCSR1[65:0] = 0 00000000 00000002
ErrCSR1[1]: Recordstop Detected
GAARB 2 ArbStopLog[15:0] = 0000 RecordStopLog[15:0] = 8000
GAARB 3 ErrorCSR1[65:0] = 0 00000000 00000002
ErrCSR1[1]: Recordstop Detected
GAARB 3 ArbStopLog[15:0] = 0000 RecordStopLog[15:0] = 8000
Ldat-side data recordstops are assumed caused by psi-side errors.
No further action is appropriate for them.
Memory data ecc error detected by XDB F.3: PUP 1/3 output parity
history matches XDB in. No action taken here.
No components would be failed based on this state.
. Solaris says the error was detected by CPU25. We know that CPU 25
is on Board 6 (25 modulo 4 = 6). The Recordstop shows that XDB 6.0
requested the Recordstop, and we know XDB 6.0 interfaces with CPU
6.1 confirming that something on Board 6 detected the error.
. Solaris says the error was detected by CPU25 but occurred on Board#15
Bank# 1 MM 1_7 check bit 3. The Recordstop confirms that check bit 3
was the bit affected by the error. We know this because are 72 bits
in an E10000 memory word. Bits 0-63 are the 64 data bits, and bits
64-71 are ECC code bits. The Recordstop indicates bit 67 was the bit
in error, which happens to be check bit 3 (data bit 64=check bit 0,
65=1, 66=2, 67=3, etc.).
. The Recordstop says the error was a "Memory data ECC error detected by
XDB F.3" confirming Solaris' claim that Board 15 had the memory error.
. The Recordstop continues saying "PUP 1/3 output parity history matches
XDB in." which lets us know that the data sent out from the Pack/Unpack
(PUP) ASICs matched the XDB (Xfire Data Buffer) input, therefore the
error was not caused by the PUPs or the connection to the memory, but
in the memory itself.
. Recordstops only have enough information to narrow down the error to
two of the four possible memory banks, in this case banks 1 and 3 are
identified.
. The Recordstop identifies that the CE occurred "in psi bus hi half
... bit 67". Each DIMM provides the same 18 contiguous bits of data in
a 144 bit transfer cycle, so this means according to the table below,
that the affected DIMM is DIMM 7.
------------------------------
| 144 bit transfer cycle table |
|==============================|
| DIMM 0: lo half bits [17: 0] |
| DIMM 1: lo half bits [35:18] |
| DIMM 2: lo half bits [53:36] |
| DIMM 3: lo half bits [71:54] |
| DIMM 4: hi half bits [17: 0] |
| DIMM 5: hi half bits [35:18] |
| DIMM 6: hi half bits [53:36] |
| DIMM 7: hi half bits [71:54] | <--- "hi half ... bit 67"
------------------------------
So it is now known from the Recordstop that the error occurred in DIMM
F.1.7 or F.3.7 . One of these matches Solaris indication of Board# 15
MM 1_7 (DIMM F.1.7), therefore this Solaris error has been corroborated
by the Recordstop.
Example Diagnosis 2 : CPU Module Failure looks like a DIMM failure
------------------------------------------------------------------
Once again we start by examining the /var/adm/message file. Again
assume we have already established from the log that three CEs have
occurred on the same DIMM within a 24 hour period, and now we are
trying to determine if this particular set of CEs were caused by
the DIMM.
Dec 12 15:58:00 xf2-b7 SUNW,UltraSPARC: [AFT0]
Corrected Memory Error on CPU31, errID 0x00000076.b4f9a4cc
Dec 12 15:58:00 xf2-b7 AFSR 0x00000000.00100000 AFAR
0x0000000e.ebe3c000
Dec 12 15:58:00 xf2-b7 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00
Fault_PC 0x781cac7c
Dec 12 15:58:00 xf2-b7 UDBH Syndrome 0x64 Memory Module Board# 7
Bank# 0 P# P2 MM 0_4
Dec 12 15:58:00 xf2-b7 SUNW,UltraSPARC: [AFT0] errID 0x00000076.b4f9a4cc
Corrected Memory Error on Board# 7 Bank# 0 P# P2 MM 0_4 is Persistent
Dec 12 15:58:00 xf2-b7 SUNW,UltraSPARC: [AFT0] errID 0x00000076.b4f9a4cc
ECC Data Bit 7 was in error and corrected
% redx -c -l
redxl> dumpf load Edd-Record-Stop-Dump-12.12.15:58
Created Wed Dec 12 15:58:25 2001
By hpost v. 3.4 Aug 20 2000 19:14:56 executing as pid=28054
On ssp name = xf2-ssp2.SD_Lab.West.Sun.COM
HOSTNAME = xf2-b7
platform_name = allxf2
Boardmask = 30088 -D option
Edd-Record-Stop-Dump
There were 0 errors encountered while creating this dump.
redxl> wfail
LAARB 3 ErrorCSR1[65:0] = 0 00000000 3C000002
ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
LAARB 7 ErrorCSR1[65:0] = 0 00000000 3C000002
ErrCSR1[29:26,1]: Recordstop, requested by all 4 GAARBs
LAARB 7 ErrorCSR3[63:0]: Hist: 0 N 0000 Flgs = 000 00200000
ErrCSR3[21]: Recordstop Requested by XDB1 (LAARB)
XDB 7.1 EccErrFlags[11:0] = 204
EccFlg[2]: Correctable error in psi bus hi half, bits [143:72]
EccFlg[11:8]: Error count = 2
psi [143:72]= 6C 0ECCF00D 7EA00080 (xmux_par[5:0]= 2A) syn= 64:
bit 07 [2B]
FAIL proc 7.2: Arbstop/Recordstop detected by xdb.
FAIL proc 7.3: Arbstop/Recordstop detected by xdb.
GAARB 0 ErrorCSR1[65:0] = 0 00000000 00000002
ErrCSR1[1]: Recordstop Detected
GAARB 0 ArbStopLog[15:0] = 0000 RecordStopLog[15:0] = 0080
GAARB 1 ErrorCSR1[65:0] = 0 00000000 00000002
ErrCSR1[1]: Recordstop Detected
GAARB 1 ArbStopLog[15:0] = 0000 RecordStopLog[15:0] = 0080
GAARB 2 ErrorCSR1[65:0] = 0 00000000 00000002
ErrCSR1[1]: Recordstop Detected
GAARB 2 ArbStopLog[15:0] = 0000 RecordStopLog[15:0] = 0080
GAARB 3 ErrorCSR1[65:0] = 0 00000000 00000002
ErrCSR1[1]: Recordstop Detected
GAARB 3 ArbStopLog[15:0] = 0000 RecordStopLog[15:0] = 0080
. Solaris says the error was detected by CPU31. We know that CPU 31
is on Board 7 (31 modulo 4 = 7). The Recordstop shows that XDB 7.1
requested the Recordstop, and we know XBD 7.1 interfaces with CPUs
7.2 & 7.3 (CPU 30 & 31) confirming that something on Board 7
detected the error.
. Solaris says the error was detected by CPU31 and occurred on Board# 7
Bank# 0 MM 0_4 data bit 7. The Recordstop confirms that data bit 7 was
the bit affected by the error.
. The Recordstop says:
"FAIL proc 7.2: Arbstop/Recordstop detected by xdb.
FAIL proc 7.3: Arbstop/Recordstop detected by xdb."
It appears CPU 7.2 or 7.3 created an ECC error that was detected by
XDB 7.1 that they share. The offending CPU needs to be isolated
from one of two possibilities.
. The problem in this case was caused by a CPU writing bad data into a DIMM.
No DIMM should be replaced based on an examination of the Recordstop.
The bad CPU needs to be replaced.
Implementation:
---
| | MANDATORY (Fully Pro-Active)
---
---
| | CONTROLLED PRO-ACTIVE (per Sun Geo Plan)
---
---
| X | REACTIVE (As Required)
---
Corrective Action:
The following recommendation is provided as a guideline for authorized
Enterprise Services Field Representatives who may encounter the above
mentioned condition.
Please perform the following as needed:
1. Examine the domain's Solaris message logs for memory errors.
A. For CEs Solaris calls "Intermittent" or "Persistent":
i) Determine that three or more "Intermittent" or "Persistent"
errors have occurred within a 24 hour period. If three or more
errors have not happened, take no service action and do not
proceed with the next steps.
ii) Copy the three error messages verbatim for possible use during
DIMM FA.
iii) Note the time of the errors for comparison with Recordstops.
B. For CEs Solaris calls "Sticky":
i) Copy error message verbatim for possible use during DIMM FA.
ii) Note the time of the error for comparison with Recordstops.
C. For UEs:
i) Copy error messages verbatim for possible use during DIMM FA.
ii) Note the time of the error for comparison with Recordstop.
2. Examine the domain's SSP Recordstops to corroborate the error and the
DIMM Solaris reports is affected. For any Solaris report of a CE or UE,
you must check if the error was caused by broken hardware writing data
with errors into a memory location and not the memory itself.
A. For all CEs ("Intermittent", "Persistent", and "Sticky"):
i) Examine the output of the wfail redx command on all Recordstops
that occurred around the same time as the errors indicated in the
Solaris error messages and up to 12 hours earlier. This is to
rule out a memory error that was caused by broken hardware writing
erroneous data out to memory and then detected by a later memory
read.
NOTE: The requirement to check back only 12 hours is due to the fact
the Solaris memory scrubber accesses all DIMM locations every 12
hours. If this scenario was to occur, a Recordstop would have to
occur within the 12 hours preceding the Solaris error message.
ii) For "Intermittent" and "Persistent" CEs, replace the DIMM
indicated in the three Solaris error messages with a Field
Replaceable Unit (FRU) DIMM only if all three Recordstops
corroborate the Solaris messages' errors.
iii) "Sticky" CEs may be considered for replacement after just one
Solaris error message has been corroborated with a Recordstop.
iv) Copy the Recordstops' wfail output verbatim for possible use
during DIMM FA.
v) Bringup the domain with an minimum hpost level of 16 to test
memory ECC functionality. If time permits, a level 24, 32, or
64 hpost will perform increasingly rigorous testing of memory.
B. For all UEs:
i) Examine the output of the wfail redx command on all Recordstops
that
occurred around the same time as the error indicated in the Solaris
error message.
NOTE: UE error messages that indicate a "Syndrome 0x3" can be related
to a CPU Module E-cache parity or I/O parity error (SBus or
PCIbus). Investigate these sources of errors before replacing
any DIMMs.
ii) Only replace the DIMM indicated in the Solaris error message with
a FRU only if the Recordstop corroborates the Solaris failure
message.
iii) Copy the Recordstop's wfail output verbatim for possible use during
DIMM FA.
iv) Bringup the domain with an hpost level of 64 to fully test memory
and other hardware functionality.
3. Return DIMMs for FA along with complete Solaris error message and SSP
Recordstop wfail output.
Comments:
If you are not certain a particular DIMM is the cause of repeated correctable
memory errors that meets the SER replacement criteria of three CEs within 24
hours, do not replace it. The experience gained in servicing E-cache parity
errors has made it clear that performing an unnecessary service action can do
more harm than good to a E10000 and increase future service calls. Make your
customers aware that soft errors are natural and expected, and that
"Intermittent" and "Persistent" errors do not necessarily imply intermittent or
persistent issues with their memory.
If you are certain a DIMM is the issue, share this information with the repair
depot by copying and providing the exact Solaris error messages and redx wfail
output you used to determine this along with the DIMM. If this information
isn't provided, this DIMM may pass testing with a NTF diagnosis, and come back
to you as a FRU again.
============================================================================
Implementation Footnote:
i) In case of MANDATORY FINs, Enterprise Services will attempt to
contact all affected customers to recommend implementation of
the FIN.
ii) For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical
support teams will recommend implementation of the FIN (to their
respective accounts), at the convenience of the customer.
iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the
need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network
browser as follows:
SunWeb Access:
--------------
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/
* From there, select the appropriate link to query or browse the FIN and
FCO Homepage collections.
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/
* From there, select the appropriate link to browse the FIN or FCO index.
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to [email protected]
--------------------------------------------------------------------------