Document Audience: | INTERNAL |
Document ID: | I0954-1 |
Title: | Diagnosing Main Memory errors versus L2SRAM errors on Sun Fire 3800/4800/4810/6800, Sun Fire 12K/15K and Sun Fire V1280 systems. SunAlert: No |
Copyright Notice: | Copyright © 2005 Sun Microsystems, Inc. All Rights Reserved |
Update Date: | 2003-04-09 |
---------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------
FIELD INFORMATION NOTICE
(For Authorized Distribution by SunService)
FIN #: I0954-1
Synopsis: Diagnosing Main Memory errors versus L2SRAM errors on Sun Fire 3800/4800/4810/6800, Sun Fire 12K/15K and Sun Fire V1280 systems. SunAlert: NoCreate Date: Apr/07/03
SunAlert: No
Top FIN/FCO Report: No
Products Reference: Sun Fire 3800/4800/4810/6800/12K/15K/V1280
Product Category: Server / Service
Product Affected:
Systems Affected:
-----------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
- LW8 ALL Netra 1280 -
- A40 ALL Sun Fire V1280 -
- S8 ALL Sun Fire 3800 -
- S12 ALL Sun Fire 4800 -
- S12i ALL Sun Fire 4810 -
- S24 ALL Sun Fire 6800 -
- F12K ALL Sun Fire 12000 -
- F15K ALL Sun Fire 15000 -
X-Options Affected:
-------------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
- - - - -
Part Number Description Model
----------- ----------- -----
- - -
References:
BugID: 4829924 - DUE followed by EDU:ST can be reported in reverse
order.
4830028 - US-III L2 SRAM messages need improvement.
PatchID: 108528-18 or higher : SunOS 5.8: kernel update patch
112233-04: SunOS 5.9: Kernel Patch
FIN: I0909-2
URL: http://onestop/programs/us3quality
Sun Alert: 50471
Infodoc: 43642
Issue Description:
On Sun Fire 3800/4800/4810/6800, Sun Fire 12K/15K and Sun Fire V1280
systems, main memory DIMM errors may potentially be misdiagnosed as
L2SRAM errors or L2SRAM errors may be misdiagnosed as main memory
errors. This may result in the wrong component being replaced, leaving
the system vulnerable to future failures.
L2 SRAM errors may occur when accessing the CPU's Level 2 SRAM cache
memory. The reports of errors vary with workload and data patterns.
One type of known L2SRAM issue is described in Sun Alert 50471. See
the Corrective Action section for the recommended SMS, firmware, and
Solaris kernel patches required to resolve this L2SRAM timing issue.
There is a type of error condition that may be generated by a main
memory DIMM that is being propagated to the L2SRAM which results in a
system panic. When reviewing the messages file, the last entry in the
/var/adm/messages file appears to call out a bad L2SRAM when in fact
the source of the problem is a DIMM, which may even reside on a
completely different system board than that L2SRAM that is reporting
the error. These types of errors have been observed to occur when an
ECC error occurs on a "memory prefetch" operation followed by ECC
errors on the associated data.
If the troubleshooting engineer does not perform due diligence on
problem diagnostics and only looks at the last entry in the messages
file, this may lead the engineer to recommend replacing the L2SRAM
(i.e. system board) when in fact the DIMM was the source of the error.
DIMM Errors Misdiagnosed as L2SRAM Errors
=========================================
NOTE: Refer to Internal Infodoc 43642 for abbreviated definitions
for errors such as DUE, EDU, etc.
Syndrome 0x003 errors (i.e. "*Bad* Esynd=0x003") that have a "DUE" event:
(i.e. WARNING: [AFT1] DUE Event on CPU) and an "EDU:ST" event
(i.e. WARNING: [AFT1] EDU:ST Event on CPU) in the /var/adm/messages file.
In some cases, the last error captured in the log file that originated
from an DIMM error may even be reported as a syndrome 0x071 (i.e.
"*Bad* Esynd=0x071") which is typically associated as an L2SRAM
failure, but in these cases it is not an L2SRAM failure.
Syndrome 0x071 errors are almost always caused by a prior uncorrectable
error, which may be from either L2SRAM or Main Memory. You need to
find the original uncorrectable error that caused the corruption.
Syndrome 0x003 and Syndrome 0x11c errors errors should be ignored if
the error is a UCU, EDU, WDU or CPU Event, but not if the error is a
DUE or UE. If a syndrome 0x003 or 0x11c error in L2SRAM is flushed to
memory, it is turned into a Syndrome 0x071 error in memory.
The point to be noted when such cases are encountered is that the
device reporting the DUE (not to be confused with the EDU and EDU:ST
also reported during this event) is the source of the error and the
device reporting the EDU:ST (not to be confused by just an EDU) is the
recipient of the error. The diagnostic engineer must retrace the
events that lead up to the error as in most cases, the EDU:ST is the
last entry in the /var/adm/messages file, which might lead to replacing
the recipient of the error, and not the source of the error.
It is also important to note that in most cases the DUE (source or the
error) will precede the EDU:ST (destination of the error) in the
messages file, but this may not always be the case. What is important
is that when the DUE and the EDU:ST are seen together, the DUE is
reporting the source of the error and the EDU:ST is the destination of
the error. Proper matching can be performed by examining the AFAR
associated with each event. Events with AFARs that are the same when
rounded down to either a 32-byte or a 64-byte boundary can be
associated with each other, irrespective of the order in which they
occur. The EDU:ST syndrome will also match the DUE syndrome, or be one
of the special syndromes of 0x003 or 0x071.
If multiple DUEs occur to different AFARs, multiple EDU:STs may be
interleaved among them.
It is also important to note that sometimes what is really an EDU:ST
will be reported as a plain EDU. This has been seen when both the ME
and EDU bits are on in the AFSR, and also when both the DUE and EDU
bits are on in the AFSR. There may be other combinations of bits that
will cause an EDU:ST to masquerade as a plain EDU. A plain EDU can be
assumed to be an EDU:ST if it has the same errID as a DUE or if it can
be matched via its AFAR and syndrome to a DUE.
Note that one can substitute ordinary UEs for DUEs above, and the same
rules apply; if an EDU:ST is associated (by AFAR) with a UE, ignore
it. Because UEs are more likely to bring the system down right away,
however, the likelihood of misdiagnosis due to associated EDU:STs is
less.
Note also when this condition occurs, the error being exhibited on the
victim L2SRAM may be reported on one or more L2SRAM's which may span
one or more system boards.
When reviewing the messages that call out L2SRAM during this condition,
the EDU, EDU:ST, WDU, CPU, UCU, and UE event messages may display error
text such as the following:
"likely from E$ WDU/CPU"
"likely from E$ EDU:ST"
While the message is technically correct, as stated previously, the
L2SRAM is the victim of the error and the error originated from main
memory.
L2SRAM Errors Misdiagnosed as DIMM Errors
=========================================
It should also be noted that blind application of the above could lead
one to misdiagnose true L2SRAM errors as DIMM errors. A main memory
DIMM UE or DUE with a syndrome of 0x71 is most likely a secondary error
and should be ignored for diagnostic purposes.
Similarly, L2SRAM xxU events with syndromes of 0x003, 0x071, and 0x11c
are most likely secondary errors and should be ignored. An L2SRAM xxU
event with a syndrome other than 0x003, 0x071, or 0x11c, that can be
matched with a UE or DUE event with the same AFAR (rounded down to
either a 32-byte or a 64-byte boundary), especially if it has the same
syndrome, should also be ignored.
A UE or DUE with a syndrome other than 0x071 may indicate a possible
memory error, but needs to be correlated with the F15K recordstop or
F3800/4800/4810/6800 loghost logs information to make sure of this.
Similarly, an L2SRAM xxU event with a syndrome other than 0x003, 0x071,
or 0x11c that cannot be matched with a UE or DUE event with the same
AFAR (rounded down to either a 32-byte or a 64-byte boundary) AND the
same syndrome may indicate a possible L2SRAM error, but again the
recordstop (F15K) or loghost logs (F3800/4800/4810/6800) information
needs to be checked to make sure of this.
Sometimes correctable errors are also reported identifying a particular
DIMM in the same bank as identified by the UE or DUE reports. This can
help focus attention on a suspect DIMM.
The associated recordstop file on the F15K server and the loghost logs
3800/4800/4810/6800 systems contain additional messages which are
critical in proper diagnosis of the L2SRAM and DIMM errors.
The following examples are indicative of main memory DIMM errors that
could be misdiagnosed as L2SRAM errors.
EXAMPLE 1:
==========
The following is an example of the condition where the last device
reported in the messages file is an L2SRAM, but in fact the error
originated on a DIMM. Each section contains a description of what
the /var/adm/messages file is reporting and what the diagnosing
engineer should be reviewing.
-------------------------------------------------------------------------------
Section 1:
----------
A Correctable Error (CE) condition occurred on /N0/SB4/P2/B1/D2 DIMM
J15501 Bit 116 is identified as the troublesome bit. In the cache
dump it has the value "1", which means it read as "0" before being
corrected.
Feb 6 17:46:23 la001 SUNW,UltraSPARC-III+: [ID 226472 kern.notice] NOTICE:
[AFT0] Corrected system bus (CE) Event on CPU18 at TL=0, errID
0x0000240b.240fca00
Feb 6 17:46:23 la001 AFSR 0x00000002.00000070 AFAR 0x00000000.35e61780
Feb 6 17:46:23 la001 Fault_PC 0x1009ebc8 Esynd 0x0070 /N0/SB4/P2/B1/D2
J15501
Feb 6 17:46:23 la001 SUNW,UltraSPARC-III+: [ID 991331 kern.notice]
[AFT0] errID 0x0000240b.240fca00 Corrected Memory Error on /N0/SB4/P2/B1/D2
J15501 is Intermittent
Feb 6 17:46:23 la001 SUNW,UltraSPARC-III+: [ID 477466 kern.notice]
[AFT0] errID 0x0000240b.240fca00 Data Bit 116 was in error and corrected
Feb 6 17:46:23 la001 SUNW,UltraSPARC-III+: [ID 135948 kern.info]
[AFT2] errID 0x0000240b.240fca00 PA=0x00000000.35e61780
Feb 6 17:46:23 la001 E$tag 0x00000000.d7124924 E$state_6 Modified
Feb 6 17:46:23 la001 SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x00) 0x6d345f73.68617265 0x61726773.006c6d5f ECC 0x17e
Feb 6 17:46:23 la001 SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x10) 0x6164645f.626c6f63 0x6b006c6d.5f676c6f ECC 0x121
Feb 6 17:46:23 la001 SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x20) 0x62616c5f.6e6c6d69 0x64005f69.6e697400 ECC 0x0d2
Feb 6 17:46:23 la001 SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x30) 0x7864725f.6e6c6d5f 0x6c6f636b.61726773 ECC 0x15c
Feb 6 17:46:23 la001 SUNW,UltraSPARC-III+: [ID 929717 kern.info]
[AFT2] D$ data not available
Feb 6 17:46:23 la001 SUNW,UltraSPARC-III+: [ID 335345 kern.info]
[AFT2] I$ data not available
-------------------------------------------------------------------------------
Section 2:
----------
Another Correctable Error (CE) occurs on /N0/SB4/P2/B1/D2 DIMM
J15501. This time bit 117 is identified as the troublesome bit. It
has the value "0" in the cache dump, which means it read as "1"
before being corrected.
Feb 6 17:46:30 la001 SUNW,UltraSPARC-III+: [ID 621556 kern.notice] NOTICE:
[AFT0] Corrected system bus (CE) Event on CPU18 at TL=0, errID
0x0000240b.244ed150
Feb 6 17:46:30 la001 AFSR 0x00000002.000001e8 AFAR 0x00000000.3549db90
Feb 6 17:46:30 la001 Fault_PC 0x100336e4 Esynd 0x01e8 /N0/SB4/P2/B1/D2
J15501
Feb 6 17:46:30 la001 SUNW,UltraSPARC-III+: [ID 700027 kern.notice]
[AFT0] errID 0x0000240b.244ed150 Corrected Memory Error on /N0/SB4/P2/B1/D2
J15501 is Intermittent
Feb 6 17:46:30 la001 SUNW,UltraSPARC-III+: [ID 714893 kern.notice]
[AFT0] errID 0x0000240b.244ed150 Data Bit 117 was in error and corrected
Feb 6 17:46:30 la001 SUNW,UltraSPARC-III+: [ID 195358 kern.info]
[AFT2] errID 0x0000240b.244ed150 PA=0x00000000.3549db80
Feb 6 17:46:30 la001 E$tag 0x00000000.d5900124 E$state_6 Modified
Feb 6 17:46:30 la001 SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x00) 0x00000000.00000000 0x00000300.0a8fa970 ECC 0x00f
Feb 6 17:46:30 la001 SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x10) 0x00000001.0c43db60 0x00000300.0bdab890 ECC 0x154
Feb 6 17:46:30 la001 SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x20) 0x00000000.00000000 0x00600188.7b730000 ECC 0x041
Feb 6 17:46:30 la001 SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x30) 0x00000000.00000000 0x00000300.0a4fa998 ECC 0x014
Feb 6 17:46:30 la001 SUNW,UltraSPARC-III+: [ID 929717 kern.info]
[AFT2] D$ data not available
Feb 6 17:46:30 la001 SUNW,UltraSPARC-III+: [ID 335345 kern.info]
[AFT2] I$ data not available
-------------------------------------------------------------------------------
Section 3:
----------
A DUE and an EDU:ST occur together. Because they occur together,
Solaris reports the EDU:ST as a plain EDU event. It also prints the
EDU report first, even though the DUE actually occurred first. We
know they occurred together because both reports have the same errID
(0x0000240e.76f08db0), so we assume they are matched.
The DUE (Uncorrectable system bus data ECC for prefetch queue) event
calls out DIMM bank /N0/SB4/P2/B1. CPU9 and its L2 bank
/N0/SB2/P1/E0 J5400 are innocent victims of the DUE.
Note that the cache dump (the lines containing the string "E$Data")
shows two syndromes, the 0x1b6 on the even checkwords, that is also
reported in the AFSR, and an 0x02d on the odd checkwords. This is
the data as CPU9 received it.
Note also the "5" and the "a" in the third nibble from the left in
each checkword. This is the nibble that contains bits 116 and 117,
which were identified as troublesome in the CE reports, although now
it appears that bits 118 and 119 may also be affected in the even and
odd checkwords, respectively. (A syndrome of 0x1b6 is consistent
with a flip in both data bit 116 and 118. Similarly, a syndrome of
0x02d is consistent with a flip in both data bit 117 and 119.)
The Invalid AFAR message can be ignored. It is an artifact of a
misunderstanding between Solaris and the CPU that will be fixed in a
future release.
Feb 6 17:46:37 la001 SUNW,UltraSPARC-III+: [ID 487947 kern.warning]
WARNING: [AFT1] EDU Event on CPU9 at TL=0, errID 0x0000240e.76f08db0
Feb 6 17:46:37 la001 AFSR 0x00500000.000001b6 AFAR
0x00000000.35dd9780 AMBIGUOUS
Feb 6 17:46:37 la001 Fault_PC 0x1000ba50 Esynd 0x01b6 AMBIGUOUS
/N0/SB2/P1/E0 J5400
Feb 6 17:46:37 la001 SUNW,UltraSPARC-III+: [ID 907614 kern.notice]
[AFT1] errID 0x0000240e.76f08db0 Two Bits were in error
Feb 6 17:46:37 la001 SUNW,UltraSPARC-III+: [ID 693922 kern.info]
[AFT2] errID 0x0000240e.76f08db0 PA=0x00000000.35dd9780
Feb 6 17:46:37 la001 E$tag 0x00000000.d7124924 E$state_6 Modified
Feb 6 17:46:37 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x00) 0x00500000.1044c758 0x00000000.00000008 ECC 0x082
*Bad* Esynd=0x1b6
Feb 6 17:46:37 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x10) 0x00a029cf.0100fff1 0x00000000.10428830 ECC 0x1e3
*Bad* Esynd=0x02d
Feb 6 17:46:37 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x20) 0x00500000.00000008 0x000029dd.0200fff1 ECC 0x1a7
*Bad* Esynd=0x1b6
Feb 6 17:46:37 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x30) 0x00a00000.10016cf4 0x00000000.00000024 ECC 0x076
*Bad* Esynd=0x02d
Feb 6 17:46:37 la001 SUNW,UltraSPARC-III+: [ID 929717 kern.info]
[AFT2] D$ data not available
Feb 6 17:46:37 la001 SUNW,UltraSPARC-III+: [ID 335345 kern.info]
[AFT2] I$ data not available
Feb 6 17:46:37 la001 unix: [ID 321153 kern.notice] NOTICE: Scheduling
clearing of error on page 0x00000000.35dd8000
Feb 6 17:46:37 la001 unix: [ID 221039 kern.notice] NOTICE: Previously
reported error on page 0x00000000.35dd8000 cleared
Feb 6 17:46:37 la001 SUNW,UltraSPARC-III+: [ID 583311 kern.warning]
WARNING: [AFT1] DUE Event on CPU9 at TL=0, errID 0x0000240e.76f08db0
Feb 6 17:46:37 la001 AFSR 0x00500000.000001b6 AFAR
0x00000000.35dd9780
Feb 6 17:46:37 la001 Fault_PC 0x1000ba50 Esynd 0x01b6 /N0/SB4/P2/B1
Feb 6 17:46:37 la001 SUNW,UltraSPARC-III+: [ID 907614 kern.notice]
[AFT1] errID 0x0000240e.76f08db0 Two Bits were in error
Feb 6 17:46:37 la001 unix: [ID 321153 kern.notice]
NOTICE: Scheduling clearing of error on page 0x00000000.35dd8000
Feb 6 17:46:38 la001 unix: [ID 221039 kern.notice]
NOTICE: Previously reported error on page 0x00000000.35dd8000 cleared
Feb 6 17:46:38 la001 SUNW,UltraSPARC-III+: [ID 647234 kern.warning]
WARNING: [AFT1] Invalid AFSR CPU9 at TL=0, errID 0x0000240e.76f377a0
Feb 6 17:46:38 la001 AFSR 0x00000000.00000000 AFAR 0x00000000.35dd9780
INVALID
Feb 6 17:46:38 la001 Fault_PC 0x1009ebc4
-------------------------------------------------------------------------------
Section 4:
----------
Following the DUE, an EDU:ST (Uncorrectable Ecache data ECC error for
store merge or block load or prefetch queue operation) is reported
calling out /N0/SB2/P1/E1, but this is the destination of the error.
Note that the AFAR is 0x00000000.35dd9790, which is the same as the
DUE AFAR (0x00000000.35dd9780) when both are rounded down to a
32-byte boundary. It also has an Esynd of 0x0003.
Feb 6 17:46:45 la001 SUNW,UltraSPARC-III+: [ID 911731 kern.warning]
WARNING: [AFT1] EDU:ST Event on CPU9 at TL=0, errID 0x0000240e.76fd8150
Feb 6 17:46:45 la001 AFSR 0x00000008.00000003 AFAR
0x00000000.35dd9790
Feb 6 17:46:45 la001 Fault_PC 0x1009eb4c Esynd 0x0003 /N0/SB2/P1/E1
J5300
Feb 6 17:46:45 la001 SUNW,UltraSPARC-III+: [ID 662991 kern.notice]
[AFT1] errID 0x0000240e.76fd8150 Two Bits were in error
Feb 6 17:46:45 la001 SUNW,UltraSPARC-III+: [ID 102248 kern.info]
[AFT2] errID 0x0000240e.76fd8150 PA=0x00000000.35dd9780
Feb 6 17:46:45 la001 E$tag 0x00000000.d7924924 E$state_6 Modified
Feb 6 17:46:45 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x00) 0x00000000.1044c758 0x00000000.00000008 ECC 0x081
*Bad* Esynd=0x003
Feb 6 17:46:45 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x10) 0x000029cf.0100fff1 0x00000000.10428830 ECC 0x1e0
*Bad* Esynd=0x003
Feb 6 17:46:45 la001 SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x20) 0x00000000.00000008 0x000029dd.0200fff1 ECC 0x1a7
Feb 6 17:46:45 la001 SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x30) 0x00000000.10016cf4 0x00000000.00000024 ECC 0x076
Feb 6 17:46:45 la001 SUNW,UltraSPARC-III+: [ID 929717 kern.info]
[AFT2] D$ data not available
Feb 6 17:46:45 la001 SUNW,UltraSPARC-III+: [ID 335345 kern.info]
[AFT2] I$ data not available
Feb 6 17:46:45 la001 unix: [ID 321153 kern.notice]
NOTICE: Scheduling clearing of error on page 0x00000000.35dd8000
Feb 6 17:46:45 la001 unix: [ID 868141 kern.warning]
WARNING: Uncorrectable Error occurred at PA 0x00000000.35dd9780 while
attempting to clear previously reported error; page removed from service
-------------------------------------------------------------------------------
Section 5:
----------
This is similar to the messages in Section 3. An EDU and DUE are
reported together (same errID), as is an Invalid AFSR that can be
ignored. The DUE indicts /N0/SB4/P2/B1. Note that the AFAR
(0x00000000.35de5790) is different from the Section 3 AFAR
(0x00000000.35dd9780), but the syndromes in the third and fourth
checkwords are similar (0x1b6 and 0x02d, respectively). This is
consistent with a bad DRAM on a memory DIMM, but it is also consistent
with a component writing bad data into memory. The recordstop logs and
loghost logs need to be consulted to determine the true source of the
error.
Feb 6 17:46:51 la001 SUNW,UltraSPARC-III+: [ID 638863 kern.warning]
WARNING: [AFT1] EDU Event on CPU9 at TL=0, errID 0x0000240e.77173e60
Feb 6 17:46:51 la001 AFSR 0x00500000.0000002d AFAR
0x00000000.35de5790 AMBIGUOUS
Feb 6 17:46:51 la001 Fault_PC 0x100071bc Esynd 0x002d AMBIGUOUS
/N0/SB2/P1/E1 J5300
Feb 6 17:46:51 la001 SUNW,UltraSPARC-III+: [ID 432893 kern.notice]
[AFT1] errID 0x0000240e.77173e60 Two Bits were in error
Feb 6 17:46:51 la001 SUNW,UltraSPARC-III+: [ID 892145 kern.info]
[AFT2] errID 0x0000240e.77173e60 PA=0x00000000.35de5780
Feb 6 17:46:51 la001 E$tag 0x00000000.d7124924 E$state_6 Modified
Feb 6 17:46:51 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x00) 0x00000000.1012529c 0x00000000.00000018 ECC 0x0eb
*Bad* Esynd=0x003
Feb 6 17:46:51 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x10) 0x00a18d18.0100fff1 0x00000000.104a41e0 ECC 0x175
*Bad* Esynd=0x003
Feb 6 17:46:51 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x20) 0x00500000.00000008 0x00018d49.0200fff1 ECC 0x150
*Bad* Esynd=0x1b6
Feb 6 17:46:51 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x30) 0x00a00000.100940bc 0x00000000.000000d0 ECC 0x08e
*Bad* Esynd=0x02d
Feb 6 17:46:51 la001 SUNW,UltraSPARC-III+: [ID 929717 kern.info]
[AFT2] D$ data not available
Feb 6 17:46:51 la001 SUNW,UltraSPARC-III+: [ID 335345 kern.info]
[AFT2] I$ data not available
Feb 6 17:46:51 la001 unix: [ID 321153 kern.notice]
NOTICE: Scheduling clearing of error on page 0x00000000.35de4000
Feb 6 17:46:52 la001 unix: [ID 221039 kern.notice]
NOTICE: Previously reported error on page 0x00000000.35de4000 cleared
Feb 6 17:46:52 la001 SUNW,UltraSPARC-III+: [ID 998431 kern.warning]
WARNING: [AFT1] DUE Event on CPU9 at TL=0, errID 0x0000240e.77173e60
Feb 6 17:46:52 la001 AFSR 0x00500000.0000002d AFAR
0x00000000.35de5790
Feb 6 17:46:52 la001 Fault_PC 0x100071bc Esynd 0x002d /N0/SB4/P2/B1
Feb 6 17:46:52 la001 SUNW,UltraSPARC-III+: [ID 432893 kern.notice]
[AFT1] errID 0x0000240e.77173e60 Two Bits were in error
Feb 6 17:46:52 la001 unix: [ID 321153 kern.notice]
NOTICE: Scheduling clearing of error on page 0x00000000.35de4000
Feb 6 17:46:52 la001 unix: [ID 221039 kern.notice]
NOTICE: Previously reported error on page 0x00000000.35de4000 cleared
Feb 6 17:46:52 la001 SUNW,UltraSPARC-III+: [ID 379899 kern.warning]
WARNING: [AFT1] Invalid AFSR CPU9 at TL=0, errID 0x0000240e.77197090
Feb 6 17:46:52 la001 AFSR 0x00000000.00000000 AFAR 0x00000000.35de5790
INVALID
Feb 6 17:46:52 la001 Fault_PC 0x1009ebc4
-------------------------------------------------------------------------------
Section 6:
----------
Similar to Section 4, this is a subsequent EDU:ST which follows the
Section 5 DUE event. The EDU:ST AFAR (0x00000000.35de5790) is
identical to the Section 5 DUE AFAR (0x00000000.35de5790).
Feb 6 17:46:59 la001 SUNW,UltraSPARC-III+: [ID 254914 kern.warning]
WARNING: [AFT1] EDU:ST Event on CPU9 at TL=0, errID 0x0000240e.771e6500
Feb 6 17:46:59 la001 AFSR 0x00000008.00000003 AFAR
0x00000000.35de5790
Feb 6 17:46:59 la001 Fault_PC 0x1009eb4c Esynd 0x0003 /N0/SB2/P1/E1
J5300
Feb 6 17:46:59 la001 SUNW,UltraSPARC-III+: [ID 402940 kern.notice]
[AFT1] errID 0x0000240e.771e6500 Two Bits were in error
Feb 6 17:46:59 la001 SUNW,UltraSPARC-III+: [ID 317376 kern.info]
[AFT2] errID 0x0000240e.771e6500 PA=0x00000000.35de5780
Feb 6 17:46:59 la001 E$tag 0x00000000.d7924924 E$state_6 Modified
Feb 6 17:46:59 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x00) 0x00000000.1012529c 0x00000000.00000018 ECC 0x0eb
*Bad* Esynd=0x003
Feb 6 17:46:59 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x10) 0x00018d18.0100fff1 0x00000000.104a41e0 ECC 0x158
*Bad* Esynd=0x003
Feb 6 17:46:59 la001 SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x20) 0x00000000.00000008 0x00018d49.0200fff1 ECC 0x150
Feb 6 17:46:59 la001 SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x30) 0x00000000.100940bc 0x00000000.000000d0 ECC 0x08e
Feb 6 17:46:59 la001 SUNW,UltraSPARC-III+: [ID 929717 kern.info]
[AFT2] D$ data not available
Feb 6 17:46:59 la001 SUNW,UltraSPARC-III+: [ID 335345 kern.info]
[AFT2] I$ data not available
Feb 6 17:46:59 la001 unix: [ID 321153 kern.notice]
NOTICE: Scheduling clearing of error on page 0x00000000.35de4000
Feb 6 17:47:00 la001 unix: [ID 221039 kern.notice]
NOTICE: Previously reported error on page 0x00000000.35de4000 cleared
-------------------------------------------------------------------------------
Section 7:
----------
This is similar to Sections 3 and 5. The AFAR is 0x00000000.35de9780.
Note the same pattern of Esynd in the cache dump.
Feb 6 17:47:06 la001 SUNW,UltraSPARC-III+: [ID 498093 kern.warning]
WARNING: [AFT1] EDU Event on CPU9 at TL=0, errID 0x0000240e.772342f0
Feb 6 17:47:06 la001 AFSR 0x00500000.000001b6 AFAR
0x00000000.35de9780 AMBIGUOUS
Feb 6 17:47:06 la001 Fault_PC 0x1000ba50 Esynd 0x01b6 AMBIGUOUS
/N0/SB2/P1/E0 J5400
Feb 6 17:47:06 la001 SUNW,UltraSPARC-III+: [ID 279337 kern.notice]
[AFT1] errID 0x0000240e.772342f0 Two Bits were in error
Feb 6 17:47:06 la001 SUNW,UltraSPARC-III+: [ID 317260 kern.info]
[AFT2] errID 0x0000240e.772342f0 PA=0x00000000.35de9780
Feb 6 17:47:06 la001 E$tag 0x00000000.d7124924 E$state_6 Modified
Feb 6 17:47:06 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x00) 0x0051f505.0100fff1 0x00000000.104eb230 ECC 0x126
*Bad* Esynd=0x1b6
Feb 6 17:47:06 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x10) 0x00a00000.000001b0 0x0001f50e.0200fff1 ECC 0x06f
*Bad* Esynd=0x02d
Feb 6 17:47:06 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x20) 0x00500000.1018ecd8 0x00000000.000000c4 ECC 0x1c6
*Bad* Esynd=0x1b6
Feb 6 17:47:06 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x30) 0x00a1f51f.0200fff1 0x00000000.101a0388 ECC 0x05f
*Bad* Esynd=0x02d
Feb 6 17:47:06 la001 SUNW,UltraSPARC-III+: [ID 929717 kern.info]
[AFT2] D$ data not available
Feb 6 17:47:06 la001 SUNW,UltraSPARC-III+: [ID 335345 kern.info]
[AFT2] I$ data not available
Feb 6 17:47:06 la001 unix: [ID 321153 kern.notice]
NOTICE: Scheduling clearing of error on page 0x00000000.35de8000
Feb 6 17:47:07 la001 unix: [ID 221039 kern.notice]
NOTICE: Previously reported error on page 0x00000000.35de8000 cleared
Feb 6 17:47:07 la001 SUNW,UltraSPARC-III+: [ID 214841 kern.warning]
WARNING: [AFT1] DUE Event on CPU9 at TL=0, errID 0x0000240e.772342f0
Feb 6 17:47:07 la001 AFSR 0x00500000.000001b6 AFAR
0x00000000.35de9780
Feb 6 17:47:07 la001 Fault_PC 0x1000ba50 Esynd 0x01b6 /N0/SB4/P2/B1
Feb 6 17:47:07 la001 SUNW,UltraSPARC-III+: [ID 279337 kern.notice]
[AFT1] errID 0x0000240e.772342f0 Two Bits were in error
Feb 6 17:47:07 la001 unix: [ID 321153 kern.notice]
NOTICE: Scheduling clearing of error on page 0x00000000.35de8000
Feb 6 17:47:07 la001 unix: [ID 221039 kern.notice]
NOTICE: Previously reported error on page 0x00000000.35de8000 cleared
Feb 6 17:47:07 la001 SUNW,UltraSPARC-III+: [ID 736351 kern.warning]
WARNING: [AFT1] Invalid AFSR CPU9 at TL=0, errID 0x0000240e.77257750
Feb 6 17:47:07 la001 AFSR 0x00000000.00000000 AFAR 0x00000000.35de9780
INVALID
Feb 6 17:47:07 la001 Fault_PC 0x1009ebc4
-------------------------------------------------------------------------------
Section 8:
----------
Again, related to Section 7 the way Sections 4 and 6 relate to Sections 3
and 5, respectively.
Feb 6 17:47:14 la001 SUNW,UltraSPARC-III+: [ID 945805 kern.warning]
WARNING: [AFT1] EDU:ST Event on CPU9 at TL=0, errID 0x0000240e.7726bb10
Feb 6 17:47:14 la001 AFSR 0x00000008.00000003 AFAR
0x00000000.35de9790
Feb 6 17:47:14 la001 Fault_PC 0x1009eb4c Esynd 0x0003 /N0/SB2/P1/E1
J5300
Feb 6 17:47:14 la001 SUNW,UltraSPARC-III+: [ID 260570 kern.notice]
[AFT1] errID 0x0000240e.7726bb10 Two Bits were in error
Feb 6 17:47:14 la001 SUNW,UltraSPARC-III+: [ID 887708 kern.info]
[AFT2] errID 0x0000240e.7726bb10 PA=0x00000000.35de9780
Feb 6 17:47:14 la001 E$tag 0x00000000.d7924924 E$state_6 Modified
Feb 6 17:47:14 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x00) 0x0001f505.0100fff1 0x00000000.104eb230 ECC 0x125
*Bad* Esynd=0x003
Feb 6 17:47:14 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x10) 0x00000000.000001b0 0x0001f50e.0200fff1 ECC 0x06c
*Bad* Esynd=0x003
Feb 6 17:47:14 la001 SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x20) 0x00000000.1018ecd8 0x00000000.000000c4 ECC 0x1c6
Feb 6 17:47:14 la001 SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x30) 0x0001f51f.0200fff1 0x00000000.101a0388 ECC 0x05f
Feb 6 17:47:14 la001 SUNW,UltraSPARC-III+: [ID 929717 kern.info]
[AFT2] D$ data not available
Feb 6 17:47:14 la001 SUNW,UltraSPARC-III+: [ID 335345 kern.info]
[AFT2] I$ data not available
Feb 6 17:47:14 la001 unix: [ID 321153 kern.notice] NOTICE:
Scheduling clearing of error on page 0x00000000.35de8000
Feb 6 17:47:14 la001 unix: [ID 221039 kern.notice]
NOTICE: Previously reported error on page 0x00000000.35de8000 cleared
-------------------------------------------------------------------------------
If the troubleshooting engineer only looked at the last entry, they
might incorrectly conclude that the L2SRAM is at fault when in fact the
DIMM /N0/SB4/P2/B1/D2 (i.e. J15501) which reported the first CE error
is in fact the suspect part. For this case, the corrective action was
to replace DIMM /N0/SB4/P2/B1/D2 (i.e. J15501) which resolved the
problem.
Example 2:
==========
The following is a second example of the problem where the last
recorded entry looks like an L2SRAM problem due to the syndrome
0x071 error message, but the error originated in the DIMM. Only a
few of the error messages from the messages file are shown.
---------------------------------------------------------------------------------
Section 1:
----------
A CE event is recorded on a DIMM read calling out SB3/P2/B0/D1 J15400
on data bit 120.
Feb 19 14:22:25 ht01da SUNW,UltraSPARC-III+: [ID 964002 kern.info]
NOTICE: [AFT0] Corrected system bus (CE) Event detected by CPU449 at
TL=0, errID 0x0003c931.296a7ee0
Feb 19 14:22:25 ht01da AFSR 0x00000002.00000068 AFAR
0x00000061.eab2e2a0
Feb 19 14:22:25 ht01da Fault_PC 0x104fe04 Esynd 0x0068 SB3/P2/B0/D1
J15400
Feb 19 14:22:25 ht01da SUNW,UltraSPARC-III+: [ID 270288 kern.info]
[AFT0] errID 0x0003c931.296a7ee0 Corrected Memory Error on SB3/P2/B0/D1
J15400 is Intermittent
Feb 19 14:22:25 ht01da SUNW,UltraSPARC-III+: [ID 634372 kern.info]
[AFT0] errID 0x0003c931.296a7ee0 Data Bit 120 was in error and corrected
Feb 19 14:22:25 ht01da SUNW,UltraSPARC-III+: [ID 491712 kern.info]
[AFT2] errID 0x0003c931.296a7ee0 PA=0x00000061.eab2e280
Feb 19 14:22:25 ht01da E$tag 0x00000187.aa000124 E$state_2 Modified
Feb 19 14:22:25 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x00) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
Feb 19 14:22:25 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x10) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
Feb 19 14:22:25 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x20) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
Feb 19 14:22:25 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x30) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
Feb 19 14:22:25 ht01da SUNW,UltraSPARC-III+: [ID 929717 kern.info]
[AFT2] D$ data not available
Feb 19 14:22:25 ht01da SUNW,UltraSPARC-III+: [ID 335345 kern.info]
[AFT2] I$ data not available
Feb 19 14:22:25 ht01da unix: [ID 868141 kern.warning]
WARNING: Uncorrectable Error occurred at PA 0x00000061.eab2e280 while
attempting to clear previously reported error; page removed from service
Feb 19 14:22:26 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x20) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
Feb 19 14:22:26 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x30) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
Feb 19 14:22:26 ht01da SUNW,UltraSPARC-III+: [ID 929717 kern.info]
[AFT2] D$ data not available
Feb 19 14:22:26 ht01da SUNW,UltraSPARC-III+: [ID 335345 kern.info]
[AFT2] I$ data not available
Feb 19 14:22:26 ht01da unix: [ID 321153 kern.notice]
NOTICE: Scheduling clearing of error on page 0x00000061.eab2e000
-------------------------------------------------------------------------------
Section 2:
----------
In this case, the DUE is reported after the EDU:ST, however the
important thing to note is that there is a DUE and EDU:ST pair. The
EDU:ST is showing the recipient of the bad data, not the source. The
recipient of the bad data is SB14/P1/E0 J5400.
Note that the AFAR is the same as the above CE AFAR, when both are down
to a 64-byte boundary.
Feb 19 14:22:26 ht01da SUNW,UltraSPARC-III+: [ID 947761 kern.warning]
WARNING: [AFT1] EDU:ST Event detected by CPU449 at TL=0, errID
0x0003c931.297eb2d5
Feb 19 14:22:26 ht01da AFSR 0x00000008.0000018c AFAR
0x00000061.eab2e280
Feb 19 14:22:26 ht01da Fault_PC 0x1007580 Esynd 0x018c SB14/P1/E0
J5400
Feb 19 14:22:26 ht01da SUNW,UltraSPARC-III+: [ID 119791 kern.notice]
[AFT1] errID 0x0003c931.297eb2d5 Two Bits were in error
Feb 19 14:22:26 ht01da SUNW,UltraSPARC-III+: [ID 643204 kern.info]
[AFT2] errID 0x0003c931.297eb2d5 PA=0x00000061.eab2e280
Feb 19 14:22:26 ht01da E$tag 0x00000187.aa000100 E$state_2 Modified
Feb 19 14:22:26 ht01da SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x00) 0xfcff0000.00000000 0x00000000.00000000 ECC 0x134
*Bad* Esynd=0x18c
Feb 19 14:22:26 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x10) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
Feb 19 14:22:26 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x20) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
Feb 19 14:22:26 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x30) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
Feb 19 14:22:26 ht01da SUNW,UltraSPARC-III+: [ID 929717 kern.info]
[AFT2] D$ data not available
Feb 19 14:22:26 ht01da SUNW,UltraSPARC-III+: [ID 335345 kern.info]
[AFT2] I$ data not available
Feb 19 14:22:26 ht01da unix: [ID 321153 kern.notice]
NOTICE: Scheduling clearing of error on page 0x00000061.eab2e000
-------------------------------------------------------------------------------
Section 3:
----------
As mentioned in Section 2, the DUE is shown after the EDU:ST.
Feb 19 14:22:27 ht01da SUNW,UltraSPARC-III+: [ID 879788 kern.warning]
WARNING: [AFT1] DUE Event detected by CPU449 at TL=0, errID
0x0003c931.297f3ed5
Feb 19 14:22:27 ht01da AFSR 0x00500000.0000018c AFAR
0x00000061.eab2e280
Feb 19 14:22:27 ht01da Fault_PC 0x104fe04 Esynd 0x018c SB3/P2/B0
J15300 J15400 J15500 J15600
Feb 19 14:22:27 ht01da SUNW,UltraSPARC-III+: [ID 439813 kern.notice]
[AFT1] errID 0x0003c931.297f3ed5 Two Bits were in error
Feb 19 14:22:27 ht01da unix: [ID 321153 kern.notice]
NOTICE: Scheduling clearing of error on page 0x00000061.eab2e000
-------------------------------------------------------------------------------
Section 4:
----------
A second EDU:ST is reported, this time with the Esynd=0x003. But the
AFARs match when rounded down.
Feb 19 14:22:28 ht01da SUNW,UltraSPARC-III+: [ID 632355 kern.warning]
WARNING: [AFT1] EDU:ST Event detected by CPU449 at TL=0, errID
0x0003c931.2a71fa89
Feb 19 14:22:28 ht01da AFSR 0x00000008.00000003 AFAR
0x00000061.eab2e290
Feb 19 14:22:28 ht01da Fault_PC 0x104fe04 Esynd 0x0003 SB14/P1/E1
J5300
Feb 19 14:22:29 ht01da SUNW,UltraSPARC-III+: [ID 143088 kern.notice]
[AFT1] errID 0x0003c931.2a71fa89 Two Bits were in error
Feb 19 14:22:29 ht01da SUNW,UltraSPARC-III+: [ID 535859 kern.info]
[AFT2] errID 0x0003c931.2a71fa89 PA=0x00000061.eab2e280
Feb 19 14:22:29 ht01da E$tag 0x00000187.aa924900 E$state_2 Modified
Feb 19 14:22:29 ht01da SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x00) 0xfcffffff.ffffffff 0xffffffff.ffffffff ECC 0x00c
*Bad* Esynd=0x003
Feb 19 14:22:29 ht01da SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x10) 0xffffffff.ffffffff 0xffffffff.ffffffff ECC 0x180
*Bad* Esynd=0x003
Feb 19 14:22:29 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x20) 0xffffffff.ffffffff 0xffffffff.ffffffff ECC 0x183
Feb 19 14:22:29 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x30) 0xffffffff.ffffffff 0xffffffff.ffffffff ECC 0x183
Feb 19 14:22:29 ht01da SUNW,UltraSPARC-III+: [ID 929717 kern.info]
[AFT2] D$ data not available
Feb 19 14:22:29 ht01da SUNW,UltraSPARC-III+: [ID 335345 kern.info]
[AFT2] I$ data not available
Feb 19 14:22:29 ht01da unix: [ID 321153 kern.notice]
NOTICE: Scheduling clearing of error on page 0x00000061.eab2e000
-------------------------------------------------------------------------------
Section 5:
----------
Data with two bits in error was stored in the L2SRAM on board 14.
When the cache line is evicted, it results in a WDU event and the
syndrome 0x003. Note that the error message tells you that this
error likely originated from a previous error from a EDU:ST (the
fourth line in the trace). (The DUE that brought this line into the
cache, and the EDU:ST that rewrote the syndrome, are not shown in
this excerpt, but do appear in the messages file from which this
example was taken)
Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 963611 kern.warning]
WARNING: [AFT1] WDU Event detected by CPU449 at TL=0, errID
0x0003c931.2a93c40a
Feb 19 14:22:48 ht01da AFSR 0x00000020.00000003 AFAR
0x00000061.eab2f690
Feb 19 14:22:48 ht01da Fault_PC 0x1170690 Esynd 0x0003 SB14/P1/E1
J5300
Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 744453 kern.notice]
[AFT1] errID 0x0003c931.2a93c40a Two Bits in error, likely from E$
EDU:ST
Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 744087 kern.info]
[AFT2] errID 0x0003c931.2a93c40a E$tag PA=0x00000000.0032f680 does
not match AFAR=0x00000061.eab2f680
Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 936089 kern.info]
[AFT2] errID 0x0003c931.2a93c40a PA=0x00000000.0032f680
Feb 19 14:22:48 ht01da E$tag 0x00000000.00000000 E$state_2 Invalid
Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x00) 0xffffffff.ffffffff 0xffffffff.ffffffff ECC 0x180
*Bad* Esynd=0x003
Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x10) 0xffffffff.ffffffff 0xffffffff.ffffffff ECC 0x180
*Bad* Esynd=0x003
Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x20) 0xffffffff.ffffffff 0xffffffff.ffffffff ECC 0x183
Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x30) 0xffffffff.ffffffff 0xffffffff.ffffffff ECC 0x183
Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 744087 kern.info]
[AFT2] errID 0x0003c931.2a93c40a E$tag PA=0x000001e1.86f2f680 does not
match AFAR=0x00000061.eab2f680
Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 936089 kern.info]
[AFT2] errID 0x0003c931.2a93c40a PA=0x000001e1.86f2f680
Feb 19 14:22:48 ht01da E$tag 0x00000786.1b000000 E$state_2 Invalid
Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x00) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x10) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x20) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x30) 0x00000000.00000000 0x00000700.8ef58e40 ECC 0x0af
Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 929717 kern.info]
[AFT2] D$ data not available
Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 335345 kern.info]
[AFT2] I$ data not available
Feb 19 14:22:48 ht01da unix: [ID 321153 kern.notice]
NOTICE: Scheduling clearing of error on page 0x00000061.eab2e000
------------------------------------------------------------------------------
Section 6:
----------
Other data with two bits in error was stored in the L2SRAM on board
14. When this cache line is evicted, it also results in a WDU event
and the syndrome *Bad* Esynd=0x003. (Again, the DUE that brought
this line into the cache, and the EDU:ST that rewrote the syndrome,
are not shown in this excerpt, but do appear in the messages file
from which this example was taken)
Feb 19 14:22:50 ht01da SUNW,UltraSPARC-III+: [ID 963783 kern.warning]
WARNING: [AFT1] WDU Event detected by CPU449 at TL=0, errID
0x0003c931.2a96a1cd
Feb 19 14:22:50 ht01da AFSR 0x00000020.00000003 AFAR
0x00000061.eab2ee90
Feb 19 14:22:50 ht01da Fault_PC 0x1170690 Esynd 0x0003 SB14/P1/E1
J5300
Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 322410 kern.notice]
[AFT1] errID 0x0003c931.2a96a1cd Two Bits in error, likely from E$
EDU:ST
Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 555008 kern.info]
[AFT2] errID 0x0003c931.2a96a1cd E$tag PA=0x00000000.0032ee80 does
not match AFAR=0x00000061.eab2ee80
Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 421903 kern.info]
[AFT2] errID 0x0003c931.2a96a1cd PA=0x00000000.0032ee80
Feb 19 14:22:51 ht01da E$tag 0x00000000.00000000 E$state_2 Invalid
Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x00) 0xffffffff.ffffffff 0xffffffff.ffffffff ECC 0x180
*Bad* Esynd=0x003
Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x10) 0xffffffff.ffffffff 0xffffffff.ffffffff ECC 0x180
*Bad* Esynd=0x003
Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x20) 0xffffffff.ffffffff 0xffffffff.ffffffff ECC 0x183
Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x30) 0xffffffff.ffffffff 0xffffffff.ffffffff ECC 0x183
Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 555008 kern.info]
[AFT2] errID 0x0003c931.2a96a1cd E$tag PA=0x00000000.0072ee80 does
not match AFAR=0x00000061.eab2ee80
Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 421903 kern.info]
[AFT2] errID 0x0003c931.2a96a1cd PA=0x00000000.0072ee80
Feb 19 14:22:51 ht01da E$tag 0x00000000.01000000 E$state_2 Invalid
Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x00) 0x8143c000.9ba01a2c 0xae34c013.973d2007 ECC 0x1b8
*Bad* Esynd=0x003
Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x10) 0x99150013.b1a01894 0x11800003.9684c014 ECC 0x03f
*Bad* Esynd=0x003
Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x20) 0xe248001b.ada018d2 0xa634c013.8143c000 ECC 0x101
*Bad* Esynd=0x003
Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x30) 0xa93d0014.81ab0a28 0xb3a000a5.988cf4e6 ECC 0x038
*Bad* Esynd=0x003
Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 929717 kern.info]
[AFT2] D$ data not available
Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 335345 kern.info]
[AFT2] I$ data not available
Feb 19 14:22:51 ht01da unix: [ID 321153 kern.notice]
NOTICE: Scheduling clearing of error on page 0x00000061.eab2e000
-----------------------------------------------------------------------------
Section 7:
----------
Finally, an error event occurs which results in an uncorrectable system
bus error (UE) and a system panic. The syndrome here shows Syndrome
0x071 which is typically associated with an L2SRAM error, when in fact
the data error originated in the DIMM as recorded by the initial CE
error. If the diagnosing engineer only looked at this entry, they
might incorrectly conclude that the error is an L2SRAM, when in fact it
originated from the DIMM.
Feb 19 14:22:54 ht01da SUNW,UltraSPARC-III+: [ID 918604 kern.warning]
WARNING: [AFT1] Uncorrectable system bus (UE) Event detected by
CPU449 Privileged Data Access at TL=0, errID 0x0003c931.2aa7c82f
Feb 19 14:22:54 ht01da AFSR 0x00100004.00000071 AFAR
0x00000061.eab2e280
Feb 19 14:22:54 ht01da Fault_PC 0x104fe68 Esynd 0x0071 SB3/P2/B0
J15300 J15400 J15500 J15600
Feb 19 14:22:54 ht01da SUNW,UltraSPARC-III+: [ID 685643 kern.notice]
[AFT1] errID 0x0003c931.2aa7c82f Two Bits in error, likely from E$
WDU/CPU
Feb 19 14:22:54 ht01da SUNW,UltraSPARC-III+: [ID 577198 kern.info]
[AFT2] errID 0x0003c931.2aa7c82f PA=0x00000061.eab2e280
Feb 19 14:22:54 ht01da E$tag 0x00000187.aa000049 E$state_2 Shared
Feb 19 14:22:54 ht01da SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x00) 0x3cffffff.ffffffff 0xffffffff.ffffffff ECC 0x00f
*Bad* Esynd=0x071
Feb 19 14:22:54 ht01da SUNW,UltraSPARC-III+: [ID 819380 kern.info]
[AFT2] E$Data (0x10) 0x3fffffff.ffffffff 0xffffffff.ffffffff ECC 0x183
*Bad* Esynd=0x071
Feb 19 14:22:54 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x20) 0xffffffff.ffffffff 0xffffffff.ffffffff ECC 0x183
Feb 19 14:22:54 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x30) 0xffffffff.ffffffff 0xffffffff.ffffffff ECC 0x183
Feb 19 14:22:54 ht01da SUNW,UltraSPARC-III+: [ID 929717 kern.info]
[AFT2] D$ data not available
Analysis of suspicious bits and comparison with CE reports in the log
allow us to narrow down the DIMM from the the entire SB3/P2/B0 bank to
the single DIMM at SB3/P2/B0/D1 J15400.
The corrective action is to replace DIMM SB3/P2/B0/D1 J15400.
The conclusion from the two examples above is:
As with any type of error, not only is it important to review the last
error in the log file, it is absolutely critical to follow the steps
that led up to the error in order to find the source of the error and
replace the appropriate component. Not retracing the steps may result
in the wrong part being replaced.
Implementation:
---
| | MANDATORY (Fully Proactive)
---
---
| | CONTROLLED PROACTIVE (per Sun Geo Plan)
---
---
| X | REACTIVE (As Required)
---
Corrective Action:
The following recommendation is provided as a guideline for authorized
Sun Services Field Representatives who may encounter the above
mentioned problem.
1. Use the guidelines provided above when diagnosing main memory and
L2SRAM errors on the listed platforms.
2. Apply the patches recommended in Sun Alert 50471 to prevent
unnecessary L2SRAM issues. These patches provide the following
levels of firmware/ SMS software:
. Upgrade Sun Fire 12K/15K customers to SMS 1.3 (114608-01 or later)
or SMS 1.2 (112488-11 or later) at the earliest opportunity.
. Upgrade Sun Fire 3800 - 6800 customers to 5.14.4 firmware (112883-05
or later) or 5.13.5 firmware (112494-08 or later) at the earliest
opportunity.
. Upgrade Sun Fire V1280 and Netra 1280 customers to 5.13.0012 firmware
(113751-02 or later).
3. Kernel Update and SunVTS Requirements:
. Per FIN I0909-2, 108528-18 (Solaris 8) and 112233-04 (Solaris 9) are
the minimum recommended kernel updates to be deployed with the firmware
and SMS patches shown above.
. Per FIN I0909-2, SunVTS version 5.1 is the minimum recommended SunVTS
version.
4. Escalations/CIC:
Customers that require replacement of the 900 MHz CPU boards or
servers for suspected L2 SRAM issue(s) will need to follow standard
Escalation/CIC processes. The escalation and CIC request will be
reviewed by the appropriate technical teams. If CIC action is
required, the customer will be prioritized for distribution of
hardware.
That Escalation policy is published at:
http://onestop/programs/us3quality
Comments:
None.
============================================================================
Implementation Footnote:
i) In case of MANDATORY FINs, Sun Services will attempt to contact
all affected customers to recommend implementation of the FIN.
ii) For CONTROLLED PROACTIVE FINs, Sun Services mission critical
support teams will recommend implementation of the FIN (to their
respective accounts), at the convenience of the customer.
iii) For REACTIVE FINs, Sun Services will implement the FIN as the
need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network
browser as follows:
SunWeb Access:
--------------
* Access the top level URL of http://sdpsweb.central/FIN_FCO/
* From there, select the appropriate link to query or browse the FIN and
FCO Homepage collections.
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.central/
* From there, select the appropriate link to browse the FIN or FCO index.
Internet Access:
----------------
* Access the top level URL of https://spe.sun.com
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to [email protected]
--------------------------------------------------------------------------