Document fins/I0616-1
FIN #: I0616-1
SYNOPSIS: Ecache Memory Parity Error
DATE: Sep/11/00
KEYWORDS: Ecache Memory Parity Error
---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------
FIELD INFORMATION NOTICE
(For Authorized Distribution by SunService)
SYNOPSIS: Solaris kernel patches provide improved handling and reduction
of CPU, Ecache, and main memory errors in UltraSPARC systems.
TOP FIN/FCO REPORT: Yes
PRODUCT_REFERENCE: Solaris 2.5.1, 2.6, 7, and 8
PRODUCT CATEGORY: Software / Solaris
PRODUCTS AFFECTED:
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
Systems Affected
----------------
- E10000-HPC ALL Ultra Enterprise 10000 HPC -
- E10000 ALL Ultra Enterprise 10000 -
- E6500-HPC ALL Ultra Enterprise 6500 HPC -
- E6500 ALL Ultra Enterprise 6500 -
- E5500-HPC ALL Ultra Enterprise 5500 HPC -
- E5500 ALL Ultra Enterprise 5500 -
- E4500-HPC ALL Ultra Enterprise 4500 HPC -
- E4500 ALL Ultra Enterprise 4500 -
- E3500-HPC ALL Ultra Enterprise 3500 HPC -
- E3500 ALL Ultra Enterprise 3500 -
- E450-HPC ALL Ultra Enterprise 450 HPC -
- A25 ALL Enterprise 450 -
- A33 ALL Enterprise 420R -
- A26 ALL Enterprise 250 -
- A34 ALL Enterprise 220R -
- N14 ALL Netra T-1405 -
- N15 ALL Netra T-1400 -
- N06 ALL Netra T1 AC -
- N04 ALL Netra T-1125 -
- N03 ALL Netra T-1120 -
- A27 ALL Ultra 80 -
- A23 ALL Ultra 60 -
- A20 ALL Ultra 450 -
- A16 ALL Ultra 30 -
- A14 ALL Ultra 2 -
- E6000 ALL Ultra Enterprise 6000 -
- E5000 ALL Ultra Enterprise 5000 -
- E4000 ALL Ultra Enterprise 4000 -
- E3000 ALL Ultra Enterprise 3000 -
- A12 ALL Ultra 1E -
- A11 ALL Ultra 1 -
- A22 ALL Ultra 10 -
- A21 ALL Ultra 5 -
X-Options Affected
------------------
X2248A - - 480Mhz UltraSPARC II Module 8MB Cache -
X2244A - - 400Mhz UltraSPARC II Module 4MB Cache -
X1994A - - 400Mhz UltraSPARC II Module 2MB Cache -
X2240A - - 300MHz UltraSPARC II Module 2MB Cache -
X2230A - - 250MHz UltraSPARC II Module 1MB Cache -
X1995A - - 450Mhz UltraSPARC II Module 4MB Cache -
X1997A - - 440Mhz UltraSPARC II Module 4MB Cache -
X2580A - - 400MHz UltraSPARC II Module 8MB cache -
X2570A - - 400MHz UltraSPARC II Module 4MB cache -
X1993A - - 400Mhz UltraSPARC II Module 2MB Cache -
X1992A - - 360Mhz UltraSPARC II Module 4MB Cache -
X2560A - - 336MHz UltraSPARC II Module 4MB Cache -
X1991A - - 300Mhz UltraSPARC II Module 1MB Cache -
X2550A - - 250MHz UltraSPARC II Module 4MB Cache -
X1990A - - 250Mhz UltraSPARC II Module 1MB Cache -
X2530A - - 250MHz UltraSPARC II Module 1MB Cache -
X1188A - - 200MHz UltraSPARC I Module 1MB Cache -
X2510A - - 167MHz UltraSPARC I Module 1MB Cache -
X1187A - - 167MHz UltraSPARC I Module .5MB Cache -
X2500A - - 167MHz UltraSPARC I Module .5MB Cache -
PART NUMBERS AFFECTED:
Part Number Description Model
----------- ----------- -----
501-5729-0X 480 MHz UltraSPARC II Module 8MB Cache -
501-5344-0X 450 MHz UltraSPARC II Module 4MB Cache -
501-5539-0X 450 MHz UltraSPARC II Module 4MB Cache -
501-5682-0X 440 MHz UltraSPARC II Module 4MB Cache -
501-5235-0X 400 MHz UltraSPARC II Module 8MB Cache -
501-5661-0X 400 MHz UltraSPARC II Module 8MB Cache -
501-5762-0X 400 MHz UltraSPARC II Module 8MB Cache -
501-4995-0X 400 MHz UltraSPARC II Module 4MB Cache -
501-5239-0X 400 MHz UltraSPARC II Module 4MB Cache -
501-5420-0X 400 MHz UltraSPARC II Module 4MB Cache -
501-5425-0X 400 MHz UltraSPARC II Module 4MB Cache -
501-5446-0X 400 MHz UltraSPARC II Module 4MB Cache -
501-5500-0X 400 MHz UltraSPARC II Module 4MB Cache -
501-5585-0X 400 MHz UltraSPARC II Module 4MB Cache -
501-5237-0X 400 MHz UltraSPARC II Module 2MB Cache -
501-5445-0X 400 MHz UltraSPARC II Module 2MB Cache -
501-5541-0X 400 MHz UltraSPARC II Module 2MB Cache -
501-5545-0X 400 MHz UltraSPARC II Module 2MB Cache -
501-4781-0X 360 MHz UltraSPARC II Module 4MB Cache -
501-5129-0X 360 MHz UltraSPARC II Module 4MB Cache -
501-5552-0X 360 MHz UltraSPARC II Module 4MB Cache -
501-4363-0X 336 MHz UltraSPARC II Module 4MB Cache -
501-4196-0X 300 MHz UltraSPARC II Module 2MB Cache -
501-4849-0X 300 MHz UltraSPARC II Module 2MB Cache -
501-4249-0X 250 MHz UltraSPARC II Module 4MB Cache -
501-4836-0X 250 MHz UltraSPARC II Module 4MB Cache -
501-4178-0X 250 MHz UltraSPARC II Module 1MB Cache -
501-4278-0X 250 MHz UltraSPARC II Module 1MB Cache -
501-4857-0X 250 MHz UltraSPARC II Module 1MB Cache -
501-3041-0X 200 MHz UltraSPARC I Module 1MB Cache -
501-4791-0X 200 MHz UltraSPARC I Module 1MB Cache -
501-2959-0X 167 MHz UltraSPARC I Module 1MB Cache -
501-2702-03 167 MHz UltraSPARC I Module .5MB Cache -
501-2941-0X 167 MHz UltraSPARC I Module .5MB Cache -
501-2942-0X 167 MHz UltraSPARC I Module .5MB Cache -
501-5149-0X 440 MHz UltraSPARC IIi Module 2MB Cache -
501-5740-0X 400 MHz UltraSPARC IIi Module 2MB Cache -
501-5741-0X 400 MHz UltraSPARC IIi Module 2MB Cache -
501-5148-0X 360 MHz UltraSPARC IIi Module 256KB Cache -
501-5222-0X 360 MHz UltraSPARC IIi Module 2MB Cache -
501-5090-0X 333 MHz UltraSPARC IIi Module 2MB Cache -
501-5568-0X 333 MHz UltraSPARC IIi Module 2MB Cache -
501-4379-0X 300 MHz UltraSPARC IIi Module 512KB Cache -
501-5040-0X 300 MHz UltraSPARC IIi Module 512KB Cache -
501-4477-0X 270 MHz UltraSPARC IIi Module 256KB Cache -
501-5039-0X 270 MHz UltraSPARC IIi Module 256KB Cache -
(SCSI Devices)
Type Vendor Model Serial Number(Min) Serial Number(Max) Firmware
---- ------ ------- ------------------ ------------------ --------
N/A
REFERENCES:
FIN: I0570-3
FIN: I0593-1
Sun Alert: SA 24669 - Possible WAIT_MBOX_DONE System Panics With Recent
Kernel Update Patches
DOC: 806-5118-13 Best Practices Guide Addressing: E-cache Parity Errors
PatchId: 103640 Kernel Patch (Solaris 2.5.1)
PatchId: 105181 Kernel Patch (Solaris 2.6)
PatchId: 106541 Kernel Patch (Solaris 7)
PatchId: 108528 Kernel Patch (Solaris 8)
PatchId: 110151 SunMC 2.1 FCS Patch (Solaris 2.6)
PatchId: 110152 SunMC 2.1 L10N Patch (Solaris 2.6)
PatchId: 110094 SunMC 2.1.1 FCS Patch (Solaris 2.6)
PatchId: 103346 Exx00 flashprom update
URL: http://bestpractices.central/
URL: http://cte-www.uk/cgi-bin/afsr/afsr.pl
URL: http://cte-www.eng/cgi-bin/afsr/afsr.pl
PROBLEM DESCRIPTION:
Solaris Kernel patches are available (see "Features Table" below for
availability details) that provide improved handling and reduction of CPU,
Ecache, and main memory errors in systems using UltraSPARC-I, -II, -IIi,
and -IIe processors. All customers on Solaris 2.5.1, 2.6, 7 and 8 are
encouraged to consider upgrading to these kernel patches as they become
available.
Table Of Contents
*****************
Kernel Patch Features Overview
Cache Scrubber
Improved Error Handling
Improved Error Messages
Performance Considerations
Kernel Patch Features Details
Features Table
Details on the Cache Scrubber
Errors and Events
Details on Improved Error Handling
Details on Improved Error Messages
Messages that identify the type and source of an error
Messages that supply a cache line or memory dump
Messages from the kernel error recovery code
Messages that indicate the disposition of an error
Error Messages Examples
EDP Event - Ecache Data Parity Event
WP Event - Writeback Data Parity Error
CP Event - Copyout Data Parity Error
UE Event - Uncorrectable Memory Error
BERR Event - Bus Error
CE Event - Correctable Memory Error
Starfire Specific
Arbstop
Recordstop
DTag Considerations
Kernel Patch Features Overview
******************************
With the patches listed below, one or more of the following features
become available in the Solaris operating system (see "Features Table"
below to determine the features delivered with each patch):
1. Cache Scrubber
==============
To reduce the likelihood of Ecache Data, Writeback and CopyOut
Parity errors, a "Cache Scrubber" has been implemented in the
Solaris Kernel that periodically flushes modified cache lines out
to main memory and invalidates cache lines that have not been
modified. By reducing the likelihood that an otherwise nonfatal
error in the Ecache will result in a system failure, this procedure
improves the system's reliability.
2. Improved Error Handling
=======================
Each error reported by the CPU is now evaluated to determine
whether it is fatal to the operating system, only fatal to a
user process, or of no immediate consequence. Fatal errors in
the kernel result in a system panic, as they did before. Fatal
errors within user space will now cause the machine to reboot
instead of panic, allowing file systems to be fully synched and
also preventing the creation of unnecessary kernel core files.
Events that do not affect the integrity of either the kernel or
user processes are logged, but otherwise ignored.
Because UltraSPARC-IIi and UltraSPARC-IIe use simplified error
reporting logic (as compared to UltraSPARC-II), the error
handling behavior for UltraSPARC-IIi and UltraSPARC-IIe based
systems has not been changed. Those systems will still panic
on most CPU, Ecache, or uncorrectable memory errors.
3. Improved Error Messages
=======================
The CPU, Ecache, and memory error messages have been improved
to be more accurate and complete. Text descriptions have been
rewritten to emphasize the important parameters associated with
each event. Also, the logic for reporting hardware errors has
changed to ensure that error events are reported accurately,
completely, and in the order they occurred. These new error
messages will make it easier to determine the CPU that has
encountered an error.
There are related patches to SunMC so that it will recognize
the improved error messages; without them, the management
console will under-report the occurrence of corrected main
memory errors. See "Corrective Action" item 3, below, for a list
of the related patches.
Performance Considerations
==========================
The above changes can slightly degrade system performance. The primary
cause of this is the Improved Error Handling, which required inserting
membars in the kernel to properly isolate user-encountered errors from
kernel-encountered ones. (A membar is an UltraSPARC instruction that
stalls the CPU pipeline until all outstanding memory operations have
completed, and any errors that may result from them have been reported.
Any errors reported after the execution of a membar completes can only
result from instructions that follow the membar in the instruction stream.)
In addition, the Cache Scrubber consumes 0.4% of CPU cycles in scanning
the Ecache.
Measurements using industry standard benchmarks have shown a decrease
in TPC-C performance of about 2% and in one kenbus configuration a
decrease in performance of about 5%. Performance degradation of most
of the other benchmarks in the performance suite was indistinguishable
from measurement noise. We do not expect most customers to notice
significant performance degradation.
Kernel Patch Features Details
*****************************
Features Table
==============
The following list gives details about the features delivered with each
of the patches:
Solaris 2.5.1 with patch 103640 will introduce:
- Cache Scrubber
Solaris 2.6 with patch 105181 will introduce:
- Cache Scrubber
- Improved Error Messages
- Improved Error Handling [1]
Solaris 7 with patch 106541 (est. Nov/10/2000) will introduce:
- Cache Scrubber
- Improved Error Messages
- Improved Error Handling [1]
Solaris 8 with patch 108528 (est. Oct/27/2000) will introduce:
- Cache Scrubber Only for UltraSPARC-I, -II, -IIi
- Improved Error Messages Only for UltraSPARC-I, -II, -IIi, -IIe
- Improved Error Handling [1] Only for UltraSPARC-I, -II
Solaris 8 Update 3 (est. Dec/2000) will introduce:
- Cache Scrubber Only for UltraSPARC-I, -II, -IIi, -IIe
- Improved Error Messages Only for UltraSPARC-I, -II, -IIi, -IIe
- Improved Error Handling [1] Only for UltraSPARC-I, -II
NOTE [1]: Due to hardware limitations there is no improved error handling
for UltraSPARC-IIi and UltraSPARC-IIe based systems.
Details on the Cache Scrubber
=============================
The cache scrubber reduces the likelihood of EDP, WP, and CP events by
shortening the data lifetime in the Ecache, and by eliminating parity
errors where possible. (See "Errors and Events" below for an
explanation
of the EDP, WP, and CP event types.)
The cache scrubber is enabled by default. It scans the entire Ecache
of every CPU in the system once every ten seconds.
On an idle CPU, it scrubs all clean lines (lines that are identical to
the system memory from where they came), and dirty lines (lines that
have newer data than the system memory from where they came) that have
good parity. This reduces the lifetime of data in the Ecache on an
idle CPU, reducing the likelihood that a parity error will affect
critical system or user data.
On a busy CPU, it only scrubs clean lines with bad parity (which might
otherwise lead to EDP or CP events). Clean lines with good parity and
dirty lines are left in the Ecache so as to not adversely impact system
performance.
The cache scrubber never scrubs dirty lines with bad parity to avoid
causing WP events. These bad lines could get overwritten by the
program using them before they are accessed or flushed, thereby
eliminating a bad parity event from occurring at all. (This is
sometimes referred to as the natural scrubbing behavior of a busy
system.)
Errors and Events
=================
UltraSPARC processors can detect errors that are reported in the
following types of events (as detailed in the UltraSPARC-I/II User's
Manual, P/N 802-7220-02):
ETP A parity error was detected by the CPU when reading from the
Ecache Tag SRAM. This is a fatal error because system coherency
has been lost. The system will reset (POR) and Starfire domains
will arbstop (UPA Fatal error). No Solaris error message will be
generated.
EDP A parity error was detected by the CPU when reading from the
Ecache Data SRAM on a cache hit.
LDP A parity error was detected by the CPU while reading main
memory through its Ultra Data Buffer (UDB) chip on an Ecache
miss. Note that the Ecache itself is not involved. This can occur
when the CPU is reading non-cacheable data (for example, a frame
buffer or I/O device), or when filling a line of cache from main
memory.
WP A parity error was detected by one of the UDB chips while data
was being written back from the Ecache into main memory. The UDB
chips convert the data with bad parity into data with bad ECC, so
that a subsequent access to the same physical address will result
in a UE. (See UE below.) (The conversion of a parity error to a
latent UE does not occur on either UltraSPARC-IIi or -IIe, which
is one of the reasons why improved error handling is not
available on those processors.)
CP A parity error was detected during a copyout transaction; that
is, a data transfer from one CPU's Ecache to another CPU. This
error is detected by the UDB chips of the providing CPU,
resulting in the CP event. The providing CPU's UDB chips convert
the data with bad parity to data with bad ECC, so that the UDBs
of the receiving CPU will report a UE event. (See UE below.)
UE An uncorrectable memory error has occurred. This event refers to
an error in the main system memory, reported by the system data
bus on a read access. The underlying source of this error could
be main memory, another CPU module (see CP above), or another UPA
device (for example, the I/O controller). The UDB chips detect
this error.
CE A correctable error was detected when reading from main memory,
or when reading from another CPU's UDB chips. The data read has
been corrected and valid data is given to the CPU and the CPU's
Ecache. This error is detected by the UDB chips.
BERR A bus error has occurred during an attempt to read from a memory
address. Either there is no device at that address, or the
device at that address has returned a bus error. Therefore, bus
errors are caused by a programming error or by a corrupted or
defective device.
TO A bus timeout was encountered during an attempt to read from a
memory address. Too much time has elapsed waiting for a device
at that address to respond.
Details on Improved Error Handling
==================================
Any of the above mentioned errors can occur in kernel instruction
space, kernel data space, user instruction space, user data space, or
when the kernel reads or writes user data (as in copyin). Depending on
these different states, the operating system will react differently so
as to maximize system availability.
On EDP, LDP, CP, UE, BERR, and TO events, the system will panic if the
affected data is in kernel space or if the error occurs while the CPU is
at a trap level greater than zero. Otherwise, the process that caused the
error will be killed immediately (sent SIGKILL) and the system will be
rebooted (as if a privileged user had entered "init 6"). [2]
On WP events, an error is reported, and the memory scrubber is notified
to scan all of system memory for the latent UE the hardware has written
to memory (see below for the behavior of the memory scrubber on
encountering UE events). If some CPU later attempts to read this
location (other than on behalf of the memory scrubber), a UE event will
occur. Hence, when a UE event is encountered, it is recommended that the
log be checked for an earlier WP event that may have in fact caused the
UE event.
If the memory scrubber detects a UE event the system will neither panic
nor reboot but trigger a recovery mechanism instead. If the page
containing the corrupted data is not in use, it will be retired and the
error will be cleared. If it is in use, it will be marked for
retirement and clearing if and when it is no longer in use.
NOTE [2]: An active SC2.X cluster node will panic with a "Failfast
timeout" (usually with "Device closed while Armed") when
rebooted. It is therefore useful to check the system messages
for EDP, LDP, CP, UE, BERR, and TO events while encountering
"Failfast timeout" panics.
Details on Improved Error Messages
==================================
For each error that is detected, the kernel generates an individual
report. This is a major change; previously, some errors would hide other
errors, and some errors were combined into a single message. The report
typically consists of several error messages. Each message [3] contains
an AFT ("Asynchronous Fault Trap") tag that eases filtering, and an
errID
code that associates all of the messages emitted for the same event. The
errID is a 64-bit code that corresponds to a specific set of error bits in
the Asynchronous Fault Status Register (AFSR) at a specific instance in
time; the value has no intrinsic meaning.
Each message may be longer than one physical line; long messages are
folded using embedded newlines. Each folded line begins with four
space characters.
NOTE [3]: Because of the introduction of improved error messages, any tool
using the affected error messages may have to be modified.
Neither the format nor the content of kernel error messages are
committed interfaces, and both may change without notice.
Users (both internal and external) who rely on the exact format
and/or content do so at their own risk.
The error messages can be grouped into four categories:
Category 1: Messages that identify the type and source of an error
------------------------------------------------------------------
Example:
WARNING: [AFT1] EDP event on CPU1 Instruction access at TL=0, errID
0x0000ad88.6cd9989f
AFSR 0x00000000.80408000<PRIV,EDP> AFAR 0x00000000.0f0c8080
AFSR.PSYND 0x8000(Score 95) AFSR.ETS 0x00 FAULT_PC 0x780b481c
UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00
Either the [AFT0] tag (for correctable errors) or the [AFT1] tag (for
uncorrectable errors) is present in the message. An "errID" field
appears at the end of the first line of the message. Messages from
this category are displayed on the console and collected in the log
file. [4]
To aid diagnosis of an Ecache-related error, especially if multiple
components are involved, a heuristic algorithm has been included that
automates analysis of the P_SYND bytes. Every component reporting a
failure has its AFSR decoded and a score ranging from 5 to 95 is
assigned ("Score 95" in the above example).
The Score indicates the likelihood that this component was the original
source of the bad parity. The higher the value, the higher the
likelihood that this component was the original source.
NOTE [4]: This is the default behavior. The /etc/system setting
report_ce_console is no longer referenced and should therefore
be removed.
Category 2: Messages that supply a cache line or memory dump
------------------------------------------------------------
Example:
[AFT2] errID 0x0000ad88.6cd9989f PA 0x00000000.0f0c8080 E$tag
0x00000000.0bc001e1 E$State: Modified E$parity 0x05
[AFT2] E$Data (0x00): 0xffffffff.beefface *Bad* PSYND=0x8000
[AFT2] E$Data (0x08): 0x00000000.00000000
[AFT2] E$Data (0x10): 0x6d656d6d.6f727920
[AFT2] E$Data (0x18): 0x6572726f.7220696e
[AFT2] E$Data (0x20): 0x6a656374.6f720000
[AFT2] E$Data (0x28): 0x6d656d74.65737420
[AFT2] E$Data (0x30): 0x6d757465.780059f8
[AFT2] E$Data (0x38): 0x00000300.00c11000
[AFT2] Event PA displayed in AFAR was derived from E$Tag
Messages from this category are targeted for Sun Microsystems support
staff to be used in backline diagnosis and for statistics.
The [AFT2] tag is always present in these messages. The "errID"
field
appears at the beginning of the first line of the message. Messages
from this category are by default only collected in the log file.
Category 3: Messages from the kernel error recovery code
--------------------------------------------------------
Example:
[AFT3] errID 0x00000058.0d0dc830 Above Error detected by protected Kernel
code
that will try to clear error from system
Messages from this category supply analysis information from the
kernel error recovery code, thereby indicating the actions the kernel
took to contain the error.
The [AFT3] tag is always present in these messages. An "errID"
field appears at the beginning of the first line of the message.
Messages from this category are by default only collected in the log
file.
Category 4: Messages that indicate the disposition of an error
--------------------------------------------------------------
Example:
panic[CPU1]/thread=30000670800: [AFT1] errID 0x00000392.89cbfefc EDP
Error(s)
See previous message(s) for details
Messages from this category state the final handling (like panic or
reboot) of a previously encountered error.
Either the [AFT0] tag (for correctable errors) or the [AFT1] tag (for
uncorrectable errors) is present in the message. The "errID" field
appears at the beginning of the first line of the message. Messages from
this category are displayed on the console and collected in the log file.
Error Messages Examples
=======================
The following compares previous messages with the new, improved error
messages. Note that this is not an exhaustive list, but a sampling of
possible messages for each event type. This also just shows what
appears on the console; the log-only messages are not shown.
Lines are shown exactly as they appear on the console. If you print
this file, you will need to either use software that wraps long lines,
or print in landscape mode.
EDP Event - Ecache Data Parity Event
------------------------------------
* Solaris 8 Message - Kernel Data:
panic[CPU1]/thread=3000225bcc0: CPU1 Ecache SRAM Data Parity Error: AFSR
0x00000000.80408000 AFAR 0x00000000.0bd83bd0
* Improved Message - Kernel Data:
WARNING: [AFT1] EDP event on CPU1 Data access at TL=0, errID
0x00000093.6323e6f8
AFSR 0x00000000.80408000<PRIV,EDP> AFAR 0x00000000.06901980
AFSR.PSYND 0x8000(Score 95) AFSR.ETS 0x00 Fault_PC 0x78128a84
UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00
panic[cpu1]/thread=30000ae5000: [AFT1] errID 0x00000093.6323e6f8 EDP Error(s)
See previous message(s) for details
* Solaris 8 Message - User Data:
panic[CPU3]/thread=30001f4fa00: CPU3 Ecache SRAM Data Parity Error: AFSR
0x00000000.00400080 AFAR 0x00000000.01820000
* Improved Message - User Data (Reboot):
Aug 16 16:47:20 thishost SUNW,UltraSPARC-II: WARNING: [AFT1] EDP event on CPU3
Data access at TL=0, errID 0x00000057.d35eff81
Aug 16 16:47:20 thishost AFSR 0x00000000.00400080<EDP> AFAR
0x00000000.05e24418
Aug 16 16:47:20 thishost AFSR.PSYND 0x0080(Score 95) AFSR.ETS 0x00 Fault_PC
0x11ce8
Aug 16 16:47:20 thishost UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND
0x00
Aug 16 16:47:20 thishost unix: NOTICE: Scheduling clearing of error on page
0x00000000.05e24000
Aug 16 16:47:20 thishost unix: WARNING: [AFT1] initiating reboot due to above
error in pid 309 (mtst)
Aug 16 16:47:23 thishost unix: NOTICE: Previously reported error on page
0x00000000.05e24000 cleared
INIT: New run level: 6
The system is coming down. Please wait.
System services are now being stopped.
Print services stopped.
Aug 16 16:47:27 thishost syslogd: going down on signal 15
The system is down.
syncing file systems... done
rebooting...
Resetting ...
* Solaris 8 Message - Kernel Data at TL=1:
panic[CPU3]/thread=30001cfabe0: Async data error at tl1: AFAR
0x00000000.0ab8f760 AFSR 0x00000000.80400080
* Improved Message - Kernel Data at TL=1 (Panic):
WARNING: [AFT1] EDP event on CPU3 Data access at TL>0, errID
0x00000111.53a7b8dd
AFSR 0x00000000.80408000<PRIV,EDP> AFAR 0x00000000.01f47dc0
AFSR.PSYND 0x8000(Score 95) AFSR.ETS 0x00 Fault_PC 0x1002fe20
UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00
panic[cpu3]/thread=30000a4e040: [AFT1] errID 0x00000111.53a7b8dd EDP Error(s)
See previous message(s) for details
* Solaris 8 Message - Kernel Instruction at TL=1:
panic[CPU3]/thread=3000226a140: Async instruction error at tl1: AFAR
0x00000000.0dd55f70 AFSR 0x00000000.80408000
* Improved Message - Kernel Instruction at TL=1 (Panic):
WARNING: [AFT1] EDP event on CPU3 Instruction access at TL>0, errID
0x00000043.24bfd349
AFSR 0x00000000.80400800<PRIV,EDP> AFAR 0x00000000.0605c790
AFSR.PSYND 0x0800(Score 95) AFSR.ETS 0x00 Fault_PC 0x1002fe20
UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00
panic[cpu3]/thread=30000ad05c0: [AFT1] errID 0x00000043.24bfd349 EDP Error(s)
See previous message(s) for details
WP Event - Writeback Data Parity Error
--------------------------------------
* Solaris 8 Message:
panic[CPU1]/thread=30001b26640: CPU1 Ecache Writeback Data Parity Error: AFSR
0x00000000.00800080 AFAR 0x00000000.0d5010f0
* Improved Message:
Aug 16 16:50:56 thishost SUNW,UltraSPARC-II: WARNING: [AFT1] WP event on CPU1,
errID 0x0000002b.3c7cd6d9
Aug 16 16:50:56 thishost AFSR 0x00000000.00800080<WP> AFAR
0x000001c8.01802800
Aug 16 16:50:56 thishost AFSR.PSYND 0x0080(Score 95) AFSR.ETS 0x00 Fault_PC
0x11d7c
Aug 16 16:50:56 thishost UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND
0x00
Aug 16 16:50:56 thishost SUNW,UltraSPARC-II: WARNING: [AFT1] Uncorrectable
Memory Error on CPU3 Data access at TL=0, errID 0x0000002b.45daae92
Aug 16 16:50:56 thishost AFSR 0x00000000.80200000<PRIV,UE> AFAR
0x00000000.03824418
Aug 16 16:50:56 thishost AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC
0x10023414
Aug 16 16:50:56 thishost UDBH 0x0000 UDBH.ESYND 0x00 UDBL
0x0203<UE>
UDBL.ESYND 0x03
Aug 16 16:50:56 thishost UDBL Syndrome 0x3 Memory Module 190x
Aug 16 16:50:56 thishost SUNW,UltraSPARC-II: WARNING: [AFT1] errID
0x0000002b.45daae92 Syndrome 0x3 indicates that this may not be a memory module
problem
Aug 16 16:50:56 thishost unix: NOTICE: Scheduling clearing of error on page
0x00000000.03824000
Aug 16 16:50:58 thishost unix: NOTICE: Previously reported error on page
0x00000000.03824000 cleared
NOTE: The last message (reporting clearing of the error) may appear much later,
or may never appear, as the page may never drop out of use. Also, the
message reporting scheduling of clearing may occur more than once, as the
memory scrubber may encounter the particular UE more than once before it
can be cleared.
CP Event - Copyout Data Parity Error
------------------------------------
* Solaris 8 Message:
panic[CPU3]/thread=2a100105d40: CPU3 UE Error: Ecache Copyout on CPU1: AFSR
0x00000000.01000080 AFAR 0x00000000.06c53090
* Improved Message - Kernel (Panic):
WARNING: [AFT1] Uncorrectable Memory Error on CPU3 Data access at TL=0, errID
0x0000003a.30aafcba
AFSR 0x00000000.80200000<PRIV,UE> AFAR 0x00000000.00347dc0
AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x78067b54
UDBH 0x0203<UE> UDBH.ESYND 0x03 UDBL 0x0000 UDBL.ESYND 0x00
UDBH Syndrome 0x3 Memory Module 190x
WARNING: [AFT1] errID 0x0000003a.30aafcba Syndrome 0x3 indicates that this may
not be a memory module problem
WARNING: [AFT1] CP event on CPU1 (caused Data access error on CPU3), errID
0x0000003a.30aafcba
AFSR 0x00000000.01008000<CP> AFAR 0x00000000.00347dc0
AFSR.PSYND 0x8000(Score 95) AFSR.ETS 0x00
UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00
panic[cpu3]/thread=2a100157d40: [AFT1] errID 0x0000003a.30aafcba UE Error(s)
See previous message(s) for details
* Improved Message - User (Reboot):
Aug 16 17:06:44 thishost SUNW,UltraSPARC-II: WARNING: [AFT1] Uncorrectable
Memory Error on CPU3 Data access at TL=0, errID 0x0000002b.963a3d3c
Aug 16 17:06:44 thishost AFSR 0x00000000.00200000<UE> AFAR
0x00000000.00224418
Aug 16 17:06:44 thishost AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC
0x12380
Aug 16 17:06:44 thishost UDBH 0x0000 UDBH.ESYND 0x00 UDBL
0x0203<UE>
UDBL.ESYND 0x03
Aug 16 17:06:44 thishost UDBL Syndrome 0x3 Memory Module 190x
Aug 16 17:06:44 thishost SUNW,UltraSPARC-II: WARNING: [AFT1] errID
0x0000002b.963a3d3c Syndrome 0x3 indicates that this may not be a memory module
problem
Aug 16 17:06:44 thishost SUNW,UltraSPARC-II: WARNING: [AFT1] CP event on CPU1
(caused Data access error on CPU3), errID 0x0000002b.963a3d3c
Aug 16 17:06:44 thishost AFSR 0x00000000.01000080<CP> AFAR
0x00000000.00224418
Aug 16 17:06:44 thishost AFSR.PSYND 0x0080(Score 95) AFSR.ETS 0x00
Aug 16 17:06:44 thishost UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND
0x00
Aug 16 17:06:44 thishost unix: NOTICE: Scheduling clearing of error on page
0x00000000.00224000
Aug 16 17:06:44 thishost unix: WARNING: [AFT1] initiating reboot due to above
error in pid 304 (mtst)
Aug 16 17:06:46 thishost unix: NOTICE: Previously reported error on page
0x00000000.00224000 cleared
INIT: New run level: 6
The system is coming down. Please wait.
System services are now being stopped.
Print services stopped.
Aug 16 17:06:50 thishost syslogd: going down on signal 15
The system is down.
syncing file systems... done
rebooting...
Resetting ...
NOTE: Due to a coding error, early versions of some of the patches produce
the string "CP Error" instead of "CP event"; programs
that parse the
messages must be prepared to deal with both.
UE Event - Uncorrectable Memory Error
-------------------------------------
* Solaris 8 Message - CPU Reference to Memory:
panic[CPU1]/thread=2a1000R7dd40: UE Error: AFSR 0x00000000.80200000 AFAR
0x00000000.089cd740 Id 0 Inst 0 MemMod U0501 U0401
* Improved Message - CPU Reference to Memory - Kernel (Panic):
WARNING: [AFT1] Uncorrectable Memory Error on CPU1 Instruction access at TL=0,
errID 0x0000004f.818d9280
AFSR 0x00000000.80200000<PRIV,UE> AFAR 0x00000000.0685c7a0
AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x7815c7a0
UDBH 0x0203<UE> UDBH.ESYND 0x03 UDBL 0x0000 UDBL.ESYND 0x00
UDBH Syndrome 0x3 Memory Module 190x
WARNING: [AFT1] errID 0x0000004f.818d9280 Syndrome 0x3 indicates that this may
not be a memory module problem
panic[cpu1]/thread=30000ad6320: [AFT1] errID 0x0000004f.818d9280 UE Error(s)
See previous message(s) for details
* Improved Message - CPU Reference to Memory - User (Reboot):
Aug 16 17:03:04 thishost SUNW,UltraSPARC-II: WARNING: [AFT1] Uncorrectable
Memory Error on CPU1 Instruction access at TL=0, errID 0x00000032.593d8229
Aug 16 17:03:04 thishost AFSR 0x00000000.00200000<UE> AFAR
0x00000000.04921bf0
Aug 16 17:03:04 thishost AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC
0x11bf0
Aug 16 17:03:04 thishost UDBH 0x0203<UE> UDBH.ESYND 0x03 UDBL
0x0000
UDBL.ESYND 0x00
Aug 16 17:03:04 thishost UDBH Syndrome 0x3 Memory Module 190x
Aug 16 17:03:04 thishost SUNW,UltraSPARC-II: WARNING: [AFT1] errID
0x00000032.593d8229 Syndrome 0x3 indicates that this may not be a memory module
problem
Aug 16 17:03:04 thishost unix: NOTICE: Scheduling clearing of error on page
0x00000000.04920000
Aug 16 17:03:07 thishost unix: NOTICE: Previously reported error on page
0x00000000.04920000 cleared
Aug 16 17:03:07 thishost unix: WARNING: [AFT1] initiating reboot due to above
error in pid 304 (mtst)
INIT: New run level: 6
The system is coming down. Please wait.
System services are now being stopped.
Print services stopped.
Aug 16 17:03:13 thishost syslogd: going down on signal 15
The system is down.
syncing file systems... done
rebooting...
Resetting ...
* Solaris 8 Message - SBus I/O Reference to Memory:
panic[CPU1]/thread=2a10007dd40: SBus0 UE Primary Error DMA read: AFSR
0x40001be0.00000000 AFAR 0x00000000.02818000 MemMod U0501 U0401 Id 31
* Improved Message - SBus I/O Reference to Memory:
WARNING: SBus0 UE Primary Error DMA read: AFSR 0x40001be0.00000000 AFAR
0x00000000.0d25c000 MemMod U0501 U0401 Id 31
panic[cpu0]/thread=2a10007dd40: Fatal Sbus0 UE Error
BERR Event - Bus Error
----------------------
* Solaris 8 Message:
panic[CPU1]/thread=30000d2c300: CPU1 Privileged Bus Error: AFSR
0x00000000.84000000 AFAR 0x00000000.03422000
* Improved Message - Kernel (Panic):
WARNING: [AFT1] Bus Error on System Bus in privileged mode from CPU1 Data
access
at TL=0, errID 0x0000002c.52b3d2c8
AFSR 0x00000000.84000000<PRIV,BERR> AFAR 0x00000000.05224410
AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x780671a4
UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00
panic[cpu1]/thread=30000b06080: [AFT1] errID 0x0000002c.52b3d2c8 BERR Error(s)
See previous message(s) for details
CE Event - Correctable Memory Error
-----------------------------------
* Solaris 8 Message:
May 8 14:35:30 thishost SUNW,UltraSPARC-II: CPU1 CE Error: AFSR
0x00000000.00100000 AFAR 0x00000000.8abb5a00 UDBH Syndrome 0x85 MemMod U0904
May 8 14:35:30 thishost SUNW,UltraSPARC-II: ECC Data Bit 63 was corrected
May 8 14:35:30 thishost unix: Softerror: Intermittent ECC Memory Error, U0904
* Improved Message:
Aug 16 16:34:48 thishost SUNW,UltraSPARC-II: [AFT0] Corrected Memory Error on
CPU1, errID 0x00000036.629edc25
Aug 16 16:34:48 thishost AFSR 0x00000000.00100000<CE> AFAR
0x00000000.00347dc0
Aug 16 16:34:48 thishost AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC
0x1002fe20
Aug 16 16:34:48 thishost UDBH Syndrome 0x85 Memory Module 1904
Aug 16 16:34:48 thishost SUNW,UltraSPARC-II: [AFT0] errID 0x00000036.629edc25
Corrected Memory Error on 1904 is Intermittent
Aug 16 16:34:48 thishost SUNW,UltraSPARC-II: [AFT0] errID 0x00000036.629edc25
ECC Data Bit 63 was in error and corrected
Starfire Specific
*****************
Arbstop
=======
STag Parity Errors on an E10000 almost always result in a "UPA Fatal
Error"
Arbstop dump. Although these can also be caused by poor VCore voltage
power pucks on a System Board, error trends have shown that these errors
are generally an "ETP Event", caused by the CPU identified in the
Arbstop
dump file.
Recordstop
==========
Recordstop dump files will be generated anytime data is transferred
through the crossbar of the Starfire centerplane. This means that
a recordstop is likely to occur during WP, CP, and LDP events.
As always, the "psi" reported error is an extremely strong indication
of the source of the "UE ECC Error" as reported in the wfail output
of redx. The reporting XDB can be associated with one or two CPUs,
but which CPU actually sourced the data cannot be determined from the
recordstop itself, unless only one of the two possible CPUs are
present at the time of the error. In these cases, the syndrome of 03
is always present in the XDB Error report. Use the recordstop dump
to complement and confirm information provided by Solaris in the
message and console logs. Expect Solaris to report a relatively
high "score" against one of the CPUs attached to the reporting XDB
within the AFT messages previously described in this document.
Note of Caution: Conversely, an XDB could report an "ldat" error with
a syndrome of 03, which includes the same data pattern and xmux_par
values. In these cases, the XDB that reports the "ldat" error is the
XDB for the "victim" CPU in a copyback (CP) event. In essence, an
"ldat" error reported by an XDB will actually prove that the CPUs it
services are victims of another CPU's Cache Parity Error, and therefore
can be used to exonerate the attached CPUs.
These XDB reported "ldat" are extremely rare, but can occur due to
other variables in a Starfire platform. These errors may or may not
be reported with a complementary "psi" error, but the XDBs will
continue
to report a "UE ECC" error in the wfail output, along with a syndrome
value of 0x03. For these events, the "ldat" error exonerates the
attached CPUs, and might be traced back through the Centerplane X-Bar
to the System Board where the data originated from. However, it is
likely that the error will not be traceable back to a CPU on the
sourcing System Board, unless a corresponding "psi"-side error
is reported by an XDB from that System Board.
For all XDB-reported "ldat" errors, expect Solaris to report a low
"score" against one of the CPUs attached to the reporting XDB within
the AFT messages previously described in this document.
DTag Considerations
===================
A rumor has been circulating that this patch increases the rate of
DTag parity errors on E10000 systems. That rumor is false.
The development team observed two customer systems (out of 30) using
the USER-level scrubber that experienced an increased rate of DTag
parity errors. It was determined that the combination of the USER-
level scrubber plus a certain customer-dependent application mix
(which we have yet to characterize) tickles marginal E10000 boards into
producing DTag parity errors. The KERNEL level scrubber that is
contained in the patch uses a completely different algorithm, and does
not have this tickling effect. There have been no reports of increased
DTag parity errors with the KERNEL scrubber.
If a customer experiences a DTag parity error with or without the USER
or KERNEL level scrubber, standard replacement policies apply.
IMPLEMENTATION: (T) (R) (Proactive vs Reactive)
---
| | MANDATORY (Fully Pro-Active)
---
---
| X | CONTROLLED PRO-ACTIVE (per Sun Geo Plan)
---
---
| | REACTIVE (As Required)
---
CORRECTIVE ACTION:
The following recommendations are provided as a guide for authorized
Enterprise Services Field Representatives and Enterprise Customers on
UltraSPARC based platforms running Solaris versions 2.5.1, 2.6, 7, and 8;
1. If this system is running the user level cache scrubber, remove it.
To determine whether a system is running the user level cache scrubber,
enter the command:
/usr/lib/cachescrubber -V
If the response is "Command not found," then the user level
scrubber is
not installed on this system. If the response is a message containing
the current version of the user level scrubber, then the user level
scrubber is installed on this system and must be removed.
To remove the user level cache scrubber, follow the removal procedure
as described in the README file for the user level cache scrubber. The
removal procedure varies for different versions of the scrubber.
2. Apply the appropriate Kernel Patch for the version of Solaris per the
chart below.
PatchId Solaris Release Availability (estimated)
--------- --------------- ------------
103640 Solaris 2.5.1 Now
105181 Solaris 2.6 Now
106541 Solaris 7 Nov/10/2000
108528 Solaris 8 Nov/15/2000
3. If the system is running SunMC, apply the appropriate SunMC patches,
per the table below. This is necessary to maintain SunMC's ability to
report corrected memory errors. If the system is running SyMon, it will
be necessary to upgrade to SunMC and then apply the appropriate patch.
Solaris Release
---------------
2.6 7 [6] 8 [6]
--------- --------- ---------
SunMC 2.1 FCS 110151 110213 110216
SunMC 2.1 L10N 110152 110214 110217
SunMC 2.1.1 FCS 110094 110215 110218
(SunMC patches are not needed for 2.5.1 as the 2.5.1 patch does not
contain improved error messages.)
NOTE [6]: The patches for Solaris 7 and Solaris 8 are not yet available.
4. To ensure proper preservation of system error messages across a panic or
or reboot:
- E3x00, E4x00, E5x00, E6x00 systems must apply OBP patch 103346
(or higher).
- E10000 systems must activate netcon logging, as described in
FIN I0593-1.
5. If the system has operations personnel that have been trained to respond
to the older system error and panic messages, these personnel must be
notified, and become familiar with, the changed error messages that are
described in this document (see "Details on Improved Error
Messages"
and "Error Messages Examples" sections, above). See the comment on
"Customer White Paper," below.
6. If the system employs custom software tools that extract system messages
from kernel core dumps or log files (like /var/adm/messages), these tools
will have to be modified to recognize the new messages. See the comment
on "Customer White Paper," below.
7. For FRU replacement guidelines, refer to the Best Practices Guide:
http://bestpractices.central.sun.com/BestPrac_Sept11_2000.ps
8. For those systems where the appropriate above listed Kernel Patches
have not yet been applied, FINI0570-3 will remain the reference
document for troubleshooting Ecache errors.
9. A mailing list has been set up to address the KJP. Any bugs filed
against the cache scrubber or error recovery mechanisms should include
this mailing list on the interest list of the bug. In addition, any
unexplained system behavior changes should be directed to this mailing
list as well.
[email protected]
COMMENTS:
User Level Cache Scrubber
-------------------------
The user level cache scrubber was an early process-level implementation of
the cache scrubber, deployed by a small number of customers as an interim
measure. Its functionality is superceded by that of the kernel level
cache scrubber that is provided in the Kernel Patch. Running the user
level scrubber on a system that has the Kernel Patch applied may degrade
performance and will defeat some of the functionality of the kernel cache
scrubber. For this reason, the user level scrubber should be removed
(uninstalled) prior to applying the Kernel Patch.
Systems that are currently running the user level cache scrubber and are
not applying the Kernel Patch (for example, on platforms where the Kernel
Patch is not yet available) should continue to run the user level
scrubber. The user level scrubber should be removed only in preparation
for installing the Kernel Patch.
AFSR Decoder Tool
-----------------
The SPG-CTE AFSR decoder is available at the following URLs:
http://cte-www.uk/cgi-bin/afsr/afsr.pl
http://cte-www.eng/cgi-bin/afsr/afsr.pl
An equivalent output as provided by AFSR decode is now immediately
available in the [AFTn] messages. However, the tool remains useful while
troubleshooting I/O related problems (DVMA transaction) and it has been
updated to reflect the Score parameter. (See "Details on Improved Error
Messages" Category 1, above, for details on Score.)
Customer White Paper
--------------------
A customer white paper is being written that describes the improved error
handling capabilities of the Solaris Operating System. The document will
assist customer operations personnel and monitoring tools developers who
need to become familiar with the new error messages.
--------------------------------------------------------------------------
Implementation Footnote:
i) In case of MANDATORY FINs, Enterprise Services will attempt to
contact all affected customers to recommend implementation of
the FIN.
ii) For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical
support teams will recommend implementation of the FIN (to their
respective accounts), at the convenience of the customer.
iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the
need arises.
--------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network
browser as follows:
SunWeb Access:
--------------
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/
* From there, select the appropriate link to query or browse the FIN and
FCO Homepage collections.
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/
* From there, select the appropriate link to browse the FIN or FCO index.
Supporting Documents:
---------------------
* Supporting documents for FIN/FCOs can be found on Edist. Edist can be
accessed internally at the following URL: http://edist.corp/.
* From there, follow the hyperlink path of "Enterprise Services Documenta-
tion" and click on "FIN & FCO attachments", then choose the
appropriate
folder, FIN or FCO. This will display supporting directories/files for
FINs or FCOs.
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to [email protected]
--------------------------------------------------------------------------
------------- End Forwarded Message -------------
Copyright (c) 1997-2003 Sun Microsystems, Inc.