Document Audience: | INTERNAL |
Document ID: | I0798-1 |
Title: | When an ECC error occurs on Sun Blade 100, Solaris incorrectly identifies the faulty DIMM |
Copyright Notice: | Copyright © 2005 Sun Microsystems, Inc. All Rights Reserved |
Update Date: | 2004-01-07 |
---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------
FIELD INFORMATION NOTICE
(For Authorized Distribution by SunService)
FIN #: I0798-1
Synopsis: When an ECC error occurs on Sun Blade 100, Solaris incorrectly identifies the faulty DIMMCreate Date: Apr/03/02
Keywords:
When an ECC error occurs on Sun Blade 100, Solaris incorrectly identifies the faulty DIMM
SunAlert: No
Top FIN/FCO Report: Yes
Products Reference: DIMM on Sun Blade 100
Product Category: Desktop / SW Admin
Product Affected:
Systems affected:
-------------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
A36 ALL Sun Blade 100 -
X-Options affected:
-------------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
- - - - -
Parts Affected:
Part Number Description Model
----------- ----------- -----
- - -
References:
BugId: 4624001 - grover DIMM reporting off by one.
PatchId: 111179-04 - Hardware/PROM: Blade 100 Flash PROM Update.
ESC: 534363 - It appears the the OBP is incorrectly converting
an AFAR to the wrong UNUM when.
DOC: 806-3416-10: Sun Blade 100 Service Manual.
Issue Description:
When an ECC memory error occurs on a Sun Blade 100 system, Solaris logs
a certain amount of diagnostic information. However the wrong DIMM can
be reported as faulty. This can lead to unnecessary outages as well
as additional service calls if the wrong DIMM is replaced.
The example below logged by Solaris shows that the physical address
0x3e52e030 is located within DIMM2 while it is actually located within
DIMM1, as stated in the Sun Blade 100 Service Manual, 806-3416-10.
Here are two cases of error messages:
Case1:
AFSR 0x00000001.80300000 AFAR 0x00000000.3e52e030
AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x1009421c
UDBH 0x0362 UDBH.ESYND 0x62 UDBL 0x0000 UDBL.ESYND 0x00
UDBH Syndrome 0x62 Memory Module DIMM2
^^^^^^^^^^^^^^^^^^^
From the manual 806-3416-10:
DIMM# UNUM Dimm Starting Address
-----------------------------------
DIMM0 U2 0x00000000
DIMM1 U3 0X20000000
DIMM2 U4 0x40000000
DIMM3 U5 0x60000000
Yet AFAR 3e52e030 was reported as DIMM2. It should be DIMM1.
Case2:
WARNING: [AFT1] Uncorrectable Memory Error on CPU0 Data access at TL>0,
errID 0x0000d11d.7f890248
AFSR 0x00000000.80300000 AFAR 0x00000000.1489bdb0
AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x10023c08
UDBH 0x03c2 UDBH.ESYND 0xc2 UDBL 0x0000 UDBL.ESYND 0x00
UDBH Syndrome 0xc2 Memory Module DIMM1
[AFT2] errID 0x0000d11d.7f890248 E$tag != PA from AFAR; E$line
was victimized dumping memory from PA 0x00000000.1489bd80 instead
[AFT2] E$Data (0x00): 0x00000000.00000000
[AFT2] E$Data (0x08): 0x00000000.00000000
[AFT2] E$Data (0x10): 0x00000000.00000000
[AFT2] E$Data (0x18): 0x00000000.00000000
[AFT2] E$Data (0x20): 0x00000000.00000000
[AFT2] E$Data (0x28): 0x00000000.00000000
[AFT2] E$Data (0x30): 0x00000000.00008800
[AFT2] E$Data (0x38): 0x00000000.00000000
panic[cpu0]/thread=2a100017d40: [AFT1] errID 0x0000d11d.7f890248 UE Error(s)
In the example above, AFAR 1489bdb0 indicates DIMM0, but the error
message reports "Memory Module DIMM1".
The OBP is incorrectly converting an AFAR to the wrong UNUM when it
comes to Sun Blade 100 DIMMs. The Sun Blade 100 OBP is generating UNUM
strings of DIMM1-4 instead of DIMM0-3.
This Sun Blade 100 DIMM reporting problem has been fixed in the latest
release of the OBP firmware. Changes were made to
obp/arch/sun4u/grover/memprobe.fth. Please follow the recommendations
provided in the Corrective Action below.
Implementation:
---
| | MANDATORY (Fully Proactive)
---
---
| X | CONTROLLED PROACTIVE (per Sun Geo Plan)
---
---
| | REACTIVE (As Required)
---
Corrective Action:
The following recommendation is provided as a guideline for authorized
Enterprise Services Field Representatives who may encounter the above
mentioned problem.
Please follow one of the below guidelines:
Guideline 1:
------------
For Sun Blade 100 platforms with OBP 4.5.0 and below, remap DIMM#
reported - Subtract one (1) from reported number.
Example : If Solaris reports DIMM2, DIMM1 is the defective DIMM.
OR:
Guideline 2:
------------
This problem has been fixed with the new "Sun Blade 100 Flash PROM".
Apply patch 111179-04 which upgrades the OBP to 4.5.9. If OBP 4.5.9
is used, the DIMM# reported will be correct.
Comments:
None
============================================================================
Implementation Footnote:
i) In case of MANDATORY FINs, Enterprise Services will attempt to
contact all affected customers to recommend implementation of
the FIN.
ii) For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical
support teams will recommend implementation of the FIN (to their
respective accounts), at the convenience of the customer.
iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the
need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network
browser as follows:
SunWeb Access:
--------------
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/
* From there, select the appropriate link to query or browse the FIN and
FCO Homepage collections.
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/
* From there, select the appropriate link to browse the FIN or FCO index.
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to [email protected]
--------------------------------------------------------------------------