Asset ID: |
1-72-1020256.1 |
Update Date: | 2010-08-11 |
Keywords: | |
Solution Type
Problem Resolution Sure
Solution
1020256.1
:
OPL: DIMMs are suddenly marked faulty after upgrading kernel patches
Related Items |
- Sun SPARC Enterprise M5000 Server
- Sun SPARC Enterprise M9000-32 Server
- Sun SPARC Enterprise M9000-64 Server
- Sun SPARC Enterprise M4000 Server
- Sun SPARC Enterprise M8000 Server
|
Related Categories |
- GCS>Sun Microsystems>Servers>OPL Servers
|
PreviouslyPublishedAs
254968
SymptomsShortly after upgrading to KJP 127111-08 or higher, one or several DIMMs are marked faulty by XSCF.
Kernel patch 127111-08 introduces memory page retirement for intermitent ECC errors.
DIMMs that were installed prior to patching may suddenly show many errors and be marked faulty by XSCF.
These errors were corrected silently prior to 127111-08 and not reported by FMA.
On older systems that have been originally installed with Solaris 10u4 this may lead to the impression that one or several DIMMs suddenly got bad.
Patching the kernel is required if upgrading to SPARC64 VII (Jupiter), customers may believe that the new CMUs are causing the problems.
ResolutionSchedule a maintenance action to replace the faulted DIMMs
Relief/WorkaroundAdditional InformationDescription of past and current behavior for intermitent and permanent correctable ECC errors.
Patch 127111-07 and older:
=========================
DIMMs are marked for replacement when more than 128 pages are retired.
A single permanent CE on a page triggers retirement of that page.
Intermittent CEs are handled and corrected silently.
Patch 127111-08 and newer:
=========================
DIMMs are marked for replacement when more than 128 pages are retired. (same as before)
A single permanent CE on a page triggers retirement of that page. (same as before)
3 intermittent CEs within 72 hours on a DIMM trigger retirement of the page associated with the 3rd ICE. (this is new)
fmdump -e will report:
================
Intermittent errors (only 127111-08 and newer):
================================
ereport.asic.mac.mi-ice
and / or
ereport.asic.mac.ptrl-ice
Persistent errors (always):
=================
ereport.asic.mac.mi-ce
and / or
ereport.asic.mac.ptrl-ce
fmadm faulty -a will show something like:
===============================
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Feb 26 18:52:14 a4309495-67c4-eb98-d174-f0c091643420 SUN4U-8000-2S Major
Fault class : fault.memory.dimm 95%
Affects : mem:///unum=/CMU03/MEM10A
degraded but still in service
FRU : mem:///unum=/CMU03/MEM10A 95%
faulty
Serial ID. : D21757AD:36HTF51272PY-667E1
Description : The number of errors associated with this memory module has
exceeded acceptable levels. Refer to
http://sun.com/msg/SUN4U-8000-2S for more information.
Response : Pages of memory associated with this memory module are being
removed from service as errors are reported.
Impact : Total system memory capacity will be reduced as pages are
retired.
Action : Schedule a repair procedure to replace the affected memory
module. Use fmdump -v -u <EVENT_ID> to identify the module.
ProductSun SPARC Enterprise M4000 Server
Sun SPARC Enterprise M5000 Server
Sun SPARC Enterprise M8000 Server
Sun SPARC Enterprise M9000 Server
Attachments
This solution has no attachment