Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1020256.1
Update Date:2010-08-11
Keywords:

Solution Type  Problem Resolution Sure

Solution  1020256.1 :   OPL: DIMMs are suddenly marked faulty after upgrading kernel patches  


Related Items
  • Sun SPARC Enterprise M5000 Server
  •  
  • Sun SPARC Enterprise M9000-32 Server
  •  
  • Sun SPARC Enterprise M9000-64 Server
  •  
  • Sun SPARC Enterprise M4000 Server
  •  
  • Sun SPARC Enterprise M8000 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>OPL Servers
  •  

PreviouslyPublishedAs
254968


Symptoms
Shortly after upgrading to KJP 127111-08 or higher, one or several DIMMs are marked faulty by XSCF.
Kernel patch 127111-08 introduces memory page retirement for intermitent ECC errors.
DIMMs that were installed prior to patching may suddenly show many errors and be marked faulty by XSCF.
These errors were corrected silently prior to 127111-08 and not reported by FMA.

On older systems that  have been originally installed with Solaris 10u4 this may lead to the impression that one or several DIMMs suddenly got bad.
Patching the kernel is required if upgrading to SPARC64 VII (Jupiter), customers may believe that the new CMUs are causing the problems.


Resolution
Schedule a maintenance action to replace the faulted DIMMs



Relief/Workaround



Additional Information
Description of past and current behavior for intermitent and permanent correctable ECC errors.
Patch 127111-07 and older:
=========================
DIMMs are marked for replacement when more than 128 pages are retired.
A single permanent CE on a page triggers retirement of that page.
Intermittent CEs are handled and corrected silently.
Patch 127111-08 and newer:
=========================
DIMMs are marked for replacement when more than 128 pages are retired. (same as before)
A single permanent CE on a page triggers retirement of that page. (same as before)
3 intermittent CEs within 72 hours on a DIMM trigger retirement of the page associated with the 3rd ICE. (this is new)
fmdump -e will report:
================

Intermittent errors (only 127111-08 and newer):
================================
ereport.asic.mac.mi-ice
and / or
ereport.asic.mac.ptrl-ice

Persistent errors (always):
=================
ereport.asic.mac.mi-ce
and / or
ereport.asic.mac.ptrl-ce

fmadm faulty -a will show something like:
===============================
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Feb 26 18:52:14 a4309495-67c4-eb98-d174-f0c091643420  SUN4U-8000-2S  Major

Fault class : fault.memory.dimm 95%
Affects     : mem:///unum=/CMU03/MEM10A
                  degraded but still in service
FRU         : mem:///unum=/CMU03/MEM10A 95%
                  faulty
Serial ID.  : D21757AD:36HTF51272PY-667E1

Description : The number of errors associated with this memory module has
              exceeded acceptable levels.  Refer to
              http://sun.com/msg/SUN4U-8000-2S for more information.

Response    : Pages of memory associated with this memory module are being
              removed from service as errors are reported.

Impact      : Total system memory capacity will be reduced as pages are
              retired.

Action      : Schedule a repair procedure to replace the affected memory
              module. Use fmdump -v -u <EVENT_ID> to identify the module.


Product
Sun SPARC Enterprise M4000 Server
Sun SPARC Enterprise M5000 Server
Sun SPARC Enterprise M8000 Server
Sun SPARC Enterprise M9000 Server


Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback