Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Technical Instruction Sure Solution 1010905.1 : Sun Enhanced Memory DIMM Replacement Policy for SPARC
PreviouslyPublishedAs 215045
Applies to:Sun Enterprise 220R ServerSun Fire 12K Server Sun Fire 15K Server Sun Fire E20K Server Sun Fire E25K Server All Platforms GoalDescription Sun Enhanced Memory DIMM Replacement Policy for SPARC The rules detailed in this Policy apply to all supported machines that use the SPARC architecture.. NOTE: Acronyms used in this document and their definitions:
DIMM - Dual Inline Memory Module Further definitions can be referenced in <Document:1004729.1> Introduction to Solaris[TM] Operating System CE/UE/ECC/CBB/CBI/DBB/DBI Error Messages SolutionSun's Sparc/Solaris DIMM Replacement Policy - Version 20100623Note: The rules detailed in this Policy apply to all supported machines that use the SPARC architecture.Replace a DIMM when:1. Rule 1: POST (when run at a level which actually tests memory) fails it.2. Rule 2: For systems with Predictive Self-Healing (Solaris 10 and later, except on UltraSPARC II-based platforms), when the system tells you to. 3. Rule 3: For all UltraSPARC II-based systems and all other systems without Predictive Self-Healing (Solaris 9 and earlier), whenever Solaris reports a UE or DUE, and investigation shows that the UE or DUE truly originated from memory, and not from a transfer from some CPU's cache, as determined by a qualified Sun Support specialist. 4. Rule 4: For two or more CEs: 4.1 Rule 4A. For all UltraSPARC II-based systems and all other systems without Predictive Self-Healing (Solaris 9 and earlier), whenever Solaris reports two or more CEs from two or more different physical addresses on each of two or more different bit positions from the same DIMM within 24 hours of each other, and all the addresses are in the same relative checkword (that is, the AFARs are all the same modulo 64). [Note: This means at least 4 CEs; two from one bit position, with unique addresses, and two from another, also with unique addresses, and the lower 6 bits of all the addresses are the same.]5. Rule 5: For Solaris 8 and 9 systems with page retirement (Solaris 8, patch level 108528-24 or later; Solaris 9, patch level 112233-11 or later), as well as for UltraSPARC II-based systems running Solaris 10 and later: 5.1 Rule 5A.5.1.1 Definition: the term "faulted page" denotes a page scheduled for retirement, whether or not that retirement has succeeded.5.1.2 When the system indicates that a DIMM has accumulated 512 or more faulted pages AND the bad reader/writer check (defined below) FAILS, OR a DIMM has accumulated 128 or more faulted pages AND ( < physical address of highest faulted page > - < physical address of lowest faulted page > ) / ( < number of faulted pages > - 1 ) > 512KB AND the bad reader/writer check FAILS then replace the DIMM.5.1.3 The bad reader/writer check is defined as follows:5.1.3.1 For this DIMM and any other DIMM in the system, if they each have at least 4 ereports at unique addresses (unique per DIMM; depending upon the system design each DIMM could have the same address in an ereport) which have the same symbol position, AND if the number of pages faulted on the DIMM with the smaller number of pages faulted is greater than 1/16 times the number of pages faulted on the DIMM with the greater number of pages faulted, then the bad reader/writer check SUCCEEDS.5.1.3.2 If, for all sets of this DIMM and any other DIMM in the system, the number of pages faulted on the DIMM with the smaller number of pages faulted is not greater than 1/16 times the number of pages faulted on the DIMM with the greater number of pages faulted OR the two DIMMs do not each have four Correctable Errors (CEs) at unique per-DIMM addresses at the same symbol position, then the bad reader/writer check FAILS. 5.1.3.3 If the bad reader/writer check SUCCEEDS, then other possible causes of CEs have to be ruled out by a qualified Sun Support specialist before replacing any DIMMs. [Note: Determining these factors is aided by the cediag diagnostic tool set.] 5.2 Rule 5B: If more than 120 non-intermittent CEs are reported against one bit position of one AFAR in 24 hours.6. Rule 6: For older Solaris releases and patch levels, when Solaris reports more than 24 nonintermittent CEs in 24 hours from a single DIMM. If more than one DIMM has experienced more than 24 non-intermittent CEs in 24 hours, then other possible causes of CEs have to be ruled out by a qualified Sun Support specialist before replacing any DIMMs. 7. Limitations: Prior to Solaris 10, retired pages are returned to service whenever a system is rebooted, and will be re-retired if and when Solaris encounters CEs from them again. POST may fail a DIMM that contained retired pages; if it does, replace the DIMM at that time. Copyright: Sun Microsystems, Inc. Original version: Nov. 17. 2004 Updated March 16, 2006 Updated January 13, 2010 (Updated Rule 5A, added Rule 5B) Updated March 5, 2010 (removed Rule 4B) Updated March 11, 2010 (modified Rules 5.1.3.1 and 5.1.3.2) Updated March 12, 2010 (spelling correction: “depending”) Updated June 23, 2010 (Corrected typo in 5.1.2 to remove duplicated text) cediag(1M) diagnostic tool download and reference:When deploying the cediag tool, follow the instructions in <Document:1003867.1> Memory DIMM Replacement Tool - cediag FAQ which also provides the patches where cediag can be obtained.To discuss this information further with
Oracle experts and industry peers, we encourage you to review, join or
start a discussion in an appropriate My Oracle Support Community, Oracle Sun Technologies Community.
Internal Comments DIMM Replacement & Related Links Quality Communications Office DIMM Directory: https://onestop.sfbay.sun.com/qco/dimm/index_dimm.shtml Note that any article on this directory dated prior to June 23, 2010 is from a period of time prior to the most recent DIMM Policy changes The policy, as shown above, is the most recent version (dated June 23, 2010) Definitions and Error Explanations <Document:1004729.1> Introduction to Solaris[TM] Operating SystemCE/UE/ECC/CBB/CBI/DBB/DBI Error Messages Refer all questions and comments to: [email protected] NOTE: Dimms displaying consistent Mtag CE errors on the Sun Fire[TM] 12K/15K/E20K/E25K should be replaced and will not be reported on by cediag. ARCHIVED RULES - ORIGINAL RULES FROM PRIOR POLICY [November 17, 2004, March 16, 2006] ARE BELOW Prior rule info is saved here for reference only since CEDIAG Tool v1.3.2 for Solaris 8 & 9, and FMA for Solaris 10 are using the old/existing Rules 4 & 5 until new software patches are released implementing the revised rules. ARCHIVED 4.2 Rule 4B. For all UltraSPARC II-based systems and all other systems without Predictive Self-Healing (Solaris 9 and earlier), whenever Solaris reports two or more CEs from two or more different physical addresses on each of three or more different outputs from the same DRAM within 24 hours of each other, as long as the three outputs do not all correspond to the same relative bit position in their respective checkwords. [Note: This means at least 6 CEs; two from one DRAM output signal, with unique addresses, two from another output from the same DRAM, also with unique addresses, and two more from yet another output from the same DRAM, again with unique addresses, as long as the three outputs do not all correspond to the same relative bit position in their respectivecheckwords.] ARCHIVED 5. RULE-5. For Solaris 8 and 9 systems with page retirement (Solaris 8, patch level 108528-24 or later; Solaris 9, patch level 112233-11 or later), as well as forUltraSPARC II-based systems running Solaris 10 and later, when the system indicates that the page retirement limit of 0.1% of physical memory has been reached and denotes one and only one DIMM as suspect (i.e., it has accumulated 130 or more non-intermittent CEs). If more than one DIMM is marked as suspect, then other possible causes of CEs have to be ruled out by a qualified Sun Support specialist before replacing any DIMMs. [Note: Determining these factors is aided by the cediag diagnostic tool set.] In the unlikely event that the system indicates that the page retirement limit has been reached but no DIMM is marked as suspect, contact a Sun Support specialist for assistance in determining any necessary action. END OF ARCHIVED SECTION. UltraSPARC, II, III, IV, IV+, DIMM, Replacement, Policy, Memory Previously Published As 79928 Attachments This solution has no attachment |
||||||||||||
|