Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Technical Instruction Sure Solution 1011988.1 : Sun Fire[TM] 12K/15K/E20K/E25K: How to translate a DIMM Slot SubSlot message from the System Controller platform messages file into a physical location.
PreviouslyPublishedAs 216430
Applies to:Sun Fire 12K ServerSun Fire 15K Server Sun Fire E20K Server Sun Fire E25K Server All Platforms GoalHow does the following error, which would be logged to a system controller's (SC) /var/opt/SUNWSMS/SMS/adm/platform/messages file get deciphered to tell which dimm is being reported in error?Jul 14 12:30:11 2003 sc0 dsmd[1686]: [0 277533849565489 ERR SoftErrorHandler.cc 665] DIMM Slot 0 SubSlot 24 SolutionWhen Solaris[TM] on the domain detects an ECC error, it will log an error detailing the encountered problem. It will then send an error notification to SMS (System Management Services) on the SC to allow for error tracking of FRU's. SMS will print a message like the one above when it receives this notification from the domain. Using the information in this document, messages like the one above can be matched to the error messages produced by Solaris in the domain.First, the error shows us it is a DIMM in error. The word DIMM means a main memory dimm, not an Ecache or L2SRAM dimm. If your message shows "E$ Slot SubSlot" messages, you must use Technical Instruction Document 1006140.1 as a guide for those errors. Second, the message identifies a "Slot" location. Third a "SubSlot" location. So, now, we must determine the actual part in error using the Slot and SubSlot locations to provide a physical dimm which is reporting the error. "Slot" refers to the system board which reports the error. There is a possibility of up to 18 total system boards in a platform (depending on the platform type), so this type of error will report a Slot between 0-17. "SubSlot" means which dimm slot on the system board which reports the error. There are 32 total dimm slots on a system board; therefore, the numbering goes 0-31. -------------------------- SubSlot Translation Table ---------------------------- Subslot Physical J##### # Location ---------------------------- 0 CPU0/B0/D0 J13300 1 CPU0/B1/D0 J13301 2 CPU0/B0/D1 J13400 3 CPU0/B1/D1 J13401 4 CPU0/B0/D2 J13500 5 CPU0/B1/D2 J13501 6 CPU0/B0/D3 J13600 7 CPU0/B1/D3 J13601 8 CPU1/B0/D0 J14300 9 CPU1/B1/D0 J14301 10 CPU1/B0/D1 J14400 11 CPU1/B1/D1 J14401 12 CPU1/B0/D2 J14500 13 CPU1/B1/D2 J14501 14 CPU1/B0/D3 J14600 15 CPU1/B1/D3 J14601 16 CPU2/B0/D0 J15300 17 CPU2/B1/D0 J15301 18 CPU2/B0/D1 J15400 19 CPU2/B1/D1 J15401 20 CPU2/B0/D2 J15500 21 CPU2/B1/D2 J15501 22 CPU2/B0/D3 J15600 23 CPU2/B1/D3 J15601 24 CPU3/B0/D0 J16300 25 CPU3/B1/D0 J16301 26 CPU3/B0/D1 J16400 27 CPU3/B1/D1 J16401 28 CPU3/B0/D2 J16500 29 CPU3/B1/D2 J16501 30 CPU3/B0/D3 J16600 31 CPU3/B1/D3 J16601 ---------------------------- On the second line, SMS provides more detailed information about the error. The "Comp ID" (Component ID) is another encoding of the Slot and SubSlot. The component ID can be broken down as follows: ------------------------- Component ID Upper Nibble ------------------------- Upper CPU Nibble ------------------------- 0 MaxCAT CPU 0 1 MaxCAT CPU 1 4 SB CPU 0 5 SB CPU 1 6 SB CPU 2 7 SB CPU 3 The lower nibble details the specific dimm associated with the component identified by the upper nibble. ------------------------- Component ID Lower Nibble ------------------------- Lower dimm Nibble ------------------------- 0 MaxCAT E$ 0 1 MaxCAT E$ 1 2 CPU E$ 0 (Jx400) 3 CPU E$ 1 (Jx300) 6 B0/D0 7 B1/D0 8 B0/D1 9 B1/D1 a B0/D2 b B1/D2 c B0/D3 d B1/D3 Thus, a Comp ID of 0x76 corresponds to P3/B0/D0 on the System Board of the identified Slot. The "Error Code" is decoded as follows: -------------------------------------------------- Code Error -------------------------------------------------- 0 UNKNOWN 1 CE Correctable ECC error 2 UE Uncorrectable ECC error 3 EDC Correctable ECC error from E$ 4 EDU Uncorrectable ECC error from E$ 5 WDC Correctable E$ write-back ECC 6 WDU Uncorrectable E$ write-back ECC 7 CPC Copy-out correctable ECC error 8 CPU Copy-out uncorrectable ECC error 9 UCC SW handled correctable ECC a UCU SW handled uncorrectable ECC b EMC Correctable MTAG ECC error c EMU Uncorrectable MTAG ECC error -------------------------------------------------- The "Error Type" is decoded as follows: ------------------------- Error Type ------------------------- Value Error Type ------------------------- 0 Unknown 1 Single bit error 2 Double bit error 3 Triple bit error 4 Quad bit error 5 Multiple bit error Replacement of the implicated component identified through use of this document should happen for an Uncorrectable event. A correctable event should be replaced per Best Practices recommendations only. Contact Sun Support Services for details of Best Practice or to set up a service request for an error of this type. Please reference this document if contacting service and provide log data (SC explorer data preferred) so support can confirm the analysis. Product Sun Fire 15K Server Sun Fire 12K Server Sun Fire E25K Server Sun Fire E20K Server Internal Section CAUTION Using the /var/opt/SUNWSMS/SMS/adm/platform/messages file for identification of a E$ dimm failure alone is not recommended. This document is solely to be used to explain how to translate these error messages into the correct dimm being reported in error. Standard troubleshooting of E$ related errors should involve analysis of rstop/dstop/xcstate files, post logs, and following the instructions layed out by the FINs, Sun Alerts, and Problem Resolution articles which relate to analysis of these types of errors. This error alone only identifies the E$ dimm issue and should be used solely as a confirmation of the analysis of the other files previously mentioned. References:
Keywords: 12k, 15k, e20k, e25k, memory dimm, system controllers, SC, translate, subslot, slot, SubSlot, Slot, DIMM Previously Published As 71129 Attachments This solution has no attachment |
||||||||||||
|