Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Technical Instruction Sure Solution 1006140.1 : Sun Fire[TM] 12K/15K/E20K/E25K: How to translate an E$ Slot SubSlot messages from the System Controller platform messages file into a physical location.
PreviouslyPublishedAs 208600
Applies to:Sun Fire 12K ServerSun Fire E25K Server Sun Fire 15K Server Sun Fire E20K Server All Platforms GoalIdentify the location of the Ecache DIMM referenced in the following message:Jul 30 20:00:20 2003 sysconf1-1 dsmd[556]: [0 1384804385762372 ERR SoftErrorHandler.cc 660] E$ Slot 16 SubSlot 7
SolutionSteps to FollowWhen Solaris[TM] on the domain detects an ECC error, it will print an error in the /var/adm/messages file detailing the encountered problem. It will then send an error notification to SMS (System Management Services) on the SC to allow for error tracking of FRU's. SMS will print a message like the one above when it receives this notification from the domain. Using the information in this document, messages like the one above can be matched to the error messages produced by Solaris in the domain. Now let's take a look at the SMS error message itself. First, the error shows it is an E$ (ecache - L2SRAM) in error. NOTE: If your message shows "DIMM Slot SubSlot" messages, you must use Second, it identifies a "Slot" location. Third a "SubSlot" location. So, now, we must determine the actual part in error using the Slot and SubSlot locations to provide a E$ dimm which reports the error. "Slot" refers to the system board which reports the error. There is a possibility of 18 total system boards in a platform (depending on the platform type), so this type of error will report a Slot between 0-17. "SubSlot" means which E$ dimm on the system board reports the error. There are 8 total E$ dimms on a system board; therefore, the numbering goes 0-7. You might think the numbering would show that E$ dimm 0 for cpu 0 would be SubSlot 0, but that is not the case. It is actually E$ dimm 1 for cpu 0 which is SubSlot 0. See the chart below for the translations. -------------------------- On the second line, SMS provides more detailed information about the error. The "Comp ID" (Component ID) is another encoding of the Slot and SubSlot. The component ID can be broken down as follows: ------------------------- The lower nibble details the specific dimm associated with the component identified by the upper nibble. ------------------------- Thus, a Comp ID of 0x76 corresponds to P3/B0/D0 on the System Board of the identified Slot. The "Error Code" is decoded as follows: -------------------------------------------------- The "Error Type" is decoded as follows: ------------------------- Replacement of the implicated component identified through use of this document should happen for an Uncorrectable event. A correctable event should be replaced per Best Practices recommendations only. Contact Sun Support Services for details of Best Practice or to set up a service request for an error of this type. Please reference this document if contacting service and provide log data (SC explorer data preferred) so support can confirm the analysis. Internal Comments *CATION* Using the /var/opt/SUNWSMS/SMS/adm/platform/messages file for identification of a E$ dimm failure alone is not recommended. This document is solely to be used to explain how to translate these error messages into the correct dimm being reported in error. Standard troubleshooting of E$ related errors should involve analysis of rstop/dstop/xcstate files, post logs, and following the instructions laid out by the FABs, Sun Alerts, and Problem Resolution articles which relate to analysis of these types of errors. This error alone only identifies the E$ dimm issue and should be used solely as a confirmation of the analysis of the other files previously mentioned.
References: Previously Published As 71043 Attachments This solution has no attachment |
||||||||||||
|