Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Troubleshooting Sure Solution 1321335.1 : Sun Enterprise[TM] 10000: Troubleshooting Recordstop Dumps
In this Document
Applies to:Sun Enterprise 10000 Server - Version: Not Applicable to Not Applicable - Release: N/A to N/AInformation in this document applies to any platform. PurposeThis document provides troubleshooting information for various recordstop dump events.Last Review DateMay 11, 2011Instructions for the ReaderA Troubleshooting Guide is provided to assist
in debugging a specific issue. When possible, diagnostic tools are included in the document
to assist in troubleshooting.
Troubleshooting DetailsBogus Uncorrectable Error reported on bit 32, syndrome 13There is a problem in the Starfire's XDB algorithm that checks the syndrome bit to identify the bad bit and determine if it is a single or multiple error. The XDB is coded to expect a syndrome of 12 for bit 32. The syndrome for bit 32 really is 13. The result is that the XDB will request a Recordstop but instead of recording a single bit error (CE), it will record an multiple bit error (UE). From the wfail output, we see something like the following: redxl> wfailBear in mind that the UE is misreported by the XDB only. Solaris detects and reports this error properly. As a result, only the Recordstop Dump File will reflect a UE with Bit 32 in error in the XDB output. The flip-side of this problem will be the XDB reporting a Syndrome 12 Correctable Error, but not identify which Bit was Corrected. In reality, Syndrome 12 maps to an Uncorrectable Error (UE), and cannot be mapped to a single bit.
Correctable ECC Error (CE) Processor X DtagsFrom the wfail output, we see something like the following: redxl> wfailThe above error should be analyzed in a way consistent with other Correctable Error recordstops. This means that the first instance or event for any given error against a particular DTag SRAM (in this example, CIC 7.2- DTag SRam 0) should be diagnosed as a soft error, and no action should be taken against it. Swap the "Implicated FRU" (SB7 in example) when the third failure occurs on any one CIC. NOTE: Blacklisting the affected CPU (proc 7.1 in example) could be used as a short term workaround. Attachments This solution has no attachment |
||||||||||||
|