Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Technical Instruction Sure Solution 1010919.1 : Understanding and Troubleshooting VCMON & CSTH-LITE errors
PreviouslyPublishedAs 215064
Applies to:Sun Fire 6800 ServerSun Fire 3800 Server Sun Fire V1280 Server Sun Fire 4800 Server Sun Fire 4810 Server All Platforms GoalDescriptionThis document is about understanding and troubleshooting VCMON & CSTH-LITE errors. SolutionSteps to FollowTroubleshooting Steps CSTH-lite and VCMON were developed to provide a tool to determine if a small percentage of early manufactured Uniboards may be susceptible to a particular class of failure known as "socket attach". Later version Uniboards were modified after quality analysis data exhibited a slightly higher then expected failure rate of these FRUs. The older boards are sometimes referred to as "TYCO" boards and the newer FRUs are sometimes referred to as "CINCH" boards. A few of the "TYCO" boards had the "socket attach" problem which, is a rare condition which can cause a domain to pause due to system errors. Engineers analyzing the errors discovered that "TYCO" boards with "socket attach" issues displayed a condition known as vcore drift. The VCMON and CSTH_lite tests were produced to monitor the condition. When "TYCO" boards are discovered with vcore drift errors they can be proactivily replaced to avoid unscheduled downtime. Since the majority of "TYCO" boards will never exhibit the "socket attach" condition, they can be left in place which lowers the risks associated with handling involved with mass board replacement and insure the best possible domain stability for Sun's customers. What is VCORE and VCORE drift? What is CSTH-lite? What is VCMON? Frequently asked questions A: CSTH-Lite should not be installed on a system controller running SMS 1.5. Q: Why does it take several days before CSTH-Lite/VCMON can make a decision about Uniboards? A: CSTH-lite/VCMON measures the difference in mean core voltages over time. Therefore, CSTH-lite/VCMON requires a baseline before the software can compare voltages. This baseline takes approximately 48 hours to develop. This interval is often referred to as the "blackout and training phase". We recommend allowing at least 5 days beyond the baseline before examining the CSTH_l/VCMON flags, to eliminate most transient voltage ramp detections. Q: What do CSTH-lite/VCMON errors look like? A: The Domain logs of the affected system will have errors like these depending on VCMON/CSTH-lite implementation. VCMON example: From a 12K - 25K server: CHS will display ( showchs -v -r SB16 ): CSTH-lite example: Aug 4 02:05:29 2004 tow21c0 csth-l[]: [WARNING csthl-detector] SB2 has been flagged by Q: How do I determine the affected board on a 12K/15K/E20K/E25K from a VCMON message? version: 2 Q: If my system gets an error what should I do? A: Prior to 5.19.0 Prior to 5.19.0 CSTH-lite and VCMON only created warning messages. CSTH-lite and VCMON messages predict a possible failure. It does not mean that anything in the system is currently failing. If a message is displayed more then 2 times in 24 hours it should not be ignored. Please contact your service provider and have the errors examined for root cause. It is also strongly encouraged to upgrade to the most current SC firmware as several issue for false VCMON warnings have been fixed. SMS 1.5 or below - Please upgarde to SMS 1.6. 5.19.0 Or Higher and SMS 1.6 VCMON will disable the CPU by marking the CPU's Component Health Status (CHS). The CPU will not be immediately disabled, it will be disabled at next post. If a CPU is disabled once in 6 months, it can be re-enabled by resetting the CHS. This may require a service mode password depending on the platform type and SC firmware version. If the same CPU is disabled more then 2 times in 6 months contact your service provider and have the errors examined for root cause. It is also strongly encouraged to upgrade to the most current SC firmware as several issue for false VCMON CHS disables have been fixed. The original process for Cinch failures was to not replace on first fail. The process for addressing TYCO Boards was to proactively replace as they are flagged. At this point, even Tyco ( if there are any still out there ) are probably beyond the point where a processor attach error would be seen. The SSH breakdown for System boards contains information on which versions contain Cinch and Tyco sockets. DO NOT submit a CIC for replacement uniboards. Previously Published As 79833 Change History Publishing Information Date: 2007-01-24 User Name: 114416 Attachments This solution has no attachment |
||||||||||||
|