![]() | Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Technical Instruction Sure Solution 1010919.1 : Understanding and Troubleshooting VCMON & CSTH-LITE errors
PreviouslyPublishedAs 215064 Applies to:Sun Fire 12K Server - Version Not Applicable to Not Applicable [Release N/A]Sun Fire 15K Server - Version Not Applicable to Not Applicable [Release N/A] Sun Fire E20K Server - Version Not Applicable to Not Applicable [Release N/A] Sun Fire E25K Server - Version Not Applicable to Not Applicable [Release N/A] Sun Fire 6800 Server - Version NotApplicable to NotApplicable [Release N/A] All Platforms GoalDescription FixSteps to Follow CSTH-lite and VCMON were developed to provide a tool to determine if a small percentage of early manufactured Uniboards may be susceptible to a particular class of failure known as "socket attach". Later version Uniboards were modified after quality analysis data exhibited a slightly higher then expected failure rate of these FRUs. The older boards are sometimes referred to as "TYCO" boards and the newer FRUs are sometimes referred to as "CINCH" boards. A few of the "TYCO" boards had the "socket attach" problem which, is a rare condition which can cause a domain to pause due to system errors. Engineers analyzing the errors discovered that "TYCO" boards with "socket attach" issues displayed a condition known as vcore drift. The VCMON and CSTH_lite tests were produced to monitor the condition. When "TYCO" boards are discovered with vcore drift errors they can be proactivily replaced to avoid unscheduled downtime. Since the majority of "TYCO" boards will never exhibit the "socket attach" condition, they can be left in place which lowers the risks associated with handling involved with mass board replacement and insure the best possible domain stability for Sun's customers. What is VCORE and VCORE drift? What is CSTH-lite? What is VCMON? Frequently asked questions A: CSTH-Lite should not be installed on a system controller running SMS 1.5. Q: Why does it take several days before CSTH-Lite/VCMON can make a decision about Uniboards? A: CSTH-lite/VCMON measures the difference in mean core voltages over time. Therefore, CSTH-lite/VCMON requires a baseline before the software can compare voltages. This baseline takes approximately 48 hours to develop. This interval is often referred to as the "blackout and training phase". We recommend allowing at least 5 days beyond the baseline before examining the CSTH_l/VCMON flags, to eliminate most transient voltage ramp detections. Q: What do CSTH-lite/VCMON errors look like? A: The Domain logs of the affected system will have errors like these depending on VCMON/CSTH-lite implementation. VCMON example: Nov 21 02:09:49 F48001 Domain-A.SC: [ID 776763 local1.warning] [VCM] Event: SF4800.VCMON.1.10.1459 CSN: 0404HH20A2 DomainID: A ADInfo: 1.VCMON.18.0 Time: Sun Nov 21 02:09:48 GMT+08:00 2004 FRU-List-Count: 1; FRU-PN: 5016178; FRU-SN: A16071; FRU-LOC: /N0/SB4/P1 Recommended-Action: Service action required From a 12K - 25K server: Sep 11 08:20:03 2005 franklin-a-sc0 erd[2060]-A(): [11900 422941882287915 CRIT MessageReportingService.cc 385] [VCM] Event: SF15000-800C-SP CSN: 233a205a DomainID: A ADInfo: 1.VCM-DE.1 Time: Sun Sep 11 08:19:36 BST 2005 Recommended-Action: Service action required Sep 11 08:21:00 2005 franklin-a-sc0 erd[2060]-A(): [11900 422998961239628 CRIT MessageReportingService.cc 385] [VCM] Event: SF15000-800C-SP CSN: 233a205a DomainID: A ADInfo: 1.VCM-DE.1 Time: Sun Sep 11 08:19:36 BST 2005 Recommended-Action: Service action required Sep 11 15:58:09 2005 franklin-a-sc0 erd[2060]-A(): [11900 450427915740031 CRIT MessageReportingService.cc 385] [VCM] Event: SF15000-800C-SP CSN: 233a205a DomainID: A ADInfo: 1.VCM-DE.1 Time: Sun Sep 11 15:57:58 BST 2005 Recommended-Action: Service action required Sep 11 15:58:36 2005 franklin-a-sc0 erd[2060]-A(): [11900 450454286748070 CRIT MessageReportingService.cc 385] [VCM] Event: SF15000-800C-SP CSN: 233a205a DomainID: A ADInfo: 1.VCM-DE.1 Time: Sun Sep 11 15:57:58 BST 2005 Recommended-Action: Service action required CHS will display ( showchs -v -r SB16 ): Component : sb16 Time Stamp : Tue Sep 27 12:50:54 BST 2005 New Status : FAULTY Old Status : OK Event Code : 01000000 (unrecognized value) Initiator : SMS Message : 2 SF15000-800C-SP 1 1 CSTH-lite example: Aug 4 02:05:29 2004 tow21c0 csth-l[]: [WARNING csthl-detector] SB2 has been flagged by CSTHL (2081.425) Aug 4 02:05:29 2004 tow21c0 csthl[]: [WARNING csthl-detector] SB2 has been flagged by CSTHL (3081.421) Aug 4 02:05:29 2004 CSTHL Detector executed successfully Q: How do I determine the affected board on a 12K/15K/E20K/E25K from a VCMON message? version: 2 class: list.suspects fault-diag-time: Sun Jan 1 15:16:15 MET 2006 DE: scheme: diag-engine authority: product-id: SF15000 chassis-id: 323MM25EF domain-id: C name: VCM-DE version: 1 uuid: 322387a0-7ad1-11da-943a-8003ba1236b7 code: SF15000-800C-SP list-sz: 1 fault-events: version: 2 class: fault.board.sb.cpu.socket fault-diag-time: Sun Jan 1 15:16:15 MET 2006 DE: scheme: diag-engine authority: product-id: SF15000 chassis-id: 323MM25EF domain-id: C name: VCM-DE version: 1 ENA-list-sz: 0 FRU: scheme: sf-hc part: 5016178 serial: A15390 authority: product-id: SF15000 chassis-id: 323MM25EF domain-id: C component: SB3 resource: scheme: sf-hc part: 5016178 serial: A15390 authority: product-id: SF15000 chassis-id: 323MM25EF domain-id: C component: SB3/P3 certainty: 100 event-specific-data: vcmonIndex: 7.995333 Q: If my system gets an error what should I do? A: Prior to 5.19.0 Prior to 5.19.0 CSTH-lite and VCMON only created warning messages. CSTH-lite and VCMON messages predict a possible failure. It does not mean that anything in the system is currently failing. If a message is displayed more then 2 times in 24 hours it should not be ignored. Please contact your service provider and have the errors examined for root cause. It is also strongly encouraged to upgrade to the most current SC firmware as several issue for false VCMON warnings have been fixed. SMS 1.5 or below - Please upgrade to SMS 1.6. 5.19.0 Or Higher and SMS 1.6 VCMON will disable the CPU by marking the CPU's Component Health Status (CHS). The CPU will not be immediately disabled, it will be disabled at next post. If a CPU is disabled once in 6 months, it can be re-enabled by resetting the CHS. This may require a service mode password depending on the platform type and SC firmware version. If the same CPU is disabled more then 2 times in 6 months contact your service provider and have the errors examined for root cause. It is also strongly encouraged to upgrade to the most current SC firmware as several issue for false VCMON CHS disables have been fixed.
Attachments This solution has no attachment |
||||||||||||
|