Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1010919.1
Update Date:2011-04-28
Keywords:

Solution Type  Technical Instruction Sure

Solution  1010919.1 :   Understanding and Troubleshooting VCMON & CSTH-LITE errors  


Related Items
  • Sun Fire E6900 Server
  •  
  • Sun Fire E25K Server
  •  
  • Sun Fire E20K Server
  •  
  • Sun Fire 6800 Server
  •  
  • Sun Fire 3800 Server
  •  
  • Sun Fire E4900 Server
  •  
  • Sun Fire 12K Server
  •  
  • Sun Fire 4800 Server
  •  
  • Sun Fire V1280 Server
  •  
  • Sun Fire E2900 Server
  •  
  • Sun Fire 15K Server
  •  
  • Sun Fire 4810 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>Midrange V and Netra Servers
  •  

PreviouslyPublishedAs
215064


Applies to:

Sun Fire 6800 Server
Sun Fire 3800 Server
Sun Fire V1280 Server
Sun Fire 4800 Server
Sun Fire 4810 Server
All Platforms

Goal

Description
This document is about understanding and troubleshooting VCMON & CSTH-LITE errors.

Solution

Steps to Follow
Troubleshooting Steps

CSTH-lite and VCMON were developed to provide a tool to determine if a small percentage of early manufactured Uniboards may be susceptible to a particular class of failure known as "socket attach". Later version Uniboards were modified after quality analysis data exhibited a slightly higher then expected failure rate of these FRUs. The older boards are sometimes referred to as "TYCO" boards and the newer FRUs are sometimes referred to as "CINCH" boards. A few of the "TYCO" boards had the "socket attach" problem which, is a rare condition which can cause a domain to pause due to system errors. Engineers analyzing the errors discovered that "TYCO" boards with "socket attach" issues displayed a condition known as vcore drift. The VCMON and CSTH_lite tests were produced to monitor the condition. When "TYCO" boards are discovered with vcore drift errors they can be proactivily replaced to avoid unscheduled downtime. Since the majority of "TYCO" boards will never exhibit the "socket attach" condition, they can be left in place which lowers the risks associated with handling involved with mass board replacement and insure the best possible domain stability for Sun's customers.

What is VCORE and VCORE drift?
VCORE is the voltage that powers the CPU core of UltraSparc III processors. There is a separate power supply for every CPU on a uniboard.
VCORE drift is a condition wherein the uniboard "VCORE", (power), changes over time. Boards with "socket attach" problems exhibit vcore drift because of an increase in resistance of the processor attach mechanism. The majority of VCMON/CSTH-L errors can be root caused to Uniboards with "socket attach" issues. If however VCORE drift occurs on a "CINCH" uniboard the error event(s) needs to be examined to determine their root cause.

What is CSTH-lite?
CSTH stands for Continuous System Telemetry. CSTH utilizes the Sun system hardware and software sensors (which monitor the system) to provide continuous streams of environmental data about the current state of the system. CSTH-Lite is specifically designed to pro-actively target and detect potential CPU core voltage (VCORE) drift conditions on system Uniboards. CSTH-Lite will officially become part of SMS 1.5 (August 2005) and FW 5.18 (October 2004).

What is VCMON?
VCMON is the current version of the earlier CSTH-lite packages which were integrated into SMS 1.5 (August 2005) and FW 5.18 (October 2004). SUN recommends removing CSTH-lite monitoring and upgrading the System Controller firmware on 3800/48x0/6800 systems. Patch ID#114525-01 which can be retrieved from SunSolve updates the System Controller Firmware to take advantage of VCMON.

Frequently asked questions
Q: Can I run CSTH-Lite with SMS 1.5?

A: CSTH-Lite should not be installed on a system controller running SMS 1.5.

Q: Why does it take several days before CSTH-Lite/VCMON can make a decision about Uniboards?

A: CSTH-lite/VCMON measures the difference in mean core voltages over time. Therefore, CSTH-lite/VCMON requires a baseline before the software can compare voltages. This baseline takes approximately 48 hours to develop. This interval is often referred to as the "blackout and training phase". We recommend allowing at least 5 days beyond the baseline before examining the CSTH_l/VCMON flags, to eliminate most transient voltage ramp detections.

Q: What do CSTH-lite/VCMON errors look like?

A: The Domain logs of the affected system will have errors like these depending on VCMON/CSTH-lite implementation.

VCMON example:
From a 4800-6900 Server:
Nov 21 02:09:49 F48001 Domain-A.SC: [ID 776763 local1.warning] [VCM] Event: SF4800.VCMON.1.10.1459
CSN: 0404HH20A2 DomainID: A ADInfo: 1.VCMON.18.0
Time: Sun Nov 21 02:09:48 GMT+08:00 2004
FRU-List-Count: 1; FRU-PN: 5016178; FRU-SN: A16071; FRU-LOC: /N0/SB4/P1
Recommended-Action: Service action required

 From a 12K - 25K server:
Sep 11 08:20:03 2005 franklin-a-sc0 erd[2060]-A(): [11900 422941882287915 CRIT MessageReportingService.cc 385] [VCM] Event: SF15000-800C-SP CSN: 233a205a DomainID: A ADInfo: 1.VCM-DE.1 Time: Sun Sep 11 08:19:36 BST 2005 Recommended-Action: Service action required
Sep 11 08:21:00 2005 franklin-a-sc0 erd[2060]-A(): [11900 422998961239628 CRIT MessageReportingService.cc 385] [VCM] Event: SF15000-800C-SP CSN: 233a205a DomainID: A ADInfo: 1.VCM-DE.1 Time: Sun Sep 11 08:19:36 BST 2005 Recommended-Action: Service action required
Sep 11 15:58:09 2005 franklin-a-sc0 erd[2060]-A(): [11900 450427915740031 CRIT MessageReportingService.cc 385] [VCM] Event: SF15000-800C-SP CSN: 233a205a DomainID: A ADInfo: 1.VCM-DE.1 Time: Sun Sep 11 15:57:58 BST 2005 Recommended-Action: Service action required
Sep 11 15:58:36 2005 franklin-a-sc0 erd[2060]-A(): [11900 450454286748070 CRIT MessageReportingService.cc 385] [VCM] Event: SF15000-800C-SP CSN: 233a205a DomainID: A ADInfo: 1.VCM-DE.1 Time: Sun Sep 11 15:57:58 BST 2005 Recommended-Action: Service action required
 CHS will display ( showchs -v -r SB16 ):
Component : sb16
Time Stamp : Tue Sep 27 12:50:54 BST 2005
New Status : FAULTY
Old Status : OK
Event Code : 01000000 (unrecognized value)
Initiator : SMS
Message : 2 SF15000-800C-SP 1 1

CSTH-lite example:

 Aug 4 02:05:29 2004 tow21c0 csth-l[]: [WARNING csthl-detector] SB2 has been flagged by
CSTHL (2081.425)
Aug 4 02:05:29 2004 tow21c0 csthl[]: [WARNING csthl-detector] SB2 has been flagged by
CSTHL (3081.421)
Aug 4 02:05:29 2004 CSTHL Detector executed successfully

Q: How do I determine the affected board on a 12K/15K/E20K/E25K from a VCMON message?
A: The message only includes the domain letter, not the affected board. "showlogs -p e" will display the detailed information (example from a different system):

 version: 2
class: list.suspects
fault-diag-time: Sun Jan 1 15:16:15 MET 2006
DE:
scheme: diag-engine
authority:
product-id: SF15000
chassis-id: 323MM25EF
domain-id: C
name: VCM-DE
version: 1
uuid: 322387a0-7ad1-11da-943a-8003ba1236b7
code: SF15000-800C-SP
list-sz: 1
fault-events:
version: 2
class: fault.board.sb.cpu.socket
fault-diag-time: Sun Jan 1 15:16:15 MET 2006
DE:
scheme: diag-engine
authority:
product-id: SF15000
chassis-id: 323MM25EF
domain-id: C
name: VCM-DE
version: 1
ENA-list-sz: 0
FRU:
scheme: sf-hc
part: 5016178
serial: A15390
authority:
product-id: SF15000
chassis-id: 323MM25EF
domain-id: C
component: SB3
resource:
scheme: sf-hc
part: 5016178
serial: A15390
authority:
product-id: SF15000
chassis-id: 323MM25EF
domain-id: C
component: SB3/P3
certainty: 100
event-specific-data:
vcmonIndex: 7.995333

Q: If my system gets an error what should I do?

A:

Prior to 5.19.0

Prior to 5.19.0 CSTH-lite and VCMON only created warning messages.  CSTH-lite and VCMON messages predict a possible failure. It does not mean that anything in the system is currently failing. If a message is displayed more then 2 times in 24 hours it should not be ignored. Please contact your service provider and have the errors examined for root cause.   It is also strongly encouraged to upgrade to the most current SC firmware as several issue for false VCMON warnings have been fixed.


SMS 1.5 or below - Please upgarde to SMS 1.6.


5.19.0 Or Higher and SMS 1.6

VCMON will disable the CPU by marking the CPU's Component Health Status (CHS).  The CPU will not be immediately disabled, it will be disabled at next post.  If a CPU is disabled once in 6 months, it can be re-enabled by resetting the CHS.  This may require a service mode password depending on the platform type and SC firmware version.   If the same CPU is disabled more then 2 times in 6 months contact your service provider and have the errors examined for root cause. It is also strongly encouraged to upgrade to the most current SC firmware as several issue for false VCMON CHS disables have been fixed.




The original process for Cinch failures was to not replace on first fail.  The process for addressing TYCO Boards was to proactively replace as they are flagged.  At this point, even Tyco ( if there are any still out there ) are probably beyond the point where a processor attach error would be seen.

The SSH breakdown for System boards contains information on which versions contain Cinch and Tyco sockets.

DO NOT submit a CIC for replacement uniboards.


Previously Published As
79833

Change History
Publishing Information
Date: 2007-01-24
User Name: 114416

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback