Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Technical Instruction Sure Solution 1010757.1 : Sun Fire[TM] 12K/15K Servers: Voltage Error on CPU Leads to Blacklisting the PROCPAIR
PreviouslyPublishedAs 214857 Description A voltage problem is detected on a single CPU; ASR (Automatic System Recovery) blacklists it's PROCPAIR; and the domain is reset. Blacklisting the PROCPAIR means that two CPUs and their memory are disabled and removed from the domain configuration. This document expains why the esmd (Event Status Monitorig Daemon) disables and removes two CPUs as the result of a voltage problem on a single CPU. summarized above. This behavior might seem incorrect, but in fact the recovery action is exactly as it was designed to be. Steps to Follow The following is an example of a voltage fault which might be logged in the /var/opt/SUNWSMS/SMS/adm/platform/messages file on the System Controller (SC): Jan 19 03:34:33 2004 s2oc-sc0 esmd[2511]: [1919 216320102467983 ERR DetectorV.cc609] A low voltage or power supply has been detected on Core3, located on CPU at SB8. The voltage detected is 0.02v; should be 1.31v to 1.47v. PROCPAIR at SB8/PP1 is being removed from the domain and powered off. Check all hardware for the cause. Jan 19 03:34:33 2004 s2oc-sc0 esmd[2511]: [0 216320149848762 NOTICE SysControl.cc 5296] Component PROCPAIR at SB8/PP1 has been blacklisted Jan 19 03:34:33 2004 s2oc-sc0 esmd[2511]: [1930 216320225206159 NOTICE SysControl.cc 6113] PROCPAIR at SB8/PP1 has been powered off: ecode=0 In the preceding error message, Core3 (CPU3) on SB8 has a low voltage. *************************************************************** * NOTE: The voltage tolerances are defined on the SC in the * * /etc/opt/SUNWSMS/SMS/config/esmd_tuning.txt file. * * DO NOT EDIT THIS FILE! * *************************************************************** In the preceding error message, esmd has blacklisted "Component PROCPAIR at SB8/PP1." PP0 = PROCPAIR0 = CPU0 and CPU1 PP1 = PROCPAIR1 = CPU2 and CPU3 The reported voltage fault on only one CPU (CPU3) results in two CPUs (CPU2 and CPU3) being removed from the domain configuration through blacklisting. The decision to blacklist the PROCPAIR for the failure of the single CPU is a result of a compromise of differing forces made upon POST with regards to availability. A POST is the hardware tests executed against components prior to entering into OBP. These tests confirm the hardware sanity of the components. **************************************************************** * Differing Forces made upon POST * **************************************************************** * FORCE 1 POST needs to be able to exclude faulty components * * from the domain configuration so that future * * failures don't occur. * * * * FORCE 2 POST should allow as many resources as possible to * * be configured into the domain to minimize * * domain impact as much as possible. * **************************************************************** It is important to note that a voltage fault reported by a single CPU might not actually be a problem limited to that CPU itself. The same voltage issue could also be affecting its "related" components, such as the BBC asic, DCDS asic, and so on. Ultimately, the fault could be the result of a power distribution issue, representative of a larger issue on board. The actual reason for the voltage fault might not be fully known, and the number of components that are affected by it might also be unknown. Because of this "unknown" factor, there are two approaches for dealing with voltage issues on board, the Conservative Approach and the Aggressive Approach. These two approaches relate directly to the two POST forces described previously: ************************************************************************ * Conservative: Disable the entire System Board. Now, any future * * outage is prevented if "related" components are * * affected by this voltage problem. FORCE 1. * * * * Aggressive: Disable only the component reporting the voltage * * problem. This leaves as many resources as possible * * available to the domain, but there is some * * risk associated * * with this approach. FORCE 2. * ************************************************************************ Arguments for each approach can be made and no one argument is incorrect. One customer might believe that disabling the whole System Board is the best decision; another customer might believe that it is absolutely unacceptable to lose that many resources. Neither customer is incorrect. A compromise is necessary to meet the needs of both of these forces. Here is what esmd does as a compromise to meet these different POST forces: * Force 1 results in the exlusion of faulty components from the domain configuration, which is done by disabling the CPU, which reports the voltage problem, and its PROCPAIR partner. This isolation to a PROCPAIR is to prevent a problem on a "CPU-related" component, such as the BBC asic, or perhaps to prevent the DCDS asic from causing further incidents. Each PROCPAIR shares components, such as these asics. Thus, the PROCPAIR is a logical place to isolate. * Force 2 results in the configuration of as many resources as possible into the domain configuration. The remaining PROCPAIR is allowed into the domain configuration so that the domain can function. For a single board domain, the Conservative Approach leaves the domain down until service can take place. The Aggressive Approach leaves an exposure if a "related" component has a voltage issue of its own. A compromise configures the domain with as minimial a resource impact as possible while also providing as much error isolation as possible. Ultimately the following are fulfilled: the differing POST force needs, supply domain availability, and fault isolation. This compromise is not perfect, but it is the appropriate way to isolate faulty components from the configuration to prevent future outages, while also allowing as many resources as possible to remain available for domain production until a maintenence window is available to resolve the issue at hand. Product Sun Fire 15K Server Sun Fire 12K Server Internal Comments This article is a result of KGap request ID 263, a request made to clarify why the PROCPAIR blacklist behavior exists. A false over-voltage issue existed in SMS 1.2 and 1.3 software: Sun Alert 53625 "CPU0/CPU1 May Be Disabled on Sun Fire 12K/15K System Boards Resulting in Domain Interruption" 12k, 15k, 12K, 15K, esmd, voltage, procpair, PROCPAIR, blacklisted, ASR Previously Published As 76240 Change History Date: 2005-09-27 User Name: 18392 Action: Update Canceled Comment: *** Restored Published Content *** fixed techgroup Version: 0 Date: 2005-09-27 User Name: 18392 Action: Update Started Comment: fixing techgroup Version: 0 Date: 2005-06-27 User Name: 95826 Action: Update Canceled Comment: *** Restored Published Content *** canceling update as updater is no longer within Sun Version: 0 Date: 2005-06-27 User Name: 95826 Action: Reassign Comment: reassigning document as updater is no longer within Sun Version: 0 Date: 2005-01-09 User Name: 132461 Action: Update Started Comment: spl/fmt/upd Version: 0 Date: 2004-05-24 User Name: c8840 Action: Approved Comment: This document was edited and is now ready for publication. Version: 0 Date: 2004-05-19 User Name: c8840 Action: Accepted Comment: Version: 0 Date: 2004-05-18 User Name: 101037 Action: Approved Comment: Good doc Version: 0 Date: 2004-05-18 User Name: 101037 Action: Accepted Comment: Version: 0 Date: 2004-05-18 User Name: 103287 Action: Approved Comment: Please review. This doc was written to help explain why procpair's are disabled for the voltage event of a single cpu. Nothing more than information... Joshua Freeman, PTS Server ESG Version: 0 Date: 2004-05-17 User Name: 103287 Action: Created Comment: Version: 0 Product_uuid 29e4659c-0a18-11d6-9fa1-e67bbc033df8|Sun Fire 15K Server 077fd4c5-df8f-4320-ad69-7d01603a674d|Sun Fire 12K Server Attachments This solution has no attachment |
||||||||||||
|