Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Technical Instruction Sure Solution 1004845.1 : Sun Fire[TM] Midrange and HighEnd Servers: Discussion of Component Health Status (CHS)
PreviouslyPublishedAs 206790
Applies to:Sun Fire 4800 ServerSun Fire V1280 Server Sun Fire 6800 Server Sun Fire 12K Server Sun Fire 15K Server All Platforms GoalDescriptionThis document provides information about the advantages of Component Health Status (CHS) on Sun Fire[TM] Midrange and HighEnd Servers. This includes a discussion on life before and after CHS. CHS is only available on the servers listed below:
CHS was first integrated in:
SolutionSteps to FollowPhilosophy of CHS: What does it do?Auto-Diagnosis and Recovery (ADR or AVL) enhancements include features for ScApp and SMS that improve the platform's Availability - the ability for a system to recover and stay in production following hardware issues. The enhancements included are:
Component Health Status is the Auto-Diagnosis and Recovery mechanism used to prevent repeat failures. In very general terms, when a hardware device is determined by the Diagnostic Engine to be root cause to an error, that Component's "Health Status" is marked "faulty". POST has been enhanced to read the component health status at the very beginning of a POST cycle. Any components marked faulty are excluded from even being tested by POST. They are ignored from entry to the domain configuration completely, thus preventing any repeat failures as a result of this same component problem. What happened prior to CHS? Before the CHS enhancements, if a component errored while a domain was up (in Solaris[TM]) and the domain rebooted (perhaps panic reboot), the next POST cycle would have to fail that component in order to prevent it from being configured back in the domain. The problem with requiring POST to fail the component is that POST simply doesn't detect all fault events all the time for all devices. The reality is that many faulty components "pass" POST upon reboot or similar, and are allowed back into the domain configuration. In some cases, they error and panic again, severely limiting Availability. Additionally, prior to CHS, a component that failed post during a POST cycle was "removed" from the configuration only until a new POST cycle was executed. If that component happened to "pass" (more appropriate to call it "not fail") the next POST cycle, the component was back into the domain (to possibly fail again). Obviously for these two reasons, the behavior was not perfect and could be improved to truly isolate out the offending hardware. CHS was that improvement. Is the improvement better? Absolutely CHS is an improvement. As stated above, before CHS, the limitations were: 1) POST had to fail a device to isolate it from the configuration.
2) POST testing may not fail the device consistently.
3) The process to completely isolate a faulty device was manual.
CHS resolves all of these limitations. It takes immediate action to diagnose the fault, label the root cause suspect as faulty and automatically prevent that device from future configuration in the domain. The hardware is updated, so even if the defective board moves to a different slot or server, it remains disabled by CHS. This increases availability because the error is prevented from repeating. Safari Port Descriptor (SPD) A data structure is created and initialized by POST for each safari port (CPU/IOC). The structure is located in each SBBC's Boot Bus SRAM. Typically, there are 2 SPDs per SBBC. For a CPU board, there would be 4 SPDs. For an I/O board, there would be 2 SPDs. SPDs contain the following information: 1. Port Status Summary {pass|fail|unknown|blacklisted} Seeprom has 3 records in the Dynamic FRUID section (DFRUID) to support CHS:
There are 3 ways to change the Status records:
CHS commands:
Tunables:
CHS: Failure to record event. When the CHS of a component is changed, two records in different segments of DFRUID have to be updated:
Since these 2 segments are different, it is possible that there will be a power loss or ScApp/SMS reboot after the StatusCurrentR record is updated, but before the StatusEventR is updated. This is very rare, but if it does happen, the reason for the change will not be captured. CHS's Three Component "states"
ScApp/SMS and POST can only "downgrade" the CHS of a component (mark faulty, suspect). A service engineer can manually "upgrade" or "downgrade" the CHS if necessary (mark ok, suspect, or faulty). There is no relationship between seeprom-based CHS and ScApp/SMS based blacklisting (enablecomponent/disablecomponent). A change in CHS will not modify the ScApp/SMS based blacklists. A change in the ScApp/SMS blacklist will not modify CHS. Location of CHS Recording CHS records are in the seeprom chips of the following devices:
CPU chips do not have their own seeprom and hence do not have their own CHS. Rather a "proxy record" is on the seeprom of the system board. Fan Trays and PSU's only provide CHS records for the SSE to manually tag it as bad for a note to the repair depot or other service people. The system ignores the CHS status of PSU's and Fan Trays' The CHS for RP's will only get set to "suspect" automatically by the system. It will never get set to "Faulty" Examples See 1004879.1 for details on resetting the CHS status on Midrange servers, including the procedure for using setchs if a service password is needed or if it is not. This how to set the status of sb2 to faulty with the reason being "bad sb" on a MidRange server: 6800-sc0:SC> setchs -s faulty -r "bad sb" -c sb2 This is how to display CHS status on the platform: 6800-sc0:SC> showchs -bComponent Status It is important to remember that CHS information is recorded to a component's seeprom. As long as that component remains in the system, the history of CHS events remains with it as well. Anytime automatic or manual action is taken to change the component CHS status, this history is updated to reflect the "new" event that relates to the component. When/if a component is replaced, the history leaves with the component. This is how to display a particular component's CHS history (Notice that the latest event in this history is the manual "FOO" event from the above command): 6800-sc0:SC[service]> showchs -v -c sb2 Component : /N0/SB2 Component : /N0/SB2 Component : /N0/SB2 Component : /N0/SB2 Internal Only NOTE: The last two CHS history records are from the Auto-Diagnosis engine. The message is an event code (SF6800.FAULT.ASIC.xxxxxxxx). This code represents a particular failure type, and can be looked up in the Sun System Handbook Auto Diagnosis and Recovery Fault Tables. As a result of CHS disabling a component, setkeyswitch operations will report: 6800-sc0:A> setkeyswitch on The ScApp command showcomponents would also show the CHS status of the component (All subcomponents on the board are marked disabled as well): 6800-sc0:A> showcomponent sb2 This example shows how to clear chs after replacing the bad component with a good one (Again, utilize 1004879.1 for the full process): sc1:SC> setchs -s OK -r "good"-c SB3/p0 <--- Clears CHS Attachments This solution has no attachment |
||||||||||||
|