![]() | Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Technical Instruction Sure Solution 1003575.1 : Sun Fire[TM] F12K/F15K/E20K/E25K: Component Health Status (CHS) - Q and A
PreviouslyPublishedAs 205032 Applies to:Sun Fire E20K Server - Version Not Applicable and laterSun Fire 12K Server - Version Not Applicable and later Sun Fire 15K Server - Version Not Applicable and later Sun Fire E25K Server - Version Not Applicable and later All Platforms GoalSun Fire[TM] System Management Services (SMS) 1.4 introduced new features that improve the availability, serviceability, diagnosability, and recovery characteristics of Sun Fire[TM] 12K/15K systems. FixQ. What causes the CHS status of a component to be set to "suspect"? A. The "suspect" CHS status means that the Auto Diagnosis engine (AD in the rest of this document) has determined that more that one component could be at fault and all possible components that could have caused the failure are marked as suspect. A "suspect" component is not removed from the configuration by hpost. Q. Can I still use DR to replace the FRU? A. Usage of Dynamic Reconfiguration must always be considered on Sun Fire[TM] 12K/15K/E20K/E25K to reduce downtime induced by maintenance and/or HW operations. There is no particular problem by using DR here. The CHS status of the defective FRU is set to faulty (on the FRUID SEEPROM of the FRU, Field Replaceable Unit); the new FRU will have a cleared CHS status. Q. Why does the first reboot right after a change of the CHS of a component fail? A. A reboot or panic ran right after the change of the CHS of a component, the 'hpost -Q' associated with the event will fail with: [...output omitted] Reading Component Health Status (CHS) information ... CHS reports E$Dimm SB5/P0/E0 status NOT_GOOD. Treating as blacklisted. Port SB5/P0 in sram struct is good (P), inconsistent with current blacklist. opt_Q_handler(): GDCD recovered from Sram IO5/S0 is inconsistent with the current PCD/blacklist state. Exitcode = 81: Invalid recovered gdcd/ldcd structure POST (level=7, verbose=20, -Q) execution time 0:13 As reported, this is due to the inconsistency between the GDCD (Global Domain Configuration Descriptor on the IO Board SRAM) and the PCD (Platform Configuration Database on the SMS). In the case described here, SMS will automatically run a subsequent hpost that will deconfigure the component (blacklisted) from the configuration. Q. What happens to proc1 when proc0 (or subcomponents) is reported as faulty? or vice versa A. CHS and Auto Diagnosis are using the blacklist mechanism to deconfigure the faulty components. The design of the CPU/Memory board and the AR asic result in processors 0 and 1 sharing the local Safari address bus. This is not the case for processors 2 and 3, nor for the two processors on the MaxCPU board. Example: CPU_Brds: Proc Mem P/B: 3/1 3/0 2/1 2/0 1/1 1/0 0/1 0/0 Slot Gen 3210 /L: 10 10 10 10 10 10 10 10 CDC SB05: P PPbb PP PP PP PP cc cc cc cc P Q. What about USIV core? A. On a USIV board, the behavior is the same. If proc0 is blacklisted, proc1 is crunched and 4 cores will be blacklisted. Example: CPU_Brds: PortCore 3 2 1 0 Mem P/B: 3/1 3/0 2/1 2/0 1/1 1/0 0/1 0/0 Slot Gen 10101010 /L: 10 10 10 10 10 10 10 10 CDC SB04: P PPPPcccc mm PP mm PP cc cc cc cc P Q. How does CHS work for USIV boards? A. There is no difference with USIII boards; the behavior is the same. If an event is detected and reported by AD for 1 core, the CHS status for the proc will be changed to "faulty" and the proc will be blacklisted. SEEPROM probe took 0 seconds. Reading Component Health Status (CHS) information ... CHS reports E$Dimm SB4/P3 status NOT_GOOD. Treating as blacklisted. [...output omitted] CPU_Brds: PortCore 3 2 1 0 Mem P/B: 3/1 3/0 2/1 2/0 1/1 1/0 0/1 0/0 Slot Gen 10101010 /L: 10 10 10 10 10 10 10 10 CDC SB04: P ccPPPPPP mm cc mm PP PP PP PP PP P Q. Why can I see a proc reported as "crunched" but no blacklisted (or failed) component in the configuration? A. As reported in the following example, a proc may be reported as "crunched" and no component is reported as blacklisted or failed. CPU_Brds: Proc Mem P/B: 3/1 3/0 2/1 2/0 1/1 1/0 0/1 0/0 Slot Gen 3210 /L: 10 10 10 10 10 10 10 10 CDC SB05: P PcPP PP PP cc cc PP PP PP PP P As per the definition for the "crunched" status, proc is victim for another Q. Impact of a faulty DIMM vs impact of a faulty logical bank? A. When a DIMM is considered as defective and its CHS status is set to faulty, the full physical bank (whole DIMM) will be removed from the configuration by hpost. When a logical bank is considered as defective and its CHS status is set to faulty, only this logical bank will be removed from the configuration by hpost. Q. When upgrading from a SMS version prior to the implementation of CHS or when adding a system board into a platform managed by SMS 1.4 and above, is it possible that the system board could already have disabled components? A. Yes this is possible. Every effort to ascertain why it is marked failed should be exhausted before re-enabling this component. If the component is re-enabled then it should be thoroughly tested before being returned to production.
Internal Section:
A. The 'setchs' and 'showchs' SMS commands are intended for Sun internal use only. sc0:sms-user:> showchs -b -v -r SB5/P0/E0 Q. Can I still use DR to replace the FRU (continued)? A. Example: proc is victim of an L2SRAM DIMM failure.
References@<NOTE:1002020.1> - Sun Fire[TM] 12K/15K/E20K/E25K: Component Health Status (CHS) - General discussion@<NOTE:1002023.1> - Sun Fire[TM] 12K/15K/E20K/E25K: Component Health Status (CHS) - Troubleshooting NOT_GOOD/faulty status Attachments This solution has no attachment |
||||||||||||
|