Sun Fire[TM] F12K/F15K/E20K/E25K: Component Health Status (CHS)

Asset ID:	1-71-1003575.1
Update Date:	2012-07-30
Keywords:

Solution Type Technical Instruction Sure

Solution 1003575.1 : Sun Fire[TM] F12K/F15K/E20K/E25K: Component Health Status (CHS) - Q and A

Applies to:

Sun Fire E20K Server - Version Not Applicable and later
Sun Fire 12K Server - Version Not Applicable and later
Sun Fire 15K Server - Version Not Applicable and later
Sun Fire E25K Server - Version Not Applicable and later
All Platforms

Goal

Sun Fire[TM] System Management Services (SMS) 1.4 introduced new features that improve the availability, serviceability, diagnosability, and recovery characteristics of Sun Fire[TM] 12K/15K systems.

These features are also available on Sun Fire[TM] E20K/E25K with SMS 1.4.1 and above.

This document proposes several Questions and Answers relative to Component Health Status (CHS).

Please refer to the appropriate documentation in the references section below for further information about AD, CHS, Error Report, Class description and hpost mechanism and terminology.

Fix

Q. What causes the CHS status of a component to be set to "suspect"?

A. The "suspect" CHS status means that the Auto Diagnosis engine (AD in the rest of this document) has determined that more that one component could be at fault and all possible components that could have caused the failure are marked as suspect. A "suspect" component is not removed from the configuration by hpost.

Q. Can I still use DR to replace the FRU?

A. Usage of Dynamic Reconfiguration must always be considered on Sun Fire[TM] 12K/15K/E20K/E25K to reduce downtime induced by maintenance and/or HW operations. There is no particular problem by using DR here. The CHS status of the defective FRU is set to faulty (on the FRUID SEEPROM of the FRU, Field Replaceable Unit); the new FRU will have a cleared CHS status.

Q. Why does the first reboot right after a change of the CHS of a component fail?

A. A reboot or panic ran right after the change of the CHS of a component, the 'hpost -Q' associated with the event will fail with:

    [...output omitted]
    Reading Component Health Status (CHS) information ...
    CHS reports E$Dimm SB5/P0/E0 status NOT_GOOD. Treating as blacklisted.
    Port SB5/P0 in sram struct is good (P), inconsistent with current blacklist.
    opt_Q_handler(): GDCD recovered from Sram IO5/S0 is inconsistent with
    the current PCD/blacklist state.
    Exitcode = 81: Invalid recovered gdcd/ldcd structure
    POST (level=7, verbose=20, -Q) execution time 0:13

As reported, this is due to the inconsistency between the GDCD (Global Domain Configuration Descriptor on the IO Board SRAM) and the PCD (Platform Configuration Database on the SMS).
'hpost -Q' is not able to remove HW from the configuration and it is not able to take the blacklist of SB5/P0/E0 into account. A full hpost ('setkeyswitch standby/on' or DR) must be run to remove the faulty component from the configuration.

In the case described here, SMS will automatically run a subsequent hpost that will deconfigure the component (blacklisted) from the configuration.

Q. What happens to proc1 when proc0 (or subcomponents) is reported as faulty? or vice versa

A. CHS and Auto Diagnosis are using the blacklist mechanism to deconfigure the faulty components.

The design of the CPU/Memory board and the AR asic result in processors 0 and 1 sharing the local Safari address bus. This is not the case for processors 2 and 3, nor for the two processors on the MaxCPU board.
So, when a proc0 is considered as faulty and is blacklisted, or a proc0 is manually disable by running disablecomponent, then the partnering proc1 is crunched, and visa versa.

Example:

    CPU_Brds:  Proc  Mem P/B: 3/1 3/0  2/1 2/0  1/1 1/0  0/1 0/0
    Slot  Gen  3210        /L: 10  10   10  10   10  10   10  10     CDC
    SB05:  P   PPbb            PP  PP   PP  PP   cc  cc   cc  cc      P

Q. What about USIV core?

A. On a USIV board, the behavior is the same. If proc0 is blacklisted, proc1 is crunched and 4 cores will be blacklisted.

Example:

    CPU_Brds:  PortCore
                3 2 1 0  Mem P/B: 3/1 3/0  2/1 2/0  1/1 1/0  0/1 0/0
    Slot  Gen  10101010        /L: 10  10   10  10   10  10   10  10     CDC
    SB04:  P   PPPPcccc            mm  PP   mm  PP   cc  cc   cc  cc      P

Q. How does CHS work for USIV boards?

A. There is no difference with USIII boards; the behavior is the same. If an event is detected and reported by AD for 1 core, the CHS status for the proc will be changed to "faulty" and the proc will be blacklisted.
Example :

    SEEPROM probe took 0 seconds.
    Reading Component Health Status (CHS) information ...
    CHS reports E$Dimm SB4/P3 status NOT_GOOD. Treating as blacklisted.
    [...output omitted]
    CPU_Brds:  PortCore
                     3  2  1  0  Mem P/B: 3/1 3/0  2/1 2/0  1/1 1/0  0/1 0/0
    Slot  Gen  10101010                /L: 10  10   10  10   10  10   10  10     CDC
    SB04:  P   ccPPPPPP                    mm  cc   mm  PP   PP  PP   PP  PP      P

Q. Why can I see a proc reported as "crunched" but no blacklisted (or failed) component in the configuration?

A. As reported in the following example, a proc may be reported as "crunched" and no component is reported as blacklisted or failed.
Example :

    CPU_Brds:  Proc  Mem P/B:  3/1 3/0  2/1 2/0  1/1 1/0  0/1 0/0
    Slot   Gen  3210        /L: 10  10   10  10   10  10   10  10     CDC
    SB05:  P    PcPP            PP  PP   cc  cc   PP  PP   PP  PP      P

As per the definition for the "crunched" status, proc is victim for another
component failure, miss, unusable or blacklist; there are many cases that can cause a proc to be crunched.
In case of a crunched proc, the evidence will be available in the logs.

Q. Impact of a faulty DIMM vs impact of a faulty logical bank?

A. When a DIMM is considered as defective and its CHS status is set to faulty, the full physical bank (whole DIMM) will be removed from the configuration by hpost. When a logical bank is considered as defective and its CHS status is set to faulty, only this logical bank will be removed from the configuration by hpost.

Q. When upgrading from a SMS version prior to the implementation of CHS or when adding a system board into a platform managed by SMS 1.4 and above, is it possible that the system board could already have disabled components?

A. Yes this is possible. Every effort to ascertain why it is marked failed should be exhausted before re-enabling this component. If the component is re-enabled then it should be thoroughly tested before being returned to production.

Internal Section:

Q. Are there any logs when using the 'setchs' command?

A. The 'setchs' and 'showchs' SMS commands are intended for Sun internal use only.
There is no manual available online and there are no logs when using the 'setchs' SMS command.
That being said, the "-v" option of the 'showchs' SMS command can be used to display the CHS history and so to display any changes of the CHS status of a component.

Example :

sc0:sms-user:> showchs -b -v -r SB5/P0/E0

    Component:          SB5/P0/E0
    Time Stamp:         Tue Apr 13 11:32:27 PDT 2004
    New Status:         Faulty
    Old Status:         OK
    Major Event Code:   Hardware-detected error
    Initiator:          Kernel
    Message:            1 SF15000-8000-FF 5-10-s10_50 1
    Component:          SB5/P0/E0
    Time Stamp:         Tue Apr 13 11:26:36 PDT 2004
    New Status:         OK
    Old Status:         Faulty
    Major Event Code:   Field Engineer-supplied status
    Initiator:          Field Engineer
    Message:            change CHS status for SB5/P0/E0

Note: 'showchs' output is not in Explorer (RFE 4962254); it might be requested separately.

Q. Can I still use DR to replace the FRU (continued)?

A. If a new FRU is inserted into a system, its CHS status must be verified via the 'showchs' command prior to replacing the HW.
In case of a non-HW issue, DR can be used to configure the component back to the domain (after changing its CHS status via the 'setchs' command).
For a detailed explanation of the reasons behind crunching the second proc in the proc-pair p0 and p1 see http://esp.us.oracle.com/starcat/post/faq.html#crunch_proc1.
For a detailed explanation of the "crunched" status see http://esp.us.oracle.com/starcat/post/glossary.html#crunch.

Q. Why can I see a proc reported as "crunched" but no blacklisted (or failed component in the configuration (continued)?

A. Example: proc is victim of an L2SRAM DIMM failure.

    [...output omitted]
    SEEPROM probe took 0 seconds.
    Reading Component Health Status (CHS) information ...
    CHS reports E$Dimm SB5/P2/E0 status NOT_GOOD. Treating as blacklisted.
    [...output omitted]

In the example provided, the associated memory dimms are also crunched as the memory controller resides on the processor with this architecture.

Q. Impact of a faulty DIMM vs impact of a faulty logical bank (continued)?

A. Example:

    sc0:sms-user:> setchs -r "test" -s faulty -c SB5/P0/B0/D2

    sc0:sms-user:> setchs -r "test" -s faulty -c SB11/P0/B0/D2/L0
    [...output omitted]
    SEEPROM probe took 0 seconds.
    Reading Component Health Status (CHS) information ...
    CHS reports Logical Dimm SB5/P0/B0/L0/D2 status NOT_GOOD. Treating as blacklisted.
    CHS reports Logical Dimm SB5/P0/B0/L1/D2 status NOT_GOOD. Treating as blacklisted.
    CHS reports Logical Dimm SB11/P0/B0/L0/D2 status NOT_GOOD. Treating as blacklisted.
    ...
    CPU_Brds: Proc Mem P/B: 3/1 3/0 2/1 2/0 1/1 1/0 0/1 0/0
    Slot Gen 3210        /L: 10 10   10 10   10 10   10 10     CDC
    SB05: P   PPPP            PP PP   PP PP   PP PP   PP bb      P
    SB11: P   PPPP            PP PP   PP PP   PP PP   PP Pb      P
    Configured in 333 with 8 procs, 14.500 GBytes, 2 IO adapters.
    Interconnect frequency is 149.989 MHz, Measured.
    Golden sram is on Slot IO5.
    POST (level=16, verbose=20) execution time 4:42

References:

Document 1002020.1 12K/15K/E20K/E25K: CHS General discussion
Document 1002023.1 12K/15K/E20K/E25K: CHS Troubleshooting NOT_GOOD/faulty status
Starcat POST Glossary - http://esp.us.oracle.com/starcat/post/glossary.html
Sun Fire 15K POST & redx Information - http://esp.us.oracle.com/starcat/post/

Previously Published As 75472

References

@<NOTE:1002020.1> - Sun Fire[TM] 12K/15K/E20K/E25K: Component Health Status (CHS) - General discussion
@<NOTE:1002023.1> - Sun Fire[TM] 12K/15K/E20K/E25K: Component Health Status (CHS) - Troubleshooting NOT_GOOD/faulty status

Attachments

This solution has no attachment