Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1004879.1
Update Date:2011-06-01
Keywords:

Solution Type  Technical Instruction Sure

Solution  1004879.1 :   Sun Fire[TM] 3800, 48x0, 6800, E2900, E4900, E6900, v1280 or Netra[TM] 1280, or 1290: Resetting a component's CHS status using setchs  


Related Items
  • Sun Fire E6900 Server
  •  
  • Sun Fire 3800 Server
  •  
  • Sun Fire 6800 Server
  •  
  • Sun Fire E4900 Server
  •  
  • Sun Fire 4800 Server
  •  
  • Sun Fire V1280 Server
  •  
  • Sun Fire E2900 Server
  •  
  • Sun Fire 4810 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>Midrange Servers
  •  
  • GCS>Sun Microsystems>Servers>Entry-Level Servers
  •  
  • GCS>Sun Microsystems>Servers>Midrange V and Netra Servers
  •  
  • GCS>Sun Microsystems>Servers>NEBS-Certified Servers
  •  

PreviouslyPublishedAs
206842


Applies to:

Sun Fire E6900 Server
Sun Fire E4900 Server
Sun Fire 3800 Server - Version: Not Applicable to Not Applicable   [Release: N/A to N/A]
Sun Fire 4810 Server - Version: Not Applicable to Not Applicable   [Release: N/A to N/A]
Sun Fire V1280 Server
All Platforms

Goal

Description
This document describes how to re-enable a component that has been marked Faulty or Suspect by Component Health Status (CHS). This document is relevant to the Sun Fire[TM] 3800, 4800, 4810, 6800, E2900, E4900, E6900, v1280 or Netra[TM] 1280, 1290 family of systems.

The System Controller (SC) or lom command showchs might report Faulty or Suspect component(s) similar to the following example:
lom>showchs
Component Status
--------------- --------
/N0/SB2 Suspect
/N0/SB2/P0 Faulty
/N0/SB2/P1 Faulty
Prtdiag may also reflect component(s) as failed or disabled, such as the following example:
Fru Operational Status:
-------------------------
Location Status
-------------------------
SB0 failed
SB0/P2 disabled
SB0/P3 disabled
SB0/P0 disabled
SB0/P1 disabled
Important Notes: Disabled hardware should be investigated by Support Services prior to resetting any CHS status for any component(s).
  • Failure to investigate why component(s) were marked Suspect or Faulty prior to resetting their status could leave the system exposed to future outages.
  • You are strongly encouraged to open a service request and have Support Services examine the appropriate data to validate if the component(s) status should be reset or if hardware needs to be replaced.  
Please note that it is fairly common to have to reset component(s) status following service actions (hardware replacement). Especially in the case of CPU or Memory errors where it is common to see a CPU and it's entire collection of associated Memory DIMMs be marked Suspect.

Solution

Procedure to reset a component's CHS status.

1. As stated before, a support engineer should first validate that the component(s) CHS status should be reset.
The support engineer should perform analysis of the data and determined whether the CHS status should be reset or whether the component should be replaced.

Assuming that a Support Services engineer verified that the CHS status needs to be reset, the following options exist to reset it's status:

2. If the system is running ScApp < 5.20.15 (ie. 5.20.14 or lower and 5.21.x IS lower)
The CHS status can only be reset by the support services engineer using Sun Shared Shell if that option exists. This is because the setchs command is ONLY available in a restricted access mode on the SC for which a service engineer is required to perform the procedure.

You are encouraged to upgrade ScApp to avoid this inconvenience (See STEP 3).

If Shared Shell is not an option for a particular site, a field engineer must be dispatched, but this may involve Time and Material charges depending on contract terms.

3. If the system is running ScApp 5.20.15 or higher
The CHS status can be reset from the SC or lom prompt by anyone who can login to the Main SC and no special access is required at all.

Perform the following steps:
  • Verify what is currently marked as Suspect or Faulty.
  • 6800-sc1:SC> showchs -b
    Component           Status  
    ---------------     -------- 
    SB3/p0              Faulty
  • Reset the CHS status of the component(s) in question.
  • 6800-sc1:SC> setchs -s  OK -r "service_request_number" -c SB3/p0
  • Validate that the component's status has been reset.
  • 6800-sc1:SC> showchs -b
    Component           Status  
    ---------------     -------- 

    4. If a component was marked Faulty it will not be back in the configuration until it is run through POST.
    The component must be 'DR'd' (Dynamic Reconfiguration) out and then back into the domain, or the domain must be rebooted (sometimes known as 'keyswitched') to prompt this testing.

    Assuming the component runs through POST testing, it should be configured back into the domain. Contact Support Services if this presents any problems. Make sure to provide the console log showing the POST execution so they can diagnose any issues that remain.


    Internal Only Instructions for Support Service engineers

    Engineers should validate why the component's status has been marked
    CHS Faulty or Suspect prior to resetting its status.

    Utilize Document 1010056.1 to validate whether the component that is
    currently disabled, Faulty, Suspect, or Missing is defective or not. 

    If it is defective, the FRU should be replaced instead of having its status reset.
    If it is determined that a component(s) CHS status needs to be reset, do so
    depending on which version of ScApp is installed.


    If the system is running ScApp 5.20.15 or higher:
    Follow the procedure documented in the public section of this knowledge article
    (STEP 3 up above).

    Customers can reset the CHS status themselves using setchs in 'normal mode'
    (without a service mode password) on ScApp 5.20.15 or higher.


    If the system is running ScApp 5.20.14 or lower (5.21.x IS lower):
    You MUST generate a service mode password and then reset the status of the
    device for the customer.  The customer should not be given access to service
    mode themselves if at all possible.  You should make every attempt to perform
    this procedure yourself using Sun Shared Shell.

    If you need to reset the status using service mode perform the following steps:


    1. Obtain the System Controller's HostID, ScApp version, and RTOS version.

       To obtain this information, enter a carriage return in place of the
    password three times:

    Connected to Hostname-sc.

      Escape character is '^]'.
      Enter Password: <--- Enter Return Here
      Invalid password.

      Enter Password: <--- Enter Return Here
    Invalid password.

    Enter Password: <--- Enter Return Here
    Invalid password.

    HostID: 83195a96
    ScApp version: 5.13.0009
    RTOS version: 23


    2.  Generate a Service Mode password.

    Take the information from step 1 and visit the Service Mode Password Generator
    to generate a service mode password.

    A back up is here: Backup Service Mode Password Generator


    3.  Utilize Sun Shared Shell to connect to the customer's system and perform
    the reset procedure.

    Where Sun Shared Shell is not possible to use, follow the recommendations in
    Document 1010655.1 and directly supervise the customer's use of this access to
    reset the CHS status. 


    4. Verify what is currently marked as Suspect or Faulty.

    6800-sc1:SC[service]> showchs -b
    Component           Status  
    ---------------     -------- 
    SB3/p0              Faulty


    5.  Reset the CHS status of the component in question.

    6800-sc1:SC[service]> setchs -s  OK -r "service_request_number" -c SB3/p0


    6.  Validate that the component's status has been reset.

    6800-sc1:SC[service]> showchs -b
    Component           Status  
    ---------------     -------- 


    7.  The component will have to have POST executed to return it to service.

    This can be accomplished by executing a setlkeyswitch on or DR operation. 
    When performing this action, monitor POST to assure that no errors are detected
    on this newly reset device.


    Background Information on why you might be need to reset CHS status.

    There may be times when a good component, such as a CPU or system board, is
    marked as faulty.  Here are some reasons good components get marked as bad:

    Example1: CR 4868106 - Upgrading to 5.15.0 without following upgrade
    procedures can lead to a "ParitySingle error" and a CHS disabled SB.

    Example2: POST fails test ID 6.1, with an error in like:  ERROR: TEST=Memory
    Tests,SUBTEST=Memory Addressing ID=61.1

    In this situation, the CPU is failed in order to disable the memory it controls, but
    the CPU is fine. It is the memory DIMM(s) which need to be replaced. For the case
    of bad memory, here is what you need to do:
       - Use setchs to re-enable the cpu.
       - Verify with showchs that the pending status is 'ok'.
       - DR the SB out and replace the memory.
       - DR the SB back in. 
      When you reinsert the SB, the local tests will be sufficiant to make the chs status
    'current' vs 'pending'.  You don't need to do a setkeyswitch off on the domain !

    Example3: Some repair action that is later corrected, but the component is left
    marked as failed by CHS.
    For instance, one recent example involved a customer moving memory around
    on the board. It was unknown exactly which DIMMs were involved, or how many
    times the customer tried to setkeyswitch the domain on, etc. One of the CPUs on
    the board was marked as CHS failed.

    It was highly possible that a DIMM had been mis-seated or something to that
    effect. The DIMMs were reseated and the CPU was re-enabled. The system board
    was tested to the satisfaction of the engineers involved which saved an
    unnecessary board replacement.

    These cases are not the only ones where a good component can be marked as
    faulty. The point is, question the recent history of the machine and any
    maintenance activity when you have a CHS disabled component, before
    proceeding to re-enable it.

    A word of caution - do not just 'blindly' re-enable a component, since the
    system disabled it for a reason. When in doubt, seek the advice of a senior
    engineer by collaborating with the next level of technical support.

    Previously Published As
    72066

    Attachments
    This solution has no attachment
      Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
     Feedback