Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1010056.1
Update Date:2011-03-23
Keywords:

Solution Type  Troubleshooting Sure

Solution  1010056.1 :   Troubleshooting offline or disabled components on Sun Fire [TM] Midrange servers  


Related Items
  • Sun Fire E6900 Server
  •  
  • Sun Fire 3800 Server
  •  
  • Sun Fire 6800 Server
  •  
  • Sun Netra 1280 Server
  •  
  • Sun Fire E4900 Server
  •  
  • Sun Fire 4800 Server
  •  
  • Sun Fire V1280 Server
  •  
  • Sun Fire E2900 Server
  •  
  • Sun Netra 1290 Server
  •  
  • Sun Fire 4810 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>Midrange Servers
  •  
  • GCS>Sun Microsystems>Servers>Midrange V and Netra Servers
  •  

PreviouslyPublishedAs
213805


Applies to:

Sun Netra 1290 Server
Sun Netra 1280 Server
Sun Fire V1280 Server
Sun Fire 3800 Server
Sun Fire 4800 Server
All Platforms

Purpose

This document addresses a system or domain which is missing or has disabled, faulty, suspect, or offline components.

System Configuration:

  • Sun Fire [TM] 3800, 4800, 4810, E4900, 6800, E6900 (Serengeti) systems.
  • Sun Fire [TM] v1280, E2900, and Netra [TM] 1280, 1290 (LightWeight8) systems.
  • Solaris[TM] 8, 9, or 10 and at least ScApp 5.15.x

Symptoms:

'showchs -b' might report Faulty and/or Suspect components, such as the following examples:

lom>showchs
Component Status
--------------- --------
/N0/SB2 Suspect
/N0/SB2/P0 Faulty
/N0/SB2/P1 Faulty

SC>showchs
Component Status
--------------- --------
/N0/SB0 Suspect
/N0/SB2 Suspect
RP2 Suspect

Prtdiag may report components as failed or disabled, such as the following example:

Fru Operational Status:
-------------------------
Location Status
-------------------------
SB0 failed
SB0/P2 disabled
SB0/P3 disabled
SB0/P0 disabled
SB0/P1 disabled

Additional Symptoms:

  • /var/adm/messages may have FMA faults or errors in it.
  • A system may be running slowly or with reduced resources.
  • The problem may also be described as "missing" CPU(s) or Memory from the configuration.
  • The system console may show POST failures and the system may boot up with resources disabled.

Last Review Date

March 23, 2011

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

Steps to Follow

Please validate that each troubleshooting step below is true for your environment. The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution. Please do not skip a step.

Most of the time a service request will eventually have to be opened in order to resolve a disabled component(s) situation.  The steps provided within this section of the article are intended to allow a user to rule out known issues which do not require a service request to be logged.  But if these known issues are not the reason for the disabled component, then collect the appropriate data (Step 5) and have an Oracle Support Engineer continue the analysis.

1. Verify the CPU(s) have not been deliberately disabled by another Systems Admin

Resources can be disabled at two levels.

A. From Solaris
  • On Solaris 10 psrinfo uses off-line for user disabled CPUs, FMA uses faulted.

$ psrinfo
96 on-line since 09/14/2004 22:18:00
97 off-line since 09/24/2004 13:11:13 <<-- User disabled
98 faulted since 09/18/2004 12:14:00  <<-- FMA disabled
99 on-line since 09/14/2004 22:18:00
  • On Solaris 8/9 psrinfo reports off-line for both user and system disabled CPUs.

B. From the System Controller
  • showboards will show if boards are powered on and confirm the platform is receiving power.
  • Check the output of showcomponent to determine the state of components
    • <Document:1017491.1> showcomponent indicates something is disabled can be used as a reference.

2. Verify this is not the known issue described in Alert 1000495.1

  • <Document:1000495.1> Sun Fire Systems Equipped With UltraSPARC IV+ Processor Modules Running Solaris 9 or Solaris 10 may Exhibit Unnecessary CPU Offlining and Solaris Panics

3. Verify this is not the known issue described in Document 1012043.1

  • <Document:1012043.1> Processor may be Incorrectly Offlined When it Encounters a UCC + ME bit set in AFSR

4. Verify that a CPU has not been offlined as a result of memory errors

There are some bugs which can cause CPUs to be disabled as a result of Uncorrectable Memory errors.

  • <Document:1018939.1> Solaris[TM] 10 Operating System: Displaying the list of Fault Management Architecture (FMA) resources currently believed to be faulted
  • <Document:1006517.1> Troubleshooting uncorrectable CPU/Memory errors on Solaris [TM] 8 and 9

5. Collect data to allow Oracle Support to progress your call

It is preferred that the following data is collected in order to more quickly and accurately diagnose your issue:

  • Console log data if the CPU/Memory has been disabled in POST (Reference:<Document:1008676.1> for console log information).
  • Explorer data (depending on the platform your issue is on) please collect the following data:
    • For Sun Fire[TM] 1280, E2900 and Netra[TM] 1280, 1290 servers (Reference: <Document:1009102.1>) the command is /opt/SUNWexplo/bin/explorer -w default,1280extended
    • For Sun Fire[TM] 3800, 4800, 4810, E4900, 6800, and E6900 servers (Reference: <Document:1011830.1>) the command is /opt/SUNWexplo/bin/explorer -w default,scextended,fru

In the event that you can not collect Explorer data, collect all of the following data in order to collaborate with Oracle (in addition to the console log data if the CPU/Memory has been disabled in POST):

  • Console log data if the CPU/Memory has been disabled in POST (Reference: <Document:1008676.1> for console log information).
  • <Document:1003529.1> Procedure to manually collect Sun Fire[TM] Midrange System Controller level failure data
  • If the domain is Solaris 10: <Document:1018939.1> Solaris[TM] 10 Operating System: Displaying the list of Fault Management Architecture (FMA) resources currently believed to be faulted

Internal Support Instructions

Before proceeding, understand the relation between CPUs & Memory DIMMs on this architecture.
Architecturally, a CPU serves as the Memory Controller for a group of Memory DIMMs.  This
means that SB0/P0 (CPU 0 on SB0) "manages" or is responsible for all DIMMs located in SB0/P0/B0
(bank0) and SB0/P0/B1 (bank1) - So, eight DIMMs in total (D0, D1, D2, D3 for both banks).

Excessive errors on any one of the DIMMs may trigger an analysis of a Solaris or ScApp Diagnosis
Engine to determine that the DIMM is suspect or faulty. As a result the Diagnosis Engine's automated
response will be to have the DIMM isolated or disabled as well as it's memory controller (which is
the CPU). View this as being "guilty by association" if it makes the concept easier to understand.

In any event, the expected behavior is for CPUs to be implicated (usually marked "suspect" in
showchs output) for memory error events on DIMMs it controls. To further confuse matters, when
the CPU is isolated, the remainder of the DIMMs it controlled are also disabled due to the fact
that their controlling CPU is no longer available for use in the configuration. You will often see
the removal of a CPU and all 8 of it's DIMMs from the configuration for memory events that may
be caused by one single bad DIMM. Customers often report an issue like this as, "missing a cpu
and all of it's memory".

Never replace a System Board & it's memory just because they are marked suspect or faulty.
If you see a CPU and all of it's memory disabled, the event is most likely a Memory issue, so analyze
the errors to determine the cause of this issue.  Replace only what is actually faulty.  It is NOT okay
to guess and replace parts simply because of it's status.

Which type of data do you have available?

I have Explorer with 1280extended,fru or scextended options.
...Use <Document:1020760.1> to continue this investigation.

I have System Controller command output or FMA command output available.
...Use <Document:1020760.1> to continue this investigation.

I haven't collected data yet. What should I collect to investigate this?
...See <Document:1019066.1> for instructions on how to collect 1280extended,fru or
   scextended Explorer data.
...Then utilize <Document:1020760.1> to continue this investigation.

Not sure how to proceed?
Progress the issue to the next level of technical support and get help.  It is much better to
collaborate for diagnostic assistance before changing parts. Never simply re-enable parts and
re-test FRUs without understanding why the components were disabled in the first place.

Document info
This document contains normalized content and is managed by the the Domain Lead(s)
of the respective domains. To notify content owners of a knowledge gap contained in this
document, please provide the feedback mechanism on this page and the document owner
will be notified of the comment.

Previously Published As 91474


Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback