Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Troubleshooting Sure Solution 1010056.1 : Troubleshooting offline or disabled components on Sun Fire [TM] Midrange servers
PreviouslyPublishedAs 213805
Applies to:Sun Netra 1290 ServerSun Netra 1280 Server Sun Fire V1280 Server Sun Fire 3800 Server Sun Fire 4800 Server All Platforms PurposeThis document addresses a system or domain which is missing or has disabled, faulty, suspect, or offline components.System Configuration:
Symptoms: 'showchs -b' might report Faulty
and/or Suspect components, such as the following examples:
Prtdiag may report components as failed or
disabled, such as the following example:
Additional Symptoms:
Last Review DateMarch 23, 2011Instructions for the ReaderA Troubleshooting Guide is provided to assist
in debugging a specific issue. When possible, diagnostic tools are included in the document
to assist in troubleshooting.
Troubleshooting DetailsSteps to FollowPlease validate that each troubleshooting step below is true for your environment. The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution. Please do not skip a step.Most of the time a service request will eventually have to be opened in order to resolve a disabled component(s) situation. The steps provided within this section of the article are intended to allow a user to rule out known issues which do not require a service request to be logged. But if these known issues are not the reason for the disabled component, then collect the appropriate data (Step 5) and have an Oracle Support Engineer continue the analysis. 1. Verify the CPU(s) have not been deliberately disabled by another Systems AdminResources can be disabled at two levels. A. From Solaris
B. From the System Controller
2. Verify this is not the known issue described in Alert 1000495.1
3. Verify this is not the known issue described in Document 1012043.1
4. Verify that a CPU has not been offlined as a result of memory errorsThere are some bugs which can cause CPUs to be disabled as a result of Uncorrectable Memory errors.
5. Collect data to allow Oracle Support to progress your callIt is preferred that the following data is collected in order to more quickly and accurately diagnose your issue:
In the event that you can not collect Explorer data, collect all of the following data in order to collaborate with Oracle (in addition to the console log data if the CPU/Memory has been disabled in POST):
Internal Support Instructions Before proceeding, understand the relation between CPUs & Memory DIMMs on this architecture. Architecturally, a CPU serves as the Memory Controller for a group of Memory DIMMs. This means that SB0/P0 (CPU 0 on SB0) "manages" or is responsible for all DIMMs located in SB0/P0/B0 (bank0) and SB0/P0/B1 (bank1) - So, eight DIMMs in total (D0, D1, D2, D3 for both banks). Excessive errors on any one of the DIMMs may trigger an analysis of a Solaris or ScApp Diagnosis Engine to determine that the DIMM is suspect or faulty. As a result the Diagnosis Engine's automated response will be to have the DIMM isolated or disabled as well as it's memory controller (which is the CPU). View this as being "guilty by association" if it makes the concept easier to understand. In any event, the expected behavior is for CPUs to be implicated (usually marked "suspect" in showchs output) for memory error events on DIMMs it controls. To further confuse matters, when the CPU is isolated, the remainder of the DIMMs it controlled are also disabled due to the fact that their controlling CPU is no longer available for use in the configuration. You will often see the removal of a CPU and all 8 of it's DIMMs from the configuration for memory events that may be caused by one single bad DIMM. Customers often report an issue like this as, "missing a cpu and all of it's memory". Never replace a System Board & it's memory just because they are marked suspect or faulty. If you see a CPU and all of it's memory disabled, the event is most likely a Memory issue, so analyze the errors to determine the cause of this issue. Replace only what is actually faulty. It is NOT okay to guess and replace parts simply because of it's status. Which type of data do you have available? I have Explorer with 1280extended,fru or scextended options. ...Use <Document:1020760.1> to continue this investigation. I have System Controller command output or FMA command output available. ...Use <Document:1020760.1> to continue this investigation. I haven't collected data yet. What should I collect to investigate this? ...See <Document:1019066.1> for instructions on how to collect 1280extended,fru or scextended Explorer data. ...Then utilize <Document:1020760.1> to continue this investigation. Not sure how to proceed? Progress the issue to the next level of technical support and get help. It is much better to collaborate for diagnostic assistance before changing parts. Never simply re-enable parts and re-test FRUs without understanding why the components were disabled in the first place. Document info This document contains normalized content and is managed by the the Domain Lead(s) of the respective domains. To notify content owners of a knowledge gap contained in this document, please provide the feedback mechanism on this page and the document owner will be notified of the comment. Previously Published As 91474 Attachments This solution has no attachment |
||||||||||||
|