Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1004797.1
Update Date:2009-11-08
Keywords:

Solution Type  Technical Instruction Sure

Solution  1004797.1 :   Sun Enterprise[TM] 3500/4500/5500/6500 Servers: “Fatal Reset” FAQ  


Related Items
  • Sun Enterprise 4500 Server
  •  
  • Sun Enterprise 5500 Server
  •  
  • Sun Enterprise 3500 Server
  •  
  • Sun Enterprise 6500 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>Midrange Servers
  •  

PreviouslyPublishedAs
206657


Description
This document provides answers to frequently asked questions pertaining to fatal resets on Sun Enterprise[TM] 3500/4500/5500/6500 Servers.


Steps to Follow
FAQ's:

Question:

What is a Fatal Reset 

Answer:

A fatal reset occurs when coherency is lost on the main system bus (between cpus, I/O controllers and boards).  In general, the system bus is made up of three buses: the address bus; control bus; and data bus.  

Information on the data bus is protected by parity and/or ECC depending on the specific system type.  In this instance, corrupted data is detected and handled by the operating system through a panic or error correction.  The address and control bus, however, are not protected in the same manner.  

When a piece of address or control data is corrupted system coherency is lost.  In this instance, the system cannot stay up long enough to panic or data corruption could ensue.  

Note that the various Sun4U systems provide different levels of protection for the address bus whereby some instances will not always lead to a fatal reset.  Instead, loss of coherency will often take the form of a Duplicate Tag (DTAG) parity error or system address parity error.

Question:

What causes a Fatal Reset 

Answer:

There are a variety of hardware failures, both transient and hard, that will cause a fatal reset.  Noise on the centerplane or system bus, failed address controllers, bad DTAG SRAM modules, transient alpha-particle interruption of the DTAG and  central processing unit (CPU) failures are a few examples.  

In these cases, bits within the control or address information are changed and now point to different address locations or tag slots/states.  Since this data is generally not ECC protected and no "copy" exists elsewhere, the data cannot be recreated.  

Memory or I/O components can almost always be eliminated as a cause.  When errors occur on the I/O (PCI or SBUS) or memory bus, the main system bus is still intact and unaffected.  Therefore, the operating system can properly detect and report I/O or memory errors, usually leading to a panic and not a fatal reset.

Question:

What troubleshooting data is created 

Answer:

Since the system cannot stay up long enough to run the panic() routine, a system core file or core dump is not generated for analysis and proper diagnosis is difficult to ascertain.  

However, at the time of the fatal reset, certain hardware registers still hold data related to the type of error and components involved in the error.  These error registers are dumped to the system controller or system console for use in analysis.  

And under some circumstances the prtdiag(1m) command will show some information on components failed during POST and after fatal reset.  As POST is not always capable of correctly identifying the failed component, prtdiag(1m) information should only be used in conjunction with console log output to determine the cause of the fatal reset.

Question:

What is the system's response to a Fatal Reset 

Answer:

When a Fatal Reset is detected, a CPU will immediately 'reset' (see above,  What is a Fatal Reset ), resulting in a Power-On Reset (POR on Enterprise systems), Externally Initiated Reset (XIR on Sun Fire[TM] systems).  

Power-On Self-Test (POST) diagnostics are run at maximum level ( diag-level=max ) as dictated by system firmware. Unfortunately, the needed troubleshooting data is displayed only to the system controller or console. If the console is not logged (by connecting external hardware to the serial port), the root cause information is lost.  

In the case of an intermittent error which caused the Fatal Reset, POST might not find the offending component. In other cases of hard failed components, POST will detect them, mark them as failed, and continue with the POST.  

Different systems respond differently but generally the Automatic System Reconfiguration (ASR) process is initiated to remove failed components and try to configure an operable system.  During the next system boot , the operating system detects the prior fatal reset and a message is logged to syslog stating "System booting after fatal error FATAL".  

Additionally, each type of system and firmware revision has different Open Boot Parameters (OBP) that control how it responds to a fatal reset.  Thus, users should reference product specific documentation for more details.

Question:

How is a Fatal Reset identified 

Answer:

Fatal Reset error messages are only visible from the machine console or system controller. To see these messages it is necessary to log the output from the console.  

Additionally, Sun Fire[TM] systems controllers usually have a small first in first out (FIFO) ring buffer where data is logged.  It is possible that the initial, relevant fatal reset message is flushed off the end of the buffer with the subsequent boot that takes place.  To alleviate this issue, the system controller should be logged using syslogd(1M).  There are many documents and resources dedicated to console and system controller logging.

Reference SOLUTION 211946  to capture Fatal Reset output for Sun systems

Question:

How are Fatal Resets diagnosed 

Answer:

Although this is outside the scope of this document, there are many other InfoDocs and SRDBs related to this topic.  Generally, analysis of the type of error, as displayed in the Error Status Register (ESR), and Asynchronous Fault Status Register (AFSR), and the components involved, as displayed in Asynchronous Fault Address Register (AFAR) will identify the component that caused the error.

Question :

Where can I find more information on "fatal resets" 

Answer :

Search both Sun Product Documentation and SunSolve via keyword string "fatal resets" for the latest resources.  

Contract customers   may access additional SunSolve resources by logging into the repository with their unique username and password.  

The username and password are created by contract customers as part of SunSolve On-line Registration, which requires Terms of Use acceptance and a Sun Support Contract Number.



Product
Sun Enterprise 6500 Server
Sun Enterprise 5500 Server
Sun Enterprise 4500 Server
Sun Enterprise 3500 Server

Internal Comments
Audited/updated 11/06/09 - [email protected], Mid-Range Systems Content Team

To report Fatal Resets, please refer to the following web site:

http://pts-americas.west/esg/msg/techinfo/platform/sunfire/fatal-resets/


For more information on Fatal Resets, please refer to:







  • Be aware that a bug exists such that system board 0 can
    incorrectly be identified as the failing component due to a timing
    problem.  See section 4.2 of document 910-4188 (above).








fatal, reset, FATAL, error
Previously Published As
51105

Change History
Date: 2006-01-18
User Name: 97961
Action: Update Canceled
Comment: *** Restored Published Content *** SSH AUDIT.
[email protected]
KDO Knowledge Engineer
Version: 0
Date: 2006-01-18
User Name: 97961
Action: Update Started
Comment: SSH AUDIT.
[email protected]
KDO Knowledge Engineer
Version: 0
Date: 2006-01-17
User Name: 97961
Action: Update Canceled
Comment: *** Restored Published Content *** SSH AUDIT.
[email protected]
KDO Knowledge Engineer
Version: 0
Date: 2006-01-17
User Name: 97961
Action: Update Started
Comment: SSH AUDIT.


Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback