Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Sun Alert Sure Solution 1277973.1 : Sun4v CMT Systems May Hang/Panic/Reset or Power Off as a Result of Handling Correctable or Retryable Events
In this Document
Applies to:Sun SPARC Enterprise T5140 Server - Version: Not ApplicableSun SPARC Enterprise T5240 Server - Version: Not Applicable and later [Release: N/A and later] Sun SPARC Enterprise T5440 Server - Version: Not Applicable and later [Release: N/A and later] Sun Blade T6340 Server Module - Version: Not Applicable and later [Release: N/A and later] Sun Netra T5440 Server - Version: Not Applicable and later [Release: N/A and later] Sun SPARC Sun OS ________________________________ Date of Resolved Release: 01-Jan-2011 ________________________________ DescriptionMulti-socket Sun4v CMT systems, when handling/processing fault events, may under certain conditions exhibit a loss of system availablilty, resulting in a system reset/panic or possibly a system hang. This occurs when handling events that require data from a remote CPU node - a fault condition is triggered when using an uninitialised register. Likelihood of OccurrenceThis issue can occur on the following platforms: Sun4v CMT Systems:
Notes: 1. No other Blade, Enterprise, or Netra systems are affected by this issue. 2. There is no specific set of conditions likely to trigger this issue, nor any method of predicting when or how frequently this issue may occur. The risk of seeing this issue is regarded as low, but the impact to system availability is high since an outage may occur. To determine the firmware version on the system, run the following commands from the ILOM: -> show HOSTor: sc> showhost Possible SymptomsWhen initialising a register as a result of reading data from a remote node, the system may experience a loss of availability that may be observed on multi-socket CMT systems and will take the form of the following: - A Solaris panic, such as one of the following: panic: send_mondo timeout- Hypervisor Abort (HVabort) followed by a system power off - Red State Exception and system reset - Watchdog Exception leading to a system reset - System Hang Workaround or ResolutionTo resolve this issue, upgrade system firmware to 7.3.0 (or above), using the appropriate patch listed below:
Modification HistoryDate of Resolved Release: 01-Jan-2011Internal Comments: Please send technical questions to the following email: [email protected] and copy the Responsible Engineer and Knowledge Analyst 6983478 - Multi-node systems crashing after CE due to incorrect rerouting code CR 6983478 relates to a coding issue within Hypervisor where an uninitialised register is being used to dereference the @internal error table entry on reading data from a remote node - such as Error Status Registers. As the register is uninitialised, when the register is dereferenced, a number of issues can result - what is often seen is either a system panic/@hang/reset or hvabort. This issue only applies to multisocket CMT systems, and it has been seen by a number of customers to date. For more indepth detail on this issue, please review the CR referenced above. Internal Contributor/Submitter: [email protected], [email protected] Internal Eng Responsible Engineer: [email protected] Internal Services Knowledge Analyst: [email protected] Internal Eng Business Unit Group: Systems Group-SVS (SPARC Volume Systems, Horizontal Systems(includes T2000/Ontario) Internal Escalation ID: 2-8213384 References<SUNPATCH:145676-01><SUNPATCH:145677-01> <SUNPATCH:145678-01> <SUNPATCH:145679-01> <SUNPATCH:145680-01> <SUNBUG:6983478> Attachments This solution has no attachment |
||||||||||||
|