Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Technical Instruction Sure Solution 1008923.1 : Sun Enterprise[TM] 3x00-6x00 servers: Data collection advice for unplanned system reboots
PreviouslyPublishedAs 212278
Applies to:Sun Enterprise 3500 ServerSun Enterprise 4500 Server Sun Enterprise 5500 Server Sun Enterprise 6500 Server Sun Enterprise 3000 Server - Version: Not Applicable and later [Release: NA and later] All Platforms GoalThe goal of this document is to provide suggestions for what type of data needs to be collected if a Sun Enterprise 3x00, 4x00, 5x00, or 6x00 system encounters an unplanned system reboot. Unplanned refers to any number of ways one might describe the system suddenly resetting (so you might refer to this as a crash, panic, reset, or some other name).Each of these names actually has different technical meanings. The whole point of this article is to set expectations on the data needs that are required to determine root cause to the event that was encountered. This article will not describe the actual resolution to the event - just how and what data is needed in order to allow the support engineer to perform the diagnosis.Background: Sun Enterprise 3x00-6x00 servers may experience unplanned reboots for different reasons. Basic failure analysis methodology might entail the need to transfer large core files for analysis, but this is not necessary in some cases. While this document will NOT completely eliminate the need for core file analysis, the goal is to reduce diagnosis time of certain failures where errors are sufficient, and transfer of large core files is unnecessary. SolutionWhat to do after a system encounters an "Unplanned Reboot":First, assuming the system is recovered, look in the /var/adm/messages file and try to validate what type of system event was encountered. On this type of platform there are a few events that cause the majority of unplanned reboots:
Normal RebootsIn /var/adm/messages, a "normal reboot" will simply show messaging that indicates that a system is being rebooted.
If no errors are seen and you can not validate that anyone purposely rebooted the system, please collect Explorer data from the system and console logs if possible. Provide this data to support and an attempt will be made to identify what happened.
Power FailuresA power failure usually leaves very little traces of any event taking place in the messages file.
Fatal Reset (aka Fatal Error) eventYou can easily validate if a system has rebooted due to a Fatal Reset event because the /var/adm/messages file will show the following message:System booting after fatal error FATALA Fatal Reset or Fatal Error event is a hardware fault that affects system integrity. Fatal Resets will not generate Solaris core files, and error analysis will depend primarily upon the messages captured from the server's system console. Data Requirements to diagnose a Fatal Reset event are:
Fatal ResetIf console output is not available, it is suggested to obtain an Explorer from the system in question. It is possible (if lucky) for diagnosis of the event to take place utilizing this data, but it's not guaranteed. For this reason, configuring a console loghost is not a suggestion, it's a necessity.
Solaris Panic (crash, core dump, etc)If a reboot was the result of a panic, some diagnostic determination regarding the nature of the panic can be made using the messages available in /var/adm/messages. While a full analysis of the corefile is always preferred, Solaris panics that are the result of multi-bit ECC hardware errors usually leave messages which are sufficient to provide a diagnosis with a reasonable level of certainty. In these situations it is not always necessary to provide support with the core file.To determine if the source of the error is hardware ECC related, look for one of the following errors in the /var/adm/message file or console log:
WARNING: [AFT1] EDP event on CPU1 Instruction access at TL=0, errID 0x0000ad88.6cd9989fIf you do see error messaging similar to above, the support engineer really only needs an Explorer data file to proceed with the diagnosis. The core file is usually not required.
Failure determinations based on /var/adm/messages alone can only be made when one of the above acronyms appears in the messages file. Internal Only Information on memory errors above: For failures that indicate a memory error event such as those listed above, the system provides limited interpretation of the failure which can aid in the diagnosis of the suspect component. For each error, each component indicated in the error is assigned a "Score" value between 5 and 95. The higher the score, the higher the probability that the part indicated is at fault. A part which is implicated with a "Score" of 95 should be considered the primary candidate for replacement, unless multiple parts are assigned a "Score" of 95. In the example above, CPU1 was the only part assigned a " (Score 95) ". Best Practices - the short version: Mirrored E-Cache (Sombra) Swap after first failure. Unmirrored E-Cache Swap after second failure. Cediag and Findaft enforce these rules. Follow their recommendations. Internal Reference: http://ittdev.east.sun.com/TechTalk/Fatal/ Previously Published As 50348 Attachments This solution has no attachment |
||||||||||||
|