Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Problem Resolution Sure Solution 1013069.1 : Sun Blade[TM] 8000 Modular System: Certain low-level Chassis Monitoring Modules (CMM) faults cannot be manually repaired using normal procedures
PreviouslyPublishedAs 217902 Symptoms There is a small group of Sun Blade[TM] 8000 Modular System Chassis Monitoring Module (CMM) faults that cannot be cleared using the normal fault clearing procedures. There is a process running on the CMM known as Autonomous Fault Handler (AFH), that monitors and handles low-level faults in the CMM hardware including the embedded network switch. Normal chassis faults including CMM are reported to and diagnosed by the high-level CMM fault management architecture which logs the fault, and lights appropriate indicators. All such faults can be manually repaired from the CMM as follows:
AFH faults are detected at a lower-level on the CMM than the chassis fault management and therefore do not interface with the normal chassis fault architecture. When a module is faulted by AFH, it is either rebooted or held in a reset state depending on the nature of the fault. In "Reset" or "Reboot" states, if there is a redundant Standby CMM then it will take over as Active during the reboot , and the original will come up as Standby. In the "Reset and Hold" state, the CMM fault LED will be lit amber, the CMM hot-plug ready to remove LED will be lit blue, the CMM ok/normal LED will be off, all network port LED's will be on, in redundant CMM configurations the operating CMM will transition to Stand-Alone state. The following fault types are detected by AFH and the action that occurs upon each: Hardware Faults
Software Faults
Not all of the above fault types will be logged in the CMM event log, only those that trigger a reset or reboot. The following is an example of an AFH detected reset and hold fault caused by an external network packet flood that triggered the switch error as seen in the CMM event log, initiating a failover to the redundant CMM: -> show /CMM/logs/event/list ... 655 Mon May 1 09:03:41 2000 System Log critical Peer reset and hold requested, reset in 10 seconds. ... -> Resolution There are 4 possible ways to clear AFH type faults:
If it is determined that all of the below are true:
then the CMM may have a low-level hardware fault and should be replaced. Product Sun Blade 8000 P Modular System Sun Blade 8000 Modular System Internal Comments To determine more root-cause of an AFH detected fault, it is necessary to login to the operating CMM as "sunservice" and gather "/usr/local/bin/collectDebugInfo.sh" output, scp it off the CMM and analyze the various linux and daemon log files. If this level of debug is necessary, an escalation should be opened. Note: In ILOM 1.1.5 there was a level of debug left enabled that caused "kill -9 & Restart process" faults to log in the CMM event log that is customer viewable. If a process is not responding, the process will restart. If it is continuously not responding, it will continually be restarted and should not interfere with normal operations of the CMM. As such, these events should not be logged in the CMM event log; only in the sunservice level process daemon logs. Do not replace a CMM for these events; ensure firmware is current and if the problem persists open an escalation. blade, 8000, chassis, fault, cmm, afh, reset, hold, network, peer Previously Published As 90623 Change History Date: 2007-09-25 User Name: 97961 Action: Approved Comment: - Applied trademarking where it is missing - Changed title to comply to the standard format - Made simple sentence/grammatical corrections Version: 4 Date: 2007-09-25 User Name: 97961 Action: Accept Comment: Version: 0 Date: 2007-09-25 Attachments This solution has no attachment |
||||||||||||
|