Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1013069.1
Update Date:2009-06-23
Keywords:

Solution Type  Problem Resolution Sure

Solution  1013069.1 :   Sun Blade[TM] 8000 Modular System: Certain low-level Chassis Monitoring Modules (CMM) faults cannot be manually repaired using normal procedures  


Related Items
  • Sun Blade 8000 System
  •  
  • Sun Blade 8000 P System
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>Blade Servers
  •  

PreviouslyPublishedAs
217902


Symptoms
There is a small group of Sun Blade[TM] 8000 Modular System Chassis Monitoring Module (CMM) faults that cannot be cleared using the normal fault clearing procedures. There is a process running on the CMM known as Autonomous Fault Handler (AFH), that monitors and handles low-level faults in the CMM hardware including the embedded network switch.

Normal chassis faults including CMM are reported to and diagnosed by the high-level CMM fault management architecture which logs the fault, and lights appropriate indicators. All such faults can be manually repaired from the CMM as follows:

  • On Web using "CMM" on left side and "System Information" tab -> "Components" sub-tab and selecting the faulted component and "Clear Faults"
  • On Command-line via SSH or Serial port to CMM, "-> cd /CH/CMM" or the component of interest, then "-> set clear_service_action=true"

AFH faults are detected at a lower-level on the CMM than the chassis fault management and therefore do not interface with the normal chassis fault architecture. When a module is faulted by AFH, it is either rebooted or held in a reset state depending on the nature of the fault. In "Reset" or "Reboot" states, if there is a redundant Standby CMM then it will take over as Active during the reboot , and the original will come up as Standby. In the "Reset and Hold" state, the CMM fault LED will be lit amber, the CMM hot-plug ready to remove LED will be lit blue, the CMM ok/normal LED will be off, all network port LED's will be on, in redundant CMM configurations the operating CMM will transition to Stand-Alone state.

The following fault types are detected by AFH and the action that occurs upon each:

Hardware Faults

  1. Switch Interface Errors including shutting down of the switch due to external network activity that may cause flooding of the internal chassis management network Action= Reset & Hold
  2. I2C Errors Action= Reboot
  3. Chassis Monitoring Status Register Access Errors Action= Reboot
  4. Power Supply RS485 Protocol Errors Action= Reboot
  5. CMM CPU Machine Check Errors Action= Reset
  6. CMM Memory ECC Correctable Errors Action= None

Software Faults

  1. Platform IPMI Interface Hung Action= Reboot
  2. ILOM Daemon Processes Hung Action= Kill -9 and Restart process
  3. Active/Standby Monitoring Processes Hung Action= 3 actions are taken sequentially - 1. Standby usurps mastership. 2. New Active places the other CMM into reset & hold to prevent it restarting. 3. New Active transitions to "Stand-Alone" state and stops negotiating mastership to prevent bouncing of active between CMM's. It will stay in this state until the other CMM is manually cleared of its fault.

Not all of the above fault types will be logged in the CMM event log, only those that trigger a reset or reboot. The following is an example of an AFH detected reset and hold fault caused by an external network packet flood that triggered the switch error as seen in the CMM event log, initiating a failover to the redundant CMM:

-> show /CMM/logs/event/list
...
655    Mon May  1 09:03:41 2000  System    Log       critical
Peer reset and hold requested, reset in 10 seconds.
...
->


Resolution
There are 4 possible ways to clear AFH type faults:
  1. Determine and resolve what external network device or traffic may be causing the CMM to protect itself from a denial-of-service type event
  2. In a redundant CMM configuration, hard reset the faulted CMM using:
    On Web, click on "CMM" on the left side, select the "Maintenance" tab, and "Reset Components" sub-tab. Select the CMM that is faulted, either /CH/CMM0 or /CH/CMM1 and select "Reset" from the drop-down. Wait approximately 2 to 3 minutes for the CMM to reset and reboot.
  3. Press the push-pin hard reset button on the faulted CMM. This is the left pinhole button (to the right of the main LED's and buttons) labeled ->*<-
  4. Remove the CMM, wait 60 seconds for it to fully shutdown, then re-insert the CMM into the chassis. The CMM is designed to be hot-swappable and can be safely removed from the chassis with power still running.

If it is determined that all of the below are true:

  1. the AFH fault continually re-occurs on 1 specific CMM only when in Active state, AND
  2. the CMM is running current latest ILOM firmware (with known bugs fixed), AND
  3. it has been proven that the external network activity does not appear to be a factor

then the CMM may have a low-level hardware fault and should be replaced.
The CMM is a customer replacable unit (CRU).



Product
Sun Blade 8000 P Modular System
Sun Blade 8000 Modular System

Internal Comments
To determine more root-cause of an AFH detected fault, it is necessary to login to the operating CMM as "sunservice" and gather "/usr/local/bin/collectDebugInfo.sh" output, scp it off the CMM and analyze the various linux and daemon log files. If this level of debug is necessary, an escalation should be opened.

Note: In ILOM 1.1.5 there was a level of debug left enabled that caused "kill -9 & Restart process" faults to log in the CMM event log that is customer viewable. If a process is not responding, the process will restart. If it is continuously not responding, it will continually be restarted and should not interfere with normal operations of the CMM. As such, these events should not be logged in the CMM event log; only in the sunservice level process daemon logs. Do not replace a CMM for these events; ensure firmware is current and if the problem persists open an escalation.
blade, 8000, chassis, fault, cmm, afh, reset, hold, network, peer
Previously Published As
90623

Change History
Date: 2007-09-25
User Name: 97961
Action: Approved
Comment: - Applied trademarking where it is missing
- Changed title to comply to the standard format
- Made simple sentence/grammatical corrections
Version: 4
Date: 2007-09-25
User Name: 97961
Action: Accept
Comment:
Version: 0
Date: 2007-09-25

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback