Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-79-1392829.1
Update Date:2012-01-24
Keywords:

Solution Type  Predictive Self-Healing Sure

Solution  1392829.1 :   Netra CT900: How to read ShMM's debug.log file  


Related Items
  • Sun Netra CT900 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>x64>Blades>SN-x64: TELCO-BL-NETRA
  •  
  • .Old GCS Categories>Sun Microsystems>Boards>NEBS-Certified Servers
  •  
  • .Old GCS Categories>Sun Microsystems>Servers>Midrange V and Netra Servers
  •  


When debugging Netra CT900 problems --- especially chassis related problems --- /tmp/debug.log file from the ShMM is  very useful data/resources to identify possible RC of the problem.  This article shows how to read into the /tmp/debug.log file of ShMM.

In this Document
  Purpose
  Netra CT900: How to read ShMM's debug.log file
     Data Collecting
     Content of debug.log file
     Type of Issues
     Details of debug.log file
     APPENDIX
  References


Applies to:

Sun Netra CT900 Server - Version: Not Applicable and later   [Release: N/A and later ]
Sun Netra CT900 Server - Version: Not Applicable and later    [Release: N/A and later]
Information in this document applies to any platform.

Purpose

For Netra CT900 related problem, more than often, the /tmp/debug.log file from the ShMM is require to look into the possible RC of the problem --- especially it is a chassis related problem.

This article shows how to read through /tmp/debug.log file and what are the points of interest regarding to certain type of problems.

Netra CT900: How to read ShMM's debug.log file

Data Collecting

If /tmp/debug.log file does not exist on "Active" ShMM, execute the /etc/summary script to generate it:


# ./summary
PING 10.133.104.1 (10.133.104.1): 56 data bytes
84 bytes from 10.133.104.1: icmp_seq=0 ttl=255 time=8.4 ms

--- 10.133.104.1 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 8.4/8.4/8.4 ms
send debug file /tmp/debug.log to PPS
#

Content of debug.log file

The /tmp/debug.log file is an equivalent of Explorer files to ShMM. Each session is separated by a header. Here is a list of header in a /tmp/debug.log file:
  • Shell Environment Variables
  • U-Boot Environment Variables
  • Network Interfaces
  • Network Routing
  • Current Process List
  • CPLD State
  • Shelfman version
  • Shelfman status
  • ShMC LAN Configuration
  • Active RMCP Sessions
  • PEF Configuration
  • Board Information
  • Detailed IPMC Information
  • Detailed FRU Information
  • Fan List
  • Cooling State
  • Fan State
  • FRU Sizes
  • Shelf FRU Info
  • System Event Log
  • Shelfman output to syslog
Each section contains some useful information. Here are some of the most important ones:

  • Network Interfaces: Contains the IP address and showing whether the ShMM is configured as a 2x IP (A floater IP that stick with "Active" ShMM and the other stick with "Backup" ShMM) or 3x IP (Top and bottom ShMM has its own static IP address, and one additional floater IP address associated with "Active" ShMM whether it is top or bottom one) --- the 3x IP configuration usually uses the eth0:1 interface.
  • Shelfman Version: Contain the firmware version of ShMM.
  • Shelfman Status: Show the ShMM is "Active" or "Backup". Usually the debug.log from Active ShMM is more useful because many commands are not allowed in the Backup ShMM.
  • ShMC IPMB Address: Shows the ShMM is top one (shm1, with IPMB address of 0x10) or bottom (shm2, with IPMB address of 0x12).
  • Board Information: Generic information of each slot --- usually shows the type of blade, A/RTM, and their current Hot Swap State.
  • Detailed IPMC Information: Details of IPMC records. ShMM also plays the role of handshaking between switch blades and computing blades; therefore these records also can be used to identify if computing blade is connecting to switch blades.
  • Fan List
  • Cooling State
  • Fan State: These three sections contain information regarding of the chassis fan tray and temperature sensors (chassis and blades).
  • System Event Log: Event lot of interactions between ShMM and chassis components.
  • Shelfman output to syslog: Syslog of ShMM.

Type of Issues

There are several types of chassis problems that could use debug.log file to locate the RC.

FAN TRAY/TEMPERATURE ISSUES

For fan try/temperature related issues,
  1. Check "Fan List", "Cooling State", and "Fan State" sections if all states are "Normal"
  2. If SEL (System Event Log) has events regarding of temperature sensors (enlisted in the "Cooling State") --- i.e. temperature sensors reading go over "higher non-recoverable" threshold.
  3. Use "[clia] sensordata <addr> <sensor #>" to obtain current reading of sensors.
With all the above information, it should be easy to determine where the problem is and what might have happened.

NOTE: List of sensors could also be obtained by the following command form ShMM:

# clia sensor <addr> | grep Sensor

where <addr> could be <IPMB address of blade>, "board <N>", or "shm <N>".

PEM (Power Entry Module) ISSUES

PEM related issues does not need debug.log files as most of the data needed are collected form "[clia] sensordata" and "[clia] getfruledstate" commands.
  1. Use "[clia] getfruledstate 20 [6|7]" to determine the current state of PEM --- OK (GREEN), OOS (RED/AMBER), or Ready to Remove (BLUE).
  2. All PEM related sensors could be obtained form the following 3 commands: "[clia] sensor 20 | grep PME", "[clia] sensor 20 | grep 48" and "[clia] sensor 20 | grep V"
  3. Use "[clia] sensordata 20 <sensor #>" to check the sensor reading to determine if PEM is working properly.
In the above, "20" is the IPMC address of chassis/shelf itself, "6" or "7" is the FRU # for PEM A & B.

POWER ISSUES

  1. Check SEL and group all events form the same blade (same IPMB address).
  2. Use "[clia] sensor board <slot #> | grep Sensor | grep V" to obtain a list of voltage sensors on a blade.
  3. Match the sensor # from the SEL and the list from step 2.
  4. If SEL shows voltage sensors go under "lower no-recoverable" threshold very often, there is a problem --- whether it is PEM or board related. Need to carefully examine the SEL and other data (such as debug level of ShMM syslog) to determine the true RC.

BLADE ISSUES

  1. Follow the troubleshooting steps of similar architecture (UltraSPARC or x86/x64) to determine whether the problem is within the blade or chassis/shelf related.
  2. If the RC is within the blade, debug.log file is irrelevant.

FIRMWARE UPGRADE ISSUES

  1. The upgrade log is more important.
  2. debug.log file is irrelevant.

NETWORK ISSUES

Network issues is out of the scope of this article.

Details of debug.log file

Network Interfaces

This is "ifconfig -a" output of ShMM, it shows several interfaces:
  • eth0     The main ShMM network interface
  • eth0:1   Used in the 3x IP configuration
  • eth1     Not accessible
  • usb0     Not accessible
  • usb1     Not accessible, usb0 and usb1 are used as the communication channel between shm1 & shm2 for fail-over purpose
  • vlan55 Interface for Netconsole (or NetConsole) feature

Shelfman Version

Pigeon Point Shelf Manger ver. 2.6.4-R3U3-RR
Pigeon Point and the stylized lighthouse logo are trademarks of Pigeon Point syatems.
Copyright (c) 2002-2008 Pigeon Point Systems
All rights reserved
Build date/time: Jan 11 2010 05:48:20
Carrier: ACB
Carrier subtype: 3; subversion: 0

Shelfman Status

"Active" or "Backup"

ShMC IPMB Address

Local IPMB Address = 0x10

0x10 is shm1 (top), and 0x12 is shm2 (bottom)

Board Information

Showing the type of blade and Hot Swap State:

Physical Slot # 3
92: Entity: (0xa0, 0x60) Maximum FRU device ID: 0x02
    PICMG Version 2.2
    Hot Swap State: M4 (Active), Previous: M3 (Activation In progress), Last State Change Cause: Normal State Change (0x0)

92: FRU # 0
    Entity: (0xa0, 0x60)
    How Swap State: M4(Active), Previous: M3 (Activation In progress), Last State Change Cause: Normal State Change (0x0)
    Device ID String: "NetraCP-3060"

92: FRU # 2 (AMC # 1)
    Entity: (0xa0, 0x61)
    How Swap State: M4(Active), Previous: M3 (Activation In progress), Last State Change Cause: Normal State Change (0x0)
    Device ID String: "375-3470-01"

Here are the explanations of How Swap States:

State
Summary
Explanation
M0
Not Installed
FRU IPMC is not reachable.  All power to the FRU is off.  The blue LED is off.
M1
FRU Inactive
The FRU is installed and its IPMC is in communication with the ShMM.  The blue LED is on solid.  The FRU is not powered up, and none of its connectivity is active.  The next state is either M0 or M2.
M2
FRU Activation Request
The FRU IPMC is waiting for activation permission form the ShMM.  The blue LED has a long blink.  FRU removal is not safe.  The next state is M3.
M3
Activation in Progress
The FRU's IPMC requests power allocation form the ShMM.  The blue LED is off.  The FRU changes to state M4 when activation is complete.
M4
FRU Active
This is the normal FRU operational state.  The FRU is powered on and cannot be removed safely.  The blue LED is off.  The next state is either M5 or M6.
M5
FRU Deactivation Request
The FRU's IPMC is requesting deactivation permission form the ShMM.  The blue LED shows a short blink.  The next state is M6.
M6
Deactivation in Progress
The FRU is shutting down and its I/O connections are being deactivated.  The blue LED continues its short blink.  The next state is M1.
M7
Communication Lost
The ShMM has lost contact with the board IPMC, or the board IPMC has lost contact with its own FRUs.  This is an abnormal state.  The board should return to its previous state when IPMC communication is reestablished.


Fan List

Fan Tray state of the chassis:

20: FRU # 3
Current Level: 5
Minimum Speed Level: 0, Maximum Speed Level: 15
20: FRU # 4
Current Level: 5
Minimum Speed Level: 0, Maximum Speed Level: 15
20: FRU # 5
Current Level: 5
Minimum Speed Level: 0, Maximum Speed Level: 15

The fan speed level should be either 5 (min speed) or 15 (max speed).

Cooling State

Cooling state and list of all temperature sensors:


Cooling state: "Normal"
Sensor(s) at this state: (0x12,2,0) (0x90,41,0) (0x90,40,0) (0x90,6,0)
                         (0x94,42,0) (0x94,41,0) (0x94,40,0) (0x94,6,0)
                         (0x82,53,0) (0x82,52,0) (0x82,45,0) (0x82,44,0)
                         (0x82,37,0) (0x82,36,0) (0x82,20,0) (0x82,10,0)
                         (0x92,31,0) (0x92,30,0) (0x92,7,0) (0x92,6,0)
                         (0x92,5,0) (0x86,25,0) (0x86,24,0) (0x86,23,0)
                         (0x86,22,0) (0x86,21,0) (0x86,20,0) (0x86,19,0)
                         (0x96,31,0) (0x96,30,0) (0x96,29,0) (0x96,6,0)
                         (0x96,5,0) (0x96,4,0) (0x90,42,0) (0x88,6,0)
                         (0x86,26,0) (0x9c,4,0) (0x9c,3,0) (0x98,4,0)
                         (0x98,3,0) (0x9c,5,0) (0x98,5,0) (0x20,120,0)
                         (0x20,121,0) (0x20,122,0) (0x20,123,0) (0x20,124,0)
                         (0x20,125,0) (0x20,126,0) (0x20,200,0) (0x20,201,0)


Cooling could be at "Minor Alert", "Major Alert" or "Critical Alert":

IPMI
PICMG 3.0
Teleco
Meaning
Non-Critical
Minor
Minor
Sensor out of normal range, but not yet a problem (Warning).
Critical
Major
Major
Sensor well out of normal range, but still within vendor operating tolerances.
Non-Recoverable
Critical
Critical
Sensor out of vendor operating tolerance range; equipment may be damaged.


The list has the following format: (IPMB Addr, Sensor #, LUN).

System Event Log

Output of "[clia] sel" and has the following event:


<ID>: Event at <D&T>; from:(IPMB, FRU, LUN); sensor:(<type>, <#>); event:<event type>: <Details>

And an example:


0x0010: Event: at Jan 10 17:01:11 2011; from:(0x84,0,0); sensor:(0x07,4); event:0x3(asserted): 0x00 0xFF 0xFF

This event, "from:(0x84,0,0); sensor:(0x07,4); event:0x3(asserted)", shows it is from slot 8 (0x84, IPMB address of slot 8), FRU 0 (the board itself), and sensor #4 ("Hot Swap AMC #3") is asserted.

More details are needed to fully decode the SEL, other output such as "[clia] sensor", "[clia] sensordata", "[clia]fruinfo", etc might be necessary.

APPENDIX

A quick reference table to covert between IPMB address, slot number, and switch port number:


Physical | shm1 shm2 Shef  01   02   03   04   05   06
Logical  | ---- ---- ----  13   11   09   07   05   03
BASE     | ---- ---- ---- 0/13 0/11 0/09 0/07 0/05 0/03
EXTENDED | ---- ---- ---- 0/12 0/10 0/08 0/06 0/04 0/02
IPMB Add |  10   12   20   9a   96   92   8e   8a   86
HW Add   |  08   09   10   4d   4b   49   47   45   43


Physical |  07   08   09   10   11   12   13   14
Logical  |  01   02   04   06   08   10   12   14
BASE     | ---- ---- 0/04 0/06 0/08 0/10 0/12 0/14
EXTENDED | ---- ---- 0/03 0/05 0/07 0/09 0/11 0/13
IPMB Add |  82   84   88   8c   90   94   98   9c
HW Add   |  41   42   44   46   48   4a   4c   4e


References

<NOTE:1346085.1> - Netra CT900 ShMM debug.log analysis

Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback