Netra CT900: How to read ShMM's debug.log file

Asset ID:	1-79-1392829.1
Update Date:	2012-01-24
Keywords:

Solution Type Predictive Self-Healing Sure

Solution 1392829.1 : Netra CT900: How to read ShMM's debug.log file

Applies to:

Sun Netra CT900 Server - Version: Not Applicable and later [Release: N/A and later ]
Sun Netra CT900 Server - Version: Not Applicable and later [Release: N/A and later]
Information in this document applies to any platform.

Purpose

For Netra CT900 related problem, more than often, the /tmp/debug.log file from the ShMM is require to look into the possible RC of the problem --- especially it is a chassis related problem.

This article shows how to read through /tmp/debug.log file and what are the points of interest regarding to certain type of problems.

Netra CT900: How to read ShMM's debug.log file

Data Collecting

If /tmp/debug.log file does not exist on "Active" ShMM, execute the /etc/summary script to generate it:

# ./summary
PING 10.133.104.1 (10.133.104.1): 56 data bytes
84 bytes from 10.133.104.1: icmp_seq=0 ttl=255 time=8.4 ms

--- 10.133.104.1 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 8.4/8.4/8.4 ms
send debug file /tmp/debug.log to PPS
#

Content of debug.log file

The /tmp/debug.log file is an equivalent of Explorer files to ShMM. Each session is separated by a header. Here is a list of header in a /tmp/debug.log file:

Shell Environment Variables
U-Boot Environment Variables
Network Interfaces
Network Routing
Current Process List
CPLD State
Shelfman version
Shelfman status
ShMC LAN Configuration
Active RMCP Sessions
PEF Configuration
Board Information
Detailed IPMC Information
Detailed FRU Information
Fan List
Cooling State
Fan State
FRU Sizes
Shelf FRU Info
System Event Log
Shelfman output to syslog

Each section contains some useful information. Here are some of the most important ones:

Network Interfaces: Contains the IP address and showing whether the ShMM is configured as a 2x IP (A floater IP that stick with "Active" ShMM and the other stick with "Backup" ShMM) or 3x IP (Top and bottom ShMM has its own static IP address, and one additional floater IP address associated with "Active" ShMM whether it is top or bottom one) --- the 3x IP configuration usually uses the eth0:1 interface.
Shelfman Version: Contain the firmware version of ShMM.
Shelfman Status: Show the ShMM is "Active" or "Backup". Usually the debug.log from Active ShMM is more useful because many commands are not allowed in the Backup ShMM.
ShMC IPMB Address: Shows the ShMM is top one (shm1, with IPMB address of 0x10) or bottom (shm2, with IPMB address of 0x12).
Board Information: Generic information of each slot --- usually shows the type of blade, A/RTM, and their current Hot Swap State.
Detailed IPMC Information: Details of IPMC records. ShMM also plays the role of handshaking between switch blades and computing blades; therefore these records also can be used to identify if computing blade is connecting to switch blades.
Fan List
Cooling State
Fan State: These three sections contain information regarding of the chassis fan tray and temperature sensors (chassis and blades).
System Event Log: Event lot of interactions between ShMM and chassis components.
Shelfman output to syslog: Syslog of ShMM.

Type of Issues

There are several types of chassis problems that could use debug.log file to locate the RC.

FAN TRAY/TEMPERATURE ISSUES

For fan try/temperature related issues,

Check "Fan List", "Cooling State", and "Fan State" sections if all states are "Normal"
If SEL (System Event Log) has events regarding of temperature sensors (enlisted in the "Cooling State") --- i.e. temperature sensors reading go over "higher non-recoverable" threshold.
Use "[clia] sensordata <addr> <sensor #>" to obtain current reading of sensors.

With all the above information, it should be easy to determine where the problem is and what might have happened.

NOTE: List of sensors could also be obtained by the following command form ShMM:

# clia sensor <addr> | grep Sensor

where <addr> could be <IPMB address of blade>, "board <N>", or "shm <N>".

PEM (Power Entry Module) ISSUES

PEM related issues does not need debug.log files as most of the data needed are collected form "[clia] sensordata" and "[clia] getfruledstate" commands.

Use "[clia] getfruledstate 20 [6|7]" to determine the current state of PEM --- OK (GREEN), OOS (RED/AMBER), or Ready to Remove (BLUE).
All PEM related sensors could be obtained form the following 3 commands: "[clia] sensor 20 | grep PME", "[clia] sensor 20 | grep 48" and "[clia] sensor 20 | grep V"
Use "[clia] sensordata 20 <sensor #>" to check the sensor reading to determine if PEM is working properly.

In the above, "20" is the IPMC address of chassis/shelf itself, "6" or "7" is the FRU # for PEM A & B.

POWER ISSUES

Check SEL and group all events form the same blade (same IPMB address).
Use "[clia] sensor board <slot #> | grep Sensor | grep V" to obtain a list of voltage sensors on a blade.
Match the sensor # from the SEL and the list from step 2.
If SEL shows voltage sensors go under "lower no-recoverable" threshold very often, there is a problem --- whether it is PEM or board related. Need to carefully examine the SEL and other data (such as debug level of ShMM syslog) to determine the true RC.

BLADE ISSUES

Follow the troubleshooting steps of similar architecture (UltraSPARC or x86/x64) to determine whether the problem is within the blade or chassis/shelf related.
If the RC is within the blade, debug.log file is irrelevant.

FIRMWARE UPGRADE ISSUES

The upgrade log is more important.
debug.log file is irrelevant.

NETWORK ISSUES

Network issues is out of the scope of this article.

Details of debug.log file

Network Interfaces

This is "ifconfig -a" output of ShMM, it shows several interfaces:

eth0 The main ShMM network interface
eth0:1 Used in the 3x IP configuration
eth1 Not accessible
usb0 Not accessible
usb1 Not accessible, usb0 and usb1 are used as the communication channel between shm1 & shm2 for fail-over purpose
vlan55 Interface for Netconsole (or NetConsole) feature

Shelfman Version

Pigeon Point Shelf Manger ver. 2.6.4-R3U3-RR
Pigeon Point and the stylized lighthouse logo are trademarks of Pigeon Point syatems.
Copyright (c) 2002-2008 Pigeon Point Systems
All rights reserved
Build date/time: Jan 11 2010 05:48:20
Carrier: ACB
Carrier subtype: 3; subversion: 0

Shelfman Status

"Active" or "Backup"

ShMC IPMB Address

Local IPMB Address = 0x10

0x10 is shm1 (top), and 0x12 is shm2 (bottom)

Board Information

Showing the type of blade and Hot Swap State:

Physical Slot # 3
92: Entity: (0xa0, 0x60) Maximum FRU device ID: 0x02
    PICMG Version 2.2
    Hot Swap State: M4 (Active), Previous: M3 (Activation In progress), Last State Change Cause: Normal State Change (0x0)

92: FRU # 0
    Entity: (0xa0, 0x60)
    How Swap State: M4(Active), Previous: M3 (Activation In progress), Last State Change Cause: Normal State Change (0x0)
    Device ID String: "NetraCP-3060"

92: FRU # 2 (AMC # 1)
    Entity: (0xa0, 0x61)
    How Swap State: M4(Active), Previous: M3 (Activation In progress), Last State Change Cause: Normal State Change (0x0)
    Device ID String: "375-3470-01"

Here are the explanations of How Swap States:

State	Summary	Explanation
M0	Not Installed	FRU IPMC is not reachable. All power to the FRU is off. The blue LED is off.
M1	FRU Inactive	The FRU is installed and its IPMC is in communication with the ShMM. The blue LED is on solid. The FRU is not powered up, and none of its connectivity is active. The next state is either M0 or M2.
M2	FRU Activation Request	The FRU IPMC is waiting for activation permission form the ShMM. The blue LED has a long blink. FRU removal is not safe. The next state is M3.
M3	Activation in Progress	The FRU's IPMC requests power allocation form the ShMM. The blue LED is off. The FRU changes to state M4 when activation is complete.
M4	FRU Active	This is the normal FRU operational state. The FRU is powered on and cannot be removed safely. The blue LED is off. The next state is either M5 or M6.
M5	FRU Deactivation Request	The FRU's IPMC is requesting deactivation permission form the ShMM. The blue LED shows a short blink. The next state is M6.
M6	Deactivation in Progress	The FRU is shutting down and its I/O connections are being deactivated. The blue LED continues its short blink. The next state is M1.
M7	Communication Lost	The ShMM has lost contact with the board IPMC, or the board IPMC has lost contact with its own FRUs. This is an abnormal state. The board should return to its previous state when IPMC communication is reestablished.

Fan List

Fan Tray state of the chassis:

20: FRU # 3
Current Level: 5
Minimum Speed Level: 0, Maximum Speed Level: 15
20: FRU # 4
Current Level: 5
Minimum Speed Level: 0, Maximum Speed Level: 15
20: FRU # 5
Current Level: 5
Minimum Speed Level: 0, Maximum Speed Level: 15

The fan speed level should be either 5 (min speed) or 15 (max speed).

Cooling State

Cooling state and list of all temperature sensors:

Cooling state: "Normal"
Sensor(s) at this state: (0x12,2,0) (0x90,41,0) (0x90,40,0) (0x90,6,0)
                         (0x94,42,0) (0x94,41,0) (0x94,40,0) (0x94,6,0)
                         (0x82,53,0) (0x82,52,0) (0x82,45,0) (0x82,44,0)
                         (0x82,37,0) (0x82,36,0) (0x82,20,0) (0x82,10,0)
                         (0x92,31,0) (0x92,30,0) (0x92,7,0) (0x92,6,0)
                         (0x92,5,0) (0x86,25,0) (0x86,24,0) (0x86,23,0)
                         (0x86,22,0) (0x86,21,0) (0x86,20,0) (0x86,19,0)
                         (0x96,31,0) (0x96,30,0) (0x96,29,0) (0x96,6,0)
                         (0x96,5,0) (0x96,4,0) (0x90,42,0) (0x88,6,0)
                         (0x86,26,0) (0x9c,4,0) (0x9c,3,0) (0x98,4,0)
                         (0x98,3,0) (0x9c,5,0) (0x98,5,0) (0x20,120,0)
                         (0x20,121,0) (0x20,122,0) (0x20,123,0) (0x20,124,0)
                         (0x20,125,0) (0x20,126,0) (0x20,200,0) (0x20,201,0)

Cooling could be at "Minor Alert", "Major Alert" or "Critical Alert":

IPMI	PICMG 3.0	Teleco	Meaning
Non-Critical	Minor	Minor	Sensor out of normal range, but not yet a problem (Warning).
Critical	Major	Major	Sensor well out of normal range, but still within vendor operating tolerances.
Non-Recoverable	Critical	Critical	Sensor out of vendor operating tolerance range; equipment may be damaged.

The list has the following format: (IPMB Addr, Sensor #, LUN).

System Event Log

Output of "[clia] sel" and has the following event:

<ID>: Event at <D&T>; from:(IPMB, FRU, LUN); sensor:(<type>, <#>); event:<event type>: <Details>

And an example:

0x0010: Event: at Jan 10 17:01:11 2011; from:(0x84,0,0); sensor:(0x07,4); event:0x3(asserted): 0x00 0xFF 0xFF

This event, "from:(0x84,0,0); sensor:(0x07,4); event:0x3(asserted)", shows it is from slot 8 (0x84, IPMB address of slot 8), FRU 0 (the board itself), and sensor #4 ("Hot Swap AMC #3") is asserted.

More details are needed to fully decode the SEL, other output such as "[clia] sensor", "[clia] sensordata", "[clia]fruinfo", etc might be necessary.

APPENDIX

A quick reference table to covert between IPMB address, slot number, and switch port number:

Physical | shm1 shm2 Shef 01   02   03   04 05   06
Logical | ---- ---- ---- 13   11   09   07   05   03
BASE | ---- ---- ---- 0/13 0/11 0/09 0/07 0/05 0/03
EXTENDED | ---- ---- ---- 0/12 0/10 0/08 0/06 0/04 0/02
IPMB Add | 10 12   20 9a   96   92   8e   8a   86
HW Add | 08   09   10   4d   4b   49   47   45   43

Physical | 07   08   09   10   11   12   13   14
Logical | 01   02   04   06   08   10   12   14
BASE | ---- ---- 0/04 0/06 0/08 0/10 0/12 0/14
EXTENDED | ---- ---- 0/03 0/05 0/07 0/09 0/11 0/13
IPMB Add | 82   84   88   8c   90   94   98   9c
HW Add | 41   42   44   46   48   4a   4c   4e

References

<NOTE:1346085.1> - Netra CT900 ShMM debug.log analysis

Attachments

This solution has no attachment