Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1008409.1
Update Date:2011-06-08
Keywords:

Solution Type  Technical Instruction Sure

Solution  1008409.1 :   How to verify platform health on a Sun X64 Server  


Related Items
  • Sun Blade 6000 System
  •  
  • Sun Fire X2200 M2 Server
  •  
  • Sun Fire X4150 Server
  •  
  • Sun Fire X4540 Server
  •  
  • Sun Blade 8000 System
  •  
  • Sun Fire X4440 Server
  •  
  • Sun Netra X4200 M2 Server
  •  
  • Sun Fire V20z Server
  •  
  • Sun Fire X4250 Server
  •  
  • Sun Netra X4450 Server
  •  
  • Sun Fire X4275 Server
  •  
  • Sun Fire X4200 Server
  •  
  • Sun Netra X4270 Server
  •  
  • Sun Fire X4600 M2 Server
  •  
  • Sun Fire X4200 M2 Server
  •  
  • Sun Fire X4240 Server
  •  
  • Sun Fire X4470 Server
  •  
  • Sun Fire X2270 Server
  •  
  • Sun Fire X4140 Server
  •  
  • Sun Fire X4170 Server
  •  
  • Sun Blade 6048 System
  •  
  • Sun Fire X4270 M2 Server
  •  
  • Sun Fire X2250 Server
  •  
  • Sun Fire X2100 M2 Server
  •  
  • Sun Fire V40z Server
  •  
  • Sun Fire X4100 M2 Server
  •  
  • Sun Fire X4600 Server
  •  
  • Sun Fire X4640 Server
  •  
  • Sun Fire X4270 Server
  •  
  • Sun Netra X4250 Server
  •  
  • Sun Fire X2270 M2 Server
  •  
  • Sun Fire X4100 Server
  •  
  • Sun Fire X4500 Server
  •  
  • Sun Fire X4800 Server
  •  
  • Sun Fire X4450 Server
  •  
  • Sun Fire X4170 M2 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>x64 Servers
  •  

PreviouslyPublishedAs
211493


Applies to:

Sun Fire X4100 Server - Version: Not Applicable and later   [Release: N/A and later ]
Sun Fire X2270 Server - Version: Not Applicable and later    [Release: N/A and later]
Sun Fire X4100 M2 Server - Version: Not Applicable and later    [Release: N/A and later]
Sun Fire X4140 Server - Version: Not Applicable and later    [Release: N/A and later]
Sun Fire X4150 Server - Version: Not Applicable and later    [Release: N/A and later]
All Platforms

Goal

Description

This purpose of this document is to outline the various ways in which you can check a Sun X64 server for error conditions.

Symptoms

  • Data gathering for troubleshooting

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Sun x86 Systems

Solution

Steps to Follow

This document explains how to examine the system LEDs, status indicators, and event logs via ipmitool, Service Processor Web GUI, and Service Processor CLI, as well as what to check if you are local to the server.

If any potential problems are identified, further troubleshooting will be required.

Checking LEDS and indicators with ipmitool

# ipmitool -I lan -H <SP IP Address> -U <SP username> sunoem led get
Look for ON Values for the LEDS that could indicate a problem.
Example: Processor 1 DIMM 0 LED is ON (Processor socket, not CPU core number):

p0.d2.led | OFF
p0.d3.led | OFF
p1.led | OFF
p1.d0.led | ON <--
p1.d1.led | OFF
p1.d2.led | OFF
p1.d3.led | OFF

Example: A fan fault on module FB1/FM0 causes other related LEDs to turn on and off:

OK | ON
SERVICE | ON <-- System service LED turns ON
LOCATE | OFF
PS_FAULT | OFF
FAN_FAULT | ON <-- Fan fault LED turns ON
TEMP_FAULT | OFF
...
FB1/FM0/SERVICE | ON <-- Faulted fan module service indicator requiring repair
FB1/FM1/SERVICE | OFF
FB1/FM2/SERVICE | OFF
FB1/FM0/OK | OFF <-- Fan module is not ok
FB1/FM1/OK | ON
FB1/FM2/OK | ON


Output similar to the following may indicate a newer version of ipmitool is required:


Sun OEM Get LED command failed: Parameter out of range
Sun OEM Get LED command failed: Destination unavailable

Download the latest Sun-Oracle supplied version of ipmitool from the Sun Oracle product downloads web page, see <Document 1009698.1> .

Checking fma data in the ILOM snapshot

ILOM 3.x versions may have an fma directory located and populated with data in the snaphot file.  The ILOM fma should be the same as solaris fma as far as faults are concerned.

Example of a failed fan:



------------------- ------------------------------------ -------------- --------
Time UUID msgid Severity
------------------- ------------------------------------ -------------- --------

2011-03-22/14:22:09 f8dde8af-d369-e31f-c9a6-b159683a286f SPX86-8000-33 Major

Fault class : fault.chassis.device.fan.fail

FRU : /SYS/FB/FAN1 (Part Number: unknown) (Serial Number: unknown)

Response : The service-required LED may be illuminated on the affected
FRU and chassis. System will be powered down when the High Temperature
threshold is reached.

Action : The administrator should review the ILOM event log for
additional information pertaining to this diagnosis. Please refer to the
Details section of the Knowledge Article for additional information.



In addition, the certain ILOM 3.x may have a fault management shell included.  Refer to How to use the Oracle ILOM 3.x Fault Management Shell (Doc ID 1309092.1)

Checking LEDS and indicators in the Service Processor Web GUI

Integrated Lights Out Manager (ILOM) and Embedded Lights Out Manager (ELOM) based Service Processors provide an easy-to-use web interface for managing the platform. Point your web browser to the Service Processor IP address or resolving DNS hostname, and enter your login credentials when prompted.

Then, when logged in (exact display will differ between ILOM and ELOM):

  • Click "System Monitoring" tab then click "Sensor readings" tab. Newer ILOM versions have an "Indicators" tab which also needs checking.
  •  Using the drop down menu select "All Sensors" (or "All Indicators" when in the Indicators tab).
  • Browse the resulting output for fault LEDs and indicators:
  • Check the 'Name' column for names ending in 'fail', 'FAULT', and 'SERVICE
  • Then look along to the the 'Status' or 'Reading' column for its status:
  • "Predictive Failure Asserted" means the fault LED is ON
  •  "Predictive Failure Deasserted" means OFF.
Example: CPU1 DIMM0 fault (as displayed by an older version of ILOM)
Status                                         Name        Reading 
Predictive Failure Asserted p1.d0.fail 2 - Processor One DIMM 0 Fault LED ON
Predictive Failure Deasserted p1.d1.fail 1 - Processor One DIMM 1 Fault LED OFF
Predictive Failure Deasserted p1.d2.fail 1 - Processor One DIMM 2 Fault LED OFF

Example: Fan fault on module FB1/FM0 shown in the Indicators tab on a newer ILOM version.


Name                 Status
FB1/FM0/SERVICE On <-- Faulty fan module
FB1/FM1/SERVICE Off
FB1/FM2/SERVICE Off
FB1/FM0/OK Off <-- Fan module no longer OK
FB1/FM1/OK On
FB1/FM2/OK On
...
/SYS/FAN_FAULT On <-- Fan fault indicator is On
/SYS/LOCATE Off
/SYS/OK On
/SYS/PS_FAULT Off
/SYS/SERVICE On <-- System service LED is On
/SYS/TEMP_FAULT Off
...

 



For more information, refer to the ELOM or ILOM Administration Guide for your platform:
http://www.oracle.com/technetwork/documentation/oracle-x86-servers-190077.html

Checking LEDS and indicators using the Service Processor CLI

ILOM:
-> show -d properties -level all /SYS
 
Example: Chassis 'Service' LED ON
/SYS/SERVICE
Properties:
type = Indicator
value = On
Example: Processor Zero DIMM 2 LED is ON (Processor 0 socket, not CPU core)

/SYS/P0/D2/SERVICE
Properties:
type = Indicator
value = On


Example: Processor Zero DIMM 2 fault from the Fault Management Architecture logic (FMA/FDD)


-> show /SP/faultmgmt

/SP/faultmgmt
Targets:
0 (/SYS/MB/P0/D2)

Properties:

Commands:
cd
show

ELOM:

-> show -level all /SP
-> show -level all /SYS
Example: CPU1 disabled due to a fault
/SP/SystemInfo/CPU/CPU1
 Properties:
  Designation = CPU 1
  Manufacturer = AMD
  Name = Opetron
  Speed = 2800MHz
  Status = disabled
V20/40Z:
$ sensor get --type led
Example - CPU0 DIMM3 and System Fault LEDs are ON

Identifier Value
cd.lp 0.00 On/Off
cpu0.lp 0.00 On/Off
cpu0.mem0.lp 0.00 On/Off
cpu0.mem1.lp 0.00 On/Off
cpu0.mem2.lp 0.00 On/Off
cpu0.mem3.lp 1.00 On/Off <-- 1 means ON
...
cpuplanar.lp 0.00 On/Off
faultswitch 1.00 On/Off <--
floppy.lp 0.00 On/Off
...


Physically checking LEDS if you are local to the server

Physically examine both back and front of the server for illuminated LEDs. For further Information about LED states refer to the appropriate Server Service Manual, or Server Diagnostics Guide:

http://www.oracle.com/technetwork/documentation/oracle-x86-servers-190077.html

Checking platform events and sensors with ipmitool

Use the following common ipmitool commands to gather further data as to the possible reasons for the platform state. These would also be useful if you need to report a support call.

ipmitool -I lan -H <SP IP Address> -U <SP username> sel elist
ipmitool -I lan -H <SP IP Address> -U <SP username> sel info
ipmitool -I lan -H <SP IP Address> -U <SP username> sdr list all info
ipmitool -I lan -H <SP IP Address> -U <SP username> fru print
ipmitool -I lan -H <SP IP Address> -U <SP username> sensor
ipmitool -I lan -H <SP IP Address> -U <SP username> sunoem led get

See <Document 1009698.1> for more information on using ipmitool to collect system event, state, and LED information.

Recent versions of ILOM include a Snapshot feature, which automates collection of ipmitool outputs and other relevant diagnostic information from the Service Processor needed for troubleshooting platform problems. A 'normal' level ILOM snapshot is appropriate in most cases.

For more information, see <Document 1020204.1>

Gathering information on system issues using the Service Processor web GUI

Point your web browser to the Service Processor IP address or resolving DNS hostname, and enter your login credentials when prompted.

Checking the System Event Log (SEL)

After you have logged into the Service Processor, click "System Monitoring" tab then click the "Event Logs" tab.  Select an event log category that you want to view from the drop-down list. You can select from the following types of events:

  • Sensor-specific events - Events generated by sensors.
  • BIOS-generated events - Error messages generated in the BIOS.
  • System management software events - Events that occur within the ILOM software.

After you have selected a category of event, the Event Log table displays the specified events. Or dependent on ILOM/ELOM version, choose Display drop-down to display All or a number of events.

Checking ILOM Fault Management

To display a list of active system faults, click "System Information" tab, then "Fault Management" tab (not available on all platforms).

If a fault is present, click on the fault in the "ID" column to display more details.

Refer to Integrated Lights Out Manager (ILOM) Administration Guide For ILOM for you platform and ILOM version. Also see ILOM Administration Guide Supplement for Sun Fire if available for your platform at http://www.oracle.com/technetwork/documentation/oracle-x86-servers-190077.html

Gathering information on system issues using Service Processor CLI

SSH into the Service processor, then use the following commands to view the system event and fault logs:

ILOM:
-> show /SP/logs/event/list
-> show -d properties -level all /SP/faultmgmt
* /SP/faultmgmt is not available on all platforms
ELOM:
-> show /SP/AgentInfo/SEL
V20/40Z:

$ sp get events -v


Gathering information about possible issues if you are local to the server

To check for issues physically on the platform, the platform needs to be down as you need to enter BIOS. If the server is up, use one of other methods provided either via ipmitool, SP web GUI or SP CLI.

  • Power on the Platform by pressing the power on button.
  • Press F2 when prompted to enter BIOS. Note any events that might be reported.
  • Once in BIOS navigate using the cursor keys to the tab labeled Advanced.
  • Navigate down to Event Log Configuration, press enter.
  • Select View Event Log, examine for possible reasons of the outage, use Esc to exit.
  • Once back at 'Advanced' tab navigate to 'IPMI 2.0 Configuration', Select and press enter to view 'View BMC System Event Log'

NOTE: Unless you are familiar with these events as they are in raw format, I would suggest you use ipmitool commands above as this decodes these events automatically. As there will be events that are part of the normal process of the system powering on, decoding of these events would be required to look for issues.

The messages can also be decoded manually by accessing the following document:

http://download.intel.com/design/servers/ipmi/IPMI2_0E4_Markup_061209.pdf

It is beyond the scope of this document to cover this manual process of decoding.

Previously Published As 91593






Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback