Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1383349.1
Update Date:2012-03-07
Keywords:

Solution Type  Technical Instruction Sure

Solution  1383349.1 :   How to perform onsite diagnosis for a down x86 AMD system:ATR:1383349.1:4  


Related Items
  • Sun Fire X2200 M2 Server
  •  
  • Sun Fire V20z Server
  •  
  • Sun Fire X4540 Server
  •  
  • Sun Fire X4440 Server
  •  
  • Sun Blade X6420 Server Module
  •  
  • Sun Fire X4640 Server
  •  
  • Sun Blade X6220 Server Module
  •  
  • Sun Fire X4200 Server
  •  
  • Sun Blade X6440 Server Module
  •  
  • Sun Fire X4600 M2 Server
  •  
  • Sun Fire X4200 M2 Server
  •  
  • Sun Fire X4240 Server
  •  
  • Sun Fire X4140 Server
  •  
  • Sun Fire X4100 M2 Server
  •  
  • Sun Fire X2100 M2 Server
  •  
  • Sun Fire V40z Server
  •  
  • Sun Fire X4600 Server
  •  
  • Sun Fire X4640 Server
  •  
  • Sun Fire X4100 Server
  •  
  • Sun Fire X4500 Server
  •  
  • Sun Fire X2100 Server
  •  
  • Sun Blade X6240 Server Module
  •  
Related Categories
  • PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: x64-CAP VCAP
  •  




In this Document
  Goal
  Solution
     1. Investigate system power source
     2. Validate the customer can log into the Service Processor (SP/ELOM/ILOM)
     3. Troubleshoot power issues
     4. Perform internal and external visual inspection
     5. Collect basic server information regarding the outage using the Service Processor
     6. Hardware best practices
     7. Run platform diagnostics
     8. Collect Post Codes during the boot
     9. Collect diagnostic information for Oracle support
     ANNEX: Links of interest
  References


Oracle Confidential (PARTNER). Do not distribute to customers
Reason: FRU CAP

Applies to:

Sun Fire V20z Server - Version: Not Applicable and later   [Release: N/A and later ]
Sun Fire X4140 Server - Version: Not Applicable and later    [Release: N/A and later]
Sun Fire X4500 Server - Version: Not Applicable and later    [Release: N/A and later]
Sun Fire X4200 Server - Version: Not Applicable and later    [Release: N/A and later]
Sun Blade X6440 Server Module - Version: Not Applicable to Not Applicable   [Release: N/A to N/A]
Information in this document applies to any platform.

Goal

How to perform onsite diagnosis for a down x64 AMD system. It applies to AMD Processor-based Servers and Blade servers.

Solution

How to perform On Site Diagnosis for a Down x64 AMD system

DISPATCH INSTRUCTIONS

WHAT SKILLS DOES THE ENGINEER NEED:(IS A SITE ENGINEER AVAILABLE?)
ILOM, Intermediate Linux/Unix Skills

Time Estimate: 120 minutes

TASK COMPLEXITY: 4

FIELD ENGINEER INSTRUCTIONS

PROBLEM OVERVIEW: System Down

WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY? : Down Hard, unknown reason

WHAT ACTION DOES THE ENGINEER NEED TO TAKE:

It's very important to document the server settings before any hardware or software changes are made.

1. Investigate system power source

# Are LEDs lit?
# Are fans spinning?
# Confirm power to all the AC Power Supplies.
# In collaboration with the customer investigate the system's power source, power cords, etc for a potential issue.

2. Validate the customer can log into the Service Processor (SP/ELOM/ILOM)

Depending on the system the monitoring interface can vary between:
  • Service Processor (SP)
  • Embedded Lights Out Manager (ELOM)
  • Integrated Lights Out Manager (ILOM)
Refer to Oracle x86 Servers documentation to identify which is the monitoring interface for your system:
http://www.oracle.com/technetwork/documentation/oracle-x86-servers-190077.html

Verifying system power status via ipmitool

Run ipmitool from a remote system to the Service Processor with the command shown in the examples below. The resulting output will indicate whether power is on or off.

# ipmitool -I lanplus -U root -H <ILOM IP Address> chassis status
System Power         : on
Power Overload       : false
Power Interlock      : inactive
Main Power Fault     : false
Power Control Fault  : false
Power Restore Policy : unknown
Last Power Event     :
Chassis Intrusion    : inactive
Front-Panel Lockout  : inactive
Drive Fault          : false
Cooling/Fan Fault    : false


# ipmitool -I lanplus -U root -H <ILOM IP Address> chassis power status
Chassis Power is on

Verifying system power status via the Service Processor CLI

Log in to the Service Processor via SSH:

# ssh -l <USERNAME> <ILOM IP Address>


And use one of the following commands to determine the platform power status
  • Service Processor (Only applies to v20/v40z platforms)
$  platform get power state
On

  • ELOM
-> show /SP/SystemInfo/CtrlInfo
  /SP/SystemInfo/CtrlInfo
    Targets:
    Properties:
        PowerStatus = on

  • ILOM
-> show /SYS
...
   /SYS
   Properties:
   type = Host System
   ipmi_name = /SYS
   product_name = SUN FIRE X4440
   product_part_number = 602-4057-01
   product_serial_number = 0812ZYX001
   product_manufacturer = SUN MICROSYSTEMS
   power_state = On

Verifying system power status via the Service Processor Web GUI

Integrated Lights Out Manager (ILOM) and Embedded Lights Out Manager (ELOM) based Service Processors provide an easy-to-use web interface for managing the platform. Point your web browser to the Service Processor IP address or resolving DNS hostname, and enter your login credentials when prompted.

After you have logged into the Service Processor, click "Remote Control" tab then Click "Remote Power Control" tab.

This contains the status of the platform, for example:   
Host is currently on

Alternatively, click the "System Monitoring" tab, then "Summary" tab where 'Power Status' will be shown.

If OFF and you expect it to be ON, then refer to How to check why the system powered off, on Sun X64 servers. (Doc ID 1002941.1)

Refer to the ELOM or ILOM Administration Guide for your platform and firmware version. Also see the ELOM or ILOM Administration Guide Supplement for your platform:
http://www.oracle.com/technetwork/documentation/oracle-x86-servers-190077.html

Related ILOM documentation:
Integrated Lights Out Manager (ILOM) 2.0 documentation: http://docs.oracle.com/cd/E19720-01/
Integrated Lights Out Manager (ILOM) 3.0 and CMM documentation: http://docs.oracle.com/cd/E19860-01/

3. Troubleshoot power issues

Verify the state of the Power OK LED from the front or rear of the server. LED states may vary slightly between platforms, but generally:

  • STEADY GREEN ON - System is powered on.
  • SLOW BLINK GREEN - System is powered OFF, but standby power is present.
  • NOT ILLUMINATED (OFF) - Server main power and standby power are off (no AC power, not plugged in, defective power cord).

Investigate the system's power source, power cords, power supplies for a potential issue.

Refer to the following Oracle documents for help on diagnosing power issues on x64 platforms:

How to check if a Sun X64 server is powered on (Doc ID 1002926.1)
How to check why the system powered off, on Sun X64 servers. (Doc ID 1002941.1)

4. Perform internal and external visual inspection

- Confirm if the General Service Fault LED is lit or if any Component Fault LEDs is ON and would indicate a hardware failure.


- A system shutdown can be initiated by a request from either of the following:

  • Board management controller (BMC). The conditions that trigger the BMC to issue a shutdown request are:
    • An over-temperature condition for more than 1 second.
    • Multiple fan failures.
  • Fault condition. The fault conditions that trigger a shutdown are:
    • All power supplies have failed or have been removed.
    • A power supply has been out of spec for more than 100 mS.
    • The hot-swap circuit has faulted.
    • An over-temperature condition has occurred.
  1. Inspect the external status indicator LEDs, which can indicate a defective component.
  2. Verify that nothing in the server environment is blocking air flow or making a contact that could short out power.
  3. If the server does not power on, check with the power source, power cords, for a potential issue.
  4. Disconnect power cords a few minutes to discharge the capacitors.
  5. Pull the power cords back and check if the power issue remains.
  6. If no power is distributed then refer to the Sun System Handbook (https://support.oracle.com/handbook_private/) wiring diagram to identify the possible components that could trigger this power issue.
  7. Inspect the cables, cards and pins to detect any evidence of a visually defect.
  8. Reseat processors, riser cards, pci cards, power supplies, memory modules, fans cables, and disks.
  9. Disconnect any external storage array to verify if the same symptoms remain.
 

5. Collect basic server information regarding the outage using the Service Processor


Login to the Service Processor using ssh (requires the Service Processor IP address or resolvable DNS hostname):

# ssh -l <USERNAME> <ILOM IP Address>

Display System Event Logs, sensor & fault indicator information:

IPMITOOL:

# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sel elist
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> -v sel elist
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sensor
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sunoem sbled get all
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sdr list all info
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> -v sdr

Be sure you use the latest by Oracle compiled ipmitool version to collect this information

ipmitool is part of the Oracle Hardware pack, more info http://download.oracle.com/docs/cd/E19960-01/index.html

Refer to the DocID 1009698.1 for detailed information on the use of ipmitool for collection of data from the platform.

ILOM

Log in to the Server's ILOM and execute the commands:
-> show /SP/logs/event/list
-> show -d properties -level all /SYS
-> show -o table -level all /SP/faultmgmt (Only available with latest ILOM versions)

ELOM:

Log in to the Server's ELOM and execute the commands:
-> show /SP/AgentInfo/SEL
-> show -d properties -level all /SP/SystemInfo

CMM (Blade specific)

Log in to the Chassis CMM where is inserted the faulty Blade and execute the commands:
-> show /CMM/logs/event/list
-> show -d properties -level all /

V20z & V40z specific Service Processor commands

Log in to the Server's SP IP Address and execute the following commands:
# sp get events -v
# sensor get --verbose
# inventory get all -v
# sp get tdulog -f stdout

6. Hardware best practices

Best practices scenario to isolate a hardware issue when facing an Oracle x64 AMD Processor-Based server down:

- Power off the platform and disconnect power cords a few minutes
- Update platform firmwares to the latest (ILOM/BIOS/HW RAID/PCI Cards)
- Review ILOM logs and sensors along with OS boot sequence to verify if any hardware or software issue is reported
-- Start the SP console to monitor the boot process
-- Start the Java Remote console to monitor OS errors
-- View component information to determine component status.
-- View the ILOM system event log.
- Run Oracle VTS to verify if any hardware error is reported
- Disconnect any external storage array
- If a component is reported faulty replace upon failure
- If unable to boot the OS then reduce to a minimum CPU/Memory configuration to isolate the faulty component.
- Remove any additional PCI card
- If no evidence of a hardware issue and the OS is booting then we should consider gathering Operating System information
- Update platform related OS drivers
- Engage the OS/software support to assist with a possible software issue

BIOS POST

From the point that the host subsystem is powered on and begins executing code, BIOS code is executed. The sequence that BIOS goes through, from the first point where code is executed to the point that the operating system booting begins, is referred to as POST (power-on self-test).

In case a hardware issue is detected during the POSTS the boot process will stop and a 4 digits error code could be displayed at the console. Refer to your platform Service Manual or Diagnostic guide to translate the POST code.

Boot device

Verify the boot device is correct from the BIOS Boot tab:
Main    Advanced    PCIPnP    Boot    Security    Chipset    Exit
********************************************************************************
* Boot Settings                                       * Configure Settings     *
* *************************************************** * during System Boot.    *
* * Boot Settings Configuration                       *                        *
*                                                     *                        *
* * Boot Device Priority                              *                        *
* * Hard Disk Drives                                  *                        *
* * Removable Drives                                  *                        *
* * CD/DVD Drives                                     *                        *
*                                                     *                        *
*                                                     *                        *
*                                                     *                        *
*                                                     *                        *
*                                                     * **    Select Screen    *
*                                                     * **    Select Item      *
*                                                     * Enter Go to Sub Screen *
*                                                     * F1    General Help     *
*                                                     * F10   Save and Exit    *
*                                                     * ESC   Exit             *
*                                                     *                        *
*                                                     *                        *
********************************************************************************

Bios boot device output is also available as a text file attached to this document: BIOS.TXT

Disks

To troubleshoot a disk issue identify your HW RAID Controller and follow the instructions from the document below:
How to Identify BIOS and Solaris[TM] Hardware RAID Status (Doc ID 1013107.1)

Blades

When troubleshooting a Blade issue, swap the Blade module to another known working slot to isolate the root cause.
  • If the problem follows the Blade then the failure is located on the Blade
  • If the Blade works in another slot then the problem could be related to the slot

Fans

A faulty fan or fan board can avoid an x64 Server to boot because of potential for system over-temperature and component damage.

Verify the Fans and Fan Board status from the ILOM Monitoring tool.

Memory modules

When investigating a memory issue
  • Verify that only supported memory modules are inserted
  • Verify the population rules are respected
  • Press the DIMM Fault Remind button if available for your platform to turn ON the slot LED for the faulty DIMM
Additionally when a Memory errors are logged in Windows or Linux logs fles, install HERD to translate the memory addresses error into CPU slot/Memory slot
How to analyze Memory Errors on x64 Servers running Linux using HERD (Doc ID 1019683.1)

CAUTION: After replacing an Oracle server motherboard it is necessary to update the platform serial number which is the reference used to log Service Requests

7. Run platform diagnostics

Oracle provides provides comprehensive diagnostic tools that tests and validates Oracle hardware by verifying the connectivity and functionality of most hardware controllers and devices on Oracle hardware platforms.

The diagnostic tools can usually be executed booting from:

  • the Tools and Drivers CD/DVD
  • the ILOM
  • an external drive
  • PXE (network)
  • the running Operating System


We will prefer a standalone method and avoid executing diagnostics from a running operating system because it could generates false I/O access errors during the tests.

Oracle VTS

SunVTS software has a sophisticated graphical user interface (GUI) that provides test configuration and status monitoring. The user interface can be run on one system to display the Sun VTS testing of another system on the network. SunVTS software also provides a TTY-mode interface for situations in which running a GUI is not possible.

The following tests are available in SunVTS: Processor/Memory/Disk/Graphics/Media/Ioports/Interconnects/Network/Environment/HBA

For more information refer to Oracle VTS 7.0 Software User's guide: http://docs.oracle.com/cd/E19719-01/E21664/index.html

PcCheck

PcCheck is a diagnostic software that will check completely the hardware components including memory modules, floppy, hard disk drives, CD-ROM/DVD drives, I/O ports, graphic controller.

To run the PcCheck diagnostics follow the steps below:

  1. Boot the system with the Supplemental CD
  2. At the main menu select "Run Hardware Diagnostics"
  3. At the PcCheck main menu select "Advanced Diagnostic Tests"
  4. At the Advanced Diagnostic Tests menu select "Memory"
  5. Then select "Test System Memory"

HDT Tool (AMD Processor-Based specific)

The Hardware Debug Tool (HDT) is a low level diagnostic tool that tests access to the system bus, memory spaces and CPU registers of the AMD Processor-Based platform.

You can access HDT through the server module as follows:

  • - SP logging to system the SP with the user: sunservice
  • - Execute the command:
# hdtl -q


When running ILOM 3.x the preferred method to run HDT is to execute an ILOM Snapshot with the Full dataset may cause a reset depending on the failure detected.

8. Collect Post Codes during the boot

In case there is no video display when attempting to power on a platform, run the hdt command below from the ILOM to catch the last Post Code:

# hdtl -bp8

0156   port80: 08c6
waiting for next POST code ............
waiting for next POST code ............

 Capture the last Post Codes from a Sunfire v20z/v40z Service Processor with the command:

$ sp get port80 -m

0x97

Refer to the AMD Platform Service Manual to translate the Post Code that could help diagnosing at which point the boot is failing during the initialization.

To translate Sunfire v20z/v40z Post Codes refer to Cheatsheet for V20z and V40z Post Codes (Doc ID 1006320.1)

9. Collect diagnostic information for Oracle support

Collect ILOM Service Snapshot utility from ILOM Web GUI

The purpose of the ILOM Service Snapshot utility is to collect data for use by Oracle Services personnel to diagnose system problems.

An ILOM snapshot output can be generated from the ILOM GUI -> Maintenance tab -> Snapshot tab.

Select the desired Data Set:
  • Normal: Specifies that ILOM, operating system, and hardware information is collected.
  • FRUID: Available as of ILOM 3.0.3, specifies that information about FRUs currently configured on your server in addition to the data collected by the Normal set option is collected.
  • Full: Specifies that all data is to be collected. Selecting Full might reset the system on an AMD Processor-based platform if an Hypertransport bus failure is detected when running HDT low level diagnostics
  • Custom: Allows you to choose one or more data sets

Caution: Customers should not run this utility unless requested to do so by Oracle Services.

For more information about the ILOM Service Snapshot utility please refer to the Oracle Integrated Lights Out Manager (ILOM) 3.0 Web Interface Procedures Guide:
http://docs.oracle.com/cd/E19860-01/


If an ILOM Snapshot cannot be collected it is recommended to collect the one of the following outputs :

# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> chassis status
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sel elist
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> -v sel elist
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> fru
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sensor
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sunoem sbled get all
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sdr list all info
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> -v sdr

# ssh -l <USERNAME> <ILOM IP Address>
-> show / -l all -o table


For v20z or v40z platforms it is required to collect tdu logs

# sp get tdulog -f stdout


Refer to the following document for more details:
How to Collect Data from the TDULOGs on Sun Fire[TM]V20z/V40z (Doc ID 1018266.1)

ANNEX: Links of interest

Oracle x86 Servers Documentation
http://www.oracle.com/technetwork/documentation/oracle-x86-servers-190077.html#hic

Firmware Downloads and Release History for Sun Systems
http://www.oracle.com/technetwork/systems/patches/firmware/release-history-jsp-138416.html

Sun x86 and x64 Platforms: Matrix of expansion cards (Doc ID 1374659.1)

Sun System Handbook
https://support.oracle.com/handbook_private/

Oracle VTS 7.0
http://docs.oracle.com/cd/E19719-01/

Systems Management and Diagnostics
http://www.oracle.com/us/products/applications/crmondemand/login/sys-mgmt-networking-190072.html

Oracle Integrated Lights Out Manager (ILOM) 3.0 Documentation
http://docs.oracle.com/cd/E19860-01/index.html

Sun Integrated Lights Out Manager (ILOM) 2.0 Documentation
http://docs.oracle.com/cd/E19720-01/index.html

Sun Installation Assistant for x64 Servers Documentation
http://docs.oracle.com/cd/E19593-01/index.html

How to update the Serial Number on Oracle x64 platforms (Doc ID 1364359.1)

RAID Management Software Documentation
http://docs.oracle.com/cd/E23383_01/index.html

If unsure how to proceed, or unable to perform the above process, collect as much information pertaining to the boot failure as possible (console logs, error messages, etc), call back in and request next available engineer.


References

<NOTE:1002941.1> - How to check why the system powered off, on Sun X64 servers.
<NOTE:1002926.1> - How to check if a Sun X64 server is powered on
<NOTE:1330254.1> - X86 Product Home
<NOTE:1418253.1> - How to perform onsite diagnosis for a down x64 Intel system:ATR:1418253.1:4
<NOTE:1431330.1> - Collect Operating System Data to troubleshoot X86 system issues

Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback