![]() | Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Technical Instruction Sure Solution 1383349.1 : How to perform onsite diagnosis for a down x86 AMD system:ATR:1383349.1:4
In this Document
Oracle Confidential (PARTNER). Do not distribute to customers
Applies to:Sun Fire V20z Server - Version: Not ApplicableSun Fire X4140 Server - Version: Not Applicable and later [Release: N/A and later] Sun Fire X4500 Server - Version: Not Applicable and later [Release: N/A and later] Sun Fire X4200 Server - Version: Not Applicable and later [Release: N/A and later] Sun Blade X6440 Server Module - Version: Not Applicable to Not Applicable [Release: N/A to N/A] Information in this document applies to any platform. GoalHow to perform onsite diagnosis for a down x64 AMD system. It applies to AMD Processor-based Servers and Blade servers.SolutionHow to perform On Site Diagnosis for a Down x64 AMD systemDISPATCH INSTRUCTIONS WHAT SKILLS DOES THE ENGINEER NEED:(IS A SITE ENGINEER AVAILABLE?) ILOM, Intermediate Linux/Unix Skills Time Estimate: 120 minutes TASK COMPLEXITY: 4 FIELD ENGINEER INSTRUCTIONS PROBLEM OVERVIEW: System Down WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY? : Down Hard, unknown reason WHAT ACTION DOES THE ENGINEER NEED TO TAKE: It's very important to document the server settings before any hardware or software changes are made. 1. Investigate system power source# Are LEDs lit?# Are fans spinning? # Confirm power to all the AC Power Supplies. # In collaboration with the customer investigate the system's power source, power cords, etc for a potential issue. 2. Validate the customer can log into the Service Processor (SP/ELOM/ILOM)Depending on the system the monitoring interface can vary between:
http://www.oracle.com/technetwork/documentation/oracle-x86-servers-190077.html Verifying system power status via ipmitoolRun ipmitool from a remote system to the Service Processor with the command shown in the examples below. The resulting output will indicate whether power is on or off.# ipmitool -I lanplus -U root -H <ILOM IP Address> chassis status
Verifying system power status via the Service Processor CLILog in to the Service Processor via SSH:# ssh -l <USERNAME> <ILOM IP Address> And use one of the following commands to determine the platform power status
$ platform get power state
-> show /SP/SystemInfo/CtrlInfo
-> show /SYS Verifying system power status via the Service Processor Web GUIIntegrated
Lights Out Manager (ILOM) and Embedded Lights Out Manager (ELOM) based
Service Processors provide an easy-to-use web interface for managing the
platform. Point your web browser to the Service Processor IP address or
resolving DNS hostname, and enter your login credentials when prompted.After you have logged into the Service Processor, click "Remote Control" tab then Click "Remote Power Control" tab. This contains the status of the platform, for example: Host is currently on Alternatively, click the "System Monitoring" tab, then "Summary" tab where 'Power Status' will be shown. If OFF and you expect it to be ON, then refer to How to check why the system powered off, on Sun X64 servers. (Doc ID 1002941.1) Refer to the ELOM or ILOM Administration Guide for your platform and firmware version. Also see the ELOM or ILOM Administration Guide Supplement for your platform: http://www.oracle.com/technetwork/documentation/oracle-x86-servers-190077.html Related ILOM documentation: Integrated Lights Out Manager (ILOM) 2.0 documentation: http://docs.oracle.com/cd/E19720-01/ Integrated Lights Out Manager (ILOM) 3.0 and CMM documentation: http://docs.oracle.com/cd/E19860-01/ 3. Troubleshoot power issuesVerify
the state of the Power OK LED from the front or rear of the server. LED
states may vary slightly between platforms, but generally:
Investigate the system's power source, power cords, power supplies for a potential issue. Refer to the following Oracle documents for help on diagnosing power issues on x64 platforms: How to check if a Sun X64 server is powered on (Doc ID 1002926.1) How to check why the system powered off, on Sun X64 servers. (Doc ID 1002941.1) 4. Perform internal and external visual inspection- Confirm if the General Service Fault LED is lit or if any Component Fault LEDs is ON and would indicate a hardware failure.
5. Collect basic server information regarding the outage using the Service ProcessorLogin to the Service Processor using ssh (requires the Service Processor IP address or resolvable DNS hostname): # ssh -l <USERNAME> <ILOM IP Address> Display System Event Logs, sensor & fault indicator information: IPMITOOL:# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sel elist Be sure you use the latest by Oracle compiled ipmitool version to collect this information ipmitool is part of the Oracle Hardware pack, more info http://download.oracle.com/docs/cd/E19960-01/index.html Refer to the DocID 1009698.1 for detailed information on the use of ipmitool for collection of data from the platform. ILOMLog in to the Server's ILOM and execute the commands:-> show /SP/logs/event/list ELOM:Log in to the Server's ELOM and execute the commands:-> show /SP/AgentInfo/SEL CMM (Blade specific)Log in to the Chassis CMM where is inserted the faulty Blade and execute the commands: V20z & V40z specific Service Processor commandsLog in to the Server's SP IP Address and execute the following commands:# sp get events -v 6. Hardware best practicesBest practices scenario to isolate a hardware issue when facing an Oracle x64 AMD Processor-Based server down: - Power off the platform and disconnect power cords a few minutes - Update platform firmwares to the latest (ILOM/BIOS/HW RAID/PCI Cards) - Review ILOM logs and sensors along with OS boot sequence to verify if any hardware or software issue is reported -- Start the SP console to monitor the boot process -- Start the Java Remote console to monitor OS errors -- View component information to determine component status. -- View the ILOM system event log. - Run Oracle VTS to verify if any hardware error is reported - Disconnect any external storage array - If a component is reported faulty replace upon failure - If unable to boot the OS then reduce to a minimum CPU/Memory configuration to isolate the faulty component. - Remove any additional PCI card - If no evidence of a hardware issue and the OS is booting then we should consider gathering Operating System information - Update platform related OS drivers - Engage the OS/software support to assist with a possible software issue BIOS POSTFrom the point that the host subsystem is powered on and begins executing code, BIOS code is executed. The sequence that BIOS goes through, from the first point where code is executed to the point that the operating system booting begins, is referred to as POST (power-on self-test).In case a hardware issue is detected during the POSTS the boot process will stop and a 4 digits error code could be displayed at the console. Refer to your platform Service Manual or Diagnostic guide to translate the POST code. Boot deviceVerify the boot device is correct from the BIOS Boot tab:Main Advanced PCIPnP Boot Security Chipset Exit Bios boot device output is also available as a text file attached to this document: BIOS.TXT DisksTo troubleshoot a disk issue identify your HW RAID Controller and follow the instructions from the document below:How to Identify BIOS and Solaris[TM] Hardware RAID Status (Doc ID 1013107.1) BladesWhen troubleshooting a Blade issue, swap the Blade module to another known working slot to isolate the root cause.
FansA faulty fan or fan board can avoid an x64 Server to boot because of potential for system over-temperature and component damage.Verify the Fans and Fan Board status from the ILOM Monitoring tool. Memory modulesWhen investigating a memory issue
How to analyze Memory Errors on x64 Servers running Linux using HERD (Doc ID 1019683.1) CAUTION: After replacing an Oracle server motherboard it is necessary to update the platform serial number which is the reference used to log Service Requests 7. Run platform diagnosticsOracle provides provides comprehensive diagnostic tools that tests and validates Oracle hardware by verifying the connectivity and functionality of most hardware controllers and devices on Oracle hardware platforms.
Oracle VTSSunVTS software has a sophisticated graphical user interface (GUI) that provides test configuration and status monitoring. The user interface can be run on one system to display the Sun VTS testing of another system on the network. SunVTS software also provides a TTY-mode interface for situations in which running a GUI is not possible. PcCheckPcCheck is a diagnostic software that will check completely the hardware components including memory modules, floppy, hard disk drives, CD-ROM/DVD drives, I/O ports, graphic controller.
HDT Tool (AMD Processor-Based specific)The Hardware Debug Tool (HDT) is a low level diagnostic tool that tests access to the system bus, memory spaces and CPU registers of the AMD Processor-Based platform.
# hdtl -q When running ILOM 3.x the preferred method to run HDT is to execute an ILOM Snapshot with the Full dataset may cause a reset depending on the failure detected. 8. Collect Post Codes during the bootIn case there is no video display when attempting to power on a platform, run the hdt command below from the ILOM to catch the last Post Code: # hdtl -bp8 Capture the last Post Codes from a Sunfire v20z/v40z Service Processor with the command: $ sp get port80 -m Refer to the AMD Platform Service Manual to translate the Post Code that could help diagnosing at which point the boot is failing during the initialization. To translate Sunfire v20z/v40z Post Codes refer to Cheatsheet for V20z and V40z Post Codes (Doc ID 1006320.1) 9. Collect diagnostic information for Oracle supportCollect ILOM Service Snapshot utility from ILOM Web GUIThe purpose of the ILOM Service Snapshot utility is to collect data for use by Oracle Services personnel to diagnose system problems.An ILOM snapshot output can be generated from the ILOM GUI -> Maintenance tab -> Snapshot tab. Select the desired Data Set:
Caution: Customers should not run this utility unless requested to do so by Oracle Services. For more information about the ILOM Service Snapshot utility please refer to the Oracle Integrated Lights Out Manager (ILOM) 3.0 Web Interface Procedures Guide: http://docs.oracle.com/cd/E19860-01/ If an ILOM Snapshot cannot be collected it is recommended to collect the one of the following outputs : # ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> chassis status # ssh -l <USERNAME> <ILOM IP Address>
For v20z or v40z platforms it is required to collect tdu logs# sp get tdulog -f stdout Refer to the following document for more details: How to Collect Data from the TDULOGs on Sun Fire[TM]V20z/V40z (Doc ID 1018266.1) ANNEX: Links of interestOracle x86 Servers Documentationhttp://www.oracle.com/technetwork/documentation/oracle-x86-servers-190077.html#hic Firmware Downloads and Release History for Sun Systems http://www.oracle.com/technetwork/systems/patches/firmware/release-history-jsp-138416.html Sun x86 and x64 Platforms: Matrix of expansion cards (Doc ID 1374659.1) Sun System Handbook https://support.oracle.com/handbook_private/ Oracle VTS 7.0 http://docs.oracle.com/cd/E19719-01/ Systems Management and Diagnostics http://www.oracle.com/us/products/applications/crmondemand/login/sys-mgmt-networking-190072.html Oracle Integrated Lights Out Manager (ILOM) 3.0 Documentation http://docs.oracle.com/cd/E19860-01/index.html Sun Integrated Lights Out Manager (ILOM) 2.0 Documentation http://docs.oracle.com/cd/E19720-01/index.html Sun Installation Assistant for x64 Servers Documentation http://docs.oracle.com/cd/E19593-01/index.html How to update the Serial Number on Oracle x64 platforms (Doc ID 1364359.1) RAID Management Software Documentation http://docs.oracle.com/cd/E23383_01/index.html If unsure how to proceed, or unable to perform the above process, collect as much information pertaining to the boot failure as possible (console logs, error messages, etc), call back in and request next available engineer. References<NOTE:1002941.1> - How to check why the system powered off, on Sun X64 servers.<NOTE:1002926.1> - How to check if a Sun X64 server is powered on <NOTE:1330254.1> - X86 Product Home <NOTE:1418253.1> - How to perform onsite diagnosis for a down x64 Intel system:ATR:1418253.1:4 <NOTE:1431330.1> - Collect Operating System Data to troubleshoot X86 system issues Attachments This solution has no attachment |
||||||||||||
|