Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1008335.1
Update Date:2012-07-30
Keywords:

Solution Type  Troubleshooting Sure

Solution  1008335.1 :   Sun[TM] x86/x64 Guide to System Troubleshooting  


Related Items
  • Sun Fire X2100 Server
  •  
  • Sun Ultra 20 M2 Workstation
  •  
  • Sun Fire X4140 Server
  •  
  • Sun Netra CT900 Server
  •  
  • Sun Java Workstation W2100z
  •  
  • Sun Fire X2100 M2 Server
  •  
  • Sun Fire X4240 Server
  •  
  • Sun Fire V20z Compute Grid Rack System
  •  
  • Sun Fire X4150 Server
  •  
  • Sun Fire V20z Server
  •  
  • Sun Netra X4200 M2 Server
  •  
  • Sun Ultra 40 Workstation
  •  
  • Sun Ultra 24 Workstation
  •  
  • Sun Fire X4450 Server
  •  
  • Sun Fire X2200 M2 Server
  •  
  • Sun Fire V40z Server
  •  
  • Sun Netra X4250 Server
  •  
  • Sun Ultra 20 Workstation
  •  
  • Sun Ultra 27 Workstation
  •  
  • Sun Ultra 40 M2 Workstation
  •  
  • Sun Fire X4540 Server
  •  
  • Sun Java Workstation W1100z
  •  
  • Sun Fire X4440 Server
  •  
  • Sun Fire X2250 Server
  •  
  • Sun Fire X4250 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>x64>Server>SN-x64: MISC-SERVER
  •  

PreviouslyPublishedAs
211405


Applies to:

Sun Netra X4250 Server - Version Not Applicable to Not Applicable [Release N/A]
Sun Fire X4450 Server - Version Not Applicable to Not Applicable [Release N/A]
Sun Ultra 27 Workstation - Version Not Applicable to Not Applicable [Release N/A]
Sun Fire X4140 Server - Version Not Applicable to Not Applicable [Release N/A]
Sun Java Workstation W2100z - Version Not Applicable to Not Applicable [Release N/A]
All Platforms

Purpose

Description

This document provides a high-level guide to troubleshooting documents for Oracle's Sun x64/x86 product line.

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Sun x86 Systems



Contents
ILOM password reset
Sun System Handbook, Documentation, Downloads
Kernel Analysis
Fatal Reset
OS Panic
Hangs
OS Troubleshooting
Disk and Redundant Array of Independent/Inexpensive Disks (RAID) Troubleshooting
General Troubleshooting
IPMItool

Troubleshooting Steps

NOTE: If your product doesn't show up on the download link below, go to My Oracle Support and follow the procedure outlined here

 

Sun System Handbook

Docs

Downloads

Service Processor

Oracle Page

Workstations:

Sun Ultra 20

Docs

Download

None


Sun Ultra 20 M2

Docs

Download

None


Sun Ultra 24

Docs

Download

None


Sun Ultra 27

Docs

Downloads

None

 

Sun Ultra 40

Docs

Download

None

 

Sun Ultra 40 M2

Docs

Download

None

 

Sun Java W1100z

Docs

Download

None

 

Sun Java W2100z

Docs

Download

None

 

Servers:

Sun Fire X2100

Docs

Download

SMDC (option)

Sun x86 Systems

Sun Fire X2100 M2

Docs

Download

ELOM

 

Sun Fire X2200 M2

Docs

Download

ELOM

 

Sun Fire X2250

Docs

Download

ILOM

 

Sun Fire X4100

Docs

Download

ILOM

 

Sun Fire X4100 M2

Docs

Download

ILOM

 

Sun Fire X4140

Docs

Download

ILOM

 

Sun Fire X4150

Docs

Download

ILOM

 

Sun Fire X4200

Docs

Download

ILOM

 

Sun Fire X4200 M2

Docs

Download

ILOM

 

Sun Fire X4240

Docs

Download

ILOM

 

Sun Fire X4250

Docs

Download

ILOM

 

Sun Fire X4440

Docs

Download

ILOM

 

Sun Fire X4450

Docs

Download

ILOM

 

Sun Fire X4500

Docs

Download

ILOM

 

Sun Fire X4540

Docs

Download

ILOM

 

Sun Fire X4600

Docs

Download

ILOM

 

Sun Fire X4600 M2

Docs

Download

ILOM

 

Sun Fire V20z

Docs

Download

SP

 

Sun Fire V40z

Docs

Download

SP

 

Blade Servers:

Sun Blade 1600

Docs

Download

Switch SC (SSC)

Sun Blade Servers

Sun Blade 6000

Docs

Download

ILOM

 

Sun Blade 8000

 

Docs

Download

ILOM

 

Netra Blades And Servers:

Netra X4200 M2

Docs

Download

ILOM

Sun Netra Carrier-Grade Servers

Netra X4250

Docs

Download

ILOM

 

Netra X4450

Docs

Download

ILOM

 

Netra CT900

Docs

Download

ShMM

 

The product links above contain general information about the specific product. The Sun system handbook links from above contain system specifications, parts lists, documentation, and the list of minimum supported operating systems. System firmware, drivers, and BIOS can be downloaded via the Download link.

ILOM password reset

The following document shows how to reset the ILOM password back to its default    How to reset the ILOM root password back to the default 'changeme' using ipmitool. (Doc ID 1328316.1)


Kernel Analysis

A system becomes unresponsive for one of three reasons:

  • Fatal reset (hardware detected)
  • Operating system panic (software detected)
  • Operating system/application hang (not detected)


The following document provides some information about the necessary data to gather:
DocID: 1010911.1 What to send to Sun[TM] after a system panic and/or unexpected reboot

Fatal Reset

Fatal Resets are hardware detected problems and are caused when the central processing unit (CPU) performs a trap which immediately drops to the BIOS.
One reason for this is due to a watchdog reset which is caused when the operating system fails to access the watchdog circuitry within its time out period.
This is really due to an operating system hang detected by the watchdog timer, so see the hang section below for techniques to diagnose.

Other reasons for fatal resets are due to hardware failure like loss of input voltage, or other major hardware related issues. No core file is saved and the messages file shows normal operation followed by an abrupt system restart (no shutdown messages).

The most important diagnosis information to retrieve is the following which his mostly gained through the service processor (SP):

  • Console output. This typically contains a reason for the reset for example "sync flood" (or nothing for total power loss).
  • SP events. This could contain sensor related events like under voltage conditions on one rail or OEM specific events like 0x12's.
  • SP sensor data. This contains information if a sensor has a consistent problem like a voltage regulator or fan failure.
  • SP field replaceable unit (FRU) data. This describes the hardware inventory configuration to assist with hardware replacement. Collect this to determine if the system has the proper configuration (eg. partially installed memory bank). A good item to check is the system board page in the Sun System Handbook.
  • Explorer or other operating system data collector that contains the messages files and other data.


If the cause of the reset cannot be quickly determined, its important to perform hardware diagnostics such as a full power on self test (POST), the bundled PCcheck or SunVTS to determine if the hardware is stable.

PCcheck and other diagnostic tools can typically be downloaded via the Download link above or already be available as part of the BIOS boot menu.

OS Panic

OS Panics are software detected problems and caused when the operating system detects that the integrity of data is suspect or in danger of being corrupted. The panic routine will create a core dump if properly configured and place panic strings into the messages file to assist in fault isolation.

Panics can be caused by either operating system coding errors which are typically fixed by patches, or caused by hardware related problems like memory Uncorrectable Errors (UE's).

If software related, collect the core dump and pass to Sun's kernel group for analysis.
If hardware related then collect the following data so the problem can be isolated:

  • Explorer or other operating system data collector that contains the messages files and other data. This typically contains panic messages and a stack trace related to the panic.
  • SP events. This could contain sensor related events like under voltage conditions on one rail or OEM specific events like 0x12's.
  • SP sensor data. This contains information if a sensor has a consistent problem like a voltage regulator or fan failure.
  • SP FRU data. This describes the hardware inventory configuration to assist with hardware replacement. Collect this to determine if the system has the proper configuration (eg. partially installed memory bank). A good item to check is the system board page in the Sun System Handbook.


If the cause of the reset cannot be quickly determined, its important to perform hardware diagnostics such as a full power on self test (POST), the bundled PCcheck or SunVTS to determine if the hardware is stable. PCcheck and other diagnostic tools can typically be downloaded via the Download link above or already be available as part of the BIOS boot menu.

Hangs

A Hang is when some applications may operate properly, and others appear dead, but the hardware and operating system do not detect a problem. Hangs are caused by resource deadlocks due to operating system race conditions or resource deprivation due to one or more applications that are too needy. Sometimes console messages may indicate the source of the hang, but typically a core should be forced so that Sun's kernel group can analyze the data. There is a small possibility that hangs can be caused by hardware, but please contact the kernel group first for isolation.

DocID: 1012991.1 How to check if your x64 platform "system hang" actually is a system hang.

This document can be referenced to assist with possible hang situations.

The following operating system diagnostic section should be read to determine how to configure and force core dumps, but forcing a core dump from a hung system is not always possible.

OS Troubleshooting

Sun x86/x64 systems typically support Solaris[TM], Red Hat Enterprise Linux, SuSE Enterprise Linux and the Windows operating system.
Please check the Sun Systems Handbook to ensure that the operating system in question is supported on that platform.
A good overall operating system document to review is:

DocID: 1019144.1 Data Requirements reference: What data is needed in order to troubleshoot my software or hardware problem?

Solaris:
Six important Solaris documents that discuss procedures and configuration for Solaris panics and hangs are as follows:

DocID: 1012913.1 Troubleshooting Panics, dumps, hangs or crashes in the Solaris[TM] Operating System
DocID: 1001950.1 Troubleshooting Suspected Solaris Operating System Hangs
DocID: 1004506.1 How to force a crash when my machine is hung
DocID: 1001950.1 When to Force a Solaris[TM] System Core File
DocID: 1004530.1 KERNEL: How to enable deadman kernel code
DocID: 1003085.1 Solaris[TM] Operating System: Forcing a kernel core dump on an x86 or x64 system

Red Hat Linux:
Three important Red Hat documents that discuss procedures and configuration for Red Hat panics & hangs are as follows:

DocID: 1005528.1 How to configure Kdump on Red Hat Enterprise Linux 5 systems
DocID: 1006577.1 Red Hat Linux: Diskdump Pre-requisites, install and settings
DocID: 1007699.1 Crash Dump capturing for Red Hat Linux

SuSE Linux:
Two important SuSE documents that discuss procedures and configuration for SuSE panics & hangs are as follows:
DocID: 1108937.1 How to configure Kdump on SuSE Linux Enterprise System 10
DocID: 1010059.1 How to configure LKCD on SuSE Linux Enterprise Systems 8 and 9

Windows:
An important Windows document that discusses procedures and configuration for panics is:
DocID: 1007054.1 How to handle Microsoft Windows panics on x64 platforms

Additional documents that assist in Windows troubleshooting are:
DocID: 1011590.1 How to check for Windows platform disk errors and online/offline status
DocID: 1010936.1 Microsoft Windows and Linux operating systems: How to obtain troubleshooting information

Disk and Redundant Array of Independent/Inexpensive Disks (RAID) Troubleshooting

Disk and RAID problems are sometimes related to the disk/RAID controller firmware and boot configuration.

A good overall document to determine the firmware revision from systems with a supported operating system and how to search for known issues is:
DocID: 1008396.1 How to Identify Optical and Hard Disk Firmware Revisions for Checking of Known Issues

A good document on boot related issues is:
DocID: 1005506.1 How to verify your boot media exists and is bootable on a Sun Fire[TM] X4100/X4200/X4600 and M2 models Server

Once the version is known, the following document can be used to provide information of how to list, create, or delete RAID volumes:
DocID: 1005358.1 Hardware RAID usage on X64 based systems with the LSI SAS1064

The LSI RAID controller firmware requires 64MB unpartitioned disk space at the end of the disk for volume management. Thus, data backup prior any RAID creation should be performed.
LSI related RAID status can be obtained via the BIOS as shown in the following:

DocID: 1013107.1 How to Identify BIOS and Solaris[TM] Hardware RAID Status

Disks placed into a RAID volume should be of identical size to avoid problems.

RAID levels are:
RAID-0: Stripe of 2 or more disks to form a virtual larger disk. No redundancy so data lost on failure, but higher performance due to access to multiple disks for a file.
RAID-1: Mirrors of 2 or more disks to provide redundant data copies to prevent data loss on disk failure. Write performance decreases due to 2 or more writes per single file update but read performance increases due to access to file access from multiple disks.
RAID-01: Mirror of striped disks, but disk failure will offline its associated stripe.
RAID-10: Stripes of mirrored disks which can tolerate loss of two disks depending on configuration.
RAID-5: Stripes 3 or more disks with distributed parity so data loss is prevented if a disk fails. Medium performance is sustained since two writes are performed for each file update, but access is striped across multiple disks.

The Solaris raidctl command provides RAID status and provides RAID creation & deletion information as described in the following:

DocID: 1013107.1 How to Identify BIOS and Solaris[TM] Hardware RAID Status

Solaris commands that are helpful in disk troubleshooting, are as follows:

 

# /usr/sbin/mount | grep "/ on" / on /dev/dsk/c1t0d0s0 read/write/setuid/devices/logging/xattr/onerror=panic/dev=f40040 on Thu Dec 6 11:49:54 2007
 

  

# iostat -E sd0 Soft Errors: 1 Hard Errors: 2 Transport Errors: 0 Vendor: AMI Product: Virtual CDROM Revision: 1.00 Serial No: Size: 0.00GB <0 bytes> Media Error: 0 Device Not Ready: 0 No Device: 2 Recoverable: 0 Illegal Request: 1 Predictive Failure Analysis: 0 sd1 Soft Errors: 2 Hard Errors: 0 Transport Errors: 0 Vendor: AMI Product: Virtual Floppy Revision: 1.00 Serial No: Size: 0.00GB <0 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 2 Predictive Failure Analysis: 0
 

 

# iostat -xe extended device statistics ---- errors --- device r/s w/s kr/s kw/s wait actv svc_t %w %b s/w h/w trn tot sd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 1 2 0 3 sd1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 2 0 0 2
 


LINUX disk issues can be isolated using the following :
DocID: 1013003.1 How to Identify if a Linux Operating Environment is Installed on a Hardware RAID Controller

The following document indicates how to determine if a LINUX disk is under RAID control.
Software RAID is configured using mdadm as discussed in:
DocID: 1011427.1 How to setup software RAID in Linux

LINUX commands that are helpful in disk troubleshooting, are as follows: 

 

# /bin/mount | grep "on / " (Display root mount point) /dev/sda2 on / type ext3 (rw)

 

Windows disk status can be checked using information from the following:

DocID: 1011590.1 How to check for Windows platform disk errors and online/offline status

An example of a Windows RAID installation is obtained from:
DocID: 1009559.1 Installing Windows 2003 Server with RAID enabled on Sun Fire[TM] x2100

General Troubleshooting

For problems not covered by the prior two sections, collect the following information:

  • Obtain SP related data in all cases. This can be done via ipmitool (see below), or via the SP's GUI or command line interfaces (if functionality exists; see SP link above).
  • Ensure that the installed operating system is supported per the Sun System Handbook link above.
  • When possible, obtain operating system data collectors such as explorer or other output that records the state of the operating system and file system (including messages files).
  • PCcheck & other diagnostic tools can typically be downloaded via the Download link above or already be available as part of the BIOS boot menu.

IPMItool

IPMItool is a very useful tool that can gather information from the ILOM and other Service Processors (SP's).

Example commands to collect are as follows replacing the "ipaddress" with the address of the service processor, not the main platform:

 

ipmitool -H "ipaddress" -U root fru

ipmitool -H "ipaddress" -U root sel elist

ipmitool -H "ipaddress" -U root -v sdr

ipmitool -H "ipaddress" -U root sdr elist

ipmitool -H "ipaddress" -U root sdr list

ipmitool -H "ipaddress" -U root chassis status

ipmitool -H "ipaddress" -U root sunoem led get

ipmitool -H "ipaddress" -U root sensor

  

@ Previously Published As 88276

References

<NOTE:1011427.1> - How to setup software RAID in Linux
<NOTE:1011590.1> - How to check for Windows platform disk errors and online/offline status
<NOTE:1012913.1> - Troubleshooting Panics, Hangs, Reboots and System Performance Issues in the Solaris Operating System
<NOTE:1019144.1> - Data Requirements reference: What data is needed in order to troubleshoot my software or hardware problem?
<NOTE:1108937.1> - How to configure Kdump on SuSE Linux Enterprise System 10 and 11
<NOTE:1010059.1> - How to configure LKCD on SuSE Linux Enterprise Systems 8 and 9
<NOTE:1013107.1> - How to Identify BIOS and Solaris[TM] Hardware RAID Status
<NOTE:1005506.1> - How to verify your boot media exists and is bootable, on a Sun X64 server.
<NOTE:1005528.1> - How to configure Kdump on Red Hat Enterprise Linux 5 systems
<NOTE:1003085.1> - Solaris[TM] Operating System: How to force a kernel core dump on an x86 or x64 system
<NOTE:1004506.1> - How to Force a Crash Dump When the Solaris Operating System is Hung
<NOTE:1004530.1> - How to Enable Deadman Kernel Code in Solaris 8 and Newer to Force a Kernel Panic During a Hang
<NOTE:1001950.1> - Troubleshooting Suspected Solaris Operating System Hangs
<NOTE:1005358.1> - Hardware RAID usage on X64 based systems with the LSI SAS1064
<NOTE:1006577.1> - Red Hat Linux: Diskdump Pre-requisites, install and settings.
<NOTE:1007054.1> - How to handle Microsoft Windows panics on x64 platforms
<NOTE:1007699.1> - Crash Dump capturing for Red Hat Linux
<NOTE:1012991.1> - How to check if your x64 platform "system hang" actually is a system hang
<NOTE:1013003.1> - How to Identify if a Linux Operating Environment is Installed on a Hardware RAID Controller
<NOTE:1010911.1> - What Should I Send to Oracle After a Solaris Panic or Unexpected Reboot?
<NOTE:1010936.1> - Microsoft Windows and Linux operating systems: How to obtain troubleshooting information
<NOTE:1008396.1> - How to Identify Optical and Hard Disk Firmware Revisions for Checking of Known Issues

Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback