Asset ID: |
1-75-1004271.1 |
Update Date: | 2011-03-24 |
Keywords: | |
Solution Type
Troubleshooting Sure
Solution
1004271.1
:
Troubleshooting errors on I/O devices (Disk drives, DVD, Tapes) in a Sun Fire [TM] Serengeti or LightWeight8 systems
Related Items |
- Sun Fire E6900 Server
- Sun Fire 3800 Server
- Sun Fire 6800 Server
- Sun Netra 1280 Server
- Sun Fire E4900 Server
- Sun Fire 4800 Server
- Sun Fire V1280 Server
- Sun Fire E2900 Server
- Sun Netra 1290 Server
- Sun Fire 4810 Server
|
Related Categories |
- GCS>Sun Microsystems>Servers>Midrange V and Netra Servers
- GCS>Sun Microsystems>Servers>Midrange Servers
|
PreviouslyPublishedAs
205899
Applies to:
Sun Netra 1280 Server Sun Netra 1290 Server Sun Fire V1280 Server Sun Fire 3800 Server Sun Fire 4800 Server All Platforms
Purpose
DescriptionThis document covers situations where certain I/O devices might be suspected to be defective.
Specifically, this document addresses how to troubleshoot device errors
affecting Hard Disk Drives (HDDs), DVDs, or Tape Drives on Sun Fire [TM]
3800, 4800, 4810, E4900, 6800, E6900 and Sun Fire [TM] v1280, E2900,
and Netra [TM] 1280, 1290 systems. This document does not address a situation where a device is considered to be "missing" or has "disappeared". - To troubleshoot a "missing" device, see <Document:1005522.1> Troubleshooting a "missing" Hard Disk Drive (HDD) on Sun Fire [TM] Serengeti or LightWeight8 systems
Symptoms:
- One might describe the situation
by saying "I have a bad I/O device or devices" or "I'm getting I/O device
errors" or "I'm getting I/O errors".
- Iostat
may report excessive hard or soft errors on disks, dvds, or tape
drives.
- There may be numerous messages in
/var/adm/messages in the domain
reflecting read or write errors, scsi transport errors, or
similar.
- In some cases, the errors or
problems could prevent a domain from booting.
- It's possible the problems could
affect the whole controller or device path or multiple controllers
or device paths.
Last Review Date
March 24, 2011
Instructions for the Reader
A Troubleshooting Guide is provided to assist
in debugging a specific issue. When possible, diagnostic tools are included in the document
to assist in troubleshooting.
Troubleshooting Details
Steps to FollowPlease validate that each troubleshooting step below is true for your environment. The steps will provide instructions or a link to a document, for validating the step and taking corrective action as necessary. The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution. Please do not skip a step.1. Verify the error(s) affect
an I/O device or devices (Hard Disk Drive, DVD ROM, Tape
Drive).
- Use iostat -En output to identify
the device(s) in error or error messages logged to /var/adm/messages on the domain.
- Example iostat - En data is
available below shows a DVD-ROM and a Hard Disk Drive (HDD):
$ iostat -En c0t0d0 Soft Errors: 2 Hard Errors: 0 Transport Errors: 0 Vendor: TOSHIBA Product: DVD-ROM SD-C2612 Revision: 1011 Serial No: Size: 0.00GB <0 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 2 Predictive Failure Analysis: 0 c1t0d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: FUJITSU Product: MAP3735N SUN72G Revision: 0401 Serial No: 0435Q0E3UJ Size: 73.40GB <73400057856 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 Predictive Failure Analysis: 0
2. Verify that the devices in
error are not newly installed or replaced
units.
- If they are newly installed or repaired, make sure they have current firmware and that the units
have been re-seated.
- For most I/O devices like a Hard Disk drive, DVD-ROM, etc the
firmware revision is displayed in iostat
-En output (see example in
Step
1
) as the field "Revision:".
- After completing these tasks, verify
that the errors persist.
3. Verify the type of error
and error count in
iostat -
En
output to indicate whether device
replacement might be necessary.
- <Document:1007250.1> iostat -E: Explanation of error counters offers
advice.
4. Verify that any device with
high Hard Error count in
iostat
-En
may actually NOT be
healthy.
- <Document:1017741.1> Solaris[TM] Operating System: High Hard Error value in iostat -E output provides an explanation of
situations which may cause hard errors on sane hardware.
5. Confirm that the
iostat -En
output is
reporting errors that correlate to the present period of
time.
- In other words, make sure that the iostat data is not "stale" (old errors that
originated sometime in the past).
- If there are errors logged
in /var/adm/messages and in iostat from the present period of time, the
data from iostat is trusted and the
device would be implicated depending on the type and count of
errors.
- If the errors in iostat are "stale"
and do not correlate to present errors in /var/adm/messages, do not replace the
implicated device, but instead monitor it for any repeat
errors.
- The error counters will be reset following the next system reboot.
6. Verify the physical
location of the devices in error.
-
Decode the device paths in error using <Document:1005907.1> Solaris[TM] Operating System: Matrix of Recognized Device Paths.
7. Verify that no system
configuration changes (software changes) recently affected the
implicated devices.
- Examples of such changes would be PCI driver patches recently
installed (or reboots to allow the changes to take affect), recent
/etc/system file changes, disk firmware upgrades, etc.
- This confirmation is most important when the devices in error
occupy different device paths, but the actual device type is
similar. For example, if there are three disks in error and
all are the same model, the likely cause might be a disk firmware
issue. But, if there are two different device types in error
(like a DVD ROM and disk drive) attached to the same type of HBA in
error, the likely cause might be the driver that operates that
particular HBA.
- So, essentially, confirm whether there is a commonality between
the devices in error and rule out recent changes to those
similarities before proceeding to the next step.
8. Investigate the primary
hardware cause of the errors depending on which devices are in
error using the table below.
The advice in this table is intended to be used as a guide to
troubleshooting this issue and not necessarily the exact resolution
to every multiple device error situation. When investigating
the cause of the errors you might choose to replace or relocate the
suspect components to another location in the configuration to
confirm the cause or determine the resolution to the errors.
- If replacing a suspect device, use the Sun System Handbook to
identify the part number of the device you need to replace.
- For Hard Disk Drive (HDD) replacements, see the reference <Document:1004390.1> Hard Disk Drive (HDD) Part Number Identification.
9. If errors persist,
investigate the secondary hardware cause of the errors depending on
which devices are in error using the information and table from
Step 8 above.10. Collect the following data and collaborate with the next level of support.- It is preferred that Explorer with the appropriate scextended or 1280extended option as detailed in: <Document:1018748.1> How to Run Sun[TM] Explorer and Forward the Data to a Sun Engineer.
- If Explorer data can not be collected for whatever reason see <Document:1003529.1> Procedure to manually collect Sun Fire[TM] Midrange System Controller level failure data.
Internal Only Information
Stale iostat Data
If the errors in iostat are "stale" and do not correlate to present errors in /var/adm/messages,
reset the error counts to prevent further confusion of the device's health by following
<Document:1012731.1> If you want to reset the iostat -E hard/soft/tran errors counters without rebooting.
@ At this point, if the customer has validated that each troubleshooting step above is true for their
environment, and the issue still exists, collaborate to the next level of technical expertise.
Previously Published As 91433
Attachments
This solution has no attachment
|