Asset ID: |
1-75-1004271.1 |
Update Date: | 2012-07-18 |
Keywords: | |
Solution Type
Troubleshooting Sure
Solution
1004271.1
:
Sun Fire [TM] SF3800/SF4800/SF4810/SF6800 - E4900/E6900 - V1280/E2900 - Netra 1280/1290 : Troubleshooting errors on I/O devices (Disk drives, DVD, Tapes)
Related Items |
- Sun Fire E6900 Server
- Sun Fire 3800 Server
- Sun Fire 6800 Server
- Sun Netra 1280 Server
- Sun Fire E4900 Server
- Sun Fire 4800 Server
- Sun Fire E2900 Server
- Sun Fire V1280 Server
- Sun Netra 1290 Server
- Sun Fire 4810 Server
|
Related Categories |
- PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: Exx00
- .Old GCS Categories>Sun Microsystems>Servers>Midrange Servers
- .Old GCS Categories>Sun Microsystems>Servers>Midrange V and Netra Servers
|
PreviouslyPublishedAs
205899
Applies to:
Sun Netra 1280 Server
Sun Fire E4900 Server
Sun Fire 3800 Server
Sun Fire 4800 Server
Sun Fire 4810 Server
All Platforms
Purpose
Description
This document covers situations where certain I/O devices might be suspected to be defective.
Specifically, this document addresses how to troubleshoot device errors affecting Hard Disk Drives (HDDs), DVDs, or Tape Drives on Sun Fire [TM] 3800, 4800, 4810, E4900, 6800, E6900 and Sun Fire [TM] v1280, E2900, and Netra [TM] 1280, 1290 systems. This document does not address a situation where a device is considered to be "missing" or has "disappeared".
- To troubleshoot a "missing" device, see <Document:1005522.1> Troubleshooting a "missing" Hard Disk Drive (HDD) on Sun Fire [TM] Serengeti or LightWeight8 systems
Symptoms:
- One might describe the situation by saying "I have a bad I/O device or devices" or "I'm getting I/O device errors" or "I'm getting I/O errors".
- Iostat may report excessive hard or soft errors on disks, dvds, or tape drives.
- There may be numerous messages in /var/adm/messages in the domain reflecting read or write errors, scsi transport errors, or similar.
- In some cases, the errors or problems could prevent a domain from booting.
- It's possible the problems could affect the whole controller or device path or multiple controllers or device paths.
Troubleshooting Steps
Steps to Follow
Please validate that each troubleshooting step below is true for your environment. The steps will provide instructions or a link to a document, for validating the step and taking corrective action as necessary. The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution. Please do not skip a step.
1. Verify the error(s) affect an I/O device or devices (Hard Disk Drive, DVD ROM, Tape Drive).
- Use iostat -En output to identify the device(s) in error or error messages logged to /var/adm/messages on the domain.
- Example iostat - En data is available below shows a DVD-ROM and a Hard Disk Drive (HDD):
$ iostat -En
c0t0d0 Soft Errors: 2 Hard Errors: 0 Transport Errors: 0
Vendor: TOSHIBA Product: DVD-ROM SD-C2612 Revision: 1011 Serial No:
Size: 0.00GB <0 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 2 Predictive Failure Analysis: 0
c1t0d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: FUJITSU Product: MAP3735N SUN72G Revision: 0401 Serial No: 0435Q0E3UJ
Size: 73.40GB <73400057856 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
2. Verify that the devices in error are not newly installed or replaced units.
- If they are newly installed or repaired, make sure they have current firmware and that the units have been re-seated.
- For most I/O devices like a Hard Disk drive, DVD-ROM, etc the firmware revision is displayed in iostat -En output (see example in Step 1 ) as the field "Revision:".
- After completing these tasks, verify that the errors persist.
3. Verify the type of error and error count in iostat - En output to indicate whether device replacement might be necessary.
- <Document:1007250.1> iostat -E: Explanation of error counters offers advice.
4. Verify that any device with high Hard Error count in iostat -En may actually NOT be healthy.
- <Document:1017741.1> Solaris[TM] Operating System: High Hard Error value in iostat -E output provides an explanation of situations which may cause hard errors on sane hardware.
5. Confirm that the iostat -En output is reporting errors that correlate to the present period of time.
- In other words, make sure that the iostat data is not "stale" (old errors that originated sometime in the past).
- If there are errors logged in /var/adm/messages and in iostat from the present period of time, the data from iostat is trusted and the device would be implicated depending on the type and count of errors.
- If the errors in iostat are "stale" and do not correlate to present errors in /var/adm/messages, do not replace the implicated device, but instead monitor it for any repeat errors.
- The error counters will be reset following the next system reboot.
6. Verify the physical location of the devices in error.
- Decode the device paths in error using <Document:1005907.1> Solaris[TM] Operating System: Matrix of Recognized Device Paths.
7. Verify that no system configuration changes (software changes) recently affected the implicated devices.
- Examples of such changes would be PCI driver patches recently installed (or reboots to allow the changes to take affect), recent /etc/system file changes, disk firmware upgrades, etc.
- This confirmation is most important when the devices in error occupy different device paths, but the actual device type is similar. For example, if there are three disks in error and all are the same model, the likely cause might be a disk firmware issue. But, if there are two different device types in error (like a DVD ROM and disk drive) attached to the same type of HBA in error, the likely cause might be the driver that operates that particular HBA.
- So, essentially, confirm whether there is a commonality between the devices in error and rule out recent changes to those similarities before proceeding to the next step.
8. Investigate the primary hardware cause of the errors depending on which devices are in error using the table below.
The advice in this table is intended to be used as a guide to troubleshooting this issue and not necessarily the exact resolution to every multiple device error situation. When investigating the cause of the errors you might choose to replace or relocate the suspect components to another location in the configuration to confirm the cause or determine the resolution to the errors.
- If replacing a suspect device, use the Sun System Handbook to identify the part number of the device you need to replace.
- For Hard Disk Drive (HDD) replacements, see the reference <Document:1004390.1> Hard Disk Drive (HDD) Part Number Identification.
9. If errors persist, investigate the secondary hardware cause of the errors depending on which devices are in error using the information and table from Step 8 above.
10. Collect the following data and collaborate with the next level of support.
- It is preferred that Explorer with the appropriate scextended or 1280extended option as detailed in: <Document:1019066.1> How to collect scextended or 1280extended Explorer.
- If Explorer data can not be collected for whatever reason see <Document:1003529.1> Procedure to manually collect Sun Fire[TM] Midrange System Controller level failure data.
Stale iostat Data
If the errors in iostat are "stale" and do not correlate to present errors in /var/adm/messages,
reset the error counts to prevent further confusion of the device's health by following
<Document:1012731.1> If you want to reset the iostat -E hard/soft/tran errors counters without rebooting.
At this point, if the customer has validated that each troubleshooting step above is true for their
environment, and the issue still exists, collaborate to the next level of technical expertise.
Previously Published As 91433
References
<NOTE:1004390.1> - Hard Disk Drive (HDD) Part Number Identification.
<NOTE:1005522.1> - Sun Fire [TM] SF3800/SF4800/SF4810/SF6800 - E4900/E6900 - V1280/E2900 - Netra 1280/1290 : Troubleshooting a "missing" Hard Disk Drive (HDD)
<NOTE:1003529.1> - Procedure to manually collect System Controller (SC) level failure data on Sun Fire[TM] v1280, E2900, 3800, 4800, E4900, 6800, E6900, and Netra 1280, 1290 servers.
<NOTE:1005907.1> - Solaris[TM] Operating System: Matrix of Recognized Device Paths for SPARC systems
<NOTE:1007250.1> - How to Interpret Error Counters for the Solaris iostat -E output
<NOTE:1012731.1> - How to Reset the iostat -E hard/soft/tran Error Counters Without Rebooting
<NOTE:1017741.1> - Solaris Operating System High Hard Error value in iostat -E output
<NOTE:1019066.1> - Sun Fire[TM] v1280, 3800, 4800, 4810, 6800, E2900, E4900, E6900 and Netra[TM] 1280, 2900 servers: How to collect scextended or 1280extended Explorer
Attachments
This solution has no attachment