Sun Storage 7000 Unified Storage System: Troubleshooting Disk Drive Failures

Asset ID:	1-75-1366035.1
Update Date:	2012-07-26
Keywords:

Solution Type Troubleshooting Sure

Solution 1366035.1 : Sun Storage 7000 Unified Storage System: Troubleshooting Disk Drive Failures

Applies to:

Sun Storage 7310 Unified Storage System - Version Not Applicable and later
Sun Storage 7110 Unified Storage System - Version Not Applicable and later
Sun Storage 7410 Unified Storage System - Version Not Applicable and later
Sun ZFS Storage 7320 - Version Not Applicable and later
Sun ZFS Storage 7420 - Version Not Applicable and later
Information in this document applies to any platform.

Purpose

The purpose of this document is to troubleshoot disk drive failures on a Sun Storage 7000 ZFS Appliance.

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - 7000 Series ZFS Appliances

NAS head revision : not dependent
BIOS revision : not dependent
ILOM revision : not dependent
JBODs Model : J4400|J4410|J4500
CLUSTER related : not dependent

Troubleshooting Steps

1. Verify the Problems list from the Appliance

To aid serviceability, the appliance detects persistent hardware failures (faults) and software failures (defects, often included under faults) and reports them as active problems on this screen. If the phone home service is enabled, active problems are automatically reported to Oracle where a support case may be opened depending on the service contract and the nature of the fault.

From support bundle

cat /fm/fmadm.out

From CLI

maintenance problems show

From BUI

Click Maintenance
Click System
Click Problems

The list below is a list of the descriptive text and the error code found on the system, as they may be seen in the customer report. Review the list shown by the appliance against the one below:

Error Code	Descrition
ZFS-8000-GH	The number of checksum errors associated with the device has exceeded acceptable levels.
ZFS-8000-FD	The number of I/O errors associated with the device has exceeded acceptable levels.
ZFS-8000-D3	The device has failed or could not be opened.
DISK-8000-0X	SMART health-monitoring firmware reported that a failure is imminent on disk.
AK-8000-F0	The disk 'XXX' uses an interface (SAS) that is incompatible with the enclosure.
DISK-8000-4Q	SCSI fault for media.

If there are no problems in the list, the drives in the Appliance are ok. No further work is required.

If there are more than one of the above problems in the list, go to Step 2.
If the problem is for ZFS-8000-FD, ZFS-8000-0X, DISK-8000-4Q, the drive is failed or is about to fail, and requires replacement. Please contact Oracle for further support.
If the problem is for ZFS-8000-GH, go to Step 3.
If the problem is for AK-8000-F0, go to Step 4.
If the problem is for ZFS-8000-D3, go to Step 5.

2. Are the problems all ZFS-8000-GH?

If so, go to Step 3.
If not, please contact Oracle for further support.

3. Run a scrub to attempt to correct the errors.

The ZFS subsystem can generate erroneous checksum errors on the system as part of a disk replacement or normal day to day action. This does not require a disk replacement unless the
data checksum turns into an unrecoverable action (this does not cause data loss). The scrub may flag additional drives as failed due to checksum errors, but will never fail enough drives to offline the storage pool.

Run the scrub to completion
- BUI: Configuration -> Storage -> Click on the pool -> Click Scrub
- CLI: configuration storage scrub start
Mark the check sum fault as cleared in the problems log
- BUI: Maintenance ->Problems -> Click on Problem->Click Marked Repaired
- CLI: maintence problems select <problem-id> markrepaired
Repeat this process until there are no checksum errors or there is an Unrecoverable Problem logged by the system.

Contact Oracle if there are unrecoverable errors generated during this process.

4. Verify Part is in the Correct Enclosure

AK-8000-F0 is an indication that the disk is not compatible with the enclosure, i.e. a SAS-1 drive in a SAS-2 enclosure. Unfortunately, the error is often spurious, and can disappear with after reseating a drive. This fault will almost exclusively be seen on 7410 and 7420 systems.

Please reseat the drive in question and mark the problem as cleared. If the problem returns, please review your configuration. There may be a SAS-1 or SAS-2 system in the wrong enclosure.

5. Verify your system release for whether CR 6999699 has been fixed on your system AND whether you have replaced a disk drive recently

<SUNBUG: 6999699> This issue is due to the creation of multiple entries for drives in the storage pools. This has been resolved in the 2010Q3.4.0 release

Get the AK release:

BUI->Click on the Sun/Oracle symbol in the upper left corner
CLI->maintenance system updates show

The release will be of format:

[email protected],1-1.14     2010-9-23 18:28:47        previous

[email protected],1-1.31     2011-7-14 14:28:23        current

If you are at a version prior the release listed above and you have replaced a disk drive recently, please contact Oracle for further support.

In case the above steps do not provide a conclusion about the disk problem you should collect a full support bundle and upload it to Oracle or attach it to a Service Request you have open for this problem.

If the steps above have not pointed you towards a resolution, please contact Oracle for further help.

Step 6 and beyond are for Internal Oracle Support, as they describe detailed how to review system logs from a support bundle to identify the reason for the fault on customer system.

6. Check debug.sys to see if there were several command timeouts for which the disk was offlined.

Get the drive serial number from the problems list:

From ./fm/fmadm.out

FRU         : "SCSI Device  13" (hc://:product-id=SUN-Storage-J4400:server-id=:chassis-id=1027QAK01F:serial=9QJ410Y8:part=SEAGATE-ST31000NSSUN1.0T:revision=SU0F/ses-enclosure=6/bay=13/disk=0)

                  faulty

From ./logs/debug.sys

grep "Disconnected command timeout for target" logs/debug.sys | grep "9QJ2WZNT"



Apr 27 22:00:14 s7410wwadminB scsi: [ID 243001 kern.warning] WARNING: /pci@0,0/pci10de,376@e/pci1000,3150@0 (mpt0):

Apr 27 22:00:14 s7410wwadminB   Disconnected command timeout for target. Vendor='SEAGATE' Product='ST31000NSSUN1.0T' Serial='9QJ2WZNT

if timeouts exist && consistent step 7
if timeouts exist && intermittent step 9
if timeouts do not exist the drive should be replaced(unless replaced previously on the system).

7. Verify whether the drive is in the head or expansion tray

From the fault in fmadm.out. Review whether the drive is in the appliance head, versus the expansion tray:

FRU         : "SCSI Device  13" (hc://:product-id=SUN-Storage-J4400:server-id=:chassis-id=1027QAK01F:serial=9QJ410Y8:part=SEAGATE-ST31000NSSUN1.0T:revision=SU0F/ses-enclosure=6/bay=13/disk=0)

                  faulty

If the device is in the head, go to Step 9.
If the device is in an expansion as denoted by product-id=SUN-Storage-J4400 or :product-id=SUN-Storage-J4410:server-id=: go to Step 8

8) Verify the status of the SIMs/SIM paths

Look for the chassis serial number found in ./fm/fmadm.out:

FRU         : "SCSI Device  13" (hc://:product-id=SUN-Storage-J4400:server-id=:chassis-id=1027QAK01F:serial=9QJ410Y8:part=SEAGATE-ST31000NSSUN1.0T:revision=SU0F/ses-enclosure=6/bay=13/disk=0)

                  faulty

Now look for that same number in the ./hw/fmtopo.txt. The serial number should be listed TWICE, once
for each SIM.

hc://:product-id=SUN-Storage-J4410:product-sn=1051FMJ01V:server-id=:chassis-id=1027QAK01F:serial=2029QTF-1043QC133F:part=3753633:revision=3524/ses-enclosure=7/controller=1

In this string we are only looking for the whether the revision is listed or not for this entry, as shown in bold above.
This should be shown for both SIMs. The controller value in the output above indicates the SIM slot (0 = left, 1 = right)

If both SIMs show the revision in fmtopo.txt, then go to Step 9
If one SIM does not show a revision, then

reseat the SIM
reseat the drive
verify fm fault is cleared

If fault is not cleared collaborate to L2

9. Check whether debug.sys shows other drives impacted by timeouts

Get the drive serial number from the problems list:

FRU         : "SCSI Device  13" (hc://:product-id=SUN-Storage-J4400:server-id=:chassis-id=1027QAK01F:serial=9QJ410Y8:part=SEAGATE-ST31000NSSUN1.0T:revision=SU0F/ses-enclosure=6/bay=13/disk=0)

                  faulty

grep "Disconnected command timeout for target" logs/debug.sys | grep "<serial number>"

Note the times of the timeouts

grep "Disconnected command timeout for target" logs/debug.sys | grep -v "<serial number>"

Do disconnected command timeouts exist for other Drive serial numbers at the same times as those
for the faulted drive?

if so collaborate with L2
if not go to step 10

10. Check for any other failed system components in hw.aksh

Perform a quick review of component status in hw.aksh for any other faulty part status.

if so collaborate with L2
if not have the drive replaced

Attachments

This solution has no attachment