Sun Storage J4000 JBOD Array: Troubleshooting Disk Failures

Asset ID:	1-75-1353887.1
Update Date:	2012-07-17
Keywords:

Solution Type Troubleshooting Sure

Solution 1353887.1 : Sun Storage J4000 JBOD Array: Troubleshooting Disk Failures

Applies to:

Sun Storage J4200 Array - Version Not Applicable and later
Sun Storage J4500 Array - Version Not Applicable and later
Sun Storage J4400 Array - Version Not Applicable and later
Information in this document applies to any platform.

Purpose

The purpose of this document is to help troubleshoot disk failure symptoms on Sun StorEdge J4000 JBOD arrays.

Symptoms:

An amber LED is lit on one or more drives in the array
Host(s) connected to the array report(s) SCSI driver errors for one or more drives
One or more drives from J4000 is/are not seen by host(s)
SAS RAID HBA reports Failed/Degraded status for Volume(s) configured using J4000 drives

Note: This document mainly deals with the Solaris Operating System Environment. The instructions may vary for other OS environments.

Troubleshooting Steps

1. Verify Host logs to identify the fault(s), and the details of each fault

Reference <Document 1005530.1> How to Check for Solaris[TM] x64 Disk Errors and Online/Offline Status

2. Verify whether the drive(s) is/are configured under RAID HBA

Reference <Document 1017961.1> How to Identify if a Solaris[TM] Operating Environment is Installed on a Hardware RAID Controller

If the drive(s) is/are configured under SAS RAID HBA, refer <Document 1013107.1> How to Identify BIOS and Solaris[TM] Hardware RAID Status. If one or more J4000 drives identified as faulty, proceed to Step 7.

For more information about SAS RAID HBAs, refer the documentation located here

If the drive(s) is/are NOT configured under SAS RAID HBA, proceed to Step 3.

3. Verify '/var/adm/messages*' file(s) for any SCSI errors

Verify /var/adm/messages* file(s) for any scsi errors similar to the following:

Apr 22 04:39:58 host01 scsi: [ID 107833 kern.warning] WARNING: /pci@7c,0/pci10de,378@b/pci1000,3150@0 (mpt0): 

Apr 22 04:39:58 host01 scsi: [ID 107833 kern.warning] WARNING: /pci@7c,0/pci10de,378@b/pci1000,3150@0 (mpt0): 

Apr 22 04:39:58 host01 Disconnected command timeout for Target 17 

Apr 22 04:39:58 host01 Disconnected command timeout for Target 17 

Apr 22 04:40:00 host01 scsi: [ID 107833 kern.warning] WARNING: /pci@7c,0/pci10de,378@b/pci1000,3150@0 (mpt0): 

Apr 22 04:40:00 host01 scsi: [ID 365881 kern.info] /pci@7c,0/pci10de,378@b/pci1000,3150@0 (mpt0): 

Apr 22 04:40:00 host01 mpt_check_task_mgt: Task 3 failed. ioc status = 4a target= 17 

Apr 22 04:40:00 host01 Log info 31140000 received for target 17. 

Apr 22 04:40:00 host01 scsi_status=0, ioc_status=8048, scsi_state=c 

Apr 22 04:40:00 host01 scsi: [ID 107833 kern.warning] WARNING: /pci@7c,0/pci10de,378@b/pci1000,3150@0 (mpt0): 

Apr 22 04:40:00 host01 mpt_check_task_mgt: Task 3 failed. ioc status = 4a target= 17

(or)

Mar 12 10:01:10 host02 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3410@9/pci1000,3150@0/sd@b,0 (sd6):

Mar 12 10:01:10 host02  Error for Command: read(10)                Error Level: Retryable

Mar 12 10:01:10 host02 scsi: [ID 107833 kern.notice]    Requested Block: 55060475                  Error Block: 55060539

Mar 12 10:01:10 host02 scsi: [ID 107833 kern.notice]    Vendor: SEAGATE                            Serial Number: 01234XXXXX

Mar 12 10:01:10 host02 scsi: [ID 107833 kern.notice]    Sense Key: Media Error

Mar 12 10:01:10 host02 scsi: [ID 107833 kern.notice]    ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0x0

If such errors are found, proceed to Step 4.
If no such errors are found, proceed to Step 7.

4. Verify whether Common Array Manager(CAM) application is installed in the host

SCSI errors reported in host(s) installed with Common Array Manager(CAM), and connected to J4000 JBOD array(s) using Pandora HBA, due to Bug 6946711 - mpt Disconnected timeouts - Pandora HBA connected to two J4500 continually reset.
<SunBug 6946711> - mpt Disconnected timeouts - Pandora HBA connected to two J4500 continually reset.

Note: Pandora is an 8-port 3Gbps SAS/SATA HBA - External. Model Number : SG-XPCIE8SAS-E-Z.

You are required to verify whether CAM application is installed in the host. Use pkginfo command as indicated below:

# pkginfo -l SUNWsefms  

PKGINST: SUNWsefms 

NAME: Sun Storage Common Array Manager Fault Management Services 

CATEGORY: application 

ARCH: all 

VERSION: 6.8.0,REV=2011.06.04.08.08.24 

BASEDIR: /opt 

VENDOR: Oracle Corporation 

DESC: The Sun Storage Common Array Manager Fault Management Services

If CAM is installed, proceed to Step 5.
If CAM is NOT installed, proceed to Step 6.

5. Implement the workaround for Bug 6946711

The workaround for the Bug 6946711 is to disable fmservice as follows:

Verify the status of 'fmservice':

# svcs fmservice STATE STIME FMRI online Aug_24 svc:/system/fmservice:default

If fmservice is reported online, disable fmservice and reboot the host for the drive(s) to come online. Then proceed to Step 12.

# svcadm disable fmservice # svcs fmservice STATE STIME FMRI disabled 20:16:06 svc:/system/fmservice:default

Note: Disabling the fmservice is a workaround. This page will be updated when a permanent fix for the bug is available.

Note: Disabling the fmservice prevents CAM from monitoring the health of JBODs which were registered, so those JBODs should be visually checked for any failed components more often.

If fmservice is reported disabled, or no fmservice is found (in case of no CAM installation), this bug is not applicable and hence proceed to Step 6.

6. Check the SCSI errors for any media errors:

If there are Media errors as indicated below and there are less than 30 for a two week period for the same drive, the errors were relocated successfully and no further action is required.

Mar 12 10:01:10 host02 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3410@9/pci1000,3150@0/sd@b,0 (sd6): Mar 12 10:01:10 host02 Error for Command: read(10) Error Level: Retryable Mar 12 10:01:10 host02 scsi: [ID 107833 kern.notice] Requested Block: 55060475 Error Block: 55060539 Mar 12 10:01:10 host02 scsi: [ID 107833 kern.notice] Vendor: SEAGATE Serial Number: 01234XXXXX Mar 12 10:01:10 host02 scsi: [ID 107833 kern.notice] Sense Key: Media Error Mar 12 10:01:10 host02 scsi: [ID 107833 kern.notice] ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0x0
If they are 30 or more for the same drive during the two week period, the drive needs to be replaced; contact Oracle Support for drive replacement.
If there is no uniform pattern, or if the errors are reported for multiple drives, proceed to Step 7.

7. Verify whether the fault(s) is/are observed for multiple drives

If the issue is seen with a single drive, proceed to Step 8.
If the issue is seen with multiple drives, proceed to Step 9.

8. Verify the physical LED indications of the drive

If the Amber Fault LED is ON, the drive is faulty; contact Oracle Support for drive replacement.

9. Check the cable connectivity

Reference Cabling configuration for J4500
Reference Cabling configuration for J4200/J4400 - Single path
Reference Cabling configuration for J4200/J4400 - Multipath

If the cabling is as per the documentation, proceed to Step 11.
If not, proceed to Step 10.

10. Adjust the cabling as per the documentation and verify whether the host can access the drives properly

If host can access the drives properly, proceed to Step 12.
If host sees one of the Symptoms again, proceed to Step 11.

Note: Cabling cannot be adjusted while host is online and accessing other enclosure drives. You need to plan a maintenance window to correct the cabling.

11. Verify SIM board LEDs and back panel indicators

Reference Back Panel Indications J4500
Reference Back Panel Indications J4200/J4400

Capture any Amber LED indications seen and proceed to Step 13.

12. Monitor the system for any errors for two days

If the Symptoms repeat, proceed to Step 13.
If no further Symptoms are seen, the issue is considered to be resolved.

13. Open a call for further analysis

At this point, if you have validated that each troubleshooting step above is true for your environment, and the issue still exists, further troubleshooting is required. Please contact Oracle Support and supply:

Critical Faults
Support Data Collection (if applicable) Reference <Document 1002514.1> Collecting Support Data for Arrays Using Sun StorageTek[TM] Common Array Manager
Detailed LED indications
Cabling configuration
Explorer output

References

Attachments

This solution has no attachment