Troubleshooting Sun Storage[TM] Array Impending Drive Failures

Asset ID:	1-75-1103184.1
Update Date:	2012-06-18
Keywords:

Solution Type Troubleshooting Sure

Solution 1103184.1 : Troubleshooting Sun Storage[TM] Array Impending Drive Failures

Applies to:

Sun Storage 2510 Array - Version Not Applicable and later
Sun Storage 2530 Array - Version Not Applicable and later
Sun Storage 2540 Array - Version Not Applicable and later
Sun Storage 2530-M2 Array - Version Not Applicable and later
Sun Storage 2540-M2 Array - Version Not Applicable and later
Information in this document applies to any platform.

Purpose

The purpose of this document is to help troubleshoot disk Impending Drive Failures for StorageTek, Sun StorEdge, Sun StorageTek, and Sun Storage arrays.

Troubleshooting Steps

Impending Drive Failures can occur in two ways:

1) The disk drive itself keeps track of errors on the disk and will report to the subsystem. If too many errors have occurred on the drive, the drive will be flagged. This is known as a Drive Reported Predictive Failure Alert (PFA).
2) The system keeps track of the number errors a drive has in communicating with the RAID controllers. If too many events occur in given period, the drive will be flagged. This is known as Synthesized PFA.

PFA's are NOT drive failures. They are an early warning service supplied by the array and drives in order to give the administrator a warning that the drive may or is about to fail. This document is designed to help clarify when a drive should be replaced, if this fault should be present on an array.

Symptoms Include:

SANtricity Storage Manager shows an alert for Impending Drive Failure Low Risk
Common Array Manager shows an alarm for Impending Drive Failure Risk Low (xx.66.1026)
SANtricity Storage Manager shows an alert for Impending Drive Failure Medium Risk
Common Array Manager shows an alarm for Impending Drive Failure Risk Medium (xx.66.1025)
SANtricity Storage Manager shows an alert for Impending Drive Failure High Risk
Common Array Manager shows an alarm for Impending Drive Failure Risk High (xx.66.1024)

If you have been replacing a lot of drives, especially due to these faults, you will want to go through this exercise to identify the reason for the failure.

Steps To Follow

Please validate that each troubleshooting step below is true for your environment. Each step will provide instructions via a link to a document, for validating the step and taking corrective action as necessary. The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution. Please do not skip a step.

1. Verify the Critical Fault by the Array.

Use the user interface to verify the list of critical faults, and the details of each fault, along with the firmware and nvsram of your system

Reference <Document 1021057.1> Verify Sun StorageTek[TM] 2500 and Sun Storage[TM] 6000 Critical Faults via the User Interface

Reference: <Document 1021067.1> Verify Sun Storage[TM] Array Firmware via the User Interface

Use the table below to match up your faults to the firmware and nvsram on the array.

Fault	Model	Firmware	NVSRAM	Action
xx.66.1026/IMPENDING_DRIVE_FAILURE_RISK_LOW	ALL	ALL	ALL	Contact Oracle to have the drive replaced
xx.66.1025/IMPENDING_DRIVE_FAILURE_RISK_MED xx.66.1024/IMPENDING_DRIVE_FAILURE_RISK_HIGH	Sun StorageTek 2510 Sun StorageTek 2530 Sun StorageTek 2540 Sun Storage 2530-M2 Sun Storage 2540-M2	ALL	ALL	Contact Oracle to have the drive replaced
xx.66.1025/IMPENDING_DRIVE_FAILURE_RISK_MED xx.66.1024/IMPENDING_DRIVE_FAILURE_RISK_HIGH	Sun Storage 6180 Sun Storage 6580 Sun Storage 6780 Sun StorageTek 6540 Sun StorageTek 6140 StorageTek Flexline 380	7.60.56.10 or later firmware	ALL	Contact Oracle to have the drive replaced
xx.66.1025/IMPENDING_DRIVE_FAILURE_RISK_MED xx.66.1024/IMPENDING_DRIVE_FAILURE_RISK_HIGH	Sun Storage 6180 Sun Storage 6580 Sun Storage 6780 Sun StorageTek 6540 Sun StorageTek 6140 StorageTek Flexline 380	Firmware 7.xx	One of these NVSRAM versions must be installed N399X-710843-069 N399X-760843-049 N49XX-760843-039 N6091-710843-059 N6091-710855-059 N6091-750843-039 N6091-750855-029 N6091-760843-039 N6091-760855-901 N7091-730843-049 N7091-750843-039 N7091-760843-039	Contact Oracle to have the drive replaced
xx.66.1025/IMPENDING_DRIVE_FAILURE_RISK_MED xx.66.1024/IMPENDING_DRIVE_FAILURE_RISK_HIGH	Sun StorageTek 6540 Sun StorageTek 6140 StorageTek Flexline 380 Sun StorEdge 6130 StorageTek Flexline 380 StorageTek Flexline 280	Firmware 6.15.xx.xx to 6.60.xx.xx	One of these NVSRAM versions must be installed N288X-660843-039 N288X-660855-039 N399X-660843-039 N6091-660843-039 N6091-660855-039	Contact Oracle to have the drive replacedContact Oracle to have the drive replaced
xx.66.1025/IMPENDING_DRIVE_FAILURE_RISK_MED xx.66.1024/IMPENDING_DRIVE_FAILURE_RISK_HIGH	ANY	Firmware below 6.16	ANY	Contact Oracle to have the drive replaced

If there is no critical fault in as referenced in the symptoms above, there are no drives that have this condition. If the array isn't reporting the Impending Failure, then there should be no alarm. No further troubleshooting is required.
If you cannot match up your configuration and fault in the above list, then you are exposed to CR 6704575 and 7062619, which can cause false impending failure faults. Go to Step 2.

NOTE: CR 6704575 states 7.60.36.10 has the fix required, but CR 7062619 found that the fix was not implemented completely, and the firmware required an update.

2. Identify the number of affected volumes.

From the alarm details, you will be able to see the volume group/Virtual Disk ID(VDisk) and the Drive location. Using the VDisk ID,
get the number of volumes in the vdisk by looking at the VDisk details.

CAM Browser

Log into Common Array Manager
Click on your array name
Click on Virtual Disks in the left hand tree
In the large pane, click on your VDisk ID
In the section marked "Related Information" there is an entry for Volumes. The value between the "( )" is the number of volumes.

If there are more than one volumes in the VDisk/Volume Group, go to Step 3.
If there is only a single volume or no volumes in the VDisk/Volume Group, the system is correctly identifying the drive fault, and should be replaced.

3. View the volume details and identify whether the volumes are owned by a single controller or split between two controllers.

NOTE: You only need to go as far as identifying at least two of the volumes are owned by different controllers in the affected VDisk/Volume Group.

CAM Browser

Log into Common Array Manager
Click on your array name
Click on Virtual Disks in the left hand tree
In the large pane, click on your VDisk ID
In the section marked "Related Information" there is an entry for Volumes. Click on it to get the list of volume names, and copy them down.
Click on the first volume in the list, and look at the Preferred Controller field, and copy that down.
Click on the Volumes Option in the left pane
Click on the next volume in your list, identify the preferred controller field.
Repeat steps 6 through 8 until you identify two volumes that have different preferred controllers, or have reached the end of the list and concluded that all volumes are owned by the same controller.

CAM sscs

Use the list of volumes from:

sscs list -a <array-name> vdisk <VDisk ID>

to run against:

sscs list -a <array-name> volume <volume-name>

There is a field for preferred owner. Review all volumes until you have identified whether there are two or more volumes owned by different controllers, or the end of the list is reached.

SANtricity

Launch SANtricity ES
Double Click on your Array Name to launch the Array Management Window
Click on the Logical tab
In the left pane of the logical tab there is an expandable tree under your array name. This will list the Volume Groups by name. Expand the Volume Group identified in the fault, and count the number of volumes. Note that "Free Space" is NOT a volume.
Right Click on each volume name in the list, and click properties(6.xx firmware) or just click on the volume and the right pane will populate with the volume details.
Identify the preferred owner of the volume
Repeat steps 5 and 6 until you identify two volumes that have different preferred controllers, or have reached the end of the list and concluded that all volumes are owned by the same controller.

If two or more volumes have preferred ownership by separate controllers, go to Step 4.
If all volumes have the same preferred owner, go the system is correctly identifying the drive fault, and should be replaced.

4. Save a stateCaptureData.dmp file from the array, and look for the Synthetic PFA flag for each affected drive.

This can be performed by collecting a support data through normal methods and unzipping/unpacking the file.

Reference <Document 1002514.1> Collecting Support Data for Arrays Using Sun StorageTek[TM] Common Array Manager

Reference <Document 1014074.1> Collecting Support Data for Arrays Using Sun StorageTek[TM] SANtricity Storage Manager

CAUTION: If you are running 07.50.08.10 or 07.50.13.10 firmware on your array, you may cause both RAID controllers to reboot unexpectedly due to a memory leak caused by collecting this output too often. Refer to CR 6857533 for details. If you are running either of these two firmwares, please upgrade to 07.60.18.10 or later prior to executing this step, or risk an unplanned outage.

CAM CLI:

service -d <array-name> -c save -t state -p <path> -o stateCaptureData.txt

Service is located in:
Solaris: /opt/SUNWsefms/bin Solaris
Windows: C:\Program Files\Sun\Common Array Manager\Component\fms\bin
Linux: /opt/sun/private/fms/bin

SANtricity GUI:

Launch SANtricity ES
Double Click on your Array Name to launch the Array Management Window
Open the Advanced Menu
Select the Troubleshooting Sub Menu
Select Capture State Information and save the file off on your client system for review(this is a text file).

Within the file, there is an output called

getObjectGraph_MT(4,0,0,0,0,0,0,0,0)

In that output use the Tray and Slot location from the alarm to identify the drive. The tray value should be converted to Hexadecimal.

Here is a SAMPLE output for a drive:

Executing getObjectGraph_MT(4,0,0,0,0,0,0,0,0,0) on controller A:
OBJECT GRAPH - 0x2a7264b8
cfgGeneration : 0x6739
DRIVE - 0x2b073930
Off/GHS/Rmvd/Avlb : 0/0/0/0
uncertified : 0
Pfa : 0
drivePfaReason : 0x1 NONE
status : 0x1 OPTIMAL
cause : 0x1
vgIndex : 3
inquiry info : SEAGATE / ST373554FSUN72G / 0409
serialNumber : 337163KX 3KP163KX
blkSize : 0x200
raw/usable caps : 0x1117732400 / 0x10f7732400
interfaceType : Fibre 0x5 0x7 Fibre 0x1 0x7
slot/tray ref : 8/0e 10 00 00 a0 b8 4d 91 0b 00 00 00 00 00 00 00 00 00 00 00

From the output, identify the drivePfaReason.

If the drivePfaReason is DRIVE, the drive is reporting the fault, and indicates it should be replaced.
If the drivePfaReason is SYNTHESIZED, go to Step 5.
If the drivePfaReason is NONE or something else, contact Oracle for further investigation.

5. Review the event log for event type 1016 (Unrecoverable Meda Error) within the last 2 weeks.

Use the user interface to see if there are any event types 0x1016 for Unrecoverable Media Errors on the drive(s) listed in the impending failure fault, in the last two weeks.

See the appendix at the end of the document for details on how to get these events.

If there are 1016 errors for the drive(s) in the fault(s), contact Oracle to have the drive(s) replaced.
If there ae no 1016 errors drive(s) in the fault(s), go to Step 6.

6. Review the event log for event type 100A (Check Condition) within the last 2 weeks

Use the user interface to see if there are any event types 0x100A for Check Conditions on the drive(s) listed in the impending failure fault, in the last two weeks.

See the appendix at the end of the document for details on how to get these events.

If there are 100A events for the indicated drive, with event specific code 03/xx/xx, the drive should be replaced as this is a Drive Medium error

If there are 100A events fro the indicated drive, with event specific code 04/xx/xx, the drive should be replaced as this is a Drive Hardware failure

If there are 100A events for the indicated drive, with event specific code B/88/3, the drive should be replaced as this is a transmission failure between the drive interposer and the drive.

If there are 100A events fro the indicated drive, with event specific code 1/nn/nn and there are 30 or more of these for the same drive during the two week period, the drive should be replaced due to recoverable errors.

If there are no 100A events in the last two weeks, or they do not match the event specific codes, go
to Step 7.

7. Based on the troubleshooting you have performed, you have identified that the drive has faulted with a Synthesized PFA, likely due to CR 6704575 or 7062619.

The CR is due to what is called a Stagnant IO threshold. This is a threshold based on the number of times an IO to a drive timed out in the array controllers, within a set amount of seconds. This threshold is set too low for Virtual Disks that have more than one volume and the ownership of the volumes is split between controllers, as you indicated in Step 5, and there is no user interface to change the threshold. You are more likely to see this issue recur when the affected drives are under heavy load.

For 6140, 6540, Flexline 380 arrays, there is a firmware fix in array firmware 07.60.56.10 that permanently sets the threshold. This is delivered via Sun StorageTek[TM] Common
Array Manager version 6.8.1 or later.

For 6180, 6580, and 6780 arrays, there is a firmware fix in array firmware 07.77.13.11. This is delivered in Sun StorageTek[TM] Common Array Manager version 6.8 or later.

For other arrays, there may be an NVSRAM fix that can be requested against this CR. It is not currently available for download, and is only available upon request.

Neither the application of the NVSRAM nor the firmware upgrade to your array will clear the counters on the drive and clear the alarm. You will still need to perform the actions below.

In order to alleviate the CR, and clear the counters:

a) Migrate all volumes in the affected VDisk that you identified in Step 3 to a single array controller. Reference

<Document 1006464.1> Changing a volume's "owning controller" and "preferred controller" in Sun StorageTek[TM] Common Array Manager

b) Use the Revive function in the User Interface to clear the Synthetic Counters

CAM GUI

Log into Common Array Manager
Click on your array name
Click on Virtual Disks in the left hand tree
In the large pane, click on your VDisk ID
In the Related Information, click on Disks
In the Disks Summary, click on your affected Disk.
Click the Revive button(it should be blue. If it's grey, try the CLI below)

CAM CLI

sscs service -a my_array -t tXdriveY revive

OR

sscs revive -a my_array disk tXdriveY

Alternatively:
service -d <array-name> -c revive -t <tXdriveY>

SANtricity GUI

Launch SANtricity ES
Double Click on your Array Name to launch the Array Management Window
Click on the Physical Tab
Click on the affected drive slot in the array graphic
Open the Advanced Menu
Open the Recovery Sub Menu
Open the Revive sub Menu
Click Drive
Type "yes" when prompted and click continue.

If the PFA critical fault is still present on the array after 5 minutes, contact Oracle.
If the PFA critical fault is cleared, no further work is required.

NOTE: For as long as you have split ownership, this issue can and will return until we have a firmware fix in place for your arrays. It is recommended that you consolidate your volume ownership as prescribed, to avoid any false drive replacements.

Appendix:

Collecting Support Information

Reference Document:1002514.1 Collecting Support Data for Arrays Using Sun StorageTek[TM] Common Array Manager.
Reference Document:1014074.1 Collecting Support Data for Arrays Using Sun StorageTek[TM] SANtricity Storage Manager.

How to get event lists from the user interface:

Sun StorageTek Common Array Manager:

Browser:

Expand Storage Arrays in the left menu pane.
Expand your storage array name in the left menu pane.
Expand Troubleshooting in the left menu pane.
Click on Events.
In the right pane, click on the -|-> icon. If you mouse over it it will state Advanced Filter.
Set Event to Log Events.
Set Event Type to Component.
Set Read the last X Kbytes From Log File to 100.
Set String Filter to 0x100A or 0x1016. (you will have to run a search for both)
Click on the Details of any alarm that is shown.

SSCS

sscs list -d <array_name> -t LogEvent -f 0x100A eventsscs list -d <array_name> -t LogEvent -f 0x1016 event

Severity    : Minor
Date        : Thu Feb 02 21:13:13 EST 2012
Device      :myarray (Sun Storage 6780)
Component   : Tray.01.Drive.02
Type        : LogEvent
Information : Drive returned CHECK CONDITION (4/80/87)
Event Code : 80.20.491
Aggregated : No
Description : Feb 02 21:13:13 pts-6780-bur Tray.01.Drive.02: [ID 0x100A] NOTICE:
              Drive returned CHECK CONDITION (4/80/87)

Probable Cause :
The array firmware has logged an informational event.

Recommended Action :
No action required.

The Event Specific Code in the example above is 4/80/87.

SANtricity Storage Manager:

GUI:

Launch SANtricity.
Double Click on your array name to open the Array Management Window.
Click on the Advanced Menu.
Click on the Troubleshooting Sub-Menu.
Click on View Event Log.
Un-Check View Only Critical Events.
Click on the Component Type field header to sort the events.
Look for Drive in the list of events.
For any Drive event, highlight it, and check the View Details box.
Get the value of the Event type and Event Specific Details field for each Drive event.

SMcli:

Get the list of events by saving off the event log:

SMcli -n array_name -c "save storageArray allEvents file=\"some/file/path/log.txt\";"

Open a text viewing application to look at the individual events.
Get the value of the Event type and Event Specific Details field for each Drive event.

Date/Time: Sun Feb 26 23:56:43 EST 2012
Sequence number: 164
Event type: 100A
Event category: Error
Priority: Informational
Description: Drive returned CHECK CONDITION
Event specific codes: 2/4/2
Component type: Drive
Component location: Tray.02.Drive.11
Logged by: Controller in slot A

impending, drive, disk, pfa, predictive, failure, fault, 2500, 6000, flx, 2510, 2530, 2540, flx240, flx280, flx380, 6540, 6130, 6140, 6180, 6780, 6580, flex, synthesized, normalized

Internal Comments
This document contains normalized content and is managed by the the Domain Lead (s) of the respective domains. To notify content owners of a knowledge gap contained in this document, please add a comment to the document.

NOTE: The NVSRAM referenced to fix this issue is delivered via escalation to Tier 3 support for the product.

References

<NOTE:1002514.1> - Collecting Sun Storage Common Array Manager Array Support Data
<NOTE:1014074.1> - Collecting Support Data for Arrays Using Sun StorageTek SANtricity Storage Manager
<NOTE:1021057.1> - How to verify Sun StorageTek[TM] 2500 and Sun Storage[TM] 6000 and J4000 Critical Faults via the User Interface
<NOTE:1021067.1> - How to Verify Sun Storage[TM] Array Firmware Using Sun Storage Common Array Manager or SANtricity ?

Attachments

This solution has no attachment