Troubleshooting Sun Storage[TM] 2500 and 6000 RAID Array Disk Failures

Asset ID:	1-75-1021055.1
Update Date:	2012-07-09
Keywords:

Solution Type Troubleshooting Sure

Solution 1021055.1 : Troubleshooting Sun Storage[TM] 2500 and 6000 RAID Array Disk Failures

Applies to:

Sun Storage 6780 Array - Version Not Applicable and later
Sun Storage 6540 Array - Version Not Applicable and later
Sun Storage 6140 Array - Version Not Applicable and later
Sun Storage 6580 Array - Version Not Applicable and later
Sun Storage 2530-M2 Array - Version Not Applicable to Not Applicable [Release N/A]
All Platforms

Purpose

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - 6000 and 2500 Series RAID Arrays

The purpose of this document is to help troubleshoot disk failure symptoms on StorageTek, Sun StorEdge, Sun StorageTek, and Sun Storage arrays.

Symptoms

Common Array Manger or SANtricity show a fault for Failed Hot Spare or Unassigned Drive (alarm ID xx.66.1021)
Common Array Manager or SANtricity show a fault for Failed Drive (alarm ID xx.66.1023).
Common Array Manager or SANtricity show a fault for Volume Degraded (alarm ID xx.66.1013).
Common Array Manager or SANtricity show a fault for Volume Failed (alarm ID xx.66.1017).
SANtricity or Common Array Manager show a fault for Impending Drive Failure Risk Low (xx.66.1026).
SANtricity or Common Array Manager show a fault for Impending Drive Failure Risk Medium (xx.66.1025).
SANtricity or Common Array Manager show a fault for Impending Drive Failure Risk High (xx.66.124).
SANtricity or Common Array Manager show a fault for Drive Bypassed, reason not specified (xx.66.1064).
SANtricity or Common Array Manager show a fault for Drive Bypassed, Single Port (xx.66.1119)
SANtricity or Common Array Manager show a fault for Drive Path Degraded (xx.66.1076).
An amber LED is lit on one or more drives in the storage system.

Please validate that each troubleshooting step below is true for your environment. Each step will provide instructions via a link to a document, for validating the step and taking corrective action as necessary. The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution. Please do not skip a step.

Troubleshooting Steps

1. Verify whether there are multiple Critical Faults Seen by the array.

Use the user interface to verify the list of critical faults, and the details of each fault. Verify that there is only a single disk drive failure on the array.

Reference <Document: 1021057.1> Verify Sun StorageTek[TM] 2500 and Sun Storage[TM] 6000 Critical Faults via the User Interface.

Check the list below as to whether you have one or more of the following critical faults listed:

xx.66.1023 / FAILED_DRIVE - Failed Unassigned Drive or Hot Spare or
xx.66.1013 / DEGRADED_VOLUME - Degraded Volume detected or Degraded Volume
xx.66.1021 / HOT_SPARE_IN_USE - Hot Spare In Use
xx.66.1017 / FAILED_VOLUME - Failed Volume - Drive Failure or Failed Volume detected
xx.66.1064 / DRIVE_BYPASSED_CAUSE_UNKNOWN - Drive Bypassed, reason unknown
xx.66.1025 / IMPENDING_DRIVE_FAILURE_RISK_MED - Impending Drive Failure Risk Medium
xx.66.1026 / IMPENDING_DRIVE_FAILURE_RISK_LOW - Impending Drive Failure Risk Low
xx.66.1024 / IMPENDING_DRIVE_FAILURE_RISK_HIGH - Impending Drive Failure Risk High
xx.66.1076/ PATH_DEGRADED - Channel Path xx is Degraded for Drive
xx.66.1119/ DRIVE_BYPASSED_SINGLE_PORT - Drive Bypassed, Single Port (xx.66.1119)

If ANY of these are for Failed Volume (66.1017/FAILED_VOLUME), contact Oracle support as you may have data loss.
If there is more than one fault, go to Step 2.
If there is only a single fault, go to Step 3.
If NO faults are listed as above, go to Step 7.

2. Verify whether the critical faults are for the same drive.

There may be three or four faults for the same drive depending on firmware revisions, which is normal.

Compare the following list of faults to the list of faults that you have:

Impending Drive Failure Risk Medium
Impending Drive Failure Risk Low
Impending Drive Failure Risk High
Degraded Volume detected or Degraded Volume
Hot Spare in Use
Drive Bypassed, reason unknown, or Drive Bypassed
Channel Path xx is Degraded for Drive/REC_PATH_DEGRADED
Drive Bypassed, Single Port

Of the faults listed above, there should only be a single entry for any one of these, where the drive location is the same in each fault.

For the faults in the array, look at the details of the fault and determine the failed drive.

If the faults are for the same disk drive, then it should be replaced, continue to Step 5.
If the faults are for different disk drives, then you have multiple drive faults for your disk drive. This will require further analysis, please continue to Step 7.

3. Verify the Critical Fault Seen by the Array.

Use the user interface to verify the list of critical faults, and the details of each fault. Verify that there is only a single disk drive failure on the array. For the purposes of this investigation, other
critical alerts can be ignored for now, although you may want to review them after troubleshooting your drive fault.

Reference <Document: 1021057.1> Verify Sun StorageTek[TM] 2500 and Sun Storage[TM] 6000 Critical Faults via the User Interface.

If there is a single Hot Spare in Use, but you have already replaced your drive and it has not copied back from it's Global Hotspare, go to Step 6.
If there are one or more Impending Failure faults, Reference <Document 1103184.1> Troubleshooting Sun Storage[TM] Array Impending Drive Failures.
If there is a single critical fault of Failed Volume - Drive Failure or Failed Volume detected go to Step 7.
If there is a single critical fault listed as: Failed Unassigned Drive or Hot Spare or Drive Tray.XX.Drive.YY failed, a drive failure occurred due to the array's periodic media scan. go to Step 5.
If there is a single, critical fault for Drive Bypassed, Single Port, or Channel Path xx for Drive, you will need to manually fail the drive, prior to replacement. Go to Step 5.
If there is a single critical fault listed as: Degraded Volume detected or Degraded Volume, the data on the volume is accessible, but has sustained one or more drive faults. Go to Step 4.
If there is a single critical fault states Hot Spare In Use go to Step 5.

4. Verify that there are no other assigned drives failed in the Degraded Volume Fault.

There will be ONE degraded volume fault for each VDisk or Volume Group affected by the drive failure. That may mean that for RAID 1 and RAID 6 configurations, multiple drives
can be listed in the fault. We need to make sure that only one drive has failed in your VDisk/Volume Group.

If there is a Degraded Volume detected or Degraded Volume fault, but more than one drive listed in the fault, contact Oracle as you may have a more pervasive issue causing drive faults.
If there is a Degraded Volume detected or Degraded Volume fault, and only a single drive listed in the fault, go to Step 5.

5. Identify Drive Model for Alert 1300555.1.

For drive model Reference: <Document:1021060.1> Verify Sun Storage[TM] Array Drive Model Information via the User Interface.

If the model is a ST330055SSUN300G or ST330055FSUN300G, please reference Alert <Document: 1300555.1> Replacement of Drives with Mechanical Positioning Errors May Cause RAID Controllers Reset or Lockdown Unexpectedly, for instructions on how to handle these drives. contact Oracle with a Support Collection (see Appendix).

6. Verify your firmware revision, and review against document 1164893.1.

If your drive has not copied back from Hot Spare, the reason may depend on the revision of firmware and the circumstances of why the drive was failed. Verify your array firmware through the user interface.
Then check this against <Document:1164893.1> Copy back not starting after replacing a faulty drive in a Sun StorageTek[TM] 2500; 6140; 6540; 6580; 6780 and Flexline 380.

If the document did not resolve your issue, go to contact Oracle with a Support Collection (see Appendix).

7. Review the event log for event type 1016 (Unrecoverable Meda Error) within the last 2 weeks.

Use the user interface to see if there are any event types 0x1016 for Unrecoverable Media Errors on the drive(s) listed in the impending failure fault, in the last two weeks.

See the appendix at the end of the document for details on how to get these events.

If there are 1016 errors for the indicated drive, contact Oracle with a Support Collection (see Appendix).
If there are no 1016 errors, go to Step 8.

8. Review the event log for event type 100A (Check Condition) within the last 2 weeks.

Use the user interface to see if there are any event types 0x100A for Check Conditions on the drive(s) listed in the impending failure fault, in the last two weeks.

See the appendix at the end of the document for details on how to get these events.

If there are 100A events for the indicated drive, with event specific code 03/xx/xx, the drive should be replaced as this is a Drive Medium error

If there are 100A events fro the indicated drive, with event specific code 04/xx/xx, the drive should be replaced as this is a Drive Hardware failure

If there are 100A events for the indicated drive, with event specific code B/88/3, the drive should be replaced as this is a transmission failure between the drive interposer and the drive.

If there are 100A events fro the indicated drive, with event specific code 1/nn/nn and there are 30 or more of these for the same drive during the two week period, the drive should be replaced due to recoverable errors.

If there are no 100A events for the indicated drive then there are no problems with the drives in your system.

Appendix:

Collecting Support Information

Reference Document:1002514.1 Collecting Support Data for Arrays Using Sun StorageTek[TM] Common Array Manager.
Reference Document:1014074.1 Collecting Support Data for Arrays Using Sun StorageTek[TM] SANtricity Storage Manager.

How to get event lists from the user interface:

Sun StorageTek Common Array Manager:

Browser:

Expand Storage Arrays in the left menu pane.
Expand your storage array name in the left menu pane.
Expand Troubleshooting in the left menu pane.
Click on Events.
In the right pane, click on the -|-> icon. If you mouse over it it will state Advanced Filter.
Set Event to Log Events.
Set Event Type to Component.
Set Read the last X Kbytes From Log File to 100.
Set String Filter to 0x100A or 0x1016. (you will have to run a search for both)
Click on the Details of any alarm that is shown.

SSCS

sscs list -d <array_name> -t LogEvent -f 0x100A eventsscs list -d <array_name> -t LogEvent -f 0x1016 event

Severity    : Minor
Date        : Thu Feb 02 21:13:13 EST 2012
Device      :myarray (Sun Storage 6780)
Component   : Tray.01.Drive.02
Type        : LogEvent
Information : Drive returned CHECK CONDITION (4/80/87)
Event Code : 80.20.491
Aggregated : No
Description : Feb 02 21:13:13 pts-6780-bur Tray.01.Drive.02: [ID 0x100A] NOTICE:
              Drive returned CHECK CONDITION (4/80/87)

Probable Cause :
The array firmware has logged an informational event.

Recommended Action :
No action required.

The Event Specific Code in the example above is 4/80/87.

SANtricity Storage Manager:

GUI:

Launch SANtricity.
Double Click on your array name to open the Array Management Window.
Click on the Advanced Menu.
Click on the Troubleshooting Sub-Menu.
Click on View Event Log.
Un-Check View Only Critical Events.
Click on the Component Type field header to sort the events.
Look for Drive in the list of events.
For any Drive event, highlight it, and check the View Details box.
Get the value of the Event type and Event Specific Details field for each Drive event.

SMcli:

Get the list of events by saving off the event log:

SMcli -n array_name -c "save storageArray allEvents file=\"some/file/path/log.txt\";"

Open a text viewing application to look at the individual events.
Get the value of the Event type and Event Specific Details field for each Drive event.

Date/Time: Sun Feb 26 23:56:43 EST 2012
Sequence number: 164
Event type: 100A
Event category: Error
Priority: Informational
Description: Drive returned CHECK CONDITION
Event specific codes: 2/4/2
Component type: Drive
Component location: Tray.02.Drive.11
Logged by: Controller in slot A

At this point, if you have validated that each troubleshooting step above is true for your environment, and the issue still exists, further troubleshooting is required. Please contact Oracle Support.

References

<NOTE:1002514.1> - Collecting Sun Storage Common Array Manager Array Support Data
<NOTE:1021057.1> - How to verify Sun StorageTek[TM] 2500 and Sun Storage[TM] 6000 and J4000 Critical Faults via the User Interface
<NOTE:1021060.1> - Verify Sun Storage[TM] Array Drive Model Information via the User Interface
<NOTE:1103184.1> - Troubleshooting Sun Storage[TM] Array Impending Drive Failures
<NOTE:1164893.1> - Copy Back not Starting After Replacing a Faulty Drive in a Sun Storage 2500, 6140, 6540, 6580, 6780 and Flexline 380
<NOTE:1300555.1> - Replacement of Drives with Mechanical Positioning Errors May Cause RAID Controllers Reset or Lockdown Unexpectedly

Attachments

This solution has no attachment