Recovery Behavior From Fatal Drive Failure May Lead to Data Integrity Issues

Asset ID:	1-77-1000018.1
Update Date:	2011-03-03
Keywords:

Solution Type Sun Alert Sure

Solution 1000018.1 : Recovery Behavior From Fatal Drive Failure May Lead to Data Integrity Issues

Related Items


Sun Storage 3510 FC Array
 Sun Storage 3310 Array
 Sun Storage 3511 SATA Array
 Sun Storage 3320 SCSI Array

Related Categories


GCS>Sun Microsystems>Sun Alert>Criteria Category>Data Loss
 GCS>Sun Microsystems>Sun Alert>Release Phase>Resolved

PreviouslyPublishedAs
200021

Product
Sun StorageTek 3310 SCSI Array
Sun StorageTek 3510 FC Array
Sun StorageTek 3320 SCSI Array
Sun StorageTek 3511 SATA Array

Bug Id
<SUNBUG: 5095223>

Date of Workaround Release
12-JAN-2006

Date of Resolved Release
15-JUN-2006

Impact

The "Sun StorEdge 3000 Family Installation, Operation, and Service Manual - Sun StorEdge 3510 FC Array" states in Section 8.5 "Recovering From Fatal Drive Failure" that you can recover from a "Status: FATAL FAIL" (two or more failed drives) by simply resetting the controller or powering off the array. This behavior can lead to data integrity issues. Due to the current internal resource handling, all cached data (including uncommitted write data) for a logical drive is discarded if and when the logical drive enters a "FATAL FAIL" state.

In the event of a fatally failed logical drive (more than 2 drive failures in a RAID 3 or 5), the current recovery process is to reset the controller, thereby causing one of the failed drives to be included back into the logical drive changing the Logical Drive state to "Degraded". If a global spare is assigned, the Logical Drive will rebuild. If a global spare is not assigned, the user can assign a spare and rebuild the logical drive. If there were incomplete write operations at the time of a drive failure, this procedure could create inconsistent data.

The "Sun StorEdge 3000 Family Installation, Operation, and Service Manual" (part number 816-7300-17) can be found on docs.sun.com at http://docs.sun.com/app/docs?q=7300-17

Note: Please also see related Sun Alert 102098 - "Insufficient Information for Recovery From Double Drive Failure for Sun StorEdge 33x0/35xx Arrays"

Contributing Factors

This issue can occur on the following platforms:

Sun StorEdge 3310 SCSI array without firmware 4.15F (as delivered in patch 113722-15)
Sun StorEdge 3320 SCSI array without firmware 4.15G (as delivered in patch 113730-01)
Sun StorEdge 3510 FC array without firmware 4.15F (as delivered in patch 113723-15)
Sun StorEdge 3511 SATA array without firmware 4.15F (as delivered in patch 113724-09)

for all current releases of controller firmware.

Symptoms

The first drive to fail in a logical disk will be persistently marked as BAD while subsequent drives that fail (while the first drive has not been fully reconstructed) will be marked as MISSING temporarily. If the multiple drives are members of the same parity group, then the owning logical device is marked as "FATAL FAIL," and any existing uncommitted write data is discarded in order to recover data cache resources.

Upon reset, the array will attempt to recover MISSING drives automatically, and if possible, will restore the logical drive to "Degraded" status.

The logical drive is restored, if possible, whether or not any uncommitted write data was discarded. The exposure window is mainly centered on whole site power outages that occur after the secondary drive failure, which would allow user applications to be automatically restarted coincidentally in conjunction with an array reset. This situation increases the probability that the server/application might ignore the logical drive going away and then returning with stale data.

Workaround

The risk of data loss can minimized by ensuring that an unused hot spare is available and/or that the first failed drive is replaced as soon as possible. This ensures that the rebuild process can start and finish as soon as possible, and reduces the exposure window as much as possible.

Unmapping the logical drive while it is in the "FATAL FAIL" state should prevent any hosts from attempting to make use of the logical drive automatically after a reset.

It is recommended that if the logical drive is recovered from a "FATAL FAIL" state, that the application(s) that make use of the logical drive run the appropriate data integrity verification utility before making use of the logical drive (i.e. fsck, chkdsk, etc).

Note: A clean filesystem check will only guarantee the filesystem structure and does NOT guarantee user data validity.

The proper use of data integrity features offered by modern databases, file systems and other applications will help ensure that user applications catch any potential data loss and can take higher level recovery procedures, thereby minimizing the effects.

Resolution

This issue is addressed on the following platforms:

Sun StorEdge 3310 SCSI array with firmware 4.15F (as delivered in patch 113722-15 or later)
Sun StorEdge 3320 SCSI array with firmware 4.15G (as delivered in patch 113730-01 or later)
Sun StorEdge 3510 FC array with firmware 4.15F (as delivered in patch 113723-15 or later)
Sun StorEdge 3511 SATA array with firmware 4.15F (as delivered in patch 113724-09 or later)

Modification History
Date: 25-APR-2006

25-Apr-2006:

Updated Contributing Factors and Resolution sections

Date: 15-JUN-2006

15-Jun-2006:

Updated Contributing Factors and Resolution sections

References

<SUNPATCH: 113723-15>
<SUNPATCH: 113722-15>
<SUNPATCH: 113730-01>
<SUNPATCH: 113724-09>

Previously Published As
102126
Internal Comments

The patches for these firmware releases were developed across all products (all arrays impacted by these issues). Therefore, some of the patch READMEs may not reflect the BugID listed in the SunAlert, but the firmware patch listed for each product does in fact remedy the issue for the platforms specified.

The recovery process of LD will become a manual process.

The following Sun Alerts have information about other known issues for the 3000 series products:

102011 - Sun StorEdge 33x0/3510 Arrays May Report a Higher Incidence of Drive Failures With Firmware 4.1x SMART Feature Enabled

102067 - Sun Cluster 3.x Nodes May Panic Upon Controller Failure/Replacement Within Sun StorEdge 3510/3511 Arrays

102086 - Failed Controller Condition May Cause Data Integrity Issues

102098 - Insufficient Information for Recovery From Double Drive Failure for Sun StorEdge 33x0/35xx Arrays

102126 - Recovery Behavior From Fatal Drive Failure May Lead to Data Integrity Issues

102127 - Performance Degradation Reported in Controller Firmware Releases 4.1x on Sun StorEdge 3310/351x Arrays for All RAID Types and Certain Patterns of I/O

102128 - Data Inconsistencies May Occur When Persistent SCSI Parity Errors are Generated Between the Host and the SE33x0 Array

102129 - Disks May be Marked as Bad Without Explanation After "Drive Failure," "Media Scan Failed" or "Clone Failed" Events

Note: One or more of the above Sun Alerts may require a Sun Spectrum Support Contract to login to a SunSolve Online account.

Bug 5095223 indicates that this is invalid behavior and there needs to a manual process of recovery rather than an automatic one.

Disk failed state information is not persistent across a power cycle when a logical drive fatal fail occurs. This is currently by design to allow a user to potentially recover from a spurious event that caused multiple drive failures to occur.

This can be especially useful in multi-enclosure configurations where cabling errors can occur. Any single drive failure is recorded in the private region of each disk drive that is a member of a logical drive. Multiple drive failures are not recorded allowing a user to possibly recover from the failed Logical Drive with a simple reboot of the controller. Although this feature can result in potential data loss as described, it can also save the user, or field personnel, from requiring a full rebuild and restore due to common cabling mistakes.

Internal Contributor/submitter
[email protected]

Internal Eng Business Unit Group
NWS (Network Storage)

Internal Eng Responsible Engineer
[email protected]

Internal Services Knowledge Engineer
[email protected]

Internal Escalation ID
1-71137216, 1-7813973

Internal Resolution Patches
113723-15, 113722-15, 113730-01, 113724-09

Internal Sun Alert Kasp Legacy ID
102126

Internal Sun Alert & FAB Admin Info
Critical Category: Data Loss
Significant Change Date: 2006-01-12, 2006-06-15
Avoidance: Patch, Workaround
Responsible Manager: [email protected]
Original Admin Info: [WF 12-Jun-2006, Dave M: updated for 4.15 FW, rerelase when patch is published to SS]
[WF 25-Apr-2006, Dave M: update for patch release, republished]
[WF 14-Apr-2006, Dave M: updated in anticipation of FW 4.15F release, per NWS and PTS engs]
[WF 12-Jan-2006, Dave M: ready for release]
[WF 05-Jan-2006, Dave M: review completed, Chessin changes added, all docs in this series on hold for Exec approval pending 1/12]
[WF 04-Jan-2006, Dave M: final edits before sending to review]
[WF 02-Jan-2006, Dave M: draft created}
Product_uuid
3db30178-43d7-4d85-8bbe-551c33040f0d|Sun StorageTek 3310 SCSI Array
58553d0e-11f4-11d7-9b05-ad24fcfd42fa|Sun StorageTek 3510 FC Array
95288bce-56d3-11d8-9e3a-080020a9ed93|Sun StorageTek 3320 SCSI Array
9fdbb196-73a6-11d8-9e3a-080020a9ed93|Sun StorageTek 3511 SATA Array

References

SUNPATCH:113722-15
SUNPATCH:113723-15
SUNPATCH:113724-09
SUNPATCH:113730-01

Attachments

This solution has no attachment