Sun StorEdge 3510 Arrays May Mark Disks as "bad" After Reporting Disk Errors

Asset ID:	1-77-1000370.1
Update Date:	2011-02-25
Keywords:

Solution Type Sun Alert Sure

Solution 1000370.1 : Sun StorEdge 3510 Arrays May Mark Disks as "bad" After Reporting Disk Errors

Related Items


Sun Storage 3510 FC Array

Related Categories


GCS>Sun Microsystems>Sun Alert>Criteria Category>Availability
 GCS>Sun Microsystems>Sun Alert>Release Phase>Resolved
 GCS>Sun Microsystems>Sun Alert>Criteria Category>Data Loss

PreviouslyPublishedAs
200492

Product
Sun StorageTek 3510 FC Array

Bug Id
<SUNBUG: 6357118>

Date of Workaround Release
28-APR-2006

Date of Resolved Release
28-Mar-2008

One or more disk drive(s) may become disabled and the logical drive may transition to a "Fatal Fail" status. (see below for details)

1. Impact

One or more disk drive(s) may become disabled and the logical drive may transition to a "Fatal Fail" status. It is possible that cached data may be written to the logical drive. If this occurs, pending write cache contents may be lost when the array is reset/power cycled.

If the array is running 4.15F firmware, "Cache purged" messages will be logged. For previous firmware versions, cache contents may be lost without notification.

2. Contributing Factors

This issue can occur on the following platform:

SPARC Platform

Sun StorEdge 3510 FC array

for all current releases of controller firmware.

3. Symptoms

If the described issue occurs, one or more disks may be disabled, perhaps in quick succession, especially under conditions of heavy I/O load. If running firmware 4.15F, there may be "0B/47" SCSI parity error messages in the event log. For previous firmware versions there are no specific error messages to identify this issue.

4. Workaround

For array firmware 4.15F:

On Sun StorEdge 3510 FC arrays with firmware 4.15F, an array reset could clear this issue. Upon proper array shutdown and reset, there is a possibility that the transient error condition causing disturbances in disk drive loop may not be present. In this case the disks could participate in array operations if the disks are good and the error was transient in nature. Documented procedure can then be followed to force the logical drive to become available.

Note: Appropriate care should be taken to verify data consistency if the "cache purge" message was logged.

For additional details on recovering a logical drive from a "Fatal Fail" state, see the "Sun StorEdge 3000 Family Installation, Operation, and Service Manual" and reference section 8.5 "Recovering From Fatal Drive Failure".

***IMPORTANT NOTE***

For array firmware prior to 4.15F:

Upon array shutdown and reset, the "cache purged" warning message is only available in firmware 4.15F. Therefore, for firmware versions prior to 4.15F, the data consistency must be checked for any logical drive which has been recovered from a "fatal fail" state. Pending write cache data may have been lost without any warning message, if the cache was set in "write back" mode.

Note: Array users should regularly monitor their arrays for messages in "persistent event log" and take actions to replace any faulty components.

5. Resolution

There are no further updates planned for this Sun Alert document. If
you need additional assistance regarding this issue, please contact Sun
Services.

This Sun Alert notification is being provided to you on an "AS IS" basis. This Sun Alert notification may contain information provided by third parties. The issues described in this Sun Alert notification may or may not impact your system(s). Sun makes no representations, warranties, or guarantees as to the information contained herein. ANY AND ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT, ARE HEREBY DISCLAIMED. BY ACCESSING THIS DOCUMENT YOU ACKNOWLEDGE THAT SUN SHALL IN NO EVENT BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES THAT ARISE OUT OF YOUR USE OR FAILURE TO USE THE INFORMATION CONTAINED HEREIN. This Sun Alert notification contains Sun proprietary and confidential information. It is being provided to you pursuant to the provisions of your agreement to purchase services from Sun, or, if you do not have such an agreement, the Sun.com Terms of Use. This Sun Alert notification may only be used for the purposes contemplated by these agreements.

Copyright 2000-2008 Sun Microsystems, Inc., 4150 Network Circle, Santa Clara, CA 95054 U.S.A. All rights reserved.

Modification History
28-Mar-2008: Resolved

Previously Published As
102329
Internal Comments

The Sun StorEdge 3510 array uses dual FC loops for communicating to disks. If any loop disturbance is observed including low signal quality, or error conditions on the disks which disrupts the loop transmission, then the disks will report a SCSI parity error (0B/47) condition. Current array firmware error recovery does not include using the alternate path to the disk drive. Current algorithm disables the drive upon an error condition on one path, (eg: scsi parity error conditions on one path itself) without trying the alternate path. Current array firmware needs improvement in drive error handling. The array development group is working on best possible approach to address this issue.

Notes on recovery options and procedure:

For array firmware 4.15F:

1. Depending on the type of logical drive (eg., RAID 5), when one or more drives are disabled, the logical drive could go into a "Fatal Fail" state. This means the logical drive is not usable and has crossed the failure tolerance limit (eg: In a RAID 5 more than one disk failure results in the array not being able to provide access to data).

It is possible for the cached data to be written to this logical drive which is already acknowledged to the host as received. In this case, "cache purged" messages will be recorded to notify that the cached data belonging to the logical drive will be discarded when the system is reset/power cycled.

2. The array reset could clear the issue. Upon proper array shutdown and reset, there is a possibility that the transient error condition causing disturbances in the disk drive loop may not be present. In this case the disks could participate in the array operation, if the disks are good and the error was transient in nature. User can then follow documented procedure to force the logical drive to become available. Appropriate care should be taken to verify data consistency if the "cache purge" message was logged.

In summary, the following is the behavior of the array with 4.15 firmware when a logical drive has gone into a "Fatal Fail" state.

Generate a "LD Fatal Fail" event when a logical drive goes into a "Fatal Fail" state.

Generate a "Cached Data Purged" event whenever cache data is discarded due to a "LD Fatal Fail" condition.

Save the "Fatal Fail" status across a power cycle.

Procedure to clear logical drive "Fatal Fail" state and recover logical drive to "Degraded" state is available.

Internal Contributor/submitter
[email protected]

Internal Eng Business Unit Group
NWS (Network Storage)

Internal Eng Responsible Engineer
[email protected]

Internal Services Knowledge Engineer
[email protected]

Internal Sun Alert Kasp Legacy ID
102329

Internal Sun Alert & FAB Admin Info
Critical Category: Data Loss, Availability ==> Severe
Significant Change Date: 2006-04-28
Avoidance: Workaround
Responsible Manager: [email protected]
Original Admin Info: [WF 28-Apr-2006, Jeff Folla: Sent for release.]

[WF 27-Apr-2006, Jeff Folla: Sent for review.]

[WF 26-Apr-2006, Jeff Folla: Sent to submitter and responsible engineer for review.]

Product_uuid
58553d0e-11f4-11d7-9b05-ad24fcfd42fa|Sun StorageTek 3510 FC Array

Attachments

This solution has no attachment