Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1006649.1
Update Date:2011-05-19
Keywords:

Solution Type  Problem Resolution Sure

Solution  1006649.1 :   Sun StorEdge[TM] 3000 Arrays: 3.2x and 4.x Firmware Differences in Handling Media Errors  


Related Items
  • Sun Storage 3510 FC Array
  •  
  • Sun Storage 3310 Array
  •  
  • Sun Storage 3511 SATA Array
  •  
  • Sun Storage 3320 SCSI Array
  •  
Related Categories
  • GCS>Sun Microsystems>Storage - Disk>Modular Disk - 3xxx Arrays
  •  

PreviouslyPublishedAs
209273
This document will clarify the behavior of the firmware in the event it encounters a bad block on a disk which is also known as a media error or an "Unrecoverable Read Error".

Applies to:

Sun Storage 3310 Array
Sun Storage 3320 SCSI Array
Sun Storage 3510 FC Array
Sun Storage 3511 SATA Array
All Platforms

Symptoms

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Storage Disk 3000 Series RAID Arrays


The following event is logged:

[1113]: StorEdge Array SN#xxxx CH2 ID10: SCSI Drive ALERT: bad block encountered (02h, 03h,11/00)

This is applicable for all the products under the Sun StorEdge[TM] 3000 family.

Changes

As the drive capacity increases with the increased demand, we see that since the density of the data stored also increasing, there is a greater chance of encountering bad blocks, and array vendors are using their own ways of handling the same.


This document is applicable for redundant RAID implementation, primarily RAID 5, and describes how the firmware handles the bad blocks on the drives.

Consider the following scenario:


On a RAID 5 Logical drive:

1. One drive fails.
2. This causes the hotspare to trigger and start rebuild.
3. The rebuild finds a media error on another member drive.


Cause

For StorEdge[TM] 3000 Arrays with 3.2x Firmware:

The first time a bad block is encountered on a member disk while rebuild is in progress, the rebuild will fail. If we are using the serial/telnet menu when this happens, the firmware would prompt us to continue the rebuild even though there is a bad block. If we answered yes, then the rebuild would continue on to completion, provided there were no other error exceptions. For the block which has the "unrecoverable media error", the firmware zeroes out the ECC of that block and puts a special pattern there and then continues the rebuild until it completes.

For StorEdge[TM] 3000 Arrays with 4.x Firmware:

The firmware will automatically go ahead with the rebuild when a bad block is encountered on a member drive while rebuild is going on. Also on 4.x firmware this "specially marked bad sector of the individual disk" represents a "Logical Drive Bad Block" that will be reported when the host next tries to access that area of the Logical Drive.

The event log would log the following event in case the host tried to read this block:

LG:2 NOTIFY:Logical Drive BAD Block Encountered 000000200.

Notice that there is no specific disk mentioned, only the Logical Drive that contains that disk. To recover from this, the host has to issue a write to that area. If we have a filesystem on this logical drive, then one option is to run fsck and see if this works. If we don't have a file system, then we should be able to locate the Logical Drive Bad Block via a dd to /dev/null. After the file/block is located, you should take the appropriate recovery steps (ie. recover from backup, re-write the data, etc.).

Explanation of Controller Behavior:


For the bad blocks encountered on the member drive while rebuild is undergoing, the controller erases the ECC bytes for that block so any subsequent read will result in an unrecoverable ECC error. The controller will also write a unique pattern in the block so it can be identified by the firmware as a controller generated bad block. Before this feature was implemented in 4.x, an unrecoverable media error on a surviving disk in an LD would result in a Rebuild Failure or require active intervention to allow the rebuild to continue past the bad block.

Solution

Casestudy:

As an example, consider the following events which are taken from a customer case.

Customer is running 4.15F firmware on a StorEdge[TM] 3510 and the following messages are logged in the event logs:

Wed Jul 5 14:26:13 2006
[Primary] Alert
LG:0 NOTIFY:Logical Drive BAD Block Encountered 0388FD300

Wed Jul 5 14:26:13 2006
[Primary] Alert
LG:0 NOTIFY:Logical Drive BAD Block Encountered 0388FD300

...

Notice that no specific drive is reporting the error so this should NOT be confused with a media error on a particular drive but a bad block on the LD and the host should also get a read error while accessing this block. We can check this by running format->analyze->read on this LD and we see....

analyze> read
Ready to analyze (won't harm SunOS). This takes a long time,
but is interruptable with CTRL-C. Continue? y

       pass 0
Medium error during read: block 948949760 (0x388fd300) (948949760)
ASC: 0x11 ASCQ: 0x0

Medium error during read: block 948949760 (0x388fd300) (948949760)
ASC: 0x11 ASCQ: 0x0

Please note that the block number reported by the format->analyze->read is the same as the block number reported by the 3510 in the event log. To recover from this, we need to find the file residing on this block and restore that file. If the application is a database, the DBA should be able to tell us the table residing on this block and we just need to restore that table. In short,


Note: The host needs to write to this block in order to make this block reusable.

Typically, a drive has latent disk errors that can only be detected when the affected disk sector is accessed. These latent disk errors can be avoided if we continuously access the drives which can be accomplished by enabling media-scan to scrub the disks continuously.

 

[For NRAID, or RAID0, if we encounter a bad block, the LD is effectively dead and there is no way or recovering other than having the host to issue a "write" to that block, or restoring the file sitting on that bad block.

 



Sense Key:0x03, Sense Code:0x11, rebuild, double, drive, failure, 3510, 3310, 3320, 3511, 4.11, 4.13, 4.15, 3.25, 3.27, 4.21, firmware, bad, block, media, scan, 4.15, parity, regenerate, RAID, disk
Previously Published As
85181

Change History
Date: 2010-11-11
User Name: [email protected]
Action: Currency & Update
Date: 2007-11-13
User Name: 7058
Action: Approved

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback