Premature replacement of drives on the x4500 and x4540 platforms

Asset ID:	1-79-1477074.1
Update Date:	2012-07-31
Keywords:

Solution Type Predictive Self-Healing Sure

Solution 1477074.1 : Premature replacement of drives on the x4500 and x4540 platforms

Applies to:

Sun Fire X4500 Server - Version Not Applicable to Not Applicable [Release N/A]
Sun Fire X4540 Server - Version Not Applicable to Not Applicable [Release N/A]
Oracle Solaris on x86-64 (64-bit)

Purpose

Premature replacement of drives for unrecoverable media errors on the X45xx platform

Details

Disk drives are very high precision electro-mechanical devices. As capacities increase they now have billions of data sectors on ultra precise surfaces spinning at very high speeds, floating just under a number of microscopic read/write heads. Like any mechanical device they cannot be 100% perfect and do suffer with a variety of issues that can cause a single sector to become un readable (referred to as an "Unrecoverable Media Error" (UME)). The various vendors continually sample their drives for quality control and have continual process improvement programs in place to to ensure that the number of sectors affected remains a tiny number. The drives themselves have a lot of technology internally to monitor and correct these sectors but a few sectors may degrade and become UMEs over a period of time on a drive. The drive vendors set aside an area of the drive to allow remapping of unrecoverable blocks should the affected block be written with new data, these areas are sized for several thousand re-allocations (mostly performed when the drive is formatted in the factory).

A small number of unrecoverable blocks developing on a disk drive is not an abnormal situation and a disk drive should not be thought of as "FAULTY" and replaced unless the total number of defective blocks detected is excessive (100s of blocks) or the rate they are being found at is excessive (> 20 in 24 hours), the disk drives have an internal monitoring system that will enforce the correct rules for the drive and will assert a SMART error to the host operating system if they are exceeded. A drive should be considered FAULTY and replaced should the total number or the discovery rate of UMEs be excessive or a SMART alert is raised.

When a disk drive suffers a UME the Solaris disk drivers will produce a message sequence like ones shown below. The block that suffered the UME can be seen in the "Error Block:" field. You may need to sort same block events together.

## Initial I/O request
Jun 8 21:15:33 XXXX scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@2,0 (sd4):
Jun 8 21:15:33 XXXX     Error for Command: read(10)                Error Level: Retryable
Jun 8 21:15:33 XXXX scsi: [ID 107833 kern.notice]     Requested Block: 972810238                 Error Block: 972810440
Jun 8 21:15:33 XXXX scsi: [ID 107833 kern.notice]     Vendor: ATA                                Serial Number: 9QMCXXXX
Jun 8 21:15:33 XXXX scsi: [ID 107833 kern.notice]     Sense Key: Media Error
Jun 8 21:15:33 XXXX scsi: [ID 107833 kern.notice]     ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0x0
## 1st retry
Jun 8 21:15:36 XXXX scsi: [ID 365881 kern.info] /pci@0,0/pci10de,377@a/pci1000,1000@0 (mpt0):
Jun 8 21:15:36 XXXX scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@2,0 (sd4):
Jun 8 21:15:36 XXXX     Error for Command: read(10)                Error Level: Retryable
Jun 8 21:15:36 XXXX scsi: [ID 107833 kern.notice]     Requested Block: 972810238                 Error Block: 972810440
Jun 8 21:15:36 XXXX scsi: [ID 107833 kern.notice]     Vendor: ATA                                Serial Number: 9QMCXXXX
Jun 8 21:15:36 XXXX scsi: [ID 107833 kern.notice]     Sense Key: Media Error
Jun 8 21:15:36 XXXX scsi: [ID 107833 kern.notice]     ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0x0
## 2nd retry
Jun 8 21:15:39 XXXX scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@2,0 (sd4):
Jun 8 21:15:39 XXXX     Error for Command: read(10)                Error Level: Retryable
Jun 8 21:15:39 XXXX scsi: [ID 107833 kern.notice]     Requested Block: 972810238                 Error Block: 972810440
Jun 8 21:15:39 XXXX scsi: [ID 107833 kern.notice]     Vendor: ATA                                Serial Number: 9QMCXXXX
Jun 8 21:15:39 XXXX scsi: [ID 107833 kern.notice]     Sense Key: Media Error
Jun 8 21:15:39 XXXX scsi: [ID 107833 kern.notice]     ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0x0
## 3rd retry
Jun 8 21:15:46 XXXX scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@2,0 (sd4):
Jun 8 21:15:46 XXXX     Error for Command: read(10)                Error Level: Retryable
Jun 8 21:15:46 XXXX scsi: [ID 107833 kern.notice]     Requested Block: 972810238                 Error Block: 972810440
Jun 8 21:15:46 XXXX scsi: [ID 107833 kern.notice]     Vendor: ATA                                Serial Number: 9QMCXXXX
Jun 8 21:15:46 XXXX scsi: [ID 107833 kern.notice]     Sense Key: Media Error
Jun 8 21:15:46 XXXX scsi: [ID 107833 kern.notice]     ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0x0
## 4th retry
Jun 8 21:15:49 XXXX scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@2,0 (sd4):
Jun 8 21:15:49 XXXX     Error for Command: read(10)                Error Level: Retryable
Jun 8 21:15:49 XXXX scsi: [ID 107833 kern.notice]     Requested Block: 972810238                 Error Block: 972810440
Jun 8 21:15:49 XXXX scsi: [ID 107833 kern.notice]     Vendor: ATA                                Serial Number: 9QMCXXXX
Jun 8 21:15:49 XXXX scsi: [ID 107833 kern.notice]     Sense Key: Media Error
Jun 8 21:15:49 XXXX scsi: [ID 107833 kern.notice]     ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0x0
## 5th and final retry
Jun 8 21:15:52 XXXX scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@2,0 (sd4):
Jun 8 21:15:52 XXXX     Error for Command: read(10)                Error Level: Fatal
Jun 8 21:15:52 XXXX scsi: [ID 107833 kern.notice]     Requested Block: 972810238                 Error Block: 972810440
Jun 8 21:15:52 XXXX scsi: [ID 107833 kern.notice]     Vendor: ATA                                Serial Number: 9QMCXXXX
Jun 8 21:15:52 XXXX scsi: [ID 107833 kern.notice]     Sense Key: Media Error
Jun 8 21:15:52 XXXX scsi: [ID 107833 kern.notice]     ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0x0

There will be 6 blocks of messages very close together. That does not mean there are 6 bad blocks, just one block that has a UME that is retried a further 5 times before the Solaris disk driver gives up on the command. This error recovery should only take a small amount of time. When that retry sequence terminates with a FATAL error level, it is indicating that the single disk command involved had a FATAL error and did not complete successfully. It is not indicating that the disk itself has suffered a FATAL error that causes the disk to be considered faulty based on this one message sequence.

The recovery action for a UME depends on the raid level of the dataset containing that disk drive. If the drive is being used in a RAID-0 dataset then the data that was on the block wont be available within the dataset without some administrative action like recovery from backups.
In a higher RAID level eg RAID-1 (mirroring) then the dataset contains multiple copies of the original data on multiple drives and when a UME results in a FATAL error message then the alternative data will be available to the RAID implementation and will be supplied immediately to the calling application automatically without any administrative action required. Most software RAID implementations when supplying the higher RAID levels will perform a "write back" of that good data back to the disk block that had the UME to allow the disk to reallocate the faulty block to the spare area, once that has happened the disk is "healed" and no further administrative action is needed unless the above criteria of UME total numbers or UME detection rate are exceeded.

With Solaris 10 and above, a ZFS zpool created with mirroring or RAIDz raid levels performs both of these operations - redundant access to data and an immediate "write back" on a UME to heal the block. It also has a ZFS FMA Diagnosis Engine which will advise when a disk should be swapped using the "fmadm faulty" command.

Swapping drives prematurely has some undesirable effects:

1) unnecessary disruption of your system, when the drive's internal monitoring has fully recovered from the single sector UME.

2) the cost of ownership will rise as administrative actions and service calls have to be made by the system administrators to swap the drives on each UME when the drive may carry on in operation for many years without further incidents.

3) when a healed disk is removed from the a RAID level > 0 dataset then the redundancy on the dataset is lost, and you are back to RAID-0 characteristics for some of that dataset. If there is problem with the remaining disk(s) holding the other copy of your data then that data is gone and an urgent administrative action such as recovery from backups will be required. This exposure will remain until the replacement drive has been synced to the original data and can operate as part of the redundant dataset.

A disk with only a small rate of UME discovery should not cause any effect to the running system. If it does then a support case needs to be opened to deal with the impact on the system rather than just changing the drive, as the Solaris system is designed to be resilient to transient disk errors.

Attachments

This solution has no attachment