![]() | Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||
Solution Type Predictive Self-Healing Sure Solution 1477074.1 : Premature replacement of drives on the x4500 and x4540 platforms
In this Document
Applies to:Sun Fire X4500 Server - Version Not Applicable to Not Applicable [Release N/A]Sun Fire X4540 Server - Version Not Applicable to Not Applicable [Release N/A] Oracle Solaris on x86-64 (64-bit) PurposePremature replacement of drives for unrecoverable media errors on the X45xx platform DetailsDisk drives are very high precision electro-mechanical devices. As capacities increase they now have billions of data sectors on ultra precise surfaces spinning at very high speeds, floating just under a number of microscopic read/write heads. Like any mechanical device they cannot be 100% perfect and do suffer with a variety of issues that can cause a single sector to become un readable (referred to as an "Unrecoverable Media Error" (UME)). The various vendors continually sample their drives for quality control and have continual process improvement programs in place to to ensure that the number of sectors affected remains a tiny number. The drives themselves have a lot of technology internally to monitor and correct these sectors but a few sectors may degrade and become UMEs over a period of time on a drive. The drive vendors set aside an area of the drive to allow remapping of unrecoverable blocks should the affected block be written with new data, these areas are sized for several thousand re-allocations (mostly performed when the drive is formatted in the factory). ## Initial I/O request
Jun 8 21:15:33 XXXX scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@2,0 (sd4): Jun 8 21:15:33 XXXX Error for Command: read(10) Error Level: Retryable Jun 8 21:15:33 XXXX scsi: [ID 107833 kern.notice] Requested Block: 972810238 Error Block: 972810440 Jun 8 21:15:33 XXXX scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: 9QMCXXXX Jun 8 21:15:33 XXXX scsi: [ID 107833 kern.notice] Sense Key: Media Error Jun 8 21:15:33 XXXX scsi: [ID 107833 kern.notice] ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0x0 ## 1st retry Jun 8 21:15:36 XXXX scsi: [ID 365881 kern.info] /pci@0,0/pci10de,377@a/pci1000,1000@0 (mpt0): Jun 8 21:15:36 XXXX scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@2,0 (sd4): Jun 8 21:15:36 XXXX Error for Command: read(10) Error Level: Retryable Jun 8 21:15:36 XXXX scsi: [ID 107833 kern.notice] Requested Block: 972810238 Error Block: 972810440 Jun 8 21:15:36 XXXX scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: 9QMCXXXX Jun 8 21:15:36 XXXX scsi: [ID 107833 kern.notice] Sense Key: Media Error Jun 8 21:15:36 XXXX scsi: [ID 107833 kern.notice] ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0x0 ## 2nd retry Jun 8 21:15:39 XXXX scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@2,0 (sd4): Jun 8 21:15:39 XXXX Error for Command: read(10) Error Level: Retryable Jun 8 21:15:39 XXXX scsi: [ID 107833 kern.notice] Requested Block: 972810238 Error Block: 972810440 Jun 8 21:15:39 XXXX scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: 9QMCXXXX Jun 8 21:15:39 XXXX scsi: [ID 107833 kern.notice] Sense Key: Media Error Jun 8 21:15:39 XXXX scsi: [ID 107833 kern.notice] ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0x0 ## 3rd retry Jun 8 21:15:46 XXXX scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@2,0 (sd4): Jun 8 21:15:46 XXXX Error for Command: read(10) Error Level: Retryable Jun 8 21:15:46 XXXX scsi: [ID 107833 kern.notice] Requested Block: 972810238 Error Block: 972810440 Jun 8 21:15:46 XXXX scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: 9QMCXXXX Jun 8 21:15:46 XXXX scsi: [ID 107833 kern.notice] Sense Key: Media Error Jun 8 21:15:46 XXXX scsi: [ID 107833 kern.notice] ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0x0 ## 4th retry Jun 8 21:15:49 XXXX scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@2,0 (sd4): Jun 8 21:15:49 XXXX Error for Command: read(10) Error Level: Retryable Jun 8 21:15:49 XXXX scsi: [ID 107833 kern.notice] Requested Block: 972810238 Error Block: 972810440 Jun 8 21:15:49 XXXX scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: 9QMCXXXX Jun 8 21:15:49 XXXX scsi: [ID 107833 kern.notice] Sense Key: Media Error Jun 8 21:15:49 XXXX scsi: [ID 107833 kern.notice] ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0x0 ## 5th and final retry Jun 8 21:15:52 XXXX scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@2,0 (sd4): Jun 8 21:15:52 XXXX Error for Command: read(10) Error Level: Fatal Jun 8 21:15:52 XXXX scsi: [ID 107833 kern.notice] Requested Block: 972810238 Error Block: 972810440 Jun 8 21:15:52 XXXX scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: 9QMCXXXX Jun 8 21:15:52 XXXX scsi: [ID 107833 kern.notice] Sense Key: Media Error Jun 8 21:15:52 XXXX scsi: [ID 107833 kern.notice] ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0x0 2) the cost of ownership will rise as administrative actions and service calls have to be made by the system administrators to swap the drives on each UME when the drive may carry on in operation for many years without further incidents. 3) when a healed disk is removed from the a RAID level > 0 dataset then the redundancy on the dataset is lost, and you are back to RAID-0 characteristics for some of that dataset. If there is problem with the remaining disk(s) holding the other copy of your data then that data is gone and an urgent administrative action such as recovery from backups will be required. This exposure will remain until the replacement drive has been synced to the original data and can operate as part of the redundant dataset.
Attachments This solution has no attachment |
||||||||||||||
|