Determining when Disks should be replaced on Oracle Exadata Database Machine

Asset ID:	1-75-1452325.1
Update Date:	2012-09-25
Keywords:

Solution Type Troubleshooting Sure

Solution 1452325.1 : Determining when Disks should be replaced on Oracle Exadata Database Machine

Applies to:

Enterprise Manager for Exadata
Exadata Database Machine X2-2 Hardware
Exadata Database Machine V2
Oracle Exadata Hardware - Version 11.2.0.1 to 11.2.0.3 [Release 11.2]
Exadata Database Machine X2-8
Information in this document applies to any platform.

Purpose

This document explains which I/O errors require disk replacement, which do not, and which should be investigated further. I/O errors can be reported in different places for different reasons, and not every I/O error is due to a physical hard disk problem that requires replacement.

Troubleshooting Steps

About Disk Error Handling:

The inability to read some sectors is not always an indication that a drive is about to fail. Even if the physical disk is damaged at one location, such that a certain sector is unreadable, the disk may be able to use spare space to replace the bad area, so that the sector can be overwritten. Physical hard disks are complex mechanical devices with spinning media, so media errors and other mechanical problems are a fact of life which is why redundancy was designed into Exadata in order to protect data against such errors. It is important to stay up to date on disk vendor's firmware which resolves known issues with internal drive mechanical control, media usage and re-allocation algorithms which are problems that can lead to premature failure if not attended to in a timely manner. The most recent Exadata patch image releases contain the latest disk firmware for each drive supported, as well as continuous improvements in ASM and Cellsrv management and handling of disk related I/O errors and failures. Refer to Note 888828.1 for the latest patch releases.

Physical hard disks used on Exadata support SMART (Self-Monitoring, Analysis, and Reporting Technology) and will report their own SMART status to the RAID HBA in the event of a problem. SMART events are based on thresholds which are vendor-defined for monitoring various internal disk mechanisms and will be different for different types of errors. By definition SMART has only 2 external states – predicted failure or not failed (OK). The SMART status does not necessarily indicate the drive's past or present reliability. Exadata Storage Cells reports disks in 2 places – on the cluster by ASM and on the individual cell server by cellsrv which accesses disks through LSI's RAID HBA Megacli utility.

When a failed disk has occurred, ASM will forcibly re-balance the data off the failed disk to restore redundancy if not replaced within the timout period. Performance may be reduced during a rebalance operation. The rebalance operation status should be checked to ensure that this rebalance has completed successfully, because if the rebalance fails, then redundancy will stay reduced until this is rectified or the disk replaced. On-site spare disks are provided such that failed disks can be replaced rapidly by the customer if they choose, in order to ensure maximum performance and full redundancy is maintained prior to the timeout expiring and forcing a rebalance.

In normal redundancy configuration for the disk groups, 1 disk failure can be survived before ASM rebalance re-establishes data redundancy for the whole cluster; if a 2^nd failure occurs before ASM rebalance has completed, then the DB may lose data and crash. In high redundancy configuration for the disk groups, 2 disk failures can be survived before ASM rebalance re-estabilishes redundancy for the whole cluster; if a 3^rd failure occurs before then, the DB may lose data and crash. While the statistical chance of a 2^nd disk failure is very low, the consequences are severe in normal redundancy mode. Redudancy configuration is a trade-off between higher availability for mission-critical and business-critical systems, vs. higher capacity disk groups available for data storage, and should be chosen according to individual customer need.

/opt/oracle.SupportTools/sundiag.sh is a utility used to collect data for Exadata service requests, and in particular contains data specific for diagnosing disk failures. The version in the Exadata software image may not be the latest. For more details and the latest version, refer to Note 761868.1. Each of the examples below are from outputs collected by sundiag.

If there is ever a situation where 2 or more disks report critical failure within seconds of each other, in particular from more than 1 server at the same time, then a sundiag output should be collected from each server and a SR opened for further analysis.

Errors for which Disk Replacement is Recommended:

Case 1. Cell's alerthistory reports the drive has changed its SMART status to "Predictive Failure":

20_1 2012-03-18T02:22:43+00:00 critical "Hard disk status changed to predictive failure. Status : PREDICTIVE FAILURE Manufacturer : SEAGATE Model Number : ST32000SSSUN2.0T Size : 2.0TB Serial Number : L1A2B3 Firmware : 0514 Slot Number : 11 Cell Disk : CD_11_exd1cel01 Grid Disk DATA_EXD1_CD_11_exd1cel01, RECO_EXD1_CD_11_exd1cel01, DBFS_DG_CD_11_exd1cel01"

This indicates the drive has determined via SMART that it is predicting it will fail and a SR should be opened for a replacement as soon as is convenient, and a sundiag attached for data analysis purposes.

If the system is connected to Oracle via Automatic Service Request, then a SR will automatically be opened for this event.

Storage cell disks can be replaced by the customer using the onsite spare provided, if the customer chooses, or Oracle will send out an engineer with the disk.

Case 2. Cell's alerthistory reports the drive lun has experienced a critical error for which it cannot recover from:

21 2012-03-24T10:45:41+08:00 warning "Logical drive status changed. Lun : 0_11 Status : critical Physical Hard disk : 20:11 Slot Number : 11 Serial Number : L1C4D5 Cell Disk : CD_11_edx1cel01 Grid Disks : RECO_EDX1_CD_11_edx1cel01, DBFS_DG_CD_11_edx1cel01, DATA_EDX1_CD_11_edx1cel01."

This indicates the drive has experienced a critical error during the transaction causing the RAID HBA to mark the volume as critical. This often occurs concurrently with Predictive Failure in Storage cells where each volume is a single-disk RAID0, but may occur by itself in the event of a problem writing. A SR should be opened for a replacement as soon as is convenient, and a sundiag attached for data analysis purposes.

Internal Note: Review sundiag's for all disk failures, and look for symptoms of Note 1360343.1 Issues 50 and 65 for which disks are requested to be CPAS'd.

Issue 50 describes a scenario where the disk drive appears to be hot-removed and hot-inserted by itself on the RAID0 volumes on the storage cell within a few seconds of each other, where it would not be physically possible to remove and replace a drive that quickly. It may also be at a time when the customer was not at the data center. This will be evident in the cell alerts and LSI HBA firmware logs which appears as a disk remove followed by a disk insert event. This may offline the volume in addition to causing SCSI I/O error messages from the kernel and may have ASM write errors evident.

Issue 65 describes a scenario where the disk records a critical failure and when parsing the megacli event logs, there are lots of command timeouts which end in "Error 02" when the disk is marked failed.

Case 3. DB node's where the Megacli status is shown as "Firmware state: (Unconfigured Bad)" preceded by logged errors indicating the drive was Failed or Predictive Failed:

=> cat exa1db01_megacli64-PdList_short_2012_03_30_01_23.out
...
Slot 03 Device 08 (HITACHI H103030SCSUN300GA2A81026A1B2C3 ) status is: Unconfigured(bad)

=> cat exa1db01_megacli64-status_2012_03_30_01_23.out
Checking RAID status on exa1db01.oracle.com
Controller a0: LSI MegaRAID SAS 9261-8i
No of Physical disks online : 3
Degraded : 0
Failed Disks : 1

The above command output files are gathered by a sundiag.

Since DB node's have RAID5 with a hotspare, when a disk fails, the hotspare will turn on and the RAID will be rebuilt, so the volume status will be Degraded temporarily and then return to Optimal. This will be evident in the Megacli logs, however may not be obvious to an operator without analysis. A failed disk can be verified by collecting a sundiag output, and a SR should be opened for analysis.

Case 4. DB node's where the "Predictive Failure Count" is >0 even if the drive status shows as "Online".

# cat exa1db01_megacli64-PdList_long_2012_03_30_01_23.out 

...

Slot Number: 2

...

Predictive Failure Count: 14

...



# cat exa1db01_megacli64-PdList_short_2012_03_30_01_23.out

...

Slot 02 Device 16 (HITACHI H103030SCSUN300GA2A81026A1B2C3 ) status is: Online,

...

The above command output files are gathered by a sundiag.

In this case, the hotspare has not turned on due to an incorrect MegaRAID setting. To force the hotspare to turn on at the next failure, do the following:

# /opt/MegaRAID/MegaCli/MegaCli64 -AdpSetProp -SMARTCpyBkEnbl -1 -a0 

(use "/opt/MegaRAID/MegaCli" on Solaris DB nodes)

If replacing the drive before the next failure, then hot-plug remove it and wait for the controller to start copyback to the hotspare due to the missing disk.

Case 5. Storage Cell's where the drive cell status is "Warning" and Megacli status is "Firmware State: (Unconfigured Bad)". The Cell's alerthistory may report the drive with a "not present" alert.

=> cat exa1cel01_alerthistory_2012_08_23_18_24.out

...

     31       2012-08-07T20:50:12-04:00     warning      "Logical drive status changed.  Lun                  : 0_2  Status               : not present  Physical Hard disk         : 20:2  Slot Number          : 2  Serial Number        : L5YH9W  Cell Disk            : CD_02_xsd1cel06  Grid Disks           : DBFS_DG_CD_02_xsd1cel06, FRAFILE_GRP1_CD_02_xsd1cel06, DBFILE_GRP1_CD_02_xsd1cel06."



=> cat exa1cel01_megacli64-status_2012_08_23_18_24.out 

Checking RAID status on exa1cel01.oracle.com

Controller a0:  LSI MegaRAID SAS 9261-8i

No of Physical disks online : 11

Degraded : 0

Failed Disks : 1





=> cat exa1cel01_megacli64-PdList_short_2012_08_23_18_24.out 

...

Slot 02 Device 17 (SEAGATE ST32000SSSUN2.0T061A1120L5YH9W  ) status is: Unconfigured(bad) 

...







=> cat exa1cel01_physicaldisk-fail_2012_08_23_18_24.out 

     20:2     L5YH9W     warning

This case may occur when the drive fails during a boot cycle, before the Cell's management services are running so the Cell does not see it go offline, only that its no longer present in the OS configuration. This will be evident in the Megacli logs, however may not be obvious to an operator without analysis. A failed disk can be verified by collecting a sundiag output, and a SR should be opened for analysis.

Storage cell disks can be replaced by the customer using the onsite spare provided, if the customer chooses, or Oracle will send out an engineer with the disk. In this case, follow the procedure for Storage Cell disks in Predictive Failure status.

Errors for which Disk Replacement is NOT Recommended:

Case 1. The Media Error counters reported by MegaCli in PdList or LdPdInfo outputs in a sundiag. On Storage Servers, these are also reported by Cellsrv in the physical disk view:

# cat exa1db01_megacli64-PdList_long_2012_03_30_01_23.out
...
Enclosure Device ID: 252
Slot Number: 3
Device Id: 8
Sequence Number: 4
Media Error Count: 109
Other Error Count: 0
Predictive Failure Count: 0
...

# cellcli -e list physicaldisk detail
...
name: 20:3
deviceId: 16
diskType: HardDisk
enclosureDeviceId: 20
errMediaCount: 402
errOtherCount: 2
...
slotNumber: 3
status: normal

The above command output files are gathered by a sundiag.

These are counters of how many times a single disk I/O transaction has experienced an error, and are not indicative of the health of a disk and its ability to keep operating. They are not thresholds of SMART and do not have any kind of specific threshold that can be used to determine if a disk needs replacement or not. On earlier Exadata image versions, some of these errors may have been generated during a Patrol Scrub operation which does a verify of all the blocks including those on the disk which have not been used by ASM yet. These may or may not cause a problem in the future, so should be left until ASM does use them and can manage any data and errors on them.

Errors being counted here should be ignored until such time as a disk SMART asserts it is critical or predicted failure and the RAID HBA will offline the disk, send an alert and change the "status" field accordingly.

If multiple disks in different cells are having errors counted, there is a possibility of multiple disks going to failure at the same time, as described above. Greater diligence should be taken to monitor for and replace each predicted or critical failure in a timely manner.

Case 2. The Other Error counters reported by MegaCli in PdList or LdPdInfo outputs in a sundiag. On Storage Servers, these are also reported by Cellsrv in the physical disk view:

# cat exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out
...
Enclosure Device ID: 252
Slot Number: 3
Device Id: 16
Sequence Number: 4
Media Error Count: 0
Other Error Count: 190
Predictive Failure Count: 0
...

# grep "Error Count" exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Media Error Count: 0
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Other Error Count: 184
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Media Error Count: 0
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Other Error Count: 0
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Media Error Count: 62
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Other Error Count: 220
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Media Error Count: 0
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Other Error Count: 211
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Media Error Count: 0
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Other Error Count: 0
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Media Error Count: 0
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Other Error Count: 183
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Media Error Count: 0
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Other Error Count: 19
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Media Error Count: 0
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Other Error Count: 184
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Media Error Count: 0
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Other Error Count: 342
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Media Error Count: 0
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Other Error Count: 225
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Media Error Count: 0
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Other Error Count: 146
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Media Error Count: 0
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Other Error Count: 121

# cellcli -e list physicaldisk detail
...
name: 20:2
deviceId: 17
diskType: HardDisk
enclosureDeviceId: 20
errMediaCount: 62
errOtherCount: 220
...
slotNumber: 2
status: normal
...
name: 20:8
deviceId: 23
diskType: HardDisk
enclosureDeviceId: 20
errMediaCount: 0
errOtherCount: 342
...
slotNumber: 8
status: critical
...

The above command output files are gathered by a sundiag.

These are counters of how many times a single disk I/O transaction has experienced a SCSI transaction error, and are most likely caused by a data path problem. On Exadata, this could be due to the RAID HBA, the SAS Cables, the SAS Expander, the Disk Backplane or the Disk. On occasion they may also cause a Disk to report as 'critical' due to a timeout responding to an I/O transaction or some other unexpected sense returned, and replacing the disk in this case most likely will not resolve the problem. In many cases these data path problems may appear on multiple disks which would indicate something other than the disk is faulted.

In the example shown, all the disks have had data path errors. Slot 2 disk had some corrected read errors as a side-effect of the data path errors that are not critical, hence status is normal and therefore this disk does not match the criteria outlined above that requires replacement. One of those data path errors has triggered Slot 8 to change to critical, although it has not shown any media errors. Disk replacements of slot 8 did not resolve this problem. Data analysis of the full history of the errors and their types identified the problem component to be the SAS Expander.

A sundiag output should be collected and a SR opened for further analysis to determine the source of the fault.

Case 3. ASM logs on the DB node show I/O error messages in *.trc files similar to:

ORA-27603: Cell storage I/O error, I/O failed on disk o/192.168.10.09/DATA_CD_01_exa1cel01 at offset 212417384 for data length 1048576
ORA-27626: Exadata error: 201 (Generic I/O error)
WARNING: Read Failed. group:1 disk:75 AU:52 offset:1048576 size:1048576
path:o/192.168.10.09/DATA_CD_01_exa1cel01
incarnation:0xe96e1227 asynchronous result:'I/O error'
subsys:OSS iop:0x2b7a8ff34160 bufp:0x2b7a90ff9e00 osderr:0xc9 osderr1:0x0
Exadata error:'Generic I/O error'

That may also be accompanied by ASM recovery messages such as this:

WARNING: failed to read mirror side 1 of virtual extent 1251 logical extent 0 of file 73 in group [1.1721532102] from disk DATA_EXA1_CD_01_EXA1CEL01 allocation unit 52 reason error; if possible, will try another mirror side
NOTE: successfully read mirror side 2 of virtual extent 1251 logical extent 1 of file 293 in group [1.1721532102 ] from disk DATA_EXA1_CD_03_EXA1CEL12 allocation unit 191

This is a single I/O Error on a read, which ASM has recovered and corrected. There are other similar ASM messages for different types of read errors.

This will likely also have a matching entry in the Storage Server's alert.log such as this:

IO Error on dev=/dev/sdb cdisk=CD_01_exa1cel01 [op=RD offset=132200642 (in sectors) sz=1048576 bytes] (errno: Input/output error [5])

This may also generate a Storage Server entry in /var/log/messages such as this:

Mar 30 15:37:08 td01cel06 kernel: sd 0:2:1:0: SCSI error: return code = 0x00070002
Mar 30 15:37:08 td01cel06 kernel: end_request: I/O error, dev sdb, sector 132200642

and will probably match an entry in the RAID HBA logs gathered by sundiag.

These single errors are recoverable using the built-in redundancy of ASM so ASM will initiate re-write of the block that had the error, using the mirrored copy which allows the disk to re-allocate data around any bad blocks in the physical disk media. The disk should not be replaced until the failures are such that they trigger predictive failure or critical cell alerts.

Case 4. Oracle Enterprise Manager users of the Exadata plug-ins may see alerts marked "Critical" for all I/O errors.

From: EnterpriseManager Exadata-OracleSupport @ oracle.com>

Date: Mar 30, 2012 1:16:35 PM PDT

To: <Exadata-OracleSupport @ oracle.com>

Subject: EM Alert: Critical:+ASM1_exa1db01.oracle.com - Disk DATA.DATA_CD_09_EXA1CEL02 has 3 Read/Write errors.



Target Name=+ASM1_exa1db01.oracle.com 

Target type=Automatic Storage Management 

Host=exa1db01.oracle.com 

Occurred At=Mar 30, 2012 1:15:21 PM PDT 

Message=Disk DATA.DATA_CD_09_EXA1CEL02 has 3 Read/Write errors. 

Metric=Read Write Errors 

Metric value=3 

Instance ID=3

Disk Group Name=DATA

Disk Name=DATA_CD_09_EXA1CEL02 

Severity=Critical 

Acknowledged=No 

Notification Rule Name=EXADATA ASM Alerts 

Notification Rule Owner=SYSMAN



FileName

----------------

EM Alert

Since read errors are correctable and not truly critical, this may be a false report. A sundiag output should be collected and a SR opened for further analysis to determine if the fault is critical or not that requires replacement.

A request for enhancement has been filed for EM to separate non-critical and critical write errors in EM.

Case 5. A disk with Firmware status "Unconfigured(good)".

This is an indication the disk is good, but not configured into a RAID volume. This is not an expected status in Exadata except during transition periods after a replacement if something does not work correctly during a replacement.

In Storage Servers in particular, this may be an indication that the disk was replaced but the Management Service (MS) daemon did not function properly and did not create the RAID volume and subsequent cell and grid disks. Refer to Note 1281395.1 and Note 1312266.1 for more details.

Conclusion:

Any other disk or I/O errors for which a disk may be suspect for such as device not present, device missing or timeouts, should have a sundiag output collected and a SR opened for further analysis to determine if the fault is critical.

Any doubts or concerns about disk or I/O errors listed above, then a sundiag output should be collected and a SR opened for further analysis to determine whether action is necessary.

References

<NOTE:1281395.1> - Steps to manually create cell/grid disks on Exadata V2 if auto-create fails during disk replacement
<NOTE:761868.1> - Oracle Exadata Diagnostic Information required for Disk Failures
@<NOTE:1360343.1> - INTERNAL Exadata Database Machine Hardware Current Product Issues
<NOTE:888828.1> - Database Machine and Exadata Storage Server 11g Release 2 (11.2) Supported Versions
<NOTE:1312266.1> - Exadata: After disk replacement ,celldisk and Griddisk is not created automatically
<NOTE:1386147.1> - How to Replace a Hard Drive in an Exadata Storage Server (Hard Failure)
<NOTE:1390836.1> - How to Replace a Hard Drive in an Exadata Storage Server (Predictive Failure)
<NOTE:1479736.1> - How to replace an Exadata Compute (Database) node hard disk drive (Predictive or Hard Failure)

Attachments

This solution has no attachment