Exadata :Cell Rebooted SCSI error: return code & kernel: end

Asset ID:	1-72-1447496.1
Update Date:	2012-04-08
Keywords:

Solution Type Problem Resolution Sure

Solution 1447496.1 : Exadata :Cell Rebooted SCSI error: return code & kernel: end_request: I/O error

Applies to:

Exadata Database Machine V2 - Version: Not Applicable and later [Release: N/A and later ]
Information in this document applies to any platform.

Symptoms

Configuration:-
------------------
cellVersion:       OSS_11.2.2.4.2_LINUX.X64_111221
kernelVersion:     2.6.18-238.12.2.0.2.el5

Errors Like :-
-----------------
Alert History:
info "IO hang detected on CD_01_dm51cel06. Power cycle forced."

ASM Log:-

Wed Mar 28 14:03:20 2012
WARNING: Disk in group 2 mode 0x7f is now being offlined
ORA-27603: Cell storage I/O error, I/O failed on disk at offset 8392704 for data length 4096
ORA-27626: Exadata error: 201 (Generic I/O error)
WARNING: Read Failed. group:5 disk:52 AU:2 offset:4096 size:4096
WARNING: cache failed reading from group fn=4 blk=1 count=1 from
disk= kfkist=0x20 status=0x02 file=kfc.c line=11366

system log:

kernel: sd 0:2:6:0: SCSI error: return code = 0x00040000
dm51cel06 kernel: end_request: I/O error, dev sdg, sector 2006351888
kernel: sd 0:2:6:0: SCSI error: return code = 0x00040000
disk LSI MR9261-8i 2.12 /dev/sdac

Cause

The sequence of events are :
Disk in a slot failed. Then IO to disk in another slot timed out. This caused the
power cycle, as IOs should never be hung on other devices for more than 30
seconds when we are having trouble with 1 bad disk.

If there is an outstanding IO hang on a disk for more than 95 seconds, then we pull the trigger and reboot the storage server.

Previous to image 11.2.3.1.0 there was no mechanism to cancel an IO on a griddisk other than to reboot the server. So, to prevent the risk of hanging the entire database, we choose to reboot just one storage cell.

Usually, the reboot provides quiet-time for background disk media scan to kick in on the offending disk and fix the bad sectors.

Solution

The fix is included in 11.2.3.1.0 (Patch 13536739)

References

<BUG:13922277> - CELL NODE REBOOTED - WITH ERRORS IN ASM LOG, MESSAGE & CELL LOGS
<BUG:12592457> - FENCEMASTER: OSS_IOCTL_FENCE_ENTITY

Attachments

This solution has no attachment