Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1447496.1
Update Date:2012-04-08
Keywords:

Solution Type  Problem Resolution Sure

Solution  1447496.1 :   Exadata :Cell Rebooted SCSI error: return code & kernel: end_request: I/O error  


Related Items
  • Exadata Database Machine V2
  •  
Related Categories
  • PLA-Support>Sun Systems>x64>Engineered Systems HW>SN-x64: EXADATA
  •  
  • .Old GCS Categories>Sun Microsystems>Specialized Systems>Database Systems
  •  




In this Document
  Symptoms
  Cause
  Solution
  References


Created from <SR 3-5525182881>

Applies to:

Exadata Database Machine V2 - Version: Not Applicable and later   [Release: N/A and later ]
Information in this document applies to any platform.

Symptoms


Configuration:-
------------------
cellVersion: OSS_11.2.2.4.2_LINUX.X64_111221
kernelVersion: 2.6.18-238.12.2.0.2.el5



Errors Like :-
-----------------
Alert History:
 info "IO hang detected on CD_01_dm51cel06. Power cycle forced."


ASM Log:-

Wed Mar 28 14:03:20 2012
WARNING: Disk  in group 2 mode 0x7f is now being offlined
ORA-27603: Cell storage I/O error, I/O failed on disk  at offset 8392704 for data length 4096
ORA-27626: Exadata error: 201 (Generic I/O error)
WARNING: Read Failed. group:5 disk:52 AU:2 offset:4096 size:4096
WARNING: cache failed reading from group fn=4 blk=1 count=1 from
disk=  kfkist=0x20 status=0x02 file=kfc.c line=11366


system log:
 
kernel: sd 0:2:6:0: SCSI error: return code = 0x00040000
 dm51cel06 kernel: end_request: I/O error, dev sdg, sector 2006351888
 kernel: sd 0:2:6:0: SCSI error: return code = 0x00040000
 disk LSI MR9261-8i 2.12 /dev/sdac

Cause

The sequence of events are :
Disk in a slot failed. Then IO to disk in another slot timed out. This caused the
power cycle, as IOs should never be hung on other devices for more than 30
seconds when we are having trouble with 1 bad disk.

If there is an outstanding IO hang on a disk for more than 95 seconds, then we pull the trigger and reboot the storage server.

Previous to image 11.2.3.1.0 there was no mechanism to cancel an IO on a griddisk other than to reboot the server. So, to prevent the risk of hanging the entire database, we choose to reboot just one storage cell.

Usually, the reboot provides quiet-time for background disk media scan to kick in on the offending disk and fix the bad sectors.

Solution


The fix is included in 11.2.3.1.0 (Patch 13536739)

References

<BUG:13922277> - CELL NODE REBOOTED - WITH ERRORS IN ASM LOG, MESSAGE & CELL LOGS
<BUG:12592457> - FENCEMASTER: OSS_IOCTL_FENCE_ENTITY

Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback