Document Audience: | INTERNAL |
Document ID: | I0886-1 |
Title: | StorEdge A1000, A3x00 or A3500FC LUNs running under volume manager software may cause upper level applications to timeout when too many I/O error retries occur. |
Copyright Notice: | Copyright © 2005 Sun Microsystems, Inc. All Rights Reserved |
Update Date: | 2002-10-07 |
---------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------
FIELD INFORMATION NOTICE
(For Authorized Distribution by SunService)
FIN #: I0886-1
Synopsis: StorEdge A1000, A3x00 or A3500FC LUNs running under volume manager software may cause upper level applications to timeout when too many I/O error retries occur.Create Date: Oct/04/02
SunAlert: No
Top FIN/FCO Report: No
Products Reference: Sun StorEdge A1000, A3x00 or A3500FC
Product Category: StorEdge / Service
Product Affected:
Systems Affected:
-----------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
- ANYSYS - System Platform Independent -
X-Options Affected:
-------------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
- A1000 ALL A1000 Storage Array -
- A3000 ALL A3000 Storage Array -
- A3500 ALL A3500 Storage Array -
- A3500FC ALL A3500FC Storage Array -
Parts Affected:
Part Number Description Model
----------- ----------- -----
825-3869-02 MNL Set SUN RSM ARRAY 2000 -
798-0188-01 SS CD ASSY RAID Manager 6.1 -
798-0522-01 RAID Manager 6.1.1 -
798-0522-02 RAID Manager6.1.1 Update 1 -
798-0522-03 RAID Manager6.1.1 Update 2 -
704-6708-10 CD, SUN STOREDGE RAID Manager6.22 -
704-7937-05 CD, SUN STOREDGE RAID Manager6.22.1 -
References:
BugId: 4722564 - Customer has experienced multiple A3500FC controllers
offlining needs root cause.
4423716 - I/O failure recovery exceeds Oracle aiowait timeout --
DB crashes 27062.
4400536 - ipsserver install should stop only the relevant
processes.
FIN: I0634-1 - StorEdge A3x00 Array controller failover.
ESC: 539253 - bug 4722564/Root cause analysis for failed a3500
controllers.
NOTICE: Infodoc 28087 - Oracle crashes with asynchronous I/O (AIO) -
ORA-27062.
27248 - ORA-27062 Causes / Remedy.
Issue Description:
Systems with StorEdge A1000, A3x00 or A3500FC Arrays, which are
configured to run under volume manager software, may experience
database or application problems when repeated disk I/O error retries
occur. Excessive RDAC and Solaris disk driver retry attempts on I/O
errors may result in database managers (such as Oracle) or applications
to timeout. This may cause the applications to crash or run very
slowly following an I/O timeout.
This issue applies to any system type with StorEdge A1000, A3x00 or
A3500FC Arrays with LUNs that are covered by a volume manager.
Volume manager software could include Veritas Volume Manager or
Solaris DiskSuite.
To identify installed volume manager software:
For Veritas Volume Manager
# pkginfo -l VRTSvxvm
For Solaris DiskSuite
# pkginfo -l SUNWmdr
Symptoms for this issue may include error messages in the affected
application's error log. For example, in Oracle 8 release, if an
Oracle I/O (AIO) is not complete within 10 minutes, it will log
messages in the Oracle alert log as follows.
LGWR: terminating instance due to error 27062
Instance terminated by LGWR, pid=1351
In addition to logging error messages, the Oracle server may crash as a
result of the I/O timeout. In this case it would appear
non-responsive.
When there is an I/O error in an A1000, A3x00 or A3500FC LUN, Solaris
disk drivers (sd or ssd) will retry the I/O. Under certain
circumstances, such as heavy I/O, this retry behavior of the Solaris
disk drivers, when combined with error recovery actions on the part of
RM6, can result in I/O error recovery attempts taking a long time to
complete.
Unfortunately, some database managers or applications, such as Oracle
8, will not tolerate I/O taking a long time to complete and this can
cause Oracle to timeout the I/O and then crash. The Oracle crash
prevents the volume manager from failing over to a working disk volume
that has a mirrored image of the customer's data.
There is a configuration workaround available which will resolve this
issue. By setting the "Rdac_RetryCount" in rmparams to "1", RDAC
retries on I/O errors will be eliminated (by default, RDAC retries 7
times). This removes one layer of error recovery as it bypasses RM6
and passes the error from the disk driver directly to the volume
manager software. This will greatly reduce the probability of
exceeding database manager or application I/O timeout limits. See
details below.
Implementation:
---
| | MANDATORY (Fully Proactive)
---
---
| | CONTROLLED PROACTIVE (per Sun Geo Plan)
---
---
| X | REACTIVE (As Required)
---
Corrective Action:
The following recommendation is provided as a guideline for authorized
Sun Services Field Representatives who may encounter the above
mentioned problem.
To eliminate RDAC retries on I/O errors,
1. Edit /etc/raid/rmparams:
Rdac_RetryCount=1 (The default value is 7).
2. Restart the amdaemon in order to make the change effective.
/etc/init.d/amdemon stop
/etc/init.d/amdemon start
Comments:
None
============================================================================
Implementation Footnote:
i) In case of MANDATORY FINs, Sun Services will attempt to
contact all affected customers to recommend implementation of
the FIN.
ii) For CONTROLLED PROACTIVE FINs, Sun Services mission critical
support teams will recommend implementation of the FIN (to their
respective accounts), at the convenience of the customer.
iii) For REACTIVE FINs, Sun Services will implement the FIN as the
need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network
browser as follows:
SunWeb Access:
--------------
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/
* From there, select the appropriate link to query or browse the FIN and
FCO Homepage collections.
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/
* From there, select the appropriate link to browse the FIN or FCO index.
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to [email protected]
--------------------------------------------------------------------------