Sun StorEdge[TM] A1000/A3000/A3500/A3500FC array: Array may crash after 828 days of uptime resulting in possible data loss.

Asset ID:	1-72-1004737.1
Update Date:	2010-08-10
Keywords:

Solution Type Problem Resolution Sure

Solution 1004737.1 : Sun StorEdge[TM] A1000/A3000/A3500/A3500FC array: Array may crash after 828 days of uptime resulting in possible data loss.

Related Items


Sun Storage A1000 Array
 Sun Storage A3500 FC Array

Related Categories


GCS>Sun Microsystems>Storage - Disk>Modular Disk - Other

PreviouslyPublishedAs
206579

Symptoms
Sun[TM] StorEdge A1000/A3000/A3500/A3500FC arrays running Raid Manager firmware
version 03.01.04.75 or earlier, have a risk of losing data (or at least
temporarily losing access to the data) on the LUNs (logical unit numbers),
if the array has been running for 828 days, without being reset or restarted.

Resolution
The problem was found to be in the handling of the overflow of the internal
"clock tick" counter, which the array controller maintains. This overflow
occurs after the controller has been running continuously for approximately
828 days and 12 hours.

If that counter overflow happens, and a write is in progress to a LUN at
that time, then the array controller "fails" all the drives in that LUN.
Once this happens, if the array is reset / power-cycled (or, for
SCSI-attached arrays, if the host is rebooted) then data on the LUNs using
those drives is lost. At that time, the only way to recover is to reset the
whole array configuration and restore the data onto the array from previous
backups.

Therefore if this problem occurs, it is important not to reset / power-cycle
the array, or reboot the attached host (for SCSI-attached arrays), as this
will result in data loss.

This issue has been identified and fixed in array controller firmware
03.01.04.81 in Raid Manager 6.22.1 (RM 6.22.1) Patch ID 112125-08 (for hosts
running Solaris[TM] 2.6 Operating System & Solaris[TM] 7 OS) or Patch ID 112126-08 (for hosts
running Solaris 8 OS & Solaris 9 OS) or higher patch versions, which are
available on SunSolve Online[SM].

There are no patches with this fix for earlier versions of Raid Manager.
Therefore customers running earlier versions of Raid Manager must upgrade to
RM 6.22.1 and apply one of those patches, to get the fix for this issue.

Relief/Workaround

For SCSI-attached StorEdge[TM] arrays (A1000/A3000/A3500) the array is reset when
the attached host reboots. Therefore, rebooting the attached host once every
2 years or so (i.e. before 828 days of uptime), will prevent this issue being
seen on those arrays.

However, rebooting the attached host every 2 years or so will not prevent the
issue occurring on the StorEdge[TM] A3500FC array, since rebooting the attached
host does not reset that type of array.

If one of the affected arrays is not rebooted for a period greater than 828
days, it suddenly suffers from failed drives, and you are confident that it
is caused by this issue, because you know that the array has not been reset
for 828 days, then you can take the following actions to attempt to recover:

1. DO NOT REBOOT the array, or the attached server, doing so WILL result in
data loss.

2. Use "drivutil -u" to unfail all of the drives in the LUN. If the LUN is
still not optimal, contact your support provider for further assistance.

Additional Information
Additional Info

1. How do you know that you've encountered this problem?

  After the array has been running continuously for more than 2 years,
suddenly some LUNs disappear on the host, due to multiple disks being
marked as failed.  This may appear to be a hardware problem, and cause
people to replace the array controller/drives, which does not solve the
problem.

2. What errors are you expected to see?

  No errors are seen as symptoms. All of a sudden, some LUNs disappear,
and the host application dies with SCSI reset/transport errors.

Product
Sun StorageTek A3500 Array
Sun StorageTek A3000
Sun StorageTek A3500 FC Array
Sun StorageTek A1000 Array
Netra st A1000 Array

Internal Comments
This problem may not be as likely to be seen with dual controller arrays,
like Sun StorEdge A300/A3500/A3500FC which could have suffered from controller
problem(s) and would be less likely to have survived for so many days without
controller reset. However, this was seen on multiple single controller Sun
StorEdge A1000 arrays at a customer site.

Please see Bug ID 4874507 , escalations# 545577 and 546371 for details.

=====

Some queries on this problem:

1. How do you determine that your controller has been up for 828 days

or more?

  You can use the serial port command "vxAbsTicks" or from

   the command line using the "/usr/lib/osa/bin/perfutil" command.

  # perfutil -c cXtXdX

  On the output of "perfutil", run "drive_stats_u1.pl".

  For example,

  # ./drive_stats_u1.pl /net/sslab09/var/tmp/tfuku/perfutil-c_c5t5d0.out

  drive_stats.pl version 1.1

  Controller = c5t5d0   Host Time/Date: 10:25:45  08/07/2003

   min of runtime =	26.2716666666667   <-- uptime

   total_recovered_errors =	0

   total_unrecovered_errors =	0

   total_request_time_outs =	0

   total_retried_requests =	24

   total_drive_bus_resets =	0

   #

  Uptime is 26.2716666666667. The time shown is minutes in this example. 

   The script converts the ticks from min to hr when they exceed 60min. 

   Likewise from days when they execeed 24hr.

2. Where do we get the drive_stats.pl perl script from ?

  The script can be downloaded from the URL below:

   http://cpre-emea.uk/tools/sonoma-info/sonoma_info.html

  Alternatively, please see Technical Instruction <Document: 1010352.1> : "Sun StorEdge[TM]

   Axx00:Tech Tip:Finding a Failed Disk that RM6 Reports as 

   Optimal" which also has an alternative link to the script.

=====

The full text of the recovery actions from LSI for an array which has hit

the bug and has failed several drives, is reproduced here - some of their

suggested actions (e.g. using "vdShow" via the serial port) are not customer

actions, and hence have been removed from the customer-viewable section of

this Problem Resolution:

"To recover, DO NOT REBOOT the controller or the server, doing so WILL result

in data loss. Use drivutil -u (lower case) to unfail all of the drives in

the lun, and verify the lun configuration with vdShow. If the lun is not

optimal, call LSI Tech Support."

A1000, A3500, A3000, A3500FC, failed, controller, RM6
Previously Published As
70661

Change History
Date: 2005-06-03
User Name: 86700
Action: Update Canceled
Comment: *** Restored Published Content *** The changes which i thought were missing are
already there in the document
-Sailesh
Version: 0
Date: 2005-06-03
User Name: 86700
Action: Update Started
Comment: Have to update this document with an additional info that this has been fixed for RM6.22.1 for which patch is available which will avoid this problem.
-Sailesh
Version: 0
Product_uuid
2a8022d4-0a18-11d6-8043-ee5a180fdb7f
2a7ca41a-0a18-11d6-82f2-e96014c515ea
b648cdf0-efb8-4d4f-93d4-b17c1baf1935
2a792916-0a18-11d6-8d0a-c3d03933af3c
49f7ad4a-aa28-47c7-935a-b971312469ea

Attachments

This solution has no attachment