Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Problem Resolution Sure Solution 1004737.1 : Sun StorEdge[TM] A1000/A3000/A3500/A3500FC array: Array may crash after 828 days of uptime resulting in possible data loss.
PreviouslyPublishedAs 206579 Symptoms Sun[TM] StorEdge A1000/A3000/A3500/A3500FC arrays running Raid Manager firmware version 03.01.04.75 or earlier, have a risk of losing data (or at least temporarily losing access to the data) on the LUNs (logical unit numbers), if the array has been running for 828 days, without being reset or restarted. Resolution The problem was found to be in the handling of the overflow of the internal "clock tick" counter, which the array controller maintains. This overflow occurs after the controller has been running continuously for approximately 828 days and 12 hours. If that counter overflow happens, and a write is in progress to a LUN at Therefore if this problem occurs, it is important not to reset / power-cycle This issue has been identified and fixed in array controller firmware There are no patches with this fix for earlier versions of Raid Manager. Relief/Workaround For SCSI-attached StorEdge[TM] arrays (A1000/A3000/A3500) the array is reset when However, rebooting the attached host every 2 years or so will not prevent the If one of the affected arrays is not rebooted for a period greater than 828 1. DO NOT REBOOT the array, or the attached server, doing so WILL result in 2. Use "drivutil -u" to unfail all of the drives in the LUN. If the LUN is Additional Information Additional Info 1. How do you know that you've encountered this problem? After the array has been running continuously for more than 2 years, suddenly some LUNs disappear on the host, due to multiple disks being marked as failed. This may appear to be a hardware problem, and cause people to replace the array controller/drives, which does not solve the problem. 2. What errors are you expected to see? No errors are seen as symptoms. All of a sudden, some LUNs disappear, and the host application dies with SCSI reset/transport errors. Product Sun StorageTek A3500 Array Sun StorageTek A3000 Sun StorageTek A3500 FC Array Sun StorageTek A1000 Array Netra st A1000 Array Internal Comments This problem may not be as likely to be seen with dual controller arrays, like Sun StorEdge A300/A3500/A3500FC which could have suffered from controller problem(s) and would be less likely to have survived for so many days without controller reset. However, this was seen on multiple single controller Sun StorEdge A1000 arrays at a customer site. Please see Bug ID 4874507 , escalations# 545577 and 546371 for details. ===== Some queries on this problem:
1. How do you determine that your controller has been up for 828 days You can use the serial port command "vxAbsTicks" or from # perfutil -c cXtXdX On the output of "perfutil", run "drive_stats_u1.pl". For example, # ./drive_stats_u1.pl /net/sslab09/var/tmp/tfuku/perfutil-c_c5t5d0.out drive_stats.pl version 1.1 Controller = c5t5d0 Host Time/Date: 10:25:45 08/07/2003 Uptime is 26.2716666666667. The time shown is minutes in this example. 2. Where do we get the drive_stats.pl perl script from ? The script can be downloaded from the URL below: Alternatively, please see Technical Instruction <Document: 1010352.1> : "Sun StorEdge[TM] =====
The full text of the recovery actions from LSI for an array which has hit
"To recover, DO NOT REBOOT the controller or the server, doing so WILL result A1000, A3500, A3000, A3500FC, failed, controller, RM6 Previously Published As 70661 Change History Date: 2005-06-03 User Name: 86700 Action: Update Canceled Comment: *** Restored Published Content *** The changes which i thought were missing are already there in the document -Sailesh Version: 0 Date: 2005-06-03 User Name: 86700 Action: Update Started Comment: Have to update this document with an additional info that this has been fixed for RM6.22.1 for which patch is available which will avoid this problem. -Sailesh Version: 0 Product_uuid 2a8022d4-0a18-11d6-8043-ee5a180fdb7f 2a7ca41a-0a18-11d6-82f2-e96014c515ea b648cdf0-efb8-4d4f-93d4-b17c1baf1935 2a792916-0a18-11d6-8d0a-c3d03933af3c 49f7ad4a-aa28-47c7-935a-b971312469ea Attachments This solution has no attachment |
||||||||||||
|