Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Sun Alert Sure Solution 1300555.1 : Replacement of Drives with Mechanical Positioning Errors May Cause RAID Controllers Reset or Lockdown Unexpectedly
In this Document
Applies to:Sun Storage Flexline 380 Array - Version: Not Applicable to Not Applicable - Release: N/A to N/ASun Storage 6780 Array - Version: Not Applicable to Not Applicable [Release: N/A to N/A] Sun Storage 6580 Array - Version: Not Applicable to Not Applicable [Release: N/A to N/A] Sun Storage 2530 Array - Version: Not Applicable to Not Applicable [Release: N/A to N/A] Sun Storage 2540 Array - Version: Not Applicable to Not Applicable [Release: N/A to N/A] Information in this document applies to any platform. This issue also applies to Sun Storage 2510, 2530, 2540, 6140, 6180, 6540, 6580, and 6780 Arrays. ______________________ ______________________ Date of Resolved Release: 02-Mar-2011 DescriptionDrives that have mechanical positioning errors may cause a RAID controller, see identified products, to reboot when the controllers attempt to fail that drive. The drive will be marked as failed when the controller completes SOD.Note: Attempting to manually fail an affected drive may cause a lockdown when the rebooting controller is unable to verify the DACstore on the drive that is still reporting optimal to the survivor, causing an outage. Likelihood of OccurrenceThis issue can occur on the following system:
Note: Arrays with 6.xx firmware are not affected by this issue. Only a single controller will issue the command to fail the controller, thus no race condition exists. Possible SymptomsThe aforementioned disk drives are failing due to a mechanical positioning error as seen in the array event logs, similar to the following:B:10/27/10 7:11:30 AM : 4050 : 0/0/0 : 6008 : Internal : Drive : Tray 33, Slot 16 : Stable storage drive unusable Sense: 04 HARDWARE ERROR Which is a Drive Positioning Mechanical Error. These drives, in particular, have an existing issue with the heads staying in one position after a drive error log update. This reduces the lubrication in the drive heads leading to a head crash indicated by the codes mentioned above. Manual failure of drives in an Optimal state, which results in one or more volumes in the array failing, typically lead to one or both controllers resetting, and possibly being held in a Lockdown state, as a result of access problems to the metadata on the array. This is due to a problem accessing and updating the metadata on the disk drive that is reporting the error. The lockdown state may show as LU, 88, or SD on one controller of a 6140, 6540, or Flexline 380. The lockdown state may show as a flashing display on a 6180, 6580, or 6780 of OE+ LU+ blank- Note: Controllers in a lockdown or offline state should be serviced immediately by Oracle support for correction. Automatic failure of drives due to write failures caused by the aforementioned error, can also result in a controller reset similar to the following: B Sat Jan 01 16:30:54 PST 2011 54527 4/15/1 100A Error Drive Tray.01.Drive.03 Note: A drive being failed by the system does not usually result in a lockdown or offline controller state. After a power cycle or controller reset, the drives often transition to a state of INCOMPATIBLE.
Workaround or ResolutionTo work around the described issue, avoid manually failing drives. This will prevent the lockdown conditions requiring service intervention. In order to service drive replacement under these conditions, use the steps below to avoid the accessibility and availability issues referenced in the symptoms section. In general,any configuration change effected by the user should be avoided until an affected drive is removed from the system. 1. Physically remove ALL Global Hot Spares from the system. Do NOT unassign them. This step is necessary to prevent the occurrence of the defect as any operation to remove a drive from the hot spare list through the user interface would provoke the defect. 2. Physically remove and replace the suspect disk(s). The Common Array Manager (CAM) Service Adviser removal and replacement procedures detail that the drive fault LEDs should be lit, and the status should be failed. Ignore this. The drive can be removed once the location is identified. 3. CAM: Use the disk replace procedure via CAM Service Advisor as outlined in “Service Advisor > Portable Virtual Disk Management>Replace a Disk Drive”. CAM allows for the pulled drives to be replaced with the same tray/slot that the new drives were inserted. Go directly to step 1 under “To Replace a Removed Disk Drive“and replace the drive with the same tray/slot as that which was just physically replaced. Once the CAM drive replacement procedure has been completed, volume group rebuild will start. SANtricity: Select the Volume Group which contains the replacement drive. Select Volume Group -> Replace Drives. Select the replacement drive and replace it by itself. 4. Re-insert the Global Hot Spare drives, any potentially failed hot spare drives should be unresponsive on insert and not cause a controller reboot. If these were not previously identified during initial analysis, have the failure of the drives analyzed and replaced as necessary. This issue is resolved in CAM 6.7.0 with Firmware Patch 145965-02/145966-02/145967-02 or later. Modification HistoryDocument created March 214-Mar-2011: Updated Likelihood of Occurrence section 06-Apr-2011: Updated Workaround section 02-May-2011: Updated Workaround/Resolution section. If a lockdown condition does occur, evaluate and respond as follows: For controllers showing: 5d or Sd on one controller and 88 on the other or 5d or Sd on one controller and LU on the other 1. Power Off the Array 2. Remove the controller showing 5d 3. Power Up 4. Serial into the booted controller, and run lemClearLockdown, then sysReboot 5. Wait for the controller to display the tray ID 6. Insert Remaining Controller For controllers showing: LU on one controller and the tray ID(85 or 99) on the other 1. serial into the LU controller, and run lemClearLockdown 2. Use the management interface to Online the Controller 3. The booting controller should show the same tray ID as the surviving one. Internal Contributor/Submitter: [email protected] Internal Eng Responsible Engineer: [email protected] Internal Services Knowledge Engineer: [email protected] Internal Eng Business Unit Group: NWS Internal Escalation ID: 72785060, 73355762, 73346876, 73452350, 73430212, 73503190 Please send questions to the following email: [email protected] and copy the Responsible Engineer listed above Attachments This solution has no attachment |
||||||||||||
|