Replacement of Drives with Mechanical Positioning Errors May Cause RAID Controllers Reset or Lockdown Unexpectedly

Asset ID:	1-77-1300555.1
Update Date:	2012-10-08
Keywords:

Solution Type Sun Alert Sure

Solution 1300555.1 : Replacement of Drives with Mechanical Positioning Errors May Cause RAID Controllers Reset or Lockdown Unexpectedly

Applies to:

Sun Storage 6180 Array - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage 6540 Array - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage 2510 Array - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage Flexline 380 Array - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage 6140 Array - Version Not Applicable to Not Applicable [Release N/A]
Information in this document applies to any platform.
This issue also applies to Sun Storage 2510, 2530, 2540, 6140, 6180, 6540, 6580, and 6780 Arrays.
______________________

______________________

Date of Resolved Release:
02-Mar-2011

Description

Drives that have mechanical positioning errors may cause a RAID controller, see identified products, to reboot when the controllers attempt to fail that drive. The drive will be marked as failed when the controller completes SOD.

Note: Attempting to manually fail an affected drive may cause a lockdown when the rebooting controller is unable to verify the DACstore on the drive that is still reporting optimal to the survivor, causing an outage.

Occurrence

This issue can occur on the following system:

Sun StorageTek 6140 Arrays without Array Firmware 07.60.56.10 or later
Sun StorageTek 6540 Arrays without Array Firmware 07.60.56.10 or later
StorageTek Flexline 380 without Array Firmware 07.60.56.10 or later
Sun Storage 6180 Arrays without Array Firmware 07.60.56.10 or later
Sun Storage 6580 Arrays without Array Firmware 07.60.56.10 or later
Sun Storage 6780 Arrays without Array Firmware 07.60.56.10 or later
Sun StorageTek 2510 Arrays without Array Firmware 7.35.67.10 or later
Sun StorageTek 2530 Arrays without Array Firmware 7.35.67.10.or later
Sun StorageTek 2540 Arrays without Array Firmware 7.35.67.10 or later

This issue only occurs on arrays with one of the following drive models:

ST330055SSUN300G
ST330055FSUN300G
ST373455LC
ST3146855LC
ST3400008SS
ST3146855SS
ST373455SS
ST3300655LC
ST373455FC
ST3400008FC
ST3300655FC
ST3146855FC

Note: Arrays with 6.xx firmware are not affected by this issue. Only a single controller will issue the command to fail the drive, thus no race condition exists.

Symptoms

The aforementioned disk drives are failing due to a mechanical positioning error as seen in the array event logs, similar to the following:

      B:10/27/10 7:11:30 AM : 4050 : 0/0/0 : 6008 : Internal : Drive : Tray 33, Slot 16 : Stable storage drive unusable
      A:10/27/10 7:10:47 AM : 4051 : 4/15/1 : 100A : Error : Drive : Tray 33, Slot 16 : Drive returned CHECK CONDITION : 
      Mechanical Positioning Error
      A:10/27/10 7:10:49 AM : 4053 : 0/0/0 : 6008 : Internal : Drive : Tray 33, Slot 16 : Stable storage drive unusable

      Sense: 04 HARDWARE ERROR
      ASC/ASCQ:   15/01 MECHANICAL POSITIONING ERROR

Which is a Drive Positioning Mechanical Error. These drives, in particular, have an existing issue with the heads staying in one position after a drive error log update. This reduces the lubrication in the drive heads leading to a head crash indicated by the codes mentioned above.

Manual failure of drives in an Optimal state, which results in one or more volumes in the array failing, typically lead to one or both controllers resetting, and possibly being held in a Lockdown state, as a result of access problems to the metadata on the array. This is due to a problem accessing and updating the metadata on the disk drive that is reporting the error.

The lockdown state may show as LU, 88, or SD on one controller of a 6140, 6540, or Flexline 380. The lockdown state may show as a flashing display on a 6180, 6580, or 6780 of OE+ LU+ blank-

Note: Controllers in a lockdown or offline state should be serviced immediately by Oracle support for correction.

Automatic failure of drives due to write failures caused by the aforementioned error, can also result in a controller reset similar to the following:

      B     Sat Jan 01 16:30:54 PST 2011     54527     4/15/1     100A     Error     Drive     Tray.01.Drive.03
      Drive returned CHECK CONDITION - Mechanical Positioning Error
      B     Sat Jan 01 16:30:54 PST 2011     54528     204/15/1     1012     Error     Drive     Tray.01.Drive.03
      Destination driver event - Mechanical Positioning Error
      B     Sat Jan 01 16:30:54 PST 2011     54529     0/0/0     6008     Notification     Drive     Tray.01.Drive.03
      Stable storage drive unusable due to I/O errors
      B     Sat Jan 01 16:49:29 PST 2011     54530     0/0/0     100D     Error     Drive     Tray.01.Drive.03
      Timeout on drive side of controller
      B     Sat Jan 01 16:49:40 PST 2011     54531     0/0/0     100D     Error     Drive     Tray.01.Drive.03
      Timeout on drive side of controller
      B     Sat Jan 01 16:49:51 PST 2011     54532     0/0/0     100D     Error     Drive     Tray.01.Drive.03
      Timeout on drive side of controller
      B     Sat Jan 01 16:50:00 PST 2011     54533     201020b/0/0     1012     Error     Drive     Tray.01.Drive.03
      Destination driver event - IO timeout
      B     Sat Jan 01 16:50:00 PST 2011     54534     0/0/0     201E     Notification     Controller
      Tray.85.Controller.B     VDD repair started
      B     Sat Jan 01 16:50:00 PST 2011     54535     0/0/0     201E     Notification     Controller
      Tray.85.Controller.B     VDD repair started
      B     Sat Jan 01 16:50:00 PST 2011     54536     0/0/0     201E     Notification     Controller
      Tray.85.Controller.B     VDD repair started
      B     Sat Jan 01 16:50:00 PST 2011     54537     0/0/0     2014     Notification     Controller
      Tray.85.Controller.B     VDD logged an error
      B     Sat Jan 01 16:50:00 PST 2011     54538     0/0/0     201F     Notification     Controller
      Tray.85.Controller.B     VDD repair completed
      B     Sat Jan 01 16:50:00 PST 2011     54539     0/0/0     201F     Notification     Controller
      Tray.85.Controller.B     VDD repair completed
      B     Sat Jan 01 16:50:00 PST 2011     54540     0/0/0     201F     Notification     Controller
      Tray.85.Controller.B     VDD repair completed
      B     Sat Jan 01 16:50:01 PST 2011     54541     0/0/0     2226     Notification     Drive
      Tray.01.Drive.03     Drive spun down
      B     Sat Jan 01 16:50:01 PST 2011     54542     0/0/0     226C     Failure     Drive     Tray.01.Drive.03
      Drive failure detected
      B     Sat Jan 01 16:50:01 PST 2011     54543     0/0/0     2215     Notification     Drive     Tray.01.Drive.03
      Drive marked failed
      B     Sat Jan 01 16:50:01 PST 2011     54544     0/0/0     2217     Notification     Drive     Tray.01.Drive.03
      Piece failed
      B     Sat Jan 01 16:50:01 PST 2011     54545     0/0/0     2216     Notification     Drive     Tray.01.Drive.03
      Piece taken out of service
      B     Sat Jan 01 16:50:01 PST 2011     54546     0/0/0     2217     Notification     Drive     Tray.01.Drive.03
      Piece failed
      B     Sat Jan 01 16:50:02 PST 2011     54547     0/0/0     100D     Error     Drive     Tray.01.Drive.03
      Timeout on drive side of controller
      B     Sat Jan 01 16:51:02 PST 2011     54548     0/0/0     400F     Notification     Controller
      Tray.85.Controller.A     Controller reset by its alternate Reboot Reason: REBOOTALT_DBM_HEALTH_CHECK_EVENT

Note: A drive being failed by the system does not usually result in a lockdown or offline controller state. After a power cycle or controller reset, the drives often transition to a state of INCOMPATIBLE.

Note: The possible symptoms that can occur when this issue is encountered will vary depending on the hardware configuration as well as the logical layout of the vdisks and volumes.

Workaround

To work around the described issue, avoid manually failing drives. This will prevent the lockdown conditions requiring service intervention. In order to service drive replacement under these conditions, use the steps below to avoid the accessibility and availability issues referenced in the symptoms section. In general,
any configuration change effected by the user should be avoided until an affected drive is removed from the system.

1. Physically remove ALL Global Hot Spares from the system. Do NOT unassign them. This step is necessary to prevent the occurrence of the defect as any operation to remove a drive from the hot spare list through the user interface would provoke the defect.

2. Physically remove and replace the suspect disk(s). The Common Array Manager (CAM) Service Adviser removal and replacement procedures (CAM Service Advisor, left pane, "Disk Drive Removal/Replacement" section) detail that the drive fault LEDs should be lit, and the status should be failed. Ignore this. The drive can be removed once the location is identified.

3.
CAM:
Use the disk replace procedure via CAM Service Advisor as outlined in “Service Advisor > Portable Virtual Disk Management>Replace a Disk Drive”. CAM allows for the pulled drives to be replaced with the same tray/slot that the new drives were inserted. Go directly to step 1 under “To Replace a Removed Disk Drive“and replace the drive with the same tray/slot as that which was just physically replaced. Once the CAM drive replacement procedure has been completed, volume group rebuild will start.

SANtricity:
Select the Volume Group which contains the replacement drive. Select Volume Group -> Replace Drives. Select the replacement drive and replace it by itself.

4. Re-insert the Global Hot Spare drives, any potentially failed hot spare drives should be unresponsive on insert and not cause a controller reboot. If these were not previously identified during initial analysis, have the failure of the drives analyzed and replaced as necessary.

This issue is resolved in CAM 6.8.1 or later.

Note: Please refer to Doc ID 1296274.1 for information on how to download Common Array Manager (CAM) software and patches.

History

Document created March 2

14-Mar-2011: Updated Likelihood of Occurrence section
06-Apr-2011: Updated Workaround section
02-May-2011: Updated Workaround/Resolution section.
06-Mar-2012: Updated Note 2 in the Workaround/Resolution section.
12-Mar-2012: Added note to Symptoms section
11-Sept-2012: Updated the "Note" in the Occurrence section.
21-Sep-2012: Updated Likelihood of Occurrence section.
08-Oct-2012: Updated Likelihood of Occurrence and Workaround sections.

The firmware versions that were originally listed in the Sun Alert at the time of release were incorrect. The CR was updated with the correct firmware versions after the Sun Alert had been released. The Sun Alert was updated because of a rework of the original bug to remove side effects. Had the side effect been known at the time, it would have been fixed in the same original CR. This is not a new bug but rather a completion of the original.

If a lockdown condition does occur, evaluate and respond as follows:
For controllers showing:
5d or Sd on one controller and 88 on the other
or
5d or Sd on one controller and LU on the other
1. Power Off the Array
2. Remove the controller showing 5d
3. Power Up
4. Serial into the booted controller, and run lemClearLockdown, then sysReboot
5. Wait for the controller to display the tray ID
6. Insert Remaining Controller

For controllers showing:
LU on one controller and the tray ID(85 or 99) on the other
1. serial into the LU controller, and run lemClearLockdown
2. Use the management interface to Online the Controller
3. The booting controller should show the same tray ID as the surviving one.

Internal Contributor/Submitter: [email protected]
Internal Eng Responsible Engineer: [email protected]
Internal Services Knowledge Engineer: [email protected]
Internal Eng Business Unit Group: NWS
Internal Escalation ID: 72785060, 73355762, 73346876, 73452350, 73430212, 73503190
Please send questions to the following email:
[email protected]
and copy the Responsible Engineer listed above

References

@ <BUG:6978258> - ST2:WORKFLOW AVAILABILITY FOR MZ3ST210 ENVIRONMENT
@ <BUG:7012554> - EM R2 : SDC78041SVQE: HOST UNREACHABLE
<NOTE:1296274.1> - How to Download Common Array Manager (CAM) Software and Patches

Attachments

This solution has no attachment