Asset ID: |
1-72-1392228.1 |
Update Date: | 2012-09-26 |
Keywords: | |
Solution Type
Problem Resolution Sure
Solution
1392228.1
:
Pillar Axiom: Fibre Channel Brick RAID Rebuilds Subsequent to a Drive Failure May, on Rare Occasions, Rebuild to an Incorrect Spare Drive
Related Items |
- Pillar Axiom 300 Storage System
- Pillar Axiom 600 Storage System
- Pillar Axiom 500 Storage System
|
Related Categories |
- PLA-Support>Sun Systems>DISK>Pillar Axiom>SN-DK: Ax600
|
A race condition may occur when a drive in an FC Brick momentarily goes offline, then returns before the two RAID controllers can synchronize information regarding the drive failure status. If this occurs, the RAID CU detecting the drive offline may initiate recovery to the spare drive, but the other CU does not, resulting in ongoing data corruption because the two RCs do not agree on the RAID array members.
In this Document
Applies to:
Pillar Axiom 300 Storage System - Version Not Applicable to Not Applicable [Release N/A]
Pillar Axiom 500 Storage System - Version Not Applicable to Not Applicable [Release N/A]
Pillar Axiom 600 Storage System - Version Not Applicable to Not Applicable [Release N/A]
Information in this document applies to any platform.
N/A
Symptoms
Axiom Fibre Channel Brick RAID rebuilds subsequent to a drive failure may, on rare occasions, rebuild to an incorrect spare drive.
Synopsis
Axiom Fibre Channel Brick RAID rebuilds subsequent to a drive failure may, on rare occasions, rebuild to an incorrect spare drive.
Affected Hardware and Software Versions
This issue affects all Axiom systems comprising of Fibre Channel disk drives that are on Axiom Releases 02.04.01 through 04.03.17 and from 05.00.00 through 05.00.06. This defect does not affect SATA drives or systems on any Axiom release prior to 02.04.02 or after 05.00.06.
Hardware | Software |
Ax600/Ax500/Ax300
with
FC or FC V2 Bricks |
02.04.02 to 04.03.17
05.00.00 to 05.00.06 |
Changes
N/A
Cause
Problem
There was a very small corner condition where the firmware on each of the two companion RAID controllers of the Axiom Fibre Channel Brick could choose different target drives for a rebuild. This would only occur if a candidate target drive suddenly went offline during the rebuild preparation phase and then suddendly came back online within the span of one second. This was an exceedingly rare case where a drive happened to fail in a very specific way during a very small timing window.
Should this condition occur, the two RAID Controllers in the Fibre Channel or Fibre Channel V2 Brick will not agree on the members of the RAID array. The CU detecting the original drive offline will remove that drive from the array and begin recovery to and use of the spare. The CU that did not get the drive status update will continue to use the drive that has just gone offline and returned before it was made aware of this condition.
While the bug has existed for quite some time, it wasn't until recently that it was encountered.
IMPORTANT!
Contact Pillar Data Systems Customer Support immediately if this issue is encountered.
Solution
Solution/Workaround
Software fixes have been propogated to mitigate the problem scenario and are included in Axiom Releases 04.03.18 and above (R4, e.g. 04.05.00) and 05.00.07 and above (R5, all of 05.02.xx and higher). An upgrade to the currently recommended software level will prevent this issue and is highly recommended.
Axiom customers outside of the recommended Axiom releases are urged to contact Pillar Support to open a Service Request and schedule an upgrade for remediation and their upgrade options.
Common Questions
Question: How much risk is there?
Answer: While the triggering of this issue is reliant upon certain conditions to exist, it is Pillar's position that any risk to data is too much. It is, therefore, strongly recommended to take appropriate steps to remove that risk as soon as possible.
Question: How long have you known about this? Why didn’t we hear about this before?
Answer: This issue was discovered recently and a solution provided. Pillar is proactively issuing this alert following confirmation of risk.
Question: Is the fix non-disruptive?
Answer: Yes. Those customers already on Axiom releases 04.00.xx, 04.02.xx, and 04.03.xx or 05.00.xx may upgrade non-disruptively. Customers below R4.0 would require disruptive upgrades. Contact the Support Center for details.
Attachments
This solution has no attachment