Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Sun Alert Sure Solution 1001167.1 : Sun Cluster 3.x Nodes May Panic Upon Controller Failure/Replacement Within Sun StorEdge 3510/3511 Arrays
PreviouslyPublishedAs 201561 Product Sun StorageTek 3510 FC Array Sun StorageTek 3511 SATA Array Bug Id <SUNBUG: 6321239>, <SUNBUG: 6365819> Date of Workaround Release 01-DEC-2005 Date of Resolved Release 15-JUN-2006 Impact Upon a controller failure/replacement on a Sun StorEdge 3510/3511 array, all the nodes connected in a Sun Cluster 3.x environment may panic. Contributing Factors This issue can occur on the following platforms: SPARC Platform
This issue will only occur in cluster configurations that issue SCSI-2 reservations (for example: 2 node clusters) including:
when LUN filtering is enabled. Symptoms Should the described issue occur, all the nodes of the cluster will panic with a Reservation Conflict similar to the following: sun50-node:/var/crash/sun50-node # panic[cpu19]/thread=2a1001e5d20: Reservation Conflict 000002a1001e57d0 ssd:ssd_mhd_watch_cb+c4 (3000c318808, 0, 7600000042, 300172f5578, 300172f55a8, 0) %l0-3: 000000007829c740 00000300164f3d98 000003000081f2c8 000003000727e320 %l4-7: 0000030005b9b5e8 0000000000000000 0000000000000002 00000000ff08dffc 000002a1001e5880 scsi:scsi_watch_request_intr+140 (0, 0, 30015f873c0, 300164f3d98, 0, 300172f5530) %l0-3: 000000001034aadc 000003000081f2c8 0000030005b9b5e8 000003002cdc0f30 %l4-7: 000003000727e320 00000000782a81ac 00000300172f55a8 000003002b73e000 000002a1001e5950 qlc:qlc_task_thread+698 (300008207f0, 300008207e8, ff00, 300008207f2, 783c3240, 783c3250) %l0-3: 000000007829920c 00000000783c3260 00000000783c3270 000000000001ff80 %l4-7: 0000030000820ac0 000003100b5ef180 00000300008207e8 00000300008207c8 000002a1001e5a60 qlc:qlc_task_daemon+70 (300008207e8, 300008207c8, 300008207f2, 104640c0, 30000820af8, 30000820afa) %l0-3: 0000030000820ae0 00000310002fdb20 0000000000000000 0000000010408000 Note: After the nodes boot following a panic, they will not be able to see the LUNs from the Sun StorEdge 3500/3511 array. Both the nodes will show the drives as "<drive not available: reserved>" when using format(1M). Only after the Sun StorEdge 3510/3511 array is reset and the nodes are rebooted will everything return to normal. The following example will show how the controller fw handling of the reservation of the nexus (controller, target, lun) at a LUN level can cause the reservation conflict to happen, when using LUN filtering. Example from a "show map" output: Ch Tgt LUN ld/lv ID-Partition Assigned Filter Map ------------------------------------------------------------------- 0 40 0 ld0 24A193C9-00 Primary 210000E08B13AC6F {HBA-1} 0 40 1 ld0 24A193C9-02 Primary 210000E08B13AC6F {HBA-1} <-- 0 40 1 ld0 24A193C9-02 Primary 210000E08B133FC2 {HBA-2} <-- 0 40 2 ld0 24A193C9-03 Primary 210000E08B13AC63 {HBA-3} <-- 0 40 2 ld0 24A193C9-05 Primary 210000E08B133FC4 {HBA-4} <-- Note that LUN #1 is being used for the same partition, 24A193C9-02, to two different initiators/HBAs, {HBA-1} and {HBA-2}. Note that LUN #2 is being used for the different partitions, 24A193C9-03 and 24A193C9-05, to two different initiators/HBAs, {HBA-4} and {HBA-3}. During a controller failure/reset, a reservation on one nexus can assert itself to the other nexus with the "same LUN number". There have been a few cases reported that the process of logical drive partition/repartition can cause the reservation panic. While the issue with controller failure/reset is known, root cause of the partition/repartition issue is still in progress. Workaround To work around the described issue, disable LUN filtering and use switch zoning. Instructions for LUN filtering can be found at: Sun StorEdge 3000 Family CLI 2.x User's Guide at: http://www.sun.com/products-n-solutions/hardware/docs/html/817-4951-14 Sun StorEdge 3000 Family RAID Firmware 4.1x User's Guide at: http://www.sun.com/products-n-solutions/hardware/docs/html/817-3711-14 For Switch zoning consult the corresponding manufacturer documentation. Resolution This issue is addressed on the following platforms:
Modification History Date: 12-JAN-2006 12-Jan-2006:
Date: 25-APR-2006
Date: 15-JUN-2006
References<SUNPATCH: 113723-15><SUNPATCH: 113724-09> Previously Published As 102067 Internal Comments The issue associated with BugID - 6321239 will be resolved in firmware release around 2nd quarter 2006. The issue associated with BugID -6365819 is currently under investigation. The following Sun Alerts have information about other known issues for the 3000 series products: 102011 - Sun StorEdge 33x0/3510 Arrays May Report a Higher Incidence of Drive Failures With Firmware 4.1x SMART Feature Enabled 102067 - Sun Cluster 3.x Nodes May Panic Upon Controller Failure/Replacement Within Sun StorEdge 3510/3511 Arrays 102086 - Failed Controller Condition May Cause Data Integrity Issues 102098 - Insufficient Information for Recovery From Double Drive Failure for Sun StorEdge 33x0/35xx Arrays 102126 - Recovery Behavior From Fatal Drive Failure May Lead to Data Integrity Issues 102127 - Performance Degradation Reported in Controller Firmware Releases 4.1x on Sun StorEdge 3310/351x Arrays for All RAID Types and Certain Patterns of I/O 102128 - Data Inconsistencies May Occur When Persistent SCSI Parity Errors are Generated Between the Host and the SE33x0 Array 102129 - Disks May be Marked as Bad Without Explanation After "Drive Failure," "Media Scan Failed" or "Clone Failed" Events Note: One or more of the above Sun Alerts may require a Sun Spectrum Support Contract to login to a SunSolve Online account. The issue with the controller failure has been duplicated in the lab. The issue with the partition/repartition HAS NOT been reproduced yet and RCA is still under progress. Test results indicate that when a controller fails, and there are existing SCSI-2 reservations, a reservation may be incorrectly set on a nexus. This will cause a loss of access to the path, and the cluster nodes to panic. Upon the controller failure the device will be reset by the host, along with an implicit fabric logout, that will clear the existing reserve(6) reservations. All testing to date indicates this is only an issue with SE3510 2 node clusters due to the use of Reserve(6) command. The investigation into root cause is in process. This issue was not seen on 3.27R but cannot be verified. There is no need for a host i/o to be present. This issue can occur on a newly rebooted server. Internal Contributor/submitter [email protected] Internal Eng Business Unit Group NWS (Network Storage) Internal Eng Responsible Engineer [email protected] Internal Services Knowledge Engineer [email protected] Internal Escalation ID 1-11214948, 1-13027714, 1-13069641 Internal Resolution Patches 113723-15, 113724-09 Internal Sun Alert Kasp Legacy ID 102067 Internal Sun Alert & FAB Admin Info Critical Category: Availability ==> HA-Failure Significant Change Date: 2005-12-01, 2006-06-15 Avoidance: Patch, Workaround Responsible Manager: [email protected] Original Admin Info: [WF -19-Jun-2006, Jeff Folla: Changed Audience from Contract to Free.] [WF 15 -Jun-2006, Jeff Folla: All patches are available. This is now resolved. Sent for re-release.] [WF 12-Jun-2006, Dave M: FW released this week, updating with 6 other alerts to publish together for FW patch release coordinated] [WF 25-Apr-2006, Jeff Folla: FW 4.15F patch now available. Sent to publish.] [WF 14-Apr-2006, Dave M: updated in anticipation of FW 4.15F release, per NWS and PTS engs] [WF 12-Jan-2006, Dave M: OK to republish] [WF 03-Jan-2005, Dave M: updating for re-release per Storage group and Executive review] [WF 01-Dec-2005, Jeff Folla: Sent for release.] [WF 30-Nov-2005, Jeff Folla: Sent for review.] Product_uuid 58553d0e-11f4-11d7-9b05-ad24fcfd42fa|Sun StorageTek 3510 FC Array 9fdbb196-73a6-11d8-9e3a-080020a9ed93|Sun StorageTek 3511 SATA Array ReferencesSUNPATCH:113723-15SUNPATCH:113724-09 Attachments This solution has no attachment |
||||||||||||
|