Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Technical Instruction Sure Solution 1006856.1 : Troubleshooting StorEdge [TM] 351x Redundant Loop Failures
PreviouslyPublishedAs 209520 Description Description Symptoms:
Purpose/scope : This is a sub-set of <Document: 1011431.1> : "Troubleshooting Sun StorEdge[TM] 33x0/351x Hardware". The steps below will help verify and resolve fibre channel redundant path problems. Steps to Follow Steps to Follow Step 1 - Check the eventlog or persistent eventlog and verify there are no redundant loop failures which may or may not be accompanied by multiple drive failures on the same loop by issuing sccli> show eventlog or sccli> show persistent-eventlog command. For Example on the 3510: sccli> show eventlog Mon Jul 17 08:06:00 2006 [113f] #9: StorEdge Array SN#8011523 CH2: ALERT: redundant loop failure detected (ALT Surviving CH3)on Jul 17 06:52:59 2006 [113f] #10: StorEdge Array SN#8011523 CH2: ALERT: redundant loop failure detected (ALT Surviving CH3)on Jul 17 08:06:10 2006 [113f] #11: StorEdge Array SN#8011523 CH2: NOTICE: fibre channel loop connection restoredon Jul 17 06:53:08 2006 [113f] #12: StorEdge Array SN#8011523 CH2: NOTICE: fibre channel loop connection restoredon Jul 17 08:06:34 2006 [113f] #13: StorEdge Array SN#8011523 CH2: ALERT: redundant loop failure detected (ALT Surviving CH3)on Jul 17 08:16:43 2006 ... [2101] #19: LD-ID 436CE267 on StorEdge Array SN#8011523: ALERT: SCSI drive failure (CH2 ID42)on Jul 17 08:16:43 2006 [2101] #20: LD-ID 72BE7D18 on StorEdge Array SN#8011523: ALERT: SCSI drive failure (CH2 ID22)on Jul 17 08:16:46 2006 [2101] #21: LD-ID 00000000 on StorEdge Array SN#8011523: ALERT: SCSI drive failure (CH2 ID5)on Jul 17 08:16:46 2006 [2101] #22: LD-ID 72BE7D18 on StorEdge Array SN#8011523: ALERT: SCSI drive failure (CH2 ID25)on Jul 17 08:16:50 2006 [2101] #23: LD-ID 436CE267 on StorEdge Array SN#8011523: ALERT: SCSI drive failure (CH2 ID43) on Jul 17 08:16:54 2006 Step 2 - Issue the sccli>show disks command to verify that multiple drives on same loop are not BAD: Failure example on 3510: sccli> show disks Ch Id Size Speed LD Status IDs Rev Step 3 - Ensure that the ID switch settings are unique per enclosure and that disk id's are identified correctly as described in: <Document: 1007692.1> : Sun StorEdge[TM] 351x FC Array switch settings and disk Ids. Step 4 - Verify the diagnostic Invalid Transmission Word counters for the RAID devices are not increasing by comparing over time the output for the following sccli commands for each channel: - show diag error channel 2 - show diag error channel 3 OR If the sccli isn't available , and to capture data during I/O activity: Check the Fibre Channel Error Statistics using the firmware interface as described in the Fibre Channel Error Statistics (FC and SATA Only) of the Sun StorEdge[TM] 3000 Family RAID Firmware 4.2x User's Guide. Monitor the following values for sharp increases on the RAID devices during I/O activity: InvalTXWord. Total number of instances of invalid transmission words. This error indicates either an invalid transmit word or disparity error. InvalCRC. Total number of instances of invalid CRC, or the number of times a frame was received and the CRC was not as expected. For example, the following 3510 RAID device has high invalid transmission counts for the channel 2 controller (device id's 14 and 15): sccli> show diag error channel 2 If counters are increasing: -Investigate back-end loop device order to understand what is just before any devices showing high error counts -Investigate the device just BEFORE the device reporting high error counts -If there are invalid transmission counts or CRC errors for the raid devices 14 and 15 this may be indicative of a mis-seated or marginal component Step 5 - Issue the sccli> show channels command and ensure that all of the configured ports are running at the correct speed: 3510 Example where Loop B is at incorrect speed: sccli>show channels Ch Type Media Speed Width PID / SIDStep 6 - Issue the sccli> show enclosure-status command to ensure both loop a and b are visible. Step 7 - Issue the following sccli commands and verify both controllers and all devices are visible on each loop and again verify device id's are correct: -show loop-map channel 3 -show loop-map channel 2 For example on 3510: sccli>show loop-map channel 2 sccli: selected devi PORT ENCL-ID ENCL-TYPE LOOP BYP-STATUS ATTRIBUTESThere are additional show bypass commands that can be used to verify device and raid status: For example: sccli> show bypass raid Refer to the Sun StorEdge[TM] 3000 Family RAID Firmware 4.2x User's Guide for details Step 9 - Check the sccli> show fru output, to determine there are no N/A or absent components on the loop, specifically the IOM or controller. If there are jbods attached, determine that they are visible as well. For example on 3510 from show fru output we cannot see the lower IOM (ch3) on raid array: Step 10 - For failures as described above, steps to troubleshoot would include: -Verify that IOM/controller is not mis-seated or failed. Refer to <Document: 1002641.1> : Troubleshooting the StorEdge [TM] 33x0/351x Controller -Verify cabling is correct. Refer to <Document: 1008193.1> : Troubleshooting StorEdge [TM] 351x Cabling -Verify there are no unused SFP's in the drive channels 2 and 3 on each controller. -Verify firmware levels for controller, PLD and SES are at the latest revision. -Hardware components that may need reseating include: SFP, cable, disk(s), controller/IOM. Step 11 – If hardware fault persists gather the latest explorer information and escalate appropriately. Step 12 - If no problems were found during the course of this document please refer back to <Document: 1011431.1> : Troubleshooting Sun StorEdge 33x0/351x Hardware. Product Sun StorageTek 3511 SATA Array Sun StorageTek 3510 FC Array Sun StorageTek 3510 2U FC Array Internal Comments This document contains normalized content and is managed by the the Domain Lead(s) of the respective domains. To notify content owners of a knowledge gap contained in this document, and/or prior to updating this document, please contact the domain engineers that are managing this document via the “Document Feedback” alias(es) listed below: [email protected] fibre channel, path failure, loop down, loop up, 3510, 3511, 351x, normalized Previously Published As 89049 Change History Date: 2010-01-20 User Name: [email protected] Action: Externalized Comment: Checked, made contract customer facing Version: 13 Date: 2007-12-04 User Name: 7058 Action: Approved Comment: Updates OK to publish Version: 13 Comment: No changes made as indicated by Vickie, so placing in final review as there are no changes to review. It may not be perfect, but with DocBook/Voyager translation bugs, this is the very best I could do. It literally took hours. Comment: I found a reference to doc ID 76756 which is an internal only doc. It is about details for a particular command, so moving it to the internal only section of this doc won't cause a huge problem. I'm moving it to the internal only section. This will allow us to move forward with Minnow normalization. Attachments This solution has no attachment |
||||||||||||
|