Troubleshooting Sun Storage[TM] 2500 and 6000 Array Drive Tray Lost Redundancy Events

Asset ID:	1-75-1388897.1
Update Date:	2012-10-09
Keywords:

Solution Type Troubleshooting Sure

Solution 1388897.1 : Troubleshooting Sun Storage[TM] 2500 and 6000 Array Drive Tray Lost Redundancy Events

Applies to:

Sun Storage 6540 Array - Version Not Applicable and later
Sun Storage 6580 Array - Version Not Applicable and later
Sun Storage 6780 Array - Version Not Applicable and later
Sun Storage 6130 Array - Version Not Applicable and later
Sun Storage 2530 Array - Version Not Applicable and later
Information in this document applies to any platform.

Purpose

The purpose of this document is to help troubleshoot Drive/Drive-Tray Lost Redundancy events for Sun Storage[TM] 2500 and 6000 Arrays.

Symptoms include:

Critical Fault for Drive <Tray.xx.Drive.xx> lost redundancy (xx.66.1032) or REC_LOST_REDUNDANCY_DRIVE
Critical Fault for Enclosure tray <Tray.xx> lost redundancy (xx.66.1033) or REC_LOST_REDUNDANCY_TRAY
Critical Fault for Lost communication with <Tray.xx.IOM.x> (xx.66.1034) or REC_LOST_REDUNDANCY_ESM

Please validate that each troubleshooting step below is true for your environment. Each step will provide instructions via a link to a document, for validating the step and taking corrective action as necessary. The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution. Please do not skip a step.

Troubleshooting Steps

1. Verify the Critical Faults on the array.

Reference <Document 1021057.1> How to verify Sun StorageTek[TM] 2500 and Sun Storage[TM] 6000 and J4000 Critical Faults via the User Interface.

If the critical fault is for only Drive, collect Supportdata and proceed to Step 2.
If there is critical fault for IOM/Tray along with the drive above, collect Cabling diagram along with Supportdata and proceed to Step 2.
If the critical fault is for only IOM/Tray, collect Cabling diagram along with Supportdata and proceed to Step 3.

Reference <Document 1002514.1> Collecting Sun Storage Common Array Manager Array Support Data.
Reference <Document 1014074.1> Collecting Support Data for Arrays Using Sun StorageTek[TM] SANtricity Storage Manager.

2. Identify Drive Details from alarms.txt (recoveryGuruProcedures.html in case of SANtricity).

If you use Sun Storage Common Array Manager:
1. Extract and open the alarms.txt file from the supportdata.
2. Get TrayID, DriveID and Channel information from the alarm xx.66.1032.
  
  Example: Alarm ID : alarm1 Description: Drive Tray.45.Drive.05 lost redundancy, IOM N/A, working channel: 5. Severity : Critical Element : t45drive5 GridCode : 80.66.1032 Date : xx-xx-xx

If you use Sun StorageTek[TM] SANtricity Storage Manager:
1. Extract and open the recoveryGuruProcedures.html file from the supportdata.
2. Get TrayID, DriveID and Channel information from the Failure Entry NO_REDUNDANCY_DRIVE
  
  Example: Storage array: ST6540 Component reporting problem: Drive in slot 8 Status: Optimal Location: Drive tray 1 Component requiring service: 8 Service action (removal) allowed: No Service action LED on component: No Working channel: 2

Proceed to Step 4 to identify Working and Affected Channels.

3. Identify Tray Details from alarms.txt (recoveryGuruProcedures.html in case of SANtricity).

Reference Examples in Step 2 to get TrayID, Working Channel information from alarms.txt or recoveryGuruProcedures.html.
Proceed to Step 4 to identify Working and Affected Channels.

4. Identify Affected and Working Channels:

Locate the 'luall' output by opening the stateCaptureData.dmp file, and searching for the keyword 'luall'. Locate the Affected Drive/Tray as mentioned in the previous steps, and identify the Affected and Working Channels by following the example below:

For example:

Executing luall(0,0,0,0,0,0,0,0,0,0) on controller A:

.......Logical Unit........:    :.Channels..:Que ............IOs............:

    Devnum Location Role   :ORP : 0 1 2 3 4 :Dep  Qd  Open  Completed  Errs : OldestCmdAge(ms)

---------- -------- ------ :--- : - - - - - :--- --- ----- ---------- ----- : ----------------

  00020000  t0        Encl :++  : A B       :  1   0     0      38399     3   0

  00010100  t0,s1     FCdr :+++ : * +       : 16   0     0       5934     2   0

  00010101  t0,s2     FCdr :+++ : + *       : 16   0     0       5935     4   0

Important fields to look here:

'Location' Column   - t0,s1   - indicates Tray0, Slot1
'Channels' Column
0 1 2 3 4 . . .                 - Drive Channel information. Here it starts from 0. Channel-0 here represents Channel-1 in storageArrayProfile or alarms.txt output, and so on.
'A' or 'B' under Channels      - Reported for only Trays, having A and B for a tray indicates the drive is redundant.
'*' under Channels          - Active Path
'+' under Channels    - Standby Path
'D' or 'd' or '-' or ' ' (No charactor) under Channels - Standby path is not available and needs further investigation.

Note1: Working Channel will always be seen with '*'
Note2: For Simplex (Single Controller) Array configuration, it's expected to see only Active path and Standby path will not be seen.

Detailed Explanation of symbols for Oracle TSE:
    Symbols appearing before Device numbers:
       -< = no IT Nexus connected
       =< = logical unit rejecting IO requests
       #< = logical unit restricted or suspended
       d< = logical unit degraded... look at the ORP

    ORP Column = Operation, Redundancy, Performance

       Operation = the state of the ITN currently chosen

         + = chosen itn is not degraded
         d = chosen itn is degraded

       Redundancy = the state of the redundant ITN

         + = alternate itn is up
         d = alternate itn is degraded
         - = alternate itn is down
         x = there is no alternate itn

       Performance = Are we using the preferred path?

         + = chosen itn is preferred
         - = chosen itn is not preferred
           = no itn preferences

    Channels column indicates the state of the itn on that channel

       * = up and chosen
       + = up and not chosen
       D = degraded and chosen
       d = degraded and not chosen
       - = down
       x = not present

If only a drive is seen with single path and all the other drives in the same tray have both paths available, the drive may need to be replaced. Proceed to Step 11.
If all the drives are seen with single path and it is controller tray, one of the controllers may not be working properly. To check for controller issue and take appropriate action(s), follow <Document 1021113.1> Sun Storage[TM] Arrays: Troubleshooting RAID Controller Failures.
If all the drives are seen with Single path and it is expansion tray, proceed to Step 5.

5. Verify other alerts in alarms.txt (recoveryGuruProcedures.html in case of SANtricity).

If any alarm exists for Failed IOM on the same tray, the IOM may need to be replaced. Proceed to Step 11.
If any alarm exists for Minihub failed and/or SFP failed, the SFP may need to be replaced. Proceed to Step 11.
If no such alarm exists, proceed to Step 6.

6. Physically locate the Affected Channel using information collected in Step4.

Default channel numbers and their location for 6130 array:
Default channel numbers and their location for 6140 array:
Default channel numbers and their location for 6180 array:
Default channel numbers and their location for 6540 array:
Default channel numbers and their location for 6580/6780 array:
Default channel numbers and their location for 2540/2530/2510 array:

7. Trace the cable connectivity from the Affected Tray in the Affected Channel.

CAUTION: Do not disconnect any cables on the working channel. Doing so may cause a possible loss of data accessibility.

If the array is 6000 series, proceed to Step 8.
If the array is 2500 series, proceed to Step 9.

8. Verify the 7 segment LED status code of IOM.

If LED status code is NOT the tray ID, capture the LED code and proceed to Step 9.
Reference Different LED status codes and their description
If LED display is not available -or- LED display is the tray ID, proceed to Step 11.

Internal Note for Oracle Support Engineers:
a. CSM200 Tray has 7 segment LED display. To identify Tray/IOM type, click here.
b. For detailed LED status code description, refer <Document 1021109.1> Sun StorageTek[TM] 6140, 6540, and Flexline 380 Array Controller 7-Segment LED

9. Verify the Port Status LED.

Reference Port Status LEDs for 2500 Series - Check for "Link Fault" LED status.
Reference Port Status LEDs for 6000 Series - Check for "Port Bypass" LED status

If Amber LED is ON, proceed to Step 10.
If the LED is OFF, proceed to Step 11.

10. Check the cable going IN to the array in the cabling sequence.

If the cable is loose -or- disconnected, connect and evaluate alarm. It may also needed to reseat IOM for that tray.

If the issue is fixed, you are finished with the procedure.
If the issue is not resolved and Amber LED is ON, the cable and/or SFP would need replacement, proceed to Step 11.

11. Please contact Oracle Support and supply:

Supportdata Collection
Cabling Diagram (if applicable)
Results of the above steps (if applicable)

References

Attachments

This solution has no attachment