Troubleshooting Sun StorEdge[TM] 6320 Disk Faults

Asset ID:	1-75-1019954.1
Update Date:	2009-07-29
Keywords:

Solution Type Troubleshooting Sure

Solution 1019954.1 : Troubleshooting Sun StorEdge[TM] 6320 Disk Faults

Related Items


Sun Storage 6320 System

Related Categories


GCS>Sun Microsystems>Storage - Disk>Modular Disk - 6xxx Arrays

PreviouslyPublishedAs
249668

Description
This document addresses the identification of failed or failing disk drive(s) in the array via various symptoms provided.

Symptoms:

SSRR disk.u#d# in 6020 array## is 'fault-disabled'
SSRR disk.u2d3 in sp0-array03 Not-Available ready-enabled->fault-enabled
SSRR disk.u2d2.fruDiskPort2State on sp0-array00 change 'ready' to 'bypass'
Storade reports disk fault
Performance degraded
Disk Fault LED lit/on
Global Fault LED lit/on
Email of SSRR alarm
SSRR reports PFA

Steps to Follow
Please validate that each troubleshooting step below is true for your
environment. The steps will provide instructions or a link to a document, for
validating the step and taking corrective action as necessary. The steps are
ordered in the most appropriate sequence to isolate the issue and identify the
proper resolution. Please do not skip a step.
1. Validate the symptom from the symptom list above.

1a. If you received an email from the array or SSRR opened a Service Request
then go to Step 2.

1b. If you See a Device Alert, per the symptom set above, then go to Step 5.

1c. If you noticed a fault LED on your disk drive or the global fault LED is lit, then go to Step 2.

1d. If none of the above skip to Step 8

2. Validate that you can log into your 6320 by going to https://<array_IP>:9443, and logging in. See chapter 2 of the Sun StorEdge 6320 System 1.2 Reference and Service Manual for details on logging into array.

If you have trouble logging into the service processor on the 6320, Refer to <Document: 1019956.1> : Troubleshooting Sun StorEdge 6320[TM] Loss Of management Access Faults. Otherwise, continue to Step 3.

3. Validate "Overall Health Status" box in Configuration Service page immediately after logging into the 6320.

3a. If the status shows: "Error" Go to Step 4
3b. If the status shows: "Ok" Go to Step 5

4. Validate the existence of a StorADE Alarm against the disk drive by logging into the Storage Automated Diagnostic Environment (StorADE) using the following procedure:

- To open a secure connection, use the following URL:
https://system_ip_address:7443

If you cannot log into the Sub System see <Document: 1019956.1>: Troubleshooting Sun StorEdge 6320[TM] Loss Of management Access Faults.

From the "Home" window that comes up, check the "Device Health Summary" to see if there are alerts listed. To verify that an alarm is related to a disk drive issue, select the "Alerts" link in the "Device Health Summary" and search for disk drive alarms. Selecting the alarm link will display details about the alarm.

4a. If there's an alarm, go to Step 5
4b. If there's no alarm, go to Step 8
4c. If there's no alarm AND an LED lit on the disk drive or Global Fault Indicator, go to Step 5

5. Validate disk drive status against a Detailed FRU report.

To collect a Detailed FRU Report:
a)From StorADE, select the "Reports" tab.
b)In the "Reports" window select "General Reports" and then "Fru Reports".
c) In the "Fru Reports" page you will find a button labeled "Generate new Report-Set", always select this button before looking at any reports.
d) From the "Fru Reports" window, under "Select report to display or Email", select the "Display" link in the "Detailed Fru Report" row.

5a. If status is ready-enabled, go to Step 8
5b. If status is substituted, you have validated that a drive has failed in the 6320, and requires replacement. Collect the the following information and contact Sun Support for a drive replacement:
    •    Arrayname
    •    Disk position
    •    Disk drive model
5c. If status is ready-disabled, Go to Step 6.
5d. If status is fault-disabled, Go to Step 7.

6. Validate global hot spare presence and state for drive in a ready-disabled state.

Verify the presence and status of a hotspare by:
a) logging into the configuration services through https://<array_IP>:9443
b) expand the SE6000 tree on the left side pane
c) select the array which contains the ready-disabled drive.
d) Check the icons on each drive and look for shared hotspares (global-hotspares) in any of the trays or dedicated hotspare disk in the tray that contains the drive.

6a. If a hot spare is present, check the pool details for a state of "Reconstructing". If a "Reconstructing" state is present, allow it to complete before preceding. Once the reconstruction has completed, verify that the drive status changes to "fault-substituted" in the "Fru Reports" output.   You have validated that a drive has failed in the 6320, and requires replacement. Collect the the following information and contact Sun Support for a drive replacement:
    •    Arrayname
    •    Disk position
    •    Disk drive model

- If a "Reconstructing" state is NOT present, and and the drive isn't in a "fault-substituted" in the "Fru Reports" output, go to Step 9

6b. If a hotspare is NOT present, check the status of the pool by logging into the configuration services through https://<array_IP>:9443, expand the SE6000 tree on the left side pane, then expand the array which contains the faulted drive. Select the pool that contains the drive to view the details for that volume.

- If the pool status is "Online", you have validated that a drive has failed in the 6320, and requires replacement. Collect the the following information and contact Sun Support for a drive replacement:
    •    Arrayname
    •    Disk position
    •    Disk drive model

-if the pool status is "Offline", go to Step 9

7. Validate pool state for fault-disabled drive.

Check to see if the pool state is reconstructing to a hotspare by:
a) logging into the configuration services through https://<array_IP>:9443
b) expand the SE6000 tree on the left side pane, then expand the array which contains the faulted drive.
c) Select the pool that contains the drive to view the details for that pool.
d) Check the state of the pool for "Reconstructing to Hot Spare".

- If no reconstruction is present, go to Step 9
- If there is a reconstruction occurring, wait until it finishes, verify that the drive status changes to "fault-substituted".   You have validated that a drive has failed in the 6320, and requires replacement. Collect the the following information and contact Sun Support for a drive replacement:

    •    Arrayname
    •    Disk position
    •    Disk drive model

8. Validate LED and/or Alarm existence against disk drive in ready-enabled state.

8a. If there is an Alarm AND a fault LED lit for the disk drive, clear the alarm and verify
LED status on array. If LED remains on, go to Step 9

8b. If there is an Alarm and no fault LED lit for the disk drive, clear the alarm.
8c. If there is an LED and NO Alarm for the disk drive, go to Step 9
8d. If there are no Alarms or LED's lit. You have verified that the disk drive is healthy.

9. At this point, if you have validated that each troubleshooting step above is true for your environment, and the issue still exists, further troubleshooting is required. Please open a Service Request with Sun Microsystems.

Please include:

StorADE Alarm text if available
Statement of Symptoms you see that pertain to the disk drive
Solution Extract. Reference to <Document: 1018865.1> : Sun StorEdge[TM] 6320: How to Collect an Extractor
Status of disk drive as shown in the Detailed FRU report in Step 5
Email text received from the 6320 storage system.

Product
Sun StorageTek 6320 System

Internal Comments
The following steps are a continuance of the above, and are Internal Only. If the above steps have not been performed, please start at Step 1. It is assumed, at this point, that you have followed steps 1-9 with the customer, and now need to further troubleshoot the array by looking at extractor data.

It is assumed at this point, that customers do NOT have access to the 6020 console, and
all further steps will be against the Solution Extract.

10. Validate drive state, pool state, drive LED status, and the existence of a hot spare along with its status.

This information is the result of Step 5 through Step 7, but can also be done through
the extractor using the files:

<extractor>/Arrays/array<number>/commands/vol_list (provides existence of hotspare for drive)
<extractor>/Arrays/array<number>/commands/fru_stat (provides drive status)
<extractor>/Arrays/array<number>/commands/global_standby_list_u<number2>
where <extractor> is the Solution Extract location, <number> is the array number(e.g. array00), and <number2>
is the tray.

Identify the following from the output for the suspect drive:

* The volume that the suspect drive is a part of from vol_list
* The other drives in that same volume from vol_list
* Whether the volume has a hot spare/standby allocated from vol_list
* Whether there is a global hot spare/standby allocated from global_standby_list_u<number2>
* The drive state and status from fru_stat

global_standby_list_u1
============================
| COMMAND: global_standby list u1
============================
Global Standby         Substituted Drive
u1d14                  -

NOTE: There will be an output for each tray in the array. This has a potential for 6 total output files.

vol_list example:

============================
| COMMAND: vol list
============================
volume            capacity raid data       standby
tray0_pool1     204.510 GB    1 u1d01-06      none
tray1_pool1     204.510 GB    1 u2d01-06      none
tray0_pool2     204.510 GB    1 u1d08-13      none
tray1_pool2     204.510 GB    1 u2d08-13      u2d014

Note that "u2d08-13" indicates that drives u2d08, u2d09, u2d10, u2d11, u2d12, u2d13 are all components of the same volume.

fru_stat example:

u2d01   ready    enabled     data disk   ready      ready      25    tray1_pool1
u2d02   ready    enabled     data disk   ready      ready      25    tray1_pool1
u2d03   ready    enabled     data disk   ready      ready      29    tray1_pool1
u2d04   ready    enabled     data disk   ready      ready      25    tray1_pool1
u2d05   ready    enabled     data disk   ready      ready      25    tray1_pool1
u2d06   ready    enabled     data disk   ready      ready      25    tray1_pool1
u2d07   ready    enabled     standby     ready      ready      24    -
u2d08   ready    enabled     data disk   ready      ready      25    tray1_pool2
u2d09   ready    enabled     data disk   ready      ready      25    tray1_pool2
u2d10   ready    enabled     data disk   ready      ready      25    tray1_pool2
u2d11   ready    enabled     data disk   ready      ready      25    tray1_pool2
u2d12   ready    enabled     data disk   ready      ready      25    tray1_pool2
u2d13   ready    enabled     data disk   ready      ready      26    tray1_pool2
u2d14   ready    enabled     standby     ready      ready      25    -

If a hot spare exists in vol_list OR global_standby_list_u<number2> and the drive is in a "fault disabled" or "ready disabled" state, go to Step 11.
If a hot spare does not exist and the drive in a "ready-disabled" state, go to Step 16
If there is an LED on the drive, but it is in a "ready-enabled" status, go to Step 18
If the drive state is "substituted", use the <extractor>/Arrays/array<number>/commands/fru_list to identify the model and have the drive replaced.
If none of the above apply, go to Step 18

11. Validate whether a reconstruction process exists using the proc_list file

For the given pool affected by the suspect drive, use the <extractor>/Arrays/array<number>/commands/proc_list file.
There should be a "vol recon" process for the given pool

Example:
============================
| COMMAND: proc list
============================
VOLUME          CMD_REF PERCENT    TIME COMMAND
tray0_pool1             21568      74 53928:47 vol verify
tray1_pool2             25666      27 178:04 vol recon <--- reconstruction process.

If a reconstruction process exists, wait until it completes. Then repeat step 10.
If a reconstruction process does not exist, continue to step 12.

12. Validate whether suspect drive has been substituted by a global hotspare/standby.

Using the global standby list output from step 10, you should be able to identify whether
a drive has been substituted by a global hotspare/standby, based on the "Substituted Drive"
column

============================
| COMMAND: global_standby list u1
============================
Global Standby         Substituted Drive
u1d14                 u4d07    <------Drive has been substituted

If the global hot spare exists, and the suspect drive has been substituted, use the <extractor>/Arrays/array<number>/commands/fru_list to identify the model and have the drive replaced.
If the global hot spare exists, and the suspect drive has NOT been substituted,   Continue to Step 13.
If no global hot spare exists, but vol_list shows a local hot spare, continue to Step 16.

13. Validate global standby size to suspect drive

Compare the size of the global hot spare drives and the suspect drive in fru_list.
You may need to use the Sun System Handbook for the 6320 to identify the drive 6320, disk, hdd, fault, fault disabled, bypassed, disk led, normalized, audited

Attachments

This solution has no attachment