Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Troubleshooting Sure Solution 1019954.1 : Troubleshooting Sun StorEdge[TM] 6320 Disk Faults
PreviouslyPublishedAs 249668 Description This document addresses the identification of failed or failing disk drive(s) in the array via various symptoms provided. Symptoms:
Steps to Follow Please validate that each troubleshooting step below is true for your environment. The steps will provide instructions or a link to a document, for validating the step and taking corrective action as necessary. The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution. Please do not skip a step. 1. Validate the symptom from the symptom list above. 1a. If you received an email from the array or SSRR opened a Service Request then go to Step 2. 1b. If you See a Device Alert, per the symptom set above, then go to Step 5. 1c. If you noticed a fault LED on your disk drive or the global fault LED is lit, then go to Step 2. 1d. If none of the above skip to Step 8 2. Validate that you can log into your 6320 by going to https://<array_IP>:9443, and logging in. See chapter 2 of the Sun StorEdge 6320 System 1.2 Reference and Service Manual for details on logging into array. If you have trouble logging into the service processor on the 6320, Refer to <Document: 1019956.1> : Troubleshooting Sun StorEdge 6320[TM] Loss Of management Access Faults. Otherwise, continue to Step 3. 3. Validate "Overall Health Status" box in Configuration Service page immediately after logging into the 6320. 3a. If the status shows: "Error" Go to Step 4 3b. If the status shows: "Ok" Go to Step 5 4. Validate the existence of a StorADE Alarm against the disk drive by logging into the Storage Automated Diagnostic Environment (StorADE) using the following procedure: - To open a secure connection, use the following URL: https://system_ip_address:7443 If you cannot log into the Sub System see <Document: 1019956.1>: Troubleshooting Sun StorEdge 6320[TM] Loss Of management Access Faults. From the "Home" window that comes up, check the "Device Health Summary" to see if there are alerts listed. To verify that an alarm is related to a disk drive issue, select the "Alerts" link in the "Device Health Summary" and search for disk drive alarms. Selecting the alarm link will display details about the alarm. 4a. If there's an alarm, go to Step 5 4b. If there's no alarm, go to Step 8 4c. If there's no alarm AND an LED lit on the disk drive or Global Fault Indicator, go to Step 5 5. Validate disk drive status against a Detailed FRU report. To collect a Detailed FRU Report: a)From StorADE, select the "Reports" tab. b)In the "Reports" window select "General Reports" and then "Fru Reports". c) In the "Fru Reports" page you will find a button labeled "Generate new Report-Set", always select this button before looking at any reports. d) From the "Fru Reports" window, under "Select report to display or Email", select the "Display" link in the "Detailed Fru Report" row. 5a. If status is ready-enabled, go to Step 8 5b. If status is substituted, you have validated that a drive has failed in the 6320, and requires replacement. Collect the the following information and contact Sun Support for a drive replacement: • Arrayname • Disk position • Disk drive model 5c. If status is ready-disabled, Go to Step 6. 5d. If status is fault-disabled, Go to Step 7. 6. Validate global hot spare presence and state for drive in a ready-disabled state. Verify the presence and status of a hotspare by: a) logging into the configuration services through https://<array_IP>:9443 b) expand the SE6000 tree on the left side pane c) select the array which contains the ready-disabled drive. d) Check the icons on each drive and look for shared hotspares (global-hotspares) in any of the trays or dedicated hotspare disk in the tray that contains the drive. 6a. If a hot spare is present, check the pool details for a state of "Reconstructing". If a "Reconstructing" state is present, allow it to complete before preceding. Once the reconstruction has completed, verify that the drive status changes to "fault-substituted" in the "Fru Reports" output. You have validated that a drive has failed in the 6320, and requires replacement. Collect the the following information and contact Sun Support for a drive replacement: • Arrayname • Disk position • Disk drive model - If a "Reconstructing" state is NOT present, and and the drive isn't in a "fault-substituted" in the "Fru Reports" output, go to Step 9 6b. If a hotspare is NOT present, check the status of the pool by logging into the configuration services through https://<array_IP>:9443, expand the SE6000 tree on the left side pane, then expand the array which contains the faulted drive. Select the pool that contains the drive to view the details for that volume. - If the pool status is "Online", you have validated that a drive has failed in the 6320, and requires replacement. Collect the the following information and contact Sun Support for a drive replacement: • Arrayname • Disk position • Disk drive model -if the pool status is "Offline", go to Step 9 7. Validate pool state for fault-disabled drive. Check to see if the pool state is reconstructing to a hotspare by: a) logging into the configuration services through https://<array_IP>:9443 b) expand the SE6000 tree on the left side pane, then expand the array which contains the faulted drive. c) Select the pool that contains the drive to view the details for that pool. d) Check the state of the pool for "Reconstructing to Hot Spare". - If no reconstruction is present, go to Step 9 - If there is a reconstruction occurring, wait until it finishes, verify that the drive status changes to "fault-substituted". You have validated that a drive has failed in the 6320, and requires replacement. Collect the the following information and contact Sun Support for a drive replacement: • Arrayname • Disk position • Disk drive model 8. Validate LED and/or Alarm existence against disk drive in ready-enabled state. 8a. If there is an Alarm AND a fault LED lit for the disk drive, clear the alarm and verify LED status on array. If LED remains on, go to Step 9 8b. If there is an Alarm and no fault LED lit for the disk drive, clear the alarm. 8c. If there is an LED and NO Alarm for the disk drive, go to Step 9 8d. If there are no Alarms or LED's lit. You have verified that the disk drive is healthy. 9. At this point, if you have validated that each troubleshooting step above is true for your environment, and the issue still exists, further troubleshooting is required. Please open a Service Request with Sun Microsystems. Please include:
Product Sun StorageTek 6320 System Internal Comments The following steps are a continuance of the above, and are Internal Only. If the above steps have not been performed, please start at Step 1. It is assumed, at this point, that you have followed steps 1-9 with the customer, and now need to further troubleshoot the array by looking at extractor data. It is assumed at this point, that customers do NOT have access to the 6020 console, and all further steps will be against the Solution Extract. 10. Validate drive state, pool state, drive LED status, and the existence of a hot spare along with its status. This information is the result of Step 5 through Step 7, but can also be done through the extractor using the files: <extractor>/Arrays/array<number>/commands/vol_list (provides existence of hotspare for drive) <extractor>/Arrays/array<number>/commands/fru_stat (provides drive status) <extractor>/Arrays/array<number>/commands/global_standby_list_u<number2> where <extractor> is the Solution Extract location, <number> is the array number(e.g. array00), and <number2> is the tray. Identify the following from the output for the suspect drive: * The volume that the suspect drive is a part of from vol_list * The other drives in that same volume from vol_list * Whether the volume has a hot spare/standby allocated from vol_list * Whether there is a global hot spare/standby allocated from global_standby_list_u<number2> * The drive state and status from fru_stat global_standby_list_u1 ============================ | COMMAND: global_standby list u1 ============================ Global Standby Substituted Drive u1d14 - NOTE: There will be an output for each tray in the array. This has a potential for 6 total output files. vol_list example: ============================ | COMMAND: vol list ============================ volume capacity raid data standby tray0_pool1 204.510 GB 1 u1d01-06 none tray1_pool1 204.510 GB 1 u2d01-06 none tray0_pool2 204.510 GB 1 u1d08-13 none tray1_pool2 204.510 GB 1 u2d08-13 u2d014 Note that "u2d08-13" indicates that drives u2d08, u2d09, u2d10, u2d11, u2d12, u2d13 are all components of the same volume. fru_stat example: u2d01 ready enabled data disk ready ready 25 tray1_pool1 u2d02 ready enabled data disk ready ready 25 tray1_pool1 u2d03 ready enabled data disk ready ready 29 tray1_pool1 u2d04 ready enabled data disk ready ready 25 tray1_pool1 u2d05 ready enabled data disk ready ready 25 tray1_pool1 u2d06 ready enabled data disk ready ready 25 tray1_pool1 u2d07 ready enabled standby ready ready 24 - u2d08 ready enabled data disk ready ready 25 tray1_pool2 u2d09 ready enabled data disk ready ready 25 tray1_pool2 u2d10 ready enabled data disk ready ready 25 tray1_pool2 u2d11 ready enabled data disk ready ready 25 tray1_pool2 u2d12 ready enabled data disk ready ready 25 tray1_pool2 u2d13 ready enabled data disk ready ready 26 tray1_pool2 u2d14 ready enabled standby ready ready 25 - If a hot spare exists in vol_list OR global_standby_list_u<number2> and the drive is in a "fault disabled" or "ready disabled" state, go to Step 11. If a hot spare does not exist and the drive in a "ready-disabled" state, go to Step 16 If there is an LED on the drive, but it is in a "ready-enabled" status, go to Step 18 If the drive state is "substituted", use the <extractor>/Arrays/array<number>/commands/fru_list to identify the model and have the drive replaced. If none of the above apply, go to Step 18 11. Validate whether a reconstruction process exists using the proc_list file For the given pool affected by the suspect drive, use the <extractor>/Arrays/array<number>/commands/proc_list file. There should be a "vol recon" process for the given pool Example: ============================ | COMMAND: proc list ============================ VOLUME CMD_REF PERCENT TIME COMMAND tray0_pool1 21568 74 53928:47 vol verify tray1_pool2 25666 27 178:04 vol recon <--- reconstruction process. If a reconstruction process exists, wait until it completes. Then repeat step 10. If a reconstruction process does not exist, continue to step 12. 12. Validate whether suspect drive has been substituted by a global hotspare/standby. Using the global standby list output from step 10, you should be able to identify whether a drive has been substituted by a global hotspare/standby, based on the "Substituted Drive" column ============================ | COMMAND: global_standby list u1 ============================ Global Standby Substituted Drive u1d14 u4d07 <------Drive has been substituted If the global hot spare exists, and the suspect drive has been substituted, use the <extractor>/Arrays/array<number>/commands/fru_list to identify the model and have the drive replaced. If the global hot spare exists, and the suspect drive has NOT been substituted, Continue to Step 13. If no global hot spare exists, but vol_list shows a local hot spare, continue to Step 16. 13. Validate global standby size to suspect drive Compare the size of the global hot spare drives and the suspect drive in fru_list. You may need to use the Sun System Handbook for the 6320 to identify the drive 6320, disk, hdd, fault, fault disabled, bypassed, disk led, normalized, audited Attachments This solution has no attachment |
||||||||||||
|