Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1020006.1
Update Date:2012-07-13
Keywords:

Solution Type  Technical Instruction Sure

Solution  1020006.1 :   Steps to ensure there are no disk failures in LDOM environment  


Related Items
  • Sun Fire T1000 Server
  •  
  • Sun Fire T2000 Server
  •  
  • Sun SPARC Enterprise T5120 Server
  •  
  • Sun SPARC Enterprise T5220 Server
  •  
  • Sun SPARC Enterprise T1000 Server
  •  
  • Sun SPARC Enterprise T5240 Server
  •  
  • Sun SPARC Enterprise T2000 Server
  •  
  • Sun SPARC Enterprise T5140 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Usx/Blade/Netra>SN-SPARC: USx
  •  
  • .Old GCS Categories>Sun Microsystems>Servers>CMT Servers
  •  

PreviouslyPublishedAs
250806


Description
This document would explain how to ensure there are no disk failures in LDOM environment.

Steps to Follow
Disks need to be maintained in good condition to carry on LDOM stuff and other normal operation too. Below information will provide the details how this will occur and what are the Solaris[TM] command available to find out that.
Broadly speaking a hard disk can fail in four ways that will lead to a potential loss of data:

1. Firmware Corruption / Damage to the firmware zone
2. Electronic Failure
3. Mechanical Failure
4. Logical Corruption Combinations of these four types of failure are also possible.

1. Firmware Corruption / Damage to the firmware zone
Explantion: Hard disk firmware is the software code that controls, and is embedded in, the physical hard drive hardware. If the irmware of a hard disk becomes corrupted or unreadable the computer is often unable to correctly interact with the hard disk. requently the data on the disk is fully recoverable once the drive has been repaired and reprogrammed.
Firmware failures - How to diagnose: Common Symptoms
* The hard disk will spin up when powered on, but be incorrectly recognised / not recognised at all by the computer
* The hard disk will spin up & be recognised correctly by the computer but the system will then hang during the boot process

2. Electronic Failure
Explanation: Electronic failure usually relates to problems on the controller board of the actual hard disk. The computer may suffer a power spike or electrical surge that knocks out the controller board on the hard disk making it undetectable to the BIOS.
Electrical failures - How to diagnose: Common Symptom
* The hard disk will not spin up when the drive is powered on - it will appear dead & not be recognised by the computer

3. Mechanical Failure Mechanical hard disk failures are those which develop on components internal to the hard disk itself. Often as soon as an internal omponent goes faulty the data on the hard disk will become inaccessible.
Mechanical failures - How to diagnose: Common Symptoms
* When powered on, the hard drive will immediately begin to make a regular ticking or clicking sound

4. Logical Errors Often the easiest and the most difficult problems to deal with, logical errors can range from simple things such as an invalid entry in a file allocation table to truly horrific problems such as the corruption and loss of the file system on a severely fragmented drive. Logical errors are different to the electrical and mechanical problems above as there is usually nothing 'physically' wrong with the disk, just the information on it.

First use the format command and cfgadm -al command to see the disk status
 For example :

   format
   AVAILABLE DISK SELECTIONS:
          0. c0t0d0 <SUN18G cyl 7506 alt 2 hd 19 sec 248>
             /pci@1f,4000/scsi@3/sd@0,0
          1. c0t1d0 <SUN18G cyl 7506 alt 2 hd 19 sec 248>
             /pci@1f,4000/scsi@3/sd@1,0
          2. c0t2d0 <SUN18G cyl 7506 alt 2 hd 19 sec 248>
             /pci@1f,4000/scsi@3/sd@2,0
          3. c0t3d0 <SUN18G cyl 7506 alt 2 hd 19 sec 248>
             /pci@1f,4000/scsi@3/sd@3,0

Here is the 'cfgadm' display for controller c0:

   cfgadm -al
   Ap_Id                   Type         Receptacle   Occupant     Condition
   c0                      scsi-bus     connected    configured   unknown
   c0::dsk/c0t0d0          disk         connected    configured   unknown
   c0::dsk/c0t1d0          disk         connected    configured   unknown
   c0::dsk/c0t2d0          disk         connected    configured   unknown
   c0::dsk/c0t3d0          disk         connected    configured   unknown
  
Use the  iostat -En whic shows the status too :

c0t0d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: FUJITSU  Product: MAY2073RCSUN72G  Revision: 0501 Serial No: 0706S08GSV
Size: 73.40GB <73400057856 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c1t0d0           Soft Errors: 2 Hard Errors: 0 Transport Errors: 0
Vendor: MATSHITA Product: CD-RW  CW-8124   Revision: DZ13 Serial No:
Size: 0.00GB <0 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 2 Predictive Failure Analysis: 0
c0t1d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: FUJITSU  Product: MAY2073RCSUN72G  Revision: 0501 Serial No: 0706S08GST
Size: 73.40GB <73400057856 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c0t2d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: SEAGATE  Product: ST973401LSUN72G  Revision: 0556 Serial No: 071111MCJT
Size: 73.40GB <73400057856 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c0t3d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: SEAGATE  Product: ST973401LSUN72G  Revision: 0556 Serial No: 071111MCDV
Size: 73.40GB <73400057856 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0


Use the command : prtdiag -v to show the disk as well as machine status

Solaris[TM] 10 11/06 also known as Solaris[TM] 10 Update 3 has the following feature to know the disk status of the machine:

A new Fault Management Architecture-based diagnosis engine (DE) is provided on the Sun machine. This DE monitors the disk drives for
predictive failures by using the SMART technology in the disk drive's own firmware. When a disk failure is imminent, the LED next to the disk
is illuminated and a Fault Management Architecture fault is generated. This fault alerts the administrator to take specific action to ensure
system availability and full performance.

We have the below features in T5140 and T5240 machines :

Disk mirroring (RAID 1) is a technique that uses data redundancy (two complete
copies of all data stored on two separate disks) to protect against loss of data due to
disk failure. One logical volume is duplicated on two separate disks.



Product
Sun Fire T2000 Server
Sun Fire T1000 Server
Sun Netra T2000 Server
Sun Netra T5220 Server
Sun Netra T5440 Server
Netra T5220 AC
Sun SPARC Enterprise T5220 Server
Sun SPARC Enterprise T5240 Server
Sun Blade T6300 Server Module
Sun Blade T6320 Server Module
Sun SPARC Enterprise T5120 Server
Sun SPARC Enterprise T5140 Server
Sun Blade T6340 Server Module
Sun SPARC Enterprise T2000 Server
Sun SPARC Enterprise T5440 Server
Sun SPARC Enterprise T1000 Server

Internal Comments
This document contains normalized content and is managed by the the Domain Lead(s) of the respective domains. To notify content owners of a knowledge gap contained in this document, and/or prior to updating this document, please contact the domain engineers that are managing this document via the "Document Feedback" alias(es) listed below:

Solairs OS Domain Feedback Alias : [email protected]


Normalized

Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback