![]() | Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Technical Instruction Sure Solution 1479736.1 : How to replace an Exadata Compute (Database) node hard disk drive (Predictive or Hard Failure)
Canned Action Plan procedure to replace an Exadata Compute (Database) node hard disk drive (Predictive or Hard Failure). This covers Exadata disk alert HALRT-02007 and HALRT-02008. Applies to:Exadata Database Machine X2-2 Qtr Rack - Version Not Applicable to Not Applicable [Release N/A]Exadata Database Machine V2 - Version Not Applicable to Not Applicable [Release N/A] Exadata Database Machine X2-8 - Version Not Applicable to Not Applicable [Release N/A] Exadata Database Machine X2-2 Hardware - Version Not Applicable to Not Applicable [Release N/A] Exadata Database Machine X2-2 Full Rack - Version Not Applicable to Not Applicable [Release N/A] Oracle Solaris on x86-64 (64-bit) Information in this document applies to any platform. GoalIdentify and replace a failed hard disk drive from an Exadata Compute (Database) node for hard or predictive failures. FixDISPATCH INSTRUCTIONS: The customer may choose to do the replacement themselves. In this case, the disk should be sent out using a parts-only dispatch.
WHAT SKILLS DOES THE FIELD ENGINEER/ADMINISTRATOR NEED?: Linux megaraid familiiarity TIME ESTIMATE: 60 minutes Complete time may be dependent on disk re-sync time. TASK COMPLEXITY: 0 CRU-optional; default is FRU with Task Complexity: 2 FIELD ENGINEER/ADMINISTRATOR INSTRUCTIONS: The failed hard disk may be marked either “critical” (hard) or “predictive failure”. For a critical hard failure, the LED for the failed hard disk should have the "OK to Remove" blue LED illuminated/flashing and have the "Service Action Required" amber LED illuminated/flashing. This may trigger alarm HALRT-02007 - refer to Note 1113034.1. For a predictive failure, the LED for the failed hard disk should have the “Service Action Required” amber LED illuminated/flashing.On certain image revisions, predictive failures may not yet be removed from the volume and may not have a fault LED on. This may trigger alarm HALRT-02008 - refer to Note 1113014.1. The normal DB node volume arrangement depends on the OS installed and the current active image version. Use “/opt/oracle.cellos/imageinfo” to determine the current active image version, and “uname -s” to determine the OS type. The volumes expected are as follows: V2/X2-2 Linux only, if dual-boot Solaris image partition has been reclaimed or was not present:
X2-2 Linux and Solaris dual-boot, if other OS image partitions have not been reclaimed:
X2-2 Solaris only, if dual-boot Linux image partition has been reclaimed:
X2-8 Linux only:
X2-8 Linux and Solaris dual-boot, if other OS image partitions have not been reclaimed:
WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE?: 1. Backup the volume and be familiar with the restore from bare metal procedure before replacing the disk. See Note 1084360.1 for details. If the DB node was running 11.2.2.1.1 or 11.2.2.2.x images and was in a state of write-through caching mode at some stage (default is write-back), there is a possibility that the Linux file system is corrupt due to a disk controller firmware bug. When this is encountered the file system may have been operating normally however will go read-only when attempting to rebuild the corrupted blocks across to the hotspare disk. This may be unavoidable as the rebuild copy back from hotspare to replacement occurs automatically. This requires a bare metal restore to correct. a. Obtain the enclosure ID for the MegaRAID card: Linux: # /opt/MegaRAID/MegaCli/MegaCli64 -encinfo -a0 | grep ID Solaris: # /opt/MegaRAID/MegaCli -encinfo -a0 | grep ID b. Identify the physical disk slot that is failed: Linux: # /opt/MegaRAID/MegaCli/MegaCli64 -pdlist -a0 | grep -iE "slot|firmware" Solaris: # /opt/MegaRAID/MegaCli -pdlist -a0 | egrep -i "slot|firmware" "Unconfigured(bad)" is the expected state for the faulted disk. In this example, it is located in physical slot 0, and it can be seen that the Hotspare in slot 3 has started rebuilding the volume. If all disks show as Online or Hotspare, then the disk may be in predictive failure state but not yet gone offline. The failed disk can be identified using this additional information: Linux: # /opt/MegaRAID/MegaCli/MegaCli64 -pdlist -a0 | grep -iE "slot|predictive|firmware" Solaris: # /opt/MegaRAID/MegaCli -pdlist -a0 | egrep -i "slot|predictive|firmware" In this example, the disk in slot 1 has reported itself as predictive failed several times but is still online. This disk should be considered the bad one. For more details refer to Note 1452325.1. c. Use the locate function which turns the "Service Action Required" amber LED on flashing: Linux: # /opt/MegaRAID/MegaCli/MegaCli64 -PdLocate -start -physdrv[E#:S#] -a0 Solaris: # /opt/MegaRAID/MegaCli -PdLocate -start -physdrv[E#:S#] -a0 where E# is the enclosure ID number identified in step a, and S# is the slot number of the disk identified in step b. In the example above, the command would be: # /opt/MegaRAID/MegaCli/MegaCli64 -PdLocate -start -physdrv[252:0] -a0 3. Verify the state of the RAID is optimal or rebuilding if there is a hotspare, or degraded if there is not, with the good disk(s) online before hot-swap removing the failed disk. If the failed disk was the global hotspare, then this step should be skipped. Linux (RAID5 Example): # /opt/MegaRAID/MegaCli/MegaCli64 -LdPdInfo -a0 | grep -iE "target|state|slot" Linux (RAID1 Example): # /opt/MegaRAID/MegaCli/MegaCli64 -LdPdInfo -a0 | grep -iE "target|state|slot" Solaris:
The volume type on Solaris is RAID0, and then the failure may cause the virtual drive to no longer be visible. In that case, check the expected number of good drives are present and online (3 of the 4 in X2-2 or 6 of the 8 in X2-8 where the hotspare does not show in this command), and verify the zpool status is degraded with 1 of the mirrors online:
# /opt/MegaRAID/MegaCli -LdPdInfo -a0 | egrep -i "target|state|slot" 4. On the drive you plan to remove, push the storage drive release button to open the latch. 5. Grasp the latch and pull the drive out of the drive slot (Caution: The latch is not an ejector. Do not bend it too far to the right. Doing so can damage the latch. Also, whenever you remove a storage drive, you should replace it with another storage drive or a filler panel, otherwise the server might overheat due to improper airflow.) 6. Wait three minutes for the system to acknowledge the disk has been removed. 7. Slide the new drive into the drive slot until it is fully seated. 8. Close the latch to lock the drive in place. 9. Verify the "OK/Activity" Green LED begins to flicker as the system recognizes the new drive. The other two LEDs for the drive should no longer be illuminated. The server's locate and disk's service LED locate blinking function should automatically turn off. If it does not, it can be manually turned off for the device using: Linux: # /opt/MegaRAID/MegaCli/MegaCli64 -PdLocate -stop -physdrv[E#:S#] -a0 Solaris: # /opt/MegaRAID/MegaCli -PdLocate -stop -physdrv[E#:S#] -a0 where E# is the enclosure ID number identified in step 2a, and S# is the slot number of the disk identified in step 2b. In the example above, the command would be: # /opt/MegaRAID/MegaCli/MegaCli64 -PdLocate -stop -physdrv[252:0] -a0
OBTAIN CUSTOMER ACCEPTANCE If the OS is Linux, depending on the volume arrangement and image version, the disk may automatically become the new hotspare disk, or it may stay in an Unconfigured(good) state until the hotspare rebuild has completed. If it stays Unconfigured then the hotspare will copy back to rebuild on the new disk after the rebuild has completed. If it is a RAID1 then it should automatically come into the volume and start rebuilding. If the OS is Solaris, it is a Solaris RAID0 volume, and may not come into a volume automatically and will be in state Unconfigured(good) until it is in a volume. # /opt/MegaRAID/MegaCli/MegaCli64 -PdInfo -physdrv[E#:Slot#] -a0 where E# is the enclosure ID number identified in step 2a of the replacement steps, and S# is the slot number of the disk replaced. In the example above, the command and output would be: # /opt/MegaRAID/MegaCli/MegaCli64 -PdInfo -physdrv[252:0] -a0 2 Verify the replacement disk has been added to the expected RAID volume. If the OS is running Linux and the failed disk was originally the global hotspare, then the replacement should have become the hotspare automatically, identified in step 1, and this step should be skipped. If that did not occur automatically, then the new disk can be assigned as the hotspare with the following command: # /opt/MegaRAID/MegaCli/MegaCli64 -PdHsp -set -EnclAffinity -PhysDrv[E#:Slot#] -a0 where E# is the enclosure ID number identified in step 2a of the replacement steps, and S# is the slot number of the disk replaced. If the OS is running Linux and the failed disk was part of a RAID volume, use the following MegaRAID command to verify the status of the RAID: # /opt/MegaRAID/MegaCli/MegaCli64 -LdPdInfo -a0 | grep -iE "target|state|slot" If it has already completed the copyback when checked, then it may already be in “Online” state. If it is in rebuilding or copyback state, you can use the following to verify progress to completion: # /opt/MegaRAID/MegaCli/MegaCli64 -pdrbld -showprog -physrv [E#:Slot#] where E# is the enclosure ID number identified in step 2a of the replacement steps, and S# is the slot number of the disk in Rebuild state. This is typically the original Hotspare disk slot. # /opt/MegaRAID/MegaCli/MegaCli64 -pdrbld -showprog -physrv [252:3] or # /opt/MegaRAID/MegaCli/MegaCli64 -pdcpybk -showprog -physrv [E#:Slot#] where E# is the enclosure ID number identified in step 2a of the replacement steps, and S# is the slot number of the disk in Copyback state. This is typically the replaced disk slot. # /opt/MegaRAID/MegaCli/MegaCli64 -pdcpybk -showprog -physdrv [252:0] -a0 If the OS is running Solaris, the RAID0 MegaRAID volume may need to be recreated, if it was not done so automatically. In this example the rpool mirror disk in slot 3 was failed: # /opt/MegaRAID/MegaCli -cfgldadd -r0[252:3] wb nora direct nocachedbadbbu -strpsz1024 -a0 # Use format to partition the disk with a full-disk Solaris label, single cylinder boot block on slice 8, and the rest of the disk as root partition on slice 0. # format -e Re-attach the new disk to the zpool. Use -f option if this is a mounted root pool: # zpool attach -f rpool c3t1d0s0 c3t2d0s0 If this was one of the 2 boot disks in the root pool, then re-enable booting: # installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c3t2d0s0 Verify status of the zpool rebuilding: # zpool status PARTS NOTE: Refer to the Exadata Database Machine Owner's Guide Appendix C for part information. How to identify which Exadata disk FRU part number to order , based on image and vendor and mixed disk support status - Note 1416303.1
Internal Only References: References@<NOTE:1360343.1> - INTERNAL Exadata Database Machine Hardware Current Product Issues@<NOTE:1360360.1> - INTERNAL Exadata Database Machine Hardware Troubleshooting <NOTE:1416303.1> - How to identify which Exadata disk FRU part number to order , based on image, vendor and mixed disk support status <NOTE:1113034.1> - HALRT-02007: Database node hard disk failure <NOTE:1113014.1> - HALRT-02008: Database node hard disk predictive failure <NOTE:1084360.1> - Bare Metal Restore Procedure for Compute Nodes on an Exadata Environment <NOTE:1071220.1> - Oracle Sun Database Machine V2 Diagnosability and Troubleshooting Best Practices <NOTE:1452325.1> - Determining when Disks should be replaced on Oracle Exadata Database Machine <NOTE:1274324.1> - Oracle Sun Database Machine X2-2/X2-8 Diagnosability and Troubleshooting Best Practices Attachments This solution has no attachment |
||||||||||||
|