![]() | Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Problem Resolution Sure Solution 1004598.1 : Sun Fire[TM] 12K/15K/E20K/E25K: Recovering from a System Controller disk failure
PreviouslyPublishedAs 206377
Applies to:Sun Fire 12K ServerSun Fire 15K Server Sun Fire E20K Server Sun Fire E25K Server All Platforms SymptomsOS version: Solaris[TM] 8 10/01 or laterSC software version: SMS 1.1 or higher SDS version: SDS 4.2.1 or subsequent releases of SVM (Solaris Volume Manager) Prerequisites: Output of 'metadb -i' command (included in explorer). Output of 'metastat -p' command (included in explorer) Output for the 'prtvtoc' command for all disks (included in explorer) Recent backup tape for filesystems (if both disks are lost) CauseOne (or both) internal disk of the Platform System Controller (SC, where SMS services run) is faulted and needs to be replaced.SolutionScenario #1: Loss of 1 of 2 disks on SC1. As we have to shutdown the SC to replace the defective disk, we need to ensure that this SC is the SPARE before shutting it down. As user sms-svc, use the 'showfailover' command to determine status of the SCs: % showfailover -v 2. Use the 'metastat' command to determine the failed submirrors: d10: Mirror3. Use the 'metadb' command to determine unavailable/unreadable state databases replicas: flags first blk block count4. Use the 'metadb' command to delete the state databases replicas on the bad disk: # metadb -d -f c0t3d0s4Depending on how the disk has failed, this step may not succeed. If this is the case, we will delete the state database replicas during reboot (step 9). 5. As we have to shutdown the SC to replace the defective disk, we need to ensure that the SC will boot using the correct OBP alias. Prevent the SC rebooting after shutdown to the ok prompt by setting "auto-boot " to false using the 'eeprom' command as superuser on the SC: # eeprom 'auto-boot =false' 6. If a disk failure occurs on the MAIN SC, loss of a disk is not a failover condition so a failover will need to be forced to the other SC by using the 'setfailover' command as user sms-svc: % setfailover force This action will force the former MAIN SC to reset and reboot as SPARE and transfer the role of MAIN SC to the opposite SC. If a disk failure occurs on the SPARE SC, disable failover on the MAIN SC using the 'setfailover' command as user sms-svc. The SPARE SC can then be shut down: On the MAIN SC: % setfailover off On the SPARE SC # init O 7. Replace the defective disk in the SCPER board (see the Sun Fire 15K System Service Manual 806-3512-xx); while replacing drive, remember that:
To power off the SCPER, you must run poweroff SC#. 8. Boot using the correct OBP alias: ok devalias 9. If step 4 above failed (the metadb -d -f command was unsuccessful due to the nature of the disk failure) OR If a reboot occurs before replacing the disk, the current boot will fail and stop in single-user mode as 51% readable state database replicas are needed. If this is the case, in single-user mode, use the 'metadb' command to delete the state databases replicas (ignore any "Read-only file system" error messages), then proceed with normal startup: # metadb -d -f c0t3d0s4 10. Partition the new disk in the same manner as it was before using the 'format' command. 11. Recreate state databases with the 'metadb' command using previous configuration # metadb -a -c3 -f c0t3d0s4This configuration can be checked using 'metadb -i'. 12. Use the 'metareplace' command to re-enable the sub-mirrors: # metareplace -e d10 c0t3d0s0This operation will take about 20 minutes per every gigabyte of filesystem. This configuration can be checked using 'metastat'. Note : In case the resync process does not complete properly (reporting needs maintenance ), the sub-mirrors had to be metadetached, metacleared and then metattached again. This will automatically start the resync process and the metadevices will then go to the "OK" state. 13. Set auto-boot to true using the 'eeprom' command as superuser: # eeprom 'auto-boot=true' 14. Failover must be enabled using the 'setfailover' command as user sms-svc user on the MAIN SC: % setfailover on15. Synchronize data from MAIN SC to SPARE SC using the 'setdatasync' command as user sms-svc on the MAIN SC: % setdatasync backup Scenario #2: Loss of both disks on SC If the SDS-mirrored root disk for SC has been completely destroyed, here are steps to resolve issues with the mirrored boot configuration. If this problem occurs on the MAIN SC, failover to the opposite SC using the 'setfailover' command as user sms-svc: % setfailover force We are now working on a SPARE SC which has the defective disks. 1. If disk failures have occurred, replace the defective disks in SCPER board (see Sun Fire 15K System Service Manual 806-3512-xx). If disks have not failed but have been corrupted, continue with step 2 below.
ok boot cdrom -s 3. Restore the root filesystem from backup tape into /a and initialize the root block using: # installboot /usr/platform/sun4u/lib/fs/ufs/bootblk /dev/rdsk/c0t2d0s0 4. Restore the /export/install filesystem. 5. Modify /etc/system file: remove all lines between the "MDD root info" lines and between the "MDD database info" lines: Begin MDD root info (do not edit) 6. Modify /etc/vfstab file by changing all metadevices for the root filesystem back to regular slices. Comment out all other metadevices: Before #device device mount FS fsck mount mountAfter #device device mount FS fsck mount mount 7. Remove all lines (except comment lines) from /etc/lvm/mddb.cf file. 8. Boot the system from the freshly restored boot disk: ok boot disk2At this time, this SC is defined as the SPARE SC. For reference, disk aliases are: ok devalias 9. Re-create state databases with the 'metadb' command using previous configuration: # metadb -a -c3 -f c0t2d0s4 10. Modify the /etc/lvm/md.tab, make sure that all mirrors are one-way mirrors, make sure that the one-way mirrors refer to the restored side: d10 -m d11 d12After d10 -m d1111. Create the metadevices: # metainit -f -a12. Set the metadevice as a root device: # metaroot d10 13. Restore metadevice entries in /etc/vfstab file: #device device mount FS fsck mount mount 14. Reboot 15. Second way mirrors can now be attached to mirrored metadevices using the 'metattach' command: # metattach d10 d12 16. Failover must be enabled using the 'setfailover' command as user sms-svc on the MAIN SC: % setfailover on 17. Synchronize data from MAIN SC to SPARE SC using the 'setdatasync' command as user sms-svc on the MAIN SC: % setdatasync backupNote: Scripts are available on the EIS-CD to set up the SC disks: /sun/tools/SF15K/SF15k-sc-bootdisks-start.shAfter running the scripts: # df -kScenario #3 Errors on Drive but no failure of disk on SC 1. As we have to shutdown the SC to replace the defective disk, we need to ensure that this SC is the SPARE before shutting it down. As user sms-svc, use the 'showfailover' command to determine status of the SCs: % showfailover -v 2. Use the 'metastat' command to determine the failed submirrors: d10: Mirror3. Use the 'metadb' command to determine unavailable/unreadable state databases replicas: flags first blk block count
# eeprom 'auto-boot=false' 5. If a disk failure occurs on the MAIN SC, loss of a disk is not a failover condition so a failover will need to be forced to the other SC by using the 'setfailover' command as user sms-svc: % setfailover force 6. This action will force the former MAIN SC to reset and reboot as SPARE and transfer the role of MAIN SC to the opposite SC. If a disk failure occurs on the SPARE SC, disable failover on the MAIN SC using the setfailover' command as user sms-svc. The SPARE SC can then be shut down: % setfailover offOn the SPARE SC # init 0 7. Replace the defective disk in the SCPER board (see the Sun Fire 15K System Service Manual 806-3512-xx), while replacing drive, remember that:
8. Boot using the correct OBP alias: ok devaliasIf faulty disk was c0t2d0 (disk2), boot from disk3 If faulty disk was c0t3d0 (disk3), boot from disk2 ok boot disk2 9. If step 4 above failed (the metadb -d -f command was unsuccessful due to the nature of the disk failure), or if a reboot occurs before replacing the disk, the current boot will fail and stop in single-user mode as 51% readable state database replicas are needed. If this is the case, in single-user mode, use the 'metadb' command to delete the state databases replicas (ignore any "Read-only file system" error messages) # metadb -d -f c0t3d0s4then proceed with normal startup. 10. Partition the new disk in the same manner as it was before using: fmthard s /var/tmp/prtvtoc.orig /dev/rdsk/c0t3d0s2 11. Re-create state databases with the 'metadb' command using previous configuration # metadb -a -c3 -f c0t3d0s4This configuration can be checked using 'metadb -i'. 12. metattach d10 d12 d20 d22 d30 d32 13. To verify that the metattach completed properly, use metastat i 14. Set auto-boot to true using the 'eeprom' command as superuser: # eeprom 'auto-boot =true' 15. Failover must be enabled using the 'setfailover' command as user sms-svc user on the MAIN SC: % setfailover on 16. Synchronize data from MAIN SC to SPARE SC using the 'setdatasync' command as user sms-svc on the MAIN SC: % setdatasync backup Product Sun Fire 15K Server Sun Fire E25K Server Sun Fire E20K Server Sun Fire 12K Server High-End Servers Keywords: disk, failure, system, controller, disksuite, sds, sms Attachments This solution has no attachment |
||||||||||||
|