Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Problem Resolution Sure Solution 1004712.1 : Sun Fire[TM] 12K/15K/E20K/E25K: Domain reboot hangs at "resetting..." and does not run HPOST
PreviouslyPublishedAs 206542 Symptoms A Sun Fire[TM] 12K/15K/E20K/E25K domain is issued the command reboot, init 6, or a "halt + boot" and it goes down to OBP. It starts the process of "Resetting" where it sits indefinitely. No post log is generated and no HPOST process is ever run for this domain. A manual setkeyswitch off and on for the domain finally brings it back up with no issues. Here is an example of this issue taken from a real case: May 19 22:06:38 2003 # init 0 May 19 22:06:43 2003 May 19 22:06:44 2003 INIT: New run level: 0 May 19 22:06:44 2003 The system is coming down. Please wait. May 19 22:06:44 2003 System services are now being stopped. May 19 22:06:54 2003 Print services already stopped. May 19 22:07:26 2003 The system is down. May 19 22:07:56 2003 syncing file systems... done May 19 22:08:03 2003 Program terminated May 19 22:08:43 2003 {2} ok boot May 19 22:08:43 2003 Resetting... May 19 22:53:12 2003 May 19 22:53:12 2003 @(#)OBP 4.5.20 2003/02/13 18:08 Sun Fire 15000 May 19 22:53:12 2003 IOSRAM based Console initialized May 19 22:53:12 2003 Probing Pseudo NVRAM device The customer entered the init 0 and then issued the boot command. The domain sat at "Resetting" for almost 45 minutes before the customer finally intervened with the setkeyswitch off and on to restore the domain to the OS. Afterward, the customer issued a reboot to see if the issue was only with "init 0 + boot". It was not; the reboot also hung at "Resetting...". Resolution Here is an example taken from the same customer case during a successful boot cycle: May 8 15:40:47 2003 rebooting... May 8 15:40:47 2003 Resetting... May 8 15:47:49 2003 May 8 15:48:02 2003 May 8 15:48:02 2003 May 8 15:48:02 2003 Sun Fire 15000, using IOSRAM based Console May 8 15:48:03 2003 Copyright 1998-2002 Sun Microsystems, Inc. All rights reserved. May 8 15:48:03 2003 OpenBoot 4.5, 94208 MB memory installed, Serial #44593284. May 8 15:48:03 2003 Ethernet address 0:0:be:a8:70:84, Host ID: 82a87084. May 8 15:48:03 2003 May 8 15:48:03 2003 May 8 15:48:03 2003 May 8 15:48:04 2003 Rebooting with command: boot May 8 15:48:04 2003 May 8 15:48:05 2003 Boot device: /pci@1c,600000/pci@1/scsi@2/disk@0,0:a File and args: / You can see that during the successful boot cycle for this domain, it takes about seven minutes between the message "Resetting" and the OBP banner (which is an indication that hpost has completed on the domain). When a domain does a reset at OBP it is supposed to be executing HPOST on the domain components. An hpost process should exist on the SC, and if the domain were rebooted, the hpost process would show a -Q option being passed to it (Quick POST). ------------------- Below SMS 1.3 ------------------- A hang at the "Resetting..." stage might be the result of domain_asr (domain Automatic System Recover) being disabled, if the SMS version is below SMS 1.3. Domain ASR can be disabled in the dsmd_tuning.txt file located in the /etc/opt/SUNWSMS/SMS1.X/config directory on the system controller. The dsmd_tuning file is the Domain Status Monitoring Daemon's configuration file. Basically, it is this file which tells dsmd on the system controller how it should function and control the platform's domains. The setting for domain_asr is shown towards the bottom of the file. From /etc/opt/SUNWSMS/SMS1.X/config/dsmd_tuning.txt: -------------------------------------------------------------------- ** The default monitoring controls are on. * To turn off all domains state monitoring, change domain_mon to 0. * To turn off all domains recovery actions, change domain_asr to 0. * domain_mon = 1 domain_asr = 1 -------------------------------------------------------------------- If "domain_asr = 0" and you are running a version of SMS older than 1.3, this is the problem with why the "Resetting..." is hanging during normal reboot or boot up operations. ******************************************************************** NOTE: Each domain can also have it's own dsmd_tuning.txt file which controls how dsmd behaves only for that specific domain. The domain specific dsmd_tuning.txt file would be in the domain configuration directory, /etc/opt/SUNWSMS/config/<A-R>. Make sure domain_asr is not disabled here either. ******************************************************************** Domain ASR should be re-enabled by changing "domain_asr = 0" in the correct dsmd_tuning.txt files and then restart dsmd to re-read it's configuration file. Dsmd is best restarted by stopping and starting SMS, but first make sure that failover is off and no platform configuration changes are occurring when you do the stop and start of SMS. Make the changes to both SCs so that the configuration of dsmd is the same regardless of which SC is the MAIN. ------------------- SMS 1.3 and Above ------------------- Bug ID 4658538, introduced in SMS 1.3 now allows a domain to reboot properly regardless of the domain_asr setting. So, if this behavior is encountered and the SMS version is 1.3 or higher, the issue is something else. The most likely cause of this behavior on SMS 1.3 and above is a permission problem on those files responsible for configuring HPOST on the platform or domain. If it is a permissions problem, you should expect to see the domain reboot, go down to OBP, and appear to hang at "Resetting..." as described above. With a permissions problem, hpost will execute on the domain and post logs should be created. The post logs (in /var/opt/SUNWSMS/SMS/adm/<A-R>/post), however, should show an error like the following: # Cmdline: /opt/SUNWSMS/SMS1.3/bin/hpost -d B -Q Unable to open .postrc file /etc/opt/SUNWSMS/config/B/.postrc Permission denied Errors in .postrc file. Bailing out! As that message clearly indicates, hpost can not read the .postrc file in question, so the domain remains at "Resetting..." trying to execute HPOST on the domain. Ultimately, a setkeyswitch off and on is executed and the domain posts just fine, and then boots back up. When a domain is rebooted, the sms-dsmd user is responsible for executing HPOST on the domain. When a domain is keyswitched on/off it is the sms-svc user (or d omain specific user if using ACL - Access Control Lists). These different users both must have access to the configuration files for HPOST in order to properly recover a domain if necessary. The .postrc files and blacklist files used in HPOST need to be world readable (644) regardless of the owner of the file. If world readable, both sms-svc and sms-dsmd can read and configure a domain properly at this "Resetting..." stage of OBP. Relief/Workaround If the reboot which started this issue is a result of a cron job, or panic on the weekend or overnight when people aren't around, this hang at "Resetting..." may last for long periods of time until manual intervention can bring it back up. The basic warning here is disable asr only when instructed to do so by Sun support, but know the risks of doing so, if operating less than SMS 1.3. This issue also stresses the importance of being sure the HPOST configuration files have the correct permissions to avoid such lengthy downtime, regardless of SMS version. These seemingly trivial changes could result in a domain remaining down for extended periods of time as the result of something so basic as a reboot. Additional Information Th Problem Resolution below provides full details on HPOST configuration file permissions information. <Document: 1010600.1> Sun Fire[TM] 12K/15K/E20K/E25K: "Domain failed by hpost: ecode=39" Product Sun Fire 15K Server Sun Fire 12K Server Sun Fire E25K Server Internal Comments For internal Sun use only. See Problem Resolution < Document: 1004778.1 > for details on why domain_asr might be disabled. Bug ID 4521655 was filed for the domain_asr behavior. Just know that this "Resetting..." hang isn't a bug. This is how asr worked prior to SMS 1.3. Bug ID 4658538 allowed asr to be disabled and still allow for domain recovery through a reboot. starcat, 12k, 15k, resetting, rebooting, boot, hang, asr, dsmd, dsmd_tuning Previously Published As 70064 Change History Reviewed by ESG Content Team on Nov 24, 2009 Date: 2006-01-22 User Name: 18392 Action: Update Canceled Comment: *** Restored Published Content *** SSH Audit Date: 2006-01-22 User Name: 18392 Action: Update Started Comment: SSH Audit Version: 0 Attachments This solution has no attachment |
||||||||||||
|