Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Troubleshooting Sure Solution 1003281.1 : Sun Enterprise[TM] 10000: Recovering a hung domain
PreviouslyPublishedAs 204556
Applies to:Sun Enterprise 10000 ServerAll Platforms ***Checked for relevance on 06-May-2011*** PurposeSteps to follow to recover a hung Enterprise[TM] 10000 domain.The aim of this document is to provide guidelines to recover a hung domain in a short period of time while maximising the chances of root cause analysis. Last Review DateMay 5, 2011Instructions for the ReaderA Troubleshooting Guide is provided to assist
in debugging a specific issue. When possible, diagnostic tools are included in the document
to assist in troubleshooting.
Troubleshooting Details--- Symptoms ---The first step is to confirm that the domain is hung. It is possible that the domain itself is in fact running, but the end-user has the impression of a hang because the application they are using has become unresponsive.Check ping, rlogin, telnet to the domain. If it is possible to get a session on the domain then use unix commands to determine why the domain appears to be hung, e.g. ps -elf, df -k etc. Attempt a console connection through a netcon session on the ssp. Note any output from the console (disk errors, nfs mounts not responding, out of per-user processes, out of memory, not enough swap space to fork, etc.) Also check the netcon log for the same. If prompted, login as root so that a networked mounted home directory is not required. Try and ping to an IP address external to the system (ie. the default router). Attempt df -k, is there a hung filesystem or nfs mount? Try ps -elf, are there runaway or respawning processes? Consult Document 1001950.1 for other forensics. --- Solution ---If it is still possible to control the domain, but the source of the problem cannot be found, then a live savecore(1M) of the domain may be a viable option. Because the system is still running, live savecore (savecore -L) can only be used when dump and swap do not share a common device. If a dedicated dump partition is not available, see Document 1004803.1 for information on using dumpadm to assign a dedicated dump file. The output of the following 'ps' command must be captured immediately prior to running savecore and submitted to Sun along with the crash dump files for anaysis.#/usr/bin/ps -e -o uid,pid,ppid,pri,nice,addr,vsz,wchan,time,fname Note: Live savecore is possible on Solaris[TM] 7 and later.
If the domain is not accessable from the netcon session: On the ssp: Check that the host isn't resetting already. If so, monitor progress from the latest post log.
Check platform/domain messages for current messages.
Other commands to be run to check status are:
These are the steps to recover a hung domain and are also defined on the hostint(1M) manpage.
Additional Information Netcon commands: These steps may be followed by a Sun Engineer and include additional steps and undocumented switches. These steps should be followed if there is sufficient time to attempt recovery of the domain without a hardware reset (hostreset or bringup). For each step it is recommended to wait 5 minutes before checking domain state to see if the command has succeeded: Steps 1 & 2 from above 3. hostint -p XReference internal for Live Savecore gotchas: Document 1004608.1 Keywords: E10K, hung, domain, hostreset, hostint, Starfire Previously Published As 70566 Attachments This solution has no attachment |
||||||||||||
|