Asset ID: |
1-71-1321263.1 |
Update Date: | 2011-05-19 |
Keywords: | |
Solution Type
Technical Instruction Sure
Solution
1321263.1
:
Sun Enterprise[TM] 10000: How To Recover from a Domain Hang Condition
Related Items |
- Sun Enterprise 10000 Server
|
Related Categories |
- GCS>Sun Microsystems>Servers>High-End Servers
|
In this Document
Goal
Solution
Applies to:
Sun Enterprise 10000 Server - Version: Not Applicable to Not Applicable - Release: N/A to N/A
Information in this document applies to any platform.
Goal
This document provides a step by step process for recovering an E10000 domain from a hang condition.
Solution
If a domain hangs, follow these steps for dumping core and/or
recovering the domain. The SUNW_HOSTNAME environment variable must be
set to the name of the problem domain via the domain_switch command.
Step |
Action(s) |
Notes |
1 |
ssp% hostinfo -h
ssp% ping <domain>
ssp% ps -ef | grep bringup
ssp% ps -ef | grep hpost
|
This is to establish the domain state. |
2 |
ssp% hostint
|
Wait at least 5 minutes for the panic to
complete. Do not assume console activity is working.
Spot check the machine state by repeating Step 1. |
3 |
ssp% hostint -p <alternate cpu>
|
Preferably, the <alternate cpu> is on a
different system board than the bootproc. At a minimum, the
alternate should use a different BBSRAM (bootproc+2).
Wait at least 5 minutes for the panic to
complete. Do not assume console activity is working.
Spot check the machine state by repeating Step 1. |
4 |
ssp% sigbcmd panic
|
Wait at least 10 minutes for the panic to
complete. Do not assume console activity is working.
Spot check the machine state by repeating Step 1. |
5 |
ssp% sigbcmd -p <alternate cpu> panic
|
Preferably, the <alternate cpu> is on a
different system board than the bootproc. At a minimum, the
alternate should use a different BBSRAM (bootproc+2).
Wait at least 10 minutes for the panic to
complete. Do not assume console activity is working.
Spot check the machine state by repeating Step 1. |
6 |
ssp% sigbcmd -I panic
|
Wait at least 10 minutes for the panic to
complete. Do not assume console activity is working.
Spot check the machine state by repeating Step 1. |
7 |
ssp% sigbcmd -I -p <alternate cpu> panic
|
Preferably, the <alternate cpu> is on a
different system board than the bootproc. At a minimum, the
alternate should use a different BBSRAM (bootproc+2).
Wait at least 10 minutes for the panic to
complete. Do not assume console activity is working.
Spot check the machine state by repeating Step 1. |
8 |
ssp% sigbcmd obp
|
If the OBP ok> prompt is reached, execute explicitly:
ok> ctrace
ok> .registers
ok> .locals
ok> sync
Capture all the screen output and provide them with the panic dump
generated by the OBP sync. |
9 |
ssp% sigbcmd -I -p <alternate cpu> panic
|
Preferably, the <alternate cpu> is on a
different system board than the bootproc. At a minimum, the
alternate should use a different BBSRAM (bootproc+2).
If the OBP ok> prompt is reached, execute explicitly:
ok> ctrace
ok> .registers
ok> .locals
ok> sync
Capture all the screen output and provide them with the panic dump
generated by the OBP sync. |
10 |
ssp% bringup -f -l64 |
Force the bringup only if all other attempts fail. At
a minimum, the level needs to be 24 to test CPU operation. |
Of course, savecore must be enabled and the primary swap partition/dump
device must be sufficiently large for the core file to be saved.
Why wait so long between steps?
Five or ten minutes is an ideal time, even buffered a little. Too often, people execute one command, wait a few seconds and conclude
it is not working, when in fact, it very well might be. At a minimum,
re-executing the hostinfo and ps commands from Step 1 (pingable, but
hung domains are rare, but do exist) will slow the process down, and
allow a given command to commence and show progress. Bottom line,
proceed with some caution and execute with a plan, but don't hurry.
Allowing a few extra minutes now may be what
is necessary to minimize domain interrupts in the future by
guaranteeing that whatever information we might gather can be used to
identify their problem. It is known that this is not a popular position
in the "heat of the moment", but extraordinary circumstances require
extraordinary action. Remember, this is not the mainstream situation,
and we must take every action necessary to ensure we collect data when
we can.
Attachments
This solution has no attachment