Sun Enterprise[TM] 10000: How To Recover from a Domain Hang Condition

Asset ID:	1-71-1321263.1
Update Date:	2011-05-19
Keywords:

Solution Type Technical Instruction Sure

Solution 1321263.1 : Sun Enterprise[TM] 10000: How To Recover from a Domain Hang Condition

Applies to:

Sun Enterprise 10000 Server - Version: Not Applicable to Not Applicable - Release: N/A to N/A
Information in this document applies to any platform.

Goal

This document provides a step by step process for recovering an E10000 domain from a hang condition.

Solution

If a domain hangs, follow these steps for dumping core and/or recovering the domain. The SUNW_HOSTNAME environment variable must be set to the name of the problem domain via the domain_switch command.

Step	Action(s)	Notes
1	`ssp% hostinfo -h` `ssp% ping <domain>` `ssp% ps -ef \| grep bringup` `ssp% ps -ef \| grep hpost`	This is to establish the domain state.
2	`ssp% hostint`	Wait at least 5 minutes for the panic to complete. Do not assume console activity is working. Spot check the machine state by repeating Step 1.
3	`ssp% hostint -p <alternate cpu>`	Preferably, the <alternate cpu> is on a different system board than the bootproc. At a minimum, the alternate should use a different BBSRAM (bootproc+2). Wait at least 5 minutes for the panic to complete. Do not assume console activity is working. Spot check the machine state by repeating Step 1.
4	`ssp% sigbcmd panic`	Wait at least 10 minutes for the panic to complete. Do not assume console activity is working. Spot check the machine state by repeating Step 1.
5	`ssp% sigbcmd -p <alternate cpu> panic`	Preferably, the <alternate cpu> is on a different system board than the bootproc. At a minimum, the alternate should use a different BBSRAM (bootproc+2). Wait at least 10 minutes for the panic to complete. Do not assume console activity is working. Spot check the machine state by repeating Step 1.
6	`ssp% sigbcmd -I panic`	Wait at least 10 minutes for the panic to complete. Do not assume console activity is working. Spot check the machine state by repeating Step 1.
7	`ssp% sigbcmd -I -p <alternate cpu> panic`	Preferably, the <alternate cpu> is on a different system board than the bootproc. At a minimum, the alternate should use a different BBSRAM (bootproc+2). Wait at least 10 minutes for the panic to complete. Do not assume console activity is working. Spot check the machine state by repeating Step 1.
8	`ssp% sigbcmd obp`	If the OBP `ok>` prompt is reached, execute explicitly: `ok> ctrace` `ok> .registers` `ok> .locals` `ok> sync` Capture all the screen output and provide them with the panic dump generated by the OBP sync.
9	`ssp% sigbcmd -I -p <alternate cpu> panic`	Preferably, the <alternate cpu> is on a different system board than the bootproc. At a minimum, the alternate should use a different BBSRAM (bootproc+2). If the OBP `ok>` prompt is reached, execute explicitly: `ok> ctrace` `ok> .registers` `ok> .locals` `ok> sync` Capture all the screen output and provide them with the panic dump generated by the OBP sync.
10	`ssp% bringup -f -l64`	Force the bringup only if all other attempts fail. At a minimum, the level needs to be 24 to test CPU operation.

Of course, savecore must be enabled and the primary swap partition/dump device must be sufficiently large for the core file to be saved.

Why wait so long between steps?

Five or ten minutes is an ideal time, even buffered a little. Too often, people execute one command, wait a few seconds and conclude it is not working, when in fact, it very well might be. At a minimum, re-executing the hostinfo and ps commands from Step 1 (pingable, but hung domains are rare, but do exist) will slow the process down, and allow a given command to commence and show progress. Bottom line, proceed with some caution and execute with a plan, but don't hurry.

Allowing a few extra minutes now may be what is necessary to minimize domain interrupts in the future by guaranteeing that whatever information we might gather can be used to identify their problem. It is known that this is not a popular position in the "heat of the moment", but extraordinary circumstances require extraordinary action. Remember, this is not the mainstream situation, and we must take every action necessary to ensure we collect data when we can.

Attachments

This solution has no attachment