Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1001778.1
Update Date:2012-05-17
Keywords:

Solution Type  Technical Instruction Sure

Solution  1001778.1 :   Sun Fire[TM] 3800, 4800/4810, 6800, E2900, E4900, E6900, V1280 or Netra[TM] 1280, 1290 server: How to Gather Data from a Hung Domain [Video]  


Related Items
  • Sun Fire 4810 Server
  •  
  • Sun Fire 3800 Server
  •  
  • Sun Netra 1290 Server
  •  
  • Sun Fire 6800 Server
  •  
  • Sun Fire E6900 Server
  •  
  • Sun Fire E2900 Server
  •  
  • Sun Fire V1280 Server
  •  
  • Sun Fire 4800 Server
  •  
  • Sun Fire E4900 Server
  •  
  • Sun Netra 1280 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: Exx00
  •  
  • .Old GCS Categories>Sun Microsystems>Servers>Midrange V and Netra Servers
  •  
  • .Old GCS Categories>Support>KM>Content>Video
  •  
  • .Old GCS Categories>Sun Microsystems>Servers>Midrange Servers
  •  

PreviouslyPublishedAs
202431


Applies to:

Sun Fire E4900 Server - Version Not Applicable and later
Sun Netra 1280 Server - Version Not Applicable and later
Sun Fire E2900 Server - Version Not Applicable and later
Sun Netra 1290 Server - Version Not Applicable and later
Sun Fire 6800 Server - Version Not Applicable and later
All Platforms

Goal

Description
Instructions on how to gather data from a hung Sun Fire[TM] SF3800/SF4800/SF4810/SF6800/E4900/E6900/E2900/V1280.

Available for this topic, a brief how-to video tutorial that provides step-by-step instructions answering Sun's most frequently asked questions. View the video and/or follow the detailed instructions below.


Video - Troubleshooting a hung domain (5:00)

Fix

Steps to Follow

Please make sure to follow each step in the order in which it is presented.
The instructions for a Sun Fire[TM] E2900, V1280 and Netra[TM] 1280, 1290 are highlighted separately as they employ Lights Out Management(LOM) instead of the System Controller(SC) employed by the Serengeti/Amazon class of servers.

Instructions for Platforms employing the System controller(SC)

1. Ensure that the domain is actually hung:

       - Can you ping the domain?
- Can you telnet to the domain?

2. Ensure that the SC (System Controller) is not hung, If you can access the System Controller, proceed to login to the SC and obtain a platform shell.

       A.If you get to the platform shell run the following commands:
               SCname:SC> showlogs
SCname:SC> showplatform
       B. If the SC is hung See Document 1002033.1 for details on how to recover from a hung system controller. Then go back to step 2A. 

3. Once in the platform shell attempt to get a domain shell:

               SCname:SC> console -d 

- If the command appears to hang, then we need to send a break signal to the domain.

       - if you are using telnet: Press CTRL ]
at the telnet prompt type: send break
       - if you are connected to the SC via tip: use ~#

At this point you should have a domain shell prompt, continue with the following commands, otherwise continue to step 4.

- If you get the domain shell run the following commands:


SCname:A> showdomain -p status
SCname:A> showlogs

Then type break to get to the OBP. if this takes you to the ok prompt then type sync to force a core file.

4. If you were not able to get to the ok prompt, then the system is really hung and we will need to send an XIR (externally initiated reset) to the domain.

From the domain shell type: reset
This command will give different behavior depending on what the OBP variable error-reset-recovery is set to.
If this variable is set to sync, a core file will attempt to be taken. If it is set to boot, then the system will just reboot as if the boot command was issued at the ok prompt.
If it is set to none it should drop you to the ok prompt, where you can run the following commands, the '#' sign represents the cpu that we took the XIR on,
use that number in the cbuf command if possible run this command on each of the cpus (some depend on firmware level of the SC):
{#} ok dump-sigblock
{#} ok # cbuf
{#} ok .xir-state-all

- If you were not able to return to the ok prompt, but have a domain prompt type the following command:

SCname:A> showresetstate

5. If none of these tactics work you may be forced in to just powering off the domain.

If this is the case then do a setkeyswitch off for the domain.

 

Note: loghost setup for domain and platform may help in troubleshooting hang issues; please check reference section below for detailed information.

 

Instructions for Platforms employing Lights Out Management(LOM)

1. Ensure that the domain is actually hung:

  • Can you ping the domain?
  • Can you telnet to the domain?

2. Login to the LOM prompt via telnet/ssh or tip.

    A. once you get the lom prompt, run the following commands:
        lom>showsc -v
lom>showlogs -v

3. Try to connect to the domain and see what state it is in:

   A. use the console commands to connect to domain
       lom> console
   B. If there's no response from console, use escape sequence to break out. The default escape sequence is "#."
       lom>console
#.
lom>
   C. Once the domain is confirmed to be un-reachable, go to next step.

4. Using the 'break' or 'reset' command to recover.

   A. Try to break into the OBP by 'break' and if you get to OBP, do a sync to collect a corefile.
lom>break
This will suspend Solaris.
Do you want to continue? [no] yes
Type 'go' to resume
debugger entered.
{3} ok sync
  B. If 'break' does not work, a 'reset' has to be used and 'showresetstate' collected as well. The behaviour of reset also depends on the settings used in OBP 
for error-reset-recovery which should preferably be set to 'sync'.
      lom>reset
      This will abruptly terminate Solaris.
Do you want to continue? [no] yes
      lom>showresetstate

5. If none of the procedures above work, a poweroff/poweron needs to be issued.

power off the platform
lom> poweroff
power on the platform, but do not start the domain
lom> poweron all
power on the platform and start the domain
lom> poweron


References

Following Manuals contain procedures and details about setting up platform loghost (where available) and configuring automatic recovery of hung domains:

Also check:

  • Doc 1008702.1: Console Logging Options to capture Fatal Reset output for Sun systems
  • Doc 1018813.1: Sun Fire [TM] SF3800/SF4800/SF4810/SF6800 - E4900/E6900 Server: Domains running firmware 5.15.x or later with hang-policy set to "notify" may lose critical troubleshooting data



Product
Sun Fire 6800 Server
Sun Fire 4810 Server
Sun Fire 4800 Server
Sun Fire 3800 Server
Sun Fire E6900 Server
Sun Fire E4900 Server
Sun Fire E2900 Server
Sun Fire V1280 Server

Internal Section

NOTE: Procedures given in this document are dependent on OBP and SC versions

Keywords: System Controller, SC, Sun Fire, Serengeti, kernel, XIR

Previously Published As 46780

References

<NOTE:1002033.1> - Sun Fire[TM] v1280, E2900, 3800, 4800, 4810, 6800, E4900, E6900, and Netra 1280, 1290 Server: How to Recover from a Hung System Controller
<NOTE:1008702.1> - Console Logging Options to capture Fatal Reset output for Sun systems
<NOTE:1018813.1> - Sun Fire [TM] SF3800/SF4800/SF4810/SF6800 - E4900/E6900 Server: Domains running firmware 5.15.x or later with hang-policy set to "notify" may lose critical troubleshooting data
<NOTE:778.1> - Multimedia Content Reference

Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback