Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1002033.1
Update Date:2012-07-30
Keywords:

Solution Type  Technical Instruction Sure

Solution  1002033.1 :   Sun Fire[TM] v1280, E2900, 3800, 4800, 4810, 6800, E4900, E6900, and Netra 1280, 1290 Server: How to Recover from a Hung System Controller  


Related Items
  • Sun Fire 4810 Server
  •  
  • Sun Fire 3800 Server
  •  
  • Sun Fire 6800 Server
  •  
  • Sun Fire E6900 Server
  •  
  • Sun Fire 4800 Server
  •  
  • Sun Fire E2900 Server
  •  
  • Sun Fire E4900 Server
  •  
  • Sun Netra 1280 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: Exx00
  •  
  • .Old GCS Categories>Sun Microsystems>Servers>Midrange Servers
  •  
  • .Old GCS Categories>Sun Microsystems>Servers>Midrange V and Netra Servers
  •  

PreviouslyPublishedAs
202844


Applies to:

Sun Netra 1280 Server - Version Not Applicable and later
Sun Fire 3800 Server - Version Not Applicable and later
Sun Fire 4800 Server - Version Not Applicable and later
Sun Fire 4810 Server - Version Not Applicable and later
Sun Fire 6800 Server - Version Not Applicable and later
All Platforms

Goal

When a system controller (SC) is hung, try a few steps before pressing the Reset button on the SC.

Fix

Try the following steps:

1) Try to Telnet or directly connect to the serial port of the "hung" SC, TIP into the platform shell, and use the "reboot" command (resetsc command on Sun Fire[TM] v1280, E2900, and Netra 1280, 1290 servers).

2) If the "reboot" command does not work, or you cannot enter anything, log in to the spare SC and try to force a failover by using the "setfailover force" command.

  • This step is not available on Sun Fire[TM] v1280, E2900, and Netra 1280, 1290 servers.
  • This step will probably not work if the primary SC is completely hung. 

If this step does work, it will reboot the hung SC and make the spare SC the primary SC.

3) If failover does not complete, the LAST RESORT is to use the Reset button on the SC; this step is not available on Sun Fire[TM] v1280, E2900, and Netra 1280, 1290 servers, where a platform power cycle will be needed (Solaris OS will need to be shutdown before the poweroff).

BEFORE YOU PRESS THIS BUTTON, you must bring down the domains. Bringing down the domains is critical because there is a possibility that the domain will crash if the Reset button is pressed and the domains are up and running.  See Document 1004364.1 for details.

NOTE: Make sure that connections setting are proper on SC.

Use a tip session onto the serial port of the SC:

6800a-sc0:SC>  showplatform -p network

The system controller is configured to be on a network.

Network settings: static
Hostname: 6800a-sc0
IP Address: 129.156.xx.xx
Netmask: 255.255.255.0
Gateway: 129.156.xx.1
DNS Domain: UK.Sun.COM
Primary DNS Server: 129.156.xx.xx
Secondary DNS Server: 129.156.xx.xx
***Connection type: none   <----- No remote access enabled
Idle connection timeout : No timeout
Sun Fire Link Enabled: no
*** This shows remote access via telnet or ssh is not enabled.

Running the command below, changes Connection type :

6800a-sc0:SC> setupplatform -p network

Network Configuration

Is the system controller on a network? [yes]:
Use DHCP or static network settings? [static]:
Hostname [6800a-sc0]:
IP Address [129.156.xx.xx]:
Netmask [255.255.255.0]:
Gateway [129.156.xx.1]:
DNS Domain [UK.Sun.COM]:
Primary DNS Server [129.156.xx.xx]:
Secondary DNS Server [129.156.xx.xx]:
**To enable remote access to the system controller, select "ssh" or "telnet".
**Connection type (ssh, telnet, none) [telnet]:
Idle connection timeout (in minutes; 0 means no timeout) [0]:
Enable Sun Fire Link? [no]: 

To enable remote access to the system controller, select either:

* ssh
* telnet

Rebooting the SC is required, for changes in the above network settings to take effect.



Product
Sun Fire 6800 Server
Sun Fire 4810 Server
Sun Fire 4800 Server
Sun Fire 3800 Server
Sun Fire v1280 Server
Sun Fire E2900 Server
Sun Fire E4900 Server
Sun Fire E6900 Server
Netra 1280 Server
Netra 1290 Server


Internal Comments

If the force option does not initiate a failover and the customer or field personnel are remote from the system thus unable to press the reset button on the hung system controller, there is a risky, non-documented alternative to "waking-up" the hung SC, that can be performed on Serengeti (not LW8): execute the setfailover override command from the spare SC.
Note: this option cannot be seen in setfailover -h command.

Note: as of 5.19.0 firmware this option is available only in service or engineering mode (see Bug ID 4703904).

The override option ignores whatever the status of the system controller is supposed to be and tells the spare to become primary. It pays no attention to the fact that the other SC could still be primary.

Warning: this procedure should be used with caution as a last resort effort, because it could crash running domains.

Example (firmware prior to 5.19.0):

kremlin-sc1:sc> setfailover override
SC: SSC1
Spare System Controller
SC Failover: disabled
This will abruptly interrupt operations on the other System Controller.
This System Controller will become the main System Controller.
Do you want to continue? [no] yes
SC Failover did not complete.
The system controllers may not be synchronized.
Failover can be done forcefully but may crash domain(s).
Do you want to force failover to continue? [no] yes
kremlin-sc1:sc>

Example (firmware 5.19.0):

fort-sc0:sc> setfailover override
override: is not a valid argument
Usage: setfailover [-y|-n] off|on|force
       setfailover -h
fort-sc0:sc>
fort-sc0:sc> engineering
fort-sc0:sc[engineering]> setfailover override
Spare System Controller
SC Failover: disabled
Clock failover disabled.
This will abruptly interrupt operations on this System Controller.
This System Controller will become the spare System Controller.
Do you want to continue? [no]
fort-sc0:sc[engineering]>
 

Keywords: SunFire, 3800, 4800, 4810, 6800, reset, system controller, failover

Previously Published As 75973

References

@ <BUG:4703904> - ADSR12BASE: LE2062 DISPLAYES A ERROR MESSAGE AFTER LAUNCHING THE FORMS APPL.
<NOTE:1004364.1> - Sun Fire[TM] Midrange Server: Safari Port Error may be caused by a resetting SC

Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback