Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Problem Resolution Sure Solution 1008719.1 : Resolving Hardhang problems on Ultra Servers
PreviouslyPublishedAs 211973
Applies to:Sun Fire 4800 ServerSun Fire 6800 Server Sun Fire V480 Server Sun Fire V490 Server Sun Fire V880 Server All Platforms SymptomsNone of the terminals are responding, console does not respond, ping/telnetdoes not respond, Stop-A does not break to OBP, "send break" from a tip line does not break into the OBP. If all of the above are tried and fail to break out of the hang then the system is hung. It is almost impossible for support to figure out the cause of the hang if there is no core file to analyze. CauseThese steps do not provide the final solution nor detect the cause of thehardhang, but they will help in getting a core file to analyze the problem. In all the cases listed below, once you are in OBP type "sync" to get the core dump. If the system was booted with kadb, then do some initial analysis and then $q to enter OBP. NOTE: This document was written specifically for Sun4U architecture systems.While many of these instructions will be applicable to other architectures, some will not. XIR is only available on Ultra Enterprise systems. SolutionOptions------- 1. Enable Deadman 2. Set Breakpoint 3. Install Hardhang Kernel 4. XIR ----------------- 1. Enable deadman timer as described in <Document 1004530.1> KERNEL: How to enable deadman kernel code 2. Set Breakpoint ----------------- The system should have been booted with kadb. After the system comes up, get into kadb (Stop-A/"send break") and set a breakpoint in system_high_handler(). This function is only invoked on level 15 interrupts and is associated with fan fails and system board detection. To set the breakpoint in kadb: kadb: (type return) kadb[0]: system_high_handler:b kadb[0]: :c When the system hardhangs again, follow the procedure described in the section "Generating a Level 15 Interrupt". Pros: Will succeed in some instances where 'snooping' does not. Cons: Requires reboot if kadb not enabled. Requires a free system board slot. Cannot break a hang caused by a device other than the cpu seizing a system bus. Will fail if level 15 interrupts have been masked out. 3. Install Hardhang Kernel -------------------------- A special kernel needs to be built and installed at the customer site. Additionally, the breakpoint in system_high_handler() should be set through kadb (see the above section "Set Breakpoint"). Now the system has been setup to break out of the hang. Should the system hardhang, follow the procedure described in the section "Generating a Level 15 Interrupt". Pros: Will succeed even if all the interrupts are masked. Cons: Requires a custom kernel. Will fail if all the CPUs have PSTATE_IE = 0. Requires a free system board slot. Cannot break a hang caused by a device other than the cpu seizing a system bus. 4. XIR ------ This is the last resort in case the interrupts have been disabled. XIR is a non maskable interrupt and will definitely break the system out of the hang. Unfortunately this method also clears memory and hence a core dump cannot be taken. But this does provide some info about the CPU state at the time of hang. The remote External Initiated Reset (XIR) command "Although limited in its current form" can be used to aid Software debugging of hung systems. Currently XIR stores the following information for each CPU: TL (Trap Level) TT (Trap Type) TPC (Program Counter TNPC (Next Program counter) TSTATE (Trap State Register) This information is then gathered by typing .xir-state-all in the OBP. (You may need to Stop-A/"send break"to the machine to stop the machine from rebooting in order to issue this command.) There are 2 methods for initiating the XIR: Method 1: Press the XIR pin in the clock board which is at the rear of the E4000, (the FE handbook notes the location of the XIR switch). To the right side of the XIR switch is the POR switch; DO NOT press it, it will cycle power. When XIR is pressed the system will come to the "ok" prompt (or wait until it comes to the "ok" prompt). This method is easier than entering the key sequences noted in method 2. Method 2: Press Return key (twice) Press ~ key (once, possibly twice) Press Control-Shift-X keys (together) This key sequence should reboot the system. At this point, you'll need to do a Stop-A/"send break" to get to the OK prompt. Once the system is at the OBP prompt, get the CPU state info: ok .xir-state-all Pros: Will break out of the hang. Cons: Will not be able to get a useful core file. Generating a Level 15 Interrupt ------------------------------- On a sun4u architecture system, a level 15 interrupt is generated when a system board is inserted. This interrupt is also generated by a fan failure, on both the sun4u and sun4d architectures, but since the fans are not easily accessible, board insertion is the method described here. If, however, the system in question is a sun4d, then disconnecting a fan will be the only method available for generating a Level 15 interrupt. When the system hangs, insert a system board into a free slot. This will generate a level 15 interrupt, which should trigger the breakpoint in kadb. Once in kadb, debugger commands can be run to examine the current state of the system. Of particular interest are: $r dump the registers $c dump the current stack backtrace freemem/D see how much memory is free When kadb debugging is complete, attempt to take a core dump by doing: kadb[0]: $q ok sync WARNING: If a non-forced level 15 interrupt should occur on the system while the breakpoint is set or the debug kernel is in place, then the system will break to the OBP/kadb prompt. The system cannot be used until control is returned to the kernel, by typing "go" at the OBP, or :c at the kadb prompt. Attachments This solution has no attachment |
||||||||||||
|