Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1323594.1
Update Date:2012-05-11
Keywords:

Solution Type  Problem Resolution Sure

Solution  1323594.1 :   Sun Fire[TM] X4540 Server: After Bios Upgrade, Server Is Unresponsive  


Related Items
  • Sun Fire X4540 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>x64>Server>SN-x64: AMD-STOR-SERVER
  •  




In this Document
Symptoms
Cause
Solution


Created from <SR 3-3428803301>

Applies to:

Sun Fire X4540 Server - Version Not Applicable to Not Applicable [Release N/A]
Information in this document applies to any platform.

Symptoms

The system was reported as unable to complete POST and maintain powered on state after firmware upgrade. After reviewing errors from the ILOM snapshot generated ipmi SEL log, the errors indicated bios/watchdog timeouts with no specific hardware related failures.

less *ipmiint_sel_elist.out


6b7a | OEM record e0 | 0000e400140000000000000000
6b7b | OEM record e0 | 0000e45c140000000002000000
6b7c | OEM record e0 | 0000e464140000000044440004
6b7d | OEM record e0 | 000110e40400000000138061c5
6b7e | OEM record e0 | 000110e40400000000138061c5
6b7f | OEM record e0 | 000110e8040000000019467880
6b80 | OEM record e0 | 00000004040000000000b00006
6b81 | OEM record e0 | 00000048040000000011110022
6b82 | OEM record e0 | 000000500400000000007f0623
6b83 | OEM record e0 | 00000058040000000000030000
6b84 | OEM record e0 | 0001006c0400000000fff90000
6b85 | OEM record e0 | 000110e8040000000019467880
6b86 | OEM record e0 | 00060004040000000000b00105
6b87 | OEM record e0 | 0006001c0400000000028000f0
6b88 | OEM record e0 | 0006003c04000000000a000000
6b89 | OEM record e0 | 0006004c040000000000480000
6b8a | OEM record e0 | 00080004040000000000b00000
6b8b | OEM record e0 | 00090004040000000000b00000
6b8c | OEM record e0 | 001830000800100f2312031022
6b8d | OEM record e0 | 0018304c08f200000000070f0f
6b8e | OEM record e0 | 0018305408000000100e022020
6b8f | 04/19/2011 | 10:48:36 | Watchdog 1 #0x18 | BIOS Reset | Asserted


--------------------------------------------------------------------------------
SEL Decode Summary - Logged errors (2):
--------------------------------------------------------------------------------

6b8d | CPU0 AMD Opteron Misc. Control (00h:18h.3h) reg 04ch = f2000000_00070f0fh
MC NB Status High/Low
- Bus HT Watchdog timeout
- Request timed out
- Processor Context corrupt
- Error reporting enabled
- Uncorrected error
- Overflow error
- Error Valid

6b8e | CPU0 AMD Opteron Misc. Control (00h:18h.3h) reg 054h = 00000010_0e022020h
Watchdog timeout error decoding:
- HT Command : Reserved-HOST
- Operation : Normal
- NextAction : Send request
- SrcPtr : Reserved
- Source : Node 0 CPU
- Destination : Node 7 CPU
- WaitPW : None
- WaitCode : 00h
- RspCnt : 2
- GartTblWlk : GART table walk not in progress

Cause

After BIOS/Firmware re-flash, the system repeatedly failed to maintain a fully powered on state and displayed the same watchdog timeout errors in the system event logs. As there were no indications of hardware failure in the ILOM snapshot diagnostic data (SEL logs), the issue appeared to be strictly software related but not the result of a failed firmware update or software corruption. It was then proposed that the failure was related to corruption within the system NVRAM and the customer was requested to follow steps to the clear the CMOS and reset the BIOS to "Optimal Defaults".

Solution

To implement the solution, please execute the following steps:

1. Confirm that the latest available firmware has been flashed onto the service processor using the outlined procedures in the available platform documentation and the latest release of the platform firmware.

http://www.oracle.com/technetwork/systems/patches/firmware/index.html

2. From the ILOM web interface or CLI, reset the system to retest the system behavior and determine if the platform will complete POST and maintain power.

3. If the system fails to fully power on and maintain that state, recollect snapshot diagnostic data from the ILOM to review system event list output.

4. Compare SEL log results to previously collected log data to see if there were any changes in reported system behavior or additional errors or warnings.

5. If there were no significant changes in the SEL log output that still indicates BIOS software issues and no specific indications of hardware problems, proceed with clearing the CMOS and have the customer reset the BIOS to Optimal Defaults via the BIOS "exit" menu screen per platform specific documentation.

As this issue was resolve via the lack of clear indications of hardware errors, it should be applicable to other platforms that are reporting inconclusive results from the SEL log that are likely correlated to corrupted BIOS or NVRAM data.



Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback