Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-77-1020990.1
Update Date:2012-08-22
Keywords:

Solution Type  Sun Alert Sure

Solution  1020990.1 :   BIOS Versions Prior to 3.0.2 May Cause System Hangs on Sun Fire x4150/X4250/x4450 Systems  


Related Items
  • Sun Fire X4150 Server
  •  
  • Sun Fire X4450 Server
  •  
  • Sun Fire X4250 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: Sun Alert
  •  
  • .Old GCS Categories>Sun Microsystems>Sun Alert>Release Phase>Resolved
  •  

PreviouslyPublishedAs
268668




***Checked for relevance on 22-Aug-2012***

Bug Id
<SUNBUG: 6871221>, <SUNBUG: 6873737>

Date of Resolved Release
02-Oct-2009

Sun Fire x4150/X4250/x4450 systems may hang as a result of correctable ECC memory errors:

1. Impact

Sun Fire X4150/X4250/X4450 systems with BIOS versions 3.0.1 or earlier  may hang as a result of correctable ECC memory errors not being handled properly.

2. Contributing Factors

This issue can occur on the following platforms:
  • Sun Fire X4150/X4250/X4450 system with BIOS versions 3.0.1 or earlier
Note 1: Sun Fire X4150 and X4250 servers with BIOS versions 1ADQW061 or earlier and X4450 servers with BIOS versions 3B61 or earlier have an issue where the SMI (System Management Interrupt) handler will never exit when it tries to handle a patrol scrub detected correctable ECC memory error. When this happens, the system will lockup with no ILOM SEL entry indicating the problem. This bug does not affect all operating systems due to the different ways they can handle a patrol scrub detected correctable memory error. VMWare 3.5, 4.0 and RHEL 5.3 are known to encounter this hang condition because they will pass patrol scrub correctable errors on to the BIOS.

Note 2: Correctable errors can occur even in healthy systems. The likelihood of a system hang due to this bug is based on if an error occurs, when it occurs, how it is detected, and the operating system running.
3. Symptoms

If the described issue occurs, the system will lock up/hang with no ILOM SEL entry indicating a problem. Access to the ILOM is not affected.

4. Workaround

There is no workaround for this issue.  Please see the Resolution section below.

5. Resolution

This issue is addressed on the following platforms:
  • Sun Fire X4150/X4250/X4450 systems with BIOS revision 3.0.2 or later
It is recommended to update affected systems with the latest BIOS versions located at:

For Sun Fire X4150:
For Sun Fire X4250:
For Sun Fire X4450:
Note: The above releases contain BIOS 1ADQW062 for the Sun Fire X4150/X4250 and BIOS 3B62 for the X4450


Modification History:

22-Aug-2012: Maintenance check for relevance/currency, no change in content



Product
Sun Fire X4150 Server
Sun Fire X4250 Server
Sun Fire X4450 Server

Internal Comments
Additional Information:

There are 2 other known issues that are being fixed in the next (3.1.0) software release:

Issue 1:
Incorrect error messaging
If a correctable ECC memory error is detected by the CPU, you will see this SEL entry as usual:

|67| IPMI | @Log | minor | Fri Sep 4 17:04:57 2009 | ID = 1d : 09/04/2009 : 17:04:57 : Memory : BIOS : Correctable ECC; Channel: D, DIMM: 5 |

If the background scrubber @detects the correctable ECC memory error, the SEL entry will look like this:

|118| IPMI | Log | *critical*| Tue Sep 8 18:00:47 200 | ID = 3f : 09/08/2009 : 18:00:47 : @Memory : BIOS : Memory Scrub Failed; Channel: D, DIMM: 5

This incorrectly indicates the error as critical. A scrubber correctable ECC memory error is not a critical @error despite the SEL entry. This will be fixed in the next software release and both types will be reported as a minor correctable ECC.

Issue 2:
Dimms being falsely mapped out during POST due to correctable ECC memory errors.
POST should not map out a DIMM due to detecting a correctable ECC memory error. If during POST a DIMM is mapped out, the system should be rebooted to determine if the mapped out DIMM is due to a correctable ECC memory error at which point three things could happen:
  1. The Dimm error goes away indicating the issue was due to a correctable ECC memory error at which point everything is fine.
  2. If the same DIMM maps out there is likely a bad Dimm DIMM and the DIMM pair should be replaced.
  3. If a different dimm maps out you should continue to reboot the system
    until the error goes away or you see a persistent DIMM mapping out and
    that Dimm pair should be replaced.
Please send technical questions to the following email:
 [email protected]
and CC the following persons:
 Internal Contributor/Submitter
 Internal Eng Responsible @Engineer
 Internal Services Knowledge Engineer

Internal Contributor/submitter
[email protected]

Internal Eng Responsible Engineer
[email protected]

Internal Services Knowledge Engineer
[email protected]

Internal Eng Business Unit Group
SVS (SPARC Volume Systems, Horizontal Systems (includes T2000/Ontario), NWS (Network Storage), Systems Group-x64 (X4100-X4600 (includes M2), V20z/V40z/V60z/V65z, Ultra20/40)

Internal Sun Alert & FAB Admin Info
WF 02-Sep-2009, jfolla: sent for release
WF 30-Sep-2009, jfolla: sent for review
WF 29-Sep-2009, jfolla: sent to submitter with questions
WF 29-Sep-2009, jfolla: created


Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback