Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-77-1458754.1
Update Date:2012-07-30
Keywords:

Solution Type  Sun Alert Sure

Solution  1458754.1 :   All M-Series Systems XSCFUs May Fail and/or Halt Due to Berkeley DataBase Corruption Without System Firmware Upgrade Version 1112 on SPARC Enterprise M8000/M9000-32/M9000-64 Servers, or a Minimum Version of 1113 on M3000/M4000/M5000 Servers  


Related Items
  • Sun SPARC Enterprise M5000 Server
  •  
  • Sun SPARC Enterprise M9000-64 Server
  •  
  • Sun SPARC Enterprise M3000 Server
  •  
  • Sun Software - Generic
  •  
  • Sun SPARC Enterprise M9000-32 Server
  •  
  • Sun SPARC Enterprise M4000 Server
  •  
  • Sun SPARC Enterprise M8000 Server
  •  
  • Sun Hardware - Generic
  •  
Related Categories
  • PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: Sun Alert
  •  
  • .Old GCS Categories>Sun Microsystems>Sun Alert>Criteria Category>Availability
  •  
  • .Old GCS Categories>Sun Microsystems>Sun Alert>Release Phase>Resolved
  •  


___________________________________

Date of Resolved Release: 16-May-2012
___________________________________

In this Document
Description
Occurrence
Symptoms
Workaround
History


Applies to:

Sun SPARC Enterprise M3000 Server
Sun SPARC Enterprise M4000 Server
Sun SPARC Enterprise M5000 Server
Sun SPARC Enterprise M9000-32 Server
Sun SPARC Enterprise M8000 Server
Information in this document applies to any platform.
___________________________________

SUNBUG:7162656
SUNBUG:7180251

Date of Workaround Release: 16-May-2012

Date of Resolved Release: 23-Jul-2012
___________________________________

Description

Mandatory firmware upgrade to1112 on Sun SPARC Enterprise M8000/M9000-32/M9000-64 Servers, and minimum firmware upgrade to 1113 on Sun SPARC Enterprise M3000/M4000/M5000 Servers is required for all M-series systems. Failure to do so may result in the XSCFUs used in the M-series platforms to fail and/or halt due to internal Berkeley DataBase (BDB) corruption.

XSCFU flash memory wear-leveling was not optimum in XCP versions prior to 1112.  Flash memory wear may result in BDB corruption due to the mishandling of certain types of ECC errors resulting from the worn region of the flash memory. XCP 1112/XCP 1113 firmware fixes the mishandling of ECC errors, improves flash memory wear-leveling, and allows the existing XSCFU hardware to be utilized.

Occurrence

This issue can occur on the following platforms:

  • Sun SPARC Enterprise M3000, M4000, M5000 without XCP firmware 1113
  • Sun SPARC Enterprise M8000, M9000-32, M9000-64 Servers without XCP firmware 1112

To determine the XCP firmware version on one of these systems, the following command can be used:

    XSCF> version -c xcp

XCP 1112 output will appear similar to the following:


     XSCF#0 (Active )
     XCP0 (Reserve): 1112
     XCP1 (Current): 1112
     XSCF#1 (Standby)
     XCP0 (Reserve): 1112
     XCP1 (Current): 1112

Symptoms

Behavior is dependent on the XCP firmware revision. Upon detecting a BDB corruption during the boot process, a XSCF will exhibit symptoms similar to the following:

Revisions prior to XCP 1092, the BDB is reconstructed. If this happens on the Active XSCF, all domains are reset.

XCP 1092 and newer revisions, the XSCFU is halted when BDB corruption is detected to prevent unplanned domains reset. On M3000, M4000 and M5000 systems, a complete platform downtime is required to recover. On M8000, M9000-32, and M9000-64 systems, the standby XSCFU will become active. However, as the BDB corruption has likely propagated to the standby XSCFU, the command replacefru cannot be used to recover and a complete platform downtime is required.

Workaround

There is no workaround for this issue.

This issue is addressed in the following releases:

  • XCP 1113 firmware for Sun SPARC Enterprise M3000, M4000, M5000 Servers
  • XCP 1112 firmware for Sun SPARC Enterprise M8000, M9000-32, M9000-64 Servers

XCP 1112 and 1113 firmware is available for download at:

Note 1: It is advised that your XCP firmware (XCP<1092) be updated as soon as possible to avoid unexpected domain outages.  XSCFU reboots should be avoided until the XCP firmware update has been completed.  For XCP firmware > 1092: update the firmware at your next scheduled maintenance window to avoid potential loss of domain console access.

Note 2: The initial release of this document stated that this issue was fixed for all affected servers by XCP 1112. However, it was later found that the fix in XCP 1112 was incomplete for M3000, M4000 and M5000 Servers. The completion of the fix for these servers was delivered in XCP 1113.

History

16-May-2012: Document created, Resolved release

05-Jun-2012: Updated the Description and Workaround sections.

26-Jun-2012: Updated document status, Occurrence, and Workaround sections.

23-Jul-2012: Updated Occurrence, and Workaround sections. Resolved.

25-Jul-2012: Updated Document title and description for clarification.

29-Jul-2012: Added note to the Workaround section.

 

NOTE: The XCP 1112 firmware does not fix this issue on the Sun SPARC Enterprise M3000, M4000, M5000 as initially thought. This issue will be addressed in the XCP 1113 firmware version for these systems. This Sun Alert will be updated when the firmware version that addresses this issue on the Sun SPARC Enterprise M3000, M4000, M5000 systems is available.

This issue is reported, and all escalations documented under CR 7078506.  CR 7162656 is the software workaround for this issue.

Explanation of the issue

Note: Services (Field and TSC) are not required to proactively create SRs for customer upgrades. There is no method to proactively check if BDB corruption has occurred.


The BDB stores all XSCF configuration and telemetry data and is constantly being updated with new data.  Due to the small size of the file system that holds the BDB, these writes all happen to a small number of sectors in the flash memory.  The frequent writing is causing above average wear on these sectors.  XCP 1112 relocates the file system where the BDB is stored to a new, larger, and previously unused section of flash memory.  Two changes were implemented to improve flash memory wear leveling. In addition to reducing the risk of ECC errors from internal flash storage, XCP 1112 also fixes the mishandling of certain types of these ECC errors.

Below are the anticipated situations that a customer may ask support to address.

1. Customer has opened a SR due to the fact that:

a) System is running XCP < 1092 and has had an unexpected reset of all domains.

In this case, the BDB has been reconstructed and should be clean (there is a very small risk that the new BDB got corrupted again due to another ECC during garbage collection).

Symptoms of this issue are:

In "showlogs monitor":

    Feb  6 08:28:04 m5000-xscf0 monitor_msg: SCF:Unit configuration change(add) /OPNL
    Feb  6 08:28:05 m5000-xscf0 monitor_msg: SCF:Unit configuration change(add) /PSU#0
    Feb  6 08:28:08 m5000-xscf0 monitor_msg: SCF:Unit configuration change(add) /PSU#1
    Feb  6 08:28:09 m5000-xscf0 monitor_msg: SCF:Unit configuration change(add) /PSU#2
    Feb  6 08:28:11 m5000-xscf0 monitor_msg: SCF:Unit configuration change(add) /PSU#3
    Feb  6 08:28:13 m5000-xscf0 monitor_msg: SCF:Unit configuration change(add) /FANBP_C
    Feb  6 08:28:14 m5000-xscf0 monitor_msg: SCF:Unit configuration change(add) /MBU_B
    Feb  6 08:28:15 m5000-xscf0 monitor_msg: SCF:Unit configuration change(add) /MBU_B/CPUM#0
    Feb  6 08:28:15 m5000-xscf0 monitor_msg: SCF:Unit configuration change(add) /MBU_B/CPUM#0
    Feb  6 08:28:16 m5000-xscf0 monitor_msg: SCF:Unit configuration change(add) /MBU_B/CPUM#1
    Feb  6 08:28:17 m5000-xscf0 monitor_msg: SCF:Unit configuration change(add) /MBU_B/CPUM#1
    Feb  6 08:28:18 m5000-xscf0 monitor_msg: SCF:Unit configuration change(add) /MBU_B/MEMB#0
    Feb  6 08:28:18 m5000-xscf0 monitor_msg: SCF:Unit configuration change(add) /MBU_B/MEMB#1
    Feb  6 08:28:19 m5000-xscf0 monitor_msg: SCF:Unit configuration change(add) /MBU_B/MEMB#2
    Feb  6 08:28:19 m5000-xscf0 monitor_msg: SCF:Unit configuration change(add) /MBU_B/MEMB#3

In "showlogs power":

    Feb 06 08:28:02 GMT 2012      SCF Reset        Power On       --   Locked
    Feb 06 08:28:09 GMT 2012      System Power On  Pow.Fail/Recov.--   Locked
    Feb 06 08:29:08 GMT 2012      Domain Power On  Pow.Fail/Recov.00   Locked
    Feb 06 08:29:10 GMT 2012      Domain Power On  Pow.Fail/Recov.01   Locked.

“Unit configuration change (add)” messages are normal at first chassis power-on, upon XSCFU replacement, or when hardware is added.  The distinguishing difference for BDB corruption triggered reconstruction is when the messages occur coincident with the outage of a running domain or XSCF reset.

Action Plan: Upgrade to 1112

b) System is running XCP > 1092 and has one (or both) XSCFs halted.

In this case, the db_initialized file has been removed and the BDB will be recreated after a power cycle.
Action Plan: Shutdown all domains, rebootxscf to get both XSCFs halted (DC only). Power cycle the platform, flashupdate to 1112, power on all domains.

Note: One or both XSCFs will also be marked faulted/degraded and a service password will be needed to clear the fault.

2.  Customer has opened a (proactive) SR and wants to know if their Mx000 is impacted and what the risks/action plans are.

Whatever the version of XCP, flashupdate to 1112 will update a bank and then reboot from the updated bank (with 1112).  If BDB corruption is detected, the XSCF will be halted and we are back to 1b.


Internal Contributor/Submitter: [email protected] / [email protected]
Internal Eng Responsible Engineer:  [email protected]
Internal Eng Business Unit Group: Systems
Internal Escalation ID:   3-3957102291, 3-4484123661, 3-4795058021, 3-4912442107, 3-5244346841,
3-5327805831, 3-5337450674, 3-5362565691, 3-5369066365, 3-5373970359,
3-5413736991, 3-5440100391, 3-5450148851, 3-5457916451, 3-5468922133,
3-5469106328, 3-5496285929, 3-5563305791,3-5567630951, 3-5597598021,
3-5643807681


Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback