Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-77-1022237.1
Update Date:2010-06-17
Keywords:

Solution Type  Sun Alert Sure

Solution  1022237.1 :   Sun Storage 7x00 2009.Q3 Software Release May Result in an Incorrect Diagnosis of CPU Correctable Error  


Related Items
  • Sun Storage 7410 Unified Storage System
  •  
  • Sun Storage 7110 Unified Storage System
  •  
  • Sun Storage 7210 Unified Storage System
  •  
  • Sun Storage 7310 Unified Storage System
  •  
Related Categories
  • GCS>Sun Microsystems>Sun Alert>Criteria Category>Availability
  •  
  • GCS>Sun Microsystems>Sun Alert>Release Phase>Resolved
  •  

PreviouslyPublishedAs
278130


Bug Id
SUNBUG: 6853745

Product
Sun Storage 7000 Unified Storage System
Sun Storage 7110 Unified Storage System
Sun Storage 7210 Unified Storage System
Sun Storage 7310 Unified Storage System
Sun Storage 7410 Unified Storage System

Date of Resolved Release
02-Mar-2010

Sun Storage 7x00 2009.Q3 Software Release May Result in an Incorrect Diagnosis of CPU Correctable Error

Impact

On the 2009.Q3 Software Release for Sun Storage 7000/7110/7210/7310/7410, a CPU may be incorrectly diagnosed as faulty resulting in unnecessary hardware replacement.  This issue also causes performance degradation.

Contributing Factors

This issue may occur on the following releases:
  • Sun Storage Software release 2009.Q3.0.0 through 2009.Q3.4.0 for Sun Storage 7000/7110/7210/7310/7410

Note: This issue occurs in the event of CPU correctable errors.

To determine if you have an affected release use a browser to connect to the appliance
management BUI on port 215, https://applianceIP:215.  Click the Sun logo on the top left.
A window will be displayed showing support data for the system. The Operating System
version can be found near the end of the list. The version corresponding to the above
is immediately after the "@" in the Operating System line.

Symptoms

The exact fault differs depending on the type of correctable error received,
but will result in a fault indicating one of the cores of a CPU is faulty. This will
generate an ASR event, alert or active problem on the system. The message ID will
always match the form "GMCA-XXXX-XX".

The alert, log message or active problem: “a level 2 cache on this cpu is faulty”,
can be found in the following locations in the browser interface:

 Maintenance/Logs/Alert
Maintenance/Logs/System
Maintenance/Problems

Workaround

To avoid the issue until the resolution can be applied manually mark the CPU repaired through the CLI or BUI.

To do this do the following:

    * Navigate in the BUI to Maintenance:Problems
    * Select the CPU fault
    * Click the Mark Repaired button

Resolution

This issue is addressed in the following releases:

  • Sun Storage 7x00 2009.Q3.4.1
  • Sun Storage 7x00 2010.Q1.0.0 or later

Information on the above upgrades can be found at:

    http://wikis.sun.com/display/FishWorks/Software+Updates

Note: If a CPU had already been incorreclty diagnosed as faulty, it will still need to
be manually marked as repaired via CLI or BUI after the upgrade. Please see Workaround above.

Modification History:

17-Jun-2010: Updated to include Sun Storage 7110/7210/7310/7410

Internal Comments (for SAs)

Root Cause

This is due to bad diagnosis software that is inappropriately "replaying" correctable errors on a 10 second frequency. This turns a single correctable error, a normally benign event, into what appears to be a pathological problem with the CPU. This was fixed in Solaris by the following CR:

6853745 Same ereport is generated every 10 seconds automatically ...


This issue has been resolved by pulling in the above CR into the 2009.Q3.4.1 release, and the fix is already present in the upcoming 2010.Q1 release.



For Support personnel:
To distinguish between this false diagnosis and a truly bad CPU, the following steps must be taken:
1. The customer must be running a software release between 2009.Q3.0.0 and 2009.Q3.4.0.

2. The fault must have a message ID of the form "GMCA-XXXX-XX".
3. Take the UUID of the fault (found in the active problems page) and run the following command from the Solaris shell: fmdump -V -u <uuid> -e | \ egrep 'ereport|IA32_MCi_STATUS|IA32_MCi_ADDR' This command can also be run against a support bundle by going to the 'fm' directory and running the above command with 'fltlog' at the end of the fmdump command (before the pipe).
4. Determine if the output consists of the same CPU ereport replayed every 10 seconds. An example of a bad diagnosis is: Jan 25 2010 01:14:21.506739770 ereport.cpu.generic-x86.l2cache class = ereport.cpu.generic-x86.l2cache IA32_MCi_STATUS = 0x940001000000010a IA32_MCi_ADDR = 0xa07c900 Jan 25 2010 01:14:11.506975918 ereport.cpu.generic-x86.l2cache class = ereport.cpu.generic-x86.l2cache IA32_MCi_STATUS = 0x940001000000010a IA32_MCi_ADDR = 0xa07c900
Jan 25 2010 01:14:01.507217171 ereport.cpu.generic-x86.l2cache class = ereport.cpu.generic-x86.l2cache IA32_MCi_STATUS = 0x940001000000010a IA32_MCi_ADDR = 0xa07c900
Note that the ereports appear approximately every 10 seconds, and they contain identical payloads.

If the customer is running different software, or the ereports do not match the above pathology, the CPU is truly faulty and must be replaced.
keywords: amber
Internal Contributor/submitter [email protected] Internal Eng Responsible Engineer [email protected] Internal Services Knowledge Engineer [email protected] Internal Eng Business Unit Group
Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback