Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-73-1463634.1
Update Date:2012-10-03
Keywords:

Solution Type  FAB (standard) Sure

Solution  1463634.1 :   A small population of T4-4 systems may experience C2C FMA SUN4V-8002-KQ Faults that can be repaired with updated coherency link tuning values.  


Related Items
  • SPARC T4-4
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>CMT>SN-SPARC: T4
  •  
  • .Old GCS Categories>Sun Microsystems>Sun FAB>Standard>Reactive
  •  


Affected X-Options:
 
  7101695 - Processor Module, 3.0GHz, T4-4 

Affected Parts:

  7019789 - FRU, Processor Module Assy, UltraSPARC T4, 8-Core 3.0GHz (7015550)

In this Document
Symptoms
Changes
Cause
Solution


Oracle Confidential (PARTNER). Do not distribute to customers.
Reason: FABs available to Internals and Partners only


Applies to:

SPARC T4-4 - Version Not Applicable to Not Applicable [Release N/A]
Information in this document applies to any platform.
__________

Affected X-Options:

7101695 - Processor Module, 3.0GHz, T4-4

Affected Parts:

7019789 - FRU, Processor Module Assy, UltraSPARC T4, 8-Core 3.0GHz (7015550)

Symptoms

A small population of systems shipped may experience an elevated rate of chip-to-chip (C2C) link replays that will trigger an FMA SUN4V-8002-KQ fault.  The fault can be seen by executing "fmadm faulty", where the resulting fault will appear as sample that follows:

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Mar 06 16:17:01 12f78095-7788-e159-dfa9-8da2a22fb197  SUN4V-8002-KQ  Major   

Host        : ssccn4-m1
Platform    : ORCL,SPARC-T4-4   Chassis_id  :
Product_sn  :Fault class : fault.cpu.generic-sparc.c2c
Affects     : hc://:product-id=ORCL,SPARC-T4-4:product-sn=1151BDY190:server-id=ssccn4-m1:chassis-id=1151BDY190/chassis=0/cpuboard=1/chip=2
              hc://:product-id=ORCL,SPARC-T4-4:product-sn=1151BDY190:server-id=ssccn4-m1:chassis-id=1151BDY190/chassis=0/cpuboard=0/chip=0
                  faulted but still in service
FRU         : "/SYS/PM1" (hc://:product-id=ORCL,SPARC-T4-4:product-sn=1151BDY190:server-id=ssccn4-m1:chassis-id=1151BDY190:serial=465769T+1149L9010F:part=7019789:revision=05/chassis=0/cpuboard=1) 50%
              "/SYS/PM0" (hc://:product-id=ORCL,SPARC-T4-4:product-sn=1151BDY190:server-id=ssccn4-m1:chassis-id=1151BDY190:serial=465769T+1124L9000G:part=7019789:revision=03/chassis=0/cpuboard=0) 50%   faulty
Description : The number of chip-to-chip recoverable errors has exceeded acceptable levels.Response    : No automated response.Impact      : System performance may be affected.

Action      : Use 'fmadm faulty' to provide a more detailed view of this event.
              Please refer to the associated reference document at
              http://sun.com/msg/SUN4V-8002-KQ for the latest service
              procedures and policies regarding this diagnosis.

------------------------------------- end of fault.cpu.generic-sparc.c2c report ---------------------------------

The above is a system generated report and references an outdated/dead URL.  For more information on this subject refer to the following internal only links;

  http://events2.us.oracle.com/msg/FMA/SUN4V/82

  https://support.us.oracle.com/oip/faces/secure/km/DocumentDisplay.jspx?id=1452064.1

NOTE: The presence of ereports for C2C replays is normal and an expected part of normal system operation. Ereports for C2C replays will appear as follows:

   ereport.cpu.generic-sparc.c2c-link 

The existence of the above c2c ereports does not indicate improper or unexpected system operation. FMA will assess the rate of C2C replays and post a SUN4V-8002-KQ fault as noted above should the rate of replays become excessive.

Impact

The system will remain operational. C2C replays are link retries that are successful and therefore pose no issue with data integrity. 

The elevated level of replays that results in FMA SUN4V-8002-KQ fault only indicates a link degraded in  performance to a level that we do not normally expect in a properly running system, and not an actual failure.  FMA SUN4V-8002-KQ fault is triggered in an attempt at pre-emptive hardware failure  detection.  

In systems with sub-optimal tuning, the issue is not related to the hardware actually degrading, but one of marginal link tuning. The  Sun_SPARC_T4-4_PM_E0010556.pkg will install tuning parameters that are optimally tuned.

Changes

 Contributing Factors

This issue is not specific to any particular configuration. The rate of C2C replays may vary with the system configuration (2P vs 4P) and from power cycle to power cycle. 
       
Systems that have experienced C2C FMA SUN4V-8002-KQ fault should be updated with the Sun_SPARC_T4-4_PM_E0010556.pkg.

In addition, all system Processor Module (PM) replacements done for any reason, also require the Sun_SPARC_T4-4_PM_E0010556.pkg to be applied in order to ensure that all processor modules installed have the new  tuning values.  This is particularly important for 4P systems with two PM modules to ensure that the  tuning values for links that run between both processor modules (PM) are identical.

Cause

The root cause of this fault stems from non-optimal link tuning values originally set that did not allow the C2C link circuitry to make the needed dynamic adjustments as material characteristics changed across production lots. The tuning parameters originally programmed proved to be sub-optimal to handle component variation within the expected design margins, stressing the dynamic tuning capabilities of the hardware, resulting in an increase of C2C replays.

The new link tuning values offer more margin that will allow operation across the entire process environment as was originally intended.
 
Manufacturing has already implemented the changes to incorporate these new tuning values on all new product and FRU builds (see Note 1 under Step 3 in the Resolution section below).

Solution

Workaround

No workaround available - see Resolution section below.


Resolution

Installation of the Sun_SPARC_T4-4_PM_E0010556.pkg will rectify the C2C FMA SUN4V-8002-KQ fault resolution.

Below are STEP-BY-STEP instructions for applying the patch which is available in Reference DocID 1452064.1 via the below URL;

  https://support.us.oracle.com/oip/faces/secure/km/DocumentDisplay.jspx?id=1452064.1

STEP #1: (Applying Sun_SPARC_T4-4_PM_E0010556.pkg)

 The Sun_SPARC_T4-4_PM_E0010556.pkg is applied as follows:

 a) Transfer patch to a local FTP or HTTP server.
    MOS Article 1452064.1 which can be viewed via below URL;
      https://support.us.oracle.com/oip/faces/secure/km/DocumentDisplay.jspx?id=1452064.1&h=Y

 b) login into the Service Processor via ILOM cli

 c) The host must be powered off to apply the patch. (From ILOM cli : stop /SYS)

 d) Load the patch using the ILOM cli "load command".  

    From ILOM cli:   load -source tftp://localFTPserver/Sun_SPARC_T4-4_PM_E0010556.pkg  - or -
    load -source  http://localHTTPserver/Sun_SPARC_T4-4_PM_E0010556

 e) The load command will automatically restart/reboot the service processor(ILOM) with
    the patch.

   
STEP #2: (Verify Sun_SPARC_T4-4_PM_E0010556.pkg has been successfully applied)

 Once a PM module has been updated, the presence of the patch can be checked as follows:

         -> show /SP/logs/event/list
            970    Wed May 16 20:46:50 2012  System    Log       critical
                   SP is about to reboot
            969    Wed May 16 20:46:48 2012  System    Log       major
                   Sun_SPARC_T4-4_PM_E0010556.pkg applied to /SYS/PM1
            968    Wed May 16 20:46:45 2012  System    Log       major
                   Sun_SPARC_T4-4_PM_E0010556.pkg applied to /SYS/PM0

 NOTE: The above is an example of the ILOM event log output where 2 PMs were updated with
       the patch.

STEP #3: (Clear all prior FMA SUN4V-8002-KQ Faults)

 Once the Sun_SPARC_T4-4_PM_E0010556.pkg has been successfully loaded and verified, then
 the existing C2C faults should be cleared.

Clear any existing SUN4V-8002-KQ faults from the OS level.

Use fmadm to obtain the uuid of  any SUN4V-8002-KQ faults.

  # fmadm faulty

For each fault use fmadm repair to clear the fault.

  # fmadm repair  uuid

    Using ILOM, clear any faults on PM0 and PM1:
       -> set /SYS/PM0 clear_fault_action=true
       Are you sure you want to clear /SYS/PM0 (y/n)? y
       Set 'clear_fault_action' to 'true'

       -> set /SYS/PM1 clear_fault_action=true
       Are you sure you want to clear /SYS/PM1 (y/n)? y
       Set 'clear_fault_action' to 'true'

 NOTE 1: Applying the Sun_SPARC_T4-4_PM_E0010556.pkg will resolve C2C FMA SUN4V-8002-KQ
       faults for all systems with marginal tuning parameters.

However, it is possible that C2C FMA SUN4V-8002-KQ faults are due to degraded hardware in which case they will not be remedied by Sun_SPARC_T4-4_PM_E0010556.pkg.

Therefore, if C2C FMA SUN4V-8002-KQ fault is seen after applying the Sun_SPARC_T4-4_PM_E0010556.pkg, then normal hardware debug and replacement process should be followed.  Mention in the SR that C2C FMA SUN4V-8002-KQ fault was seen after applying new C2C link tuning patch hence HW replacement is now planned.

Identification of Affected Parts (how to)

All T4-4 Processor Modules with the following Part Numbers that are experiencing C2C FMA SUN4V-8002-KQ faults require the Sun_SPARC_T4-4_PM_E0010556.pkg to be applied:

   7015550     FRU,PM-Module,3.0G,T4-4
   7019789     Proc-Mod,3.0G,T4-4  ( Assembly level )

A Processor Module part number can be identified by typing the following at the ILOM prompt:
(retype this for each installed PM and look for the "fru_part_number" part number)

   -> show /SYS/PM0 fru_part_number

        /SYS/PM0
        Properties:
        fru_part_number = 7019789

Note: PM modules with later production part numbers will NOT have Sun_SPARC_T4-4_PM_E0010556.pkg entries in the ILOM event log output as they were shipped from the factory with the new link training tuning settings, and hence do not require updating.

Processor Modules with new link training tuning settings:

7051795
7053169

References

  BugID: 7110931: SSC RQT Fault: fault.cpu.generic-sparc.c2c
             7151759: fault.cpu.generic-sparc.c2c fault reported on SSC T4-4

  MOS DocID: 1452064.1

Contacts

    Contributor: [email protected], [email protected]
    Responsible Engineer: [email protected]
    Responsible Manager: [email protected]
    Business Unit Group: Systems Group - SVS


Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback