Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-77-1285535.1
Update Date:2011-01-20
Keywords:

Solution Type  Sun Alert Sure

Solution  1285535.1 :   Sun4v CMT Systems May Experience Storms of Events and May Stop Logging Error Telemetry for Errored Events  


Related Items
  • Sun Netra T5440 Server
  •  
  • Sun SPARC Enterprise T5440 Server
  •  
  • Sun SPARC Enterprise T5120 Server
  •  
  • Sun SPARC Enterprise T5220 Server
  •  
  • Sun SPARC Enterprise T5240 Server
  •  
  • Sun Blade T6320 Server Module
  •  
  • Sun Netra T5220 Server
  •  
  • Sun Blade T6340 Server Module
  •  
  • Sun Netra T6340 Server Module
  •  
  • Sun SPARC Enterprise T5140 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Sun Alert>Criteria Category>Availability
  •  
  • GCS>Sun Microsystems>Sun Alert>Release Phase>Resolved
  •  




In this Document
  Description
  Likelihood of Occurrence
  Possible Symptoms
  Workaround or Resolution
  Modification History
  References


Applies to:

Sun SPARC Enterprise T5440 Server - Version: Not Applicable and later   [Release: N/A and later ]
Sun Netra T6340 Server Module - Version: Not Applicable and later    [Release: N/A and later]
Sun Netra T5440 Server - Version: Not Applicable and later    [Release: N/A and later]
Sun Blade T6320 Server Module - Version: Not Applicable and later    [Release: N/A and later]
Sun SPARC Enterprise T5120 Server - Version: Not Applicable and later    [Release: N/A and later]

Description


Sun4v CMT systems may experience the following issue when handling error events: error telemetry may stop being processed/logged by the Service Processor to the host upon processing a stream of error events. Diagnosis and FRU isolation are impacted, along with Solaris ability to perform operations such as page retire.

Likelihood of Occurrence


This issue can occur on the following platforms:

Multi Socket CPU CMT Systems:
  • Sun SPARC Enterprise T5140, T5240, T5440
  • Sun Blade T6340
  • Sun Netra T6340, Netra T5440
Single Socket CPU CMT Systems:
  • Sun SPARC Enterprise T5120, T5220
  • Sun Blade T6320
  • Netra T5220
when the above systems are running system firmware 7.3.0 and earlier.

Notes:

1. No other Blade, Enterprise, or Netra systems are affected by this issue.

2. There is no specific set of conditions likely to trigger this issue, nor any method of predicting when or how frequently this issue may occur. The risk of seeing this issue is regarded as low, but the potential impact is high since this issue may occur without notice.

To determine the firmware version on the system, run the following commands from the ILOM:
-> show HOST

/HOST
  Targets:
      bootmode
      diag
      domain

Properties:
    autorestart = reset
    autorunonerror = false
    bootfailrecovery = poweroff
    bootrestart = none
    boottimeout = 0
    hypervisor_version = Hypervisor 1.7.2.b 2009/07/17 09:35
    macaddress = 00:14:4f:ef:1b:c4
    maxbootfail = 3
    obp_version = OBP 4.30.2.b 2009/06/16 07:02
    post_version = POST 4.30.2 2009/04/21 09:57
    send_break_action = (none)
    status = Solaris running
    sysfw_version = Sun System Firmware 7.2.2.g 2009/07/17 10:34  <<<<<

Commands:
    cd
    set
    show
->
or:
sc> showhost
Sun System Firmware 7.2.7.b 2010/01/07 17:56

Host flash versions:
    Hypervisor 1.7.6 2009/12/01 14:30
    OBP 4.30.6 2009/12/01 12:41
    POST 4.30.6 2009/12/01 13:18
sc>

Possible Symptoms


When fault management data is being dropped, diagnosis and FRU isolation are impacted, along with Solaris ability to perform operations such as page retire, as and when the faults occur.

The primary issue is the delivery of events, or more importantly the lack of events being logged to either ILOM or Solaris logs. The primary problem occurs when FMD running on ILOM core dumps, and can result in event reports not being processed. Given the nature of the fault it is the absence of events when otherwise expected that will highlight the issue.

The secondary issue occurs when communication between FMD on ILOM cannot pass events to FMD on Solaris, resulting in a backlog of reports on the SP which can consume resources leading to a potential loss of ILOM service. An additional potential side effect is Solaris will not take action against an underlying event due to lack of visibility, for example Memory Page Retirement.

In the event of this secondary issue causing ILOM to exhaust resources customer may see the following ILOM event:
Out of Memory: executing rebooting thread..... wait 600 secs for the userlevel to complete shutdown

Workaround or Resolution


To resolve this issue, upgrade system firmware to 7.3.0.c (or above), using the appropriate patch listed below:

Multi Socket CPU CMT Systems:
  • Sun SPARC Enterprise T5140/T5240 patch 145676-02 or later
  • Sun SPARC Enterprise T5440 patch 145678-02 or later
  • Netra T5440 patch patch 145677-02 or later
  • Sun Blade T6340 patch 145679-02 or later
  • Sun Netra T6340 patch 145680-02 or later
Single Socket CPU CMT Systems:
  • Sun SPARC Enterprise T5120/T5220 patch 145673-02 or later
  • Sun Blade T6320 patch 145674-02 or later
  • Netra T5220 patch 145675-02 or later
Note: Although the likelihood of experiencing this issue is low, upgrading to firmware 7.3.0.c (or later) is recommended as soon as possible when your schedule allows.

Modification History

Date of Resolved Release: 20-Jan-2011
@
Internal Comments:

6981373 ILOM: fmd spawning lots of processes
6983799: L2 bank not stored correctly for DSC/DSU scrub errors

CRs 6983799 and 6981373 do not necessarily impact availability.

CR 6983799 addresses an issue whereby there is no storm protection
for DSC/DSU events (H/W DRAM scrubber events) such that the SP may
be bombarded with events from the HOST relating to these, and overwhelm
the SP with events triggering other issues in the SP S/W stack causing
events to be dropped (See CR 6724341 as an example). CR 6724341 is being
worked upon and code is in review and is planned to be addressed in the
near future. CR 6983799 can be detected by looking at the contents of
an ILOM snapshot and reviewing the 'fmdump' files for DSC events.
Depending on Solaris patches and platform type it may be possible to
review DSC events on the HOST also using the command 'fmdump'

CR 6981373 addresses an issue whereby under certain conditions, fmd
on the SP may coredump. CR 6981373 also triggers another CR: 7006461
which can cause fmd on the SP to consume ereports and not proceed to
log them to the HOST or to the SP, thus impacting severely field
diagnosis (CR 7006461 is still outstanding).  CR 6981373 can only
be detected by looking for a core file in /coredump on the SP which
can be determined by looking at the contents of an ILOM snapshot in
the field. CR 7006461 cannot be detected at all.

For more indepth detail on these issues,
please review the CRs referenced above.

Internal Contributor/Submitter: 
[email protected], [email protected]
Internal Eng Responsible Engineer:  [email protected]
Internal Services Knowledge Analyst:  [email protected]
Internal Eng Business Unit Group:  Systems Group-SVS
(SPARC Volume Systems, Horizontal Systems(includes T2000/Ontario)
Internal Escalation ID: 2-8213384

References

<SUNPATCH:145673-02>
<SUNPATCH:145674-02>
<SUNPATCH:145675-02>
<SUNPATCH:145676-02>
<SUNPATCH:145677-02>
<SUNPATCH:145678-02>
<SUNPATCH:145679-02>
<SUNPATCH:145680-02>
<SUNBUG:6981373>
<SUNBUG:6983799>

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback