Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-77-1000134.1
Update Date:2011-02-18
Keywords:

Solution Type  Sun Alert Sure

Solution  1000134.1 :   Using the Reset Button on A Main System Controller May Cause Domain Outage  


Related Items
  • Sun Fire E6900 Server
  •  
  • Sun Fire 3800 Server
  •  
  • Sun Fire 6800 Server
  •  
  • Sun Fire E4900 Server
  •  
  • Sun Fire 4800 Server
  •  
  • Sun Fire 4810 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Sun Alert>Criteria Category>Availability
  •  
  • GCS>Sun Microsystems>Sun Alert>Release Phase>Resolved
  •  

PreviouslyPublishedAs
200180


Product
Sun Fire 3800 Server
Sun Fire 4800 Server
Sun Fire 4810 Server
Sun Fire 6800 Server
Sun Fire E6900 Server
Sun Fire E4900 Server

Bug Id
<SUNBUG: 4378797>

Date of Workaround Release
21-APR-2005

Date of Resolved Release
12-NOV-2007

Impact

If the main System Controller (SC) on a Sun Fire 3800, 4800, 4810, 4900, 6800 or 6900 system (with running domains) is reset with the reset button, there is a possibility of a change in hardware configuration which would cause the domains to perform a "fatal" reset. The domains will reset and take action as per the "error-reset-recovery" OBP property, which may include unexpected system outages while domains are recovered.


Contributing Factors

This issue can occur on the following platforms:

  • Sun Fire 3800, 4800, 4810, 4900, 6800, 6900 (without recommended SunFire SCApp firmware update 5.12.6)

if the reset button is used on running domains.

The Sun Fire System Controller (SC) periodically queries system ASICs (Application Specific Integrated Circuits) via JTAG buses to read configuration, monitor environmental states and change domain configuration. If the hardware reset button is used during one of these operations, the JTAG bus may be left in an undefined state. This change in configuration can trigger a fatal reset on affected active domains.

To determine the firmware version of the SCApp, use the "showsc" command from the platform shell as follows:

    SC> showsc
    SC: SSC0
    Main System Controller
    SC Failover: disabled
    Clock failover enabled.
    SC date: Thu Jun 01 12:59:45 CDT 2006
    SC uptime: 25 minutes 58 seconds     ScApp version: 5.19.6 Build_01
    RTOS version: 45

Symptoms

Shortly after the main System Controller has been reset using the reset button, the domains within the system reboot with an error message similar to the following:

ErrorMonitor: Domain A has a SYSTEM ERROR

Workaround

In the case of an SC becoming unresponsive, attempts should be made to confirm connectivity via the serial port and network prior to using the reset button. If the SC appears to be hung:

  1. Confirm that the SC is actually hung by connecting to the serial port of the SC (with a known good cable)
  2. Hit "enter" a few times - if no prompt is returned, the SC is hung
  3. If this is the case, halt all domains, using the Solaris "init 0" command (or "shutdown")
  4. Reset the SC using the reset button, or power-cycle the whole chassis.

Note: The use of the reset button on running domains should be avoided whenever possible, and the SC should be reset either by the above steps or via ScApp.


Resolution

This issue is addressed on the following platforms:

  • Sun Fire 3800, 4800, 4810, 4900, 6800, 6900 with SunFire SCApp firmware 5.12.6 (as delivered in patch 112127-02 or later)

Note: The patch above addresses the software issue for BugID 4378797. The use of the reset button on running domains should be avoided whenever possible.



Modification History
Date: 28-SEP-2005

29-Sep-2005:

  • Update Relief/Workaround section

Date: 12-NOV-2007
  • Updated Contributing Factors and Resolutions sections
  • State: Resolved


References

<SUNPATCH: 112127-02>

Previously Published As
101656
Internal Comments



This has been flagged as a Sun Alert as the impact when the reset button causes domain outages runs counter to what customers expect i.e. that resetting the SC won't affect the running domains.



The Sun Fire System Controller (SC) periodically queries system ASICs (Application Specific Integrated Circuits) via JTAG buses to read configuration, monitor enviromental states and change domain configuration. The design of JTAG is such that reading and writing to the bus must be done in a complete cycle in which the complete command and data bit sequence is serially shifted into the JTAG ring. Any interruption to this will result in an unknown bit pattern being placed on the JTAG bus.



If the SC is performing one of these operations, or has hung during one, and the hardware reset button is used, then the JTAG bus may be left in this undefined state. When the SC reboots it will resume JTAG activity. However, this will result in the remains of the previous operation being commited to the bus which in some rare circumstances may change the configuration of core ASICs. This change in configuration can trigger a fatal reset on affected active domains. This does not happen with a software reset as the SC will complete JTAG operations and quiese the bus before resetting.



There are a number of bugs on this, such as 4378797. The core problem is that if the JTAG master (the SC) does not complete an operation (such as an SC hanging) it will leave a partial command or data sequence on the JTAG ring. When the SC attempts to start a new JTAG command it does not and cannot know what is already in the ring, and thus the placement of a new command will essentially push a random data or command value onto the ring which may cause changes to the configuration of ASICs such as ARs and DXs, with obvious unpleasant results.



There is no software workaround as it is impossible to determine the state of the JTAG ring or to know what was left of the sequence being put into the JTAG ring. A possible hardware fix would be to use a persistant SRAM as a log of the pending JTAG command, however the current generation of SCs lack the hardware for that feature and there are no plans to add it.


Internal Contributor/submitter
[email protected]

Internal Eng Business Unit Group
SSG ES (Enterprise Systems)

Internal Eng Responsible Engineer
[email protected]

Internal Services Knowledge Engineer
[email protected]

Internal Escalation ID
1-4540996

Internal Resolution Patches
112127-02

Internal Sun Alert Kasp Legacy ID
101656, 57744 (Sun Alert)

Internal Sun Alert & FAB Admin Info
Critical Category: Availability ==> Pervasive
Significant Change Date: 2005-04-21, 2007-11-12
Avoidance: Patch, Workaround
Responsible Manager: [email protected]
Original Admin Info: [WF 09-Nov-2007, dave m: sent email for update, a patch is issued for this BugID, may be resolved]
[WF 28-Sep-2005, Dave M: no plans in the immediate future to fix this; Eng will notify me if there are any changes, update R/W, re-publish out of Preliminary to Workaround]
Engineering Notification Interval: 0
This document has been imported from KMS Creator and may need adjustment before re-publishing.

This imported document has been reviewed/adjusted by:
Review Name:
Review Date:

Original KMS Creator attributes below:

--- PLEASE DO NOT MAKE ANY CHANGES BELOW THIS LINE! ---

Sun Alert ID: 57744
Synopsis: Using the Reset Button on A Main System Controller May Cause Domain Outage
Category: Availability
Product: Sun Fire 3800, Sun Fire 4800, Sun Fire 4810, Sun Fire 4900, Sun Fire 6800, Sun Fire 6900
BugIDs: 4378797
Avoidance: None
State: Committed
Date Released: 21-Apr-2005
Date Closed:
Date Modified:
Escalation IDs: 1-4540996
Pending Patches:
Resolution Patches:
FIN:
FCO:
Date Submitted: 24-Feb-2005, 11-Apr-2005
Submitter: [email protected]
Responsible Engineer: [email protected]
Responsible Manager: [email protected]
CTE group: SSG-ES
Responsible Writer: [email protected]
Distribution: Preliminary Contract SunSolve

Workflow History:

WF State: Issued, 21-Apr-2005, David Mariotto
WF Note: sending for release

WF State: Draft, 21-Apr-2005, David Mariotto
WF Note: received comments from Chessin, sent revision to sub
mitters for review, OK by EOD to release.

WF State: Draft, 21-Apr-2005, David Mariotto
WF Note: 4/20 OK to go for tech review per Eddie

WF State: Draft, 20-Apr-2005, David Mariotto
WF Note: sent for review this morning, BU review needs clarify on
at least one issue

WF State: Draft, 20-Apr-2005, David Mariotto
WF Note: approved by BU for review (Eddie) - sending for review

WF State: Draft, 19-Apr-2005, David Mariotto
WF Note: expecting OK from Eddie today, Tues 4/19

WF State: Draft, 18-Apr-2005, David Mariotto
WF Note: no reply from BU or BUPO

WF State: Draft, 15-Apr-2005, David Mariotto
WF Note: emailed to BU again for OK on draft, now waiting until
Monday (per email of Leynette) for approval

WF State: Draft, 13-Apr-2005, David Mariotto
WF Note: emailed again for status (#4) to Leynette, with Omar,
Anne, ssg-es (no response 2 days....) still waiting

WF State: Draft, 18-Mar-2005, David Mariotto
WF Note: emailed 3/15 for status, have not heard back from
submitter on status of issue

WF State: Draft, 11-Mar-2005, David Mariotto
WF Note: resent request again for update on this issue

WF State: Draft, 03-Mar-2005, David Mariotto
WF Note: reqeusted approval of draft 2nd time from BU, need
approval on draft and need Bug filed against documetation
as requested to continue this draft.

WF State: Draft, 25-Feb-2005, David Mariotto
WF Note: recevied notification from BU that they want to see a
copy of the current draft (for approval purposes) - sent
copy of draft, waiting for approval (write back Monday)

WF State: Draft, 24-Feb-2005, David Mariotto
WF Note: will need positive confirmation from both BUs prior to
sending the draft for review (Fri or Mon)

WF State: Draft, 24-Feb-2005, David Mariotto
WF Note: Article created.

Exported from KMS Creator Sat May 21 09:16:59 2005 GMT, [email protected]
Internal SA-FAB Eng Submission
Using the Reset Button on A Main System Controller May Cause Domain Outage

Product_uuid
29d05214-0a18-11d6-92b2-a111614865b5|Sun Fire 3800 Server
29d3a694-0a18-11d6-92da-df959df44cdd|Sun Fire 4800 Server
29d6f808-0a18-11d6-8aa8-943929fbbdd8|Sun Fire 4810 Server
29da7938-0a18-11d6-8a41-9ed1ad6d6779|Sun Fire 6800 Server
4fe39727-0599-11d8-84cb-080020a9ed93|Sun Fire E6900 Server
bed24aa9-0598-11d8-84cb-080020a9ed93|Sun Fire E4900 Server

References

SUNPATCH:112127-02

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback