Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-73-1001149.1
Update Date:2010-08-31
Keywords:

Solution Type  FAB (standard) Sure

Solution  1001149.1 :   USIV+ (Panther and Jaguar) System Boards could experience POST Failure if Board Replacement Procedure is not correctly followed  


Related Items
  • Sun Fire E25K Server
  •  
  • Sun Fire E20K Server
  •  
  • Sun Fire 12K Server
  •  
  • Sun Fire 15K Server
  •  
Related Categories
  • GCS>Sun Microsystems>Sun FAB>Standard>Reactive
  •  

PreviouslyPublishedAs
201540


Product
Sun Fire 12K Server
Sun Fire E20K Server
Sun Fire 15K Server
Sun Fire E25K Server

Bug Id
<SUNBUG: 6426425>

Part
  • Part No: 540-6439-06
  • Part Description: 1500MHz USIV+
Part
  • Part No: 540-6753
  • Part Description: 1800MHz USIV+
Part
  • Part No: 540-5664
  • Part Description: 1050MHz USIV+
Part
  • Part No: 540-5940
  • Part Description: 1200MHz USIV+
Part
  • Part No: 540-6251
  • Part Description: 1200MHz USIV+
Part
  • Part No: 540-6248
  • Part Description: 1050MHz USIV+
Part
  • Part No: 540-6233
  • Part Description: 1200MHz USIV+
Part
  • Part No: 540-6295
  • Part Description: 1350MHz USIV+

Impact

If the environmental status monitoring daemon (esmd) does not log the removal/insertion of System Boards, corruption of the System Board's SEEPROM can occur.  If the aforementioned happens, POST fails with:

    FAIL E$Dimm SBx/Px/Ex: Not compatible with US-IV+ processor

Subsequent POST will report:

    CHS reports E$Dimm SB0/P0/E0 status NOT_GOOD. Treating as blacklisted.

Users are then unable to clear the CHS status via "setchs -s OK -r OK -c SBx/Px/Ex" - it returns the error:

    ERROR: cannot set status.
    FRU error 2: I/O error (this component may not exist, may be powered off, or the FRU may be
    corrupt).

 


Symptoms

 


Root Cause

SEEPROM corruption can be triggered in a number of ways. In this case, a board is removed, but before esmd can notice the board removal a different board is inserted. Because esmd does not log the remove/insert events, it does not clear out the SEEPROM cache in picld, or frad, nor any of the pending SEEPROM events for the board.  When frad next updates, it writes to the wrong location and corrupts the SEEPROM.

Another failure signature related to this scenario is:

    FAIL Proc SB2/P0: Serial number of CPU (80000228.B850D4EF) doesn't match data in board SEEPROM
    (0000003E.77240482).

Additional validation of SEEPROM corruption can be displayed via "prtfru" command:

    xc88-sc0:sms-svc:137> prtfru
    ex15?Label=ex15/EXB/sb15?Label=sb15/CPU/p0?Label=p0/e1?Label=e1/ECACHE/frutree/chassis
    /CP/ex15?Label=ex15/EXB/sb15?Label=sb15/CPU/p0?Label=p0/e1?Label=e1/ECACHE
    SEGMENT: ID
    SEGMENT: FD
    Error processing data in segment "FD":  IO error
    SEGMENT: ED
    /Fru_Type: 0A04 (unrecognized value)

 


Workaround

The maintenance procedure below is outlined in the document, Sun Fire 15K/12K Systems Service Manual, and should be followed:

The proper FRU swap procedure should always be used when removing/inserting boards from/into the platform.  It is required that service engineers wait for ESMD logging of the remove or insert message before additional actions are taken.  Utilize "showlogs -F" to monitor platform message events such as the examples below:

    esmd[4326]: [50141 744994421110 NOTICE Cabinet.cc 240] V3CPU at SB17 has been removed.
    esmd[4326]: [50141 776011024237 NOTICE Cabinet.cc 207] V3CPU at SB17 has been inserted.

These messages are logged to the platform message log located in /var/opt/SUNWSMS/adm/platform.  ESMD polls for board insertion or removal every 30 seconds, but it may take up to 1.5 minutes before the message is logged (depending upon the SC load).

Note 1: Due to the issue described in bug 6426425, USIV+ boards may have their SEEPROM corrupted if the above proper procedure is not followed. If the board becomes "corrupted" and exhibits Failure Signature "FAIL E$Dimm SBx/Px/Ex: Not compatible with US-IV+ processor" Please Escalate to Technology Service Center, to request Internal 'tool' that repairs corrupted SEEPPROM container, and restores board functionality.

Exception: Failure Signature "Serial number of CPU (80000228.B850D4EF) doesn't match data in board SEEPROM" cannot be corrected with the Internal 'tool'. The System Board will have to be replaced.

Note 2: None of the following actions taken in the field will correct the SEEPROM corruption, the System Board will require replacement.

  • sms restart
  • power cycle of board
  • re-insertion of board
  • flashupdate of the "sgcpu" image on System Board

 


Resolution

 


Modification History
Date: 23-AUG-2006
  • Updated to include Jaguar line of system boards

Date: 09-OCT-2006
  • Updated Corrective Action, Workaround Note: 1

 


Date: 22-JAN-2007
  • Updated Corrective Action, Workaround Note:1 and added Exception below note.


Previously Published As
102488
Internal Comments


Upon Verification of the esmd logged insertion event, the board may be powered on and POSTed.



Definitons of FD and ED segments




  • FD = Field Dynamic data. Standard segment where Daemons and Field software will write data common to all platforms. Contains Records that do not get updated frequently



Recommended Section: ReadWrite (Dynamic)



Readable By: All



Writable By: All settings other than Ops/Repair



Lifetime: Forever or Field



Dynamic Data Segment Size Bytes: 2949



Tagged Data Assigned to Segment: ECO_CurrentR,Customer_DataR,InstallationR,Soft_ErrorsR, Status_EventsR



Status: Active



Formater Error Type: Error




  • ED = can be used to store additional platform or FRU specific "Static" data. This type of data would be used for system configuration etc.



Reccomended Section: Read Only (Static)



Readable By: Sun Only



Writable By: Ops/Repair



Lifetime: Forever



Dynamic Data Segment Size Bytes: n/a



Tagged Data Assigned to Segment: n/a



Status: Active



Formater Error Type: Error


Related Information
  • Manual: Sun Fire 15K/12K Systems Service Manual
  • URL: http://www.sun.com/products-n-solutions/hardware/docs/Servers/High-End_Servers/Sun_Fire_15K/HW_Documentation/index.html

Internal Contributor/submitter
[email protected]

Internal Eng Business Unit Group
SSG ES (Enterprise Systems)

Internal Eng Responsible Engineer
[email protected]

Internal Services Knowledge Engineer
[email protected], [email protected]

Internal Escalation ID
1-15782131

Internal Kasp FAB Legacy ID
102488

Internal Sun Alert & FAB Admin Info
Critical Category:
Significant Change Date: 2006-06-29
Avoidance: Service Procedure
Responsible Manager: [email protected]
Original Admin Info: WF - updated Note 1 and added exception in Corrective Action section per Scott Barnard (1/22/07) - Joe
WF - i added Joe to KE list instead of me - 25-Jul-07 karen

Product_uuid
077fd4c5-df8f-4320-ad69-7d01603a674d|Sun Fire 12K Server
1404a2d3-059a-11d8-84cb-080020a9ed93|Sun Fire E20K Server
29e4659c-0a18-11d6-9fa1-e67bbc033df8|Sun Fire 15K Server
d842dd03-059b-11d8-84cb-080020a9ed93|Sun Fire E25K Server

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback