Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-73-1001149.1
Update Date:2012-09-06
Keywords:

Solution Type  FAB (standard) Sure

Solution  1001149.1 :   USIV+ (Panther and Jaguar) System Boards could experience POST Failure if Board Replacement Procedure is not correctly followed  


Related Items
  • Sun Fire E25K Server
  •  
  • Sun Fire 12K Server
  •  
  • Sun Fire 15K Server
  •  
  • Sun Fire E20K Server
  •  
Related Categories
  • PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: Sun FAB
  •  
  • .Old GCS Categories>Sun Microsystems>Sun FAB>Standard>Reactive
  •  

PreviouslyPublishedAs
201540


__________

***Checked for relevance on 06-Sep-2012***

Product
Sun Fire 12K Server
Sun Fire E20K Server
Sun Fire 15K Server
Sun Fire E25K Server

Bug Id
<SUNBUG: 6426425>

Part
  • Part No: 540-6439-06
  • Part Description: 1500MHz USIV+
Part
  • Part No: 540-6753
  • Part Description: 1800MHz USIV+
Part
  • Part No: 540-5664
  • Part Description: 1050MHz USIV+
Part
  • Part No: 540-5940
  • Part Description: 1200MHz USIV+
Part
  • Part No: 540-6251
  • Part Description: 1200MHz USIV+
Part
  • Part No: 540-6248
  • Part Description: 1050MHz USIV+
Part
  • Part No: 540-6233
  • Part Description: 1200MHz USIV+
Part
  • Part No: 540-6295
  • Part Description: 1350MHz USIV+

Impact
If the environmental status monitoring daemon (esmd) does not log the removal/insertion of System Boards, corruption of the System Board's SEEPROM can occur.

Symptoms

  If the aforementioned happens, POST fails with:

FAIL E$Dimm SBx/Px/Ex: Not compatible with US-IV+ processor

Subsequent POST will report:

    CHS reports E$Dimm SB0/P0/E0 status NOT_GOOD. Treating as blacklisted.

Users are then unable to clear the CHS status via "setchs -s OK -r OK -c SBx/Px/Ex" - it returns the error:

    ERROR: cannot set status.
    FRU error 2: I/O error (this component may not exist, may be powered off, or the FRU may be
    corrupt).


Root Cause

SEEPROM corruption can be triggered in a number of ways. In this case, a board is removed, but before esmd can notice the board removal a different board is inserted. Because esmd does not log the remove/insert events, it does not clear out the SEEPROM cache in picld, or frad, nor any of the pending SEEPROM events for the board.  When frad next updates, it writes to the wrong location and corrupts the SEEPROM.

Another failure signature related to this scenario is:

    FAIL Proc SB2/P0: Serial number of CPU (80000228.B850D4EF) doesn't match data in board SEEPROM
    (0000003E.77240482).

Additional validation of SEEPROM corruption can be displayed via "prtfru" command:

    xc88-sc0:sms-svc:137> prtfru
    ex15?Label=ex15/EXB/sb15?Label=sb15/CPU/p0?Label=p0/e1?Label=e1/ECACHE/frutree/chassis
    /CP/ex15?Label=ex15/EXB/sb15?Label=sb15/CPU/p0?Label=p0/e1?Label=e1/ECACHE
    SEGMENT: ID
    SEGMENT: FD
    Error processing data in segment "FD":  IO error
    SEGMENT: ED
    /Fru_Type: 0A04 (unrecognized value)

 


Workaround
No workaround available - see Resolution section below.


Resolution

The maintenance procedure below is outlined in the document, Sun Fire 15K/12K Systems Service Manual, and should be followed:

The proper FRU swap procedure should always be used when removing/inserting boards from/into the platform.  It is required that service engineers wait for ESMD logging of the remove or insert message before additional actions are taken.  Utilize "showlogs -F" to monitor platform message events such as the examples below:

    esmd[4326]: [50141 744994421110 NOTICE Cabinet.cc 240] V3CPU at SB17 has been removed.
    esmd[4326]: [50141 776011024237 NOTICE Cabinet.cc 207] V3CPU at SB17 has been inserted.

These messages are logged to the platform message log located in /var/opt/SUNWSMS/adm/platform.  ESMD polls for board insertion or removal every 30 seconds, but it may take up to 1.5 minutes before the message is logged (depending upon the SC load).

Note 1: Due to the issue described in bug 6426425, USIV+ boards may have their SEEPROM corrupted if the above proper procedure is not followed. If the board becomes "corrupted" and exhibits Failure Signature "FAIL E$Dimm SBx/Px/Ex: Not compatible with US-IV+ processor" Please Escalate to Technology Service Center, to request Internal 'tool' that repairs corrupted SEEPPROM container, and restores board functionality.

Exception: Failure Signature "Serial number of CPU (80000228.B850D4EF) doesn't match data in board SEEPROM" cannot be corrected with the Internal 'tool'. The System Board will have to be replaced.

Note 2: None of the following actions taken in the field will correct the SEEPROM corruption, the System Board will require replacement.

  • sms restart
  • power cycle of board
  • re-insertion of board
  • flashupdate of the "sgcpu" image on System Board

 


Comments

Upon Verification of the esmd logged insertion event, the board may be powered on and POSTed.

Definitons of FD and ED segments.

  • FD = Field Dynamic data. Standard segment where Daemons and Field software will write data common to all platforms. Contains Records that do not get updated frequently

  • Recommended Section: ReadWrite (Dynamic)

    Readable By: All

    Writable By: All settings other than Ops/Repair

    Lifetime: Forever or Field

    Dynamic Data Segment Size Bytes: 2949

    Tagged Data Assigned to Segment: ECO_CurrentR,Customer_DataR,InstallationR,Soft_ErrorsR, Status_EventsR


    Status: Active

    Formater Error Type: Error


    • ED = can be used to store additional platform or FRU specific "Static" data. This type of data would be used for system configuration etc.

    Reccomended Section: Read Only (Static)

    Readable By: Sun Only

    Writable By: Ops/Repair

    Lifetime: Forever

    Dynamic Data Segment Size Bytes: n/a

    Tagged Data Assigned to Segment: n/a

    Status: Active

    Formater Error Type: Error


    Modification History
    Date: 23-AUG-2006
    • Updated to include Jaguar line of system boards

    Date: 09-OCT-2006
    • Updated Corrective Action, Workaround Note: 1

     


    Date: 22-JAN-2007
    • Updated Corrective Action, Workaround Note:1 and added Exception below note.


    Previously Published As
    102488
    Internal Contributor/submitter: [email protected]
    Internal Eng Business Unit Group: SSG ES (Enterprise Systems)
    Internal Eng Responsible Engineer: [email protected]
    Internal Services Knowledge Engineer: [email protected], [email protected]

    Internal Escalation ID: 1-15782131

    Internal Kasp FAB Legacy ID: 102488

    Internal Sun Alert & FAB Admin Info
    Critical Category:
    Significant Change Date: 2006-06-29
    Avoidance: Service Procedure
    Responsible Manager: [email protected]
    Original Admin Info: WF - updated Note 1 and added exception in Corrective Action section per Scott Barnard (1/22/07) - Joe
    WF - i added Joe to KE list instead of me - 25-Jul-07 karen

    Product_uuid
    077fd4c5-df8f-4320-ad69-7d01603a674d|Sun Fire 12K Server
    1404a2d3-059a-11d8-84cb-080020a9ed93|Sun Fire E20K Server
    29e4659c-0a18-11d6-9fa1-e67bbc033df8|Sun Fire 15K Server
    d842dd03-059b-11d8-84cb-080020a9ed93|Sun Fire E25K Server

    Attachments
    This solution has no attachment
      Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
     Feedback