Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1019646.1
Update Date:2010-08-10
Keywords:

Solution Type  Troubleshooting Sure

Solution  1019646.1 :   Troubleshooting Interconnect errors on Sun Fire[TM] v1280, 3800, 4800, 4810, 6800, E2900, E4900, E6900, and Netra 1280, 1290 systems.  


Related Items
  • Sun Fire E6900 Server
  •  
  • Sun Fire 6800 Server
  •  
  • Sun Fire 3800 Server
  •  
  • Sun Fire E4900 Server
  •  
  • Sun Netra 1280 Server
  •  
  • Sun Fire 4800 Server
  •  
  • Sun Fire V1280 Server
  •  
  • Sun Fire E2900 Server
  •  
  • Sun Netra 1290 Server
  •  
  • Sun Fire 4810 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>Midrange V and Netra Servers
  •  
  • GCS>Sun Microsystems>Servers>Entry-Level Servers
  •  
  • GCS>Sun Microsystems>Servers>Midrange Servers
  •  

PreviouslyPublishedAs
242866


Applies to:

Sun Fire 6800 Server
Sun Netra 1290 Server
Sun Fire V1280 Server
Sun Fire 4810 Server
Sun Fire E4900 Server
All Platforms

Purpose

Description

This document provides the basic troubleshooting steps to follow when needing to diagnose the cause of Interconnect Errors on Sun Fire[TM] Midrange Servers

Symptoms:

  • A System Board, I/O Board, or Repeater may have been recently serviced, replaced, or reseated.
  • A domain may not be able to boot.
  • A domain could be described as down, can't be setkeyswitched on, can't be powered on, or as having failed POST.
  • Error messages displayed in the System Controller (SC) log files (showlogs -v) or on the console could include messages like:
Failed AR interconnect test.

CPU Board V3 at /N0/SB1 has been removed from domain C due to a failure in interconnection test.
Service action required.

AR Interconnect test: System board SB1/ar0 address repeater connections to system board RP3/ar0 failed

DX Interconnect test: System board /N0/SB1 data line connections to system board RP0 failed
NOTE:  The example errors can be associated to any domain, RP, or any System Board (SB) or I/O Board (IB), and the examples above are not exclusive to these faults.

System Type:

  • Sun Fire[TM] v1280, 3800, 4800, 4810, 6800, E2900, E4900, E6900
  • Netra[TM] 1280, 1290

Last Review Date

July 23, 2010

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

Steps to Follow
Collect the appropriate troubleshooting data and contact Sun Support Services.
The error you have encountered is a board interconnection issue.  Essentially, this is a board connectivity issue.  It is likely a hardware defect, a board or slot issue, or a board "seating" issue.  The event requires that a Sun Support Engineer is engaged to diagnose and resolve this event.

Please contact Sun Support Services in order to diagnose this issue.  Being prepared with the following troubleshooting data will allow that engineer to immediately begin diagnosis of the issue, and decrease the time to resolution.

Please provide:

  • Explorer with scextended or 1280extended option (depending on platform type);  See Document 1019066.1 for details
  •  When Explorer data can not be captured, please obtain the list of System Controller (SC) commands from Document 1003529.1.


    Internal Comments
    Please validate that each troubleshooting step below is true for your
    environment.

    The steps will provide instructions or a link to a document, for
    validating the step and taking corrective action  as necessary. The
    steps are ordered in the most appropriate sequence to isolate the
    issue and identify the proper resolution.  Please do not skip a step.

    1.  Verify the components implicated in the interconnect errors were
         not recently replaced, reseated, or "handled".

    - Recently "handled" hardware would include any board that has been
       removed or inserted to replace it or hardware components contained
       on it.

    - Since the error is an interconnection problem, the physical act of
       servicing or handling the board could be the cause of the problem.

    Reference:  Document 1019218.1 Sun Fire[TM] Midrange Servers:  How to
    identify pin or socket damage.

    2.  Verify that the errors persist after executing System Controller
         Failover (dual SC config) or an SC Reset (single SC config).

    - Failover (scfailover) is only available on systems with Dual SCs.
          Reference: Document 1003245.1 Sun Fire[TM] 3800-6900: System
                     Controller failover functionality

    - On Sun Fire[TM] v1280/E2900 and Netra[TM} 1280/1290 (single SC
       configurations) you will need to utilize the resetsc command to
       reset the SC and confirm it's sanity.
          Reference: Document 1012388.1 Sun Fire[TM] V1280/2900 LOM
                     Quick Command Reference

        - If errors persist on both SCs or after the resetsc is issued,
          proceed to Step 3.
        - If errors go away after the resetsc you are done.
        - If they go away after executing scfailover, fail back to the
          original Main SC and confirm the errors persist again.
            - Replace the SC if they do.

    3.  Confirm that you are able to determine the suspect list for
         this issue and prioritize which suspect is most likely to be
         root cause.

    - See Document 1019649.1 How to determine the suspect list for
       Sun Fire[™] Midrange Server interconnect errors.

    4.  Verify that the primary FRU is NOT defective (primary FRU
         determined by the results of Step 3).

    - If a System Board or I/O Board is implicated, it can be verified
       as defective two different ways:
        - By replacing the board.
        - By having a Sun engineer move the suspect board into an empty
          slot or switch it with another board in the domain and observe
          the behavior.
            - If the board works in the alternate slot, the RP or the
              board slot (CP) is implicated (proceed to Step 4).
            - If the board fails to work in the alternate slot, the
              board is defective, so replace it.
            - If a Repeater (RP) is implicated, it can be verified as
              defective two different ways:
                - By replacing it.
                - By having an engineer switch the suspect RP with an
                  alternate RP in th system and observe the behavior.
           - If the error follows the RP to it's new location, then
             the RP is defective, so replace it.
           - If the failure remains at the old RP's slot, then the
             Centerplane is suspect.

    - The Sun engineer performing any replacement or moving any
       hardware should be extremely careful to inspect the board and
       CP pins and sockets.

       Reference: Document 1019218.1 Sun Fire[TM] Midrange Servers: 
       How to identify pin or socket damage.

    5.  Verify that the secondary FRU is not defective (secondary
         FRU determined by the results of Step 3).

    - If a System Board or I/O Board is implicated, it can be verified
       as defective two different ways:
        - By replacing the board.
        - By having a Sun engineer move the suspect board into an empty
          slot or switch it with another board in the domain and observe
          the behavior.
            - If the board works in the alternate slot, the RP or the
              board slot (CP) is implicated (proceed to Step 4)
            - If the board fails to work in the alternate slot, the
              board is defective, so replace it.
            - If a Repeater (RP) is implicated, it can be verified as
              defective two different ways:
                - By replacing it.
                - By having a Sun engineer switch the suspect RP with
                  an alternate RP in the same system and observe the
                  behavior.

    - If the error follows the RP to it's new location, then the RP
       is defective, so replace it.
    - If the failure remains at the old RP's slot, then the
       Centerplane is suspect.

    - The Sun engineer performing any replacement or moving any
       hardware should be extremely careful to inspect the board and
       CP pins and sockets. 

    Reference: Document 1019218.1 Sun Fire[TM] Midrange Servers: How
    to identify pin or socket damage.

    6.  Collaborate with TSC prior to proceeding to a Centerplane
         replacement.

    - Make sure to have console data, explorer data, and a detailed
       explanation of what has been replaced, and when available when
       collaborating with TSC.
    - Most likely the Centerplane will have to be replaced, but TSC
       will want make absolutely sure that nothing has been overlooked
       before proceeding to this invasive replacement action.

    NOTE: The testinterconnect command can be utilized to test board
    interconnections if you obtain a service mode password
    (setkeyswitch on also accomplishes this testing). For details on
    testinterconnect command usage refer to Document 1005014.1.

    Document Information:
    This document contains normalized content and is managed by the
    the Content Lead(s) of the respective domains. Please provide
    feedback using the Add Comment link on this article to notify of
    a needed modification.

    Support Aliases: [email protected] or [email protected]
    Alias Archives: http://archives.central/alias/serengeti-support or
    http://archives.central/alias/lw8-support

    
 Instant Messenger Chat Room: Gl-ESG
    Service Request Queue:  GL-ESG


    Keywords:
    interconnect, Interconnect, interconnect test, interconnection,
    testinterconnect, Service action required, failure, POST, normalized

    Attachments
    This solution has no attachment
      Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
     Feedback