Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1019646.1
Update Date:2012-01-05
Keywords:

Solution Type  Troubleshooting Sure

Solution  1019646.1 :   Troubleshooting Interconnect errors on Sun Fire[TM] v1280, 3800, 4800, 4810, 6800, E2900, E4900, E6900, and Netra 1280, 1290 systems.  


Related Items
  • Sun Fire E6900 Server
  •  
  • Sun Fire 3800 Server
  •  
  • Sun Fire 6800 Server
  •  
  • Sun Fire E4900 Server
  •  
  • Sun Netra 1280 Server
  •  
  • Sun Fire 4800 Server
  •  
  • Sun Fire V1280 Server
  •  
  • Sun Fire E2900 Server
  •  
  • Sun Fire 4810 Server
  •  
  • Sun Netra 1290 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: Exx00
  •  
  • .Old GCS Categories>Sun Microsystems>Servers>Entry-Level Servers
  •  
  • .Old GCS Categories>Sun Microsystems>Servers>Midrange Servers
  •  
  • .Old GCS Categories>Sun Microsystems>Servers>Midrange V and Netra Servers
  •  

PreviouslyPublishedAs
242866


Applies to:

Sun Fire E6900 Server
Sun Fire 6800 Server
Sun Netra 1290 Server
Sun Fire V1280 Server
Sun Fire 4810 Server
All Platforms

Purpose

Description

This document provides the basic troubleshooting steps to follow when needing to diagnose the cause of Interconnect Errors on Sun Fire[TM] Midrange Servers

Symptoms:

  • A System Board, I/O Board, or Repeater may have been recently serviced, replaced, or reseated.
  • A domain may not be able to boot.
  • A domain could be described as down, can't be setkeyswitched on, can't be powered on, or as having failed POST.
  • Error messages displayed in the System Controller (SC) log files (showlogs -v) or on the console could include messages like:
Failed AR interconnect test.

CPU Board V3 at /N0/SB1 has been removed from domain C due to a failure in interconnection test.
Service action required.

AR Interconnect test: System board SB1/ar0 address repeater connections to system board RP3/ar0 failed

DX Interconnect test: System board /N0/SB1 data line connections to system board RP0 failed


 

NOTE:  The example errors can be associated to any domain, RP, or any System Board (SB) or I/O Board (IB), and the examples above are not exclusive to these faults.

System Type:

  • Sun Fire[TM] v1280, 3800, 4800, 4810, 6800, E2900, E4900, E6900
  • Netra[TM] 1280, 1290

Last Review Date

July 23, 2010

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

Steps to Follow
Collect the appropriate troubleshooting data and contact Sun Support Services.
The error you have encountered is a board interconnection issue.  Essentially, this is a board connectivity issue.  It is likely a hardware defect, a board or slot issue, or a board "seating" issue.  The event requires that a Sun Support Engineer is engaged to diagnose and resolve this event.

Please contact Sun Support Services in order to diagnose this issue.  Being prepared with the following troubleshooting data will allow that engineer to immediately begin diagnosis of the issue, and decrease the time to resolution.

Please provide:

  • Explorer with scextended or 1280extended option (depending on platform type);  See Document 1019066.1 for details
  •  When Explorer data can not be captured, please obtain the list of System Controller (SC) commands from Document 1003529.1.



    To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in an appropriate
    My Oracle Support Community - Oracle Sun Technologies Community.





    Please validate that each troubleshooting step below is true for your environment.

    The steps will provide instructions or a link to a document, for validating the step and taking corrective action  as necessary.
    The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution. 
    Please do not skip a step.

    1.  Verify the components implicated in the interconnect errors were not recently replaced, reseated, or "handled".

       - Recently "handled" hardware would include any board that has been
         removed or inserted to replace it or hardware components contained on it.

      - Since the error is an interconnection problem, the physical act of
        servicing or handling the board could be the cause of the problem.

       Reference:  Document 1019218.1 Sun Fire[TM] Midrange Servers:  How to identify pin or socket damage.

    2.  Verify that the errors persist after executing System Controller Failover (dual SC config) or an SC Reset (single SC config).

       - Failover (scfailover) is only available on systems with Dual SCs.
         Reference: Document 1003245.1 Sun Fire[TM] 3800-6900: System Controller failover functionality

       - On Sun Fire[TM] v1280/E2900 and Netra[TM} 1280/1290 (single SC configurations) you will need to utilize the resetsc command to
         reset the SC and confirm it's sanity.
         Reference: Document 1012388.1 Sun Fire[TM] V1280/2900 LOM Quick Command Reference

        - If errors persist on both SCs or after the resetsc is issued, proceed to Step 3.
        - If errors go away after the resetsc you are done.
        - If they go away after executing scfailover, fail back to the original Main SC and confirm the errors persist again.
        - Replace the SC if they do.

    3.  Confirm that you are able to determine the suspect list for this issue and prioritize which suspect is most likely to be root cause.

      - See Document 1019649.1 How to determine the suspect list for Sun Fire[TM] Midrange Server interconnect errors.

    4.  Verify that the primary FRU is NOT defective (primary FRU determined by the results of Step 3).

      - If a System Board or I/O Board is implicated, it can be verifief as defective two different ways:
        - By replacing the board.
        - By having a Sun engineer move the suspect board into an empty
          slot or switch it with another board in the domain and observe the behavior.
            - If the board works in the alternate slot, the RP or the
              board slot (CP) is implicated (proceed to Step 4).
            - If the board fails to work in the alternate slot, the
              board is defective, so replace it.
            - If a Repeater (RP) is implicated, it can be verified as
              defective two different ways:
                - By replacing it.
                - By having an engineer switch the suspect RP with an
                  alternate RP in th system and observe the behavior.
           - If the error follows the RP to it's new location, then
             the RP is defective, so replace it.
           - If the failure remains at the old RP's slot, then the
             Centerplane is suspect.

      - The Sun engineer performing any replacement or moving any
        hardware should be extremely careful to inspect the board and CP pins and sockets.

       Reference: Document 1019218.1 Sun Fire[TM] Midrange Servers:  How to identify pin or socket damage.
     
     5.  Verify that the secondary FRU is not defective (secondary FRU determined by the results of Step 3).

      - If a System Board or I/O Board is implicated, it can be verified as defective two different ways:
        - By replacing the board.
        - By having a Sun engineer move the suspect board into an empty
          slot or switch it with another board in the domain and observe the behavior.
            - If the board works in the alternate slot, the RP or the
              board slot (CP) is implicated (proceed to Step 4)
            - If the board fails to work in the alternate slot, the
              board is defective, so replace it.
            - If a Repeater (RP) is implicated, it can be verified as
              defective two different ways:
                - By replacing it.
                - By having a Sun engineer switch the suspect RP with
                  an alternate RP in the same system and observe the behavior.
     
      - If the error follows the RP to it's new location, then the RP
        is defective, so replace it.
      - If the failure remains at the old RP's slot, then the Centerplane is suspect.

      - The Sun engineer performing any replacement or moving any
        hardware should be extremely careful to inspect the board and
       CP pins and sockets. 

      Reference: Document 1019218.1 Sun Fire[TM] Midrange Servers: How to identify pin or socket damage.

    6.  Collaborate with TSC prior to proceeding to a Centerplane replacement.

      - Make sure to have console data, explorer data, and a detailed
        explanation of what has been replaced, and when available when
        collaborating with TSC.
      - Most likely the Centerplane will have to be replaced, but TSC
        will want make absolutely sure that nothing has been overlooked
        before proceeding to this invasive replacement action.

    NOTE: The testinterconnect command can be utilized to test board
    interconnections if you obtain a service mode password
    (setkeyswitch on also accomplishes this testing). For details on
    testinterconnect command usage refer to Document 1005014.1.


    Keywords:
    interconnect, Interconnect, interconnect test, interconnection,
    testinterconnect, Service action required, failure, POST, normalized, Mapped, Global_Oring





Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback