Asset ID: |
1-75-1005476.1 |
Update Date: | 2012-07-23 |
Keywords: | |
Solution Type
Troubleshooting Sure
Solution
1005476.1
:
Troubleshooting Level 2 Check Errors (L2CheckError) on Sun Fire[TM] 3800/4800/4810/6800/E2900/E4900/E6900 & Netra[TM] 1280/1290
Related Items |
- Sun Netra 1280 Server
- Sun Fire 6800 Server
- Sun Fire 3800 Server
- Sun Fire E2900 Server
- Sun Fire 4810 Server
- Sun Fire V1280 Server
- Sun Fire 4800 Server
- Sun Fire E4900 Server
- Sun Fire E6900 Server
- Sun Netra 1290 Server
|
Related Categories |
- PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: Exx00
- .Old GCS Categories>Sun Microsystems>Servers>Entry-Level Servers
- .Old GCS Categories>Sun Microsystems>Servers>NEBS-Certified Servers
- .Old GCS Categories>Sun Microsystems>Servers>Midrange Servers
|
PreviouslyPublishedAs
207600
Applies to:
Sun Fire E4900 Server - Version All Versions and later
Sun Fire E6900 Server - Version All Versions and later
Sun Fire 6800 Server - Version All Versions and later
Sun Netra 1290 Server - Version All Versions and later
Sun Netra 1280 Server - Version All Versions and later
All Platforms
Purpose
Description
This document provides the steps required to be followed to troubleshoot Level 2 Check Error events (L2CheckErrors) on Sun Fire[TM] Midrange servers.
Symptoms:
-
A system or domain(s) may have been described as having gone down, rebooted unexpectedly, panicked, reset, or similar term.
-
The word L2CheckError may be displayed in errors on the System Controller (SC) console or in showlogs .
-
It may be reported that one domain was rebooted and another unexpectedly reset, panicked, or similar.
-
It may be reported that a System Board (SB), Repeater, I/O Board, CPU(s), Memory DIMMs, or similar component is labeled as faulty or suspect and may be missing or disabled.
-
It may be reported that the system or domain booted after the System Controller (SC) was failed over, rebooted, or reset.
System Type and Configuration:
-
Sun Fire[TM] 3800, 4800, 4810, 6800 Servers
-
Sun Fire[TM] E4900, E6900 Servers
-
Sun Fire[TM] v1280, E2900 Servers
-
Netra[TM] 1280, 1290 Servers
Notes: The system configuration includes at least System Controller Application (ScApp) 5.15.x. A device called a Repeater (RP) will be implicated by an L2CheckError event. An RP is a type of board on all systems except for Sun Fire[TM] 3800 where the RPs are located on the system's Backplane/Centerplane.
Assumption:
This document assumes the event encountered is not a repeat event. Collect data as outlined in Step 5 below if this is a repeat event, and let the Sun Support Services engineer perform analysis.
Sun Shared Shell
If you require assistance in collecting the data recommended in this article or require help in diagnosing a system issue, there is a collaborative service tool called Sun Shared Shell which allows Sun Service engineers to remotely view and diagnose customer's systems. Consider using this option to reduce the problem resolution time.
Troubleshooting Steps
Steps to Follow
Please validate that each troubleshooting step below is true for your environment.
The steps will provide instructions or a link to a document, for validating the step and taking corrective action as necessary. The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution. Please do not skip a step.
1. Verify the event encountered is a L2CheckError.
SC-Name:SC> showlogs -d c
Dec 29 10:23:51 sc0 Domain-C.SC: [ID 272780 local5.error] ArAsic reported first error on /N0/SB1
Dec 29 10:23:51 sc0 Domain-C.SC: [ID 807371 local5.error]
/partition1/domain0/SB1/ar0:
L2CheckError[0x6150] : 0x00608060
AccIncSyncErr [24:21] : 0x3 accumulated incoming mismatch
FE [15:15] : 0x1
INCSyncErr [08:05] : 0x3 Ports [9:6] incoming mismatched against internal expected incoming
2. Verify this is not a "known" memory interleaving issues
Note: Sun Fire[TM] v1280/E2900 & Netra[TM] 1280/1290 are excluded from this step because they can not have a multiple domain configuration.
3. Verify this is not a "known" adjacent domain issue.
Note: Sun Fire[TM] v1280/E2900 & Netra[TM] 1280/1290 are excluded from this step because they can not have a multiple domain configuration.
4. Customers should contact Sun Support Services, mention this document ID, and verify extended Explorer data is available for analysis or be prepared to use Sun Shared Shell to continue diagnosis of the event.
-
Sun Shared Shell is a collaborative service tool which allows Sun Service engineers to remotely view and diagnose customer's systems. Consider using this option to reduce the problem resolution time.
-
If the shared shell option above is not available, the Sun Support Engineer will verify the previous steps have been performed and then perform analysis offline using Explorer data.
Product
Sun Netra 1290 Server
Netra 1280 Server
Sun Fire V1280 Server
Sun Fire E6900 Server
Sun Fire E4900 Server
Sun Fire E2900 Server
Sun Fire 6800 Server
Sun Fire 4810 Server
Sun Fire 4800 Server
Sun Fire 3800 Server
Performing Additional Analysis Offline
Verify that Steps 1-3 in the Steps to Follow section above have been performed prior to commencing with step 6 and 7.
6. Verify this is not a "known" Dynamic Reconfiguration (DR)/cfgadm issue (previously described into Doc 1001300.1)
Bug Id 6300392: Use of the cfgadm(1M) command can trigger a domain outage with an "L2CheckError."
This issue can occur on the following platforms: Sun Fire 3800, 4800, 4810, E2900, E4900, 6800, E6900 and V1280 systems without ScApp firmware 5.19.8 or 5.20.3 (as delivered in patches 114526-09 and 114527-04).
Notes: This issue may occur on the systems listed above running Solaris 8, 9 or 10. Solaris 7 does not support the x800/x900 series of Sun Fire Systems.
This issue will only occur on systems configured for Dynamic Reconfiguration (DR).
An example use of cfgadm(1) causing this condition would be during the configuration of a system board, as in the following example:
# cfgadm -c configure N0.SB2
Output from the "showerrorbuffer" command will display captured error messages similar to the following:
ErrorData[19]
Date: Mon Jun 13 20:55:01 GMT-07:00 2005
Device: /SSC0/sbbc0/systemepld
Register: FirstError[0x10] : 0x0800 SB2 encountered the first error
ErrorData[20]
Date: Mon Jun 13 20:55:01 GMT-07:00 2005
Device: /partition0/domain0/SB2/bbcGroup0/repeaterepld
Register: FirstError[0x10]: 0x0001 ar0 encountered the first error
ErrorData[21]
Date: Mon Jun 13 20:55:01 GMT-07:00 2005
Device: /partition0/domain0/SB2/ar0
ErrorID: 0x10221fff
Register: L2CheckError[0x6150] : 0x00001e00 CMDVSyncErr [12:09] : 0xf Ports [9:6] command valid mismatched against internal expected command valid
ErrorData[22]
Date: Mon Jun 13 20:55:01 GMT-07:00 2005
Device: /partition0/domain0/SB2/ar0
ErrorID: 0x10221fff
Register: L2CheckError[0x6150] : 0x0000001e PreqSyncErr [04:01] : 0xf Ports [9:6] prereq mismatched against internal expected prereq
ErrorData[23]
Date: Mon Jun 13 20:55:01 GMT-07:00 2005
Device: /partition0/domain0/SB2/ar0
ErrorID: 0x10221fff
Register: L2CheckError[0x6150] : 0x1e000000 AccCMDVSyncErr [28:25] : 0xf accumulated valid command mismatch
ErrorData[24]
Date: Mon Jun 13 20:55:01 GMT-07:00 2005
Device: /partition0/domain0/SB2/ar0
ErrorID: 0x10221fff
Register: L2CheckError[0x6150] : 0x001e0000 AccPreqSyncErr [20:17] : 0xf accumulated prerequisite mismatch
and from the output of the "showlogs -d <domain name>" command for the same error:
Jun 13 20:55:01 g1db1-sc0 Domain-A.SC: [ID 427805 local0.crit] ErrorMonitor: Domain A has a SYSTEM ERROR
Jun 13 20:55:01 g1db1-sc0 Domain-A.SC: [ID 924577 local0.error] /N0/SB2 encountered the first error
Jun 13 20:55:01 g1db1-sc0 Domain-A.SC: [ID 175522 local0.error] ArAsic reported first error on /N0/SB2
Jun 13 20:55:01 g1db1-sc0 Domain-A.SC: [ID 653352 local0.error] /partition0/domain0/SB2/ar0:
>>>>>> L2CheckError[0x6150] : 0x1e1e9e1e
CMDVSyncErr [12:09] : 0xf Ports [9:6] command valid mismatched against internal expected command valid
PreqSyncErr [04:01] : 0xf Ports [9:6] prereq mismatched against internal expected prereq
AccCMDVSyncErr [28:25] : 0xf accumulated valid command mismatch FE [15:15] : 0x1
AccPreqSyncErr [20:17] : 0xf accumulated prerequisite mismatch
Jun 13 20:55:01 g1db1-sc0 Domain-A.SC: [ID 250001 local0.error]
[AD] Event: SF4800
CSN: 229H2199 DomainID: A ADInfo: 1.SCAPP.15.4
Time: Mon Jun 13 20:55:01 GMT-07:00 2005
FRU-List-Count: 0; FRU-PN: ; FRU-SN: ; FRU-LOC: UNRESOLVED
Recommended-Action: Service action required
Jun 13 20:55:01 g1db1-sc0 Domain-A.SC: [ID 253130 local0.crit] Domain A is
currently paused due to an error.
This domain must be turned off via "setkeyswitch off" to recover
To work around the described issue, use one of the two following options:
a) Reboot the main system controller
or:
b) Manually failover the main system controller
This issue is addressed on the following platforms:
Sun Fire 3800, 4800, 4810, E2900, E4900, 6800, E6900 and V1280 systems with ScApp firmware 5.19.8 or 5.20.3 (as delivered in patches 114526-09 or later and 114527-04 or later)
7. Verify this is not a repeat event.
A repeat event is an event that has:
- has an identical failure signature and suspect indictment list or
- the customer may report or feel the event is reoccurring on the same system/platform.
Repeat events require collaboration with the next level of support (Step 12).
8. Verify that this event is not caused by a power failure on a System (SB) or I/O Board (IB).
A power failure of a System or I/O Board can be easily identified by the following message appearing in the System Controller showlogs or showlogs -v domainID file:
Path broken between CBH and SDC:SB# ----> For a SB fault.
Path broken between CBH and SDC:IB# ----> For a IB fault.
If the message shown above is present for a System Board (SB), utilize Document 1019667.1 to resolve this issue
If the message shown above is present for an I/O Board (IB), utilize Document 1017844.1 to resolve this issue.
9. Verify that the Auto-Diagnosis (AD) Event Message "FRU-LOC" does not say "UNRESOLVED".
The AD Event Messages are contained in the System Controller (SC) log files (showlogs or showlogs -d ). Look for the AD Event Message appropriate to the date/time of the event in question.
The following example identifies suspects RP3 and SB0:
Jul 07 21:56:47 sc0 Domain-C.SC:
[AD] Event: SF6800.ASIC.AR.INC_SYNC_ERR.1024106f
CSN: 136M2383 DomainID: C ADInfo: 1.SCAPP.19.3
Time: Wed Jul 07 21:56:38 PDT 2004
FRU-List-Count: 2; FRU-PN: 5014953; FRU-SN: 013023; FRU-LOC: RP3
FRU-PN: 5014362; FRU-SN: 017608;
FRU-LOC: /N0/SB0
Recommended-Action: Service action required
The following example says "UNRESOLVED":
Dec 29 10:23:51 systemx Domain-C.SC:
[ID 436815 local5.error] [AD] Event: SF6800
CSN: 313H3174 DomainID: C ADInfo: 1.SCAPP.19.3
Time: Mon Dec 29 10:23:51 CST 2003
FRU-List-Count: 0; FRU-PN: ; FRU-SN: ;
FRU-LOC: UNRESOLVED
Recommended-Action: Service action required
Collaborate with the next level of support (see Step 12) if UNRESOLVED or unable or unsure how to determine this.
10. Identify and replace the Primary Suspect from the AD Event Message "FRU-LOC" indictment.
The FRU-LOC (Field Replaceable Unit Location) indictment compose a list of suspects including SBs, IBs, and RPs.
Count the number of individual SBs + IBs versus individual RPs listed in the AD Event Message and compare the totals to the table below.
---------------------------------------------------
Number of Number of Primary
SB & IB RP Suspect
---------------------------------------------------
1 1 SB or IB
---------------------------------------------------
1 2 (or +) SB or IB
---------------------------------------------------
2 (or +) 1 RP
---------------------------------------------------
2 (or +) 2 (or +) Collaborate
---------------------------------------------------
From the event message example in Step 7 where SB0 and RP3 were implicated, the Table identifies that when the number of "SB & IB" and "RP" are both "1", it is the SB or IB which is the primary suspect. In this example, that would be SB0 .
Collaborate with the next level of support (see Step 12) if unable or unsure how to determine this.
11. Verify the problem does not reoccur within 24 hours after replacing the Primary Suspect.
Replacement procedures are located in the Systems Service Manual for each server by accessing the appropriate system's Hardware link through the Midframe & Midrange Servers Product Documentation Website .
12. Verify the latest data is available and collaborate with the next level of support.
The information needing to be provided includes Explorer with the appropriate scextended or 1280extended option as detailed in How to run Sun data and send to Sun engineer
If Explorer data can not be collected for whatever reason see Procedure to collect Sunfire Midrange failure data manually
Detailed listing of all previous service actions, identifying parts replaced and dates of service.
Confirmation that the previous steps in this resolution path were performed (unless this is a repeat event).
Resources for continued troubleshooting:
Document: 1006221.1 Sun Fire[TM] Servers: How L2CheckErrors Happen
Document: 1009156.1 SDC Parity errors and SDC L2CheckError discussion
Previously Published As 88171
References
<NOTE:1000008.1> - Solaris Reboot Triggers Spurious SYSTEM Error in Adjacent Domain
<NOTE:1000448.1> - Domains on Sun Fire 3800/4800/4810/4900/6800/6900 Systems May Experience a Domain Pause
<NOTE:1001300.1> - Use of cfgadm(1M) on Certain Systems May Cause Domain Outage, Reporting "L2CheckError"
<NOTE:1002383.1> - Oracle Explorer Data Collector for Sun systems
@<NOTE:1006221.1> - Sun Fire[TM] Servers: How L2CheckErrors Happen
@<NOTE:1009156.1> - SDC Parity errors and SDC L2CheckError discussion
<NOTE:1017844.1> - Sun[TM] Fire 3800/4800/4810/6800/E2900/E4900/E6900/V1280 and Netra[TM] 1280/1290 Server: I/O Board (IB) power supply failures.
<NOTE:1019667.1> - Sun Fire[TM] Server System Board (SB) voltage errors.
Attachments
This solution has no attachment