Asset ID: |
1-71-1002710.1 |
Update Date: | 2011-05-23 |
Keywords: | |
Solution Type
Technical Instruction Sure
Solution
1002710.1
:
Sun Fire[TM] v1280, 3800, 4800, 4810, 6800, E2900, E4900, E6900, and Netra[TM] 1280, and 1290 systems: Incoming versus Outgoing errors.
Related Items |
- Sun Fire 4810 Server
- Sun Fire 3800 Server
- Sun Netra 1290 Server
- Sun Fire E6900 Server
- Sun Fire 6800 Server
- Sun Fire V1280 Server
- Sun Fire 4800 Server
- Sun Fire E2900 Server
- Sun Fire E4900 Server
- Sun Netra 1280 Server
|
Related Categories |
- GCS>Sun Microsystems>Servers>Midrange Servers
- GCS>Sun Microsystems>Servers>Midrange V and Netra Servers
|
PreviouslyPublishedAs
203717
Applies to:
Sun Netra 1290 Server - Version: Not Applicable and later [Release: N/A and later ] Sun Fire E6900 Server - Version: Not Applicable and later [Release: N/A and later] Sun Fire E4900 Server - Version: Not Applicable and later [Release: N/A and later] Sun Fire 4810 Server - Version: Not Applicable and later [Release: N/A and later] Sun Fire V1280 Server - Version: Not Applicable and later [Release: N/A and later] All Platforms
Goal
Description This document applies to Sun Fire[TM] v1280,
3800, 4800, 4810, 6800, E2900, E4900, E6900, and Netra[TM] 1280, and
1290 systems.
This document relates to the diagnosis of error events
that get logged to a file called the error buffer on the System
Controller (SC) on the systems shown above. The error buffer log file data is collected by the command showerrorbuffer when running an Explorer using the scextended or 1280extended option. Alternatively, a user can display this information directly on the System Controller by executing the
command as follows (This example is from the lom prompt
on an E2900 server):
lom> showerrorbuffer
ErrorData[0] Date: Sat Aug 18 09:50:39 EDT 2007 Device: /SB0/dx3 ErrorID: 0x33071ff3 Port: 3 Syndrome: 0xd(CE bit 41) Direction: outgoing read TargetAid: 0x3 Transid: 0x1 ErrorData[1] Date: Sat Aug 18 09:50:39 EDT 2007 Device: /SB2/dx3 ErrorID: 0x33071ff3 Port: 3 Syndrome: 0xd(CE bit 41) Direction: incoming read First error: true TargetAid: 0x3 Transid: 0x1
The error example above will be used in the remainder of this article to explain the relation of
Incoming to Outgoing as it relates to error message diagnosis.
Solution
Diagnosing incoming versus outgoing errors in the showerrorbuffer file.
What is the relation of the terms Incoming and Outgoing?The answer is actually kind of easy, because the terms are
related to a direction of a data transaction. There are two possible directions for an error event to "travel" and the direction is "as it relates to the dx asic" (picture below illustrates the data path in question here between DX and DCDS):
- Outgoing - An error that is moving away from the dx asic (Ultimately to a DCDS/CPU/Memory on the board or off to some other board).
- Incoming - An error that is moving towards the dx asic (From a DCDS/CPU/Memory on the reporting dx asic's board).
Why do we care about what direction the error
"travels"?
The short answer is that because this is an error.
The longer answer is that the event(s) may
mean that there is defective hardware involved if the errors are uncorrectable or excessive (exceeding Oracle's Memory Error Best Practice) in nature. Knowing the direction of the event allows a user to identify the source of the error which is crucial to resolving the event and stopping the errors.
The direction of the transaction identifies for
us the source and thus Root Cause to
the event.
Now, how we do identify the direction that an event is "traveling" and identify the source? Using the same error example as before:ErrorData[0] Date: Sat Aug 18 09:50:39 EDT 2007 Device: /SB0/dx3 <--- This dx is reporting the event. ErrorID: 0x33071ff3 Port: 3 <--- This is the CPU number implicated. Syndrome: 0xd(CE bit 41) <--- This is the error syndrome. Direction: outgoing read <--- This is the direction of the event TargetAid: 0x3 as it relates to the dx. Transid: 0x1 Outgoing
means that the error's direction went from the dx asic (SB0/dx3) to the CPU (SB0/P3) or it's Memory (through the DCDS). This is what is called a "Victim" event because the error came
from somewhere else and the dx asic "passed it along".
The next error from the example error log file shows a "Source"
event. Source events are root cause events.
ErrorData[1] Date: Sat Aug 18 09:50:39 EDT 2007 Device: /SB2/dx3 <--- This dx is reporting the event. ErrorID: 0x33071ff3 Port: 3 <--- This is the CPU number implicated. Syndrome: 0xd(CE bit 41) <--- This is the error syndrome. Direction: incoming read <--- This is the direction of the event First error: true as it relates to the dx. TargetAid: 0x3 Transid: 0x1
Incoming
means that the error's direction went from the CPU (SB2/P3) or memory via the DCDS to the dx asic (SB2/dx3). This means that the error
sourced
from the DCDS, the CPU or it's memory (the CPU is a memory controller). The dx is simply reporting that a CPU it monitors has seen the error and forwards it along - to become a
different dx asic's Outgoing event.
In the above example, the Root Cause suspects would be SB2 DIMM
pair J16500/J16501 because data bit 41 (ESYN 0xd) translates to that DIMM pair.
- If there were correlating ecc errors in the
domain's /var/adm/messages file that showed only one DIMM bank in error,
then the error would be further isolated to a single DIMM (either Bank 0 or Bank 1).
- The suspect(s) should be replaced ONLY IF meeting the Best
Practice rules as defined in Document 1010905.1 Oracle Enhanced Memory DIMM Replacement Policy
NOTES:
- It is worth mentioning
that this document discusses one of the easiest error examples to
diagnose as it relates to Incoming/Outgoing directions. It showed "read" transactions.
- A read is almost always
sourced to a memory DIMM.
- If you see an "incoming write" from a
single CPU location with many different "outgoing reads", suspect
the CPU who is related to the "incoming write" transaction as Root Cause.
- Big rule: CPUs "write" and DIMMs "read" so, when only "read's.
Internal Comments
There is an ESYN Translator located at http://panacea.uk.oracle.com/twiki/bin/view/Tools/ToolPageEsynDecoderUniboard
which can be used to translate ECC syndromes as shown in this article's example.
Previously Published As 90269
Attachments
This solution has no attachment
|