Asset ID: |
1-71-1010407.1 |
Update Date: | 2010-09-08 |
Keywords: | |
Solution Type
Technical Instruction Sure
Solution
1010407.1
:
DTAG parity error Troubleshooting and Analysis
Related Items |
- Sun Enterprise 3000 Server
- Sun Enterprise 4500 Server
- Sun Enterprise 5500 Server
- Sun Enterprise 4000 Server
- Sun Enterprise 5000 Server
- Sun Enterprise 6000 Server
- Sun Enterprise 3500 Server
- Sun Enterprise 6500 Server
|
Related Categories |
- GCS>Sun Microsystems>Servers>Midrange Servers
|
PreviouslyPublishedAs
214288
Applies to:
Sun Enterprise 3000 Server
Sun Enterprise 3500 Server
Sun Enterprise 4000 Server
Sun Enterprise 4500 Server
Sun Enterprise 5000 Server
All Platforms
Goal
Description:
This document describes how to perform analysis
of
DTAG parity error events on Sun Enterprise 3x00/4x00/5x00/6x00
(aka
Classic) Servers and determine if a replacement action is necessary.
Examples:
A
DTAG Parity Error event is often only visible on the system console (sometimes called the console log since this is often logged on a console server) and is usually seen within
Fatal Reset output.
An example from console log data is below:
17-OCT-2001 17:07:55.17 LBC5 Fatal Reset
17-OCT-2001 17:07:56.69 LBC5 0,0>FATAL ERROR
17-OCT-2001 17:07:57.15 LBC5 0,0> At time of error: System software was running.
17-OCT-2001 17:07:57.37 LBC5 0,0> Diagnosis: Board 2, Dtag B (UPA Port1),AC
17-OCT-2001 17:07:57.37 LBC5 0,0>Log Date: Oct 17 21:17:19 GMT 2001 17-OCT-2001 17:07:57.37 LBC5 0,0>
17-OCT-2001 17:07:57.58 LBC5 0,0>RESET INFO for CPU/Memory board in slot 2
17-OCT-2001 17:07:57.58 LBC5 0,0> AC ESR 00000010.00000000 DT_PERRB
17-OCT-2001 17:07:57.59 LBC5 0,0> DC[0] 00
17-OCT-2001 17:07:57.59 LBC5 0,0> DC[1] 00
17-OCT-2001 17:07:57.59 LBC5 0,0> DC[2] 00
17-OCT-2001 17:07:57.59 LBC5 0,0> DC[3] 00
17-OCT-2001 17:07:57.59 LBC5 0,0> DC[4] 00
17-OCT-2001 17:07:57.59 LBC5 0,0> DC[5] 00
17-OCT-2001 17:07:57.59 LBC5 0,0> DC[6] 00
17-OCT-2001 17:07:57.80 LBC5 0,0> DC[7] 00
17-OCT-2001 17:07:57.80 LBC5 0,0> FHC CSR 00050030 LOC_FATAL SYNC BRD_LED_M BRD_LED_R
17-OCT-2001 17:07:57.80 LBC5 0,0> FHC RCSR 02000000 FATAL
17-OCT-2001 17:07:57.80 LBC5 0,0> Config policy change
17-OCT-2001 17:07:57.80 LBC5 0,0>
17-OCT-2001 17:07:57.80 LBC5 0,0>@(#) POST 3.9.28 2000/12/20 12:29
17-OCT-2001 17:07:58.02 LBC5 0,0>Copyright 2000 Sun Microsystems, Inc. All rights reserved.
In the example above the
DTAG parity error occurred on System Board 2.
NOTE that the port can also be Port A.
The FIX section of this article will explain further details of this event.
- In order to
analyze such an event it is important to have console log data so Document 1008702.1 Console Logging
Options to capture Fatal Reset output for Sun systems may help you if needing to configure console logging.
A DTAG event may also be seen in
prtdiag output (in the section called
Analysis
of most recent Fatal Hardware Watchdog). This is about the only type of Fatal Error event that can be diagnosed from prtdiag output alone.
A DTAG error looks like the following in prtdiag:
AC: UPA Port B Dtag Parity Error
NOTE that the port can also be
Port A for example:
AC: UPA Port A Dtag Parity Error
Once you have determined that your event matches what has been described above, proceed to the
FIX section of this article to resolve the event.
Solution
What is a DTAG parity Error?
The event
DT_PERR indicates a
Duplicate Tag SRAM (DTAG) parity error. These
DTAG SRAM's reside on CPU/Memory boards in Sun Enterprise 3x00/4x00/5x00 (
Classic) Servers. DTAG's are duplicates of the CPU's ETAG's on the system board.
- DT_PERRA refers to DTAG SRAM's supporting CPU location 0.
- DT_PERRB refers to DTAG SRAM's supporting CPU location 1.
Notes about troubleshooting DTAg Errors:
DTAG errors are usually caused by bit flips in DTAG SRAM. DTAG
SRAM is located on the system board. The same issue
which cause bit flips in memory (Alpha Particles, handling and
environmental conditions) cause bit flips in DTAG SRAM.
The CPUs and memory on a System Board which receives a DTAG parity error are never the cause.
Repair Vendor testing of system boards which received DTAG
parity errors prove that more then 90% of the time, these errors
are transient and never occur again.
For this reason, Oracle's Best Practices (originally was "Sun's Best Practices") dictates that if a DTAG
parity error occurs the recommendation is:
- Power cycle the system with max diags to re-POST the hardware.
- The DTAG error event may have disabled the board.
- To bring it back online and test it's sanity, the power cycle and max diag POST execution is recommended.
- If no error is detected in POST, monitor the
system for repeat errors but do not replace the system board.
- If a second DTAG error occurs on the same system
board and same DTAG group in 6 months, replace the system board which has
indicated the errors.
From the example in the GOALS section:
The
DTAG parity error occurred on System Board 2. The DTAG memory
which suffered a bit flip was associated with CPU location 1,
(DT_PERRB). If this was the
second occurrence of the same error the system Board in the past 6 months, system board
2 should be replaced. But, if this was a first error, Best Practice dictates that the board should not be replaced.
Additional Information:
One leading cause of DTAG errors are Environmental factors.
A good environmental resource to utilize is Document 1011650.1 Sun Enterprise[TM] 3X00-6X00 Servers: Board Temperature Information.
@ Previously Published As 40760
Attachments
This solution has no attachment