Asset ID: |
1-75-1012214.1 |
Update Date: | 2012-07-30 |
Keywords: | |
Solution Type
Troubleshooting Sure
Solution
1012214.1
:
Troubleshooting Red State Exception Memory Errors
Related Items |
- Sun Fire V480 Server
- Sun Fire V880z Visualization Server
- Sun Fire V890 Server
- Sun Fire V880 Server
- Sun Fire V490 Server
|
Related Categories |
- PLA-Support>Sun Systems>SPARC>Usx/Blade/Netra>SN-SPARC: USx
- .Old GCS Categories>Sun Microsystems>Servers>Entry-Level Servers
|
PreviouslyPublishedAs
216842
Applies to:
Sun Fire V480 Server
Sun Fire V490 Server
Sun Fire V880 Server
Sun Fire V880z Visualization Server
Sun Fire V890 Server
All Platforms
Purpose
When scanning error messages from any of the message files you will usually see the failing DIMM(s) being printed out for you, but when decoding a Red State Exception you won't have this luxury.
Debugging Red State Exceptions (w/ CE/UE errors) you must be able to manually decode the bad DIMM(s) from the AFSR and AFAR data given in the Red State Exception error output. This output is printed to the ttya port or the rsc console logs at the time of error.
Last Review Date
September 19, 2011
Instructions for the Reader
A Troubleshooting Guide is provided to assist
in debugging a specific issue. When possible, diagnostic tools are included in the document
to assist in troubleshooting.
Troubleshooting Details
Red State Exceptions (RSE):
Red State Exceptions are most often caused by hardware problems. In some isolated cases, software can cause a Red State Exception.
Note: <Document 1011888.1> How to set up and disable the RSC console on Sun Fire[TM] 280R, V480, V490, V880, V890 and V880z servers before proceeding, because a full "Red State Exception" output is needed. This document will assist you in how to capture this output.
Two common types of RSE's:
- Trap Level (TL)=5 with AFSR Error Bit(s) Set
- Trap Level (TL)=5 with Trap Type (TT) Code
In this troubleshooting document we will be focusing on the first type ( Trap Level (TL)=5 with AFSR Error Bit(s) Set ) since this is where you will see your CE/UE errors.
Trap Level (TL)=5 with AFSR Error Bit(s) Set:
ERROR:
CPU3 RED State Exception **CPU3 called a Red State Exception, further investigation is needed
System State (CPU3 reporting) **CPU3 is JUST reporting the error, any CPU in the system can report the error, but this does not mean CPU3 is the problem**
CPU0 Config/Control/Status registers:
CPUVersion: 003e.0014.5400.0507
SafConfig: 0caa.01bc.0000.8002
SafBaseAdr: 0000.0400.0000.0000
DCacheCtl: 0000.0200.0000.0000
ECacheCtl: 0000.0000.0009.4400
ECErrEnable: 0000.0000.0000.000b
AFAR: 0000.0000.0000.0000
AFSR: 0000.0000.0000.0000 (no errors set)
Important:
The Red State example is from a 750MHz CPU, because 900MHz CPUs (and beyond) will also include AFAR2/AFSR2 register lines below the AFAR/AFSR register lines and this represents the first error captured. The AFAR/AFSR will always represent the most recent error that occurred on the system.**
DMMU SFAR: 0000.0000.fff7.8ec8
DMMU SFSR: 0000.0000.0080.8008 TM PR
IMMU SFSR: 0000.0000.0000.0000 (no status set)
CPU0 Trap registers: Trap Level = 1
*TL=1 TT: 0000.0000.0000.0003
TSTATE: 0000.0099.1500.1600 XCC:NC ICC:NC MM=TSO PEF PRIV IE
TPC: 0000.0000.f004.9700
TnPC: 0000.0000.f004.9704
TL=2 TT: 0000.0000.0000.0068
TSTATE: 0000.0099.5804.1400 XCC:NC ICC:NC MM=TSO PEF PRIV
TPC: 0000.0000.f004.4b68
TnPC: 0000.0000.f004.4b6c
TL=3 TT: 0000.0000.0000.0000
TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO
TPC: 0000.0000.3333.3330
TnPC: 0000.0000.3333.3330
TL=4 TT: 0000.0000.0000.0000
TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO
TPC: 0000.0000.4444.4444
TnPC: 0000.0000.4444.4444
TL=5 TT: 0000.0000.0000.0000
TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO
TPC: 0000.0000.5555.5554
TnPC: 0000.0000.5555.5554
CPU0 General registers:
%PIL: 15
%PC: 0000.0000.f004.9700
%nPC: 0000.0000.f004.9704
%PSTATE: 0000.0000.0000.0035 TLE MM=TSO PEF
%CCR: 0000.0000.0000.0099 XCC:NC ICC:NC
%FPRS: 0000.0000.0000.0005 FEF DL
%v0: 0000.0000.0000.0000 %v1: 0000.0000.0000.004a %v2: 0000.0000.0000.0000
%v3: 0000.0000.fff7.8000 %v4: 0000.0000.0000.0ef8 %v5: 0caa.01bc.0000.8002
%v6: 0000.0000.0000.007f %v7: 0000.0000.0000.0680
.
. text deleted
.
%i0: 0000.0000.f000.00e0 %i1: 0000.0000.0000.0005 %i2: 0000.0000.0000.0004
%i3: 0000.0000.f000.00e0 %i4: 0000.0000.0000.001f %i5: 0000.0000.0000.0000
%i6: f000.0000.0001.c981 %i7: 0000.0000.f000.d680
CPU1 Config/Control/Status registers:
CPUVersion: 003e.0014.5400.0507
SafConfig: 0caa.01bc.0002.8002
SafBaseAdr: 0000.0400.0080.0000
DCacheCtl: 0000.0200.0000.0000
ECacheCtl: 0000.0000.0009.4400
ECErrEnable: 0000.0000.0000.000b
AFAR: 0000.0000.0000.0000
AFSR: 0000.0000.0000.0000 (no errors set)
DMMU SFAR: 0000.0000.fff7.8ec8
DMMU SFSR: 0000.0000.0080.8008 TM PR
IMMU SFSR: 0000.0000.0000.0000 (no status set)
CPU1 Trap registers: Trap Level = 1
*TL=1 TT: 0000.0000.0000.0003
TSTATE: 0000.0099.1500.1600 XCC:NC ICC:NC MM=TSO PEF PRIV IE
TPC: 0000.0000.f004.9700
TnPC: 0000.0000.f004.9704
TL=2 TT: 0000.0000.0000.0068
TSTATE: 0000.0099.5804.1400 XCC:NC ICC:NC MM=TSO PEF PRIV
TPC: 0000.0000.f004.4b68
TnPC: 0000.0000.f004.4b6c
TL=3 TT: 0000.0000.0000.0000
TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO
TPC: 0000.0001.3333.3330
TnPC: 0000.0001.3333.3330
TL=4 TT: 0000.0000.0000.0000
TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO
TPC: 0000.0001.4444.4444
TnPC: 0000.0001.4444.4444
TL=5 TT: 0000.0000.0000.0000
TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO
TPC: 0000.0001.5555.5554
TnPC: 0000.0001.5555.5554
CPU1 General registers:
%PIL: 15
%PC: 0000.0000.f004.9700
%nPC: 0000.0000.f004.9704
%PSTATE: 0000.0000.0000.0035 TLE MM=TSO PEF
%CCR: 0000.0000.0000.0099 XCC:NC ICC:NC
%FPRS: 0000.0000.0000.0005 FEF DL
%v0: 0000.0000.0000.0000 %v1: 0000.0000.0000.004a %v2: 0000.0000.0000.0000
%v3: 0000.0000.fff7.8000 %v4: 0000.0000.0000.0ef8 %v5: 0caa.01bc.0002.8002
%v6: 0000.0000.0000.007f %v7: 0000.0000.0000.0680
.
. text deleted
.
%i0: 0000.0000.f000.00e0 %i1: 0000.0000.0000.0005 %i2: 0000.0000.0000.0004
%i3: 0000.0000.f000.00e0 %i4: 0000.0000.0000.001f %i5: 0000.0001.0000.0000
%i6: f000.0000.0001.c981 %i7: 0000.0000.f000.d680
CPU2 Config/Control/Status registers:
CPUVersion: 003e.0014.5400.0507
SafConfig: 1534.01bc.0004.8002
SafBaseAdr: 0000.0400.0100.0000
DCacheCtl: 0000.0200.0000.0000
ECacheCtl: 0000.0000.0009.4400
ECErrEnable: 0000.0000.0000.000b
AFAR: 0000.0000.0000.0000
AFSR: 0000.0000.0000.0000 (no errors set)
DMMU SFAR: 0000.0000.fff7.8ec8
DMMU SFSR: 0000.0000.0080.8008 TM PR
IMMU SFSR: 0000.0000.0000.0000 (no status set)
CPU2 Trap registers: Trap Level = 1
*TL=1 TT: 0000.0000.0000.0003
TSTATE: 0000.0099.1500.1600 XCC:NC ICC:NC MM=TSO PEF PRIV IE
TPC: 0000.0000.f004.9700
TnPC: 0000.0000.f004.9704
TL=2 TT: 0000.0000.0000.0068
TSTATE: 0000.0099.5804.1400 XCC:NC ICC:NC MM=TSO PEF PRIV
TPC: 0000.0000.f004.4b68
TnPC: 0000.0000.f004.4b6c
TL=3 TT: 0000.0000.0000.0000
TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO
TPC: 0000.0002.3333.3330
TnPC: 0000.0002.3333.3330
TL=4 TT: 0000.0000.0000.0000
TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO
TPC: 0000.0002.4444.4444
TnPC: 0000.0002.4444.4444
TL=5 TT: 0000.0000.0000.0000
TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO
TPC: 0000.0002.5555.5554
TnPC: 0000.0002.5555.5554
CPU2 General registers:
%PIL: 15
%PC: 0000.0000.f004.9700
%nPC: 0000.0000.f004.9704
%PSTATE: 0000.0000.0000.0035 TLE MM=TSO PEF
%CCR: 0000.0000.0000.0099 XCC:NC ICC:NC
%FPRS: 0000.0000.0000.0005 FEF DL
%v0: 0000.0000.0000.0000 %v1: 0000.0000.0000.004a %v2: 0000.0000.0000.0000
%v3: 0000.0000.fff7.8000 %v4: 0000.0000.0000.0ef8 %v5: 1534.01bc.0004.8002
%v6: 0000.0000.0000.007f %v7: 0000.0000.0000.0680
.
. text deleted
.
%i0: 0000.0000.f000.00e0 %i1: 0000.0000.0000.0005 %i2: 0000.0000.0000.0004
%i3: 0000.0000.f000.00e0 %i4: 0000.0000.0000.001f %i5: 0000.0002.0000.0000
%i6: f000.0000.0002.b381 %i7: 0000.0000.f000.d680
CPU3 Config/Control/Status registers:
CPUVersion: 003e.0014.5400.0507
SafConfig: 1534.01bc.0006.8002
SafBaseAdr: 0000.0400.0180.0000
DCacheCtl: 0000.0000.0000.0000
ECacheCtl: 0000.0000.0009.4400
ECErrEnable: 0000.0000.0000.000b
AFAR: 0000.00b0.ece1.0450
AFSR: 0010.0006.0000.015b PRIV UE CE
'UE' and 'CE' tell you 'Uncorrectable' and 'Correctable' memory errors occurred and caused this 'Red State Exception'.
'b0' in the AFAR tells you the error occurred on CPU/Memory board in Slot 'B'. See Step #3 'Calculating the Physical Memory bank location' in <Document 1359373.1> V480/V880 Manual Decoding of DIMM(s) in Memory Error on how to calculate
'15b' (Bits 8-0 of the AFSR) tells you the ECC Syndrome. In this example it is M2 Probable Double bit error within a nibble. See 'Step #1 (Find bit(s) in error using ECC Syndromes)' in <Document 1359373.1> V480/V880 Manual Decoding of DIMM(s) in Memory Error on how to calculate]
'450' (Bits 9-6 of the AFAR) tells you which logical bank you are using. In this example it is 'CPU3 Bank0 (Bank0 located on CPU/Memory board in Slot 'B')'.See 'Step #3 (Calculating the Physical Memory bank location)' in <Document 1359373.1> V480/V880 Manual Decoding of DIMM(s) in Memory Error on how to calculate on how to calculate
The failing DIMMs in 'CPU3 Bank0' are J7900, J7901, J8001, and J8000. See 'Step #4 (Finding the 4 DIMMs (Jxxxx's) Related to this Physical Bank)' in <Document 1359373.1> V480/V880 Manual Decoding of DIMM(s) in Memory Error on how to calculate]
Resolution:
All DIMMs in the above bank need to be changed, because a multibit error can not be broken down to the correct DIMM since the multiple bits in error could be on multiple DIMMs in the faulty memory bank.
Keep in mind that a DIMM having CE errors early on in the Explorer message (0,1,2,3,...) files could very likely be the bad DIMM if a UE crashes the system and that DIMM is in the bank of DIMMs included in the error message. A DIMM causing multiple CE's repeatedly has a more likely chance of hitting a double bit error or UE. Always review message files starting at the oldest message file and working your way to the current dated message file for DIMM history
DMMU SFAR: 0000.0000.fff5.2000
DMMU SFSR: 0000.0000.0004.8028 TM CT1 PR
IMMU SFSR: 0000.0000.0080.8008 TM PR
CPU3 Trap registers: Trap Level = 5
CPU3 is in question since it went to Trap Level 5 (Red State Level)
TL=1 TT: 0000.0000.0000.0063 (Corrected ECC Error)
TSTATE: 0000.0099.8000.1603 XCC:NC ICC:NC MM=TSO PEF PRIV IE
TPC: 0000.0000.0102.96f0
TnPC: 0000.0000.0102.96dc
TL=2 TT: 0000.0000.0000.0068 (Fast Data Access MMU miss)
TSTATE: 0000.0099.8000.1503 XCC:NC ICC:NC MM=TSO PEF PRIV AG
TPC: 0000.0000.f004.2c24 TnPC: 0000.0000.f004.2c28
TL=3 TT: 0000.0000.0000.0032 (Data Access Error)
TSTATE: 0000.0088.5804.1403 XCC:N ICC:N MM=TSO PEF PRIV
TPC: 0000.0000.f004.4c64
TnPC: 0000.0000.f004.4c68
TL=4 TT: 0000.0000.0000.0010 (Illegal Instruction)
TSTATE: 0000.0088.5800.1503 XCC:N ICC:N MM=TSO PEF PRIV AG
TPC: 0000.0000.f000.4640
TnPC: 0000.0000.f000.4644
*TL=5 TT: 0000.0000.0000.0010 (Illegal Instruction)
TSTATE: 0000.0088.5800.1503 XCC:N ICC:N MM=TSO PEF PRIV AG
TPC: 0000.0000.f000.4200
TnPC: 0000.0000.f000.4204
CPU3 General registers:
%PIL: 13
%PC: 0000.0000.f000.4200
%nPC: 0000.0000.f000.4204
%PSTATE: 0000.0000.0000.0035 TLE MM=TSO PEF
%CCR: 0000.0000.0000.0091 XCC:NC ICC:C %FPRS: 0000.0000.0000.0000
%v0: 0000.0000.0000.0000 %v1: 0000.0000.0000.0000 %v2: 0000.0000.0000.0000
%v3: 0000.0000.0000.0000 %v4: ffff.ffff.0000.0000 %v5: 0000.0000.0000.0000
%v6: 0000.0000.0000.0000 %v7: 00ca.02a8.0840.0005
.
. text deleted
.
%i0: 0000.0000.0008.0000 %i1: 0000.0000.0508.0000 %i2: 0000.0000.0000.0000
%i3: 0000.0700.0536.0c20 %i4: 0000.0700.0536.0c30 %i5: 0000.0000.0007.3c00
%i6: 0000.0000.0140.8fe1 %i7: 0000.0000.0102.96b8
IO-Bridge 8 at 0000.0400.0400.0000
Device ID fc00.0000.0011.a954
Ctl/Stat 0255.5554.0080.7e02
Error Ctl fc00.0000.0000.03e0
Int Ctl 8000.0000.0000.0017
Error Log 0000.0000.0000.0000
ECC Ctl e000.0000.0000.0000
EStar Ctl 0000.0000.0000.0001
Queue Ctl 0000.0000.0000.0000
Address Match Address Mask
PCIA Mem 8000.07fd.0000.0000 0000.07ff.0000.0000
PCIA C/IO 8000.07ff.ec00.0000 0000.07ff.fe00.0000
PCIB Mem 8000.07fe.0000.0000 0000.07ff.0000.0000
PCIB C/IO 8000.07ff.ee00.0000 0000.07ff.fe00.0000
AFAR AFSR
UE 0000.0000.0000.0000 0000.0000.0000.0000
CE 0000.0100.0000.0000 0000.0000.0000.0000
PCI A 0000.0000.0000.0000 0000.0000.0000.0000
PCI B 0000.0000.0000.0000 0000.0000.0000.0000
Control/Status Idle Check Diag Diagnostic
PCI A 0000.0002.010e.003f 0000.0000.0000.8000 0000.0000.0000.0000
PCI B 0000.0000.010e.003f 0000.0000.0000.8000 0000.0000.0000.0000
IO-Bridge 9 at 0000.0400.0480.0000
Device ID fc00.0000.0013.a954
Ctl/Stat 0255.59a8.0090.7e02
Error Ctl fc00.0000.0000.03e0
Int Ctl 8000.0000.0000.0017
Error Log 0000.0000.0000.0000
ECC Ctl e000.0000.0000.0000
EStar Ctl 0000.0000.0000.0001
Queue Ctl 0000.0000.0000.0000
Address Match Address Mask
PCIA Mem 8000.07fb.0000.0000 0000.07ff.0000.0000
PCIA C/IO 8000.07ff.e800.0000 0000.07ff.fe00.0000
PCIB Mem 8000.07fc.0000.0000 0000.07ff.0000.0000
PCIB C/IO 8000.07ff.ea00.0000 0000.07ff.fe00.0000
AFAR AFSR
UE 0000.0000.0000.0000 0000.0000.0000.0000
CE 0000.0000.0000.0000 0000.0000.0000.0000
PCI A 0000.0000.0000.0000 0000.0000.0000.0000
PCI B 0000.0000.0000.0000 0000.0000.0000.0000
Control/Status Idle Check Diag Diagnostic
PCI A 0000.0002.010e.003f 0000.0000.0000.8000 0000.0000.0000.0000
PCI B 0000.0000.010e.003f 0000.0000.0000.8000 0000.0000.0000.0000
Note: This document can also assist you in troubleshooting FATAL reset errors that see UEs. All you need to do is substitute Red State Exception with FATAL Reset in the above procedure.
On many occasions field and support engineers utilize the helpful Red State Exception decoder located at URL:
http://panacea.uk.oracle.com/twiki/bin/view/Tools/ToolDecodeRedStateDecoder
It must be noted that the Red State Exception Decoder Tool is only a reference , and is only as good as the algorithm used to decode Red State Exception outputs. This algorithm in itself is flawed as the decoder has no intellectual processing of all the errors together as one, it simply breaks out each individual error and tells you which CPU is suspect for replacement. THE RED STATE DECODER CANNOT TELL YOU IF THE RED STATE BEING REVIEWED HAS A MEMORY DIMM PROBLEM. For example if the Red State Exception output above was put into the decoder it would state that CPU3 is suspect and should be replaced, even though it has been proven above that the problem is not CPU related, but in fact a failing bank of DIMMs.
Attachments
This solution has no attachment