Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1012214.1
Update Date:2012-07-30
Keywords:

Solution Type  Troubleshooting Sure

Solution  1012214.1 :   Troubleshooting Red State Exception Memory Errors  


Related Items
  • Sun Fire V480 Server
  •  
  • Sun Fire V880z Visualization Server
  •  
  • Sun Fire V890 Server
  •  
  • Sun Fire V880 Server
  •  
  • Sun Fire V490 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Usx/Blade/Netra>SN-SPARC: USx
  •  
  • .Old GCS Categories>Sun Microsystems>Servers>Entry-Level Servers
  •  

PreviouslyPublishedAs
216842


Applies to:

Sun Fire V480 Server
Sun Fire V490 Server
Sun Fire V880 Server
Sun Fire V880z Visualization Server
Sun Fire V890 Server
All Platforms

Purpose

When scanning error messages from any of the message files you will usually see the failing DIMM(s) being printed out for you, but when decoding a Red State Exception you won't have this luxury.

Debugging Red State Exceptions (w/ CE/UE errors) you must be able to manually decode the bad DIMM(s) from the AFSR and AFAR data given in the Red State Exception error output. This output is printed to the ttya port or the rsc console logs at the time of error.

Last Review Date

September 19, 2011

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

Red State Exceptions (RSE):

Red State Exceptions are most often caused by hardware problems. In some isolated cases, software can cause a Red State Exception.

Note: <Document 1011888.1> How to set up and disable the RSC console on Sun Fire[TM] 280R, V480, V490, V880, V890 and V880z servers before proceeding, because a full "Red State Exception" output is needed. This document will assist you in how to capture this output.

Two common types of RSE's:
  • Trap Level (TL)=5 with AFSR Error Bit(s) Set
  • Trap Level (TL)=5 with Trap Type (TT) Code

In this troubleshooting document we will be focusing on the first type ( Trap Level (TL)=5 with AFSR Error Bit(s) Set ) since this is where you will see your CE/UE errors.

Trap Level (TL)=5 with AFSR Error Bit(s) Set:


ERROR:

CPU3 RED State Exception  **CPU3 called a Red State Exception, further investigation is needed


System State (CPU3 reporting)  **CPU3 is JUST reporting the error, any CPU in the system can report the error, but this does not mean CPU3 is the problem**


CPU0 Config/Control/Status registers:

   CPUVersion: 003e.0014.5400.0507
   SafConfig: 0caa.01bc.0000.8002
   SafBaseAdr: 0000.0400.0000.0000
   DCacheCtl: 0000.0200.0000.0000
   ECacheCtl: 0000.0000.0009.4400
   ECErrEnable: 0000.0000.0000.000b

  AFAR: 0000.0000.0000.0000
  AFSR: 0000.0000.0000.0000  (no errors set)



Important:
The Red State example is from a 750MHz CPU, because 900MHz CPUs (and beyond) will also include AFAR2/AFSR2 register lines below the AFAR/AFSR register lines and this represents the first error captured. The AFAR/AFSR will always represent the most recent error that occurred on the system.**



  DMMU SFAR: 0000.0000.fff7.8ec8
  DMMU SFSR: 0000.0000.0080.8008 TM PR
  IMMU SFSR: 0000.0000.0000.0000 (no status set)

CPU0 Trap registers:   Trap Level = 1

*TL=1 TT: 0000.0000.0000.0003
   TSTATE: 0000.0099.1500.1600 XCC:NC ICC:NC MM=TSO PEF PRIV IE
   TPC: 0000.0000.f004.9700
   TnPC: 0000.0000.f004.9704
TL=2 TT: 0000.0000.0000.0068
   TSTATE: 0000.0099.5804.1400 XCC:NC ICC:NC MM=TSO PEF PRIV
   TPC: 0000.0000.f004.4b68
   TnPC: 0000.0000.f004.4b6c
TL=3 TT: 0000.0000.0000.0000
   TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO
   TPC: 0000.0000.3333.3330
   TnPC: 0000.0000.3333.3330
TL=4 TT: 0000.0000.0000.0000
   TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO
   TPC: 0000.0000.4444.4444
   TnPC: 0000.0000.4444.4444
TL=5 TT: 0000.0000.0000.0000
   TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO
   TPC: 0000.0000.5555.5554
   TnPC: 0000.0000.5555.5554


CPU0 General registers:

      %PIL: 15
      %PC: 0000.0000.f004.9700
      %nPC: 0000.0000.f004.9704
      %PSTATE: 0000.0000.0000.0035 TLE MM=TSO PEF
      %CCR: 0000.0000.0000.0099 XCC:NC ICC:NC
      %FPRS: 0000.0000.0000.0005 FEF DL

%v0: 0000.0000.0000.0000 %v1: 0000.0000.0000.004a %v2: 0000.0000.0000.0000
%v3: 0000.0000.fff7.8000 %v4: 0000.0000.0000.0ef8 %v5: 0caa.01bc.0000.8002
%v6: 0000.0000.0000.007f %v7: 0000.0000.0000.0680

.
.  text deleted
.

%i0: 0000.0000.f000.00e0 %i1: 0000.0000.0000.0005 %i2: 0000.0000.0000.0004
%i3: 0000.0000.f000.00e0 %i4: 0000.0000.0000.001f %i5: 0000.0000.0000.0000
%i6: f000.0000.0001.c981 %i7: 0000.0000.f000.d680

CPU1 Config/Control/Status registers:

    CPUVersion: 003e.0014.5400.0507
    SafConfig: 0caa.01bc.0002.8002
    SafBaseAdr: 0000.0400.0080.0000
    DCacheCtl: 0000.0200.0000.0000
    ECacheCtl: 0000.0000.0009.4400
    ECErrEnable: 0000.0000.0000.000b

    AFAR: 0000.0000.0000.0000
    AFSR: 0000.0000.0000.0000 (no errors set)


    DMMU SFAR: 0000.0000.fff7.8ec8
    DMMU SFSR: 0000.0000.0080.8008 TM PR
    IMMU SFSR: 0000.0000.0000.0000 (no status set)

CPU1 Trap registers:  Trap Level = 1


*TL=1 TT: 0000.0000.0000.0003
    TSTATE: 0000.0099.1500.1600 XCC:NC ICC:NC MM=TSO PEF PRIV IE
    TPC: 0000.0000.f004.9700
    TnPC: 0000.0000.f004.9704
TL=2 TT: 0000.0000.0000.0068
   TSTATE: 0000.0099.5804.1400 XCC:NC ICC:NC MM=TSO PEF PRIV
   TPC: 0000.0000.f004.4b68
   TnPC: 0000.0000.f004.4b6c
TL=3 TT: 0000.0000.0000.0000
   TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO
   TPC: 0000.0001.3333.3330
   TnPC: 0000.0001.3333.3330
TL=4 TT: 0000.0000.0000.0000
   TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO
   TPC: 0000.0001.4444.4444
   TnPC: 0000.0001.4444.4444
TL=5 TT: 0000.0000.0000.0000
   TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO
   TPC: 0000.0001.5555.5554
   TnPC: 0000.0001.5555.5554

CPU1 General registers:

      %PIL: 15
      %PC: 0000.0000.f004.9700
      %nPC: 0000.0000.f004.9704
      %PSTATE: 0000.0000.0000.0035 TLE MM=TSO PEF
      %CCR: 0000.0000.0000.0099 XCC:NC ICC:NC
      %FPRS: 0000.0000.0000.0005 FEF DL

%v0: 0000.0000.0000.0000 %v1: 0000.0000.0000.004a %v2: 0000.0000.0000.0000
%v3: 0000.0000.fff7.8000 %v4: 0000.0000.0000.0ef8 %v5: 0caa.01bc.0002.8002
%v6: 0000.0000.0000.007f %v7: 0000.0000.0000.0680

.
. text deleted
.

%i0: 0000.0000.f000.00e0 %i1: 0000.0000.0000.0005 %i2: 0000.0000.0000.0004
%i3: 0000.0000.f000.00e0 %i4: 0000.0000.0000.001f %i5: 0000.0001.0000.0000
%i6: f000.0000.0001.c981 %i7: 0000.0000.f000.d680

CPU2 Config/Control/Status registers:

    CPUVersion: 003e.0014.5400.0507
    SafConfig: 1534.01bc.0004.8002
    SafBaseAdr: 0000.0400.0100.0000
    DCacheCtl: 0000.0200.0000.0000
    ECacheCtl: 0000.0000.0009.4400
    ECErrEnable: 0000.0000.0000.000b

    AFAR: 0000.0000.0000.0000
    AFSR: 0000.0000.0000.0000  (no errors set)

    DMMU SFAR: 0000.0000.fff7.8ec8
    DMMU SFSR: 0000.0000.0080.8008 TM PR
    IMMU SFSR: 0000.0000.0000.0000 (no status set)

CPU2 Trap registers: Trap Level = 1

*TL=1 TT: 0000.0000.0000.0003
    TSTATE: 0000.0099.1500.1600 XCC:NC ICC:NC MM=TSO PEF PRIV IE
    TPC: 0000.0000.f004.9700
    TnPC: 0000.0000.f004.9704
TL=2 TT: 0000.0000.0000.0068
    TSTATE: 0000.0099.5804.1400 XCC:NC ICC:NC MM=TSO PEF PRIV
    TPC: 0000.0000.f004.4b68
    TnPC: 0000.0000.f004.4b6c
TL=3 TT: 0000.0000.0000.0000
    TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO
    TPC: 0000.0002.3333.3330
    TnPC: 0000.0002.3333.3330
    TL=4 TT: 0000.0000.0000.0000
    TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO
    TPC: 0000.0002.4444.4444
    TnPC: 0000.0002.4444.4444
TL=5 TT: 0000.0000.0000.0000
    TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO
    TPC: 0000.0002.5555.5554
    TnPC: 0000.0002.5555.5554

CPU2 General registers:

      %PIL: 15
      %PC: 0000.0000.f004.9700
      %nPC: 0000.0000.f004.9704
      %PSTATE: 0000.0000.0000.0035 TLE MM=TSO PEF
      %CCR: 0000.0000.0000.0099 XCC:NC ICC:NC
      %FPRS: 0000.0000.0000.0005 FEF DL

%v0: 0000.0000.0000.0000 %v1: 0000.0000.0000.004a %v2: 0000.0000.0000.0000
%v3: 0000.0000.fff7.8000 %v4: 0000.0000.0000.0ef8 %v5: 1534.01bc.0004.8002
%v6: 0000.0000.0000.007f %v7: 0000.0000.0000.0680

.
. text deleted
.

%i0: 0000.0000.f000.00e0 %i1: 0000.0000.0000.0005 %i2: 0000.0000.0000.0004
%i3: 0000.0000.f000.00e0 %i4: 0000.0000.0000.001f %i5: 0000.0002.0000.0000
%i6: f000.0000.0002.b381 %i7: 0000.0000.f000.d680

CPU3 Config/Control/Status registers:

    CPUVersion:  003e.0014.5400.0507
    SafConfig:   1534.01bc.0006.8002 
    SafBaseAdr:  0000.0400.0180.0000
    DCacheCtl:   0000.0000.0000.0000 
    ECacheCtl:   0000.0000.0009.4400
    ECErrEnable: 0000.0000.0000.000b

    AFAR: 0000.00b0.ece1.0450
    AFSR: 0010.0006.0000.015b  PRIV UE CE



'UE' and 'CE' tell you 'Uncorrectable' and 'Correctable' memory errors occurred and caused this 'Red State Exception'.

'b0' in the AFAR tells you the error occurred on CPU/Memory board in Slot 'B'. See Step #3 'Calculating the Physical Memory bank location' in <Document 1359373.1> V480/V880 Manual Decoding of DIMM(s) in Memory Error on how to calculate

'15b' (Bits 8-0 of the AFSR) tells you the ECC Syndrome. In this example it is M2 Probable Double bit error within a nibble. See 'Step #1 (Find bit(s) in error using ECC Syndromes)' in <Document 1359373.1> V480/V880 Manual Decoding of DIMM(s) in Memory Error on how to calculate]

'450' (Bits 9-6 of the AFAR) tells you which logical bank you are using. In this example it is 'CPU3 Bank0 (Bank0 located on CPU/Memory board in Slot 'B')'.See 'Step #3 (Calculating the Physical Memory bank location)' in <Document 1359373.1> V480/V880 Manual Decoding of DIMM(s) in Memory Error on how to calculate on how to calculate

The failing DIMMs in 'CPU3 Bank0' are J7900, J7901, J8001, and J8000. See 'Step #4 (Finding the 4 DIMMs (Jxxxx's) Related to this Physical Bank)' in <Document 1359373.1> V480/V880 Manual Decoding of DIMM(s) in Memory Error on how to calculate]

Resolution:

All DIMMs in the above bank need to be changed, because a multibit error can not be broken down to the correct DIMM since the multiple bits in error could be on multiple DIMMs in the faulty memory bank.


Keep in mind that a DIMM having CE errors early on in the Explorer message (0,1,2,3,...) files could very likely be the bad DIMM if a UE crashes the system and that DIMM is in the bank of DIMMs included in the error message. A DIMM causing multiple CE's repeatedly has a more likely chance of hitting a double bit error or UE. Always review message files starting at the oldest message file and working your way to the current dated message file for DIMM history




     DMMU SFAR: 0000.0000.fff5.2000
     DMMU SFSR: 0000.0000.0004.8028 TM CT1 PR
     IMMU SFSR: 0000.0000.0080.8008 TM PR



CPU3 Trap registers:  Trap Level = 5


CPU3 is in question since it went to Trap Level 5 (Red State Level)


TL=1 TT: 0000.0000.0000.0063 (Corrected ECC Error)
   TSTATE: 0000.0099.8000.1603 XCC:NC ICC:NC MM=TSO PEF PRIV IE
   TPC: 0000.0000.0102.96f0
   TnPC: 0000.0000.0102.96dc
TL=2 TT: 0000.0000.0000.0068 (Fast Data Access MMU miss)
   TSTATE: 0000.0099.8000.1503 XCC:NC ICC:NC MM=TSO PEF PRIV AG
   TPC: 0000.0000.f004.2c24 TnPC: 0000.0000.f004.2c28
TL=3 TT: 0000.0000.0000.0032 (Data Access Error)
   TSTATE: 0000.0088.5804.1403 XCC:N ICC:N MM=TSO PEF PRIV
   TPC: 0000.0000.f004.4c64
   TnPC: 0000.0000.f004.4c68
TL=4 TT: 0000.0000.0000.0010 (Illegal Instruction)
   TSTATE: 0000.0088.5800.1503 XCC:N ICC:N MM=TSO PEF PRIV AG
   TPC: 0000.0000.f000.4640
   TnPC: 0000.0000.f000.4644
*TL=5 TT: 0000.0000.0000.0010 (Illegal Instruction)
   TSTATE: 0000.0088.5800.1503 XCC:N ICC:N MM=TSO PEF PRIV AG
   TPC: 0000.0000.f000.4200
   TnPC: 0000.0000.f000.4204

CPU3 General registers:

     %PIL: 13
     %PC: 0000.0000.f000.4200
     %nPC: 0000.0000.f000.4204
     %PSTATE: 0000.0000.0000.0035 TLE MM=TSO PEF
     %CCR: 0000.0000.0000.0091 XCC:NC ICC:C %FPRS: 0000.0000.0000.0000

%v0: 0000.0000.0000.0000 %v1: 0000.0000.0000.0000 %v2: 0000.0000.0000.0000
%v3: 0000.0000.0000.0000 %v4: ffff.ffff.0000.0000 %v5: 0000.0000.0000.0000
%v6: 0000.0000.0000.0000 %v7: 00ca.02a8.0840.0005

.
. text deleted
.

%i0: 0000.0000.0008.0000 %i1: 0000.0000.0508.0000 %i2: 0000.0000.0000.0000
%i3: 0000.0700.0536.0c20 %i4: 0000.0700.0536.0c30 %i5: 0000.0000.0007.3c00
%i6: 0000.0000.0140.8fe1 %i7: 0000.0000.0102.96b8

IO-Bridge 8 at 0000.0400.0400.0000
   Device ID fc00.0000.0011.a954
   Ctl/Stat 0255.5554.0080.7e02
   Error Ctl fc00.0000.0000.03e0
   Int Ctl 8000.0000.0000.0017
   Error Log 0000.0000.0000.0000
   ECC Ctl e000.0000.0000.0000
   EStar Ctl 0000.0000.0000.0001
   Queue Ctl 0000.0000.0000.0000

                  Address Match Address Mask
PCIA Mem 8000.07fd.0000.0000 0000.07ff.0000.0000
PCIA C/IO 8000.07ff.ec00.0000 0000.07ff.fe00.0000
PCIB Mem 8000.07fe.0000.0000 0000.07ff.0000.0000
PCIB C/IO 8000.07ff.ee00.0000 0000.07ff.fe00.0000

                   AFAR AFSR
UE 0000.0000.0000.0000 0000.0000.0000.0000
CE 0000.0100.0000.0000 0000.0000.0000.0000
PCI A 0000.0000.0000.0000 0000.0000.0000.0000
PCI B 0000.0000.0000.0000 0000.0000.0000.0000

                               Control/Status Idle Check Diag Diagnostic
PCI A 0000.0002.010e.003f 0000.0000.0000.8000 0000.0000.0000.0000
PCI B 0000.0000.010e.003f 0000.0000.0000.8000 0000.0000.0000.0000

IO-Bridge 9 at 0000.0400.0480.0000
   Device ID fc00.0000.0013.a954
   Ctl/Stat 0255.59a8.0090.7e02
   Error Ctl fc00.0000.0000.03e0
   Int Ctl 8000.0000.0000.0017
   Error Log 0000.0000.0000.0000
   ECC Ctl e000.0000.0000.0000
   EStar Ctl 0000.0000.0000.0001
  Queue Ctl 0000.0000.0000.0000

                   Address Match Address Mask
PCIA Mem 8000.07fb.0000.0000 0000.07ff.0000.0000
PCIA C/IO 8000.07ff.e800.0000 0000.07ff.fe00.0000
PCIB Mem 8000.07fc.0000.0000 0000.07ff.0000.0000
PCIB C/IO 8000.07ff.ea00.0000 0000.07ff.fe00.0000

                  AFAR AFSR
UE 0000.0000.0000.0000 0000.0000.0000.0000
CE 0000.0000.0000.0000 0000.0000.0000.0000
PCI A 0000.0000.0000.0000 0000.0000.0000.0000
PCI B 0000.0000.0000.0000 0000.0000.0000.0000

                        Control/Status Idle Check Diag Diagnostic
PCI A 0000.0002.010e.003f 0000.0000.0000.8000 0000.0000.0000.0000
PCI B 0000.0000.010e.003f 0000.0000.0000.8000 0000.0000.0000.0000


Note: This document can also assist you in troubleshooting FATAL reset errors that see UEs. All you need to do is substitute Red State Exception with FATAL Reset in the above procedure.

On many occasions field and support engineers utilize the helpful Red State Exception decoder located at URL:

http://panacea.uk.oracle.com/twiki/bin/view/Tools/ToolDecodeRedStateDecoder

It must be noted that the Red State Exception Decoder Tool is only a reference , and is only as good as the algorithm used to decode Red State Exception outputs. This algorithm in itself is flawed as the decoder has no intellectual processing of all the errors together as one, it simply breaks out each individual error and tells you which CPU is suspect for replacement. THE RED STATE DECODER CANNOT TELL YOU IF THE RED STATE BEING REVIEWED HAS A MEMORY DIMM PROBLEM. For example if the Red State Exception output above was put into the decoder it would state that CPU3 is suspect and should be replaced, even though it has been proven above that the problem is not CPU related, but in fact a failing bank of DIMMs.

Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback