Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1012214.1
Update Date:2009-09-13
Keywords:

Solution Type  Troubleshooting Sure

Solution  1012214.1 :   Troubleshooting Red State Exception Memory Errors  


Related Items
  • Sun Fire V480 Server
  •  
  • Sun Fire V880z Visualization Server
  •  
  • Sun Fire V890 Server
  •  
  • Sun Fire V880 Server
  •  
  • Sun Fire V490 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>Entry-Level Servers
  •  

PreviouslyPublishedAs
216842


Oracle Confidential (INTERNAL). Do not distribute to customers
Reason: Migrated distribution from Sun

Description
When scanning error messages from any of the message files you will usually see the failing DIMM(s) being printed out for you, but when decoding a Red State Exception you won't have this luxury.

Debugging  Red State Exceptions  (w/ CE/UE errors) you must be able to manually decode the bad DIMM(s) from the AFSR and AFAR data given in the  Red State Exception  error output. This output is printed to the ttya port or the rsc console logs at the time of error.



Steps to Follow
Red State Exceptions (RSE):
Red State Exceptions are most often caused by hardware problems. In some isolated cases, software can cause a Red State Exception.

Note: See Tecnical Instruction <Document: 1008702.1> "Console Logging Options to capture Fatal Reset output for Sun systems" before proceeding, because a full "Red State Exception" output is needed. This document will assist you in how to capture this output.

Two common types of RSE's:


Trap Level (TL)=5 with AFSR Error Bit(s) Set




Trap Level (TL)=5 with Trap Type (TT) Code


Note: In this troubleshooting document we will be focusing on the first type ( Trap Level (TL)=5 with AFSR Error Bit(s) Set ) since this is where you will see your CE/UE errors.



Trap Level (TL)=5 with AFSR Error Bit(s) Set:


ERROR: 

CPU3

 RED State Exception


[**CPU3 called a Red State Exception, further investigation is needed**]


 System State (


CPU3

 reporting)


[**CPU3 is **JUST** reporting the error, any CPU in the system can report the error, but this does not mean CPU3 is the problem**]


CPU0 Config/Control/Status registers:

              CPUVersion:  003e.0014.5400.0507
                 SafConfig:      0caa.01bc.0000.8002
                SafBaseAdr:   0000.0400.0000.0000
                DCacheCtl:     0000.0200.0000.0000
                ECacheCtl:     0000.0000.0009.4400
                ECErrEnable:  0000.0000.0000.000b
                AFAR:        0000.0000.0000.0000
                 AFSR:        0000.0000.0000.0000 

(no errors set)




[**





Important:





The Red State example is from a 750MHz CPU, because 900MHz CPUs (and beyond) will also include AFAR2/AFSR2
register lines below the AFAR/AFSR register lines and this represents the first error captured. The AFAR/AFSR will always represent the most recent error that occurred on the system.**]


               DMMU SFAR:   0000.0000.fff7.8ec8
                 DMMU SFSR:   0000.0000.0080.8008 TM PR
                IMMU SFSR:     0000.0000.0000.0000 (no status set)
   CPU0 Trap registers: 

Trap Level = 1


    *TL=1 TT:        0000.0000.0000.0003
          TSTATE: 0000.0099.1500.1600 XCC:NC ICC:NC MM=TSO PEF PRIV IE
         TPC:          0000.0000.f004.9700
         TnPC:        0000.0000.f004.9704
       TL=2 TT:        0000.0000.0000.0068
         TSTATE: 0000.0099.5804.1400 XCC:NC ICC:NC MM=TSO PEF PRIV
         TPC:          0000.0000.f004.4b68
         TnPC:        0000.0000.f004.4b6c
       TL=3 TT:        0000.0000.0000.0000
         TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO
         TPC:          0000.0000.3333.3330
         TnPC:        0000.0000.3333.3330
       TL=4 TT:        0000.0000.0000.0000
         TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO
         TPC:          0000.0000.4444.4444
         TnPC:        0000.0000.4444.4444
       TL=5 TT:        0000.0000.0000.0000
         TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO
         TPC:          0000.0000.5555.5554
         TnPC:        0000.0000.5555.5554
  CPU0 General registers:
     %PIL:              15
      %PC:              0000.0000.f004.9700
     %nPC:            0000.0000.f004.9704
     %PSTATE:    0000.0000.0000.0035 TLE MM=TSO PEF
     %CCR:            0000.0000.0000.0099 XCC:NC ICC:NC
     %FPRS:          0000.0000.0000.0005 FEF DL
%v0: 0000.0000.0000.0000  %v1: 0000.0000.0000.004a  %v2: 0000.0000.0000.0000
%v3: 0000.0000.fff7.8000    %v4: 0000.0000.0000.0ef8   %v5: 0caa.01bc.0000.8002
%v6: 0000.0000.0000.007f   %v7: 0000.0000.0000.0680
.
.
.
%i0: 0000.0000.f000.00e0  %i1: 0000.0000.0000.0005  %i2: 0000.0000.0000.0004
%i3: 0000.0000.f000.00e0  %i4: 0000.0000.0000.001f   %i5: 0000.0000.0000.0000
%i6: f000.0000.0001.c981  %i7: 0000.0000.f000.d680
 CPU1 Config/Control/Status registers:
           CPUVersion:  003e.0014.5400.0507
          SafConfig:      0caa.01bc.0002.8002
         SafBaseAdr:   0000.0400.0080.0000
         DCacheCtl:    0000.0200.0000.0000
         ECacheCtl:     0000.0000.0009.4400
         ECErrEnable:  0000.0000.0000.000b
           AFAR:   0000.0000.0000.0000
          AFSR:   0000.0000.0000.0000 

(no errors set)


          DMMU SFAR:   0000.0000.fff7.8ec8
          DMMU SFSR:   0000.0000.0080.8008 TM PR
         IMMU SFSR:     0000.0000.0000.0000 (no status set)
   CPU1 Trap registers:  

Trap Level = 1


      *TL=1 TT:        0000.0000.0000.0003
          TSTATE: 0000.0099.1500.1600 XCC:NC ICC:NC MM=TSO PEF PRIV IE
         TPC:         0000.0000.f004.9700
         TnPC:       0000.0000.f004.9704
       TL=2 TT:        0000.0000.0000.0068
         TSTATE: 0000.0099.5804.1400 XCC:NC ICC:NC MM=TSO PEF PRIV
         TPC:         0000.0000.f004.4b68
         TnPC:       0000.0000.f004.4b6c
       TL=3 TT:        0000.0000.0000.0000
         TSTATE:  0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO
         TPC:         0000.0001.3333.3330
         TnPC:       0000.0001.3333.3330
       TL=4 TT:        0000.0000.0000.0000
         TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO
         TPC:         0000.0001.4444.4444
         TnPC:       0000.0001.4444.4444
       TL=5 TT:        0000.0000.0000.0000
         TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO
         TPC:         0000.0001.5555.5554
         TnPC:       0000.0001.5555.5554
CPU1 General registers:
     %PIL:             15
      %PC:               0000.0000.f004.9700
     %nPC:             0000.0000.f004.9704
     %PSTATE:     0000.0000.0000.0035 TLE MM=TSO PEF
     %CCR:             0000.0000.0000.0099 XCC:NC ICC:NC
     %FPRS:           0000.0000.0000.0005 FEF DL
%v0: 0000.0000.0000.0000  %v1: 0000.0000.0000.004a  %v2: 0000.0000.0000.0000
%v3: 0000.0000.fff7.8000    %v4: 0000.0000.0000.0ef8   %v5: 0caa.01bc.0002.8002
%v6: 0000.0000.0000.007f   %v7: 0000.0000.0000.0680
.
.
.
%i0: 0000.0000.f000.00e0  %i1: 0000.0000.0000.0005  %i2: 0000.0000.0000.0004
%i3: 0000.0000.f000.00e0  %i4: 0000.0000.0000.001f   %i5: 0000.0001.0000.0000
%i6: f000.0000.0001.c981  %i7: 0000.0000.f000.d680
CPU2 Config/Control/Status registers:
           CPUVersion:   003e.0014.5400.0507
          SafConfig:      1534.01bc.0004.8002
         SafBaseAdr:   0000.0400.0100.0000
         DCacheCtl:    0000.0200.0000.0000
         ECacheCtl:     0000.0000.0009.4400
         ECErrEnable:  0000.0000.0000.000b
           AFAR:   0000.0000.0000.0000
          AFSR:   0000.0000.0000.0000 

(no errors set)


          DMMU SFAR:   0000.0000.fff7.8ec8
          DMMU SFSR:   0000.0000.0080.8008 TM PR
         IMMU SFSR:     0000.0000.0000.0000 (no status set)
   CPU2 Trap registers:  

Trap Level = 1


      *TL=1 TT:        0000.0000.0000.0003
          TSTATE: 0000.0099.1500.1600 XCC:NC ICC:NC MM=TSO PEF PRIV IE
         TPC:         0000.0000.f004.9700
         TnPC:       0000.0000.f004.9704
       TL=2 TT:        0000.0000.0000.0068
         TSTATE: 0000.0099.5804.1400 XCC:NC ICC:NC MM=TSO PEF PRIV
         TPC:         0000.0000.f004.4b68
         TnPC:       0000.0000.f004.4b6c
       TL=3 TT:        0000.0000.0000.0000
         TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO
         TPC:        0000.0002.3333.3330
         TnPC:      0000.0002.3333.3330
       TL=4 TT:        0000.0000.0000.0000
         TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO
         TPC:        0000.0002.4444.4444
         TnPC:      0000.0002.4444.4444
       TL=5 TT:        0000.0000.0000.0000
         TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO
         TPC:        0000.0002.5555.5554
         TnPC:      0000.0002.5555.5554
  CPU2 General registers:
     %PIL:               15
     %PC:               0000.0000.f004.9700
     %nPC:             0000.0000.f004.9704
     %PSTATE:     0000.0000.0000.0035 TLE MM=TSO PEF
     %CCR:             0000.0000.0000.0099 XCC:NC ICC:NC
     %FPRS:           0000.0000.0000.0005 FEF DL
%v0: 0000.0000.0000.0000  %v1: 0000.0000.0000.004a  %v2: 0000.0000.0000.0000
%v3: 0000.0000.fff7.8000    %v4: 0000.0000.0000.0ef8   %v5: 1534.01bc.0004.8002
%v6: 0000.0000.0000.007f   %v7: 0000.0000.0000.0680

.
.
.
%i0: 0000.0000.f000.00e0  %i1: 0000.0000.0000.0005  %i2: 0000.0000.0000.0004
%i3: 0000.0000.f000.00e0  %i4: 0000.0000.0000.001f   %i5: 0000.0002.0000.0000
%i6: f000.0000.0002.b381  %i7: 0000.0000.f000.d680
CPU3 Config/Control/Status registers:
        CPUVersion:   003e.0014.5400.0507
          SafConfig:      1534.01bc.0006.8002         SafBaseAdr:   0000.0400.0180.0000         DCacheCtl:    0000.0000.0000.0000         ECacheCtl:     0000.0000.0009.4400         ECErrEnable:  0000.0000.0000.000b
        AFAR:    0000.00


b0


.ece1.0


450

           AFSR:     0010.0006.0000.0


15b

 PRIV


UE CE




['UE' and 'CE' tell you 'Uncorrectable' and 'Correctable' memory errors occurred and caused this 'Red State Exception']




['b0' in the AFAR tells you the error occurred on CPU/Memory board in Slot 'B'; See 'Step #3 (Calculating the Physical Memory bank

 

location)' in 'Voyager Article ID 77426 (V480/V880 Manual Decoding of DIMM(s) in Memory Error)' on how to calculate]




['15b' (Bits 8-0 of the AFSR) tells you the ECC Syndrome. In this example it is M2 (Probable Double bit error within a nibble). See




'Step #1 (Find bit(s) in error using ECC Syndromes)' in 'Voyager Article ID 77426 (V480/V880 Manual Decoding of DIMM(s) in Memory  Error)' on how to calculate]




['450' (Bits 9-6 of the AFAR) tells you which logical bank you are using. In this example it is 'x001' (where 'x' means 'Don't Care') and




the Physical bank is 'CPU3 Bank0 (Bank0 located on CPU/Memory board in Slot 'B')'. See 'Step #3 (Calculating the Physical Memory

 

bank location)' in 'Voyager Article ID 77426 (V480/V880 Manual Decoding of DIMM(s) in Memory Error)' on how to calculate]




[The failing DIMMs in 'CPU3 Bank0' are J7900, J7901, J8001, and J8000. See 'Step #4 (Finding the 4 DIMMs (Jxxxx's) Related to this




Physical Bank)' in 'Voyager Article ID 77426 (V480/V880 Manual Decoding of DIMM(s) in Memory Error)' on how to calculate]




[





Resolution:





All DIMMs in the above bank need to be changed, because a multibit error can not be broken down to the correct DIMM

 

(since the multiple bits in error could be on multiple DIMMs in the faulty memory bank)]


[Keep in mind that a DIMM having CE errors early on in the Explorer message (0,1,2,3,...) files could very likely be the bad DIMM if a UE crashes the system and that DIMM is in the bank of DIMMs included in the error message. A DIMM causing multiple CE's repeatedly has a more likely chance of hitting a double bit error or UE. Always review message files starting at the oldest message file and working your way to the current dated message file for DIMM history]

          DMMU SFAR:   0000.0000.fff5.2000
          DMMU SFSR:   0000.0000.0004.8028 TM CT1 PR         IMMU SFSR:     0000.0000.0080.8008 TM PR


CPU3

 Trap registers:  

Trap Level = 5




[CPU3 is in question since it went to Trap Level 5 (Red State Level)]


       TL=1 TT:        0000.0000.0000.00


63

 


(Corrected ECC Error)


            TSTATE:  0000.0099.8000.1603 XCC:NC ICC:NC MM=TSO PEF PRIV IE
         TPC:          0000.0000.0102.96f0
         TnPC:       0000.0000.0102.96dc
       TL=2 TT:        0000.0000.0000.00


68 (Fast Data Access MMU miss)

           TSTATE:  0000.0099.8000.1503 XCC:NC ICC:NC MM=TSO PEF PRIV AG
         TPC:          0000.0000.f004.2c24         TnPC:        0000.0000.f004.2c28
TL=3 TT:        0000.0000.0000.00


32 (Data Access Error)

            TSTATE: 0000.0088.5804.1403 XCC:N ICC:N MM=TSO PEF PRIV
TPC:          0000.0000.f004.4c64
TnPC:        0000.0000.f004.4c68
TL=4 TT:        0000.0000.0000.00


10 (Illegal Instruction)


            TSTATE: 0000.0088.5800.1503 XCC:N ICC:N MM=TSO PEF PRIV AG
           TPC:          0000.0000.f000.4640
TnPC:        0000.0000.f000.4644
*TL=5 TT:       0000.0000.0000.00


10 (Illegal Instruction)

            TSTATE: 0000.0088.5800.1503 XCC:N ICC:N MM=TSO PEF PRIV AG
 TPC:         0000.0000.f000.4200
TnPC:       0000.0000.f000.4204
CPU3 General registers:
     %PIL:               13
      %PC:               0000.0000.f000.4200       %nPC:              0000.0000.f000.4204       %PSTATE:    0000.0000.0000.0035 TLE MM=TSO PEF       %CCR:          0000.0000.0000.0091 XCC:NC ICC:C       %FPRS:         0000.0000.0000.0000
%v0: 0000.0000.0000.0000  %v1: 0000.0000.0000.0000  %v2: 0000.0000.0000.0000
%v3: 0000.0000.0000.0000  %v4: ffff.ffff.0000.0000        %v5: 0000.0000.0000.0000
%v6: 0000.0000.0000.0000  %v7: 00ca.02a8.0840.0005

.
.
.
%i0: 0000.0000.0008.0000  %i1: 0000.0000.0508.0000  %i2: 0000.0000.0000.0000
%i3: 0000.0700.0536.0c20  %i4: 0000.0700.0536.0c30  %i5: 0000.0000.0007.3c00
%i6: 0000.0000.0140.8fe1  %i7: 0000.0000.0102.96b8

IO-Bridge 8 at 0000.0400.0400.0000

       Device ID   fc00.0000.0011.a954
      Ctl/Stat         0255.5554.0080.7e02
     Error Ctl         fc00.0000.0000.03e0
     Int Ctl          8000.0000.0000.0017
     Error Log       0000.0000.0000.0000
     ECC Ctl        e000.0000.0000.0000
     EStar Ctl        0000.0000.0000.0001
     Queue Ctl      0000.0000.0000.0000
                                       Address Match            Address Mask
  PCIA Mem   8000.07fd.0000.0000     0000.07ff.0000.0000
 PCIA C/IO        8000.07ff.ec00.0000      0000.07ff.fe00.0000
 PCIB Mem    8000.07fe.0000.0000     0000.07ff.0000.0000
 PCIB C/IO         8000.07ff.ee00.0000      0000.07ff.fe00.0000
                                              AFAR                            AFSR
  UE                  0000.0000.0000.0000    0000.0000.0000.0000
 CE                  0000.0100.0000.0000    0000.0000.0000.0000
 PCI A                0000.0000.0000.0000    0000.0000.0000.0000
 PCI B                0000.0000.0000.0000    0000.0000.0000.0000
                                       Control/Status            Idle Check Diag                Diagnostic
  PCI A      0000.0002.010e.003f     0000.0000.0000.8000    0000.0000.0000.0000
PCI B      0000.0000.010e.003f     0000.0000.0000.8000    0000.0000.0000.0000

IO-Bridge 9 at 0000.0400.0480.0000

       Device ID      fc00.0000.0013.a954
      Ctl/Stat         0255.59a8.0090.7e02
     Error Ctl         fc00.0000.0000.03e0
     Int Ctl           8000.0000.0000.0017
     Error Log       0000.0000.0000.0000
     ECC Ctl        e000.0000.0000.0000
     EStar Ctl        0000.0000.0000.0001
     Queue Ctl       0000.0000.0000.0000
                                      Address Match            Address Mask
  PCIA Mem       8000.07fb.0000.0000      0000.07ff.0000.0000
 PCIA C/IO        8000.07ff.e800.0000       0000.07ff.fe00.0000
 PCIB Mem        8000.07fc.0000.0000      0000.07ff.0000.0000
 PCIB C/IO         8000.07ff.ea00.0000       0000.07ff.fe00.0000
                                              AFAR                            AFSR
 UE                  0000.0000.0000.0000     0000.0000.0000.0000   CE                  0000.0000.0000.0000     0000.0000.0000.0000
PCI A                0000.0000.0000.0000     0000.0000.0000.0000
PCI B                0000.0000.0000.0000     0000.0000.0000.0000
                                 Control/Status             Idle Check Diag Diagnostic
PCI A               0000.0002.010e.003f       0000.0000.0000.8000 0000.0000.0000.0000
PCI B                0000.0000.010e.003f       0000.0000.0000.8000 0000.0000.0000.0000

Note: This document can also assist you in troubleshooting FATAL reset errors that see UEs. All you need to do is substitute  Red State Exception  with  FATAL Reset  in the above procedure.

Note: On many occasions field and support engineers utilize the helpful Red State Exception decoder located at URL:

http://cpre-emea.uk/cgi-bin/redstate.tcl

It must be noted that the Red State Exception Decoder Tool is only a reference , and is only as good as the algorithm used to decode Red State Exception outputs.  This algorithm in itself is flawed as the decoder has no intellectual processing of all the errors together as one, it simply breaks out each individual error and tells you which CPU is suspect for replacement.   THE RED STATE DECODER CANNOT TELL YOU IF THE RED STATE BEING REVIEWED HAS A MEMORY DIMM PROBLEM.   For example if the Red State Exception output above was put into the decoder it would state that CPU3 is suspect and should be replaced, even though it has been proven above that the problem is not CPU related, but in fact a failing bank of DIMMs.



Product
Sun Fire V880 Server
Sun Fire V890 Server
Sun Fire V880z Visualization Server
Sun Fire V490 Server
Sun Fire V480 Server

Internal Comments
Place Sun Internal-Use Only content here. This content will be published to internal SunSolve only.



UE, CE, DIMM, memory, red state exception, fatal reset, V480, V880, decode, error, ECC
Previously Published As
77556

Change History
Date: 2005-12-14
User Name: 71396
Action: Approved
Comment: Performed final review of article.

No changes required.
Product_uuid
29726712-0a18-11d6-8636-c7e996b581dc|Sun Fire V880 Server
5d2816fe-5e51-11d7-8de2-d7bc0dd226fc|Sun Fire V890 Server
e9eace0a-3e09-11d7-9809-a34d84026ec3|Sun Fire V880z Visualization Server
5c71fc02-5e51-11d7-8add-8938754df22a|Sun Fire V490 Server
a2b9bc2b-52c6-45c2-a3e0-f19bd2c86953|Sun Fire V480 Server

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback