![]() | Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Troubleshooting Sure Solution 1012214.1 : Troubleshooting Red State Exception Memory Errors
PreviouslyPublishedAs 216842 Oracle Confidential (INTERNAL). Do not distribute to customers Reason: Migrated distribution from Sun Description When scanning error messages from any of the message files you will usually see the failing DIMM(s) being printed out for you, but when decoding a Red State Exception you won't have this luxury. Debugging Red State Exceptions (w/ CE/UE errors) you must be able to manually decode the bad DIMM(s) from the AFSR and AFAR data given in the Red State Exception error output. This output is printed to the ttya port or the rsc console logs at the time of error. Steps to Follow Red State Exceptions (RSE): Red State Exceptions are most often caused by hardware problems. In some isolated cases, software can cause a Red State Exception.
Note: See Tecnical Instruction <Document: 1008702.1> "Console Logging Options to capture Fatal Reset output for Sun systems" before proceeding, because a full "Red State Exception" output is needed. This document will assist you in how to capture this output. Two common types of RSE's: Trap Level (TL)=5 with AFSR Error Bit(s) Set Trap Level (TL)=5 with Trap Type (TT) Code Note: In this troubleshooting document we will be focusing on the first type ( Trap Level (TL)=5 with AFSR Error Bit(s) Set ) since this is where you will see your CE/UE errors.
Trap Level (TL)=5 with AFSR Error Bit(s) Set:
ERROR: CPU3 RED State Exception [**CPU3 called a Red State Exception, further investigation is needed**] System State ( CPU3 reporting) [**CPU3 is **JUST** reporting the error, any CPU in the system can report the error, but this does not mean CPU3 is the problem**] CPU0 Config/Control/Status registers: CPUVersion: 003e.0014.5400.0507 SafConfig: 0caa.01bc.0000.8002 SafBaseAdr: 0000.0400.0000.0000 DCacheCtl: 0000.0200.0000.0000 ECacheCtl: 0000.0000.0009.4400 ECErrEnable: 0000.0000.0000.000b AFAR: 0000.0000.0000.0000 AFSR: 0000.0000.0000.0000 (no errors set) [** Important: The Red State example is from a 750MHz CPU, because 900MHz CPUs (and beyond) will also include AFAR2/AFSR2 register lines below the AFAR/AFSR register lines and this represents the first error captured. The AFAR/AFSR will always represent the most recent error that occurred on the system.**] DMMU SFAR: 0000.0000.fff7.8ec8 DMMU SFSR: 0000.0000.0080.8008 TM PR IMMU SFSR: 0000.0000.0000.0000 (no status set) CPU0 Trap registers: Trap Level = 1 *TL=1 TT: 0000.0000.0000.0003 TSTATE: 0000.0099.1500.1600 XCC:NC ICC:NC MM=TSO PEF PRIV IE TPC: 0000.0000.f004.9700 TnPC: 0000.0000.f004.9704 TL=2 TT: 0000.0000.0000.0068 TSTATE: 0000.0099.5804.1400 XCC:NC ICC:NC MM=TSO PEF PRIV TPC: 0000.0000.f004.4b68 TnPC: 0000.0000.f004.4b6c TL=3 TT: 0000.0000.0000.0000 TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO TPC: 0000.0000.3333.3330 TnPC: 0000.0000.3333.3330 TL=4 TT: 0000.0000.0000.0000 TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO TPC: 0000.0000.4444.4444 TnPC: 0000.0000.4444.4444 TL=5 TT: 0000.0000.0000.0000 TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO TPC: 0000.0000.5555.5554 TnPC: 0000.0000.5555.5554 CPU0 General registers:
%PIL: 15 %PC: 0000.0000.f004.9700 %nPC: 0000.0000.f004.9704 %PSTATE: 0000.0000.0000.0035 TLE MM=TSO PEF %CCR: 0000.0000.0000.0099 XCC:NC ICC:NC %FPRS: 0000.0000.0000.0005 FEF DL %v0: 0000.0000.0000.0000 %v1: 0000.0000.0000.004a %v2: 0000.0000.0000.0000 %v3: 0000.0000.fff7.8000 %v4: 0000.0000.0000.0ef8 %v5: 0caa.01bc.0000.8002 %v6: 0000.0000.0000.007f %v7: 0000.0000.0000.0680 . . . %i0: 0000.0000.f000.00e0 %i1: 0000.0000.0000.0005 %i2: 0000.0000.0000.0004 %i3: 0000.0000.f000.00e0 %i4: 0000.0000.0000.001f %i5: 0000.0000.0000.0000 %i6: f000.0000.0001.c981 %i7: 0000.0000.f000.d680 CPU1 Config/Control/Status registers:
CPUVersion: 003e.0014.5400.0507 SafConfig: 0caa.01bc.0002.8002 SafBaseAdr: 0000.0400.0080.0000 DCacheCtl: 0000.0200.0000.0000 ECacheCtl: 0000.0000.0009.4400 ECErrEnable: 0000.0000.0000.000b AFAR: 0000.0000.0000.0000 AFSR: 0000.0000.0000.0000 (no errors set) DMMU SFAR: 0000.0000.fff7.8ec8 DMMU SFSR: 0000.0000.0080.8008 TM PR IMMU SFSR: 0000.0000.0000.0000 (no status set) CPU1 Trap registers: Trap Level = 1 *TL=1 TT: 0000.0000.0000.0003 TSTATE: 0000.0099.1500.1600 XCC:NC ICC:NC MM=TSO PEF PRIV IE TPC: 0000.0000.f004.9700 TnPC: 0000.0000.f004.9704 TL=2 TT: 0000.0000.0000.0068 TSTATE: 0000.0099.5804.1400 XCC:NC ICC:NC MM=TSO PEF PRIV TPC: 0000.0000.f004.4b68 TnPC: 0000.0000.f004.4b6c TL=3 TT: 0000.0000.0000.0000 TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO TPC: 0000.0001.3333.3330 TnPC: 0000.0001.3333.3330 TL=4 TT: 0000.0000.0000.0000 TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO TPC: 0000.0001.4444.4444 TnPC: 0000.0001.4444.4444 TL=5 TT: 0000.0000.0000.0000 TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO TPC: 0000.0001.5555.5554 TnPC: 0000.0001.5555.5554 CPU1 General registers:
%PIL: 15 %PC: 0000.0000.f004.9700 %nPC: 0000.0000.f004.9704 %PSTATE: 0000.0000.0000.0035 TLE MM=TSO PEF %CCR: 0000.0000.0000.0099 XCC:NC ICC:NC %FPRS: 0000.0000.0000.0005 FEF DL %v0: 0000.0000.0000.0000 %v1: 0000.0000.0000.004a %v2: 0000.0000.0000.0000 %v3: 0000.0000.fff7.8000 %v4: 0000.0000.0000.0ef8 %v5: 0caa.01bc.0002.8002 %v6: 0000.0000.0000.007f %v7: 0000.0000.0000.0680 . . . %i0: 0000.0000.f000.00e0 %i1: 0000.0000.0000.0005 %i2: 0000.0000.0000.0004 %i3: 0000.0000.f000.00e0 %i4: 0000.0000.0000.001f %i5: 0000.0001.0000.0000 %i6: f000.0000.0001.c981 %i7: 0000.0000.f000.d680 CPU2 Config/Control/Status registers:
CPUVersion: 003e.0014.5400.0507 SafConfig: 1534.01bc.0004.8002 SafBaseAdr: 0000.0400.0100.0000 DCacheCtl: 0000.0200.0000.0000 ECacheCtl: 0000.0000.0009.4400 ECErrEnable: 0000.0000.0000.000b AFAR: 0000.0000.0000.0000 AFSR: 0000.0000.0000.0000 (no errors set) DMMU SFAR: 0000.0000.fff7.8ec8 DMMU SFSR: 0000.0000.0080.8008 TM PR IMMU SFSR: 0000.0000.0000.0000 (no status set) CPU2 Trap registers: Trap Level = 1 *TL=1 TT: 0000.0000.0000.0003 TSTATE: 0000.0099.1500.1600 XCC:NC ICC:NC MM=TSO PEF PRIV IE TPC: 0000.0000.f004.9700 TnPC: 0000.0000.f004.9704 TL=2 TT: 0000.0000.0000.0068 TSTATE: 0000.0099.5804.1400 XCC:NC ICC:NC MM=TSO PEF PRIV TPC: 0000.0000.f004.4b68 TnPC: 0000.0000.f004.4b6c TL=3 TT: 0000.0000.0000.0000 TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO TPC: 0000.0002.3333.3330 TnPC: 0000.0002.3333.3330 TL=4 TT: 0000.0000.0000.0000 TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO TPC: 0000.0002.4444.4444 TnPC: 0000.0002.4444.4444 TL=5 TT: 0000.0000.0000.0000 TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO TPC: 0000.0002.5555.5554 TnPC: 0000.0002.5555.5554 CPU2 General registers:
%PIL: 15 %PC: 0000.0000.f004.9700 %nPC: 0000.0000.f004.9704 %PSTATE: 0000.0000.0000.0035 TLE MM=TSO PEF %CCR: 0000.0000.0000.0099 XCC:NC ICC:NC %FPRS: 0000.0000.0000.0005 FEF DL %v0: 0000.0000.0000.0000 %v1: 0000.0000.0000.004a %v2: 0000.0000.0000.0000 %v3: 0000.0000.fff7.8000 %v4: 0000.0000.0000.0ef8 %v5: 1534.01bc.0004.8002 %v6: 0000.0000.0000.007f %v7: 0000.0000.0000.0680 . . . %i0: 0000.0000.f000.00e0 %i1: 0000.0000.0000.0005 %i2: 0000.0000.0000.0004 %i3: 0000.0000.f000.00e0 %i4: 0000.0000.0000.001f %i5: 0000.0002.0000.0000 %i6: f000.0000.0002.b381 %i7: 0000.0000.f000.d680 CPU3 Config/Control/Status registers:
CPUVersion: 003e.0014.5400.0507 SafConfig: 1534.01bc.0006.8002 SafBaseAdr: 0000.0400.0180.0000 DCacheCtl: 0000.0000.0000.0000 ECacheCtl: 0000.0000.0009.4400 ECErrEnable: 0000.0000.0000.000b AFAR: 0000.00 b0 .ece1.0 450 AFSR: 0010.0006.0000.0 15b PRIV UE CE ['UE' and 'CE' tell you 'Uncorrectable' and 'Correctable' memory errors occurred and caused this 'Red State Exception'] ['b0' in the AFAR tells you the error occurred on CPU/Memory board in Slot 'B'; See 'Step #3 (Calculating the Physical Memory bank location)' in 'Voyager Article ID 77426 (V480/V880 Manual Decoding of DIMM(s) in Memory Error)' on how to calculate] ['15b' (Bits 8-0 of the AFSR) tells you the ECC Syndrome. In this example it is M2 (Probable Double bit error within a nibble). See 'Step #1 (Find bit(s) in error using ECC Syndromes)' in 'Voyager Article ID 77426 (V480/V880 Manual Decoding of DIMM(s) in Memory Error)' on how to calculate] ['450' (Bits 9-6 of the AFAR) tells you which logical bank you are using. In this example it is 'x001' (where 'x' means 'Don't Care') and the Physical bank is 'CPU3 Bank0 (Bank0 located on CPU/Memory board in Slot 'B')'. See 'Step #3 (Calculating the Physical Memory bank location)' in 'Voyager Article ID 77426 (V480/V880 Manual Decoding of DIMM(s) in Memory Error)' on how to calculate] [The failing DIMMs in 'CPU3 Bank0' are J7900, J7901, J8001, and J8000. See 'Step #4 (Finding the 4 DIMMs (Jxxxx's) Related to this Physical Bank)' in 'Voyager Article ID 77426 (V480/V880 Manual Decoding of DIMM(s) in Memory Error)' on how to calculate] [ Resolution: All DIMMs in the above bank need to be changed, because a multibit error can not be broken down to the correct DIMM (since the multiple bits in error could be on multiple DIMMs in the faulty memory bank)] [Keep in mind that a DIMM having CE errors early on in the Explorer message (0,1,2,3,...) files could very likely be the bad DIMM if a UE crashes the system and that DIMM is in the bank of DIMMs included in the error message. A DIMM causing multiple CE's repeatedly has a more likely chance of hitting a double bit error or UE. Always review message files starting at the oldest message file and working your way to the current dated message file for DIMM history] DMMU SFAR: 0000.0000.fff5.2000 DMMU SFSR: 0000.0000.0004.8028 TM CT1 PR IMMU SFSR: 0000.0000.0080.8008 TM PR CPU3 Trap registers: Trap Level = 5 [CPU3 is in question since it went to Trap Level 5 (Red State Level)] TL=1 TT: 0000.0000.0000.00 63 (Corrected ECC Error) TSTATE: 0000.0099.8000.1603 XCC:NC ICC:NC MM=TSO PEF PRIV IE TPC: 0000.0000.0102.96f0 TnPC: 0000.0000.0102.96dc TL=2 TT: 0000.0000.0000.00 68 (Fast Data Access MMU miss) TSTATE: 0000.0099.8000.1503 XCC:NC ICC:NC MM=TSO PEF PRIV AG TPC: 0000.0000.f004.2c24 TnPC: 0000.0000.f004.2c28 TL=3 TT: 0000.0000.0000.00 32 (Data Access Error) TSTATE: 0000.0088.5804.1403 XCC:N ICC:N MM=TSO PEF PRIV TPC: 0000.0000.f004.4c64 TnPC: 0000.0000.f004.4c68 TL=4 TT: 0000.0000.0000.00 10 (Illegal Instruction) TSTATE: 0000.0088.5800.1503 XCC:N ICC:N MM=TSO PEF PRIV AG TPC: 0000.0000.f000.4640 TnPC: 0000.0000.f000.4644 *TL=5 TT: 0000.0000.0000.00 10 (Illegal Instruction) TSTATE: 0000.0088.5800.1503 XCC:N ICC:N MM=TSO PEF PRIV AG TPC: 0000.0000.f000.4200 TnPC: 0000.0000.f000.4204 CPU3 General registers:
%PIL: 13 %PC: 0000.0000.f000.4200 %nPC: 0000.0000.f000.4204 %PSTATE: 0000.0000.0000.0035 TLE MM=TSO PEF %CCR: 0000.0000.0000.0091 XCC:NC ICC:C %FPRS: 0000.0000.0000.0000 %v0: 0000.0000.0000.0000 %v1: 0000.0000.0000.0000 %v2: 0000.0000.0000.0000 %v3: 0000.0000.0000.0000 %v4: ffff.ffff.0000.0000 %v5: 0000.0000.0000.0000 %v6: 0000.0000.0000.0000 %v7: 00ca.02a8.0840.0005 . . . %i0: 0000.0000.0008.0000 %i1: 0000.0000.0508.0000 %i2: 0000.0000.0000.0000 %i3: 0000.0700.0536.0c20 %i4: 0000.0700.0536.0c30 %i5: 0000.0000.0007.3c00 %i6: 0000.0000.0140.8fe1 %i7: 0000.0000.0102.96b8 IO-Bridge 8 at 0000.0400.0400.0000 Device ID fc00.0000.0011.a954 Ctl/Stat 0255.5554.0080.7e02 Error Ctl fc00.0000.0000.03e0 Int Ctl 8000.0000.0000.0017 Error Log 0000.0000.0000.0000 ECC Ctl e000.0000.0000.0000 EStar Ctl 0000.0000.0000.0001 Queue Ctl 0000.0000.0000.0000 Address Match Address Mask PCIA Mem 8000.07fd.0000.0000 0000.07ff.0000.0000 PCIA C/IO 8000.07ff.ec00.0000 0000.07ff.fe00.0000 PCIB Mem 8000.07fe.0000.0000 0000.07ff.0000.0000 PCIB C/IO 8000.07ff.ee00.0000 0000.07ff.fe00.0000 AFAR AFSR UE 0000.0000.0000.0000 0000.0000.0000.0000 CE 0000.0100.0000.0000 0000.0000.0000.0000 PCI A 0000.0000.0000.0000 0000.0000.0000.0000 PCI B 0000.0000.0000.0000 0000.0000.0000.0000 Control/Status Idle Check Diag Diagnostic PCI A 0000.0002.010e.003f 0000.0000.0000.8000 0000.0000.0000.0000 PCI B 0000.0000.010e.003f 0000.0000.0000.8000 0000.0000.0000.0000 IO-Bridge 9 at 0000.0400.0480.0000 Device ID fc00.0000.0013.a954 Ctl/Stat 0255.59a8.0090.7e02 Error Ctl fc00.0000.0000.03e0 Int Ctl 8000.0000.0000.0017 Error Log 0000.0000.0000.0000 ECC Ctl e000.0000.0000.0000 EStar Ctl 0000.0000.0000.0001 Queue Ctl 0000.0000.0000.0000 Address Match Address Mask PCIA Mem 8000.07fb.0000.0000 0000.07ff.0000.0000 PCIA C/IO 8000.07ff.e800.0000 0000.07ff.fe00.0000 PCIB Mem 8000.07fc.0000.0000 0000.07ff.0000.0000 PCIB C/IO 8000.07ff.ea00.0000 0000.07ff.fe00.0000 AFAR AFSR UE 0000.0000.0000.0000 0000.0000.0000.0000 CE 0000.0000.0000.0000 0000.0000.0000.0000 PCI A 0000.0000.0000.0000 0000.0000.0000.0000 PCI B 0000.0000.0000.0000 0000.0000.0000.0000 Control/Status Idle Check Diag Diagnostic PCI A 0000.0002.010e.003f 0000.0000.0000.8000 0000.0000.0000.0000 PCI B 0000.0000.010e.003f 0000.0000.0000.8000 0000.0000.0000.0000 Note: This document can also assist you in troubleshooting FATAL reset errors that see UEs. All you need to do is substitute Red State Exception with FATAL Reset in the above procedure. Note: On many occasions field and support engineers utilize the helpful Red State Exception decoder located at URL: http://cpre-emea.uk/cgi-bin/redstate.tcl It must be noted that the Red State Exception Decoder Tool is only a reference , and is only as good as the algorithm used to decode Red State Exception outputs. This algorithm in itself is flawed as the decoder has no intellectual processing of all the errors together as one, it simply breaks out each individual error and tells you which CPU is suspect for replacement. THE RED STATE DECODER CANNOT TELL YOU IF THE RED STATE BEING REVIEWED HAS A MEMORY DIMM PROBLEM. For example if the Red State Exception output above was put into the decoder it would state that CPU3 is suspect and should be replaced, even though it has been proven above that the problem is not CPU related, but in fact a failing bank of DIMMs. Product Sun Fire V880 Server Sun Fire V890 Server Sun Fire V880z Visualization Server Sun Fire V490 Server Sun Fire V480 Server Internal Comments Place Sun Internal-Use Only content here. This content will be published to internal SunSolve only. UE, CE, DIMM, memory, red state exception, fatal reset, V480, V880, decode, error, ECC Previously Published As 77556 Change History Date: 2005-12-14 User Name: 71396 Action: Approved Comment: Performed final review of article. No changes required. Product_uuid 29726712-0a18-11d6-8636-c7e996b581dc|Sun Fire V880 Server 5d2816fe-5e51-11d7-8de2-d7bc0dd226fc|Sun Fire V890 Server e9eace0a-3e09-11d7-9809-a34d84026ec3|Sun Fire V880z Visualization Server 5c71fc02-5e51-11d7-8add-8938754df22a|Sun Fire V490 Server a2b9bc2b-52c6-45c2-a3e0-f19bd2c86953|Sun Fire V480 Server Attachments This solution has no attachment |
||||||||||||
|