Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1359373.1
Update Date:2011-11-07
Keywords:

Solution Type  Troubleshooting Sure

Solution  1359373.1 :   Sun Fire[TM] Servers (V480, V490, V880, V890): How to Manual Decoding of DIMM(s) in Memory Error  


Related Items
  • Sun Fire V490 Server
  •  
  • Sun Fire V480 Server
  •  
  • Sun Fire V880z Visualization Server
  •  
  • Sun Fire V880 Server
  •  
  • Sun Fire V890 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Workgroup Servers>SN-SPARC: SF-Vx80
  •  
  • .Old GCS Categories>Sun Microsystems>Servers>Entry-Level Servers
  •  




In this Document
  Purpose
  Last Review Date
  Instructions for the Reader
  Troubleshooting Details
     Summary of Steps(1-6) needed for Manual Decoding of DIMM:
     Detailed Steps


Applies to:

Sun Fire V490 Server - Version: Not Applicable and later   [Release: N/A and later ]
Sun Fire V880 Server - Version: Not Applicable and later    [Release: N/A and later]
Sun Fire V480 Server - Version: Not Applicable and later    [Release: N/A and later]
Sun Fire V880z Visualization Server - Version: Not Applicable and later    [Release: N/A and later]
Sun Fire V890 Server - Version: Not Applicable and later    [Release: N/A and later]
Information in this document applies to any platform.

Purpose

The purpose of this document is to provide guidance on manually decoding memory locations on the Entry Level Servers listed above.  Many errors provide the DIMM(s) location, but manual decoding is necessary for  Red State Exceptions,  Fatal Resets, or errors systems having a Solaris 8 Kernel Update Patch 108528-15 (or earlier).   For those errors you will have to manually decode the DIMM(s) from the AFSR and AFAR data given in the “Red State Exception, “Fatal Reset, or Solaris 8 KUP 108528-15 (or earlier) error message outputs.

Last Review Date

September 16, 2011

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

Memory Error Output from “/var/adm/messages”:


May 19 10:06:47 sf02 SUNW,UltraSPARC-III+: [ID 649096 kern.info] NOTICE: [AFT0] Corrected system bus (CE)
Event detected by CPU7 at TL=0, errID 0x00000019.ea55b668
May 19 10:06:47 sf02 AFSR 0x00000002<CE>.000000b0 AFAR 0x000000d0.cfea4a20
May 19 10:06:47 sf02 Fault_PC 0x1009b110 Esynd 0x00b0 Slot D: J3001
May 19 10:06:47 sf02 SUNW,UltraSPARC-III+: [ID 311202 kern.info] [AFT0] errID 0x00000019.ea55b668 Corrected Memory Error on Slot D:J3001 is Persistent
May 19 10:06:47 sf02 SUNW,UltraSPARC-III+: [ID 291034 kern.info] [AFT0] errID 0x00000019.ea55b668 Data Bit 103 was in error and corrected
May 19 10:06:47 sf02 SUNW,UltraSPARC-III+: [ID 315577 kern.info] [AFT2] errID 0x00000019.ea55b668 PA=0x000000d0.cfea4a00
May 19 10:06:47 sf02 E$tag 0x00000343.3f000002 E$state_0 Exclusive
May 19 10:06:47 sf02 SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x00) 0xbaddcafe.baddcafe 0xbaddcafe.baddcafe ECC 0x0be
May 19 10:06:47 sf02 SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x10) 0xbaddcafe.baddcafe 0xbaddcafe.baddcafe ECC 0x0be


Summary of Steps(1-6) needed for Manual Decoding of DIMM:

Step #1 Find bit(s) in error using ECC Syndromes (Table #1)
Step #2 Retrieve Memory Configuration Output from OBP/POST
Step #3 Calculating the Physical Memory bank location (Table #2)
Step #4 Finding the 4 DIMMs (Jxxxx's) Related to this Physical Bank (Table #3)
Step #5 Finding the correct SDRAM (Table #4)
Step #6 Finding the correct DIMM to be replaced (Table #3)


Detailed Steps

Step #1 Find bit(s) in error using ECC Syndromes (Table #1):

Lets first look at the AFSR. Bits 8 - 0 comprise of the system-bus or L2 cache data ECC syndrome (Esynd). In this example it would break down as follows:


AFSR=00000002<CE>.000000b0
......................./^\
....................../ | \
...................../  |  \
..................0000 1011 0000 (binary) = 0b0 (hex)
............bits.....8 7654 3210 = From right to left; bits (8 - 0) is the ECC syndrome

                 ECC syndrome (Esynd) = 0b0
                           x coordinate = 0
                           y coordinate = 0b


Using Table #1 (ECC Syndromes) below find where the “y" value (vertical left margin) and the "x" value (horizontal top margin) intersect to get the bit value (Data or ECC Check). In this example Esynd 0b0 decodes to “Data Bit 103" (Corrected Data Bit in Error). Notice that some of the values are not single Data bits but instead are single ECC Check bits or multi-bit errors as described above Table #1. The following procedure assumes that this table decoded to a single Data or ECC check bit, i.e Correctable Error (CE).


NOTE: Please write down this Data/ECC Check bit number. You will need it later in your calculations.


Step #2 Retrieve Memory Configuration Output from OBP/POST:

Retrieving the Memory Configuration Output from OBP/POST table can be done by using one of the following two methods:
  • Set “diagswitch? to true and reset-all at OBP prompt. Power off and then back on system.
  • Power off system, put keyswitch in diagnostic position, and power back on.
Here is an example of a fully configured Sun Fire V880's (Memory Configuration Output from OBP/POST) table with 8-way memory interleaving:


CPU0 Bank0 128 + 128 + 128 + 128 : 512MB @ a000000000 8way #0
CPU0 Bank1 128 + 128 + 128 + 128 : 512MB @ a000000000 8way #2
CPU0 Bank2 128 + 128 + 128 + 128 : 512MB @ a000000000 8way #4
CPU0 Bank3 128 + 128 + 128 + 128 : 512MB @ a000000000 8way #6

CPU1 Bank0 128 + 128 + 128 + 128 : 512MB @ b000000000 8way #0
CPU1 Bank1 128 + 128 + 128 + 128 : 512MB @ b000000000 8way #2
CPU1 Bank2 128 + 128 + 128 + 128 : 512MB @ b000000000 8way #4
CPU1 Bank3 128 + 128 + 128 + 128 : 512MB @ b000000000 8way #6

CPU2 Bank0 128 + 128 + 128 + 128 : 512MB @ a000000000 8way #1
CPU2 Bank1 128 + 128 + 128 + 128 : 512MB @ a000000000 8way #3
CPU2 Bank2 128 + 128 + 128 + 128 : 512MB @ a000000000 8way #5
CPU2 Bank3 128 + 128 + 128 + 128 : 512MB @ a000000000 8way #7

CPU3 Bank0 128 + 128 + 128 + 128 : 512MB @ b000000000 8way #1
CPU3 Bank1 128 + 128 + 128 + 128 : 512MB @ b000000000 8way #3
CPU3 Bank2 128 + 128 + 128 + 128 : 512MB @ b000000000 8way #5
CPU3 Bank3 128 + 128 + 128 + 128 : 512MB @ b000000000 8way #7

CPU4 Bank0 128 + 128 + 128 + 128 : 512MB @ c000000000 8way #0
CPU4 Bank1 128 + 128 + 128 + 128 : 512MB @ c000000000 8way #2
CPU4 Bank2 128 + 128 + 128 + 128 : 512MB @ c000000000 8way #4
CPU4 Bank3 128 + 128 + 128 + 128 : 512MB @ c000000000 8way #6

CPU5 Bank0 128 + 128 + 128 + 128 : 512MB @ d000000000 8way #0
CPU5 Bank1 128 + 128 + 128 + 128 : 512MB @ d000000000 8way #2
CPU5 Bank2 128 + 128 + 128 + 128 : 512MB @ d000000000 8way #4
CPU5 Bank3 128 + 128 + 128 + 128 : 512MB @ d000000000 8way #6

CPU6 Bank0 128 + 128 + 128 + 128 : 512MB @ c000000000 8way #1
CPU6 Bank1 128 + 128 + 128 + 128 : 512MB @ c000000000 8way #3
CPU6 Bank2 128 + 128 + 128 + 128 : 512MB @ c000000000 8way #5
CPU6 Bank3 128 + 128 + 128 + 128 : 512MB @ c000000000 8way #7

CPU7 Bank0 128 + 128 + 128 + 128 : 512MB @ d000000000 8way #1
CPU7 Bank1 128 + 128 + 128 + 128 : 512MB @ d000000000 8way #3
CPU7 Bank2 128 + 128 + 128 + 128 : 512MB @ d000000000 8way #5
CPU7 Bank3 128 + 128 + 128 + 128 : 512MB @ d000000000 8way #7




NOTE: On Sun Fire V480's/V880's you find that 8-Way interleaving is the most common interleave group, if you have multiple interleave groups on the same board you must find out in which group the physical address belongs (see multiple interleave groups example below).

Here is an example of a 2 CPU Sun Fire V880's (Memory Configuration Output from OBP/POST) table with multiple interleave groups:


CPU0 Bank1 128 + 128 + 128 + 128 : 512MB @ a000000000 4way #0
CPU0 Bank3 128 + 128 + 128 + 128 : 512MB @ a000000000 4way #2
CPU2 Bank0 128 + 128 + 128 + 128 : 512MB @ a080000000 2way #0
CPU2 Bank1 128 + 128 + 128 + 128 : 512MB @ a000000000 4way #1
CPU2 Bank2 128 + 128 + 128 + 128 : 512MB @ a080000000 2way #1
CPU2 Bank3 128 + 128 + 128 + 128 : 512MB @ a000000000 4way #3

Use the AFAR to find which interleave group the physical DIMM belongs:

AFAR between a000000000 base+size will be in the 4way interleave group (in this example)
AFAR between a080000000 base+size will be in the 2way interleave group (in this example)

The above example shows we have two interleave groups:

The 4-way interleave group starts at offset a000000000:

CPU0 Bank1 128 + 128 + 128 + 128 : 512MB @ a000000000 4way #0
CPU2 Bank1 128 + 128 + 128 + 128 : 512MB @ a000000000 4way #1
CPU0 Bank3 128 + 128 + 128 + 128 : 512MB @ a000000000 4way #2
CPU2 Bank3 128 + 128 + 128 + 128 : 512MB @ a000000000 4way #3


The 2-way interleave group starts at offset a080000000:

CPU2 Bank0 128 + 128 + 128 + 128 : 512MB @ a080000000 2way #0
CPU2 Bank2 128 + 128 + 128 + 128 : 512MB @ a080000000 2way #1


Once the interleave group (with its common AFAR addresses) is known we just need the Logical Bank # to complete our gathering of information for the physical bank.

Continuing with the 8way example above we find the AFAR addresses are common to each individual CPU/Memory board:

CPUs 0 and 2 interleave memory starting @ 0x a000000000
CPUs 1 and 3 interleave memory starting @ 0x b000000000
CPUs 4 and 6 interleave memory starting @ 0x c000000000
CPUs 5 and 7 interleave memory starting @ 0x d000000000



Very Important: Interleaving of memory is *Per CPU/Memory board slot only* and not across CPU/Memory board slots.

Logical banks in the 8way example above are numbered at the far right and are 8way interleaved with the onboard CPUs:

Slot A (CPU's 0 + 2)
Slot B (CPU's 1 + 3)
Slot C (CPU's 4 + 6)
Slot D (CPU's 5 + 7)


NOTE: The above example is shown with 8 CPUs for V880 purposes only, but if you had a V480 (only 4 CPUs) you will see only the top half of this table (CPUs 0,1,2,3).

Step #3 Calculating the Physical Memory bank location (Table #2 and Step #2):

Break AFAR nibbles (0, 1, 2) into binary:

AFAR=0x000000d0.cfea4a20
...................../^\
..................../ | \
.................../  |  \
...............1010 0010 0000 = From right to left; bits (11 – 0), but we are
                                just interested in bits 9 – 6, which is
                                highlighted since they correspond with the
                                four bits of the lower mask value LM[3 – 0]
                                in Table #2 (Memory Interleaving and Logical
                                bank #'s).

........bits 98 7654 3210     = 1000

Use the 9 - 6 bits with the LM [3:0] (lower mask value) in Table #2 (Memory Interleaving and Logical bank #'s) to determine the logical bank. In this example you can see that bits 9 - 6 are 1000. The type of interleaving used (2, 4, or 8 way) determines which of the bits 1000 are used and which are “don't care" as shown in Table #3 (Cross Reference for CPU, Group, Bank, DIMM, DRAM, and Jxxxx).

Next you must look at the output from “Step #2 - Retrieve Memory Configuration Output from OBP/POST" to determine what the interleaving factor is.

The type of interleaving used is a function of the number of DIMMS in the system as well as the size of the DIMMS. In this example the system is using 8way interleaving meaning the first bit is a don't care (1000 is x000 which is a lower mask of 000 or just the first logical bank).

Notice that Bit 9 (LM[3]) is a “x"(Don't Care). It's primary use is for Serengeti 16-Way interleave which the Sun Fire V880 and V480 don't support. 

In Table #2 (Memory Interleaving and Logical bank #'s) you can see that “x000" is the first logical bank (out of a possible 8, since 8-way interleaving is being used). We will call it Logical Bank #0.

To find out which of the four CPU/Memory boards this logical bank is a part of we go back to the AFAR:

AFAR 0x000000d0.cfea4a20

You can see we're in the d0.xxxxxxxx range so the address is coming from the CPU/Memory board in “Slot D":

a0.xxxxxxxx > Slot A (CPU's 0 + 2)
b0.xxxxxxxx > Slot B (CPU's 1 + 3)
c0.xxxxxxxx > Slot C (CPU's 4 + 6)
d0.xxxxxxxx > Slot D (CPU's 5 + 7)

Logical Bank #0 is physical memory bank:

                   CPU/Memory board Slot D “CPU5 Bank0"

This physical memory bank comes from Step #2 as we step down the interleave output between CPU5+7 with the logical bank number (in this example; Logical Bank #0). We use the interleaving between CPU5+7, because our AFAR above is pointing to Slot D (000000d0.xxxxxxxx):

CPU5 Bank0 128 + 128 + 128 + 128 : 512MB @ d000000000 8way #0

Step #4 – Finding the 4 DIMMs (Jxxxx's) Related to this Physical Bank (Table #3):

Take the physical bank (CPU5 Bank0) from Step #3 and find out what DIMM it's part of by looking at Table #3 (Cross Reference for CPU, Group, Bank, DIMM, DRAM, and Jxxxx):

DIMM's J2900, J2901, J3000, J3001


NOTE: Checking the above memory error output we can see the DIMM in error (J3001) is part of our calculated four DIMMs.

Step #5 Finding the correct SDRAM:

Now with a Data bit value of 103 (from Step #1), knowing the physical memory bank is CPU5 Bank0 located on CPU/Memory board in Slot D (from Step #3), and the DIMMs making up this physical bank are J2900, J2901, J3000, and J3001 (from Step #4) we are ready to find the SDRAM.

Find the Data bit value of 103 in the left column of Table #4 (DIMM bit Assignment to # of SDRAM) below. It's located under the column titled “Nibble" which provides a list of each nibble and its corresponding Data/ECC check bits. Data [103:100] in “Nibble #29" shows a set of single data bits (103,102,101, 100), mapped to the Quadword's [03] on right.

Each Quadword [03] column provides information in the form Dx [Y1,Y2,Y3,Y4], where 'x' is the DRAM number and 'Y1Y4' are the pins on that DRAM corresponding to the Data/ECC check bits listed in the left “Nibble" column (ie, Quadword 0 in line with Data Bits [103:100], has D33 [0,7,10,13], which shows Data Bit 103 located on DRAM 33 pin 0, Data Bit 102 on DRAM 33 pin 7, Data Bit 101 on DRAM 33 pin 10, and finally Data Bit 100 on DRAM 33 pin 13). It is usually not significant to note which quadword the error occurred in, since it is only to the level of the DRAM that we are looking for, not the specific pin on the DRAM.

In the error message above, our Esynd (from AFSR register in Step #1) told us that Data Bit 103 was in error. Find the line under the “Nibble" column in Table #4 with this Data Bit (Data [103:100] Nibble
#29
) and looking across to the Quadword 0 column we see D33 [0,7,10,13]. Data Bit 103 maps to DRAM 33 pin 0, but we are just interested in D33.

Step #6 Finding the correct DIMM to be replaced:

Take D33 and find the correct DIMM in Table #3 (Cross Reference for CPU, Group, Bank, DIMM, DRAM, and Jxxxx).

D33 is part of 4 DIMM's (J3001, J3201, J8001, J8201)
CPU5 is the CPU that contains the following DIMMs (J3001, J3201)
Bank0 is the Bank in error (J3001)

[Bad DIMM: J3001]



Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback