![]() | Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Troubleshooting Sure Solution 1359373.1 : Sun Fire[TM] Servers (V480, V490, V880, V890): How to Manual Decoding of DIMM(s) in Memory Error
In this Document
Applies to:Sun Fire V490 Server - Version: Not ApplicableSun Fire V880 Server - Version: Not Applicable and later [Release: N/A and later] Sun Fire V480 Server - Version: Not Applicable and later [Release: N/A and later] Sun Fire V880z Visualization Server - Version: Not Applicable and later [Release: N/A and later] Sun Fire V890 Server - Version: Not Applicable and later [Release: N/A and later] Information in this document applies to any platform. PurposeThe purpose of this document is to provide guidance on manually decoding memory locations on the Entry Level Servers listed above. Many errors provide the DIMM(s) location, but manual decoding is necessary for Red State Exceptions, Fatal Resets, or errors systems having a Solaris 8 Kernel Update Patch 108528-15 (or earlier). For those errors you will have to manually decode the DIMM(s) from the AFSR and AFAR data given in the âRed State Exception, âFatal Reset, or Solaris 8 KUP 108528-15 (or earlier) error message outputs.Last Review DateSeptember 16, 2011Instructions for the ReaderA Troubleshooting Guide is provided to assist
in debugging a specific issue. When possible, diagnostic tools are included in the document
to assist in troubleshooting.
Troubleshooting DetailsMemory Error Output from “/var/adm/messages”:May 19 10:06:47 sf02 SUNW,UltraSPARC-III+: [ID 649096 kern.info] NOTICE: [AFT0] Corrected system bus (CE) Summary of Steps(1-6) needed for Manual Decoding of DIMM:Step #1 Find bit(s) in error using ECC Syndromes (Table #1)Step #2 Retrieve Memory Configuration Output from OBP/POST Step #3 Calculating the Physical Memory bank location (Table #2) Step #4 Finding the 4 DIMMs (Jxxxx's) Related to this Physical Bank (Table #3) Step #5 Finding the correct SDRAM (Table #4) Step #6 Finding the correct DIMM to be replaced (Table #3) Detailed StepsStep #1 Find bit(s) in error using ECC Syndromes (Table #1):Lets first look at the AFSR. Bits 8 - 0 comprise of the system-bus or L2 cache data ECC syndrome (Esynd). In this example it would break down as follows: AFSR=00000002<CE>.000000b0 Using Table #1 (ECC Syndromes) below find where the “y" value (vertical left margin) and the "x" value (horizontal top margin) intersect to get the bit value (Data or ECC Check). In this example Esynd 0b0 decodes to “Data Bit 103" (Corrected Data Bit in Error). Notice that some of the values are not single Data bits but instead are single ECC Check bits or multi-bit errors as described above Table #1. The following procedure assumes that this table decoded to a single Data or ECC check bit, i.e Correctable Error (CE). NOTE: Please write down this Data/ECC Check bit number. You will need it later in your calculations. Step #2 Retrieve Memory Configuration Output from OBP/POST: Retrieving the Memory Configuration Output from OBP/POST table can be done by using one of the following two methods:
CPU0 Bank0 128 + 128 + 128 + 128 : 512MB @ a000000000 8way #0 NOTE: On Sun Fire V480's/V880's you find that 8-Way interleaving is the most common interleave group, if you have multiple interleave groups on the same board you must find out in which group the physical address belongs (see multiple interleave groups example below). Here is an example of a 2 CPU Sun Fire V880's (Memory Configuration Output from OBP/POST) table with multiple interleave groups: CPU0 Bank1 128 + 128 + 128 + 128 : 512MB @ a000000000 4way #0 Use the AFAR to find which interleave group the physical DIMM belongs: AFAR between a000000000 base+size will be in the 4way interleave group (in this example) AFAR between a080000000 base+size will be in the 2way interleave group (in this example) The above example shows we have two interleave groups: The 4-way interleave group starts at offset a000000000: CPU0 Bank1 128 + 128 + 128 + 128 : 512MB @ a000000000 4way #0 The 2-way interleave group starts at offset a080000000: CPU2 Bank0 128 + 128 + 128 + 128 : 512MB @ a080000000 2way #0 Once the interleave group (with its common AFAR addresses) is known we just need the Logical Bank # to complete our gathering of information for the physical bank. Continuing with the 8way example above we find the AFAR addresses are common to each individual CPU/Memory board: CPUs 0 and 2 interleave memory starting @ 0x a000000000 Very Important: Interleaving of memory is *Per CPU/Memory board slot only* and not across CPU/Memory board slots. Logical banks in the 8way example above are numbered at the far right and are 8way interleaved with the onboard CPUs: Slot A (CPU's 0 + 2) NOTE: The above example is shown with 8 CPUs for V880 purposes only, but if you had a V480 (only 4 CPUs) you will see only the top half of this table (CPUs 0,1,2,3). Step #3 Calculating the Physical Memory bank location (Table #2 and Step #2): Break AFAR nibbles (0, 1, 2) into binary: AFAR=0x000000d0.cfea4a20 Use the 9 - 6 bits with the LM [3:0] (lower mask value) in Table #2 (Memory Interleaving and Logical bank #'s) to determine the logical bank. In this example you can see that bits 9 - 6 are 1000. The type of interleaving used (2, 4, or 8 way) determines which of the bits 1000 are used and which are “don't care" as shown in Table #3 (Cross Reference for CPU, Group, Bank, DIMM, DRAM, and Jxxxx). Next you must look at the output from “Step #2 - Retrieve Memory Configuration Output from OBP/POST" to determine what the interleaving factor is. The type of interleaving used is a function of the number of DIMMS in the system as well as the size of the DIMMS. In this example the system is using 8way interleaving meaning the first bit is a don't care (1000 is x000 which is a lower mask of 000 or just the first logical bank). Notice that Bit 9 (LM[3]) is a “x"(Don't Care). It's primary use is for Serengeti 16-Way interleave which the Sun Fire V880 and V480 don't support. In Table #2 (Memory Interleaving and Logical bank #'s) you can see that “x000" is the first logical bank (out of a possible 8, since 8-way interleaving is being used). We will call it Logical Bank #0. To find out which of the four CPU/Memory boards this logical bank is a part of we go back to the AFAR: AFAR 0x000000d0.cfea4a20 You can see we're in the d0.xxxxxxxx range so the address is coming from the CPU/Memory board in “Slot D": a0.xxxxxxxx > Slot A (CPU's 0 + 2) Logical Bank #0 is physical memory bank: CPU/Memory board Slot D “CPU5 Bank0" This physical memory bank comes from Step #2 as we step down the interleave output between CPU5+7 with the logical bank number (in this example; Logical Bank #0). We use the interleaving between CPU5+7, because our AFAR above is pointing to Slot D (000000d0.xxxxxxxx): CPU5 Bank0 128 + 128 + 128 + 128 : 512MB @ d000000000 8way #0 Step #4 – Finding the 4 DIMMs (Jxxxx's) Related to this Physical Bank (Table #3): Take the physical bank (CPU5 Bank0) from Step #3 and find out what DIMM it's part of by looking at Table #3 (Cross Reference for CPU, Group, Bank, DIMM, DRAM, and Jxxxx): DIMM's J2900, J2901, J3000, J3001 NOTE: Checking the above memory error output we can see the DIMM in error (J3001) is part of our calculated four DIMMs. Step #5 Finding the correct SDRAM: Now with a Data bit value of 103 (from Step #1), knowing the physical memory bank is CPU5 Bank0 located on CPU/Memory board in Slot D (from Step #3), and the DIMMs making up this physical bank are J2900, J2901, J3000, and J3001 (from Step #4) we are ready to find the SDRAM. Find the Data bit value of 103 in the left column of Table #4 (DIMM bit Assignment to # of SDRAM) below. It's located under the column titled “Nibble" which provides a list of each nibble and its corresponding Data/ECC check bits. Data [103:100] in “Nibble #29" shows a set of single data bits (103,102,101, 100), mapped to the Quadword's [03] on right. Each Quadword [03] column provides information in the form Dx [Y1,Y2,Y3,Y4], where 'x' is the DRAM number and 'Y1Y4' are the pins on that DRAM corresponding to the Data/ECC check bits listed in the left “Nibble" column (ie, Quadword 0 in line with Data Bits [103:100], has D33 [0,7,10,13], which shows Data Bit 103 located on DRAM 33 pin 0, Data Bit 102 on DRAM 33 pin 7, Data Bit 101 on DRAM 33 pin 10, and finally Data Bit 100 on DRAM 33 pin 13). It is usually not significant to note which quadword the error occurred in, since it is only to the level of the DRAM that we are looking for, not the specific pin on the DRAM. In the error message above, our Esynd (from AFSR register in Step #1) told us that Data Bit 103 was in error. Find the line under the “Nibble" column in Table #4 with this Data Bit (Data [103:100] Nibble #29) and looking across to the Quadword 0 column we see D33 [0,7,10,13]. Data Bit 103 maps to DRAM 33 pin 0, but we are just interested in D33. Step #6 Finding the correct DIMM to be replaced: Take D33 and find the correct DIMM in Table #3 (Cross Reference for CPU, Group, Bank, DIMM, DRAM, and Jxxxx). D33 is part of 4 DIMM's (J3001, J3201, J8001, J8201) CPU5 is the CPU that contains the following DIMMs (J3001, J3201) Bank0 is the Bank in error (J3001) [Bad DIMM: J3001] Attachments This solution has no attachment |
||||||||||||
|