Asset ID: |
1-71-1319343.1 |
Update Date: | 2011-05-26 |
Keywords: | |
Solution Type
Technical Instruction Sure
Solution
1319343.1
:
Sun Enterprise [TM] 10000: POST and Hardware Dump Frequently Asked Questions
Related Items |
- Sun Enterprise 10000 Server
|
Related Categories |
- GCS>Sun Microsystems>Servers>High-End Servers
|
In this Document
Goal
Solution
Applies to:
Sun Enterprise 10000 Server - Version: Not Applicable to Not Applicable - Release: N/A to N/A
Information in this document applies to any platform.
Goal
This document contains answers to POST and hardware dump frequently asked questions.
Solution
What are the meanings of the various board and hardware component states found at the end of a POST run or log file?
Each component and the system board overall are associated with
one of nine states. The Gen column is the 'general health' of the
system board. Other columns report the state of individual components.
From left to right: processors, memory banks, I/O controllers and
slots, ASICs.
Board Descriptor Array:
Proc M/Grp IOC/Slot CIC PC XDB LDPTH
Brd Gen 3210 3210 1/3210 0/3210 3210 210 3210 10
0: x rrrr r/rrrr r/rrrr r/rrrr rrrr rrr rrrr rr | G=Good
1: x xxxx x/xxxx x/xxxx x/xxxx xxxx xxx xxxx xx | f=Failed
2: x rrrr r/rrrr r/rrrr r/rrrr rrrr rrr rrrr rr | m=Missing
3: x rrrr r/rrrr r/rrrr r/rrrr rrrr rrr rrrr rr | b=Blacklisted
4: x rrrr r/rrrr r/rrrr r/rrrr rrrr rrr rrrr rr | r=Redlisted
5: x rrrr r/rrrr r/rrrr r/rrrr rrrr rrr rrrr rr | c=Crunched
6: x rrrr r/rrrr r/rrrr r/rrrr rrrr rrr rrrr rr | _=Undefined
7: x rrrr r/rrrr r/rrrr r/rrrr rrrr rrr rrrr rr | x=NotInDomain
8: G GGGG G/mmGG G/__mG G/__mG GGGG GGG GGGG GG | u=G,unconfig
9: G GGGb G/mmGG G/__mG G/__Gm GGGG GGG GGGG GG
A: G GGGG G/mmGG G/__GG G/__GG GGGG GGG GGGG GG
B: G GGGG G/GGGG G/__Gm G/__GG GGGG GGG GGGG GG
C: c mmmm m/cccc c/__mm c/__mm cccc ccc cccc cc
D: G GGGG G/mmGG G/__GG G/__mG GGGG GGG GGGG GG
E: G GGGG G/GGGG c/__mm c/__mm GGGG GGG GGGG GG
F: G GGGG G/bmGG G/__GG G/__mG GGGG GGG GGGG GG
State | Description
|
---|
Good (G)
| Component passed hpost tests. The hpost level indicated on line 2 of the output. |
Failed (f)
| Component failed hpost tests. See the post log in /var/opt/SUNWssp/adm/<platform>/<domain>/post. |
Missing (m)
| The component is not physically present in
the system. In the above output, an example is memory banks 2 and 3 on
board 8. Also, I/O slots 8.0.1 and 8.1.1 are empty. |
Blacklisted (b)
| The component has been blacklisted in either a platform or domain-specific blacklist file. Proc 9.0 above is an example. |
Redlisted (r) | The component is considered 'untouchable' by
hpost. Typcially, this means the component is part of another domain,
as hpost may not change the state of resources in other domains. It is possible that a component has been added to the redlist file, although it is not recommended to touch this file. See the redlist man page for more. |
Crunched (c) | A component is crunched when it serves no useful purpose, and hpost therefore does not configure that component into the system. It's essentially a cause and effect - if component A relies on/serves component B, and component B is not good, don't bother configuring component A. In the above output, system board C is crunched, for it has no memory, no processors, and no I/O cards. Another example is the I/O on system board E. No I/O cards are present, so the I/O controllers have been crunched. Crunching of a component can also result when ASICs are blacklisted. |
Undefined (_) | An unimplemented location. For each SYSIO, up to 4 sbus cards is supported, but there is only physical space for 2. |
NotInDomain (x) | If a board's Gen column is x, but all other components are r, that board is part of another domain. System board 0 is an example. A board reporting all components as x means that the board is not configured in any domain. System board 2 in the exammple above is such a board. |
G,unconfig (u) | The component is good, but not configured for some reason. This is a holdover from the CS6400 and is not used in Starfire. |
How do I analyze a WatchDog-Redmode-Dump file?
The WatchDog-Redmode-Dump file is only useful for reviewing the
configuration of a domain. It will not provide any information on the
failure because a watchdog or redmode is a cpu-based failure, and not
an interconnect-based failure.
With a watchdog or redmode failure, look for a hostresetdump
file, which will contain (among other things) the processor states
Hpost reports Component ID discrepancy
A component ID discrepancy means that hpost has detected a piece of
hardware in a domain that is unknown to the scan database on the SSP.
The most common occurrence is with new processor modules. Messages can
be either WARNINGs or FAILs. For example:
(output omitted)
phase jtag_integ: JTAG probe and integrity test...
WARNING: b/r/c = sysboard12/proc0/spitfire:
Component ID is up-version: Actual A003602F
Expected 9003602F
FAIL b/r/c = sysboard12/proc0/udb0: Component ID discrepancy.
FAIL Actual 00000000; Expected one of:
FAIL 4F643989 or
FAIL 3F643989 or
FAIL 2F643989 or
FAIL 1F643989 or
FAIL 0F643989 or
FAIL 5002602F or
FAIL 1002602F
(output omitted)
If the messages reported are FAILs:
- Install any/all patches on the SSP that update the scan database with new hardware information.
- Reboot the SSP. This makes the scan database changes take effect.
- Run autoconfig on the system board(s) containing the new hardware.
Do not run autoconfig on any system board running OS or OBP as it will
crash that domain. See the autoconfig man page for details.
If the messages reported are WARNINGs, provided the board passes
POST the domain should operate without problems. See here for more
information.
Hpost reports bogus Mixed Ecache error
Symptom:
The post log reports something like:
(output omitted)
phase proc1: Initial processor module tests...
FAIL proc 9.3: Mixed Ecache sizes on board.
phase pc/cic_reg: PC and CIC register tests...
(output omitted)
However, it's a known fact that the board in question contains
four identical processors and other post failures have not failed with
this error.
This problem occurs when:
- A power cycle of the system boards/platform is done
- AND The "failing" processors are 400MHz
- AND The SSP is at 3.1.1 or 3.2
The failure occurs only on the first hpost (bringup) immediately
following a power cycle of the system boards/platform. The failure does
not always occur. Failures have not been observed or reported on SSP
3.0 or 3.1.
Workaround:
Another hpost (bringup) run does not fail. All subsequent hpost runs are also error free, until the next power cycle.
Resolution:
Fixed in SSP 3.4 and later
How do I determine the memory configuration of a system board?
Dimm information is available only in an interactive redx session from the SSP. The generic command is shdimm . A repeat command can be used to make life simpler. This command outputs all 4 banks on system board 0:
-
WARNING: shdimm can crash a running domain!
redx> repeat 4 { shdimm 0 $loopcnt }
DIMMs 0.0[7:0] = 6F 6F 6F 6F 6F 6F 6F 6F
Type 6F:
0F Size/Org[4:0] Type[4:0] 128 MB dimm / 1 GB Bank
3 Speed[1:0] Type[6:5] 60 ns
0 Reserved Type[7]
DIMMs 0.1[7:0] = 6F 6F 6F 6F 6F 6F 6F 6F
Type 6F:
0F Size/Org[4:0] Type[4:0] 128 MB dimm / 1 GB Bank
3 Speed[1:0] Type[6:5] 60 ns
0 Reserved Type[7]
DIMMs 0.2[7:0] = FF FF FF FF FF FF FF FF
Type FF: Empty Socket
DIMMs 0.3[7:0] = FF FF FF FF FF FF FF FF
Type FF: Empty Socket
redx> repeat 4 { shdimm d $loopcnt }
DIMMs D.0[7:0] = 6B 6B 6B 6B 6B 6B 6B 6B Type 6B: 0B Size/Org[4:0] Type[4:0] 32 MB dimm / 256 MB Bank 3 Speed[1:0] Type[6:5] 60 ns 0 Reserved Type[7]
DIMMs D.1[7:0] = 6B 6B 6B 6B 6B 6B 6B 6B Type 6B: 0B Size/Org[4:0] Type[4:0] 32 MB dimm / 256 MB Bank 3 Speed[1:0] Type[6:5] 60 ns 0 Reserved Type[7]
DIMMs D.2[7:0] = FF FF FF FF FF FF FF FF Type FF: Empty Socket
DIMMs D.3[7:0] = FF FF FF FF FF FF FF FF Type FF: Empty Socket
To rifle through the entire platform, use:
repeat 16 { shdimm $loopcnt 0
shdimm $loopcnt 1
shdimm $loopcnt 2
shdimm $loopcnt 3 }
How do I determine what DTAGs are on a system board?
DTAG information can be read from a Recordstop Dump, Arbstop
Dump, or a live platform. Dump files only contain the DTAG information
for those system boards in the domain that produces the dump file.
redx> repeat 4 { shdtag 0 $loopcnt };
DTAG 0.0 Component IDs[2:0] = 100000E3 100000E3 100000E3
DTAG 0.1 Component IDs[2:0] = 100000E3 100000E3 100000E3
DTAG 0.2 Component IDs[2:0] = 100000E3 100000E3 100000E3
DTAG 0.3 Component IDs[2:0] = 100000E3 100000E3 100000E3
Component ID
| Sram Vendor
|
00000000 | system board not present in dumpfile/platform |
100000E3 100050E3 | Sony
|
01910149 11910149 | IBM
|
To rifle through the entire platform, use:
repeat 16 { shdtag $loopcnt 0
shdtag $loopcnt 1
shdtag $loopcnt 2
shdtag $loopcnt 3 }
How do I read part/serial numbers
Serial number data can be read for the centerplane (cp),
system boards (sys), control boards (ctlbd), centerplane support boards
(csb), memory mezzanines (mem), and I/O mezzanines (io).
redx> eepr cp 0
Serial number eeprom of centerplane 0:
Assembly Part Number 501-6509-04 Rev 01 Serial Number 28R301696
Programmed on Thu Jan 16 12:11:11 1997
redx> eepr sys 0
Serial number eeprom of system board 0:
Assembly Part Number 501-4347-10 Rev 50 Serial Number 28Q736115
Programmed on Mon Nov 17 14:28:20 1997
redx> eepr ctlbd 0
Serial number eeprom of control board 0:
Assembly Part Number 501-4345-05 Rev 50 Serial Number 28R301232
Programmed on Mon Apr 7 10:01:15 1997
redx> eepr csb 0
Serial number eeprom of cplane sup board 0:
Assembly Part Number 501-4346-04 Rev 50 Serial Number 28R301054
Programmed on Mon Apr 14 08:36:13 1997
redx> eepr mem 0
Serial number eeprom of memory module 0:
Assembly Part Number 501-4351-04 Rev 50 Serial Number 28R303807
Programmed on Tue Mar 25 10:57:04 1997
redx> eepr io 0
I/O module type on board 0: code = 01: 2 * (SYSIO w/ 2 SBus slots)
Serial number eeprom of I/O module 0:
Assembly Part Number 501-4349-50 Rev 52 Serial Number 28B008654
Programmed on Wed Dec 3 15:05:00 1997
Part/serial number data can be obtained from Arb/Record Stop
files for all components except control boards. Control board data
requires an interactive redx session from the SSP.
How do I read thermcal data?
Part/serial number data is only appropriate for system boards
and the centerplane. This command is primarily to validate that
thermcal data is written to a component.
redx> eepr -T sys 7
Serial number eeprom of system board 7:
Assembly Part Number 501-4347-09 Rev 50 Serial Number 28Q736308
Programmed on Mon Mar 24 08:33:37 1997
Asic thermistors calibrated at 26.953 degrees-C. 5 thermistors:
9B3 89F 949 8FB 86C
redx> eepr -T cp 0
Serial number eeprom of centerplane 0:
Assembly Part Number 501-6509-04 Rev 01 Serial Number 28R301696
Programmed on Thu Jan 16 12:11:11 1997 Asic thermistors calibrated at 25.228 degrees-C. 10 thermistors: 994 9FC A1D 9FC 9A4 9E4 99F A3C 874 884
The component will report an error if it has not been thermcal'ed. Part/serial number data for system boards can be obtained from
Arb/Record Stop files if data for those boards is included in the dump.
Centerplane information must be collected in an interactive redx
session from the SSP.
How do I see what hpost passed to OBP?
After hpost finishes, it builds a 'post2obp' structure and
stores it in the BBSRAM of the bootproc of a given domain. It's also
known as the board descriptor array. The structure can be viewed using
an interactive redx session.
-
Obtain the boot processor of the domain in question.
# cat /var/opt/SUNWssp/etc/southpark/kenny/bootproc
32
2. Using redx, set the current processor to the booproc and dump the 'post2obp' structure.
redx> proc 32
Current proc set to 8.0 = 32
redx> p2o
p2o_magic = XFPOST_2OBP p2o_struct_version = 010F0000
Created by pid = 864 running at level 17 on Mon Nov 22 15:30:36 1999
Bus configuration = 3F ShuffleMode = 0 Flags = 00000000
Interconnect Freq = 99902435 Hz Processor Ext Freq = 199804870 Hz.
Processor Internal to Interconnect frequency ratio = 4.
Board Descriptor Array:
Proc M/Grp IOC/Slot CIC PC XDB LDPTH
Brd Gen 3210 3210 1/3210 0/3210 3210 210 3210 10
0: x rrrr r/rrrr r/rrrr r/rrrr rrrr rrr rrrr rr | G=Good
1: x xxxx x/xxxx x/xxxx x/xxxx xxxx xxx xxxx xx | f=Failed
2: x rrrr r/rrrr r/rrrr r/rrrr rrrr rrr rrrr rr | m=Missing
3: x rrrr r/rrrr r/rrrr r/rrrr rrrr rrr rrrr rr | b=Blacklisted
4: x rrrr r/rrrr r/rrrr r/rrrr rrrr rrr rrrr rr | r=Redlisted
5: x rrrr r/rrrr r/rrrr r/rrrr rrrr rrr rrrr rr | c=Crunched
6: x rrrr r/rrrr r/rrrr r/rrrr rrrr rrr rrrr rr | _=Undefined
7: x rrrr r/rrrr r/rrrr r/rrrr rrrr rrr rrrr rr | x=NotInDomain
8: G GGGG G/mmGG G/__mG G/__mG GGGG GGG GGGG GG | u=G,unconfig
9: G GGGb G/mmGG G/__mG G/__Gm GGGG GGG GGGG GG
A: G GGGG G/mmGG G/__GG G/__GG GGGG GGG GGGG GG
B: G GGGG G/GGGG G/__Gm G/__GG GGGG GGG GGGG GG
C: c mmmm m/cccc c/__mm c/__mm cccc ccc cccc cc
D: G GGGG G/mmGG G/__GG G/__mG GGGG GGG GGGG GG
E: G GGGG G/GGGG c/__mm c/__mm GGGG GGG GGGG GG
F: G GGGG G/bmGG G/__GG G/__mG GGGG GGG GGGG GG
Memory total: 7 chunks, 2162688 8KB pages (16896 MBytes):
PA = 010.00000000 262144 Pages (2048 MBytes)
PA = 012.00000000 262144 Pages (2048 MBytes)
PA = 014.00000000 262144 Pages (2048 MBytes)
PA = 016.00000000 524288 Pages (4096 MBytes)
PA = 01A.00000000 65536 Pages (512 MBytes)
PA = 01C.00000000 524288 Pages (4096 MBytes)
PA = 01E.00000000 262144 Pages (2048 MBytes)
Attachments
This solution has no attachment