Sun Fire[TM] 12K/15K/E20K/E25K: Domain fails HPOST test "PCI IOC External Functional Tests,SUBTEST=PCI IOC DMA External loopback Tests"

Asset ID:	1-72-1010750.1
Update Date:	2011-05-26
Keywords:

Solution Type Problem Resolution Sure

Solution 1010750.1 : Sun Fire[TM] 12K/15K/E20K/E25K: Domain fails HPOST test "PCI IOC External Functional Tests,SUBTEST=PCI IOC DMA External loopback Tests"

Applies to:

Sun Fire 12K Server
Sun Fire 15K Server
Sun Fire E20K Server
Sun Fire E25K Server
All Platforms

Symptoms

A Sun Fire[TM] 12K/15K/20K/25K domain fails HPOST (Hardware Power On Self Test) with the following error.

The IOC (I/O Controller) on the I/O Board and the HBAs controlled by the IOC are taken out of the configuration by HPOST.

NOTE: If the SC configuration is SMS 1.4 or higher, CHS (Component Health Status) would set the IOC status to faulty and prevent it from being configured into the domain again at all (and HBAs).

The post log shows the failure:

ERROR: TEST=PCI IOC External Functional Tests,SUBTEST=PCI IOC DMA External loopback Tests ID=196.0
{SB10/P0/C0} Component under test: /IO10/P1: RIO
{SB10/P0/C0} Data miss compare
{SB10/P0/C0} 	addr 00000140.00800400
{SB10/P0/C0} 	exp  00010203
{SB10/P0/C0} 	obs  00010283
{SB10/P0/C0} (DMA 32WR) :rc == FAIL
{SB10/P0/C0} 	(BERR) Bus Error from system bus
{SB10/P0/C0} 	(PRIV) Privileged code access error(s)
{SB10/P0/C0} 	(TO) Time-out from system bus
{SB10/P0/C0} 	(PRIV) Privileged code access error(s)
Proccore SB10/P0/C0: Test PCI IOC External Functional Tests id=0xC4 Subtest returned through Watchdog/POR.
XIR Magic XIRSAVE  XIR Version 00000005 Buglevel 00000000
XIR Save Total Size 0x00000B10 bytes

I/O_Brds:         IOC  P1/Bus/Adapt   IOC  P0/Bus/Adapt
Slot  Gen  Type   P1   B1/10 B0/10    P0   B1/eb10 B0/10  (e=ENet, b=BBC)
IO10:  P   hsPCI+ f    c _c  c _c     P    p PP_e  p _p

As displayed above, IOC 1 has been failed, and the cassettes and HBAs it controls were crunched by the failure, leaving them unusable. How do we correct this situation?

Cause

The test that has failed is the "PCI IOC External Functional Tests". As the name indicates, this failure occurred while testing the devices that are external to the IOC.
The devices external to the IOC are the I/O Cassettes and the HBAs (and the pathway to the devices - See NOTE below).
The subtest that failed in this case is the "PCI IOC DMA External loopback Tests". The loopback test is a functional test which confirms sanity of the devices on the bus.

NOTE: The RIO and SBBC asic on the I/O board share the bus with PCI slot 1, which is controlled by IOC 0. So, any failure on IOC0 also includes the RIO and SBBC asics as additional suspects.

This is a hardware error and as such Oracle Technical Support should be contacted and a Service ticket created. Troubleshooting such an error as this might involve moving hardware around, or replacing components and only a certified Support Field Engineer is authorized to do this.

Solution

The HPOST failure

Compare error:

{SB10/P0/C0} Data miss compare
	{SB10/P0/C0} 	addr 00000140.00800400
	{SB10/P0/C0} 	exp  00010203
	{SB10/P0/C0} 	obs  00010283

This is a data compare test (Compares what was sent versus what was received), and as seen in the post log, the expected and observed data shows a bit was flipped. The bit flip is the result of bad hardware on the bus.

NOTE: The reference to a CPU shown in the HPOST failure example (SB10/P0/C0) does not imply that the CPU is bad. This is simply showing that it is this CPU which is executing the test associated with this failure. The CPU is NOT suspect.

"Geography Lesson"

Document 1017493.1 shows the layout of an I/O board, so refer to it if you need to see the board layout visually.
Cassette slots 0 and 1 share IOC 0, while slots 2 and 3 share IOC 1. As noted above, the I/O board's RIO and SBBC asics share the slot 1 bus, so they are also on IOC 0.

So, if a failure of this type occurs on IOC 0, the suspect list is (in no particular order):

IOC 0
RIO
SBBC
Slot 0
Slot 1
Pathway/Interconnect

And, if a failure of this type occurs on IOC 1, the suspect list is (in no particular order):

IOC 1
Slot 2
Slot 3
Pathway/Interconnect

Component under test

Looking at Post output:

{SB10/P0/C0} Component under test: /IO10/P1: RIO

As noted above in "Geography", this shouldn't make sense. HPOST is identifying that the component under test is the IO10/P1 RIO, however the RIO is not associated with IOC 1, but is actually associated with IOC 0.
This is Bug ID 6201351: the RIO is incorrectly identified as the component under test.

Ignore the mention of the RIO and focus on the IOC that is implicated. In the example from the Problem Statement section the implicated component would be IO10/P1 = IOC 1.

So, it is now known that some hardware is bad on the IO10/IOC 1 bus. According to the "Geography Lesson" section, our list of suspects would be:
IOC 1
Slot 2
Slot 3
Pathway/Interconnect

This means that an IO Board, two IO cassettes, interconnects and maybe even HBAs are included in the list of possible FRUs (Field Replaceable Units).

Narrowing Down FRU List

There are ways to narrow down the list of FRUs and resolve this issue without resorting to a mass FRU replacement. There are some facts that must first be known:

1) Is the failure consistent?

A consistent error is one that happens every time HPOST is executed. No changes to the configuration have occurred, and the same level of HPOST is executed each time to create the error.

NOTE: SMS 1.4 and above will disable the reporting IOC via CHS, so the component must be enabled to confirm that HPOST fails consistently in this configuration - See the Temporary Workaround
section below for details.

A consistent error will make troubleshooting easy because any hardware change we initiate will have a direct effect on the results of HPOST. Either the change will resolve the error or not change it at all. The result will be definitive.

An inconsistent error is one that does not occur on every HPOST cycle. This will be a difficult failure to troubleshoot, because it is unknown whether a hardware change truly resolved the error, or if the error is just being inconsistent again. See the Additional Information section for some advice on how to Troubleshoot an Inconsistent error of this type.

2) Have any recent service actions taken place that relate to the suspect component list?

What has recently changed on the domain that is associated to this implicated IOC? Has new hardware been installed, maybe a HBA or a cassette was replaced?
If a recent service action took place on the implicated IOC, that service action should be the focus of the troubleshooting process. Perhaps the cassette or HBA is a DOA (Dead on Arrival) part, or maybe the action itself caused slot damage (cassette slot).

Try reseating the part just replaced or installed and see what the resulting HPOST shows. Even better, try executing HPOST (if a consistent failure) without the new part installed and see if the failure ceases. If it works fine, root cause is the new part. If it fails still, then the new part is not in fact the cause.

The important thing here is that troubleshooting steps have to include any changes that have occurred to this list of suspects. A change that leads to a failure is most likely a result of that change, and should be troubleshot from the start.

3) I/O cassettes which are empty can be related to the failure, so don't rule them out just because they are empty.

Resolution

As stated above, this is a hardware error and as such Oracle Technical Support should be contacted and a Service ticket created. Troubleshooting such an error as this might involve moving hardware around, or replacing components and only a certified Sun Support Field Engineer is authorized to do this.

Root cause is determined by ruling out suspects through troubleshooting. Focus on the recent service actions as shown above. If they don't exist, remove suspects from the configuration and add back individually as needed until failure is reproduced. Then take the appropriate action as troubleshooting allows.

From the example shown in the Problem Statement section, the troubleshooting steps could be:

The configuration fails HPOST as the example shows above. No recent service actions have happened but the failure is consistent.
IOC 1 is implicated so this includes slots 2 and 3 as suspect. Unseat the I/O cassettes in slot 2 and 3 and rerun HPOST.
If HPOST passes fine with slots 2 and 3 unseated, the I/O board asics are ruled out as likely suspects. Insert one of the I/O cassettes back into the board and rerun HPOST. If it passes, try the next one.
Replace the component which troubleshooting proves to be the most likely root cause component.

NOTE 1: It is possible for the failure to be resolved by a simple reseat of an I/O cassette. This is more likely to be resolution to such a failure that follows service to a cassette (like a failure that starts after an HBA is inserted or replaced) and less likely to be the resolution to a failure that occurs from a domain where no recent service took place.

NOTE 2: Troubleshooting may prove that root cause is a particular I/O cassette slot and the cassette is replaced yet errors persist. If this were to happen, the slot itself would be suspect, meaning that the I/O Board is the next component to replace. Do not be confused if this situation occurs during troubleshooting.

The Problem Statement example above came from a "Real World" case where the root cause component turned out to be the I/O cassette in slot 2. This I/O cassette was empty.
In other situations the resolution will vary as a result of troubleshooting the failure.

Relief/Workaround

This feature prevents repeated domain outages from occurring due to the same bad component.

If SMS 1.4 or above is installed on this platform, CHS will disable the IOC that was implicated in the post log and prevent the IOC from being configured into all future HPOST runs. It is important for the troubleshooting reasons shown above to be able to reset this status so we can identify root cause to the failure. We reset the status on the MAIN SC (System Controller), as user sms-svc:

1) Show the CHS disabled part in question:

sms-svc:> showchs -v -c IO10
Component:          IO10/P1
Time Stamp:         Fri Nov 25 10:42:39 CST 2004
New Status:         Faulty
Old Status:         OK
Major Event Code:   Hardware-detected error
Initiator:          POST
Message:            1 SF15000-8001-QM 1.4.1 1

This shows that the component disabled by CHS is IO10/P1, and it was disabled by POST.

2) Re-enable the device:

sms-svc> setchs -s "ok" -r "troubleshooting" -c IO10/P1

Anytime an HPOST failure occurs, the component status must be reset before continuing to the next stage in troubleshooting.

Additional Information

Troubleshooting an Inconsistent Error: as defined above, an inconsistent error is one that does not occur on every HPOST cycle of the same level.
There appears to be no "rhythm" to the error, like if it reliably failed every third HPOST or every other HPOST. If the error can not "reliably" reproduce, it is an inconsistent error and therefore troubleshooting it is more difficult.

As shown above, there is a list of suspect components that are all implicated as possible root cause FRUs to an error of this type:
I/O Board
I/O cassettes (2)
HBAs (2)

The goal should be to replace as few components as possible, for cost reasons, but more importantly to limit the "touches" on the platform to as few as possible (reduces chance of inflicting more harm, and lowers the chance of inserting DOA components). More important to the customer, we need to limit outages and failures on the platform, so we must make the correct replacement as quickly as possible. In an inconsistent failure mode, the correct action may not resolve the problem the first time, but we should make decisions on the action wisely to have the largest likelihood of resolving the problem on first attempt.

Confirm that the error is truly inconsistent. Perhaps it is believed to be passing HPOST at times, but in fact the suspect IOC is CHS disabled (See Temporary Workaround section).
Assuming this is a true inconsistent failure, troubleshooting must begin with a complete history of the implicated IOC (past service actions on this I/O board, trend of scsi errors or messages, etc)
If no trend is available, the likelihood that a particular component has failed should be considered; see below

Which component is most likely to have this failure? The answer is the I/O Board, because it contains the most suspect asics and most of the actual pathway used in this test.
Replacing the I/O Board, without any other past history to suggest otherwise is a logical first step. If errors persist, the I/O cassettes are the next FRUs to focus in on.

Oracle SUPPORT - Please refer to the note in the Internal Only section of this document.

Product
Sun Fire E25K Server
Sun Fire E20K Server
Sun Fire 15K Server
Sun Fire 12K Server

Internal Section

Support Note: Please escalate the case to Hw SPARC Team before recommending replacement of all components on the on the implicated IOC bus.
We must avoid "mass swaps" of hardware components, and perhaps an expert engineer might be able to provide direction to where we should instead focus the troubleshooting attention.
Please open a Collaboration SR to L2 engineers for any case that you feel needs assistance in explanation or troubleshooting.

References:

Document 1017493.1: "Sun Fire[TM] 12K/15K/E20K/E25K: Cheat Sheet: hsPCI, hsPCI+ and hsPCI-X Cartridge Locations"
Document 1017539.1: "Sun Fire[TM] 12K/15K/E20K/E25K, Rio External Loopback Test failure"
CR 6201351 "Starcat pci_lpost incorrectly identifies RIO as failing component during external loopback test"
CR 6463931 "Rio External Loopback Test failure during SMS 1.5 hpost"

Keywords: 12k, 15k, e20k, e25k, HPOST, POST, IOC, FAIL, CHS, cassette, crunched

Previously Published As 79388

Attachments

This solution has no attachment