Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Problem Resolution Sure Solution 1010750.1 : Sun Fire[TM] 12K/15K/E20K/E25K: Domain fails HPOST test "PCI IOC External Functional Tests,SUBTEST=PCI IOC DMA External loopback Tests"
PreviouslyPublishedAs 214835
Applies to:Sun Fire 12K ServerSun Fire 15K Server Sun Fire E20K Server Sun Fire E25K Server All Platforms SymptomsA Sun Fire[TM] 12K/15K/20K/25K domain fails HPOST (Hardware Power On Self Test) with the following error.The IOC (I/O Controller) on the I/O Board and the HBAs controlled by the IOC are taken out of the configuration by HPOST. NOTE: If the SC configuration is SMS 1.4 or higher, CHS (Component Health Status) would set the IOC status to faulty and prevent it from being configured into the domain again at all (and HBAs). The post log shows the failure: ERROR: TEST=PCI IOC External Functional Tests,SUBTEST=PCI IOC DMA External loopback Tests ID=196.0 As displayed above, IOC 1 has been failed, and the cassettes and HBAs it controls were crunched by the failure, leaving them unusable. How do we correct this situation? CauseThe test that has failed is the "PCI IOC External Functional Tests". As the name indicates, this failure occurred while testing the devices that are external to the IOC.The devices external to the IOC are the I/O Cassettes and the HBAs (and the pathway to the devices - See NOTE below). The subtest that failed in this case is the "PCI IOC DMA External loopback Tests". The loopback test is a functional test which confirms sanity of the devices on the bus. NOTE: The RIO and SBBC asic on the I/O board share the bus with PCI slot 1, which is controlled by IOC 0. So, any failure on IOC0 also includes the RIO and SBBC asics as additional suspects. This is a hardware error and as such Oracle Technical Support should be contacted and a Service ticket created. Troubleshooting such an error as this might involve moving hardware around, or replacing components and only a certified Support Field Engineer is authorized to do this. SolutionThe HPOST failureCompare error: {SB10/P0/C0} Data miss compare This is a data compare test (Compares what was sent versus what was received), and as seen in the post log, the expected and observed data shows a bit was flipped. The bit flip is the result of bad hardware on the bus. NOTE: The reference to a CPU shown in the HPOST failure example (SB10/P0/C0) does not imply that the CPU is bad. This is simply showing that it is this CPU which is executing the test associated with this failure. The CPU is NOT suspect. "Geography Lesson" Document 1017493.1 shows the layout of an I/O board, so refer to it if you need to see the board layout visually. Cassette slots 0 and 1 share IOC 0, while slots 2 and 3 share IOC 1. As noted above, the I/O board's RIO and SBBC asics share the slot 1 bus, so they are also on IOC 0. So, if a failure of this type occurs on IOC 0, the suspect list is (in no particular order):
Component under test Looking at Post output: {SB10/P0/C0} Component under test: /IO10/P1: RIO As noted above in "Geography", this shouldn't make sense. HPOST is identifying that the component under test is the IO10/P1 RIO, however the RIO is not associated with IOC 1, but is actually associated with IOC 0. This is Bug ID 6201351: the RIO is incorrectly identified as the component under test. Ignore the mention of the RIO and focus on the IOC that is implicated. In the example from the Problem Statement section the implicated component would be IO10/P1 = IOC 1. So, it is now known that some hardware is bad on the IO10/IOC 1 bus. According to the "Geography Lesson" section, our list of suspects would be: IOC 1 Slot 2 Slot 3 Pathway/Interconnect This means that an IO Board, two IO cassettes, interconnects and maybe even HBAs are included in the list of possible FRUs (Field Replaceable Units). Narrowing Down FRU List There are ways to narrow down the list of FRUs and resolve this issue without resorting to a mass FRU replacement. There are some facts that must first be known: 1) Is the failure consistent? A consistent error is one that happens every time HPOST is executed. No changes to the configuration have occurred, and the same level of HPOST is executed each time to create the error. NOTE: SMS 1.4 and above will disable the reporting IOC via CHS, so the component must be enabled to confirm that HPOST fails consistently in this configuration - See the Temporary Workaround section below for details. A consistent error will make troubleshooting easy because any hardware change we initiate will have a direct effect on the results of HPOST. Either the change will resolve the error or not change it at all. The result will be definitive. An inconsistent error is one that does not occur on every HPOST cycle. This will be a difficult failure to troubleshoot, because it is unknown whether a hardware change truly resolved the error, or if the error is just being inconsistent again. See the Additional Information section for some advice on how to Troubleshoot an Inconsistent error of this type. 2) Have any recent service actions taken place that relate to the suspect component list? What has recently changed on the domain that is associated to this implicated IOC? Has new hardware been installed, maybe a HBA or a cassette was replaced? If a recent service action took place on the implicated IOC, that service action should be the focus of the troubleshooting process. Perhaps the cassette or HBA is a DOA (Dead on Arrival) part, or maybe the action itself caused slot damage (cassette slot). Try reseating the part just replaced or installed and see what the resulting HPOST shows. Even better, try executing HPOST (if a consistent failure) without the new part installed and see if the failure ceases. If it works fine, root cause is the new part. If it fails still, then the new part is not in fact the cause. The important thing here is that troubleshooting steps have to include any changes that have occurred to this list of suspects. A change that leads to a failure is most likely a result of that change, and should be troubleshot from the start. 3) I/O cassettes which are empty can be related to the failure, so don't rule them out just because they are empty. Resolution As stated above, this is a hardware error and as such Oracle Technical Support should be contacted and a Service ticket created. Troubleshooting such an error as this might involve moving hardware around, or replacing components and only a certified Sun Support Field Engineer is authorized to do this. Root cause is determined by ruling out suspects through troubleshooting. Focus on the recent service actions as shown above. If they don't exist, remove suspects from the configuration and add back individually as needed until failure is reproduced. Then take the appropriate action as troubleshooting allows. From the example shown in the Problem Statement section, the troubleshooting steps could be:
NOTE 2: Troubleshooting may prove that root cause is a particular I/O cassette slot and the cassette is replaced yet errors persist. If this were to happen, the slot itself would be suspect, meaning that the I/O Board is the next component to replace. Do not be confused if this situation occurs during troubleshooting. The Problem Statement example above came from a "Real World" case where the root cause component turned out to be the I/O cassette in slot 2. This I/O cassette was empty. In other situations the resolution will vary as a result of troubleshooting the failure. Relief/Workaround This feature prevents repeated domain outages from occurring due to the same bad component. If SMS 1.4 or above is installed on this platform, CHS will disable the IOC that was implicated in the post log and prevent the IOC from being configured into all future HPOST runs. It is important for the troubleshooting reasons shown above to be able to reset this status so we can identify root cause to the failure. We reset the status on the MAIN SC (System Controller), as user sms-svc: 1) Show the CHS disabled part in question: sms-svc:> showchs -v -c IO10 This shows that the component disabled by CHS is IO10/P1, and it was disabled by POST. 2) Re-enable the device: sms-svc> setchs -s "ok" -r "troubleshooting" -c IO10/P1 Anytime an HPOST failure occurs, the component status must be reset before continuing to the next stage in troubleshooting. Additional Information Troubleshooting an Inconsistent Error: as defined above, an inconsistent error is one that does not occur on every HPOST cycle of the same level. There appears to be no "rhythm" to the error, like if it reliably failed every third HPOST or every other HPOST. If the error can not "reliably" reproduce, it is an inconsistent error and therefore troubleshooting it is more difficult. As shown above, there is a list of suspect components that are all implicated as possible root cause FRUs to an error of this type: I/O Board I/O cassettes (2) HBAs (2) The goal should be to replace as few components as possible, for cost reasons, but more importantly to limit the "touches" on the platform to as few as possible (reduces chance of inflicting more harm, and lowers the chance of inserting DOA components). More important to the customer, we need to limit outages and failures on the platform, so we must make the correct replacement as quickly as possible. In an inconsistent failure mode, the correct action may not resolve the problem the first time, but we should make decisions on the action wisely to have the largest likelihood of resolving the problem on first attempt.
Replacing the I/O Board, without any other past history to suggest otherwise is a logical first step. If errors persist, the I/O cassettes are the next FRUs to focus in on. Oracle SUPPORT - Please refer to the note in the Internal Only section of this document. Product Sun Fire E25K Server Sun Fire E20K Server Sun Fire 15K Server Sun Fire 12K Server Internal Section Support Note: Please escalate the case to Hw SPARC Team before recommending replacement of all components on the on the implicated IOC bus. We must avoid "mass swaps" of hardware components, and perhaps an expert engineer might be able to provide direction to where we should instead focus the troubleshooting attention. Please open a Collaboration SR to L2 engineers for any case that you feel needs assistance in explanation or troubleshooting. References:
Keywords: 12k, 15k, e20k, e25k, HPOST, POST, IOC, FAIL, CHS, cassette, crunched Previously Published As 79388 Attachments This solution has no attachment |
||||||||||||
|