![]() | Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Troubleshooting Sure Solution 1362005.1 : Sun SPARC Enterprise[TM] M3000/M4000/M5000/M8000/M9000 (OPL) Servers: Troubleshooting PCIEX-8000-KP and SUNOS-8000-FU fault codes produced by Solaris FMA
In this Document
Oracle Confidential (PARTNER). Do not distribute to customers
Applies to:Sun SPARC Enterprise M4000 Server - Version: Not Applicable to Not Applicable - Release: N/A to N/ASun SPARC Enterprise M3000 Server - Version: Not Applicable to Not Applicable [Release: N/A to N/A] Sun SPARC Enterprise M8000 Server - Version: Not Applicable to Not Applicable [Release: N/A to N/A] Sun SPARC Enterprise M9000-32 Server - Version: Not Applicable to Not Applicable [Release: N/A to N/A] Sun SPARC Enterprise M9000-64 Server - Version: Not Applicable to Not Applicable [Release: N/A to N/A] Oracle Solaris on SPARC (64-bit) Oracle Solaris on SPARC (32-bit) PurposeThis document is intended to be a overall guide to troubleshooting OPL systems that Solaris FMA on a System Domain is reporting the SUNOS-8000-FU and/or the PCIEX-8000-KP faults in "fmadm faulty" output.Please note that on M3000/M4000/M5000/M8000/M9000 series systems, Solaris FMA is responsible for the majority of PCI fault troubleshooting, the XSCF will only produce errors of the type FMD-8000-11 when these types of issues occur, and are not useful in diagnosing the true problem. Last Review DateFebruary 7, 2012Instructions for the ReaderA Troubleshooting Guide is provided to assist
in debugging a specific issue. When possible, diagnostic tools are included in the document
to assist in troubleshooting.
Troubleshooting DetailsIssue VerificationVerify the faults exist by running "fmadm faulty" on the Solaris Domain as the root user (or confirm with the outputs from explorer in the /fma directory).Faults reported will be similar to this example: --------------- ------------------------------------ -------------- --------- *Note - it is most often seen that both errors are reported, but there are times when only one, and not the other are reported ResolutionSUNOS-8000-FUIf only the SUNOS-8000-FU is reported, without the associated PCIEX-8000-KP, then manual inspection of the "fmdump -eV" outputs will be required to determine the device path of the fault. Example: Sep 02 2011 03:13:48.955933600 ereport.io.pci.fabric 1. The SUNOS-8000-FU is a Bug in the FMA diagnostic engine being unable to diagnose the "ereport.io.pci.sec-rserr" fault coming from the PCI bus. This Bug has been resolved in <SunPatch:146855-01> and is documented in this Sun Alert: Healthy Solaris 10 SPARC Systems May Incorrectly Report Hardware Errors During PCIE Correctable Events <Document 1369869.1> 2. If this is the only fault being reported, then it is expected that there are very few PCI events found in the "fmdump -eV" output, then this fault can be safely cleared with "fmadm repair <uuid>" and customer given recommendation to apply the patch. 3. If the PCIEX-8000-KP fault is being reported in conjunction with SUNOS-8000-FU, simply verify the "ereport.io.pci.sec-rserr" fault is coming from the same device path as the card blamed in the "fmadm faulty" output for the PCIEX-8000-KP. If they are coming from the same device path, then it can simply be cleared with the "fmadm repair <uuid>" and the patch recommended to the customer. 3a. If the error has come from another PCI bus than the PCIEX-8000-KP fault, then the number of errors on that bus should be checked manually from the "fmdump -eV" output, and cleared if found to be a low number of errors (<= 36 per hour), and again, the patch recommended to the customer. Sun Bug Reference: 6960665 SUNOS-8000-FU reported on ereport.io.pci.sec-rserr on OPL during PCI CE events PCIEX-8000-KP For this event, "fmadm faulty" output already gives us the affected device in question taking the correctable errors. 1. The first step is to check the frequency of the errors being reported; "fmdump -e" output is the simplest way to check for this: Sep 02 03:13:48.9559 ereport.io.pci.fabric
Please note that this issue regarding the SERD error rates has been documented in this SunAlert, and it includes further commands that can be run to investigate the SERD decision engines (fmstat): Solaris 10 SPARC Kernel Patch 137137-09 May Cause Erroneous PCIEX-8000-KP Reports During PCIE Correctable Events <Document 1369835.1> Sun Bug Reference: 7051331 SPARC Solaris IO FMA s10u6 and later causing false IO hardware faults Note: if the device that reports the PCI correctable errors is an Aura F20 card (PCI Express Flash Accelerator F20 SAS HBA), please check the following bug before proceeding with troubleshooting steps below: Bug 6997490: PIC fabric errors seen during OPL M4/5000 production testing of Aura F20 The instructions to fix this issue on OPL Systems for this scenario are available here.
3a. If the fault returns after PCI card re-seat and pin cleaning, then the error rates must be checked again. If the error rate is <= 36 per hour, then the resolution is in Step 2. Sun Bug Reference: 6907573 Repeated PCIEX-8000-KP errors on iou#0-pci#2 even after replacing the card. 4. If you have reached this step, then the next action is to replace the PCI card with a new FRU stock unit. 4a. If the fault returns after PCI card replacement, then the error rates must be checked again. If the error rate is <= 36 per hour, then the resolution is in Step 2.5. If you have reached this step, then the next action is to replace the IOU that the PCI card is installed in with a new FRU stock unit. 5a. If the fault returns after IOU replacement, then the error rates must be checked again. If the error rate is <= 36 per hour, then the resolution is in Step 2.6. If you have reached this step, then the card model must be checked. If the card being faulted is a 8-Port 3Gbps SAS/SATA HBA PN:375-3487, then there is a known rare bug with this card (known internally as Pandora). If this card model is the one continually being faulted, then it must be replaced with another model card, specifically the PCI Express 8-Port 6Gbps SAS HBA PN: 375-3641 (known internally as Erie). As this is a rare fault, an FCO has not been deemed justified, and Field Services should treat this as a CIC of the Pandora card(s) to get the Erie replacement(s) for the customer. Sun Bug Reference: 7002517 Repeated PCIEX-8000-KP errors reopen of 6907573 7. If you have reached this step, engagement of a Senior Level Domain Engineer within the TSC SPARC OPL team is required. Product M3000 M4000 M5000 M8000 M9000 Addendum In order to determine the correctable error rate for PCIEX-8000-KP events on a particular system, you may use one of the following methods: 1. Use the findfma tool available on cores2 (/cores_data/local/bin/findfma); this needs to be run against the affected system's "fmdump -eV" output (note that fmdump-eV.out file is included in the explorer output into fma directory). 2. Use a single line counter on the errlog collected in explorer (fma/var/fm/fmd), for example: % fmdump -e -c 'ereport.io.pciex.rc.ce-msg' -n "detector.device-path=/pci@2,600000*" -t 01/01/11 errlog | cut -b1-9 | uniq -c | awk '{print $2,$3,"2011",$4":00,"$1}' | egrep -v TIME | sort -t, -rn +1 | headExample output (date/time hour, CE rate): Nov 15 2011 23:00,537 Nov 01 2011 17:00,322 Oct 28 2011 16:00,235 NOTE: substitute the device path for whichever adapter is listed faulty by FMA. This will sort the CE rate and show you the highest correctable error rate per hour. If it is higher than 36, then the patch won't help. You can also see how the CE rate varies through time by dropping the sorting. This may be useful to see if the rate was influenced by a recent insertion or other cassette manipulation: % fmdump -e -c 'ereport.io.pciex.rc.ce-msg' -n "detector.device-path=/pci@2,600000*" -t 01/01/11 errlog | cut -b1-9 | uniq -c | awk '{print $2,$3,"2011",$4":00,"$1}' | egrep -v TIME References<NOTE:1369835.1> - Solaris 10 SPARC Kernel Patch 137137-09 May Cause Erroneous PCIEX-8000-KP Reports During PCIE Correctable Events<NOTE:1369869.1> - Healthy Solaris 10 SPARC Systems May Incorrectly Report Hardware Errors During PCIE Correctable Events Attachments This solution has no attachment |
||||||||||||
|