Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1362005.1
Update Date:2012-02-07
Keywords:

Solution Type  Troubleshooting Sure

Solution  1362005.1 :   Sun SPARC Enterprise[TM] M3000/M4000/M5000/M8000/M9000 (OPL) Servers: Troubleshooting PCIEX-8000-KP and SUNOS-8000-FU fault codes produced by Solaris FMA  


Related Items
  • Sun SPARC Enterprise M9000-32 Server
  •  
  • Sun SPARC Enterprise M8000 Server
  •  
  • Sun SPARC Enterprise M9000-64 Server
  •  
  • Sun SPARC Enterprise M4000 Server
  •  
  • Sun SPARC Enterprise M3000 Server
  •  
  • Sun SPARC Enterprise M5000 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: Mx000
  •  
  • .Old GCS Categories>Sun Microsystems>Servers>OPL Servers
  •  




In this Document
  Purpose
  Last Review Date
  Instructions for the Reader
  Troubleshooting Details
     Issue Verification
     Resolution
  References


Oracle Confidential (PARTNER). Do not distribute to customers
Reason: Troubleshooting steps require actions not allowed for customers to do

Applies to:

Sun SPARC Enterprise M4000 Server - Version: Not Applicable to Not Applicable - Release: N/A to N/A
Sun SPARC Enterprise M3000 Server - Version: Not Applicable to Not Applicable   [Release: N/A to N/A]
Sun SPARC Enterprise M8000 Server - Version: Not Applicable to Not Applicable   [Release: N/A to N/A]
Sun SPARC Enterprise M9000-32 Server - Version: Not Applicable to Not Applicable   [Release: N/A to N/A]
Sun SPARC Enterprise M9000-64 Server - Version: Not Applicable to Not Applicable   [Release: N/A to N/A]
Oracle Solaris on SPARC (64-bit)
Oracle Solaris on SPARC (32-bit)

Purpose

This document is intended to be a overall guide to troubleshooting OPL systems that Solaris FMA on a System Domain is reporting the SUNOS-8000-FU and/or the PCIEX-8000-KP faults in "fmadm faulty" output. 

Please note that on M3000/M4000/M5000/M8000/M9000 series systems, Solaris FMA is responsible for the majority of PCI fault troubleshooting, the XSCF will only produce errors of the type FMD-8000-11 when these types of issues occur, and are not useful in diagnosing the true problem.

Last Review Date

February 7, 2012

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

Issue Verification

Verify the faults exist by running "fmadm faulty" on the Solaris Domain as the root user (or confirm with the outputs from explorer in the /fma directory).

Faults reported will be similar to this example:

--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Aug 17 08:21:39 8f503397-4226-eba6-9d1f-ebf8f2ac6df2 PCIEX-8000-KP Major

Host : <system host name>
Platform : SUNW,SPARC-Enterprise Chassis_id : <system serial number>

Fault class : fault.io.pciex.device-interr-corr max 25%
fault.io.pciex.bus-linkerr-corr max 13%
Affects : dev:////pci@10,600000/pci@0/pci@9
faulted but still in service
dev:////pci@10,600000/pci@0/pci@9/SUNW,emlxs@0,1
ok and in service
dev:////pci@10,600000/pci@0/pci@9/SUNW,emlxs@0
ok and in service
FRU : "iou#1" (hc:///component=iou#1) 25%
faulty
"iou#1-pci#1" (hc://:product-id=SUNW,SPARC-Enterprise:chassis-id=B
EF08506C0:server-id=cdb1.dc1.prod/chassis=0/ioboard=1/hostbridge=0/pciexrc=0/pci
exbus=2/pciexdev=0/pciexfn=0/pciexbus=3/pciexdev=9/pciexfn=0/pciexbus=119/pciexd
ev=0) max 25%
repair attempted

Description : Too many recovered bus errors have been detected, which indicates
a problem with the specified bus or with the specified
transmitting device. This may degrade into an unrecoverable
fault.
Refer to http://sun.com/msg/PCIEX-8000-KP for more information.

Response : One or more device instances may be disabled

Impact : Loss of services provided by the device instances associated with
this fault

Action : If a plug-in card is involved check for badly-seated cards or
bent pins. Otherwise schedule a repair procedure to replace the
affected device. Use fmadm faulty to identify the device or
contact Sun for support.

--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Aug 17 08:21:25 ac02f2f7-d203-6661-acb9-b3bb9c070d92 SUNOS-8000-FU Major

Host : <system host name>
Platform : SUNW,SPARC-Enterprise Chassis_id : <system serial number>

Fault class : defect.sunos.eft.undiag.fme

Description : The diagnosis engine encountered telemetry for which it was
unable to perform a diagnosis. Refer to
http://sun.com/msg/SUNOS-8000-FU for more information.

Response : Error reports have been logged for examination by Sun.

Impact : Automated diagnosis and response for these events will not occur.

Action : Ensure that the latest Solaris Kernel and Predictive Self-Healing
(PSH) patches are installed.

*Note - it is most often seen that both errors are reported, but there are times when only one, and not the other are reported

Resolution

SUNOS-8000-FU

If only the SUNOS-8000-FU is reported, without the associated PCIEX-8000-KP, then manual inspection of the "fmdump -eV" outputs will be required to determine the device path of the fault.

Example:

Sep 02 2011 03:13:48.955933600 ereport.io.pci.fabric
nvlist version: 0
class = ereport.io.pci.fabric
ena = 0xfc4a6a8a88404801
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = /pci@10,600000/pci@0/pci@9/SUNW,emlxs@0
(end detector)

bdf = 0x7700
device_id = 0xfc20
vendor_id = 0x10df
rev_id = 0x2
dev_type = 0x0
pcie_off = 0x44
pcix_off = 0x0
aer_off = 0x100
ecc_ver = 0x0
pci_status = 0x10
pci_command = 0x147
pcie_status = 0x1
pcie_command = 0x203f
pcie_dev_cap = 0x6409a4
pcie_adv_ctl = 0x1f4
pcie_ue_status = 0x0
pcie_ue_mask = 0x0
pcie_ue_sev = 0x62011
pcie_ue_hdr0 = 0x4008001
pcie_ue_hdr1 = 0x1000703
pcie_ue_hdr2 = 0x22020000
pcie_ue_hdr3 = 0x220200
pcie_ce_status = 0x1
pcie_ce_mask = 0x0
remainder = 0x1
severity = 0x3
__ttl = 0x1
__tod = 0x4e60ac5c 0x38fa63a0

Sep 02 2011 03:13:48.955933200 ereport.io.pci.sec-rserr
nvlist version: 0
ena = 0xfc4a6a8a88404801
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = /pci@10,600000/pci@0/pci@9
(end detector)

class = ereport.io.pci.sec-rserr
pci-sec-status = 0x4000
pci-bdg-ctrl = 0x3
__ttl = 0x1
__tod = 0x4e60ac5c 0x38fa6210

1. The SUNOS-8000-FU is a Bug in the FMA diagnostic engine being unable to diagnose the "ereport.io.pci.sec-rserr" fault coming from the PCI bus.  This Bug has been resolved in <SunPatch:146855-01>  and is documented in this Sun Alert:

Healthy Solaris 10 SPARC Systems May Incorrectly Report Hardware Errors During PCIE Correctable Events <Document 1369869.1>

2. If this is the only fault being reported, then it is expected that there are very few PCI events found in the "fmdump -eV" output, then this fault can be safely cleared with "fmadm repair <uuid>" and customer given recommendation to apply the patch.

3. If the PCIEX-8000-KP fault is being reported in conjunction with SUNOS-8000-FU, simply verify the "ereport.io.pci.sec-rserr" fault is coming from the same device path as the card blamed in the "fmadm faulty" output for the PCIEX-8000-KP.  If they are coming from the same device path, then it can simply be cleared with the "fmadm repair <uuid>" and the patch recommended to the customer.  

3a. If the error has come from another PCI bus than the PCIEX-8000-KP fault, then the number of errors on that bus should be checked manually from the "fmdump -eV" output, and cleared if found to be a low number of errors (<= 36 per hour), and again, the patch recommended to the customer.


Sun Bug Reference: 6960665 SUNOS-8000-FU reported on ereport.io.pci.sec-rserr on OPL during PCI CE events



PCIEX-8000-KP

For this event, "fmadm faulty" output already gives us the affected device in question taking the correctable errors.

1.
The first step is to check the frequency of the errors being reported; "fmdump -e" output is the simplest way to check for this:

Sep 02 03:13:48.9559 ereport.io.pci.fabric
Sep 02 03:13:48.9559 ereport.io.pci.fabric
Sep 02 03:13:48.9559 ereport.io.pci.fabric
Sep 02 03:13:48.9559 ereport.io.pci.fabric
Sep 02 03:13:48.9559 ereport.io.pci.sec-rserr
Sep 02 03:13:48.9559 ereport.io.pciex.pl.re
Sep 02 03:13:48.9559 ereport.io.pciex.rc.ce-msg
Sep 02 03:17:18.6640 ereport.io.pci.fabric
Sep 02 03:17:18.6640 ereport.io.pci.fabric
Sep 02 03:17:18.6640 ereport.io.pci.fabric
Sep 02 03:17:18.6640 ereport.io.pci.fabric
Sep 02 03:17:18.6640 ereport.io.pci.sec-rserr
Sep 02 03:17:18.6640 ereport.io.pciex.pl.re
Sep 02 03:17:18.6640 ereport.io.pciex.rc.ce-msg


The frequency of the errors is critical to the next step in the resolution path. 

2. If the fault has occurred <= 36 times in an hour, the it should simply be cleared with "fmadm repair <uuid>".  This is a Solaris FMA Bug, the SERD error rates were changed in Solaris 10 Update 6 to produce the PCIEX-8000-KP fault with only 6 errors in a 2 hour time frame.   This Bug has been resolved in <SunPatch:147705-01>  Solaris Patch 147705-01 must be installed on Solaris 10 Update 10 or apply Solaris 10 Update 10 Feature Kernel Update patch <SunPatch:144500-19> and then apply patch 147705-01.  *Only if the card is re-faulted after doing the Solaris upgrade and patch application would we continue to the next step.*

Please note that this issue regarding the SERD error rates has been documented in this SunAlert, and it includes further commands that can be run to investigate the SERD decision engines (fmstat):

Solaris 10 SPARC Kernel Patch 137137-09 May Cause Erroneous PCIEX-8000-KP Reports During PCIE Correctable Events <Document 1369835.1>


Sun Bug Reference: 7051331 SPARC Solaris IO FMA s10u6 and later causing false IO hardware faults


Note: if the device that reports the PCI correctable errors is an Aura F20 card (PCI Express Flash Accelerator F20 SAS HBA), please check the following bug before proceeding with troubleshooting steps below:

Bug 6997490: PIC fabric errors seen during OPL M4/5000 production testing of Aura F20

The instructions to fix this issue on OPL Systems for this scenario are available here.


3. If the fault has occurred >36 times in an hour, then the card is now suspect to have a seating issue or something wrong with its PCI Pins. The PCI card should be removed from the system and from its cassette, carefully inspected for any damage or contaminants on its PCI pins on both sides.   If no damage other than the normal scuffing seen parallel to the pins from insertion into the PCI slot then the pins should be carefully wiped with an isopropyl alcohol wipe, being careful not to touch the pins after this is done.   It should be then carefully installed in its cassette and verified it is properly mounted and engages fully and evenly when the lever on the cassette is in the locked position.   The card can now be inserted back into the system in the same slot, and brought back online.  Solaris FMA should then be checked again to see if the faults are still occurring (simplest to check "fmdump -e" every couple of minutes for 15 minutes with the card in full operation).

3a. If the fault returns after PCI card re-seat and pin cleaning, then the error rates must be checked again.  If the error rate is <= 36 per hour, then the resolution is in Step 2.
3b. If the fault returns after PCI card re-seat and pin cleaning, and the error rates are >36 per hour, only then would you move to Step 4.


Sun Bug Reference: 6907573 Repeated PCIEX-8000-KP errors on iou#0-pci#2 even after replacing the card.


4.  If you have reached this step, then the next action is to replace the PCI card with a new FRU stock unit.

4a. If the fault returns after PCI card replacement, then the error rates must be checked again. If the error rate is <= 36 per hour, then the resolution is in Step 2.
4b. If the fault returns after PCI card replacement, and the error rates are  >36 per hour, only then would you move to Step 5.

5. If you have reached this step, then the next action is to replace the IOU that the PCI card is installed in with a new FRU stock unit.

5a. If the fault returns after IOU replacement, then the error rates must be checked again. If the error rate is <= 36 per hour, then the resolution is in Step 2.
5b. If the fault returns after IOU replacement, and the error rates are >36 per hour, only then would you move to Step 6.

6. If you have reached this step, then the card model must be checked.  If the card being faulted is a 8-Port 3Gbps SAS/SATA HBA PN:375-3487, then there is a known rare bug with this card (known internally as Pandora).  If this card model is the one continually being faulted, then it must be replaced with another model card, specifically the PCI Express 8-Port 6Gbps SAS HBA PN: 375-3641 (known internally as Erie).  As this is a rare fault, an FCO has not been deemed justified, and Field Services should treat this as a CIC of the Pandora card(s) to get the Erie replacement(s) for the customer.


Sun Bug Reference: 7002517 Repeated PCIEX-8000-KP errors reopen of 6907573


7. If you have reached this step, engagement of a Senior Level Domain Engineer within the TSC SPARC OPL team is required.


Product
M3000 M4000 M5000 M8000 M9000



Addendum

In order to determine the correctable error rate for PCIEX-8000-KP events on a particular system, you may use one of the following methods:

1. Use the findfma tool available on cores2 (/cores_data/local/bin/findfma); this needs to be run against the affected system's "fmdump -eV" output (note that fmdump-eV.out file is included in the explorer output into fma directory).

2. Use a single line counter on the errlog collected in explorer (fma/var/fm/fmd), for example:
% fmdump -e -c 'ereport.io.pciex.rc.ce-msg' -n "detector.device-path=/pci@2,600000*" -t 01/01/11 errlog  |  cut -b1-9 | uniq -c | awk '{print $2,$3,"2011",$4":00,"$1}' | egrep -v TIME | sort -t, -rn +1 | head
Example output (date/time hour, CE rate):
Nov 15 2011 23:00,537
Nov 01 2011 17:00,322
Oct 28 2011 16:00,235

NOTE: substitute the device path for whichever adapter is listed faulty by FMA.
This will sort the CE rate and show you the highest correctable error rate per hour. If it is higher than 36, then the patch won't help.
You can also see how the CE rate varies through time by dropping the sorting. This may be useful to see if the rate was influenced by a recent insertion or other cassette manipulation:
% fmdump -e -c 'ereport.io.pciex.rc.ce-msg' -n "detector.device-path=/pci@2,600000*" -t 01/01/11 errlog  |  cut -b1-9 | uniq -c | awk '{print $2,$3,"2011",$4":00,"$1}' | egrep -v TIME


References

<NOTE:1369835.1> - Solaris 10 SPARC Kernel Patch 137137-09 May Cause Erroneous PCIEX-8000-KP Reports During PCIE Correctable Events
<NOTE:1369869.1> - Healthy Solaris 10 SPARC Systems May Incorrectly Report Hardware Errors During PCIE Correctable Events

Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback