Document fins/I0852-1
FIN #: I0852-1
SYNOPSIS: Some pcisch driver panics on F15K systems are unrelated to failed
hardware
DATE: Nov/15/02
KEYWORDS: Some pcisch driver panics on F15K systems are unrelated to failed
hardware
---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------
FIELD INFORMATION NOTICE
(For Authorized Distribution by SunService)
SYNOPSIS: Some pcisch driver panics on F15K systems are unrelated to
failed hardware.
SunAlert: No
TOP FIN/FCO REPORT: No
PRODUCT_REFERENCE: Sun Fire 15K
PRODUCT CATEGORY: Server / Service
PRODUCTS AFFECTED:
Systems Affected:
-----------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
- F15K ALL Sun Fire 15000 -
X-Options Affected:
-------------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
- - - - -
PART NUMBERS AFFECTED:
Part Number Description Model
----------- ----------- -----
375-3030-01 PCI Dual FC Network Adapter+ -
375-3019-01 PCI Single FC Host Adapter -
501-6302-03 or lower hsPCI I/O Board (w/ Cassettes) -
501-5397-11 or lower hsPCI I/O Board (w/o Cassettes) -
501-5599-07 or lower 3.3V hsPCI Cassette -
REFERENCES:
BugId: 4699182 - OS panics w/ PCI SERR that H/W replacements
don't alleviate .
ESC: 537306 - SWON/LT/ system generated core file.
MANUAL: 806-3512-10: Sun Fire 15K System Service Manual.
URL: http://infoserver.central/data/816/816-5002/pdf/816-5002-11.pdf
PROBLEM DESCRIPTION:
The pcisch driver may panic on Sun Fire 15000 domains due to a parity
error on the PCI Bus. In most cases this is due to a faulty hardware
component. However, in some cases the panic cannot be corrected by
replacing a hardware FRU. This second scenario may result in multiple
unexpected domain failures if not corrected. This FIN describes how to
diagnose and correct this type of pcisch driver panic.
It is important to note that the panic stack for this problem is
IDENTICAL to the panic stack that is produced as a result of bad
hardware. It is imperative when diagnosing these types of errors that
the field troubleshoot the issue as faulty hardware first. Only after
the panic persists or moves instances repeatedly, should the field
attribute the problem to the issue outlined in this FIN.
Panics in the pcisch driver cover a wide range of possible failures.
In this case, the control status register (CSR) calls out the detection
of bad parity on the PCI bus:
WARNING: pcisch-19: PCI fault log start:
PCI SERR
PCI error occurred on device #0
dwordmask=0 bytemask=0
pcisch-19: PCI primary error (0):pcisch-19: PCI secondary error
(0):pcisch-19:
PBM AFAR 0.00000000:WARNING: pcisch19: PCI config space
CSR=0xc2a0<signaled-system-error,detected-parity-error>
pcisch-19: PCI fault log end.
panic[cpu128]/thread=2a10001fd20: pcisch-19: PCI bus 3 error(s)!
000002a10001bea0 pcisch:pbm_error_intr+148 (30000b643d8, 2772, 30000b84548,
3,
30000b643d8, 3)
%l0-3: 00000300008b9860 0000000000004000 0000000000000000 0000030000b86584
%l4-7: 00000300009978c8 0000030008d03ea8 0000000000000000 0000030008d03ed0
000002a10001bf50 unix:current_thread+44 (0, ffffffffffffffff, 0, 300335b3528,
0, 1044f340)
%l0-3: 0000000010007450 000002a10001f061 000000000000000e 0000000000000016
%l4-7: 0000000000010000 00000300339922a8 000000000000000b 000002a10001f910
000002a10001f9b0 unix:disp_getwork+40 (1044e398, 0, 1044f340, 10457310, 2, 0)
%l0-3: 000000001010e2d8 0000000010509e00 00000300335bd518 000002a100c37d20
%l4-7: 000002a100cebd20 0000000002736110 0000000000000000 000002a10001f9c0
000002a10001fa60 unix:idle+a4 (0, 0, 80, 1044e398, 3000096d980, 0)
%l0-3: 0000000010043d58 2030205b275d2076 616c20696e646578 000002a10011dd20
%l4-7: 70636220290a2020 202e22202073703a 20222031205b275d 2076616c20696e64
NOTE: The stack itself can be different, depending on each specific case.
What matters is the CSR values (specifically the
"detected-parity-error" bit).
With every other panic of this nature, a hardware replacement has
resolved the case. However, with one customer, repeated hardware
replacements did not resolve the issue. The customer's issue has since
been replicated on multiple machines in an engineering environment.
There are some unique factors that are needed to create this scenario:
A. To date, this problem has only been seen on 375-3030 (Crystal+)
cards.
B. All the panics have been in either slot 0 or slot 2 of the I/O Boat.
(Slots 0 and 2 is the lower 66 MHz slots)
C. Schizo 2.3 seems to bring the problem out with more regularity.
D. Veritas software (specifically adding mirrors to volumes) seems
to increase the likelihood of failure.
Steps for Diagnosis
===================
As a reminder, when looking at an F15K I/O boat, the slots are designated:
-----------------------------------------------------
| Schizo 1, leaf B (33Mhz) | Schizo 0, leaf B (33Mhz) |
|--------------------------+--------------------------|
| Schizo 1, leaf A (66Mhz) | Schizo 0, leaf A (66Mhz) |
-----------------------------------------------------
OR
-----------------
| Slot 3 | Slot 1 |
| OR | OR |
| X.1.1.1| X.1.0.1|
|--------+--------|
| Slot 2 | Slot 0 |
| OR | OR |
| X.1.1.0| X.1.0.0|
-----------------
NOTE: X = hsPCI number (0-17)
To diagnosis the pcisch panic from the above stack, follow these steps:
a) Use the /etc/path_to_inst file on the domain or the cfgadm/rcfgadm
commands to isolate the slot. For example, using the two methods with
the panic above (pcisch-19):
# grep pcisch /etc/path_to_inst
"/pci@3d,600000" 7 "pcisch"
"/pci@1c,700000" 0 "pcisch"
"/pci@3c,700000" 4 "pcisch"
"/pci@9d,600000" 19 "pcisch" <----------
"/pci@9c,600000" 17 "pcisch"
"/pci@3c,600000" 5 "pcisch"
"/pci@5d,600000" 11 "pcisch"
"/pci@7d,600000" 15 "pcisch"
"/pci@1c,600000" 1 "pcisch"
"/pci@1d,600000" 3 "pcisch"
"/pci@5c,700000" 8 "pcisch"
"/pci@7c,700000" 12 "pcisch"
"/pci@7c,600000" 13 "pcisch"
"/pci@9c,700000" 16 "pcisch"
"/pci@9d,700000" 18 "pcisch"
"/pci@3d,700000" 6 "pcisch"
"/pci@5c,600000" 9 "pcisch"
"/pci@1d,700000" 2 "pcisch"
"/pci@7d,700000" 14 "pcisch"
"/pci@5d,700000" 10 "pcisch"
"/pci@11c,700000" 20 "pcisch"
"/pci@11c,600000" 21 "pcisch"
"/pci@11d,700000" 22 "pcisch"
"/pci@11d,600000" 23 "pcisch"
In this case, instance 19 is "/pci@9d,600000". To translate that
into a
slot location, break down the 9d into binary <10011101>, then add a
space
to obtain <100 1110 1>. That address now breaks down to slot 4
(100),
skip the middle section (1110), pci 1 (or the pci slot on the left).
The other option is to use the conversion which the dynamic
reconfiguration interface provides:
# rcfgadm -d a -la | grep pcisch
pcisch0:e00b1slot1 pci-pci/hp connected configured ok
pcisch10:e02b1slot3 unknown connected unconfigured unknown
pcisch11:e02b1slot2 pci-pci/hp connected configured ok
pcisch12:e03b1slot1 pci-pci/hp connected configured ok
pcisch13:e03b1slot0 pci-pci/hp connected configured ok
pcisch14:e03b1slot3 unknown connected unconfigured unknown
pcisch15:e03b1slot2 pci-pci/hp connected configured ok
pcisch16:e04b1slot1 unknown connected unconfigured unknown
pcisch17:e04b1slot0 pci-pci/hp connected configured ok
pcisch18:e04b1slot3 unknown connected unconfigured unknown
--> pcisch19:e04b1slot2 unknown empty unconfigured
unknown
pcisch1:e00b1slot0 unknown empty unconfigured unknown
pcisch20:e08b1slot1 unknown empty unconfigured unknown
pcisch21:e08b1slot0 pci-pci/hp connected configured ok
pcisch22:e08b1slot3 unknown empty unconfigured unknown
pcisch23:e08b1slot2 unknown empty unconfigured unknown
pcisch2:e00b1slot3 unknown connected unconfigured unknown
pcisch3:e00b1slot2 pci-pci/hp connected configured ok
pcisch4:e01b1slot1 pci-pci/hp connected configured ok
pcisch5:e01b1slot0 unknown empty unconfigured unknown
pcisch6:e01b1slot3 unknown connected unconfigured unknown
pcisch7:e01b1slot2 pci-pci/hp connected configured ok
pcisch8:e02b1slot1 pci-pci/hp connected configured ok
pcisch9:e02b1slot0 unknown connected unconfigured unknown
In this case, the issue is on expander 4 (ex4), I/0 board (b1), slot 2.
b) Once you identify the correct location, there are three FRUs which
could be causing the parity error: the hsPCI (p/n: 501-6302-03 or lower
and 501-5397-11 or lower) also called the I/O boards),
the 3.3v cassette (p/n: 501-5599-07), or the adapter itself.
To narrow down the problem, employ standard hardware
troubleshooting techniques and move/replace one hardware FRU at a
time (CPRE recommends moving/replacing the adapter, then the
cassette, and finally the PCI). If the problem follows a FRU (on a
move) or no longer panics (on a replacement), CPAS the offending
FRU.
In the event that you are unable to follow this process, it may become
necessary to replace all three FRUs at once. However, this is not
recommended as this could impact FRU availability and will increase
service costs to Sun.
c) Once you identify a failing FRU and have taken appropriate action,
track the machine's availability for an appropriate amount of
time. Depends on the time taken to identify a failing FRU, the
recommendation is to run the machine for twice as long as the panic
interval. In some cases, that is 1 hour, while in others that is
24 days. If the problem persists or shows up on another pcisch
instance, the machine could be experiencing the problem reported in
Bug 4699182. Please escalate to CPRE.
d) Once CPRE verifies the customer is experiencing this issue, choose
a "workaround" option (of two) listed in the "Corrective
Action"
section.
The root cause for pcisch driver panics, which are unrelated to faulty
hardware, is still under investigation. There is no final fix at this
time. In the meantime, use the recommended workarounds mentioned in
the Corrective Action section below.
IMPLEMENTATION:
---
| | MANDATORY (Fully Proactive)
---
---
| | CONTROLLED PROACTIVE (per Sun Geo Plan)
---
---
| X | REACTIVE (As Required)
---
CORRECTIVE ACTION:
The following recommendation is provided as a guideline for authorized
Enterprise Services Field Representatives who may encounter the above
mentioned problem.
Troubleshoot pcisch driver panics on F15K domains as outlined above.
If the problem is determined NOT to be caused by faulty hardware,
implement one of the two workarounds below.
A. Replace the 375-3030 (Crystal+) cards with 375-3019 (Amber) cards.
This has been shown to alleviate the issue after extensive testing.
OR
B. Move all 375-3030 cards to either slot 1 or slot 3. This assumes
there are enough I/O boats.
C. Upgrade the 375-3030 (Crystal+) cards to 375-3108 (Crystal-2A). This
will require new drivers to be installed and LC-SC or LC-LC Fibre
Cables. See Product Note 816-5002 for details:
http://infoserver.central/data/816/816-5002/pdf/816-5002-11.pdf
COMMENTS:
None
============================================================================
Implementation Footnote:
i) In case of MANDATORY FINs, Enterprise Services will attempt to
contact all affected customers to recommend implementation of
the FIN.
ii) For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical
support teams will recommend implementation of the FIN (to their
respective accounts), at the convenience of the customer.
iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the
need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network
browser as follows:
SunWeb Access:
--------------
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/
* From there, select the appropriate link to query or browse the FIN and
FCO Homepage collections.
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/
* From there, select the appropriate link to browse the FIN or FCO index.
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to [email protected]
--------------------------------------------------------------------------
Copyright (c) 1997-2003 Sun Microsystems, Inc.