Document Audience: | INTERNAL |
Document ID: | A0209-1 |
Title: | Sun Fire 15K & Sun Fire 12K with Crystal+ cards may experience panics in the pcisch driver. |
Copyright Notice: | Copyright © 2007 Sun Microsystems, Inc. All Rights Reserved |
Update Date: | Fri Jun 13 00:00:00 MDT 2003 |
----------------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
----------------------------------------------------------------------------
FIELD CHANGE ORDER
(For Authorized Distribution by Enterprise Services)
FCO #: A0209-1
Status: inactive
Synopsis: Sun Fire 15K & Sun Fire 12K with Crystal+ cards may experience panics in the pcisch driver.Date: Jun/13/2003
SunAlert: No
Top FIN/FCO Report: No
Products Reference: Sun Fire 15K/12K
Product Category: Server / System Component
Product Affected:
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
- F15K - Sun Fire 15K -
- F12K - Sun Fire 12K -
X-Options Affected
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
X6727A F15K/F12K - PCI Dual FC Network Adapter+ -
Parts Affected:
Part Number Description Model
----------- ----------- -----
375-3030-xx PCI Dual FC Network Adapter+ -
(SCSI Devices)
Type Vendor Model SerialNumber(Min) SerialNumber(Max) Firmware
---- ------ ------- ----------------- ----------------- --------
N/A
References:
ESC: 537306
FIN: IO852-1
BugID: 4699182
Issue Description:
Sun Fire 15K and Sun Fire 12K systems with PCI Dual FC Network Adapter+
(Crystal+) used in the 66MHz slots may experience pcisch driver panics
due to a parity error on the PCI Bus.
Panics in the pcisch driver cover a wide range of possible failures.
In this case, the control status register (CSR) calls out the detection
of bad parity on the PCI bus:
WARNING: pcisch-19: PCI fault log start:
PCI SERR
PCI error occurred on device #0
dwordmask=0 bytemask=0
pcisch-19: PCI primary error (0):pcisch-19: PCI secondary error (0):pcisch-19:
PBM AFAR 0.00000000:WARNING: pcisch19: PCI config space
CSR=0xc2a0
pcisch-19: PCI fault log end.
panic[cpu128]/thread=2a10001fd20: pcisch-19: PCI bus 3 error(s)!
000002a10001bea0 pcisch:pbm_error_intr+148 (30000b643d8, 2772, 30000b84548, 3,
30000b643d8, 3)
%l0-3: 00000300008b9860 0000000000004000 0000000000000000 0000030000b86584
%l4-7: 00000300009978c8 0000030008d03ea8 0000000000000000 0000030008d03ed0
000002a10001bf50 unix:current_thread+44 (0, ffffffffffffffff, 0, 300335b3528,
0, 1044f340)
%l0-3: 0000000010007450 000002a10001f061 000000000000000e 0000000000000016
%l4-7: 0000000000010000 00000300339922a8 000000000000000b 000002a10001f910
000002a10001f9b0 unix:disp_getwork+40 (1044e398, 0, 1044f340, 10457310, 2, 0)
%l0-3: 000000001010e2d8 0000000010509e00 00000300335bd518 000002a100c37d20
%l4-7: 000002a100cebd20 0000000002736110 0000000000000000 000002a10001f9c0
000002a10001fa60 unix:idle+a4 (0, 0, 80, 1044e398, 3000096d980, 0)
%l0-3: 0000000010043d58 2030205b275d2076 616c20696e646578 000002a10011dd20
%l4-7: 70636220290a2020 202e22202073703a 20222031205b275d 2076616c20696e64
NOTE: The stack itself can be different, depending on each specific case. What
matters is the CSR values (specifically the "detected-parity-error" bit).
Although this type of panic can result from a hardware issue on any adapter,
this FCO is only addressing those with a PCI Dual FC Network Adapter+. In
addition, this FCO is only legitimate for failures in the 66Mhz slots (bottom
slots of an hsPCI assembly).
With every other panic of this nature, a hardware replacement has resolved
the case. However, with one customer, repeated hardware replacements did not
resolve the issue. The customer's issue has since been replicated on multiple
machines in an engineering environment. There are some unique factors that
are needed to create this scenario:
A. To date, this problem has only been seen on 375-3030 (Crystal+)
cards.
B. All the panics have been in either slot 0 or slot 2 of the I/O Boat.
(Slots 0 and 2 is the lower 66 MHz slots)
C. Schizo 2.3 seems to bring the problem out with more regularity.
D. Veritas software (specifically adding mirrors to volumes) seems
to increase the likelihood of failure.
Please review the Steps for Diagnosis in the Special Considersations
section below before implementing any corrective action.
Parts Affected:
N/A
Implementation:
---
| | MANDATORY (Fully Pro-Active)
---
---
| | CONTROLLED PRO-ACTIVE (per Sun Geo Plan)
---
---
| X | UPON FAILURE
---
Replacement Time Estimate:
4.0 hours
Special Considerations:
Steps for Diagnosis
===================
1) Isolate the offending PCI bus:
As a reminder, when looking at a starcat I/O boat, the slots
are designated:
|--------------------------|--------------------------|
| Schizo 1, leaf B (33Mhz) | Schizo 0, leaf B (33Mhz) |
|--------------------------|--------------------------|
| Schizo 1, leaf A (66Mhz) | Schizo 0, leaf A (66Mhz) |
|--------------------------|--------------------------|
OR
|--------|--------|
| Slot 3 | Slot 1 |
| OR | OR |
| X.1.1.1| X.1.0.1|
|--------|--------|
| Slot 2 | Slot 0 |
| OR | OR |
| X.1.1.0| X.1.0.0|
|--------|--------|
NOTE: X = hsPCI number (0-17)
To diagnose the pcisch panic from the above stack, follow these steps:
Use the /etc/path_to_inst on the domain or the cfgadm/rcfgadm commands
to isolate the slot. For example, using the two methods with the panic
above (pcisch-19):
# grep pcisch /etc/path_to_inst
"/pci@3d,600000" 7 "pcisch"
"/pci@1c,700000" 0 "pcisch"
"/pci@3c,700000" 4 "pcisch"
--> "/pci@9d,600000" 19 "pcisch"
"/pci@9c,600000" 17 "pcisch"
"/pci@3c,600000" 5 "pcisch"
"/pci@5d,600000" 11 "pcisch"
"/pci@7d,600000" 15 "pcisch"
In this case, instance 19 is "/pci@9d,600000". To
translate that into a slot location, break down the 9d into
binary <10011101>, then add spaces to obtain <100 1110 1>.
That address now breaks down to slot 4 (100), skip the
middle section (1110), pci 1 (or the pci slot on the
left).
The other option is to leverage the conversion the dynamic
reconfiguration interface provides:
# rcfgadm -d a -la | grep pcisch
pcisch0:e00b1slot1 pci-pci/hp connected configured ok
pcisch10:e02b1slot3 unknown connected unconfigured unknown
pcisch11:e02b1slot2 pci-pci/hp connected configured ok
pcisch12:e03b1slot1 pci-pci/hp connected configured ok
pcisch13:e03b1slot0 pci-pci/hp connected configured ok
pcisch14:e03b1slot3 unknown connected unconfigured unknown
pcisch15:e03b1slot2 pci-pci/hp connected configured ok
pcisch16:e04b1slot1 unknown connected unconfigured unknown
pcisch17:e04b1slot0 pci-pci/hp connected configured ok
pcisch18:e04b1slot3 unknown connected unconfigured unknown
--> pcisch19:e04b1slot2 unknown empty unconfigured unknown
pcisch1:e00b1slot0 unknown empty unconfigured unknown
pcisch20:e08b1slot1 unknown empty unconfigured unknown
pcisch21:e08b1slot0 pci-pci/hp connected configured ok
pcisch22:e08b1slot3 unknown empty unconfigured unknown
pcisch23:e08b1slot2 unknown empty unconfigured unknown
pcisch2:e00b1slot3 unknown connected unconfigured unknown
pcisch3:e00b1slot2 pci-pci/hp connected configured ok
pcisch4:e01b1slot1 pci-pci/hp connected configured ok
pcisch5:e01b1slot0 unknown empty unconfigured unknown
pcisch6:e01b1slot3 unknown connected unconfigured unknown
pcisch7:e01b1slot2 pci-pci/hp connected configured ok
pcisch8:e02b1slot1 pci-pci/hp connected configured ok
pcisch9:e02b1slot0 unknown connected unconfigured unknown
In this case, the issue is on expander 4 (ex4), I/0 board
(b1), slot 2.
b) Once the offending FRU has been identified, follow FIN IO852-1
and replace the hsPCI and the cassette called out in the panic.
Once completed, replace ALL x6272A's within the domain with
x6768A (Crystal2A), including x6727A that have not generated
panics.
So, for the example above, we would replace the hsPCI in
slot 4, the cassette in slot 2 (lower left), the x6727A
with a x6768A and all other x6727A's in this domain.
It is expected that some customers may wish to take the
down time and replace all x6727A's in their entire platform
where applicable. This action has been approved under this
FCO.
EXCEPTION: If customer attached A3500FC (540-4026 or 540-4027)
to F12/15K via Crystal+, then x6799A (Amber) must be used in
place of x6768A (Crystal 2A).
c) There are some hardware prerequisites you might have to
contend with:
- The cables used for the x6768A differ from the cables
used for the x6727A. Before performing this FCO,
verify and replace all required cables or use LC ->
SC adapters.
replace 537-1004 2 Meter SC-SC with 537-1035 2 Meter LC-SC
replace 537-1020 5 Meter SC-SC with 537-1033 5 Meter LC-SC
replace 537-1004 15 Meter SC-SC with 537-1034 15 Meter LC-SC
If a custom length SC-SC cable is in use, order 0.4 Meter LC-SC
cable 537-1036 and SC-SC Female to Female coupler 130-4723.
d) There are some software prerequisites you might have to
contend with:
- Slot 1 DR will be available with SMS1.3. The current
target date (always subject to change) for SMS1.3 is
sometime near the end of Janurary 2003. Until then,
the system will have to incur a downtime for
replacement.
- If the boot device is on the to-be-replaced hsPCI, it will be
necessary to have planned the system configuration to
allow a boot-device DR (i.e. multipathing/mirroring,
etc). If you do not have such a capability, the
domain will have to incur a downtime for
replacement.
- Once the x6727A has been replaced with a x6768A, the
controller number for the disk will change unless you
follow the procedure below.
- The customer will need to download the drivers for the x6768A.
Reference the procedure below. NOTE: DO NOT FORGET
TO UPDATE/PATCH THE JUMPSTART IMAGE, IF APPLICABLE.
At the time of authoring this FCO, the driver
required according to:
Sun StorEdge 2G FC PCI Dual Channel Network Adapter
Product Notes (Part Number: Part No.816-5002-11
June 2002, Revision A)
Before installing the Sun StorEdge 2G FC PCI Dual
Channel Network Adapter card, the host must have
both the Solaris 8 update 4 operating environment
release with the recommended patch cluster and the
Sun StorEdge 2G FC PCI Dual Channel Network Adapter
driver.
Check http://www.sun.com/download/ or
http://www.sun.com/storage/ san for updates. There
is one set of packages for the Solaris 8 operating
environment and another for the Solaris 9 operating
environment available under the respective links
for the operating environments. The SUNWsan package
is interchangeable between the releases.
Packages:
SUNWsan
SUNWcfpl
SUNWcfplx
Available at: http://www.sun.com/download/ or
http://www.sun.com/storage/san
Patches (NOTE: patches might be uprev'ed. These
are the minimum requirements):
Solaris 8 Solaris 9
--------- ---------
Sun StorEdge Traffic Manager patch 111412-09 113039-01
fctl/fp/fcp/usoc driver 111095-10 113040-01
fcip driver 111096-04 113041-01
qlc driver 111097-10 113042-02
luxadm/liba5k and libg_fc patch 111413-08 113043-01
cfgadm fp plug-in library patch 111846-04 113044-01
SAN Foundation Kit patch 111847-04 111847-04
Available at: sunsolve.sun.com
- There is a known issue with replacing an crystal with
an encapsulated boot device.
If not donce correctly, device major number will be
incorrectly set forcing panics. Please reference the
procedure below and specifcally the name_to_major
file references.
- Replacement Procedure:
Replace Crystal+ cards with Crystal2A on a F15K/12K
==================================================================
I. Prerequisites
- Crystal2A drivers, patches and packages.
[ refer to Sun StorEdge 2G FC PCI Dual Channel
Network Adapter Product Notes ]
- Solaris 8 2/02 with current recommended patch
cluster and san patches.
- Dedicated Solaris 8 2/02 network boot/jumpstart image with
Crystal2A drivers, packages and patches.
- A good backup of all filesystems.
==================================================================
II. Preparation
1. If you are replacing a controller that contains the
boot device or a Veritas Volume Manager device in
rootdg, you will have to create a boot server image
that contains the Crystal2A drivers. Otherwise you may
skip this step.
To do this, first create a Solaris 8 02/02
JumpStart Boot server. Then, install the Crystal2A
drivers, patches and packages into this image and
copy the /etc/name_to_major file from the domain
onto the boot image. (This will prevent problems
with differing major numbers between the domain and
the boot image).
Example using a Solaris 8 02/02 boot server image located at
/jumpstart/5.8_HW202:
For Packages:
=============
cd [ location of packages ]
pkgadd -R /jumpstart/5.8_HW202/Solaris_8/Tools/Boot -d .
For Patches:
============
cd [ location of patches ]
patchadd -C /jumpstart/5.8_HW202/Solaris_8/Tools/Boot
./[patchid]
For /etc/name_to_major:
=======================
cd /jumpstart/5.8_HW202/Solaris_8/Tools/Boot/etc
cp name_to_major name_to_major.orig
ftp domain
ftp> cd /etc
ftp> get name_to_major
2. Bring the domain down to single user mode
OK> boot -s
3. Install patches and packages on the domain.
Follow normal patchadd and pkgadd procedures.
4. Verify the controller number[s] for the card being replaced
# format
# ls -l /dev/dsk
# ls -l /dev/ses
( You may want to save this output for reference. )
==================================================================
III. Replacing a controller NOT used for the boot device.
(This example uses c1 as the controller to be changed.)
1. If Volume Manager is being used, disable it from starting.
# touch /etc/vx/reconfig.d/state.d/install-db
2. Reboot domain into single user mode
# init 0
OK> boot -s
Volume manager should not be running at this point.
# ps -ef | grep vx (this should show no volume manager
processes)
3. Remove the devices associated with the controller to be
replaced.
# cd /dev/dsk
# rm c1*
# cd /dev/rdsk
# rm c1*
# cd /dev/cfg
# rm c1 [ this entry may or may not exist ]
# cd /a/dev/ses ( if applicable )
# rm ses2 ses3 ( for the ses devices associated with c1 )
4. Shutdown and replace the cards. Make sure auto-boot? is
false.
# init 0
OK> setenv auto-boot? false
Shut off the domain from the SC
setkeyswitch -d [domainid] off
Replace the card, turn on the domain.
setkeyswitch -d [domainid] on
Verify that the new controller is available from OBP.
OK> probe-scsi-all
Do a single user reconfiguration boot
OK> boot -sr
5. Verify that the devices were created as expected:
# format
# ls -l /dev/dsk/c1* /dev/rdsk/c1*
# ls -lL /dev/dsk/c1* /dev/rdsk/c1*
# ls -l /dev/es
If Veritas Volume Manager was NOT used, check that the
devices can be mounted. If all looks good continue to
multiuser.
# mountall
6. If Veritas was disabled previously (Step 1), re-enable it and
reboot.
# rm /etc/vx/reconfig.d/state.d/install-db
# init 6
Verify that veritas started correctly and all volumes are
available.
==================================================================
IV. Replacing a controller that IS used for the boot device.
(This example uses c0 as the controller to be changed.)
1. If the boot disk is encapsulated, you must first
unencapsulate the boot device.
Reboot and verify that the boot disk has been successfully
unencapsulated.
2. If Volume Manager is being used, disable it from starting.
# touch /etc/vx/reconfig.d/state.d/install-db
3. Reboot domain into single user mode
# init 0
OK> boot -s
Volume manager should not be running at this point.
# ps -ef | grep vx (this should show no volume manager
processes)
4. Shutdown and replace the cards. Make sure auto-boot? is
false.
# init 0
OK> setenv auto-boot? false
Shut off the domain from the SC
setkeyswitch -d [domainid] off
Replace the card, turn on the domain.
setkeyswitch -d [domainid] on
Verify that the new controller is availble from OBP.
OK> probe-scsi-all
5. Boot from the Crystal-2A enabled JumpStart Boot server (as
described under preparation. Verify that the
devices are visible.
OK> boot net -s (from Crystal-2a patched jumpstart)
# format
6. Mount the boot device's / partition at /a. Remove the
previous controller's device nodes.
# mount /dev/dsk/ /a
# rm /a/dev/dsk/c0* /a/dev/rdsk/c0*
# rm /a/dev/cfg/c0 (this may or may not exist)
# rm /a/dev/es/ses0 /a/dev/es/ses1
( for the ses devices associated with c0 )
7. Build and verify the new device nodes and reset-all.
# devfsadm -r /a -p /a/etc/path_to_inst
# ls -l /a/dev/dsk/c0* /a/dev/rdsk/c0*
# ls -l /a/dev/dsk/c0* /a/dev/rdsk/c0*
# ls -l /a/dev/es
# umount /a
# halt
OK> reset-all
8. Determine the new boot device path.
The device path WILL change the Crystal2A has a different FW prom.
For example, original boot device:
/pci@3d,600000/pci@1/SUNW,qlc@4/fp@0,0/disk@w220000203733433b,0:a
New boot device:
/pci@3d,600000/SUNW,qlc@1/fp@0,0/disk@w220000203733433b,0:a
OK> show-disks (check for new card's device)
OK> probe-scsi-all (check that disk are visible)
Verify the new path of the boot device and use nvunalias and
nvalias to record it.
OK> nvunalias [old-boot-device-alias]
OK> nvalias [device-alias] [device-path]
OK> setenv boot-device [device-alias]
OK> setenv diag-device [device-alias] (if desired)
9. Boot off the new path into single user mode. Verify that the
devices were created as expected
ok> boot -s
# format
If Veritas Volume Manager was NOT used, check that the
devices can be mounted. If all looks good continue
to multiuser.
# mountall
10. If Veritas was disabled previously (Step 1), re-enable it
and reboot.
# rm /etc/vx/reconfig.d/state.d/install-db
# init 6
Verify that Veritas started correctly and all volumes are
available.
Corrective Action:
Important! Troubleshoot pcisch driver panics as outlined above and
in FIN I0852-1 and follow instructions outlined in the
Special Considerations section.
A. Replace all 375-3030-xx (Crystal+) cards with 375-3108-xx
(Crystal-2A) cards in the affected domain.
OR
B. If customer attached A3500FC (540-4026 or 540-4027) to F12/15K
via Crystal+, replace all 375-3030-xx (Crystal+) cards with
375-3019-xx (Amber) cards in the affected domain.
Either action will require new drivers to be installed and LC-SC
or LC-LC Fibre Cables. See Product Note 816-5002 for details:
http://infoserver.central/data/816/816-5002/pdf/816-5002-11.pdf
Comments:
Billing Type:
Warranty: Sun will provide parts at no charge under Warranty
Service. On-Site Labor Rates are based on how the
system was initially installed.
Contract: Sun will provide parts at no charge. On-Site Labor Rates
are based on the type of service contract.
Non Contract: Sun will provide parts at no charge. Installation by
Sun is available based on the On-Site Labor Rates
defined in the Price List.
--------------------------------------------------------------------------
Implementation Footnote:
________________________
i) In case of Mandatory FCOs, Sun Services will attempt to contact
all known customers to recommend the part upgrade.
ii) For controlled proactive swap FCOs, Sun Services mission critical
support teams will initiate proactive swap efforts for their respective
accounts, as required.
iii) For Replace upon Failure FCOs, Sun Services partners will implement
the necessary corrective actions as and when they are required.
--------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network
browser as follows:
SunWeb Access:
______________
* Access the top level URL of http://sdpsweb.Central/FIN_FCO/
* From there, select the appropriate link to query or browse the FIN and
FCO Homepage collections.
SunSolve Online Access:
_______________________
* Access the SunSolve Online URL at http://sunsolve.Central/
* From there, select the appropriate link to browse the FIN or FCO index.
Internet Access:
_______________
* Access the top level URL of https://spe.sun.com
--------------------------------------------------------------------------
General:
________
Send questions or comments to [email protected]
---------------------------------------------------------------------------