Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Problem Resolution Sure Solution 1009008.1 : Sun Fire[TM] 3800, 4800, 4810, 6800, E4900, E6900, V1280, E2900, Netra 1280, or 1290 server: NCPQ_TO workarounds
PreviouslyPublishedAs 212426
Applies to:Sun Netra 1280 ServerSun Fire V1280 Server Sun Fire 3800 Server Sun Fire 4800 Server Sun Fire 4810 Server All Platforms SymptomsAn NCPQ_TO (Non-Coherent Pending Queue Time-Out) will cause a domain to error pause. This means the domain in question will be reset in order to be recovered (it will shut down and restart - reboot).This error pause will appear on the SC platform shell (console messaging) or in the showlogs file as: ErrorMonitor: Domain A has a SYSTEM ERROR
An NCPQ_TO will also appear on the SC domain
shell and in showlogs as: /partition0/domain0/SB0/bbcGroup1/sbbc1: On an UltraSPARC[R] III machine, the bit signalling NCPQ_TO changes slightly: /partition0/domain0/SB0/bbcGroup1/cpuCD/cpusafariagent0: CauseSun Fire[TM] 3800, 4800, 4810, 6800, E4900, E6900, V1280 and Netra[TM] 1280 & 1290 servers can be susceptible to NCPQ_TO error pause conditions that might not be resolved by hardware replacement. The usual resolution for repeat NCPQ_TO error pause conditions can usually be effected by following the process described in Document 1017926.1. NCPQ_TO error pause on this range of server platforms with JNI 1063 Fibre Channel, Marconi/FORE HE155 ATM, and Sun cPCI Dual Fibre Channel Network Adapters (Part Number 375-0118) HBAs, tend not to be resolved by hardware replacement as described in Document 1017926.1. They are more likely due to problems described in Sun BugID's 4836915, 4859295, 4919824, and 4408474. The JNI 1063 HBA is not supported on these servers, but it often sold with Hitachi Data Systems(HDS) 99xx disk arrays. A JNI 1063 card on one of these servers will appear in prtdiag output as: fibre-channel-pci1242,4643.0 The card itself will have a sticker with the following on it: FCI-1063 Another clue is that pkginfo will list a package: JNIfcaPCI A format command will list a device path like: /ssm@0,0/pci@1d,700000/fibre-channel@3/sd@7d,0 The Marconi/FORE ATM 155 card is also unsupported. A Marconi/FORE ATM 155 card will appear in prtdiag output as: FORE,HE-155 A 375-0118 cPCI Dual Fibre Channel HBA will appear in prtdiag output as: SUNW,qlc-pci1077,2200.1077.4084.+ A format command will list a device path like: /ssm@0,0/pci@1d,600000/pci@1/SUNW,qlc@4/fp@0,0/ssd@w50060e80039d5d07,0
Workarounds exist for cases involving JNI 1063 Fibre Channel, Marconi/FORE HE155 ATM, and Sun cPCI Dual Fibre Channel Host Bus Adapters HBAs (Part Number 375-0118).
SolutionFirst, validate that ScApp is at least at 5.20.6 (which includes the latest NCPQ_TO fix). See Patch ID 114527 if needing to upgrade ScApp and you are advised to load the latest release.
Workaround:
For cases involving a JNI 1063 Fibre Channel HBA: NCPQ_TO error pause has been observed on these servers with the JNI cards when they have been left disconnected. Typically, the JNI 1063 card is used in a Fibre Channel-Arbitrated Loop (FC-AL) topology. When the loop is open (as is the case where no device or FC-AL hub is attached), the card will continuously reset. The resets have been linked to eventual NCPQ_TO error pause. The following, repeating message is often seen in the /var/adm/messages file on these Sun Fire systems with JNI 1063 cards, which have experienced NCPQ_TO error pause: fca-pci0: Link Failure. Resetting... The resets can be eliminated by completing the FC-AL loop, or removing the JNI 1063 card. The loop can be completed by attaching an FC-AL device, an FC-AL hub, or by inserting an external loopback plug into the JNI 1063 HBA. Replacing the JNI 1063 card with a Sun PCI Dual FC Network Adapter+ (Part Number 375-3030) HBA might also resolve the problem. For cases involving a Marconi/FORE HE155 ATM HBA: NCPQ_TO error pause has been observed
on these servers with
these cards, when they share a PCI bus with other PCI devices. The other devices can be other HBA cards, or embedded devices which share a PCI bus wth PCI slots 0, 1, and 2 on a Sun Fire server 8 slot PCI I/O Assembly, (PN 501-4404) I/O boat. Isolating the Marconi card to it's own PCI bus has been shown to prevent the NCPQ_TO error pause conditions. To accomplish this, the card should EITHER be placed in one of the 66 Mhz slots:
OR be the ONLY card installed on one of the 33Mhz slots:
If installed in one of the 33Mhz slotes, the Marconi card should be the ONLY CARD in any of those three slots, and the other two slots on the bus should be EMPTY. For cases involving a Sun cPCI Dual Fibre Channel HBA (Part Number 375-0118): NCPQ_TO error pause has been diagnosed as Sun BugID's 4859295 and 4408474 when this HBA is installed. The problem may be prevented by using only one port on this dual fibre channel port card. It may be neccessary to supply an additional Sun cPCI Dual Fibre Channel HBA (PN 375-0118), to spread the fibre connections to one per HBA. It is also possible to swap all of the cards within an I/O boat from cPCI to their PCI equivalents. For Sun Fire 4800-6800 servers which use the 4 slot cPCI I/O boats, this will also require swapping 4 slot cPCI I/O Assemblies (501-4868) for 8 slot PCI I/O Assemblies (501-4404) . For Sun Fire 3800 servers, this will require swapping the entire server for another UltraSPARC III platform. Sun Bugs: CR 4836915 NCPQ timeouts on Serengeti difficult to diagnose CR 4859295 NCPQ_TO error hangs the 3800/cPCI CR 4919824 NCPQ_TOs fails Serengeti DS-6800 s9 domain CR 4408474 DSTOP during disk I/O CR 4899682 NCPQ_TO System hang occurs during solaris reboot using FW 5.14 Knowledge articles references: Document 1006063.1 SF 3800-6800: Non-Cacheable Address Space tables Document 1017926.1 SF 3800-6800: Troubleshooting NCPQ_TO errors Previously Published As 71760 & 212426 Attachments This solution has no attachment |
||||||||||||
|