Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Technical Instruction Sure Solution 1010580.1 : How to recognize, diagnose, and troubleshoot PCI SERR errors on UltraSPARC(R) II, IIi, IIe based systems.
PreviouslyPublishedAs 214556 Description The following document provides input on what actually causes a PCI SERR, what do the PCI debug drivers actually do, and how to determine driver output after a SERR on UltraSparc II, IIi, IIe based systems, Steps to Follow --- WHAT CAUSES A PCI ERROR? In a modern pci network, there can be many nested pci buses. PCI buses are connected by pci-to-pci bridges. The bus nearest the machine is the primary bus, and the one away is the secondary bus. In a fairly complex system with an E1 pci extender unit, there can be six or more hierachical buses. In a Solaris[TM] systems, there is a device that translates from the system bus (UPA or other architecture-specific system bus type) to the pci bus(es). With reference to UltraSparc II, IIi, and IIe based systems, this chip is the Psycho (upa-to-pci), and is managed by PCI bus nexus drivers of similar names. The topology of the pci buses below these devices is machine dependant. Many low cost machines have simba pci-to-pci bridges, that link one 64bit bus to two separate 32 bit pci buses. Often one bus is used to drive onboard devices like consoles/ethernets etc, and the other bus is used to drive the available pci slot. Many pci cards have onboard pci-to-pci bridges so that the multiple devices are hidden behind a bridge. The SERR signal may be pulsed by any pci device to report: * address parity errors; (a parity error detected during an address cycle that is bad, and no device knows who should respond, so every device that sees the parity error asserts SERR; * data parity errors during special cycles; * critical errors other than parity errors. ( data trapped in a bridge with no way to inform the sender that itis being dropped ). As your data flows throught the various pci-to-pci bridges that need to be negotiated to get to the final device, any device on any of the intermediate buses could assert SERR and not just the final device. The SERR signal is passed up from secondary side of a bridge to the primary side until it percolates up to the pci nexus device which gets an interrupt and prints out a few lines before the machine panics. The standard pci nexus drivers do not walk the bus looking to see where an error was asserted. The drivers just dump the status register from the top level node which is actually part of the nexus device. All one can say is that some component below the pci bus nexus chip asserted SERR for some reason. Some machine configurations have only one device er bus, so identifying what is complaining is quite easy, but why is a different matter. --- WHAT DO THE DEBUG DRIVERS DO? The driver works by walking the device tree below the nexus driver that gets the SERR interrupt. At each node, it will extract the common pci status/command registers. Then for nodes that it reconizes, it will extract more information, (eg for a pci-to-pci bridge chip it will get registers from both sides as well as chip specific error registers). --- DRIVER OUTPUT AFTER A SERR: NOTICE: SIMBA pcipsy-0/simba-1 0x108e 0x5000 0x81 0x147 0x42a0 0x4280 0x23 0x0 NOTICE: SIMBA pcipsy-0/simba-1 0x0 0x0 0x0 0x0 NOTICE: simba-1/glm-0 - 0x1000 0xb 0x80 0x146 0x210 NOTICE: simba-1/glm-1 - 0x1000 0xb 0x80 0x146 0x210 NOTICE: DEC 21152 simba-1/pci_pci-0 0x1011 0x22 0x1 0x147 0x4290 0x6280 0x23 0x0 NOTICE: DEC 21152 pci_pci-0/pci_pci-1 0x1011 0x26 0x1 0x147 0x4290 0x6280 0x23 0x0 NOTICE: DEC 21152 pci_pci-1/pci_pci-4 0x8086 0xb152 0x1 0x147 0x4290 0x2280 0x23 0x10 NOTICE: pci_pci-4/pci108e,1000--1 - 0x108e 0x1000 0x80 0x2 0x280 NOTICE: pci_pci-4/hme--1 - 0x108e 0x1001 0x80 0x146 0x280 NOTICE: pci_pci-4/isp-1 - 0x1077 0x1020 0x0 0x157 0x200 panic[cpu0]/thread=300028cbce0: pcipsy-0: PCI bus 1 error(s)! The format of each line is: NOTICE nameparent/child vid did conf command status If the name is not null then the driver has recognized the device and will get more registers, those exceptions are listed below. The "parent/child" line shows the device tree relationships, sO a goodexample of a generic line is: NOTICE: simba-1/glm-0 - 0x1000 0xb 0x80 0x146 0x210. No "name" so we pull just the generic registers which are the following: VALUE NAME DESCRIPTION 0x1000 vid 16 bit, offset 0, pci vendor id , 0x1000 = LSi 0xb did 16 bit, offset 2, specific device id - look at vendors website. 0x80 header 8 bit, offset 0xe 0x146 command 16 bit, offset 4, command register 0x210 status 16 bit, offset 0x6, status register --- INTERPRETING THE OUTPUT. 1) treat the data path between the nexus driver and the final device as a tree. 2) start from the nexus driver and examine the status registers looking for header some indication of a received error in the example: pcipsy-0/simba-1 0x108e 0x5000 0x81 0x147 0x42a0 0x4280 0x23 0x0 If we decode the status register for the primary bus, we get0x42a0 which means b0100 0010 1010 0000; bit 14: signalled system error is the only abnormal bit set. So we took the SERR panic because the simba instance 1 which is below pcipsy instance 0 indicated a SERR upwards. So now look at the secondary bus status 0x4280 which means; b0100 0010 1000 0000 bit 14: received SERR. So we know that the simba bridge was just passing on the SERR it received on its secondary bus. So what devices have simba-1 as a parent as it could be any one of them asserting SERR. NOTICE: simba-1/glm-0 - 0x1000 0xb 0x80 0x146 0x210 NOTICE: simba-1/glm-1 - 0x1000 0xb 0x80 0x146 0x210 NOTICE: DEC 21152 simba-1/pci_pci-0 0x1011 0x22 0x1 0x147 0x4290 0x6280 0x23 0x0 The status registers of the glm units are both 0x210 which are normal so they are not the culprit, but the dec 2115X bridge chip has 0x4290 as its primary bus status and that means; b0100 0010 1001 0000 bit 14: set so it was asserting SERR on the bus up to the simba-1 device. Looking at its secondary bus status we see 0x6280, which means; b0110 0010 1000 0000 bit 14: received SERR on this bus. bit 13: received master abort, so while we were a master on this bus, the transaction was aborted by the target with a master abort not good. So now we know the serr was sent up from pci_pci instance 0 because a child node asserted SERR on its secondary bus. -- SO, WHAT DEVICE IS BELOW THE PCI_PCI 0? NOTICE: DEC 21152 pci_pci-0/pci_pci-1 0x1011 0x26 0x1 0x147 0x4290 0x6280 0x23 0x0 Again the same status, so let us move further down and determine who has pci_pci-1 as a parent. NOTICE: DEC 21152 pci_pci-1/pci_pci-4 0x8086 0xb152 0x1 0x147 0x4290 0x2280 0x23 0x10 So here we see the primary bus status 0x4290 shows that it asserted SERR, but the secondary bus status shows 0x2280 received master abort. So we know where the SERR originated. But if we ask why, we will ascertain that the last register for this device is 0x10 and that is the p_serr_l_status, or why this device asserted SERR. p_serr_l_status = 0x10 means master abort during posted write. So we had some buffered data in the bridge that we told the master ad been delivered, but when we sent it to the target we got back a master abort. A master abort happens when no target claims the address for an existing transaction. So we know why we got a SERR, and we know who pcipsy-0/simba-1 simba-1/pci_pci-0 pci_pci-0/pci_pci-1 pci_pci-1/pci_pci-4 got a master abort for some buffered data who it could not cry to the sender for help. what devices are below the pci_pci@4 device. NOTICE: pci_pci-4/pci108e,1000--1 - 0x108e 0x1000 0x80 0x2 0x280 NOTICE: pci_pci-4/hme--1 - 0x108e 0x1001 0x80 0x146 0x280 NOTICE: pci_pci-4/isp-1 - 0x1077 0x1020 0x0 0x157 0x200 So can we work out which device? NO! All we know is where to put our analyzer, and which drivers to instrument up in order to determine who is accessing an address that no one curently owns. EXCEPTION DEVICES: SIMBA DOCUMENTATION: The simba pci device is a pci-to_pci bridge, that has both a 64 bit primary bus and two 32bit secondary buses, managed by a solaris simba driver. The format is... NOTICE: SIMBA pcipsy-0/simba-1 0x108e 0x5000 0x81 0x147 0x42a0 0x4280 0x23 NOTICE: SIMBA pcipsy-0/simba-1 0x0 0x0 0x0 0x0 So, there are 11 registers gathered. Value Name Description 0x108e vid 16 bits, offset 0, 0x108e = sun 0x5000 did 16 bits at offset 2 0x5000 = simba 0x81 header 8 bit at offset 0xe 0x147 command 16 bits, offset 0x4, command register 0x42a0 status 16 bits at offset 0x6 status register 0x4280 secondary bus 16 bit at offset 0x1e secondary bus status status register 0x23 bridge control 16 bit register at 0x3e 0x0 dma_afsr 64 bit vaue at offset 0xc8 0x0 dma_afar 64 bit value at offset 0xd0 0x0 pio_afsr pio_afsr 64 bit vaue at offset 0xe8 0x0 pio_afar pio_afar 64 bit vaue at offset 0xf0 DEC 2115[234] DOCUMENTATION: The DEC 2115X pci-to-pci bridge is quite common, and the 2115X family of pci-to-pci bridge chips are now released by intel. It is managed by the pci_pci solaris driver. NOTICE: DEC 21152 simba-1/pci_pci-0 0x1011 0x22 0x1 0x147 0x4290 0x6280 0x23 0x0 INTEL 21554 DOCUMENTATION (managed by the db21554 driver): Value Name Description 0x1011 vid 0x22 did conf 0x147 command 0x4290 status secondary command 16 bits at 0x44 0x6280 secondary status 16 bits at 0x46 cc0 16 bit diagnostics at 0xcc cc1 16 bit diagnostics at 0xce cc2 16 bit diagnostics at 0xd0 -- COMMAND REGISTER FORMAT(16 bits): BIT MEANING 0 I/O space address decoder enable 1 Memory space address decoder enable 2 Bus master enable 3 Special cycles enable 4 Memory write-and-invalidate enable 5 VGA palette snoop enable 6 PERR generation enable 7 Address stepping enable 8 SERR enable 9 FAST back-to-back enable 10-15 reserved -- STATUS REGISTER FORMAT (16 bits) BIT MEANING 0-3 Reserved 4 2.2 capable 5 66 Mhz capable 6 Reserved 7 Fast back to back capable 8 Master data parity error 9-10 Dev sel timing 11 Signaled target abort 12 Received target abort 13 Received master abort 14 Signaled SERR on the bus 15 Detected a parity error -- SECONDARY BUS STATUS REGISTER FORMAT (16 BITS) BIT MEANING 0-3 Reserved 4 2.2 capable 5 66 Mhz capable 6 Reserved 7 Fast back to back capable 8 Data parity reported 9-10 Dev sel timing 11 Signaled target abort 12 Received target abort 13 Received master abort 14 Received SERR 15 Detected a parity error Product NEBS-Certified Servers Sun Enterprise 450 Server Sun Enterprise 420R Server Sun Enterprise 250 Server Sun Enterprise 220R Server Sun Enterprise 150 Server Ultra 80 Workstation Ultra 450 Workstation Ultra 60 Workstation Ultra 5 Workstation Ultra 30 Workstation Ultra 10 Workstation Sun Blade 150 Workstation Sun Blade 100 Workstation Internal Comments Additional Documentation and Debug Drivers are available internally here: http://clem.uk/~timu/pci/index.html pci, serr, errors, simba, psycho, ultrasparc, II, IIi, IIe, decode, diagnose Previously Published As 72402 Change History Date: 2009-02-26 The product NEBS-Certified Servers is a folder in the Swordfish database. We need specific products in the product statement. If you can provide a list of the specific servers that this article applies to, the document can be published. The specific products will show up with a plus sign rather than a folder in http://krep.central.sun.com/stats/swordfish/. Date: 2005-09-23 User Name: 7058 Action: Update Canceled Comment: *** Restored Published Content *** Only fixed metadata. Version: 0 Date: 2005-09-23 User Name: 7058 Action: Update Started Comment: Fixing missing tech group. Version: 0 Attachments This solution has no attachment |
||||||||||||
|