Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Problem Resolution Sure Solution 1011106.1 : Fabric devices and QuickLoop devices exported to Solaris [TM] via the same Fiber Channel connection.
PreviouslyPublishedAs 215279 Symptoms When Fabric devices and QuickLoop devices are exported to Solaris via the same Fiber channel connection, it reported offline/online under heavy load. And also, it resulted in poor IO performance. Example: A highend server at the customer site logged the following errors: svr03 qlc: [ID 686697 kern.info] NOTICE: Qlogic qlc(1): Loop OFFLINE svr03 qlc: [ID 686697 kern.info] NOTICE: Qlogic qlc(1): Link ONLINE svr03 fctl: [ID 517869 kern.warning] WARNING: 2589=>fp(1)::fp_unsol_cb() bailing out LOGO for D_ID b19c7 svr03 fctl: [ID 517869 kern.warning] WARNING: 2591=>fp(1)::fp_unsol_cb() bailing out LOGO for D_ID b18c6 svr03 fctl: [ID 517869 kern.warning] WARNING: 2593=>fp(1)::fp_unsol_cb() bailing out LOGO for D_ID fffc0f svr03 fctl: [ID 517869 kern.warning] WARNING: 2595=>fp(1)::fp_unsol_cb() bailing out LOGO for D_ID d0000 svr03 fctl: [ID 517869 kern.warning] WARNING: 2597=>fp(1)::fp_unsol_cb() bailing out LOGO for D_ID b0000 svr03 qlc: [ID 787125 kern.warning] WARNING: qlc(1) no lid for adisc b19c7 svr03 fp: [ID 517869 kern.info] NOTICE: fp(1): ADISC to b19c7 failed, cmd_flags=1 state=Packet Transport error, reason=No Connection svr03 qlc: [ID 787125 kern.warning] WARNING: qlc(1) no lid for adisc b18c6 svr03 fctl: [ID 517869 kern.warning] WARNING: 2609=>fp(1)::fp_adisc_intr: Dev change notification to ULP port=300204db000, pd=300f2b5b998, map_flags=0 map_state=1 svr03 fp: [ID 517869 kern.info] NOTICE: fp(1): ADISC to b18c6 failed, cmd_flags=1 state=Packet Transport error, reason=No Connection svr03 fctl: [ID 517869 kern.warning] WARNING: 2612=>fp(1)::fp_adisc_intr: Dev change notification to ULP port=300204db000, pd=300bd6c6140, map_flags=0 map_state=1 svr03 fcip: [ID 356328 kern.warning] WARNING: fc_ulp_login failed for d_id: 0xb19c7, rval: 0x41 svr03 fcip: [ID 356328 kern.warning] WARNING: fc_ulp_login failed for d_id: 0xb18c6, rval: 0x41 svr03 scsi: [ID 107833 kern.warning] WARNING: /pci@fd,600000/SUNW,qlc@1,1/fp@0,0/ssd@w5006016830601681,49 (ssd60): svr03 Error for Command: read(10) Error Level: Retryable svr03 scsi: [ID 107833 kern.notice] Requested Block: 95798528 Error Block: 95798528 svr03 scsi: [ID 107833 kern.notice] Vendor: DGC Serial Number: 4900004D24CL svr03 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention svr03 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 ...(repeated for several DGC LUNs. Error Level is Retryable, Omitted here ! ) svr03 scsi: [ID 107833 kern.warning] WARNING: /pci@fd,600000/SUNW,qlc@1,1/fp@0,0/st@w50050763004a3e05,0 (st31): svr03 Error for Command: write Error Level: Fatal svr03 scsi: [ID 107833 kern.notice] Requested Block: 2303 Error Block: 2303 svr03 scsi: [ID 107833 kern.notice] Vendor: IBM Serial Number: svr03 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention svr03 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 svr03 scsi: [ID 107833 kern.warning] WARNING: /pci@fd,600000/SUNW,qlc@1,1/fp@0,0/st@w50050763004a3e06,0 (st30): svr03 Error for Command: load/start/stop Error Level: Fatal svr03 scsi: [ID 107833 kern.notice] Requested Block: 0 Error Block: 0 svr03 scsi: [ID 107833 kern.notice] Vendor: IBM Serial Number: svr03 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention svr03 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 svr03 tldd[19472]: [ID 861947 daemon.error] TLD(0) unload failed in io_open, I/O error[5] svr03 tldd[10635]: [ID 821050 daemon.error] TLD(0) drive 6 (device 0) is being DOWNED, status: Unable to SCSI unload drive svr03 tldd[10635]: [ID 229259 daemon.error] Check integrity of the drive, drive path, and media svr03 scsi: [ID 107833 kern.warning] WARNING: /pci@fd,600000/SUNW,qlc@1,1/fp@0,0/ssd@w5006016830601681,1f (ssd73): svr03 Error for Command: write(10) Error Level: Retryable svr03 scsi: [ID 107833 kern.notice] Requested Block: 20434 Error Block: 20434 svr03 scsi: [ID 107833 kern.notice] Vendor: DGC Serial Number: 1F000042A0CL svr03 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention svr03 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0xca Resolution Similar issues happened several times on this same server(svr03). Customer DBA complained that backup UFS file system which based on EMC[TM] CLARiiON Cx700 LUNs showed very poor IO performance (less than 50KB/s for read OR write). Scheduled backup jobs failed. After a careful review of the current IO sub-system configuration, it was found that the affected EMC[TM] CLARiiON Cx700 LUNs(OS marked it as "Vendor:DGC") and SAN attached Tape drives (Os marked it as "Vendor:IBM) are all presented to Solaris via the same fiber channel - /pci@fd,600000/SUNW,qlc@1,1/fp@0,0. This is the lower Fiber channel port of server's 1st HBA(Part Number is X6768 OR 375-3108, it is 2GB dual port HBA). The following diagram shows the original system Backup SAN architecture: +--------+ +--------+ +--------+ +--------+ | 1st HBA| | 2nd HBA| | 3rd HBA| | 4th HBA| | | | | | | | | | FC(U) | | FC(U) | | FC(U) | | FC(U) | | | | | | | | | | | | | | | | | | FC(L) | | FC(L) | | FC(L) | | FC(L) | | | | | | | | | | +--|-----+ +--------+ +--------+ +--------+ | | +--> To SAN switch ports for both Cx700 array & tape drives. (The port was configured both in zone svr3-bk-za & svr3-bk-zb, overlapped) Remark: Above Four HBA's Upper FC ports [FC(U)] are used for another high-end storage connection. Zone configuration (svr3-bk-za and svr3-bk-zb): Zone Defines Port Port Type -------------------------------------------------------------------- Zone svr03-bk-za 1st HBA lower port F-Port Cx700 Controller SPA F-Port Cx700 Controller SPB F-Port Zone svr03-bk-zb 1st HBA lower port(overlap) F-Port Tape Driver(st30) L-Port, 1 Public Tape Driver(st31) L-Port, 1 Public --------------------------------------------------------------------- We also noticed that above failure used to happen only under heavy IO loads. Light IO workload worked fine. Though fabric device and QuickLoop device can work together, it was never recommended by any Storage or Switch Vendors. Because a chunk of data needs to be read from this Fiber channel and then write to the tape drives via the same Fiber Channel. This could trigger poor IO performance, resulting application failure. When the ports for the tape drives in zone "svr03-bk-zb" were made to fail, two tape drives st30 & st31 both became offline. Following meesages were logged. svr03 fctl: [ID 517869 kern.warning] WARNING: 2793=>fp(1)::GPN_ID for D_ID=b18c6 failed svr03 fctl: [ID 517869 kern.warning] WARNING: 2794=>fp(1)::N_x Port with D_ID=b18c6, PWWN=50050763004a3e06 disappeared from fabric svr03 fctl: [ID 517869 kern.warning] WARNING: 2804=>fp(1)::GPN_ID for D_ID=b19c7 failed svr03 fctl: [ID 517869 kern.warning] WARNING: 2805=>fp(1)::N_x Port with D_ID=b19c7, PWWN=50050763004a3e05 disappeared from fabric svr03 scsi: [ID 243001 kern.info] /pci@fd,600000/SUNW,qlc@1,1/fp@0,0 (fcp1): svr03 offlining lun=0 (trace=0), target=b18c6 (trace=2800004) svr03 scsi: [ID 107833 kern.warning] WARNING: /pci@fd,600000/SUNW,qlc@1,1/fp@0,0/st@w50050763004a3e06,0 (st30): svr03 transport rejected svr03 genunix: [ID 408114 kern.info] /pci@fd,600000/SUNW,qlc@1,1/fp@0,0/st@w50050763004a3e06,0 (st30) offline svr03 scsi: [ID 243001 kern.info] /pci@fd,600000/SUNW,qlc@1,1/fp@0,0 (fcp1): svr03 offlining lun=0 (trace=0), target=b19c7 (trace=2800004) svr03 scsi: [ID 107833 kern.warning] WARNING: /pci@fd,600000/SUNW,qlc@1,1/fp@0,0/st@w50050763004a3e05,0 (st31): svr03 transport rejected svr03 genunix: [ID 408114 kern.info] /pci@fd,600000/SUNW,qlc@1,1/fp@0,0/st@w50050763004a3e05,0 (st31) offline This resulted satisfying IO performance, single Read or write thread can generate IO throughput up to 40-50 MB/s. So, disabling those tape drives can be used as a temporary workaround in a similar configuration. This proves that the bottleneck was in the configuration. Rebuilding the current backup SAN architecture, that is to organize Fibric devices and QuickLoop devices in two separate zones (also using different HBAs), was the proposed solution. Following are the two new planned Zone defines: Zone Defines Port Port Type -------------------------------------------------------------------- Zone svr03-bk-za 1st HBA lower port F-Port 2nd HBA lower port(for DMP) F-Port Cx700 Controller SPA F-Port Cx700 Controller SPB F-Port Zone svr03-bk-zb 3rd HBA lower port F-Port Tape Driver(st30) L-Port, 1 Public Tape Driver(st31) L-Port, 1 Public --------------------------------------------------------------------- Following is the diagram of the final system Backup SAN architecture: +--------+ +--------+ +--------+ +--------+ | 1st HBA| | 2nd HBA| | 3rd HBA| | 4th HBA| | | | | | | | | | FC(U) | | FC(U) | | FC(U) | | FC(U) | | | | | | | | | | | | | | | | | | FC(L) | | FC(L) | | FC(L) | | FC(L) | | | | | | | | | | | | +--|-----+ +---|----+ +----|---+ +--------+ | | | | | | | | +--> To SAN Switch for Tape driver connection | | (This port was configured in zone svr03-bk-zb) | | | +---> To SAN Switch for Cx700 Array connections(DMP path A) | (This port was configured in zone svr3-bk-za) | +--> To SAN Switch Port for Cx700 Array connections(DMP path B) (This port was configured in zone svr3-bk-za) So, as a best practice for device connection via SAN Switch, try to avoid configuring Fabric and QuickLoop devices into the same fiber channel connection especially when they are both used for the same application. Relief/Workaround These two types of devices need to be in different zones. So, disabling one of these devices temporarily would avoid the poor performance issue. Product Sun Fire E25K Server Sun Fire E20K Server Sun Fire 15K Server Sun Fire 12K Server SAN switch L-Port, F-Port, X6768, 375-3108, Qlogic qlc, Loop OFFLINE, Link ONLINE, tape, QuickLoop Previously Published As 83332 Change History Date: 2005-12-06 User Name: 31620 Action: Approved Comment: Verified Metadata - ok Verified Keywords - ok Verified still correct for audience - was free, has to be contract as per FvF http://kmo.central/howto/FvF.html Checked review date - currently set to 2006-12-05 Checked for TM - added on efor Solaris Publishing under the current publication rules of 18 Apr 2005: Version: 4 Product_uuid d842dd03-059b-11d8-84cb-080020a9ed93|Sun Fire E25K Server 1404a2d3-059a-11d8-84cb-080020a9ed93|Sun Fire E20K Server 29e4659c-0a18-11d6-9fa1-e67bbc033df8|Sun Fire 15K Server 077fd4c5-df8f-4320-ad69-7d01603a674d|Sun Fire 12K Server Attachments This solution has no attachment |
||||||||||||
|