Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1011106.1
Update Date:2012-07-30
Keywords:

Solution Type  Problem Resolution Sure

Solution  1011106.1 :   Fabric devices and QuickLoop devices exported to Solaris [TM] via the same Fiber Channel connection.  


Related Items
  • Sun Fire E25K Server
  •  
  • Sun Fire E20K Server
  •  
  • Sun Fire 12K Server
  •  
  • Sun Fire 15K Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: SF-Exxk
  •  
  • .Old GCS Categories>Sun Microsystems>Servers>High-End Servers
  •  

PreviouslyPublishedAs
215279


Applies to:

Sun Fire 12K Server
Sun Fire 15K Server
Sun Fire E20K Server
Sun Fire E25K Server
All Platforms

Symptoms

When Fabric devices and QuickLoop devices are exported to Solaris via the same Fiber channel connection, it reported offline/online under heavy load. And also, it resulted in poor IO performance.

Example:
A highend server at the customer site logged the following errors:

svr03 qlc: [ID 686697 kern.info] NOTICE: Qlogic qlc(1): Loop OFFLINE
svr03 qlc: [ID 686697 kern.info] NOTICE: Qlogic qlc(1): Link ONLINE
svr03 fctl: [ID 517869 kern.warning] WARNING: 2589=>fp(1)::fp_unsol_cb() bailing out LOGO for D_ID b19c7
svr03 fctl: [ID 517869 kern.warning] WARNING: 2591=>fp(1)::fp_unsol_cb() bailing out LOGO for D_ID b18c6
svr03 fctl: [ID 517869 kern.warning] WARNING: 2593=>fp(1)::fp_unsol_cb() bailing out LOGO for D_ID fffc0f
svr03 fctl: [ID 517869 kern.warning] WARNING: 2595=>fp(1)::fp_unsol_cb() bailing out LOGO for D_ID d0000
svr03 fctl: [ID 517869 kern.warning] WARNING: 2597=>fp(1)::fp_unsol_cb() bailing out LOGO for D_ID b0000
svr03 qlc: [ID 787125 kern.warning] WARNING: qlc(1) no lid for adisc b19c7
svr03 fp: [ID 517869 kern.info] NOTICE: fp(1): ADISC to b19c7 failed, cmd_flags=1 state=Packet Transport error, reason=No Connection
svr03 qlc: [ID 787125 kern.warning] WARNING: qlc(1) no lid for adisc b18c6
svr03 fctl: [ID 517869 kern.warning] WARNING: 2609=>fp(1)::fp_adisc_intr: Dev change notification to ULP port=300204db000, pd=300f2b5b998, map_flags=0 map_state=1
svr03 fp: [ID 517869 kern.info] NOTICE: fp(1): ADISC to b18c6 failed, cmd_flags=1 state=Packet Transport error, reason=No Connection
svr03 fctl: [ID 517869 kern.warning] WARNING: 2612=>fp(1)::fp_adisc_intr:
Dev change notification to ULP port=300204db000, pd=300bd6c6140, map_flags=0 map_state=1
svr03 fcip: [ID 356328 kern.warning] WARNING: fc_ulp_login failed for d_id: 0xb19c7, rval: 0x41
svr03 fcip: [ID 356328 kern.warning] WARNING: fc_ulp_login failed for d_id: 0xb18c6, rval: 0x41
svr03 scsi: [ID 107833 kern.warning] WARNING: /pci@fd,600000/SUNW,qlc@1,1/fp@0,0/ssd@w5006016830601681,49 (ssd60):
svr03 Error for Command: read(10) Error Level: Retryable
svr03 scsi: [ID 107833 kern.notice] Requested Block: 95798528 Error Block: 95798528
svr03 scsi: [ID 107833 kern.notice] Vendor: DGC Serial Number: 4900004D24CL
svr03 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention
svr03 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
...(repeated for several DGC LUNs. Error Level is Retryable, Omitted here ! )
svr03 scsi: [ID 107833 kern.warning] WARNING: /pci@fd,600000/SUNW,qlc@1,1/fp@0,0/st@w50050763004a3e05,0 (st31):
svr03 Error for Command: write Error Level: Fatal
svr03 scsi: [ID 107833 kern.notice] Requested Block: 2303 Error Block: 2303
svr03 scsi: [ID 107833 kern.notice] Vendor: IBM Serial Number:
svr03 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention
svr03 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
svr03 scsi: [ID 107833 kern.warning] WARNING: /pci@fd,600000/SUNW,qlc@1,1/fp@0,0/st@w50050763004a3e06,0 (st30):
svr03 Error for Command: load/start/stop Error Level: Fatal
svr03 scsi: [ID 107833 kern.notice] Requested Block: 0 Error Block: 0
svr03 scsi: [ID 107833 kern.notice] Vendor: IBM Serial Number:
svr03 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention
svr03 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
svr03 tldd[19472]: [ID 861947 daemon.error] TLD(0) unload failed in io_open, I/O error[5]
svr03 tldd[10635]: [ID 821050 daemon.error] TLD(0) drive 6 (device 0) is being DOWNED, status: Unable to SCSI unload drive
svr03 tldd[10635]: [ID 229259 daemon.error] Check integrity of the drive, drive path, and media
svr03 scsi: [ID 107833 kern.warning] WARNING: /pci@fd,600000/SUNW,qlc@1,1/fp@0,0/ssd@w5006016830601681,1f (ssd73):
svr03 Error for Command: write(10) Error Level: Retryable
svr03 scsi: [ID 107833 kern.notice] Requested Block: 20434 Error Block: 20434
svr03 scsi: [ID 107833 kern.notice] Vendor: DGC Serial Number: 1F000042A0CL
svr03 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention
svr03 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0xca

Cause

Similar issues happened several times on this same server(svr03). Customer DBA complained that backup UFS file system which based on EMC[TM] CLARiiON Cx700 LUNs showed very poor IO performance (less than 50KB/s for read OR write). Scheduled backup jobs failed.

After a careful review of the current IO sub-system configuration, it was found that the affected EMC[TM] CLARiiON Cx700 LUNs(OS marked it as "Vendor:DGC") and SAN attached Tape drives (Os marked it as "Vendor:IBM) are all presented to Solaris via the same fiber channel - /pci@fd,600000/SUNW,qlc@1,1/fp@0,0.

This is the lower Fiber channel port of server's 1st HBA(Part Number is X6768 OR 375-3108, it is 2GB dual port HBA). The following diagram shows the original system Backup SAN architecture:
+--------+  +--------+  +--------+  +--------+
| 1st HBA|  | 2nd HBA|  | 3rd HBA|  | 4th HBA|
|        |  |        |  |        |  |        |
|  FC(U) |  |  FC(U) |  |  FC(U) |  |  FC(U) |
|        |  |        |  |        |  |        |
|        |  |        |  |        |  |        |
|  FC(L) |  |  FC(L) |  |  FC(L) |  |  FC(L) |
|  |     |  |        |  |        |  |        |
+--|-----+  +--------+  +--------+  +--------+
   |
   |
   +--> To SAN switch ports for both Cx700 array & tape drives. (The port was configured both in zone svr3-bk-za & svr3-bk-zb,
   overlapped)

Note: Above Four HBA's Upper FC ports [FC(U)] are used for another high-end storage connection. Zone configuration (svr3-bk-za and svr3-bk-zb):
Zone Defines	Port				Port Type
--------------------------------------------------------------------
Zone svr03-bk-za	1st HBA lower port		F-Port
			Cx700 Controller SPA		F-Port
			Cx700 Controller SPB		F-Port
Zone svr03-bk-zb	1st HBA lower port(overlap) 	F-Port
			Tape Driver(st30)		L-Port, 1 Public
			Tape Driver(st31)		L-Port, 1 Public
---------------------------------------------------------------------
We also noticed that above failure used to happen only under heavy IO loads. Light IO workload worked fine.

Solution

Though fabric device and QuickLoop device can work together, it was never recommended by any Storage or Switch Vendors. Because a chunk of data needs to be read from this Fiber channel and then write to the tape drives via the same Fiber Channel. This could trigger poor IO performance, resulting application failure.

When the ports for the tape drives in zone "svr03-bk-zb" were made to fail, two tape drives st30 & st31 both became offline. Following meesages were logged:

svr03 fctl: [ID 517869 kern.warning] WARNING: 2793=>fp(1)::GPN_ID for D_ID=b18c6 failed svr03 fctl: [ID 517869 kern.warning] WARNING: 2794=>fp(1)::N_x Port with D_ID=b18c6, PWWN=50050763004a3e06 disappeared from fabric
svr03 fctl: [ID 517869 kern.warning] WARNING: 2804=>fp(1)::GPN_ID for D_ID=b19c7 failed svr03 fctl: [ID 517869 kern.warning] WARNING: 2805=>fp(1)::N_x Port with D_ID=b19c7, PWWN=50050763004a3e05 disappeared from fabric
svr03 scsi: [ID 243001 kern.info] /pci@fd,600000/SUNW,qlc@1,1/fp@0,0 (fcp1): svr03 offlining lun=0 (trace=0), target=b18c6 (trace=2800004)
svr03 scsi: [ID 107833 kern.warning] WARNING: /pci@fd,600000/SUNW,qlc@1,1/fp@0,0/st@w50050763004a3e06,0 (st30): svr03 transport rejected svr03 genunix: [ID 408114 kern.info] /pci@fd,600000/SUNW,qlc@1,1/fp@0,0/st@w50050763004a3e06,0 (st30) offline
svr03 scsi: [ID 243001 kern.info] /pci@fd,600000/SUNW,qlc@1,1/fp@0,0 (fcp1): svr03 offlining lun=0 (trace=0), target=b19c7 (trace=2800004) svr03 scsi: [ID 107833 kern.warning] WARNING: /pci@fd,600000/SUNW,qlc@1,1/fp@0,0/st@w50050763004a3e05,0 (st31): svr03 transport rejected
svr03 genunix: [ID 408114 kern.info] /pci@fd,600000/SUNW,qlc@1,1/fp@0,0/st@w50050763004a3e05,0 (st31) offline


This resulted satisfying IO performance, single Read or write thread can generate IO throughput up to 40-50 MB/s. So, disabling those tape drives can be used as a temporary workaround in a similar configuration. This proves that the bottleneck was in the configuration.

Rebuilding the current backup SAN architecture, that is to organize Fibric devices and QuickLoop devices in two separate zones (also using different HBAs), was the proposed solution.
Following are the two new planned Zone defines:
Zone Defines	Port				Port Type
--------------------------------------------------------------------
Zone svr03-bk-za	1st HBA lower port		F-Port
			2nd HBA lower port(for DMP)	F-Port
			Cx700 Controller SPA		F-Port
			Cx700 Controller SPB		F-Port	
Zone svr03-bk-zb	3rd HBA lower port	 	F-Port
			Tape Driver(st30)		L-Port, 1 Public
			Tape Driver(st31)		L-Port, 1 Public
---------------------------------------------------------------------

Following is the diagram of the final system Backup SAN architecture:
+--------+  +--------+  +--------+  +--------+
| 1st HBA|  | 2nd HBA|  | 3rd HBA|  | 4th HBA|
|        |  |        |  |        |  |        |
|  FC(U) |  |  FC(U) |  |  FC(U) |  |  FC(U) |
|        |  |        |  |        |  |        |
|        |  |        |  |        |  |        |
|  FC(L) |  |  FC(L) |  |  FC(L) |  |  FC(L) |
|  |     |  |   |    |  |    |   |  |        |
+--|-----+  +---|----+  +----|---+  +--------+
|            |            |
|            |            |
|            |            +--> To SAN Switch for Tape driver connection
|            |            (This port was configured in zone svr03-bk-zb)
|            |
|            +---> To SAN Switch for Cx700 Array connections(DMP path A)
|            (This port was configured in zone svr3-bk-za)
|
+--> To SAN Switch Port for Cx700 Array connections(DMP path B)
(This port was configured in zone svr3-bk-za)

So, as a best practice for device connection via SAN Switch, try to avoid configuring Fabric and QuickLoop devices into the same fiber channel connection especially when they are both used for the same application.

Relief/Workaround
These two types of devices need to be in different zones. So, disabling one of these devices temporarily would avoid the poor performance issue.

Product
Sun Fire E25K Server
Sun Fire E20K Server
Sun Fire 15K Server
Sun Fire 12K Server


Keywords: SAN switch L-Port, F-Port, X6768, 375-3108, Qlogic qlc, Loop OFFLINE, Link ONLINE, tape, QuickLoop
Previously Published As 83332

Product_uuid
d842dd03-059b-11d8-84cb-080020a9ed93|Sun Fire E25K Server
1404a2d3-059a-11d8-84cb-080020a9ed93|Sun Fire E20K Server
29e4659c-0a18-11d6-9fa1-e67bbc033df8|Sun Fire 15K Server
077fd4c5-df8f-4320-ad69-7d01603a674d|Sun Fire 12K Server




Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback