Document Audience: | INTERNAL |
Document ID: | I0777-1 |
Title: | In some configurations, after a FC loop disruption the SOC+ HBA intermittently takes the FC loop down |
Copyright Notice: | Copyright © 2005 Sun Microsystems, Inc. All Rights Reserved |
Update Date: | 2002-02-25 |
---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------
FIELD INFORMATION NOTICE
(For Authorized Distribution by SunService)
FIN #: I0777-1
Synopsis: In some configurations, after a FC loop disruption the SOC+ HBA intermittently takes the FC loop downCreate Date: Feb/25/02
Keywords:
In some configurations, after a FC loop disruption the SOC+ HBA intermittently takes the FC loop down
SunAlert: No
Top FIN/FCO Report: No
Products Reference: SOC+ Host Bus Adapter
Product Category: Storage / Service
Product Affected:
Systems Affected
----------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
- Anysys - System Platform Independent -
X-Options Affected
------------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
- A5000 ALL StorEdge A5X00 -
- A3500 ALL StorEdge A3500 -
- T3 ALL StorEdge T3 -
X6538A - - X-OPT A3500FC CONTROLLER -
X2611A - - OPT INT I/O BD FOR EXX00 -
X2612A - - OPT INT I/O BD EXX00 W/FC-AL -
X2622A - - OPT INT GRAPHICS I/O BD EXX00 -
X6730A - - SBus/FC HA w/GBIC -
X6757A - - SBus Dual FC Network Adapter -
Parts Affected:
Part Number Description Model
----------- ----------- -----
375-3048-01 SBus Dual FC Network Adapter -
540-4026-01 A3500FC FC-AL Array Ctrlr w/Memory -
501-4266-08 SBus I/O Board with SOC+ -
501-4833-01 I/O Board with SOC+ -
501-4884-07 Graphics I/O Board with SOC+ -
501-5266-04 FC-AL SBus Card (FC100/S) -
501-5202-03 FC-AL SBus Card (FC100/S) -
References:
BugId: 4525143 - SOC+ SAN: SOC+ & FC switch or hub - LIP on one port kills
I/O on both ports.
4479045 - SOC+ & FC-AL hub: storage not reliably seen after reboot
or loop interruption.
PROBLEM DESCRIPTION:
In some configurations, after a Fibre Channel (FC) loop disruption such
as a system reboot or a plugging/unplugging of the FC cable at the FC
hub, the SOC+ HBA might intermittently take the FC loop down. Also, it
is possible that a LIP (Loop Initialization Primitive) received on an
idle port of a SOC+ HBA can disrupt I/O activity on the second port of
that HBA.
Problem 1 - Bug 4479045
Bug 4479045 Problem Description:
---------------------------------
In some configurations, after a FC loop disruption (e.g.. System reboot
or plug / unplug of FC cable in FC hub etc.) the SOC+ HBA
intermittently takes FC loop down. When the FC loop connected to a hub
is idle, any loop interruption (e.g. rebooting one host, unplugging or
re-plugging a host or storage into or out of the loop etc.) can cause
the SOC+ chip on one of the HBAs to hang the entire loop. This will
prevent access to the storage on that loop from the host(s) which are
still running. The problem can still occur when there is I/O load on
the loop, but often, the driver's error recovery procedure will reset
the affected SOC+ chip causing it to run correctly again. Sometimes
I/O errors are seen in this case, sometimes they are not.
Bug 4479045 Susceptible Configurations:
----------------------------------------
A Sun FC hub with 2 x host connections using SOC+ HBAs and storage can
be affected. The type of storage appears to be irrelevant (problem has
been seen with T3x and A3500FC, but not yet with A5x00). This is not
related to any cluster software so it can occur with Sun Cluster, VCS
or simply a loop with 2 hosts sharing storage using VxVM to manage data
ownership with manual failover (poor man's cluster).
Problem 2 - Bug 4525143
Bug 4525143 Problem
Issue Description:
In some configurations, after a Fibre Channel (FC) loop disruption such
as a system reboot or a plugging/unplugging of the FC cable at the FC
hub, the SOC+ HBA might intermittently take the FC loop down. Also, it
is possible that a LIP (Loop Initialization Primitive) received on an
idle port of a SOC+ HBA can disrupt I/O activity on the second port of
that HBA.
Problem 1 - Bug 4479045
Bug 4479045 Problem Description:
---------------------------------
In some configurations, after a FC loop disruption (e.g.. System reboot
or plug / unplug of FC cable in FC hub etc.) the SOC+ HBA
intermittently takes FC loop down. When the FC loop connected to a hub
is idle, any loop interruption (e.g. rebooting one host, unplugging or
re-plugging a host or storage into or out of the loop etc.) can cause
the SOC+ chip on one of the HBAs to hang the entire loop. This will
prevent access to the storage on that loop from the host(s) which are
still running. The problem can still occur when there is I/O load on
the loop, but often, the driver's error recovery procedure will reset
the affected SOC+ chip causing it to run correctly again. Sometimes
I/O errors are seen in this case, sometimes they are not.
Bug 4479045 Susceptible Configurations:
----------------------------------------
A Sun FC hub with 2 x host connections using SOC+ HBAs and storage can
be affected. The type of storage appears to be irrelevant (problem has
been seen with T3x and A3500FC, but not yet with A5x00). This is not
related to any cluster software so it can occur with Sun Cluster, VCS
or simply a loop with 2 hosts sharing storage using VxVM to manage data
ownership with manual failover (poor man's cluster).
Problem 2 - Bug 4525143
Bug 4525143 Problem Description:
--------------------------------
This problem is seen if there is heavy I/O load on one port of a SOC+
HBA, and a LIP (Loop Initialization Primitive) occurs on the other port
of the same SOC+ HBA while that second SOC+ port is idle. The LIP
could be caused by plugging or unplugging a fibre cable, rebooting a
second host attached to the same FC switch or hub, or a marginal FC
loop component (cable, GBIC). The SOC+ port on the loop which receives
the LIP does not initialize correctly, resulting in that port going
offline. This means that it loses contact with the storage on that
port. The I/Os via the other SOC+ port to the other loop sometimes
continue successfully, but sometimes they fail. The affected SOC+ port
will recover if all I/Os are stopped on the other SOC+ port on that
HBA.
Bug 4525143 Susceptible Configurations:
---------------------------------------
A SOC+ HBA where both ports are used.
There are currently several workarounds available for avoidance of
Bugs 4479045 and 4525143. See the Corrective Action Section below.
In addition, Sun Engineering is developing a new HBA, when used in
conjunction with Solaris 8 (or higher) and SunCluster 3.0, will
prevent the problems seen with Bug 4479045.
Implementation:
---
| | MANDATORY (Fully Proactive)
---
---
| | CONTROLLED PROACTIVE (per Sun Geo Plan)
---
---
| X | REACTIVE (As Required)
---
Corrective Action:
The following recommendation is provided as a guideline for authorized
Enterprise Services Field Representatives who may encounter the above
mentioned problem.
Problem 1 - Bug 4479045 Workarounds:
------------------------------------
NOTE: There are five workarounds for bug# 4479045. Workarounds 1, 2,
and 3 are proactive workarounds and may be implemented to prevent
or reduce the frequency of future incidences. Workarounds 4 and 5
are reactive, and may be utilized to recover a loop in a hang
condition.
Proactive
---------
1. Use Sun FC switches instead of FC hubs. However, if both ports on
one SOC+ HBA are connected to (different) Sun FC switches, then the
configuration is exposed to bug 4525143 (see below).
NOTE: This requires an escalation to CPRE for technical review and a
CIC.
Proactive
---------
2. [Only helps with T3s]
Use this device order in the FC hub: T3/host/T3/host
Do not use T3/T3/host/host or host/host/T3/T3.
This partial workaround relies on a side effect of the T3's error
recovery whereby, in the current QLogic 2100 firmware used in our T3
firmware, the 2100 chip is reset during error recovery. If the
resulting brief loss-of-sync is 'seen' by a SOC+ HBA which has entered
the faulty state, then this causes the SOC+ HBA to recover. See bug
4430163 for more details of this. The use of this hub ordering is not
expected to help with any other type of storage on the loop, unless its
error recovery behavior causes a similar loss-of-sync.
Since it depends on which SOC+ HBA has entered the faulty state, this
change in hub order is not completely effective in preventing the
problem from occurring, but has been seen to reduce the incidence from
1 in 5 host reboots to 1 in 20 host reboots on a customer site.
Furthermore, there is no certainty that this T3 behavior will always
occur in the future since it is not required by the FC-AL spec.
For example:
Host/Host/T3/T3: 1 failure out of 5 tries (80% reliable)
Host/T3/Host/T3 1 failure out of 20 tries (95% reliable)
NOTE: If this is not successful, open an escalation with CPRE for
technical review.
NOTE: If the configuration has 2 x hosts and 2 x T3s, consider this
workaround.
NOTE: If there are more than 2 T3s, this workaround does not help.
Proactive
---------
3. [Only helps with T3s]
Use only 1 x T3 per FC hub. This relies on the same T3 error recovery
behavior as described in (2) above. It "appears" to be completely
effective in preventing this issue from being seen.
NOTE: If this is not successful, open an escalation with CPRE for
technical review.
Reactive
--------
4. Use a 'luxadm -e forcelip' to send a LIP to the affected loop, from
all hosts on that loop. This needs to be done from all hosts, since
it *must* be done from the SOC+ HBA which has been affected, but
it is not always possible to identify which HBA that is, without
access to a FC analyzer. Therefore, do it on all HBAs on the
affected loop. In short, this is a partial recovery method, but it
does work sometimes.
NOTE: If one host sees the devices, the loop is functional and
devices will be seen on the other host as well. This means there
is no need to issue 'luxadm -e forcelip' from the other host.
So, after issuing 'luxadm -e forcelip', run the format command to
see the devices. If devices are seen, stop issuing 'forcelip',
because loop has recovered.
NOTE: If this is not successful, open an escalation with CPRE for
technical review.
Reactive
--------
5. One can achieve the same result by unplugging and replugging the
fibre going to any GBIC on the hub connected to the hung loop.
Repeat that process until the loop stabilizes. Depending on which
fibre goes to which device on the loop, you might have to unplug and
replug more than 1 fibre to get the loop to recover. There is also a
risk that the customer will unplug things from the wrong loop if they
have several in close proximity.
NOTE: Does not work for remote access.
NOTE: If this is not successful, open an escalation with CPRE for
technical review.
Problem 2 - Bug 4525143 Workaround
----------------------------------
In the case of Bug 4525143, please use the following workaround:
. Only use 1 port on a SOC+ HBA.
. Do not connect anything to the second port on a SOC+ card.
NOTE: If this is not successful, open an escalation with CPRE for
technical review.
Comments:
None
============================================================================
Implementation Footnote:
i) In case of MANDATORY FINs, Enterprise Services will attempt to
contact all affected customers to recommend implementation of
the FIN.
ii) For CONTROLLED PROACTIVE FINs, Enterprise Services mission
critical
support teams will recommend implementation of the FIN (to their
respective accounts), at the convenience of the customer.
iii) For REACTIVE FINs, Enterprise Services will implement the FIN as
the
need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network
browser as follows:
SunWeb Access:
--------------
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/
* From there, select the appropriate link to query or browse the FIN and
FCO Homepage collections.
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/
* From there, select the appropriate link to browse the FIN or FCO
index.
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to [email protected]
--------------------------------------------------------------------------