Document Audience: | INTERNAL |
Document ID: | I1030-1 |
Title: | T3/T3+ Power Control Unit (PCU) connectors on the midplane may be damaged by applied force and/or stress during normal maintenance. |
Copyright Notice: | Copyright © 2005 Sun Microsystems, Inc. All Rights Reserved |
Update Date: | 2004-04-05 |
------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
------------------------------------------------------------------------
FIELD INFORMATION NOTICE
(For Authorized Distribution by Sun Service)
FIN #: I1030-1
Synopsis: T3/T3+ Power Control Unit (PCU) connectors on the midplane may be damaged by applied force and/or stress during normal maintenance.Create Date: Mar/15/04
SunAlert: No
Top FIN/FCO Report: No
Products Reference: Sun StorEdge T3/T3+ Arrays
Product Category: Storage / Service
Product Affected:
Systems Affected:
-----------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
- ANYSYS - System Platform Independent -
X-Options Affected:
-------------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
- T3 ALL T3 StorEdge Array -
- T3+ ALL T3+ StorEdge Array -
Parts Affected:
----------------------
Part Number Description Model
----------- ----------- -----
300-1454-04 or lower PWR SUPPLY PURPLE1 NIMH -
References:
FIN: I0745-1
URL: http://grand.central/web/salesmktg/products/svc_prod/sz/SunMoves.html
http://sdpsweb.EBay/FIN_FCO/FIN/FINI0745-1_dir/connector_closeup.jpg
http://sdpsweb.EBay/FIN_FCO/FIN/FINI0745-1_dir/connector_closeup2.jpg
http://sdpsweb.EBay/FIN_FCO/FIN/FINI0745-1_dir/connector_inchassis.jpg
http://sdpsweb.EBay/FIN_FCO/FIN/FINI0745-1_dir/connector_inchassis2.jpg
http://sdpsweb.EBay/FIN_FCO/FIN/FINI0745-1_dir/connector_inchassis3.jpg
Issue Description:
Applied force or inadvertent over stress during PCU insertions may
cause the Power Cooling Unit (PCU) connector in StorEdge T3/T3+ arrays
to shift. This may produce PCU midplane connector breakage and cause
the T3 PCU leads to short, rendering the disk array non-functional.
There have been no reported instances of data loss, but a new chassis
with midplane, loop cards, controller and at least one PCU may be
required to restore the system to full functionality.
When this connector damage occurs and power is applied to the T3/T3+
disk array, the power leads may short. Smoke and heat may be emitted
from the unit and the T3/T3+ disk array. This results in the array
becoming non-functional. Components on the controller and loop cards
can be seen to have heat damage, as can the PCU and the PCU midplane
connector.
NOTE: Make sure to follow all guidelines outlined in FIN IO745-1
for the movement of any Sun T3s equipment.
Please see the following URL for T3 movement guidelines:
URL:
http://grand.central/web/salesmktg/products/svc_prod/sz/SunMoves.html
Implementation:
---
| | MANDATORY (Fully Proactive)
---
---
| | CONTROLLED PROACTIVE (per Sun Geo Plan)
---
---
| X | REACTIVE (As Required)
---
Corrective Action:
The following recommendation is provided as a guideline for authorized
Sun Services Field Representatives who may encounter the above
mentioned issue.
These guidelines will help to ensure midplane connectors are not
damaged during normal PCU insertion activities. However due to prior
PCU removal or insertion actions, the midplane connector may have
already sustained damage. These guidelines include instructions to
examine the midplane connector for damage prior to reinserting the
PCU. Please visit the following URLs to examine the types of damaged
connectors:
http://sdpsweb.EBay/FIN_FCO/FIN/FINI0745-1_dir/connector_closeup.jpg
http://sdpsweb.EBay/FIN_FCO/FIN/FINI0745-1_dir/connector_closeup2.jpg
http://sdpsweb.EBay/FIN_FCO/FIN/FINI0745-1_dir/connector_inchassis.jpg
http://sdpsweb.EBay/FIN_FCO/FIN/FINI0745-1_dir/connector_inchassis2.jpg
http://sdpsweb.EBay/FIN_FCO/FIN/FINI0745-1_dir/connector_inchassis3.jpg
When performing a proactive replacement of near expiring PCUs, always
be careful on handling the PCUs. Make sure to use only the PCUs that
are in a sealed bag. Also make sure that the PCUs should be inserted
and removed with a single gentle motion without hesitation or side to
side motion.
Please adhere to the guidelines shown below and perform the following
step-by-step procedure in order to remove and reinstall PCUs on all
T3/T3+:
NOTE: Never change out both PCU1 and PCU2 on the same brick on the same
day. This gives the newly replaced PCU time to fully recharge.
1. If applicable, check with the system administrator to make sure that
they are ready, by executing 'tail -f /var/adm/messages.t300' on the
admin host so the syslog activity can be seen.
Check with the system administrator which filename is used on the host
system for array syslog remote logging - the filename given here
(/var/adm/messages.t300) is a standard name but the customer may have
chosen a different filename. Ensure that you are monitoring the remote
syslog messages from the array that you will be working on.
2. As a best practice perform the following prerequisites:
A. Verify that all loop cables (for ES config) and MIAs are screwed
down tightly by using a small flathead screwdriver and tightening
each loop cable. Be very careful not to disconnect any loop cable.
If you notice a loop cable that is not screwed in at all, notify
customer.
B. Verify all controllers and loop cards are in their prospective
slots securely by pushing on each card and verifying that all
latches are in the locked position.
C. Verify that all PCUs are in their prospective slots securely
by pushing on each PCU and verifying that the PCU latches are
in the locked position.
D. Type "fru stat" to check ALL T3 FRUs are in a healthy state
and that their LEDs are in their normal state before proceeding.
E. Type "date" and "tzset" to check if the date and
timezone are correct. If not, use the "date" and "tzset" command
to set the date and timezone, respectively.
F. Type "refresh -s" to check that no battery refresh is running
before proceeding. Also, check if the "Next Refresh" won't begin
shortly after executing the PCU replacement. If yes, the "Next
Refresh" should be re-scheduled to a later time (24 hours). Refer
to the Field Service Manual for changing the refresh time (BAT_BEG)
in the file /etc/schd.conf. If battery status is reported as "Low",
this is ok as the purpose of this maintenance action is to replace
it, or replace the battery pack.
G. Type "proc list" to check that no drive reconstruction is running
before proceeding.
NOTE: And if possible this procedure be performed during a
maintenance window to minimize disruption to customer
operation.
H. Notify customer that performance will degrade during and
after execution of this FCO as new batteries will need to be
charged up after power on. Charging can take up to 12 hours
(per battery) and during this time write caching will be
disabled.
I. Advise the customer SysAdmin that while the PCU is removed, an
inspection of the midplane connector will be performed. If this
inspection results in finding a cracked or damaged PCU midplane
connector, notify customer immediately. Ask the customer SysAdmin
to make the operational decision whether they want to try to
install a new PCU, or start shutting down access so the chassis
can be replaced. The chassis will need to be swapped out at some
point.
3. All new PCUs from the RSL's should be at revision level -04. If you
can't reliably identify version level return the PCU to stock and
clearly identify the issue.
4. To remove the PCU for battery swap, or remove the PCU to replace
the PCU, you must power off the PCU and then you can pull it out.
Carefully observe and follow these guide lines:
A. Power off PCU, wait 30 seconds. (watch syslog)
NOTE - DO NOT POWER OFF MORE THAN ONE PCU AT A TIME FOR EITHER ES
OR WG CONFIGURATION. Powering off/removing a PCU will cause
the T3 cache to run in write-through mode. Make sure that
the AC LED (left) is AMBER and the PS LED (right) is OFF.
typical messages:
Jan 14 19:47:47 LPCT[1]: W: u2pcu1: Switch off, serial no = 005363
Jan 14 19:47:48 LPCT[1]: W: u2pcu1: Off, serial no = 005363
Jan 14 19:47:50 LPCT[1]: W: u2pcu1: DC not OK, serial no = 005363
No additional errors or warnings are noted.
B. Disconnect power cord from PCU. (watch syslog)
typical message:
Jan 14 19:48:23 LPCT[1]: E: u2pcu1: Battery not present
No additional errors or warnings are noted.
C. Push the PCU latches into the unlocked position and pull the unit
out of the disk tray. Wait 15 seconds and then verify that both
controller online LEDs are still GREEN. If any controller LED
changes to non-solid GREEN (ie OFF/AMBER/Flashing AMBER) then
immediately refer to the "Troubleshooting" section (below) before
continuing.
CAUTION - Any PCU that is removed must be replaced within 30 minutes
or the Sun StorEdge T3 disk tray and all attached disk
trays will automatically shutdown and power off.
CAUTION - For partner pair configurations make sure that the loop
cables have significant length to spread apart so you can
remove u1pcu1. Also make sure that the loop cables, along
with other cables connected to the T3, are screwed in
tightly so you do not inadvertently knock them off during
removal/insertion.
typical messages:
Jan 14 19:49:06 LPCT[1]: N: u2pcu1: Warranty date was cleared.
Jan 14 19:49:06 LPCT[1]: E: u2pcu1: Not present
Jan 14 19:49:06 TMRT[1]: E: u2pcu1: Missing; system shutting down
in 30 minutes
Jan 14 19:49:08 TMRT[1]: E: u2ctr: Multiple Fan Faults; system
shutting down in 30 minutes
Jan 14 19:50:45 LPCT[2]: E: u2pcu1: Not present
No additional errors or warnings are noted.
D. Look inside the PCU bay, inspect the left and right sides of the
PCU midplane connector for cracks or other damage. A working
flashlight is required to inspect the connector.
NOTE: PCU must be inserted within 30 minutes, otherwise the brick
will time out and shut off.
E. If obvious damage is seen, inform the SysAdmin of the risk of an
outage as soon as we attempt to insert the new PCU.
Ask the customer SysAdmin to make the operational decision whether
they want to try to put the new one in, or start shutting down
access so the chassis can be replaced. The chassis will need to be
swapped out at some point.
CAUTION - It is also important to note the same fault(cracked
connector) may be experienced on incorrect
removal/re-insertion of a PCU. PCU's should be inserted
and removed with a single gentle motion without hesitation
or side to side motion.
F. If no damage is seen, carefully install the replacement PCU. Do
not force. If any abnormal resistance or friction is felt select
another PCU to use in this T3 chassis. You can most likely use the
PCU experiencing friction in the next T3. Observe same insertion
procedure.
G. Install new PCU. Wait 30 seconds and then verify that both
controller online LEDs are still GREEN. If any controller LED
changes to non-solid GREEN (ie OFF/AMBER/Flashing AMBER) immediately
refer to the "- Troubleshooting" section below before continuing.
typical messages:
Jan 14 19:50:06 LPCT[1]: E: u2pcu1: Over temperature, serial no =
005363
Jan 14 19:50:06 LPCT[1]: W: u2pcu1: Switch off, serial no = 005363
Jan 14 19:50:07 LPCT[1]: W: u2pcu1: Off, serial no = 005363
Jan 14 19:50:07 LPCT[1]: E: u2pcu1: Battery not present
Jan 14 19:50:11 LPCT[1]: W: u2pcu1: DC not OK, serial no = 005363
No additional errors or warnings are noted.
H. Push the PCU latches into the locked position.
I. Connect power cord to PCU. (watch syslog)
typical messages:
Jan 14 19:50:58 LPCT[1]: N: u2pcu1: Battery not OK
Jan 14 19:50:58 LPCT[1]: W: u2pcu1: Off, serial no = 005363
No additional errors or warnings are noted.
J. Verify that the AC LED (left) is AMBER, indicating that AC power
is present.
K. Power on PCU, wait 30 seconds. (watch syslog)
typical message:
Jan 14 19:51:40 LPCT[1]: N: u2pcu1: Battery not OK
No additional errors or warnings are noted.
L. Verify that both LEDs on the Power Cooling Unit are Green,
indicating that the unit is receiving power. Wait 15 seconds
and then verify that both controller online LEDs are still GREEN.
If any controller LED changes to AMBER immediately refer to the
"-Troubleshooting" section below before continuing.
NOTE - The PS LED (right) may blink GREEN for a period of time.
(up to 12 hours for charging per battery while write
caching is disabled)
M. Type "fru stat" to check if new PCU is recognized and functioning.
Battery might show up as "fault" as it is charging up.
N. Verify the Battery Warranty Date by typing "id read u(x)pcu(y)".
hostname:/:<1>id read u1pcu1
Revision: 0000
Manufacture Week: 00421999
Battery Install Week : 00222001 <----- week # when battery was
installed
Battery Life Used : 0 days, 0 hours <----- usage since pcu
inserted
Battery Life Span : 730 days, 12 hours
Serial Number : 003566
range
Battery Warranty Date: 20010322172349 <----- date & time when PCU
switch turn on
Battery Internal Flag: 0x00000000
Vendor ID : TECTROL-CAN
Model ID : 300-1454-01(50)
5. Troubleshooting
During the removal, insertion, or switching on of the PCU, there is
a very small chance where the T3 (ES or WG config) will reboot, and
in the case of ES config one T3 controller can be disabled. When
this happens, the controller LED will change state from a solid
GREEN to either OFF (reboot started), AMBER (booting), or Flashing
AMBER (disabled).
It is important to run the extractor after the T3 boots up and to
get the reset log of the disabled controller. The extractor will,
by default, get the reset log of the remaining live controller.
Give engineering the extractor and reset log for analysis and note
when the reboot occurred, ie; at removal, insertion, or power on.
Whether the disabled controller can be reused or not depends on any
valid information from the reset log. To get the reset log of the
disabled controller:
A. Remove the disabled controller from the T3.
B. Insert a new controller.
NOTE: the new controller will boot up as alt master role for
ES config.
C. Take the removed controller back, install it in a spare T3
(single brick), and let it boot up.
D. Via the telnet session or serial port, type "logger -dmprstlog"
to dump the reset log to the T3 syslog.
E. If the reset log shows a valid hardware problem (ex; cache parity
error) around the time the PCU was replaced, the controller should
be sent back via CPAS.
Example;
Jul 18 20:15:26 pshc[1]: N: logger -dmprstlog
Jul 18 20:15:26 pshc[1]: W: u1ctr SysFail Reset (7001) was initiated
at Cache memory parity error detected 20010626 163740
^^^^^^^^ ^^^^^^
/ /
/ /
yyyymmdd hr/min/sec
F. If the reset log shows other non-hardware related messages and that
the time of occurrence is not around the time the PCU was replaced
then the controller can be deemed to be good. The problem is more
related to firmware than hardware.
Example;
Jul 13 22:03:26 pshc[1]: W: u1ctr Exception Reset (2004) was
initiated at Instruction Access exception 20001103 175513
^^^^^^^^ ^^^^^^
/ /
/ /
yyyymmdd hr/min/sec
Comments:
None.
============================================================================
Implementation Footnote:
i) In case of MANDATORY FINs, Sun Services will attempt to contact
all affected customers to recommend implementation of the FIN.
ii) For CONTROLLED PROACTIVE FINs, Sun Services mission critical
support teams will recommend implementation of the FIN (to their
respective accounts), at the convenience of the customer.
iii) For REACTIVE FINs, Sun Services will implement the FIN as the
need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network
browser as follows:
SunWeb Access:
--------------
* Access the top level URL of http://sdpsweb.central/FIN_FCO/
* From there, select the appropriate link to query or browse the FIN and
FCO Homepage collections.
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.central/
* From there, select the appropriate link to browse the FIN or FCO index.
Internet Access:
----------------
* Access the top level URL of https://spe.sun.com
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to [email protected]
--------------------------------------------------------------------------