Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-73-1354590.1
Update Date:2012-08-09
Keywords:

Solution Type  FAB (standard) Sure

Solution  1354590.1 :   FCO A0309-1: Higher than expected failure rate of the D150 DC to DC converter on the System Board can lead a domain panic.  


Related Items
  • Sun Fire E2900 Server
  •  
  • Sun Fire V1280 Server
  •  
  • Sun Netra 1290 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: Sun FAB
  •  




In this Document
Symptoms
Changes
Cause
Solution


Oracle Confidential (PARTNER). Do not distribute to customers.
Reason: FABs available to Internals and Partners only

Applies to:

Sun Fire V1280 Server - Version Not Applicable to Not Applicable [Release N/A]
Sun Netra 1290 Server - Version Not Applicable to Not Applicable [Release N/A]
Sun Fire E2900 Server - Version Not Applicable to Not Applicable [Release N/A]
Information in this document applies to any platform.
__________

Affected Parts:

540-7048 - CPU/Memory UniBoard w/4x US IV+ 1.5GHz, 0MB
540-6077 - CPU/Memory UniBoard w/4x UltraSPARC IV 1050MHz, 0MB
540-6232 - CPU/Memory UniBoard w/4x UltraSPARC IV 1050MHz, 0MB
540-6864 - CPU/Memory UniBoard w/4x UltraSPARC IV 1050MHz, 0MB, RoHS
540-6038 - CPU/Memory UniBoard w/4x UltraSPARC IV 1200MHz, 0MB
540-6257 - CPU/Memory UniBoard w/4x UltraSPARC IV 1200MHz, 0MB
540-6880 - CPU/Memory UniBoard w/4x UltraSPARC IV 1200MHz, 0MB, RoHS
540-6298 - CPU/Memory UniBoard w/4x UltraSPARC IV 1350MHz, 0MB
540-6664 - CPU/Memory UniBoard w/4x UltraSPARC IV 1350MHz, 0MB, RoHS
540-6442 - CPU/Memory UniBoard w/4x UltraSPARC IV+ 1500MHz, 0MB
540-6680 - CPU/Memory UniBoard w/4x UltraSPARC IV+ 1500MHz, 0MB, RoHS
540-6757 - CPU/Memory UniBoard w/4x UltraSPARC IV+ 1800MHz, 0MB, RoHS
540-7531 - CPU/Memory UniBoard w/4x UltraSPARC IV+ 1800MHz, 0MB, RoHS
540-7126 - CPU/Memory UniBoard w/4x UltraSPARC IV+ 1950MHz, 0MB, RoHS
560-2962 - CPU Clutch Kit (Qty of 3 system board clutches)


Symptoms

The following examples show typical error messages that accompany a panic for this issue:

Sep 26 11:45:46 sc lom: [ID 360430 local0.error] Device voltage problem: /N0/SB0 abnormal state for device: 
Board 0 3.3 VDC 0 Value: 0.37 Volts DC 
Fri Sep 26 11:45:46 sc lom: [ID 322610 local0.notice] CPU Board V3 at /N0/SB0 Device poll caused: 
sun.serengeti.FailedHwException: (SdcAsic)Asic.getTemp: Path broken between CBH and SDC: SB0.sdc.10 (12000010) 
Fri Sep 26 11:45:46 sc lom: [ID 336982 local0.notice] Device will not be polled 
Fri Sep 26 11:45:46 sc lom: [ID 664082 local0.notice] CPU Board V3 at /N0/SB0 Device poll caused: 
sun.serengeti.FailedHwException: (ArAsic)Asic.getTemp: Path broken between CBH and SDC: SB0.ar.10 (12080010) 
Fri Sep 26 11:45:46 sc lom: [ID 336982 local0.notice] Device will not be polled 

Sat Sep 27 06:16:24 sc lom: [ID 395834 local0.error] Attempt to power up /N0/SB0 failed: /N0/SB0 3.3V DC 
failed, observed: 0.15 volts 
Sat Sep 27 06:16:25 sc lom: [ID 503827 local0.error] sun.serengeti.HpuFailedException: CPU Board V3 at /N0/SB0 
Sat Sep 27 06:16:25 sc lom: [ID 889337 local0.notice] sun.serengeti.CommException 
Sat Sep 27 06:16:29 sc lom: [ID 304509 local0.error] No usable Cpu board in domain. 

Wed Oct 01 21:56:10 sc lom: [ID 390680 local0.notice] CPU Board V3 at /N0/SB0 Device poll caused: 
sun.serengeti.HpuFailedException: CpuVoltageA2D.getOutputVoltage: sun.serengeti.CommException: I2cComm.readCmd: 
Path broken between CBH and SDC: SB0.sbbc1.regs.c0 (102000c0) 
Wed Oct 01 21:56:10 sc lom: [ID 336982 local0.notice] Device will not be polled 
Wed Oct 01 21:56:10 sc lom: [ID 120592 local0.notice] /N0/SB0, sensor status, outside acceptable limits 
(7,1,0x207000d00070000)

Impact

When the DC to DC converter (D150) on the System Board fails while part of a running domain the board will lose power and the domain will panic. If the converter fails during power on or during POST the failing board will be fautled and powered off. The system board may or may not remain failed after a power cycle. If it does remain failed the system board will be configured out when the domain is brought back up. If the system board does not remain failed, it should be manually configured out of the domain until it can be replaced. The reason for this is that occasionally, the DC to DC converter will recover after a power-cycle and the system board will pass POST and be configured back into the domain. If this happens, the system board is likely to fail again resulting in an additional panic. To prevent additional panics, disable the affected board as follows.

    lom> setls -s disable -l sbx

Where 'x' is the affected system board.

Changes

Contributing Factors

This issue affects all Netra 1290 Server configurations and all Sun Fire V1280 and Sun Fire E2900 systems that contain at least one of the affected system boards.

Cause

Root Cause

The 3.3V DC to DC converter (onboard PSU) on the system board fails.  A new more robust DC to DC converter has been designed and manufactured to address this issue. This fix was made available in limited quantities in late 2010 via GSAPs 5337.B, 5361.A, and 5362.A.  However, sufficient quantities to support the release of this FCO did not become available until August of 2011.


Implementation: Upon Failure (Reactive)

Solution

Workaround

No workaround available - see Resolution section below.

Resolution

1) Upon Failure Only replace the affected system board with one that is at or above the dash levels listed in the "Identification of Affected Parts (how to)" section below.

2) On platforms meeting the criteria below, proactively replace the clutches in all 3 system board slots, including any empty slots by ordering and installing one CPU Clutch Kit (containing 3 system board clutches) via p/n F560-2962.

If the platform has a system serial number less than or equal to 0818xxxx all 3 clutches in the system board slots are to be replaced during the first service of the first failed board.  Prior to replacing the clutches check for prior FCO 309 implementation markings as shown in the picture attached.

Upon the proactive replacement of the clutches write "FCO 309" and the current date with a permanent (indelible non-erasable) marker on the the inside of the front door as illustrated in the attached picture.

If the platform has a system serial number greater than or equal to 0819xxxx no proactive replacement of system board clutches is required or authorized.


Clutch Replacement Instructions

Instructions for replacing the anti-gravity clutches can be found in the following Service Manual sections:

N1290 Service Manual - 819-4373-10 - Section 3.4

http://download.oracle.com/docs/cd/E19102-01/n1290.srvr/index.html

-OR-

E2900 Service Manual - 817-4054-15 - Section 14.1.2

http://download.oracle.com/docs/cd/E19095-01/sfe2900.srvr/index.html

NOTE: The above Service Manual references describe how to replace an anti-gravity clutch. Older manuals will instruct you to reuse the existing screws to affix the new clutch. However, it is highly recommended that you discard the original screws and use the screws that come with the new clutch kit. These screws are very short, with few threads, and come with a (blue) contact glue on their tips to ensure they remain in place and don't vibrate loose.


Identification of Affected Parts (how to):

The dash levels in the following list indicate boards that have been reworked for this issue. Dashes levels lower than the ones in this list indicate boards that remain impacted by this issue.

540-7048-03   CPU/Memory UniBoard w/4x US IV+ 1.5GHz, 0MB 
540-6077-12   CPU/Memory UniBoard w/4x UltraSPARC IV 1050MHz, 0MB 
540-6232-06   CPU/Memory UniBoard w/4x UltraSPARC IV 1050MHz, 0MB 
540-6864-02   CPU/Memory UniBoard w/4x UltraSPARC IV 1050MHz, 0MB, RoHS 
540-6038-13   CPU/Memory UniBoard w/4x UltraSPARC IV 1200MHz, 0MB 
540-6257-08   CPU/Memory UniBoard w/4x UltraSPARC IV 1200MHz, 0MB 
540-6880-02   CPU/Memory UniBoard w/4x UltraSPARC IV 1200MHz, 0MB, RoHS 
540-6298-04   CPU/Memory UniBoard w/4x UltraSPARC IV 1350MHz, 0MB 
540-6664-02   CPU/Memory UniBoard w/4x UltraSPARC IV 1350MHz, 0MB, RoHS 
540-6442-06   CPU/Memory UniBoard w/4x UltraSPARC IV+ 1500MHz, 0MB 
540-6680-03   CPU/Memory UniBoard w/4x UltraSPARC IV+ 1500MHz, 0MB, RoHS 
540-6757-03   CPU/Memory UniBoard w/4x UltraSPARC IV+ 1800MHz, 0MB, RoHS 
540-7531-02   CPU/Memory UniBoard w/4x UltraSPARC IV+ 1800MHz, 0MB, RoHS 
540-7126-03   CPU/Memory UniBoard w/4x UltraSPARC IV+ 1950MHz, 0MB, RoHS

Hardware Remediation and Material Availability Details

At time of publication of this FAB good replacement materials were available globally.


References

Problem DocID 1019667.1 - Sun Fire[TM] Server System Board (SB) voltage errors.

Sun Alert DocID 1021703.1 - Potential for System Outages Due to Cooling Issues on Sun Fire V1280/E2900, Netra 1280/1290

FAB DocID 1021064.1 - Preventing potential system outages through proactive cooling improvements on Sun Fire Entry Level servers.

Reference Manual: 819-4373-10, 817-4054-15
ECO: WO_43653
GSAP: 5337, 5361, 5362

 

For information about FAB documents, its release processes, implementation strategies and billing information, click here.

In addition to the above you may email:

   [email protected]

Contacts

Contributor/Submitter: [email protected]
Eng Responsible Engineer: [email protected]
Responsible Manager: [email protected]
Eng Business Unit Group: Systems Group-Enterprise Systems


Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback