Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1008393.1
Update Date:2012-10-01
Keywords:

Solution Type  Troubleshooting Sure

Solution  1008393.1 :   Sun Fire [TM] SF3800/SF4800/SF4810/SF6800 - E4900/E6900 - V1280/E2900 - Netra 1280/1290 : Troubleshooting Cooling Fan Failures  


Related Items
  • Sun Fire 4810 Server
  •  
  • Sun Fire 3800 Server
  •  
  • Sun Netra 1290 Server
  •  
  • Sun Fire 6800 Server
  •  
  • Sun Fire E6900 Server
  •  
  • Sun Fire 4800 Server
  •  
  • Sun Fire E2900 Server
  •  
  • Sun Fire V1280 Server
  •  
  • Sun Fire E4900 Server
  •  
  • Sun Netra 1280 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: Exx00
  •  
  • .Old GCS Categories>Sun Microsystems>Servers>Midrange Servers
  •  
  • .Old GCS Categories>Sun Microsystems>Servers>Midrange V and Netra Servers
  •  

PreviouslyPublishedAs
211476


Applies to:

Sun Fire 4800 Server - Version Not Applicable and later
Sun Fire 4810 Server - Version Not Applicable and later
Sun Fire 6800 Server - Version Not Applicable and later
Sun Fire E2900 Server - Version Not Applicable and later
Sun Fire E4900 Server - Version Not Applicable and later
All Platforms

Purpose

Description

This document addresses how to troubleshoot cooling fan issues on Sun Fire [TM] 3800, 4800, 4810, E4900, 6800, E6900 (Serengeti) and Sun Fire [TM] v1280, E2900, and Netra [TM] 1280, 1290 (LightWeight8) systems.

Specifically, this document covers situations where the Fan Tray (FT) fan or Power Supply Unit (PSU) fan is suspected to be defective, or a replacement FT/PSU is not functional following its' replacement.

  • To troubleshoot temperature warnings or messages related to a single component, see <Document:1010052.1> Troubleshooting temperature warnings on an individual component within a Sun Fire [TM] Serengeti or LightWeight8 system.
  • To troubleshoot temperature warnings or messages relating to multiple components, see <Document:1013119.1> Troubleshooting temperature warnings on multiple components within a Sun Fire [TM] Serengeti or LightWeight8 system.

 

Symptoms:

  • One might describe the issue as having a "bad Fan Tray"or "bad PSU" or "defective Fan" or similar.
  • Fan Tray(s) or Power Supply Unit(s) may be marked Failed in showenvironment output on the System Controller.
  • Domain(s) could be unable to be powered on and booted, degraded (missing components), or it is possible that they are completely unaffected.
  • You might expect to see a warning message such as:
WARNING: PS2 temperature is elevated indicating it may have a failed cooling fan.

Troubleshooting Steps

Steps to Follow

Please validate that each troubleshooting step below is true for your environment.  The steps will provide instructions or a link to a document, for validating the step and taking corrective action  as necessary. The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution. Please do not skip a step.

1. Verify external power is present and proper for the system.

 

  • Confirm all the lights are on, fans are spinning, and SC is responsive or you are able to login to the SC or domain.
  • See <Document:1010053.1> Troubleshooting Complete system Power Outages on Sun Fire [TM] Serengeti or LightWeight8 Systems if the system has no power.

 

2.  Verify the issue is not Alert 1000793.1 if the suspected failed fan is in a Power Supply Unit (PSU).



3.  Verify that the FT or PSU is marked FAILED in showenvironment .

  • Confirm the status as shown in <Document:1011930.1> Sun Fire[TM] (3800-6800 System Controller Application (ScApp How To's).
  • Note that V1280/E2900 and Netra 1280/1290 systems have multiple Fan types installed: System Fan Trays (FT) and IB or PCI I/O Fans. IB Fans may show "ERROR LOW" in showenvironment command output instead of "Failed"
  • Also make sure to check showlogs output for all platforms in order to verify the Fan issue
NOTE:  A Sun "badged" engineer or Certified Partner engineer should perform service actions that relate to re-seats or replacements (upcoming steps).

ATTENTION: for V1280/E2900 - Netra 1280/1290: for individual fan failures, please take the following into consideration

For Fan 0,1,2,3,4 and 5 :
=========================

The fan can be Hot-Swapped.
Some alarms will appear on the console but there will not be any impact to platform uptime.
Always have the replacement fan ready to be installed without delay.


For Fan 6 or 7 :
================

A failure of either of these fans will most likely result in a domain reboot within 9 minutes of the fan failure (to disable certain CPUs).
This behavior can vary depending on the measurement of the ambient temperature (showenvironment: ambient sensors are called 'Board').
The following CPUs will most likely be disabled as a result of the failure of Fan-6 or Fan-7 :

     SB0/P2
     SB2/P2
     SB4/P2

First check to see if "all" of the above CPUs have been disabled.
If, and only if, all of them have been disabled, the faulty fan can be Hot-Swapped (Fan 6 or 7).
Otherwise, use Cold-Swap (poweroff the domain).

Note that the disabled CPUs will need to be enabled again (setls, enablecomponent, setchs, whichever is appropriate).

At this point if you are a customer and have reached this stage in the troubleshooting process, please open a Service Ticket with Oracle Support Services or engage your local field office to obtain assistance with resolving this issue.  Make sure to mention this knowledge article so we can continue with the following steps to resolve this issue.

 

4.  If this is a newly installed or replaced FT or PSU, verify that re-seating it does not resolve the issue.

 

5.  Verify the errors persist if the component is replaced.

  • Reference the appropriate System Service Manual for complete instructions on FRU replacement and procedures (see Step 4 for links).

 

6.  Confirm the same FT or PSU is still suspect when the other SC is main (if dual SC configuration).

  • If the errors cease utilizing the new SC, then the former SC is suspect.
  • System Controller failover reference is: <Document:1003245.1> Sun Fire[TM] 3800-6900: System Controller failover functionality 

 

7.  Verify that the FT or PSU is fully functional in a different slot.

  • Essentially, we're confirming if the failure follows the PSU or stays with the slot.

 

8.  Verify replacing the appropriate backplane does not resolve the issue.

  • Use the Sun System Handbook to determine the correct FRU for the part in question and server.
  • Reference the appropriate System Service Manual  for complete instructions on FRU replacement and procedures (see Step 4 for links).

 

9.   Collect the following data and collaborate with the next level of support.

  • It is preferred that Explorer with the appropriate scextended or 1280extended option as detailed in <Document:1019066.1> How to collect scextended or 1280extended Explorer.
  • If Explorer data can not be collected for whatever reason see <Document:1003529.1> Procedure to manually collect Sun Fire[TM] Midrange System Controller level failure data.  


Internal Section

More details on the lw8 (V1280/E2900 and Netra 1280/1290 systems) IB Fans; this is an excerpt of showenvironment command:

Slot Device Sensor Value Units Age Status
/N0/FT0 Fan 3 Cooling 0 Auto 3 sec OK
/N0/FT0 Fan 0 Cooling 0 Auto 3 sec OK
/N0/FT0 Fan 1 Cooling 0 Auto 3 sec OK
/N0/FT0 Fan 2 Cooling 0 Auto 3 sec OK
/N0/FT0 Fan 4 Cooling 0 Auto 3 sec OK
/N0/FT0 Fan 5 Cooling 0 Auto 3 sec OK
/N0/FT0 Fan 6 Cooling 0 Auto 3 sec OK
/N0/FT0 Fan 7 Cooling 0 Auto 3 sec OK ---> System Fan Tray (FT)
...
/N0/IB6 Fan 0 Cooling 0 High 5 sec OK
/N0/IB6 Fan 1 Cooling 0 High 5 sec OK ---> PCI Fans located near the I/O board, on the top of the chassis

These PCI or I/O Fans may also show errors in lom logs:

Wed Jan 18 23:56:06 lom-hostname lom: [ID 806328 local0.error] IB6/FAN1 Faulty: replacement required
Wed Jan 18 23:56:07 lom-hostname lom: [ID 324967 local0.notice] /N0/IB6, sensor status, fan failure (7,4)

and into showenvironment command output:

Slot Device Sensor Value Units Age Status
/N0/IB6 Fan 0 Cooling 0 High 6 sec OK
/N0/IB6 Fan 1 Cooling 0 Off 6 sec *** ERROR LOW ***

PCI Fans may be replaced while system is running (p/ns in the I/O section of System Handbook); replacement procedure in the Service Manual here.


Previously Published As 91430

References

<NOTE:1000793.1> - Multiple Power Supply Unit (PSU) Fan Failures on Sun Fire 3800-6800 Servers may Result in Platform Outage
<NOTE:1001307.1> - Power Supply Fan failures can occur without notification in Sun Fire 3800, 4800, 4810, and 6800 Systems.
<NOTE:1003245.1> - Sun Fire[TM] 3800-6900: System Controller failover functionality
<NOTE:1003529.1> - Procedure to manually collect System Controller (SC) level failure data on Sun Fire[TM] v1280, E2900, 3800, 4800, E4900, 6800, E6900, and Netra 1280, 1290 servers.
<NOTE:1010052.1> - Troubleshooting temperature warnings on an individual component within a Sun Fire [TM] Serengeti or LightWeight8 system
<NOTE:1010053.1> - Sun Fire [TM] SF3800/SF4800/SF4810/SF6800 - E4900/E6900 - V1280/E2900 - Netra 1280/1290 : Troubleshooting complete system power outages
<NOTE:1011930.1> - Sun Fire[TM] 3800, 4800, 4810, 6800, E4900, and E6900 System Controller Application (ScApp) How To's.
<NOTE:1013119.1> - Sun Fire [TM] SF3800/SF4800/SF4810/SF6800 - E4900/E6900 - V1280/E2900 - Netra 1280/1290 : Troubleshooting temperature warnings on multiple components
@<NOTE:1018919.1> - Sun Fire[TM] 3800-6800 servers: Power supply failures, (fan failures)
<NOTE:1019066.1> - Sun Fire[TM] v1280, 3800, 4800, 4810, 6800, E2900, E4900, E6900 and Netra[TM] 1280, 2900 servers: How to collect scextended or 1280extended Explorer

Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback