Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1011650.1
Update Date:2011-01-04
Keywords:

Solution Type  Technical Instruction Sure

Solution  1011650.1 :   Sun Enterprise[TM] 3X00-6X00 Servers: Board Temperature Information  


Related Items
  • Sun Enterprise 3000 Server
  •  
  • Sun Enterprise 4500 Server
  •  
  • Sun Enterprise 5500 Server
  •  
  • Sun Enterprise 5000 Server
  •  
  • Sun Enterprise 6000 Server
  •  
  • Sun Enterprise 4000 Server
  •  
  • Sun Enterprise 3500 Server
  •  
  • Sun Enterprise 6500 Server
  •  
  • Solaris SPARC Operating System
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>Midrange Servers
  •  
  • GCS>Sun Microsystems>Operating Systems>Solaris Operating System
  •  

PreviouslyPublishedAs
215972


Applies to:

Sun Enterprise 3000 Server
Sun Enterprise 3500 Server
Sun Enterprise 4000 Server
Sun Enterprise 4500 Server
Sun Enterprise 5000 Server
All Platforms

Goal

This document provides an optimal temperature specification for Sun Enterprise[TM] classic systems. This document also describes how to tune the system environment to eliminate most known, transient errors.

Solution


Sun Enterprise[TM] 3X00-6X00 servers are tested for operation in ambient temperatures ranging from 0 to 68 degrees centigrade (32 to 154 degrees Fahrenheit). Each CPU/memory module has a thermistor installed below each of the processor boards. The analog output from each thermistor is fed to an analog-to-digital converter, and the resulting value is placed in a system register for reading by software.
The same implementation is used on the I/O boards and the clock board, so accurate temperature readings are maintained for all core system boards.

The sampled temperature is used to drive the speed of the cooling fans enclosed in the 300 watt Power Cooling Modules (PCMs).

Note: A memory-only CPU/memory board does not provide any temperature data because no thermistors are installed for monitoring the temperatures of DIMMs. However, memory DIMMs do not generate a significant amount of heat, so system reliability is not adversely affected in any way.

Software control is performed using a polling mechanism implemented in the Solaris[TM] Operating System (Solaris OS) that reads the temperature registers every 2 seconds. If the temperature reaches a "Yellow Zone" threshold, the system, using console messages, emits warnings. If the temperature reaches a "Red Zone" threshold, the system continues and repeats the warning. If the temperature for the affected component stays in the red zone for 20 seconds or longer, the system either powers down the component or powers itself down entirely, depending on the implementation level of the product.

Monitoring software sets the "Yellow Zone" at 60 degrees celsius for CPU/memory Boards, I/O boards and the clock board. The "Red Zone" is set to temperatures at 68 degrees celsius on all boards.

Finding Nominal Temperatures of the System Boards:
-------------------------------------------------
1) The first step is to determine which slots contain CPU/Memory Boards.\Then observe the temperatures for only these boards. I/O Boards will have higher temperatures then system boards and can ramp higher and lower based on usage of attached SBus FRU"S. SBus Card Thermal Baffle 330-3283 can be used to reduce the temperature of a too hot I/O board.If any CPU/Memory Board temperatures exceed the others by more then 30%, an investigation should be done to determine the root cause. If any boards are ramping more then 5.5C from the listed current temperature, an investigation should be pursued as to why. If the nominal temperature of all the system boards is above the optimal nominal that is shown in the following example, an environmental audit should be pursued to determine the cause. High system board temperatures have been linked to higher incidence of transient Duplicate Tag SRAM (DTAG) and ETAG Parity errors as well as Uncorrectable Memory Errors (UE). Running a server at temperatures outside the optimal nominal temperatures might result in higher correctable memory error rates as well.

A recent study of transient system errors shows that these errors could be greatly reduced by maintaining an environment that is optimal for the hardware. Temperature and humidity auditing can provide you with data to achieve these optimal temperatures. An intake temperature of 70 Deg F or 21.11 Deg C and an RH% 45% - 50% should bring a Sun Enterprise classic server into compliance to achieve optimal numbers.

2) The next step is to find the present nominal temperature.

To find the present nominal temperature, obtain the output of a prtdiag -v command from the suspect system. The output should be current and obtained after at least 168 hours (7 days) of up time. Simply add the temperatures of all the CPU/Memory boards (because the CPU/Memory boards contain the most temperature critical components), and then divide the total by the number of CPU/Memory boards to get an average.

For Example:

  1. prtdiag -v
System Temperatures (Celsius)
-----------------------------
Board  State   Current  Min  Max  Trend
--- ------- ------- --- --- ------
0 OK 29 28 32 stable <-temp should vary less then 5.5 Deg. C.
1 OK 39 38 48 stable < I/O Board not included in calculation
2 OK 29 30 32 stable
3 OK 41 40 45 stable < I/O Board not included in calculation
4 OK 30 28 32 stable
5 OK 31 32 35 stable
6 OK 32 31 36 stable
7 OK 32 31 33 stable
CLK OK 33 30 34 stable < Clock Board not included in calculation.

In the example, the nominal temperature of this system's system boards is 30.5C:

29+29+30+31+32+32+33=183/6=30.5

ASIC Revisions
--------------

Brd  FHC  AC  SBus0  SBus1  PCI0  PCI1  FEPS  Board Type      
--- --- -- ----- ----- ---- ---- ---- ----------
0 1 5 CPU
1 1 5 1 1 22 Dual-SBus-SOC+ <-I/O board not calculated
2 1 5 CPU
3 1 5 1 1 22 Dual-SBus-SOC+ <-I/O board not calculated
4 1 5 CPU
5 1 5 CPU
6 1 5 CPU
7 1 5 CPU

Specifications for Temperature Zones
-------------------------------------
This section provides prtdiag specifications for temperature zones for Sun Enterprise classic servers and also provides the specified kernel patches. Processors (installed in a system) that are 400 MHZ and above require these patches to prevent damage to the CPUs.

Solaris OS 2.5.1 with patch 103640-33 or later
Solaris OS 2.6 with sysctrl driver patch 105181-25 or later
Solaris OS 7 with kernel patch 106541-11 or later
Solaris OS 8 with fhc driver patch 108528-04 or later

When the patches are installed, warning messages appear at 60 degrees C, and a power down sequence of overheated CPU modules occurs at a new danger limit setting of 68 degrees C. These temperatures are lower than the standard default limits of 73 degrees C (for warning messages) and 83 degrees C (for a danger limit).

---------------------------------------------------------------------------
Board Type Yellow Temps Red Temps Optimal Temps Optimal/Nominal
---------------------------------------------------------------------------
CPU		 24C - 60C         68C         28C-32C     	  30C
I/O 24C - 60C 68C 28C-46C 38C
CLOCK 24C - 60C 68C 28C-32C 30C

Note:

1) Severe temperature or relative humidity swings should be avoided.
Conditions should not be allowed to change by more than 5.5 C or
10% RH in any 60-minute period of operation.

2) A CPU temperature above 40 degrees centigrade might be within
specification for the CPU, but the temperature is above the published
environmental specification for the computer room because it represents
an air intake temperature of the machine above 35 degrees C /95
degrees F. Here are the room specifications for the Sun
Enterprise[TM]3X00-6X00 servers:

           *** Operating: 5 C to 35 C (41 F to 95 F) ***

If the room is generally in compliance and the system boards are running above 40 degrees C., it is possible that the machine is installed in a "hotspot." You should investigate and consider managing cooling around the machine to bring it into compliance.

0 OK 29 28 32 stable<-temp shouldn't vary more then 5.5 Deg C


Product
Solaris 2.5.1
Solaris 2.6 Operating System
Solaris 7 Operating System
Solaris 8 Operating System
Sun Enterprise 6500 Server
Sun Enterprise 6000 Server
Sun Enterprise 5500 Server
Sun Enterprise 5000 Server
Sun Enterprise 4500 Server
Sun Enterprise 4000 Server
Sun Enterprise 3500 Server
Sun Enterprise 3000 Server

Internal Comments
For internal Sun use only.

Many DTAG ETAG UE and correctable errors are caused by environmental conditions,


(temperature and humidity problems). If the system experiences these types of

failures, your replacing hardware over and over will not solve the problem and

most likely will lead to secondary problems that might make diagnosis even

harder. It is a good idea to ensure that the system environment is correct before the hardware swap is initiated for these issues.




Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback