Sun Storage 7000 Unified Storage System: Thermal Events and Ongoing Fan issues on 7410/7110 storage arrays

Asset ID:	1-72-1386616.1
Update Date:	2012-03-30
Keywords:

Solution Type Problem Resolution Sure

Solution 1386616.1 : Sun Storage 7000 Unified Storage System: Thermal Events and Ongoing Fan issues on 7410/7110 storage arrays

Applies to:

Sun Storage 7410 Unified Storage System - Version: Not Applicable to Not Applicable - Release: N/A to N/A
Sun Storage 7110 Unified Storage System - Version: Not Applicable to Not Applicable [Release: N/A to N/A]
7000 Appliance OS (Fishworks)

Symptoms

Thermal shutdown events reported on Sun Storage 7410 Unified Storage and Sun Storage 7110 Unified Storage System arrays.

Previous Array history of ongoing Fan issues being reported (usually via ASR) and Service Processor resets attempted to resolve the situation, but issues persist. In extreme situation the Sun Storage 7000 Unified Storage System experiences a Thermal Shutdown event.

High temperature alerts logged in storage controller iLOM or ipmp logs via SupportBundle/BUI

[logmgr: ID = 7401 : Wed Dec 7 07:02:21 2011 : IPMI : Log : critical : ID = 3a4 : 12/07/2011 : 07:02:21 : Temperature : sys.t_amb : Upper Critical going high : reading 44 > threshold 40 degrees C]

Reported incidents of Sun Storage 7000 Unified Storage System controller automatically shutting down due to over temperature warning to prevent system overheating can be confirmed with collecting following data:

Fan LED status currently or before array shut down. Suspect Amber LED's
Current ambient data center temperature in local environment is within an acceptable range
iLOM snapshot if possible
Any recent SupportBundle data if available
If allowed to cool down the system, restart the Service Processor and the System using the console and monitor the system coming up

If you see multiple External sensors fan errors at FM 0 + FM 1 + FM 2 at same time then this can be BUG 6925325

iLOM prompt:   reset /SP/
CLI> Maintenance - hardware - select - chassis-000 - select - SP - reset

Changes

Previous SR's dealt with Fan issues and a reported Service Processor Memory leak situation that was resolved with an SP reset.

Customer will need to confirm the current LED status of the FANS on this unit and confirm if the actual environmental temperature is indeed in an acceptable range within their Data Center.

An iLOM snapshot would also be useful at this time. I have included a link to the iLOM Service Processor sunservice data collection script, collectDebugInfo.sh

To find the SP-IP adress go to Web Gui:
Maintenance --> Hardware-> select the Head-> go to "Show Details" -> go to "SP"

  Login with   sunservice@SP_ip-adress (root password)

        (flash)root@v-ss7110b-sp-gmp03:~#

1. /usr/local/bin/collectDebugInfo.sh

Cause

A number of known issues exist for the 7x10 array, relating to memory leaks on the Service Processor. Over time memory becomes depleted and the Service Processor becomes unresponsive and/or hangs.
When present, the issues surfaces somewhere between 30 and 60 days of uptime if running Servcie Processor firmware below BIOS43, therefor this versions of BIOS request a SP reboot every 30 days.

In reported cases of Thermal Shutdown
Resetting the Service Processor directly from iLOM with # reset /SP/ fails to bring the SP back up and the array stays down.

Proceeded to reset the system with # reset /SYS , then try #start /SP/console

Note that # reset /SP only restarts the Service Processor and # reset /SYS only restarts the system part of the chassis.

Typical ipmp log data from a SupportBundle:

389 | 12/06/2011 | 21:46:52 | Temperature #0x03 | Upper Critical going high
38a | 12/06/2011 | 21:46:59 | Temperature #0x03 | Upper Critical going low
38b | 12/06/2011 | 21:50:44 | Temperature #0x03 | Upper Critical going high
38c | 12/06/2011 | 21:50:49 | Temperature #0x03 | Upper Critical going low
38d | 12/06/2011 | 21:58:06 | Temperature #0x03 | Upper Critical going high
38e | 12/06/2011 | 21:58:11 | Temperature #0x03 | Upper Non-recoverable going high
38f | 12/06/2011 | 21:58:18 | Temperature #0x03 | Upper Non-recoverable going low
390 | 12/06/2011 | 21:58:24 | Temperature #0x03 | Upper Critical going low
391 | 12/06/2011 | 21:58:48 | System ACPI Power State #0xfb | S5/G2: soft-off | Asserted
392 | 12/07/2011 | 06:57:15 | System ACPI Power State #0xfb | S0/G0: working | Asserted
393 | 12/07/2011 | 06:57:15 | System Boot Initiated #0x01 | Initiated by power up | Asserted
394 | 12/07/2011 | 06:57:18 | Processor #0x02 | Presence detected | Asserted
3ae | 12/07/2011 | 07:05:32 | System ACPI Power State #0xfb | S5/G2: soft-off | Asserted
3af | 12/07/2011 | 10:33:16 | Entity Presence #0x46 | Device Present
3b0 | 12/07/2011 | 10:33:16 | System ACPI Power State #0xfb | S5/G2: soft-off | Asserted
3b1 | 12/07/2011 | 10:33:21 | Entity Presence #0x50 | Device Present
3b2 | 12/07/2011 | 11:56:04 | System ACPI Power State #0xfb | S0/G0: working | Asserted
3b3 | 12/07/2011 | 11:56:06 | System Boot Initiated #0x01 | Initiated by power up | Asserted

Solution

Initial system boot issues were worked around after performing a #reset /SYS and Sun Storage 7410 Unified Storage System or Sun Storage 7110 Unified Storage System rebooted successfully, but within 10 minutes, the customer reported further temperature over heating issues followed by a shut down again.

Ongoing thermal issues on Sun Storage 7410 Unified Storage System independent on clustered or not.
If part of a clustered Pair, the other node within cluster may well be confirmed as being located below problem array within the same Rack but operating in an optimal state.

All Fan and Power components reported as optimal via SupportBundles but temperature warnings received via the iLOM, lead to repeated system shut downs due to over temperature warnings of the affected system.

Previous temperature events have only been recovered after performing a # reset /SYS then array boots ok and 10 minutes later shuts down with temperature warnings from the iLOM again.

Field Engineer requires on-site to carry out the following:

Firstly carry out a full environmental inspection of array in current location.

Rack placement
blockages in Fan air intakes
General condition of datacentre ie dust etc
Data Center Air conditioning system temps etc.

If nothing is causing concern there, we have faulty Hardware.

This is NOT the known memory leak situation as customers array has SP/BIOS fixes for this..
sp_version: '2.0.2.16',
fw_version: '0ABMN080',
os_version: 'ak/[email protected],1-1.21',

We may have to consider possible fan board connector issues to fan boards if it turns out that the SP reset has NO effect and the Array is still reporting temperature and fan issues.

541-2211, FRU Fan Board, RoHS:Y Culprit
541-2213, FRU Connector Board Assembly, PATA DVDVictim

The most likely resolution for this situation is provided when both parts, the fan board and connector board are replaced simultaneously. Fan modules remain unaffected.

References

<NOTE:1004226.1> - ILOM Service Processor sunservice commands for Sun Fire[TM] X4100 Server (applies to SFX4200/SFX4500/SFX4600 also)
<NOTE:1267544.1> - Older versions of the Service Processor firmware on Sun Storage 7110, 7210, 7310 and 7410 can leak memory.
<BUG:6925325> - ELWOOD (2U) CHASSIS FAN BOARD INTERMITTENTS CAUSING FANS TO "DISSAPPEAR"& TACHOMETER READING ERRORS

Attachments

This solution has no attachment