Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1475984.1
Update Date:2012-07-17
Keywords:

Solution Type  Problem Resolution Sure

Solution  1475984.1 :   Pillar Axiom: Slammer Control Unit Recovery Succeeded on Axiom Releases below 03.03.25  


Related Items
  • Pillar Axiom 300 Storage System
  •  
  • Pillar Axiom 600 Storage System
  •  
  • Pillar Axiom 500 Storage System
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>Pillar Axiom>SN-DK: Ax300
  •  


This article documents the signature for a family of Slammer CU warm starts caused by memory leaks in the platform component code.
There are two different memory leaks, one that tends to take several months, and the other that takes about 60 to 90 days to deplete the Slammer memory.  

Both of these are resolved in all currently supported releases.    The long term leak is fixed in 03.03.00 and 04.00.00 and higher, the short term leak is fixed in 03.03.25.

In this Document
Symptoms
Cause
Solution


Created from <SR 3-5954435811>

Applies to:

Pillar Axiom 600 Storage System - Version Not Applicable to Not Applicable [Release N/A]
Pillar Axiom 500 Storage System - Version Not Applicable to Not Applicable [Release N/A]
Pillar Axiom 300 Storage System - Version Not Applicable to Not Applicable [Release N/A]
Information in this document applies to any platform.
This specific issue only affects Ax600, Ax500, and Ax300 systems on releases below 03.03.25

Symptoms

 The Axiom Slammer will run continuously for several months, then warm start, with a corrupt core dump.   The warm start may succeed, or there may not be enough memory to complete the warm start, so the CU fails over and fails back.

Either of these will resolve the causing memory leak and allow the Slammer CU to run for several more months until it runs out of memory again.  

If the load on the Axiom is reasonably well balanced, the two Slammer CU's in any given slammer will tend to see the warm starts or CU failovers within a few days of each other.

In the scanlog summary Slammer Log section, the signature is: PSG_NMQ    PANIC_ASSERT: buf pointer is NULL

The dumper will attempt to run, but there is insufficient memory, and the dumper will fail to produce a core, or will produce a corrupt, incomplete core.

dump_open                      PSG        DUMPER     CORE TDS OPEN fp.core
dump_write                     PSG        DUMPER     * TDS WRITE failed. Overflow.
dump_close                     PSG        DUMPER     Partial core file generated. Not enough core file memory.

The duration the Slammer has been running continuously is encoded in the name of the Slammer log file.

For example:  100130.200703-101127.213342.GMT.podr-pa.OUTLOG  where the dates are encoded as YYMMDD.HHMMSS-YYMMDD.HHMMSS GMT

This indicates that the Slammer CU has been running continuously since 2010, Jan 30 at 20:07:03 GMT. 

The time of the panic is 2010, November 27th at 21:33:42 GMT

If the CU has been running continuously for more than 5-6 months, the specific memory leak is BUG 13746588 - [Legacy defect 49027][SLOW MEMORY LEAK]

If the CU has been running continuously for 2-3 months, the specific memory leak is Bug 13750977 [Legacy defect 55860] [MEMORY LEAK]  

Cause

 The slow memory leak is caused by the command, status, and message passing code that is used for Pilot Software to Slammer software communication, as well as between the Slammer CUs.  Each time a message is passed, 12 bytes of memory are not properly freed.  Eventually the Slammer CU does not have enough memory to respond to a heartbeat, health check, or command.   If the CU attempts to respond or send a status update, the warm start occurs with the buf pointer is NULL.   

The slightly faster memory leak is caused by a buffer leak when the target port group commands are sent from the Slammer CU to the Bricks to determine the internal fabric path status.   This variety of memory leak tends to occur less often, as the port group commands are only sent if there are raid controller failures, upgrades, etc. ocurring.  The result is the same, eventually the Slammer CU cannot allocate enough memory to send the command and it warm starts with the buf pointer is NULL. 

Solution

The fix for the long term memory leak is in release 03.03.00 and higher, and in all R4 and R5 releases.

The fix for the short term memory leak is in release 03.03.25 and all R4 and R5 releases.   

Axiom release 03.05.07 is available only as an interim step to upgrade to R4.x  

Axioms with this issue cannot be upgraded directly to R4, as the pre-requisite for that upgrade is 03.03.25, and this issue does not affect 03.03.25 or higher.

Before upgrading to R4.x, a pre-upgrade audit is required.  The Axiom must be at release 03.02.00 or higher to perform that audit. 


Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback