![]() | Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||
Solution Type Problem Resolution Sure Solution 1475984.1 : Pillar Axiom: Slammer Control Unit Recovery Succeeded on Axiom Releases below 03.03.25
This article documents the signature for a family of Slammer CU warm starts caused by memory leaks in the platform component code. There are two different memory leaks, one that tends to take several months, and the other that takes about 60 to 90 days to deplete the Slammer memory. Both of these are resolved in all currently supported releases. The long term leak is fixed in 03.03.00 and 04.00.00 and higher, the short term leak is fixed in 03.03.25. In this Document
Created from <SR 3-5954435811> Applies to:Pillar Axiom 600 Storage System - Version Not Applicable to Not Applicable [Release N/A]Pillar Axiom 500 Storage System - Version Not Applicable to Not Applicable [Release N/A] Pillar Axiom 300 Storage System - Version Not Applicable to Not Applicable [Release N/A] Information in this document applies to any platform. This specific issue only affects Ax600, Ax500, and Ax300 systems on releases below 03.03.25 SymptomsThe Axiom Slammer will run continuously for several months, then warm start, with a corrupt core dump. The warm start may succeed, or there may not be enough memory to complete the warm start, so the CU fails over and fails back. Either of these will resolve the causing memory leak and allow the Slammer CU to run for several more months until it runs out of memory again. If the load on the Axiom is reasonably well balanced, the two Slammer CU's in any given slammer will tend to see the warm starts or CU failovers within a few days of each other. In the scanlog summary Slammer Log section, the signature is: PSG_NMQ PANIC_ASSERT: buf pointer is NULL The dumper will attempt to run, but there is insufficient memory, and the dumper will fail to produce a core, or will produce a corrupt, incomplete core. dump_open PSG DUMPER CORE TDS OPEN fp.core The duration the Slammer has been running continuously is encoded in the name of the Slammer log file. For example: 100130.200703-101127.213342.GMT.podr-pa.OUTLOG where the dates are encoded as YYMMDD.HHMMSS-YYMMDD.HHMMSS GMT This indicates that the Slammer CU has been running continuously since 2010, Jan 30 at 20:07:03 GMT. The time of the panic is 2010, November 27th at 21:33:42 GMT If the CU has been running continuously for more than 5-6 months, the specific memory leak is BUG 13746588 - [Legacy defect 49027][SLOW MEMORY LEAK] If the CU has been running continuously for 2-3 months, the specific memory leak is Bug 13750977 [Legacy defect 55860] [MEMORY LEAK] CauseThe slow memory leak is caused by the command, status, and message passing code that is used for Pilot Software to Slammer software communication, as well as between the Slammer CUs. Each time a message is passed, 12 bytes of memory are not properly freed. Eventually the Slammer CU does not have enough memory to respond to a heartbeat, health check, or command. If the CU attempts to respond or send a status update, the warm start occurs with the buf pointer is NULL. The slightly faster memory leak is caused by a buffer leak when the target port group commands are sent from the Slammer CU to the Bricks to determine the internal fabric path status. This variety of memory leak tends to occur less often, as the port group commands are only sent if there are raid controller failures, upgrades, etc. ocurring. The result is the same, eventually the Slammer CU cannot allocate enough memory to send the command and it warm starts with the buf pointer is NULL. SolutionThe fix for the long term memory leak is in release 03.03.00 and higher, and in all R4 and R5 releases. The fix for the short term memory leak is in release 03.03.25 and all R4 and R5 releases. Axiom release 03.05.07 is available only as an interim step to upgrade to R4.x Axioms with this issue cannot be upgraded directly to R4, as the pre-requisite for that upgrade is 03.03.25, and this issue does not affect 03.03.25 or higher. Before upgrading to R4.x, a pre-upgrade audit is required. The Axiom must be at release 03.02.00 or higher to perform that audit. Attachments This solution has no attachment |
||||||||||||||||
|