Pillar Axiom: Both Control Units Fail at Identical Time for Similar Reason

Asset ID:	1-72-1393811.1
Update Date:	2012-08-27
Keywords:

Solution Type Problem Resolution Sure

Solution 1393811.1 : Pillar Axiom: Both Control Units Fail at Identical Time for Similar Reason

Applies to:

Pillar Axiom 600 Storage System - Version Not Applicable to Not Applicable [Release N/A]
Information in this document applies to any platform.

Symptoms

Problem Reported: Slammer CU0 Failed, LUN offline

Both CUs failed and the Axiom performed a self initiated restart.   The entire Axiom was unavailable.
2009-03-18 11:21:08 MCC received event 83, SlammerControlUnitFailed, Internal event code:3000e from AGENT on 0x2008000b08043152
2009-03-18 11:21:38 MCC received event 83, SlammerControlUnitFailed, Internal event code:3000e from AGENT on 0x2009000b0804315a
2009-03-18 11:22:08 Rebooting active pilot NOW...
2009-03-18 11:24:38 Processing request: ColdStartSoftware
2009-03-18 11:25:25 **** MCC COLD START COMPLETED SUCCESSFULLY
2009-03-18 11:25:31 AdminAction Added: FQN: /LUNOffline/VMFS_0 Type: LUNOffline
2009-03-18 11:52:03 Processing request: PerformClearLostData

PSG_NMQ    PANIC_ASSERT: buf pointer is NULL

Both CUs failed almost at almost the same time and for the same reason, a software issue fixed in current releases.

As soon as the second CU failed, the Axiom went into a self initiated full system restart, since both Slammer CUs were offline.

All LUNs would be unavailable from the time the second CU failed until the full system restart completed.   This would be roughly 30 minutes from the times in the log entry.

Cause

The cause is a resource leak in the memory pool allocated to the queue for the status and command interface on the private management network. When the Slammer CU cannot allocate memory to handle a critical function, it will reset itself. Since this queue is used to manage that, the attempt to soft reset is turned into a Node Failure. Typically both CU's will experience this depletion at very close to the same time, so both CU's will tend to fail at the same time.
Containment: The two CU failures resolved this for the time being, however, it will recur.

Solution

Solution: This is defect 47890 which is fixed in R3.1 and above.

Recommendations: The system will continue to periodically fail with the probability that both CUs will fail at the same time until it is upgraded to a release with the fix.

Attachments

This solution has no attachment