Sun Storage 7000 Unified Storage System: BUI unavailable and seeing errors like "failed to update kstat chain: Not enough space"

Asset ID:	1-72-1494369.1
Update Date:	2012-10-01
Keywords:

Solution Type Problem Resolution Sure

Solution 1494369.1 : Sun Storage 7000 Unified Storage System: BUI unavailable and seeing errors like "failed to update kstat chain: Not enough space"

Applies to:

Sun Storage 7410 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage 7310 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage 7210 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage 7110 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
Sun ZFS Storage 7420 - Version Not Applicable to Not Applicable [Release N/A]
7000 Appliance OS (Fishworks)

Symptoms

BUI access hangs or generates log errors of the form:

Thu Mar 22 23:27:37 2012: asynchronous error on statistics module 'mem': failed to update kstat chain: Not enough space

Fri Mar 23 02:05:53 2012: failed to update chassis data: failed to update kstat chain: Not enough space

These errors will also be seen in the akd.ak error log

CLI access may also be lost - but sometimes may still be available

The management interface (akd) will usually have been running without having been restarted for months

Take a core dump of akd and check vmem:

# gcore -o akd.core `pgrep -ox akd`

# mdb akd.core.<PID>

>::vmem

ADDR NAME INUSE TOTAL SUCCEED FAIL
fe9bdd98 sbrk_top 2454306816 3592060928 1040654983 8468408 << TOTAL memory over 3 Gb, but INUSE memory much lower. This indicates memory fragmentation.
fe9be20c sbrk_heap 2454306816 2454306816 1040654983 8466433
fe9be680 vmem_internal 89620480 89620480 126069236 0
fe9beaf4 vmem_seg 86294528 86294528 21068 0
fe9bef68 vmem_hash 3309824 3313664 35 0
fe9bf3dc vmem_vmem 17100 19128 126048153 0
08062000 umem_internal 22701056 22704128 79028 0
08062474 umem_cache 402320 626688 51 0
080628e8 umem_hash 2239488 2244608 54 0
08063000 umem_log 0 0 0 0
08063474 umem_firewall_va 0 0 0 0
080638e8 umem_firewall 0 0 0 0
08064000 umem_oversize 158445402 166821888 913453146 8466433 << The umem_oversize line shows a large number of allocations (billions succeeded, millions failed).
08064474 umem_memalign 4431888 10883072 64834 0
080648e8 umem_default 2164277248 2164277248 1053568 0

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - 7000 Series ZFS Appliances

Cause

This is most likely a known problem where the akd process that controls the management interface runs out of memory because of memory fragmentation due to large number of oversize allocations.

If unsure, please raise a call with Oracle Support who will be able to verify if you are hitting this issue.

The likely cause for this is CR 7157268.

Solution

The workaround for this is to restart the management interface (akd) to alleviate the heap fragmentation.

If the CLI is still available, it is possible to restart the management interface from there:

S7000:> maintenance system restart

Please note that if you have a cluster then you should verify that the cluster is in a sane state before restarting the management interface on any one head to prevent a takeover happening.

You can check this by checking the cluster configuration:


S7000:> configuration cluster show
Properties:

                         state = AKCS_CLUSTERED

                   description = Active

                      peer_asn = 7adaa852-e2da-e6d6-e0ad-d22330278cb3

                 peer_hostname = zs7420-tvp540-b-h1

                    peer_state = AKCS_CLUSTERED

              peer_description = Active



Children:

                        resources => Configure resources

Valid states for the cluster head and peer are AKCS_CLUSTERED, AKCS_OWNER and AKCS_STRIPPED. In these states restarting the management interface will not cause any takeover by the other head.

Restarting the management interface will not have any effect on access to the shares.

If the CLI is not available, please raise a service request with Oracle Support to restart the menagement interface.

The full fix will be to upgrade to Appliance Firmware Release 2011.04.24.5.0 (2011.1.5.0) when it is available.

Note as of the time of writing 2011.04.24.5.0 was not yet available.

If an urgent fix is required then it is also available as an IDR based on 2011.1.4.0 (IDR 2011.04.24.4.0,1-2.21.10.1).

Access to this IDR will need to be obtained from RPE.

Attachments

This solution has no attachment