Sun Storage 7000 Unified Storage System: BUI/CLI hang due to 'excessive' analytics collected

Asset ID:	1-72-1401595.1
Update Date:	2012-06-29
Keywords:

Solution Type Problem Resolution Sure

Solution 1401595.1 : Sun Storage 7000 Unified Storage System: BUI/CLI hang due to 'excessive' analytics collected

Applies to:

Sun Storage 7410 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
Sun ZFS Storage 7120 - Version Not Applicable to Not Applicable [Release N/A]
Sun ZFS Storage 7320 - Version Not Applicable to Not Applicable [Release N/A]
Sun ZFS Storage 7420 - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage 7110 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
7000 Appliance OS (Fishworks)

Symptoms

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - 7000 Series ZFS Appliances

Symptoms observable by the customer:

Cannot login to BUI/CLI
ssh to appliance returns "aksh: fatal error: Couldn't allocate memory for XML-RPC fault response"
BUI reporting error: "asynchronous statistic error: failed to update kstat chain: Not enough space"
ssh to appliance drops to 'failsafe' shell:

    $ ssh [email protected]

    [email protected]'s password:



    Last login: Fri Jan  7 11:12:35 2011 from 10.123.45.67



    Waiting for the appliance shell to start ...





    The appliance shell is taking longer than usual to start.

    Press Ctrl-C to exit or wait 45 seconds for the emergency shell.



    Waiting for the appliance shell to start ...





    aksh-wrapper: FATAL: Appliance shell failed to start in the allotted time



    Dropping into failsafe shell ...

Replication failures in the 'alert' log:

    class = alert.ak.appliance.nas.project.replication.send.fail.misc

    ak_errmsg = zfs_send failed: cannot send 'pool-0/local/iscsi_User': No such process

System log message buffer reported a number of events:

    WARNING: /var/run: File system full, swap space limit exceeded

    WARNING: /var/run: File system full, swap space limit exceeded

    WARNING: /var/run: File system full, swap space limit exceeded

    WARNING: /etc/svc/volatile: File system full, swap space limit exceeded

    WARNING: /etc/svc/volatile: File system full, swap space limit exceeded

    WARNING: /etc/svc/volatile: File system full, swap space limit exceeded

After restarting akd (system reboot without diagnosis), we can see very large ONDISK value for analytics datasets, eg.:

sun-nas:analytics> datasets 

sun-nas:analytics datasets> show

Datasets:



DATASET     STATE   INCORE ONDISK NAME

dataset-000 active    867K  3.88G arc.accesses[hit/miss]

dataset-001 active    148K   470M arc.l2_accesses[hit/miss]

dataset-002 active    148K   470M arc.l2_size

dataset-003 active    148K   470M arc.size

dataset-004 active    439K  2.82G arc.size[component]

dataset-005 active    148K   470M cpu.utilization

dataset-006 active    554K  1.51G cpu.utilization[cpu]

dataset-007 active    295K  1.83G cpu.utilization[mode]

dataset-008 active    337K  2.04G dnlc.accesses[hit/miss]

dataset-009 active    109K  11.2M fc.bytes

dataset-010 active    109K  11.2M fc.ops

dataset-011 active    109K  11.2M ftp.kilobytes

dataset-012 active    148K   470M http.reqs

dataset-013 active    148K   470M io.bytes

dataset-014 active   4.16M  6.54G io.bytes[disk]

dataset-015 active    334K  1.94G io.bytes[op]

dataset-016 active    113K  20.9M io.disks[utilization=95][disk]

dataset-017 active    148K   470M io.ops

dataset-018 active   4.65M  18.2G io.ops[disk]

dataset-019 active    334K  1.94G io.ops[op]

dataset-020 active    109K  11.2M iscsi.bytes

dataset-021 active    148K   470M iscsi.ops

dataset-022 active    148K   472M iscsi.ops[client]

dataset-023 active    148K   470M ndmp.diskkb

dataset-024 active    140K   282M net.kilobytes

dataset-025 active    395K  2.30G net.kilobytes[interface]

dataset-026 active    109K  11.2M nfs2.ops

dataset-027 active    109K  11.2M nfs2.ops[op]

dataset-028 active    148K   470M nfs3.ops

dataset-029 active   1.89M  1.53G nfs3.ops[client]

dataset-030 active   4.89M  94.3G nfs3.ops[file]

dataset-031 active   4.83M  90.9G nfs3.ops[file]

dataset-032 active   27.2M  35.6G nfs3.ops[latency]

dataset-033 active   1.39M  4.40G nfs3.ops[op]

dataset-034 active   3.54M  4.56G nfs3.ops[share]

dataset-035 active    148K   470M nfs4.ops

dataset-036 active    148K   470M nfs4.ops[op]

dataset-037 active    148K   470M nic.kilobytes

dataset-038 active    546K  2.90G nic.kilobytes[device]

dataset-039 active    336K  2.21G nic.kilobytes[direction]

dataset-040 active    109K  11.2M sftp.kilobytes

dataset-041 active    148K   470M smb.ops

dataset-042 active    148K   470M smb.ops[op]

Symptoms observable by the Oracle Support engineer:

aklog has 'no memory' messages logged:

Mon Sep 26 15:49:14 2011: magnitude-1 dataspan at 176860000 for 'smb.ops[file]': failed to write: failed to write file com/sun/ak/xmlrpc/dataset/3a8b5059-db08-46fd-befe-eefc6fc5b56f/_dataspan/M1/170000000/176860000.new for stash update: no memory

Mon Sep 26 15:49:14 2011: magnitude-1 dataspan at 176870000 for 'smb.ops[file]': failed to write: failed to write file com/sun/ak/xmlrpc/dataset/3a8b5059-db08-46fd-befe-eefc6fc5b56f/_dataspan/M1/170000000/176870000.new for stash update: no memory

Mon Sep 26 15:49:14 2011: magnitude-1 dataspan at 176880000 for 'smb.ops[file]': failed to write: failed to write file com/sun/ak/xmlrpc/dataset/3a8b5059-db08-46fd-befe-eefc6fc5b56f/_dataspan/M1/170000000/176880000.new for stash update: failed to write stash file

Mon Sep 26 15:49:14 2011: magnitude-0 dataspan at 176890000 for 'smb.ops[file]': failed to write: failed to write file com/sun/ak/xmlrpc/dataset/3a8b5059-db08-46fd-befe-eefc6fc5b56f/_dataspan/M0/176000000/176890000.new for stash update: failed to write stash file

prstat shows akd as taking up ~3.5 GB of memory:

PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP

  2515 daemon   5972K 5172K sleep   60  -20  26:46:25 0.3% nfsd/78

   951 root        0K    0K sleep   99  -20  10:39:50 0.2% zpool-pool-0/158

  1285 root       41M   21M sleep   59    0   0:01:05 0.1% httpd/1

  2455 root     3540M 3540M sleep  100    -   1:48:29 0.0% akd/175  <<<  4302 root     4240K 2848K cpu4    59    0   0:00:00 0.0% prstat/1

  1360 root       34M   27M sleep   59    0   0:03:23 0.0% fmd/33

  1796 root     9136K 4588K sleep   59    0   0:00:00 0.0% syseventd/18

  4292 webservd   40M 7732K sleep   59    0   0:00:00 0.0% httpd/1

   184 root       22M 9412K sleep   59    0   0:00:00 0.0% akwatchd/1

  1066 root     4416K 3284K sleep   59    0   0:00:00 0.0% zebra/1

   115 netadm   3868K 2788K sleep   59    0   0:00:00 0.0% ipmgmtd/3

  1022 root     3300K 1408K sleep   59    0   0:00:00 0.0% dbus-daemon/1

   108 root     3224K 2000K sleep   59    0   0:00:00 0.0% dlmgmtd/8

Note that 3.5G is close to the limit for a 32-bit application, which could cause many different symptoms.

stash size was many 'tens of GBs' -> large amount of analytics:

> ::akx_class -d -n dataset | ::print ak_dataset_state_t akds_datasets | ::list ak_dataset_t akd_next | ::ak_dataset

 DATASET       INCORE STAT

18749a88       885263 nic.kilobytes[direction]

18749808       316000 nfs2.ops

1857a548       467818 nfs4.ops[op]

18c1bc88       316000 iscsi.bytes

18c1ba08       316000 sftp.kilobytes

18c1b788       316000 arc.size

 8f26848      2671116 nfs3.ops[op]

17eee388      1963692 arc.accesses[hit/miss]

131400c8       316000 io.bytes

13140ac8       316000 ndmp.diskkb

13140848      7784753 nfs3.ops[client]

131405c8       316000 iscsi.ops

13140348       316000 nfs2.ops[op]

12fcd088      4020360 arc.accesses[file]

13140d48       316000 io.ops

12e542c8       876347 io.ops[op]

125ee008       316000 http.reqs

12000848       702401 net.kilobytes[interface]

120bb108      7480821 nfs3.ops[client]

125eea08       316000 arc.l2_size

125ee508      7047006 io.ops[disk]

125eec88       848828 cpu.utilization[mode]

120005c8       972386 nic.kilobytes[device]

120bb388       316000 nfs3.ops

120bbb08       876147 io.bytes[op]

120bb888       308000 fc.bytes

120bb608       316000 nfs4.ops

120bbd88      7970586 arc.l2_accesses[file]

 c224548       868108 dnlc.accesses[hit/miss]

120000c8       480185 io.disks[utilization=95][disk]

12000d48       316000 cpu.utilization

12000ac8       316000 smb.ops[op]

12000348       878583 net.kilobytes[direction]

 c224048      1461677 arc.size[component]

 c2242c8       316000 ftp.kilobytes

 bee7788       308000 fc.ops

 bee7a08   1080518527 nfs3.ops[file]        <<<<<<<<

 bee7c88       316000 nic.kilobytes

 8f260c8       316000 smb.ops

 91a2d88   1075506365 nfs3.ops[file]        <<<<<<<<

 8f265c8       829068 arc.l2_accesses[hit/miss]

Notice that there are two datasets for 'nfs3.ops' broken down by file, each taking over a GB.

akd was spinning on following thread reading jobs/datasets data one by one:

-----------------  lwp# 1 / thread# 1  --------------------

feed2345 syscall  (82, 8047aa0, 0, f00f67e2) + 5

feebcbc9 openat64 (82, 8047aa0, 0, f00f67e2) + d1

f00f6b7e ak_stash_reload_atom (93d7e54, f01bb8b8, 8047b50, f00f4e1d) + ee

f00f4e38 ak_stash_resume_object (93d7e54, f01bb940, fc2e562a, f00f58e2) + ac

f00f591f ak_stash_reload (8276498, f01bb940, 8712bf8, f00b06fd) + 13f

f00b07af ak_job_init (80ddc08, 8104578, f0185844, 0) + 13f

f0111568 ak_init  (1, 1, 0, 8047dd4, 1, 8047e6c) + 960

08051564 main     (1, 8047e6c, 8047e74, feffb8b4) + 28c

0805117d _start   (1, 8047f08, 0, 8047f18, 8047f2e, 8047f46) + 7d

Cause

This is a known issue due to 'excessive' analytics being collected and retained.

Known issue - <SunBug 6993881> - accrued analytics data can completely wedge akd

Solution

Recommended action for the customer:

You will need to engage Oracle Support, by opening a Service Request, so that Oracle Support Services can provide confirmation of this issue and then carry out the appropriate activities to resolve the issue.

Recommended actions for the Oracle Support engineer:

Remove the analytics from the appliance stash - see the following wiki document:

https://stbeehive.oracle.com/teamcollab/wiki/AmberRoadSupport:Solaris+shell+procedure+to+remove+%28excessive%29+analytics

If you cannot access this document engage NAS Storage-TSC for assistance

Back to <Document 1401282.1> Sun Storage 7000 Unified Storage System: How to Troubleshoot Unresponsive Adminstrative Interface (BUI/CLI Hang).

References

@ <BUG:6993881> - ACCRUED ANALYTICS DATA CAN COMPLETELY WEDGE AKD
@Support wiki - remove excess analytics: https://stbeehive.oracle.com/teamcollab/wiki/AmberRoadSupport:Solaris+shell+procedure+to+remove+%28excessive%29+analytics
<NOTE:1401282.1> - Sun Storage 7000 Unified Storage System: How to Troubleshoot Unresponsive Administrative Interface (BUI/CLI hang)

Attachments

This solution has no attachment