Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1401595.1
Update Date:2012-06-29
Keywords:

Solution Type  Problem Resolution Sure

Solution  1401595.1 :   Sun Storage 7000 Unified Storage System: BUI/CLI hang due to 'excessive' analytics collected  


Related Items
  • Sun Storage 7410 Unified Storage System
  •  
  • Sun Storage 7310 Unified Storage System
  •  
  • Sun ZFS Storage 7120
  •  
  • Sun ZFS Storage 7320
  •  
  • Sun ZFS Storage 7420
  •  
  • Sun Storage 7110 Unified Storage System
  •  
  • Sun Storage 7210 Unified Storage System
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>NAS>SN-DK: 7xxx NAS
  •  
  • .Old GCS Categories>Sun Microsystems>Storage - Disk>Unified Storage
  •  




In this Document
Symptoms
Cause
Solution
References


Applies to:

Sun Storage 7410 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
Sun ZFS Storage 7120 - Version Not Applicable to Not Applicable [Release N/A]
Sun ZFS Storage 7320 - Version Not Applicable to Not Applicable [Release N/A]
Sun ZFS Storage 7420 - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage 7110 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
7000 Appliance OS (Fishworks)

Symptoms

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - 7000 Series ZFS Appliances


Symptoms observable by the customer:

  • Cannot login to BUI/CLI
  • ssh to appliance returns "aksh: fatal error: Couldn't allocate memory for XML-RPC fault response"
  • BUI reporting error: "asynchronous statistic error: failed to update kstat chain: Not enough space"
  • ssh to appliance drops to 'failsafe' shell:
    $ ssh [email protected]
    [email protected]'s password:

    Last login: Fri Jan  7 11:12:35 2011 from 10.123.45.67

    Waiting for the appliance shell to start ...


    The appliance shell is taking longer than usual to start.
    Press Ctrl-C to exit or wait 45 seconds for the emergency shell.

    Waiting for the appliance shell to start ...


    aksh-wrapper: FATAL: Appliance shell failed to start in the allotted time

    Dropping into failsafe shell ...
  • Replication failures in the 'alert' log:
    class = alert.ak.appliance.nas.project.replication.send.fail.misc
    ak_errmsg = zfs_send failed: cannot send 'pool-0/local/iscsi_User': No such process
  • System log message buffer reported a number of events:
    WARNING: /var/run: File system full, swap space limit exceeded
    WARNING: /var/run: File system full, swap space limit exceeded
    WARNING: /var/run: File system full, swap space limit exceeded
    WARNING: /etc/svc/volatile: File system full, swap space limit exceeded
    WARNING: /etc/svc/volatile: File system full, swap space limit exceeded
    WARNING: /etc/svc/volatile: File system full, swap space limit exceeded
  • After restarting akd (system reboot without diagnosis), we can see very large ONDISK value for analytics datasets, eg.:
sun-nas:analytics> datasets
sun-nas:analytics datasets> show
Datasets:

DATASET     STATE   INCORE ONDISK NAME
dataset-000 active    867K  3.88G arc.accesses[hit/miss]
dataset-001 active    148K   470M arc.l2_accesses[hit/miss]
dataset-002 active    148K   470M arc.l2_size
dataset-003 active    148K   470M arc.size
dataset-004 active    439K  2.82G arc.size[component]
dataset-005 active    148K   470M cpu.utilization
dataset-006 active    554K  1.51G cpu.utilization[cpu]
dataset-007 active    295K  1.83G cpu.utilization[mode]
dataset-008 active    337K  2.04G dnlc.accesses[hit/miss]
dataset-009 active    109K  11.2M fc.bytes
dataset-010 active    109K  11.2M fc.ops
dataset-011 active    109K  11.2M ftp.kilobytes
dataset-012 active    148K   470M http.reqs
dataset-013 active    148K   470M io.bytes
dataset-014 active   4.16M  6.54G io.bytes[disk]
dataset-015 active    334K  1.94G io.bytes[op]
dataset-016 active    113K  20.9M io.disks[utilization=95][disk]
dataset-017 active    148K   470M io.ops
dataset-018 active   4.65M  18.2G io.ops[disk]
dataset-019 active    334K  1.94G io.ops[op]
dataset-020 active    109K  11.2M iscsi.bytes
dataset-021 active    148K   470M iscsi.ops
dataset-022 active    148K   472M iscsi.ops[client]
dataset-023 active    148K   470M ndmp.diskkb
dataset-024 active    140K   282M net.kilobytes
dataset-025 active    395K  2.30G net.kilobytes[interface]
dataset-026 active    109K  11.2M nfs2.ops
dataset-027 active    109K  11.2M nfs2.ops[op]
dataset-028 active    148K   470M nfs3.ops
dataset-029 active   1.89M  1.53G nfs3.ops[client]
dataset-030 active   4.89M  94.3G nfs3.ops[file]
dataset-031 active   4.83M  90.9G nfs3.ops[file]
dataset-032 active   27.2M  35.6G nfs3.ops[latency]
dataset-033 active   1.39M  4.40G nfs3.ops[op]
dataset-034 active   3.54M  4.56G nfs3.ops[share]
dataset-035 active    148K   470M nfs4.ops
dataset-036 active    148K   470M nfs4.ops[op]
dataset-037 active    148K   470M nic.kilobytes
dataset-038 active    546K  2.90G nic.kilobytes[device]
dataset-039 active    336K  2.21G nic.kilobytes[direction]
dataset-040 active    109K  11.2M sftp.kilobytes
dataset-041 active    148K   470M smb.ops
dataset-042 active    148K   470M smb.ops[op]



Symptoms observable by the Oracle Support engineer:

  • aklog has 'no memory' messages logged:
Mon Sep 26 15:49:14 2011: magnitude-1 dataspan at 176860000 for 'smb.ops[file]': failed to write: failed to write file com/sun/ak/xmlrpc/dataset/3a8b5059-db08-46fd-befe-eefc6fc5b56f/_dataspan/M1/170000000/176860000.new for stash update: no memory
Mon Sep 26 15:49:14 2011: magnitude-1 dataspan at 176870000 for 'smb.ops[file]': failed to write: failed to write file com/sun/ak/xmlrpc/dataset/3a8b5059-db08-46fd-befe-eefc6fc5b56f/_dataspan/M1/170000000/176870000.new for stash update: no memory
Mon Sep 26 15:49:14 2011: magnitude-1 dataspan at 176880000 for 'smb.ops[file]': failed to write: failed to write file com/sun/ak/xmlrpc/dataset/3a8b5059-db08-46fd-befe-eefc6fc5b56f/_dataspan/M1/170000000/176880000.new for stash update: failed to write stash file
Mon Sep 26 15:49:14 2011: magnitude-0 dataspan at 176890000 for 'smb.ops[file]': failed to write: failed to write file com/sun/ak/xmlrpc/dataset/3a8b5059-db08-46fd-befe-eefc6fc5b56f/_dataspan/M0/176000000/176890000.new for stash update: failed to write stash file
  • prstat shows akd as taking up ~3.5 GB of memory:
PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
  2515 daemon   5972K 5172K sleep   60  -20  26:46:25 0.3% nfsd/78
   951 root        0K    0K sleep   99  -20  10:39:50 0.2% zpool-pool-0/158
  1285 root       41M   21M sleep   59    0   0:01:05 0.1% httpd/1
  2455 root     3540M 3540M sleep  100    -   1:48:29 0.0% akd/175  <<<  4302 root     4240K 2848K cpu4    59    0   0:00:00 0.0% prstat/1
  1360 root       34M   27M sleep   59    0   0:03:23 0.0% fmd/33
  1796 root     9136K 4588K sleep   59    0   0:00:00 0.0% syseventd/18
  4292 webservd   40M 7732K sleep   59    0   0:00:00 0.0% httpd/1
   184 root       22M 9412K sleep   59    0   0:00:00 0.0% akwatchd/1
  1066 root     4416K 3284K sleep   59    0   0:00:00 0.0% zebra/1
   115 netadm   3868K 2788K sleep   59    0   0:00:00 0.0% ipmgmtd/3
  1022 root     3300K 1408K sleep   59    0   0:00:00 0.0% dbus-daemon/1
   108 root     3224K 2000K sleep   59    0   0:00:00 0.0% dlmgmtd/8

 

Note that 3.5G is close to the limit for a 32-bit application, which could cause many different symptoms.
  • stash size was many 'tens of GBs' -> large amount of analytics:
> ::akx_class -d -n dataset | ::print ak_dataset_state_t akds_datasets | ::list ak_dataset_t akd_next | ::ak_dataset
 DATASET       INCORE STAT
18749a88       885263 nic.kilobytes[direction]
18749808       316000 nfs2.ops
1857a548       467818 nfs4.ops[op]
18c1bc88       316000 iscsi.bytes
18c1ba08       316000 sftp.kilobytes
18c1b788       316000 arc.size
 8f26848      2671116 nfs3.ops[op]
17eee388      1963692 arc.accesses[hit/miss]
131400c8       316000 io.bytes
13140ac8       316000 ndmp.diskkb
13140848      7784753 nfs3.ops[client]
131405c8       316000 iscsi.ops
13140348       316000 nfs2.ops[op]
12fcd088      4020360 arc.accesses[file]
13140d48       316000 io.ops
12e542c8       876347 io.ops[op]
125ee008       316000 http.reqs
12000848       702401 net.kilobytes[interface]
120bb108      7480821 nfs3.ops[client]
125eea08       316000 arc.l2_size
125ee508      7047006 io.ops[disk]
125eec88       848828 cpu.utilization[mode]
120005c8       972386 nic.kilobytes[device]
120bb388       316000 nfs3.ops
120bbb08       876147 io.bytes[op]
120bb888       308000 fc.bytes
120bb608       316000 nfs4.ops
120bbd88      7970586 arc.l2_accesses[file]
 c224548       868108 dnlc.accesses[hit/miss]
120000c8       480185 io.disks[utilization=95][disk]
12000d48       316000 cpu.utilization
12000ac8       316000 smb.ops[op]
12000348       878583 net.kilobytes[direction]
 c224048      1461677 arc.size[component]
 c2242c8       316000 ftp.kilobytes
 bee7788       308000 fc.ops
 bee7a08   1080518527 nfs3.ops[file]        <<<<<<<<
 bee7c88       316000 nic.kilobytes
 8f260c8       316000 smb.ops
 91a2d88   1075506365 nfs3.ops[file]        <<<<<<<<
 8f265c8       829068 arc.l2_accesses[hit/miss]

Notice that there are two datasets for 'nfs3.ops' broken down by file, each taking over a GB.

  • akd was spinning on following thread reading jobs/datasets data one by one:
-----------------  lwp# 1 / thread# 1  --------------------
feed2345 syscall  (82, 8047aa0, 0, f00f67e2) + 5
feebcbc9 openat64 (82, 8047aa0, 0, f00f67e2) + d1
f00f6b7e ak_stash_reload_atom (93d7e54, f01bb8b8, 8047b50, f00f4e1d) + ee
f00f4e38 ak_stash_resume_object (93d7e54, f01bb940, fc2e562a, f00f58e2) + ac
f00f591f ak_stash_reload (8276498, f01bb940, 8712bf8, f00b06fd) + 13f
f00b07af ak_job_init (80ddc08, 8104578, f0185844, 0) + 13f
f0111568 ak_init  (1, 1, 0, 8047dd4, 1, 8047e6c) + 960
08051564 main     (1, 8047e6c, 8047e74, feffb8b4) + 28c
0805117d _start   (1, 8047f08, 0, 8047f18, 8047f2e, 8047f46) + 7d

Cause

This is a known issue due to 'excessive' analytics being collected and retained.


Known issue - <SunBug 6993881> - accrued analytics data can completely wedge akd

Solution

Recommended action for the customer:

You will need to engage Oracle Support, by opening a Service Request, so that Oracle Support Services can provide confirmation of this issue and then carry out the appropriate activities to resolve the issue.


Recommended actions for the Oracle Support engineer:

  Remove the analytics from the appliance stash - see the following wiki document:

    https://stbeehive.oracle.com/teamcollab/wiki/AmberRoadSupport:Solaris+shell+procedure+to+remove+%28excessive%29+analytics

  If you cannot access this document engage NAS Storage-TSC for assistance

 

Back to <Document 1401282.1> Sun Storage 7000 Unified Storage System: How to Troubleshoot Unresponsive Adminstrative Interface (BUI/CLI Hang).

References

@ <BUG:6993881> - ACCRUED ANALYTICS DATA CAN COMPLETELY WEDGE AKD
@Support wiki - remove excess analytics: https://stbeehive.oracle.com/teamcollab/wiki/AmberRoadSupport:Solaris+shell+procedure+to+remove+%28excessive%29+analytics
<NOTE:1401282.1> - Sun Storage 7000 Unified Storage System: How to Troubleshoot Unresponsive Administrative Interface (BUI/CLI hang)

Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback