Asset ID: |
1-72-1401595.1 |
Update Date: | 2012-06-29 |
Keywords: | |
Solution Type
Problem Resolution Sure
Solution
1401595.1
:
Sun Storage 7000 Unified Storage System: BUI/CLI hang due to 'excessive' analytics collected
Related Items |
- Sun Storage 7410 Unified Storage System
- Sun Storage 7310 Unified Storage System
- Sun ZFS Storage 7120
- Sun ZFS Storage 7320
- Sun ZFS Storage 7420
- Sun Storage 7110 Unified Storage System
- Sun Storage 7210 Unified Storage System
|
Related Categories |
- PLA-Support>Sun Systems>DISK>NAS>SN-DK: 7xxx NAS
- .Old GCS Categories>Sun Microsystems>Storage - Disk>Unified Storage
|
In this Document
Applies to:
Sun Storage 7410 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
Sun ZFS Storage 7120 - Version Not Applicable to Not Applicable [Release N/A]
Sun ZFS Storage 7320 - Version Not Applicable to Not Applicable [Release N/A]
Sun ZFS Storage 7420 - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage 7110 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
7000 Appliance OS (Fishworks)
Symptoms
Symptoms observable by the customer:
- Cannot login to BUI/CLI
- ssh to appliance returns "aksh: fatal error: Couldn't allocate memory for XML-RPC fault response"
- BUI reporting error: "asynchronous statistic error: failed to update kstat chain: Not enough space"
- ssh to appliance drops to 'failsafe' shell:
$ ssh [email protected]
[email protected]'s password:
Last login: Fri Jan 7 11:12:35 2011 from 10.123.45.67
Waiting for the appliance shell to start ...
The appliance shell is taking longer than usual to start.
Press Ctrl-C to exit or wait 45 seconds for the emergency shell.
Waiting for the appliance shell to start ...
aksh-wrapper: FATAL: Appliance shell failed to start in the allotted time
Dropping into failsafe shell ...
- Replication failures in the 'alert' log:
class = alert.ak.appliance.nas.project.replication.send.fail.misc
ak_errmsg = zfs_send failed: cannot send 'pool-0/local/iscsi_User': No such process
- System log message buffer reported a number of events:
WARNING: /var/run: File system full, swap space limit exceeded
WARNING: /var/run: File system full, swap space limit exceeded
WARNING: /var/run: File system full, swap space limit exceeded
WARNING: /etc/svc/volatile: File system full, swap space limit exceeded
WARNING: /etc/svc/volatile: File system full, swap space limit exceeded
WARNING: /etc/svc/volatile: File system full, swap space limit exceeded
- After restarting akd (system reboot without diagnosis), we can see very large ONDISK value for analytics datasets, eg.:
sun-nas:analytics> datasets
sun-nas:analytics datasets> show
Datasets:
DATASET STATE INCORE ONDISK NAME
dataset-000 active 867K 3.88G arc.accesses[hit/miss]
dataset-001 active 148K 470M arc.l2_accesses[hit/miss]
dataset-002 active 148K 470M arc.l2_size
dataset-003 active 148K 470M arc.size
dataset-004 active 439K 2.82G arc.size[component]
dataset-005 active 148K 470M cpu.utilization
dataset-006 active 554K 1.51G cpu.utilization[cpu]
dataset-007 active 295K 1.83G cpu.utilization[mode]
dataset-008 active 337K 2.04G dnlc.accesses[hit/miss]
dataset-009 active 109K 11.2M fc.bytes
dataset-010 active 109K 11.2M fc.ops
dataset-011 active 109K 11.2M ftp.kilobytes
dataset-012 active 148K 470M http.reqs
dataset-013 active 148K 470M io.bytes
dataset-014 active 4.16M 6.54G io.bytes[disk]
dataset-015 active 334K 1.94G io.bytes[op]
dataset-016 active 113K 20.9M io.disks[utilization=95][disk]
dataset-017 active 148K 470M io.ops
dataset-018 active 4.65M 18.2G io.ops[disk]
dataset-019 active 334K 1.94G io.ops[op]
dataset-020 active 109K 11.2M iscsi.bytes
dataset-021 active 148K 470M iscsi.ops
dataset-022 active 148K 472M iscsi.ops[client]
dataset-023 active 148K 470M ndmp.diskkb
dataset-024 active 140K 282M net.kilobytes
dataset-025 active 395K 2.30G net.kilobytes[interface]
dataset-026 active 109K 11.2M nfs2.ops
dataset-027 active 109K 11.2M nfs2.ops[op]
dataset-028 active 148K 470M nfs3.ops
dataset-029 active 1.89M 1.53G nfs3.ops[client]
dataset-030 active 4.89M 94.3G nfs3.ops[file]
dataset-031 active 4.83M 90.9G nfs3.ops[file]
dataset-032 active 27.2M 35.6G nfs3.ops[latency]
dataset-033 active 1.39M 4.40G nfs3.ops[op]
dataset-034 active 3.54M 4.56G nfs3.ops[share]
dataset-035 active 148K 470M nfs4.ops
dataset-036 active 148K 470M nfs4.ops[op]
dataset-037 active 148K 470M nic.kilobytes
dataset-038 active 546K 2.90G nic.kilobytes[device]
dataset-039 active 336K 2.21G nic.kilobytes[direction]
dataset-040 active 109K 11.2M sftp.kilobytes
dataset-041 active 148K 470M smb.ops
dataset-042 active 148K 470M smb.ops[op]
Symptoms observable by the Oracle Support engineer:
- aklog has 'no memory' messages logged:
Mon Sep 26 15:49:14 2011: magnitude-1 dataspan at 176860000 for 'smb.ops[file]': failed to write: failed to write file com/sun/ak/xmlrpc/dataset/3a8b5059-db08-46fd-befe-eefc6fc5b56f/_dataspan/M1/170000000/176860000.new for stash update: no memory
Mon Sep 26 15:49:14 2011: magnitude-1 dataspan at 176870000 for 'smb.ops[file]': failed to write: failed to write file com/sun/ak/xmlrpc/dataset/3a8b5059-db08-46fd-befe-eefc6fc5b56f/_dataspan/M1/170000000/176870000.new for stash update: no memory
Mon Sep 26 15:49:14 2011: magnitude-1 dataspan at 176880000 for 'smb.ops[file]': failed to write: failed to write file com/sun/ak/xmlrpc/dataset/3a8b5059-db08-46fd-befe-eefc6fc5b56f/_dataspan/M1/170000000/176880000.new for stash update: failed to write stash file
Mon Sep 26 15:49:14 2011: magnitude-0 dataspan at 176890000 for 'smb.ops[file]': failed to write: failed to write file com/sun/ak/xmlrpc/dataset/3a8b5059-db08-46fd-befe-eefc6fc5b56f/_dataspan/M0/176000000/176890000.new for stash update: failed to write stash file
- prstat shows akd as taking up ~3.5 GB of memory:
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
2515 daemon 5972K 5172K sleep 60 -20 26:46:25 0.3% nfsd/78
951 root 0K 0K sleep 99 -20 10:39:50 0.2% zpool-pool-0/158
1285 root 41M 21M sleep 59 0 0:01:05 0.1% httpd/1
2455 root 3540M 3540M sleep 100 - 1:48:29 0.0% akd/175 <<< 4302 root 4240K 2848K cpu4 59 0 0:00:00 0.0% prstat/1
1360 root 34M 27M sleep 59 0 0:03:23 0.0% fmd/33
1796 root 9136K 4588K sleep 59 0 0:00:00 0.0% syseventd/18
4292 webservd 40M 7732K sleep 59 0 0:00:00 0.0% httpd/1
184 root 22M 9412K sleep 59 0 0:00:00 0.0% akwatchd/1
1066 root 4416K 3284K sleep 59 0 0:00:00 0.0% zebra/1
115 netadm 3868K 2788K sleep 59 0 0:00:00 0.0% ipmgmtd/3
1022 root 3300K 1408K sleep 59 0 0:00:00 0.0% dbus-daemon/1
108 root 3224K 2000K sleep 59 0 0:00:00 0.0% dlmgmtd/8
Note that 3.5G is close to the limit for a 32-bit application, which could cause many different symptoms.
- stash size was many 'tens of GBs' -> large amount of analytics:
> ::akx_class -d -n dataset | ::print ak_dataset_state_t akds_datasets | ::list ak_dataset_t akd_next | ::ak_dataset
DATASET INCORE STAT
18749a88 885263 nic.kilobytes[direction]
18749808 316000 nfs2.ops
1857a548 467818 nfs4.ops[op]
18c1bc88 316000 iscsi.bytes
18c1ba08 316000 sftp.kilobytes
18c1b788 316000 arc.size
8f26848 2671116 nfs3.ops[op]
17eee388 1963692 arc.accesses[hit/miss]
131400c8 316000 io.bytes
13140ac8 316000 ndmp.diskkb
13140848 7784753 nfs3.ops[client]
131405c8 316000 iscsi.ops
13140348 316000 nfs2.ops[op]
12fcd088 4020360 arc.accesses[file]
13140d48 316000 io.ops
12e542c8 876347 io.ops[op]
125ee008 316000 http.reqs
12000848 702401 net.kilobytes[interface]
120bb108 7480821 nfs3.ops[client]
125eea08 316000 arc.l2_size
125ee508 7047006 io.ops[disk]
125eec88 848828 cpu.utilization[mode]
120005c8 972386 nic.kilobytes[device]
120bb388 316000 nfs3.ops
120bbb08 876147 io.bytes[op]
120bb888 308000 fc.bytes
120bb608 316000 nfs4.ops
120bbd88 7970586 arc.l2_accesses[file]
c224548 868108 dnlc.accesses[hit/miss]
120000c8 480185 io.disks[utilization=95][disk]
12000d48 316000 cpu.utilization
12000ac8 316000 smb.ops[op]
12000348 878583 net.kilobytes[direction]
c224048 1461677 arc.size[component]
c2242c8 316000 ftp.kilobytes
bee7788 308000 fc.ops
bee7a08 1080518527 nfs3.ops[file] <<<<<<<<
bee7c88 316000 nic.kilobytes
8f260c8 316000 smb.ops
91a2d88 1075506365 nfs3.ops[file] <<<<<<<<
8f265c8 829068 arc.l2_accesses[hit/miss]
Notice that there are two datasets for 'nfs3.ops' broken down by file, each taking over a GB.
- akd was spinning on following thread reading jobs/datasets data one by one:
----------------- lwp# 1 / thread# 1 --------------------
feed2345 syscall (82, 8047aa0, 0, f00f67e2) + 5
feebcbc9 openat64 (82, 8047aa0, 0, f00f67e2) + d1
f00f6b7e ak_stash_reload_atom (93d7e54, f01bb8b8, 8047b50, f00f4e1d) + ee
f00f4e38 ak_stash_resume_object (93d7e54, f01bb940, fc2e562a, f00f58e2) + ac
f00f591f ak_stash_reload (8276498, f01bb940, 8712bf8, f00b06fd) + 13f
f00b07af ak_job_init (80ddc08, 8104578, f0185844, 0) + 13f
f0111568 ak_init (1, 1, 0, 8047dd4, 1, 8047e6c) + 960
08051564 main (1, 8047e6c, 8047e74, feffb8b4) + 28c
0805117d _start (1, 8047f08, 0, 8047f18, 8047f2e, 8047f46) + 7d
Cause
This is a known issue due to 'excessive' analytics being collected and retained.
Known issue - <SunBug 6993881> - accrued analytics data can completely wedge akd
Solution
Recommended action for the customer:
You will need to engage Oracle Support, by opening a Service Request, so that Oracle Support Services can provide confirmation of this issue and then carry out the appropriate activities to resolve the issue.
Recommended actions for the Oracle Support engineer:
Remove the analytics from the appliance stash - see the following wiki document:
https://stbeehive.oracle.com/teamcollab/wiki/AmberRoadSupport:Solaris+shell+procedure+to+remove+%28excessive%29+analytics
If you cannot access this document engage NAS Storage-TSC for assistance
Back to <Document 1401282.1> Sun Storage 7000 Unified Storage System: How to Troubleshoot Unresponsive Adminstrative Interface (BUI/CLI Hang).
References
@ <BUG:6993881> - ACCRUED ANALYTICS DATA CAN COMPLETELY WEDGE AKD
@Support wiki - remove excess analytics: https://stbeehive.oracle.com/teamcollab/wiki/AmberRoadSupport:Solaris+shell+procedure+to+remove+%28excessive%29+analytics
<NOTE:1401282.1> - Sun Storage 7000 Unified Storage System: How to Troubleshoot Unresponsive Administrative Interface (BUI/CLI hang)
Attachments
This solution has no attachment