Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1410873.1
Update Date:2012-06-29
Keywords:

Solution Type  Problem Resolution Sure

Solution  1410873.1 :   Sun Storage 7000 Unified Storage System: SMF unable to spawn processes due to contract exhaustion  


Related Items
  • Sun Storage 7310 Unified Storage System
  •  
  • Sun Storage 7410 Unified Storage System
  •  
  • Sun ZFS Storage 7120
  •  
  • Sun Storage 7110 Unified Storage System
  •  
  • Sun ZFS Storage 7320
  •  
  • Sun ZFS Storage 7420
  •  
  • Sun Storage 7210 Unified Storage System
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>NAS>SN-DK: 7xxx NAS
  •  
  • .Old GCS Categories>Sun Microsystems>Storage - Disk>Unified Storage
  •  




In this Document
Symptoms
Cause
Solution
References


Applies to:

Sun Storage 7210 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage 7110 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage 7310 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage 7410 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
Sun ZFS Storage 7120 - Version Not Applicable to Not Applicable [Release N/A]
7000 Appliance OS (Fishworks)

Symptoms

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - 7000 Series ZFS Appliances


Symptoms observable by the customer:

  • Cannot login to BUI/CLI
  • SSH service disabled
  • Users lost access to NFS shares
  • Workflows timing out, when run
  • Error message 'failed to spawn job' - when performing CLI operations (examples)
    nas1:maintenance system> sendbundle
    error: An unanticipated system error occurred: failed to spawn job
    a75604b6-36fd-2222-eeee-b5fbb4b9e9a6: invalid peer group name

    This may be due to transient failure, or a software defect. If this problem
    persists, contact your service provider.


    nas2:maintenance system updates> download
    nas2:maintenance system updates download (uncommitted)> set url=http://devops-storage.us.oracle.com/upgrade/OLD/ak-nas-2010-08-17-2-1-1-1-21-nd.pkg.gz
    url = http://devops-storage.us.oracle.com/upgrade/OLD/ak-nas-2010-08-17-2-1-1-1-21-nd.pkg.gz
    nas2:maintenance system updates download (uncommitted)> commit
    Transferred 618M of 644M (95.9%) ... done
    error: An unanticipated system error occurred: failed to spawn job
    64fcd5a2-1234-abcd-e232-d1865ec6e13c: invalid peer group name

    This may be due to transient failure, or a software defect. If this problem persists, contact your service provider.
  • BUI System log contains "fork: Resource temporarily unavailable" messages:
Mar 11 13:10:52 nas402 svc.startd[76]: [ID 748625 daemon.error] network/ssh:default failed: transitioned to maintenance (see 'svcs -xv' for details)
Mar 11 13:11:15 nas402 sshd[2105]: [ID 800047 auth.error] error: fork: Resource temporarily unavailable
Mar 11 13:14:55 nas402 svc.startd[76]: [ID 748625 daemon.error] network/ssh:default failed: transitioned to maintenance (see 'svcs -xv' for details)
Mar 11 13:16:46 nas402 sshd[2105]: [ID 800047 auth.error] error: fork: Resource temporarily unavailable
Mar 11 13:21:28 nas402 svc.startd[76]: [ID 748625 daemon.error] network/ssh:default failed: transitioned to maintenance (see 'svcs -xv' for details)



Symptoms observable by the Oracle Support engineer:

  • 'ctstat' shows that the number of contracts is close to the limit of 10000.  
# svcs -pv akd

      -> Note down the CTID - Contract ID value.  Let's call it '$CTID'

# ctstat | grep $CTID | wc -l        => returns no of contracts used by akd

# ctstat | wc -l                     => returns total no of contracts used
  • debug.sys shows "Resource temporarily unavailable" messages:
Mar 11 13:10:56 nas402 svc.startd[76]: [ID 462725 daemon.warning] svc:/network/ssh:default: Couldn't fork to execute method /lib/svc/bin/svcio -p -L ro -R /etc/svc/volatile -S /usr/lib/ak/svc/stencil -a && exec /lib/svc/method/sshd start: Resource temporarily unavailable
Mar 11 13:10:56 nas402 svc.startd[76]: [ID 748625 daemon.error] network/ssh:default failed: transitioned to maintenance (see 'svcs -xv' for details)
Mar 11 16:30:07 nas402 sshd[2105]: [ID 800047 auth.error] error: fork: Resource temporarily unavailable
Mar 11 16:46:26 nas402 svc.startd[76]: [ID 462725 daemon.warning] svc:/network/nfs/status:default: Couldn't fork to execute method exec /usr/lib/ak/svc/method/nfs-status stop 189706: Resource temporarily unavailable
Mar 11 16:46:26 nas402 svc.startd[76]: [ID 462725 daemon.warning] svc:/appliance/kit/nfsconf:default: Couldn't fork to execute method /lib/svc/bin/svcio -p -L ro -R /etc/svc/volatile -S /usr/lib/ak/svc/stencil -au: Resource temporarily unavailable
Mar 11 16:46:26 nas402 svc.startd[76]: [ID 462725 daemon.warning] svc:/network/nfs/server:default: Couldn't fork to execute method exec /usr/lib/ak/svc/method/nfs-server stop 189709: Resource temporarily unavailable
Mar 11 16:48:36 nas402 svc.startd[76]: [ID 462725 daemon.warning] svc:/appliance/kit/nfsconf:default: Couldn't fork to execute method /lib/svc/bin/svcio -p -L ro -R /etc/svc/volatile -S /usr/lib/ak/svc/stencil -a: Resource temporarily unavailable
Mar 11 16:48:36 nas402 svc.startd[76]: [ID 748625 daemon.error] appliance/kit/nfsconf:default failed: transitioned to maintenance (see 'svcs -xv' for details)
Mar 11 16:48:36 nas402 svc.startd[76]: [ID 462725 daemon.warning] svc:/network/nfs/cbd:default: Couldn't fork to execute method exec /usr/lib/nfs/nfs4cbd: Resource temporarily unavailable
  • SSH service is in maintenance mode:
adc26stor02# svcs -xv
svc:/network/ssh:default (SSH server)
State: maintenance since Wed Mar 30 14:46:45 2011
Reason: Method failed repeatedly.
See: http://sun.com/msg/SMF-8000-8Q
See: man -M /usr/share/man -s 1M sshd
See: /var/svc/log/network-ssh:default.log
Impact: This service is not running.
  • Many (network) services in maintenance state
nas01# svcs -a | grep main
maintenance     7:10:41 svc:/system/ndmpd:default
maintenance     7:12:22 svc:/network/ntp:default
maintenance     7:12:44 svc:/system/identity:domain
maintenance     7:12:44 svc:/appliance/kit/nsswitch:default
maintenance     7:12:48 svc:/network/dns/client:default
maintenance     7:12:50 svc:/network/sendmail-client:default
maintenance     7:12:52 svc:/appliance/kit/netconf:default
maintenance     7:13:12 svc:/appliance/kit/nfsconf:default
maintenance     7:13:15 svc:/network/nfs/cbd:default
maintenance     7:13:18 svc:/network/nfs/status:default
maintenance     7:13:32 svc:/appliance/kit/http:default
  • Many job 'objects' in the stash:   
abc12# cd /var/ak/stash/com/sun/ak/job/
abc12# ls -l | wc -l
229

Cause

Whenever a workflow terminates abnormally, it leaves a unused 'contract id'.
Also, once this situation arises, the system 'stash' is filled with failed 'jobs'.


After executing the workflow, if sleep is killed from solaris shell, contract id's are not cleared up.

Known issue - <SunBug 7014175> - akd using maximum number of contracts
Additionally, <SunBug 7038390> - job spawn failures pollute the stash

Solution

Recommended action for the customer:

You will need to engage Oracle Support, by opening a Service Request, so that Oracle Support Services can provide confirmation of this issue and then carry out the appropriate activities to resolve the issue.
For a permanent resolution, please update to the Appliance Firmware Release version 2010.Q3.3.1 or later


Recommended actions for the Oracle Support engineer:

  Confirm the 'contract limit' issue and remove the stash 'job' objects - see the following wiki document:

    https://stbeehive.oracle.com/teamcollab/wiki/AmberRoadSupport:Confirm+contract+limit+issue+and+remove+stash+jobs

  If you cannot access this document engage NAS Storage-TSC for assistance

 

Back to <Document 1401282.1> Sun Storage 7000 Unified Storage System: How to Troubleshoot Unresponsive Administrative Interface.

References

@ <BUG:7014175> - AKD USING MAXIMUM NUMBER OF CONTRACTS
@ <BUG:7038390> - JOB SPAWN FAILURES POLLUTE THE STASH
@Support wiki - confirm contract limit: https://stbeehive.oracle.com/teamcollab/wiki/AmberRoadSupport:Confirm+contract+limit+issue+and+remove+stash+jobs
<NOTE:1401282.1> - Sun Storage 7000 Unified Storage System: How to Troubleshoot Unresponsive Administrative Interface (BUI/CLI hang)

Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback