Sun Storage 7000 Unified Storage System: SMF unable to spawn processes due to contract exhaustion

Asset ID:	1-72-1410873.1
Update Date:	2012-06-29
Keywords:

Solution Type Problem Resolution Sure

Solution 1410873.1 : Sun Storage 7000 Unified Storage System: SMF unable to spawn processes due to contract exhaustion

Applies to:

Sun Storage 7210 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage 7110 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage 7310 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage 7410 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
Sun ZFS Storage 7120 - Version Not Applicable to Not Applicable [Release N/A]
7000 Appliance OS (Fishworks)

Symptoms

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - 7000 Series ZFS Appliances

Symptoms observable by the customer:

Cannot login to BUI/CLI

SSH service disabled

Users lost access to NFS shares

Workflows timing out, when run

Error message 'failed to spawn job' - when performing CLI operations (examples)

    nas1:maintenance system> sendbundle

    error: An unanticipated system error occurred: failed to spawn job

    a75604b6-36fd-2222-eeee-b5fbb4b9e9a6: invalid peer group name



    This may be due to transient failure, or a software defect. If this problem

    persists, contact your service provider.





    nas2:maintenance system updates> download

    nas2:maintenance system updates download (uncommitted)> set url=http://devops-storage.us.oracle.com/upgrade/OLD/ak-nas-2010-08-17-2-1-1-1-21-nd.pkg.gz

    url = http://devops-storage.us.oracle.com/upgrade/OLD/ak-nas-2010-08-17-2-1-1-1-21-nd.pkg.gz

    nas2:maintenance system updates download (uncommitted)> commit

    Transferred 618M of 644M (95.9%) ... done

    error: An unanticipated system error occurred: failed to spawn job

    64fcd5a2-1234-abcd-e232-d1865ec6e13c: invalid peer group name



    This may be due to transient failure, or a software defect. If this problem persists, contact your service provider.

BUI System log contains "fork: Resource temporarily unavailable" messages:

Mar 11 13:10:52 nas402 svc.startd[76]: [ID 748625 daemon.error] network/ssh:default failed: transitioned to maintenance (see 'svcs -xv' for details)

Mar 11 13:11:15 nas402 sshd[2105]: [ID 800047 auth.error] error: fork: Resource temporarily unavailable

Mar 11 13:14:55 nas402 svc.startd[76]: [ID 748625 daemon.error] network/ssh:default failed: transitioned to maintenance (see 'svcs -xv' for details)

Mar 11 13:16:46 nas402 sshd[2105]: [ID 800047 auth.error] error: fork: Resource temporarily unavailable

Mar 11 13:21:28 nas402 svc.startd[76]: [ID 748625 daemon.error] network/ssh:default failed: transitioned to maintenance (see 'svcs -xv' for details)

Symptoms observable by the Oracle Support engineer:

'ctstat' shows that the number of contracts is close to the limit of 10000.

# svcs -pv akd



      -> Note down the CTID - Contract ID value.  Let's call it '$CTID'



# ctstat | grep $CTID | wc -l        => returns no of contracts used by akd



# ctstat | wc -l                     => returns total no of contracts used

debug.sys shows "Resource temporarily unavailable" messages:

Mar 11 13:10:56 nas402 svc.startd[76]: [ID 462725 daemon.warning] svc:/network/ssh:default: Couldn't fork to execute method /lib/svc/bin/svcio -p -L ro -R /etc/svc/volatile -S /usr/lib/ak/svc/stencil -a && exec /lib/svc/method/sshd start: Resource temporarily unavailable

Mar 11 13:10:56 nas402 svc.startd[76]: [ID 748625 daemon.error] network/ssh:default failed: transitioned to maintenance (see 'svcs -xv' for details)

Mar 11 16:30:07 nas402 sshd[2105]: [ID 800047 auth.error] error: fork: Resource temporarily unavailable

Mar 11 16:46:26 nas402 svc.startd[76]: [ID 462725 daemon.warning] svc:/network/nfs/status:default: Couldn't fork to execute method exec /usr/lib/ak/svc/method/nfs-status stop 189706: Resource temporarily unavailable

Mar 11 16:46:26 nas402 svc.startd[76]: [ID 462725 daemon.warning] svc:/appliance/kit/nfsconf:default: Couldn't fork to execute method /lib/svc/bin/svcio -p -L ro -R /etc/svc/volatile -S /usr/lib/ak/svc/stencil -au: Resource temporarily unavailable

Mar 11 16:46:26 nas402 svc.startd[76]: [ID 462725 daemon.warning] svc:/network/nfs/server:default: Couldn't fork to execute method exec /usr/lib/ak/svc/method/nfs-server stop 189709: Resource temporarily unavailable

Mar 11 16:48:36 nas402 svc.startd[76]: [ID 462725 daemon.warning] svc:/appliance/kit/nfsconf:default: Couldn't fork to execute method /lib/svc/bin/svcio -p -L ro -R /etc/svc/volatile -S /usr/lib/ak/svc/stencil -a: Resource temporarily unavailable

Mar 11 16:48:36 nas402 svc.startd[76]: [ID 748625 daemon.error] appliance/kit/nfsconf:default failed: transitioned to maintenance (see 'svcs -xv' for details)

Mar 11 16:48:36 nas402 svc.startd[76]: [ID 462725 daemon.warning] svc:/network/nfs/cbd:default: Couldn't fork to execute method exec /usr/lib/nfs/nfs4cbd: Resource temporarily unavailable

SSH service is in maintenance mode:

adc26stor02# svcs -xv

svc:/network/ssh:default (SSH server)

State: maintenance since Wed Mar 30 14:46:45 2011

Reason: Method failed repeatedly.

See: http://sun.com/msg/SMF-8000-8Q

See: man -M /usr/share/man -s 1M sshd

See: /var/svc/log/network-ssh:default.log

Impact: This service is not running.

Many (network) services in maintenance state

nas01# svcs -a | grep main

maintenance     7:10:41 svc:/system/ndmpd:default

maintenance     7:12:22 svc:/network/ntp:default

maintenance     7:12:44 svc:/system/identity:domain

maintenance     7:12:44 svc:/appliance/kit/nsswitch:default

maintenance     7:12:48 svc:/network/dns/client:default

maintenance     7:12:50 svc:/network/sendmail-client:default

maintenance     7:12:52 svc:/appliance/kit/netconf:default

maintenance     7:13:12 svc:/appliance/kit/nfsconf:default

maintenance     7:13:15 svc:/network/nfs/cbd:default

maintenance     7:13:18 svc:/network/nfs/status:default

maintenance     7:13:32 svc:/appliance/kit/http:default

Many job 'objects' in the stash:

abc12# cd /var/ak/stash/com/sun/ak/job/

abc12# ls -l | wc -l

229

Cause

Whenever a workflow terminates abnormally, it leaves a unused 'contract id'.
Also, once this situation arises, the system 'stash' is filled with failed 'jobs'.

After executing the workflow, if sleep is killed from solaris shell, contract id's are not cleared up.

Known issue - <SunBug 7014175> - akd using maximum number of contracts
Additionally, <SunBug 7038390> - job spawn failures pollute the stash

Solution

Recommended action for the customer:

You will need to engage Oracle Support, by opening a Service Request, so that Oracle Support Services can provide confirmation of this issue and then carry out the appropriate activities to resolve the issue.
For a permanent resolution, please update to the Appliance Firmware Release version 2010.Q3.3.1 or later

Recommended actions for the Oracle Support engineer:

Confirm the 'contract limit' issue and remove the stash 'job' objects - see the following wiki document:

https://stbeehive.oracle.com/teamcollab/wiki/AmberRoadSupport:Confirm+contract+limit+issue+and+remove+stash+jobs

If you cannot access this document engage NAS Storage-TSC for assistance

Back to <Document 1401282.1> Sun Storage 7000 Unified Storage System: How to Troubleshoot Unresponsive Administrative Interface.

References

@ <BUG:7014175> - AKD USING MAXIMUM NUMBER OF CONTRACTS
@ <BUG:7038390> - JOB SPAWN FAILURES POLLUTE THE STASH
@Support wiki - confirm contract limit: https://stbeehive.oracle.com/teamcollab/wiki/AmberRoadSupport:Confirm+contract+limit+issue+and+remove+stash+jobs
<NOTE:1401282.1> - Sun Storage 7000 Unified Storage System: How to Troubleshoot Unresponsive Administrative Interface (BUI/CLI hang)

Attachments

This solution has no attachment