Sun Storage 7000 Unified Storage System: Scheduled snapshots fail to run for a share which has a "zombie" snapshot beneath it

Asset ID:	1-72-1379364.1
Update Date:	2012-05-13
Keywords:

Solution Type Problem Resolution Sure

Solution 1379364.1 : Sun Storage 7000 Unified Storage System: Scheduled snapshots fail to run for a share which has a "zombie" snapshot beneath it

Applies to:

Sun Storage 7110 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage 7720 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage 7210 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
Sun Storage 7410 Unified Storage System - Version Not Applicable to Not Applicable [Release N/A]
Sun ZFS Storage 7120 - Version Not Applicable to Not Applicable [Release N/A]
7000 Appliance OS (Fishworks)

Symptoms

Scheduled snapshots fail to run at the times they are scheduled to do so for certain shares.
Upon investigation there appears to be a "zombie" under the snapshot directory of the affected shares.
Once the zombie has disappeared the scheduled snapshots restart.

Cause

For scheduled snapshots there is a "keep at most" property. If this is not specified then the scheduled snapshots will be kept forever. However, to prevent building up huge numbers of frequently scheduled snapshots, the property is capped at a maximum of 48 and 24 for half-hourly and hourly snapshots respectively.
If the limit of the "keep at most" property has been reached for a particular project or share then the next scheduled snapshots will not run until the snapshots outside of the retention policy have been deleted and the number of snapshots is back below the "keep at most" limit.
If there have been many changes to the parent filesystem since the snapshots were taken they may be large and take relatively long to delete. This may result in snapshots that are being destroyed spending longer than anticipated on the zombie list, hence the scheduled snapshots that are not running may often be seen to have zombie snapshots under the snapshot directory of the parent filesystem.

Solution

Most often this situation will correct itself. Once the snapshots have been completely destroyed and moved from the zombie list the number of snapshots should be back below the "keep at most" limit and the next scheduled snapshots will run again.
If scheduled snapshots are consistently failing it may be worth checking if a "keep at most" limit has been set, and maybe set too low.

Remember that if the "keep at most" property is not set it will be limited to 48 or 24 for half-hourly and hourly snapshot schedules respectively. For other frequencies of scheduled snapshots no limit means that the snapshots will be kept forever.

Please see the ZFS Storage 7000 System Administration Guide for more information on setting the "frequency" and "keep at most" properties of scheduled snapshots:

https://docs.oracle.com/cd/E22471_01/html/820-4167/shares__shares__snapshots.html#shares__shares__snapshots__scheduled_snapshots

If the "keep at most" property is set appropriately and still scheduled snapshots are not running it may be that there are problems destroying some of the zombie snapshots, and these will then be keeping the number of total snapshots for the share or project over the "keep at most" limit.
You can check for zombie snapshots as the should show up as in the example below under the .zfs/snapshot directory under the parent filesystems of the NFS shares.

# ls -lAn .zfs/snapshot/

ls: cannot access .zfs/snapshot/.zombie-3a09c: Input/output error
...
drwxr-xr-x+ 29 20218 20218 81 2011-09-14 03:00 .auto-1315962000
drwxr-xr-x+ 29 20218 20218 81 2011-09-14 03:00 .auto-1315963800
drwxr-xr-x+ 29 20218 20218 85 2011-09-23 23:00 test
??????????? ? ? ? ? ? .zombie-3a09c

There may be a ZFS hold on these snapshots that is preventing their destruction. To investigate and remedy this situation it will be necessary to raise a Service Request with your Oracle Global Support Centre.

To check for holds on zombies see the following example, from the Operating System shell:

# zfs list -t all | grep zombie

pool-01/local/filesystem/[email protected] 8.60M - 185G -
pool-01/zombie 62K 6.86T 31K none
pool-01/zombie/shares 31K 6.86T 31K none

# zfs holds pool-01/local/filesystem/[email protected]

NAME TAG TIMESTAMP
[email protected] .send-101485-0 Thu Jun 24 08:34:55 2010

The numbers following the ".send" are the PID and thread information of the process that has a hold on the zombie.

Check what process this is

# ps -ef | grep [PID]

It is most likely a replication (PID will correspond to akd) or a ndmp backup. To release the hold on the zombie kill the process that has this hold if and when appropriate.

If there is nothing that has the hold, you can release this hold with the zfs release command, (before doing this see the WORKAROUND section from <Bug 7166278>):

# zfs release .send-101485-0 [email protected]

References

@ <BUG:7166278> - NAS_CACHE_REAPER SPINS WITH EBUSY TRYING TO DELETE ZOMBIE SNAPSHOTS

Attachments

This solution has no attachment