___________________________________
Description
For Sun Storage appliances S7110/S7210/S7310/S7310C/S7410/S7410C with firmware releases 2009.Q2, 2009.Q3 or 2010.Q1, large amounts of ZFS filesystem activity triggered by "snapshot destroy" can result in severe performance degradation or an apparent hang of the appliance.
Occurrence
This issue can occur on the following platforms:
- Sun Storage S7110/S7210/S7310/S7310C/S7410/S7410C
running firmware releases 2009.Q2, 2009.Q3 or 2010.Q1.
To determine the version of firmware on these systems, do the following:
From any UNIX client (able to do ssh):
# ssh -l root <appliance IP addr> "script run('configuration version'); print('version: '+get('version'))"
version: 2010.02.09.2.1,1-1.18
Or from the BUI:
Maintenance -> System -> Current Installation
and match with the correct 2009 or 2010 release:
2009.Q2 <= 2009.04.10
2009.Q3 <= 2009.09.01
2010.Q1 <= 2010.02.09
A snapshot destroy operation can be triggered in one of the following ways:
- As a result of regular snapshot expiry at the end of the specified snapshot retention period
- In response to user deletion or alteration of the snapshot policy (e.g. scheduled start time) via the BUI Following snapshot rollback
- As a result of replication, wherein the snapshot which is created prior to start of data replication is then destroyed upon sync completion
The impact of snapshot destroy activity may go undetected if the appliance is able to complete deletion of configured snapshots quickly enough, when measured against client-side I/O timeouts. However, the extent of filesystem activity triggered by snapshot destroy depends upon the number of data blocks which must be deleted, taken in conjunction with other appliance workloads, which therefore depends upon the following factors:
- The number of projects/shares/luns which have the snapshot feature enabled.
- The number of distinct snapshots configured against each project/share/lun.
- The number of data blocks which have changed in the time between snapshot creation and deletion
- Whether snapshot destroy occurs during a time of high/peak appliance I/O load.
- Whether many/all snapshots have been configured with the same start time (and therefore the same deletion time)
- Where iSCSI LUNs are in use, the issue is exacerbated when using small block sizes (e.g. 512 bytes, 1KB)
Symptoms
Symptoms resulting from this scenario typically include much higher I/O latency seen by attached clients, possibly leading to I/O retries, timeouts and lost connectivity.
These symptoms typically occur at fixed or regular times, which correlate with the snapshot destroy schedule configured on the appliance.
In extreme cases, appliance I/O response may ultimately appear to be hung when viewed from a client perspective. In addition, the appliance BUI may appear hung during the snapshot destroy process. Such persistent symptoms will not be cleared by a reboot, although normal performance levels will return once snapshot destroy has completed.
Note: Oracle support will be able to confirm the underlying cause by directly observing the relevant ZFS thread states, using dtrace(1M) from the appliance shell.
Workaround
As a temporary workaround for any given project/share/LUN, increasing the snapshot retention policy (measured in days) will delay the point at which snapshot destroy next occurs, providing there is sufficient space available on the appliance. Following consultation with Oracle support, this may provide additional diagnosis/planning time if this issue is suspected as the root cause.
Impact may be reduced by spreading (staggering) the start times for configured snapshots (so for example they do not all begin at 01:00 or 09:00).
Customers which either already have or which will have a dependency on ZFS snapshot usage are strongly advised to upgrade to firmware release 2010.Q1.1.0 (or later). This firmware release provides performance benefits to the snapshot destroy process over previous releases, and will reduce (but not altogether remove or resolve) performance impact resulting from large amounts snapshot destroy activity.
Contract Customers who have either recently enabled snapshots, or who have increased the overall degree of snapshot usage on the appliance and are now seeing severe performance degradation, are advised to raise a new Service Request.
This issue is addressed in the following release:
- Sun Storage 7000 firmware 2010.Q3
History
13-Sep-2010: Date of Workaround Release
01-Nov-2010: Date of Resolved Release - updated for firmware release
30-Jul-2012: Maintenance update, no change in content
Firmware 2010.Q1.1.0 (already available)
resolves the following contributing issue:
6949730 spurious arc_free() can significantly
exacerbate 6948890
Firmware 2010.Q3 will resolve
the following contributing problems :
6948890 snapshot deletion can induce
pathologically long spa_sync() times
6944388 dsl_dataset_snapshot_reserve_space()
causes dp_write_limit=max
Responsible Engineer: [email protected]
Community: Sun NAS - Storage-Disk
Please send technical questions to the following email:
[email protected]
and copy the Responsible Engineer
Attachments
This solution has no attachment