ZFS Write Performance Degrades With Threads Held Up By space_map_load

Asset ID:	1-72-1359269.1
Update Date:	2012-02-16
Keywords:

Solution Type Problem Resolution Sure

Solution 1359269.1 : ZFS Write Performance Degrades With Threads Held Up By space_map_load_wait()

Applies to:

Solaris SPARC Operating System - Version: 10 3/05 and later   [Release: 10.0 and later ]
Sun Storage 7720 Unified Storage System - Version: Not Applicable and later    [Release: N/A and later]
Sun ZFS Storage 7420 - Version: Not Applicable and later    [Release: N/A and later]
Sun Storage 7310 Unified Storage System - Version: Not Applicable and later    [Release: N/A and later]
OpenSolaris Operating System - Version: 2008.05 and later    [Release: 10.0 and later]
Information in this document applies to any platform.

Symptoms

There are several actions required to confirm this issue.

A threadlist taken from the system during the performance issue should contain threads with a stack similar to the following:

ffffff00f5253c60 fffffffffbc2bbb0                0   0  99 ffffff8215990f3e
  PC: _resume_from_idle+0xf1    TASKQ: spa_zio_write_issue
  stack pointer for thread ffffff00f5253c60: ffffff00f5253790
  [ ffffff00f5253790 _resume_from_idle+0xf1() ]
     swtch+0x160()
     cv_wait+0x61()
     space_map_load_wait+0x2e()
     space_map_load+0x4c()
     metaslab_activate+0x6e()
  metaslab_group_alloc+0x269()
  metaslab_alloc_dva+0x287()
     metaslab_alloc+0x9b()
  zio_dva_allocate+0x3e()
     zio_execute+0xa0()
     taskq_thread+0x1b7()
     thread_start+8()

Or count the number of stacks with space_map_load_wait () in them within mdb

> ::stacks -c space_map_load_wait

THREAD           STATE SOBJ COUNT
ffffff007cd61c60 SLEEP CV   8
         swtch+0x147
         cv_wait+0x61
         space_map_load_wait+0x2e
         metaslab_activate+0x60
         metaslab_group_alloc+0x246
         metaslab_alloc_dva+0x2a6
         metaslab_alloc+0x9c
         zio_dva_allocate+0x57
         zio_execute+0x89
         taskq_thread+0x1b7
         thread_start+8

Looking at time spent in zio_dva_allocate (timestamp/1000) there are several outliers which are taking a long time.

$ dtrace -n 'fbt:zfs:zio_dva_allocate:entry {self->ts = timestamp;} fbt:zfs:zio_dva_allocate:return /self->ts/ {@[probefunc] = quantize((timestamp - self->ts)/1000); self->ts = 0;}'
^C

         value   ------------- Distribution ------------- count
      2048 | 0
      4096 |                                         1
      8192 |@                                        1944
     16384 |@@@@@@@@@@@@@@                           27018
     32768 |@@@@@@@@@@@@@@@@                         31038
     65536 |@@@@                                     7144
    131072 |@@                                       3470
    262144 |@                                        2331
    524288 |@                                        1150
   1048576 |                                         510
   2097152 |                                         211
   4194304 |                                         81
   8388608 | 103
16777216 |                                         248
33554432 |                                         424
67108864 | 588
134217728 |                                         636
268435456 | 543
536870912 |                                         223
1073741824 |                                         28
2147483648 |                                         0

Running '::zio_state' from mdb will show several write threads in WAIT_FOR_CHILDREN_READY and DVA_ALLOCATE

> ::zio_state

ADDRESS                           TYPE   STAGE    WAITER
60038a57130                       NULL DONE                     3017d34daa0
301eb1ebc40 NULL DONE                     3000c5a14e0
60047aa0060 NULL DONE                     3011d097800
6004dd35c40                       NULL   WAIT_FOR_CHILDREN_READY 2a10627fca0
600471ac450    WRITE WAIT_FOR_CHILDREN_READY -
600498aa5b8                     WRITE WAIT_FOR_CHILDREN_READY -
   300960f0008    WRITE WAIT_FOR_CHILDREN_READY -
    302a7915888 WRITE WAIT_FOR_CHILDREN_READY -
     300bbd28d78    WRITE WAIT_FOR_CHILDREN_READY -
      3006e484ac0 WRITE WAIT_FOR_CHILDREN_READY -
       3006289f1d0    WRITE WAIT_FOR_CHILDREN_READY -
        3009b726120 WRITE WAIT_FOR_CHILDREN_READY -
         30181d8f720    WRITE WAIT_FOR_CHILDREN_READY -
          6004a1f6f10 WRITE DVA_ALLOCATE -
          6004a1f6700 WRITE DVA_ALLOCATE -
         30387051168    WRITE WAIT_FOR_CHILDREN_READY -
          30047b738a0             WRITE DVA_ALLOCATE -
          30047b72b30 WRITE DVA_ALLOCATE -
          30097f45468 WRITE DVA_ALLOCATE -

Running 'zpool iostat -v' will show the vdevs which have limited space left. The following example shows how the 'capacity free' varies between the original vdevs (emcpower16g and empower17g) versus the new vdevs (emcpower1g and emcpower23c).

                                capacity    operations  bandwidth
pool          alloc   free    read   write   read   write
-------------  -----  -----  -----  -----  -----  -----
DATA2021-02       372G   94.8G    212      152    1.97M  1.68M
  emcpower1g     198G   50.4G     88        30      934K    553K
  emcpower16g  56.3G  3.22G     36        37      318K    179K
   emcpower17g  56.2G  3.34G     35        39      307K    202K
   emcpower23c  61.6G  37.9G     51        46      462K    789K
-------------  -----  -----  -----  -----  -----  -----

Changes

This issue can occur in situations where additional RAIDZ vdevs have been added to an existing zpool to increase storage. This results in imbalanced RAIDZ vdevs within the zpool where the original vdevs have more data on them than the new ones. When the older vdevs fill up ZFS will spend a lot of time in space_map_load_wait() trying to look for space in other vdevs.

Cause

Due to Defect# 6876962, this issue can occur in the following releases:

SPARC Platform:

Solaris 10 without the ZFS Stability IDR patch IDR147574-01 or Kernel Patch 147440-04

x86 Platform:

Solaris 10 without the ZFS Stability IDR patch IDR147575-01 or Kernel Patch 147441-04

Unified Storage Appliances (S7000)

Fishworks OS without ak-2010.08.17 or later

Oracle Solaris 11 Express is not affected by this issue.

Solution

This issue is addressed in the following releases:

SPARC Platform

Solaris 10 with patch 147440-04

x86 Platform

Solaris 10 with patch 147441-04

Unified Storage Appliances (S7000)

Fishworks OS ak-2010.08.17 or later

Workaround(s)

1) Delete any old snapshots to free up space within the zpool

2) Manually rebalance the vdev space utilization using the following

Create a new dataset in the pool
Move the data from a old larger dataset to the new dataset using mv(1).

Option 2

* Use mdb(1) to set vdev_cant_write of the top level older vdev(s) to B_TRUE so that all allocation can go to the newer vdev.
* Use zpool(1m) and do 'zpool clear <poolname>' to reset the flag once the space utilization is rebalanced.

Setting vdev_cant_write to 1 may have a side effect - root vdev state may change to VDEV_CANT_OPEN as a result of vdev_propagate_state() processing, so pool state will be reported as UNAVAIL by zpool list/status etc. This seems to be harmless so far, but anyway you may want to run 'zpool clear' which should clear all vdev_cant_write settings and return pool to the normal state.

It it not advised to use this method in production unless all other avenues have been eliminated and no other options exist. If this method is used, the customer must watch the space allocated to the new vdev(s) and run 'zpool clear poolname' when all vdevs contain a similar amount of data otherwise the disk may fill up.

### First step is to find the spa address of the vdev(s) that existed in the zpool $ mdb -k > ::spa -c ! grep <poolname> ### Take the address and display the vdev_cant_write value <spa addr>::print -a spa_t spa_root_vdev->vdev_child[n]->vdev_cant_write eg:: ffffff0189510540::print -a spa_t spa_root_vdev->vdev_child[0]->vdev_cant_write ffffff017e2a559f spa_root_vdev->vdev_child[0]->vdev_cant_write = 0 ### Now we change the value to B_TRUE $ echo "<vdev_cant_write addr>/v 1" | mdb -kw eg: $ echo "ffffff017e2a559f/v 1" | mdb -kw 0xffffff017e2a559f: 0 = 0x1 $

References

<BUG:6876962> - DEGRADED WRITE PERFORMANCE WITH THREADS HELD UP BY SPACE_MAP_LOAD_WAIT()

Attachments

This solution has no attachment