Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1359269.1
Update Date:2012-02-16
Keywords:

Solution Type  Problem Resolution Sure

Solution  1359269.1 :   ZFS Write Performance Degrades With Threads Held Up By space_map_load_wait()  


Related Items
  • OpenSolaris Operating System
  •  
  • Solaris x64/x86 Operating System
  •  
  • Sun Storage 7410 Unified Storage System
  •  
  • Sun ZFS Storage 7320
  •  
  • Solaris SPARC Operating System
  •  
  • Sun Storage 7720 Unified Storage System
  •  
  • Sun Storage 7310 Unified Storage System
  •  
  • Sun Storage 7210 Unified Storage System
  •  
  • Sun ZFS Storage 7420
  •  
  • Sun Storage 7110 Unified Storage System
  •  
  • Sun ZFS Storage 7120
  •  
Related Categories
  • PLA-Support>Sun Systems>SAND>Kernel>SN-SND: Sun Kernel Performance
  •  
  • .Old GCS Categories>Sun Microsystems>Operating Systems>Solaris Kernel
  •  


Write performance on ZFS filesystems can degrade on zpools which have, at some point in the past, been grown by adding more vdevs to the pool.

In this Document
  Symptoms
  Changes
  Cause
  Solution
  References


Applies to:

Solaris SPARC Operating System - Version: 10 3/05 and later   [Release: 10.0 and later ]
Sun Storage 7720 Unified Storage System - Version: Not Applicable and later    [Release: N/A and later]
Sun ZFS Storage 7420 - Version: Not Applicable and later    [Release: N/A and later]
Sun Storage 7310 Unified Storage System - Version: Not Applicable and later    [Release: N/A and later]
OpenSolaris Operating System - Version: 2008.05 and later    [Release: 10.0 and later]
Information in this document applies to any platform.

Symptoms

There are several actions required to confirm this issue. 

A threadlist taken from the system during the performance issue should contain threads with a stack similar to the following:

ffffff00f5253c60 fffffffffbc2bbb0 0 0 99 ffffff8215990f3e
PC: _resume_from_idle+0xf1 TASKQ: spa_zio_write_issue
stack pointer for thread ffffff00f5253c60: ffffff00f5253790
[ ffffff00f5253790 _resume_from_idle+0xf1() ]
  swtch+0x160()
  cv_wait+0x61()
  space_map_load_wait+0x2e()
  space_map_load+0x4c()
  metaslab_activate+0x6e()
  metaslab_group_alloc+0x269()
  metaslab_alloc_dva+0x287()
  metaslab_alloc+0x9b()
  zio_dva_allocate+0x3e()
  zio_execute+0xa0()
  taskq_thread+0x1b7()
  thread_start+8()

Or count the number of stacks with space_map_load_wait () in them within mdb

> ::stacks -c space_map_load_wait

THREAD           STATE  SOBJ COUNT
ffffff007cd61c60 SLEEP  CV   8
         swtch+0x147
         cv_wait+0x61
         space_map_load_wait+0x2e
         metaslab_activate+0x60
         metaslab_group_alloc+0x246
         metaslab_alloc_dva+0x2a6
         metaslab_alloc+0x9c
         zio_dva_allocate+0x57
         zio_execute+0x89
         taskq_thread+0x1b7
         thread_start+8


Looking at time spent in zio_dva_allocate (timestamp/1000) there are several outliers which are taking a long time.


$ dtrace -n 'fbt:zfs:zio_dva_allocate:entry {self->ts = timestamp;} fbt:zfs:zio_dva_allocate:return /self->ts/ {@[probefunc] = quantize((timestamp - self->ts)/1000); self->ts = 0;}'
^C

         value   ------------- Distribution ------------- count
      2048 |                                         0
      4096 |                                         1
      8192 |@                                        1944
     16384 |@@@@@@@@@@@@@@                           27018
     32768 |@@@@@@@@@@@@@@@@                         31038
     65536 |@@@@                                     7144
    131072 |@@                                       3470
    262144 |@                                        2331
    524288 |@                                        1150
   1048576 |                                         510
   2097152 |                                         211
   4194304 |                                         81
   8388608 |                                         103
  16777216 |                                         248
  33554432 |                                         424
  67108864 |                                         588
 134217728 |                                         636
 268435456 |                                         543
 536870912 |                                         223
1073741824 |                                         28
2147483648 |                                         0


Running '::zio_state' from mdb will show several write threads in WAIT_FOR_CHILDREN_READY and DVA_ALLOCATE

> ::zio_state

ADDRESS                           TYPE   STAGE                    WAITER
60038a57130                       NULL   DONE                     3017d34daa0
301eb1ebc40                       NULL   DONE                     3000c5a14e0
60047aa0060                       NULL   DONE                     3011d097800
6004dd35c40                       NULL   WAIT_FOR_CHILDREN_READY  2a10627fca0
 600471ac450                      WRITE  WAIT_FOR_CHILDREN_READY  -
  600498aa5b8                     WRITE  WAIT_FOR_CHILDREN_READY  -
   300960f0008                    WRITE  WAIT_FOR_CHILDREN_READY  -
    302a7915888                   WRITE  WAIT_FOR_CHILDREN_READY  -
     300bbd28d78                  WRITE  WAIT_FOR_CHILDREN_READY  -
      3006e484ac0                 WRITE  WAIT_FOR_CHILDREN_READY  -
       3006289f1d0                WRITE  WAIT_FOR_CHILDREN_READY  -
        3009b726120               WRITE  WAIT_FOR_CHILDREN_READY  -
         30181d8f720              WRITE  WAIT_FOR_CHILDREN_READY  -
          6004a1f6f10             WRITE  DVA_ALLOCATE             -
          6004a1f6700             WRITE  DVA_ALLOCATE             -
         30387051168              WRITE  WAIT_FOR_CHILDREN_READY  -
          30047b738a0             WRITE  DVA_ALLOCATE             -
          30047b72b30             WRITE  DVA_ALLOCATE             -
          30097f45468             WRITE  DVA_ALLOCATE             -

Running 'zpool iostat -v' will show the vdevs which have limited space left.  The following example shows how the 'capacity free' varies between the original vdevs (emcpower16g and empower17g) versus the new vdevs (emcpower1g and emcpower23c).

               capacity    operations  bandwidth
pool          alloc free  read  write read  write
------------- ----- ----- ----- ----- ----- -----
DATA2021-02   372G  94.8G 212   152   1.97M 1.68M
  emcpower1g  198G  50.4G 88    30    934K  553K
  emcpower16g 56.3G 3.22G 36    37    318K  179K
  emcpower17g 56.2G 3.34G 35    39    307K  202K
  emcpower23c 61.6G 37.9G 51    46    462K  789K
------------- ----- ----- ----- ----- ----- -----



Changes

This issue can occur in situations where additional RAIDZ vdevs have been added to an existing zpool to increase storage.  This results in imbalanced RAIDZ vdevs within the zpool where the original vdevs have more data on them than the new ones.  When the older vdevs fill up ZFS will spend a lot of time in space_map_load_wait() trying to look for space in other vdevs.

Cause

Due to Defect# 6876962, this issue can occur in the following releases:

SPARC Platform:
  • Solaris 10 without the ZFS Stability IDR patch IDR147574-01 or Kernel Patch 147440-04

x86 Platform:
  • Solaris 10 without the ZFS Stability IDR patch IDR147575-01 or Kernel Patch 147441-04
Unified Storage Appliances (S7000)
  • Fishworks OS without ak-2010.08.17 or later

Oracle Solaris 11 Express is not affected by this issue.

Solution

This issue is addressed in the following releases:

SPARC Platform
  • Solaris 10 with patch 147440-04

x86 Platform
  • Solaris 10 with patch 147441-04

Unified Storage Appliances (S7000)
  • Fishworks OS ak-2010.08.17 or later


Workaround(s)

1) Delete any old snapshots to free up space within the zpool

2) Manually rebalance the vdev space utilization using the following
  • Create a new dataset in the pool
  • Move the data from a old larger dataset to the new dataset using mv(1).


Option 2

* Use mdb(1) to set vdev_cant_write of the top level older vdev(s) to B_TRUE so that all allocation can go to the newer vdev.
* Use zpool(1m) and do 'zpool clear <poolname>' to reset the flag once the space  utilization is rebalanced.

Setting vdev_cant_write to 1 may have a side effect - root vdev state may change to VDEV_CANT_OPEN as a result of vdev_propagate_state() processing, so pool state will be reported as UNAVAIL by zpool list/status etc. This seems to be harmless so far, but anyway you may want to run 'zpool clear' which should clear all vdev_cant_write settings and return pool to the normal state.

It it not advised to use this method in production unless all other avenues have been eliminated and no other options exist.  If this method is used, the customer must watch the space allocated to the new vdev(s) and run 'zpool clear poolname' when all vdevs contain a similar amount of data otherwise the disk may fill up.

### First step is to find the spa address of the vdev(s) that existed in the zpool

$ mdb -k
> ::spa -c ! grep <poolname>

### Take the address and display the vdev_cant_write value

<spa addr>::print -a spa_t spa_root_vdev->vdev_child[n]->vdev_cant_write

eg::

ffffff0189510540::print -a spa_t spa_root_vdev->vdev_child[0]->vdev_cant_write
ffffff017e2a559f spa_root_vdev->vdev_child[0]->vdev_cant_write = 0

### Now we change the value to B_TRUE

$ echo "<vdev_cant_write addr>/v 1" | mdb -kw

eg:

$ echo "ffffff017e2a559f/v 1" | mdb -kw
0xffffff017e2a559f: 0 = 0x1
$



References

<BUG:6876962> - DEGRADED WRITE PERFORMANCE WITH THREADS HELD UP BY SPACE_MAP_LOAD_WAIT()

Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback