Sun ZFS Storage Appliance: Performance clues and considerations

Asset ID:	1-79-1213714.1
Update Date:	2012-08-31
Keywords:

Solution Type Predictive Self-Healing Sure

Solution 1213714.1 : Sun ZFS Storage Appliance: Performance clues and considerations

Applies to:

Sun ZFS Storage 7420 - Version Not Applicable and later
Sun Storage 7410 Unified Storage System - Version Not Applicable and later
Sun ZFS Storage 7120 - Version Not Applicable and later
Sun ZFS Storage 7320 - Version Not Applicable and later
Sun Storage 7310 Unified Storage System - Version Not Applicable and later
7000 Appliance OS (Fishworks)
NAS head revision : [not dependent]
BIOS revision : [not dependent]
ILOM revision : [not dependent]

Purpose

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - 7000 Series ZFS Appliances

Giving accurate values for expected system-level performance on the ZFS Storage Appliance is not possible, as it depends on many parameters - such as client workload, feature set used, memory size, network interfaces type, number of disks, optional use of logzilla and/or readzilla (SSD), pool layout (mirror vs raidz2), filesystem recordsize and FC/iSCSI LUN volblocksize.

However, this document will give you some basics which may help you to estimate what could be expected and give you some best practices to advise on the best configuration for the customer.

Please note : This document is not intended to be used in isolation for performance troubleshooting. Please refer to and use the ZFSSA Performance Troubleshooting resolution path document (See <Document 1331769.1>) to guide through pre-requisite stages of performance problem separation, clarification, categorization, basic system health-checks and then further information gathering - before looking to compare customer system performance with the information described in this document.

Scope

This documentation provides some guidelines but will not replace a serious and reliable benchmark based on genuine workload from the customer needs.

Details

Disks performance

The genuine performance really depends on IO workload. As a starter, we can give these numbers :
7x10 series use SATA-1 disks : average of 150 IOPs per disk
7x20 series use SAS-2 :
- 7200rpm 1TB and 2TB disks : average of 150 IOPs per disk
- 15000rpm 300GB and 600GB disks : average of 300 IOPs per disk

Note : a full disk may provide less IOPs. This means that the IOPs ratio between 7200rpm and 15000rpm disks might be 1.5 instead of 2.0 for the same amount of stored data.

We traditionally think of IOPs as "IOPs under fully random conditions," which is a worst-case situation. Fully random means that there is a seek and rotate delay between every I/O, and that is lost time as far as transfers are concerned. If the workload is massively random, this can decrease down to 70 IOPs.

But in fact, most I/O is not really fully random. And if you have some sequential I/O - which does either no seek or a very short one - you can easily see vastly higher I/O rates from a disk. In the limit, sequential 512-byte I/Os can easily achieve rates in the thousands.

The NAS head is based on ZFS for data handling. Each pool is made of a certain number of vdevs of the same type (mirror, raidz, raidz2, raidz3). vdevs are stripped in a pool. Each vdev contains disks. To very simplify, when an IO comes in, it is first written to the first vdev. The next IO will be written to the second vdev and so on. The bandwidth (and IOPs) limit is per vdev. The more vdev is used, the more bandwidth is expected.
For example, let's take a configuration with 16 disks. For a mirror pool layout, we will have 8 vdevs (2 disks per vdev), hence we can expect up to 8*150 = 1200 IOPs. For a raidz2 pool layout, we might have only 2 vdevs (each made of 6 disks), hence we can expect 2*150 = 300 IOPs.
On large config, raidz pool layout can end up with 5x or 10x more vdevs than a raidz2. mirroring is still faster, but the point is that raidz and raidz2 are not necessarily the same.

See also <Document 1315536.1> for a detailed example (RAIDZ2 Performance Issues With High I/O Wait Queues)

SSD

2 SSD types can be used in the NAS heads.

SSD logzilla is used for Synchronous IOs (iSCSI and files open with O_DSYNC attribute). When a synchronous IO comes in the NAS head, it is written in the DRAM (memory) as well as in the logzilla. In the next 5 seconds (or less) the grouped DATA will be flushed from the DRAM to the SATA or SAS-2 disks. We never read from the logzilla except after a system crash. The logzilla size is 18GB on 7x10 series and/or 73GB on 7x20 series. In the 7120 series, a 96GB Flash Module card has been integrated as a PCI device. It is divided in 4 modules, each usable as a log device.

Perf :

18GB : 120 MB/s of synchronous writes and up to 3300 4Kbytes IOPS.

73GB : 200 MB/s of synchronous writes and up to 25000 8Kbytes IOPS

SSD readzilla is used as a L2 cache for the ZFS ARC. After some time, the old data/metadata are pushed from L1 ARC (kernel memory) to the level2 ARC. This is called "eviction". Access to this is still 10 times faster than retrieving it from the SATA/SAS disks. SSD readzilla are written and read from memory, there is no direct copy from the SATA/SAS disks to the SSD readzilla. The readzilla size is 100GB on 7x10 series and 512GB on 7x20 series.
The readzillas have two workloads, the read requests that it is satisfying and the writes that are trying to fill the device. They trade off against each other. The write rate is a function of the eviction rate from the L1 cache and a variety of other factors. The write rate is explicitly throttled to avoid suppressing the ability to satisfy reads.

Perf : 3100 8Kbyte IOPS and up to 10000 IOPS with a synthetic benchmark.

See also <Document 1213725.1> to learn when some logzilla/readzilla can be added (search "observing hardware bottlenecks in Analytics").

some more details can be found here : https://stbeehive.oracle.com/teamcollab/wiki/Elite+Engineering+Exchange:ZFSSA+Sizing+Q+and+A#.26.2339.3B.26.2339.3BWhat+is+the+maximum+IOPS+per+15k+disk.26.2339.3B.26.2339.3B

Network

2 types of network interfaces can be used :
PCI Express Quad Gigabit Ethernet UTP
Dual 10-Gigabit Ethernet

A 1Gb device can push ~120MBytes/sec.
A 10Gb device can push ~1.20GBytes/sec.

Most of the time, LACP usage with 2 10Gb interfaces will not let 1 client to get a 20Gb bandwidth.
Some load balancing can be done at the protocol level (tcp/udp port) with LACP Policy set to L4 which uses the source and destination transport level port. This means that a client can use different interfaces in the LACP group if different protocols are used at the same time. The more clients use the same LACP group, the best is the efficiency.

Jumbo Frames can be used to make the MTU larger (9000). The clients must have Jumbo frames enabled to see performance improvements. There is a negotiation between the client and the NAS head. The less is used.

Some of possible bugs :
6977076 : Memory leak in nxge_start when >1Mb dblks with jumbo frames used
6981953 : Suggest lowering of max db_ref value for DEBUG kernels
6982878 : 7310 hangs during normal activity with no access to BUI or ak cli
6423877 : Found memory leaks in tcp_send
6423874 : found memory leaks in strmakedata

Pool layout

As introduced previously, many pool layouts can be used when configuring the storage. For small latency usage (VMWARE, VDI), it is highly recommended to use mirror layout.
Raidz2 layout is an acceptable choice for sequential IOs and can be quite good with many vdevs (but many disks).
Mirror layout remains better because the number of vdevs is far bigger than raidz2 and IO reads can be done on each of the 2 submirror disks at the same time.

Recordsize/volblocksize

The recordsize specifies a suggested block size for files in a filesystem. It can be set up to 128Kbytes (default) and can be changed at any time but it only applies to files created after the change.
The volblocksize specifies the block size of a volume (iSCSI,FC). The block size cannot be changed once a volume has been written, so set the block size at volume creation time. The default block size for volumes is 8 Kbytes. It can be set up to 128Kbytes.

It is very important to match the client blocksize with the filesystem recordsize or volume blocksize.
Wrong sizing might lead to unexpected performance degradation, especially for random IO reads in a raidz2 vdev.
Even on iSCSI with logzilla used, the wrong sizing may lead to bad performance : if the IO does not match in term of size (less than the volblocksize), ZFS has to retrieve the rest of the block from memory (if still present) or disk (far slower) to write the entire block back to the logzilla. The block alignment has to be taken into consideration too, see next section.

block alignment

This topic is detailed on a blog from David Lutz.
With proper alignment, a single client block that is the same size or less as the volume block size of a LUN will be contained entirely within a single volume block in the LUN. Without proper alignment, that same client block may span multiple volume blocks in the LUN. That could result in 2 appliance reads for a single client read, and 2 appliance reads plus 2 appliance writes for a single client write. This will obviously have a big impact on performance if ignored.
For details, see Partition Alignment Guidelines for Unified Storage.

See also <Document 1175573.1> Sun Storage 7000 Unified Storage System: Configuration and Tuning for iSCSI performance. This shows some 'wmic' and 'diskpart.exe' commands for windows.

dedup

Dedup is good for capacity but has some known caveats for performance : throughput to and from shares with deduplication enabled is within 30% of the throughput available without deduplication enabled.
For details, see Dedup design and implementation guidelines : http://www.oracle.com/technetwork/articles/servers-storage-admin/zfsdedupguidelines-335537.html#Perf

Back to <Document 1331769.1> Sun Storage 7000 Unified Storage System: How to Troubleshoot Performance Issues.

References

<BUG:6981953> - SUGGEST LOWERING OF MAX DB_REF VALUE FOR DEBUG KERNELS
<BUG:6982878> - ORA-904 WHEN QUERY VIEWS USING ANSI FULL JOIN
@ <BUG:6423877> - SR# MMAU-171# METER READ QUERY CRASH ON IT CUSTOMER
@ <BUG:6423874> - SR# MMAU-168# BILL MESSAGE CORRECTION
<NOTE:1175573.1> - Sun Storage 7000 Unified Storage System: Configuration and tuning for iSCSI performance
<NOTE:1213725.1> - Sun Storage 7000 Unified Storage System: Configuration and tuning for NFS performance
<NOTE:1229193.1> - Sun Storage 7000 Unified Storage System: Collecting analytics data for iSCSI performance issues
<NOTE:1230145.1> - Sun Storage 7000 Unified Storage System: Collecting analytics data for CIFS performance issues
<NOTE:1315536.1> - Sun Storage 7000 Unified Storage System: RAIDZ2 Performance Issues With High I/O Wait Queues
<NOTE:1331769.1> - Sun Storage 7000 Unified Storage System: How to Troubleshoot Performance Issues
@ <BUG:6977076> - WORKFLOW DOWN

Attachments

This solution has no attachment