Sun ZFS Storage Appliance: How to set up and use Analytics for benchmarking applications

Asset ID:	1-71-1472716.1
Update Date:	2012-08-16
Keywords:

Solution Type Technical Instruction Sure

Solution 1472716.1 : Sun ZFS Storage Appliance: How to set up and use Analytics for benchmarking applications

Applies to:

Sun ZFS Storage 7420 - Version Not Applicable to Not Applicable [Release N/A]
Sun ZFS Storage 7320 - Version Not Applicable to Not Applicable [Release N/A]
Sun ZFS Storage 7120 - Version Not Applicable to Not Applicable [Release N/A]
Sun ZFS Backup Appliance - Version Not Applicable to Not Applicable [Release N/A]
Information in this document applies to any platform.

Goal

This document provides the reader with the minimum level of instrumentation to execute performance test and benchmarks with the ZFS Storage Appliance or ZFS Backup Appliance.

Fix

Detailed performance monitoring of the ZFS Storage Appliance for general use cases can be accomplished by first enabling Advanced Analytics (Configuration->Preferences, check the Enable Advanced Analytics box) and then enabling the following Analytics:

CPU: CPUs broken down by percent utilization
CPU: Percent utilization broken down by CPU mode
Cache: ARC accesses broken down by hit/miss
Cache: ARC adaptive parameter
Cache: ARC size
Cache: L2ARC accesses broken down by hit/miss
Cache: L2ARC size
Disk: Average number of I/O operations broken down by state
Disks: broken down by percent utilization
Disk: I/O bytes broken down by type
Disk: I/O bytes broken down by disk
Disk I/O operations broken down by type
- Drill down on read by latency
- Drill down on write by latency
Disk percent utilization broken down by disk
Add data set "Disks with utilization >95% broken down by disk"
Disk: ZFS logical I/O bytes broken down by type
Disk: ZFS logical I/O operations broken down by type
Memory: kernel memory in use broken down by kmem_cache
Network: device bytes broken down by device
Network: device bytes broken down by direction
Network: interface bytes broken down by interface
Network: interface bytes broken down by direction
Protocol: fibre channel bytes broken down by type
Protocol: SRP, iSCSI, Fibre channel operations broken down by type
- Drill down on reads by size
- Drill down on reads by latency: turn on 5% filter and pick the mode (most frequently occurring value) to estimate the average
- Drill down on writes by size
- Drill down on writes by latency: turn on 5% filter and pick the mode (most frequently occurring value) to estimate the average
Protocol SMB, NFSv[3,4] operations drill down by type
- Drill down on reads by size
- Drill down on reads by latency: turn on 5% filter and pick the mode (most frequently occurring value) to estimate the average
- Drill down on writes by size
- Drill down on writes by latency: turn on 5% filter and pick the mode (most frequently occurring value) to estimate the average
- Drill down on commits by latency: turn on 5% filter and pick the mode (most frequently occurring value) to estimate the average

Details for interpreting each accounting statistics are available from the ZFS Storage Appliance and ZFS Backup Appliance browser interface (BUI) help menu: Help->Analytis->Statistics.

The client operating system accessing the ZFS Storage Appliance or ZFS Backup Appliance instrumentation should also be enabled and recorded during performance evaluations or benchmarking exercises.

For Solaris begin with the following instrumentation (please note the 5 second timing is a good first suggestion, but specific circumstances may dictact a different value):

Disk or NFS file system access: iostat -xnzcCMT d 5
CPU consumption: mpstat
Network consumption: netstat -i -I <interface> 5
TCP send/receive queue: netstat -a or netstat -a | grep nfsd
Locks: lockstat -a sleep 5
Hot kernel stacks: dtrace -n 'profile-1001 { @[stack()] = count() }'
Hot user stacks: dtrace -n 'profile-1001 { @[ustack()] = count() }'

For the Linux operating system begin with the following instrumentation:

Disk (block) access: iostat -x 5
NFS access: nfsstat 5
RPC statistics: mountstat 5
CPU consumption: mpstat -P ALL 5
Network interface traffic: sar -n DEV 5
TCP send/receive queue: netstat -a 5 or netstat -a 5 | grep nfs

For the Windows operating system begin the Windows PerfMon tool and enable the following instrumentation:

Disk seconds per read
Disk seconds per write
Disk seconds per transfer
Disk reads per second
Disk writes per second
Disk transfers per second
Disk read queue length
Disk write queue length

In the specific use case of Oracle Database Access to the ZFS Storage Appliance or ZFS Backup Appliance the Automatic Workload Repository (AWR) report should be enabled and used during any benchmarking or performance testing effort. AWR snapshots should be triggered at the beginning and end of the test workload and the AWR report should be generated from these snapshots. Review the ZFS Storage Appliance Analytics and and operating system accounting statistics in the context of the workload shown in the AWR load profile and tablespace I/O statistics and the storage related wait events. Pay specific attention to the following details:

If 2 tests run differently verify the logical and physical I/O per transation is the same in both cases - if the logical and physical I/O per transaction is not the same in both cases then the 2 tests are testing different things
Check the Top 5 Wait Events section to quantify how much time the database is waiting on I/O compared to other resources; in the ideal storage benchmark most of the wait should be on I/O
Keep track of the time for I/O wait events with the physical I/O workload; if a system runs slower and the I/O wait event is shorter then there is probably a non-storage bottleneck; if you increase the load on the system and the I/O wait events increase then you are pushing the storage
The physical I/O shown in the Load Profile indicates physical blocks read - this does not inlcude coalescing associated with multi-block reads; check the Tablespace I/O statistics to identify mult-block reads
The physical writes in the AWR report does not inlcude mirroring at the ASM layer - if you have ASM or other LVM technology performing mirroring you need to track that at the operating system level

The Oracle Database documenation specific to your release, available at docs.oracle.com, contains descriptions of all of the wait events shown in the AWR report. In the case of Oracle 11g, you can find the descriptions at this link: http://docs.oracle.com/cd/B28359_01/server.111/b28320/waitevents003.htm#BGGIBDJI.

By comparing workload and response times reported by the test application, the database (if used), the operating system, and the storage system bottlnecks can be quickly and accurately identified. In practical systems, bottlenecks can be created by software bugs, misconfigured software, hardware faults, and misconfigured hardware. By carefully instrumenting the system and analyzing the results performance problems can be classified and solved. In general, system throughput may be limited by several factors

Hardware saturation: such as an insufficient amount of disk drives to support the target throughput
Supply side bottlenecks: such as configuration mistakes that limit the number of pending I/O available for the storage system to process
Hardware faults: such as broken disk drives or network infrastruture that lead to executing error-handling code and increased response times
Software bugs: such as inefficient processing that leads to poor throughput or poor hardware utilization

Hardware saturation is an important goal of any benchmarking activity. Network, CPU, and storage media all have well-defined limits as to how much throughput they can process, and hitting these limits usually indicates a well-tuned system. If the hardware is not saturated, then the problem limiting throughput may be related to software processing on the hardware. Troubleshooting this type of issue usually requires drawing a software block diagram showing the software components executed during the test, instrumenting those components, and identifying which software component is holding up the work, and finally reconfiguring or fixing that software component. This exercise is most easily undertaken by starting with the software/hardware block diagram in the system.

Supply side bottlenecks show up when we double the workload at the application, throughput does not change, application response time increases, and storage response time stays the same. If there were no supply side bottlneck, doubling the number of I/O the application queues to the storage should double the time it takes for the storage to respond to the I/O. If the storage response time does not increase, then something between the application and the storage bottleneckd the I/O. This is a common problem when benchmarking systems that requires expert-level attention to the details of the system.

Hardware faults, while rare, are a possibility in any real system. Always confirm there are no errors in the operating system log, the network driver status, and there are no faults reported in the storage system's status pages.

Software bugs that are not present in lightly loaded conditions can show up under extereme load conditions. This includes race conditions that cause I/O to hang and processing problems that cause I/O to appear to stall. In cases where there are no obvious hardware faults, saturation, or bottlenecks pay careful attention to application code. In the context of the ZFSSA, if you hit a case where doubling the application workload does not cause a doubling of the front-end protocol response time and the operating system reports the same response time as the ZFSSA then you may have an application-level software bug.

Attachments

This solution has no attachment