Sun Microsystems, Inc.  Sun System Handbook - ISO 4.1 October 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1472716.1
Update Date:2012-08-16
Keywords:

Solution Type  Technical Instruction Sure

Solution  1472716.1 :   Sun ZFS Storage Appliance: How to set up and use Analytics for benchmarking applications  


Related Items
  • Sun ZFS Backup Appliance
  •  
  • Sun ZFS Storage 7120
  •  
  • Sun ZFS Storage 7420
  •  
  • Sun ZFS Storage 7320
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>NAS>SN-DK: 7xxx NAS
  •  


This document describes which ZFS Storage Appliance Analytics to enable in performance-related benchmarks or proofs of concept (POC) and how to interpret the results of those benchmarks.  This document also includes references to the client-side accounting tools that should also be used to monitor the client system during benchmarking or performance testing.

In this Document
Goal
Fix


Applies to:

Sun ZFS Storage 7420 - Version Not Applicable to Not Applicable [Release N/A]
Sun ZFS Storage 7320 - Version Not Applicable to Not Applicable [Release N/A]
Sun ZFS Storage 7120 - Version Not Applicable to Not Applicable [Release N/A]
Sun ZFS Backup Appliance - Version Not Applicable to Not Applicable [Release N/A]
Information in this document applies to any platform.

Goal

This document provides the reader with the minimum level of instrumentation to execute performance test and benchmarks with the ZFS Storage Appliance or ZFS Backup Appliance.

Fix

Detailed performance monitoring of the ZFS Storage Appliance for general use cases can be accomplished by first enabling Advanced Analytics (Configuration->Preferences, check the Enable Advanced Analytics box) and then enabling the following Analytics:

  • CPU: CPUs broken down by percent utilization
  • CPU: Percent utilization broken down by CPU mode
  • Cache: ARC accesses broken down by hit/miss
  • Cache: ARC adaptive parameter
  • Cache: ARC size
  • Cache: L2ARC accesses broken down by hit/miss
  • Cache: L2ARC size
  • Disk: Average number of I/O operations broken down by state
  • Disks: broken down by percent utilization
  • Disk: I/O bytes broken down by type
  • Disk: I/O bytes broken down by disk
  • Disk I/O operations broken down by type
    • Drill down on read by latency
    • Drill down on write by latency
  • Disk percent utilization broken down by disk
  • Add data set "Disks with utilization >95% broken down by disk"
  • Disk: ZFS logical I/O bytes broken down by type
  • Disk: ZFS logical I/O operations broken down by type
  • Memory: kernel memory in use broken down by kmem_cache
  • Network: device bytes broken down by device
  • Network: device bytes broken down by direction
  • Network: interface bytes broken down by interface
  • Network: interface bytes broken down by direction
  • Protocol: fibre channel bytes broken down by type
  • Protocol: SRP, iSCSI, Fibre channel operations broken down by type
    • Drill down on reads by size
    • Drill down on reads by latency: turn on 5% filter and pick the mode (most frequently occurring value) to estimate the average
    • Drill down on writes by size
    • Drill down on writes by latency: turn on 5% filter and pick the mode (most frequently occurring value) to estimate the average
  • Protocol SMB, NFSv[3,4] operations drill down by type
    • Drill down on reads by size
    • Drill down on reads by latency: turn on 5% filter and pick the mode (most frequently occurring value) to estimate the average
    • Drill down on writes by size
    • Drill down on writes by latency: turn on 5% filter and pick the mode (most frequently occurring value) to estimate the average
    • Drill down on commits by latency: turn on 5% filter and pick the mode (most frequently occurring value) to estimate the average

Details for interpreting each accounting statistics are available from the ZFS Storage Appliance and ZFS Backup Appliance browser interface (BUI) help menu: Help->Analytis->Statistics.

The client operating system accessing the ZFS Storage Appliance or ZFS Backup Appliance instrumentation should also be enabled and recorded during performance evaluations or benchmarking exercises.

For Solaris begin with the following instrumentation (please note the 5 second timing is a good first suggestion, but specific circumstances may dictact a different value):

  • Disk or NFS file system access: iostat -xnzcCMT d 5
  • CPU consumption: mpstat
  • Network consumption: netstat -i -I <interface> 5
  • TCP send/receive queue: netstat -a or netstat -a | grep nfsd
  • Locks: lockstat -a sleep 5
  • Hot kernel stacks: dtrace -n 'profile-1001 { @[stack()] = count() }'
  • Hot user stacks: dtrace -n 'profile-1001 { @[ustack()] = count() }'

For the Linux operating system begin with the following instrumentation:

  • Disk (block) access: iostat -x 5
  • NFS access: nfsstat 5
  • RPC statistics: mountstat 5
  • CPU consumption: mpstat -P ALL 5
  • Network interface traffic: sar -n DEV 5
  • TCP send/receive queue: netstat -a 5 or netstat -a 5 | grep nfs

For the Windows operating system begin the Windows PerfMon tool and enable the following instrumentation:

  • Disk seconds per read
  • Disk seconds per write
  • Disk seconds per transfer
  • Disk reads per second
  • Disk writes per second
  • Disk transfers per second
  • Disk read queue length
  • Disk write queue length

In the specific use case of Oracle Database Access to the ZFS Storage Appliance or ZFS Backup Appliance the Automatic Workload Repository (AWR) report should be enabled and used during any benchmarking or performance testing effort.  AWR snapshots should be triggered at the beginning and end of the test workload and the AWR report should be generated from these snapshots.  Review the ZFS Storage Appliance Analytics and and operating system accounting statistics in the context of the workload shown in the AWR load profile and tablespace I/O statistics and the storage related wait events.  Pay specific attention to the following details:

  • If 2 tests run differently verify the logical and physical I/O per transation is the same in both cases - if the logical and physical I/O per transaction is not the same in both cases then the 2 tests are testing different things
  • Check the Top 5 Wait Events section to quantify how much time the database is waiting on I/O compared to other resources; in the ideal storage benchmark most of the wait should be on I/O
  • Keep track of the time for I/O wait events with the physical I/O workload; if a system runs slower and the I/O wait event is shorter then there is probably a non-storage bottleneck; if you increase the load on the system and the I/O wait events increase then you are pushing the storage
  • The physical I/O shown in the Load Profile indicates physical blocks read - this does not inlcude coalescing associated with multi-block reads; check the Tablespace I/O statistics to identify mult-block reads
  • The physical writes in the AWR report does not inlcude mirroring at the ASM layer - if you have ASM or other LVM technology performing mirroring you need to track that at the operating system level

The Oracle Database documenation specific to your release, available at docs.oracle.com, contains descriptions of all of the wait events shown in the AWR report.  In the case of Oracle 11g, you can find the descriptions at this link: http://docs.oracle.com/cd/B28359_01/server.111/b28320/waitevents003.htm#BGGIBDJI.

By comparing workload and response times reported by the test application, the database (if used), the operating system, and the storage system bottlnecks can be quickly and accurately identified.  In practical systems, bottlenecks can be created by software bugs, misconfigured software, hardware faults, and misconfigured hardware.  By carefully instrumenting the system and analyzing the results performance problems can be classified and solved.  In general, system throughput may be limited by several factors

  • Hardware saturation: such as an insufficient amount of disk drives to support the target throughput
  • Supply side bottlenecks: such as configuration mistakes that limit the number of pending I/O available for the storage system to process
  • Hardware faults: such as broken disk drives or network infrastruture that lead to executing error-handling code and increased response times
  • Software bugs: such as inefficient processing that leads to poor throughput or poor hardware utilization

Hardware saturation is an important goal of any benchmarking activity.  Network, CPU, and storage media all have well-defined limits as to how much throughput they can process, and hitting these limits usually indicates a well-tuned system.  If the hardware is not saturated, then the problem limiting throughput may be related to software processing on the hardware.  Troubleshooting this type of issue usually requires drawing a software block diagram showing the software components executed during the test, instrumenting those components, and identifying which software component is holding up the work, and finally reconfiguring or fixing that software component.  This exercise is most easily undertaken by starting with the software/hardware block diagram in the system.

Supply side bottlenecks show up when we double the workload at the application, throughput does not change, application response time increases, and storage response time stays the same.  If there were no supply side bottlneck, doubling the number of I/O the application queues to the storage should double the time it takes for the storage to respond to the I/O.  If the storage response time does not increase, then something between the application and the storage bottleneckd the I/O.  This is a common problem when benchmarking systems that requires expert-level attention to the details of the system.

Hardware faults, while rare, are a possibility in any real system.  Always confirm there are no errors in the operating system log, the network driver status, and there are no faults reported in the storage system's status pages.

Software bugs that are not present in lightly loaded conditions can show up under extereme load conditions.  This includes race conditions that cause I/O to hang and processing problems that cause I/O to appear to stall.  In cases where there are no obvious hardware faults, saturation, or bottlenecks pay careful attention to application code.  In the context of the ZFSSA, if you hit a case where doubling the application workload does not cause a doubling of the front-end protocol response time and the operating system reports the same response time as the ZFSSA then you may have an application-level software bug.


Attachments
This solution has no attachment
  Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
 Feedback