Asset ID: |
1-75-1010680.1 |
Update Date: | 2011-05-31 |
Keywords: | |
Solution Type
Troubleshooting Sure
Solution
1010680.1
:
Troubleshooting Disk Performance
Related Items |
- Sun Storage MultiPack FC Desktop Array
- Sun Storage D1000 Array
- Sun Storage D2 Array
- Sun Storage UniPack Disk Drive
- Sun Storage A3500 SCSI Array
- Sun Storage MultiPack SCSI Desktop Array
- Sun Storage D240 (StorEdge) Media Tray
|
Related Categories |
- GCS>Sun Microsystems>Storage - Disk>Modular Disk - Other
|
PreviouslyPublishedAs
214750
Applies to:
Sun Storage A3500 SCSI Array
Sun Storage D1000 Array
Sun Storage MultiPack SCSI Desktop Array
Sun Storage D2 Array
Sun Storage D240 (StorEdge) Media Tray
All Platforms
Purpose
This document was written in an effort to highlight some of the steps
needed to diagnose if an issue is in fact a disk performance issue or
not. And if it is, what data will be collected to help diagnose the
issue.
Note that performance in general is not an easy issue to tackle. This
is because it can be caused by many issues that may not be
disk related. For example, a disk will
perform as well as the applications ask of it.
So, a slow application may misguide you to think that you have a disk
performance issue.
The best way in any troubleshooting method is to eliminate, as
much as possible, the factors that can cause performance issues.
For example, let us look at the layers involved in a simple UFS
filesystem on a disk:
-
Physical Disk.
-
LUN if it is an array.
-
Host Bus Adaptor.
-
HBA Driver.
-
sd/sdd driver.
-
ufs driver.
-
Application performing I/O to the filesystem.
These are some of the layers involved in the path of a single
I/O, yet it is not exactly comprehensive.
This document expects that you can read iostat outputs. Refer to
man pages for iostat description.
Last Review Date
March 10, 2011
Instructions for the Reader
A Troubleshooting Guide is provided to assist
in debugging a specific issue. When possible, diagnostic tools are included in the document
to assist in troubleshooting.
Troubleshooting Details
1. Define the Problem.
Generally a
problem is identified when an application is not performing as
expected.
The above is a
very general statement and the best method is to clearly define what
is it that is not performing as expected and then work out if the
expectation is a realistic one. For example if we
have an Ultra SCSI disk then we would not expect a
throughput of more than 20 Mbytes/Sec theoretically. Hence this is
what is meant by expectation.
We need a
realistic expectation before we can continue.
Now a theoretical
value of 20 MB/s for Ultra SCSI is not what we would get
in real life, so we need to leave some room for that as well. There
is no guide on how much lower the actual throughput will be, this is
also dependent on many factors such as I/O size and type. This
is left to common sense, the theoretical value should be used as a
guide.
We also need to
clearly define the problem. For example a good
problem statement maybe:
iostat shows
high asvc_t (100 ms) when Oracle write performance is only 3MB/s
(iosize 8k) on /oradata1 volume, expect 15 MB/s with lower asvc_t
(<30ms)
Note the above is
specific to the type (write) of operation, iosize, and the actual
volume that is experiencing the problem. We also stated the current
level and the expected level of throughput.
A bad example is:
Oracle not
performing
well.
The example above
of a good problem statement could be improved further with time based
information that is, the "problem" happens since a given date
(after a change to xxx), and appears to reoccur every 'X'
days/hours/mins ...2. Identify Bottleneck.
This is best done
through process of elimination. And in all
the steps below you need to look at and interpret iostat output.
1. Replace
the application with something else such as vxbench or similar.
Don't use dd or tar to test performance, that is not what they are designed for. So if you have
Oracle running you need to know what is it doing and try and simulate
that with say vxbench, vdbenck or similar.
This will show you if the application is a probable cause and hence
give you a direction to follow.
An example would
be :
Oracle application
performing 8k sequential
reads.
# vxbench -m -w read -s -i nthreads=32,iosize=8k,iocount=5000 /dev/rdsk/c1t2d3s2
You can use
Raw/Block/filesystem devices with vxbench. They can be disk or
Logical devices such as VXVM or SVM devices.
vxbench is
available from Veritas Website. You can use
vxbench on any device (even filesystems) not only vxvm devices.
2. If disk
is still performing badly then find out if individual disks are
performing badly or logical volume is. So if you have a
vxvm volume, test the disks individually that make up this volume. See also vxstat
command to debug individual vxvm volume stats.
3. If
neither of the above, take a look at Solaris and see if it is under
resourced. This is beyond the scope of this document.
4. Also be
aware that you could just simply be draining all you can from a
particular disk and you need to load balance. Everything has
limitation, and you need to know them.
Note in all above
they are not a definite indicator of issue, but they will give us a
better idea of what we are looking at.
3. Collect Data.
So now we know
where the problem lies, what do we do:
1. If it is
application then go to the application vendor. Some applications are
not designed to perform, but its for a
different purpose such as cp/mv/dd/ufsdump , these are designed for a
specific purpose and will have limitation, as their primary function
is of importance and not their speed.
2. When a
disk issue found it will need to be investigated and the following
data collected.
A clear
description of the concern ,such as high %b or high actv seen with
reference to the iostat output.
Type of
application.
Type of I/O
(eg 8k/random/read).
SUN Explorer
from the host experiencing issue.
Guds output
of when issue is occurring. Reference Document: 1285485.1 GUDS - A Script for Gathering Solaris Performance Data
Any vxbench
or similar outputs run.
Iostat
showing issues of concern
Any
extractors of Storage Device having issue.
All above data is
only useful when collected while the issues of concern are occurring
otherwise there is no point in collecting the data.
A Few iostat
Examples:
# iostat -xpnz 1
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
293.1 0.0 37510.5 0.0 0.0 31.7 0.0 108.3 1 100 c0t0d0
293.1 0.0 37510.5 0.0 0.0 31.7 0.0 108.3 1 100 c0t0d0s2
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
294.0 0.0 37632.7 0.0 0.0 31.9 0.0 108.6 0 100 c0t0d0
294.0 0.0 37632.9 0.0 0.0 31.9 0.0 108.6 0 100 c0t0d0s2
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
293.0 0.0 37504.4 0.0 0.0 31.9 0.0 1032.0 0 100 c0t0d0
293.0 0.0 37504.4 0.0 0.0 31.9 0.0 1032.0 0 100 c0t0d0s2
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
294.0 0.0 37631.3 0.0 0.0 31.8 0.0 108.1 1 100 c0t0d0
294.0 0.0 37631.3 0.0 0.0 31.8 0.0 108.1 1 100 c0t0d0s2
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
294.0 0.0 37628.1 0.0 0.0 31.9 0.0 108.6 0 100 c0t0d0
294.0 0.0 37627.5 0.0 0.0 31.9 0.0 108.6 1 100 c0t0d0s2
Notice that the above
disk gets a very high asvc_t of 1032.0 but this is only a single
spike (ie no pattern found). This can happen and if it is just a
spike as above then it can be safely ignored. What we need to look
for here is patterns of performance degradation not a single
occurrence. Additionally this spike has no impact on disk
performance as above.
From above iostat
you will also note that this disk is still performing pretty well
with approx 36MB/s reads throughput. But as we can see this disk is
at 100% busy, so we are pretty much reaching its limitation.
We can also use
the information above to determine a few things. For example we can
calculate iosize using the kr/s and r/s fields.
37628.1/294 = 128k
reads.
This is not always
the correct method since these reads could be a variety of sizes and
hence we may not be able to get the right profile of the i/o.
You will also note
that in general disks have more than one limitation, throughput of
data per second, and IOPS (I/O per second). Hence the smaller the I/O
the lower the throughput we expect. So don't expect a 36MB/s for 4k
I/O. This is because smaller I/O sizes carry more overheads in total,
as there will be more of them for the same amount of data.
Here is a 4k read
example for the same disks
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
6096.1 0.0 24384.6 0.0 0.1 86.3 0.0 14.2 6 100 c0t0d0
6096.6 0.0 24386.4 0.0 0.1 86.3 0.0 14.1 6 100 c0t0d0s2
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
5826.2 0.0 23304.9 0.0 0.0 88.5 0.0 15.2 5 100 c0t0d0
5825.4 0.0 23301.6 0.0 0.0 88.5 0.0 15.2 5 100 c0t0d0s2
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
5947.4 0.0 23789.7 0.0 0.1 81.8 0.0 13.7 5 97 c0t0d0
5947.4 0.0 23789.5 0.0 0.1 81.8 0.0 13.7 5 97 c0t0d0s2
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
5647.3 0.0 22589.2 0.0 0.0 86.6 0.0 15.3 5 100 c0t0d0
5648.2 0.0 22592.6 0.0 0.1 86.6 0.0 15.3 5 100 c0t0d0s2
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
5835.7 0.0 23343.0 0.0 0.0 87.5 0.0 15.0 5 100 c0t0d0
5834.9 0.0 23339.7 0.0 0.1 87.5 0.0 15.0 5 100 c0t0d0s2
The point of above is
different i/o will have different behavior and hence it is important
to know what the application is doing, this was you can diagnose an
issue if there is one.
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.2 3.3 2.0 9.4 0.0 0.1 0.0 300.1 0 2 c0t1d0
0.2 3.3 2.0 9.4 0.0 0.1 0.0 300.1 0 2 c0t1d0s0
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.2 3.3 2.0 9.4 0.0 0.1 0.0 600.4 0 2 c0t1d0
0.2 3.3 2.0 9.4 0.0 0.1 0.0 600.4 0 2 c0t1d0s0
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.2 3.3 2.0 9.4 0.0 0.1 0.0 194.5 0 2 c0t1d0
0.2 3.3 2.0 9.4 0.0 0.1 0.0 194.5 0 2 c0t1d0s0
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.2 3.3 2.0 9.4 0.0 0.1 0.0 340.0 0 2 c0t1d0
0.2 3.3 2.0 9.4 0.0 0.1 0.0 340.0 0 2 c0t1d0s0
Now the above shows
us a disk with low throughput and low %b, but very high asvc_t, this
would generally be a concern as the service time is to high for a
normal disk. This case would warrant further investigation.
The above are 3
examples of things that may help in identifying issues. But as stated
at the beginning we can not cover all aspect of performance issues,
there are endless examples to be given. Hence the above three steps
have been given as a guide to tackling such issues. Understanding the
issues is the first step to solving it, common sense will guide you
further, if your stuck collect the information suggested above and
seek assistance.
Conclusion:
Disk performance
is a complex issue, this document neither attempts or can solve all
issues. The main point of this document is a logical approach to
problem solving. Understanding and defining the problem clearly will
assist in resolving issues faster. Know your application and know
your limitation will generally help you a long way. Understand that a
disks subsystem is made up of many subsystems and never assume
without proof. The only way to solve an issue is to pin point the
cause, and finding the bottleneck is the key step towards a solution.
Hence this document attempts to guide you in your thoughts and
direction on how to tackle these issues.
Attachments
This solution has no attachment