Disk Performance Analysis on Linux

Really Into the Guts of Linux Filesystem Performance

Really Into the Guts of LinuxIt is possible to find out how hard your disks are running. Some you don’t want to run as hard as others. For instance, if you wanted to save on write cycles for your SSD you could move small frequent writes into a ram filesystem, and that seems to be done by default in many systems these days (with /tmp getting mounted as tmpsf).

In the process of rebuilding my backup server, I left it going with a few large writes overnight, to check for some stability issues. Checking dmesg is always a good way to do that.

For instance, I see that drive 14 seems to be wonky:

Nov 16 22:44:13 beavertail kernel: [  692.590756] ata14.00: cmd 61/e0:18:20:0a:00/00:00:00:00:00/40 tag 3 ncq 114688 out
Nov 16 22:44:13 beavertail kernel: [  692.590756]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 16 22:44:13 beavertail kernel: [  692.598694] ata14.00: status: { DRDY }
Nov 16 22:44:14 beavertail kernel: [  693.093364] ata14: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Nov 16 22:44:14 beavertail kernel: [  693.118236] ata14.00: configured for UDMA/100
Nov 16 22:44:14 beavertail kernel: [  693.118255] ata14.00: device reported invalid CHS sector 0
Nov 16 22:44:14 beavertail kernel: [  693.118273] ata14: EH complete

Which drive is this? Oh, I see where it came up using sudo dmesg | grep 'ata14' |less:

[    1.061214] ata14: SATA max UDMA/133 abar m2048@0xfea10000 port 0xfea10280 irq 49
[    1.556891] ata14: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[    1.589329] ata14.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100

I cannot pass ata14 as an argument to programs, so I will see if I can match a serial number:

root@beavertail /dev/disk/by-id
 > find -type l  -a \( -name 'wwn-*' -o -name '*-part*' \) -prune -o -printf "%l\t%f\n" | sort
	.
../../sda	ata-Hitachi_HTS547575A9E384_J2190059G9PDPC
../../sdb	ata-Hitachi_HTS547575A9E384_J2190059G9SBBC
../../sdc	ata-Hitachi_HTS547575A9E384_J2190059G6GMGC
../../sdd	ata-Hitachi_HTS547575A9E384_J2190059G95REC
../../sde	ata-OCZ-AGILITY4_OCZ-77Z13FI634825PNW
../../sdf	ata-Hitachi_HTS547575A9E384_J2190059G9LH9C
../../sdg	ata-Hitachi_HTS547575A9E384_J2190059G95JPC
../../sdh	ata-Hitachi_HTS547575A9E384_J2190059G6LUDC
../../sdi	ata-Hitachi_HTS547575A9E384_J2190059G5PXYC
../../sdj	ata-TOSHIBA_MQ01ABD050_933PTT17T
../../sdk	ata-TOSHIBA_MQ01ABD050_X3EJSVUOS
../../sdl	ata-TOSHIBA_MQ01ABD050_X3EJSVUNS
../../sdm	ata-TOSHIBA_MQ01ABD050_933PTT11T
../../sdn	ata-TOSHIBA_MQ01ABD050_933PTT15T
../../sdo	ata-TOSHIBA_MQ01ABD050_933PTT12T
../../sdp	ata-TOSHIBA_MQ01ABD050_933PTT13T
../../sdq	ata-TOSHIBA_MQ01ABD050_933PTT14T
../../sdr	ata-TOSHIBA_MQ01ABD050_933PTT0ZT

Nope, so we search dmesg again, grepping for the ATA-8 port:

[    1.538182] ata4.00: ATA-8: Hitachi HTS547575A9E384, JE4OA60A, max UDMA/133
[    1.538213] ata2.00: ATA-8: Hitachi HTS547575A9E384, JE4OA60A, max UDMA/133
[    1.538244] ata3.00: ATA-8: Hitachi HTS547575A9E384, JE4OA60A, max UDMA/133
[    1.538565] ata5.00: ATA-8: Hitachi HTS547575A9E384, JE4OA60A, max UDMA/133
[    1.563034] ata6.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100
[    1.584336] ata12.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100
[    1.589329] ata14.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100
[    1.590477] ata11.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100
[    1.590713] ata13.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100
[    1.618162] ata15.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100
[    1.621146] ata18.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100
[    1.623070] ata16.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100
[    1.625311] ata17.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100
[    3.267833] ata7.00: ATA-8: Hitachi HTS547575A9E384, JE4OA60A, max UDMA/133
[    5.469619] ata8.00: ATA-8: Hitachi HTS547575A9E384, JE4OA60A, max UDMA/133
[    7.671669] ata9.00: ATA-8: Hitachi HTS547575A9E384, JE4OA60A, max UDMA/133
[    9.890091] ata10.00: ATA-8: Hitachi HTS547575A9E384, JE4OA60A, max UDMA/133

We begin to notice I should have kept with the Hitachi drives, they are going to be faster. Oh well, I’m doing raid 50, striping across five groups, that right there is pretty good.

But where is this drive? Time to paw thru /sys. Paw is just about right, I don’t have much memorized about the /sys directory, but with ‘less’ and ‘ls’ I bet…there we go: ls -l /sys/class/block/ | grep ata14:

root@beavertail /var/log
 > ls -l /sys/class/block/ | grep ata14
lrwxrwxrwx 1 root root 0 Nov 16 22:32 sdn -> ../../devices/pci0000:00/0000:00:02.0/0000:01:00.0/ata14/host13/target13:0:0/13:0:0:0/block/sdn
lrwxrwxrwx 1 root root 0 Nov 17 06:43 sdn1 -> ../../devices/pci0000:00/0000:00:02.0/0000:01:00.0/ata14/host13/target13:0:0/13:0:0:0/block/sdn/sdn1
lrwxrwxrwx 1 root root 0 Nov 17 06:43 sdn9 -> ../../devices/pci0000:00/0000:00:02.0/0000:01:00.0/ata14/host13/target13:0:0/13:0:0:0/block/sdn/sdn9

Now I can use hdparm on this guy:

root@beavertail /var/log
 > hdparm -I /dev/sdn | head

/dev/sdn:

ATA device, with non-removable media
        Model Number:       TOSHIBA MQ01ABD050                      
        Serial Number:      933PTT15T
        Firmware Revision:  AX001A  
        Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6

Now that I have the serial number for that guy, I will be able to find him.

How is my new storage array doing in general? Well, iostat gives me some indication that the drives are being written to somewhat equally:

jreynolds@beavertail ~
 > iostat -N -x 2 2
Linux 3.11.0-12-generic (beavertail) 	11/17/2013 	_x86_64_	(2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.52    0.00   12.75   46.82    0.00   39.91

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.01     0.00    0.06   59.85     0.28  5350.97   178.65     6.73  112.27   23.92  112.36  13.23  79.25
sdb               0.01     0.00    0.05   59.86     0.27  5350.94   178.63     6.69  111.66   25.89  111.74  13.22  79.19
sdc               0.01     0.00    0.07   67.59     0.33  6029.96   178.27     7.05  104.27   21.14  104.35  12.08  81.72
sdd               0.01     0.00    0.13   65.75     0.59  5953.08   180.73     7.11  107.97   35.70  108.12  12.46  82.08
sde               0.02     1.13    3.29  319.92    20.22 39454.49   244.27     1.49    4.60    2.80    4.62   0.52  16.79
sdf               0.01     0.00    0.12   70.63     0.54  5953.20   168.30     0.70    9.91   16.08    9.90   2.17  15.38
sdg               0.01     0.00    0.06   70.40     0.30  5953.25   169.00     0.76   10.84   18.69   10.83   2.31  16.28
sdi               0.01     0.00    0.14   71.55     0.63  6030.08   168.23     0.82   11.43   23.98   11.41   2.34  16.76
sdj               0.02     0.00    0.14   75.14     0.66  5497.69   146.09     0.21    2.75    9.36    2.74   0.94   7.07
sdk               0.02     0.00    0.03    0.00     0.21     0.04    18.43     0.00    3.04    2.78    8.32   2.40   0.01
sdm               0.04     0.00    0.08   68.44     0.50  5477.06   159.90     0.27    3.98    8.62    3.97   0.98   6.72
sdo               0.04     0.00    0.07   66.84     0.49  5477.08   163.72     0.29    4.35    8.03    4.34   1.04   6.94
sdn               0.02     0.00    0.08   64.80     0.43  5497.76   169.51     0.41    6.35    8.18    6.35   1.31   8.51
sdl               0.02     0.00    0.03    0.00     0.21     0.04    18.43     0.00    3.37    3.10    8.74   2.52   0.01
sdp               0.04     0.00    0.07   66.85     0.49  5477.05   163.71     0.29    4.34    8.13    4.34   1.04   6.93
sdq               0.04     0.00    0.08   67.91     0.53  5497.71   161.72     0.29    4.27    8.44    4.27   1.02   6.92
sdh               0.01     0.00    0.13   72.34     0.58  6030.12   166.44     0.72    9.90   16.88    9.89   2.12  15.38
sdr               0.04     0.00    0.07   66.92     0.48  5351.01   159.77     0.29    4.37    7.13    4.36   1.12   7.53

From that I can tell that /dev/sde is getting the shit kicked out of it with 6x the writes other drives are getting. What the hell? Let’s check zpool:

root@beavertail ~
 > zpool iostat -v 30
                                                         capacity     operations    bandwidth
pool                                                  alloc   free   read  write   read  write
----------------------------------------------------  -----  -----  -----  -----  -----  -----
tank                                                  2.15T  5.99T      0    608    204  58.8M
  raidz1                                               470G  1.57T      0    130     34  12.7M
    ata-Hitachi_HTS547575A9E384_J2190059G5PXYC-part1      -      -      0     73      0  6.32M
    ata-Hitachi_HTS547575A9E384_J2190059G6GMGC-part1      -      -      0     70      0  6.43M
    ata-Hitachi_HTS547575A9E384_J2190059G6LUDC-part1      -      -      0     73    273  6.32M
  raidz1                                               464G  1.58T      0    128     85  12.7M
    ata-Hitachi_HTS547575A9E384_J2190059G95JPC-part1      -      -      0     72      0  6.32M
    ata-Hitachi_HTS547575A9E384_J2190059G95REC-part1      -      -      0     68    136  6.42M
    ata-Hitachi_HTS547575A9E384_J2190059G9LH9C-part1      -      -      0     72    273  6.32M
  raidz1                                               414G   978G      0    116     68  11.0M
    ata-Hitachi_HTS547575A9E384_J2190059G9PDPC-part1      -      -      0     61    136  5.56M
    ata-Hitachi_HTS547575A9E384_J2190059G9SBBC-part1      -      -      0     60    136  5.54M
    ata-TOSHIBA_MQ01ABD050_933PTT0ZT                      -      -      0     66      0  5.45M
  raidz1                                               425G   967G      0    115      0  11.1M
    ata-TOSHIBA_MQ01ABD050_933PTT11T                      -      -      0     70      0  5.61M
    ata-TOSHIBA_MQ01ABD050_933PTT12T                      -      -      0     66      0  5.61M
    ata-TOSHIBA_MQ01ABD050_933PTT13T                      -      -      0     67      0  5.61M
  raidz1                                               426G   966G      0    117     17  11.2M
    ata-TOSHIBA_MQ01ABD050_933PTT14T                      -      -      0     69      0  5.66M
    ata-TOSHIBA_MQ01ABD050_933PTT15T                      -      -      0     66      0  5.66M
    ata-TOSHIBA_MQ01ABD050_933PTT17T                      -      -      0     78    136  5.66M
logs                                                      -      -      -      -      -      -
  ata-OCZ-AGILITY4_OCZ-77Z13FI634825PNW-part5          128K  1.98G      0      0      0      0
cache                                                     -      -      -      -      -      -
  ata-OCZ-AGILITY4_OCZ-77Z13FI634825PNW-part6         64.1G  14.0M      0    342  4.80K  41.9M
----------------------------------------------------  -----  -----  -----  -----  -----  -----

That guy is my Layer 2 cache SSD partition on my OCZ drive. (The astute will see that I am only running one SSD on here.) Since this is a backup drive, I doubt actually need cache on it.
It is transferring like %50 of the activity of the whole pool for no good reason. I’ll disable him later.

But, (and this is quiet easily seen using dstat) I have four drives just “Chugging Away” doing more than their share. These guys are busier and I don’t know why, unless it has to do with extra parity writes.

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00   64.50     0.00  5354.00   166.02     5.62   88.93    0.00   88.93  11.63  75.00
sdb               0.00     0.00    0.00   63.50     0.00  5416.00   170.58     6.55  108.13    0.00  108.13  12.94  82.20
sdc               0.00     0.00    0.50   73.00     2.00  7144.00   194.45     7.33  102.07   20.00  102.63  11.40  83.80
sdd               0.00     0.00    0.00   63.00     0.00  7044.00   223.62     7.67  126.29    0.00  126.29  13.43  84.60

What kind of jibberish am I looking at? Trim it down to the write-queue and the “awaiting” columns:

Device:          avgrq-sz avgqu-sz   await  w_await   %util
sda                166.02     5.62   88.93    88.93   75.00
sdb                170.58     6.55  108.13   108.13   82.20
sdc                194.45     7.33  102.07   102.63   83.80
sdd                223.62     7.67  126.29   126.29   84.60

We see that the write-wait queue, the outstanding device wait interval, and the queue size on those devices is way higher
than the rest of the drives. However, I have five raidz pools, so why would I have four drives so poorly conditioned? Either
ZFS has decided that I need to write more bytes there, or those four drives are all on a common controller that is slower, I suspect.

Common controller? Let’s check that /sys/ directory again. Yeah, with variations of ls -l /sys/class/block/ | egrep 'sd[a-d] ' I definitely see those
drives share something on the PCI bus in common: address 14.

jreynolds@beavertail ~
 > ls -l /sys/class/block/ | egrep 'sd[a-d] '
lrwxrwxrwx 1 root root 0 Nov 16 22:32 sda -> ../../devices/pci0000:00/0000:00:14.4/0000:03:05.0/ata7/host1/target1:0:0/1:0:0:0/block/sda
lrwxrwxrwx 1 root root 0 Nov 16 22:32 sdb -> ../../devices/pci0000:00/0000:00:14.4/0000:03:05.0/ata8/host2/target2:0:0/2:0:0:0/block/sdb
lrwxrwxrwx 1 root root 0 Nov 16 22:32 sdc -> ../../devices/pci0000:00/0000:00:14.4/0000:03:05.0/ata9/host3/target3:0:0/3:0:0:0/block/sdc
lrwxrwxrwx 1 root root 0 Nov 16 22:32 sdd -> ../../devices/pci0000:00/0000:00:14.4/0000:03:05.0/ata10/host5/target5:0:0/5:0:0:0/block/sdd

However, using dmidecode, all I can tell at the moment is that I have four slots and their addresses do not match. This is unlikely a motherboard connected item. Is it a PCI connected item?

 > lspci
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 12h Processor Root Complex
00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Sumo [Radeon HD 6410D]
00:02.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 12h Processor Root Port
00:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 12h Processor Root Port
00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 40)
00:12.0 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB OHCI Controller (rev 11)
00:12.2 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB EHCI Controller (rev 11)
00:13.0 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB OHCI Controller (rev 11)
00:13.2 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB EHCI Controller (rev 11)
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 14)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 11)
00:14.4 PCI bridge: Advanced Micro Devices, Inc. [AMD] FCH PCI Bridge (rev 40)
00:14.5 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB OHCI Controller (rev 11)
00:15.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Hudson PCI to PCI bridge (PCIE port 0)
00:15.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Hudson PCI to PCI bridge (PCIE port 1)
00:16.0 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB OHCI Controller (rev 11)
00:16.2 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB EHCI Controller (rev 11)
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 12h/14h Processor Function 0 (rev 43)
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 12h/14h Processor Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 12h/14h Processor Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 12h/14h Processor Function 3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 12h/14h Processor Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 12h/14h Processor Function 6
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 12h/14h Processor Function 5
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 12h/14h Processor Function 7
01:00.0 SATA controller: Marvell Technology Group Ltd. Device 9235 (rev 10)
02:00.0 SATA controller: Marvell Technology Group Ltd. Device 9235 (rev 10)
03:05.0 RAID bus controller: Silicon Image, Inc. SiI 3124 PCI-X Serial ATA Controller (rev 02)
04:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06)

DING Found it: that Silicon Image Sata controller. It is likely the slow link in all this. I can see the match: 03:05.0

03:05.0 RAID bus controller: Silicon Image, Inc. SiI 3124 PCI-X Serial ATA Controller (rev 02)
Advertisements

2 thoughts on “Disk Performance Analysis on Linux

  1. RAID, Inc repurposes SAS JBODs to PCIe SSDs insideEditor:- October 31, 2013 – When I started writing about applications for removable 2.5″ PCIe SSDs last year – one of the no-brainers was that oems who want to push the boundaries of good old fashioned raw fast SSD throughput – and who already have legacy SAS SSD based boxes (which they’ve already speeded up from HDD origins) will replumb the wiring and up-market these systems to a higher performance level again – as super high speed (and more modern sounding) ” PCIe SSD inside” storage. That’s what it looks like RAID, Inc is doing in its 2U 24-Bay SSD/SAS Ability EBOD – which only a year ago was impressively fast enough when populated with SMART’s SAS drives – but now can go faster still with 2.5″ PCIe SSDs instead. When you add enough of these building blocks in a system you can get – 768GB/s full duplex connectivity according to a recent announcement by the company – (although the exact configuration details are still unknown to me). These new 2U bays with 24 PCIe SSDs inside will be shown at the SuperComputing 2013 show next month in Denver, CO. RAID, Inc already uses PCIe SSDs from Fusion-io in its client server nodes BTW.

Comments are closed.