
Really Into the Guts of Linux Filesystem Performance
Really Into the Guts of LinuxIt is possible to find out how hard your disks are running. Some you don’t want to run as hard as others. For instance, if you wanted to save on write cycles for your SSD you could move small frequent writes into a ram filesystem, and that seems to be done by default in many systems these days (with /tmp getting mounted as tmpsf).
In the process of rebuilding my backup server, I left it going with a few large writes overnight, to check for some stability issues. Checking dmesg is always a good way to do that.
For instance, I see that drive 14 seems to be wonky:
Nov 16 22:44:13 beavertail kernel: [ 692.590756] ata14.00: cmd 61/e0:18:20:0a:00/00:00:00:00:00/40 tag 3 ncq 114688 out Nov 16 22:44:13 beavertail kernel: [ 692.590756] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Nov 16 22:44:13 beavertail kernel: [ 692.598694] ata14.00: status: { DRDY } Nov 16 22:44:14 beavertail kernel: [ 693.093364] ata14: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Nov 16 22:44:14 beavertail kernel: [ 693.118236] ata14.00: configured for UDMA/100 Nov 16 22:44:14 beavertail kernel: [ 693.118255] ata14.00: device reported invalid CHS sector 0 Nov 16 22:44:14 beavertail kernel: [ 693.118273] ata14: EH complete
Which drive is this? Oh, I see where it came up using sudo dmesg | grep 'ata14' |less:
[ 1.061214] ata14: SATA max UDMA/133 abar m2048@0xfea10000 port 0xfea10280 irq 49 [ 1.556891] ata14: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 1.589329] ata14.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100
I cannot pass ata14 as an argument to programs, so I will see if I can match a serial number:
root@beavertail /dev/disk/by-id > find -type l -a \( -name 'wwn-*' -o -name '*-part*' \) -prune -o -printf "%l\t%f\n" | sort . ../../sda ata-Hitachi_HTS547575A9E384_J2190059G9PDPC ../../sdb ata-Hitachi_HTS547575A9E384_J2190059G9SBBC ../../sdc ata-Hitachi_HTS547575A9E384_J2190059G6GMGC ../../sdd ata-Hitachi_HTS547575A9E384_J2190059G95REC ../../sde ata-OCZ-AGILITY4_OCZ-77Z13FI634825PNW ../../sdf ata-Hitachi_HTS547575A9E384_J2190059G9LH9C ../../sdg ata-Hitachi_HTS547575A9E384_J2190059G95JPC ../../sdh ata-Hitachi_HTS547575A9E384_J2190059G6LUDC ../../sdi ata-Hitachi_HTS547575A9E384_J2190059G5PXYC ../../sdj ata-TOSHIBA_MQ01ABD050_933PTT17T ../../sdk ata-TOSHIBA_MQ01ABD050_X3EJSVUOS ../../sdl ata-TOSHIBA_MQ01ABD050_X3EJSVUNS ../../sdm ata-TOSHIBA_MQ01ABD050_933PTT11T ../../sdn ata-TOSHIBA_MQ01ABD050_933PTT15T ../../sdo ata-TOSHIBA_MQ01ABD050_933PTT12T ../../sdp ata-TOSHIBA_MQ01ABD050_933PTT13T ../../sdq ata-TOSHIBA_MQ01ABD050_933PTT14T ../../sdr ata-TOSHIBA_MQ01ABD050_933PTT0ZT
Nope, so we search dmesg again, grepping for the ATA-8 port:
[ 1.538182] ata4.00: ATA-8: Hitachi HTS547575A9E384, JE4OA60A, max UDMA/133 [ 1.538213] ata2.00: ATA-8: Hitachi HTS547575A9E384, JE4OA60A, max UDMA/133 [ 1.538244] ata3.00: ATA-8: Hitachi HTS547575A9E384, JE4OA60A, max UDMA/133 [ 1.538565] ata5.00: ATA-8: Hitachi HTS547575A9E384, JE4OA60A, max UDMA/133 [ 1.563034] ata6.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100 [ 1.584336] ata12.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100 [ 1.589329] ata14.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100 [ 1.590477] ata11.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100 [ 1.590713] ata13.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100 [ 1.618162] ata15.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100 [ 1.621146] ata18.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100 [ 1.623070] ata16.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100 [ 1.625311] ata17.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100 [ 3.267833] ata7.00: ATA-8: Hitachi HTS547575A9E384, JE4OA60A, max UDMA/133 [ 5.469619] ata8.00: ATA-8: Hitachi HTS547575A9E384, JE4OA60A, max UDMA/133 [ 7.671669] ata9.00: ATA-8: Hitachi HTS547575A9E384, JE4OA60A, max UDMA/133 [ 9.890091] ata10.00: ATA-8: Hitachi HTS547575A9E384, JE4OA60A, max UDMA/133
We begin to notice I should have kept with the Hitachi drives, they are going to be faster. Oh well, I’m doing raid 50, striping across five groups, that right there is pretty good.
But where is this drive? Time to paw thru /sys. Paw is just about right, I don’t have much memorized about the /sys directory, but with ‘less’ and ‘ls’ I bet…there we go: ls -l /sys/class/block/ | grep ata14:
root@beavertail /var/log > ls -l /sys/class/block/ | grep ata14 lrwxrwxrwx 1 root root 0 Nov 16 22:32 sdn -> ../../devices/pci0000:00/0000:00:02.0/0000:01:00.0/ata14/host13/target13:0:0/13:0:0:0/block/sdn lrwxrwxrwx 1 root root 0 Nov 17 06:43 sdn1 -> ../../devices/pci0000:00/0000:00:02.0/0000:01:00.0/ata14/host13/target13:0:0/13:0:0:0/block/sdn/sdn1 lrwxrwxrwx 1 root root 0 Nov 17 06:43 sdn9 -> ../../devices/pci0000:00/0000:00:02.0/0000:01:00.0/ata14/host13/target13:0:0/13:0:0:0/block/sdn/sdn9
Now I can use hdparm on this guy:
root@beavertail /var/log > hdparm -I /dev/sdn | head /dev/sdn: ATA device, with non-removable media Model Number: TOSHIBA MQ01ABD050 Serial Number: 933PTT15T Firmware Revision: AX001A Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6
Now that I have the serial number for that guy, I will be able to find him.
How is my new storage array doing in general? Well, iostat gives me some indication that the drives are being written to somewhat equally:
jreynolds@beavertail ~ > iostat -N -x 2 2 Linux 3.11.0-12-generic (beavertail) 11/17/2013 _x86_64_ (2 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 0.52 0.00 12.75 46.82 0.00 39.91 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.01 0.00 0.06 59.85 0.28 5350.97 178.65 6.73 112.27 23.92 112.36 13.23 79.25 sdb 0.01 0.00 0.05 59.86 0.27 5350.94 178.63 6.69 111.66 25.89 111.74 13.22 79.19 sdc 0.01 0.00 0.07 67.59 0.33 6029.96 178.27 7.05 104.27 21.14 104.35 12.08 81.72 sdd 0.01 0.00 0.13 65.75 0.59 5953.08 180.73 7.11 107.97 35.70 108.12 12.46 82.08 sde 0.02 1.13 3.29 319.92 20.22 39454.49 244.27 1.49 4.60 2.80 4.62 0.52 16.79 sdf 0.01 0.00 0.12 70.63 0.54 5953.20 168.30 0.70 9.91 16.08 9.90 2.17 15.38 sdg 0.01 0.00 0.06 70.40 0.30 5953.25 169.00 0.76 10.84 18.69 10.83 2.31 16.28 sdi 0.01 0.00 0.14 71.55 0.63 6030.08 168.23 0.82 11.43 23.98 11.41 2.34 16.76 sdj 0.02 0.00 0.14 75.14 0.66 5497.69 146.09 0.21 2.75 9.36 2.74 0.94 7.07 sdk 0.02 0.00 0.03 0.00 0.21 0.04 18.43 0.00 3.04 2.78 8.32 2.40 0.01 sdm 0.04 0.00 0.08 68.44 0.50 5477.06 159.90 0.27 3.98 8.62 3.97 0.98 6.72 sdo 0.04 0.00 0.07 66.84 0.49 5477.08 163.72 0.29 4.35 8.03 4.34 1.04 6.94 sdn 0.02 0.00 0.08 64.80 0.43 5497.76 169.51 0.41 6.35 8.18 6.35 1.31 8.51 sdl 0.02 0.00 0.03 0.00 0.21 0.04 18.43 0.00 3.37 3.10 8.74 2.52 0.01 sdp 0.04 0.00 0.07 66.85 0.49 5477.05 163.71 0.29 4.34 8.13 4.34 1.04 6.93 sdq 0.04 0.00 0.08 67.91 0.53 5497.71 161.72 0.29 4.27 8.44 4.27 1.02 6.92 sdh 0.01 0.00 0.13 72.34 0.58 6030.12 166.44 0.72 9.90 16.88 9.89 2.12 15.38 sdr 0.04 0.00 0.07 66.92 0.48 5351.01 159.77 0.29 4.37 7.13 4.36 1.12 7.53
From that I can tell that /dev/sde is getting the shit kicked out of it with 6x the writes other drives are getting. What the hell? Let’s check zpool:
root@beavertail ~ > zpool iostat -v 30 capacity operations bandwidth pool alloc free read write read write ---------------------------------------------------- ----- ----- ----- ----- ----- ----- tank 2.15T 5.99T 0 608 204 58.8M raidz1 470G 1.57T 0 130 34 12.7M ata-Hitachi_HTS547575A9E384_J2190059G5PXYC-part1 - - 0 73 0 6.32M ata-Hitachi_HTS547575A9E384_J2190059G6GMGC-part1 - - 0 70 0 6.43M ata-Hitachi_HTS547575A9E384_J2190059G6LUDC-part1 - - 0 73 273 6.32M raidz1 464G 1.58T 0 128 85 12.7M ata-Hitachi_HTS547575A9E384_J2190059G95JPC-part1 - - 0 72 0 6.32M ata-Hitachi_HTS547575A9E384_J2190059G95REC-part1 - - 0 68 136 6.42M ata-Hitachi_HTS547575A9E384_J2190059G9LH9C-part1 - - 0 72 273 6.32M raidz1 414G 978G 0 116 68 11.0M ata-Hitachi_HTS547575A9E384_J2190059G9PDPC-part1 - - 0 61 136 5.56M ata-Hitachi_HTS547575A9E384_J2190059G9SBBC-part1 - - 0 60 136 5.54M ata-TOSHIBA_MQ01ABD050_933PTT0ZT - - 0 66 0 5.45M raidz1 425G 967G 0 115 0 11.1M ata-TOSHIBA_MQ01ABD050_933PTT11T - - 0 70 0 5.61M ata-TOSHIBA_MQ01ABD050_933PTT12T - - 0 66 0 5.61M ata-TOSHIBA_MQ01ABD050_933PTT13T - - 0 67 0 5.61M raidz1 426G 966G 0 117 17 11.2M ata-TOSHIBA_MQ01ABD050_933PTT14T - - 0 69 0 5.66M ata-TOSHIBA_MQ01ABD050_933PTT15T - - 0 66 0 5.66M ata-TOSHIBA_MQ01ABD050_933PTT17T - - 0 78 136 5.66M logs - - - - - - ata-OCZ-AGILITY4_OCZ-77Z13FI634825PNW-part5 128K 1.98G 0 0 0 0 cache - - - - - - ata-OCZ-AGILITY4_OCZ-77Z13FI634825PNW-part6 64.1G 14.0M 0 342 4.80K 41.9M ---------------------------------------------------- ----- ----- ----- ----- ----- -----
That guy is my Layer 2 cache SSD partition on my OCZ drive. (The astute will see that I am only running one SSD on here.) Since this is a backup drive, I doubt actually need cache on it.
It is transferring like %50 of the activity of the whole pool for no good reason. I’ll disable him later.
But, (and this is quiet easily seen using dstat) I have four drives just “Chugging Away” doing more than their share. These guys are busier and I don’t know why, unless it has to do with extra parity writes.
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 0.00 64.50 0.00 5354.00 166.02 5.62 88.93 0.00 88.93 11.63 75.00 sdb 0.00 0.00 0.00 63.50 0.00 5416.00 170.58 6.55 108.13 0.00 108.13 12.94 82.20 sdc 0.00 0.00 0.50 73.00 2.00 7144.00 194.45 7.33 102.07 20.00 102.63 11.40 83.80 sdd 0.00 0.00 0.00 63.00 0.00 7044.00 223.62 7.67 126.29 0.00 126.29 13.43 84.60
What kind of jibberish am I looking at? Trim it down to the write-queue and the “awaiting” columns:
Device: avgrq-sz avgqu-sz await w_await %util sda 166.02 5.62 88.93 88.93 75.00 sdb 170.58 6.55 108.13 108.13 82.20 sdc 194.45 7.33 102.07 102.63 83.80 sdd 223.62 7.67 126.29 126.29 84.60
We see that the write-wait queue, the outstanding device wait interval, and the queue size on those devices is way higher
than the rest of the drives. However, I have five raidz pools, so why would I have four drives so poorly conditioned? Either
ZFS has decided that I need to write more bytes there, or those four drives are all on a common controller that is slower, I suspect.
Common controller? Let’s check that /sys/ directory again. Yeah, with variations of ls -l /sys/class/block/ | egrep 'sd[a-d] ' I definitely see those
drives share something on the PCI bus in common: address 14.
jreynolds@beavertail ~ > ls -l /sys/class/block/ | egrep 'sd[a-d] ' lrwxrwxrwx 1 root root 0 Nov 16 22:32 sda -> ../../devices/pci0000:00/0000:00:14.4/0000:03:05.0/ata7/host1/target1:0:0/1:0:0:0/block/sda lrwxrwxrwx 1 root root 0 Nov 16 22:32 sdb -> ../../devices/pci0000:00/0000:00:14.4/0000:03:05.0/ata8/host2/target2:0:0/2:0:0:0/block/sdb lrwxrwxrwx 1 root root 0 Nov 16 22:32 sdc -> ../../devices/pci0000:00/0000:00:14.4/0000:03:05.0/ata9/host3/target3:0:0/3:0:0:0/block/sdc lrwxrwxrwx 1 root root 0 Nov 16 22:32 sdd -> ../../devices/pci0000:00/0000:00:14.4/0000:03:05.0/ata10/host5/target5:0:0/5:0:0:0/block/sdd
However, using dmidecode, all I can tell at the moment is that I have four slots and their addresses do not match. This is unlikely a motherboard connected item. Is it a PCI connected item?
> lspci 00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 12h Processor Root Complex 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Sumo [Radeon HD 6410D] 00:02.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 12h Processor Root Port 00:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 12h Processor Root Port 00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 40) 00:12.0 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB OHCI Controller (rev 11) 00:12.2 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB EHCI Controller (rev 11) 00:13.0 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB OHCI Controller (rev 11) 00:13.2 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB EHCI Controller (rev 11) 00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 14) 00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 11) 00:14.4 PCI bridge: Advanced Micro Devices, Inc. [AMD] FCH PCI Bridge (rev 40) 00:14.5 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB OHCI Controller (rev 11) 00:15.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Hudson PCI to PCI bridge (PCIE port 0) 00:15.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Hudson PCI to PCI bridge (PCIE port 1) 00:16.0 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB OHCI Controller (rev 11) 00:16.2 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB EHCI Controller (rev 11) 00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 12h/14h Processor Function 0 (rev 43) 00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 12h/14h Processor Function 1 00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 12h/14h Processor Function 2 00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 12h/14h Processor Function 3 00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 12h/14h Processor Function 4 00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 12h/14h Processor Function 6 00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 12h/14h Processor Function 5 00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 12h/14h Processor Function 7 01:00.0 SATA controller: Marvell Technology Group Ltd. Device 9235 (rev 10) 02:00.0 SATA controller: Marvell Technology Group Ltd. Device 9235 (rev 10) 03:05.0 RAID bus controller: Silicon Image, Inc. SiI 3124 PCI-X Serial ATA Controller (rev 02) 04:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection 05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06)
DING Found it: that Silicon Image Sata controller. It is likely the slow link in all this. I can see the match: 03:05.0
03:05.0 RAID bus controller: Silicon Image, Inc. SiI 3124 PCI-X Serial ATA Controller (rev 02)
RAID, Inc repurposes SAS JBODs to PCIe SSDs insideEditor:- October 31, 2013 – When I started writing about applications for removable 2.5″ PCIe SSDs last year – one of the no-brainers was that oems who want to push the boundaries of good old fashioned raw fast SSD throughput – and who already have legacy SAS SSD based boxes (which they’ve already speeded up from HDD origins) will replumb the wiring and up-market these systems to a higher performance level again – as super high speed (and more modern sounding) ” PCIe SSD inside” storage. That’s what it looks like RAID, Inc is doing in its 2U 24-Bay SSD/SAS Ability EBOD – which only a year ago was impressively fast enough when populated with SMART’s SAS drives – but now can go faster still with 2.5″ PCIe SSDs instead. When you add enough of these building blocks in a system you can get – 768GB/s full duplex connectivity according to a recent announcement by the company – (although the exact configuration details are still unknown to me). These new 2U bays with 24 PCIe SSDs inside will be shown at the SuperComputing 2013 show next month in Denver, CO. RAID, Inc already uses PCIe SSDs from Fusion-io in its client server nodes BTW.
Pls post link to talk if you can. And who makes backplanes that can support 24 PCIe slots?