Raw ZVOL access in ZFS on Linux not an option for VirtualBox?

I have gotten about as far as I can go…I can boot a Fedora netinstall cd, and I can attempt to fdisk /dev/sda, but this device will not open in fdisk. This might be consistent with no “rdsk” device in the dev tree. Funny though, I can fdisk /dev/zd0, and it lists partitions.

Could this be a problem with how I have my vmdk setup?

Converting Virtualbox disk to ZFS volumes

I am trying to create zvol backed virtualbox guests using ZFS on Linux. Hopefully this is not too ambitious.

  1. Find size of original .VDI disk image: vboxmanage showhdinfo Fedora19.vdi
  2. Pass that file size into zfs create: zfs create -V40960M tank/VMs/Fedora19
  3. Convert .vdi file to raw, best if you can do this from one hd to another. In my case I’m reading from a zvol and writing to an ssd: vboxmanage clonehd Fedora19.vdi ~/tmp/Fedora19.raw –format RAW
  4. Load that back into a raw parition image: dd if=~/tmp/Fedora19.raw of=/dev/zvol/tank/VMs/9101-f19 bs=1280k
  5. now we can check partitions: vboxmanage internalcommands listpartitions –rawdisk /dev/zvol/tank/VMs/9101-f19
  6. Then we can create a vmdk file for the raw parition: vboxmanage internalcommands createrawvmdk -filename ~/VirtualBox\ VMs/9101-f19/9101-f19-B.vmdk -rawdisk /dev/zvol/tank/VMs/9101-f19 -partitions 1,2,3
  7. create an idea controller: vboxmanage storagectl …vboxmanage storagectl 9101-f19-B –name IDE –add ide
  8. add blank version of hard drive: vboxmanage storageattach 9101-f19-B –storagectl IDE –port 0 –device 0 –mtype normal –type hdd –medium none
  9. add hard drive: vboxmanage storageattach 9101-f19-B –storagectl IDE –port 0 –device 0 –mtype normal –type hdd –medium “/home/jreynolds/VirtualBox VMs/9101-f19-B/9101-f19-B.vmdk”
  10. …attach the CD rom and then repair the system…somehow my disk image stopped being bootable. :-(

Replacing a Four Drives at Once

I would not have considered this tactic possible, but then I’ve often come from small raid setups or hardware raid setups that are unlikely to want to replace multiple disks at a time. Replacing four at a time!

Replacing four bad drives at once.

Replacing four bad drives at once.

Two drives fail at once? Oh yeah…

Just as I thot I was all cool for having a sixteen drive NAS, today’s opening of it and trying a new network card (did not fit) left me with bad news on the next powerup.

 > dmesg | grep ata | grep error:
[   23.223221] ata13.00: error: { ABRT }
[   23.234448] ata13.00: error: { ABRT }
[   31.262674] ata13.00: error: { ABRT }
[   31.275241] ata13.00: error: { ABRT }
[   31.288012] ata13.00: error: { ABRT }
[   39.073802] ata13.00: error: { ABRT }
[   50.815339] ata13.00: error: { ABRT }
[   50.827082] ata13.00: error: { ABRT }
[   57.606645] ata13.00: error: { ABRT }
[   69.616356] ata7.00: error: { ABRT }
[   69.616451] ata13.00: error: { ABRT }

That’s failure of two drives. TWO at the same time! ….and look at this:

 > zpool status -v
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 4h43m with 0 errors on Sat Sep  6 00:13:22 2014
config:

        NAME                                            STATE     READ WRITE CKSUM
        tank                                            ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            ata-Hitachi_HTS547575A9E384_J2190059G9PDPC  ONLINE       0     0     0
            ata-Hitachi_HTS547575A9E384_J2190059G9SBBC  ONLINE       0     0     0
            ata-Hitachi_HTS547575A9E384_J2190059G6GMGC  ONLINE       0     0     0
            ata-Hitachi_HTS547575A9E384_J2190059G95REC  ONLINE       0     0     0
          raidz1-1                                      ONLINE       0     0     0
            ata-Hitachi_HTS547575A9E384_J2190059G9LH9C  ONLINE       0     0     0
            ata-Hitachi_HTS547575A9E384_J2190059G95JPC  ONLINE       0     0     0
            ata-Hitachi_HTS547575A9E384_J2190059G6LUDC  ONLINE       0     0     0
            ata-Hitachi_HTS547575A9E384_J2190059G5PXYC  ONLINE       0     0     0
          raidz1-2                                      ONLINE       0     0     0
            ata-TOSHIBA_MQ01ABD050_X3EJSVUOS            ONLINE       0     0     0
            ata-TOSHIBA_MQ01ABD050_X3EJSVUNS            ONLINE       0     0     0
            ata-TOSHIBA_MQ01ABD050_933PTT11T            ONLINE       0     0     0
            ata-TOSHIBA_MQ01ABD050_933PTT17T            ONLINE       0     0     0
          raidz1-3                                      ONLINE       0     0     0
            ata-TOSHIBA_MQ01ABD050_933PTT12T            ONLINE       0     0     0
            ata-TOSHIBA_MQ01ABD050_933PTT13T            ONLINE       0     0     2
            ata-TOSHIBA_MQ01ABD050_933PTT14T            ONLINE       0     0     2
            ata-TOSHIBA_MQ01ABD050_933PTT0ZT            ONLINE       0     0     0
        logs
          ata-OCZ-AGILITY4_OCZ-77Z13FI634825PNW-part5   ONLINE       0     0     0
        cache
          ata-OCZ-AGILITY4_OCZ-77Z13FI634825PNW-part6   ONLINE       0     0     0

errors: No known data errors

Two checksum errors in the same Raid 5 volume. That’s going to be a very tricky replacement. I think I’m going to either replace one disk at a time and hope for the best resilver possibilities, or maybe…add a PCI controller back in there and add another zvol and migrate data from one zvol to another? That’ll be a wild trick.

It will frack up my backups for a while, that’s for sure. Oh, and those Toshiba drives? That’s three Toshiba failures, zero Hitatchi failures.

ZFS on Linux machine

Beavertaill Cactus [Wikipedia]

Beavertaill Cactus [Wikipedia]


Here is my ZFS on Linux story, and some of you might have seen these pictures when I started this project last year: I recycled an old Athlon system from work and placed an 35 Watt AMD A2 processor with 8GB 1600 ram on an Asus mobo in it. I name my home systems after after cacti, and after I installed Ubuntu 12.04 on it, I named this one Beavertail.bitratchet.net.

My previous experiences with storage systems involved them dying from heat. So I decided I would avoid full sized drives and stick with laptop drives and boot off an SSD. I have the SSD partitioned with a /boot, root, and two more partitions for ZIL and L2ARC. The bulk of the storage is a mix of 750GB Hitachi and 500GB Toshiba laptop hard drives, 16 total.  I have lost two drives in this system, which I would label “normal drive attrition.” Boot drive is a 128GB OCZ Vertex 2.

image

Half the drives are on the bottom, and half are on top. At work I have access to gobs of full-height card brackets and this is what I built drive cages out of.

To get all the drives wired up, I started with a bunch of 1x and 2x PCIe sata expanders and used up all my mobo sata ports, but by the time I got to about 12 drives, I only had a PCI slot left, so had to use that. When looking at my disk utilization in iostat -Nx and dstat --disk-util it was plainly clear that I had a swath of drives underperforming and they were all connected to the slowest PCI controller.

Supermicro HBA 8-port SAS controllers

Supermicro HBA 8-port SAS controllers

I saved up and remedied that by purchasing two SuperMicro SAS HBA’s with Marvel chipsets. They are only 3G SATA (equivalent) but they each control eight drives, and they do so consistently. They take 8x PCIe lanes, and that’s great use for the two 16x PCIe slots on the mobo.

02:00.0 RAID bus controller: Marvell Technology Group Ltd. 88SE9485 SAS/SATA 6Gb/s controller (rev c3)
04:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection

It took me a while to find out my network bandwidth issues. The problem was my motherboard: it has an onboard Realtek chipset. It would max out at 500Mbps download and 250Mbps upload…and very often wedge the system. I got a PCIe 1x Instell card and I got a good clean 955Mbps both ways out of that with one iperf stream, and 985+Mpbs with two iperf streams. To actually achieve this, I needed to put an Intel nic in my workstation as well. (My switch is a 16-port unmanaged Zyxel).

picture of drives

Eight drives on top

I am able to push close to full network capacity to Beavertail. As you can see, the results speak for themselves: the screenie below shows iftop displaying better than 880Mps and I saw it grab 910Mbps during this backup. Clearly part of the success is having a Samsung 840EVO in my laptop, but having a stripe of four zvols clearly allows plenty of IO headroom.

screen capture of iftop

910Mbps transfer from laptop to NAS.

 

Here are some other nerdy stats, mostly on how my drives are arranged:

 > zpool status -v
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 4h43m with 0 errors on Sat Sep  6 00:13:22 2014
config:

        NAME                                            STATE     READ WRITE CKSUM
        tank                                            ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            ata-Hitachi_HTS547575A9E384_J2190059G9PDPC  ONLINE       0     0     0
            ata-Hitachi_HTS547575A9E384_J2190059G9SBBC  ONLINE       0     0     0
            ata-Hitachi_HTS547575A9E384_J2190059G6GMGC  ONLINE       0     0     0
            ata-Hitachi_HTS547575A9E384_J2190059G95REC  ONLINE       0     0     0
          raidz1-1                                      ONLINE       0     0     0
            ata-Hitachi_HTS547575A9E384_J2190059G9LH9C  ONLINE       0     0     0
            ata-Hitachi_HTS547575A9E384_J2190059G95JPC  ONLINE       0     0     0
            ata-Hitachi_HTS547575A9E384_J2190059G6LUDC  ONLINE       0     0     0
            ata-Hitachi_HTS547575A9E384_J2190059G5PXYC  ONLINE       0     0     0
          raidz1-2                                      ONLINE       0     0     0
            ata-TOSHIBA_MQ01ABD050_X3EJSVUOS            ONLINE       0     0     0
            ata-TOSHIBA_MQ01ABD050_X3EJSVUNS            ONLINE       0     0     0
            ata-TOSHIBA_MQ01ABD050_933PTT11T            ONLINE       0     0     0
            ata-TOSHIBA_MQ01ABD050_933PTT17T            ONLINE       0     0     0
          raidz1-3                                      ONLINE       0     0     0
            ata-TOSHIBA_MQ01ABD050_933PTT12T            ONLINE       0     0     0
            ata-TOSHIBA_MQ01ABD050_933PTT13T            ONLINE       0     0     2
            ata-TOSHIBA_MQ01ABD050_933PTT14T            ONLINE       0     0     2
            ata-TOSHIBA_MQ01ABD050_933PTT0ZT            ONLINE       0     0     0
        logs
          ata-OCZ-AGILITY4_OCZ-77Z13FI634825PNW-part5   ONLINE       0     0     0
        cache
          ata-OCZ-AGILITY4_OCZ-77Z13FI634825PNW-part6   ONLINE       0     0     0

errors: No known data errors

And to finish up, this system has withstood a series of in-place Ubuntu upgrades. It is now running 14.04. My advice on this, and Linux kernels is this:

  • Do not rush to install new mainline kernels, you have to wait for dkms and spl libraries to sync up with mainline and to send out PPA updates through the ubuntu-zfs ppa.
  • If you do a dist-upgrade and reboot, and your zpool does not return on reboot, this is easily fixed by doing a ubuntu-zfs reinstall: apt-get install --reinstall ubuntu-zfs. This will re-link your kernel modules and you should be good to go.

Like I said, this has been working for four releases of Ubuntu for me, along with replacing controllers anbd drives. My only complaint is that doing sequences of small file operations in it tends to bring the speed down a lot (or has, have not recreated on 14.04 yet). But for streaming large files, I get massive throughput…which is great for my large photo collection!

Resilvering: Just what I want to see

I love replacing failed hard drives. Mostly, because I hate having hard drive failures.

root@cholla ~
$ zpool attach -f tank ata-ST2000DM001-9YN164_W1E16BJB ata-ST2000DM001-1E6164_W1E5W99Y

root@cholla ~
$ zpool status -v
pool: tank
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Fri Feb 21 08:41:52 2014
440M scanned out of 2.19T at 55.0M/s, 11h35m to go
217M resilvered, 0.02% done
config:

NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-ST2000DM001-9YN164_W1E16BJB ONLINE 0 0 0
ata-ST2000DM001-1E6164_W1E5W99Y ONLINE 0 0 0 (resilvering)
mirror-1 ONLINE 0 0 0
ata-ST2000DM001-1CH164_W1E15D5G ONLINE 0 0 0
ata-ST2000DM001-1CH164_W1E16ACY ONLINE 0 0 0
logs
mirror-2 ONLINE 0 0 0
ata-OCZ-VERTEX4_OCZ-IHFPW0WRG6202CJ6-part2 ONLINE 0 0 0
ata-OCZ-VERTEX4_OCZ-8O3625YQSVR0993K-part2 ONLINE 0 0 0
cache
ata-OCZ-VERTEX4_OCZ-8O3625YQSVR0993K-part3 ONLINE 0 0 0
ata-OCZ-VERTEX4_OCZ-IHFPW0WRG6202CJ6-part3 ONLINE 0 0 0

errors: No known data errors

And there we go!

Disk Performance Analysis on Linux

Really Into the Guts of Linux Filesystem Performance

Really Into the Guts of LinuxIt is possible to find out how hard your disks are running. Some you don’t want to run as hard as others. For instance, if you wanted to save on write cycles for your SSD you could move small frequent writes into a ram filesystem, and that seems to be done by default in many systems these days (with /tmp getting mounted as tmpsf).

In the process of rebuilding my backup server, I left it going with a few large writes overnight, to check for some stability issues. Checking dmesg is always a good way to do that.

For instance, I see that drive 14 seems to be wonky:

Nov 16 22:44:13 beavertail kernel: [  692.590756] ata14.00: cmd 61/e0:18:20:0a:00/00:00:00:00:00/40 tag 3 ncq 114688 out
Nov 16 22:44:13 beavertail kernel: [  692.590756]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 16 22:44:13 beavertail kernel: [  692.598694] ata14.00: status: { DRDY }
Nov 16 22:44:14 beavertail kernel: [  693.093364] ata14: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Nov 16 22:44:14 beavertail kernel: [  693.118236] ata14.00: configured for UDMA/100
Nov 16 22:44:14 beavertail kernel: [  693.118255] ata14.00: device reported invalid CHS sector 0
Nov 16 22:44:14 beavertail kernel: [  693.118273] ata14: EH complete

Which drive is this? Oh, I see where it came up using sudo dmesg | grep 'ata14' |less:

[    1.061214] ata14: SATA max UDMA/133 abar m2048@0xfea10000 port 0xfea10280 irq 49
[    1.556891] ata14: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[    1.589329] ata14.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100

I cannot pass ata14 as an argument to programs, so I will see if I can match a serial number:

root@beavertail /dev/disk/by-id
 > find -type l  -a \( -name 'wwn-*' -o -name '*-part*' \) -prune -o -printf "%l\t%f\n" | sort
	.
../../sda	ata-Hitachi_HTS547575A9E384_J2190059G9PDPC
../../sdb	ata-Hitachi_HTS547575A9E384_J2190059G9SBBC
../../sdc	ata-Hitachi_HTS547575A9E384_J2190059G6GMGC
../../sdd	ata-Hitachi_HTS547575A9E384_J2190059G95REC
../../sde	ata-OCZ-AGILITY4_OCZ-77Z13FI634825PNW
../../sdf	ata-Hitachi_HTS547575A9E384_J2190059G9LH9C
../../sdg	ata-Hitachi_HTS547575A9E384_J2190059G95JPC
../../sdh	ata-Hitachi_HTS547575A9E384_J2190059G6LUDC
../../sdi	ata-Hitachi_HTS547575A9E384_J2190059G5PXYC
../../sdj	ata-TOSHIBA_MQ01ABD050_933PTT17T
../../sdk	ata-TOSHIBA_MQ01ABD050_X3EJSVUOS
../../sdl	ata-TOSHIBA_MQ01ABD050_X3EJSVUNS
../../sdm	ata-TOSHIBA_MQ01ABD050_933PTT11T
../../sdn	ata-TOSHIBA_MQ01ABD050_933PTT15T
../../sdo	ata-TOSHIBA_MQ01ABD050_933PTT12T
../../sdp	ata-TOSHIBA_MQ01ABD050_933PTT13T
../../sdq	ata-TOSHIBA_MQ01ABD050_933PTT14T
../../sdr	ata-TOSHIBA_MQ01ABD050_933PTT0ZT

Nope, so we search dmesg again, grepping for the ATA-8 port:

[    1.538182] ata4.00: ATA-8: Hitachi HTS547575A9E384, JE4OA60A, max UDMA/133
[    1.538213] ata2.00: ATA-8: Hitachi HTS547575A9E384, JE4OA60A, max UDMA/133
[    1.538244] ata3.00: ATA-8: Hitachi HTS547575A9E384, JE4OA60A, max UDMA/133
[    1.538565] ata5.00: ATA-8: Hitachi HTS547575A9E384, JE4OA60A, max UDMA/133
[    1.563034] ata6.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100
[    1.584336] ata12.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100
[    1.589329] ata14.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100
[    1.590477] ata11.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100
[    1.590713] ata13.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100
[    1.618162] ata15.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100
[    1.621146] ata18.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100
[    1.623070] ata16.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100
[    1.625311] ata17.00: ATA-8: TOSHIBA MQ01ABD050, AX001A, max UDMA/100
[    3.267833] ata7.00: ATA-8: Hitachi HTS547575A9E384, JE4OA60A, max UDMA/133
[    5.469619] ata8.00: ATA-8: Hitachi HTS547575A9E384, JE4OA60A, max UDMA/133
[    7.671669] ata9.00: ATA-8: Hitachi HTS547575A9E384, JE4OA60A, max UDMA/133
[    9.890091] ata10.00: ATA-8: Hitachi HTS547575A9E384, JE4OA60A, max UDMA/133

We begin to notice I should have kept with the Hitachi drives, they are going to be faster. Oh well, I’m doing raid 50, striping across five groups, that right there is pretty good.

But where is this drive? Time to paw thru /sys. Paw is just about right, I don’t have much memorized about the /sys directory, but with ‘less’ and ‘ls’ I bet…there we go: ls -l /sys/class/block/ | grep ata14:

root@beavertail /var/log
 > ls -l /sys/class/block/ | grep ata14
lrwxrwxrwx 1 root root 0 Nov 16 22:32 sdn -> ../../devices/pci0000:00/0000:00:02.0/0000:01:00.0/ata14/host13/target13:0:0/13:0:0:0/block/sdn
lrwxrwxrwx 1 root root 0 Nov 17 06:43 sdn1 -> ../../devices/pci0000:00/0000:00:02.0/0000:01:00.0/ata14/host13/target13:0:0/13:0:0:0/block/sdn/sdn1
lrwxrwxrwx 1 root root 0 Nov 17 06:43 sdn9 -> ../../devices/pci0000:00/0000:00:02.0/0000:01:00.0/ata14/host13/target13:0:0/13:0:0:0/block/sdn/sdn9

Now I can use hdparm on this guy:

root@beavertail /var/log
 > hdparm -I /dev/sdn | head

/dev/sdn:

ATA device, with non-removable media
        Model Number:       TOSHIBA MQ01ABD050                      
        Serial Number:      933PTT15T
        Firmware Revision:  AX001A  
        Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6

Now that I have the serial number for that guy, I will be able to find him.

How is my new storage array doing in general? Well, iostat gives me some indication that the drives are being written to somewhat equally:

jreynolds@beavertail ~
 > iostat -N -x 2 2
Linux 3.11.0-12-generic (beavertail) 	11/17/2013 	_x86_64_	(2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.52    0.00   12.75   46.82    0.00   39.91

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.01     0.00    0.06   59.85     0.28  5350.97   178.65     6.73  112.27   23.92  112.36  13.23  79.25
sdb               0.01     0.00    0.05   59.86     0.27  5350.94   178.63     6.69  111.66   25.89  111.74  13.22  79.19
sdc               0.01     0.00    0.07   67.59     0.33  6029.96   178.27     7.05  104.27   21.14  104.35  12.08  81.72
sdd               0.01     0.00    0.13   65.75     0.59  5953.08   180.73     7.11  107.97   35.70  108.12  12.46  82.08
sde               0.02     1.13    3.29  319.92    20.22 39454.49   244.27     1.49    4.60    2.80    4.62   0.52  16.79
sdf               0.01     0.00    0.12   70.63     0.54  5953.20   168.30     0.70    9.91   16.08    9.90   2.17  15.38
sdg               0.01     0.00    0.06   70.40     0.30  5953.25   169.00     0.76   10.84   18.69   10.83   2.31  16.28
sdi               0.01     0.00    0.14   71.55     0.63  6030.08   168.23     0.82   11.43   23.98   11.41   2.34  16.76
sdj               0.02     0.00    0.14   75.14     0.66  5497.69   146.09     0.21    2.75    9.36    2.74   0.94   7.07
sdk               0.02     0.00    0.03    0.00     0.21     0.04    18.43     0.00    3.04    2.78    8.32   2.40   0.01
sdm               0.04     0.00    0.08   68.44     0.50  5477.06   159.90     0.27    3.98    8.62    3.97   0.98   6.72
sdo               0.04     0.00    0.07   66.84     0.49  5477.08   163.72     0.29    4.35    8.03    4.34   1.04   6.94
sdn               0.02     0.00    0.08   64.80     0.43  5497.76   169.51     0.41    6.35    8.18    6.35   1.31   8.51
sdl               0.02     0.00    0.03    0.00     0.21     0.04    18.43     0.00    3.37    3.10    8.74   2.52   0.01
sdp               0.04     0.00    0.07   66.85     0.49  5477.05   163.71     0.29    4.34    8.13    4.34   1.04   6.93
sdq               0.04     0.00    0.08   67.91     0.53  5497.71   161.72     0.29    4.27    8.44    4.27   1.02   6.92
sdh               0.01     0.00    0.13   72.34     0.58  6030.12   166.44     0.72    9.90   16.88    9.89   2.12  15.38
sdr               0.04     0.00    0.07   66.92     0.48  5351.01   159.77     0.29    4.37    7.13    4.36   1.12   7.53

From that I can tell that /dev/sde is getting the shit kicked out of it with 6x the writes other drives are getting. What the hell? Let’s check zpool:

root@beavertail ~
 > zpool iostat -v 30
                                                         capacity     operations    bandwidth
pool                                                  alloc   free   read  write   read  write
----------------------------------------------------  -----  -----  -----  -----  -----  -----
tank                                                  2.15T  5.99T      0    608    204  58.8M
  raidz1                                               470G  1.57T      0    130     34  12.7M
    ata-Hitachi_HTS547575A9E384_J2190059G5PXYC-part1      -      -      0     73      0  6.32M
    ata-Hitachi_HTS547575A9E384_J2190059G6GMGC-part1      -      -      0     70      0  6.43M
    ata-Hitachi_HTS547575A9E384_J2190059G6LUDC-part1      -      -      0     73    273  6.32M
  raidz1                                               464G  1.58T      0    128     85  12.7M
    ata-Hitachi_HTS547575A9E384_J2190059G95JPC-part1      -      -      0     72      0  6.32M
    ata-Hitachi_HTS547575A9E384_J2190059G95REC-part1      -      -      0     68    136  6.42M
    ata-Hitachi_HTS547575A9E384_J2190059G9LH9C-part1      -      -      0     72    273  6.32M
  raidz1                                               414G   978G      0    116     68  11.0M
    ata-Hitachi_HTS547575A9E384_J2190059G9PDPC-part1      -      -      0     61    136  5.56M
    ata-Hitachi_HTS547575A9E384_J2190059G9SBBC-part1      -      -      0     60    136  5.54M
    ata-TOSHIBA_MQ01ABD050_933PTT0ZT                      -      -      0     66      0  5.45M
  raidz1                                               425G   967G      0    115      0  11.1M
    ata-TOSHIBA_MQ01ABD050_933PTT11T                      -      -      0     70      0  5.61M
    ata-TOSHIBA_MQ01ABD050_933PTT12T                      -      -      0     66      0  5.61M
    ata-TOSHIBA_MQ01ABD050_933PTT13T                      -      -      0     67      0  5.61M
  raidz1                                               426G   966G      0    117     17  11.2M
    ata-TOSHIBA_MQ01ABD050_933PTT14T                      -      -      0     69      0  5.66M
    ata-TOSHIBA_MQ01ABD050_933PTT15T                      -      -      0     66      0  5.66M
    ata-TOSHIBA_MQ01ABD050_933PTT17T                      -      -      0     78    136  5.66M
logs                                                      -      -      -      -      -      -
  ata-OCZ-AGILITY4_OCZ-77Z13FI634825PNW-part5          128K  1.98G      0      0      0      0
cache                                                     -      -      -      -      -      -
  ata-OCZ-AGILITY4_OCZ-77Z13FI634825PNW-part6         64.1G  14.0M      0    342  4.80K  41.9M
----------------------------------------------------  -----  -----  -----  -----  -----  -----

That guy is my Layer 2 cache SSD partition on my OCZ drive. (The astute will see that I am only running one SSD on here.) Since this is a backup drive, I doubt actually need cache on it.
It is transferring like %50 of the activity of the whole pool for no good reason. I’ll disable him later.

But, (and this is quiet easily seen using dstat) I have four drives just “Chugging Away” doing more than their share. These guys are busier and I don’t know why, unless it has to do with extra parity writes.

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00   64.50     0.00  5354.00   166.02     5.62   88.93    0.00   88.93  11.63  75.00
sdb               0.00     0.00    0.00   63.50     0.00  5416.00   170.58     6.55  108.13    0.00  108.13  12.94  82.20
sdc               0.00     0.00    0.50   73.00     2.00  7144.00   194.45     7.33  102.07   20.00  102.63  11.40  83.80
sdd               0.00     0.00    0.00   63.00     0.00  7044.00   223.62     7.67  126.29    0.00  126.29  13.43  84.60

What kind of jibberish am I looking at? Trim it down to the write-queue and the “awaiting” columns:

Device:          avgrq-sz avgqu-sz   await  w_await   %util
sda                166.02     5.62   88.93    88.93   75.00
sdb                170.58     6.55  108.13   108.13   82.20
sdc                194.45     7.33  102.07   102.63   83.80
sdd                223.62     7.67  126.29   126.29   84.60

We see that the write-wait queue, the outstanding device wait interval, and the queue size on those devices is way higher
than the rest of the drives. However, I have five raidz pools, so why would I have four drives so poorly conditioned? Either
ZFS has decided that I need to write more bytes there, or those four drives are all on a common controller that is slower, I suspect.

Common controller? Let’s check that /sys/ directory again. Yeah, with variations of ls -l /sys/class/block/ | egrep 'sd[a-d] ' I definitely see those
drives share something on the PCI bus in common: address 14.

jreynolds@beavertail ~
 > ls -l /sys/class/block/ | egrep 'sd[a-d] '
lrwxrwxrwx 1 root root 0 Nov 16 22:32 sda -> ../../devices/pci0000:00/0000:00:14.4/0000:03:05.0/ata7/host1/target1:0:0/1:0:0:0/block/sda
lrwxrwxrwx 1 root root 0 Nov 16 22:32 sdb -> ../../devices/pci0000:00/0000:00:14.4/0000:03:05.0/ata8/host2/target2:0:0/2:0:0:0/block/sdb
lrwxrwxrwx 1 root root 0 Nov 16 22:32 sdc -> ../../devices/pci0000:00/0000:00:14.4/0000:03:05.0/ata9/host3/target3:0:0/3:0:0:0/block/sdc
lrwxrwxrwx 1 root root 0 Nov 16 22:32 sdd -> ../../devices/pci0000:00/0000:00:14.4/0000:03:05.0/ata10/host5/target5:0:0/5:0:0:0/block/sdd

However, using dmidecode, all I can tell at the moment is that I have four slots and their addresses do not match. This is unlikely a motherboard connected item. Is it a PCI connected item?

 > lspci
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 12h Processor Root Complex
00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Sumo [Radeon HD 6410D]
00:02.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 12h Processor Root Port
00:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 12h Processor Root Port
00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 40)
00:12.0 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB OHCI Controller (rev 11)
00:12.2 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB EHCI Controller (rev 11)
00:13.0 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB OHCI Controller (rev 11)
00:13.2 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB EHCI Controller (rev 11)
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 14)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 11)
00:14.4 PCI bridge: Advanced Micro Devices, Inc. [AMD] FCH PCI Bridge (rev 40)
00:14.5 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB OHCI Controller (rev 11)
00:15.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Hudson PCI to PCI bridge (PCIE port 0)
00:15.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Hudson PCI to PCI bridge (PCIE port 1)
00:16.0 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB OHCI Controller (rev 11)
00:16.2 USB controller: Advanced Micro Devices, Inc. [AMD] FCH USB EHCI Controller (rev 11)
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 12h/14h Processor Function 0 (rev 43)
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 12h/14h Processor Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 12h/14h Processor Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 12h/14h Processor Function 3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 12h/14h Processor Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 12h/14h Processor Function 6
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 12h/14h Processor Function 5
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 12h/14h Processor Function 7
01:00.0 SATA controller: Marvell Technology Group Ltd. Device 9235 (rev 10)
02:00.0 SATA controller: Marvell Technology Group Ltd. Device 9235 (rev 10)
03:05.0 RAID bus controller: Silicon Image, Inc. SiI 3124 PCI-X Serial ATA Controller (rev 02)
04:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06)

DING Found it: that Silicon Image Sata controller. It is likely the slow link in all this. I can see the match: 03:05.0

03:05.0 RAID bus controller: Silicon Image, Inc. SiI 3124 PCI-X Serial ATA Controller (rev 02)

Recreating my Backup Pool

Sing out if you know what I’m doing with this one command:

zpool create -f -o ashift=12 tank \
raidz ata-Hitachi_HTS547575A9E384_J2190059G5PXYC-part1 ata-Hitachi_HTS547575A9E384_J2190059G6GMGC-part1 ata-Hitachi_HTS547575A9E384_J2190059G6LUDC-part1 \
raidz ata-Hitachi_HTS547575A9E384_J2190059G95JPC-part1 ata-Hitachi_HTS547575A9E384_J2190059G95REC-part1 ata-Hitachi_HTS547575A9E384_J2190059G9LH9C-part1 \
raidz ata-Hitachi_HTS547575A9E384_J2190059G9PDPC-part1 ata-Hitachi_HTS547575A9E384_J2190059G9SBBC-part1 ata-TOSHIBA_MQ01ABD050_933PTT0ZT \
raidz ata-TOSHIBA_MQ01ABD050_933PTT11T ata-TOSHIBA_MQ01ABD050_933PTT12T ata-TOSHIBA_MQ01ABD050_933PTT13T \
raidz ata-TOSHIBA_MQ01ABD050_933PTT14T ata-TOSHIBA_MQ01ABD050_933PTT15T ata-TOSHIBA_MQ01ABD050_933PTT17T \
log /dev/sda5 \
cache /dev/sda6 \
spare ata-TOSHIBA_MQ01ABD050_X3EJSVUNS ata-TOSHIBA_MQ01ABD050_X3EJSVUOS

ZFS, Bash, and Piped Loops

Fellow Linux Nutjobs! Here is your evening lesson in how to reclaim space from a ZFS partition by destroying snapshots. In my setup, my remote backup has a subset of the snapshots I’m automatically creating from my backup scripts. To reclaim space, I’m deleting those snapshots that have been backed up off-site.


Assume alias Zfs="sudo zfs" in the following example:

Zfs list -rHt snap tank/pictures \
| while read L ; do 
   [[ $L == "tank/pictures@20131008-191254"* ]] && break; 
   echo $L;
done \
| cut -d' ' -f1 \
| while read M ; do 
    Zfs destroy $M; 
done

Pop quiz. What other command could I use besides cut?

And guess what? I typed that command in almost one go, not pasted in from an editor. I’ve formatted it all fancy-like for your precious sanity.

Example of how Jed works his mojo.

Example of how Jed works his mojo.