Two drives fail at once? Oh yeah…

Just as I thot I was all cool for having a sixteen drive NAS, today’s opening of it and trying a new network card (did not fit) left me with bad news on the next powerup.

 > dmesg | grep ata | grep error:
[   23.223221] ata13.00: error: { ABRT }
[   23.234448] ata13.00: error: { ABRT }
[   31.262674] ata13.00: error: { ABRT }
[   31.275241] ata13.00: error: { ABRT }
[   31.288012] ata13.00: error: { ABRT }
[   39.073802] ata13.00: error: { ABRT }
[   50.815339] ata13.00: error: { ABRT }
[   50.827082] ata13.00: error: { ABRT }
[   57.606645] ata13.00: error: { ABRT }
[   69.616356] ata7.00: error: { ABRT }
[   69.616451] ata13.00: error: { ABRT }

That’s failure of two drives. TWO at the same time! ….and look at this:

 > zpool status -v
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 4h43m with 0 errors on Sat Sep  6 00:13:22 2014
config:

        NAME                                            STATE     READ WRITE CKSUM
        tank                                            ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            ata-Hitachi_HTS547575A9E384_J2190059G9PDPC  ONLINE       0     0     0
            ata-Hitachi_HTS547575A9E384_J2190059G9SBBC  ONLINE       0     0     0
            ata-Hitachi_HTS547575A9E384_J2190059G6GMGC  ONLINE       0     0     0
            ata-Hitachi_HTS547575A9E384_J2190059G95REC  ONLINE       0     0     0
          raidz1-1                                      ONLINE       0     0     0
            ata-Hitachi_HTS547575A9E384_J2190059G9LH9C  ONLINE       0     0     0
            ata-Hitachi_HTS547575A9E384_J2190059G95JPC  ONLINE       0     0     0
            ata-Hitachi_HTS547575A9E384_J2190059G6LUDC  ONLINE       0     0     0
            ata-Hitachi_HTS547575A9E384_J2190059G5PXYC  ONLINE       0     0     0
          raidz1-2                                      ONLINE       0     0     0
            ata-TOSHIBA_MQ01ABD050_X3EJSVUOS            ONLINE       0     0     0
            ata-TOSHIBA_MQ01ABD050_X3EJSVUNS            ONLINE       0     0     0
            ata-TOSHIBA_MQ01ABD050_933PTT11T            ONLINE       0     0     0
            ata-TOSHIBA_MQ01ABD050_933PTT17T            ONLINE       0     0     0
          raidz1-3                                      ONLINE       0     0     0
            ata-TOSHIBA_MQ01ABD050_933PTT12T            ONLINE       0     0     0
            ata-TOSHIBA_MQ01ABD050_933PTT13T            ONLINE       0     0     2
            ata-TOSHIBA_MQ01ABD050_933PTT14T            ONLINE       0     0     2
            ata-TOSHIBA_MQ01ABD050_933PTT0ZT            ONLINE       0     0     0
        logs
          ata-OCZ-AGILITY4_OCZ-77Z13FI634825PNW-part5   ONLINE       0     0     0
        cache
          ata-OCZ-AGILITY4_OCZ-77Z13FI634825PNW-part6   ONLINE       0     0     0

errors: No known data errors

Two checksum errors in the same Raid 5 volume. That’s going to be a very tricky replacement. I think I’m going to either replace one disk at a time and hope for the best resilver possibilities, or maybe…add a PCI controller back in there and add another zvol and migrate data from one zvol to another? That’ll be a wild trick.

It will frack up my backups for a while, that’s for sure. Oh, and those Toshiba drives? That’s three Toshiba failures, zero Hitatchi failures.

%d bloggers like this: