Backups are great. Having terabytes of of space for them is now completely necessary. Filling up those terabytes is …frustrating.
My photo collection is probably much like many photo enthusiast’s — well into the hundreds of thousands of pictures. But why is it so? I make thumbnails (that’s 2x pictures), I keep a low-res and a full-res picture besides thumbnails (3x). I’ve started keeping both my RAW (DNG) and my full-res JPG files (4x pictures).
This does not presume madness with the backups. What happens when you have copies of the same SD card on two computers? And you discover the months later, and you don’t have time to un-dup those few hundred? Having multiple photo editing software tools doesn’t necessarily make it easier. I use both Gimp and Darktable. I batch create my thumbnails using a shell script that drives ImageMagick, of course. And then what piece of backup software would ever warn you if you started backing up your photos in overlapping locations? I might have (6x) the pictures I actually took!
Look at this listing of photos, for example:
val_memorial168.JPG;2484792;/tank/pictures/9999-Source/2006/2006-05-13-memorial-tey-roberts/big val_memorial168.JPG;2484792;/tank/pictures/Pictures/9999-Source/2006/2006-05-13-memorial-tey-roberts/big val_memorial168.JPG;2484792;/tank/pictures/tank/pictures/9999-Source/2006/2006-05-13-memorial-tey-roberts/big val_memorial168.JPG;2484792;/tank/pictures/tank/pictures/Pictures/9999-Source/2006/2006-05-13-memorial-tey-roberts/big val_memorial168-small.jpg;29530;/tank/pictures/9999-Source/2006/2006-05-13-memorial-tey-roberts/thumb val_memorial168-small.jpg;29530;/tank/pictures/Pictures/9999-Source/2006/2006-05-13-memorial-tey-roberts/thumb val_memorial168-small.jpg;29530;/tank/pictures/tank/pictures/9999-Source/2006/2006-05-13-memorial-tey-roberts/thumb val_memorial168-small.jpg;29530;/tank/pictures/tank/pictures/Pictures/9999-Source/2006/2006-05-13-memorial-tey-roberts/thumb val_memorial168-small.jpg;355505;/tank/pictures/9999-Source/2006/2006-05-13-memorial-tey-roberts val_memorial168-small.jpg;355505;/tank/pictures/Pictures/9999-Source/2006/2006-05-13-memorial-tey-roberts val_memorial168-small.jpg;355505;/tank/pictures/tank/pictures/9999-Source/2006/2006-05-13-memorial-tey-roberts val_memorial168-small.jpg;355505;/tank/pictures/tank/pictures/Pictures/9999-Source/2006/2006-05-13-memorial-tey-roberts
Shell scripting to the rescue, right? My tactics are roughly this:
- sort the photos by name (not by path)
- and by size (similarly sized photos are often dups)
- generate md5 sums of the start of the photo (the start is likely going to be as indicative and the whole photo)
- sort by all of the above
- try not to use perl :-)
Does it take a fancy data structure to sort this information? No. The trick is to re-arrange it in lines of text so that the things you most want to sort by are on the left side of the line.
And to wit, I doth bust out said magic thusly:
1 #!/bin/bash 2 ## ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 3 ## 4 ## Script intended to identify duplicately named 5 ## files and also files with likely identical contents. 6 ## 7 ## ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 8 9 PIC_DIR=/tank/pictures 10 VAR_DIR=/home/jreynolds/var 11 rm -f $VAR_DIR/*.txt 12 rm -f $VAR_DIR/*.a* 13 F01_UNSORTED_NAMES="$VAR_DIR/unsorted.txt" 14 F02_SORTED_NAMES="$VAR_DIR/sorted.txt" 15 F03_WITH_HASHES="$VAR_DIR/hashes.txt" 16 F04_SORTED_HASHES="$VAR_DIR/sorted-hashes.txt" 17 # | head -100 \ 18 find $PIC_DIR -type f -printf "%f;%s;%h\n" \ 19 > $F01_UNSORTED_NAMES 20 wc -l $F01_UNSORTED_NAMES 21 22 cat $F01_UNSORTED_NAMES | sort > $F02_SORTED_NAMES 23 line_ct=$(cat $F02_SORTED_NAMES|wc -l) 24 lines_per=$[ (line_ct / 8 ) + 1 ]; 25 echo "This is the line count per file: $lines_per" 26 27 i=0 28 rm $F03_WITH_HASHES.* 29 split -l $lines_per $F02_SORTED_NAMES ${F02_SORTED_NAMES/.txt/}. 30 ls -l $VAR_DIR/* 31 32 for J in $VAR_DIR/sorted.a* ; do 33 echo "== $J == started" 34 bash -c "$HOME/bin/hash_list.sh $J" & 35 done 36 n=8 37 while [ $n -gt 0 ]; do 38 sleep 1 39 n=$(pgrep -lf hash_list.sh | wc -l ) 40 echo -n "$n processing " 41 done
And the careful reader will wonder, wth is hash_list.sh? You’d better wonder. It reveals the only way one can encorporate exactly one efficient disk read per file into a program. Behold:
1 #!/bin/bash 2 [ -z "$1" ] && echo "please specify file list, bye." && exit 1 3 4 cat $1 | while read F ; do 5 IFS=\; read -ra hunks <<> $1.hmac 10 i=$[ i + 1 ] 11 [ $(( $i % 50 )) == 0 ] && echo -n "." 12 done 13 wc -l $1.hmac 14 15 cat $1.hmac | sort > $1.sorted 16 wc -l $1.sorted
Can you even do this on windows? I’d be surprised if you could do it without at least Cygwin. However, this is all that is required: the bash programming environment. Could you do this on a Mac? Yeah. Could you do it on an iPad? Pppppt…you’re joking. Could you do it on a Samsung android tablet? Yeah, if you wanted to watch it…melt. You’d have to install the ssh client and terminal apps of course. Enough!
Now see what I discover? Broad swaths of duplication await to be uncovered:
000173549decf06ae5c858ea1eccfcca -,IMGP2355.JPG;2301225;/tank/pictures/0000-incoming/in-2012-12-09 000173549decf06ae5c858ea1eccfcca -,imgp2355.jpg;2301225;/tank/pictures/0000-incoming/in-2012-12-09/big 000173549decf06ae5c858ea1eccfcca -,imgp2355.jpg;2301225;/tank/pictures/9999-Source/2011/2011-08-31 000173549decf06ae5c858ea1eccfcca -,IMGP2355.JPG;2301225;/tank/pictures/9999-Source/2011/2011-08-31 000173549decf06ae5c858ea1eccfcca -,IMGP2355.JPG;2301225;/tank/pictures/9999-Source/2011/2011-08-31b 000173549decf06ae5c858ea1eccfcca -,IMGP2355.JPG~;2301225;/tank/pictures/9999-Source/2012/in-2012-12-09 000173549decf06ae5c858ea1eccfcca -,IMGP2355.JPG;2301225;/tank/pictures/9999-Source/2012/in-2012-12-09 000173549decf06ae5c858ea1eccfcca -,imgp2355.jpg;2301225;/tank/pictures/9999-Source/2012/in-2012-12-09b/big 000173549decf06ae5c858ea1eccfcca -,imgp2355.jpg;2301225;/tank/pictures/9999-Source/2013/2011-08-31 000173549decf06ae5c858ea1eccfcca -,imgp2355.jpg;2301225;/tank/pictures/9999-Source/2013/2011/2011-08-31 000173549decf06ae5c858ea1eccfcca -,imgp2355.jpg;2301225;/tank/pictures/home/Pictures/9999-Source/2011/2011-08-31 000173549decf06ae5c858ea1eccfcca -,imgp2355.jpg;2301225;/tank/pictures/home/Pictures/9999-Source/2013/2011-08-31 000173549decf06ae5c858ea1eccfcca -,imgp2355.jpg;2301225;/tank/pictures/home/Pictures/9999-Source/2013/2011/2011-08-31 000173549decf06ae5c858ea1eccfcca -,IMGP2355.JPG;2301225;/tank/pictures/home/Pictures/9999-Source/in-2012-12-09 000173549decf06ae5c858ea1eccfcca -,IMGP2355.JPG;2301225;/tank/pictures/Pictures/9999-Source/2011/2011-08-31b 000173549decf06ae5c858ea1eccfcca -,IMGP2355.JPG;2301225;/tank/pictures/Pictures/9999-Source/2011/ff/2011-08-31 000173549decf06ae5c858ea1eccfcca -,IMGP2355.JPG;2301225;/tank/pictures/Pictures/9999-Source/2012/ff/in-2012-12-09 000173549decf06ae5c858ea1eccfcca -,imgp2355.jpg;2301225;/tank/pictures/Pictures/9999-Source/2012/ff/in-2012-12-09/big 000173549decf06ae5c858ea1eccfcca -,IMGP2355.JPG;2301225;/tank/pictures/Pictures/9999-Source/2012/in-2012-12-09 000173549decf06ae5c858ea1eccfcca -,imgp2355.jpg;2301225;/tank/pictures/Pictures/9999-Source/2012/in-2012-12-09b/big 000173549decf06ae5c858ea1eccfcca -,IMGP2355.JPG;2301225;/tank/pictures/Pictures/9999-Source/2012/zz/in-2012-12-09 000173549decf06ae5c858ea1eccfcca -,imgp2355.jpg;2301225;/tank/pictures/tank/pictures/9999-Source/2011/2011-08-31 000173549decf06ae5c858ea1eccfcca -,IMGP2355.JPG;2301225;/tank/pictures/tank/pictures/9999-Source/2011/2011-08-31 000173549decf06ae5c858ea1eccfcca -,IMGP2355.JPG;2301225;/tank/pictures/tank/pictures/9999-Source/2011/2011-08-31b 000173549decf06ae5c858ea1eccfcca -,imgp2355.jpg;2301225;/tank/pictures/tank/pictures/9999-Source/2011/2011-08-31-c 000173549decf06ae5c858ea1eccfcca -,imgp2355.jpg;2301225;/tank/pictures/tank/pictures/9999-Source/2011/2011-08-31-d 000173549decf06ae5c858ea1eccfcca -,IMGP2355.JPG;2301225;/tank/pictures/tank/pictures/9999-Source/2011/2011-08-31-f 000173549decf06ae5c858ea1eccfcca -,IMGP2355.JPG~;2301225;/tank/pictures/tank/pictures/9999-Source/2012/in-2012-12-09 000173549decf06ae5c858ea1eccfcca -,IMGP2355.JPG;2301225;/tank/pictures/tank/pictures/9999-Source/2012/in-2012-12-09 000173549decf06ae5c858ea1eccfcca -,imgp2355.jpg;2301225;/tank/pictures/tank/pictures/9999-Source/2012/in-2012-12-09b/big 000173549decf06ae5c858ea1eccfcca -,imgp2355.jpg;2301225;/tank/pictures/tank/pictures/9999-Source/2013/2011-08-31 000173549decf06ae5c858ea1eccfcca -,imgp2355.jpg;2301225;/tank/pictures/tank/pictures/9999-Source/2013/2011/2011-08-31 000173549decf06ae5c858ea1eccfcca -,IMGP2355.JPG;2301225;/tank/pictures/tank/pictures/9999-Source/in-2012-12-09 000173549decf06ae5c858ea1eccfcca -,imgp2355.jpg;2301225;/tank/pictures/tank/pictures/home/Pictures/9999-Source/2013/2011-08-31 000173549decf06ae5c858ea1eccfcca -,IMGP2355.JPG;2301225;/tank/pictures/tank/pictures/home/Pictures/9999-Source/in-2012-12-09 000173549decf06ae5c858ea1eccfcca -,IMGP2355.JPG;2301225;/tank/pictures/tank/pictures/Pictures/9999-Source/2011/2011-08-31b 000173549decf06ae5c858ea1eccfcca -,IMGP2355.JPG;2301225;/tank/pictures/tank/pictures/Pictures/9999-Source/2011/ff/2011-08-31 000173549decf06ae5c858ea1eccfcca -,IMGP2355.JPG;2301225;/tank/pictures/tank/pictures/Pictures/9999-Source/2012/ff/in-2012-12-09 000173549decf06ae5c858ea1eccfcca -,imgp2355.jpg;2301225;/tank/pictures/tank/pictures/Pictures/9999-Source/2012/ff/in-2012-12-09/big 000173549decf06ae5c858ea1eccfcca -,IMGP2355.JPG;2301225;/tank/pictures/tank/pictures/Pictures/9999-Source/2012/in-2012-12-09 000173549decf06ae5c858ea1eccfcca -,imgp2355.jpg;2301225;/tank/pictures/tank/pictures/Pictures/9999-Source/2012/in-2012-12-09b/big 000173549decf06ae5c858ea1eccfcca -,IMGP2355.JPG;2301225;/tank/pictures/tank/pictures/Pictures/9999-Source/2012/zz/in-2012-12-09