Sorting through thousands of photos

Backups are great. Having terabytes of of space for them is now completely necessary. Filling up those terabytes is …frustrating.

My photo collection is probably much like many photo enthusiast’s — well into the hundreds of thousands of pictures. But why is it so? I make thumbnails (that’s 2x pictures), I keep a low-res and a full-res picture besides thumbnails (3x). I’ve started keeping both my RAW (DNG) and my full-res JPG files (4x pictures).

This does not presume madness with the backups. What happens when you have copies of the same SD card on two computers? And you discover the months later, and you don’t have time to un-dup those few hundred? Having multiple photo editing software tools doesn’t necessarily make it easier. I use both Gimp and Darktable. I batch create my thumbnails using a shell script that drives ImageMagick, of course.  And then what piece of backup software would ever warn you if you started backing up your photos in overlapping locations? I might have (6x) the pictures I actually took!

Look at this listing of photos, for example:

val_memorial168.JPG;2484792;/tank/pictures/9999-Source/2006/2006-05-13-memorial-tey-roberts/big
val_memorial168.JPG;2484792;/tank/pictures/Pictures/9999-Source/2006/2006-05-13-memorial-tey-roberts/big
val_memorial168.JPG;2484792;/tank/pictures/tank/pictures/9999-Source/2006/2006-05-13-memorial-tey-roberts/big
val_memorial168.JPG;2484792;/tank/pictures/tank/pictures/Pictures/9999-Source/2006/2006-05-13-memorial-tey-roberts/big
val_memorial168-small.jpg;29530;/tank/pictures/9999-Source/2006/2006-05-13-memorial-tey-roberts/thumb
val_memorial168-small.jpg;29530;/tank/pictures/Pictures/9999-Source/2006/2006-05-13-memorial-tey-roberts/thumb
val_memorial168-small.jpg;29530;/tank/pictures/tank/pictures/9999-Source/2006/2006-05-13-memorial-tey-roberts/thumb
val_memorial168-small.jpg;29530;/tank/pictures/tank/pictures/Pictures/9999-Source/2006/2006-05-13-memorial-tey-roberts/thumb
val_memorial168-small.jpg;355505;/tank/pictures/9999-Source/2006/2006-05-13-memorial-tey-roberts
val_memorial168-small.jpg;355505;/tank/pictures/Pictures/9999-Source/2006/2006-05-13-memorial-tey-roberts
val_memorial168-small.jpg;355505;/tank/pictures/tank/pictures/9999-Source/2006/2006-05-13-memorial-tey-roberts
val_memorial168-small.jpg;355505;/tank/pictures/tank/pictures/Pictures/9999-Source/2006/2006-05-13-memorial-tey-roberts

Shell scripting to the rescue, right? My tactics are roughly this:

  • sort the photos by name (not by path)
  • and by size (similarly sized photos are often dups)
  • generate md5 sums of the start of the photo (the start is likely going to be as indicative and the whole photo)
  • sort by all of the above
  • try not to use perl :-)

Does it take a fancy data structure to sort this information? No. The trick is to re-arrange it in lines of text so that the things you most want to sort by are on the left side of the line.

And to wit, I doth bust out said magic thusly:

      1 #!/bin/bash
      2 ##  ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
      3 ##
      4 ##    Script intended to identify duplicately named 
      5 ##    files and also files with likely identical contents.
      6 ##
      7 ##  ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
      8 
      9 PIC_DIR=/tank/pictures
     10 VAR_DIR=/home/jreynolds/var
     11 rm -f $VAR_DIR/*.txt
     12 rm -f $VAR_DIR/*.a*
     13  F01_UNSORTED_NAMES="$VAR_DIR/unsorted.txt"
     14    F02_SORTED_NAMES="$VAR_DIR/sorted.txt"
     15     F03_WITH_HASHES="$VAR_DIR/hashes.txt"
     16   F04_SORTED_HASHES="$VAR_DIR/sorted-hashes.txt"
     17 # | head -100  \
     18 find $PIC_DIR -type f -printf "%f;%s;%h\n" \
     19 > $F01_UNSORTED_NAMES
     20 wc -l $F01_UNSORTED_NAMES
     21 
     22 cat $F01_UNSORTED_NAMES | sort > $F02_SORTED_NAMES
     23 line_ct=$(cat $F02_SORTED_NAMES|wc -l)
     24 lines_per=$[ (line_ct / 8 ) + 1 ];
     25 echo "This is the line count per file: $lines_per"
     26 
     27 i=0
     28 rm $F03_WITH_HASHES.*
     29 split -l $lines_per $F02_SORTED_NAMES ${F02_SORTED_NAMES/.txt/}.
     30 ls -l $VAR_DIR/*
     31 
     32 for J in $VAR_DIR/sorted.a* ; do
     33    echo "== $J == started"
     34    bash -c "$HOME/bin/hash_list.sh $J" &
     35 done
     36 n=8
     37 while [ $n -gt 0 ]; do 
     38    sleep 1
     39    n=$(pgrep -lf hash_list.sh | wc -l )
     40    echo -n "$n processing "
     41 done

And the careful reader will wonder, wth is hash_list.sh? You’d better wonder. It reveals the only way one can encorporate exactly one efficient disk read per file into a program. Behold:

      1 #!/bin/bash
      2 [ -z "$1" ] && echo "please specify file list, bye." && exit 1
      3 
      4 cat $1 | while read F ; do
      5    IFS=\; read -ra hunks <<> $1.hmac
     10    i=$[ i + 1 ]
     11    [ $(( $i % 50 )) == 0 ] && echo -n "."
     12 done
     13 wc -l $1.hmac
     14 
     15 cat $1.hmac | sort > $1.sorted
     16 wc -l $1.sorted

Can you even do this on windows? I’d be surprised if you could do it without at least Cygwin. However, this is all that is required: the bash programming environment. Could you do this on a Mac? Yeah. Could you do it on an iPad? Pppppt…you’re joking. Could you do it on a Samsung android tablet? Yeah, if you wanted to watch it…melt. You’d have to install the ssh client and terminal apps of course. Enough!

Now see what I discover? Broad swaths of duplication await to be uncovered:

000173549decf06ae5c858ea1eccfcca  -,IMGP2355.JPG;2301225;/tank/pictures/0000-incoming/in-2012-12-09
000173549decf06ae5c858ea1eccfcca  -,imgp2355.jpg;2301225;/tank/pictures/0000-incoming/in-2012-12-09/big
000173549decf06ae5c858ea1eccfcca  -,imgp2355.jpg;2301225;/tank/pictures/9999-Source/2011/2011-08-31
000173549decf06ae5c858ea1eccfcca  -,IMGP2355.JPG;2301225;/tank/pictures/9999-Source/2011/2011-08-31
000173549decf06ae5c858ea1eccfcca  -,IMGP2355.JPG;2301225;/tank/pictures/9999-Source/2011/2011-08-31b
000173549decf06ae5c858ea1eccfcca  -,IMGP2355.JPG~;2301225;/tank/pictures/9999-Source/2012/in-2012-12-09
000173549decf06ae5c858ea1eccfcca  -,IMGP2355.JPG;2301225;/tank/pictures/9999-Source/2012/in-2012-12-09
000173549decf06ae5c858ea1eccfcca  -,imgp2355.jpg;2301225;/tank/pictures/9999-Source/2012/in-2012-12-09b/big
000173549decf06ae5c858ea1eccfcca  -,imgp2355.jpg;2301225;/tank/pictures/9999-Source/2013/2011-08-31
000173549decf06ae5c858ea1eccfcca  -,imgp2355.jpg;2301225;/tank/pictures/9999-Source/2013/2011/2011-08-31
000173549decf06ae5c858ea1eccfcca  -,imgp2355.jpg;2301225;/tank/pictures/home/Pictures/9999-Source/2011/2011-08-31
000173549decf06ae5c858ea1eccfcca  -,imgp2355.jpg;2301225;/tank/pictures/home/Pictures/9999-Source/2013/2011-08-31
000173549decf06ae5c858ea1eccfcca  -,imgp2355.jpg;2301225;/tank/pictures/home/Pictures/9999-Source/2013/2011/2011-08-31
000173549decf06ae5c858ea1eccfcca  -,IMGP2355.JPG;2301225;/tank/pictures/home/Pictures/9999-Source/in-2012-12-09
000173549decf06ae5c858ea1eccfcca  -,IMGP2355.JPG;2301225;/tank/pictures/Pictures/9999-Source/2011/2011-08-31b
000173549decf06ae5c858ea1eccfcca  -,IMGP2355.JPG;2301225;/tank/pictures/Pictures/9999-Source/2011/ff/2011-08-31
000173549decf06ae5c858ea1eccfcca  -,IMGP2355.JPG;2301225;/tank/pictures/Pictures/9999-Source/2012/ff/in-2012-12-09
000173549decf06ae5c858ea1eccfcca  -,imgp2355.jpg;2301225;/tank/pictures/Pictures/9999-Source/2012/ff/in-2012-12-09/big
000173549decf06ae5c858ea1eccfcca  -,IMGP2355.JPG;2301225;/tank/pictures/Pictures/9999-Source/2012/in-2012-12-09
000173549decf06ae5c858ea1eccfcca  -,imgp2355.jpg;2301225;/tank/pictures/Pictures/9999-Source/2012/in-2012-12-09b/big
000173549decf06ae5c858ea1eccfcca  -,IMGP2355.JPG;2301225;/tank/pictures/Pictures/9999-Source/2012/zz/in-2012-12-09
000173549decf06ae5c858ea1eccfcca  -,imgp2355.jpg;2301225;/tank/pictures/tank/pictures/9999-Source/2011/2011-08-31
000173549decf06ae5c858ea1eccfcca  -,IMGP2355.JPG;2301225;/tank/pictures/tank/pictures/9999-Source/2011/2011-08-31
000173549decf06ae5c858ea1eccfcca  -,IMGP2355.JPG;2301225;/tank/pictures/tank/pictures/9999-Source/2011/2011-08-31b
000173549decf06ae5c858ea1eccfcca  -,imgp2355.jpg;2301225;/tank/pictures/tank/pictures/9999-Source/2011/2011-08-31-c
000173549decf06ae5c858ea1eccfcca  -,imgp2355.jpg;2301225;/tank/pictures/tank/pictures/9999-Source/2011/2011-08-31-d
000173549decf06ae5c858ea1eccfcca  -,IMGP2355.JPG;2301225;/tank/pictures/tank/pictures/9999-Source/2011/2011-08-31-f
000173549decf06ae5c858ea1eccfcca  -,IMGP2355.JPG~;2301225;/tank/pictures/tank/pictures/9999-Source/2012/in-2012-12-09
000173549decf06ae5c858ea1eccfcca  -,IMGP2355.JPG;2301225;/tank/pictures/tank/pictures/9999-Source/2012/in-2012-12-09
000173549decf06ae5c858ea1eccfcca  -,imgp2355.jpg;2301225;/tank/pictures/tank/pictures/9999-Source/2012/in-2012-12-09b/big
000173549decf06ae5c858ea1eccfcca  -,imgp2355.jpg;2301225;/tank/pictures/tank/pictures/9999-Source/2013/2011-08-31
000173549decf06ae5c858ea1eccfcca  -,imgp2355.jpg;2301225;/tank/pictures/tank/pictures/9999-Source/2013/2011/2011-08-31
000173549decf06ae5c858ea1eccfcca  -,IMGP2355.JPG;2301225;/tank/pictures/tank/pictures/9999-Source/in-2012-12-09
000173549decf06ae5c858ea1eccfcca  -,imgp2355.jpg;2301225;/tank/pictures/tank/pictures/home/Pictures/9999-Source/2013/2011-08-31
000173549decf06ae5c858ea1eccfcca  -,IMGP2355.JPG;2301225;/tank/pictures/tank/pictures/home/Pictures/9999-Source/in-2012-12-09
000173549decf06ae5c858ea1eccfcca  -,IMGP2355.JPG;2301225;/tank/pictures/tank/pictures/Pictures/9999-Source/2011/2011-08-31b
000173549decf06ae5c858ea1eccfcca  -,IMGP2355.JPG;2301225;/tank/pictures/tank/pictures/Pictures/9999-Source/2011/ff/2011-08-31
000173549decf06ae5c858ea1eccfcca  -,IMGP2355.JPG;2301225;/tank/pictures/tank/pictures/Pictures/9999-Source/2012/ff/in-2012-12-09
000173549decf06ae5c858ea1eccfcca  -,imgp2355.jpg;2301225;/tank/pictures/tank/pictures/Pictures/9999-Source/2012/ff/in-2012-12-09/big
000173549decf06ae5c858ea1eccfcca  -,IMGP2355.JPG;2301225;/tank/pictures/tank/pictures/Pictures/9999-Source/2012/in-2012-12-09
000173549decf06ae5c858ea1eccfcca  -,imgp2355.jpg;2301225;/tank/pictures/tank/pictures/Pictures/9999-Source/2012/in-2012-12-09b/big
000173549decf06ae5c858ea1eccfcca  -,IMGP2355.JPG;2301225;/tank/pictures/tank/pictures/Pictures/9999-Source/2012/zz/in-2012-12-09
%d bloggers like this: