Press question mark to see available shortcut keys

A reminder of why I keep backups! A few days ago, I decided to finally archive my copies of the defunct blackmarket Drugslist; since I had multiple mirrors of it, I used my shell script for the sort-key trick ( http://www.gwern.net/Archiving%20URLs#sort---key-compression-trick ) which rearranges the order in which files are compressed so duplicates are put next to each other and take up ~0 space in the compressed archive. I noticed that the archive seemed almost too small, so I unpacked it and compared the size. It was smaller than the original. Oh shit.

I remembered that I had used that exact shell command to archive around a dozen defunct blackmarkets a few weeks ago when I wanted to free up some more disk space. Oh shit.

I've spent easily a hundred hours at this point mirroring the blackmarkets, and it would be... unfortunate if I was never able to run the analyses I wanted because I corrupted my data and didn't have backup copies and all the blackmarkets in question were 100% defunct and further data impossible to collect. Very unfortunate. Very very unfortunate.

Investigating the command, wondering if there was some sort of formatting error or newline-in-filename or something, it turned out that when I compressed all my black-market mirrors with a short script that included the command 'sort --unique --key=3', the 'key=3' means that the uniquefying is done on the filename, not the full path as it usually would. (Why did I include '--unique' at all? Force of habit, and normally it would be a no-op.) So each mirror of the same market is now effectively a diff, rather than a mirror. This renders them all partially useless, since now I know only the date of the first appearance of any seller or drug listing, and I don't know when they disappeared, so I can't estimate any lengths or durations. This cost me perhaps 1GB of uncompressed data.

Good news: I keep backups! Phew!

Bad news: most of my backups are simple rsync mirrors, since the external drives aren't big enough to store any history. Because I did the archiving May 10th, all of the rsync mirrors had been updated between then and when I realized my mistake, so they were useless. I did have one duplicity backup which kept a history; it took an hour or two to figure out the right invocation, chew through all the incremental backups and extract them, but I managed to pull out the uncompressed directories.

Done? No. Turned out that I had also edited some of the mirrors to delete junk and improve the SR1 forum compilation and added the defunct Pigeon Marketplace to the collection, so I couldn't just delete my junk compressed files and swap in the recovered files.

Eventually I just extracted Pigeon Marketplace from the duplicity history too, recompressed all the recovered directories (sans the deadly `--unique` option), compared the sizes of pairs of archives, and kept the larger. Seems to've worked.

So, a subtle and hard to see error corrupted my data in an important way, but eventually, backups saved my ass. If you aren't keeping incremental backups of your own stuff, you either don't value what you're doing or don't realize your danger.

#backup #backups
Shared publiclyView activity