If you are mirroring websites, or otherwise compiling a lot of directories with redundant data on a file-by-file level, there's a cute trick to massively improve your compression ratios: don't sort the usual alphabetical way, but sort by a subdirectory. (I learned about this trick a few years ago while messing around with archiving my home directory using `find` and `tar`.) To take my black-market mirrors: I download a site each time as a separate `wget` run in a timestamped folder. So in my Silk Road 2 folder, I have `2013-12-25/` and `2014-01-15/`. These share a lot of similar, if not identical files, so they compress together pretty well with `xz` down from 1.8GB to 0.3GB.
But they could compress even better: the similar files may be thousands of files and hundreds of megabytes away by alphabetical or file-inode order, so even with a very large window and a top-notch compressor, it will fail to spot many long-range redundancies. In between `2013-12-25/default.css` and `2014-01-15/default.css` is going to be all sorts of files which have nothing to do with CSS, like `2014-01-16/items/2-grams-of-pure-vanilla-ketamine-powder-dutchdope?vendor_feedback_page=5` and `2014-01-16/items/10-generic-percocet-10-325-plus-1-30mg-morphine`. You see the problem. Because we sort the files by 'all files starting with "2013"' and then 'all files starting "2014"', we lose all proximity. If instead, we could sort by subfolder and then by top-level folder, then we'd have everything line up nicely... Fortunately, we can do this! `sort` supports exactly this functionality: we can feed it a file list, tell it to break filenames by "/", and then to sort on a lower level, and if we did it right, we will indeed get output like `2013-12-25/default.css` just before `2014-01-15/default.css`, which will do wonders for our compression, and which will pay ever more dividends as we accumulate more partially-redundant mirrors.
Here is an example of output for my Pandora mirrors, where, due to frequent rumors of its demise triggering mirroring on my part, I have 5 full mirrors; and naturally, if we employ the sort-key trick (`find . -type f | sort --unique --key=3 --field-separator="/"`), we find a lot of similar-sounding files:
./2014-01-15/profile/5a66e5238421f0422706b267b735d2df/6
./2014-01-16/profile/5a9df4f5482d55fb5a8997c270a1e22d
./2013-12-25/profile/5a9df4f5482d55fb5a8997c270a1e22d/1
./2014-01-15/profile/5a9df4f5482d55fb5a8997c270a1e22d.1
./2013-12-25/profile/5a9df4f5482d55fb5a8997c270a1e22d/2
./2014-01-15/profile/5a9df4f5482d55fb5a8997c270a1e22d.2
./2013-12-25/profile/5a9df4f5482d55fb5a8997c270a1e22d/3
./2014-01-15/profile/5a9df4f5482d55fb5a8997c270a1e22d/4
./2014-01-15/profile/5a9df4f5482d55fb5a8997c270a1e22d/5
./2014-01-15/profile/5a9df4f5482d55fb5a8997c270a1e22d/6
./2013-12-25/profile/5abb81db167294478a23ca110284c587
./2013-12-25/profile/5acc44d370e305e252dd4e2b91fda9d0/1
./2014-01-15/profile/5acc44d370e305e252dd4e2b91fda9d0.1
./2013-12-25/profile/5acc44d370e305e252dd4e2b91fda9d0/2
./2014-01-15/profile/5acc44d370e305e252dd4e2b91fda9d0.2
Note the interleaving of 5 different mirrors, impossible in a normal left-to-right alphabetical sort. You can bet that these 4 files (in 15 versions) are going to compress much better than if they were separated by a few thousand other profile pages.
So here's an example invocation (doing everything in pipelines to avoid disk IO which is very slow):
find . -type f -print0 | sort --zero-terminated --unique --key=3 --field-separator="/" | tar --no-recursion --null --files-from - -c | xz -9 --extreme --stdout > ../mirror.tar.xz
Used on my 2 Silk Road 2 mirrors which together weigh 1800M, a normal run without the `--key`/`--field-separator` options, yields a 308M archive. That's not too bad. Certainly much better than hauling around almost 2GB. However - if I switch to the sort-key trick, however, the .tar.xz is 271M or 37M less. Same compression algorithm, same files, same unpacked results, same speed, just 2 little obscure `sort` options... and I get an archive 87% the size of the original.
Not impressed? Well, I did say that the advantage increases with the number of mirrors to extract redundancy from. With only 2 mirrors, the SR2 results can't be too impressive. How about the Pandora mirrors? 5 of them gives the technique more scope to shine. And as expected, it's even more impressive when I compare the Pandora archives: 71M vs 162M. The sort-keyed archive is 44% of the regular archive!
#commandline #compression
But they could compress even better: the similar files may be thousands of files and hundreds of megabytes away by alphabetical or file-inode order, so even with a very large window and a top-notch compressor, it will fail to spot many long-range redundancies. In between `2013-12-25/default.css` and `2014-01-15/default.css` is going to be all sorts of files which have nothing to do with CSS, like `2014-01-16/items/2-grams-of-pure-vanilla-ketamine-powder-dutchdope?vendor_feedback_page=5` and `2014-01-16/items/10-generic-percocet-10-325-plus-1-30mg-morphine`. You see the problem. Because we sort the files by 'all files starting with "2013"' and then 'all files starting "2014"', we lose all proximity. If instead, we could sort by subfolder and then by top-level folder, then we'd have everything line up nicely... Fortunately, we can do this! `sort` supports exactly this functionality: we can feed it a file list, tell it to break filenames by "/", and then to sort on a lower level, and if we did it right, we will indeed get output like `2013-12-25/default.css` just before `2014-01-15/default.css`, which will do wonders for our compression, and which will pay ever more dividends as we accumulate more partially-redundant mirrors.
Here is an example of output for my Pandora mirrors, where, due to frequent rumors of its demise triggering mirroring on my part, I have 5 full mirrors; and naturally, if we employ the sort-key trick (`find . -type f | sort --unique --key=3 --field-separator="/"`), we find a lot of similar-sounding files:
./2014-01-15/profile/5a66e5238421f0422706b267b735d2df/6
./2014-01-16/profile/5a9df4f5482d55fb5a8997c270a1e22d
./2013-12-25/profile/5a9df4f5482d55fb5a8997c270a1e22d/1
./2014-01-15/profile/5a9df4f5482d55fb5a8997c270a1e22d.1
./2013-12-25/profile/5a9df4f5482d55fb5a8997c270a1e22d/2
./2014-01-15/profile/5a9df4f5482d55fb5a8997c270a1e22d.2
./2013-12-25/profile/5a9df4f5482d55fb5a8997c270a1e22d/3
./2014-01-15/profile/5a9df4f5482d55fb5a8997c270a1e22d/4
./2014-01-15/profile/5a9df4f5482d55fb5a8997c270a1e22d/5
./2014-01-15/profile/5a9df4f5482d55fb5a8997c270a1e22d/6
./2013-12-25/profile/5abb81db167294478a23ca110284c587
./2013-12-25/profile/5acc44d370e305e252dd4e2b91fda9d0/1
./2014-01-15/profile/5acc44d370e305e252dd4e2b91fda9d0.1
./2013-12-25/profile/5acc44d370e305e252dd4e2b91fda9d0/2
./2014-01-15/profile/5acc44d370e305e252dd4e2b91fda9d0.2
Note the interleaving of 5 different mirrors, impossible in a normal left-to-right alphabetical sort. You can bet that these 4 files (in 15 versions) are going to compress much better than if they were separated by a few thousand other profile pages.
So here's an example invocation (doing everything in pipelines to avoid disk IO which is very slow):
find . -type f -print0 | sort --zero-terminated --unique --key=3 --field-separator="/" | tar --no-recursion --null --files-from - -c | xz -9 --extreme --stdout > ../mirror.tar.xz
Used on my 2 Silk Road 2 mirrors which together weigh 1800M, a normal run without the `--key`/`--field-separator` options, yields a 308M archive. That's not too bad. Certainly much better than hauling around almost 2GB. However - if I switch to the sort-key trick, however, the .tar.xz is 271M or 37M less. Same compression algorithm, same files, same unpacked results, same speed, just 2 little obscure `sort` options... and I get an archive 87% the size of the original.
Not impressed? Well, I did say that the advantage increases with the number of mirrors to extract redundancy from. With only 2 mirrors, the SR2 results can't be too impressive. How about the Pandora mirrors? 5 of them gives the technique more scope to shine. And as expected, it's even more impressive when I compare the Pandora archives: 71M vs 162M. The sort-keyed archive is 44% of the regular archive!
#commandline #compression
Note that if most of the similar files are in fact identical (css stylesheets, unupdated content, etc) you get this trick for free and a much nicer UI for looking at progress over time by stacking the different mirrors on top of one another as a Git history.Jan 18, 2014
I suppose that's possible, but I see very few mirrors distributed as git repos, and I'm not very comfortable using git, so I prefer separate directories.
While we're discussing alternatives, there's another I didn't mention: `lrzip`, which is a LZMA variant which tries to find long-range duplications (like what we're turning into short-range duplications with the sort-key trick) http://ck.kolivas.org/apps/lrzip/ / https://wiki.archlinux.org/index.php/Lrzip
I haven't actually tried it yet to see how it compares, but the main problem is that it's not widely used at all and that's a problem if you want to distribute archives to other people - at least `xz` is now in a lot of default Linux installations!Jan 18, 2014
Git was just an example, as most sane versioning tools will handle this case well--are you not comfortable with any version control system? Because you need to be.
Mind you, no, mirror dumps don't need to be distributed this way, but I version control pretty much everything I touch because occasionally it proves incredibly useful.Jan 18, 2014
I think I tried putting website mirrors in a VCS once and found that all the binary assets really bloated the repo and it was more straightforward to just keep separate copies and let the compression utilities do their job.Jan 18, 2014
A general-purpose attempt at optimal sorting: http://neoscientists.org/~tmueller/binsort/
Very cool looking, but oddly enough, it underperforms the sort-key trick on both the SR2 and Pandora datasets by 5-10MB. May be more useful when the names aren't so helpful for identifying redundancy.Jan 20, 2014
A run on Tormarket suggests that the problem is the optimization setting: `-o15` is the default but it can go up to `-o1000` and that seems to produce a more optimized sort.Feb 2, 2014
Mar 31, 2014