Profile

Cover photo
Henrik Pauli
Works at UHU Systems Kft.
97 followers|16,557 views
AboutPostsPhotosVideos

Stream

Henrik Pauli

Shared publicly  - 
 
:D
 
To sum up this week's news coverage from the tech world ;) 
1 comment on original post
1
Add a comment...

Henrik Pauli

Shared publicly  - 
 
Wow o.o
 
Studio Kousyuuya in Nikko, Japan has seen four generations of master painters creating "Hitofude Ryuu" (Dragon with one stroke) for decades.
78 comments on original post
1
Add a comment...

Henrik Pauli

Shared publicly  - 
 
Henrik Pauli was out walking. He tracked 20.35 km in 3h:37m:22s.

Torrential downpour interrupted me not far from home :C
1
Add a comment...

Henrik Pauli

Shared publicly  - 
 
It's a good thing MTP was invented as a way to interface with phones' file systems, except I've never seen it work reliably.

In that sense, it reminds me of Bluetooth, which is an awesome idea executed in a truly shittastic way.
1
Add a comment...

Henrik Pauli

Shared publicly  - 
 
Dear +Gmail and +Google+ Developers, why must you make the site(s) scroll down/up a page when I press Ctrl+PgDn/Ctrl+PgUp?  That's a common tab-switching shortcut, but some stupid Javascript nonsense of yours keeps overriding it :(
1
Add a comment...

Henrik Pauli

Shared publicly  - 
 
hah!
1
Add a comment...

Henrik Pauli

Shared publicly  - 
 
nice trick
 
If you are mirroring websites, or otherwise compiling a lot of directories with redundant data on a file-by-file level, there's a cute trick to massively improve your compression ratios: don't sort the usual alphabetical way, but sort by a subdirectory. (I learned about this trick a few years ago while messing around with archiving my home directory using `find` and `tar`.) To take my black-market mirrors: I download a site each time as a separate `wget` run in a timestamped folder. So in my Silk Road 2 folder, I have `2013-12-25/` and `2014-01-15/`. These share a lot of similar, if not identical files, so they compress together pretty well with `xz` down from 1.8GB to 0.3GB.

But they could compress even better: the similar files may be thousands of files and hundreds of megabytes away by alphabetical or file-inode order, so even with a very large window and a top-notch compressor, it will fail to spot many long-range redundancies. In between `2013-12-25/default.css` and `2014-01-15/default.css` is going to be all sorts of files which have nothing to do with CSS, like `2014-01-16/items/2-grams-of-pure-vanilla-ketamine-powder-dutchdope?vendor_feedback_page=5` and `2014-01-16/items/10-generic-percocet-10-325-plus-1-30mg-morphine`. You see the problem. Because we sort the files by 'all files starting with "2013"' and then 'all files starting "2014"', we lose all proximity. If instead, we could sort by subfolder and then by top-level folder, then we'd have everything line up nicely... Fortunately, we can do this! `sort` supports exactly this functionality: we can feed it a file list, tell it to break filenames by "/", and then to sort on a lower level, and if we did it right, we will indeed get output like `2013-12-25/default.css` just before `2014-01-15/default.css`, which will do wonders for our compression, and which will pay ever more dividends as we accumulate more partially-redundant mirrors.

Here is an example of output for my Pandora mirrors, where, due to frequent rumors of its demise triggering mirroring on my part, I have 5 full mirrors; and naturally, if we employ the sort-key trick (`find . -type f | sort --unique --key=3 --field-separator="/"`), we find a lot of similar-sounding files:

    ./2014-01-15/profile/5a66e5238421f0422706b267b735d2df/6
    ./2014-01-16/profile/5a9df4f5482d55fb5a8997c270a1e22d
    ./2013-12-25/profile/5a9df4f5482d55fb5a8997c270a1e22d/1
    ./2014-01-15/profile/5a9df4f5482d55fb5a8997c270a1e22d.1
    ./2013-12-25/profile/5a9df4f5482d55fb5a8997c270a1e22d/2
    ./2014-01-15/profile/5a9df4f5482d55fb5a8997c270a1e22d.2
    ./2013-12-25/profile/5a9df4f5482d55fb5a8997c270a1e22d/3
    ./2014-01-15/profile/5a9df4f5482d55fb5a8997c270a1e22d/4
    ./2014-01-15/profile/5a9df4f5482d55fb5a8997c270a1e22d/5
    ./2014-01-15/profile/5a9df4f5482d55fb5a8997c270a1e22d/6
    ./2013-12-25/profile/5abb81db167294478a23ca110284c587
    ./2013-12-25/profile/5acc44d370e305e252dd4e2b91fda9d0/1
    ./2014-01-15/profile/5acc44d370e305e252dd4e2b91fda9d0.1
    ./2013-12-25/profile/5acc44d370e305e252dd4e2b91fda9d0/2
    ./2014-01-15/profile/5acc44d370e305e252dd4e2b91fda9d0.2

Note the interleaving of 5 different mirrors, impossible in a normal left-to-right alphabetical sort. You can bet that these 4 files (in 15 versions) are going to compress much better than if they were separated by a few thousand other profile pages.

So here's an example invocation (doing everything in pipelines to avoid disk IO which is very slow):

    find . -type f -print0 | sort --zero-terminated --unique --key=3 --field-separator="/" | tar --no-recursion --null --files-from - -c | xz -9 --extreme --stdout > ../mirror.tar.xz

Used on my 2 Silk Road 2 mirrors which together weigh 1800M, a normal run without the `--key`/`--field-separator` options, yields a 308M archive. That's not too bad. Certainly much better than hauling around almost 2GB. However - if I switch to the sort-key trick, however, the .tar.xz is 271M or 37M less. Same compression algorithm, same files, same unpacked results, same speed, just 2 little obscure `sort` options... and I get an archive 87% the size of the original.

Not impressed? Well, I did say that the advantage increases with the number of mirrors to extract redundancy from. With only 2 mirrors, the SR2 results can't be too impressive. How about the Pandora mirrors? 5 of them gives the technique more scope to shine. And as expected, it's even more impressive when I compare the Pandora archives: 71M vs 162M. The sort-keyed archive is 44% of the regular archive!

#commandline #compression  
7 comments on original post
1
Add a comment...
Have him in circles
97 people
Damien Sim's profile photo
Bence Sztanyik's profile photo
Adam Scott's profile photo
János Mészáros's profile photo
Cégvezetők Bálja's profile photo
Zsuzsanna Cserépné Kű's profile photo
Munkhtur Bayarsaikhan's profile photo
Szabadulós Játék's profile photo
László Remete's profile photo

Henrik Pauli

Shared publicly  - 
 
disgusting.
1
Add a comment...

Henrik Pauli

Shared publicly  - 
 
Brilliant, utterly brilliant.
 
Wishlist: Cognitive Dissonance Light Bulb.
Join the Simple Science and Interesting Things Community and share interesting stuff!
https://plus.google.com/communities/117518490246975838002

http://i.imgur.com/PszIF3G.jpg
3 comments on original post
1
Add a comment...

Henrik Pauli

Shared publicly  - 
 
Oh fokken yes!
1
Add a comment...

Henrik Pauli

Shared publicly  - 
 
Update, revoke, replace!
 
This OpenSSL bug, WHICH CAN REVEAL YOUR PRIVATE SSL KEY, has been in the wild for TWO YEARS. TWO F***ING YEARS.

I need a drink.

http://heartbleed.com/
14 comments on original post
1
Add a comment...
People
Have him in circles
97 people
Damien Sim's profile photo
Bence Sztanyik's profile photo
Adam Scott's profile photo
János Mészáros's profile photo
Cégvezetők Bálja's profile photo
Zsuzsanna Cserépné Kű's profile photo
Munkhtur Bayarsaikhan's profile photo
Szabadulós Játék's profile photo
László Remete's profile photo
Work
Occupation
Perl & PostgreSQL (and less so: HTML, CSS, JavaScript, jQuery, qooxdoo, Ruby (on Rails)) programmer
Employment
  • UHU Systems Kft.
    developer, 2007 - present
  • Telima Kft.
    2005 - 2007
  • OSD Kft.
    editor, 2002 - 2005
  • OSD Kft.
    developer, 2001 - 2002
Contact Information
Home
Skype
Ralesk
ICQ
37046326
Google Talk
henrik.pauli@gmail.com
Work
Email
Skype
Ralesk
Story
Bragging rights
Back in the day the creator of the first Hungarian Digimon website, ex-translator of LiveJournal, MetaJournal, BitWise IM and Facebook. Also once the creator of a Python2/KDE3 LiveJournal client simply called ljKlient.
Basic Information
Gender
Male
Apps with Google+ Sign-in