A small update on one of the projects we are currently working on at The State and University Library.
The State and University Library has been harvesting the Danish internet since 2005 just like the the Internet Archive (https://archive.org). We have also build a search engine on top of our 500TB harvested Arc/Warc files. We have 4billion documents in the index now, but we have also only indexed about 20% of raw data so far. Using the search engine we can easy search for material compared to the Internet Archive where have to know the URL if you want to see the historic webpage.
The index contains only the text/meta-data for the harvested pages and the binaries are stored in the Arc/Warc files. But in the search result we have also index the offsets from the Arc-files for the documents. So after the search result has been returned we can enrich the result with the binaries loaded from the Arc/Warc-files. And this is what I did, making a simple mimetype servlet that returns the binaries. (images/pdf/html/js.. etc.)
This makes it a lot faster for the internet researchers to have the images included directly in the search result or a download-button to download a PDF etc. Before they had to load the webpage in the 'Wayback' engine.
Our Isilon storage which has the 500TB Arc/Warc data has no problem returning the binaries almost instantly even though the webpage loads 20 of the images simultaneously by the servlet and every image is several MB. The servlet resizes them for the search-result but the download will be original content of course.
Disclaimer: The screenshot is just my proof of concept frontend.
For the real front-end we are aiming for "shine" (https://github.com/netarchivesuite/shine) which is also used by British Library.
is working on it from The State and University Library.
I have mentioned this Danish Internet Archive project before, here etc. :
And for more technical details how we are improving the response time of the searches, especially for the facets you can read some of the blogs by .
This small mimetype servlet, just as all our internet archive projects, is open source github: https://github.com/netarchivesuite/webarchivemimetypeservlet