Profile cover photo
Profile photo
Toke Eskildsen
217 followers
217 followers
About
Toke Eskildsen's posts

Post has attachment
Looking for a way to explore image collections, and as usual OpenSeadragon comes to the rescue. The idea is to make a single collage of all images and allow for seamless zoom down to the full-size images. Selected meta-data are shown for the images pointed to, with links to further information.

Technically the system is build to scale to millions of images. The more interesting question is how many images it makes sense to explore in this manner.

Proof of concept with 1000 images at http://labs.statsbiblioteket.dk/juxta/subject3795/ and source at https://github.com/tokee/juxta
Photo

Post has attachment
We index the Danish Net Archive in Solr, but we do not have a large budget for that. Fortunately our query load is limited and we have a few tricks to play.

By making the shards in our index immutable and tweaking the faceting code in Solr, we are now at a point where we have sub-second facet search response times on a (relatively) low budget.

Furthermore, as we are already using multiple machines, we know quite precisely what it takes to scale further up and when we need to add machines: The next limit is the number of shards that SolrCloud is comfortable handling, which is about 3-5 times more than we've got now.

Post has attachment
An image is worth a thousand words and a newspaper page is worth a million other pages? Seems like a circular definition. We build a zoomable mosaic of 20 terapixels of scanned material from our archive at the State and University Library, Denmark. For fun. Give it a go, zoom down and press reload if you want another one.

The total mosaic size is in the zettapixels (10²¹) range , give or take a few orders of magnitude. But that number does not really make sense: The unique mosaic building blocks are only 1 million images of about 20 megapixels each. I'll do a technical write-up later on how we did it.

Post has attachment
Part of a web archive is being able too efficiently lookup resources in it. A lot of such lookups are based on URLs with timestamps, with the current lookup technology being CDX.

Unfortunately CDX is not a well-defined thing. Fortunately, there is a lot of work going on to define it. But CDX as an API, as a format and as an implementation tend to get mixed together. Very confusing to a not-a-hardcore-web-archiver like me, so here's an attempt of presenting the different perspectives.

Post has attachment
So I tried improving grouping performance in Solr and failed. The idea was to delay resolving of the group value in the second part of the two-pass grouping code, but it only gave a win for a few of the test result sets.

All is not lost though. Maybe the group value resolving can be fully skipped, both for the first and second pass? Read on for the details.

It seems to me that someone ought to have written a forum or mailing list analyzer. Some Natural Language Processing tool that goes through a lot of correspondence between multiple participants, extracting common themes to a timeline, grouping participants into questioners & answerers, doing sentiment analysis for participants as well as overall for the group.

Anyone know of such a tool?

Post has attachment
The memory & performance gains from sparse faceting for Solr (http://tokee.github.io/lucene-solr/) left me with the impression that similar optimizations might work elsewhere in the Lucene/Solr code base. So I set out to improve on bitsets and large relevance ranked result sets.

Well, it failed. Some of it just did not really work and some of it worked but without any real-world benefits. It is not especially fun doing a write-up on experiments that does not yield the expected results, but that too is part of experimentation. So read to see what did not work.

At the heart of a standard Lucene/Solr search lies a PriorityQueue. The PriorityQueue (PQ from now on) is a simple beast: At any given time during the matching of documents, it keeps track of the Top-X documents. Each document is represented via a ScoreDoc, which contains score (float), docID (int) and shardIndex (int). The PQ in Solr is a heap and uses sentinel objects, meaning that it is pre-filled with guaranteed-to-be-less-than-any-real-match ScoreDocs.

Some problems immediately springs to mind.

* Using one object/entry means that if top-1M is requested and found, 1M objects will have to be allocated. That takes time. Worse, if they end up on the main heap, that taxes the garbage collector.
* The use of sentinel objects means that all searches for top-X will allocate X ScoreDocs, no matter the number of actual matches.
* A basic heap has horrible memory locality, which is worsened by using Objects. For at heap with 1M objects, the CPU L2/3 cache is constantly thrashed, leading to poor performance.
* The PQ in Solr maintains its internal state at all time, but the cycle of use is always allocate, fill, empty. No interleaving of fill & empty. For large heaps and especially heaps that does not overflow, it is cheaper to just add the data unordered and delay ordering until needed.

All this leads to a single conclusion: Don't ask for a lot of documents in a top-X request in Solr!

But... What is it wasn't so?
* First step would be to eliminate the sentinel Object thing for large values of X. That should make small result sets fast.
* Second step would be to do away with all those pesky ScoreDocs, replacing them with 3 arrays: score, docID & shardIndex. Or maybe 2 arrays as the docID & shardIndex can be packed into a single long.
* Third step would be to delay the ordering of the heap until needed, maybe using merge sort or another more locality positive sorting mechanism if it is detected that there will be no more values added before the arrays are full.

I have done some preliminary micro benchmarks with the arrays-idea (caveat: Micro benchmarks are hard to make reliable). It seems that there are quite a speed up from top-100K and upwards, not to mention memory usage being less than half of vanilla Solr PQ and GC being a lot lighter. More to come of course.

For the very curious, there is a proof of concept in the TestHitQueue class at https://github.com/tokee/lucene-solr/tree/hitdocs

Post has attachment
Lenient configuration handling strikes again. Badly.

Solr has schema.xml, which defines how the index is structured. Solr also has solrconfig.xml, which (rough generalization) defines how Solr behaves and how searches should be issued.

Among other things, one can define groups (user searches foo:bar, which under the hood expands to bas:bar OR ost:bar) and phrase fields (user searches hop bar , which under the hood expands to bas:"hop bar" OR ost:"hop bar"). As one can imagine, it is quite important to get the field names (bas and ost) right.

If the name of a phrase field is misspelled, Solr happily chugs along, not doing the phrase-search on that particular field. What happens is that search quality degrades a bit. That is bad as somewhat poorer quality is hard to detect.

If a field name in a group is misspelled, the same thing happens: Solr goes on, but searches on that group does not match the misspelled field. Bad enough, but enter SOLR-6376 (https://issues.apache.org/jira/browse/SOLR-6376): Searches with boosts, such as foo:bar^1.2, will not match anything at all if foo is a group with a misspelled field name inside. And if foo is used as a field in edismax, even the non-qualified search bar^1.2 will not match anything.

Due to a large clean up and a borked merge conflict, we had both of these errors in one of our Solr setups. Luckily we do the term-boosting thing, so we discovered that something was wrong by the absence of hits for a simple search.

All that adds up to the point that we really don't want leniency in the setup. All fields referenced in solrconfig.xml must be present in schema.xml (dynamic fields are a bit tricky). So a tiny script was born, which does simple field name validation in the Solr config files. Until a proper solution shows itself, that validation will be part of our release procedure.

Post has attachment
Kicking the heuristic faceting tires a bit harder revealed that the sampling method was somewhat flawed: Clusters of documents had too high a chance of having either too much weight or too little. Making the sampling finer grained yielded much better results (defined as more correct results).

For our setup, a fixed sampling size of 250K documents gives us a high degree of valid results and also means that the maximum response time for faceting on the particular field (600M unique values in a 250M document index with 25 values/document) is below 10 seconds. As the worst case time without heuristics is more than 10 minutes, this is quite a win.
Wait while more posts are being loaded