First the bounding polygon (10K points!) is drawn in red, then the BKD tree intersection takes over, recursively visiting all previously indexed cells (shown in gray), testing each point in the cell to see if it's inside the polygon. The cells were created during indexing by recursively partitioning space, alternating latitude then longitude, until the leaf cell has between 512 and 1024 points. Areas with a high point density result in very small cells.
It's interesting to me how many cells wind up "lopsided" as long slivers instead of being closer to squares; I didn't expect this, and it shows how important it is to visualize the things you work on. Or maybe it's just a bug!
The animation also makes one limitation clear: the search recursion now visits all cells that overlap the enclosing bounding box of the polygon, but this is clearly wasteful as you see cells outside the polygon, but inside its outer bounding box. To fix this, we need a fast way to check whether the shape overlaps an arbitrary axis-aligned rectangle.
The BKD approach differs from other space partitioning structures like quad trees (http://en.wikipedia.org/wiki/Quadtree) and geohash (http://en.wikipedia.org/wiki/Geohash) because it's data-driven, drawing lines depending on the data set, not static, drawing fixed lines in space regardless of what you are indexing. It makes it a bit more costly at indexing time, but then at search time it's very fast: ~5.7X faster than #Lucene 's geohash implementation for various bounding-box searches around London.
It can only index points, which should be the common case for spatial search with #Lucene .
Many thanks to http://openstreetmap.org for providing the base map image, bounding polygon for London, and the full database of points and relations (I indexed a ~60 million subset for this animation).
- Independent Consultant, 2012 - present
- MITRESoftware Engineer, 1997 - 2014
- Northeastern UniversityComputer Science, 1996 - 2000
3 lessons learned running an open source company | Opensource.com
One thing is certain: running an open source company has some unique challenges and opportunities.
Amazon’s CloudSearch gets Solr Powered! | LucidWorks
Amazon moves CloudSearch to Lucene/Solr
The Log: What every software engineer should know about real-time data's...
We're seeking intelligent problem solvers who are inspired and motivated to change the world.
Codd’s Relational Vision: Has NoSQL Come Full Circle? | Javalobby
Recently, I spoke at NoSQL Matters in Barcelona about database history. As somebody with a history background, I was pretty excited to dig i
Podcast: ratings, rankings, and the advantage of being born lucky - O'Re...
Is popularity just a matter of simple luck--of some early advantage compounded by human preference for things that are already popular? A pa
Optional Dependency Strategies for Java Libraries - Blog - Axel Fontaine...
In software, contrary to common belief, lines of code are a liability not an asset. As you gradually accumulate code, little by little the p
Salmon Run: Dictionary Backed Named Entity Recognition with Lucene and L...
Domain-specific Concept Search (such as ours) typically involves recognizing entities in the query and matching them up to entities that mak
There's No Economic Imperative to Reconsider an Open Internet by Benoît ...
The debate on the neutrality of Internet access isn’t new, and if its intensity varies over time, it has for a long while tainted the relati