If you've ever wondered how 2D spatial search algorithms work here's a simple animation, showing how the fast BKD tree addition to #Lucene
) finds all points within the greater city boundary of London, UK. The algorithm is described in https://www.cs.duke.edu/~pankaj/publications/papers/bkd-sstd.pdf
First the bounding polygon (10K points!) is drawn in red, then the BKD tree intersection takes over, recursively visiting all previously indexed cells (shown in gray), testing each point in the cell to see if it's inside the polygon. The cells were created during indexing by recursively partitioning space, alternating latitude then longitude, until the leaf cell has between 512 and 1024 points. Areas with a high point density result in very small cells.
It's interesting to me how many cells wind up "lopsided" as long slivers instead of being closer to squares; I didn't expect this, and it shows how important it is to visualize the things you work on. Or maybe it's just a bug!
The animation also makes one limitation clear: the search recursion now visits all cells that overlap the enclosing bounding box of the polygon, but this is clearly wasteful as you see cells outside the polygon, but inside its outer bounding box. To fix this, we need a fast way to check whether the shape overlaps an arbitrary axis-aligned rectangle.
The BKD approach differs from other space partitioning structures like quad trees (http://en.wikipedia.org/wiki/Quadtree
) and geohash (http://en.wikipedia.org/wiki/Geohash
) because it's data-driven, drawing lines depending on the data set, not static, drawing fixed lines in space regardless of what you are indexing. It makes it a bit more costly at indexing time, but then at search time it's very fast: ~5.7X faster than #Lucene
's geohash implementation for various bounding-box searches around London.
It can only index points, which should be the common case for spatial search with #Lucene
Many thanks to http://openstreetmap.org
for providing the base map image, bounding polygon for London, and the full database of points and relations (I indexed a ~60 million subset for this animation).