Efficient Database Index Design: Analysis of the RaptorDB v2 Index Format (Sorted Lists + Hash Tables) and how it could be enhanced.
RaptorDB v2 uses an interesting index format: a sorted list (O(logn) op time) of block pointers to blocks that contain an unsorted list of index records (key-pointer pairs) where the first record in the block is always greater than any record in the block before it and less than any record in the block next to it.
The primary sorted list is the most important thing to keep in memory and only contains that first key from each block. The records in that top level list look like:
[key][block num][block size]
I don't know the size of the records (he would be using the 32-bit or 128-bit MurMurHash, and then I don't know the size of the block # or block size counters) but I know out of the box RaptorDB v2 is using 10k block sizes for key records.
Assuming a worse case (that every index record is 256-bits (32-bytes)) and the top level sorted list only needed the first record from each block of keys, in 1 MB of memory you could store a list of 32,768 keys; pointing at that many blocks of 10k index entries for a total indexed record count of 327,680,000 (327 million).
NOTE: This assumes absolutely perfect hashing and no wasted space in the index; even with a 50% fill ratio before a hash collision (more info below) indexing 327 million records in 2 or 3MB of memory is impressive!
Mehdi claims there is an O(1) lookup once the page is found via a binary search and he mentions MurMurHash in the features list for RaptorDB, but never explicitly explains the block layout itself; I am assuming, given the pieces of the puzzle, that his blocks are hash indexes calculated using MurMur.
I benchmarked this design a few weeks ago for a similar approach to index design using hashing inside the blocks and while MurMur was a fantastic fit for fast-and-random hashing, using a hash index for a database had the following problems that I found:
== 1 ==
No ranged queries without additional structure to support ordering OR sorting the blocks on-the-fly in memory before pulling the data off disk and streaming it back to the client. Runs the potential risk of requiring constant resorting of your block contents in memory.
OPTIMIZATION: You can sort blocks once and write them out and then operate on the assumption that your blocks are always mostly
sorted and utilize a sort algo that favors mostly-sorted problem spaces, e.g. Bubble Sort (http://www.sorting-algorithms.com/
Alternatively you have to maintain a separate data structure over the blocks themselves that indicate sorting. RaptorDB mitigates this somewhat by linking blocks
to each other and not the top level index keys, but since the block contents are unsorted themselves, you still need a mechanism to sort and deliver the data pointed at from the block records.
== 2 ==
No matter the hashing algorithm I used, and assuming blocks are fixed sizes, I could never get better than a 17-22% fill rate before the first collision occurred. This lead to huge amounts of empty space (roughly 75%) in the index files. This occurred when trying to compact a 128-bit UUID down to a 64-bit long value using a cryptographically secure hashing algorithm into a hash table of size 128.
== 3 ==
Fixed key size. Hashing variable-length strings results in a hash value determined by the algorithm you are using (e.g. 64-bit, 128-bit, etc.). This is actually a benefit for index design (but a limitation for usability) because it makes traversal of the index data easier (e.g. you know every 20 bytes is a new record). It can either limit your support for variable-length index values OR lead you to problem #4 below.
== 4 ==
Reducing potential collision scope. You are taking an infinite key space for variable-length strings and forcing it into a smaller keyspace dictated by your hash algorithm. This increases the potential for collisions anytime you do this.http://www.codeproject.com/Articles/316816/RaptorDB-The-Key-Value-Store-V2
Overall I think Mehdi is on to something here by combining a global, sparse view of his data in sorted order and providing a hashed view of the data at the micro level.
I don't know how he is efficiently
getting around the ranged query issue at the moment; as I mentioned there are expensive or complex ways to do this, but I would really opt for an elegant solution.
An alternative design that I am currently toying with is basically a 2-level skip list similar to Mehdi's design; a master list that indexes all the blocks (sorted) and then all the blocks sorted as well.
In my design the blocks are much smaller (4 KB) and hold 250 keys on average, but they also support variable-length keys.
This design will take up more space than Mehdi's because I cannot rely on concisely sized hash values, but the up-shot is that efficient RANGED queries are still possible, variable-length keys of any length are supported and I can still do efficient index lookups by prefixing the key offsets which was an idea given to me by a SO user: http://dba.stackexchange.com/questions/11189/how-do-databases-store-index-key-values-on-disk-for-variable-length-fields
An alternative design I have been benchmarking is a write-optimized variation of this approach that leaves the blocks unsorted and simply appends to them in the order records are received (and indexed). This does require the block contents to be sorte before being searched however (and optionally written back out after being sorted).
This suffers from the same issue I mentioned above about Mehdi's design in-that you either sort the block every time before every request or you incrementally sort it; either way, your queries are getting more expensive.
It just depends on what you need your data store to do; be optimized for writes or optimized for reads.
Another thing to consider is that sorting a few hundred or even 1k keys is not that expensive of an operation. Sorting 10k or more on every read or write though could start to add up.
I do love the constant key size and O(1) lookup time that hashing gives you. I think Mehdi is on to something here by combining both ideas.
He tackles his smallest problem scope (the number of blocks) with a O(log n) search and then tackles his larger (10,000 items per block) search with a O(1) lookup.
He also keeps the block contents unsorted, which is great for write performance.
The only two problems I cannot get away from are:
1. RANGE queries require a sort of the block contents before replying to the client.
2. Hash collisions when adding items to a block likely invokes block splitting well before a block is ever full; this means a lot of wasted space in the index. In my testing, roughly 75%.
If perfect hashing existed and was inexpensive to calculate and the only problem was #1 above, I would say that RaptorDB v2 had the perfect index design (sorting before responding to a ranged query isn't that bad), but combining that with #2 can become a problem with huge data sets and trying to keep that index in memory unfortunately.