Shared publicly  - 
 
Efficient Database Index Design: Analysis of the RaptorDB v2 Index Format (Sorted Lists + Hash Tables) and how it could be enhanced.

RaptorDB v2 uses an interesting index format: a sorted list (O(logn) op time) of block pointers to blocks that contain an unsorted list of index records (key-pointer pairs) where the first record in the block is always greater than any record in the block before it and less than any record in the block next to it.

The primary sorted list is the most important thing to keep in memory and only contains that first key from each block. The records in that top level list look like:

[key][block num][block size]

I don't know the size of the records (he would be using the 32-bit or 128-bit MurMurHash, and then I don't know the size of the block # or block size counters) but I know out of the box RaptorDB v2 is using 10k block sizes for key records.

Assuming a worse case (that every index record is 256-bits (32-bytes)) and the top level sorted list only needed the first record from each block of keys, in 1 MB of memory you could store a list of 32,768 keys; pointing at that many blocks of 10k index entries for a total indexed record count of 327,680,000 (327 million).

NOTE: This assumes absolutely perfect hashing and no wasted space in the index; even with a 50% fill ratio before a hash collision (more info below) indexing 327 million records in 2 or 3MB of memory is impressive!

Mehdi claims there is an O(1) lookup once the page is found via a binary search and he mentions MurMurHash in the features list for RaptorDB, but never explicitly explains the block layout itself; I am assuming, given the pieces of the puzzle, that his blocks are hash indexes calculated using MurMur.

I benchmarked this design a few weeks ago for a similar approach to index design using hashing inside the blocks and while MurMur was a fantastic fit for fast-and-random hashing, using a hash index for a database had the following problems that I found:

== 1 ==

No ranged queries without additional structure to support ordering OR sorting the blocks on-the-fly in memory before pulling the data off disk and streaming it back to the client. Runs the potential risk of requiring constant resorting of your block contents in memory.

OPTIMIZATION: You can sort blocks once and write them out and then operate on the assumption that your blocks are always mostly sorted and utilize a sort algo that favors mostly-sorted problem spaces, e.g. Bubble Sort (http://www.sorting-algorithms.com/)

Alternatively you have to maintain a separate data structure over the blocks themselves that indicate sorting. RaptorDB mitigates this somewhat by linking blocks to each other and not the top level index keys, but since the block contents are unsorted themselves, you still need a mechanism to sort and deliver the data pointed at from the block records.

== 2 ==

No matter the hashing algorithm I used, and assuming blocks are fixed sizes, I could never get better than a 17-22% fill rate before the first collision occurred. This lead to huge amounts of empty space (roughly 75%) in the index files. This occurred when trying to compact a 128-bit UUID down to a 64-bit long value using a cryptographically secure hashing algorithm into a hash table of size 128.

== 3 ==

Fixed key size. Hashing variable-length strings results in a hash value determined by the algorithm you are using (e.g. 64-bit, 128-bit, etc.). This is actually a benefit for index design (but a limitation for usability) because it makes traversal of the index data easier (e.g. you know every 20 bytes is a new record). It can either limit your support for variable-length index values OR lead you to problem #4 below.

== 4 ==

Reducing potential collision scope. You are taking an infinite key space for variable-length strings and forcing it into a smaller keyspace dictated by your hash algorithm. This increases the potential for collisions anytime you do this.

http://www.codeproject.com/Articles/316816/RaptorDB-The-Key-Value-Store-V2


Overall I think Mehdi is on to something here by combining a global, sparse view of his data in sorted order and providing a hashed view of the data at the micro level.

I don't know how he is efficiently getting around the ranged query issue at the moment; as I mentioned there are expensive or complex ways to do this, but I would really opt for an elegant solution.

READ-OPTIMIZED VARIATION

An alternative design that I am currently toying with is basically a 2-level skip list similar to Mehdi's design; a master list that indexes all the blocks (sorted) and then all the blocks sorted as well.

In my design the blocks are much smaller (4 KB) and hold 250 keys on average, but they also support variable-length keys.

This design will take up more space than Mehdi's because I cannot rely on concisely sized hash values, but the up-shot is that efficient RANGED queries are still possible, variable-length keys of any length are supported and I can still do efficient index lookups by prefixing the key offsets which was an idea given to me by a SO user: http://dba.stackexchange.com/questions/11189/how-do-databases-store-index-key-values-on-disk-for-variable-length-fields

WRITE-OPTIMIZED VARIATION

An alternative design I have been benchmarking is a write-optimized variation of this approach that leaves the blocks unsorted and simply appends to them in the order records are received (and indexed). This does require the block contents to be sorte before being searched however (and optionally written back out after being sorted).

This suffers from the same issue I mentioned above about Mehdi's design in-that you either sort the block every time before every request or you incrementally sort it; either way, your queries are getting more expensive.

It just depends on what you need your data store to do; be optimized for writes or optimized for reads.

Another thing to consider is that sorting a few hundred or even 1k keys is not that expensive of an operation. Sorting 10k or more on every read or write though could start to add up.

CONCLUSION

I do love the constant key size and O(1) lookup time that hashing gives you. I think Mehdi is on to something here by combining both ideas.

He tackles his smallest problem scope (the number of blocks) with a O(log n) search and then tackles his larger (10,000 items per block) search with a O(1) lookup.

He also keeps the block contents unsorted, which is great for write performance.

The only two problems I cannot get away from are:

1. RANGE queries require a sort of the block contents before replying to the client.

2. Hash collisions when adding items to a block likely invokes block splitting well before a block is ever full; this means a lot of wasted space in the index. In my testing, roughly 75%.

If perfect hashing existed and was inexpensive to calculate and the only problem was #1 above, I would say that RaptorDB v2 had the perfect index design (sorting before responding to a ranged query isn't that bad), but combining that with #2 can become a problem with huge data sets and trying to keep that index in memory unfortunately.
15
5
Howard Chu's profile photoRiyad Kalla's profile photoJosiah Carlson's profile photoDave Dolan's profile photo
10 comments
 
first collision isn't an indicator of the fill, necessarily. statistically it's a clue that it could be better, but there may be local extrema making it hard for you to tell. perhaps it gravitates a certain way at certain points but the overall shape is more varied. murmur2, 32 bit, is what he's using. murmur3 supposedly has better fill but I'm only going by what the author of the murmur algorithm says (he's probably not lying in public!) murmurhash
 
Dave, I had another post about all my hash testing (different algorithms including murmur) and the fill rates before 1st collision were all horrible; as Josiah pointed out, the birthday problem was unavoidable. I just didn't realize in-practice how damn persistent the issue was (something like a 17% fill before 1st collision). While you are absolutely right it doesn't necessarily indicate the fill in the general case, if you cannot have collisions in your system, it would be a 1:1 map to fill -- in the case of Raptor2 index design, it is.
 
I see. Well, not to be confrontational... but what's a better idea?
 
+Dave Dolan great question; in theory a perfect hash would be phenomenal -- would give you O(1) inserts and retrievals but you'd have to come up with a clever way to handle duplicates. Given that perfect hashing requires some computation and isn't as simple as running a key through a "good" algorithm, I have found in some micro-tests that in sufficiently small buckets (say 100s of items) binary searching is extremely fast, even in the case where the item you are looking for requires many reductions.

you can further increase the performance by not sorting the bucket items on insert; only on retrieval. So on an insert-heavy data set, you would get fully optimized "append only" writes much like Cassandra but as soon as a retrieval is issued, the bucket state is checked, if unsorted, then it is sorted (which again is a quick operation) then quickly retrieved.

In a few quick implementations I've done of this approach, it gives the best of all worlds.
 
Holy cow you guys are sickening smart. I was looking into using the RaptorDB ver. 2 in a mobile game due to what seemed like awesome performance, and also based on various reading I have done regarding NoSql vs. Relational databases. I was seriously considering using RaptorDB but after reading your stuff here I got nervous. Well what's so funny is that I am not anywhere near smart enough to even know what you said up above. :) 
Should I be worried about using RaptorDB based on what you said? Sorry for the silly question but I am so new to the NoSql way of thinking I have no idea what your article was really implying.
By "in a mobile game" I mean the RaptorDB itself would be hosted on a server near some web services and an HTML5 game will use it for the storage solution.
 
+Tony Pickens I know that feeling, sorry to get you all nervous.

The writeup isn't actually a bad thing, it is just a design approach that RaptorDB has chosen for its indexes (every DB makes trade-off choices in their design similar to this).

The bigger question, is RaptorDB mature enough to do what you want it to do reliably? and I don't know the answer to that because I don't know much about the project.

If you are creating a mobile game and need a DB, you might consider one of the Amazon Web Services databases (SimpleDB or DynamoDB for NoSQL -- or RDS for MySQL) just so you don't need to be in the business of database administration.

My last tip is that NoSQL is no panacea, it is just alternative ways to store/retrieve data. If you know SQL already, don't feel bad about using it :)
 
Awesome... more helpful than you know. The actual long term goal is to go to DynamoDB for just the reason you mention. The RaptorDB solution is more of a temporary measure for completing a prototype and judging interest and yet using NoSql with the hopes of having similar plumbing if it does work out and we want to give it a real shot.
I truly appreciate your response and anticipated having to check back for the next few weeks, had no idea you would respond so quickly. :)
 
Thanks Howard, I wasn't aware of either of those.
Add a comment...