Alternative link: http://research.google.com/pubs/pub42851.html
Paper with a gazillion Googlers as co-authors to be presented at VLDB 2014 on their production Mesa data warehouse system that holds all their advertising data. It allows near real-time updates and access to petabytes of data with low latency, trillions of rows accessed per day, all of it geo-replicated and able to provide consistent and repeatable answers rapidly even if an entire database fails.
F1 + Mesa would be pretty interesting. How does one handle all those updates in real-time while maintaining query performance, that is the question.
The paper has good details on how they handle updates in near real-time while maintaining amazingly low latency on query performance for a data warehouse. Section 3.2.1 has a good short summary, more details in Sections 2.2 - 2.4 and 3.1.2. In brief, they batch the updates, don't require locking between queries and updates because of the way they batch the updates and do versioning, and route queries to the same data to the same servers whenever possible to improve cache performance. Remarkably, they claim almost all queries (99th percentile) are answered in hundreds of milliseconds.
I saw all that, but waiting up to five minutes for a batch update to finish is unacceptable for an operational system. I know they specifically said they are not tackling the operational database problem, but the goal should be to have a single database capable of handling both operational and analytical workloads. I encourage you to check out the Context Database we've been working on at MIOsoft since about 1998, for instance. ;-)
Add a comment...