Profile cover photo
Profile photo
Shivakumar N.
Humanist :-)
Humanist :-)
Shivakumar's posts

Dremel: Discarding many bad ideas in a few seconds!

I am giving an invited talk at Stanford’s data center retreat about Dremel later this week. Their interest is piqued. They have been hearing why Dremel (nee ~2005?) is an incredible system at Google, and the recent spike in attention it has been getting through early open source projects. And why many companies are considering Dremel-like technologies as a better bet for their needs than classic MapReduce/Hadoop.They have read the technical papers and want to know more!

I started jotting down some notes as I was reminiscing about some of the technical beauty of Dremel and how it completely flew in the face of wisdom at Google at the time. Then I was struck by how Dremel became a tool that helped unleash the creativity of so many engineers @ Google, and that may have really been its bigger impact. So here it goes... :-) Old time Googlers, if you remember other details around this time please remind me...


Quick history: The core of Dremel came out of the frustration of one smart, brash engineer who was looking at Google search and ads data to test out some ideas in ~2005 -- Andrey Gubarev. He had dozens of ideas on how to improve quality of ads and for search quality. Meanwhile, Google had a spectacular gold mine of all user searches from all across the world and data in the tens of terabytes range even a decade ago. Game on, for a creative engineer to build much better products and algorithms based on data!

Andrey was not unique in that he had tons of ideas. Google had hundreds of similar engineers with a passion for improving products based on data, and had dozens of hypothesis they each wanted to test. Engineers were gleeful they could use MapReduce and Sawmill (a data processing system on MapReduce) to run over tens of terabytes of data with a few hours of programming and a few hours for running. You could test maybe one or two hypothesis in a day. 99% of hypothesis would not work and one or two would be pure gold in making a radical improvement on product. Such iteration cycles were pretty much unheard of anywhere else, and probably 10x faster than anywhere else in industry giving Google a great edge. Engineers were happy!

Andrey was unique in that he made the crucial observation (after testing out his 18th hypothesis, which did not work). ‘I do not understand why Google search can be so fast and return results in a second over billions of documents, while checking out our ideas over the same data sources takes a day or more.” I started saying all the usual things about indexing, pre-processing, why query formats matter, as a typical computer scientist would say. Then I realized, he was right and promised him a few months of air cover. 

There were quite a few big challenges
1. How do you build a system that is flexible to support ad hoc queries over tens of terabytes data? 
2. What is the sweet spot of indexing parts of the data vs the complexity &overhead in keeping and updating all kinds of indices? 
3. Google had 3 mandates at the time around projects (1) MapReduce for all forms of how to compute over data, (2) using Borg, a cluster management system for running all jobs and no individual control over machines, and (3) using GFS/BigTable for all storage and Sawmill for all log processing. While these were broadly good ideas for a variety of security/scalability and auditability reasons, it crucially took a lot of systems level flexibility and ‘being close to the metal’ feel away from engineers.

So we did what any engineer would do -- we stole 10 machines from another project and built a small skunkworks team to prototype the system over a day’s worth of search data. The first results were very promising in terms of direction -- a small 10 machine cluster with a carefully constructed data operator language was able to process a Terabyte worth of search data in under 20 seconds. Versus the equivalent query on MapReduce was taking over an hour, on a thousand machine cluster. 

I was happy to be the first and primary tester over this amazing system. For me, it was exciting to see the team build the power of ad hoc querying over massive data sets and get back results near-interactively. At the same time, I was privy to some of the broader questions we were struggling with at Google, on what broad areas & products to invest in and what not to invest in? For a few weeks, Googlers thought I had gotten unexpectedly ‘smarter’ because I knew a bunch of random factoids about their data. It did not last long -- engineers and product leaders figured out quickly I was using this new system and also started using Dremel. Folks started crawling out of the woodworks to write these short 2-3 line queries and quickly test hypothesis in a few minutes. This was Dremel0 in 2006, and you could churn through testing ideas at about 5-10 simple ideas/hour!

After the first promising prototypes, came the second battle, which is currently playing out in industry. What query language to use -- something declarative like SQL or something with more developer control like MapReduce? How flexible can you make a system without losing its near-interactivity? Plus, quite a few senior Googlers expressed a strong distaste for SQL, as a legacy language, and most engineers thought it was cool to agree. 

Naturally, Dremel team chose a subset of SQL as the core query language. Not in the typical way Oracle/MySQL were building SQL engines. But re-adapted to Google’s search engine style of how searches were done through a combination of mixers that aggregated results over shards, so thousands of machines could be working in concert to answer a query. 

We described the fundamental technical insights and learnings about Dremel, as it evolved from Dremel0 to Dremel6 in a VLDB paper, in 2010. By combining multi-level execution trees and columnar data layout for tree-shaped and semi-structured data, it became capable of running aggregation queries over trillion-row tables in seconds. The system scaled to thousands of CPUs and petabytes of data, and most engineers & groups at Google now use it. 

Over time, Dremel grew from a pure read-only system to becoming a core part of Google’s computation fabric, and opened up a new way of computing within Google, through a series of follow on projects (including WebIQ, another exciting system on web graphs which I’ll save for another day). Externally it is now available in a limited form as BigQuery. 

When you can do interesting queries over a trillion rows in a few seconds, that’s already pretty frigging cool as a system. I think more importantly, Dremel on Google’s logs unleashed the creativity of thousands of smart engineers and researchers daily to explore ideas, discard the bad ones, refine the promising ones in search of radical improvements to all Google products!

Post has shared content
It seems that the color pink used to be associated with boys, and blue was associated with girls:

"in the trade publication Earnshaw's Infants' Department in June 1918 said: "The generally accepted rule is pink for the boys, and blue for the girls. The reason is that pink, being a more decided and stronger color, is more suitable for the boy, while blue, which is more delicate and dainty, is prettier for the girl." From then until the 1940s, pink was considered appropriate for boys because being related to red it was the more masculine and decided color, while blue was considered appropriate for girls because it was the more delicate and dainty color."

Post has attachment
Which Federal budget proposal is better?

Post has attachment
Terrible news, +Steve Lacey RIP.

Steve was a great guy, an excellent engineer and shaped many Google Apps products, including GMail and Talk at the Google Kirkland office.

Widescope: Building Consensus for Budgets

For a fun research project at Stanford, +Ashish Goel and I (with some students) have been experimenting with US Federal Budgets. How we macro-invest as a society is an important problem, that is O(trillions $s) and affects the futures of millions of people.

Our dream is to help societies explore hard economic trade-offs, through shared data and lenses. Our 1st experiment is a website to help you explore federal & state budgets.

Check out

We need your help!
1. Before we open this more widely, we'd love your feedback on the tool. Play around, create your own budgets.
2. If you have related ideas or interested in helping, please ping!

Watson & Crick would probably be more famous for discovering 'social DNA,' in times like these. Sigh, YAPA (Is Social In Google's DNA?

I expected Larry would do his earnings call with Fibonacci numbers. E.g., G+ has over Fib(35) users in < Fib(31) seconds! Congrats to G+ team, amazing growth.

The Social Network(s) -- Next few months

The Google+ launch was interesting to me, from a couple of perspectives.
1. The launch and super fast growth so far shows there is an appetite for another social network. It is not a ``winner takes all’’ game anymore, and this was not obvious before the G+ launch. This is great news overall, for industry and consumers.

2. The universal bar across all Google properties. I’m pleasantly surprised it happened. I’ve personally been in too many meetings, where topics like the bar and colors have been discussed endlessly and deferred, because no one could say 'Yes' and anyone could say 'No.' From a Google perspective, this is an unusually big deal and bodes well for Google.

Now the next few months will be even more interesting.

0. I expect Google will innovate on the ranking of news feeds, and try to get a high signal-to-noise ratio. This will be good, because Facebook’s ranking for my news feed is usually pretty random and lossy.

1. Fast ``embrace and extend'' tactics, from Facebook and Twitter, et al. to stop G+ before it gets going. With their sheer size and engagement, this is still their battle to lose. On the face of it, the current G+ launch is mostly new and cool UI, with easily replicable features. I dont see a leapfrog technology. Yet. I’d bet the design teams at the leading social networks have figured out the best parts of the UX they want to adopt, and have a launch-and-iterate plan.

Of course this is not a new tactic. For ex, Google’s search teams quickly followed Bing’s fresh UI and side-bars when Bing increasingly became a threat in 09-10. Ditto with Android’s radically different UXs pre and post Iphone launch. And both times, it worked.

2. More interestingly, I expect a flurry of activity from the other big consumer brands. Now that there is clearly an appetite for trying out new experiences, Amazon, Apple, Zynga (and local brands in each country) et al will engage fully on building new social experiences from scratch with their own compelling assets. Why rely on someone else’s social network, identity and platform, when you can roll your own?

Five interesting networks by mid 2012? Will I need uber-circles to organize my social networks next? Begun, the Clone War has!

Post has attachment
N Shivakumar changed his profile photo.
Wait while more posts are being loaded