Press question mark to see available shortcut keys

Speaking of causal nets, an interesting Facebook paper; "The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services":

"Current debugging and optimization methods scale poorly to deal with the complexity of modern Internet services, in which a single request triggers parallel execution of numerous heterogeneous software components over a distributed set of computers. The Achilles’ heel of current methods is the need for a complete and accurate model of the system under observation: producing such a model is challenging because it requires either assimilating the collective knowledge of hundreds of programmers responsible for the individual components or restricting the ways in which components interact. Fortunately, the scale of modern Internet services offers a compensating benefit: the sheer volume of requests serviced means that, even at low sampling rates, one can gather a tremendous amount of empirical performance observations and apply “big data” techniques to analyze those observations. In this paper, we show how one can automatically construct a model of request execution from pre-existing component logs by generating a large number of potential hypotheses about program behavior and rejecting hypotheses contradicted by the empirical observations. We also show how one can validate potential performance improvements without costly implementation effort by leveraging the variation in component behavior that arises naturally over large numbers of requests to measure the impact of optimizing individual components or changing scheduling behavior. We validate our methodology by analyzing performance traces of over 1.3 million requests to Facebook servers. We present a detailed study of the factors that affect the end-to-end latency of such requests. We also use our methodology to suggest and validate a scheduling optimization for improving Facebook request latency."

Apparently the Internet giants like Facebook are now so large and so complex internally that it makes sense to start treating them as black boxes on par with human bodies and ecosystems and start applying techniques like randomization & causal networks rather than continue relying on human engineers.

In this case, they're using their server logs to infer causal networks linking their servers/infrastructure in order to infer where performance problems are.

Besides the unexpected connection of causal nets with server logs, one interesting aspect is that they seem to be using an algorithm which is about as dumb as possible to infer the causal net: they generate every possible causal net, and begin running each log item past the set of causal nets, deleting causal nets the instant they contradict a log item. If that sounds hard to you, well, this approach apparently requires an entire Hadoop cluster and a day of realtime to generate useful results... so it is.

And they only use a million or so log items. This is not so much 'big data' as 'big compute'.

#bayesnet #statistics #facebook #causalinference  
Shared publiclyView activity