Speaking of causal nets, an interesting Facebook paper; "The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services":
"Current debugging and optimization methods scale poorly to deal with the complexity of modern Internet services, in which a single request triggers parallel execution of numerous heterogeneous software components over a distributed set of computers. The Achilles’ heel of current methods is the need for a complete and accurate model of the system under observation: producing such a model is challenging because it requires either assimilating the collective knowledge of hundreds of programmers responsible for the individual components or restricting the ways in which components interact. Fortunately, the scale of modern Internet services offers a compensating benefit: the sheer volume of requests serviced means that, even at low sampling rates, one can gather a tremendous amount of empirical performance observations and apply “big data” techniques to analyze those observations. In this paper, we show how one can automatically construct a model of request execution from pre-existing component logs by generating a large number of potential hypotheses about program behavior and rejecting hypotheses contradicted by the empirical observations. We also show how one can validate potential performance improvements without costly implementation effort by leveraging the variation in component behavior that arises naturally over large numbers of requests to measure the impact of optimizing individual components or changing scheduling behavior. We validate our methodology by analyzing performance traces of over 1.3 million requests to Facebook servers. We present a detailed study of the factors that affect the end-to-end latency of such requests. We also use our methodology to suggest and validate a scheduling optimization for improving Facebook request latency."
Apparently the Internet giants like Facebook are now so large and so complex internally that it makes sense to start treating them as black boxes on par with human bodies and ecosystems and start applying techniques like randomization & causal networks rather than continue relying on human engineers.
In this case, they're using their server logs to infer causal networks linking their servers/infrastructure in order to infer where performance problems are.
Besides the unexpected connection of causal nets with server logs, one interesting aspect is that they seem to be using an algorithm which is about as dumb as possible to infer the causal net: they generate every possible causal net, and begin running each log item past the set of causal nets, deleting causal nets the instant they contradict a log item. If that sounds hard to you, well, this approach apparently requires an entire Hadoop cluster and a day of realtime to generate useful results... so it is.
And they only use a million or so log items. This is not so much 'big data' as 'big compute'.
#bayesnet #statistics #facebook #causalinference
"Current debugging and optimization methods scale poorly to deal with the complexity of modern Internet services, in which a single request triggers parallel execution of numerous heterogeneous software components over a distributed set of computers. The Achilles’ heel of current methods is the need for a complete and accurate model of the system under observation: producing such a model is challenging because it requires either assimilating the collective knowledge of hundreds of programmers responsible for the individual components or restricting the ways in which components interact. Fortunately, the scale of modern Internet services offers a compensating benefit: the sheer volume of requests serviced means that, even at low sampling rates, one can gather a tremendous amount of empirical performance observations and apply “big data” techniques to analyze those observations. In this paper, we show how one can automatically construct a model of request execution from pre-existing component logs by generating a large number of potential hypotheses about program behavior and rejecting hypotheses contradicted by the empirical observations. We also show how one can validate potential performance improvements without costly implementation effort by leveraging the variation in component behavior that arises naturally over large numbers of requests to measure the impact of optimizing individual components or changing scheduling behavior. We validate our methodology by analyzing performance traces of over 1.3 million requests to Facebook servers. We present a detailed study of the factors that affect the end-to-end latency of such requests. We also use our methodology to suggest and validate a scheduling optimization for improving Facebook request latency."
Apparently the Internet giants like Facebook are now so large and so complex internally that it makes sense to start treating them as black boxes on par with human bodies and ecosystems and start applying techniques like randomization & causal networks rather than continue relying on human engineers.
In this case, they're using their server logs to infer causal networks linking their servers/infrastructure in order to infer where performance problems are.
Besides the unexpected connection of causal nets with server logs, one interesting aspect is that they seem to be using an algorithm which is about as dumb as possible to infer the causal net: they generate every possible causal net, and begin running each log item past the set of causal nets, deleting causal nets the instant they contradict a log item. If that sounds hard to you, well, this approach apparently requires an entire Hadoop cluster and a day of realtime to generate useful results... so it is.
And they only use a million or so log items. This is not so much 'big data' as 'big compute'.
#bayesnet #statistics #facebook #causalinference
That's kind of amazing. At first I thought it was an April Fools paper.
But it's somewhat similar to a project I saw when I was working at Google, which was using global RPC profiling data to try to identify dependencies between services. In fact, they mention my project at the time, Dapper, in the paper. Far from "a small set of middleware services", Dapper instrumented virtually every RPC and HTTP call that happened at Google (with downsampling obviously.) It was Dapper data that the other project -- whose name I forget -- was using to do dependency analysis. No idea whether it succeeded.Apr 1, 2015
It is amazing. The first time I read it, I found it hard to believe I could be understanding it right - how could even a cluster possibly work through that many DAGs? I'm still not sure I understand it correctly and that they're not doing a lot of heuristic or other kinds of net searches to make it feasible.Apr 1, 2015
Thanks for finding this paper. That this sort of exhaustive testing is possible is a consequence of cheap and massive computing power. That it is needed? <insert PHP joke here>.Apr 1, 2015