Public

**Why Probabilistic Programming Matters**

Last week, DARPA announced a new program to fund research in probabilistic programming languages. While the accompanying news stories gave a certain angle on why this was important, this is a new research area and it's still pretty mysterious to most people who care about machine intelligence.

So: what is probabilistic programming, and why does it matter? Read on for my thoughts.

A probabilistic programming language is a language which includes random events as first-class primitives. When the expressive power of a real programming language is brought to bear on random events, the developer can easily encode sophisticated

**structured stochastic processes**- i.e., probabilistic models of the events that might have occurred in the world to produce a given collection of data or observations.

But just writing down a probability model as a computer program wouldn't be particularly exciting - that's just a matter of syntax. The real power of a probabilistic programming language lies in its compiler or runtime environment (like other languages, probabilistic ones can be either compiled or interpreted). In addition to its usual duties, the compiler or runtime needs to figure out how to perform

**inference**on the program. Inference answers the question: of all of the ways in which a program containing random choices could execute, which of those execution paths provides the best explanation for the data?

Another way of thinking about this: unlike a traditional program, which only runs in the forward directions, a probabilistic program is run in both the forward and backward direction. It runs forward to compute the consequences of the assumptions it contains about the world (i.e., the model space it represents), but it also runs backward from the data to constrain the possible explanations. In practice, many probabilistic programming systems will cleverly interleave these forward and backward operations to efficiently home in on the best explanations.

So, that's probabilistic programming in a nutshell. It's clearly cool, but why does it matter?

**Probabilistic programming will unlock narrative explanations of data**, one of the holy grails of business analytics and the unsung hero of scientific persuasion. People think in terms of stories - thus the unreasonable power of the anecdote to drive decision-making, well-founded or not. But existing analytics largely fails to provide this kind of story; instead, numbers seemingly appear out of thin air, with little of the causal context that humans prefer when weighing their options.

But probabilistic programs can be written in an explicitly "generative" fashion - i.e., a program directly encodes a space of hypotheses, where each one comprises a candidate explanation of how the world produced the observed data. The specific solutions that are chosen from this space then constitute specific causal and narrative explanations for the data.

The dream here is to combine the best aspects of anecdotal and statistical reasoning - the persuasive power of story-telling, and the predictiveness and generalization abilities of the larger body of data.

**Probabilistic programming decouples modeling and inference.**Just as modern databases separate querying from indexing and storage, and high-level languages and compilers separate algorithmic issues from hardware execution, probabilistic programming languages provide a crucial abstraction boundary that is missing in existing learning systems.

While modeling and inference/learning have long been seen as conceptually separate activities, they have been tightly linked in practice. What this means is that the models that are applied to a given problem are very tightly constrained by the need to derive efficient inference schemes - and these schemes are incredibly time-consuming to derive and implement, requiring very specialized skills and expertise. And even small changes to the model can necessitate wholesale rethinking of the inference approach. This is no way to run a railroad.

If probabilistic programming reaches its potential, these two activities will finally be decoupled by a solid abstraction barrier: the programmer will write the program he wants, and then leave it to the compiler or runtime to sort out the inference (though see below for challenges). This should allow the probabilistic programmer to encode and leverage much more of his domain knowledge, utimately leading to much more powerful reasoning and better data-driven decisions.

**Probabilistic programming enables more general, abstract reasoning**, because these programs can directly encode many levels of reasoning and then perform inference across all of these levels simultaneously. Existing learning systems are largely confined to learning the model parameters that provide the best explanation of the data, but they are unable to reason about whether the model class itself is appropriate.

Probabilistic programs, on the other hand, can reason across all of these levels at the same time: learning model parameters, and also choosing the best models from a given class, and also deciding between entirely different model classes. While there is no silver bullet to modeling bias and the application of inappropriate models, this capability will nevertheless go a long way toward providing the best explanations for data, not only at the quantitative but also the qualitative level.

Finally, I should mention some of the key challenges that researchers will have to solve if probabilistic programming is to have these kinds of impacts:

**Creating the compilers and runtimes that automatically generate good inference schemes for arbitrary probabilistic programs is very, very hard.**If the discussion above sounds too good to be true, that's because it just might be. Researchers have a lot of work ahead of them if they want to make all of this work in the general setting. The most successful systems to date (BUGS, infer.net, factor.ie) have worked by limiting the expressive power of the language, and thus reducing the complexity of the inference problem.

**All abstractions are leaky, and the abstraction between modeling (i.e., writing a probabilistic program) and inference is no different.**What this means is that some probabilistic programs will result in much faster inference than others, for reasons that may not be easy for the programmer to understand. In traditional high-level programming languages, these problems are typically solved by profiling and optimization, and probabilistic systems will need to come with the analogous set of tools so that programmers can inspect and solve the performance problems they encounter.

Finally,

**probabilistic programming requires a different combination of skills**from traditional development, on the one hand, and machine learning or statistical inference on the other. I think it would be a shame if these systems, once they were developed, remained at the fringes - used only by the kinds of programmers who are currently drawn to, say, languages like Lisp and Haskell.

While these challenges are indeed great, I have very high hopes for probabilistic programming over the next few years. Some of the very smartest people I've ever known or worked with are involved in these projects, and I think they're going to make a lot of progress.

Note: I have no involvement with any current probabilistic programming research, nor with the DARPA program; at this point, I'm just an interested observer.

Update: edited for typos, and to add title.

View 26 previous comments

- +Joseph Catanzarite Absolutely. However, the hierarchical models are solved with inference methods that do NOT have a hierarchical character. They are nice algorithms, I like them a lot. But it seems we're missing something in inference for Bayesian methods that is getting discovered in the field of deep learning.

It seems similar to how time is often treated in machine learning. A lot of models basically "flatten time" to be able to do predictions. There is no dynamics within the system that is coupled to the dynamics out of the system. In computer science terms, time is a first-class citizen in the real-world, but not anymore in the way the model is learning about that world. For example, in echo state networks the dynamics of the modeled phenomenon is directly reflected in temporal activity patterns within the network. I've not seen Bayesian inference methods that try to do something like that.

Bayesian inference methods seem divorced from their models. Pick a hierarchical model and inference is just done by Gibbs sampling or something else pretty standard. There are no automated methods that take into account the "nature of the model" and generate "an inference method that reflects this nature".

A paper that goes a bit in this direction is "Efficient construction of reversible jump Markov chain Monte Carlo proposal distributions" by Brooks et al. (2003):

https://stat.duke.edu/~scs/Courses/Stat376/Papers/TransdimMCMC/BrooksRobertsRJ.pdfOct 17, 2015 - +Anne van Rossum +Vijay Varanasi I disagree strongly that probabilistic sampling approaches are that different from evolutionary techniques (ET). If you think about what particle filters do, the importance sampling imposes a fitness function on the population by deciding which of the particles to keep around. The biggest difference is that it is implicit in the distribution you are sampling rather than explicitly defining it (which in turn implicitly imposes a distribution on your genetic population).

Additionally, the general complaints about ET (such as it is ad-hoc) are often the same complaints people level against NPB (priors tend to chosen in an ad-hoc manner). In this case the priors of ET are features you pick. As long as the genetic description for your ET fully spans the problem space, you will converge to an answer (it might just be very slowly). The same argument has been made about the priors you pick for NPB.Oct 28, 2015 - +Abe Schneider The Bayesian viewpoint addresses optimal inference. Given the proper prior and likelihood, it describes the best way to cope with the data. There are no better reasoning methods that handle uncertainty.

To give up on optimality is everybody's own decision, but you can't just state that things are equivalent. Friston has a nice account on his own acceptance of Bayes optimality at http://www.sciencedirect.com/science/article/pii/S1053811911011657. He comes up with interesting models postulating Markov blankets between an organism and the environment, but also within the brain in his other papers. You can also give up on optimality in a controlled way, such as is done for example by Braun and Ortega or Bert Kappen in their bounded rationality papers. These authors maintain the connections with Bayesian optimality.

That's definitely not the case with evolutionary algorithms in general. I know there are severely stripped down versions of evolutionary algorithms such as evolutionary strategies. Perhaps there are some connections with being optimal. Please, feel free to suggest a few you like most with respect to this topic.

By the way I agree that there are sampling methods that are just as crappy and as-hoc as genetic algorithms. Particle filters are one of them. That's my point. There is a divorce between the Bayesian optimality of the models and the sampling algorithms that are in use. The sampling methods come with additional constraints such as detailed balance. Or there are adaptive MCMC methods that shouldn't called algorithms, but heuristics.

There is a place for heuristics and I place a lot of sampling methods right there, especially the evolutionary/genetic ones. By the way, if an evolutionary algorithm is stripped down to an MCMC sampling method it does not converge. The term convergence doesn't make sense for MCMC. There is always a probability that it ends up in a low-probability region again.Oct 29, 2015 - +Anne van Rossum My main point was that the original question asked if there was a link between evolutionary techniques (ET) and probabilistic programming. As most probabilistic programming techniques employ some type of sampling method to find the distribution (though variational is also usually an option). I think it's fair to say there is a high degree of similarity in those approaches. Both have problem spaces that cannot be solved easily with a closed form solution. Both search around the space in a semi-random fashion. Both have methods to create a new set of samples from the previous batch. And the biggest similarity between them is that they both use prior knowledge to update their current evidence. In the same way you can argue a Kalman filter is Bayesian, I think this also casts a Bayesian interpretation on ETs.

As for optimally, unless you have a nicely convex problem (generally the only types of problems I worry about), there is no way of knowing whether you have an optimal solution (i.e. global maximum/minimum). If you don't know that you have an optimal solution, how do you know that you have a optimal method to solve it?

What you can't do (and I think is your main point), is have a well defined definition of probability. The distribution is implicitly set by a cost function, and you have no partition function. However, suppose the cost function is the actual thing you are trying to fit. Maybe there is not good way to describe it in a probabilistic way due it's particular structure. You can of course still try to approximate it with a mixture of distributions, but that doesn't always guarantee the best fit. In that case, what is the optimal solution? The nice thing about ETs, is they make the fewest assumptions about the problem space, and thus can find solutions for many complicated problems spaces. That also means they give the fewest guarantees about their solution.

I'm not sure what you mean that convergence doesn't make sense for MCMC? How else do you know that you've found a solution? It might not have an exact definition, but you still need a stopping test.Oct 29, 2015 - +Abe Schneider A good example of the kind of work I like is this paper of yesterday that demonstrates the current divorce between models and inference: http://news.mit.edu/2015/shrinking-bulls-eye-algorithm-speeds-complex-modeling-days-hours-1116.

I agree with you that if the evolutionary algorithms are getting so abstract that they are called "mean-field interaction particle systems" or something like that, a Bayesian interpretation becomes possible. :-)

I kindly disagree with your statement that you cannot know an optimal solution for a nonconvex problem.

Zellner has several texts on the optimality of Bayesian algorithms. It doesn't require any convexity argument.

You're right that if you have a problem in which the cost function or loss function is actually the problem at hand, then introducing uncertainty where there isn't, is of course incorrect.

About MCMC and convergence, the chain weakly converges to a stationary distribution, but the samples themselves are wandering around of course. Thus, if you mean that all generations over time in an evolutionary algorithm sample the solution space I agree. If you are talking about an evolutionary algorithm converging to a particular generation representing the solution on its own, I consider this concept of convergence disparate from convergence in MCMC. Not that important because I assume you meant the former anyway.Nov 17, 2015 - Thanks Abe. This article is very interesting. I haven't read any of the

source material, but do you know if it's based on ABC (approximate

bayesian computation)?Nov 17, 2015

Add a comment...