Profile cover photo
Profile photo
Sai Rahul
Communities and Collections
View all

Post has shared content
Why Probabilistic Programming Matters

Last week, DARPA announced a new program to fund research in probabilistic programming languages. While the accompanying news stories gave a certain angle on why this was important, this is a new research area and it's still pretty mysterious to most people who care about machine intelligence.

So: what is probabilistic programming, and why does it matter? Read on for my thoughts.

A probabilistic programming language is a language which includes random events as first-class primitives. When the expressive power of a real programming language is brought to bear on random events, the developer can easily encode sophisticated structured stochastic processes - i.e., probabilistic models of the events that might have occurred in the world to produce a given collection of data or observations.

But just writing down a probability model as a computer program wouldn't be particularly exciting - that's just a matter of syntax. The real power of a probabilistic programming language lies in its compiler or runtime environment (like other languages, probabilistic ones can be either compiled or interpreted). In addition to its usual duties, the compiler or runtime needs to figure out how to perform inference on the program. Inference answers the question: of all of the ways in which a program containing random choices could execute, which of those execution paths provides the best explanation for the data?

Another way of thinking about this: unlike a traditional program, which only runs in the forward directions, a probabilistic program is run in both the forward and backward direction. It runs forward to compute the consequences of the assumptions it contains about the world (i.e., the model space it represents), but it also runs backward from the data to constrain the possible explanations. In practice, many probabilistic programming systems will cleverly interleave these forward and backward operations to efficiently home in on the best explanations.

So, that's probabilistic programming in a nutshell. It's clearly cool, but why does it matter?

Probabilistic programming will unlock narrative explanations of data, one of the holy grails of business analytics and the unsung hero of scientific persuasion. People think in terms of stories - thus the unreasonable power of the anecdote to drive decision-making, well-founded or not. But existing analytics largely fails to provide this kind of story; instead, numbers seemingly appear out of thin air, with little of the causal context that humans prefer when weighing their options.

But probabilistic programs can be written in an explicitly "generative" fashion - i.e., a program directly encodes a space of hypotheses, where each one comprises a candidate explanation of how the world produced the observed data. The specific solutions that are chosen from this space then constitute specific causal and narrative explanations for the data.

The dream here is to combine the best aspects of anecdotal and statistical reasoning - the persuasive power of story-telling, and the predictiveness and generalization abilities of the larger body of data.

Probabilistic programming decouples modeling and inference. Just as modern databases separate querying from indexing and storage, and high-level languages and compilers separate algorithmic issues from hardware execution, probabilistic programming languages provide a crucial abstraction boundary that is missing in existing learning systems.

While modeling and inference/learning have long been seen as conceptually separate activities, they have been tightly linked in practice. What this means is that the models that are applied to a given problem are very tightly constrained by the need to derive efficient inference schemes - and these schemes are incredibly time-consuming to derive and implement, requiring very specialized skills and expertise. And even small changes to the model can necessitate wholesale rethinking of the inference approach. This is no way to run a railroad.

If probabilistic programming reaches its potential, these two activities will finally be decoupled by a solid abstraction barrier: the programmer will write the program he wants, and then leave it to the compiler or runtime to sort out the inference (though see below for challenges). This should allow the probabilistic programmer to encode and leverage much more of his domain knowledge, utimately leading to much more powerful reasoning and better data-driven decisions.

Probabilistic programming enables more general, abstract reasoning, because these programs can directly encode many levels of reasoning and then perform inference across all of these levels simultaneously. Existing learning systems are largely confined to learning the model parameters that provide the best explanation of the data, but they are unable to reason about whether the model class itself is appropriate.

Probabilistic programs, on the other hand, can reason across all of these levels at the same time: learning model parameters, and also choosing the best models from a given class, and also deciding between entirely different model classes. While there is no silver bullet to modeling bias and the application of inappropriate models, this capability will nevertheless go a long way toward providing the best explanations for data, not only at the quantitative but also the qualitative level.

Finally, I should mention some of the key challenges that researchers will have to solve if probabilistic programming is to have these kinds of impacts:

Creating the compilers and runtimes that automatically generate good inference schemes for arbitrary probabilistic programs is very, very hard. If the discussion above sounds too good to be true, that's because it just might be. Researchers have a lot of work ahead of them if they want to make all of this work in the general setting. The most successful systems to date (BUGS,, have worked by limiting the expressive power of the language, and thus reducing the complexity of the inference problem.

All abstractions are leaky, and the abstraction between modeling (i.e., writing a probabilistic program) and inference is no different. What this means is that some probabilistic programs will result in much faster inference than others, for reasons that may not be easy for the programmer to understand. In traditional high-level programming languages, these problems are typically solved by profiling and optimization, and probabilistic systems will need to come with the analogous set of tools so that programmers can inspect and solve the performance problems they encounter.

Finally, probabilistic programming requires a different combination of skills from traditional development, on the one hand, and machine learning or statistical inference on the other. I think it would be a shame if these systems, once they were developed, remained at the fringes - used only by the kinds of programmers who are currently drawn to, say, languages like Lisp and Haskell.

While these challenges are indeed great, I have very high hopes for probabilistic programming over the next few years. Some of the very smartest people I've ever known or worked with are involved in these projects, and I think they're going to make a lot of progress.

Note: I have no involvement with any current probabilistic programming research, nor with the DARPA program; at this point, I'm just an interested observer.

Update: edited for typos, and to add title.
Add a comment...

Post has shared content

Post has shared content
A really nice write-up of how to gently introduce some graph theory concepts to younger students.
Add a comment...

Post has shared content
Extinct Giant Insects Rediscovered and Rescued!

Even if you are grossed out by the picture, the story is wonderful -- that is if you like science, adventure, and people's desire to help non-human animals.

#science   #conservation   #entomology   #insects   #discovery  

Add a comment...

Post has attachment
Came across this wonderful talk by Tom Mitchell. It is about how our brain represents the language and whether we can decode it ?. Also whether the representation is same across different persons. Its very very early now. But we can easily image matrix kind of stuff. I.e learning how to ride the chopper without actual training :)

Some highlights. 

Trained neural activity on one person and try to decode the same from another one. Probability of success is more than a chance. Meaning the representation is similar. 

If a person sees Hammer then there is a significant activity in premotor cortex. Premotor cortex is responsible for movements. 

Neural representation is similar across people, language, word vs picture. Its easier to decode concrete and emotion nouns but difficult to decode abstract nouns and verbs. Ex abstract nouns like justice. 
Add a comment...

Post has attachment
World is changing ! Microsoft made .net opensource and it is MIT license ! 2 amazing things in 2 days ! 
.NET Blog
.NET Blog
Add a comment...

Post has attachment
Time to celebrate Humans ! Rosetta took 10 years to reach the comet.
Add a comment...

Post has attachment

Post has shared content
Learning to Execute and Neural Turing Machines

I'd like to draw your attention to two papers that have been posted in the last few days from some of my colleagues at Google that I think are pretty interesting and exciting:

  Learning to Execute:

  Neural Turing Machines:

The first paper, "Learning to Execute", by +Wojciech Zaremba and +Ilya Sutskever attacks the problem of trying to train a neural network to take in a small Python program, one character at a time, and to predict its output.  For example, as input, it might take:

print((c+8704) if 2641<8500 else 5308)"

During training, the model is given that the desired output for this program is "12185".  During inference, though, the model is able to generalize to completely new programs and does a pretty good of learning a simple Python interpreter from examples.

The second paper, "Neural Turing Machines", by +alex graves, Greg Wayne, and +Ivo Danihelka from Google's DeepMind group in London, couples an external memory ("the tape") with a neural network in a way that the whole system, including the memory access, is differentiable from end-to-end.  This allows the system to be trained via gradient descent, and the system is able to learn a number of interesting algorithms, including copying, priority sorting, and associative recall.

Both of these are interesting steps along the way of having systems learn more complex behavior, such as learning entire algorithms, rather than being used for just learning functions.

(Edit: changed link to Learning to Execute paper to point to the top-level Arxiv HTML page, rather than to the PDF).
Add a comment...

Post has shared content
24, the Monster, and quantum gravity

Think of a prime number other than 2 or 3. Multiply the number by itself and then subtract 1. The result is a multiple of 24. This observation might appear to be a curiosity, but it turns out to be the tip of an iceberg, with far-reaching connections to other areas of mathematics and physics.

This result works for more than just prime numbers. It works for any number that is relatively prime to 24. For example, 25 is relatively prime to 24, because the only positive number that is a factor of both of them is 1. (An easy way to check this is to notice that 25 is not a multiple of 2, or 3, or both.) Squaring 25 gives 625, and 624=(24x26)+1.

A mathematician might state this property of the number 24 as follows:
If m is relatively prime to 24, then m^2 is congruent to 1 modulo 24.
One might ask if any numbers other than 24 have this property. The answer is “yes”, but the only other numbers that exhibit this property are 12, 8, 6, 4, 3, 2 and 1; in other words, the factors of 24.

The mathematicians John H. Conway and Simon P. Norton used this property of 24 in their seminal 1979 paper entitled Monstrous Moonshine. In the paper, they refer to this property as “the defining property of 24”. The word “monstrous” in the title is a reference to the Monster group, which can be thought of as a collection of more than 8x10^53 symmetries; that is, 8 followed by 53 other digits. The word “moonshine” refers to the perceived craziness of the intricate relationship between the Monster group and the theory of modular functions.

The existence of the Monster group, M, was not proved until shortly after Conway and Norton wrote their paper. It turns out that the easiest way to think of M in terms of symmetries of a vector space over the complex numbers is to use a vector space of dimension 196883. This number is close to another number that is related to the Leech lattice. The Leech lattice can be thought of as a stunningly efficient way to pack unit spheres together in 24 dimensional space. In this arrangement, each sphere will touch 196560 others. The closeness of the numbers 196560 and 196883 is not a coincidence and can be explained using the theory of monstrous moonshine.

It is now known that lying behind monstrous moonshine is a certain conformal field theory having the Monster group as symmetries. In 2007, the physicist Edward Witten proposed a connection between monstrous moonshine and quantum gravity. Witten concluded that pure gravity with maximally negative cosmological constant is dual to the Monster conformal field theory. This theory predicts a value for the semiclassical entropy estimate for a given black hole mass, in the large mass limit. Witten's theory estimates the value of this quantity as the natural logarithm of 196883, which works out at about 12.19. As a comparison, the work of Jacob Bekenstein and Stephen Hawking gives an estimate of 4π, which is about 12.57.

Relevant links
Wikipedia on the Monster group:
Wikipedia on the Leech lattice:
Wikipedia on Monstrous Moonshine:
A 2004 survey paper about Monstrous Moonshine by Terry Gannon:

#mathematics #physics #sciencesunday  
Add a comment...
Wait while more posts are being loaded