Profile cover photo
Profile photo
Nicolas Chapados
Nicolas's interests
View all
Nicolas's posts

Post has attachment
Like the previous commenter, I don't really like either, but my vote goes to #2. I think that #1 lacks freshness and is too reminiscent of "just any old textbook". #2 is more lively and suggests interactivity. OTOH, it does look a little 90-ish (except for the typefaces, which look nice).

Post has shared content
Significant experiment in overhauling the conference peer-review process.
ICLR 2013 Program Available

The list of accepted papers to the International Conference on Learning Representations (ICLR 2013, which we pronounce "I clear) is ready.

As some of you know, ICLR is the first conference to use the open review model I have been advocating for a number of year, and the first conference to use the system to handle the reviewing process.

I must say that the experiment is an unmitigated success. By going to the OpenReview ICLR link, you will be able to see all the papers (accepted or not accepted for presentation), all the reviews from (anonymous) designated reviewers, all the spontaneous reviews/comments, and all author responses. The process allowed for an essentially unlimited number of comments and paper revisions throughout the reviewing process. Some papers have gathered over 10 comments, with lively technical discussions. Many comments provided invaluable feedback to the authors (llke "there is a bug in your proof" or "this paper is related to yours"), and some will prove useful for readers to understand the papers. Some papers went through some pretty dramatic revision following comments.

There are 5 categories of papers: conference/oral, conference/poster, workshop/oral, workshop/oral, not accepted for presentation at this time. All the papers are available (yes, including the non accepted ones). Some papers submitted to the conference track were deemed intriguing enough to merit a presentation, but not in an sufficiently advanced state to receive a seal of approval "for the record". These papers were accepted in the workshop track.

Overall, it feels like there was considerably less "noise" in the submissions and in the reviews than with the traditional non-open process. There seems to be fewer false negatives, where papers were rejected because the reviewers just didn't get it, or because the paper didn't have the citations the reviewers deemed indispensable (because the authors had opportunities to add those in). The reviews and comments seemed more constructive to me than your typical conference paper review.

The process took a little longer than expected, as the folks David Soergel and Andrew McCallum were implementing the required features on the fly. 

The program chairs +Aaron Courville, +Rob Fergus, and +Christopher Manning did a wonderful job with the program (+Yoshua Bengio and I were not involved in the paper selection process). 

We will organize a formal survey to ask authors what they think of the process, but feel free to comment here.

Post has shared content
A significant move by DARPA to recognize and develop one of the most fascinating areas of current machine learning research, receiving at the moment far too little attention. 
Why Probabilistic Programming Matters

Last week, DARPA announced a new program to fund research in probabilistic programming languages. While the accompanying news stories gave a certain angle on why this was important, this is a new research area and it's still pretty mysterious to most people who care about machine intelligence.

So: what is probabilistic programming, and why does it matter? Read on for my thoughts.

A probabilistic programming language is a language which includes random events as first-class primitives. When the expressive power of a real programming language is brought to bear on random events, the developer can easily encode sophisticated structured stochastic processes - i.e., probabilistic models of the events that might have occurred in the world to produce a given collection of data or observations.

But just writing down a probability model as a computer program wouldn't be particularly exciting - that's just a matter of syntax. The real power of a probabilistic programming language lies in its compiler or runtime environment (like other languages, probabilistic ones can be either compiled or interpreted). In addition to its usual duties, the compiler or runtime needs to figure out how to perform inference on the program. Inference answers the question: of all of the ways in which a program containing random choices could execute, which of those execution paths provides the best explanation for the data?

Another way of thinking about this: unlike a traditional program, which only runs in the forward directions, a probabilistic program is run in both the forward and backward direction. It runs forward to compute the consequences of the assumptions it contains about the world (i.e., the model space it represents), but it also runs backward from the data to constrain the possible explanations. In practice, many probabilistic programming systems will cleverly interleave these forward and backward operations to efficiently home in on the best explanations.

So, that's probabilistic programming in a nutshell. It's clearly cool, but why does it matter?

Probabilistic programming will unlock narrative explanations of data, one of the holy grails of business analytics and the unsung hero of scientific persuasion. People think in terms of stories - thus the unreasonable power of the anecdote to drive decision-making, well-founded or not. But existing analytics largely fails to provide this kind of story; instead, numbers seemingly appear out of thin air, with little of the causal context that humans prefer when weighing their options.

But probabilistic programs can be written in an explicitly "generative" fashion - i.e., a program directly encodes a space of hypotheses, where each one comprises a candidate explanation of how the world produced the observed data. The specific solutions that are chosen from this space then constitute specific causal and narrative explanations for the data.

The dream here is to combine the best aspects of anecdotal and statistical reasoning - the persuasive power of story-telling, and the predictiveness and generalization abilities of the larger body of data.

Probabilistic programming decouples modeling and inference. Just as modern databases separate querying from indexing and storage, and high-level languages and compilers separate algorithmic issues from hardware execution, probabilistic programming languages provide a crucial abstraction boundary that is missing in existing learning systems.

While modeling and inference/learning have long been seen as conceptually separate activities, they have been tightly linked in practice. What this means is that the models that are applied to a given problem are very tightly constrained by the need to derive efficient inference schemes - and these schemes are incredibly time-consuming to derive and implement, requiring very specialized skills and expertise. And even small changes to the model can necessitate wholesale rethinking of the inference approach. This is no way to run a railroad.

If probabilistic programming reaches its potential, these two activities will finally be decoupled by a solid abstraction barrier: the programmer will write the program he wants, and then leave it to the compiler or runtime to sort out the inference (though see below for challenges). This should allow the probabilistic programmer to encode and leverage much more of his domain knowledge, utimately leading to much more powerful reasoning and better data-driven decisions.

Probabilistic programming enables more general, abstract reasoning, because these programs can directly encode many levels of reasoning and then perform inference across all of these levels simultaneously. Existing learning systems are largely confined to learning the model parameters that provide the best explanation of the data, but they are unable to reason about whether the model class itself is appropriate.

Probabilistic programs, on the other hand, can reason across all of these levels at the same time: learning model parameters, and also choosing the best models from a given class, and also deciding between entirely different model classes. While there is no silver bullet to modeling bias and the application of inappropriate models, this capability will nevertheless go a long way toward providing the best explanations for data, not only at the quantitative but also the qualitative level.

Finally, I should mention some of the key challenges that researchers will have to solve if probabilistic programming is to have these kinds of impacts:

Creating the compilers and runtimes that automatically generate good inference schemes for arbitrary probabilistic programs is very, very hard. If the discussion above sounds too good to be true, that's because it just might be. Researchers have a lot of work ahead of them if they want to make all of this work in the general setting. The most successful systems to date (BUGS,, have worked by limiting the expressive power of the language, and thus reducing the complexity of the inference problem.

All abstractions are leaky, and the abstraction between modeling (i.e., writing a probabilistic program) and inference is no different. What this means is that some probabilistic programs will result in much faster inference than others, for reasons that may not be easy for the programmer to understand. In traditional high-level programming languages, these problems are typically solved by profiling and optimization, and probabilistic systems will need to come with the analogous set of tools so that programmers can inspect and solve the performance problems they encounter.

Finally, probabilistic programming requires a different combination of skills from traditional development, on the one hand, and machine learning or statistical inference on the other. I think it would be a shame if these systems, once they were developed, remained at the fringes - used only by the kinds of programmers who are currently drawn to, say, languages like Lisp and Haskell.

While these challenges are indeed great, I have very high hopes for probabilistic programming over the next few years. Some of the very smartest people I've ever known or worked with are involved in these projects, and I think they're going to make a lot of progress.

Note: I have no involvement with any current probabilistic programming research, nor with the DARPA program; at this point, I'm just an interested observer.

Update: edited for typos, and to add title.

Valentine's Postulate,

      P(copulation | flowers, food, wine, spa) = 1

remains one of humanity's most significant unresolved questions. First posed around 250 AD by Valentine, made Saint by the Catholic church for his work [1], this problem has since eluded attack by many of the species' most incisive minds [2,3]. Although proofs by induction [4] would, at first glance, appear effective, they have resisted a general inductive step beyond k=100 or so. Moreover, all attempts at proofs by contradiction turned out disastrously, the perpetrator forever barred from any further attempts at induction [5].

It is widely suspected that no general solution will emerge before fundamental progress in a Grand Unification Theory of Mars and Venus is accomplished.


Post has shared content
Symbolic vs sub-symbolic AI: the debate rages on!
My dear NYU colleague +Gary Marcus wrote a critical response to +John Markoff's front-page article on deep learning in the New York Times.

Gary is a professor in the psychology department at NYU and the author of a number of books, including a very nice little book entitled "Kluge: the haphazard construction of the human mind" in which he argues (very convincingly) that the brain is a collection or hacks (which were called kluges, back when cool things were mechanical and not software), the result of haphazard refinement through evolution.

Gary has been a long-time critic of non-symbolic (or sub-symbolic) approaches to AI, such as neural nets and connectionist models. He comes from the Chomsky/Fodor/Minsky/Pinker school of thought on the nature of intelligence whose main tenet is that the mind is a collection of pre-wired modules, that are largely determined by genetics. This is meant to contrast the working hypothesis on which us, deep learning people, are basing our research: the cortex runs a somewhat "generic" and task-independent learning "algorithm" that will capture the structure of whatever signal it is fed with. 

To be sure, none of us are extreme in our positions. I have been a long-time advocate for the necessity of some structure in learning architectures (such as convolutional net). All of learning theory points to the fact that learning needs structure. Similarly, Gary doesn't claim that learning has no role to play. 

In the end, it all comes down to two questions:
- how important of a role does learning play in building a human mind?
- how much prior structure is needed?

+Geoffrey Hinton and I have devoted most of our careers to devise learning algorithms that can do interesting feats with as little structure as possible (but still some). It's a matter of degree.

One important point in Gary's piece is the fact that neural nets are merely "a ladder on the way to the moon" because they are incapable of symbolic reasoning. I think there are two issues with that argument:
1. As I said on previous occasions, I'd be happy if, within my lifetime, we have machines as intelligent as a rat. I don't think Gary would argue that rats do symbolic reasoning, but they are pretty smart. I don't think human intelligence is considerably (qualitatively) different from that of a rat, and definitely not that different from that of an ape. We could do a lot without human-style symbolic reasoning.
2. There is not that much of a conceptual difference between some of the learning systems that we are building and the symbolic reasoning systems that Gary likes. Much of modern ML systems produce their output by minimizing some sort of energy function, a process qualitatively equivalent to inference (Geoff and I call that "energy-based models", but Bayesian nets also fit in that framework). Training consists in shaping the energy function so that the inference process produces an acceptable answer (or a distribution over answers).

Gary points out that the second wave of neural nets in the late 80's and early 90's was pushed out by other methods. Interestingly, they were pushed out by methods such as Support Vector Machines which are closer to the earliest Perceptrons and even further away from the symbolic reasoning than deep learning systems. To some extent, it could be argued that the kernel trick allowed us to temporarily abandon the search for methods that could go significantly beyond linear classifiers and template matching. 

There is one slight confusing thing in Gary's piece (as well as in John Markoff's piece): the fact that all the recent successes of deep learning are due to unsupervised learning. That is not the case. Many of the stunning results use purely supervised learning, sometimes applied to convolutional network architectures, as in Geoff's ImageNet object recognizer, our scene parsing system, our house number recognizer (now used by Google) and IDSIA's traffic sign recognizer. The key idea of deep learning is to train deep multilayer architectures to learn pre-processing, low-level feature extraction, mid-level feature extraction, classification, and sometimes contextual post-processing in an integrated fashion. Back in the mid 90's, I used to call this "end to end learning" or "global training".

Gary makes the point that even deep learning modules are but one component of complex systems with lots of other components and moving parts. It's true of many systems. But the philosophy of deep learning is to progressively integrate all the modules in the learning process.
An example of that is the check reading system I built at Bell Labs in the early 1990's with +Leon Bottou, +Yoshua Bengio and +Patrick Haffner.  It integrated a low-level feature extractor, a mid-level feature extractor, a classifier (all parts of a convolutional net), and a graphical model (word and language model) all trained in an integrated fashion.

So, just wait a few years, Gary. Soon, deep learning system will incorporate reasoning again.

The debate is open. Knock yourself out, dear readers.

Post has shared content
David MacKay is a brilliant researcher and author. His Information Theory book is amongst of the most lucid ever written on the subject, connecting apparently completely disparate areas. Fantastically rewarding.

Post has shared content
A great experiment in peer-reviewing for a great new machine learning conference! I salute the instigators.
BEHOLD! We are creating a new conference on learning representations: ICLR (or ICLeaR, pronounced like "I clear"). The conference will use a radically new open reviewing process.

ICLR is sponsored by the Computational and Biological Learning Society. CBLS is the sponsor of the highly successful Learning Workshop, held at Snowbird since 1986. The creation of ICLR lead CBLS to phase out the Snowbird Workshop.

Official announcement follows. Please redistribute.

1st International Conference on Learning Representations (ICLR2013)

Website: representationlearning2013

Held in conjunction with AISTATS2013, Scottsdale, Arizona, May 2nd-4th 2013

Submission deadline: January 15th 2013

It is well understood that the performance of machine learning methods is heavily dependent on the choice of data representation (or features) on which they are applied. The rapidly developing field of representation learning is concerned with questions surrounding how we can best learn meaningful and useful representations of data.  We take a broad view of the field, and include in it topics such as deep learning and feature learning, metric learning, kernel learning, compositional models, non-linear structured prediction, and issues regarding non-convex optimization.

Despite the importance of representation learning to machine learning and to application areas such as vision, speech, audio and NLP, there is currently no common venue for researchers who share a common interest in this topic. The goal of ICLR is to help fill this void.

A non-exhaustive list of relevant topics:
- unsupervised representation learning
- supervised representation learning
- metric learning and kernel learning
- dimensionality expansion, sparse modeling
- hierarchical models
- optimization for representation learning
- implementation issues, parallelization, software platforms, hardware
- applications in vision, audio, speech, and natural language processing.
- other applications

Submission Process

ICLR2013 will use a novel publication model that will proceed as follows:

- Authors post their submissions on arXiv and send us a link to the paper. A separate, permanent website will be setup to handle the reviewing process, to publish the reviews and comments, and to maintain links to the papers.

- The ICLR program committee designates anonymous reviewers as usual.

- The submitted reviews are published without the name of the reviewer, but with an indication that they are the designated reviews. Anyone can write and publish comments on the paper (non anonymously). Anyone can ask the program chairs for permission to become an anonymous designated reviewer (open bidding). The program chairs have ultimate control over the publication of each anonymous review. Open commenters will have to use their real name, linked with their Google Scholar profile.

- Authors can post comments in response to reviews and comments. They can revise the paper as many time as they want, possibly citing some of the reviews.

- On March 15th 2013, the ICLR program committee will consider all submitted papers, comments, and reviews and will decide which papers are to be presented at the conference as oral or poster. Although papers can be modified after that date, there is no guarantee that the modifications will be taken into account by the committee.

- The best of the accepted papers (the top 25%-50%) will be given oral presentations at the conference. We have made arrangements for revised versions of selected papers from the conference to be published in a JMLR special topic issue.

- The other papers will be considered non-archival (like workshop presentations), and could be submitted elsewhere (modified or not), although the ICLR site willc maintain the reviews, the comments, and the links to the arXiv versions.

Invited Speakers
Jeff Bilmes (U. Washington)
Geoffrey Hinton (U. Toronto)
Ruslan Salakhutdinov (U. Toronto)
Alan Yuille (UCLA)
Jason Eisner (JHU)

General Chairs
Yoshua Bengio, Université de Montreal
Yann LeCun, New York University

Program Chairs
Aaron Courville, Université de Montreal
Rob Fergus, New York University
Chris Manning, Stanford University

The organizers can be contacted at:

Post has attachment
Not a whole lot of math truly ought to become viral. But this one awesome piece is an exception: Matthias Vallentin's #Probability and #Statistics Cookbook:

Post has shared content
Philippe hits the nail in the head, again, by equating my work habits with the observation that rational numbers are countable. 
Why I don't focus too hard too early

My friend +Seb Paquet linked [1] to a blog post this morning that said, in essence: If you want to get things done, do not focus on more than one habit at a time. Looking at my three browser windows and 24 open tabs I immediately though "I'm screwed". Then Seb reminded me that, in fact, I do accomplish quite a bit even though an ability to focus on a single habit is definitely not my strong point. More importantly, maybe, I achieve the things that feel important to me.

In fact, I believe the source of these achievements may even lie in this tendency I have of jumping from one task to another. At any time I'll pick one task and execute on it, but if the task becomes too boring or if I don't think it's worth much to complete it, I'll gladly switch to something else.

"Heresy!" My elementary school teachers would say. "Focus on the task at hand, complete it, and then you can play with your Rubik's cube." 

The problem is, in the real world, tasks are hardly as straightforward as filling out a list of multiplications. The most interesting things I've done were never obvious from the onset. How long was it going to take to develop a boardgame? What will I get out of that PhD? Is it worth applying at Google a second time, or will it only make me angry?

Answering these questions before you get started can often become a challenge in itself. But if you are to focus all your energy on a single task it would be utterly stupid to blindly rush forward. You should consider it carefully, ponder the options, evaluate the potential reward, estimate how much time you'll spend completing it... This is setup time that is not spent accomplishing something.

Even worse, evaluating how long it will take to complete a task is sometimes unfeasible. And here I mean mathematically unfeasible. There is a beautiful result in computer science, called the halting problem, that says it's impossible to design an algorithm that can determine whether a given piece of software will systematically terminate.

If you're not convinced by this argument, just consider the number of dilettantes that wasted their lives trying to prove Fermat's Last Theorem. Granted, designing a boardgame is not as hard as proving a century old conjecture. Still the question of how long you; with your flaws, your obligations, your background, your imagination; will take to complete a non-trivial task is still very hard to answer.

You may agree that focusing on a task either means blindly rushing forward or spending a lot of time evaluating the potential reward, but you may also wonder how not focusing too hard or too early can help.

Let me bring you back to my beloved Mathematics... Let's say you want to count all positive rational numbers (numbers in the form a / b ), making sure you don't forget any. How would you do? Think about it for a while...

If you set out enumerating all numbers where a = 1 you'll get 1, 1/2, 1/3,... but you'll never get 2/3. Let's try setting b = 3 instead. Then you'll get 1/3, 2/3, 3/3,... but you will never get 1/2.

The trick is to make a bit of progress with a = 1, then a bit of progress with a = 2, then go back to a = 1... Basically you work your way diagonally in the table [3]. You never get stuck anywhere. If there is a golden nugget somewhere in that table, you will eventually find it.

That's basically how I convinced myself that this lack of early focus could be a good thing given that you eventually come back to the task you momentarily dropped to make progress elsewhere. [4]

Noticed I said lack of early focus. When a task is nearly completed, the value of finishing it usually becomes easy to assess. This is when I roll up my sleeves, do the damn thing, land my frigging Chrome patch and get on with the next most interesting task in my ever-expanding diagonal walk.

[2] Skipping some nitty-gritty details, if you're interested check
[3] For a nice visual explanation, check this:
[4] I find a lot of parallel between this and the idea of failing early and pivoting that seems to be the new leitmotiv of hip entrepreneurs.

Questioning peer review. It seems that a great many communities are getting fed up with peer review as it is practiced. We get this recent comment by a Larry Wasserman—a famous statistician—who envisions a "World Without Peer Review" ( A few years ago, the Machine Learning community had the same debate at NIPS as part of a "Future Publication Model" (

Yet, it goes beyond this. The editorial in the latest issue of the very serious Review of Financial Studies argues that we should be "Reviewing Less and Progressing More" ( — paywalled).

So what are examples of "Occupy Peer Review" in your research community? More importantly, what's being done in practice to "fix" it? It seems that the main reason for the continuing dominance and reflexive defence of this model is that is remains the only one universally understood by University selection and tenure committees. Until an easily-understood alternative for evaluating research contributions reaches critical mass, the venerable archaic 350-year old system of peer review, will—for worse—remain the default.
Wait while more posts are being loaded