**Deep Learning and Graphical Models**

I sometimes get questions like "how does deep learning compare with graphical models?". There is no answer to this question because deep learning and graphical models are orthogonal concepts that can be (and have been) combined.

Let me state this very clearly: there is no opposition between the two paradigms. They can be advantageously combined.

Of course, deep Boltzmann Machines are a form of probabilistic factor graph themselves. But there are other ways in which the concepts can be combined.

For example, you could imagine a factor graph in which the factors themselves contain a deep neural net. A good example would be a dynamical factor graph in which the state vector at time t, Z(t) is predicted from the states and inputs at previous times through a deep neural net (perhaps a temporal convolutional net). A simple instance is when the log factor is equal to ||Z(t) - G(Z(t-1), X(t))||^2, where G is a deep neural net.

This simply says that the conditional distribution of Z(t) given Z(t-1) and X(t) is a Gaussian of mean G(Z(t-1), X(t)) and covariance unity.

This type of dynamic factor graph can be used to model multi-dimensional time series. When a sequence X(t) is observed, one can infer the most likely sequence of hidden states Z(t) by minimizing the sum of the log factors (which we can call an energy function).

Once the optimal Z(t) is found, one can update the parameters of the network G() to make the energy smaller.

A more sophisticated version of this could be used to learn the covariance of the Gaussians, or to marginalize over the Z(t) sequence instead of just doing MAP inference (only taking into account the sequence with the lowest energy).

An example of such "factor graph with deep factors" was described in 2009 ECML paper with my former student +Piotr Mirowski (who is now at Bell Labs) "Factor Graphs for Time Series Modeling"

(Piotr Mirowski & Yann LeCun, ECML 2009): http://yann.lecun.com/exdb/publis/pdf/mirowski-ecml-09.pdf

A similar model used auto-encoder-type unsupervised pre-training to do language modeling "Dynamic Auto-Encoders for Semantic Indexing" (Piotr Mirowski & Yann LeCun, NIPS Workshop on Deep Learning, 2010):

http://yann.lecun.com/exdb/publis/pdf/mirowski-nipsdl-10.pdf

Another way to combine deep learning with graphical models is through structured prediction. To some, this may sound like a new idea, but the history of this goes back to the early 90's. +Leon Bottou and Xavier Driancourt used a sequence alignment on top of a temporal convolutional net to do spoken work recognition. They trained the convnet and the elastic word models simultaneously, at the word level, by back-propagating gradients through the time alignment module (which you can see as a kind of factor graph in which the time warping function is a latent variable).

In the early 90's Leon, +Yoshua Bengio and +Patrick Haffner built "hybrid" speech recognition systems in which a temporal convolutional net and an HMM were trained simultaneously using a discriminative criterion at the word (or sentence) level.

A few years later, Leon, Yoshua, Patrick and I used similar ideas to train our handwriting recognition system. Instead of a normalized HMM, we used a kind of energy-based factor graph without normalization. The normalization is superfluous (even hurtful) when the training is discriminative. We called this "Graph Transformer Networks". This was first published at CVPR 1997 and ICASSP 1997, but the best explanation of it is in our 1998 Proc, IEEE paper: http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf

Some of the history of this with detailed bibliography is available in the paper "A Tutorial on Energy-Based Learning": http://yann.lecun.com/exdb/publis/pdf/lecun-06.pdf (starting around Section 6).

- And how about sum-product networks? Isnt it where DL and PGM are connected most closely?Apr 28, 2013
- Yann LeCunOwnerThe relationship between factor graphs and SPN is complicated. On the one hand, graphical models are a special case of SPN with a single product layer that computes the product of all the factors. On the other hand, SPNs must be trees, which means that a variable must be input to a only one first-layer factor (which would make a single-layer SPN a rather trivial form of factor graph).Apr 28, 2013
- Yes, SPNs combine advantages of DL and PGMs. Yann, I couldn't help but notice some common misconceptions about SPNs (e.g., "must be trees", and that a PGM is just an SPN with "a single product layer"). We've compiled answers to these and other SPN FAQs here:

http://spn.cs.washington.edu/faq.shtml#a03Aug 28, 2013