*Building High-level Features Using Large Scale Unsupervised Learning* 

Posted by Quoc Le, Software Engineer

Despite machine learning’s many successes, one of the key challenges in building machine learning applications is that typically a lot of work is needed to design “features” for each application.  By developing a large scale artificial neural network with 1.15 billion parameters, and training it on 16,000 CPU cores for a week, we recently showed that it is possible to automatically learn good feature representations that can capture even high-level concepts.  For example, trained on completely unlabeled YouTube data, our algorithm automatically developed separate neurons that were highly selective for faces, for cats, for pedestrians, and for other high-level concepts.

To take a concrete example, consider the problem of learning to recognize cars.  Images in computers are represented as a list of values [255, 128, 2, 20, …] comprising the pixel intensities of an image.  If we were to apply a traditional machine learning algorithm (supervised learning) using the raw data, it’s very difficult for an algorithm to tell if these pixel values represent cars or not.  Thus, traditionally computer vision researchers have spent a significant amount of time coding up different “features”, meaning different ways to represent an image.  For example, one feature might try to find edges in the image, since edge locations are a higher-level concept than pixels, and more useful for recognizing objects.  Another feature might find different combinations of edges.  In applied machine learning as practiced in most software products today, a lot of the effort in developing new systems is in hand-designing features for new problems.  

Building on several ideas from the literature on Deep Learning, including the ideas of  ICA (independent component analysis) and tiled convolutional networks, and by distributing the network across many CPU cores, we trained a network on 10 million images extracted from YouTube videos.  This training process used only unlabeled data.  We then probed different neurons in the network to see if any of them were highly selective (i.e., responded strongly to) for particular common objects.  Perhaps in line with a stereotype about what goes on YouTube, we found a neuron that was highly selective to images of cat faces. 

The trained network contained 3 million neurons and 1 billion synapses, most of which compute very complex functions of the input and for which we don’t have simple English explanations (such as “cat detector”).  The values computed by these neurons, however, comprise a very useful set of high-level features for describing an image.  For example, depending on the application--say, trying to automatically annotate an image with a set of labels--it’s far more useful to know if there are people/cats/etc. in an image than to know the locations of edges in an image.  We also experimented with the ImageNet dataset, a ~20,000 category object classification problem that is a major computer vision benchmark.  Random guessing thus achieves 0.005% accuracy on this dataset, and because many labels are ambiguous (“triangle” vs. “Isoceles triangle”) or easily confused with other labels (“sting ray” vs. “manta ray”), human performance on this task is also far from perfect. By augmenting our learning system with a small amount of labeled data, we were able to take advantage of the high-level features learned by our network and improve the state-of-the-art accuracy from 9.3% to 15.8%--a significant leap of 70% relative improvement.  

By training larger models with larger amounts of labeled and unlabeled data, we expect that significant further improvements can be made.

This post is a high-level summary of a paper I co-authored with +Marc'Aurelio Ranzato, +Rajat Monga, +Matthieu Devin, +Kai Chen, +Greg Corrado, +Jeff Dean and +Andrew Ng.  The full paper is being presented at the International Conference on Machine Learning (ICML 2012) this week, and can be found here: http://research.google.com/archive/unsupervised_icml2012.html

You can also check out this article in The New York Times describing our method of object recognition as "a cybernetic cousin to what takes place in the brain’s visual cortex": http://www.nytimes.com/2012/06/26/technology/in-a-big-network-of-computers-evidence-of-machine-learning.html?_r=2&hpw&pagewanted=all.
Shared publiclyView activity