Profile

Cover photo
Debasish Ghosh
Works at NRI Fintech
Attended Jadavpur University
Lives in Kolkata, India
1,559 followers|506,926 views
AboutPosts

Stream

Debasish Ghosh

Shared publicly  - 
 
 
Critique of Paper by "Deep Learning Conspiracy" (Nature 521 p 436)

Machine learning is the science of credit assignment. The machine learning community itself profits from proper credit assignment to its members. The inventor of an important method should get credit for inventing it. She may not always be the one who popularizes it. Then the popularizer should get credit for popularizing it (but not for inventing it). Relatively young research areas such as machine learning should adopt the honor code of mature fields such as mathematics: if you have a new theorem, but use a proof technique similar to somebody else's, you must make this very clear. If you "re-invent" something that was already known, and only later become aware of this, you must at least make it clear later.

As a case in point, let me now comment on a recent article in Nature (2015) about "deep learning" in artificial neural networks (NNs), by LeCun & Bengio & Hinton (LBH for short), three CIFAR-funded collaborators who call themselves the "deep learning conspiracy" (e.g., LeCun, 2015). They heavily cite each other. Unfortunately, however, they fail to credit the pioneers of the field, which originated half a century ago. All references below are taken from the recent deep learning overview (Schmidhuber, 2015), except for a few papers listed beneath this critique focusing on nine items.

1. LBH's survey does not even mention the father of deep learning, Alexey Grigorevich Ivakhnenko, who published the first general, working learning algorithms for deep networks (e.g., Ivakhnenko and Lapa, 1965). A paper from 1971 already described a deep learning net with 8 layers (Ivakhnenko, 1971), trained by a highly cited method still popular in the new millennium. Given a training set of input vectors with corresponding target output vectors, layers of additive and multiplicative neuron-like nodes are incrementally grown and trained by regression analysis, then pruned with the help of a separate validation set, where regularisation is used to weed out superfluous nodes. The numbers of layers and nodes per layer can be learned in problem-dependent fashion.

2. LBH discuss the importance and problems of gradient descent-based learning through backpropagation (BP), and cite their own papers on BP, plus a few others, but fail to mention BP's inventors. BP's continuous form was derived in the early 1960s (Bryson, 1961; Kelley, 1960; Bryson and Ho, 1969). Dreyfus (1962) published the elegant derivation of BP based on the chain rule only. BP's modern efficient version for discrete sparse networks (including FORTRAN code) was published by Linnainmaa (1970). Dreyfus (1973) used BP to change weights of controllers in proportion to such gradients. By 1980, automatic differentiation could derive BP for any differentiable graph (Speelpenning, 1980). Werbos (1982) published the first application of BP to NNs, extending thoughts in his 1974 thesis (cited by LBH), which did not have Linnainmaa's (1970) modern, efficient form of BP. BP for NNs on computers 10,000 times faster per Dollar than those of the 1960s can yield useful internal representations, as shown by Rumelhart et al. (1986), who also did not cite BP's inventors.

3. LBH claim: "Interest in deep feedforward networks [FNNs] was revived around 2006 (refs 31-34) by a group of researchers brought together by the Canadian Institute for Advanced Research (CIFAR)." Here they refer exclusively to their own labs, which is misleading. For example, by 2006, many researchers had used deep nets of the Ivakhnenko type for decades. LBH also ignore earlier, closely related work funded by other sources, such as the deep hierarchical convolutional neural abstraction pyramid (e.g., Behnke, 2003b), which was trained to reconstruct images corrupted by structured noise, enforcing increasingly abstract image representations in deeper and deeper layers. (BTW, the term "Deep Learning" (the very title of LBH's paper) was introduced to Machine Learning by Dechter (1986), and to NNs by Aizenberg et al (2000), none of them cited by LBH.)

4. LBH point to their own work (since 2006) on unsupervised pre-training of deep FNNs prior to BP-based fine-tuning, but fail to clarify that this was very similar in spirit and justification to the much earlier successful work on unsupervised pre-training of deep recurrent NNs (RNNs) called neural history compressors (Schmidhuber, 1992b, 1993b). Such RNNs are even more general than FNNs. A first RNN uses unsupervised learning to predict its next input. Each higher level RNN tries to learn a compressed representation of the information in the RNN below, to minimise the description length (or negative log probability) of the data. The top RNN may then find it easy to classify the data by supervised learning. One can even "distill" a higher, slow RNN (the teacher) into a lower, fast RNN (the student), by forcing the latter to predict the hidden units of the former. Such systems could solve previously unsolvable very deep learning tasks, and started our long series of successful deep learning methods since the early 1990s (funded by Swiss SNF, German DFG, EU and others), long before 2006, although everybody had to wait for faster computers to make very deep learning commercially viable. LBH also ignore earlier FNNs that profit from unsupervised pre-training prior to BP-based fine-tuning (e.g., Maclin and Shavlik, 1995). They cite Bengio et al.'s post-2006 papers on unsupervised stacks of autoencoders, but omit the original work on this (Ballard, 1987).

5. LBH write that "unsupervised learning (refs 91-98) had a catalytic effect in reviving interest in deep learning, but has since been overshadowed by the successes of purely supervised learning." Again they almost exclusively cite post-2005 papers co-authored by themselves. By 2005, however, this transition from unsupervised to supervised learning was an old hat, because back in the 1990s, our unsupervised RNN-based history compressors (see above) were largely phased out by our purely supervised Long Short-Term Memory (LSTM) RNNs, now widely used in industry and academia for processing sequences such as speech and video. Around 2010, history repeated itself, as unsupervised FNNs were largely replaced by purely supervised FNNs, after our plain GPU-based deep FNN (Ciresan et al., 2010) trained by BP with pattern distortions (Baird, 1990) set a new record on the famous MNIST handwritten digit dataset, suggesting that advances in exploiting modern computing hardware were more important than advances in algorithms. While LBH mention the significance of fast GPU-based NN implementations, they fail to cite the originators of this approach (Oh and Jung, 2004).

6. In the context of convolutional neural networks (ConvNets), LBH mention pooling, but not its pioneer (Weng, 1992), who replaced Fukushima's (1979) spatial averaging by max-pooling, today widely used by many, including LBH, who write: "ConvNets were largely forsaken by the mainstream computer-vision and machine-learning communities until the ImageNet competition in 2012," citing Hinton's 2012 paper (Krizhevsky et al., 2012). This is misleading. Earlier, committees of max-pooling ConvNets were accelerated on GPU (Ciresan et al., 2011a), and used to achieve the first superhuman visual pattern recognition in a controlled machine learning competition, namely, the highly visible IJCNN 2011 traffic sign recognition contest in Silicon Valley (relevant for self-driving cars). The system was twice better than humans, and three times better than the nearest non-human competitor (co-authored by LeCun of LBH). It also broke several other machine learning records, and surely was not "forsaken" by the machine-learning community. In fact, the later system (Krizhevsky et al. 2012) was very similar to the earlier 2011 system. Here one must also mention that the first official international contests won with the help of ConvNets actually date back to 2009 (three TRECVID competitions) - compare Ji et al. (2013). A GPU-based max-pooling ConvNet committee also was the first deep learner to win a contest on visual object discovery in large images, namely, the ICPR 2012 Contest on Mitosis Detection in Breast Cancer Histological Images (Ciresan et al., 2013). A similar system was the first deep learning FNN to win a pure image segmentation contest (Ciresan et al., 2012a), namely, the ISBI 2012 Segmentation of Neuronal Structures in EM Stacks Challenge.

7. LBH discuss their FNN-based speech recognition successes in 2009 and 2012, but fail to mention that deep LSTM RNNs had outperformed traditional speech recognizers on certain tasks already in 2007 (Fernández et al., 2007) (and traditional connected handwriting recognisers by 2009), and that today's speech recognition conferences are dominated by (LSTM) RNNs, not by FNNs of 2009 etc. While LBH cite work co-authored by Hinton on LSTM RNNs with several LSTM layers, this approach was pioneered much earlier (e.g., Fernandez et al., 2007).

8. LBH mention recent proposals such as "memory networks" and the somewhat misnamed "Neural Turing Machines" (which do not have an unlimited number of memory cells like real Turing machines), but ignore very similar proposals of the early 1990s, on neural stack machines, fast weight networks, self-referential RNNs that can address and rapidly modify their own weights during runtime, etc (e.g., AMAmemory 2015). They write that "Neural Turing machines can be taught algorithms," as if this was something new, although LSTM RNNs were taught algorithms many years earlier, even entire learning algorithms (e.g., Hochreiter et al., 2001b).

9. In their outlook, LBH mention "RNNs that use reinforcement learning to decide where to look" but not that they were introduced a quarter-century ago (Schmidhuber & Huber, 1991). Compare the more recent Compressed NN Search for large attention-directing RNNs (Koutnik et al., 2013).

One more little quibble: While LBH suggest that "the earliest days of pattern recognition" date back to the 1950s, the cited methods are actually very similar to linear regressors of the early 1800s, by Gauss and Legendre. Gauss famously used such techniques to recognize predictive patterns in observations of the asteroid Ceres.

LBH may be backed by the best PR machines of the Western world (Google hired Hinton; Facebook hired LeCun). In the long run, however, historic scientific facts (as evident from the published record) will be stronger than any PR. There is a long tradition of insights into deep learning, and the community as a whole will benefit from appreciating the historical foundations.

The contents of this critique may be used (also verbatim) for educational and non-commercial purposes, including articles for Wikipedia and similar sites.

References not yet in the survey (Schmidhuber, 2015):

Y. LeCun, Y. Bengio, G. Hinton (2015). Deep Learning. Nature 521, 436-444. http://www.nature.com/nature/journal/v521/n7553/full/nature14539.html

Y. LeCun (2015). IEEE Spectrum Interview by L. Gomes, Feb 2015: http://spectrum.ieee.org/automaton/robotics/artificial-intelligence/facebook-ai-director-yann-lecun-on-deep-learning

R. Dechter (1986). Learning while searching in constraint-satisfaction problems. University of California, Computer Science Department, Cognitive Systems Laboratory. First paper to introduce the term "Deep Learning" to Machine Learning.

I. Aizenberg, N.N. Aizenberg, and J. P.L. Vandewalle (2000). Multi-Valued and Universal Binary Neurons: Theory, Learning and Applications. Springer Science & Business Media. First paper to introduce the term "Deep Learning" to Neural Networks. Compare a popular G+ post on this: https://plus.google.com/100849856540000067209/posts/7N6z251w2Wd?pid=6127540521703625346&oid=100849856540000067209.

J. Schmidhuber (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85-117. Preprint: http://arxiv.org/abs/1404.7828

AMAmemory (2015): Answer at reddit AMA (Ask Me Anything) on "memory networks" etc (with references): http://www.reddit.com/r/MachineLearning/comments/2xcyrl/i_am_j%C3%BCrgen_schmidhuber_ama/cp0q12t


#machinelearning
#artificialintelligence
#computervision
#deeplearning

Link: http://people.idsia.ch/~juergen/deep-learning-conspiracy.html
1
Add a comment...

Debasish Ghosh

Shared publicly  - 
 
 
I just learned with immense sadness of the passing of Paul Hudak.

Many people will praise him for his technical (and artistic) accomplishments. I instead wanted to share a personal story.

Over a decade ago, I co-wrote a paper that by rights should have been sent to him to review. It was, and he rightly panned it.

But he panned it with kindness, and he signed his review, and offered to talk about it (it was a single-blind submission). So after getting his response, I wrote him and he offered to have us visit him at Yale. We drove down and spent with him what was the single most informative day of my research career.

To me, that embodies Paul. Where he could have been harsh—with every justification—he was kind, open, and helpful. He helped me, in whom he had no stake, become a better researcher, and asked for nothing in return (and all I could offer was a bland acknowledgment that could not possibly do justice to what he'd done for us).

If we all could live up to that same spirit of generosity, science would be much better.
There's an update to this CaringBridge website. Read more and leave your message of love, hope and compassion.
1
Add a comment...

Debasish Ghosh

Shared publicly  - 
 
 
A few months ago we had a small post here discussing different weight initializations, and I remember +Sander Dieleman and a few others had a good discussion . It is fairly important to do good weight initializations, as the rewards are non-trivial.
For example, AlexNet, which is fairly popular, from Alex's One Weird Trick paper, converges in 90 epochs (using alex's 0.01 stdv initialization).
I retrained it from scratch using the weight initialization from Yann's 98 paper, and it converges to the same error within just 50 epochs, so technically +Alex Krizhevsky could've rewritten the paper with even more stellar results (training Alexnet in 8 hours with 8 GPUs).
In fact, more interestingly, just by doing good weight initialization, I even removed the Local Response Normalization layers in AlexNet with no drop in error.
I've noticed the same trend with several other imagenet-size models, like Overfeat and OxfordNet, they converge in much lesser epochs than what is reported in the paper, just by doing this small change in weight initialization.
If you want the exact formulae, look at the two links below:
https://github.com/torch/nn/blob/master/SpatialConvolution.lua#L28
https://github.com/torch/nn/blob/master/Linear.lua#L18
And read yann's 98 paper Efficient Backprop: http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf

On that note, Surya Ganguli's talk this year at NIPS workshop wrt optimal weight initializations triggered this post. Check out his papers on that side, great work.
1
1
Artella Coding's profile photo
Add a comment...

Debasish Ghosh

Shared publicly  - 
 
 
Videos represent an abundant and rich source of (unsupervised) visual information. Extracting meaningful representations from large volumes of unconstrained video sequences in unsupervised fashion is quite challenging.

Here is one attempt at doing this, but I suspect more research will follow up very shortly. 

http://arxiv.org/abs/1502.04681
Abstract: We use multilayer Long Short Term Memory (LSTM) networks to learn representations of video sequences. Our model uses an encoder LSTM to map an input sequence into a fixed length representation. This representation is decoded using single or multiple decoder LSTMs to perform different ...
1
2
Ivan Pierre's profile photoArtella Coding's profile photo
Add a comment...

Debasish Ghosh

Shared publicly  - 
 
thanks for sharing ..
 
50 Years of Deep Learning and Beyond: an Interview with Jürgen Schmidhuber
INNS Big Data Conference website:  Over the last decades, Jürgen Schmidhuber has been one of the leading protagonists in the advancement of machine learning and neural networks. Alone or with his r...
2
2
Ivan Pierre's profile photoArtella Coding's profile photo
Add a comment...

Debasish Ghosh

Shared publicly  - 
 
has to be one of the best introduction to monads ..
 
Answer by Tikhon Jelvis to What are monads and why are they useful? I found this to be the best in depth explanation of what monads are. 
1
1
Jay Hutfles's profile photo
Add a comment...

Debasish Ghosh

Shared publicly  - 
 
Functional Patterns in Domain Modeling - Composing a domain workflow with statically checked invariants
I have been doing quite a bit of domain modeling using functional programming mostly in Scala. And as it happens when you work on something for a long period of time you tend to identify more and more patterns that come up repeatedly within your implementat...
1
1
Satish Chhatpar's profile photoJay Hutfles's profile photo
 
Sir

Can you share examples of retail domain?
Like online shopping 
Add a comment...
Have him in circles
1,559 people
Durgesh Mishra's profile photo
Per Lauge Buresø Holst's profile photo
Łukasz Kuczera's profile photo
Brian Oxley's profile photo
Jan Köhnlein's profile photo
Mushtaq Ahmed's profile photo
santosh dhakal's profile photo
Jim Brady's profile photo
Arvin Lin's profile photo

Debasish Ghosh

Shared publicly  - 
 
Baking a π can teach you a bit of Parametricity
Even though I got my copy of Prof. Eugenia Cheng's awesome How to Bake π a couple of weeks back, I started reading it only over this weekend. I am only on page 19 enjoying all the stuff regarding cookies that Prof. Cheng is using to explain abstraction. Thi...
1
Add a comment...

Debasish Ghosh

Shared publicly  - 
 
Randomization and Probabilistic Techniques to scale up Machine Learning
Some time back I blogged about the possibilities that probabilistic techniques and randomization bring on to the paradigm of stream computing. Architectures based on big data not only relate to high volume storage, but also on low latency velocities, and th...
1
Add a comment...

Debasish Ghosh

Shared publicly  - 
 
fed - co Ex
 
Co-FedEx
1
Add a comment...

Debasish Ghosh

Shared publicly  - 
 
 
There are several new ImageNet results floating around that beat my 5.1% accuracy on ImageNet. Most recently an interesting paper from Google that uses "batch normalization". I wanted to make a few comments regarding "surpassing human-level accuracy":

Optimistic human performance is ~3%
I reported 5.1%, but it is interesting to try to estimate an optimistic human performance on ILSVRC by removing what I call "silly errors":

1. Note that I trained myself with 500 images, and as I documented in my blog post and our ILSVRC paper, 18 of my errors (24% - a quarter) were due to what I consider to be "class unawareness". That means that when I look at my mistake I felt that the answer was relatively evident if I had thought of that class. If I had trained longer, it's reasonable to suppose that I would have eliminated a large chunk of these, making my error ~3.9%.
2. The other issue that I call "insufficient training data" (since I was only shown 13 images / class) is also an error that falls into this category. Without this error type, the accuracy would be ~3.6%
3. The next error I'd be willing to argue I could have prevented was the fine-grained error. In the optimistic estimate, if I was willing to spend 15 minutes / terrier instead of ~5 minutes / terrier, the error would become 3.2%.

The remainder of the errors were "multiple objects" and "incorrect annotations", which I consider to be near insurmountable to some degree.

TLDR:
- 5.1% is an error rate for a human who trained for 500 images and then spent up to ~5 minutes per image.
- About ~3% is an optimistic estimate without my "silly errors".

Human ensemble experiments
This ~3% conclusion is also consistent with our "optimistic human" experiments that were based on ~250 images (and reported on in the ILSVRC paper). We had two labelers and considered an image correct if at least one of us got it. Our optimistic human performance was 2.4%, but this is a bit noisy result due to insufficient data. Moreover, we expect that an actual human ensemble would be slightly higher error than the "optimistic human", so ~3% seems relatively consistent with this interpretation.

Top5/Top1 error
As a second point, I do think we should start to look at the top-1 accuracy a bit more. I understand that there are problems with it but I do believe that there is some signal there. For example, there are only 5 snake species, so when I saw a snake image I just lazily labeled all 5 snake types and knew I got it right somewhere in top 5. In other words, the top5 error does not test differentiating between snake types. A few other categories share this property (few fishes and car types, for example).

I don't at all intend this post to somehow take away from any of the recent results: I'm very impressed with how quickly multiple groups have improved from 6.6% down to ~5% and now also below! I did not expect to see such rapid progress. 

It seems that we're now surpassing a dedicated human labeler. And imo, when we are down to 3%, we'd matching the performance of a hypothetical super-dedicated fine-grained expert human ensemble of labelers.

My blog: 
http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/
The ILSVRC paper that has more details on human optimistic results:
http://arxiv.org/abs/1409.0575
Abstract: Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, ...
1
Add a comment...
People
Have him in circles
1,559 people
Durgesh Mishra's profile photo
Per Lauge Buresø Holst's profile photo
Łukasz Kuczera's profile photo
Brian Oxley's profile photo
Jan Köhnlein's profile photo
Mushtaq Ahmed's profile photo
santosh dhakal's profile photo
Jim Brady's profile photo
Arvin Lin's profile photo
Education
  • Jadavpur University
    1984
Basic Information
Gender
Male
Story
Tagline
Programmer, blogger, author, nerd, and Seinfeld fanboy
Introduction
Programming nerd with interest in 
functional programming, 
domain-specific languages, and NoSQL databases.

Debasish is a senior member of ACM and has authored DSLs 
In Action, published by Manning in December 2010. Debasish is also authoring another book Functional and Reactive Domain Modeling also to be published by Manning.
Work
Occupation
CTO
Employment
  • NRI Fintech
    CTO, 2012 - present
  • Anshin Software
    CTO, 2012
  • PricewaterhouseCoopers Ltd
  • Tanning Technologies
  • Techna International
Places
Map of the places this user has livedMap of the places this user has livedMap of the places this user has lived
Currently
Kolkata, India