Shared publicly  - 
 
There are several new ImageNet results floating around that beat my 5.1% error rate on ImageNet. Most recently an interesting paper from Google that uses "batch normalization". I wanted to make a few comments regarding "surpassing human-level accuracy". The most critical one is this:

Human accuracy is not a point. It lives on tradeoff curve.

Estimating the lower bound error
5.1% is an approximate upper bound on human error, achieved by a relatively dedicated labeler who trained on 500 images and then evaluated on 1500. It is interesting to go further and estimate the lower bound on human error. We can do this approximately since I have broken down my errors based on categories, some of which I feel are fixable (by more training, or more expert knowledge of dogs, etc.), and some which I believe to be relatively insurmountable (e.g. multiple correct answers per image, or incorrect ground truth label).

In detail, the my human error types were:
1. Multiple correct objects in the image (12 mistakes)
2. Clearly incorrect label ground truth (5 mistakes)
3. Fine-grained recognition error (28 mistakes)
4. Class unawareness error (18 mistakes)
5. Insufficient training data (4 mistakes)
6. Unsorted/misc category (9 mistakes)

For a total of 76 mistakes, giving 76/1500 ~= 0.051 error. From these, I would argue that 1. and 2. are near insurmountable, while the rest could be further reduced by fine-grained experts (3.) and longer training period (4., 5.). For an optimistic lower bound, we could drop these errors down to 76 - 28 - 18 - 4 = 26, giving 26/1500 ~= 1.7% error, or even 1.1% if we drop all of (6.).

In conclusion
When you read the "surpassing-human" headlines, we should all keep in mind that human accuracy is not a point - it's a tradeoff curve. We trade off human effort and expertise with the error rate: I am one point on that curve with 5.1%. My labmates with almost no training are another point, with even up to 15% error. And based on the above hypothetical calculations, it's not unreasonable to suggest that a group of very dedicated humans might push this down to 2% or so.

That being said, I'm very impressed with how quickly multiple groups have improved from 6.6% down to ~5% and now also below! I did not expect to see such rapid progress. It seems that we're now surpassing a dedicated human labeler. And imo, when we are down to 3%, we'd matching the performance of a hypothetical super-dedicated fine-grained expert human ensemble of labelers.

My blog: 
http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/
The ILSVRC paper that has more details on human optimistic results:
http://arxiv.org/abs/1409.0575
Abstract: Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, ...
103
33
Mihail Sirotenko's profile photoAndrej Karpathy's profile photoJon Barron's profile photo
3 comments
 
+Andrej Karpathy thank you valuable research of ILSVRC2012. In our lab we also did kind of data mining in this dataset. We wanted to find out if the Top5 metric is reasonable for ILSVRC2012. And in our case the answer was yes. Actually we found many ambiguities and misstakes in annotation. We also was surprised that you didn't mention them in your blog. For example, notebook and laptop have same meaning and same appearence. There're many other categories that are either same, or has different notation but in fact contain similar images. So it looks like either dataset should be refined to adequately use Top1 metric or it is better to keep using Top5.
 
Thanks +Mihail Sirotenko for the comment! When I briefly alluded to issues with top1, examples like notebook vs. laptop are exactly what I had in mind. I think that even though it's impossible to get this category right, you'd just end up with chance performance within some of these categories and maybe that's okay. For many of the categories I mentioned above though, the current metric does not adequately test the fine-grained recognition capabilities of the classifier, which I would consider to be a more serious issue in comparison. I'm not 100% convinced about this, I'm just thinking out loud. Happy to hear your thoughts!
 
Let's all take a moment to appreciate that we are at doing peer-reviewed research using only Google+ and Arxiv, with a turnaround time of less than a day. Chew on that, PAMI.
Add a comment...