There are several new ImageNet results floating around that beat my 5.1% accuracy on ImageNet. Most recently an interesting paper from Google that uses "batch normalization". I wanted to make a few comments regarding "surpassing human-level accuracy":Optimistic human performance is ~3%
I reported 5.1%, but it is interesting to try to estimate an optimistic human performance on ILSVRC by removing what I call "silly errors":
1. Note that I trained myself with 500 images, and as I documented in my blog post and our ILSVRC paper, 18 of my errors (24% - a quarter) were due to what I consider to be "class unawareness". That means that when I look at my mistake I felt that the answer was relatively evident if I had thought of that class. If I had trained longer, it's reasonable to suppose that I would have eliminated a large chunk of these, making my error ~3.9%.
2. The other issue that I call "insufficient training data" (since I was only shown 13 images / class) is also an error that falls into this category. Without this error type, the accuracy would be ~3.6%
3. The next error I'd be willing to argue I could
have prevented was the fine-grained error. In the optimistic estimate, if I was willing to spend 15 minutes / terrier instead of ~5 minutes / terrier, the error would become 3.2%.
The remainder of the errors were "multiple objects" and "incorrect annotations", which I consider to be near insurmountable to some degree.TLDR
- 5.1% is an error rate for a human who trained for 500 images and then spent up to ~5 minutes per image.
- About ~3%
is an optimistic estimate without my "silly errors".Human ensemble experiments
This ~3% conclusion is also consistent with our "optimistic human" experiments that were based on ~250 images (and reported on in the ILSVRC paper). We had two labelers and considered an image correct if at least one of us got it. Our optimistic human performance was 2.4%, but this is a bit noisy result due to insufficient data. Moreover, we expect that an actual human ensemble would be slightly higher error than the "optimistic human", so ~3% seems relatively consistent with this interpretation.Top5/Top1 error
As a second point, I do think we should start to look at the top-1 accuracy a bit more. I understand that there are problems with it but I do believe that there is some signal there. For example, there are only 5 snake species, so when I saw a snake image I just lazily labeled all 5 snake types and knew I got it right somewhere in top 5. In other words, the top5 error does not test differentiating between snake types. A few other categories share this property (few fishes and car types, for example).
I don't at all intend this post to somehow take away from any of the recent results: I'm very impressed with how quickly multiple groups have improved from 6.6% down to ~5% and now also below! I did not expect to see such rapid progress.
It seems that we're now surpassing a dedicated human labeler
. And imo, when we are down to 3%, we'd matching the performance of a hypothetical super-dedicated fine-grained expert human ensemble of labelers
My blog: http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/
The ILSVRC paper that has more details on human optimistic results:http://arxiv.org/abs/1409.0575