There are several new ImageNet results floating around that beat my 5.1% error rate on ImageNet. Most recently an interesting paper from Google that uses "batch normalization". I wanted to make a few comments regarding "surpassing human-level accuracy". The most critical one is this:

Human accuracy is not a point. It lives on tradeoff curve.

Estimating the lower bound error
5.1% is an approximate upper bound on human error, achieved by a relatively dedicated labeler who trained on 500 images and then evaluated on 1500. It is interesting to go further and estimate the lower bound on human error. We can do this approximately since I have broken down my errors based on categories, some of which I feel are fixable (by more training, or more expert knowledge of dogs, etc.), and some which I believe to be relatively insurmountable (e.g. multiple correct answers per image, or incorrect ground truth label).

In detail, the my human error types were:
1. Multiple correct objects in the image (12 mistakes)
2. Clearly incorrect label ground truth (5 mistakes)
3. Fine-grained recognition error (28 mistakes)
4. Class unawareness error (18 mistakes)
5. Insufficient training data (4 mistakes)
6. Unsorted/misc category (9 mistakes)

For a total of 76 mistakes, giving 76/1500 ~= 0.051 error. From these, I would argue that 1. and 2. are near insurmountable, while the rest could be further reduced by fine-grained experts (3.) and longer training period (4., 5.). For an optimistic lower bound, we could drop these errors down to 76 - 28 - 18 - 4 = 26, giving 26/1500 ~= 1.7% error, or even 1.1% if we drop all of (6.).

In conclusion
When you read the "surpassing-human" headlines, we should all keep in mind that human accuracy is not a point - it's a tradeoff curve. We trade off human effort and expertise with the error rate: I am one point on that curve with 5.1%. My labmates with almost no training are another point, with even up to 15% error. And based on the above hypothetical calculations, it's not unreasonable to suggest that a group of very dedicated humans might push this down to 2% or so.

That being said, I'm very impressed with how quickly multiple groups have improved from 6.6% down to ~5% and now also below! I did not expect to see such rapid progress. It seems that we're now surpassing a dedicated human labeler. And imo, when we are down to 3%, we'd matching the performance of a hypothetical super-dedicated fine-grained expert human ensemble of labelers.

My blog:
The ILSVRC paper that has more details on human optimistic results:
Shared publiclyView activity