GoogLeNet is on par with human experts for ImageNet-1000 classification according to this paper from the ILSVRC organizing team.

The task is too challenging for humans without training, so they had to rely on specially trained human experts to recognize these 1000 categories. The task was also too tedious so each human only labeled a random subset of the test data, on which GoogLeNet was also tested. One human did slightly better than GoogLeNet (with 5.1% error vs 6.8%), while the other did twice as many errors with 12% error vs 5.8% for GoogLeNet.

They conclude: "It is clear that humans will soon outperform state-of-the-art image classification models only by use of significant effort, expertise, and time."

"We found the task of annotating images with one of 1000 categories to be an extremely challenging task for an untrained annotator. The most common error that an untrained annotator is susceptible to is a failure to consider a relevant class as a possible label because they are unaware of its existence.
Therefore, in evaluating the human accuracy we relied primarily on expert annotators who learned to recognize a large portion of the 1000 ILSVRC classes. During training, the annotators labeled a few hundred validation images for practice and later switched to the test set images."

"[...] we conclude that a significant amount of training time is necessary for a human to achieve competitive performance on ILSVRC. However, with a sufficient amount of training, a human annotator is still able to outperform the GoogLeNet result (p = 0.022) by approximately 1.7%."

The matter that I have pointed out before, which is that detection is the real measure to look at from now on because classification is too ambiguous at this stage (top-1 being even more ambiguous than top-5) is confirmed by the following: "Both GoogLeNet and humans struggle with images that contain multiple ILSVRC classes (usually many more than five), with little indication of which object is the focus of the image." This type of error accounts for 24% of the errors of GoogLeNet. Additionally, 21% of the GoogLeNet errors come small or thin objects. The remaining errors come from filtered pictures (e.g. Instagram), abstract representations (paintings, toys, etc) or unconventional viewpoints. Human errors however are more likely to be due to fine-grained recognition, class unawareness or insufficient training data.
Shared publiclyView activity