Fascinating to see complex neural networks being layered on top of each other like components, which may be similar to how the human brain is organized (though not strictly feed forward like this system still is): "This idea comes from recent advances in machine translation between languages, where a Recurrent Neural Network (RNN) transforms, say, a French sentence into a vector representation, and a second RNN uses that vector representation to generate a target sentence in German ... What if we replaced that first RNN and its input words with a deep Convolutional Neural Network (CNN) trained to classify objects in images? Normally, the CNN’s last layer is used in a final Softmax among known classes of objects, assigning a probability that each object might be in the image. But if we remove that final layer, we can instead feed the CNN’s rich encoding of the image into a RNN designed to produce phrases. We can then train the whole system directly on images and their captions, so it maximizes the likelihood that descriptions it produces best match the training descriptions for each image."
A picture is worth a thousand (coherent) words: building a natural description of images
6 plus ones
Shared publicly•View activity
Add a comment...