Representing words as high dimensional vectors
Making computers understand human language is an active area of research, called Natural Language Processing (NLP). A widely used method of NLP research involves the statistical modeling of N-grams (http://goo.gl/zSZZy
), which are collected from freely available text corpora, and treated as single “atomic” units. While this has the benefit of being able to create simple models that can be trained on large amounts of data, it suffers when a large dataset isn’t available, such as high quality transcribed speech data for automatic speech recognition, or when one wants to have a notion of similarities between words.
In the paper Efﬁcient Estimation of Word Representations in Vector Space
), Googlers Tomas Mikolov, +Kai Chen
, +Greg Corrado
, and +Jeff Dean
describe recent progress being made on the application of neural networks to understanding the human language. By representing words as high dimensional vectors, they design and train models for learning the meaning of words in an unsupervised manner from large textual corpora. In doing so, they find that similar words arrange themselves near each other in this high-dimensional vector space, allowing for interesting results to arise from mathematical operations on the word representations. For example, this method allows one to solve simple analogies by performing arithmetic on the word vectors and examining the nearest words in the vector space.
To get more information, and to get the open source toolkit for computing continuous distributed representations of words, aimed to promote research on how machine learning can apply to natural language problems, head over to the Google Open Source Blog, linked below.