Shared publicly  - 
Representing words as high dimensional vectors

Making computers understand human language is an active area of research, called Natural Language Processing (NLP). A widely used method of NLP research involves the statistical modeling of N-grams (, which are collected from freely available text corpora, and treated as single “atomic” units. While this has the benefit of being able to create simple models that can be trained on large amounts of data, it suffers when a large dataset isn’t available, such as high quality transcribed speech data for automatic speech recognition, or when one wants to have a notion of similarities between words. 

In the paper Efficient Estimation of Word Representations in Vector Space (, Googlers Tomas Mikolov, +Kai Chen+Greg Corrado, and +Jeff Dean describe recent progress being made on the application of neural networks to understanding the human language. By representing words as high dimensional vectors, they design and train models for learning the meaning of words in an unsupervised manner from large textual corpora. In doing so, they find that similar words arrange themselves near each other in this high-dimensional vector space, allowing for interesting results to arise from mathematical operations on the word representations. For example, this method allows one to solve simple analogies by performing arithmetic on the word vectors and examining the nearest words in the vector space.

To get more information, and to get the open source toolkit for computing continuous distributed representations of words, aimed to promote research on how machine learning can apply to natural language problems, head over to the Google Open Source Blog, linked below. 
elaine ossipov's profile photoAdedayo Oluokun's profile photoRashid kayani's profile photoYi-Ling Chung's profile photo
Anna Sz
Incredible. Still impressed by the computational possibilities needed.
About 20 years ago, I speculated that if you apply a tool such as this to a dictionary (a self-referential document in which all its words have definitions) one could get a glimpse of the underlying semantics of that language; that it might reveal how more complex concepts were built from simpler ones. The other major benefit of doing this is that it would result in a "standardized" result set (topic map / neural net-ish data set) that could be used as a basis for comparing all other information sets. Pretend that this standardized set was exposed to a new set of information and ai-weighted with the new information. The resulting data set could be compared with an original (dictionary-only) data set to determine the delta between them. This delta would become the semantic marker / representation of that new information. From there, any other raw information could be compared to that delta set to determine similarities to it. Anyway, any work being done in this area? My sense is that using a standardized dictionary set/map could give the ai and semantic communities an un-looked-for way of establishing a basis for representing semantic content. Thoughts?
Add a comment...