A Billion Words: Because today's language modeling standard should be higherPosted by +Dave Orr, Product Manager, and +Ciprian Chelba, Research Scientist
Language is chock full of ambiguity, and it can turn up in surprising places. Many words are hard to tell apart without context: most Americans pronounce “ladder” and “latter” identically, for instance.
One key way computers use context is with language models (http://goo.gl/IXVQCy
). These are used for predictive keyboards, but also speech recognition, machine translation, spelling correction, query suggestions, and so on.
We believe that the field could benefit from a large, standard set with benchmarks for easy comparison and experiments with new modeling techniques. To that end, we are releasing scripts that convert a set of public data into a language model consisting of over a billion words, with standardized training and test splits, described in an arXiv paper (http://goo.gl/eWPwx3
Along with the scripts, we’re releasing the processed data in one convenient location, along with the training and test data. This will make it much easier for the research community to quickly reproduce results, and we hope will speed up progress on these tasks. The benchmark scripts and data are freely available, and can be found at http://goo.gl/CMOEqa
For all the researchers out there, try out this model, run your experiments, and let us know how it goes -- or publish, and we’ll enjoy finding your results at conferences and in journals. Head over to the Google Research Blog to learn more.