Shared publicly  - 
Large Scale Language Modeling in Automatic Speech Recognition by +Ciprian Chelba 

At Google, we’re able to use the large amounts of data made available by the Web’s fast growth. Two such data sources are the anonymized queries on and the web itself. They help improve automatic speech recognition through large language models: Voice Search makes use of the former, whereas YouTube speech transcription benefits significantly from the latter. 

The language model is the component of a speech recognizer that assigns a probability to the next word in a sentence given the previous ones. As an example, if the previous words are “new york”, the model would assign a higher probability to “pizza” than say “granola”. The n-gram approach to language modeling (predicting the next word based on the previous n-1 words) is particularly well-suited to such large amounts of data: it scales gracefully, and the non-parametric nature of the model allows it to grow with more data. For example, on Voice Search we were able to train and evaluate 5-gram language models consisting of 12 billion n-grams, built using large vocabularies (1 million words), and trained on as many as 230 billion words. 

The computational effort pays off, as highlighted by the plot below: both word error rate (a measure of speech recognition accuracy) and search error rate (a metric we use to evaluate the output of the speech recognition system when used in a search engine) decrease significantly with larger language models. A more detailed summary of results on Voice Search and a few YouTube speech transcription tasks, written by +Ciprian Chelba, +Dan Bikel, +Masha Shugrina, +Patrick Nguyen and Shankar Kumar (, presents our results when increasing both the amount of training data, and the size of the language model estimated from such data. Depending on the task, availability and amount of training data used, as well as language model size and the performance of the underlying speech recognizer, we observe reductions in word error rate between 6% and 10% relative, for systems on a wide range of operating points.

Cross-posted with the Research Blog:
Kevin Duh's profile photoEnrico Altavilla's profile photoCiprian Chelba's profile photoUdhyakumar Nallasamy's profile photo
This chart also resembles an Apple stock chart for the last month. :D
Does the statistical approach mean that the calculation of probabilities is less precise for less popular languages?
More data beats better algorithm. :)
+Enrico Altavilla N-gram models let us balance the quality of the probability estimates with the power of the model.

For example, If there is less data we can use 3-grams instead of 5-grams, which will dramatically reduce the number of parameters in the model, thus allowing for reliable estimation on less data than what is needed for a 5-gram.

The trade-off is in the "modeling power", which for language models can be equated with perplexity (see link to PDF in the post): a 3-gram language model will usually have a higher perplexity than a 5-gram language model, and provide less accurate predictions for the next word.
I hadn't realize the Youtube dataset is so challenging. 50% WER! Much research remains, I guess.
+Ciprian Chelba : thanks, I understand. Less data forces you to use a lower power and a lower power will provide less accurate predictions of the next word if compared to a model with a higher power. Do you see in the near future the adoption of models based on 6-grams or higher powers? Is it possible to estimate how (much) PPL would change as the power of the model will increase?
+Kevin Duh : Indeed! It would also be interesting (educational?) to understand why more or less the same technology works quite well for speech acquired from Android/iPhone smartphones.  There the user/speaker is both vested in the outcome (so s/he will articulate more clearly and avoid using it in a very noisy environment), and has the expectation set by an existing text app, e.g. Google search.
+Enrico Altavilla The best way is to build such models and see how they work. However one gets diminishing returns with increasing n-gram order. At the same time the model size increases quasi-exponentially with n, so the trade-off will be biased towards lower n values.
Nice report. It would be interesting to see the PPL of youtube dataset
Add a comment...