Shared publicly  - 
 
 
ETAOIN SRHLDCU

When Research@Google was young, we memorized the top English letter frequencies (ETAOINSHRDLU) and put them to good use, spending countless hours solving cryptograms.

Those frequencies date from 1965, when Mark Mayzner painstakingly tabulated 20,000 words from newspapers, magazines, and books.  

Google has now scanned over two trillion words—eight orders of magnitude more. Recently, Mark contacted Googler Peter Norvig to suggest that he use the Google Books Ngram corpus (http://books.google.com/ngrams) to perform an updated analysis for English.  The results are at http://norvig.com/mayzner.html.

Some highlights:
- R, L, and C are more common than originally thought.
- The average English word is 4.79 letters long.
- The most common 4-gram is "tion".
- The most common 7-gram is "present".
- The most common 9-gram is "different".

If anyone wants to perform a similar analysis for other languages, let Research@Google know—the data for German, French, Italian, Spanish, Chinese, Hebrew, and Russian is available for download at http://storage.googleapis.com/books/ngrams/books/datasetsv2.html.
1
Add a comment...