Shared publicly  - 
 
ETAOIN SRHLDCU

When Research@Google was young, we memorized the top English letter frequencies (ETAOINSHRDLU) and put them to good use, spending countless hours solving cryptograms.

Those frequencies date from 1965, when Mark Mayzner painstakingly tabulated 20,000 words from newspapers, magazines, and books.  

Google has now scanned over two trillion words—eight orders of magnitude more. Recently, Mark contacted Googler Peter Norvig to suggest that he use the Google Books Ngram corpus (http://books.google.com/ngrams) to perform an updated analysis for English.  The results are at http://norvig.com/mayzner.html.

Some highlights:
- R, L, and C are more common than originally thought.
- The average English word is 4.79 letters long.
- The most common 4-gram is "tion".
- The most common 7-gram is "present".
- The most common 9-gram is "different".

If anyone wants to perform a similar analysis for other languages, let Research@Google know—the data for German, French, Italian, Spanish, Chinese, Hebrew, and Russian is available for download at http://storage.googleapis.com/books/ngrams/books/datasetsv2.html.
I culled a corpus of 20000 words from a variety of sources, e.g., newspapers, magazines, books, etc. For each source selected, a starting place was chosen at random. In proceeding forward from this po...
64
65
Michelle Sas's profile photoAlberto Trombetta's profile photoLeon Di Stefano's profile photoAlbert Evans's profile photo
8 comments
 
Interesting. But can you share the Turkish analysis? 
 
Well, can you prepare another one for Turkish? I'll be greatful.
 
+Mehmet Soydam We haven't scanned enough books in Turkish to have a representative sample, so it'd be of limited utility.
 
Well, at least thanks for your attention.
 
Dr. Mayzner did his research in 1965, but ETAOIN SHRDLU long predates that. Linotype machines had keyboards with ETAOIN SHRDLU (instead of QWERTYUIOP) in the late 19th or early 20th century.

http://en.wikipedia.org/wiki/Etaoin_shrdlu#Literature : "Etaoin Shrdlu, or a portion of the phrase, is the name of a character in works of fiction, including: Elmer Rice's 1923 play The Adding Machine ..."
Add a comment...