Terence Tao

31,651 followers

31,651 followers

Terence's posts

Post has attachment

Public

Hopefully...

Post has attachment

Public

Self-explanatory.

Post has shared content

Public

One of the big developments in complexity theory from ~~last year~~ late 2015 is unfortunately slightly less big than first thought, though still significant.

A retraction to the quasi-polynomial graph isomorphism claim of Babai (replaced by sub-exponential time).

Post has attachment

Public

This would have been a useful video to share with my (recently concluded) graduate complex analysis class.

Post has shared content

Public

**Jean Bourgain wins 2017 Breakthrough Prize**

Congratulations to Jean Bourgain (Princeton Institute for Advanced Study), who has won the 2017 Breakthrough Prize in mathematics!

https://www.ias.edu/press-releases/2016/bourgain-breakthrough

Post has shared content

Public

Open to students between 13 and 18 in most countries; submission deadline is October 10. First prize is a $250,000 college scholarship, together with an award to a teacher that inspired the student, and funding for a science laboratory at the student's school.

Submit a video about science.

https://breakthroughjuniorchallenge.org/

https://breakthroughjuniorchallenge.org/

Public

When I became naturalised as a US citizen in 2009, I received at the ceremony a little bag containing (among other things) a small US flag and a pocket copy of the US constitution. A few years ago, while doing some housecleaning, I decided to discard the copy of the constitution, reasoning that the text was easily available online whenever I needed to read it. (Also, I had previously studied the constitution in order to pass the citizenship exam.)

... I think I now regret that decision.

... I think I now regret that decision.

Post has attachment

Public

There is a classic paper of Shannon from 1950 (linked below) in which he estimates the entropy of the English language to be about 1 bit per character. Roughly speaking, this means that for medium sized n (e.g. 50 to 100), the distribution of n-grams - strings of n consecutive letters in an English text, after stripping punctuation and spacing) - is concentrated in a set of size about 2^n (this can be compared with the number of all possible n-grams using the Roman alphabet, which is 26^n). In particular, any two randomly selected n-grams from two different English texts should have a probability of about 2^{-n} of agreeing with each other from purely random chance.

As a consequence, given two English texts of character length N and M, the probability that they share a matching n-gram purely from random chance is approximately N M 2^{-n} (assuming that n is large enough that this product is significantly less than 1, but still small compared to N and M). One can view this quantity NM 2^{-n} as a crude measure of the "p-value" when testing for the hypothesis that a match is occurring for a reason other than random chance.

It is tempting to apply this formula as a test for plagiarism, when a matching n-gram is found between two texts (as was recently the case involving the convention speeches of Melania Trump and Michelle Obama). One can take N to be the number of characters of the text to be tested for plagiarism (in the case of Trump's speech, this is about 2^{12.5}). It would however be unfair to take M to be the length of the text in which a match was found (in this case, Obama's speech, which is about 2^13 characters), because this is not the only reference text in which would be available for plagiarism (this is a variant of the "Texas sharpshooter fallacy"). For maximum benefit of the doubt, one could take M to be something like the total number of characters of English text on pages indexed by a major search engine; this is a bit difficult to estimate, but it seems that there about 10^9 to 10^10 English web pages indexed by (say) Google, which on average contain about 10^3 to 10^4 characters of text, so one could very roughly take M to be about 10^13, or about 2^43. (This is probably an overestimate due to redundancy of text on the web, as well as the unsuitability of much of this corpus for the purpose of using in a political speech, but I have no way to easily estimate the level of such redundancy or unsuitability.)

Using this corpus, we can then estimate the maximum n-gram match one should have between a speech such as Trump's or Obama's and the entirety of the indexed internet to be about 65-70, with the p-value of a match decreasing exponentially by roughly a factor of two for every additional character beyond this range.

In the case of Trump's speech, it turns out that the longest n-gram match between Trump and Obama is 81 characters long ("values that you work hard for what you want in life your word is your bond and you do what you say", minus the spaces). This suggests a p-value of at most 0.001. But this is an extremely rough estimate, given the large margins of error in the quantities used above, though it is probably on the conservative side as it is using a rather large corpus for M, and is also ignoring the other matching or nearly matching n-grams between the two texts.

There have been some attempts to counter the accusations of plagiarism by pointing out that the speech of Obama (or of other speakers) also contain matches with prior texts; however the matches are much shorter in length (e.g. the phrase "the world as it should be" appears both in the speech of Obama and in a speech of Saul Alinsky), and given the exponentially decaying nature of the p-value, such shorter matches are completely explainable through random chance. A matching n-gram that is even just 10 characters shorter than the one in Trump's speech, for instance, would be about 1000 times more likely to occur by random chance; one that is 20 characters shorter would be about one million times more likely to occur by chance; and so forth. This is certainly a situation in which the level of significance does not behave linearly with the length of the matching text!

As a consequence, given two English texts of character length N and M, the probability that they share a matching n-gram purely from random chance is approximately N M 2^{-n} (assuming that n is large enough that this product is significantly less than 1, but still small compared to N and M). One can view this quantity NM 2^{-n} as a crude measure of the "p-value" when testing for the hypothesis that a match is occurring for a reason other than random chance.

It is tempting to apply this formula as a test for plagiarism, when a matching n-gram is found between two texts (as was recently the case involving the convention speeches of Melania Trump and Michelle Obama). One can take N to be the number of characters of the text to be tested for plagiarism (in the case of Trump's speech, this is about 2^{12.5}). It would however be unfair to take M to be the length of the text in which a match was found (in this case, Obama's speech, which is about 2^13 characters), because this is not the only reference text in which would be available for plagiarism (this is a variant of the "Texas sharpshooter fallacy"). For maximum benefit of the doubt, one could take M to be something like the total number of characters of English text on pages indexed by a major search engine; this is a bit difficult to estimate, but it seems that there about 10^9 to 10^10 English web pages indexed by (say) Google, which on average contain about 10^3 to 10^4 characters of text, so one could very roughly take M to be about 10^13, or about 2^43. (This is probably an overestimate due to redundancy of text on the web, as well as the unsuitability of much of this corpus for the purpose of using in a political speech, but I have no way to easily estimate the level of such redundancy or unsuitability.)

Using this corpus, we can then estimate the maximum n-gram match one should have between a speech such as Trump's or Obama's and the entirety of the indexed internet to be about 65-70, with the p-value of a match decreasing exponentially by roughly a factor of two for every additional character beyond this range.

In the case of Trump's speech, it turns out that the longest n-gram match between Trump and Obama is 81 characters long ("values that you work hard for what you want in life your word is your bond and you do what you say", minus the spaces). This suggests a p-value of at most 0.001. But this is an extremely rough estimate, given the large margins of error in the quantities used above, though it is probably on the conservative side as it is using a rather large corpus for M, and is also ignoring the other matching or nearly matching n-grams between the two texts.

There have been some attempts to counter the accusations of plagiarism by pointing out that the speech of Obama (or of other speakers) also contain matches with prior texts; however the matches are much shorter in length (e.g. the phrase "the world as it should be" appears both in the speech of Obama and in a speech of Saul Alinsky), and given the exponentially decaying nature of the p-value, such shorter matches are completely explainable through random chance. A matching n-gram that is even just 10 characters shorter than the one in Trump's speech, for instance, would be about 1000 times more likely to occur by random chance; one that is 20 characters shorter would be about one million times more likely to occur by chance; and so forth. This is certainly a situation in which the level of significance does not behave linearly with the length of the matching text!

Post has attachment

Public

Correlation does not imply causation; nevertheless, this is quite a striking correlation between trust in experts and Brexit voting intentions (from June 16). (Note: the actual data http://d25d2506sfb94s.cloudfront.net/cumulus_uploads/document/x4iynd1mn7/TodayResults_160614_EUReferendum_W.pdf has a more refined breakdown depending on what type of "expert" is being considered, but the correlation persists across types.)

Post has attachment

Public

An impressively huge lockdown here at UCLA.

Wait while more posts are being loaded