### Hong Ooi

Shared publicly -There is a classic paper of Shannon from 1950 (linked below) in which he estimates the entropy of the English language to be about 1 bit per character. Roughly speaking, this means that for medium sized n (e.g. 50 to 100), the distribution of n-grams - strings of n consecutive letters in an English text, after stripping punctuation and spacing) - is concentrated in a set of size about 2^n (this can be compared with the number of all possible n-grams using the Roman alphabet, which is 26^n). In particular, any two randomly selected n-grams from two different English texts should have a probability of about 2^{-n} of agreeing with each other from purely random chance.

As a consequence, given two English texts of character length N and M, the probability that they share a matching n-gram purely from random chance is approximately N M 2^{-n} (assuming that n is large enough that this product is significantly less than 1, but still small compared to N and M). One can view this quantity NM 2^{-n} as a crude measure of the "p-value" when testing for the hypothesis that a match is occurring for a reason other than random chance.

It is tempting to apply this formula as a test for plagiarism, when a matching n-gram is found between two texts (as was recently the case involving the convention speeches of Melania Trump and Michelle Obama). One can take N to be the number of characters of the text to be tested for plagiarism (in the case of Trump's speech, this is about 2^{12.5}). It would however be unfair to take M to be the length of the text in which a match was found (in this case, Obama's speech, which is about 2^13 characters), because this is not the only reference text in which would be available for plagiarism (this is a variant of the "Texas sharpshooter fallacy"). For maximum benefit of the doubt, one could take M to be something like the total number of characters of English text on pages indexed by a major search engine; this is a bit difficult to estimate, but it seems that there about 10^9 to 10^10 English web pages indexed by (say) Google, which on average contain about 10^3 to 10^4 characters of text, so one could very roughly take M to be about 10^13, or about 2^43. (This is probably an overestimate due to redundancy of text on the web, as well as the unsuitability of much of this corpus for the purpose of using in a political speech, but I have no way to easily estimate the level of such redundancy or unsuitability.)

Using this corpus, we can then estimate the maximum n-gram match one should have between a speech such as Trump's or Obama's and the entirety of the indexed internet to be about 65-70, with the p-value of a match decreasing exponentially by roughly a factor of two for every additional character beyond this range.

In the case of Trump's speech, it turns out that the longest n-gram match between Trump and Obama is 81 characters long ("values that you work hard for what you want in life your word is your bond and you do what you say", minus the spaces). This suggests a p-value of at most 0.001. But this is an extremely rough estimate, given the large margins of error in the quantities used above, though it is probably on the conservative side as it is using a rather large corpus for M, and is also ignoring the other matching or nearly matching n-grams between the two texts.

There have been some attempts to counter the accusations of plagiarism by pointing out that the speech of Obama (or of other speakers) also contain matches with prior texts; however the matches are much shorter in length (e.g. the phrase "the world as it should be" appears both in the speech of Obama and in a speech of Saul Alinsky), and given the exponentially decaying nature of the p-value, such shorter matches are completely explainable through random chance. A matching n-gram that is even just 10 characters shorter than the one in Trump's speech, for instance, would be about 1000 times more likely to occur by random chance; one that is 20 characters shorter would be about one million times more likely to occur by chance; and so forth. This is certainly a situation in which the level of significance does not behave linearly with the length of the matching text!

As a consequence, given two English texts of character length N and M, the probability that they share a matching n-gram purely from random chance is approximately N M 2^{-n} (assuming that n is large enough that this product is significantly less than 1, but still small compared to N and M). One can view this quantity NM 2^{-n} as a crude measure of the "p-value" when testing for the hypothesis that a match is occurring for a reason other than random chance.

It is tempting to apply this formula as a test for plagiarism, when a matching n-gram is found between two texts (as was recently the case involving the convention speeches of Melania Trump and Michelle Obama). One can take N to be the number of characters of the text to be tested for plagiarism (in the case of Trump's speech, this is about 2^{12.5}). It would however be unfair to take M to be the length of the text in which a match was found (in this case, Obama's speech, which is about 2^13 characters), because this is not the only reference text in which would be available for plagiarism (this is a variant of the "Texas sharpshooter fallacy"). For maximum benefit of the doubt, one could take M to be something like the total number of characters of English text on pages indexed by a major search engine; this is a bit difficult to estimate, but it seems that there about 10^9 to 10^10 English web pages indexed by (say) Google, which on average contain about 10^3 to 10^4 characters of text, so one could very roughly take M to be about 10^13, or about 2^43. (This is probably an overestimate due to redundancy of text on the web, as well as the unsuitability of much of this corpus for the purpose of using in a political speech, but I have no way to easily estimate the level of such redundancy or unsuitability.)

Using this corpus, we can then estimate the maximum n-gram match one should have between a speech such as Trump's or Obama's and the entirety of the indexed internet to be about 65-70, with the p-value of a match decreasing exponentially by roughly a factor of two for every additional character beyond this range.

In the case of Trump's speech, it turns out that the longest n-gram match between Trump and Obama is 81 characters long ("values that you work hard for what you want in life your word is your bond and you do what you say", minus the spaces). This suggests a p-value of at most 0.001. But this is an extremely rough estimate, given the large margins of error in the quantities used above, though it is probably on the conservative side as it is using a rather large corpus for M, and is also ignoring the other matching or nearly matching n-grams between the two texts.

There have been some attempts to counter the accusations of plagiarism by pointing out that the speech of Obama (or of other speakers) also contain matches with prior texts; however the matches are much shorter in length (e.g. the phrase "the world as it should be" appears both in the speech of Obama and in a speech of Saul Alinsky), and given the exponentially decaying nature of the p-value, such shorter matches are completely explainable through random chance. A matching n-gram that is even just 10 characters shorter than the one in Trump's speech, for instance, would be about 1000 times more likely to occur by random chance; one that is 20 characters shorter would be about one million times more likely to occur by chance; and so forth. This is certainly a situation in which the level of significance does not behave linearly with the length of the matching text!

1

Add a comment...