Profile

Cover photo
Hong Ooi
Lives in Melbourne, Australia
95 followers|86,361 views
AboutPostsPhotosVideosReviews

Stream

Hong Ooi

Shared publicly  - 
 
 
There is a classic paper of Shannon from 1950 (linked below) in which he estimates the entropy of the English language to be about 1 bit per character. Roughly speaking, this means that for medium sized n (e.g. 50 to 100), the distribution of n-grams - strings of n consecutive letters in an English text, after stripping punctuation and spacing) - is concentrated in a set of size about 2^n (this can be compared with the number of all possible n-grams using the Roman alphabet, which is 26^n). In particular, any two randomly selected n-grams from two different English texts should have a probability of about 2^{-n} of agreeing with each other from purely random chance.

As a consequence, given two English texts of character length N and M, the probability that they share a matching n-gram purely from random chance is approximately N M 2^{-n} (assuming that n is large enough that this product is significantly less than 1, but still small compared to N and M). One can view this quantity NM 2^{-n} as a crude measure of the "p-value" when testing for the hypothesis that a match is occurring for a reason other than random chance.

It is tempting to apply this formula as a test for plagiarism, when a matching n-gram is found between two texts (as was recently the case involving the convention speeches of Melania Trump and Michelle Obama). One can take N to be the number of characters of the text to be tested for plagiarism (in the case of Trump's speech, this is about 2^{12.5}). It would however be unfair to take M to be the length of the text in which a match was found (in this case, Obama's speech, which is about 2^13 characters), because this is not the only reference text in which would be available for plagiarism (this is a variant of the "Texas sharpshooter fallacy"). For maximum benefit of the doubt, one could take M to be something like the total number of characters of English text on pages indexed by a major search engine; this is a bit difficult to estimate, but it seems that there about 10^9 to 10^10 English web pages indexed by (say) Google, which on average contain about 10^3 to 10^4 characters of text, so one could very roughly take M to be about 10^13, or about 2^43. (This is probably an overestimate due to redundancy of text on the web, as well as the unsuitability of much of this corpus for the purpose of using in a political speech, but I have no way to easily estimate the level of such redundancy or unsuitability.)

Using this corpus, we can then estimate the maximum n-gram match one should have between a speech such as Trump's or Obama's and the entirety of the indexed internet to be about 65-70, with the p-value of a match decreasing exponentially by roughly a factor of two for every additional character beyond this range.

In the case of Trump's speech, it turns out that the longest n-gram match between Trump and Obama is 81 characters long ("values that you work hard for what you want in life your word is your bond and you do what you say", minus the spaces). This suggests a p-value of at most 0.001. But this is an extremely rough estimate, given the large margins of error in the quantities used above, though it is probably on the conservative side as it is using a rather large corpus for M, and is also ignoring the other matching or nearly matching n-grams between the two texts.

There have been some attempts to counter the accusations of plagiarism by pointing out that the speech of Obama (or of other speakers) also contain matches with prior texts; however the matches are much shorter in length (e.g. the phrase "the world as it should be" appears both in the speech of Obama and in a speech of Saul Alinsky), and given the exponentially decaying nature of the p-value, such shorter matches are completely explainable through random chance. A matching n-gram that is even just 10 characters shorter than the one in Trump's speech, for instance, would be about 1000 times more likely to occur by random chance; one that is 20 characters shorter would be about one million times more likely to occur by chance; and so forth. This is certainly a situation in which the level of significance does not behave linearly with the length of the matching text!
17 comments on original post
1
Add a comment...

Hong Ooi

Shared publicly  - 
 
Facebook has replaced Google+ as one of my frequently-visited tabs in Chrome. Resistance is futile....
1
Add a comment...

Hong Ooi

Discussion  - 
 
ASSUMING DIRECT CONTROL
This "biobot" is half-computer, half-beetle, and you can control how it moves.
15
2
E. Krieger (Luna Fantasma)'s profile photoBrickling Titan's profile photoSovereign Everblight's profile photoDee Anne (DragonWyrd)'s profile photo
4 comments
 
Ewwwww!
Add a comment...

Hong Ooi

Shared publicly  - 
 
It's not April 1. Scott Guthrie, executive vice president of Microsoft's Cloud and Enterprise Group, announced today that next year Microsoft will be releasing a version of SQL Server that runs on Linux. A private preview is available today that includes the core relational database features of SQL Server 2016.
Will follow the release of SQL Server 2016 for Windows later this year.
1
Add a comment...

Hong Ooi

Shared publicly  - 
 
Zoe Quinn can't fight any more :(
Her home address, passwords and nude photos were widely distributed. She was threatened with rape. Zoe Quinn had a solid criminal case. But now, this high-profile victim of cyber-harassment has given up on the legal system.
1
Add a comment...

Hong Ooi

Shared publicly  - 
 
In response to North Korea's latest nuke test, South Korea will begin broadcasting K-pop across the border. In the face of this onslaught, the North Korean leadership must be rethinking their...

Hmm.

Man, I got to get me some of those nukes.
1
Add a comment...

Hong Ooi

Shared publicly  - 
 
 
It's not what you think. But it's also more interesting than you think.
interviewing.io is a platform where people can practice technical interviewing anonymously and, in the process, find jobs based on their interview performance rather than their resumes. Since we st…
3 comments on original post
1
Add a comment...

Hong Ooi

Shared publicly  - 
 
"Today, a company can artificially disrupt an industry by using cash to make up for the inefficiencies of the business plan. If Uber has a big enough war chest, it can disrupt the taxi business by taking a loss on rides for several years. It does that to drive everyone else out of business and establish its monopoly. Is that disruption? Sure. It's not a creative sort, because it does not have a sustainable marketplace on the other end. It's just destructive destruction. Like what Amazon did to the book industry.

"So Uber can establish a monopoly in a taxi market without worrying about the long-term health of that market. It's OK if the drivers can't make a living, or the traffic patterns won't work, or the roads break down, or people with special needs can't get transportation. It's OK because Uber doesn't need the taxi market to stay viable; it simply needs a monopoly in order to leverage over to something else, like logistics, drone delivery, or robotic driving.

"Most of Uber's drivers can't make a living as an Uber driver. While taxi drivers and limo drivers could recoup enough money to keep themselves and their families alive, Uber drivers are not paid enough to do that. The platform keeps the money." And then lists on the stock exchange, making its founders unimaginably rich.
Doubts about the motivations of the new tech titans are piling up, writes Kelsey Munro.
1
Add a comment...

Hong Ooi

Shared publicly  - 
 
Another day, another vote to close on Stack Overflow.

1
Add a comment...

Hong Ooi

Shared publicly  - 
 
 
The new journal Discrete Analysis has just been launched.

Points to note.

1. It is an arXiv overlay journal. That means that to submit a paper you just post it to the arXiv and tell us that that is what you have done. If we accept it, we suggest revisions, you post those, and that's it. That enables us to keep our costs about two orders of magnitude lower than those of a traditional journal. Papers are free to read (obviously) and we make no charge to authors.

2. Contrary to what one might expect, the website of the journal adds considerable value to the articles, by presenting them properly. Thanks to the work of Scholastica, the platform we chose for the journal, the site looks beautiful and is laid out extremely thoughtfully (this will become clearer when we have more articles). We have also provided "Editorial introductions" for each article, which aim to place it in context and help you judge whether you might be interested in reading it. If you open one of these, it does not take you to a different page, but opens up in a box on the existing page. (In general, we have tried to minimize the amount of loading you have to do.)

I have written more about it in a blog post:

   https://gowers.wordpress.com/2016/03/01/discrete-analysis-launched/
8 comments on original post
1
Add a comment...

Hong Ooi

Shared publicly  - 
 
On the one hand, Martin Shkreli is an A-grade dickbag.

On the other, how many of us have wanted to do the same thing to Lamar Smith or the rest of the Republican climate-change-is-a-conspiracy bunch?


'Pharma bro' said he was going 'school Congress'. When faced them, he uncharacteristically shut his mouth.
1
Add a comment...

Hong Ooi

Shared publicly  - 
 
 
People getting jail sentences for burning public lands without permission in order to hide illegal poaching?

Government overreach.

People being executed by government representatives in the streets, for selling cigarettes? Or for having a toy gun? Or for minor traffic infractions?

Not government overreach.

Mandatory minimum jail sentences that leave no leeway for judges' discretion?

Appropriate and needed in order to deter crime. From non-white people.

Mandatory minimum jail sentences that leave no leeway for judges' discretion... when they affect white people?

Government tyranny.
2 comments on original post
1
Add a comment...
Story
Introduction
I live in sunny Austria, the land down under.
Basic Information
Gender
Male
Places
Map of the places this user has livedMap of the places this user has livedMap of the places this user has lived
Currently
Melbourne, Australia
Previously
Sydney, Australia - Canberra, Australia - Kuala Lumpur, Malaysia
Links
YouTube
Seriously awesome burgers.
Public - 3 years ago
reviewed 3 years ago
1 review
Map
Map
Map