Profile cover photo
Profile photo
Hong Ooi
Hong's posts

Post has attachment

"You can no longer pause combat to aim and fire powers or tell your squadmates to do so, and combat arenas are designed to accommodate Andromeda's emphasis on increased character mobility and fast-paced third-person shooting."

I think I should be fine, since ME3MP cured me of the need to pause every 2 seconds to line up my shots. But damn, assuming the quote is about singleplayer, this is a pretty big change.

Post has shared content
There is a classic paper of Shannon from 1950 (linked below) in which he estimates the entropy of the English language to be about 1 bit per character. Roughly speaking, this means that for medium sized n (e.g. 50 to 100), the distribution of n-grams - strings of n consecutive letters in an English text, after stripping punctuation and spacing) - is concentrated in a set of size about 2^n (this can be compared with the number of all possible n-grams using the Roman alphabet, which is 26^n). In particular, any two randomly selected n-grams from two different English texts should have a probability of about 2^{-n} of agreeing with each other from purely random chance.

As a consequence, given two English texts of character length N and M, the probability that they share a matching n-gram purely from random chance is approximately N M 2^{-n} (assuming that n is large enough that this product is significantly less than 1, but still small compared to N and M). One can view this quantity NM 2^{-n} as a crude measure of the "p-value" when testing for the hypothesis that a match is occurring for a reason other than random chance.

It is tempting to apply this formula as a test for plagiarism, when a matching n-gram is found between two texts (as was recently the case involving the convention speeches of Melania Trump and Michelle Obama). One can take N to be the number of characters of the text to be tested for plagiarism (in the case of Trump's speech, this is about 2^{12.5}). It would however be unfair to take M to be the length of the text in which a match was found (in this case, Obama's speech, which is about 2^13 characters), because this is not the only reference text in which would be available for plagiarism (this is a variant of the "Texas sharpshooter fallacy"). For maximum benefit of the doubt, one could take M to be something like the total number of characters of English text on pages indexed by a major search engine; this is a bit difficult to estimate, but it seems that there about 10^9 to 10^10 English web pages indexed by (say) Google, which on average contain about 10^3 to 10^4 characters of text, so one could very roughly take M to be about 10^13, or about 2^43. (This is probably an overestimate due to redundancy of text on the web, as well as the unsuitability of much of this corpus for the purpose of using in a political speech, but I have no way to easily estimate the level of such redundancy or unsuitability.)

Using this corpus, we can then estimate the maximum n-gram match one should have between a speech such as Trump's or Obama's and the entirety of the indexed internet to be about 65-70, with the p-value of a match decreasing exponentially by roughly a factor of two for every additional character beyond this range.

In the case of Trump's speech, it turns out that the longest n-gram match between Trump and Obama is 81 characters long ("values that you work hard for what you want in life your word is your bond and you do what you say", minus the spaces). This suggests a p-value of at most 0.001. But this is an extremely rough estimate, given the large margins of error in the quantities used above, though it is probably on the conservative side as it is using a rather large corpus for M, and is also ignoring the other matching or nearly matching n-grams between the two texts.

There have been some attempts to counter the accusations of plagiarism by pointing out that the speech of Obama (or of other speakers) also contain matches with prior texts; however the matches are much shorter in length (e.g. the phrase "the world as it should be" appears both in the speech of Obama and in a speech of Saul Alinsky), and given the exponentially decaying nature of the p-value, such shorter matches are completely explainable through random chance. A matching n-gram that is even just 10 characters shorter than the one in Trump's speech, for instance, would be about 1000 times more likely to occur by random chance; one that is 20 characters shorter would be about one million times more likely to occur by chance; and so forth. This is certainly a situation in which the level of significance does not behave linearly with the length of the matching text!

Post has shared content

Facebook has replaced Google+ as one of my frequently-visited tabs in Chrome. Resistance is futile....

Post has attachment
"Today, a company can artificially disrupt an industry by using cash to make up for the inefficiencies of the business plan. If Uber has a big enough war chest, it can disrupt the taxi business by taking a loss on rides for several years. It does that to drive everyone else out of business and establish its monopoly. Is that disruption? Sure. It's not a creative sort, because it does not have a sustainable marketplace on the other end. It's just destructive destruction. Like what Amazon did to the book industry.

"So Uber can establish a monopoly in a taxi market without worrying about the long-term health of that market. It's OK if the drivers can't make a living, or the traffic patterns won't work, or the roads break down, or people with special needs can't get transportation. It's OK because Uber doesn't need the taxi market to stay viable; it simply needs a monopoly in order to leverage over to something else, like logistics, drone delivery, or robotic driving.

"Most of Uber's drivers can't make a living as an Uber driver. While taxi drivers and limo drivers could recoup enough money to keep themselves and their families alive, Uber drivers are not paid enough to do that. The platform keeps the money." And then lists on the stock exchange, making its founders unimaginably rich.

Post has attachment

Another day, another vote to close on Stack Overflow.

Post has attachment
It's not April 1. Scott Guthrie, executive vice president of Microsoft's Cloud and Enterprise Group, announced today that next year Microsoft will be releasing a version of SQL Server that runs on Linux. A private preview is available today that includes the core relational database features of SQL Server 2016.

Post has shared content
The new journal Discrete Analysis has just been launched.

Points to note.

1. It is an arXiv overlay journal. That means that to submit a paper you just post it to the arXiv and tell us that that is what you have done. If we accept it, we suggest revisions, you post those, and that's it. That enables us to keep our costs about two orders of magnitude lower than those of a traditional journal. Papers are free to read (obviously) and we make no charge to authors.

2. Contrary to what one might expect, the website of the journal adds considerable value to the articles, by presenting them properly. Thanks to the work of Scholastica, the platform we chose for the journal, the site looks beautiful and is laid out extremely thoughtfully (this will become clearer when we have more articles). We have also provided "Editorial introductions" for each article, which aim to place it in context and help you judge whether you might be interested in reading it. If you open one of these, it does not take you to a different page, but opens up in a box on the existing page. (In general, we have tried to minimize the amount of loading you have to do.)

I have written more about it in a blog post:

Post has attachment
Zoe Quinn can't fight any more :(
Wait while more posts are being loaded