### Çetin Kaya Koç

Shared publicly -In honor of David Duchovny's birthday - here's a look back at one of our favorite bits! #TheXFiles

1

Add a comment...

Start a hangout

Çetin Kaya Koç

Works at University of California Santa Barbara

Lives in Santa Barbara, California

233 followers|38,124 views

AboutPostsPhotosYouTubeReviews

In honor of David Duchovny's birthday - here's a look back at one of our favorite bits! #TheXFiles

1

Add a comment...

Any one knows what is this ?

1

Add a comment...

This needs to stop. What a waste.

#Bitcoin transactions could consume as much #energy as Denmark by the year 2020 http://boingboing.net/2016/03/31/projection-by-2020-bitcoin-t.html

The numbers in this study are very back-of-the-envelope and assume a worst case: widespread adoption of Bitcoin and not much improvement in Bitcoin mining activity, along with long replacement cycl…

1

Add a comment...

I'm on the editorial board of Combinatorica. Whether I should be is another matter, since it is a journal owned by Springer, one of the big commercial publishers. But I am, and as a result I have a free subscription to the journal. Today I found the latest issue in my pigeonhole, and the last paper in the issue was a paper by Csaba Tóth, entitled, "The Szemerédi-Trotter theorem in the complex plane."

This paper is remarkable for two reasons. One, which provokes this post, is that at the beginning of the paper it says, "Received December 1999, Revised May 16 2014." So the paper is coming out over 15 years after it was submitted. Doubtless this isn't a record, but it's still a pretty big gap. I noticed it because my first reaction on seeing the title was, "But I thought this had been done a long time ago."

The other reason is the result itself. The Szemerédi-Trotter theorem states that if you have n points and m lines in the plane, then the number of incidences (that is, pairs (P,L) where P is a point in your collection, L is a line in your collection, and P is a point in L) is at most C(n + m + n^{2/3}m^{2/3}). This slightly curious looking bound is best possible up to the constant C and is more natural than it looks.

The known proofs of the theorem relied heavily on the topological properties of the plane, which meant that it was far from straightforward to generalize the result to lines and points in the complex plane (by which I mean C^2 and not C). Indeed, it was an open problem to do so, and that was what Tóth solved.

If you're feeling ambitious, there is also a lovely conjecture in the paper. Define a d-flat in R^{2d} to be an affine subspace of dimension d. Suppose now that you have n points and m d-flats with the property that no two of the d-flats intersect in more than a point. Is it the case that the number of incidences is at most C(n + m + n^{2/3}m^{2/3})? The constant C is allowed to depend on the dimension d but not on anything else. Note that even for d=2 this would be a new result, since Tóth's theorem is the special case where the d-flats are complex lines.

I should say that I haven't checked whether there has been any progress on this conjecture, so I don't guarantee that it is open. If anyone knows about its status, it would be great if you could comment below.

#spnetwork DOI: 10.1007/s00493-014-2686-2

This paper is remarkable for two reasons. One, which provokes this post, is that at the beginning of the paper it says, "Received December 1999, Revised May 16 2014." So the paper is coming out over 15 years after it was submitted. Doubtless this isn't a record, but it's still a pretty big gap. I noticed it because my first reaction on seeing the title was, "But I thought this had been done a long time ago."

The other reason is the result itself. The Szemerédi-Trotter theorem states that if you have n points and m lines in the plane, then the number of incidences (that is, pairs (P,L) where P is a point in your collection, L is a line in your collection, and P is a point in L) is at most C(n + m + n^{2/3}m^{2/3}). This slightly curious looking bound is best possible up to the constant C and is more natural than it looks.

The known proofs of the theorem relied heavily on the topological properties of the plane, which meant that it was far from straightforward to generalize the result to lines and points in the complex plane (by which I mean C^2 and not C). Indeed, it was an open problem to do so, and that was what Tóth solved.

If you're feeling ambitious, there is also a lovely conjecture in the paper. Define a d-flat in R^{2d} to be an affine subspace of dimension d. Suppose now that you have n points and m d-flats with the property that no two of the d-flats intersect in more than a point. Is it the case that the number of incidences is at most C(n + m + n^{2/3}m^{2/3})? The constant C is allowed to depend on the dimension d but not on anything else. Note that even for d=2 this would be a new result, since Tóth's theorem is the special case where the d-flats are complex lines.

I should say that I haven't checked whether there has been any progress on this conjecture, so I don't guarantee that it is open. If anyone knows about its status, it would be great if you could comment below.

#spnetwork DOI: 10.1007/s00493-014-2686-2

1

Add a comment...

Nice!

Here is the kit for my recent graph theory project:

Math for eight-year-olds: graph theory for kids!

http://jdh.hamkins.org/math-for-eight-year-olds/

Print out to double-sided, and then fold each page in half. Place the folded pages one after the other (not nested) inside the cover page.

Math for eight-year-olds: graph theory for kids!

http://jdh.hamkins.org/math-for-eight-year-olds/

Print out to double-sided, and then fold each page in half. Place the folded pages one after the other (not nested) inside the cover page.

3

HAYIRRRRRRRRR!!!!!

Add a comment...

There is a classic paper of Shannon from 1950 (linked below) in which he estimates the entropy of the English language to be about 1 bit per character. Roughly speaking, this means that for medium sized n (e.g. 50 to 100), the distribution of n-grams - strings of n consecutive letters in an English text, after stripping punctuation and spacing) - is concentrated in a set of size about 2^n (this can be compared with the number of all possible n-grams using the Roman alphabet, which is 26^n). In particular, any two randomly selected n-grams from two different English texts should have a probability of about 2^{-n} of agreeing with each other from purely random chance.

As a consequence, given two English texts of character length N and M, the probability that they share a matching n-gram purely from random chance is approximately N M 2^{-n} (assuming that n is large enough that this product is significantly less than 1, but still small compared to N and M). One can view this quantity NM 2^{-n} as a crude measure of the "p-value" when testing for the hypothesis that a match is occurring for a reason other than random chance.

It is tempting to apply this formula as a test for plagiarism, when a matching n-gram is found between two texts (as was recently the case involving the convention speeches of Melania Trump and Michelle Obama). One can take N to be the number of characters of the text to be tested for plagiarism (in the case of Trump's speech, this is about 2^{12.5}). It would however be unfair to take M to be the length of the text in which a match was found (in this case, Obama's speech, which is about 2^13 characters), because this is not the only reference text in which would be available for plagiarism (this is a variant of the "Texas sharpshooter fallacy"). For maximum benefit of the doubt, one could take M to be something like the total number of characters of English text on pages indexed by a major search engine; this is a bit difficult to estimate, but it seems that there about 10^9 to 10^10 English web pages indexed by (say) Google, which on average contain about 10^3 to 10^4 characters of text, so one could very roughly take M to be about 10^13, or about 2^43. (This is probably an overestimate due to redundancy of text on the web, as well as the unsuitability of much of this corpus for the purpose of using in a political speech, but I have no way to easily estimate the level of such redundancy or unsuitability.)

Using this corpus, we can then estimate the maximum n-gram match one should have between a speech such as Trump's or Obama's and the entirety of the indexed internet to be about 65-70, with the p-value of a match decreasing exponentially by roughly a factor of two for every additional character beyond this range.

In the case of Trump's speech, it turns out that the longest n-gram match between Trump and Obama is 81 characters long ("values that you work hard for what you want in life your word is your bond and you do what you say", minus the spaces). This suggests a p-value of at most 0.001. But this is an extremely rough estimate, given the large margins of error in the quantities used above, though it is probably on the conservative side as it is using a rather large corpus for M, and is also ignoring the other matching or nearly matching n-grams between the two texts.

There have been some attempts to counter the accusations of plagiarism by pointing out that the speech of Obama (or of other speakers) also contain matches with prior texts; however the matches are much shorter in length (e.g. the phrase "the world as it should be" appears both in the speech of Obama and in a speech of Saul Alinsky), and given the exponentially decaying nature of the p-value, such shorter matches are completely explainable through random chance. A matching n-gram that is even just 10 characters shorter than the one in Trump's speech, for instance, would be about 1000 times more likely to occur by random chance; one that is 20 characters shorter would be about one million times more likely to occur by chance; and so forth. This is certainly a situation in which the level of significance does not behave linearly with the length of the matching text!

As a consequence, given two English texts of character length N and M, the probability that they share a matching n-gram purely from random chance is approximately N M 2^{-n} (assuming that n is large enough that this product is significantly less than 1, but still small compared to N and M). One can view this quantity NM 2^{-n} as a crude measure of the "p-value" when testing for the hypothesis that a match is occurring for a reason other than random chance.

It is tempting to apply this formula as a test for plagiarism, when a matching n-gram is found between two texts (as was recently the case involving the convention speeches of Melania Trump and Michelle Obama). One can take N to be the number of characters of the text to be tested for plagiarism (in the case of Trump's speech, this is about 2^{12.5}). It would however be unfair to take M to be the length of the text in which a match was found (in this case, Obama's speech, which is about 2^13 characters), because this is not the only reference text in which would be available for plagiarism (this is a variant of the "Texas sharpshooter fallacy"). For maximum benefit of the doubt, one could take M to be something like the total number of characters of English text on pages indexed by a major search engine; this is a bit difficult to estimate, but it seems that there about 10^9 to 10^10 English web pages indexed by (say) Google, which on average contain about 10^3 to 10^4 characters of text, so one could very roughly take M to be about 10^13, or about 2^43. (This is probably an overestimate due to redundancy of text on the web, as well as the unsuitability of much of this corpus for the purpose of using in a political speech, but I have no way to easily estimate the level of such redundancy or unsuitability.)

Using this corpus, we can then estimate the maximum n-gram match one should have between a speech such as Trump's or Obama's and the entirety of the indexed internet to be about 65-70, with the p-value of a match decreasing exponentially by roughly a factor of two for every additional character beyond this range.

In the case of Trump's speech, it turns out that the longest n-gram match between Trump and Obama is 81 characters long ("values that you work hard for what you want in life your word is your bond and you do what you say", minus the spaces). This suggests a p-value of at most 0.001. But this is an extremely rough estimate, given the large margins of error in the quantities used above, though it is probably on the conservative side as it is using a rather large corpus for M, and is also ignoring the other matching or nearly matching n-grams between the two texts.

There have been some attempts to counter the accusations of plagiarism by pointing out that the speech of Obama (or of other speakers) also contain matches with prior texts; however the matches are much shorter in length (e.g. the phrase "the world as it should be" appears both in the speech of Obama and in a speech of Saul Alinsky), and given the exponentially decaying nature of the p-value, such shorter matches are completely explainable through random chance. A matching n-gram that is even just 10 characters shorter than the one in Trump's speech, for instance, would be about 1000 times more likely to occur by random chance; one that is 20 characters shorter would be about one million times more likely to occur by chance; and so forth. This is certainly a situation in which the level of significance does not behave linearly with the length of the matching text!

1

Add a comment...

Time is the indefinite continued progression of existence and events that occur in apparently irreversible succession from the past through the present to the future.

1

Add a comment...

Educate your neighborhood trumpies -- at least try

1

Add a comment...

The Riemann Hypothesis is one of the great unsolved problems of mathematics and the reward of $1000000 of Clay Mathematics Institute prize money awaits the person who solves it. But-with or without money-its resolution is crucial for our understanding of the nature of numbers.

1

Add a comment...

Collections Çetin Kaya is following

View all

Work

Occupation

Professor

Employment

- University of California Santa BarbaraResearch Professor, 2008 - present
- Claveo SoftwareCEO, 2010 - 2013
- CryptocodeCEO, 2003

Basic Information

Gender

Male

Story

Introduction

Places

Currently

Santa Barbara, California

Previously

Istanbul, Turkey

Links

YouTube

Links

One of the best places to learn English. My niece tells me so :)

Public - a month ago

reviewed a month ago

Near the Univ of Siena and one of the liveliest Siena streets with restaurants and stores. The host and staff are relaxed and helpful. I stay here every time I come to unisi <3

Public - 3 months ago

reviewed 3 months ago

Have ben at Le Logge every time I come to Siena and always the best

Public - 3 months ago

reviewed 3 months ago