Profile cover photo
Profile photo
Dan Piponi
2,672 followers -
Homo Sapiens, Hominini, Hominidae, Primates, Mammalia, Chordata, Animalia
Homo Sapiens, Hominini, Hominidae, Primates, Mammalia, Chordata, Animalia

2,672 followers
About
Dan Piponi's interests
View all
Dan Piponi's posts

Post has attachment
I noticed that a basic electronic component allows you to turn digital electronic circuits into a model for provability logic. I wrote it up here: http://blog.sigfpe.com/2017/07/self-referential-logic-via-self.html

The main point is that a certain type of latching delay satisifies something a bit like Löb's theorem.
Animated Photo

Post has attachment
Quite often, in fields like statistics, machine learning, and 3D rendering, you want to optimize the expected value of something even though it's intractable to compute it exactly.

One approach is to use a Monte Carlo method and work with estimates of the expectation based on averaging samples. Unfortunately these estimates can be hard to differentiate with respect to the parameters you care about. For example, if you're trying to estimate E(f(X)) where X is a random variable from the Bernoulli distribution B(p) (i.e. biased coin tosses), then when you sample values from B(p) you get either 0 or 1. It's not always obvious how you can get the derivative of your estimate with respect to p when it depends on p via a number that takes values in {0, 1}. Now consider when you want to differentiate the result of a simulation that samples B(p) a million times and maybe also depends on p and other variables in other ways too.

So I wrote something about a way to stretch automatic differentiation to estimate these kinds of derivatives: http://blog.sigfpe.com/2017/06/a-relaxation-technique.html

The key idea is to use importance sampling [1], but only, in some sense, an infinitesimally small amount of it.

Note I've only tested this on toy optimisations and I don't know how well it scales to real problems.

[1] https://en.wikipedia.org/wiki/Importance_sampling
[2] Coin toss pic from https://commons.wikimedia.org/wiki/File:Coin_tossing.JPG
Photo

Post has attachment
The definition of a Markov chain uses a useful notion that I find hard to express in English. It's the idea that P(X(n)|X(n-1), X(n-2), ..., X(0)) = P(X(n)|X(n-1)). Ie., knowing X(n-2), X(n-3) and so on might tell you about X(n), but once you know X(n-1), none of that other stuff is of any use to you at all. It comes up in ordinary everyday reasoning but I don't know what to call it.

Some examples:

1. I want my computer program to run as fast as possible. There's a step I can do using either function A or function B. I measure the speed of A and the speed of B and I know A is faster. Now someone comes along and provides me with profiling tools that can tell me exactly what's going on inside functions A and B. It's all very fancy. But in actual fact it's no use to me at all. After all, I've already measured the speed of A and B. Knowing precisely which instructions contributed what to the total time doesn't get me any further.

2. A friend of a friend is an aircraft mechanic. He refuses to fly because he's seen all kinds of errors in his daily work that could have turned into disasters. But in actual fact, once you know how often aircraft disasters happen, knowing the intimate details of aircraft maintenance doesn't make you any more informed about how safe it is to fly.

3. Someone performs a study of how aggressive video games make people. They find certain effects on the behaviour of game players. They also perform fMRI studies and find that after playing video games, parts of the brain associated with aggression "lit up" and use that to support the case that video games cause violence. But once you know people behave with a certain level of aggression, the fMRI study tells you nothing more about this. The fMRI result is, at best, a proxy for actually violent behaviour - but that's only useful if you don't actually know what the behaviour was.

4. A pharmaceutical product Y has been studied in detail and the risks are now well documented. A group of anti-Y campaigners now advertise that compound X is used in the production of product Y. X is known to have certain risks. But once you know the risks associated with product Y, you've already measured the risk including the risk from X. Noting additionally that X itself carries risks is double counting.

In each case, knowing the extra detail may teach you something useful that can be used in other situations. But it's of no use in answering the main point I raise in each example. And in each case the detail would be useful if you didn't already have the number you really need - like the actual risk of flying.

I'd lump the information from the biopsied cells in this NY Times article in the same category: https://www.nytimes.com/2017/03/23/well/move/the-best-exercise-for-aging-muscles.html?_r=0 Knowing how many genes have changed their activity level is almost certainly a very weak proxy for measuring how best to exercise. (And anyway, disease states change gene activity too.) But it may be good science for other applications.

I wish there was a name for the thing I'm pointing out. And the notion, I think, makes sense, even if you disagree with the examples.

Post has attachment
I don't like the way the concept of entropy (in the sense of information theory) is taught. I'll explain by analogy.

Area is the unique property of geometric figures that satisfies a bunch of "axioms" like "is additive", "is invariant under Euclidean motions", "if you scale something by a its area scales by a^2" and so on. But nobody teaches area like this. The earliest lesson on area I can remember involved placing my hand on graph paper, drawing the shape of my hand, and then counting the number of squares in the shape. What's more, that lesson still corresponds well to how I use area today and it has direct descendants in the form of the Riemann and Lebesgue integrals.

Entropy often gets taught using some set of axioms including additivity (in some sense). They give some good intuitions, but I feel like at bottom there's something missing. In particular, you don't get the formula for entropy emerging of its own accord. Everything seems a little contrived.

To me, the thing that brings the most clarity is known as the "Type Theorem". It's in many (but not all) books on information theory and I provide a link below [1]. I think it's the appropriate analogue of counting squares for relative entropy. It starts with a natural question: if I draw n independent samples from some probability distribution on a finite set, and count how many times each outcome comes up, what is the probability distribution on the set of possible counts? Ie. what is the distribution of possible empirical distributions?

The Type Theorem answers this question. It says that to a good approximation the probability of any particular empirical distribution is a simple expression in the relative entropy between the empirical and real distributions (modulo some details you can read about). To me this makes everything clear. The form of the expression for entropy just pops out. And numerous applications of entropy outside of communications follow from this result. I think it should come before anything else in a course teaching entropy.

[1] https://blogs.princeton.edu/sas/2013/10/10/lecture-3-sanovs-theorem/
Photo

Post has attachment
A little followup to my previous post on functional square roots.

If we have functional powers that suggests the idea of functional logarithms and exponentials too. For example, if f is a function, we'd expect something like log(f^m) = m log f.

It's tempting to use the formal power series for log. We could interpret log (1+f) as f-f^2/2+f^3/3-... where 1 here is the identity function and ^ is functional power. But this fails. It took me a little while to figure out why.

Formal power series form a ring under addition and multiplication. For example, if f, g and h are power series, we have f*(g+h) = f*g+f*h, where * is power series multiplication. It looks like we might also have a ring structure with with power series composition operator ○ as the multiplication. But we don't have distributivity. The formal power series of log(1+x) only satisfies the property log(f^m) = m log f because of distributivity. So if we want to compute a logarithm of a function we'll have to use a more direct definition of log that doesn't assume distributivity.

Here's an approach: define log (f) = lim (n->∞) n (f^(1/n) - 1). We can compute f^(1/n) by iterating the square root operator I previously defined. The catch is that it's problematic to compute this exactly because the coefficients of the resulting power series are all limits. So here's what I do: I compute a rational approximation to the limit with n = 2^256 (nice and big!) and then find rational numbers with small denominators that approximate the resulting coefficients. This makes the assumption that the exact answer has relatively small denominators. I put the code at [1].

When I run the code for f(x)=x+x^2, out comes a sequence of rationals. Sadly I can't find the sequence on OEIS. Nonetheless, a web search gets me to [2] and I see...oh...I'm just rediscovering some classical mathematics :-)

Anyway, you'd hope for a property like log f = -log(inverse f). Amazingly the code demonstrates that this seems to hold for the case log (x -> x+x^2) = - log (x -> (-1+√(1+4x))/2).

[1] https://gist.github.com/dpiponi/db72d898dea78e95579998b8c01befe0
[2] http://math.stackexchange.com/questions/208996/half-iterate-of-x2c
Photo

Post has attachment
Hacker's Delight [1] contains all kinds of neat tricks. One family of tricks that is useful for some image processing tasks (for example) uses binary carry propagation to quickly find contiguous strings of set bits.

More precisely, consider adding the number 1 to a positive integer. If the lowest digit in its binary representation is 1, then there will be a carry to the next digit. If that is also 1 there will be a further carry. This process will continue until the first 0 is hit, and then the carry propagation stops.

So if you add 1 to a number, the bits that are changed are the contiguous string of 1s starting with the lowest digit, plus an extra digit corresponding to the first 0.

I'll use the usual C names for bitwise binary operators: | for OR, & for AND, ~ for NOT, ^ for XOR.

Then p^(p+1) has ones corresponding to the first contiguous sequence of 1s in p, plus an extra 1.

Define a unary operator □p = p^(p+1). We can consider this to be an operator on all of ℤ by representing integers, using the twos complement system [2], as infinite sequences of binary digits. The elements of ℤ correspond precisely to the infinite bit sequences where there are either a finite number of 0s or a finite number of 1s.

Armed with the bitwise operations and □ we can perform all kinds of useful operations efficiently. For example we can use □ to find the contiguous sequence of set bit starting at the lowest bit: we just compute ⊡p which I define to be □p&p. We can iterate this to find the nth contiguous sequence of 1s and perform other ‘topological’ operations.

Because the bitwise operations and addition correspond to single machine instructions on most computers we get lightning fast algorithms for working with contiguous bit sequences. I've made use of this sort of thing for image processing and to quickly evaluate positions in game AIs - for example in games where contiguous groups of pieces (or, in the case of Go, spaces) play an important role.

Let's add a convenient operator: → defined by a → b = ~a | b.

Here are some facts about □:

□(p → q) → □p → □q = -1 (which is 1111… in the twos complement representation)
□(□ p → p) → □p = -1

If p → q = -1, and p = -1, then q = -1
If p = -1 then □p = -1.

These are close to the axioms and derivation rules of provability logic, aka GL [3]. The second fact above is essentially Löb's theorem.

(Very briefly: provability logic is the modal logic you get by introducing the operator □ which means "provability in PA (= Peano arithmetic)". So, for example ¬□⊥ is the claim that ⊥ is unprovable in PA, ie. it is the claim that PA is consistent. Gödel's second incompleteness theorem can be written ¬□⊥ → ¬□(¬□⊥), which is basically shorthand for "if PA is consistent then PA can't prove PA is consistent")

Let me tighten up what I mean.

Consider the fragment of GL with no variables, so-called “letterless GL”. We can define a valuation [.] like this:

[⊥] = 0
[⊤] = -1
[p→q] = [p] → [q]
[p∨q] = [p] | [q]
[p∧q] = [p] & [q]
[¬p] = ~[p]
[□p] = □[p] = [p]^([p]+1)

(Where 'p' and 'q' aren't variables in GL, they're variables I'm using to represent subexpressions of propositions. And I’ve overloaded □.)

Not only is this a valuation, it is an 'isomorphism' in the sense that if [p] = -1, then p is a theorem of GL, and so if [p] = [q] then p ↔ q is a theorem. (Where a↔b is defined to be a→b ∧ b→a.) I'm just stating these properties of [.] but the proof takes some work.

So any question about letterless GL is a question about twiddling with bitwise binary operations and adding one.

There's nothing new here really. This is all in Boolos' book on Provability Logic. But Boolos talks about finite and cofinite subsets of the naturals. I noticed that these are binary representations of integers, and that the expression Boolos uses to define [□p] reminded me of the days when I used to write bit-twiddling code.

Here’s an example of what this means: Mikhail Barasz et al. show how machines can achieve a stable state in Prisoner's Dilemma if they can examine each other's source code [4]. Ultimately the argument boils down to showing that the simultaneous equations

□a → b and □b → a

have a unique solution (in some sense) in GL.

Ie. that

~(a^(a+1))|b = -1 and ~(b^(b+1))|a = -1

have a unique solution a = b = -1 in the integers.

And some Raymond Smullyan puzzles can be rephrased this way too.

BTW We can assign a meaning to these integers. They label possible worlds we might live in. Is PA consistent or not? If you think it is not consistent then we live in a world whose label has the lowest bit set to 1.
I.e. a world where □⊥→⊥ is false has the lowest bit set to one.
A world where □□⊥→□⊥ is false has the second lowest bit set to one.
A world where □□□⊥→□□⊥ is false has the third lowest bit set to one.
And so on.
The bit sequence of [p] tells you which worlds p holds in.
For example, theorems of GL evaluate to -1=1111…. so they hold in all of these worlds.

Note also that p+1 = p^□p, p-1 = ¬(¬p&□¬p) = p|◇p where ◇p = ~□~p, -p = ~p&□~p, but I don’t think we can write p+p = 2p using just □ and bitwise ops.

And I'm bound to have made a mistake somewhere above as modal logic is a long way from my field...

[1] https://doc.lagout.org/security/Hackers%20Delight.pdf
[2] https://en.wikipedia.org/wiki/Two%27s_complement
[3] https://en.wikipedia.org/wiki/Provability_logic
[4] http://lesswrong.com/lw/hmw/robust_cooperation_in_the_prisoners_dilemma/
Photo

Post has attachment
Throw a differentiable manifold in the bath and turn on the water. As the bath fills, from time to time the surface of the water will be tangent to the surface of the manifold. The points where this happens are called critical points. There's a critical point the moment the water level hits the bottom of the manifold and there's one just as it covers the top. There are also critical points at the beginning and end of each hole in the manifold. So (modulo some jiggling [5]) the numbers of critical points provide constraints on things like the number of connected components and the number of holes of the manifold. This is the beginning of what are known as the Morse inequalities [1].

Back in 1982 Ed Witten published a classic paper showing how Morse theory is related to supersymmetric quantum field theory, rederiving the inequalities using physics. This is an example what Witten has a history of doing: reinterpreting some mathematics as a physical theory and then "proving" results with a physical argument. Sometimes the things he proves are old results (like the Morse inequalities) and sometimes he shows things that mathematicians haven't proved yet.

Anyway, I found this paper interesting: The Graph Laplacian and Morse Inequalities [3]. It simplifies Witten's setup by considering graphs rather than differentiable manifolds. In this setting the path integrals of quantum field theory become sums over walks on the graph. The result is a version of the Morse inequalities derived using graph laplacians and it turns the complexity of supersymmetric quantum field theory into something you can study with a bit of linear algebra on a computer. Along the way I learnt about the existence of "discrete Morse theory" [4].

[1] https://en.wikipedia.org/wiki/Morse_theory#Morse_inequalities
[2] https://projecteuclid.org/euclid.jdg/1214437492
[3] https://arxiv.org/abs/1704.08354
[4] https://en.wikipedia.org/wiki/Discrete_Morse_theory
[5] https://en.wikipedia.org/wiki/Sard%27s_theoremo
Photo

Post has attachment
There is a bit of folklore which says that in an imperative language you can always eliminate GOTOs from your code and replace them with structured WHILE or DO loops.

I was wondering about compiling BASIC programs and got interested in the problem of rearranging the GOTOs to use structured loops. There's an algebraic gadget for analysing this sort of thing called a "Kleene Algebra with Tests" [2]. So I thought I'd have a go at deriving an algorithm for the restructuring using it.

I quickly found you end up with lots of code duplication. But worse, I quickly found examples where I couldn't do the elimination.

I'd rediscovered that the "Bohm Jacopini" theorem is false :-) [4] (Actually I rerediscovered it as I've been in a similar place before for a different reason...)

The correct statement of the folklore is in a paper "On Folk Theorems" [1]. You can eliminate all GOTOs if you're allowed to introduce new flag variables. In fact, this is often how GPUs eliminate branching [5].

If you want to eliminate GOTOs without introducing new variables you not only need WHILE (or DO, or FOR) loops, but you also need some kind of BREAK instruction. But that's still not enough. You need a multi-level BREAK N instruction than can get you out of N levels of loop. Or, as suggested in [6], you use returns from procedures.

From time to time I find myself introducing a new flag variable in imperative code to ensure a portion of a loop isn't executed. It always looks clunky. But I guess now I know there's a theorem to justify it I can feel less guilty.

By the way, check out link [3] for a fantastic example of why it's hard to compile BASIC and why interpreted languages can be very different from compiled ones.

[1] http://www.marcusramos.com.br/univasf/pc-2008-2/p379-harel.pdf
[2] https://www.cs.cornell.edu/~kozen/papers/kat.pdf
[3] https://www.riscosopen.org/forum/forums/11/topics/3890#posts-49640
[4] https://www.cs.cornell.edu/~kozen/papers/BohmJacopini.pdf
[5] http://docs.nvidia.com/cuda/parallel-thread-execution/#instruction-statements
[6] http://stackoverflow.com/questions/982595/how-to-break-out-of-2-loops-without-a-flag-variable-in-c


Photo

Post has attachment
I eventually figured out how to exactly compute, and implement in Haskell, the "functional" logarithm and exponential of (some) formal power series. So, for example, we can get the formal power series for arcsin as exp(-log(sin)) where log and exp here are the functional versions.

In fact, exp here is just the exponential map from differential geometry, and log is its inverse.

I wrote it up here: http://blog.sigfpe.com/2017/02/logarithms-and-exponentials-of-functions.html
Photo

Post has attachment
Let me try to distract you briefly from current affairs with a square root of the sin function, i.e. a function f such that f(f(x)) = sin(x).

I wrote some Haskell code to compute the formal power series of such a function [2]. (In fact, it computes the unique functional square root of any formal power series starting x+...)

The coefficients grow large. Nonetheless, you can truncate the series and try evaluating the resulting polynomial. In the graph I've plotted:

In blue: the degree 38 truncation of the series. Let's call it h.
In green: the functional square of the truncation, i.e. x->h(h(x)).
In red: the sin function.

In most of the range [0,π/2] the approximation isn't bad, but you can see it dive bombs soon after.

There's a trick to squeeze more out of this approximation though. Clearly sin^(1/2)(x) = sin^(-1)(sin^(1/2)(sin(x))). What's more, sin(x) is closer to zero than x. So if we replace sin^(1/2) with h in the middle of the sandwich we should get better results. In [1] Thomas Cartright speculates that we can repeat this trick ad infinitum, i.e. that sin^(1/2)(x) = limit as n -> ∞ of sin^(-n)(h(sin^n(x))) where h is given by a truncated power series.

Also see the answers at [3]. To be honest, my reading comprehension skills aren't very good. The question asks if the series converges but I can't figure out what the answer is despite there being a lot of writing there.

You can tweak the code to compute things like the square root of the logistic map x -> x(1-x) to find out how many rabbits there are after any dyadic rational number of years - assuming you can truncate the resulting series appropriately.

[1] http://www.physics.miami.edu/~curtright/TheRootsOfSin.pdf
[2] https://gist.github.com/dpiponi/aa7b713da7c822c04193095397b70bb4
[3] http://mathoverflow.net/questions/45608/does-the-formal-power-series-solution-to-ffx-sin-x-converge
Photo
Wait while more posts are being loaded