Shared publicly  - 
 
Could category theory help save privacy?

Jeremy Kun has a great blog article about encryption and privacy:

http://jeremykun.com/2013/06/10/why-theoretical-computer-scientists-arent-worried-about-privacy/

He quotes Obama as saying:

You can’t have 100% security and 100% privacy, and also zero inconvenience.

But he argues against this, pointing out:

There are two facts which are well known in theoretical computer science that the general public is not aware of. The first is about the privacy of databases:

There is a way to mine information from a database without the ability to inspect individual entries in the database.

This is known as differential privacy. The second is no less magical:

There are secure encryption schemes which allow one to run programs on encrypted data and produce encrypted results, without the ability to decrypt the data.

This is known as fully homomorphic encryption.

These ideas are exciting for lots of practical reasons.   The second one means that big companies and governments can do useful things with encrypted data you send them, and then send back the results for you to see, encoded in a way only you can decode... without them being able to know what your data was! 

But as a mathematician, I really love how this is an example of category theory in action! 

The funny diagram below expresses the key idea.  We start with a message m and encrypt it, getting the encrypted message enc(m).   This process is drawn as the vertical arrow at left, labelled enc.  Arrows are processes: things we can do.

Alternatively, we could run some program f on our message and get some output f(m) - that's what the horizontal arrow at bottom means.    Then we could encrypt that and get the encrypted output enc(f(m)).

But it would be really cool if anyone in the world could produce this encrypted output  starting with the encrypted message, but without knowing how to decode that encrypted message.

And they can.   There is a program, called eval in this picture, which takes enc(m) as input and produces the  encrypted output enc(f(m)).  

So: going up the left-hand arrow and then across the top of the rectangle is the same as going across the bottom and then going up the right-hand arrow.  This is a very standard idea in category theory. 

In short: eval(enc(m)) = enc(f(m)).

But the cool part is that anyone in the world can have access to the program eval but still not be able to decode your encrypted message.  So, people can do useful things with your data without ever knowing your data.  Privacy is preserved.

Thanks to +Alexander Kruel for pointing this out.
97
60
Abhay Bothra's profile photoJohn Baez's profile photoMike Stay's profile photoPhilip Thrift's profile photo
48 comments
 
Nice, but still falls foul of false positives.  Orwell isn't the (immediate) risk:  Kafka is.  See Bruce Schneier:  http://bit.ly/129X5Io
 
So would it be an accurate summation to say that f(enc(m)) = enc(f(m)) ?
 
With caveats; only certain operations qualify as fully homomorphic.
 
Perhaps I'm mistaken, but this kind of analysis seems fantastic for working with something like Bitcoin.
 
+Andrew Trapp - if you read the diagram up and across you get eval(enc(m)).  If you read it across and up you get enc(f(m)).  So these are what's equal, just as +Robert Byrne points out.  I'll add the correct equation to my post, since that may help people.
 
" Orwell isn't the (immediate) risk:  Kafka is." Great comment. Not only is this true here, but I conjecture that it's true regardless of the bound.
 
Ah, got it.  Thanks for the clarification, everyone.
 
I couldn't believe what you just posted. The first fully homomorphic cryptosystem came up in 2009 (wikipedia says). I'm not sure if i read that correctly but Craig Gentry seems to give a factor of about 10 compared to unencrypted and a factor of about 100 for preparation (key gen) here:
http://eurocrypt2010rump.cr.yp.to/9854ad3cab48983f7c2c5a2258e27717.pdf
That sounds very reasonable for simple computations. Why am i thinking of petri nets now...?
 
This kind of stuff is nifty but it answers the wrong question.  We've known since the mid-1970s how to build really great cryptosystems. Aside from a few supergeeks, we're not using them. Why not? Because the people who would deploy them either don't have any incentive to do so (commercial), or actually have an incentive to not do so (government). The right question is, how can we create an incentive to deploy excellent security systems?
 
+John Baez This post is very important. As always when it comes to mathematics, the challenge consists in communicating the relevance to the wider public. The key messages are: 

1. Privacy and convenience are technologically possible, and we need not relinquish security to attain it 
2. Efficient protocols for ensuring privacy are out there waiting to be implemented in software

This argument is worthwhile discussing:

It’s easy to argue that these techniques will never become mainstream enough for individuals to benefit from it. ... And then there are questions of policy: why would any company relinquish the ability to work directly with user data? And the cost of rearchitecturing existing services to utilize these technologies would be enough to dissuade most business leaders.

Given the latest events, I'm not sure whether Silicon Valley is the best place to look for innovation.

It is a matter of packaging the implementation such that using the techniques becomes sufficiently convenient. Furthermore the way businesses and governments use data is not always benign. In the next few years more and more personal genetic information and health data is going to be stored online. Either we abolish any notion of privacy or most of this data will have to be fully controlled by individual citizens.

The arcane data architectures we use today are the result of a legacy of technological and economic constraints on data storage that applied several decades ago. Traditional databases with centralised control represent a form of legacy that is convenient for corporations and governments, by protecting established power structures and by stifling innovation.
 
On the contrary, we use heavy encryption on a daily basis in numerous aspects.

The problem with security isn't hardening the already secure bits, it's all the rest. Ergo, the problem with web security isn't the algorithms used to produce keys for HTTPS. It's with tricking users to authorize an action that is not in their best interest, it's in pretending to be an entity that you're not in order to have users authenticate to you, etc.

These are predominantly social and design issues; crypto is secure. Wikileaks was exemplary of this; Manning was arrested cause he was outed by a trusted friend, not because anyone broke into Wikileaks' systems.

Hacktivismo used to do similar things with human rights organizations; data encrypted on a laptop is worthless to the authorities, its infeasible to decrypt, and that's with algorithms 10+ years old.
 
You can't "mine" encrypted data; but you can perform operations on it. For instance, let's say you've encrypted a number in a database. I can't find out what number that is, but I can perform a multiplication operation on it and get a result (also encrypted) of that multiplication. I never know what your value is, nor can I.
 
The benefit is, many times you want someone who is storing a piece of your encrypted data to do something with it (increment/decrement your bank balance, let's say); ideally, you'd want to give them the least amount of control over your data in doing so. That's the holy grail of fully homomorphic encryption.
 
The history of homomorphic encryption is fascinating.  From Wikipedia: "The utility of fully homomorphic encryption has been long recognized. The problem of constructing such a scheme was first proposed within a year of the development of RSA.[4] A solution proved more elusive; for more than 30 years, it was unclear whether fully homomorphic encryption was even possible. During this period, the best result was the Boneh-Goh-Nissim cryptosystem which supports evaluation of an unlimited number of addition operations but at most one multiplication."

Gentry made a fantastic breakthrough in 2009: first implement almost-homomorphic encryption with some chance of error, then implement an error-correcting code and apply it to itself!
 
I think you could mine encrypted data -- in the following sense -- based on what I've read so far. Example: Government (gov) wants email accounts of pairs or groups of people who are using certain terms in emails and communicating with certain frequencies. Tech companies don't want to do this mining and don't want to hand over their servers and decryption keys.

So: blocks of email data are specially encrypted and handed to gov. Gov runs the eval algorithm, a tuned mining algorithm that looks for the patterns they want. This returns encrypted results that gov can't interpret. Results are given to tech company. Tech company decrypts results (only they know how) and tell gov they may be interested in email accounts A, B and C. Gov obtains search warrant for A, B and C. Tech company hands over data specified in warrant for specific accounts, under terms of the law/court order/ warrant, etc. (e.g. Data may or may not include full inboxes, may only include emails in certain date ranges).

In this way you could still get false positives, but you avoid total surveillance whereby the government directly reads my emails to my mom.

Analyzing everything like this will create as many false positives as unencrypted total surveillance. That is a major downside.
 
Cool,i like constructive solution.Simply saying if you see the diagram as a directed  graph  there are two paths can reach the vertices enc(f(m)) from vertices m the plain message.The above one got the property that no one can peep you information at the first.  But the hard step in the path i think is the eval process,i will look into it deeply.
  
 
I should add that what I outlined about data mining also falls foul of the early stages that this research is at. e.g. It might be hard to come up with a mining eval function that doesn't require a trillion gigabytes to analyze 100 email addresses. That seems to be one of the research challenges currently. It's not a question of throwing more hardware at the problem, it's a question of getting clever with developing eval functions. I'm not sure we know what's possible yet.
 
+Robert Byrne  That is what i worried about.The resources to implements eval process is a question.
 
The problem is is that you're looking at this as a replacement for our current data mining techniques. Instead, it's an addition to our current techniques. What that means is that you're not going to see the end of data mining clear text data, but the addition of this new technique for those of us who use encryption on a constant basis.

Privacy is not enhanced, but reduced even further.
 
Again, this is a failure to understand; these operations do not grant access to anything you don't already have access to.

If you don't have the private key for encrypted content, you can perform operations on it but you cannot discover what it is (barring possibilities of extracting a private key through social engineering/coercion and assuming the private key isn't stupidly simplistic and vulnerable to a simple dictionary brute force attack).
 
+Jorn Bettin - thanks for your thoughtful comment!  By the way, when +Jordan Peacock wrote a comment directly after yours beginning:

"On the contrary, we use heavy encryption on a daily basis in numerous aspects..."

he wasn't replying to you, but rather the previous comment, where +Jef Poskanzer wrote:

"We've known since the mid-1970s how to build really great cryptosystems. Aside from a few supergeeks, we're not using them."
 
Would a randomized factor in f(m) prevent this being useful? 
 
+Felix He yes I didn't see your message before I sent my second one. Eval is tricky as you say, and the choice of encryption too, perhaps.
 
+Jordan Peacock - Of course we're using them. I just wanted spectators to be able to understand who is talking to whom.  Big conversations can get a bit confusing.
 
Cryptography is a theory of secret functions. Category theory is an abstract theory of functions. +John Baez, do you know of people other than Pavlovic at Oxford working on categorical cryptography? Both in the real-life sense, and in the abstract-nonsense sense.
 
+Refurio Anachro You're reading the table wrong. The table lists operation times for key generation, encryption, and decryption. There is no comparison to unencrypted operations. Fully homomorphic encryption is currently quite impractical, although things are improving fast. The state of the art system requires 600kB of key material, and O(k^3.5) running time per elementary operation (AND, XOR, etc.), where k is the security parameter. For comparison, the running time of RSA is O(k^3) in the security parameter, slightly faster than O(k^3.5). However, note that the only reason RSA is practical is because the amount of RSA computation can be kept small using hybrid encryption (https://en.wikipedia.org/wiki/Hybrid_cryptosystem). Without hybrid encryption there is serious doubt as to whether RSA would be usable anywhere near the scale that we use it today. Fully homomorphic encryption does not have any analogue of hybrid encryption that can save computation time, which means that we are far from being able to use it routinely. Although I believe the situation will improve, it is premature to declare that fully homomorphic encryption will have any practical applications.

As to Jeremy Kun's larger point, I believe he is far too optimistic. Past experience in cryptography has demonstrated convincingly that even perfect technological solutions are unlikely to be adopted. We've known how to do encrypted email since 1977; how many people are using that today? As another example, since about 1990 we have had the cryptographic technology to conduct (in a practically achievable manner) secure electronic elections, in which each voter can prove that their vote was counted, each voter can prove that the published vote totals are a correct count of the votes, and no voter/coalition can prove to anybody else what his/their vote was. When I tell my non-crypto friends about this system, they do not believe me -- they think this combination of properties is so magical as to be impossible. Tellingly, the number of voting jurisdictions that have ever deployed cryptographically secure electronic voting is zero. The barriers to privacy and security are not technological. They are political and social in nature.
 
+David Jao - While a lot of the problems are political and social, I think it's important to get more people to understand what's technically possible, especially decision-makers and bureaucrats.  Luckily, they don't need to understand how things work: to most people, all modern technology might as well be magic. They just need to know what magic is possible.
 
+John Baez I admire this sentiment, but honestly, I do not think it is going to work. I can't think of a single example where it has ever worked.

The most successful encryption technologies, such as SSH and iMessage, achieved success not by educating the public, but by finding ways to operate transparently, without user interaction. Even if such a design diminishes security in theory, it's more often than not worth doing, because it makes the difference between adoption or no adoption. Conversely, protocols such as HTTPS or encrypted email, which require user interaction for their security, have struggled to achieve widespread deployment, and ended up brittle and fragile.

Security is a chain whose strength is determined by its weakest link. Right now, that weak link is user error, so we should be optimizing our research towards minimizing opportunities for user error. Unfortunately, theoretical computer scientists have largely ignored human factors, perhaps because they are hard to model rigorously.
 
Thank-you, +Michael Bacon.

I would add to my comment that Orwell is misdirection, and the spectre of an Orwellian state isn't taken seriously, which is why folk are so blaze about the risk of surveillance.

If people considered a Kafaesque scenario, they might be more worried.  People have universally dealt with the problems of bureaucracy.  Those who understand Bayesian inferencing and the effect of false positives against a huge population relative to the proportion who are terrorists will particularly have cause for concern.
 
+David Jao - It makes sense that encryption technologies most easily become popular when their users barely notice they're in use. 

Suppose we forget about educating the 'public', and focus on the tech types and decision-makers and bureacrats who might
be in the position to say "okay, let's use homomorphic encryption here".  Surely this is more likely if more of them know that homomorphic encryption exists?

Anyway, my own reason for writing this article was that I think homomorphic encryption is a cool example of category theory.  Even saying this is bound to instantly turn off 99% of the 'public'.  But that leaves hundreds of mathematicians, scientists, and computer programmers - the kind of people who read my stuff.
 
 
 
+John Baez Your own reasons are great for your audience which is of course interested in what you have to say, and you're doing a great job with engagement.

To get everyone else to use the technology, a different type of education is needed. The issue is not whom but what. Even if we concentrate on decision-makers, it's not enough to say this technology exists or what it can do. We need to emphasize the importance of deploying the technology transparently. The naive way of deploying most encryption technologies is end-to-end, which provides the greatest theoretical security, but is awful for the user, because it requires end user interaction. Most crypto research assumes this model of deployment, which I think is unfortunate, because it ignores the real world. Any other choice of deployment model will involve some theoretical security tradeoff; basically, the system designer is taking over some of the decisions that the end user would otherwise make. The question then becomes: which tradeoffs are worth making? That latter judgment is very hard to make correctly, involving a lot of guesswork and intuition. Programmers, bureaucrats, and even researchers have a hard time striking the right balance. Basically, yes, I think educating the powers that be will work, but we have to educate them on the right topics. They need education in social psychology more than technology.

And again, with respect to fully homomorphic encryption specifically, it's a cool application of category theory, but needs improvement before it can be used in practice.
 
There are a few things here that are different from the usual definitions. The most important one is that encryption is always probabilistic (to thwart simple attacks like 'I've seen encryption of 0 before & this is not the same ciphertext => this is not an encryption of 0'). This means that enc(m) is defined as a distribution, as is of course enc(f(m)). What do you want to achieve is that the probabilistic functions commute (i.e., enc(f(m)) and eval(enc(m), f) have the same distribution - or, sometimes, that those distributions are not distinguishable by efficient algorithms).
 
Shouldn't eval be a program that takes a program f and outputs another program eval(f)? (explicitly showing eval depends on f)

So it's not

eval(enc(m)) = enc(f(m))

but

eval(f)(enc(m)) = enc(f(m)).
 
+Philip Thrift my understanding is that the dependency of eval on f is implicit, so it's eval_f really. 
 
Eval takes a ciphertext and a description of a function and outputs another ciphertext that is assumed or at least wished to to be computationally indistinguishable from the original ciphertext. If it were not, it would leak information about f which is usually not desired
 
+Philip Thrift - yes, I might call the arrow on top of the square eval(f) or eval_f, not just eval. 
 
+John Baez I was looking at types, so if it is eval_f (as a subscript) there would be one for every f. (Is that right?) Could one then define a function eval':

eval'(f) = eval_f?

(The diagram suggested to me that eval stayed the same even if f changed, which seemed odd.)
 
Eval has f as an argument. In existing schemes, f is actually replaced by a description of circuit C that computes it. Alternatively C can just be an as a succinct description of f's computation. Eval then proceeds evaluation gate by gate, evaluating small elementary functions one by one.

 
"Eval has f as an argument."

Now I'm happy! :)
 
Another way to read this is:  Given enc(), dec() and a set of f()'s, what is the set of eval()'s which make the diagram comute (or exchanging the role of f()'s and eval()'s).  We could ask this question with any crypto-scheme.  You certainly don't want f() == dec() in your set as that would mean eval(enc(m)) = m.  Also, if a given eval() comutes with all of the f()'s out of the set F = { is_bit_i_set_in_m(.) } would be a problem.  Is there any systematical study done on existing crypto schemes in this respect?
 
All this is fine and dandy, but maybe it is worth pointing out whether any data mining from online activity is an efficient way to detect malevolent activities to begin with.
See what Bruce Schneier had to say about this 7 years ago, and this was assuming an access to plaintext user data : http://www.wired.com/politics/security/commentary/securitymatters/2006/03/70357?currentPage=all
tl;dr : Due to the base rate effect and the high cost of false positives, data mining is simply not designed for this kind of target.
Add a comment...