Shared publicly  - 
 
I've been dismayed at the recent flap about privacy of your contact information (address book), and the apps that upload it to their servers without asking. Particularly, I was annoyed at industry media coverage which perpetuates a false dichotomy of privacy vs social functionality.

I can understand that journalists won't have any clue what hashing is, but I was disturbed at how many developers hadn't even heard of it either. It was definitely covered in my CS degree at the University of Glasgow!

I decided to write a suitable-for-all, non-technical piece on what hashing is, and how it can be used to (somewhat) address privacy concerns whilst still providing social-media functionality in apps. It's snarky, of course!

http://mattgemmell.com/2012/02/11/hashing-for-privacy-in-social-apps/
15
5
Adam Robbie's profile photoHarald Wagener's profile photoPeter Dickman's profile photoMatt Gemmell's profile photo
6 comments
 
Lovely article. I'm busily sharing it with colleagues who work closely with journalists. I know (and understand and agree with your motivation) that you don't allow comments on your blog, but I'm going to cheat and provide some feedback here since you didn't disable comments on this G+ post. Two contributions:

1.
You missed one possible complaint and rebuttal which might otherwise confuse a less technical person and which might be worth adding to the FAQ at the end for completeness (after rewriting into your eloquent style of course):

Apparent problem: Suppose I know that foo@googlemail.com and foo@gmail.com are actually the same email address, for all foo, which is indeed the case. Or that foo.bar@gmail.com is the same as foobar@gmail.com. Then if I have the actual email addresses I could use that knowledge and form useful connections, which wouldn't be possible if I only have the hashed versions.

Resolution: There are relatively few such rewriting rules and they are quickly and easily expressed in a compact form using some more magic, called regular expressions. Instead of uploading the original email addresses and doing the matching at the server, it's perfectly possible to download those rewrite rules along with the hashing software and do a tiny magic trick, called normalisation, at the client's device immediately before applying the hashing. This would ensure that the hash values come out the same regardless of which of the equivalent forms was used in the address book.

2.
I think your privacy example with the asymmetric link between Bob and Jane is flawed. Suppose that B has J's email address in their address book, but J does not have B's email address. You argue that if B joins the service, then J joins later, no link needs to be (or perhaps even should be) formed from B to J. But in this asymmetric case, surely the order in which B & J join the service has no bearing on whether or not it's appropriate for the link to be formed. In your specific example, if J had joined first, I think you're allowing a link to be formed when B joins later, since J's hashed email address is known (as a client of the service) and B has the hashed version of J's email address in their uploaded address book.

Order-invariance will give an easier to understand principle: that B is linked to J (or not), regardless of the order in which they join the service. i.e. links are either uni-directional (Twitter & G+) or bi-directional (Facebook).

Another privacy example worth thinking about is: what happens if B & C don't know each other and both join the service, J does not join, and both B & C have J's email address? It might in theory make sense to be able to tell them they have a friend in common, without being able to tell them who that person is; which requires retaining the hashed contacts lists. However, J might find it disconcerting if B & C have their attention drawn to the fact that they both know someone, and are then able to determine that the mutual friend is J by comparing contacts lists.
 
Hi Peter,

Thanks for the response - I'm glad you enjoyed the piece. It might well have been you who taught me about hashing in the first place. :)

In response to your points:

1. Indeed, the issue of normalisation is well worth adding to the FAQ. In fact, it was raised on Twitter in response to the article, though the example was phone numbers and the various formats they can take (assorted punctuations of area codes, presence or absence of international country codes, etc). The person in question seemed to think that this was an insurmountable objection to hashing, but I did point out that well-defined client-side normalisation gets rid of the issue. Another example of a relatively basic Comp Sci concept that's worryingly little-known by some developers, perhaps?

2. Good point; the order of joining does expose a privacy (and indeed semantic) flaw in using address books to find "friends". My position is that, as you say, the order of joining should have no bearing on the inherent validity of the connection, and that for information-exposing services (like Facebook, Path, iOS' "Find My Friends", etc), that the only valid connection is an explicitly symmetrical/bi-directional one.

Your point about the mutual acquaintance is well taken - I'd classify it as another example of "good for the service, (potentially) bad for the users".

I'll update the post with your points - thanks!
 
You're welcome; I'll look forward to the update.

Incidentally, I suspect it might have been Rob Irving or Ron Poet that covered hashing, but am not certain. I'm fairly sure my courses assumed it was elsewhere in the curriculum.
Translate
 
Every time I see your post I want some cornbeef and hash. Thanks Matt.
Add a comment...