Thursday, May 28, 2009

You are where you are

One of the basic facts of anonymity is that you're only as anonymous as the people you could be mistaken for. If I don't know who you are, but I know that you're a Grammy-nominated Nigerian guitarist who has acted in a Robert Altman film, I have a pretty good idea who you are. If all I know is that you're Nigerian, I don't really have a clue.

This notion was formalized in 2002 by Latanya Sweeney (or someone going by that name, at least) under the name anonymity set. Along with a host of other interesting research, Sweeney also studied the uniqueness of various combinations of readily available data and concluded (I'm quoting here from the abstract) that
  • 87% (216 million of 248 million) of the population in the United States had reported characteristics that likely made them unique based only on {5-digit ZIP, gender, date of birth}.
  • About half of the U.S. population (132 million of 248 million or 53%) are likely to be uniquely identified by only {place, gender, date of birth}, where place is basically the city, town, or municipality in which the person resides.
  • And even at the county level, {county, gender, date of birth} are likely to uniquely identify 18% of the U.S. population.
  • In general, few characteristics are needed to uniquely identify a person.

Recently, Philippe Golle and Kurt Partridge, in a paper that's been making the rounds, have built on that work to see what one can glean from knowing approximately where a person lives and works. The result, based on US Census data, was that given a pair of census blocks in which someone is known to live and work (for example because they've contributed cell phone location data to an anonymized database), there is on average one person who lives and works in that particular pair.

Golle and Partridge adopt Sweeney's definition of privacy (or that lack thereof) based on anonymity sets. In the extreme case that there's only one person in the set, that is, there's only one person with some particular set of characteristics, one should assume it won't be hard to discover just which one person that is. This seems prudent.

Under that notion of privacy, if someone knows the census blocks where you live and work, they know who you are.

Now, a census block is fairly small, being about a city block (hence the name) in a large city and comprising from zero to a few hundred people, but it's not tiny. Moreover, even if all that's known are the census tracts (more or less a zipcode/postcode) you live and work in, there are probably about twenty people in the same situation, more if the two are the same, fewer if you live in a different tract from your workplace. Only at the county level does the anonymity set get reasonable large (into the tens of thousands on average).

So: Give out your county, but not your zipcode. If you live in rural Alaska your milage will vary, but in that case it's probably a good idea that everyone know who you are and vice versa.

