Thursday, June 12, 2008

A cautionary tale from AOL

Anyone doing research on, say, the locations of people's cell phones would have to be aware of, and keen not to repeat, the great AOL search data debacle of 2006.

I have to admit I didn't follow this closely at the time. Seemed like the kind of thing that was bound to happen sooner or later, and might happen a little less often given that AOL, despite repeated and doubtless sincere apologies, lost business and was generally humiliated for its troubles. But as fate would have it, two stories I wanted to comment on intersected exactly there. One was the piece on cell phones, and the other I'll get to in a bit.

There is certainly value in gathering anonymized bulk data and studying overall patterns. Paul Boutin has an interesting informal analysis of the AOL data, for example. Unfortunately, there are limits to how anonymous that data can be.

Anonymity depends critically on everyone being able to plausibly say "How do you know it was me? It could have been any of these people." I call this the "I'm Spartacus" effect, and it in turn depends on not giving away specific, unique data.

It turns out that people's internet searches can be very specific indeed. Sure, lots of people search for popular products, or celebrities, or any of a number of other things, but we also search for friends or acquaintances, or local businesses, or organizations we belong to or what-have-you. In the case of the AOL data, the New York Times had no trouble tracking down a lady in Georgia, who was kind enough to be interviewed, and several other searchers have also been identified.

At least one searcher, User 927, became notorious even without being identified, owing to a particularly disturbing search history, and is now the inspiration for a play of the same name. This was the other news item that led me to revisit the AOL fiasco. I haven't seen the play and doubt I will, just as I doubt User 927 will be laying claim to any of royalties.

Naturally, AOL tried to put the genie back in the bottle, and naturally it failed. The raw data is available on several sites -- you can search for them, of course -- and at least one site lets you search the searches on line. I wonder if they log that.

[The domain name for the original link for the play seems to have turned over since this was written.  The link I gave now points at a banking site somewhere in Scandinavia.  I've updated to an Ars Technica article on the play -- D.H. Sep 2018]

