Tuesday, February 22, 2011

Your browser's fingerprints

This has been out for about a year now, but I just stumbled onto it:

In order to provide a better browsing experience, your browser is prepared to tell any site it visits a number of things about itself.  For example, it may divulge what fonts it knows about, how big your screen is, which browser it is, and so forth.  This is a useful thing to do, and you might not think it would give away much information -- after all, it shouldn't be giving away too much to say you can print Zapf Dingbats and a bunch of other fonts.

Thing is, it gives away quite a bit.  The EFF provides a site, panopticlick, to let you test your own browser setup.  You might not find the results it gives particularly comforting.

In the particular case of fonts, it's not the fonts themselves, so much as the order they appear in.  This turns out to be an artifact of how your particular computer happened to stash the files when you or whoever set up your system installed them, and that's fairly random.  The odds that Zapf Dingbats happens to appear before American Typewriter Condensed Light are closer to 50% than the 0% you might expect assuming the list is sorted alphabetically.  The ordering doesn't seem to be completely arbitrary, but enough so that only a small percentage of browsers out there will actually have the exact same list, taking order into account.

Plugins are even worse.  It's quite possible that only you have your exact combination of plugins, even after sorting (for whatever reason, browsers don't seem to report plugins in a consistent order over time, so the order doesn't provide a stable fingerprint).  I haven't tracked down exactly why this is so, but I believe it's because some plugins are installed on demand as you visit sites.  Which ones you have and haven't collected will depend on which sites you've visited so far, and the exact versions will be affected by when you visited.

Some things that make little or no difference:
  • Whether you have cookies enabled
  • Whether you're using a a normal or anonymous window (e.g., Chrome's incognito feature)
  • In at least some cases, which actual browser you're using -- different browsers may still send the same signature under the covers.
For bonus points, several popular privacy-protection mechanisms can actually make your fingerprint more unique, as they leave traces in the fingerprint and relatively few people use them.  Among them: at least some means of disabling JavaScript (see EFF's paper for details).

Have I mentioned supercookies?

All in all, 80-90% of the browsers that connected to the EFF's site had a unique fingerprint.  Fewer than 1% had a fingerprint shared by more than one other browser (more technically, had an anonymity set with more than two members).  For whatever it's worth, the distribution of fingerprints displays a fine example of a "long tail".

So, yikes.

What to do?  In order of increasing effort:
  • Do nothing.  Roll over, go back to sleep.
  • Do nothing, but bear in mind that browsing is almost certainly not an anonymous activity.  Not a bad assumption in any case.
  • Read the EFF's (and anyone else's) summary of the situation and understand the pitfalls better.
  • Use an anonymizer (but first, read up a bit on the topic -- there are several good links scattered through the posts here tagged "anonymity")
  • Hack your browser to give out less specific information.  Sort those font lists.  Say "version 3.1" instead of "version".
  • Hack your browser to tell randomly varying harmless lies about its setup.  Randomness is important.  Fingerprints will drift naturally over time, but it turns out to be easy to connect a later version (X, Y, Z etc. plus a new plugin) with an earlier one (X, Y and Z).
  • Get the browser developers to change their APIs (e.g., don't give out lists of fonts at all)
  • Get the standards committees to make the underlying protocols more anonymous -- and then get the implementers to implement the standards.
Once you've done all that, sleep soundly knowing that panopticlick was just a proof-of-concept and that people seriously trying to fingerprint browsers have means at their disposal well beyond those mentioned here.

The name panopticlick is a play on Jeremy Bentham's panopticon, a prison design meant to induce "a sentiment of an invisible omniscience" or, in Bentham's own words, "a new mode of obtaining power of mind over mind, in a quantity hitherto without example."

Lovely stuff, that.

Thursday, February 17, 2011

Where Wikipedia pages went to die

While looking for something else (of course) I ran across Deletionpedia, an archive of pages that have been deleted from Wikipedia.  The idea is simple: siphon off pages deleted from Wikipedia, with exceptions such as copyright violations, libel and intentionally offensive pages.

Why do  this?  Wikipedia is reasonably wide-open, but it does have well-known standards for inclusion.  If it's not notable, or contains original research, or creative writing, or anything else that doesn't really belong in an encyclopedia, it's out, regardless of its other merits.  Deletionpedia was an effort to preserve such pages.

I saw "was" because, even though the site is still up, it hasn't been updated since mid-2008 (or 2012, if you believe the rather odd timestamps on the Recent Changes page).  All in all, Deletionpedia collected about 63,000 pages in the space of a few months.  Why did it stop?  The last status update, from 2008, apologizes for recent downtime, promises it will return in improved form and that  "Full service will resume ASAP."

Famous last words, indeed.  Another cool idea that most likely just didn't have sufficient resources behind it, particularly the time required to administer the site and maintain the Python script that was meant to automate the process of sifting out pages that not even Deletionpedia should provide a home for.

The origins of the whole exercise may lie in the "Inclusionist/Deletionist" theological debate in the Wikipedia community.  I wouldn't say that a site like Deltionpedia necessarily supports one side or the other.  On the one hand, it perpetuates pages that would otherwise disappear.  On the other hand, it lowers the consequences of deleting a page.

Neither should such a site have much effect on Wikipedia's "Right to Vanish" which, as far as I can make out, is more of a Right to Make it Somewhat Harder to Associate Your Edits With Your Identity.  Invoking this right does entail deleting one's User: page (but not one's Talk: User page), but I'm not sure how the average user page would make it easier or more difficult to track down who made a particular set of somewhat-anonymized edits.  But I'm not a Wikipedia.expert, so I may have missed something.

Naturally, there is a Wikipedia page on Deltionpedia, and naturally, it has been nominated for deletion at least once.

Sunday, February 6, 2011

You joined the social network ... now see the movie

To be clear right off the bat: This is about the movie The Social Network.  It's not about Facebook, the company, or Mark Zuckerberg, the CEO, or any other actual person, place or thing.  True, there's a person called Mark Zuckerberg in the movie, there is a university called Harvard and a substance called beer, and probably a bit more than the usual amount of care was taken to align those depictions with their real-world counterparts, but it's a movie.  Likewise, my comments here are about the movie.

I liked it.  It's not a bad movie.  But then, I liked Hackers when I finally saw it.

The techspeak is reasonably believable.  In particular, the rapid-fire voiceover as Zuckerberg puts together Facemash is taken directly from the real-life Zuckerberg's online diary (which, however, gets tarted up a bit for the camera).  Using wget to fetch pictures off a web site with an index page full of them is not exactly cutting-edge, but the Zuckerberg character acknowledges as much.  Hacking a perl script with emacs -- or vi, if you prefer -- is a bit more like it.  None of it's neurosurgery, but this is a quick hack.  Judging by the timestamps, he is hacking reasonably quickly, so the guy knows how to get under the hood and get his hands dirty.

What's more interesting is not the coding but the engineering.  In the process of pulling together mugshots of as many Harvard students as he can, Zuckerberg runs across a house whose particular setup makes the task difficult.  What does he do?  Does he down four cans of Jolt Cola and miraculously come up with a superhuman hack to break in?  No.  He punts.

Absolutely the right call.

To make Facemash work, he just needs a bunch of pictures.  He doesn't need every single one, which is fortunate because many aren't online at all.  So why waste time trying to pick the high-hanging fruit when the low-hanging fruit will do?  That little bit of realism actually making it into a big-budget film, even if it goes by so fast you have to think back to realize it happened, and the lack of the usually obligatory thirty-seconds-of-typing-and-the-magic-ACCESS-GRANTED-popup-fills-the-screen scene (yeah, I'm talking about you, Iron Man 2), make a refreshing change, to say the least.

Now, when the site actually goes up and the kids start having fun with it, the resulting traffic apparently brings the Harvard intranet to its knees.  Seriously?  According to the script there were on the order of twenty thousand hits in two hours, if I remember right.  That's about five hits a second, probably more at peak, but not a lot more, and these are fairly small pages -- a couple of mugshot images and some HTML.  It's all going to Zuckerberg's server, and that's not falling over.  The network can't keep up with one Linux box in someone's dorm room?  Sounds like a bit of dramatic license to me.

Similarly, what does the security chief care if some student was snarfing images off the other dorms' servers?  That's not a security threat, it's an annoyance for whoever's administrating those servers and as I understand it a breach of the undergraduate code of conduct.  Judging by the complete mishmash of setups, the sysadmins are probably students themselves, not the university's IT department.  The security guy's job is to keep outside people from causing mischief, and probably to keep everyone from messing with the more sensitive administrative data, particularly grades.  But I digress somewhat.

Actually, one more bit of geekery: Zuckerberg sits, preoccupied, in an OS class while the professor talks about memory management, page tables and such.  Zuckerberg walks out.  The professor taunts him for giving up, at which point Zuckerberg rattles off the answer the professor was looking for.  Except the correct answer was "Sixteen bit virtual address space?  Do what now?  All you've got is 64K and you're going to swap some of it to disk?  It's 2003.  My phone can eat 64K for a light snack."

But starting a site with hundreds of millions of members isn't about coding, nor is it primarily about software engineering in the larger sense.  It's about pulling together the right ideas and getting the word out.  As Zuckerberg points out later, it's also important to have reliable servers, and (as the movie character doesn't mention but the real-life CEO probably would) things like an extensible platform for third parties, but none of that matters if you don't have something of interest running on them in the first place.

Which is why, at least in the movie version, it seems to me that the Winklevoss twins and their partner were more than fairly compensated for their trouble.  Did Zuckerberg deal badly with them by neglecting to mention that he wasn't really working on their site but was in fact working on his own take on a similar idea?  Of course.  Does that mean they invented facebook and he stole it from them?  Not so much.

It's quite clear that the (movie) Winklevosses would have done the site differently.  For starters, "exclusivity" is not a great way to get to a hundred million members.  Nor did they seem to like the look of the site, though that might have been sour grapes.  For all the talk about the first mover advantage -- and one of these days I'd like to have a look at whether such a thing really exists -- MySpace was already around and known to all involved.  For that matter the Harvard house pages that Zuckerberg cribbed called themselves "facebooks".

If Zuckerberg would have been stealing anything, it wasn't the idea of a social networking site, but the Winklevoss's vision of it.   But that wasn't the vision that Zuckerberg implemented.  Facebook (in the movies or real life) isn't just a social network.  It's a collection of features, like relationship status, the wall, the ability to tag photos, a privacy policy, and so forth.  The bulk of these features were implemented well after the split.

The Winklevosses had a concept of a social networking site and they wanted to hire the job done.  They hired the wrong guy and it cost them the few weeks it took them to realize their mistake.  $60 million seems ample compensation for that, even taking into account that their hired hand at least passively misled them.  It's not like Zuckerberg was the only techie in the Cambridge area in 2003 that could have put up a server, or the twins would have had to scrounge for funding to hire someone new.

Again, just going by the movie account.  The real-life version has been hashed out in court.

That's probably deeper into that tar pit than I should have gone.  What's more interesting here is a larger point: Which counts for more: the original idea or the implementation?  There are certainly egregious cases of unscrupulous operators outright stealing an idea and passing it off as their own, but the scenario put forth in the movie isn't such a case.  All other things equal, it's the implementation that counts, just as you can't copyright or patent an idea, only its expression.

At the end of the day, it's not the people with original ideas that tend to go on to business success.  I could rattle off a long list of computing pioneers who didn't become gazillionaires in startups, either because they didn't found startups or the startups didn't succeed.  It's the people who make those ideas into something that people actually use.  Of the mix that goes into that -- design, coding, financing, marketing, knowing the right people (social networking, that is), relentlessness, sheer dumb luck and whatever I left out -- the technical ingredient is arguably one of the most replaceable.

Which brings me back to the mugshot hacking.  The whole hack was nothing but pulling together existing pieces -- the pictures, Apache, wget, perl and yes, emacs to synthesize something new that people wanted.  Nicely foreshadowed.