Field notes on the Web: privacy

Showing posts with label privacy. Show all posts

Friday, August 25, 2017

I'm on a party line

There have been headlines lately about new FCC regulations allowing internet service providers to sell information about what sites you visit. From the summary I read in The Verge, which looks well put together and overall plausible, the situation is a bit more complicated than that, but certainly ISPs have access to quite a bit of information about what sites a particular IP address under their management connects to, and they have to have access to that information in order to provide good service.

I'm not going to offer an opinion here on whether this is good, bad, indifferent or some combination. Instead I wanted to take a look at privacy in general.

If you live in a house with separate rooms with doors that close and may even lock, it's easy to think of having a room of one's own as the natural state of things, but that's not universally the case. There are plenty of examples of people sharing space, whether in a one-room house or a portable structure such as a tent, yurt or tipi. Or think of an un-air-conditioned apartment block in summer. If everyone's window is open onto the same courtyard, privacy is going to be a bit limited. Enhanced privacy isn't the most obvious benefit of air conditioning, but it would certainly appear to be one.

Even if doors and windows can close, living in a small community, particularly one that has to be fairly self-sufficient, means getting to know more than one might care to about one's neighbors, and having them know details about one's own life. Arguably this is actually the normal state of things. Urbanization is a fairly recent phenomenon in human history.

Again, not saying any of this is good or bad, just that privacy is not necessarily something that we once had, but lost once technology came along.

For that matter, and back at the title, in the earlier days of telephony, many customers had a party line arrangement, meaning that a number of households shared the same physical phone line. This meant that if someone else was making a call and you picked up your phone, you would hear them talking, at which point you might hang up and try again later, or perhaps ask them if they would be done soon ... or just listen in for a while.

Even placing a call meant, at least in some cases, calling an operator and telling them whom you wanted to call, so they could patch the call through -- literally using a patch cord. That process was eventually automated, but the phone company still needed to keep records, at least of long-distance calls, in order to bill for them. Those records could be subpoenaed in the course of criminal investigations and in any case were available to at least some company employees.

People seemed largely OK with all this, perhaps because the convenience of the telephone outweighed the lack of privacy, perhaps because people figured out ways of minimizing the intrusion (some interesting game theory/economics there), and probably for other reasons.

We're also social animals. To some extent we want to share things about ourselves and have others share with us. It's not clear to me whether social media have amplified this kind of behavior so much as reflected it.

What seems different about modern technological privacy is that the people with access to one's private information are strangers with their own incentives and plans. In a small, tight-knit community information flows both ways. "Everybody knows everything about everybody." With a 20th-century phone company or a 21st-century ISP this isn't the case, and generally the entity in question is in business to make money.

One can argue that such businesses have a strong incentive to respect their customer's privacy on the grounds that failing to respect it would be bad for business, but that doesn't always seem particularly comforting. On the other hand, the basic issues are clearly older than the internet, so at least we've had some time to work them out. I could have added 19th-century telegraph companies or maybe even 18th-century messenger services to the paragraph above.

I think the problem decreases as you go back in time, since communicating via commercial services run by strangers becomes less pervasive, but the telephone was a pretty integral part of 20th-century life, particularly in the second half. It's not clear to me how much more integral the net is. I'm sure it is to some extent, but not how much.

I honestly don't know what to conclude from all that, but I did at least want to offer the perspective that, as in other cases, the internet doesn't necessarily change everything. Some things, almost certainly, but the real fun lies in figuring out exactly what.

Thursday, January 12, 2017

Identity redux

Today I spent an embarrassing amount of time trying to figure out why I couldn't use SSH with my new GitHub account, before figuring out that I needed to log in as git@github.com and not myaccount@github.com. Evidently I'm not the first person to stub a toe on this, but it got me thinking about one of the earliest topics on this blog: identity.

A natural way to think about identities and logging in is that your username is your identity and there are various ways of authenticating that identity, for example

a password
a password and a second factor such as a magic number sent via SMS or generated by a smartcard
a public/private key pair
(in some SSH contexts) hostname or IP address and public/private key pair

Others are possible, of course. What the GitHub experience made clear to me is that the "username" part is secondary, at least as far as SSH is concerned. The important part of authenticating SSH is the key.

As far as I can tell, GitHub is taking the public key offered during the SSH handshake and looking it up to get the account, and thus the account name. That's probably also why when you try to upload a key you've already uploaded (e.g., to check that you haven't taken complete leave of your senses while trying to figure out why you can't log in), the error message is "already in use". It doesn't say by whom, even when it's you. The rule is one account per key (but potentially multiple keys per account).

This suggests a different approach to identity. As far as the web is concerned a key, and in general an authentication method, is an identity. This is more or less the case with Bitcoin wallets, and to some extent for PGP and other email privacy schemes, but even then for the most part we talk about using keys to establish an identity.

Let's run through some data modeling to see how this all fits:

People, identities and resources to be accessed (such as accounts) are three different things.
A person can have multiple identities
Multiple people can use the same identity, though that's often not a good thing
A resource can be accessed by multiple identities
In general, though not in the case of GitHub accounts, an identity can access multiple resources

There are two reasons I find this key-is-identity model attractive. One is that your web server doesn't see you, it sees the credentials you present. It really only sees the key, or at least only ought to look at it when verifying identity. Yes, it may also know things like which IP address someone is connecting from, but even though that information can sometimes be a useful hint that something's not right, it's ephemeral, not part of the identity.

The other, maybe just the first with a different emphasis, is that it loosens the connection between resources and people. It might be nice to think that Gavin Belson logged in to your server with username gavin@hooli.com and the proper password, but it's better to think that someone logged in with those credentials. You know that gavin@hooli.com logged in. You don't know that that was Gavin Belson (I'm looking at you, Gilfoyle). The identity that matters here is gavin@hooli.com, not Gavin Belson.

Except that gavin@hooli.com is associated with a password, which can change without changing what we mean by the username (or, one would hope, it's associated with a password and a second factor, such as a phone or smartcard). Are we really going to say that if Gavin changes his password, we're dealing with a different identity?

Let's try "yes". The whole point of changing your password is that anyone who knew your old password won't know your new one. We presume that, at least at first, only Gavin knows the new password. From the point of view of the system Gavin is logging in to, (gavin@hooli.com, old password) is indeed a different identity from (gavin@hooli.com, new password) because there are potentially different sets of people who could be each.

What if Gavin uses his phone as a second factor? There are a number of ways to do that, so suppose that the server sends him a text with a magic number when he tries to log in and expects to get that number back as part of the login process. That provides a reasonable assurance that whoever's logging in has both Gavin's password and his phone (assuming the text isn't intercepted). If Gavin does have his phone, it also informs him that someone, hopefully it's actually him, is trying to log in.

Suppose Gavin switches phone numbers but keeps his password the same. Should we consider that a new identity? I think the same logic still applies. If Gavin's password has already been compromised and he changes his phone number, then someone might manage to grab the old phone number, and so forth. In any case, the set of people with access to the old phone number is potentially different from the set of people with access to the new one, so different identity.

If you're carefully tracking who did what to a resource, you need to track the authentication. A different means of identification, even for the same user name, means potentially different people.

One logical conclusion of this is that username is not identity. So what is it?

It's a name. Seems plausible, at least.

Names are yet another concept, distinct from person, identity or resource (personae are yet again distinct, but this is getting complicated enough as it is). For example, gavin@hooli.com sure looks like an identity, but when you send email, you're really just sending it to an address which is connected to an inbox (which is a resource).

There may be more than one address connected to a given inbox. Likewise, the name I use to log in to access that inbox may or may not be an email address (for example, my ISP provides me with an email address I never use, but if I want to see mail for that address I log in with a username that's different from the email address). Likewise for however you logged in to send me mail. If we're using secure mail then, regardless of everything just mentioned, you'll encrypt the mail to your recipient's public key and sign it using your private keys. The keys are the real identity, because we trust them.

I'm comfortable with this, too. I've come to think that one of the most fertile sources of bugs is confusing names with identities (see this post for some a bit more on names). Names are convenient, but ideally they're only used to look up what you're really interested in. I personally prefer systems in which renaming things is cheap, if only because I generally come to hate the names I come up with to start with, no matter how much careful thought I gave them.

The way you generally do this is to assign a unique id -- typically a hundred bits or so of random gibberish -- to each resource that can be named, and then maintain a map from name to id. When you access a resource, say an account, by name, you look up the name in that map, stash the id somewhere and use the id to access the object. If the name → id map changes, you still have the id and you can still find the same object. The system can maintain as many names for a given object as it sees fit, but each name corresponds to only one object (at least at any given time).

Summing up

Web servers don't see people directly. They see the credentials that people (and other servers, for that matter) present.
The credentials are distinct from who (or what) uses them
The names we use to refer to resources are distinct from the credentials that can be used to access them.

I think public key systems line up well with the key-is-identity model because the public key is the single identifying item. In a password scheme, whether you consider (username, password) or just username to be the identity, you are giving two pieces of information, both of which are durable, but which must generally be kept separate because one of them is meant to be secret. The password isn't truly private. The host you're logging into has to know it [technically, it only has to store a secure hash of the password together with a bit of randomly-generated "salt", but from a security point of view that's only a bit better since it still has to see the password during the login in order to do the hash and comparison, and password files can be stolen and attacked brute-force offline --D.H. Feb 2017].

In a public key scheme, there is still a public part and a private part, but you present only the public part. The private key remains truly private. It's generally stored encrypted, guarded by a passphrase that you only use locally. If you change the passphrase, no one else needs to know. Authenticating means exchanging ephemeral information, much of it randomly generated, that will never be used again. All of this makes it much easier to keep the private key secure for long periods of time, so the public key can serve as a durable identity. Since it's the only durable thing that other parties see, it's sufficient to serve as an identity by itself.

There's a long way to go yet, but it seems likely that the world will gradually shift to key-as-identity, or something at least as strong.

Tuesday, October 25, 2016

Names and namespaces

I'm not the only David Hull in the world. I may not even be the only one with my exact name. I've met a couple of other David Hulls. There was also a prominent philosopher by the same name, not to mention musicians, athletes and others. Mine is not a particularly common name, but, at least if we look and first and last names only, it's definitely not unique. It would be interesting to know what portion of names in the world are unique. There are plenty of Pat Smiths or Zhang Weis, for example but (as far as I know) only one Dweezil Zappa (born Ian Donald Calvin Euclid Zappa).

In daily life it's generally not a huge problem for two people to have the same name. If there are two David Hulls working in the same office, one might be "Dave" and the other "David", or one might be "Dave in sales", or one might go by "Walrus" for whatever reason. If we want to be more precise, we can always add more identifiers, such as middle name, birth date, place of birth and so forth. It's very unlikely that (to make up an example), there was more than one Patricia Terpsichore Smith born on February 29th, 1940 in Saint Paul, Minnesota.

If you want to be really sure, you assign everyone a unique identifier, such as the Social Security number in the US (SSN for short). In theory there should never be more than one person with a given social security number, regardless of their name, age or place of birth. In practice, that doesn't really hold up. There are actually millions of people with multiple SSNs and/or SSNs assigned to multiple people. Leaving that aside, though, it's worth taking a closer look at how SSNs and identifiers like them are built. I'll use US examples here since that's what I'm familiar with, but many other countries have similar schemes.

A Social Security number is split into three parts. Up until 2011, these followed a specific pattern. For example, LifeLock CEO Todd Davis's is 457-55-5462. The 457 part is in the range 449-467, which is assigned to Texas. The 55 means, in this case, that the card was issued in 1982. Exactly which year the middle digits map to depends on the first three digits, presumably because not all three-digit prefixes are used every year. The last four digits are issued in numerical order, so putting it all together, Davis would have been the 5462nd person to be issued a SSN starting with 457-55, and the card was issued in Texas in 1982.

This sort of scheme is not uncommon. US phone numbers are built in a similar fashion. The full history would be material for a separate post, but during the "long distance" era before the advent of cell phones, a US phone number consisted of a three-digit area code, having a middle digit of 0 or 1, followed by a three-digit exchange associated with a piece of equipment in a particular location, followed by a four-digit number. For example, the White House phone number is (202) 456-1111. The area code for Washington, D.C. is 202, 456 is one of the exchanges there and 1111 is easy to remember, easy to dial on old-fashioned rotary phones and hey, the White House can pick whatever number it likes.

Likewise, US Zone Improvement Plan codes (zip codes to most of us) uses the first digit to denote a particular region of the country, the second two to denote a particular area within that region (and typically a particular sorting and distribution center, and the last two a smaller area within that region. Here's a nice illustration I mentioned in a previous post. The later ZIP+4 scheme takes that down to individual blocks, apartment buildings, large businesses and so forth.

The parts in schemes like this often nest. Area codes comprise exchanges which comprise individual phones. Zip code regions comprise distribution centers which comprise smaller areas. Social security numbers are a bit different, in that the first five digits together denote a region and year associated with a block of individual numbers, but you can't really say that there are years within regions and vice versa.

People's names are also a bit borderline. You could think of families comprising individuals, but the family name doesn't really correspond to biological families for a variety of reasons. The approximately 22% of Koreans named Kim (김) are not a biological family (though there are some 384 clans with the Kim name). It's not much more meaningful to talk of all the Hulls than it is to talk of all the Davids.

All of these schemes have special cases, for example:

SSNs never start with 000, while 700-728 were originally reserved for railroad employees
SSNs never start with 666, and this number does not even appear on the Social Security Administration's historical list.
Area code 800 (along with several others now) is reserved for toll-free services
Exchange 555 includes special numbers such as 555-1212 (directory assistance) and has blocks that are guaranteed not to be used for real phone numbers, which is why a US phone number you see in a movie almost always starts with 555.
Zip codes starting with 569 are reserved for the USPS Parcel Return Service.
In the US legal system, John Doe, Richard Roe and other names are used in various contexts for persons whose actual names are unknown or withheld.

It's easy to think of more of these identifiers made of parts, usually with special cases. Credit card numbers. Place names like Paris, Texas, USA as distinct from Paris, Kentucky, USA or Paris, Kiribati or Paris, France. Three cases are of particular interest on the web:

domain names, which consist of parts nested from right to left (e.g., fieldnotesontheweb.blogspot.com)
IP addresses, which (in version 4) consist of four parts (sort of) nested left to right (e.g., 216.58.194.78)
URLs, which consist of a protocol (e.g., http) an authority (e.g., fieldnotesontheweb.blogspot.com), a path (e.g, /blogger.g) a query (e.g., ?blogID=21299...) and a fragment (e.g., #overview/src=dashboard) again nested (basically) from left to right

Again there are special cases, such as example.com, IP addresses starting with, e.g., 192.168. and the about: pseudo-protocol Chrome uses.

All of the naming schemes I've mentioned so far make some use of the idea of a namespace, that is, a context in which names are meant to be unique. Within a given family, siblings typically share a family name but have distinct first names. I say "typically" because there are plenty of exceptions in real life, ranging from ordinary blended families to George Foreman's five sons named George (who nonetheless appear to go by their own nicknames in daily life).

You could think of an area code as a sort of family name with exchanges as given names or, one level down, think of the exchange as a family composed of individual numbers. Some of these analogies make more sense than others. Sure, every SSN starting with 457 was assigned in Texas (if it was assigned before 2011), but there's no good way to get from the middle digits to a year without knowing the first three digits. Real life is a bit messy.

Even so, schemes like this are a decent fit for the way we think, which should not come as a great surprise. But this has its drawbacks. Maybe you don't really want someone to know in what state and year you got your social security card. Maybe you'd like to give out your phone number without giving away a reasonably good idea of where you live.

Besides the privacy implications, there are practical concerns. In theory there are a billion possible SSN's, enough to keep up with the US population for a while yet. In practice, not all numbers can be used. If there are only 500 people with numbers starting with XXX-YY at the end of the year, the other 500 numbers starting with XXX-YY will go unassigned, and I'm sure there are other inefficiencies. This is not unique to SSNs. Any numerical scheme that allocates blocks of numbers will tend to leave some blocks unfilled.

For these and other reasons, many kinds of ID numbers are assigned in a single "flat" namespace, as SSNs are now. One way to do this is with a serial number that's incremented with each new ID, but (again at least partly for privacy and security reasons), that's often not the case. For example, Blogger gives this blog post and ID of 8084382145281586649. The blog itself has an entirely different ID. The two have nothing (obvious) to do with each other. I certainly haven't written 8 quintillion posts for this blog, nor are there anywhere near that many posts in all of Blogger. The previous post on this blog (from, um, just a little while ago), has ID 236347809273236220.

This way of using longish, apparently random strings of digits has a few often-useful properties:

Because the numbers are big enough, there is generally very little chance that the same ID number will be given out twice. And by "very little" I mean "not liable to happen in our lifetimes" and sometimes much longer, not "eh ... this'll happen from time to time but don't sweat it". As a rule of thumb, if there are N digits in an ID, the number of things you'd need to get a collision is an N/2 digit number. If blog post IDs are 18 digits or so long, you'd need billions of posts before there was a significant chance of a collision, even if they're not explicitly checking whether a supposedly new ID has already been used. Generally, "universal unique ID" (UUID) schemes use a lot more than 18 digits, making the chances of collision ridiculously small.
Almost all UUID schemes use some sort of secure hash. This means that, generally speaking, changing even one bit of the input will change about half of the bits of the ID. This and other properties make it, as far as anyone currently knows, infeasible to learn anything about the thing being assigned the ID from the thing itself. For example, the IDs of the two posts give no clue that they identify adjacent posts in the same blog, much less what's in them. The URLs given to the posts, in contrast, make an effort to provide at least some useful information (e.g., http://blog.fieldnotesontheweb.com/2016/05/a-couple-of-updates-on-satoshi.html). But that's fine. As long as you have a unique ID you know exactly which item you're dealing with and you can give it any kind of friendly name you like.

You can still have namespaces of a sort with a hashing scheme. If I form my IDs by hashing the string "fieldnotesontheweb" followed by the title of the post and you use "myawesomeblog" instead of "fieldnotesontheweb", there is pretty much no chance we'll every use the same ID, even if we happen to pick the same post title. This gives the same kind of uniqueness as the "given names within a group of siblings" model. You just can't tell from the IDs.

It's not uncommon for a naming scheme to evolve from a hierarchical structure, like SSNs before 2011, to a flat structure, like SSNs from 2011 onwards. Given that, there's a good argument to be made that you should just start with a big, flat namespace and save the headache of conversion.

Friday, August 3, 2012

Cookies in the UK (or should that be "biscuits"?)

I haven't tracked down whether Parliament decreed this, though it seems likely, but a number of UK sites I've visited in the past couple of months show you a brief popup or other announcement to the effect that they use cookies (small files that your browser stores on your disk and hands back to the site on later visits so the site can tell it's you). The announcement is typically a couple of simple sentences with a link for further information. For example:

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Find out more here.

The linked page details in clear, precise language what cookies are and what the site uses them for. It explains how to set your browser to disable cookies for the site, with the understanding that you might not have as nice an experience since the site won't be able to remember who you are. Once you dismiss the announcement you don't see it again, because -- of course -- it has set a cookie and knows not to come back (unless you disabled cookies or later clear the cookie).

Wow. They Got It Right. Well done!

Wednesday, January 18, 2012

Does there have to be an app for that?

Weddings are generally public affairs, and they always have been. I doubt it's ever been particularly difficult to find out who's planning a public wedding and when in a given area. With the advent of online wedding planning it's now perhaps a bit easier yet, and if you're looking for a wedding to crash, well, there's an app for that.

Now, I'm with the author of the article in thinking that the kind of person who would use such a thing has -- how shall we say -- issues, but the flip side of that is, who's actually going to use it, as opposed to just having a laugh looking up one's friends and acquaintances? Or more precisely, who's going to use it who wouldn't have been willing and able to crash a given wedding anyway?

In general, there's a lot of gray area when it comes to "enabling technologies", not to mention the larger sticky issue of to what extent technology can or should be considered without considering its potential consequences. On the one hand, it's easy to say "The real problem is the wedding sites' privacy models. The app just pulls together information that's already available." But that's a cop-out. As we've seen, pulling together information that's already available and making it universally accessible (if not useful) can make a significant difference. Sometimes this is good, sometimes not, and just because something can be done doesn't mean it should be.

Just how much of a difference pulling together existing information and making it easy to get to can make depends on what the information is, how hidden it was, who wants to know and a host of other factors. In this particular case, I doubt the app will make much difference. That's not to condone wedding crashing, or the app, or to excuse its creators. If your wedding is crashed by some tech-savvy boor who would otherwise have missed out, you have my sympathies, for whatever that's worth.

Wednesday, March 30, 2011

Just how much should one be unsettled by all this unsettling stuff?

Sitting here as I type this, I can't see you reading it. With a couple of exceptions, I have little idea who you might be, and you may well know me only through my profile and posts here. In the absence of face-to-face interaction, it's easy to think oneself anonymous. For all you know, I could be a dog.

It's easy to think oneself anonymous on the web, but there are significant ways in which this just isn't so. As I've mentioned in a few recent posts, for example, a web server can find out if you're logged in to various sites, your browser quite likely has a unique fingerprint, and the location of your WiFi router is probably in several databases which can be used to locate people who aren't even connected to it (on a recent road trip, I helped install a brand-new wifi router, and did a double-take when Apple's location service knew where I was -- because of the half-dozen other WiFi routers in range).

Which brings us to the question: To what extent are these just somewhat unsettling facts of web.life and to what extent are they cause for real concern? And of course, the answer is "it depends".

Fine. What kinds of things does it depend on?

For such things to be more than theoretical concerns, someone has to do something harmful with the information that they wouldn't have been able to do anyway, or at least, the likelihood of someone doing harm has to increase (or if you really want to be technical, the expected harm has to be outweigh any expected benefits, using "expected" in the probability sense). The downside will depend on what kinds of bad things can happen, which depends on the particular unsettling fact, and how likely they are to happen, which can depend on all sorts of things.

For example, there's probably not a lot of harm in most cases if a server can tell that you're logged into FaceSpace, but if your employer has a strict policy against that and someone decides to install a FaceSpace login detector in your company's internal homepage, the consequences could be serious. It's up to you to weigh how likely that is and how much you need the job (and how important it really is to browse FaceSpace at work).

If you live under a regime that bans unauthorized WiFi routers, the odds of something bad happening if you put one up anyway are pretty high. It's almost certainly not worth it. In most places, however, it shouldn't pose a problem. If your security is set up properly, most likely all that someone can determine is that there's a WiFi device with some particular SSID near some particular location. Given that it's easy to determine that there is, say, a house with a mailbox and electricity at a given location, that doesn't seem so dangerous, once you get past the creep-out factor of someone being able to detect something inside your home that they can't see.

To a large extent then, such things are just a part of conducting business on the web. There's nothing wrong with being concerned about privacy, or taking reasonable steps to try to protect one's privacy, but it's a mistake to expect one's online life to be perfectly private.

But the same can be said of one's offline life. Ultimately, issues of privacy are not technical, but social and legal. Expectations of privacy have always been around. So have breaches of those expectations, and so have various ways of trying to cope with such breaches. Which is worse: having your Awful Secret shouted in the town square of your medieval village, or to the whole tribal group around the prehistoric campfire, or having it posted to millions of people on the modern web? I don't see much difference. All three cases are serious. What matters isn't how many people find out, but how many people that you care about find out. I'm not convinced that technology changes that picture much.

Thursday, March 3, 2011

Now I remember why I don't pay much attention to this kind of stuff

Recently I enabled location services on my MacBook. That couldn't do too much, right? The MacBook doesn't have a GPS attached. A quick check from whatsmyip (go there, then do the "IP Address Lookup") gave a location several miles from my actual one. Clearly that's as close as it can get.

Well, actually, it was accurate to within a hundred yards or so.

After double-checking that I really, really didn't have a GPS, I dug up what was really happening: Like many of us, I'm connected to the internet through a wireless router. That router has a unique MAC address by default. The MacBook's networking layer knows this (otherwise it can't function), so when I'm on my home network, I'm associated with a particular MAC address.

Several parties, including Apple, keep a database of locations of wireless routers. They get these locations by "wardriving" -- driving along looking for WiFi signals and using old-school radio techniques to pinpoint roughly where the signal is coming from. Location services simply contacts the mothership with the MAC address of the router to get the physical location out of the database. This has all been around a while. I'm just slow on the uptake.

Assuming everything's working as intended, this doesn't mean that any random person on the internet can find out your location. You decide whether to share that information, just like you decide whether to share a particular file (caveat: I'm not sure how secure a MacBook's default settings are). Still, it's another for the growing list of unsettling privacy issues (maybe I'll create a new tag).

As I understand it, there's not a whole lot you can do about this form of information gathering, though I didn't delve deep enough into the IEEE standards docs to be sure I had a definitive answer. From what I can make out, though, the source and destination MAC addresses have to be on every packet, unencrypted. Otherwise your router would have to try to decode every encrypted packet it received against the session keys for every active connection in order to see if it was the intended recipient.

Given that, turning off broadcasting of your SSID, using WPA2, whitelisting MAC addresses and so forth is not going to make a difference. The wardriver just has to sniff for packets and note the MAC addresses. That's not to say you shouldn't use WPA2 -- you absolutely should, as it provides decent protection against eavesdropping and unauthorized connections -- just that it won't prevent someone from knowing that a router with a given MAC address is in a given location.

There are some countermeasures you can take:

Use your router's "MAC cloning" feature to set its MAC to something already in the database (if you choose a MAC not in the database, it will get added with your location next time the car comes by). Your friends and foes will then think you're in sunny Tahiti or wherever.
Don't use WiFi

String a bunch of Cat 5 cable and give up the convenience of wirelessness
Use a smart phone as a tether -- this has its own set of privacy issues, but I'm not familiar with them and ignorance is bliss [I doubt this helps that much. A tethering phone looks like any other WiFi router to a wardriver, including having a MAC address. If you only use the tethering in one place, say your home, there's no real difference from using any other WiFi router. If you move around, there will be a record that your MAC address was seen at several places, which is probably not what you want --D.H. Sep 2015].

Disable location services. The world will know that there's a wireless router at a given location, but won't be able to associate it with you or your computer (or at least, not quite as easily)
Build a suitable Faraday cage around your network.
Don't worry, be happy.

Please understand I'm not recommending any of these, except possibly the last.

Thursday, January 27, 2011

OK, this is a bit unsettling ...

File under unintended consequences. It all makes sense, and yet, it doesn't seem quite right.

Mike Cardwell blogs:

When you visit my website, I can automatically and silently determine if you're logged into Facebook, Twitter, GMail and Digg.

and sure enough, the page will say "Yes, you are logged in" or "No, you are not logged in" at the appropriate places. Eerie. What's going on here?

As Cardwell explains, whenever you send an HTTP request to a server, you get back a response code. That response code might say things like "Your request was OK, here's the data you asked for," or "Sorry, I don't have what you're looking for," or "Goodness, I seem to be having some sort of problem here." or any of a number of other things. So far, so good.

Modern browsers can keep track of whether you're logged in to particular sites, so you don't have to keep logging in. Fair enough. If you're logged in and you ask for something on a site, you'll get it (assuming you have the proper permissions, etc.). If not, you'll typically get an error.

HTML allows you to reference other web sites within your document -- that's pretty much what makes the web webby -- and modern browsers allow you to behave one way or another depending on what happens when you try to fetch something (it doesn't even have to be based on a status code -- pretty much any reliably observable difference in behavior will do).

Put it all together, and

any web site
can use a reference to another site
to tell if you're logged in to that site

In Chrome, at least, if you open an incognito window to visit Cardwell's site, it can no longer tell whether you're logged in, because incognito windows don't share any state with other browser windows. But that's kind of throwing out the baby with the bathwater. You can also turn off JavaScript support (or only selectively turn it on), but that has its own problems.

To really solve the problem you have to be able to control what state is shared between, for example, different tabs or windows. Doing that simply and non-intrusively is easier said than done.

On the other hand, as a couple of commenters point out, such tricks have been around for a while. Whether anyone's exploiting them in a significant way is another matter. Before a site can find out if you're logged in, it has to get you to visit it, not that there aren't plenty of sneaky ways to do that, and then it just knows whether you're logged in or not to sites it knows how to check for (each site requires its own custom-tailored check). And then, if all you log into is, say, GMail and Twitter, then all your adversary can find out -- from this particular particular, at least -- is that (yawn) you use GMail and Twitter.

Worth losing sleep over? Probably not. Worth keeping in mind? Definitely.

Cardwell's site looks to have a lot of other fun and useful information on it as well ... and if you stop by for a visit, your browser will most likely tell his server I sent you.

Tuesday, November 16, 2010

Maybe I just don't understand this whole "privacy" thing

Today I had to get the full account number of a bank account. It was probably on some old paper statement at home, but I wasn't at home and besides, didn't I "go green with online statements" years ago? Everything's on the web these days, right?

Except when it's not. Many sites in a similar situation will provide a way of getting your full account number directly. Not this one. Most will at least provide PDFs of the statements they would have mailed, but again, no. Fine. I call them up and give them a bunch of identifiers (not quite as bad as this time, just the usual rigmarole). May I have my full account number now? Well, no, they don't give that out over the phone.

But no worries. What they can do is fax a recent statement, with account activity and all manner of other fun stuff, to any random fax number I choose. So that's all right, then.

For bonus points, they do read out a disclaimer warning that information sent to a public fax machine might be seen by anyone and everyone.

Noted.

Friday, May 7, 2010

Cameras everywhere

In the late 90s I worked in central London, near the Royal Courts of Justice. One day I looked up to see how many surveillance cameras I could count. It was at least a dozen, possibly two dozen.

Not all of them were safeguarding the Courts. They were stuck all over the place, on both public buildings and private. The intent was clearly not just to keep an eye on things, but also to make it abundantly clear that someone was doing so. Likewise in Oxford street, where the stated aim was to deter shoplifters and pickpockets (but not, so it seemed, shell games and sellers of counterfeit watches).

Surveillance cameras in major cities are not unusual, nor were they then, I'm sure, but the density of bristling, in-your-face obvious cameras seemed particularly high to me.

Fast forward back to the present and it seems like the rest of the world is catching up. One pattern I've noticed, in two different parts of the US, is a city adopting red light cameras at selected intersections (as far as I can tell), but sticking cameras on top of pretty much any piece of public infrastructure that will hold one.

As far as I can tell, most of the cameras are just webcams. The actual red-light cameras tend to be conspicuously big boxy affairs that flash a bright light when they nail someone. But for all a paranoid new driver in the area knows, they could be, and for that matter it doesn't seem a great feat of engineering to turn an ordinary webcam setup into a red-light camera.

Come to think of it, you wouldn't necessarily need any sophisticated software or full-time employees to monitor them. Just crowdsource it. Proud citizens of Anytown: Do you really want that new recreation center? Just stream the following URLs and click the "You're busted" button when you spot a transgression. Remember, every illegal left turn you spot brings us $X closer ...

Yikes. I should just stop typing now.

How did we get here, anyway? Is Big Brother taking over? Well, not exactly. It looks more like a combination of two pretty mundane factors:

People want to see what's going on.
Digital cameras and web connections are cheap and getting cheaper.

The limit is no longer technical, but what people will put up with. Interesting times, indeed.

Tuesday, January 12, 2010

Facebook privacy: Probably not dead either

There seem to be a lot of articles and posts lately about Facebook founder Mark Zuckerberg having announced the "end of privacy", so let's slow down a bit.

Here's a transcript of what Zuckerberg said on the matter, taken from Marshall Kirkpatrick's critique (the post also includes video of the original interview):

When I got started in my dorm room at Harvard, the question a lot of people asked was "Why would I want to put any information on the Internet at all? Why would I want to have a website?"
And then in the last 5 or 6 years, blogging has taken off in a huge way and all these different services that have people sharing all this information. People have really gotten comfortable not only sharing more information and different kinds, but more openly and with more people. That social norm is just something that has evolved over time.
We view it as our role in the system to constantly be innovating and be updating what our system is to reflect what the current social norms are.
A lot of companies would be trapped by the conventions and their legacies of what they've built, doing a privacy change - doing a privacy change for 350 million users is not the kind of thing that a lot of companies would do. But we viewed that as a really important thing, to always keep a beginner's mind and what would we do if we were starting the company now and we decided that these would be the social norms now and we just went for it.

On the face of it, this sounds like a CEO making a fairly narrow statement about his company's service. There's a bit of ambiguity as to which social norms he's talking about, but clearly said norms are those of people who are on or might like to be on Facebook. So why has it been repeatedly glossed as "Facebook CEO says privacy is obsolete" or similar?

From what I can make out (and I'm not on Facebook) Facebook is changing its default privacy settings for content users publish from opt-in (you have to explicitly say you're sharing information) to opt-out (you have to explicitly say you aren't). This is part of a larger shift to more fine-grained privacy control, and to manage the transition Facebook users have been given a tool to "empower people to personalize control over their information".

Before I go on, here's a little Bingo card you can use the next time Facebook puts out a press release like the one in the link:

transparent	empower	roll out	easy-to-use	personalize
transform	control	message	evolution	innovate
serve users’ changing needs	tool	FREE	dynamic	simplify
community	model	intuitive	accessible	set a new standard
unprecedented	process	educate	network	iterative

But I digress.

Reading past the marketspeak, this looks like a pretty reasonable cut at something any successful software product needs sooner or later: a migration tool. In particular, they appear to go to some lengths to preserve settings you already have. Where that can't be done, it looks like they tell you what the default is, and why, and how to change it. The devil is in the details, but it's clear that they've at least examined the kind of issues that inevitably crop up in such an exercise. Their claim to have done extensive user testing looks credible. With 350 million users, they better have.

One item does stand out, though, and it's probably the basis of Zuckerberg's comments above:

Common set of publicly available information: Facebook’s latest privacy policy, announced in October, indicated that certain basic information—a user’s name, profile picture, gender, current city, Friend List and Pages—would be categorized as “publicly available.” The overwhelming majority of users already make all of this information available to everyone and this label was chosen to ensure that users understand that it is possible for this information to be viewed by others. However, users can still avoid being found in searches or prevent contact from non-friends.

So, if you want to be on Facebook, you have to give out those basic items, but now they make that an explicit policy rather instead of just something everyone was doing. You can choose who sees what else, but with finer-grained control now. People who didn't originally make the basic items above public (may?) now have them made public, but nothing else need change. Maybe I've missed something, but this doesn't seem earth-shaking, and Zuckerberg's comments don't seem to say much more than "It looks like people like to share stuff on Facebook more widely than we originally expected." [Five or six years on, it seems even less like the Earth has shaken --D.H. Dec 2015]

And it's all just Zuckerberg's opinion. As Derek Thompson argues, Zuckerberg may not even be right about people's attitudes in his own backyard of FaceBook.

Sunday, December 20, 2009

This one has a little bit of everything

For quite a while, the Did you feel it? link on the USGS web site has given the general public a chance to report earthquakes. This allows the seismologists to get a quick fix on the location and intensity of a quake before their instruments can produce more precise results -- seismic waves take time to travel through the earth.

This is a nice bit of crowdsourcing, somewhat akin to Galaxy Zoo, but it depends on people getting to the USGS site soon after they feel an earthquake. Some people are happy to do just that, but it's not necesarily everyone's top priority. So now the USGS has started searching through Twitter for keywords like "earthquake" or "shaking", and they're finding enough to be useful. The tweets range from a simple "Earthquake! OMG!" to something more like "The ceiling fan is swaying and my aunt's vase just fell off the top shelf," which gives some idea of magnitude.

As with Twitter in Iran, tweets are a great primary source of information, but you need to sift through them to get useful data. As with Google Flu, mining tweets doesn't require active cooperation from the people supplying the data. Rather, it mines data that people have already chosen to make public. In the case of Google Flu, Google is trying to use its awesome power for good by mining information that people give up in exchange for being able to use Google. (you have read Google's privacy policy, haven't you?) With Twitter, the picture is much simpler: The whole point is that you're broadcasting your message to the world.

It should come as no surprise that tweets about seismic activity are much more useful if you know where they came from (though even the raw timestamp should be of some use). Recently (November 2009), Twitter announced that its geotagging API had gone live. This allows twitterers to choose to supply location information with their tweets. The opt-in approach is definitely called for here, but even so there are serious questions of privacy. Martin Bryant has a good summary, to which I'll add that information about your location over time is a very good indicator of who you are.

Friday, August 22, 2008

Who owns the cloud?

Along the lines of "the usual yada yada," NPR recently ran a story on the downside of storing important personal data -- email, pictures, schedules, secret recipes, whatever -- "in the cloud", that is, online somewhere, you neither know nor care where, conveniently managed and backed up by someone else.

They mention three main problems:

Your provider could fold, taking your data with it.
Depending on the terms of its yada yada, your provider could shut down your account for any number of reasons beyond your control. For example, a random person could tell them, without proof, that they think you're engaged in committing a crime.
Again depending on the terms of the yada yada, the provider might share your data with anyone and everyone.

Now, without meaning to be harsh on anyone (when was the last time I scraped a copy of this blog?), these seem like problems one could anticipate, if only on the basis that in any sweet deal, there's got to be a catch someplace. But that doesn't stop them from being serious concerns.

The holy grail here is a service whereby your data is:

Safe: It won't go away, barring disasters in multiple, geographically separated sites (in which case there are probably bigger fish to fry). You may lose access to it, whether because you don't have connectivity, or because your provider folded and the data is temporarily in escrow, or because you really are accused of a crime, or whatever.
Secure: Only you can get at it. If you provider leaks your data, it's liable up to some fairly substantial point. If you lose your keys, you can have them replaced conveniently.
Yours: You have the rights to whatever you store (provided you created it or otherwise had the rights to it in the first place) unless and until you explicitly sign them away. As I understand it, this is one of the key tenets behind personal datastores.

In most cases, providers are implicitly suggesting this kind of service, and since no one reads the yada yada, everyone is expecting it. Providers also have a strong incentive to make this level of service a reality. If it's too far off, word will eventually get out and fewer people will want to buy in. In particular, the chances of one of the major players folding and taking your data down, or simply losing large portions of it, appear fairly small. Not zero by any means, but fairly small.

On the other hand, there is probably room for a few well-placed regulations to help things along here. In particular:

That data held by a provider that goes out of business should go into escrow and made available to former customers for a reasonable period.
That data remains private unless specifically made public.

Tuesday, July 29, 2008

The good driving monitor

A while ago I reported the dilemma of a teenager whose stepfather had insisted on his having a GPS unit installed in his car -- and who stood to get out of a speeding ticket as a result. Now car insurance companies are offering the same dilemma to the market at large. Put a monitoring device in your car, and if it shows you drive carefully they'll reduce your rates. If it doesn't -- and keep in mind that everyone thinks they're an above-average driver -- your rates will go up, though not by as much. Here's the breakdown for one company, in relative terms:

If the monitor convinces them you drive the way they like, you pay $0.60
If you don't need no stinking monitor, you pay $1.00
If the monitor convinces them you drive badly, you pay $1.09

In other words, you're more or less guilty until proven innocent. Not that that's wrong. This is business, not a court of law.

Privacy advocates, naturally, have a problem with this. The problem is not the monitoring per se, but that the company effectively owns the data. I don't see why it should have to be that way. Suppose you own your driving data. You can choose to sell access to it in return for cheaper insurance, or you can decline, in which case your insurer will presume you have a reason.

And that's the more subtle consequence: You could be a perfectly good driver, but not like the idea of turning that information over to The Man, and end up paying for the privilege. That's actualy not the case at the moment. The non-monitored driver currently pays less than the monitored "bad" driver. That seems like an unstable situation, though. If such monitors become widespread, the presumptions change, and in any case the actuarial risk has to show up somewhere. Some possibilities:

Require that everyone pay the same rate, no matter what. There's no need to gather driving data, but dangerous drivers pay less at the expense of safer drivers.
Prohibit the use of monitor data in setting rates, but allow accidents, speeding tickets and such to count, as they do now (at least in the US).
Allow the use of monitor data, but prohibit companies from charging more than X to customers who decline to supply the data. If X is the lowest rate, then no one will volunteer and we're back to the first case. If X is the worst rate, then the non-volunteer rate will come up and/or the worst rate will come down to eliminate the $0.09 difference.
Do nothing and see what happens.

In other words (and I should probably throw in the "I'm not an economist" disclaimer here), if this sort of thing catches on it looks like it would be very hard to prevent insurers from -- rationally -- charging non-volunteers the same rate as demonstrably unsafe drivers.

However, that doesn't mean people can't have control over whether they volunteer the information or not. There are at least two ways to do this. One, which I mentioned in a follow-up, is for you to own your own monitor and decide whether to let it yield up its secrets. The other, which would have much the same effect, would be for your monitor to send its data to your personal datastore, whence you could share it out as you saw fit. In either case the monitor needs to be tamper-resistant, but that's a given.

In any case, add this to the list of "Who owns the data?" cases where the initial answer is "a particular private company" but the eventual answer ought to be "you".

Wednesday, July 16, 2008

Privacy hot potato

It seems that Google and Viacom have reached an agreement on the YouTube usage data Google was ordered to turn over; Google will be allowed to anonymize the data by replacing user IDs and IP addresses with random tokens.

This is good news (though not as good as, say "turns out the data was anonymous to begin with"), but not a surprise. Google and Viacom had both stated that they wanted to find a way to protect the anonymity of the data. Google's interest is obvious, but Viacom had an interest as well: If they get anonymized data, no one can accuse them of abusing personalized data or accidentally leaking it. AOL already saw what it's like to be the guy that leaks personalized data, even if only by accident. No one wants to be that guy.

Now, the whole reason this is a big deal is because personalized data is valuable, and that presents a temptation. But a rational player will realize that the high cost of getting caught, together with the difficulty of keeping a dozen terabytes of valuable data completely secret and the lack of anyone else but Google to blame a breach on, far outweighs any benefit there may be. Viacom is just being rational. If there's a breach now, the list of suspects is one, not two companies long.

Put another way, the personal content of the usage data has value in general, but it has less than no net value to Viacom. It's a hot potato they don't want to catch. Better to make sure it's not thrown in the first place.

[Re-reading my original post on this topic, I see I already made this point, but I still think it's a good point.]

Saturday, July 5, 2008

Google, Viacom and privacy

A certain amount of controversy over privacy is just part of being Google, not just because Google is a large software company, but also because it aims to make as much of the world's information available to as many of the world's people as it can, subject, of course, to the admonition not to be evil. "Don't be evil" is just three simple words, but just how those three words apply when the bits hit the wire is the stuff of dissertations and lawsuits.

Google is embroiled in at least two significant disputes lately: The ongoing rumbles over Street View, which seem to be getting worked out piece by piece as we go along, and a lawsuit by Viacom over YouTube which, while probably not as bad as the flap over Italian tax privacy, does involve at least a couple of echoes of the AOL search data debacle.

That one certainly looks bad at first blush. In his ruling, Judge Louis L. Stanton has granted Viacom access to "all data from the Logging database concerning each time a YouTube video has been viewed on the YouTube website or through embedding on a third-party website." That data includes "for each instance a video is watched, the unique “login ID” of the user who watched it, the time when the user started to watch the video, the internet protocol address other devices connected to the internet use to identify the user’s computer (“IP address”), and the identifier for the video." [The link above is via Wired. If the original ruling is on the District Court's website yet, I can't find it. If anyone has a better link, please send it]

So there's a bit of a privacy issue there.

Google had two objections to this, first that the database was big. Well, 12 terabytes is a lot of data, but as the judge points out, it's not too big for commodity disks these days. The more serious argument is that the database contains personally identifiable data and is more than Viacom needs to “recreate the number of views for any particular day of a video” and "compare the attractiveness of allegedly infringing videos with that of non-infringing videos."

The judge was not impressed, calling Google's concerns "speculative" and citing a (very reasonable) blog post by Google developer Alma Whitten arguing that an IP address by itself is generally not personally identifiable information. That seems a bit odd since in this case the IP address is not by itself, but linked with just the kind of information that Whitten claims would make it personally identifiable.

However, the main thrust of the judge's argument seems to be that Viacom's use of the data is limited to a particular purpose in the discovery phase of a particular civil case. Presumably, if Viacom is later found to be making other use of the data, or if the data leaks out into the larger world, Google or someone else can come back after them. In the case at hand, Viacom would probably also run afoul of the Video Privacy Protection Act. Well, maybe so, but a genie out of a bottle is a genie out of a bottle ..

It's not clear to me why Google couldn't have just been compelled to disclose what Viacom said it was after: detailed logs of how many people watched what videos at what time, but not which particular people or from what IP address. In the cases Viacom is interested in, where large numbers of people watched copyrighted material, there should be more than enough individuals involved to provide anonymity.

On the bright side, Google and Viacom are now trying to work out how best to implement the court order without giving away personally identifiable information. Google's interest in this is obvious. Viacom also has an interest, though. They would like to be able to say "we only looked at the information we asked for, and we can prove it." No one wants to be seen as the company that inadvertently gave away information on millions of users. AOL went through that. It hurt.

In the background to all this is the long-standing complaint from privacy advocates that Google should have anonymized the YouTube data to begin with, as it has with its web search data. You can't divulge what you don't know, and in a case like the present one Google could have convincingly argued that it can only supply Viacom with what it asked for and no more. This is clearly easier and less error-prone than the present case.

In practice, it's probably not that simple. It's easy to think of a company as a monolith, but only the smallest companies really are. When you get to be Google's size, and the entity in question is a newly-acquired subsidiary, it's not a great surprise that rules and practices would differ.

Wednesday, June 11, 2008

More on cell phones as tracking devices

It was this BBC piece on a recent study at Northeastern University that set me musing about tracking via cell phone.

The article is sort of a roller coaster ride of "yikes!":

It would be wonderful if every [mobile] carrier could give universities access to their data because it's so rich

The researchers said they were 'not at liberty' to disclose where the information had been collected.

... giving way to "that's not so bad":

[S]teps had been taken to guarantee the participants' anonymity

[W]e only know the coordinates of the tower routing the communication, hence a user's location is not known within a tower's service area

... and the occasional "hmm ...":

Nokia have put forward an idea to attach sensors to phones that could report back on air quality. The project would allow a large location-specific database to be built very quickly.

Ofcom is also planning to use mobiles to collect data about the quality of wi-fi connections around the UK.

Evidently the business of attaching interesting sensors to cell phones is expected to boom in the next few years.

The real punchline, though, was the unsurprising conclusion that most people's daily activities are pretty boring: "The study concludes that humans are creatures of habit, mostly visiting the same few spots time and time again. Most people also move less than 10km on a regular basis[.]" Even those that travel further still tend to visit a small number of places repeatedly.

It's natural to be concerned about the ever-increasing speed of communication, and the prospect that at some point everyone might have access to everything known about everyone. But on the other hand it's comforting to know that one's own activities are probably too boring for most people to care about.

Monday, June 9, 2008

It's 2:00 am. Does your cell phone know where you are?

As I understand it, cell phones are called cell phones because the area of coverage is divided into (generally overlapping) "cells", each covered by a given tower. This means that if you're connected to the network, the provider will be able to tell, at a minimum, which cells you're in. By looking at signal strength from the towers involved one can get a much more accurate estimate. And as if that's not enough -- and apparently it isn't -- GPS is becoming a standard feature.

Having a precise, accurate location device on hand at all times can be handy and in some cases even life-saving. On the other hand, having an unobtrusive tracking device on one's person at all times raises some obvious privacy issues.

There are two contrasting extreme views on this sort of thing. The Utopian view plays up the "never lost" and "find a restaurant" features and goes on to argue that a world where everybody can locate everybody else is a fundamentally Good Thing.

The dystopian view plays up the privacy concerns, argues that The Man wants to know where you are and, further, wants to make it nearly-impossible to live without your personal tracker.

Naturally, I don't subscribe to either extreme view. I'm not really excited by the idea of a service that alerts me if a friend happens to wander into my vicinity (or vice versa), but neither do I see the whole thing as a step down the slippery slope towards Big Brother. I am a bit concerned that it's easy to forget, or never really realize, how locatable you are when you carry a cell phone, but that problem has been around for a while now.

On the balance I see it as technology taking yet another incremental step and life going on more or less as usual.

Wednesday, April 30, 2008

Italian tax privacy -- or lack thereof

Oh. My. Goodness.

It seems the outgoing government of Italy has seen fit to put up, with no prior warning, a web site with the tax details of every Italian taxpayer. The information includes, at least, income and tax paid.

Can they do that? Looks like they just did.

Human nature being what it is, the site was soon overwhelmed with traffic and, according to the BBC, has been or is to be taken down. It's not completely clear to me whether the "privacy watchdogs" mentioned have the authority to get it taken down, or are just demanding it be taken down, but the statement that the site was up for 24 hours suggests that someone took it down. In which case the interesting question becomes who was able to scrape and store how much data about whom during that window ... [My understanding from later articles is that the watchdogs in question were governmental agencies with the authority to have the site taken down, and that it was taken down ... but not before a lot of people had had their fun]

I'd be curious to know how all this squares with Article 8 of Chapter II of the EU Charter of Fundamental Human Rights. My guess would be that it doesn't. But then, my understanding is that the CFHR doesn't actually have any legally binding status.

Sunday, February 17, 2008

Health care and datastores

One possible killer app for personal datastores is medical record keeping. Right now, (in the US, at least) every health care provider I use has its own copy of my medical history. It's generally not hard to get your old provider to send your new provider your records, but the point is that they need to be sent at all. The inevitable result is a multitude of small mistakes and discrepancies as essentially the same form gets filled in or transcribed over and over again, leaving you to wonder what sort of large mistakes and discrepancies might creep in while no one's looking.

Joe Andrieu relates a story of Doc Searls dealing with just such problems -- in a state with universal health care -- in his own case [The story in question starts with the section marked "User centrism as system architecture," but as usual the whole post is worth reading]. As he says, a personal datastore would make the whole situation much simpler. Your medical data is part of your datastore. You give your providers permission to read and update it. There's only one copy of it, so all the data replication problems go away. You control access to it.

If you want a new provider to have access, just say so. You could even give blanket permission to any accredited hospital, in case of emergency. This permission would, of course, live in the world-readable part of your datastore.

I have to say it sounds beautiful, and I'm confident that it, or something functionally equivalent, will eventually happen. But how do we get there?

Given that this is very personal medical information, privacy is a major concern. Health care providers (again in the US, at least) are bound by strict privacy rules. Without digging into the details of HIPAA, one of whose aims is actually to promote electronic data interchange, suffice it to say that achieving HIPAA compliance has been a long, expensive and sometimes painful process for the US medical industry.

One result of this process is that providers (and any other "covered entities") are limited in what they may disclose to other parties. While the intent of the privacy rules seems very much in harmony with the idea of a personal datastore, the realization laid out in the law is very much built on the idea of each provider having its own data fiefdom, with strictly limited interchange among the various fiefdoms.

By contrast, in a personal datastore world, providers would never have to worry about disclosing data to other providers. In fact, it would be best for a provider never even to take a local copy, except perhaps for emergency backup purposes, since the patient's datastore itself is the definitive, up-to-date version. This could be particularly important if, say, a patient in a hospital is also being treated by an unaffiliated specialist. Anything one of them does is automatically visible by the other, unless there is a particular reason to restrict access.

The geek in me is fascinated to once again see the concepts of cache coherency and abstraction turning up in the larger world (whence we geeks are only borrowing them). But the health care consumer in me is concerned that the less-than-abstract form of the law, together with the need to implement it, has almost certainly produced a system with far too much inertia to switch to a datastore-centered approach anytime soon.

Obstacle one is that hospital's data systems just aren't set up for it, and after going through the wringer to get the present setup in place, they are not going to be in any hurry to implement a new scheme. Even with that out of the way, it is the providers that are on the hook to ensure privacy. They will want some assurance that relying on personal datastores does not expose them to any new liability.

That in turn will depend on personal datastores having been shown to be secure and reliable. Which is why, although I have no doubt that medical record keeping would be a great application for personal datastores, it seems unlikely to be the first, "killer" app that breaks them into the mainstream.

On the other hand, in chasing down the links for the lead paragraph, I ran across this post on Joe Andrieu's blog about Microsoft HealthVault. It looks like a step in the right direction, but curiously enough, only health care providers can access your vault directly. You can't.