Tuesday, October 25, 2016

Names and namespaces

I'm not the only David Hull in the world.  I may not even be the only one with my exact name.  I've met a couple of other David Hulls.  There was also a prominent philosopher by the same name, not to mention musicians, athletes and others.  Mine is not a particularly common name, but, at least if we look and first and last names only, it's definitely not unique.  It would be interesting to know what portion of names in the world are unique.  There are plenty of Pat Smiths or Zhang Weis, for example but (as far as I know) only one Dweezil Zappa (born Ian Donald Calvin Euclid Zappa).

In daily life it's generally not a huge problem for two people to have the same name.  If there are two David Hulls working in the same office, one might be "Dave" and the other "David", or one might be "Dave in sales", or one might go by "Walrus" for whatever reason.  If we want to be more precise, we can always add more identifiers, such as middle name, birth date, place of birth and so forth.  It's very unlikely that (to make up an example), there was more than one Patricia Terpsichore Smith born on February 29th, 1940 in Saint Paul, Minnesota.

If you want to be really sure, you assign everyone a unique identifier, such as the Social Security number in the US (SSN for short).  In theory there should never be more than one person with a given social security number, regardless of their name, age or place of birth.  In practice, that doesn't really hold up.  There are actually millions of people with multiple SSNs and/or SSNs assigned to multiple people.  Leaving that aside, though, it's worth taking a closer look at how SSNs and identifiers like them are built.  I'll use US examples here since that's what I'm familiar with, but many other countries have similar schemes.

A Social Security number is split into three parts.  Up until 2011, these followed a specific pattern.  For example, LifeLock CEO Todd Davis's is 457-55-5462.  The 457 part is in the range 449-467, which is assigned to Texas.  The 55 means, in this case, that the card was issued in 1982.  Exactly which year the middle digits map to depends on the first three digits, presumably because not all three-digit prefixes are used every year.  The last four digits are issued in numerical order, so putting it all together, Davis would have been the 5462nd person to be issued a SSN starting with 457-55, and the card was issued in Texas in 1982.

This sort of scheme is not uncommon.  US phone numbers are built in a similar fashion.  The full history would be material for a separate post, but during the "long distance" era before the advent of cell phones, a US phone number consisted of a three-digit area code, having a middle digit of 0 or 1, followed by a three-digit exchange associated with a piece of equipment in a particular location, followed by a four-digit number.  For example, the White House phone number is (202) 456-1111.  The area code for Washington, D.C. is 202, 456 is one of the exchanges there and 1111 is easy to remember, easy to dial on old-fashioned rotary phones and hey, the White House can pick whatever number it likes.

Likewise, US Zone Improvement Plan codes (zip codes to most of us) uses the first digit to denote a particular region of the country, the second two to denote a particular area within that region (and typically a particular sorting and distribution center, and the last two a smaller area within that region.   Here's a nice illustration I mentioned in a previous post. The later ZIP+4 scheme takes that down to individual blocks, apartment buildings, large businesses and so forth.

The parts in schemes like this often nest.  Area codes comprise exchanges which comprise individual phones.  Zip code regions comprise distribution centers which comprise smaller areas.  Social security numbers are a bit different, in that the first five digits together denote a region and year associated with a block of individual numbers, but you can't really say that there are years within regions and vice versa.

People's names are also a bit borderline.  You could think of families comprising individuals, but the family name doesn't really correspond to biological families for a variety of reasons.  The approximately 22% of Koreans named Kim () are not a biological family (though there are some 384 clans with the Kim name).  It's not much more meaningful to talk of all the Hulls than it is to talk of all the Davids.

All of these schemes have special cases, for example:
  • SSNs never start with 000, while 700-728 were originally reserved for railroad employees
  • SSNs never start with 666, and this number does not even appear on the Social Security Administration's historical list.
  • Area code 800 (along with several others now) is reserved for toll-free services
  • Exchange 555 includes special numbers such as 555-1212 (directory assistance) and has blocks that are guaranteed not to be used for real phone numbers, which is why a US phone number you see in a movie almost always starts with 555.
  • Zip codes starting with 569 are reserved for the USPS Parcel Return Service.
  • In the US legal system, John Doe, Richard Roe and other names are used in various contexts for persons whose actual names are unknown or withheld.
It's easy to think of more of these identifiers made of parts, usually with special cases.  Credit card numbers.  Place names like Paris, Texas, USA as distinct from Paris, Kentucky, USA or Paris, Kiribati or Paris, France.  Three cases are of particular interest on the web:
  • domain names, which consist of parts nested from right to left (e.g., fieldnotesontheweb.blogspot.com)
  • IP addresses, which (in version 4) consist of four parts (sort of) nested left to right (e.g., 216.58.194.78)
  • URLs, which consist of a protocol (e.g., http) an authority (e.g., fieldnotesontheweb.blogspot.com), a path (e.g, /blogger.g) a query (e.g., ?blogID=21299...) and a fragment (e.g., #overview/src=dashboard) again nested (basically) from left to right
Again there are special cases, such as example.com, IP addresses starting with, e.g., 192.168. and the about: pseudo-protocol Chrome uses.

All of the naming schemes I've mentioned so far make some use of the idea of a namespace, that is, a context in which names are meant to be unique.  Within a given family, siblings typically share a family name but have distinct first names.  I say "typically" because there are plenty of exceptions in real life, ranging from ordinary blended families to George Foreman's five sons named George (who nonetheless appear to go by their own nicknames in daily life).

You could think of an area code as a sort of family name with exchanges as given names or, one level down, think of the exchange as a family composed of individual numbers.  Some of these analogies make more sense than others.  Sure, every SSN starting with 457 was assigned in Texas (if it was assigned before 2011), but there's no good way to get from the middle digits to a year without knowing the first three digits.  Real life is a bit messy.

Even so, schemes like this are a decent fit for the way we think, which should not come as a great surprise.  But this has its drawbacks.  Maybe you don't really want someone to know in what state and year you got your social security card.  Maybe you'd like to give out your phone number without giving away a reasonably good idea of where you live.

Besides the privacy implications, there are practical concerns.  In theory there are a billion possible SSN's, enough to keep up with the US population for a while yet.  In practice, not all numbers can be used.  If there are only 500 people with numbers starting with XXX-YY at the end of the year, the other 500 numbers starting with XXX-YY will go unassigned, and I'm sure there are other inefficiencies.  This is not unique to SSNs.  Any numerical scheme that allocates blocks of numbers will tend to leave some blocks unfilled.

For these and other reasons, many kinds of ID numbers are assigned in a single "flat" namespace, as SSNs are now.  One way to do this is with a serial number that's incremented with each new ID, but (again at least partly for privacy and security reasons), that's often not the case.  For example, Blogger gives this blog post and ID of 8084382145281586649.  The blog itself has an entirely different ID.  The two have nothing (obvious) to do with each other.  I certainly haven't written 8 quintillion posts for this blog, nor are there anywhere near that many posts in all of Blogger.  The previous post on this blog (from, um, just a little while ago), has ID 236347809273236220.

This way of using longish, apparently random strings of digits has a few often-useful properties:
  • Because the numbers are big enough, there is generally very little chance that the same ID number will be given out twice.  And by "very little" I mean "not liable to happen in our lifetimes" and sometimes much longer, not "eh ... this'll happen from time to time but don't sweat it".  As a rule of thumb, if there are N digits in an ID, the number of things you'd need to get a collision is an N/2 digit number.  If blog post IDs are 18 digits or so long, you'd need billions of posts before there was a significant chance of a collision, even if they're not explicitly checking whether a supposedly new ID has already been used.  Generally, "universal unique ID" (UUID) schemes use a lot more than 18 digits, making the chances of collision ridiculously small.
  • Almost all UUID schemes use some sort of secure hash.  This means that, generally speaking, changing even one bit of the input will change about half of the bits of the ID.  This and other properties make it, as far as anyone currently knows, infeasible to learn anything about the thing being assigned the ID from the thing itself.  For example, the IDs of the two posts give no clue that they identify adjacent posts in the same blog, much less what's in them.  The URLs given to the posts, in contrast, make an effort to provide at least some useful information (e.g., http://blog.fieldnotesontheweb.com/2016/05/a-couple-of-updates-on-satoshi.html).  But that's fine.  As long as you have a unique ID you know exactly which item you're dealing with and you can give it any kind of friendly name you like.
You can still have namespaces of a sort with a hashing scheme.  If I form my IDs by hashing the string "fieldnotesontheweb" followed by the title of the post and you use "myawesomeblog" instead of "fieldnotesontheweb", there is pretty much no chance we'll every use the same ID, even if we happen to pick the same post title.  This gives the same kind of uniqueness as the "given names within a group of siblings" model.  You just can't tell from the IDs.

It's not uncommon for a naming scheme to evolve from a hierarchical structure, like SSNs before 2011, to a flat structure, like SSNs from 2011 onwards.  Given that, there's a good argument to be made that you should just start with a big, flat namespace and save the headache of conversion.