Field notes on the Web: 2007

Monday, December 31, 2007

This code comes with ABSOLUTELY NO WARRANTY etc., etc.

Kinda lame, I know, but I didn't have time to make it really lame.


#include <stdio.h>
main(){char *s="#tcg[\"ygP\"{rrcJ23";while(*s++);s--;int _=0,j=9,c=42;
while(*s!='#'){putchar(_++<2||_>298?*--s-2:_&16?8:_&15?46:j--+48);
if(!(_%32-10)){fflush(stdout);sleep(1);}}putchar(10);}

Saturday, December 29, 2007

Radiohead: Just what is going on here?

A few months ago Radiohead put out its latest album, In Rainbows, in two formats. You could get a "discbox" with a vinyl pressing, bonus CD, album art, etc. for a fixed price, or you could just download the tracks as .mp3 files and throw whatever you wanted (or nothing) in the tip jar.

So what happened? Who paid what? Do we have a whole new paradigm for music sales? An interesting one-off experiment? An out-and-out boondoggle? All or none of the above?

It's hard to see In Rainbows as a whole new paradigm, if only because there are so many special circumstances. Radiohead is an established band with a loyal following and a formidable reputation. The band happened to be between record contracts for this album. The downloadable version was a companion to a more traditional offering. And Radiohead is Radiohead. What works for them might not work for anyone else.

In fact, it may not even have worked for them. Certainly the band saw the online release as a one-off. They are currently in negotiations with both record labels and iTunes, and the download offer has been discontinued (effective New Year's Eve).

Beyond that, the picture gets very muddy very quickly. One report, by Gigwise, claims downloads of at least 1.2 million copies of the album. How much did people pay? Those who know aren't talking, but one survey indicates an average of 4 pounds (about $8), with 1/3 of downloaders paying nothing. Another, hotly disputed by the band, suggests that 62% paid nothing and the average price across all downloads was $2.26.

So basically, we don't know how many copies were downloaded, how much people paid for them, whether the price paid changed over time, why there's a discrepancy between the two surveys, or much of anything else. Except that the band most likely pulled in millions of dollars for the downloads, some further amount for the discboxes, and expects further income from traditional distribution. That's not even counting the T-shirt sales. Not bad for a bunch of guys from Oxfordshire.

If the tip-jar/download approach is not obviously the future of music distribution, but it's not a massive flop either, what is it, and why is the band discontinuing it? Is it the tip of the iceberg, or an evolutionary dead end like the million dollar homepage?

My guess, and it's only a guess, is that the tip-jar model is not going to dominate, though it might not disappear entirely. Rapper Saul Williams is currently spinning a variation on this with a low-bandwidth version of his latest release. The hi-fi version is available for a $5 donation.

Why is the band discontinuing the offer? My recollection from the original page is that it was never promised indefinitely in the first place. Most likely the band has gotten the good out of it. I would expect that people who care enough to pay also care enough to download early (which might explain some of the discrepancy between the two surveys). The band also seems not to have burned its bridges with traditional distribution channels, and continuing the "It's up to you" offer would only muddy the waters there.

Thursday, December 27, 2007

100 and (I hope) counting

According to the "Blog archive" heading, this will be my 100th post to this blog. Stephen Jay Gould took a similar opportunity to tell us, finally, about his field work on Bahamian land snails. I'm more with Eubie Blake, who celebrated a 100th birthday and who said "If I'd known I was going to live this long, I would have taken better care of myself."

I won't be writing about my equivalent of Bahamian land snails -- I wish I had something so interesting to draw on -- but on something more apropos of Eubie Blake. Blake, as it turns out, really only lived to be 96 (most of us should be so lucky), and that seems as good a point as any to pick up a thread that's been running through this blog more or less from the beginning: imperfection.

When electronic computers first entered the popular consciousness sometime after World War II, their defining property was perfection. If the hero needed the answer to an intractable problem, the computer was always there, ticking away impassively. On the darker side, the flawless, emotionless and relentless android, aware of its own perfection and our human inferiority, was a stock villain.

The computer was the ultimate in modernism. Its rise coincides, perhaps not coincidentally, with the shift from modernism to whatever we're in now, variously postmodernism or late modernism, depending on whether you want to emphasize change (how modern) or continuity.

The notion of the all-knowing perfect computer dissolves rapidly on contact with actual computers. One of my early experiences in computing was meeting my dad's friend Herb Harris, who ran a computing facility in the building . I vaguely recall watching cards being punched and read, but I definitely recall suggesting that you could use a computer to store everything in the encyclopedia (and therefore, all of human knowledge).

Herb loaned me a book I still have, somewhere, on programming the IBM 360. He also gently prodded me to consider what putting an encyclopedia in a computer would mean, particularly the question of how you would find the information once you got it there. To give you an idea of the hardware of the time, the book contained a recipe for doing decimal multiplication by use of a multiplication table you could read in from external storage. I concluded that the problem was harder than it looked, but still ought to be at least partially solvable, with somewhat better hardware. Maybe I'd get back to it later ...

Now we have vast collections of textual material available via computer, and we have at least one usable way of finding the information that's there. We even have encyclopedias on line. All this information, its storage and its retrieval deal intimately in imperfection. Some examples:

Dangling links are explicitly allowed in the web. This is not an accident but a basic tenet of web architecture. Allowing links to point to nothing means that you don't have to build a whole site at once, or even know that it will ever get built. Among other things, dangling links are a key part of the wiki editing experience (not as much fun if you just want the information, though).
The underlying protocols the web is built on assume that messages routinely get dropped or duplicated in transit (TCP), that the information you are looking for may in fact be somewhere else (HTTP), or that the server you're ultimately trying to reach may be down (HTTP again).
Documents are given logical addresses, not physical addresses, on the assumption that information may be physically moved, without notice, at any time. For that matter, computers themselves also generally go by logical names. There is no one perfect physical realization of the web.
The web inherently doesn't assume that any given document is the last word on a given subject. Search engines generally give you some idea of how well-connected a page is, but this can change over time and in any case it's only a hint. Anyone can comment on a page and incorporate that page by reference.
You can't take anything on the web at face value, or at least you shouldn't invest too much faith in a page without considering where it came from (which you don't always know) and how well it jibes with other sources of information. This sort of on-the-fly evaluation quickly becomes a reflex.
From a purely graphical point of view, there is no definitive format for a given web page. If you try to lock everything down to the last pixel it will generally look bad on displays you didn't have in mind. If you don't, it's up to the browser at the other end to decide what it looks like, and with CSS and other tools, the viewer can have almost unlimited leeway. Nothing is perfect for everyone, so we try to get close and allow for tweaks after the fact.
A key part of running a successful web site is managing details like backup, maintaining uptime in the face of hardware failures and (one hopes) dealing gracefully with large numbers of people pushing the limits of your bandwidth. This is hard enough that you generally want to farm it out.

There are many more examples, and probably much better ones as well. The point here is that even when it looks like the system is working just fine, imperfection is everywhere. The web tolerates this rather than trying to stamp out every last flaw, and in some fundamental ways even builds on imperfection. The result is far more powerful and useful than a computer that never loses at chess or never makes an arithmetical error.

Postscript: Herb Harris is no longer with us, but the University of Kansas student computing lab bears his name [... or it did for a while. Somewhere around 2010 the trail seems to go cold. Now that everyone has a phone or laptop and storage is in the cloud, there's no longer such a need to go to a place in order to compute. The building is still there, but it now houses "Information Technology". Sic transit ... -- D.H. Dec 2018]

Wednesday, December 26, 2007

80% of the solution in a fraction of the time

As can happen, I set out to write this piece once already, only to end up with a slightly different one. Here's another take, bringing Wikipedia into the picture.

First, let me say I like Wikipedia. A quick scan will show I refer to it all the time. I see it as a default starting point for information on a particular topic (as opposed to a narrowly-focused search for a given document or type of document). I don't see it as definitive, but I don't think that's really its job.

Wikipedia would seem a perfect test case for Eric S. Raymond's formulation of Linus's Law ("Given enough eyeballs, all bugs are shallow). But -- as Wikipedia's page on Raymond dutifully reports -- Raymond himself has said, well, here's how it came out in a New Yorker article:

Even Eric Raymond, the open-source pioneer whose work inspired Wales, argues that “ ‘disaster’ is not too strong a word” for Wikipedia. In his view, the site is “infested with moonbats.” (Think hobgoblins of little minds, varsity division.) He has found his corrections to entries on science fiction dismantled by users who evidently felt that he was trespassing on their terrain. “The more you look at what some of the Wikipedia contributors have done, the better Britannica looks,” Raymond said. He believes that the open-source model is simply inapplicable to an encyclopedia. For software, there is an objective standard: either it works or it doesn’t. There is no such test for truth.

Let's start right there. Software doesn't simply either work or not. You can't even put it on some sort of linear, objective "goodness" scale. Even in cases where you'd think software is cut and dried, it isn't. Did you test that sort routine with all N! combinations of N elements? Of course you didn't. Did you rigorously prove its correctness? How do you know your correctness proof is correct? Don't laugh: Mathematicians routinely find holes in each other's proofs, in some cases even after publication.

But most software is nowhere near this regime. Often we don't even know exactly what we're trying to write when we set out to write it (thus much of the emphasis on "agile" development techniques). In the case of something like a game, or even a website design, most of what we're after is a subjectively good experience, not something objectively testable (though ironically games seem to put a bigger premium on basic correctness, since bugs spoil the illusion).

It's not even completely clear when software doesn't work. If a piece of code is supposed to do X and Y, but in fact does Y and Z, does it work? It does if I need it to do Y or Z. What if it hangs when you try to do X, but there's an easy work-around? What if it hangs at random 10% of the time when you try to do X, but that's tolerable and nothing else does X at all? What if it does X if a coin flip comes up heads, but might not if it doesn't? I'm not making that one up. See this Wikipedia article (of course) for more info. What if it's an operating system and it just plain hangs some of the time? Not that that would ever happen.

All of this to say that I doubt that software and encyclopedia entries make such different demands on their development process. And as a corollary, I think the results are about the same. Namely, there are excellent results in some cases, reasonable but not excellent results in many cases, and occasional out-and-out garbage.

Here's what I think goes on, roughly, in both cases:

Someone comes up with an idea. That person may be an expert in the field, or may just have what looks like a neat idea.
The original person produces a first draft, or perhaps just a "stub", or an "enhancement request".
If no one with the expertise to take it further is persuaded to do so, it stays right there indefinitely, or may even be purged from the system (perhaps to re-appear later).
If the idea has legs, one or more people take it up and make improvements.
Typically, one of three stable states is reached:

It's perfect. Nothing more than minor cosmetic changes can be added. New ideas along the same lines typically become their own projects.
It's good enough for everyone currently involved. That may not be particularly good, but no one can be persuaded to go further. This may be the case right out of the gate, or after several rounds of fixes by a single originator.
It's not good enough, but there is no agreement on how to take it further. Work may grind to a halt as competing fixes go in and come out, or the project may split into two similar projects, with better or worse sharing of common material and effort.

Thinking it over, this process is not unique to open source. The magic of the open approach is that the bigger the pool of participants, the bigger the chance that an idea with legs will get supporters and get fleshed out, and the faster it will get to a stable state. In our imperfect world, that stable state is generally short of perfection. Put the two together and you have 80% of the solution in a fraction of the time.

That said, there are some differences between prose and software. I've argued above that software isn't hard and fast. It's soft, in other words. But prose is even softer. As a result, there is greater potential for disagreement on where to go, and in case of disagreement, there looks to be a better chance of thrashing back and forth with competing fixes, as opposed to moving forward but with separate (and to some extent redundant) solutions.

Wikipedia does seem to attract more vandals, but this is not necessarily because it's not software. It may also be because it openly invites frequent edits from a very large pool and changes are moderated after the fact. Open software projects, particularly critical pieces like kernels and basic tools, tend to require changes to pass by a small group of gatekeepers before being checked in. Conversely, some wikis are moderated.

As usual, this is all just my rough "figuring it out as I go along" guess, not anything with actual numbers behind it, but that's my story and I'm sticking to it for now.

Saturday, December 22, 2007

Eyeballs and shallow bugs

Eric S. Raymond has asserted that "given enough eyeballs, all bugs are shallow", a principle he calls Linus's law after Linus Torvalds (my fingers want to type "Linux Torvalds").

How many eyeballs are enough? How many eyeballs are available? What does it take to get a "shallow" bug fixed, checked in and tested? What's a bug, anyway? A few meditations:

How many eyeballs are enough? Suppose an excellent kernel hacker has a 90% chance of nailing a given bug. What are the chances that two excellent kernel hackers can nail the bug? Well, it's not 180%. It will range from 90% (if the second doesn't know anything new about the particular problem) to 100% (if the second knows everything the first one doesn't).

If the two are completely independent sources of information there's a 99% chance one or the other will nail it. But what are the odds two people got to be excellent kernel hackers by completely independent routes? Now, what are the odds that a bunch excellent kernel hackers, quite a few more reasonable kernel hackers and a horde of non-specialists could nail the bug? Pretty good, I'd say, but not 100%.

How many are there? I've been doing software for a while now. I've built a few Linux kernels (roughly the equivalent of changing a tire on a car), and I've looked at small portions of the source (roughly the equivalent of opening the hood, pointing and grunting). If some subtle race condition should creep into the next version of the kernel, the odds that I could contribute something useful to the conversation are approximately zero (the automotive equivalent might be, say, being able to help fix a problem in a Formula One engine design).

The most qualified people in such a situation are the small and dedicated core of kernel maintainers and whoever put in the changes that turned up the problem. These may well be the same people. I know a fair bit about software in general, and even a bit about race conditions in general, but I know essentially nothing about the details of the kernel, the design decisions behind particular parts, the hidden pitfalls and so forth.

This being open source, much of that information is available, directly or indirectly. The limiting factor is the ability to absorb all of the above. This takes not only skill but time and dedication. The natural consequence is that, for at least some bugs, there just aren't enough eyeballs available to make them "shallow". Instead, someone will have to expend considerable brain sweat figuring out what happened.

[Another not-infrequent case: Lots of people see a bug, but no one can quite nail down what's causing it, much less suggest a fix. Filing good bug reports takes practice, just like writing good code does. Eliminating all the variables in a typical desktop environment takes time, even for someone with lots of practice. As a result, the people who could fix the bug don't have enough information to go on and probably have bigger fish to fry.]

What does it take to get a shallow bug fixed and tested? Suppose that the broken code in question (kernel or otherwise) has passed by enough eyeballs that someone has said "Hey, that's easy, you just need to ..." That person puts in a fix. Are we done? No. At a minimum, someone needs to test the fix, preferably someone other than the fixer. Someone should also look over the code change and make sure it fits in well with the existing code. And so forth. Open source doesn't remove the need for good software hygiene. If anything, it increases it.

What's a bug, anyway? Suppose not just one person steps up with a fix to some bug. Suppose two or three people do. Unfortunately, they don't exactly agree on the fix. Maybe one wants to patch around the problem, one has a small re-write that may also fix some other problems, and another thinks the application wouldn't have such bugs if it were structured differently. Someone else might even argue that nothing needs to be fixed at all.

Expediency will tend to favor the patch, and expediency is often right. The small re-write has a chance if the proponent can convince enough people that it's a good thing. The re-structured system will probably need to be a whole new project, potentially splitting up the pool of qualified eyeballs.

So does this mean that open source is a crock? Not at all. Most of the problems I've pointed out here aren't open source things. They're software things. Open source offers a number of potential advantages in dealing with them. One I think may be overlooked is that writing a system so that anyone anywhere can check it out and build it, and that several people can work on simultaneously and largely independently, enforces a certain discipline that's useful anyway. If your code's a mess or no one else can build it and run it, you're not going to get as many collaborators.

On the other hand, open source isn't a magical solution to all of life's problems, and there are arguably cases where you just need someone to say "Today we work on X," or "We will not do Y." Strictly speaking, that kind of control is a separate question from whether the source is freely available, but Linus's law assumes that eyeballs are not being commanded to look elsewhere.

So is Linus's law a crock? Not at all. It captures a useful principle. But like most snappy aphorisms, it only captures an ideal in a world that's considerably messier and more intricate.

[A few years later, a couple of striking examples turned up, in the form of Heartbleed and Shellshock -- D.H. Dec 2018]

Thursday, December 20, 2007

Arguing web architecture with myself

A while ago, talking about web sites as web services in the context of "Ten Future Web Trends," I said:

My guess is that tooling will gradually have more and more useful stuff baked in, so that when you put up, say, a list of favorite books it will be likely to have whatever "book" microformatting is appropriate without your doing too much on your part. For example if you copy a book title from Amazon or wherever, it should automagically carry stuff like the ISBN and the appropriate tagging.

Um, why copy book data? This is the web. Make a link with the book title for text, pointing at Amazon or wherever. Anything crawling around the web trying to make sense of this ought to be able to recognize where the link is pointing, chase it and get the other data. All the usual arguments against copying (e.g., difficulty of keeping copies in sync) apply.

Monday, December 17, 2007

Who wants to be on .TV?

I seem to remember -- and maybe this is just my addled memories of Silicon Valley playing tricks on me -- that all the good .com names were supposed to have been snapped up years ago in some great virtual land rush. The only viable alternative was to grab a domain from one of the newly-minted top-level-domains, like maybe .biz, or hey wait, there's an island nation called Tuvalu and guess what! Its TLD is .tv!

Station managers! Why use your-call-letters.com when you could use your-call-letters.tv? Why use your-favorite-show.com? Doesn't your-favorite-show.tv sound so much better? In those fin-de-siecle years, dreams and fortunes were made of less.

Current statistics from Name Intelligence, of course, tell a somewhat different story:

TLD	Registered domains (millions)
.COM	71
.NET	11
.ORG	6
.INFO	5
.BIZ	2
.US	1

Stats for other TLDs are harder to track down, but clearly they're at least 2 orders of magnitude behind .com.

There does appear to be another effort in the works to get people buying .tv (the page I linked is a redirect from www.tv). Certain "premium" domains are up for sale at premium prices. Annual fees range from $500,000 for business.tv to $100 for, say, fishness.tv. I was intrigued by rotten.tv, but not $3000 a year worth of intrigued. Non-premium names, I believe, go for a more usual fee of around $25. The full list of 52,000+ premium names makes for somewhat entertaining browsing.

What of Tuvalu? The nation of 11,000 gets a cut of revenues from its agreement with Verisign. Being extremely remote, having few natural resources and having its highest point around 5 meters above (current) sea level, it can certainly use the cash. However, there is concern that the cut could be larger. From what I can make out, the total comes to around $2M a year, better than nothing, but at around $200 per person certainly not the bonanza one might have hoped for.

The official site for Tuvalu is www.gov.tv, but be aware that their server appears very slow, possibly due to high latency as much as low bandwidth.

UI inertia

I just had an irritating experience on a major retail site. It doesn't really matter which one, or exactly what problem, but here's a brief summary of the case in point: Fairly early in a several-step checkout process, I entered a new form of payment. Then I realized that I had to correct the shipping on one item. No problem, there was a button for that.

A few steps later, I ended up on a page asking me to enter a form of payment, which page seemed to have no memory of the new one I'd just entered. Or anything else I'd ever used, for that matter. Or a button to take me anywhere but forward. So I finally ended up starting over, losing most (but not all) of the other information I'd put in.

Whenever something like this happens, I make a mental checklist. Did the system have all the information it needed to let me fix the problem without losing what I'd put in? Would it have had to guess my intentions? Could the problem have been solved much better with known technology? In this case, and so many like it, the answers are clearly yes, no and yes.

Why does this keep happening? Why do we as an industry seem immune to experience? It's not from lack of trying. From what I can make out, having made most or all of the mistakes myself at one point or another, the cycle goes something like this:

The application needs some feature. It might be a shopping-cart UI, or a way to remember configuration, or a database, or whatever.
Early on, the requirements don't seem that demanding, and it's crucial to Get The Thing Out The Door. So someone puts together a good-enough first cut.
Pain results.
For most of the perennial problems, this happens again and again, leading people to develop toolkits. Typically each house grows its own.
More ambitious and successful houses venture out to bottle and sell theirs.
Again there is pressure to Get The Thing Out The Door (the toolkit this time), so the new solution solves just enough problems to constitute a clear improvement. The state of the art advances by a modest increment.
Except that all the apps that had to be pushed Out The Door before the toolkit and its improvements came along is already out the door and thus massively harder to change.
As a corollary, more quickly successful products tend to have clunkier interfaces, as there was less time to change them before they became hard to change.
Finally, a certain number of houses won't use the latest stuff anyway. Instead they'll use older stuff or roll their own, for a number of reasons, some valid and some not so valid.

It's not impossible to clean up something that's already out the door, but it requires a special blend of skill and patience. It's hard to make a business case for fixing something that doesn't appear badly broken. Generally it will require an upstart competitor to change the risk/reward balance.

In the particular case of web sites, the path of least resistance has been the "fill out a form, push a button, fill out the next form" routine, with a cookie or two to keep track of how far you've gotten in case you need to break off or backtrack. Even that may be ambitious. There are still surprisingly many sites that will make you re-enter an entire form if you mess up one field, or make you start from scratch if you have to go back to step 1. This is far behind what available tools will support.

I'm not entirely against step-by-step processes. Sometimes they work better than alternatives like "tweak stuff until you like what you see and then press 'go'". They at least leave little doubt as to what to do next. Which combination of approaches to use when is a matter of skill, taste and empirical testing with real people.

Whatever the approach, there is always a surprisingly high portion of stuff out there that just seems like it ought to have been better, that has problems that were identified, and solved, ten or twenty years ago. It's easy to conclude that people just must not know what they're doing, but I don't think that's a big part of the story. Rather, there seem to be fairly strong forces (in particular the door-outward force) tending to allocate resources to ensure that the end product is just good enough, and no better.

One of the best takes on this I've seen is in Richard Gabriel's classic Lisp: Good News, Bad News, How to Win Big. Gabriel is mainly talking about LISP, and he makes a lot of good and interesting points. In section 2.1, "The Rise of Worse is Better", however, he argues more generally that while we may want to do The Right Thing, a system that doesn't has much better survival characteristics. To throw a little more fuel on the fire, Gabriel's canonical "worse is better" system is UNIX. Naturally, it's section 2.1 that everyone quotes.

Thursday, December 13, 2007

But what if I don't want Coke to be my friend?

This is old news by now, but I wanted to explore it anyway.

A couple of years ago a younger relative, then still in college, showed me Facebook. At the time I didn't really get the concept, but I figured there was probably something to it since my relative thought it was pretty cool. On the other hand, my first instinct was to wonder about privacy.

At the risk of dating myself (but then, I do give my age in my profile), I cut my social net.teeth on BBSs and Usenet. On BBS's, people almost universally went by handles and not by their real names, a practice that almost certainly traces back to CB culture. On Usenet, you went by your email address and .sig, which could be revealing (david.hull@myschool.edu) or concealing (mysterious@whoknows.com or even an12345@anon.penet.fi) according to your choice and what you could get your local admins to go along with.

In either case you were faceless (at least at the time) and there was at least a good possibility if not an outright expectation that your name would be made-up. Coming from that perspective, and knowing that every single post to alt.stuff.hairy.hairy.hairy and rec.pigeon-fanciers is preserved in amber for all time, I was somewhat taken aback by the idea of a site that not only showed your real name, but your picture and whatever other personal information you chose to put up.

Why would people do this, I thought? Well first, not everyone's as camera-shy as I am. From talking to people, there also seems to be the perception that if you're doing something on you computer in the privacy of your room, you're doing it in your room and not in the net. Finally, though, I suspect that, even though the information is readily available, people may not fully appreciate the distinction between one's friends (a human-sized hand-picked list) and one's network (a city-sized collection of people you mostly wouldn't know from Adam).

As an aside, there's some interesting graph theory to be explored in modeling social networks. Your PhD awaits ...

Where this all comes to a head is the economics of identity. I've argued already that anonymity services must be understood in economic terms, particularly regarding the value each party places on anonymity. Conversely, a service like Facebook, MySpace or LinkedIn is a veritable mother lode of marketing data, much of it apparently available for free.

I say "apparently" because if this data is really free, it's almost certainly mispriced. Mispricing means arbitrage opportunity, and arbitrage opportunity means new, fair price. At a first guess, the value of belonging to a social network is something like

Convenience of being able to keep in touch with your friends and vice versa
Minus hassle of being in touch with people you'd rather not be in touch with
Plus joy of discovering interesting things about other people
Minus embarrassment of realizing that your prospective employer can see those pictures of you doing jello shots which seemed like such a good idea to post
Plus value of learning about cool new stuff advertised on the site
Minus pain of advertisements you don't want

Users can manage the first two by tweaking their friend lists. People appear to be growing more savvy about divulging personal information. One source tells me that juniors are now advised to scrub their Facebook sites a good year in advance of graduating and entering the job market.

The last two are, as I understand it, going through an interesting period of adjustment. One method of adjustment is for users to vote with their feet and stop using the service (or stop using it as much). This makes the site as a whole and advertising on it in particular less valuable. Presumably, really cool advertisements could make a service more attractive, with the reverse effect.

I haven't really investigated any of this in detail, other than reading a couple of articles in the popular press, and I'd be particularly interested in comments.

Tuesday, December 11, 2007

Undead technology

I did a double-take just now when I followed a link to a PDF file and noticed the Kinkos/FedEx logo in my PDF viewer. When did that happen? [Imagine a quick Google search here] Looks like the deal was announced back in June.

OK, so if I'm reading something perfectly well online, I have the option of sending it off to my local copy shop, having it printed out and then venturing out to go pick it up. Or, I suppose, I could have it FedExed to my doorstep.

Something tells me I'm not in the right market niche for this.

I could, however, imagine FooCorp emailing a bunch of, say, nice glossy marketing collateral from the Oceania headquarters to the Eurasia headquarters, where it would then be printed at the local shop, collated and bound and delivered to the appropriate desks. Clearly Adobe and Kinkos/FedEx think that people will want this, and who am I to say them nay?

At the risk of sounding like an, um, broken record, it seems that certain technologies have not yet figured out that they're supposed to be dead. Text isn't dead. It's now a verb. Print appears to be doing just fine, as well.

Monday, December 10, 2007

Why is there still print?

The Newsweek article on Kindle quotes Jeff Bezos as saying "Books are the last bastion of analog." I take his point, but it seems an odd statement. Text, after all, is arguably the first real digital medium. What he means by "digital", of course, is "available to computers". Unlike music and video, which are now routinely released in computer-readable form, books are still released in a form you can't just download. Bezos aims to change this with the Kindle.

The interesting question is, why does print resist digitization so well? I've suggested that publishers like it because it provides copy protection, but why does it? The answer has to be economic, not technical. Technically, it's trivial to digitize a book. Just scan it in. Don't bother to try to convert the image back to text. If all that people want to do with the result is read it, the image should work fine.

There's an interesting subplot here. Optical character recognition (OCR) seems to do fairly well these days on well-printed books, judging by Google books and Amazon's own "Search inside the book" feature. On the other hand, the fully general problem of reading anything a person can make out still appears to be hard, which is why sites use distorted text CAPTCHAs to try to stop bots. This seems like the equivalent of anX-prize for freelance OCR hackers, and indeed the inevitable arms race appears to be well under way. Finally, bringing us full circle, one source of these CAPTCHAs is printed text that failed to scan correctly.

In any case, the difficulty doesn't seem to be digitizing text in a readable form. The problem is, what do you do with it once you've got it? It's technically trivial to scan a book, but it still takes some time and effort to flip through all the pages, at least without expensive specialized equipment. So if I've done this, I'd like to see some compensation -- assuming I don't mind violating copyright laws.

Can I put it on the web and sell it? Well, um, I've just brought it into digital form, thereby making it hugely easier to copy. In other words, I've just put myself in the position of the publisher whose print-based copy protection I've just broken. If copy-protection is out, there's always advertising. Except that's maybe not such a good idea given that I've just broken the law.

This same argument would seem to act as a counterbalance to all sorts of unauthorized copying, but obviously it doesn't apply as effectively to audio and video. This is probably because copying CDs and DVDs is much, much easier than scanning books, and also because books are simply a different medium. I'd expect that PhDs have already been earned on just such matters.

Kindle and print

While looking for something else, I ran across the November 26 issue of Newsweek. The cover story was on Amazon's new Kindle e-book. Conveniently enough, the article is available online.

Overall I found the article pretty evenhanded, balancing the "print is inherently inefficient" side with the "books are inherently special" side. As usual, I think both sides have valid points. A few thoughts:

Yes, print is inherently inefficient. That doesn't mean it will die anytime soon. People still sent hand-delivered messages long after the telephone became widespread. Steam trains ran long after the diesel came along. Western Union only recently shut down its telegraph service.
On the other hand, it's hard to imagine print not giving way to bits over time and eventually reaching niche status. My completely unfounded guess is that it will end up more like blacksmithing than buggy whips.
Amazon is right to recognize that it's not enough just to have an electronic device that more or less looks like a book. The Kindle is not just a device but a service. Along with searchability and the potential for hyperlinks, Amazon hopes the killer app will be the "buy and read it right now" feature. Push a button (and pay Amazon a fee generally less than you'd pay for print) and the Kindle will download whatever book you like. Whether this is enough to pull people in remains to be seen, but it at least seems plausible.
The Kindle relies on copy protection, presumably using some Trusted Computing-like facility. I've argued that it's not unreasonable to expect a special-purpose device to give up programmability in an attempt to lock down copy protection. Again, it will be interesting to see how well this works.
Conversely, print has a nice, well-understood copy protection model. Copying a book means physically copying pages. In theory this is quite breakable. In practice it works well (so far). Publishers naturally like this. It would be interesting to try to quantify how much this convenience to publishers is extending the lifetime of the book, as opposed to the "nice to curl up with and read" aspect.

Monday, December 3, 2007

Text and technological change

In the previous post on literary texts and hypertexts, I had meant to make a fairly mundane point, but got sidetracked in the fascinating details of the particular texts I was using as examples. At least, I found it fascinating.

The mundane point was this: The web has given rise to new textual forms, things like wikis, blogs and for that matter ordinary HTML web sites. Some aspects of these are new, but the general notion of a text as a multi-layered, interlinked structure, possibly with multiple authors and a less-than-clear history, is not.

This is a general point, not limited to literary criticism. For example, the question of what constitutes a derived work, important in software copyrights, musical sampling and elsewhere, has a long history. This history happens to include Ulysses -- one of the charges leveled against the 1984 corrected text of Ulysses was that the publishers had pushed to include as many corrections as possible in hopes of obtaining a new copyright.

We sometimes like to think that a new technology changes everything, that the old rules cannot possibly apply because the game itself is so different. Technology does change things, but our social tools for coping with the change remain largely the same. The flip side of this is that the problems a new technology appears to raise are often older than one might think.

Sunday, December 2, 2007

Literary texts and hypertexts

An old literary chestnut: Just what is a text? Three texts to consider:

Beowulf: Often cited as the oldest extant text in the English language, this blood-soaked tale comes to us through a single manuscript dating to around the year 1000. The manuscript was damaged in a fire in 1731, and was not transcribed until 1786.

Since then it has deteriorated further, leaving the 1786 copy as the only source for some 2000 letters, though modern imaging techniques have also helped in reconstruction. The 11th-century manuscript itself is written in two different hands and appears to be a working copy. The story clearly draws on centuries of oral tradition. It is an open question to what extent the particular telling is a transcription of a spoken saga or a literary work in its own right.

The poem begins: "Hwæt! We Gardena in geardagum, þeodcyninga, þrym gefrunon, hu ða æþelingas ellen fremedon." The language has changed a bit since then, and as a result most people read it in modern English translation (or wait for the movie to come out).

What is the text of Beowulf? Is it an old saga, as captured by a long-ago author? The words written on the old manuscript? The manuscript itself, or as accurate a reconstruction as we can manage? The story, as well as we can render it in modern language?

Hamlet: We all know at least the beginning of Hamlet's famous soliloquy: "To be, or not to be? I, there's the point/To die, to sleep, is that all? I, all ..."

Wait, that can't be right, can it? Ah ... I must have reading from the infamous "bad quarto". I should have been reading the first folio: "To be, or not to be, that is the question ..."

The first folio is generally considered the best starting point for Shakespeare's plays. Its own introduction warns of inferior editions, and indeed many of the plays, Hamlet included, exist in several different forms.

There have been attempts to harmonize the various sources into some sort of "ideal" version. Whether this works or not is a matter of opinion. Arguably, there is no one text in such cases, but unfortunately a stage production has to work from a single script, whether a harmonized one or one of the original editions.

On the other hand, if you consider the play as a play, and not just a text, whose production is definitive? The Globe theater is open for business again, but the original production company is no longer available for engagements.

Ulysses: Beowulf and Hamlet are old texts. It's no surprise that they might present problems. What about a work that was published in the 20th century by an author who lived another 19 years after its publication, producing corrections and commentary along the way? The text, of course, is James Joyce's Ulysses.

Ulysses presents a few special problems. It was originally published in serial form, until one of the installments ran afoul of US obscenity laws. It was then published as a book by Shakespeare and Company with, by most accounts, thousands of typos (just how many thousands depends on whom you ask).

The language of Ulysses is not as obscure as that of Joyce's last work, Finnegans Wake, but it is full of invented compounds like hismy and snotgreen, which traditional proofreading would tend to break up, directly against the author's wishes. Further complicating matters, the original setting relied heavily on Joyce's handwritten notes on the galleys. Many of these are now lost.

The result was a series of printings that everyone knew had significant numbers of errors, perhaps minor and perhaps not, but which no one knew exactly how to correct, particularly after Joyce's death. An attempt in 1984 drew worldwide attention but also scathing criticism.

The 1961 edition appears to be the most popular today, but further corrected editions are promised and "genetic studies" of the various versions of the text is a thriving field in its own right. You can find more here and here.

So once again, what is the text?

What does any of this have to do with the web (other than all three texts being available on the web in some form or other)? One common thread here is that a text is a deeper thing than just a series of words printed on a page. Scholarly editions have recognized this for centuries, by means of devices like footnotes, marginal comments, glossaries, bibliographies and so forth.

Some of these techniques predate printing, but all of them work even better as hypertext. It's not surprising that there are well-developed web sites available for all three works. The new technology is a natural fit for the old problems.

It's also nice that, with serious scholarship being made available on the web, the average reader can see up close the kind of dense, interlinked historical structure that used to be available only in sparsely-circulated journals. This is basically the literary equivalent of putting scientific material up on the web, and the benefits are similar as well.

Saturday, December 1, 2007

You could be Spartacus and not even know it.

All anonymity systems require large numbers of people to send traffic, or perhaps more accurately, the fewer people send traffic, the less anonymity the system can possibly provide. This is particularly a problem in setting up a system everyone has an incentive to use. Essentially, parties that value anonymity highly need people who value it less to use the system anyway in order to provide cover traffic.

This paper (also available on Freehaven's site), suggests ways of using various HTTP trickery to get people to send traffic, and to carry other people's traffic, on the system without even knowing they're doing so.

This not using obscure loopholes. It's using things like redirects, cookies and JavaScript that pretty much everyone has enabled and which would be a royal hassle to turn off. On the other hand, you would have to visit particular sites controlled by the system.

It's not clear wow much of a practical problem this is, but it's yet another thing to keep in mind when pondering just exactly how secure the web might be.

On "On the economics of anonymity"

Since posting "On the economics of anonymity", I ran across a paper of the same title. Being an actual research paper, it goes into considerably more depth than my brief take. Some of the high points:

An anonymity service can be viewed as a "public good." A public good is something that everyone can use without diminishing someone else's ability to use it. Street lights are a classic example.
In most public good scenarios, "free-riding" (using the good without paying for it) is a problem. In anonymity systems, free riders can also be good, since they provide cover traffic.
You're more anonymous as a node than as a free-riding user.
Parties that value anonymity highly have good reason to become nodes, despite the higher costs. Everyone else might as well just use the system.
More nodes means less traffic per node, means less anonymity.
It appears difficult to set up an anonymity system that everyone will have an incentive to use, particularly if you're starting from scratch.

For more details, please see the paper itself. It, and many other goodies on anonymity, are available on Freehaven's excellent Selected Papers in Anonymity.

Tuesday, November 27, 2007

On the history of the telephone

Here's a brief but hopefully not-too-distorted history of the telephone:

In the late 1800's various people hit on the idea of using electricity to transmit sound over wires. It's not entirely clear who did what first, and there is quite a bit of litigation at the time, but in 1875 Bell is granted a patent for "Transmitters and Receivers for Electric Telegraphs". By that point, the basic premise is in place: extend the existing communications technology (wires) to carry a new medium (sound).

At first, telephones are confined to early adopters in places like national capitals and Deadwood, South Dakota (there's a reason they called the dot-com madness a "gold rush"). Connections are originally point to point but exchanges are introduced very soon, providing a means of scaling the network up. To make a call via an exchange, you call the operator there, who then connects you to the person you're trying to call -- or to another exchange, ideally closer to your desired party, if that person doesn't belong to yours.

Adoption proceeds rapidly and vast fortunes are made, but full saturation takes decades even in industrialized countries. Beyond the basic technological leap of transmitting analog sound instead of bits, technological progress is incremental; better phones, switches to automate the exchanges, standards for phone numbers, area codes and so forth.

As the technology matures, reliability becomes a concern. Other features, such as conference calls, call waiting, touch-tone dialing and such are nice to have, but can be dispensed with as long as you can just pick up a phone and expect it to work. The possible exceptions I'd cite would be the answering machine and voice mail, which are more in the "how did we ever do without that?" category. Caller ID is another possible candidate. It definitely changes the interaction, but if I had to pick I'd probably go with voice mail. Your mileage may vary.

Obviously there are several similarities and contrasts to be drawn between the telephone and the web. One that I'd like to draw out here is the pattern of a world-changing new invention followed by incremental refinements.

The idea of serving hypertext over an internet aimed more at file transfer and remote logins shook things up. Compared to that, Web 2.0 concepts like tagging, microformats and social networking seem more like refinements. Useful refinements, to be sure, and ones whose combined effect will help make the web in 2010 noticeably different from the 2000 edition, but I don't see them as revolutionary. One could make a case that improvements in bandwidth (I won't say "broadband" because current broadband will look like a joke in ten years) will have more effect.

Granted, if you put enough incremental improvements together you end up with a qualitative change. Long distance calling today is an entirely different thing from the "have my operator call your operator" scenario I described above, and as a result the world is in a certain sense a smaller place. Nonetheless, I would expect the future history of the web to have relatively few "on this date ..." major moments and more "by the 2010's ..." summaries of progress.

Saturday, November 24, 2007

On the economics of anonymity

I'm still trying to get a good handle on the economics of anonymizers [and I'm not alone -- see here for pointers to a discussion in greater depth]. The first clear point is that clients use the service to offload risk, namely the risk of being associated with some particular activity on the web (three guesses what the most popular activity appears to be). When risk is transferred, there will generally need to be some kind of compensation. This is a basic economic proposition, one that's been back in the headlines lately.

But just where is this risk going? The first guess is the exit nodes. After all, it's the exit nodes that actually contact the services being used and would seem to have the most explaining to do if The Man starts asking questions. They also appear relatively easy to find. For example, if I continually send anonymous messages to myself, I should expect to hear from every exit node sooner or later (if the routing prefers a particular path for a particular client or server, compared to random chance, that could be used to narrow down the identities of one or both).

However, if The Man is really trying to find out who's on the other end of the connection, busting the exit node operator is not going to help, except perhaps to weaken the network as a whole. There may be jurisdictional problems as well. This pushes the search back to the clients.

Where we go from here probably depends on exactly how you analyze the anonymizer in question. Let's assume that The Man can make a better-than-random guess as to who's using the anonymizer or not. This seems very likely if relatively few people are using it. This will include pure clients, who only use the anonymizer but don't relay traffic or act as exit nodes, as well as the exits and relays themselves, who as far as I can tell have no way of proving they're not also clients.

Under this assumption, and all other things being equal, the risk is spread evenly among all the nodes, whatever their type. In that case, risk is certainly being transferred, namely from those with more to lose from exposure to those with less, but in a perfect anonymizer it's impossible to tell who is which. The basic arbitrage opportunity is there, but there appears to be no way to exploit it.

Or at least, no way for an outside observer to exploit it. If I'm, say, running a relay node but also using the anonymizer to do something truly hairy, I can be reasonably sure I have more to gain than someone just sitting at work perusing material that violates company policies. In effect, most of the clients are acting as a smokescreen for my activities. That in turn makes it worth my while to contribute greater-than-average resources to the network. At least, if I can do so without anyone noticing.

That seems a plausible story, but I'm not at all confident that I've understood the full implications here.

Wednesday, November 21, 2007

Clouds and onions

Hal Finney comments, correctly, that the story I told in Anonymous Three-Card Monte misses a couple of significant points. So here's an attempt to rectify that.

Generally, when you use an anonymizer, you talk to the anonymizer -- that is, to some set of hosts participating in the anonymizing. After several carefully managed intermediate steps the anonymizer -- that is, some participating host -- talks to the service you're really interested in. To that service, it looks like the request came from the IP address of that host, not yours, because that's indeed who's talking to it.

One way to look at this is to consider the anonymizer as a "cloud". You don't really know what goes on inside. An outside observer would see traffic between you and the cloud and between the cloud and the real service. It would also see a lot of random encrypted traffic among the hosts in the cloud, but as long as there are enough users (or at least computers spitting out random encrypted bits and pretending to be users) for the "I'm Spartacus" effect to kick in, that outside observer can't connect the you to your real service.

Good anonymizers use multiple hops inside the cloud, each of which is unaware of the rest of the chain, to provide multiple layers of protection, like layers of an onion.

The hosts, called "exit nodes", that talk to real services have to use their own IP addresses. Because of this, an outside observer could say "at such and such time, someone at this IP address connected to this Very Bad Site." If you're just using the anonymizer, but not participating in the cloud, there's zero chance that your IP address will be connected directly to the Very Bad Site. In effect, the exit nodes have collectively taken on that particular risk for you.

On the other hand, if you're using an anonymizer, you should probably pessimistically assume that an outside observer could tell which hosts are in the cloud. That is, you should assume that people can tell you're using the service. You should also take care that you use an encrypted connection to your real service. The exit node can only do what you (indirectly) ask it to, and if you don't ask it to use encryption, someone watching could say "I don't know who's at the other end of this connection, but whoever it was logged into this Very Bad Site under the name of ..." Caveat browsor.

So where does that leave the original story?

Well, IP addresses are being pooled, but among exit nodes, not among exit nodes and users. If you're an exit node, your IP address will be directly associated with the activity of random people whom, if the system is working, you have no way of identifying. This means you may have some explaining to do, more or less as described in the punchline. And you may have less explaining to do if your node has an IP address from a country that doesn't keep close track of who's using what address. Assuming there are such.

If you're running an anonymizer and not charging money for it, you might consider requiring anyone who uses the service to be prepared to host an exit node as well [It's not clear how you'd convince someone they wanted to do that. See this later post, for example.]. This, arguably, distributes the risk fairly. As a corollary, it also produces the "big pot of IP addresses" scenario that I originally described.

However, if you're just using such a service and not acting as an exit node, you shouldn't have to explain much more than why you're using an anonymizer. Beyond that, you can shoot yourself in the foot in a variety of ways, whether by failing to encrypt your connection to your real service, or by giving away more information than you think you are, or confiding in someone who turns out not to be who you thought they were or by some similar mistake. But the anonymizer can't help you there.

The larger point here is that you should be good and sure you understand what an anonymizer can and cannot do before you decide to use one.

Tuesday, November 20, 2007

Barristers and bloggers

Picking up where I just left off ...

Some professions seem fairly immune to technological change. The law is one. As the man said, lawyers find out still litigious men. If automobiles supplant buggies and consign the buggy whip makers to a small niche, chances are everyone involved will want to consult a lawyer sooner or later.

Which brings up a question: In the spectrum from buggy whip makers through blacksmiths, brewers and bakers to lawyers, where do writers fit in? My fond hope is that it's closer to the lawyer end (at least in terms of viability), and I think there's some evidence for that.

The odds seem good that there will continue to be viable business models in which writers get paid, whether it's through advertising, or as part of the production of interactive games and experiences or perhaps some other way. Certainly people still seem interested in text and in scripted entertainment.

And yet the writing game must surely be changing. Consider blogging. That's something radically new and different, right? Well, it depends. Certainly the medium is new, but just how has it changed the game?

For example, this blog, along with many others, is basically a column. The genre has been around for quite a while. The present example owes as much to E. B. White (at least as a model to strive toward) as it does to the pioneers of the web (to whom it also owes much).

What about political bloggers, with their game-changing, king-making deal-breaking influence? Is this a new phenomenon, or is it just political activists -- players in another very old game -- making use of the latest technology? (Let me add that when I say the game is old, I'm not claiming that all political bloggers are working for a particular party. Grass-roots activism has its own long pedigree.)

What about the celebrity and gossip blogs? Again, I'd argue that's an old genre in a new medium, and similarly for music blogs, personal journals and much if not all of the other material I've run across in the blogosphere.

What about the web of reactions among blogs? Surely this is new, could only have happened on the web. Well, no and yes. No, because deliberative exchanges in writing are most likely as old as writing itself. But yes, because the ability to quickly build up such a discussion, and to easily navigate through it later, is new and has a very web-ish flavor.

So what am I trying to say here?

Writing as a profession seems to benefit from the web, rather than being marginalized by it.
The web offers new media for writing, but the genres are probably largely the same.
Web media offer new possibilities but, IMHO, the similarities to old media are at least as significant as the differences.

On that last point, I might liken the situation to 3-D movies vs. traditional ones. Yes, there is a difference, but the basic experiences are more similar than different.

Buggy whips and blacksmiths

Pity the poor buggy whip, the icon of technology's scrap heap. Do you miss the sturdy heft of the old WE302 telephones? Alas, they've gone the way of the buggy whip. Can't stand the latest annoying gadget? Don't worry, the paradigm will soon shift and it, too, will go the way of the buggy whip.

Just what is a buggy whip? As the name implies, it's a small whip used to drive the horses pulling a buggy or carriage. Buggy whip manufacture used to be a prominent industry, but that changed when the automobile came along.

The implication is that when a new technology comes along, older ones are left in the dust. Consider blacksmithing. Look up at the older buildings in many cities and you're likely to see a lot of wrought iron (wrought iron is worked by hammers and such, while cast iron takes its shape from the mold it's poured into). That iron was likely worked by small armies of blacksmiths under the supervision of a master smith.

Blacksmithing was an important profession anywhere there was iron, which was a large portion of the world. In smaller towns, the smith would also act as a farrier, shoeing horses. But all that's gone the way of the buggy whip. With newer machining and manufacturing processes available, why would anyone take the time to work iron by hand, at least in the industrialized world?

Except ... blacksmiths are still very much around, and doing reasonably well for themselves. What do they do? Apart from producing pure sculpture, they build fences, handrails, window bars, fireplace tools, weather vanes and anything else that can usefully be made of wrought iron. Generally a hand-wrought item will cost more than something from the local big-box store, but it will also look better, custom-fit the site and provide a one-of-a-kind design. Enough people like that enough to keep modern blacksmiths in business.

The same has happened with many of the traditional crafts. Witness the resurgence of local breweries and bakeries, which are now called microbreweries and artisan bakeries, much as guitars are now called acoustic guitars. There are any number of other examples. Free associating from "acoustic guitars", drum machines were supposed to put drummers out of work, but they didn't.

That's not to say that new technology is necessarily good for old technology. There are, after all, many fewer blacksmiths, brewers and bakers than there used to be. But neither is it a death sentence. It's also worth noting that many modern blacksmiths use gas forges and power hammers, and state-of-the-art brewing and baking equipment is, well, state-of-the-art.

Not even the buggy whip has gone the way of the buggy whip, if that way is supposed to be extinction. They're still made, just not as many or by as many people.

What does this have to do with the web? I'm getting to that ...

Monday, November 19, 2007

Scented junk mail. Oh dear.

Apparently, the advent of email has reduced the volume of snail mail. I say "apparently" because my own mailbox never seems empty. In an effort to counteract this trend, the British Royal Mail has, on advice from an Oxford consulting firm, opted to try "reinventing" mail, taking it from a two-dimensional medium to a "three-, four- or five-dimensional medium."

I'm not making this up. You can read it here.

How to do this? Traditional mail is aimed at the visual system because, well, the visual system seems particularly well tuned to the kind of information we want to convey with mail. That's why we have text. But in this modern, digitized age, that's not enough. Modern mail must be enhanced by adding elements of sound, smell and taste

The Royal Mail is on the case with a sales force of 300 dedicated to helping businesses develop, and decide that they need to send "noisy, smelly junk mail" (That's probably not the designation the consultants at Brand Sense had in mind, but it seems apt).

As far as I can tell, the underlying rationale is that the mail needs to compete with email, and its unique advantage lies in being able to engage all the senses. Since email can send sound and video just fine -- more conveniently, one could argue -- that really leaves smell, taste and touch.

On the radio piece I heard, the Brand Sense spokesman described using scent not just in a literal way, as with perfume or dish soap, but more abstractly. Use citrus scents if you want to convey freshness and excitement, for example. Intriguing, to be sure, but just what need are we trying to fill here?

The whole idea of the state-run mails competing with email seems strange. If there's less need to send paper around thanks to email, that's a good thing, not a problem that needs to be solved by inventing new kinds of paper to send around, much less expending resources actively trying to convince people to do so.

Mind, I expect the consultants would have a different take.

Sunday, November 18, 2007

An editorial note

On re-reading my first post on Richard Stallman and trusted computing, I found myself unsatisfied with the way I had represented FSF's position on software copyrights. My first impulse was to fix the text, and indeed I did just that.

The result was even more unsatisfying, not because it was wrong -- as far as I can tell it was better -- but because, even though I clearly noted that I'd made the change, it just didn't fit with my view of what a blog should be.

This is a blog, not a wiki. On a wiki, the edit history is readily available. On a blog, it's not (even to me, as far as I can tell). In this case, I could

Quietly change the text. I do this routinely with typos I catch on re-reading, or missing or inconsistent tags, or prose I just don't like. For example, I've tightened up the punchline of Anonymous three-card monte at least twice. But in this case, the change was substantive. Quiet, substantive changes seem out of bounds.
Make the change and mark it as such. That's what I originally tried, but that left only my description of the original text, and that didn't seem right, either.
Use strikethroughs, italics and such to show the changes explicitly. Frankly, by the time I considered that, I was tired enough not to want to bother with it. It would give the fullest disclosure, but it would also be well down the road of trying to make a blog into a wiki.

So instead I put the text back as best I could and put in a note linking to a later post that (in my opinion) handled the topic better. This seems like a good balance to me, and I think I'll stick with it. Purely editorial changes will continue to go in quietly, in a lame attempt to present myself as a more careful writer than I actually am.

Substantive mis-steps will stay in place [but I may comment on them later --DH 9 Sep 2010]. If a later post or comment adds something significant to an existing post -- whether the existing post is wrong or for some other reason -- I'll try to put in a note the next time I review. Naturally, the backlink feature is useful here as well, but a nice, visible [italicized note] should make things clearer.

That is all. We now return to our regularly scheduled programming.

Friday, November 16, 2007

Sixty-year-old computer slower than modern emulator. Film at a 11.

I suppose that's not really a fair summary of this BBC article on a commemoration of the cracking of Nazi codes by Colossus, one of the first modern computers (depending on one's definition of "computer"). Still, it hardly seems surprising that an emulator running on a laptop would be faster than the 1940's original.

Re-assembling Colossus and getting it running, though -- that's a neat hack. Especially since the original machines were cut into pieces after the war.

Laptop orchestras. You read that right.

The University of York has been getting publicity lately for its Worldscape Laptop Orchestra, currently billed as the world's largest, though not the first. Others include the Moscow Laptop Cyber Orchestra and Princeton's PLOrk. Create Digital Music has a good summary. [There doesn't seem to be a good permalink for the Worldscape site yet -- I'll have to remember to fix the link when there is one][I've updated the link from York Music's home page to the press release for Worldscape. They still don't seem to have their own page, which leads me to wonder if they're still around].

So just what is a laptop orchestra? A bunch of people clicking "play" on some mp3 files and listening to the results? Not at all. Worldscape and its cousins are bona fide orchestras, making live music, often collaborating with more traditional instrumentalists and at least in the case of Worldscape, requiring a conductor. There is also at least one club sponsoring open jam sessions where anyone can show up with their gear, plug in and play.

The key here is the interactive element. An instrument in a laptop orchestra isn't just spewing out pre-programmed bits. It's responding to the musician's input, whether through specialized controllers, gestures grabbed by a video camera, or whatever else. As with any other orchestra, the musicians respond to each other, to the conductor (if any) and to the audience. The result is a genuinely live musical performance.

One telling detail: How do you record a laptop orchestra? You might think you'd just capture the digitized sounds the laptops are producing and mix them down. That's certainly possible, but if you want to capture the experience, it's better just to put mics in the house and record what the audience is hearing.

That's not to say you couldn't do the same thing online. I've heard of small-scale live musical collaborations over the net (though I can't remember where). I suspect, however, that keeping an orchestra of fifty in sync online is going to be a problem. I doubt you could just put everyone on one big Skype conference call, but if it's been done on that scale I'd be glad to be proved wrong.

Wednesday, November 14, 2007

A bit of clarification on anonymity

In previous posts (like this one, this one and maybe this one) I've taken a fairly skeptical tone concerning anonymizers and such. I wanted to take the opportunity here to clarify that a bit.

It might seem that I think that tools like anonymizers are a waste of time or that only miscreants are likely to use them. That's not what I think.

There are certain situations where anonymity is extremely valuable. Real journalism requires anonymous sources. Some crimes and abuses will only be exposed if those in the know -- including both victims and perpetrators -- can come forward without risk of identification. Political action sometimes requires anonymity. The Federalist Papers come to mind.

So when I take aim at certain quirks and pitfalls of anonymity, I'm not trying to write off anonymity entirely. I'm just trying to point out aspects of anonymity on the web that are trickier than they might seem (and therefore, frankly, fun to write about).

Wikipedia's angle on anonymous IP addresses

I'm not sure when this kicked in, but the message you now get when you edit a page anonymously is intriguing ...

You are not currently logged in. While you are free to edit without logging in, be aware that doing so will allow your IP address (which can be used to determine the associated network/corporation name) to be recorded publicly, along with the dates and times at which you made your edits, in this page's edit history. It is sometimes possible for others to identify you with this information. If you create an account, you can conceal your IP address and be provided with many other benefits. Messages sent to your IP can be viewed on your talk page.

So in other words, if you have a user name, you're more anonymous than if you don't. It's an interesting angle.

From its beginnings, Wikipedia has been beset by anonymous vandals who find out about Wikipedia's "anyone can edit" ethos and think "Whoa, dude, I can totally write 'My math teacher sucks' here and no one will know who did it", or something similar, but generally less sophisticated and coherent.

Fortunately, a number of Wikipedians have taken it upon themselves to make life better for the rest of us by constantly scanning the change logs for such drivel and reverting it back. One does occasionally run across vandalized pages, but in general vandalism gets reverted within seconds. And may I join the rest of the community in repeating my sincere thanks for that.

With that for background, it's easy to see why the community would want to discourage anonymous editing in the first place. On the other hand, it wouldn't do to ban it entirely. Anonymous editing (and editing by registered users who, erm, forget to log in from time to time, not that anyone would do that ...) is a valuable part of the process. Trying to prevent it while still promoting anything like an open culture would be an exercise in frustration as vandals worked out ways of gaming the system anyway.

And thus the current formulation, part carrot -- register and you can create your own persona and reap other benefits -- and part stick -- misbehave and people may well be able to track you down. Oh look: all those nasty edits to the page on FooCorp are coming from BarCorp's IP addresses.

It also warns legitimate anonymous editors that they may not be as anonymous as they think. If you're blowing the whistle on FooCorp, do it from a cybercafe or public library, not from your office at FooCorp (well, you knew that anyway, didn't you).

I'm Spartacus!

It's one of the great scenes in film, one, if you're like me, you've seen even though you haven't actually seen the film itself. After wreaking havoc against the Roman armies, Spartacus and his followers are finally defeated and captured. It's clear that Spartacus is destined to die a horrible death -- if only the Romans can figure out who he is.

To get his followers to rat Spartacus out, the Romans promise leniency to the person who will identify him. Instead, those assembled stand up one by one and shout "I'm Spartacus!"

That's anonymity in a nutshell. Spartacus might be anyone in the crowd. If the idea was to single Spartacus out, the Romans are no closer than they were to begin with. When you use an anonymizer, you're in much the same situation. It's not hard to establish that you might be the person who engaged in some particular interaction, but if the anonymizer is doing its work, there's no way to tell that it was you in particular and not some other user. Everyone's in the same boat.

This "everyone's in the same boat" factor lends anonymity a peculiar flavor. Looking at it from that angle, why would I use such a service if I didn't feel I had more to lose than the average user? This in turn will tend to throw the average user in with a fairly interesting crowd. I'm guessing here, of course. There aren't a lot of reliable usage statistics available.

I'm also guessing that most people using anonymizers aren't up to anything particularly nefarious and either value privacy on principle or just like the concept. How does that square with the previous point? Probably most people figure there's safety in numbers. Whatever those involved stand to lose, there is presumably a smaller chance they will lose it than if each operated alone.

"Sure, there may be some bad apples in the crowd, but they can't arrest all of us just to find them. And if they come for me, I can prove I'm not up to anything bad."

At which point it might be worth pointing out that in the film, Spartacus and his followers end up crucified.

(A side note: Not only do the Romans know no more than when they started, everyone now knows this. It's a neat case of common knowledge in action. By contrast, in the classic "question them separately" scenario, the person being interrogated has no idea who has said what to whom.)

(Another side note: The real Spartacus most likely died in battle. The whole scene is just a nice bit of dramatic license.)

[And finally ... this later post in the anonymity thread references LaTanya Sweeny's work in anonymity, specifically the notion of an "anonymity set", which formalizes the intuition that the more people you could be mistaken for, the more anonymous you are. Another later post references Alessandro Acquisti, Roger Dingledine and Paul Syverson's work on the economics of anonymity, drawing on the economic notion of a public good.]

Tuesday, November 13, 2007

Middle ground on GPS and privacy

Let's assume that GPS evidence becomes generally admissible in court. It's already worked a few times. Besides the case I mentioned, there has been a similar case in Australia (my thanks to two anonymous commenters for the pointer).

So how is this going to work? I get busted for speeding. I bring a printout to court from my GPS saying I was doing the speed limit. The judge says "and how do I know you didn't just fabricate this?" That's not going to work.

On the other end, we have the case I first mentioned, where the GPS coordinates are getting beamed back to some third party for perusal. The GPS itself is presumably tamper-resistant. I'm presuming this because the evidence stood up in court, and because there are existing GPS applications, such as monitoring commerce and monitoring people under house arrest, where tamper-resistance is at a premium.

That ought to work just fine, but who wants to run around with a GPS reporting their every move, just to get out of a possible speeding ticket? The stepson in the case certainly didn't. He just didn't have much choice.

Fortunately, there's a middle ground. A tamper-resistant (and probably tamper-evident) unit that can provide its logs if asked (ideally with proper authentication), but doesn't just broadcast them. As far as I can tell (and again I haven't done the legwork here), that's what happened in the Australian case.

This seems like a decent paradigm for Trusted-Computing-like devices that use techniques like strong encryption and special hardware to try to ensure that everything is what it appears to be. As with the music/video case, the trusted device performs a specialized function and doesn't need to be highly upgradeable.

Unlike the classic TC scenario, the trusted device is not in frequent communication with the mothership. Its job is to hold sensitive data and divulge it only when I ask. Or more accurately, when someone who can prove they know a particular secret asks. Much like a personal datastore.

Monday, November 12, 2007

Anonymous three-card monte

Practically everything that happens on the net has an IP addresses attached to it. That's even a decent working definition of the net: anything that happens with an IP address attached.

You can find out a lot from an IP address (you can find out about yours here). IP addresses are typically tied at least to an ISP and a location near your actual address, typically in the same town or one nearby. In some cases they can be nailed down more exactly.

If The Man decides to subpoena your ISP, your ISP can generally provide your exact house address from their records. Even without cooperation from the ISP a dedicated snooper, working for The Man or otherwise, could compile a record of what sites your IP address connected to and, depending on the exact sites, find out all sorts of things about the person or persons using that address, possibly including their identities.

Naturally, not everyone is comfortable with that. Even someone with little to hide may still want to keep it hid, if only out of principle. As a result, there are several services available promising anonymity.

This approach is not without its pitfalls. The site I mentioned above has a pretty good rundown on this. Basically, if you are using an anonymizing service, you are investing a pretty high level of trust in it. Good anonymizers recognize this and take steps to ensure that not even they know what's going on ... an interesting business to be in, to say the least. But hey, Swiss banks seem to do OK.

Now when you visit a site through an anonymizer, that site still has to see some IP address. Otherwise the protocols just don't work. Since you're anonymous, it can't be your IP address, so whose is it? They can't just make one up. Someone else might already be using it, resulting in various havoc. One approach is to grab a block from some lightly-regulated area. Hmm ... this site sure is getting a lot of traffic from Elbonia these days ...

Another is to take the IP addresses of all the people using the service (and there had better be a bunch -- an anonymizer with only one user is not fooling anybody) and throw them in a big pot. When you go to visit a site, you get an address out of the pot [As Hal Finney points out below, this is somewhat oversimplified, but let's go with it. See this followup for a more accurate picture --D.H. May 2016].

So you decide to use such a service to, well, it's not any of my business, is it? Someone else decides to use this service to visit a Very Bad Site because, well, they don't want anyone to find out, now do they? When they do this, the service happens to pick your IP address out of the pot.

Then The Man comes a-knocking. Your story is: No sir, I was not using that site. Someone else was using my IP address to visit that site. No, I don't know who. You see, I use an anonymizer that switches my IP with other people's. Why? With all due respect, that's none of your business, sir.

Best of luck with that. Bear in mind that The Man is not always known to appreciate the subtleties of such arguments.

Sunday, November 11, 2007

One teenager's dilemma (and ours)

I heard this on the radio the other day ...

The stepfather of a teenage boy, concerned about the stepson's driving, has a GPS installed in his car. The GPS reports back to its mothership and Dad can log in to check up. It will also email Dad if the car exceeds a given speed. This happens once, resulting in a 10-day loss of car keys.

Not surprisingly, the stepson is not entirely thrilled with the arrangement.

Then one day the stepson gets a speeding ticket. Radar has him going 60+ in a 45. GPS says he was doing the speed limit. As the radio story airs, Dad is in the process of challenging the ticket in court, on the grounds that the GPS is much more reliable than radar. The stepson still hates the GPS, but admits that, just this once, maybe it's not such a bad thing.

And there's the whole privacy dilemma in a nutshell: We'd love to have the cameras running when it benefits us, but the rest of the time, whether we're misbehaving or just being or normal boring human selves, we'd just as soon be left alone.

This is not an entirely new problem, of course. Privacy concerns have been around as long as people have lived around each other, which is pretty much as long as there have been people.

Modern privacy concerns are not so much about whether your neighbor knows what you're up to, but about who gets to be your neighbor and the balance of power between eavesdropper and eavesdroppee. From time to time, technology disrupts that balance (anyone remember party lines?) and society has to work out new rules to reclaim it.

One could make a reasonable theoretical argument that in a rational society, everybody benefits if everybody knows everything about everyone. But society is made up of people and people aren't rational. If the only choices are complete surveillance and complete privacy, I would tend to side with the stepson on this one and go for privacy. Those aren't the only choices, though.

Saturday, November 10, 2007

Trusted computing: What could be better?

The fundamental tension behind trusted computing is over programmability. Someone sending out protected content wants to be sure that it can only be accessed on a restricted set of particular devices. This is a lot easier of the devices in question are not highly programmable. In the case of a portable music player or set-top box, the keys involved can be kept in special tamper-resistant hardware and otherwise protected from exposure or modification.

If your playback device is a general-purpose computer, the game becomes a lot harder. I could send you a player application with a key branded into it, but there are any number of ways to get such a player to yield up its secrets, or yield up the unprotected content without having to uncover the secrets themselves.

The trusted computing model tries to combat this by restricting access to the information on a given computer and tightly controlling all modification to such an otherwise-programmable device. In other words, the vendor asserts control over programmability. It is this idea, not the idea that the creator of content should have control over the content, that fundamentally conflicts with the ideas (and ideals) of personal computing in general and free software in particular.

The TC model, depending on tight control of all possible modifications, is inherently fragile. Compare it to the models used in modern cryptography (on which it heavily relies). In modern cryptography, one makes extremely pessimistic assumptions about what will happen in practice.

For example, in designing a cipher, one typically assumes an adaptive chosen-plaintext attack. This means that the attacker can repeatedly choose a message to be encrypted, look at the resulting ciphertext, choose another message to be encrypted and so on. This did not come about by accident. There are various ways a real-world attacker can perform such an attack on a real-world cipher.

Cipher design, and robust engineering in general, assumes that anything that can go wrong will. This generaly means minimizing the number of dependencies and moving parts. The RSA cipher, for example, consists of raising a number representing the message to a known power and taking the remainder against a large number, called the modulus. The modulus (along with a second exponent used to decrypt the message) is derived from two large, randomly-chosen prime numbers by a simple recipe.

That's it. That's one of the most secure ciphers known. But even with that simple recipe there are known subtleties in choosing a good key and in preparing messages for encryption in order to avoid various attacks.

Trusted computing relies on five key technologies, which interact in various ways to provide the full model. You need hardware support in several places to even have a chance at making it all work. There are legitimate questions about how all this will affect basic system functions like backup. It's quite clear that any TC system will be actively attacked by hackers in both senses (I shouldn't get started on this, but I still like to think of "hacker" as meaning someone who does clever things with technology for the sake of learning and having fun; the more popular meaning is someone who tries to break into systems).

It doesn't seem like a good bet.

Trying to prevent or control modifications to a general-purpose computer is swimming upstream. The main driver here is to protect content like music and video. That requires a tamper-resistant decoder (and faith that this is a worthwhile exercise, despite analog reconversion). From this point of view, TC tries to enable general-purpose computers to become decoders by first making them tamper-resistant.

The alternative is not to try to make general-purpose computers into decoders. If my computer has an encrypted-bits-to-sound-and-video decoder attached to it, then I can reprogram my computer all I want, and I can make as many copies of protected content as I want. When I want to play a song or video, I send it to my decoder, which has all the attributes TC wants: it's tamper-resistant, non-programmable and has a private key embedded in it as tightly as modern technology will allow.

I can use my favorite software to index the content that I've bought the rights to, to sequence it, to dispatch it to the various decoders I own and so forth. I can use my favorite non-media software without having to worry about what measures my OS vendor is taking to control my use of the content I bought the rights to.

This is not to far from how current content-delivery systems like cable and satellite boxes work, as I understand it. Given that, it's not clear to me how much farther we need to go down the TC road.