Monday, December 31, 2007

This code comes with ABSOLUTELY NO WARRANTY etc., etc.

Kinda lame, I know, but I didn't have time to make it really lame.


#include <stdio.h>
main(){char *s="#tcg[\"ygP\"{rrcJ23";while(*s++);s--;int _=0,j=9,c=42;
while(*s!='#'){putchar(_++<2||_>298?*--s-2:_&16?8:_&15?46:j--+48);
if(!(_%32-10)){fflush(stdout);sleep(1);}}putchar(10);}

Saturday, December 29, 2007

Radiohead: Just what is going on here?

A few months ago Radiohead put out its latest album, In Rainbows, in two formats. You could get a "discbox" with a vinyl pressing, bonus CD, album art, etc. for a fixed price, or you could just download the tracks as .mp3 files and throw whatever you wanted (or nothing) in the tip jar.

So what happened? Who paid what? Do we have a whole new paradigm for music sales? An interesting one-off experiment? An out-and-out boondoggle? All or none of the above?

It's hard to see In Rainbows as a whole new paradigm, if only because there are so many special circumstances. Radiohead is an established band with a loyal following and a formidable reputation. The band happened to be between record contracts for this album. The downloadable version was a companion to a more traditional offering. And Radiohead is Radiohead. What works for them might not work for anyone else.

In fact, it may not even have worked for them. Certainly the band saw the online release as a one-off. They are currently in negotiations with both record labels and iTunes, and the download offer has been discontinued (effective New Year's Eve).

Beyond that, the picture gets very muddy very quickly. One report, by Gigwise, claims downloads of at least 1.2 million copies of the album. How much did people pay? Those who know aren't talking, but one survey indicates an average of 4 pounds (about $8), with 1/3 of downloaders paying nothing. Another, hotly disputed by the band, suggests that 62% paid nothing and the average price across all downloads was $2.26.

So basically, we don't know how many copies were downloaded, how much people paid for them, whether the price paid changed over time, why there's a discrepancy between the two surveys, or much of anything else. Except that the band most likely pulled in millions of dollars for the downloads, some further amount for the discboxes, and expects further income from traditional distribution. That's not even counting the T-shirt sales. Not bad for a bunch of guys from Oxfordshire.

If the tip-jar/download approach is not obviously the future of music distribution, but it's not a massive flop either, what is it, and why is the band discontinuing it? Is it the tip of the iceberg, or an evolutionary dead end like the million dollar homepage?

My guess, and it's only a guess, is that the tip-jar model is not going to dominate, though it might not disappear entirely. Rapper Saul Williams is currently spinning a variation on this with a low-bandwidth version of his latest release. The hi-fi version is available for a $5 donation.

Why is the band discontinuing the offer? My recollection from the original page is that it was never promised indefinitely in the first place. Most likely the band has gotten the good out of it. I would expect that people who care enough to pay also care enough to download early (which might explain some of the discrepancy between the two surveys). The band also seems not to have burned its bridges with traditional distribution channels, and continuing the "It's up to you" offer would only muddy the waters there.

Thursday, December 27, 2007

100 and (I hope) counting

According to the "Blog archive" heading, this will be my 100th post to this blog. Stephen Jay Gould took a similar opportunity to tell us, finally, about his field work on Bahamian land snails. I'm more with Eubie Blake, who celebrated a 100th birthday and who said "If I'd known I was going to live this long, I would have taken better care of myself."

I won't be writing about my equivalent of Bahamian land snails -- I wish I had something so interesting to draw on -- but on something more apropos of Eubie Blake. Blake, as it turns out, really only lived to be 96 (most of us should be so lucky), and that seems as good a point as any to pick up a thread that's been running through this blog more or less from the beginning: imperfection.

When electronic computers first entered the popular consciousness sometime after World War II, their defining property was perfection. If the hero needed the answer to an intractable problem, the computer was always there, ticking away impassively. On the darker side, the flawless, emotionless and relentless android, aware of its own perfection and our human inferiority, was a stock villain.

The computer was the ultimate in modernism. Its rise coincides, perhaps not coincidentally, with the shift from modernism to whatever we're in now, variously postmodernism or late modernism, depending on whether you want to emphasize change (how modern) or continuity.

The notion of the all-knowing perfect computer dissolves rapidly on contact with actual computers. One of my early experiences in computing was meeting my dad's friend Herb Harris, who ran a computing facility in the building . I vaguely recall watching cards being punched and read, but I definitely recall suggesting that you could use a computer to store everything in the encyclopedia (and therefore, all of human knowledge).

Herb loaned me a book I still have, somewhere, on programming the IBM 360. He also gently prodded me to consider what putting an encyclopedia in a computer would mean, particularly the question of how you would find the information once you got it there. To give you an idea of the hardware of the time, the book contained a recipe for doing decimal multiplication by use of a multiplication table you could read in from external storage. I concluded that the problem was harder than it looked, but still ought to be at least partially solvable, with somewhat better hardware. Maybe I'd get back to it later ...

Now we have vast collections of textual material available via computer, and we have at least one usable way of finding the information that's there. We even have encyclopedias on line. All this information, its storage and its retrieval deal intimately in imperfection. Some examples:
  • Dangling links are explicitly allowed in the web. This is not an accident but a basic tenet of web architecture. Allowing links to point to nothing means that you don't have to build a whole site at once, or even know that it will ever get built. Among other things, dangling links are a key part of the wiki editing experience (not as much fun if you just want the information, though).
  • The underlying protocols the web is built on assume that messages routinely get dropped or duplicated in transit (TCP), that the information you are looking for may in fact be somewhere else (HTTP), or that the server you're ultimately trying to reach may be down (HTTP again).
  • Documents are given logical addresses, not physical addresses, on the assumption that information may be physically moved, without notice, at any time. For that matter, computers themselves also generally go by logical names. There is no one perfect physical realization of the web.
  • The web inherently doesn't assume that any given document is the last word on a given subject. Search engines generally give you some idea of how well-connected a page is, but this can change over time and in any case it's only a hint. Anyone can comment on a page and incorporate that page by reference.
  • You can't take anything on the web at face value, or at least you shouldn't invest too much faith in a page without considering where it came from (which you don't always know) and how well it jibes with other sources of information. This sort of on-the-fly evaluation quickly becomes a reflex.
  • From a purely graphical point of view, there is no definitive format for a given web page. If you try to lock everything down to the last pixel it will generally look bad on displays you didn't have in mind. If you don't, it's up to the browser at the other end to decide what it looks like, and with CSS and other tools, the viewer can have almost unlimited leeway. Nothing is perfect for everyone, so we try to get close and allow for tweaks after the fact.
  • A key part of running a successful web site is managing details like backup, maintaining uptime in the face of hardware failures and (one hopes) dealing gracefully with large numbers of people pushing the limits of your bandwidth. This is hard enough that you generally want to farm it out.
There are many more examples, and probably much better ones as well. The point here is that even when it looks like the system is working just fine, imperfection is everywhere. The web tolerates this rather than trying to stamp out every last flaw, and in some fundamental ways even builds on imperfection. The result is far more powerful and useful than a computer that never loses at chess or never makes an arithmetical error.


Postscript: Herb Harris is no longer with us, but the University of Kansas student computing lab bears his name.

Wednesday, December 26, 2007

80% of the solution in a fraction of the time

As can happen, I set out to write this piece once already, only to end up with a slightly different one. Here's another take, bringing Wikipedia into the picture.

First, let me say I like Wikipedia. A quick scan will show I refer to it all the time. I see it as a default starting point for information on a particular topic (as opposed to a narrowly-focused search for a given document or type of document). I don't see it as definitive, but I don't think that's really its job.

Wikipedia would seem a perfect test case for Eric S. Raymond's formulation of Linus's Law ("Given enough eyeballs, all bugs are shallow). But -- as Wikipedia's page on Raymond dutifully reports -- Raymond himself has said, well, here's how it came out in a New Yorker article:
Even Eric Raymond, the open-source pioneer whose work inspired Wales, argues that “ ‘disaster’ is not too strong a word” for Wikipedia. In his view, the site is “infested with moonbats.” (Think hobgoblins of little minds, varsity division.) He has found his corrections to entries on science fiction dismantled by users who evidently felt that he was trespassing on their terrain. “The more you look at what some of the Wikipedia contributors have done, the better Britannica looks,” Raymond said. He believes that the open-source model is simply inapplicable to an encyclopedia. For software, there is an objective standard: either it works or it doesn’t. There is no such test for truth.
Let's start right there. Software doesn't simply either work or not. You can't even put it on some sort of linear, objective "goodness" scale. Even in cases where you'd think software is cut and dried, it isn't. Did you test that sort routine with all N! combinations of N elements? Of course you didn't. Did you rigorously prove its correctness? How do you know your correctness proof is correct? Don't laugh: Mathematicians routinely find holes in each other's proofs, in some cases even after publication.

But most software is nowhere near this regime. Often we don't even know exactly what we're trying to write when we set out to write it (thus much of the emphasis on "agile" development techniques). In the case of something like a game, or even a website design, most of what we're after is a subjectively good experience, not something objectively testable (though ironically games seem to put a bigger premium on basic correctness, since bugs spoil the illusion).

It's not even completely clear when software doesn't work. If a piece of code is supposed to do X and Y, but in fact does Y and Z, does it work? It does if I need it to do Y or Z. What if it hangs when you try to do X, but there's an easy work-around? What if it hangs at random 10% of the time when you try to do X, but that's tolerable and nothing else does X at all? What if it does X if a coin flip comes up heads, but might not if it doesn't? I'm not making that one up. See this Wikipedia article (of course) for more info. What if it's an operating system and it just plain hangs some of the time? Not that that would ever happen.

All of this to say that I doubt that software and encyclopedia entries make such different demands on their development process. And as a corollary, I think the results are about the same. Namely, there are excellent results in some cases, reasonable but not excellent results in many cases, and occasional out-and-out garbage.

Here's what I think goes on, roughly, in both cases:
  • Someone comes up with an idea. That person may be an expert in the field, or may just have what looks like a neat idea.
  • The original person produces a first draft, or perhaps just a "stub", or an "enhancement request".
  • If no one with the expertise to take it further is persuaded to do so, it stays right there indefinitely, or may even be purged from the system (perhaps to re-appear later).
  • If the idea has legs, one or more people take it up and make improvements.
  • Typically, one of three stable states is reached:
    • It's perfect. Nothing more than minor cosmetic changes can be added. New ideas along the same lines typically become their own projects.
    • It's good enough for everyone currently involved. That may not be particularly good, but no one can be persuaded to go further. This may be the case right out of the gate, or after several rounds of fixes by a single originator.
    • It's not good enough, but there is no agreement on how to take it further. Work may grind to a halt as competing fixes go in and come out, or the project may split into two similar projects, with better or worse sharing of common material and effort.
Thinking it over, this process is not unique to open source. The magic of the open approach is that the bigger the pool of participants, the bigger the chance that an idea with legs will get supporters and get fleshed out, and the faster it will get to a stable state. In our imperfect world, that stable state is generally short of perfection. Put the two together and you have 80% of the solution in a fraction of the time.

That said, there are some differences between prose and software. I've argued above that software isn't hard and fast. It's soft, in other words. But prose is even softer. As a result, there is greater potential for disagreement on where to go, and in case of disagreement, there looks to be a better chance of thrashing back and forth with competing fixes, as opposed to moving forward but with separate (and to some extent redundant) solutions.

Wikipedia does seem to attract more vandals, but this is not necessarily because it's not software. It may also be because it openly invites frequent edits from a very large pool and changes are moderated after the fact. Open software projects, particularly critical pieces like kernels and basic tools, tend to require changes to pass by a small group of gatekeepers before being checked in. Conversely, some wikis are moderated.

As usual, this is all just my rough "figuring it out as I go along" guess, not anything with actual numbers behind it, but that's my story and I'm sticking to it for now.

Saturday, December 22, 2007

Eyeballs and shallow bugs

Eric S. Raymond has asserted that "given enough eyeballs, all bugs are shallow", a principle he calls Linus's law after Linus Torvalds (my fingers want to type "Linux Torvalds").

How many eyeballs are enough? How many eyeballs are available? What does it take to get a "shallow" bug fixed, checked in and tested? What's a bug, anyway? A few meditations:

How many eyeballs are enough?
Suppose an excellent kernel hacker has a 90% chance of nailing a given bug. What are the chances that two excellent kernel hackers can nail the bug? Well, it's not 180%. It will range from 90% (if the second doesn't know anything new about the particular problem) to 100% (if the second knows everything the first one doesn't).

If the two are completely independent sources of information there's a 99% chance one or the other will nail it. But what are the odds two people got to be excellent kernel hackers by completely independent routes? Now, what are the odds that a bunch excellent kernel hackers, quite a few more reasonable kernel hackers and a horde of non-specialists could nail the bug? Pretty good, I'd say, but not 100%.

How many are there?
I've been doing software for a while now. I've built a few Linux kernels (roughly the equivalent of changing a tire on a car), and I've looked at small portions of the source (roughly the equivalent of opening the hood, pointing and grunting). If some subtle race condition should creep into the next version of the kernel, the odds that I could contribute something useful to the conversation are approximately zero (the automotive equivalent might be, say, being able to help fix a problem in a Formula One engine design).

The most qualified people in such a situation are the small and dedicated core of kernel maintainers and whoever put in the changes that turned up the problem. These may well be the same people. I know a fair bit about software in general, and even a bit about race conditions in general, but I know essentially nothing about the details of the kernel, the design decisions behind particular parts, the hidden pitfalls and so forth.

This being open source, much of that information is available, directly or indirectly. The limiting factor is the ability to absorb all of the above. This takes not only skill but time and dedication. The natural consequence is that, for at least some bugs, there just aren't enough eyeballs available to make them "shallow". Instead, someone will have to expend considerable brain sweat figuring out what happened.

[Another not-infrequent case: Lots of people see a bug, but no one can quite nail down what's causing it, much less suggest a fix. Filing good bug reports takes practice, just like writing good code does. Eliminating all the variables in a typical desktop environment takes time, even for someone with lots of practice. As a result, the people who could fix the bug don't have enough information to go on and probably have bigger fish to fry.]

What does it take to get a shallow bug fixed and tested? Suppose that the broken code in question (kernel or otherwise) has passed by enough eyeballs that someone has said "Hey, that's easy, you just need to ..." That person puts in a fix. Are we done? No. At a minimum, someone needs to test the fix, preferably someone other than the fixer. Someone should also look over the code change and make sure it fits in well with the existing code. And so forth. Open source doesn't remove the need for good software hygiene. If anything, it increases it.

What's a bug, anyway? Suppose not just one person steps up with a fix to some bug. Suppose two or three people do. Unfortunately, they don't exactly agree on the fix. Maybe one wants to patch around the problem, one has a small re-write that may also fix some other problems, and another thinks the application wouldn't have such bugs if it were structured differently. Someone else might even argue that nothing needs to be fixed at all.

Expediency will tend to favor the patch, and expediency is often right. The small re-write has a chance if the proponent can convince enough people that it's a good thing. The re-structured system will probably need to be a whole new project, potentially splitting up the pool of qualified eyeballs.


So does this mean that open source is a crock? Not at all. Most of the problems I've pointed out here aren't open source things. They're software things. Open source offers a number of potential advantages in dealing with them. One I think may be overlooked is that writing a system so that anyone anywhere can check it out and build it, and that several people can work on simultaneously and largely independently, enforces a certain discipline that's useful anyway. If your code's a mess or no one else can build it and run it, you're not going to get as many collaborators.

On the other hand, open source isn't a magical solution to all of life's problems, and there are arguably cases where you just need someone to say "Today we work on X," or "We will not do Y." Strictly speaking, that kind of control is a separate question from whether the source is freely available, but Linus's law assumes that eyeballs are not being commanded to look elsewhere.

So is Linus's law a crock? Not at all. It captures a useful principle. But like most snappy aphorisms, it only captures an ideal in a world that's considerably messier and more intricate.

Thursday, December 20, 2007

Arguing web architecture with myself

A while ago, talking about web sites as web services in the context of "Ten Future Web Trends," I said:
My guess is that tooling will gradually have more and more useful stuff baked in, so that when you put up, say, a list of favorite books it will be likely to have whatever "book" microformatting is appropriate without your doing too much on your part. For example if you copy a book title from Amazon or wherever, it should automagically carry stuff like the ISBN and the appropriate tagging.
Um, why copy book data? This is the web. Make a link with the book title for text, pointing at Amazon or wherever. Anything crawling around the web trying to make sense of this ought to be able to recognize where the link is pointing, chase it and get the other data. All the usual arguments against copying (e.g., difficulty of keeping copies in sync) apply.

Monday, December 17, 2007

Who wants to be on .TV?

I seem to remember -- and maybe this is just my addled memories of Silicon Valley playing tricks on me -- that all the good .com names were supposed to have been snapped up years ago in some great virtual land rush. The only viable alternative was to grab a domain from one of the newly-minted top-level-domains, like maybe .biz, or hey wait, there's an island nation called Tuvalu and guess what! Its TLD is .tv!

Station managers! Why use your-call-letters.com when you could use your-call-letters.tv? Why use your-favorite-show.com? Doesn't your-favorite-show.tv sound so much better? In those fin-de-siecle years, dreams and fortunes were made of less.

Current statistics from Name Intelligence, of course, tell a somewhat different story:

TLDRegistered domains (millions)
.COM71
.NET11
.ORG6
.INFO5
.BIZ2
.US1

Stats for other TLDs are harder to track down, but clearly they're at least 2 orders of magnitude behind .com.

There does appear to be another effort in the works to get people buying .tv (the page I linked is a redirect from www.tv). Certain "premium" domains are up for sale at premium prices. Annual fees range from $500,000 for business.tv to $100 for, say, fishness.tv. I was intrigued by rotten.tv, but not $3000 a year worth of intrigued. Non-premium names, I believe, go for a more usual fee of around $25. The full list of 52,000+ premium names makes for somewhat entertaining browsing.

What of Tuvalu? The nation of 11,000 gets a cut of revenues from its agreement with Verisign. Being extremely remote, having few natural resources and having its highest point around 5 meters above (current) sea level, it can certainly use the cash. However, there is concern that the cut could be larger. From what I can make out, the total comes to around $2M a year, better than nothing, but at around $200 per person certainly not the bonanza one might have hoped for.

The official site for Tuvalu is www.gov.tv, but be aware that their server appears very slow, possibly due to high latency as much as low bandwidth.

UI inertia

I just had an irritating experience on a major retail site. It doesn't really matter which one, or exactly what problem, but here's a brief summary of the case in point: Fairly early in a several-step checkout process, I entered a new form of payment. Then I realized that I had to correct the shipping on one item. No problem, there was a button for that.

A few steps later, I ended up on a page asking me to enter a form of payment, which page seemed to have no memory of the new one I'd just entered. Or anything else I'd ever used, for that matter. Or a button to take me anywhere but forward. So I finally ended up starting over, losing most (but not all) of the other information I'd put in.

Whenever something like this happens, I make a mental checklist. Did the system have all the information it needed to let me fix the problem without losing what I'd put in? Would it have had to guess my intentions? Could the problem have been solved much better with known technology? In this case, and so many like it, the answers are clearly yes, no and yes.

Why does this keep happening? Why do we as an industry seem immune to experience? It's not from lack of trying. From what I can make out, having made most or all of the mistakes myself at one point or another, the cycle goes something like this:
  • The application needs some feature. It might be a shopping-cart UI, or a way to remember configuration, or a database, or whatever.
  • Early on, the requirements don't seem that demanding, and it's crucial to Get The Thing Out The Door. So someone puts together a good-enough first cut.
  • Pain results.
  • For most of the perennial problems, this happens again and again, leading people to develop toolkits. Typically each house grows its own.
  • More ambitious and successful houses venture out to bottle and sell theirs.
  • Again there is pressure to Get The Thing Out The Door (the toolkit this time), so the new solution solves just enough problems to constitute a clear improvement. The state of the art advances by a modest increment.
  • Except that all the apps that had to be pushed Out The Door before the toolkit and its improvements came along is already out the door and thus massively harder to change.
  • As a corollary, more quickly successful products tend to have clunkier interfaces, as there was less time to change them before they became hard to change.
  • Finally, a certain number of houses won't use the latest stuff anyway. Instead they'll use older stuff or roll their own, for a number of reasons, some valid and some not so valid.
It's not impossible to clean up something that's already out the door, but it requires a special blend of skill and patience. It's hard to make a business case for fixing something that doesn't appear badly broken. Generally it will require an upstart competitor to change the risk/reward balance.

In the particular case of web sites, the path of least resistance has been the "fill out a form, push a button, fill out the next form" routine, with a cookie or two to keep track of how far you've gotten in case you need to break off or backtrack. Even that may be ambitious. There are still surprisingly many sites that will make you re-enter an entire form if you mess up one field, or make you start from scratch if you have to go back to step 1. This is far behind what available tools will support.

I'm not entirely against step-by-step processes. Sometimes they work better than alternatives like "tweak stuff until you like what you see and then press 'go'". They at least leave little doubt as to what to do next. Which combination of approaches to use when is a matter of skill, taste and empirical testing with real people.

Whatever the approach, there is always a surprisingly high portion of stuff out there that just seems like it ought to have been better, that has problems that were identified, and solved, ten or twenty years ago. It's easy to conclude that people just must not know what they're doing, but I don't think that's a big part of the story. Rather, there seem to be fairly strong forces (in particular the door-outward force) tending to allocate resources to ensure that the end product is just good enough, and no better.

One of the best takes on this I've seen is in Richard Gabriel's classic Lisp: Good News, Bad News, How to Win Big. Gabriel is mainly talking about LISP, and he makes a lot of good and interesting points. In section 2.1, "The Rise of Worse is Better", however, he argues more generally that while we may want to do The Right Thing, a system that doesn't has much better survival characteristics. To throw a little more fuel on the fire, Gabriel's canonical "worse is better" system is UNIX. Naturally, it's section 2.1 that everyone quotes.

Thursday, December 13, 2007

But what if I don't want Coke to be my friend?

This is old news by now, but I wanted to explore it anyway.

A couple of years ago a younger relative, then still in college, showed me Facebook. At the time I didn't really get the concept, but I figured there was probably something to it since my relative thought it was pretty cool. On the other hand, my first instinct was to wonder about privacy.

At the risk of dating myself (but then, I do give my age in my profile), I cut my social net.teeth on BBSs and Usenet. On BBS's, people almost universally went by handles and not by their real names, a practice that almost certainly traces back to CB culture. On Usenet, you went by your email address and .sig, which could be revealing (david.hull@myschool.edu) or concealing (mysterious@whoknows.com or even an12345@anon.penet.fi) according to your choice and what you could get your local admins to go along with.

In either case you were faceless (at least at the time) and there was at least a good possibility if not an outright expectation that your name would be made-up. Coming from that perspective, and knowing that every single post to alt.stuff.hairy.hairy.hairy and rec.pigeon-fanciers is preserved in amber for all time, I was somewhat taken aback by the idea of a site that not only showed your real name, but your picture and whatever other personal information you chose to put up.

Why would people do this, I thought? Well first, not everyone's as camera-shy as I am. From talking to people, there also seems to be the perception that if you're doing something on you computer in the privacy of your room, you're doing it in your room and not in the net. Finally, though, I suspect that, even though the information is readily available, people may not fully appreciate the distinction between one's friends (a human-sized hand-picked list) and one's network (a city-sized collection of people you mostly wouldn't know from Adam).

As an aside, there's some interesting graph theory to be explored in modeling social networks. Your PhD awaits ...

Where this all comes to a head is the economics of identity. I've argued already that anonymity services must be understood in economic terms, particularly regarding the value each party places on anonymity. Conversely, a service like Facebook, MySpace or LinkedIn is a veritable mother lode of marketing data, much of it apparently available for free.

I say "apparently" because if this data is really free, it's almost certainly mispriced. Mispricing means arbitrage opportunity, and arbitrage opportunity means new, fair price. At a first guess, the value of belonging to a social network is something like
  • Convenience of being able to keep in touch with your friends and vice versa
  • Minus hassle of being in touch with people you'd rather not be in touch with
  • Plus joy of discovering interesting things about other people
  • Minus embarrassment of realizing that your prospective employer can see those pictures of you doing jello shots which seemed like such a good idea to post
  • Plus value of learning about cool new stuff advertised on the site
  • Minus pain of advertisements you don't want
Users can manage the first two by tweaking their friend lists. People appear to be growing more savvy about divulging personal information. One source tells me that juniors are now advised to scrub their Facebook sites a good year in advance of graduating and entering the job market.

The last two are, as I understand it, going through an interesting period of adjustment. One method of adjustment is for users to vote with their feet and stop using the service (or stop using it as much). This makes the site as a whole and advertising on it in particular less valuable. Presumably, really cool advertisements could make a service more attractive, with the reverse effect.

I haven't really investigated any of this in detail, other than reading a couple of articles in the popular press, and I'd be particularly interested in comments.

Tuesday, December 11, 2007

Undead technology

I did a double-take just now when I followed a link to a PDF file and noticed the Kinkos/FedEx logo in my PDF viewer. When did that happen? [Imagine a quick Google search here] Looks like the deal was announced back in June.

OK, so if I'm reading something perfectly well online, I have the option of sending it off to my local copy shop, having it printed out and then venturing out to go pick it up. Or, I suppose, I could have it FedExed to my doorstep.

Something tells me I'm not in the right market niche for this.

I could, however, imagine FooCorp emailing a bunch of, say, nice glossy marketing collateral from the Oceania headquarters to the Eurasia headquarters, where it would then be printed at the local shop, collated and bound and delivered to the appropriate desks. Clearly Adobe and Kinkos/FedEx think that people will want this, and who am I to say them nay?

At the risk of sounding like an, um, broken record, it seems that certain technologies have not yet figured out that they're supposed to be dead. Text isn't dead. It's now a verb. Print appears to be doing just fine, as well.

Monday, December 10, 2007

Why is there still print?

The Newsweek article on Kindle quotes Jeff Bezos as saying "Books are the last bastion of analog." I take his point, but it seems an odd statement. Text, after all, is arguably the first real digital medium. What he means by "digital", of course, is "available to computers". Unlike music and video, which are now routinely released in computer-readable form, books are still released in a form you can't just download. Bezos aims to change this with the Kindle.

The interesting question is, why does print resist digitization so well? I've suggested that publishers like it because it provides copy protection, but why does it? The answer has to be economic, not technical. Technically, it's trivial to digitize a book. Just scan it in. Don't bother to try to convert the image back to text. If all that people want to do with the result is read it, the image should work fine.

There's an interesting subplot here. Optical character recognition (OCR) seems to do fairly well these days on well-printed books, judging by Google books and Amazon's own "Search inside the book" feature. On the other hand, the fully general problem of reading anything a person can make out still appears to be hard, which is why sites use distorted text CAPTCHAs to try to stop bots. This seems like the equivalent of anX-prize for freelance OCR hackers, and indeed the inevitable arms race appears to be well under way. Finally, bringing us full circle, one source of these CAPTCHAs is printed text that failed to scan correctly.

In any case, the difficulty doesn't seem to be digitizing text in a readable form. The problem is, what do you do with it once you've got it? It's technically trivial to scan a book, but it still takes some time and effort to flip through all the pages, at least without expensive specialized equipment. So if I've done this, I'd like to see some compensation -- assuming I don't mind violating copyright laws.

Can I put it on the web and sell it? Well, um, I've just brought it into digital form, thereby making it hugely easier to copy. In other words, I've just put myself in the position of the publisher whose print-based copy protection I've just broken. If copy-protection is out, there's always advertising. Except that's maybe not such a good idea given that I've just broken the law.

This same argument would seem to act as a counterbalance to all sorts of unauthorized copying, but obviously it doesn't apply as effectively to audio and video. This is probably because copying CDs and DVDs is much, much easier than scanning books, and also because books are simply a different medium. I'd expect that PhDs have already been earned on just such matters.

Kindle and print

While looking for something else, I ran across the November 26 issue of Newsweek. The cover story was on Amazon's new Kindle e-book. Conveniently enough, the article is available online.

Overall I found the article pretty evenhanded, balancing the "print is inherently inefficient" side with the "books are inherently special side". As usual, I think both sides have valid points. A few thoughts:
  • Yes, print is inherently inefficient. That doesn't mean it will die anytime soon. People still sent hand-delivered messages long after the telephone became widespread. Steam trains ran long after the diesel came along. Western Union only recently shut down its telegraph service.
  • On the other hand, it's hard to imagine print not giving way to bits over time and eventually reaching niche status. My completely unfounded guess is that it will end up more like blacksmithing than buggy whips.
  • Amazon is right to recognize that it's not enough just to have an electronic device that more or less looks like a book. The Kindle is not just a device but a service. Along with searchability and the potential for hyperlinks, Amazon hopes the killer app will be the "buy and read it right now" feature. Push a button (and pay Amazon a fee generally less than you'd pay for print) and the Kindle will download whatever book you like. Whether this is enough to pull people in remains to be seen, but it at least seems plausible.
  • The Kindle relies on copy protection, presumably using some Trusted Computing-like facility. I've argued that it's not unreasonable to expect a special-purpose device to give up programmability in an attempt to lock down copy protection. Again, it will be interesting to see how well this works.
  • Conversely, print has a nice, well-understood copy protection model. Copying a book means physically copying pages. In theory this is quite breakable. In practice it works well (so far). Publishers naturally like this. It would be interesting to try to quantify how much this convenience to publishers is extending the lifetime of the book, as opposed to the "nice to curl up with and read" aspect.

Monday, December 3, 2007

Text and technological change

In the previous post on literary texts and hypertexts, I had meant to make a fairly mundane point, but got sidetracked in the fascinating details of the particular texts I was using as examples. At least, I found it fascinating.

The mundane point was this: The web has given rise to new textual forms, things like wikis, blogs and for that matter ordinary HTML web sites. Some aspects of these are new, but the general notion of a text as a multi-layered, interlinked structure, possibly with multiple authors and a less-than-clear history, is not.

This is a general point, not limited to literary criticism. For example, the question of what constitutes a derived work, important in software copyrights, musical sampling and elsewhere, has a long history. This history happens to include Ulysses -- one of the charges leveled against the 1984 corrected text of Ulysses was that the publishers had pushed to include as many corrections as possible in hopes of obtaining a new copyright.

We sometimes like to think that a new technology changes everything, that the old rules cannot possibly apply because the game itself is so different. Technology does change things, but our social tools for coping with the change remain largely the same. The flip side of this is that the problems a new technology appears to raise are often older than one might think.

Sunday, December 2, 2007

Literary texts and hypertexts

An old literary chestnut: Just what is a text? Three texts to consider:

Beowulf: Often cited as the oldest extant text in the English language, this blood-soaked tale comes to us through a single manuscript dating to around the year 1000. The manuscript was damaged in a fire in 1731, and was not transcribed until 1786.

Since then it has deteriorated further, leaving the 1786 copy as the only source for some 2000 letters, though modern imaging techniques have also helped in reconstruction. The 11th-century manuscript itself is written in two different hands and appears to be a working copy. The story clearly draws on centuries of oral tradition. It is an open question to what extent the particular telling is a transcription of a spoken saga or a literary work in its own right.

The poem begins: "Hwæt! We Gardena in geardagum, þeodcyninga, þrym gefrunon, hu ða æþelingas ellen fremedon." The language has changed a bit since then, and as a result most people read it in modern English translation (or wait for the movie to come out).

What is the text of Beowulf? Is it an old saga, as captured by a long-ago author? The words written on the old manuscript? The manuscript itself, or as accurate a reconstruction as we can manage? The story, as well as we can render it in modern language?



Hamlet: We all know at least the beginning of Hamlet's famous soliloquy: "To be, or not to be? I, there's the point/To die, to sleep, is that all? I, all ..."

Wait, that can't be right, can it? Ah ... I must have reading from the infamous "bad quarto". I should have been reading the first folio: "To be, or not to be, that is the question ..."

The first folio is generally considered the best starting point for Shakespeare's plays. Its own introduction warns of inferior editions, and indeed many of the plays, Hamlet included, exist in several different forms.

There have been attempts to harmonize the various sources into some sort of "ideal" version. Whether this works or not is a matter of opinion. Arguably, there is no one text in such cases, but unfortunately a stage production has to work from a single script, whether a harmonized one or one of the original editions.

On the other hand, if you consider the play as a play, and not just a text, whose production is definitive? The Globe theater is open for business again, but the original production company is no longer available for engagements.


Ulysses: Beowulf and Hamlet are old texts. It's no surprise that they might present problems. What about a work that was published in the 20th century by an author who lived another 19 years after its publication, producing corrections and commentary along the way? The text, of course, is James Joyce's Ulysses.

Ulysses presents a few special problems. It was originally published in serial form, until one of the installments ran afoul of US obscenity laws. It was then published as a book by Shakespeare and Company with, by most accounts, thousands of typos (just how many thousands depends on whom you ask).

The language of Ulysses is not as obscure as that of Joyce's last work, Finnegans Wake, but it is full of invented compounds like hismy and snotgreen, which traditional proofreading would tend to break up, directly against the author's wishes. Further complicating matters, the original setting relied heavily on Joyce's handwritten notes on the galleys. Many of these are now lost.

The result was a series of printings that everyone knew had significant numbers of errors, perhaps minor and perhaps not, but which no one knew exactly how to correct, particularly after Joyce's death. An attempt in 1984 drew worldwide attention but also scathing criticism.

The 1961 edition appears to be the most popular today, but further corrected editions are promised and "genetic studies" of the various versions of the text is a thriving field in its own right. You can find more here and here.

So once again, what is the text?


What does any of this have to do with the web (other than all three texts being available on the web in some form or other)? One common thread here is that a text is a deeper thing than just a series of words printed on a page. Scholarly editions have recognized this for centuries, by means of devices like footnotes, marginal comments, glossaries, bibliographies and so forth.

Some of these techniques predate printing, but all of them work even better as hypertext. It's not surprising that there are well-developed web sites available for all three works. The new technology is a natural fit for the old problems.

It's also nice that, with serious scholarship being made available on the web, the average reader can see up close the kind of dense, interlinked historical structure that used to be available only in sparsely-circulated journals. This is basically the literary equivalent of putting scientific material up on the web, and the benefits are similar as well.

Saturday, December 1, 2007

You could be Spartacus and not even know it.

All anonymity systems require large numbers of people to send traffic, or perhaps more accurately, the fewer people send traffic, the less anonymity the system can possibly provide. This is particularly a problem in setting up a system everyone has an incentive to use. Essentially, parties that value anonymity highly need people who value it less to use the system anyway in order to provide cover traffic.

This paper (also available on Freehaven's site), suggests ways of using various HTTP trickery to get people to send traffic, and to carry other people's traffic, on the system without even knowing they're doing so.

This not using obscure loopholes. It's using things like redirects, cookies and JavaScript that pretty much everyone has enabled and which would be a royal hassle to turn off. On the other hand, you would have to visit particular sites controlled by the system.

It's not clear wow much of a practical problem this is, but it's yet another thing to keep in mind when pondering just exactly how secure the web might be.

On "On the economics of anonymity"

Since posting "On the economics of anonymity", I ran across a paper of the same title. Being an actual research paper, it goes into considerably more depth than my brief take. Some of the high points:
  • An anonymity service can be viewed as a "public good." A public good is something that everyone can use without diminishing someone else's ability to use it. Street lights are a classic example.
  • In most public good scenarios, "free-riding" (using the good without paying for it) is a problem. In anonymity systems, free riders can also be good, since they provide cover traffic.
  • You're more anonymous as a node than as a free-riding user.
  • Parties that value anonymity highly have good reason to become nodes, despite the higher costs. Everyone else might as well just use the system.
  • More nodes means less traffic per node, means less anonymity.
  • It appears difficult to set up an anonymity system that everyone will have an incentive to use, particularly if you're starting from scratch.
For more details, please see the paper itself. It, and many other goodies on anonymity, are available on Freehaven's excellent Selected Papers in Anonymity.