Field notes on the Web: distributed computing

Showing posts with label distributed computing. Show all posts

Friday, April 15, 2011

Xanadu vs. the web: Part II - Xanadu the architecture

OK, so what is this Xanadu project? First, if you want to explore for yourself, the project is at http://xanadu.net/. Many of the ideas behind the project are expressed its founder Ted Nelson's Computer Lib, of which I have only read small excerpts. My source for the history of the project is Gary Wolf's Wired article The Curse of Xanadu. As the title suggests, the article does not paint a rosy picture, and Nelson has objected strenuously to it in the letters column of Wired itself.

Over its lifetime Xanadu has been a lot of things to a lot of people, but I'll focus on two aspects here: Xanadu as an architecture (in this post) and Xanadu as a business model (in the next), before going on to try to make some sort of overall sense of everything. Along the way I'll also touch on Xanadu as a software engineering project (or not).

So ... those first two paragraphs don't really belong in this post. They belong in the previous one. Now, I could go back and quietly edit Part I to include them. I explicitly asserted the right to make quiet editorial changes quite a while back when trying to deal with a mistake I'd made. But the upshot of that experience was that it's generally better to leave anything more than minor mistakes uncorrected and supply further material on the subject if needed (this theme will recur in a moment). The principle I settled on was: a blog is not a wiki. In particular, it has no visible edit history, so the blog itself must fill that role.

That's actually not a digression. Any hypertext system has to deal with exactly the questions my little editorial decision raises, particularly: How do you handle a dynamically changing interlinked set of documents? If I edit something that someone has a link to, what should they see?

In a blog (or at least in Blogger blogs), a link to a post is a link to the latest revision. Exactly what you see may depend on when you chase the link. With a wiki you also have the option of linking to a particular version of an article, which will never change, however many later edits may come along. The W3C standards for HTTP and company recommend that links be to immutable data, but they do not require it. The basic machinery of the web works the same either way. Having results change over time is tolerated.

Xanadu takes a different approach to this and other issues. In Xanadu, every object has a unique, secure identity. This doesn't preclude keeping multiple copies for the usual reasons of performance and reliability, but these physical copies share the same identity and so represent a single logical object. Objects in Xanadu are immutable. If I revise a post, the revised post is a separate object from the original, which still remains.

Objects are addressable via lists of numbers called tumblers. Tumblers are ordered, and given a tumbler it is always possible to find a tumbler after it but before any other existing tumbler. This makes it easy to add new revisions. Since tumblers are hierarchic by nature, it is possible to address parts within objects -- to address, say, the third paragraph of an article or a sentence in that paragraph, or a word in that sentence.

Links between objects are two-way, and they are visible objects in their own right. Links are non-intrusive, meaning that you can add a new link to or from an object without changing that object. The endpoints of a link are just tumblers [if I've got it right]. Since tumblers can address arbitrary parts of an object, you can define, say, a link from the word "Xanadu" in some article to the Wikipedia article on Xanadu without changing either the source article or the Wikipedia article.

Since the system retains every version of every object, and links are addresses pointing to (or into) existing objects, links are never broken (the referent is no longer there) or dangling (the referent was never created at all).

Xanadu takes this one step further by positing that all edits point back to an immutable record of the unedited original. For example, if I boldface a word in a text, the boldfacing is separate from the original text. Documents become collections of editing commands, which may reference other such collections, and so forth.

Finally (for the purposes of this discussion at least), Xanadu includes a notion of transclusion. Nelson defines transclusion as "the same content knowably in more than one place". Transclusion isn't the same as quoting. I just quoted Nelson, but there's no way to navigate from that quote to its source. Even if I make the quote a link, as "the same content knowably in more than one place", that's still not transclusion, first because the link points to the whole document, not the quote, but more importantly because there's no way to navigate back, or to other uses of that quote. [From illustrations of transclusion, it's easy to interpret it as "showing quoted text inline", but that's a matter of presentation. Whether the front end chooses to show a link or the full quoted text is its business. It's the navigability that matters from an architectural point of view, because that two-way navigability requires cooperation with the outside world.]

This ability to slide back and forth between (or among) different uses of the same text is fundamental to Xanadu. Other features, such as immutability and tumbler addressing, exist to enable it.

This architecture has several key differences from The Web as We Know It:

Links in Xanadu are never broken. Web links are routinely broken.
Both endpoints of a link are fixed. If I edit a post, I've constructed a new collection of edits pointing into the original post. Your link points to the original, unchanged post. In the web, there is only one object, which has changed out from under a link.
Links in Xanadu are bidirectional. If you link to my post, I (or you, or anyone else) can follow that link back to whatever you linked from. I can easily and automatically navigate from a post to the comments on the post. If you've commented on a particular sentence, I can see that because my end of the link refers to the sentence specifically.
Links in Xanadu never go away, because nothing ever goes away.
Markup lives outside a document. If I want to boldface every example of the word "orange" in a document and you want to boldface every example of the word "banana", we can do this independently and without editing the original.
Xanadu doesn't exist. The web does.

That last may come across as snide, but unfortunately it's true. Xanadu as a concept has been around in one form or another since the 1960s -- coming up on half a century. In all that time, there has not been one commercial implementation of anything more than superficially like it. Not from Nelson, not from the dozens of programmers who have worked with Nelson, not from anyone inspired by Nelson to make it a reality, not by someone working in isolation and coming up with the same approach independently. This wants explanation.

From a technical point of view, it's tempting to look for scalability issues and other architectural weaknesses. For example

If I decide, say, to link every word of every Field Notes post to its dictionary definition (applying some hack for words already occurring in links), that's my business. In Xanadu, the dictionary, at least, has to know about thousands of new links [More precisely, if not the dictionary, then whatever's keeping track of the links, and anything accessing the dictionary needs access to them.]
Since Xanadu is meant to use redundant copies for performance and fault-tolerance, the keepers of every copy will have to know (or be able to find out about) all those links as well.
And it all has to be kept in sync as changes come along. Cache coherence is one of the hard problems, though it certainly helps that objects are immutable and the set of objects only grows. Web protocols allow for caching, but stale cache entries can and do happen. This is just a fact of web.life, and web.life goes on.
Suppose I really did want to link to the latest revision of something, whatever that may be at the moment. If you edit that something, then the link needs to be updated as well. My document doesn't need to be, since the link lives outside it, but anything referencing that link, or more likely the combination of that link and my document, also needs to be updated, assuming it also wants the latest version of everything. Updating means creating a new copy and ensuring that whatever wants to be pointing at the latest is pointing to it. The resulting pile of corner cases and gotchas is probably resolvable, but the upshot is that the simple act of editing may have arbitrarily wide-ranging consequences. On the web, no one but you has to know you edited a page. That can cut both ways, but from experience it appears to be the right default [re-reading, I wonder if Xanadu defines, or could define, a special kind of tumbler meaning "the latest version of ...". The meaning of this tumbler could change over time, even though individual objects are immutable -- D.H. Sep 2015].
Keeping every version of everything may be expensive in some cases, though in the case of, say, this blog it wouldn't be.

These are all valid concerns, and I'm sure there are more. The various developers must have run across them, and it would be interesting to read over the resulting discussions, if they're still out there. However, I think there are two more fundamental issues.

First, is Xanadu trying to solve the right problem? It's very clear that transclusion would solve problems that Nelson finds pressing, but it's far from clear there's any general demand for it, and by now that's not because no one in the field has heard of it, or even that no one in the general populace has. Nelson explains transclusion clearly enough in the site I linked to, and "the same content knowably in more than one place" is fairly clear all by itself. But no one seems to be asking for it. Nelson himself says that people rarely grasp the power and importance of transclusion. Fair enough, but such cases generally present a barrier to widespread adoption (not always -- some things you just have to try for a while before you decide they're actually cool, and some of those catch on anyway).

But more than that, Xanadu is fundamentally a closed system. Yes, it's possible to pull in, say, a web page and treat it as Xanadu content -- pull out a quote here, reformat there and create a Xanadu-style mash-up. But that's not transclusion. There is no way for the owner of the web page to know that its content is also elsewhere. To do that, the web page itself would have to be part of Xanadu.

The converse is only slightly better. Xanadu could present a web face allowing people to view it on a web browser and create documents with links to it. The Xanadu server could track referring sites in URLs and track who's visiting it via what page. But that doesn't provide any assurance that any particular quote appears on any particular page, even in the absence of spoofing. I might later delete a link, or I might simply cut and paste text in without making a link. There may well be additional difficulties with, say, a Xanadu object pointing to a web page that links back to something else in Xanadu. I can't be bothered to think that through at the moment.

In short, to actually realize the idea of transclusion, everyone has to cooperate. That could work for a purely local application that never accesses the net, or for a collection of servers that all run Xanadu and speak whatever protocol it would use to maintain links coherently. At this point, though, there is a lot of non-Xanadu information out there, and you'd have to persuade a huge number of existing systems to switch over or at least adopt Xanadu as an add-on. Any new source of information would also have to be Xanadu-aware. Not gonna happen. The web, for its part, also requires computers to cooperate in using protocols, principally HTTP, but this is a much, much lower bar to clear.

The web, with its organically grown patchwork of standards and near-standards, its tolerance for missing pieces and other imperfections, and its lack of overarching authority, necessarily lacks the coherence and uniformity of something like Xanadu. But these traits are exactly what allows it to thrive.

There's a moral to be drawn there, for those who wish information to be universally accessible to all.

Monday, November 2, 2009

60 Minutes and the MPAA: Part I - BitTorrent

Before I start: I'm in favor of copyrights, I believe movie makers, even big mainstream Hollywood studios, have a right to make a buck and I think that sneaking out a lo-fi recording of a movie and the guy in the next row snoring, besides being illegal, is just pretty lame. That said ...

CBS's long-running news show 60 Minutes just ran a piece on video piracy. I'm calling it a "piece" and not a news story because it's essentially the Motion Picture Association of America's position on the issue re-routed through a major news show. The MPAA certainly has a right to make its case, and the case is not without merits, but passing it off as news -- particularly old-school investigative reporting -- has considerably less merit.

In my first draft of this post I tried to hit all of my objections, but the result was unwieldy and it soon became clear I had some homework of my own to do. So before digging into the real meat of the issue, let's just talk about BitTorrent.

BitTorrent is widely used for exchanging large files. Said large files include new Linux distributions, legitimately produced free video and, of course, pirated videos. In other words, like pretty much any other information technology out there, it can be used for good or ill. 60 Minutes seems mildly shocked that such a thing could be "perfectly legal."

BitTorrent, as you may know, works by avoiding the bottleneck of a central server. Instead of thousands of home computers all bombarding the central site with requests and overloading even its capacity to fulfill them, BitTorrent uses a central computer to coordinate the home computers distributing the data to each other. The 60 Minutes piece gets this wrong by suggesting that the data itself is going both to and from a central server:

Tiny "bits" moving toward a blue column in the middle of Malcolm's screen are pieces of the movie we were getting from people all around the world. The bits moving away from the column are pieces we have and are sharing with someone else.

You can sort of see where the key concepts got lost in translation, but ... huh? Pieces I'm sending out are going away from the column? Am I the column then? Each piece goes both to and from the column? What's the point of that? If you want to see what's really going on, Wikipedia has a much less whizzy but more accurate picture.

I should pause to point out here that BitTorrent's architecture is a classic example of the value of carefully considering the problem you're trying to solve. Instead of solving "How can you transmit a file quickly from one computer to another?" which basically has only one answer (high bandwidth on both ends), BitTorrent solves "How can you distribute copies of a file among a large number of computers?" Once you look at it that way, the answer of letting a given host pay back its download over time and to lots of other hosts seems fairly natural, but looking at it that way in the first place is an engineering masterstroke.

To use BitTorrent, you have to let a central tracker know what file you're retrieving. You're also committing to upload data to other BitTorrent users in return for downloading. This architecture makes BitTorrent different from other well-known ways of moving big files around:

Unlike the central server model, BitTorrent is distributed. There is no single bottleneck for data. There is a single place for handling metadata, but metadata is much, much smaller.
BitTorrent is much more traceable than older peer-to-peer file sharing systems (but see below).
It's also faster, because you're effectively downloading your file from many other places instead of being constrained by one person's upload speed.

In short, you trade off anonymity for speed. This is a perfectly good trade if you're conducting legitimate business, not so good if you're not. Even if you neglect to tell The Man you're setting up a BitTorrent tracker to share files, the pattern of lots of peer-to-peer traffic coupled with frequent low-volume traffic between everybody to a central node is pretty distinctive. Once the central server is located, watching it is much easier than watching everybody. It's also much easier to tell who's involved, since they're all talking to the tracker. All in all, this seems like much less of a headache for parties like the MPAA than, say, Napster and its descendants.

However ...

BitTorrent's speed is a definite headache. A typical cable-based home connection, at least in the States, has a download speed massively higher than its upload speed. Last time I measured mine, the ratio was 40:1. This makes it reasonably easy for me to download big files from a central, easily traceable server with huge bandwidth, but a real pain for me to send a copy of that file to my friend or whoever. That's fine, but it's much less of a pain for ten thousand people to distribute copies of a file amongst each other.

More worrisome to folks like the MPAA, though, is that it is possible to run the same distribution scheme without a central tracker. As I said, metadata is small, so distributing it to everyone takes relatively little time. A particular mechanism, distributed hash tables, has been around for a while now. As with file sharing itself, distributed hash tables aren't inherently evil. In fact, they have some very generally useful properties. But a highly efficient file distribution system without a visible center presents a real problem if you see your job as preventing people from distributing files.

In summary: Copyright violation is illegal, harmful and lame. BitTorrent can be used for copyright violation or for legitimate purposes, and it's a very neat hack* in any case. Preventing illegal file sharing by purely technical means looks like a tall order. Bashing BitTorrent or any other particular product is unlikely to help.

* I suppose in this context it's particularly important to point out I mean "hack" in the sense of "cool and unexpected thing done with technology," not in the more widespread sense of "act of computer vandalism".

Saturday, September 26, 2009

Wikipedia, voices and objectivity

In some sort of ideal world, we get our information purely from objective sources, apply cool judgment and act accordingly. In this world the ideal news article or reference text doesn't appear to have been written by anyone. It merely transmits facts, and only facts, to the reader directly and transparently.

This is a caricature, of course, but it's fairly close to what my high school journalism teacher taught, and it's woven deeply into Wikipedia's fabric under the label of Neutral Point of View (NPOV). On the other hand, Wikipedia is almost by definition a work in progress, constantly updated by a near-anarchy of mostly psudonymous if not anonymous editors. No one can stop you from saying that hard-boiled eggs must only be cracked on the big end, and no one can stop me from correcting your heinous misconception. I mean, from expressing my personal opinion on the matter.

But it all works remarkably well, for several reasons:

Wikipedia is inclusive by nature. An encyclopedia aims to be all-inclusive to begin with. An online encyclopedia, without the limitations of physical ink and paper, doesn't have to worry about running out of space. More important, though, is the huge number of contributors. All the paper in the world is useless without someone to write on it. And revise. And re-revise. And so on. This is not to say that Wikipedia includes everything willy-nilly. There are definite policies for what can and cannot be included, but they're aimed towards notability and not someone's idea of correctness.
The guidelines like NPOV really do matter because they're supported by a strong culture. The community has long since reached a critical mass of active members that take Wikipedia policy seriously and act to reinforce it and to repair breaches, even if that means tediously reverting an endless stream of "MY MATH TEECHUR SUX DOOD" and worse vandalism.
It's generally easy to tell when someone is injecting opinion. It's even easier to tell when two (or more) people are trying to inject conflicting opinions. The occasional of jumble of "Some authorities ^[who?] insist that ... however so-and-so^[17] has stated that ... " doesn't necessarily make for smooth or pleasant reading, but it does tend to make clear who's grinding which ax.
Similarly, it's easy to spot a backwater article that hasn't seen a lot of editing. This is not necessarily a bad thing. Obscure math articles, for example, tend to read like someone's first draft of a textbook, full of "Let x ..." and "it then clearly follows that ..." The prose may be a bit chewy, but whoever wrote it almost certainly cared enough to get the details right. Articles on obscure bands generally read like liner notes and tend to slightly hype that band's achievements and their home-town music scene. That's fine. Take it with a grain of salt and enjoy the tidbits you wouldn't have heard otherwise.
Likewise, it's easy to tell when an article has had a good going-over. Articles on "controversial" topics may or may not have had their "on the other hand ... on the other other hand ..." back-and-forth smoothed out, but they do tend to accumulate copious footnotes. Just as one could argue that forums exist to generate FAQ lists, one could argue that such articles exist to gather references to primary sources.

Whenever I find myself too far out on my "web changes nothing" limb, it helps to consider Wikipedia and realize that there's really nothing quite like it. But it's also important, I think, to realize that Wikipedia works so well not because it works perfectly -- it clearly doesn't -- but because it's robust in the face of its imperfections. This is a property of good distributed systems in general, the distributed system in this case comprising not just the author/editors, but the reader taking Wikipedia's nature into account.

P.S.: While fetching up the link for NPOV above, I first tried "npov", figuring it would redirect to the right place, WP:NPOV, since I can never remember the right prefix for the special pages. Oddly enough, if you don't capitalize it the right way, npov redirects to Journalism. Not sure I buy that, but it's an interesting angle.

Wednesday, September 5, 2007

What killed parallel computing?

When I was an undergrad, parallel computing was the Next Big Thing. By "parallel computing" I mean a large number of CPUs that either share memory or a have relatively little local memory but pass (generally small) messages on a very fast local message bus. This is as opposed to distributed computing, where CPUs have lots of local memory and communicate in larger chunks over relatively slow networks.

So what happened? Multiple choice:

What do you mean "what killed it?" Supercomputers today are all massively parallel. Next time you do a Google search, thank a cluster.
Distributed computing killed it. If you want to really crunch a lot of data, get a bunch of people on the net to do it with their spare cycles, a la SETI@home and GIMPs.
Moore's law killed it. Most people don't need more than one or two processors because they're so darn fast. Sure you can use parallel techniques if you really need to, but most people don't need to.

Personally, I'd go with "all of the above" (but then, I wrote the quiz).

Another worthwhile question is "What's the difference between parallel and distributed anyway?" The definitions I gave above are more than a bit weasely. What's "relatively small"? What's the difference between a few dozen computers running GIMPs on the web and a few-dozen-node Beowulf? At any given time, the Beowulf ought to be faster, due to higher bandwidth and lower latency, but today's virtual Beowulf ought to be as fast as a real one from N years ago.

A distinction I didn't mention above is that classic parallel algorithms have all the nodes running basically the same code, while nodes in distributed systems specialize (client-server being the most popular case). From that point of view the architecture of the code is more important than the hardware running it. And that's probably about right.

Field notes on the Web

Friday, April 15, 2011

Xanadu vs. the web: Part II - Xanadu the architecture

Monday, November 2, 2009

60 Minutes and the MPAA: Part I - BitTorrent

Saturday, September 26, 2009

Wikipedia, voices and objectivity

Wednesday, September 5, 2007

What killed parallel computing?

About Me

My other blog

People following Field Notes

FeedBurner

Search This Blog

Blog Archive

Reader Picks

Labels

Search This Blog

Pages

Hyperlinks vs. the web

Report Abuse