Showing posts with label wikipedia. Show all posts
Showing posts with label wikipedia. Show all posts

Wednesday, August 23, 2017

Lists and limitations

There are several things that Wikipedia does that you wouldn't necessarily guess just from a description like "online, world-editable encyclopedia."  One of my favorites is that tends to accumulate lists.  All kinds of lists.  List of bridges.  List of Eastenders characters.  List of enzymes.  List of Russian poets.  List of screw drives.  List of numbers (always room for one more).  List of presidents of Brazil.  List of fictional primates.  List of fires.  And, of course, List of lists of lists.

Honestly, part of the attraction is the sheer fun of seeing just what sorts of things people have seen, but there's a more serious point, too.   Our brains are subject to all sorts of biases that lead us to remember things selectively.  We tend remember things by their impact.  We remember more recent things better than less recent things (though we also tend to remember the beginning of a list of things better than the middle).  We remember things that we encounter more than similar things we don't.  We tend to remember things that have stressful emotional associations, and so forth.  In fact, there's a great list of these biases on Wikipedia.

One antidote to our natural biases is to lay out all the information in one place, for example, in a list.

To take a random example, what's the largest city in the US, meaning the one with the most area?  I've spent some time in LA, and I'd think that it, or one of its suburbs, must be pretty large.  Let's check the List of US Cities by Area.  And the winner is ... Sitka, Alaska, at 7434 sq km (2870 sq mi), population 8881.  Next on the list are Juneau, Wrangell and Anchorage, all in Alaska.

None of these is what we'd think of a a "big city", however large the city limits might be.  In fact, all of these are consolidated city/counties, and it's not surprising that counties in Alaska would be on the large side.  Likewise for Anaconda and Butte, both in Montana, a little farther down the list.  The first physically large city with a large population is number five: Jacksonville, Florida, area 1935 sq km (747 sq mi), population 821,784 (though Anchorage is over 200 thousand).  The first with a population over a million is Houston, Texas, area 1553 sq km (600 sq mi), population 2,099,451.

The next large cities with over a million are Phoenix and ... oh, there's LA at number 12.  Interestingly, New York, New York, which I tend to think of as the prototypical "lots of people packed into not much space", is number 24 on the list, between Kansas City, Missouri (Kansas City, Kansas is considerably smaller in both area and population), and Augusta, Georgia.

If all this matches up with your preconceptions, congratulations, your preconceptions are better than mine.

Or suppose there's been a major flood in your area (and, of course, I hope there hasn't).  Seems like it's happening more and more around the world?  Is it? Well, one way to find out would be to look at the List of Floods.  And, indeed, it looks like there have been many more in this century than before.  In fact, in the 1990s, the list shifts from decade-by-decade to year by year.

But that's not really right.  We've just gotten better at reporting them, and more recently-reported events might be easier to link to on Wikipedia.  So there are limits.  In this case, there's a clear sampling bias (I wouldn't quite call it recency bias, since that has to do with an individual's memory.

Maybe I'll just go back to browsing the List of images on the cover of Sgt. Pepper's Lonely Hearts Club Band.  Interesting bunch, that lot.

Wednesday, December 30, 2015

Wikipedia considered harmful ... or not

In an old post on Wikipedia, I said that

[I]t's easy to spot a backwater article that hasn't seen a lot of editing. This is not necessarily a bad thing. Obscure math articles, for example, tend to read like someone's first draft of a textbook, full of "Let x ..." and "it then clearly follows that ..." The prose may be a bit chewy, but whoever wrote it almost certainly cared enough to get the details right.

My feeling was that if you were really interested in, say, the functoriality of singular homology groups,  you'd probably have enough context to chew through prose like "This generality implies that singular homology theory can be recast in the language of category theory."*

In a recent ars technica article, John Timmer argues that impenetrable technical articles are actively harmful: "The problematic entries reinforce the popular impression that science is impossible to understand and isn't for most people—they make science seem elitist. And that's an impression that we as a society really can't afford."

I think that's a good point, but I'm not sure how bad the problem really is in practice.  A hard-to-read article is most likely to be harmful if a lot of people are seeing it, which also makes it more likely that someone will be able to improve it.  This is a fundamental assumption of Wikipedia in general, I think.  As such, it would be interesting to see some data behind it -- is there a strong correlation between the number of times a page is landed on and the number of edits (or edits per word, or such)?

Assuming that correlation holds, then someone coming to Wikipedia to learn about, say, physics should have a good chance at a gentle introduction.  Let's try:

  • The main article on Physics seems like a perfectly good Wikipedia page.  It starts with a general introduction, goes into history, core theories, relation to other fields and so forth.  Let's look at one of those fields:
  • Condensed matter physics still seems to be in good shape.  The first sentence doesn't seem completely useful at first: "Condensed matter physics is a branch of physics that deals with the physical properties of condensed phases of matter," but the next paragraph goes on to explain nicely what a condensed phase of matter is.  The rest of the article continues in a well-structured way to give an outline of the field.  Let's look at one of the theoretical aspects:
  • Symmetry breaking "needs attention from an expert in Physics".  I'd agree with that assessment.  The general idea is still there, but we're definitely getting technical: "In physics, symmetry breaking is a phenomenon in which (infinitesimally) small fluctuations acting on a system crossing a critical point decide the system's fate, by determining which branch of a bifurcation is taken."  For example, what's a bifurcation?  Well, we can at least chase the link and find out:
  • Bifurcation theory is actually a better-structured article, but no less technical.  It's actually a math article, not a physics article.
I would say that either of the last two would be intimidating to a non-physicist/mathematician.  I don't know if you could say the same about the first two.  Yes, there are still technical terms and concepts, but it's pretty hard to get away from that and still cover the material.  I would also say that a non-physicist interested in physics in general would be far more likely to land on the Physics article than the other three.

I also noticed that, while I have run across a few really impenetrable technical articles in Wikipedia, it didn't seem -- in this particular random walk, at least -- that the quality of the articles dropped off steadily as one went off the beaten path.  Fields intersect, and perhaps you're never too far from someone's beaten path.  I did chase a link from Bifurcation Theory to Stationary point, which was marked "The verifiability of all or part of this article is disputed", not something one expects in math articles, but it didn't seem particularly better or worse than the previous two, the warning notwithstanding.


Let's say that the random walk above is fairly representative -- and I think it is, based on other experience browsing Wikipedia.  What of Timmer's claim that general interest articles such as The Battle of the Wilderness are accessible, while technical articles such as Reproducing kernel Hilbert space are hostile?

I suspect Selection bias.  Timmer (and myself, and anyone who browses a lot of technical articles) sees a lot more technical articles than the average reader.  In fact, we should broaden that a bit and say "specialized" instead of "technical".  Just as math geeks might read a lot of math articles, history geeks will read a lot of history articles, sports geeks a lot of sports articles and so forth.  Should we really expect Reproducing kernel Hilbert space to be as newbie-friendly as The Battle of the Wilderness when the battle site has a plaque on a public highway talking about it?

Let's try following History like we did for Physics above, since Timmer's example of an excellent article is from that field.  The first area of study is Periodization, and already we see a notice that the article's sources remain unclear.  A large chunk of the article covers Marxian periodisation, giving the almost certainly false impression that this is the only significant way that historians divide history into periods.  This section seems to be a copy of the corresponding section in Marx's theory of history, suggesting that one of the most basic Wikipedia cleanups -- making sure that the bulk of the information is in one definitive place and everything else links to that place -- hasn't been done.

Just as impenetrable math or physics articles may give the impression that scientists are a bunch of elitists, the article on Periodization -- the second history article I tried -- may give the impression that historians are a bunch of leftists.  So maybe Wikipedia as a whole is unfriendly to academics as a whole?  That would be a sad irony indeed, but I doubt that's really what's going on.  Wikipedia is filling more than one function, after all.  It's a general introduction in some places, and a specialized reference work in others.  That seems fine.


There's a lot more to be said here.  Maybe survey articles like the ones on physics and history are the wrong place to look -- not that we've really defined what we're looking for.  I tried putting "The battle of" into the search bar and clicking on a few of the battles that came up.  All the articles seemed quite well written.  Perhaps the Wikipedia process works better for documenting specific events?  Or perhaps the search bar was just showing me the most popular articles, which in turn would be popular for being well written and well-written for being popular?

Overall, the proposition that Wikipedia articles on the sciences are bad for the sciences seems like a testable hypothesis, at least in principle, but properly testing it requires a lot more machinery -- methodology, statistical models, surveys, etc. -- than you'll find in a blog post or op-ed.  Without a more thorough study, we should take assertions like Timmer's as good starting points, not conclusions.



* I should point out that the article I linked to is not necessarily the kind of article that Timmer is complaining about.  I just picked a handy example of a technical article that would probably not be too familiar to most readers.  Yep, I ... just happened ... to be reading up on singular homology theory over my winter break.  What can I say?

Saturday, August 4, 2012

Answering my random question


I recently asked whether there were more than a Britannica worth of Britannica-quality articles in Wikipedia.  Looking into it a bit, I'd have to generally agree with Earl that no, there aren't.

Britannica has about half a million articles (according to Wikipedia's page on Britannica).  English Wikipedia has about four million.  I would not say that one in eight Wikipedia articles is up to Britannica standards.

Granted, the famous Nature study of 2005 found that Wikipedia science articles are nearly as accurate as Britannica articles -- and that Britannica is far from flawless.  One can dispute the methodology and conclusions of that study, and Britannica did, but the overall conclusion seems at least plausible.

However, apart from science articles only being part of the picture, the writing in Wikipedia is uneven and full of Wikipedia tics.  Britannica, with full-time writers and editors, ought to be a bit better.  I tend to think this is where Wikipedia generally falls short. Factually, the two are comparable.  In style and organization, not so much.

Taking content and writing together, there are probably relatively few Britannica-quality articles in Wikipedia, but there are more than enough that are close enough.


Friday, August 3, 2012

Random question

Are there now more than a Britannica worth of Britannica-quality articles on Wikipedia?

Wednesday, July 6, 2011

Wikipedia tics

I'll say it again: Wikipedia is great.  I use it all the time.  It does its job astoundingly well, particularly given that when it was first getting started any sensible person could have told you it couldn't possibly work.  Anyone can edit it?  Anyone can write anything about anything?  And people are going to depend on it for information on a daily basis?  Riiiight.

But it does work, thanks to countless hours of effort from dedicated Wikipedians hammering out workable policies, nurturing the culture behind those polices and putting those policies into practice by editing a stupefying number of articles.   It is this endless stream of repairs and improvements that keeps Wikipedia from devolving into chaos.  It's a wonderful thing, but wonderful is not the same as absolutely perfect (for starters, one is achievable and the other isn't).  Anyone who's read Wikipedia more than casually will inevitably have a few pet peeves.  Here are some of mine (and yes, I do try to fix them when I come across them, time permitting):
  • Link drift: Article A includes a link to article B.  Article B gets merged into article C and the link is changed to point to article C -- not the section, but the whole article.
  • More link drift: Article A includes a link to article B.  Someone creates an article on a different meaning of B.  The article for B becomes a disambiguation page, and the article on A continues to point to it.
  • Digression:  Article A has some connection to topic B, which people Need to Know More About.  Instead of just providing a short summary and linking to the article on B, an enthusiastic editor gives the complete story of B, in nearly but not exactly the same form as in the original article (or, the digressive section moves to its own article, but the section later regrows).
  • I'm really into this: An article is stuffed with unsourced Things You Didn't Know about the topic, often to the point of downright creepiness.
  • Some say ... yes, but some other people say ... yes, but ... :  People feel strongly about topic A.  Generations of editors qualify each other's statements until the article reads like a pingpong match. Usually an effort is made to collect the clashing statements into one section, but that doesn't always keep them from escaping into the article at large.
  • Actually, everybody gets this wrong:  An editor makes a great point of declaring some piece of common knowledge incorrect without bothering to check if this is really the case.
  • This is a very important distinction:  Instead of saying something on the order of "not to be confused with [link]" or such, an editor feels that it's worth including a sentence or two on either side of some valid but not earthshaking distinction emphasizing how crucial it is (see previous item if the distinction in question is invalid)
  • Take it to the discussion page, please: A discussion that ought to be lightly summarized is hashed out in excruciating detail before our eyes.
  • Oh look, I can write a textbook/conference paper, too!:  Editors seem to make a special effort to pepper their writing with the mannerisms of their professors or other authorities.  Math articles seem particularly prone to this ("clearly ... it turns out that ...").
  • My home town/band is the awesomest:  Material on a place or group reads like your cousin showing you around on a visit.  I actually don't mind this, so long as it's not too overboard, even though it generally runs somewhat afoul of Wikipedia's notability policy, because how else does one find out about the Anytown Moose-waxing festival or the real meaning of "incandescent oak" in that one song (don't go searching for those -- I made them up).
  • This article reads like it was written by dozens of different people over the course of several years:  Well, yeah.  The real magic of Wikipedia is that relatively few articles read like that, particularly if they really have had a chance for dozens of different people to work on them over the course of several years.
  • [One other tic occurred to me not long after I hit "Publish": Gratuitous wikification.  To "wikify", in wiki parlance, is to make an ordinary term into a link to the article for that term.  It's one of the things that makes wikis wikis, but sometimes people seem to go randomly overboard, occasionally with fairly odd results.]
Wikipedia's strength is in its transparency.  For the most part, you can see every draft of every article if you want to, every mistake, every correction, every paragraph in need of tightening, every statement in need of a reference, every quibble, every pointless edit war -- in short, everything that a normal publication, encyclopedic or otherwise, goes to great lengths to hide.  The downside is that flaws like the ones listed above are also there for all to see.

The upside is that we get Wikipedia.

Monday, April 18, 2011

Xanadu vs. the web: Part VI - Wikipedia and GNU

I've spent quite a few words on why Xanadu, sometimes called the original hypertext system (Vanevar Bush's Memex proposal and Doug Englebart's work notwithstanding), is not, in fact, the hypertext system we use.  With that as background, I'd like to take a look at two pieces of what actually developed, namely Wikipedia and the GNU project.

Wikipedia would seem very much in the spirit of Xanadu.  It seeks to create an interconnected collection of documents surveying humanity's store of knowledge.  It is not only freely accessible to anyone with a net connection, it is editable by anyone (or at least, anyone who can get along in the Wikipedia community, which is a great many people).  It remembers past versions.  Indeed, the edit history is an integral part of Wikipedia, not only technically but culturally.  It can be approached from any angle, read in any order and quoted freely.  It even has two-way links of a sort ("what links here").  One might think that, for example, the "Transliterature open standard" would have something to say about it, or at least wikis in general.

Your search - wiki site:transliterature.org - did not match any documents.


That's not completely fair, as Nelson's work is scattered across many sites across the web and off it, and I'm quite sure he's had at least something to say on the project, but whatever it is doesn't exactly leap to the fore.  You can plug xanadu.net and other sites into the search above and still find nothing.

Wikis are probably the most Xanadu-like, most hypertextual parts of the web.  They're not Xanadu, though.  They do not provide true transclusion, or the side-by-side, interconnected views that Nelson advocates, much less "flying islands" or more exotic presentations.  All they do is provide millions of people the means to explore the world's knowledge in personalized, non-linear, cross-referenced and interconnected ways.

Two conclusions one could draw from this:
  1. Wikis, because they are not transclusive and appear much like traditional media on the screen, merely "simulate paper" and are thus detours on the true path to Xanadu.  We must fight on.
  2. Easy editing, wide access and the ability both to follow links and to search are more important than strictly adhering to any particular vision of hypertext.
Xanadu, unfortunately, seems firmly in camp (1).

(It occurs to me that the back button on a browser would probably get the same treatment.  A one-way link with a generic way to navigate back is obviously not the same as a two-way link, but it turns out to provide significant value nonetheless.)

How did Wikis (or at least Wikipedia, which has by far the lion's share of Wiki-related traffic) get to be what they are?  People put up servers and people used them.



Project GNU and Xanadu would seem to have much in common.  Persuasive, eccentric founders (by eccentric I mean simply "far from the center") with radical, ambitious visions.  A strongly-defined subculture with vocabulary and practices all its own.  A strongly ideological bent (GNU even has its own manifesto) and a willingness to say that mainstream thought and practice are simply wrong.  A conviction that computing must be liberated from narrow-minded corporate constraints.

I've taken issue with GNU founder Richard Stallman's ideas myself (particularly here and to some extent here), but I certainly don't dismiss them out of hand.  To this day Stallman commands a degree of respect and attention, for a simple reason:

GNU shipped useful code.

If you run Linux, you're almost certainly running the GNU tools (among others) on top of the Linux kernel.  The kernel itself is built with GNU tools.  Even if you're not a Linux geek, if you use the web you've interacted with any number of servers running GNU tools.

Actually, there's a second reason, not quite as simple but even more significant.  Tons of code outside the GNU project has shipped under the GNU public license (GPL) or licenses heavily influenced by it.  Because of this, it's easy to, say, download Eclipse and a bunch of Apache libraries and start doing interesting stuff.  Stallman literally pioneered a whole new form of software development and distribution [There had been "shareware" and similar arrangements already, but the GPL is a completely different beast, particularly in its brilliant use of copyright law].

There are a lot of reasons why GNU has been so influential, even if not every hope or prediction has panned out, but none of them would have had much effect at all if Stallman and company had not produced actual, useful running code, particularly GNU Emacs and gcc.

Ironically, the GNU operating system itself, which prompted the whole effort, has fared less well.  Stallman announced in the original manifesto that an initial implementation of the kernel had been written (as indeed it had), and the actual work started around 1983, even before the manifesto.  Nonetheless to this day there is no stable release.  If that had been the whole story, we might well have another depressing tale of non-delivery (or in the case of GNU, not-quite-delivery).  Happily, though, that's not the whole story.  In the event, the kernel itself was effectively supplanted by the Linux kernel, which may or may not be the better result, but other development went on.


This willingness to shift tacks when the winds change is the hallmark of every successful project that I'm aware of.  It's a big reason we have the web we do.

Thursday, February 17, 2011

Where Wikipedia pages went to die

While looking for something else (of course) I ran across Deletionpedia, an archive of pages that have been deleted from Wikipedia.  The idea is simple: siphon off pages deleted from Wikipedia, with exceptions such as copyright violations, libel and intentionally offensive pages.

Why do  this?  Wikipedia is reasonably wide-open, but it does have well-known standards for inclusion.  If it's not notable, or contains original research, or creative writing, or anything else that doesn't really belong in an encyclopedia, it's out, regardless of its other merits.  Deletionpedia was an effort to preserve such pages.

I saw "was" because, even though the site is still up, it hasn't been updated since mid-2008 (or 2012, if you believe the rather odd timestamps on the Recent Changes page).  All in all, Deletionpedia collected about 63,000 pages in the space of a few months.  Why did it stop?  The last status update, from 2008, apologizes for recent downtime, promises it will return in improved form and that  "Full service will resume ASAP."

Famous last words, indeed.  Another cool idea that most likely just didn't have sufficient resources behind it, particularly the time required to administer the site and maintain the Python script that was meant to automate the process of sifting out pages that not even Deletionpedia should provide a home for.


The origins of the whole exercise may lie in the "Inclusionist/Deletionist" theological debate in the Wikipedia community.  I wouldn't say that a site like Deltionpedia necessarily supports one side or the other.  On the one hand, it perpetuates pages that would otherwise disappear.  On the other hand, it lowers the consequences of deleting a page.

Neither should such a site have much effect on Wikipedia's "Right to Vanish" which, as far as I can make out, is more of a Right to Make it Somewhat Harder to Associate Your Edits With Your Identity.  Invoking this right does entail deleting one's User: page (but not one's Talk: User page), but I'm not sure how the average user page would make it easier or more difficult to track down who made a particular set of somewhat-anonymized edits.  But I'm not a Wikipedia.expert, so I may have missed something.


Naturally, there is a Wikipedia page on Deltionpedia, and naturally, it has been nominated for deletion at least once.

Sunday, August 15, 2010

Wikipedia 1.0: journey vs. destination

While browsing through the Wikipedia policy pages (it was either that or just tattoo "Geek" on my forehead and be done with it) I ran across something I remembered running across a while ago, more or less shrugging at and moving on, namely an offline edition of Wikipedia. There seem to be two approaches:
  • The "German model": Distribute a snapshot of Wikipedia on CD. Why, I'm not sure. Perhaps to reach that select audience of people who have heard of Wikipedia but don't have an internet connection to access it*?
  • The "Wikipedia 1.0" model: Select the best, most polished articles and publish them, whether on paper, CD/DVD, read-only web site, or whatever.
The Wikipedia 1.0 project was proposed in 2003. At this writing, several versions have been released and 0.8 will be out Real Soon Now. That's not to say that 1.0 will be two versions from that. The beauty of the x.y version numbering scheme is that you don't have to go from 0.9 to 1.0. You can release 0.91, 0.95 ..., you can release 0.10, 0.11 ..., you can release 0.9a, 0.9b ... [But it looks like we'll go into 2016 still on version 0.8 ... my guess is that 1.0 isn't going to happen -- D.H. Dec 2015]

For my money, it's not particularly important whether 1.0 ever comes out. Plenty of good has come out of attempting the exercise at all, in particular as a spur toward improving the quality of core articles and encouraging the development of Wikipedia's quality and importance ratings. These exhibit a nice division of labor: People rate articles and computers aggregate the best-rated ones.

The main reason not to just leave it at that and integrate the ratings more directly into the UI, is that vandalism still has to be filtered by hand and, despite the lack of imagination exhibited by most vandals, always will be. But most likely even that could be handled without an explicit release mechanism, by means of "flagged revisions," which allow editors to flag particular revisions as being free of vandalism and otherwise up to snuff. Apparently the mechanism has been in place for a while but the community is still figuring out how best to use it.

What's the proverbial "simplest thing that could possibly work" here? Perhaps just allowing anyone -- or anyone with an account -- to tag a revision however they like, and allow readers to filter what revisions they see. E.g., only show me revisions that the quality rating committee has rated "good" or better and my friend Jimbo has rated "funny". The proposal for "sighted revisions" looks pretty close to this, though less flexible.


* That's a bit glib, as there are communities with access to computers but with limited or no bandwidth, but given it was the German edition at 3 Euros per CD, I doubt this was the intended audience. Nonetheless, 40,000 people opted to buy it.

Wednesday, June 30, 2010

Wiki without the pedia

While tagging my previous post, I noticed that I had tags for both "Wikipedia" and "wiki". There are four articles (now five, of course) tagged "wiki," three of which are more or less to do with Wikipedia. The other is from the Baker's Dozen series, speculating about what role the wiki approach may play in the next generation of search engines.

What really stands out to me about wikis is that there's Wikipedia and then there's everything else.

Everybody's heard of Wikipedia by now and quite a few people have tried their hand at editing it. As a result, there is a well-known tool for editing Wikipedia (Mediawiki) along with a well-established culture and etiquette. There is also enough of a critical mass that, for the most part, articles tend to improve over time.

And then there's everything else. Don't get me wrong. There are some good wikis out there. But there are also an awful lot of half-baked ones. These tend to crop up when a small software shop or similar organization decides that it needs a wiki to, say, document its software architecture and development process. Well, why not? Wikipedia is pretty successful, and software shops are always looking for lightweight, dare I say "agile" ways of tracking what's going on.

In practice, there are several pitfalls:
  • Wikipedia has a lot of eyes. According to Wikipedia, Wal-Mart has about 2 million employees, while Wikipedia has close to 13 million registered users. Granted, Wikipedia claims only about 90,000 "active contributors", but that's still about the same headcount as Microsoft. Chances are, your company isn't that big*
  • It used to be every computer science undergrad wanted to invent and implement a programming language. Somewhere around the turn of the century that ambition seems to have shifted to writing a wiki engine (which typically has at least a toy programming language in it somewhere). So many to choose from and, even though approximately one of the choices has a huge userbase and all that goes with it, the odds are that whoever set up your wiki chose something "better" than Mediawiki.
  • Wikis were designed for quickly throwing together webs of loosely structured text, and not for any of several other things they sometimes get used for. A wiki page generally doesn't know what role it has in a bigger picture. A wiki is not a bug tracker. It is not a release planning system. It doesn't know that feature X was promised to FooCorp for release 2.1 whose schedule has just slipped. No one told it any of that. Ah, but that's where the toy programming language comes in ...
  • Many shops are content to limit wikis to the smaller role of gathering together bits of wisdom that people tend to email each other as the occasion demands. "Why did you design it this way?" "Well ..." The problem is that this conversation tends to happen when, for any of myriad reasons, the design wasn't documented close to the code, so someone is now asking the author. Ideally, the original designer goes and documents the code and replies with a link to the new doc. Alternatively, if the conversation is taking place on an archived list, the answer will be in the archives for future generations. In either case, it's not clear that updating a wiki and replying with a link to that would be an improvement.
  • Wikis need gardening to combat various forms of rot. Typically there's even less time for this, particularly in a small shop, than there is for updating the wiki in the first place.
Wiki writing is not magically easier than any other kind of writing. Maintaining a wiki takes time and dedication. Wikipedia has a lot of dedicated contributors, including many who specialize in gardening and other less glamorous jobs. If your organization is not specifically in the business of producing wiki pages, chances are the wiki will reflect that.


* On the other hand, chances are you wiki is not going to be as big as Wikipedia. Nonetheless, (I claim) there are economies of scale that happen when the user base gets larger.  In a large community people can specialize, for example in maintenance tasks.

[Wikipedia continues to dominate the world of Wiki, even neglecting its sister projects.  The one notable exception I can think of is TV Tropes.  I doubt it has anywhere near the readership of Wikipedia, but it's still the rare example of a publicly-edited non-Wikipedia wiki with a significant readership -- D.H. Dec 2015]

Wikipedia moved my food dish (slightly)

Wikipedia has recently undergone a facelift. Just as a casual user I've noticed approximately two things:
  • The buttons and stuff are shinier.
  • The search field is now up top instead of over to the side.
I was somewhat annoyed by that second item for a bit, but I'm already used to it now, and I can see the UX value in putting such a vital, high-volume element in a more prominent place.

What else did they do? The new features link mentions a couple of new editing widgets, which I may explore next time I edit a page, a new version of the logo (part of the general new shininess) and, "improved search suggestions". They've also made it clearer whether you're reading or editing a page, but I've never had a lot of trouble with that distinction.

Of these, the improved search suggestions are the real winner. Search suggestions rock, and I'd say that even if I didn't work for Google.

Sunday, November 22, 2009

Today is yesterday's tomorrow (sort of)

The other night I was watching Ghostbusters II (oh, don't ask why) and right in the middle of it Harold Ramis' character uses The Computer to look up information on a historical figure. I'll use GBII for reference here since it's handy, but I could have picked any number of others.

The Computer has been a staple of science fiction for decades. It's interesting that its role in such movies is very often not to compute but to look something up, as was the case here. Our hero gives the computer the name, and back comes a neatly formatted 80-column by 24-row answer, with underlines and everything, saying who the person is.

Of all the technological devices in such movies, The Computer always seemed among the less plausible. I'm not counting the ghost-zapping equipment as technology; it's magic and falls firmly under suspension of disbelief. The Computer counts as technology because it's assumed just to be there. At some point in the future, super-powerful all-knowing computers will be generally available. How do we know? Just look at the movies ...

There were a couple of reasons The Computer always seemed particularly implausible. First, knowing a bit about real computers makes it harder for me to gloss over the technical hurdles. Force fields? Jet packs? Sure, why not? That's physics. Physics is what you major in if you're too smart for anything else. They'll figure it out. But a computer you can just type some vague query into and get a sensible answer? Come on. Like that'll happen.

Second, it always seemed like a computer smart enough to, essentially, act like the Encyclopedia Galactica would surely have all kinds of other powers that the careful scriptwriter would have to take into account. If The Computer can tell you who the bad guy in the painting is, why can't it tell you how to take him out?

You can probably tell where I'm going with this. Today, about fifteen years after GBII, you can sit down at your home computer, type in the name of a historical figure and very likely come up with a concise, well-formatted description of who the person was, thanks to the now ubiquitous browser/search-engine/Wikipedia setup.

As powerful as it is, though, the system is an idiot savant. It won't tell you how to neutralize a malevelolent spirit (or rather, it won't tell you a single, clear way to do so) and it won't do a lot of other things. It just allows you to quickly locate useful information that's already been discovered and made publicly available. It's powerful, but not magic.

What particularly strikes me about the description above is the presence of Wikipedia. Large, fast networks of computers were already building out in 1994. Mosaic came out while GBII was in production. The missing piece, and one that I don't recall very many people predicting, was the massively-collaborative human-powered Wikipedia, not a technical advance in itself, but something very much enabled by several technical advances.

The Internet, HTTP, browsers, scripting languages, broadband, email, databases, server farms, cell phones, etc. -- these are all technologies. Wikipedia isn't, and yet it fits easily and comfortably into the list of advances from the last few decades. It fills a niche that's been anticipated for decades, but -- fascinatingly -- not by the anticipated means of using sheer computing power to somehow divine the history of the world.

Thursday, October 8, 2009

Lulu, Wikipedia and vanity

I've been looking into Lulu.com lately, not because I plan to use it, but as part of my ongoing and mostly unsuccessful effort to understand how the web and print publishing interact. Along the way I had a look at the Wikipedia article on vanity presses. Immediately my spidey-sense tingled that something was amiss there. In particular, the article mixes vanity presses with on-demand printers. On-demand printers such as Lulu fit the definition given at the top of the article, since they don't screen authors, but they definitely don't fall under the more popular notion of a vanity press scam.

There's a pretty good summary of the problem in discussion page, under the heading This article is entirely wrong and defamatory to some of the organisations it references (no, tell 'em how you really feel). To understand the basic distinction, follow the money:
  • In a vanity press scam, you pay the publisher. They run a small printing of your work at an exorbitant fee, send you the books and pocket the difference.
  • With an on-demand printer, you upload your book and pay nothing. When people order it, they print it, ship it, send you a cut and keep a cut for themselves.
Both of these are different from a commercial publisher. A publisher does much more than just print books. It also markets, distributes, and edits them, typically pays an advance to authors against future royalties and assumes the financial risk involved in doing all this.

A vanity press pretends to be a publisher, but charges you in advance for what a publisher would normally do while not actually doing any of it. An on-demand printer does not claim to be a publisher (except in a limited sense described below), tells you exactly what they do and don't do and makes its money by taking a cut of whatever's actually printed and purchased.

From what I can see Lulu and company occupy a legitimate niche, allowing an author to bypass the screening process at the cost of assuming the marketing and editing duties. The author also forgoes any advance on royalties, thereby assuming some financial risk even without paying out of pocket. Printing costs are higher for on-demand publishing, but I doubt that's a major part of the picture compared to the other factors.

That said, if you're looking to self-publish, don't underestimate the value of the traditional publishing services. If you expect to sell purely on-line, you won't have to pay anything, but if you want to, say, sell physical books on a speaking tour, you'll have to buy the physical books. If you want your book listed on Amazon, you'll have to buy a distribution package from Lulu for $25-$75 plus the cost of a proof copy and make sure that your book meets certain distribution requirements. In any case you'll have to decide where to price your book, what the cover will look like, where and how to advertise it (at your own expense), etc., etc.

Caveat scriptor.

[I was going to write a follow-up for this, but on re-reading more closely it didn't seem like much had changed.  Even the "entirely wrong and defamatory" screed is still on the talk page.  Not every Wikipedia article has lots of eyes on it.

The basic analysis still holds, I think.  There are three segments: traditional publishers, print-on-demand support for self-publishers and outright vanity presses.  Where you put a company like Lulu depends on which dividing line seems more important.  If you care about which way the money flows, Lulu is on the "real publishing" side.  If you care about editing and marketing services, it's just another form of vanity/self-publishing. 

This piece, linked by the current version of the Wikipedia article, falls rather caustically on the "just more vanity" side, but does give a list of several uses for print-on-demand, including yearbooks, technical how-to-manuals and "time limitations" --D.H. Dec 2015]

Saturday, September 26, 2009

Wikipedia, voices and objectivity

In some sort of ideal world, we get our information purely from objective sources, apply cool judgment and act accordingly. In this world the ideal news article or reference text doesn't appear to have been written by anyone. It merely transmits facts, and only facts, to the reader directly and transparently.

This is a caricature, of course, but it's fairly close to what my high school journalism teacher taught, and it's woven deeply into Wikipedia's fabric under the label of Neutral Point of View (NPOV). On the other hand, Wikipedia is almost by definition a work in progress, constantly updated by a near-anarchy of mostly psudonymous if not anonymous editors. No one can stop you from saying that hard-boiled eggs must only be cracked on the big end, and no one can stop me from correcting your heinous misconception. I mean, from expressing my personal opinion on the matter.

But it all works remarkably well, for several reasons:
  • Wikipedia is inclusive by nature. An encyclopedia aims to be all-inclusive to begin with. An online encyclopedia, without the limitations of physical ink and paper, doesn't have to worry about running out of space. More important, though, is the huge number of contributors. All the paper in the world is useless without someone to write on it. And revise. And re-revise. And so on. This is not to say that Wikipedia includes everything willy-nilly. There are definite policies for what can and cannot be included, but they're aimed towards notability and not someone's idea of correctness.
  • The guidelines like NPOV really do matter because they're supported by a strong culture. The community has long since reached a critical mass of active members that take Wikipedia policy seriously and act to reinforce it and to repair breaches, even if that means tediously reverting an endless stream of "MY MATH TEECHUR SUX DOOD" and worse vandalism.
  • It's generally easy to tell when someone is injecting opinion. It's even easier to tell when two (or more) people are trying to inject conflicting opinions. The occasional of jumble of "Some authorities [who?] insist that ... however so-and-so[17] has stated that ... " doesn't necessarily make for smooth or pleasant reading, but it does tend to make clear who's grinding which ax.
  • Similarly, it's easy to spot a backwater article that hasn't seen a lot of editing. This is not necessarily a bad thing. Obscure math articles, for example, tend to read like someone's first draft of a textbook, full of "Let x ..." and "it then clearly follows that ..." The prose may be a bit chewy, but whoever wrote it almost certainly cared enough to get the details right. Articles on obscure bands generally read like liner notes and tend to slightly hype that band's achievements and their home-town music scene. That's fine. Take it with a grain of salt and enjoy the tidbits you wouldn't have heard otherwise.
  • Likewise, it's easy to tell when an article has had a good going-over. Articles on "controversial" topics may or may not have had their "on the other hand ... on the other other hand ..." back-and-forth smoothed out, but they do tend to accumulate copious footnotes. Just as one could argue that forums exist to generate FAQ lists, one could argue that such articles exist to gather references to primary sources.
Whenever I find myself too far out on my "web changes nothing" limb, it helps to consider Wikipedia and realize that there's really nothing quite like it. But it's also important, I think, to realize that Wikipedia works so well not because it works perfectly -- it clearly doesn't -- but because it's robust in the face of its imperfections. This is a property of good distributed systems in general, the distributed system in this case comprising not just the author/editors, but the reader taking Wikipedia's nature into account.


P.S.: While fetching up the link for NPOV above, I first tried "npov", figuring it would redirect to the right place, WP:NPOV, since I can never remember the right prefix for the special pages. Oddly enough, if you don't capitalize it the right way, npov redirects to Journalism. Not sure I buy that, but it's an interesting angle.

Tuesday, June 16, 2009

Baker's Dozen: How many cities?

While I was putting together the previous post, on crowdsourcing, I tried to look up how many cities there were in the US (for some reasonable definition of "city"). This seemed right up Wolfram Alpha's alley, so I tried "How many cities are there in the US?" Alpha did something I hadn't seen it do before. It answered the question, but not well.

But at least it's pretty clear where it went astray. For whatever reason, it assumed I was interested in the largest cities in the US and gave me the top five. There was a "more" link, but that just expanded the list to the top 10. Who knew only nine have over a million people? (San Jose is tenth with about 900,000)

Unlike True Knowledge, Alpha chooses not to clutter its display with a list of web hits. That's probably why I missed the "web search" button the first time around. Chasing that takes me to Google. The top hit is WikiAnswers. The answer states that "Because many towns are considered counties in some States it gets complex" and points me at City-Data.com, which is supposed to have all of them. Maybe it does, but it doesn't seem to answer the question directly.

OK, where did Alpha get its information then? Apparently from a variety of sources, including the CIA fact book and the US Census Bureau. So maybe look there. There's also a link to the Wikipedia article, and visually skimming that I see that "In 2006, 254 incorporated places had populations over 100,000."

That's a decent answer. 100,000 is a commonly used if somewhat arbitrary cutoff point for citihood. It wasn't really what I was after, though. I was looking for places one might use in a query of the distance from place A to place B, and I would expect that plenty of places with under 100,000 people would qualify. But that'll have to wait.

So ... Alpha to Google to WikiAnswers to a dead end, Alpha to Wikipedia to a plausible answer; two pointers toward the US Census, which is where I would have gone looking if I hadn't had search engines to guide me.

Friday, June 12, 2009

Baker's dozen: Crowdsourcing

As we've seen, getting a computer to understand a simple English question is not necessarily easy. People, on the other hand, are reasonably good at the task. So instead of trying to get a computer to answer a question, why not use the computer purely as a means of communcation in order to connect a question with someone's direct answer? Two efforts along those lines come to mind.

The creation of Wikipedia founder Jimmy Wales, Wikia Search officially folded its tent last month. Naturally, Wikipedia has an article on the topic, not all of which has quite made it into past tense. The Wikia search site now redirects to Wikianswers, not to be confused with WikiAnswers.com, which I'll get to.

The first question of the baker's dozen to get an answer other than "This question has not been answered." is number 6: Who starred in 2001? This gets a "Magic answer", presented in a curtained frame with black background and a magician's top hat in one corner. The answer is attributed to Yahoo! answers and begins "It is an excellent movie. I give it four stars out of 5." The title of the movie is nowhere mentioned, but it appears to have starred Nicole Kidman and have been set during "gee umm WWI or WWII". A couple of minutes on IMDB identifies the film as The Others. Curiously, the more specific question Who starred in 2001: a Space Odyssey? gets no answer.

I also got a magic answer from Yahoo! on Who invented the hammock? and this time it's relevant: the hammock "originated in Central America more than 1,000 years ago." There seem to be two schools of thought on this one: Central America and Amazon basin. I say it was Colonel Mustard in the library with a lead pipe.

WikiAnswers.com is much the same beast as Wikianswers but commercial and -- according to Wikipedia -- more heavily trafficked. The results are not particularly different from those of Wikianswers, but it does answer How far is it from Bangor to New York?

Going a bit further afield, what about using Twitter as a search engine? If you've got a question, send it out as a tweet and see what comes back. There has apparently been some buzz about this concept, and indeed it's one of the options Wikianswers (the first one, not WikiAnswers.com) gives if it can't answer a question. Farhad Manjoo offers a contrasting viewpoint on Slate.com. The gist, if I understand aright, is that in order to sort through the responses, you need a real search engine, so why not just hook Twitter up with an existing search engine and be done with it?

All in all, crowdsourcing doesn't seem to deliver great results here. Why would that be?

Crowdsourcing, at least the free and open Wiki-style variety, depends on each person being able to get more out than they put in. This is possible because information is not consumed, only used -- if you learn something from a source, that doesn't prevent someone else from learning something from it later. It's also possible because sharing knowledge can be its own reward, but I suspect that's a smaller factor.

The classic case is Wikipedia. If 10,000 people read an article, and only 1/10th edit it, and only 1/10th of those edit it in a substantially useful way, you've still got a hundred people working on the article. Naturally I'm making up those numbers, but real experience suggests something of the kind is at work.

Single, discrete answers are not the same as in-depth articles. For example, suppose there are 10,000 places of interest. There are then 100,000,000 questions of the form "How far is it from X to Y?" You can get rid of the 10,000 cases where X and Y are the same and half of the rest because its just as far from X to Y as from Y to X, but that still leaves about 50,000,000 possible questions.

The odds of any particular question coming up more than once will depend on the prominence of the places. It's quite possible that many people will be interested in how far it is from LA to New York, but if I'm doing a tour from Schenectady to Poughkeepsie to Paducah to Tehachapi to Tonapah, I'm probably not going to find that someone else has already asked and had answered those particular combinations.

If I keep striking out asking questions, why should I go to any trouble to pass along the answers I finally do dig up elsewhere? The canonical answer is for the good of the wiki as a whole, and more selfishly to improve the odds I find my answer next time on the assumption everyone is doing likewise. But if I can generally find the answer without the wiki, why do I care whether the wiki can also answer it? Wikipedia wins because it gathers information that's not readily found in one place elsewhere.

On the other hand, a map database, once it's learned the 10,000 places and the routes between them, will gladly answer any and all distance queries with equal ease.

Not every potential question for a crowdsourced engine has the odds stacked so strongly against it. Probably lots of people want to know celebrity du jour's birthday. Unfortunately, that's just the kind of information that's fairly easy to track down with existing tools.

The True Knowledge experience showed another potential problem. Making information easy to find means indexing it, and indexing is a different beast from asking questions. Wikipedia, for example, provides two basic means of structuring information, as distinct from just typing it in: categorizing (tagging) it and organizing the body text into articles, sections, subsections etc. The results are not perfect, but they're very helpful and probably about as much as we can expect from the crowd. Trying to have the crowd too intimately involved the mechanics of a search mechanism itself is probably not a good fit.

On the other hand, crowd-generated content is great. A large portion, though not 100%, of the web is crowd-generated. As a result, just searching Wikipedia often works well. I prefer it when the result I'm after is something like an encyclopedia article. Along with its take, Wikipedia will provide links to sources and if that's not enough I can still Google. I'll use Wikipedia's native index if I know the particular topic (or can get close). Otherwise I use Google and happily read any relevant Wikipedia articles that show up.

This seems a good division of labor. People write the content and machines search and collate.

Sunday, April 19, 2009

Terms of wiki art

"Link rot" is the tendency for URLs to become invalid as the sites they point to go dead or move elsewhere (and any forwarding left behind goes dead). It's an annoying but necessary consequence of a very basic principle of the web: links don't have to point at anything, even though they generally should*. It's probably less of a problem than it used to be as more material comes to live on sites hosted by large, durable entities. Blogger.com, now a Google property, for example. As the man said, cool URIs don't change.

Wikipedia and similar wikis add a particular twist: Links within the wiki generally don't go dead; they go weird. Some ways this can happen:
  • The original link points to an article on, say, crickets. Per usual custom, the actual link reads [[Cricket|crickets]]. That is, it appears as "crickets" but actually points to the article entitled Cricket. This is originally about the insect, but soon someone adds an article on the game. The link now points at either the disambiguation page for the various possibilities of Cricket or at the article for the game, depending on how the process proceeds.
  • The original link points to a specialized article, say on cricket songs. This is later deemed not to be worth its own article and gets folded into Cricket (insect). Helpful bots redirect the link in the article, but the link is now considerably less useful, particularly if it was originally something like [[Cricket song|song]] and later edits rearrange the sentence the link appears in. You start with something like "The sound of the instrument has been compared to the [[Cricket song|song]] of crickets." and end up with something like "The sound of the instrument has been compared to insect [[Cricket (insect)|song]]," with the actual material on cricket song somewhere on the page.
  • In the previous case, the section on cricket song may later be removed, possibly completely or possibly to, say, a general page on insect sounds. The [[Cricket (insect)|song]] link now points to an article on the cricket, with at best a link in the general direction of the original material on its song, said link being in some random spot on what is now a very thorough and complete article on the cricket, its diet and habits, its appearance, its significance in human culture, etc. etc.
  • Or ... the first two cases can combine to leave a link that appears as "song", points to Cricket and lands you — huh?? — at an article on an inscrutable pastime of the Commonwealth.
I'm 90% sure the Wikipedia community has a term of art for this, but the obvious choices of "wikirot" and "wiki rot" don't seem to turn up anything. "Wiki gardening" is the practice of tending a wiki in order to counter such rot and generally improve the organization of the wiki.

While I'm at it, is there a term for the practice "wikifying" (making links for) marginally relevant terms while leaving really relevant ones "unwikified"?

* For a little more on dangling links as a principle of web architecture, see this post and this one. Appropriately enough, the relevant snippets are buried in the middle of them.

Tuesday, December 23, 2008

All of human knowledge

In the annual (?) appeal for funding for the Wikimedia Foundation, Jimmy Wales asks us to
Imagine a world in which every single person on the planet is given free access to the sum of all human knowledge.
This seems like perfectly fine wording for a fundraising appeal, a decent description of what Wikipedia is about, and a noble ideal to boot. So let's rain on the parade by picking it apart, shall we?

Is it possible, even in principle, to give even one person access to the sum of all human knowledge? Actually, what does "the sum of human knowledge" even mean? Some time ago, I was convinced it was "everything in the encyclopedia". Now I'm not so sure. Wikipedia itself specifically excludes knowledge that isn't "notable" (what did I have for breakfast yesterday?) and "original research" such as tends to creep in as people summarize pieces of articles and draw conclusions from them. It also goes to great lengths to exclude or at least neutralize opinion (POV in the jargon (*)).

In other words, it aims to gather information generally accepted as "known". This is the kind of philosophical quicksand that holds up just fine so long as all you do is walk blithely across it. So let's just walk ...

Assuming there's such a thing as the sum of human knowledge, for some value of "knowledge", could anyone access it? Well, you don't really want to access all of it. You couldn't anyway. You want to be able to access the bit you need at the moment, right then and there.

This runs directly into the limits human bandwidth. Not only is there only so much raw information you can process at one time, there is only so much metadata -- information about what and where other information is -- that you can process at one time. Sure, the knowledge you're looking for is in there, and you have both the careful work of editors and categorizers and the raw horsepower of text search at your disposal. But can you find it? Empirically, the answer so far is "often". I doubt it will ever be "always".

Nonetheless, an unachievable goal is still worth aiming for so long as we produce useful results along the way.

(*) The Wikipedia article on POV contains a very relevant bit of wisdom:

In Thought du Jour Harold Geneen has stated:[1]

The reliability of the person giving you the facts is as important as the facts themselves. Keep in mind that facts are seldom facts, but what people think are facts, heavily tinged with assumptions.

Friday, March 14, 2008

The Economist on the soul of Wikipedia

Lots of fun stuff in The Economist's technology quarterly, including a piece on "The Battle for Wikipedia's Soul".

Some time ago I came away from a debate with an anarchist a libertarian friend with the conclusion that, for better or worse, government is just something that people tend to do. Wikipedia seems a perfect case in point.

Wikipedia started like any wiki, lean and mean and (pretty much) free for all. Over time, however, it has developed rules, customs, social groupings and hierarchies just like any other society.

Also in common with governmental forms of other societies, these have taken on a life of their own. The article quotes a 2006 estimate that entries about governance and editorial policies were the fastest-growing segment and comprised around a quarter of the total content.

I'm curious as to how this was reckoned. Wikipedia claims over 2 million English articles, and even if most of these are rather small, it's hard to believe the WP: space is as big as half a million randomly chosen articles. I'm guessing the figure includes talk pages. In any case, the larger point stands: The Wikipedia community devotes significant resources to governing itself.


One major point of discussion, probably the major one, is what gets in and what stays out. There are two schools of thought. Inclusionists prefer to include as much as possible. Deletionists try to eliminate frivolous or badly-written material.

The heart of the problem is that there are no hard-and-fast rules for deciding what's worthy and what's not. Bad articles are like obscenity: you might not know what it is, but you know it when you see it. And different people see it differently. In the absence of consensus, judgment comes into play, and with that, the question of who does the judging. There's simply no way to decide that will leave everyone happy.

Is this a problem? Not necessarily. Such imperfection is part of every human system I'm aware of. The more important question is how to deal with that imperfection. If there's an epic battle between inclusionists and deletionists, as opposed to just a normal give-and-take, the question is not who will win, but what damage the battle will do to the system as a whole.

Wednesday, February 6, 2008

Tourism today

Here's Richard Stallman talking to an audience in Stockholm, at the Kungliga Tekniska Hogskolan (Royal Institute of Technology), in October 1986. I've edited a bit to draw out the point I want to make. Please see the original transcript for further detail.
Now "tourism" is a very old tradition at the AI lab, that went along with our other forms of anarchy, and that was that we'd let outsiders come and use the machine. Now in the days where anybody could walk up to the machine and log in as anything he pleased this was automatic: if you came and visited, you could log in and you could work. Later on we formalized this a little bit, as an accepted tradition specially when the Arpanet began and people started connecting to our machines from all over the country.

Now what we'd hope for was that these people would actually learn to program and they would start changing the operating system . If you say this to the system manager anywhere else he'd be horrified. If you'd suggest that any outsider might use the machine, he'll say ``But what if he starts changing our system programs?'' But for us, when an outsider started to change the system programs, that meant he was showing a real interest in becoming a contributing member of the community.

We would always encourage them to do this. [...] So we would always hope for tourists to become system maintainers, and perhaps then they would get hired, after they had already begun working on system programs and shown us that they were capable of doing good work.

But the ITS machines had certain other features that helped prevent this from getting out of hand, one of these was the ``spy'' feature, where anybody could watch what anyone else was doing. And of course tourists loved to spy, they think it's such a neat thing, it's a little bit naughty you see, but the result is that if any tourist starts doing anything that causes trouble there's always somebody else watching him.

So pretty soon his friends would get very mad because they would know that the continued existence of tourism depended on tourists being responsible. So usually there would be somebody who would know who the guy was, and we'd be able to let him leave us alone. And if we couldn't, then what we would [do] was we would turn off access from certain places completely, for a while, and when we turned it back on, he would have gone away and forgotten about us. And so it went on for years and years and years.
In sum:
  • Everyone can change the system. In fact, everyone is openly encouraged to change the system.
  • People who make good changes rise in the ranks and eventually help run the place.
  • Everyone can see what people are up to, and in particular what changes they're making.
  • Vandals are locked out and generally go away after a bit (likely to be replaced by new, nearly identical vandals). Sites can be blocked if blocking individuals doesn't work.
One of rms's main themes is that this is The Way Things Should Be. Now, it's all well and good to have this sort of semi-anarchic meritocracy in the hallows of academia, with access physically limited to those who worked in the lab or wandered in off the street. It might even work on the early Arpanet, several orders of magnitude smaller than today's wild and woolly web. But surely it'll never work on a big scale in the commercial world. After all, one of rms's other main themes is that commercialism killed the AI lab (at least in the form he describes) and threatens environments like it.

The obvious counter-argument is that open source software (a term, I should mention, that rms disfavors) works quite well on the same basic principles, albeit not always strictly according to the FSF model. It's a good point, but the average open source project is a fairly small system. I doubt there are many with more than a few dozen active participants at any given time. Such projects also tend to have a limited audience, a limited pool of potential contributors, or both.

However, there is at least one very large and prominent system, with hundreds of thousands of participants, that the bullet points above describe almost as though I'd written them with it in mind (which I did): Wikipedia.

Wednesday, December 26, 2007

80% of the solution in a fraction of the time

As can happen, I set out to write this piece once already, only to end up with a slightly different one. Here's another take, bringing Wikipedia into the picture.

First, let me say I like Wikipedia. A quick scan will show I refer to it all the time. I see it as a default starting point for information on a particular topic (as opposed to a narrowly-focused search for a given document or type of document). I don't see it as definitive, but I don't think that's really its job.

Wikipedia would seem a perfect test case for Eric S. Raymond's formulation of Linus's Law ("Given enough eyeballs, all bugs are shallow). But -- as Wikipedia's page on Raymond dutifully reports -- Raymond himself has said, well, here's how it came out in a New Yorker article:
Even Eric Raymond, the open-source pioneer whose work inspired Wales, argues that “ ‘disaster’ is not too strong a word” for Wikipedia. In his view, the site is “infested with moonbats.” (Think hobgoblins of little minds, varsity division.) He has found his corrections to entries on science fiction dismantled by users who evidently felt that he was trespassing on their terrain. “The more you look at what some of the Wikipedia contributors have done, the better Britannica looks,” Raymond said. He believes that the open-source model is simply inapplicable to an encyclopedia. For software, there is an objective standard: either it works or it doesn’t. There is no such test for truth.
Let's start right there. Software doesn't simply either work or not. You can't even put it on some sort of linear, objective "goodness" scale. Even in cases where you'd think software is cut and dried, it isn't. Did you test that sort routine with all N! combinations of N elements? Of course you didn't. Did you rigorously prove its correctness? How do you know your correctness proof is correct? Don't laugh: Mathematicians routinely find holes in each other's proofs, in some cases even after publication.

But most software is nowhere near this regime. Often we don't even know exactly what we're trying to write when we set out to write it (thus much of the emphasis on "agile" development techniques). In the case of something like a game, or even a website design, most of what we're after is a subjectively good experience, not something objectively testable (though ironically games seem to put a bigger premium on basic correctness, since bugs spoil the illusion).

It's not even completely clear when software doesn't work. If a piece of code is supposed to do X and Y, but in fact does Y and Z, does it work? It does if I need it to do Y or Z. What if it hangs when you try to do X, but there's an easy work-around? What if it hangs at random 10% of the time when you try to do X, but that's tolerable and nothing else does X at all? What if it does X if a coin flip comes up heads, but might not if it doesn't? I'm not making that one up. See this Wikipedia article (of course) for more info. What if it's an operating system and it just plain hangs some of the time? Not that that would ever happen.

All of this to say that I doubt that software and encyclopedia entries make such different demands on their development process. And as a corollary, I think the results are about the same. Namely, there are excellent results in some cases, reasonable but not excellent results in many cases, and occasional out-and-out garbage.

Here's what I think goes on, roughly, in both cases:
  • Someone comes up with an idea. That person may be an expert in the field, or may just have what looks like a neat idea.
  • The original person produces a first draft, or perhaps just a "stub", or an "enhancement request".
  • If no one with the expertise to take it further is persuaded to do so, it stays right there indefinitely, or may even be purged from the system (perhaps to re-appear later).
  • If the idea has legs, one or more people take it up and make improvements.
  • Typically, one of three stable states is reached:
    • It's perfect. Nothing more than minor cosmetic changes can be added. New ideas along the same lines typically become their own projects.
    • It's good enough for everyone currently involved. That may not be particularly good, but no one can be persuaded to go further. This may be the case right out of the gate, or after several rounds of fixes by a single originator.
    • It's not good enough, but there is no agreement on how to take it further. Work may grind to a halt as competing fixes go in and come out, or the project may split into two similar projects, with better or worse sharing of common material and effort.
Thinking it over, this process is not unique to open source. The magic of the open approach is that the bigger the pool of participants, the bigger the chance that an idea with legs will get supporters and get fleshed out, and the faster it will get to a stable state. In our imperfect world, that stable state is generally short of perfection. Put the two together and you have 80% of the solution in a fraction of the time.

That said, there are some differences between prose and software. I've argued above that software isn't hard and fast. It's soft, in other words. But prose is even softer. As a result, there is greater potential for disagreement on where to go, and in case of disagreement, there looks to be a better chance of thrashing back and forth with competing fixes, as opposed to moving forward but with separate (and to some extent redundant) solutions.

Wikipedia does seem to attract more vandals, but this is not necessarily because it's not software. It may also be because it openly invites frequent edits from a very large pool and changes are moderated after the fact. Open software projects, particularly critical pieces like kernels and basic tools, tend to require changes to pass by a small group of gatekeepers before being checked in. Conversely, some wikis are moderated.

As usual, this is all just my rough "figuring it out as I go along" guess, not anything with actual numbers behind it, but that's my story and I'm sticking to it for now.