Friday, June 12, 2009

Baker's dozen: Crowdsourcing

As we've seen, getting a computer to understand a simple English question is not necessarily easy. People, on the other hand, are reasonably good at the task. So instead of trying to get a computer to answer a question, why not use the computer purely as a means of communcation in order to connect a question with someone's direct answer? Two efforts along those lines come to mind.

The creation of Wikipedia founder Jimmy Wales, Wikia Search officially folded its tent last month. Naturally, Wikipedia has an article on the topic, not all of which has quite made it into past tense. The Wikia search site now redirects to Wikianswers, not to be confused with WikiAnswers.com, which I'll get to.

The first question of the baker's dozen to get an answer other than "This question has not been answered." is number 6: Who starred in 2001? This gets a "Magic answer", presented in a curtained frame with black background and a magician's top hat in one corner. The answer is attributed to Yahoo! answers and begins "It is an excellent movie. I give it four stars out of 5." The title of the movie is nowhere mentioned, but it appears to have starred Nicole Kidman and have been set during "gee umm WWI or WWII". A couple of minutes on IMDB identifies the film as The Others. Curiously, the more specific question Who starred in 2001: a Space Odyssey? gets no answer.

I also got a magic answer from Yahoo! on Who invented the hammock? and this time it's relevant: the hammock "originated in Central America more than 1,000 years ago." There seem to be two schools of thought on this one: Central America and Amazon basin. I say it was Colonel Mustard in the library with a lead pipe.

WikiAnswers.com is much the same beast as Wikianswers but commercial and -- according to Wikipedia -- more heavily trafficked. The results are not particularly different from those of Wikianswers, but it does answer How far is it from Bangor to New York?

Going a bit further afield, what about using Twitter as a search engine? If you've got a question, send it out as a tweet and see what comes back. There has apparently been some buzz about this concept, and indeed it's one of the options Wikianswers (the first one, not WikiAnswers.com) gives if it can't answer a question. Farhad Manjoo offers a contrasting viewpoint on Slate.com. The gist, if I understand aright, is that in order to sort through the responses, you need a real search engine, so why not just hook Twitter up with an existing search engine and be done with it?

All in all, crowdsourcing doesn't seem to deliver great results here. Why would that be?

Crowdsourcing, at least the free and open Wiki-style variety, depends on each person being able to get more out than they put in. This is possible because information is not consumed, only used -- if you learn something from a source, that doesn't prevent someone else from learning something from it later. It's also possible because sharing knowledge can be its own reward, but I suspect that's a smaller factor.

The classic case is Wikipedia. If 10,000 people read an article, and only 1/10th edit it, and only 1/10th of those edit it in a substantially useful way, you've still got a hundred people working on the article. Naturally I'm making up those numbers, but real experience suggests something of the kind is at work.

Single, discrete answers are not the same as in-depth articles. For example, suppose there are 10,000 places of interest. There are then 100,000,000 questions of the form "How far is it from X to Y?" You can get rid of the 10,000 cases where X and Y are the same and half of the rest because its just as far from X to Y as from Y to X, but that still leaves about 50,000,000 possible questions.

The odds of any particular question coming up more than once will depend on the prominence of the places. It's quite possible that many people will be interested in how far it is from LA to New York, but if I'm doing a tour from Schenectady to Poughkeepsie to Paducah to Tehachapi to Tonapah, I'm probably not going to find that someone else has already asked and had answered those particular combinations.

If I keep striking out asking questions, why should I go to any trouble to pass along the answers I finally do dig up elsewhere? The canonical answer is for the good of the wiki as a whole, and more selfishly to improve the odds I find my answer next time on the assumption everyone is doing likewise. But if I can generally find the answer without the wiki, why do I care whether the wiki can also answer it? Wikipedia wins because it gathers information that's not readily found in one place elsewhere.

On the other hand, a map database, once it's learned the 10,000 places and the routes between them, will gladly answer any and all distance queries with equal ease.

Not every potential question for a crowdsourced engine has the odds stacked so strongly against it. Probably lots of people want to know celebrity du jour's birthday. Unfortunately, that's just the kind of information that's fairly easy to track down with existing tools.

The True Knowledge experience showed another potential problem. Making information easy to find means indexing it, and indexing is a different beast from asking questions. Wikipedia, for example, provides two basic means of structuring information, as distinct from just typing it in: categorizing (tagging) it and organizing the body text into articles, sections, subsections etc. The results are not perfect, but they're very helpful and probably about as much as we can expect from the crowd. Trying to have the crowd too intimately involved the mechanics of a search mechanism itself is probably not a good fit.

On the other hand, crowd-generated content is great. A large portion, though not 100%, of the web is crowd-generated. As a result, just searching Wikipedia often works well. I prefer it when the result I'm after is something like an encyclopedia article. Along with its take, Wikipedia will provide links to sources and if that's not enough I can still Google. I'll use Wikipedia's native index if I know the particular topic (or can get close). Otherwise I use Google and happily read any relevant Wikipedia articles that show up.

This seems a good division of labor. People write the content and machines search and collate.

1 comment:

earl said...

Did you catch this?

http://www.npr.org/templates/story/story.php?storyId=105438559