Thursday, June 26, 2008

Searching for a smarter search engine

One look at Google's quarterly reports should be enough to understand why people are still trying to build a better search engine. Google search does a great job. It will come as no shock that I've consulted it repeatedly in practically every post here. A friend once described it as adding (say) 25 points to his IQ, though not everyone agrees with that assessment.

I've cited Google as a classic case of "dumb is smarter". Google doesn't try to do anything one might consider "understanding" the material it's indexing. For the most part it just looks at words and the links between pages. There is some secret sauce involved, for example in handling inflections or making it harder to game the rankings. Mainly, though, Google wins because its PageRank algorithm turns out to do a good job of finding relevant pages and because it throws massive amounts of computing power at indexing everything in sight [There's a lot of secret sauce involved in getting that to work at the scale Google operates on].

Google is the dominant search engine, but that doesn't man there's no room for other engines, particularly engines that take a noticeably different approach or that try to solve a noticeably different problem. Powerset is one such engine. Rather than trying to index the entire web by keyword, Powerset answers English queries about material in Wikipedia. Without delving into a proper product review or comparison, which would have to include at least Google and, say, Ask (formerly Ask Jeeves), I'll just note a few impressions and head on to my real goal of blue-sky speculation [geek note: The "power set" of a set is the set of all that set's subsets; less formally, all the combinations of given set of elements.].

Suppose you want to know when John von Neumann was born. You ask "When was John von Neumann born?" Hmm ... oddly enough, it didn't answer that one directly. It did give the Wikipedia page for von Neuman, which gives the answer (December 28, 1903). "When was Mel Brooks born?" works more as intended, with a nice big "1926" at the top of the results. It also shows a link to a page that says 1928, but seems to know better than to believe it.

Other examples
  • "Where is the world's tallest building?" turns up the list of tallest buildings.
  • "What is the time zone for Afghanistan?" turns up a list of pages, the first of which mentions the right answer.
  • "How much money has been spent on cancer research?" turns up a link giving a figure for the UK, but nothing suggesting an overall figure
  • "Why is there air?" brings up the Bill Cosby album of the same name.
Beyond accepting questions posed in plain English, Powerset also aims to give you a richer view of the results it finds. This includes an outline of the page contents and a list of "Factz" gleaned from the text. These take the form of short subject-verb-object near-sentences like (in the case of the "tallest building" article) "dozens measure meter" and "television broadcasts towers". Click on one of these and it highlights a relevant passage in the text, for example "In terms of absolute height, the tallest structures are currently the dozens of radio and television broadcasting towers which measure over 600 meters (about 2,000 feet) in height." or "In terms of absolute height, the tallest structures are currently the dozens of radio and television broadcasting towers which measure over 600 meters (about 2,000 feet) in height."

It's not immediately clear what this is supposed to give me. Powerset says "For most people, places and things, Powerset shows a summary of Factz from across Wikipedia," and to illustrate this, it shows a section of a table of Factz about Henry VII -- whom he married (wife, Anne Boleyn ...) what he dissolved (monestaries, Abbey ...) and so forth. Evidently Henry provides a better example than tall buildings do.

The Factz summary appears to be the sort of thing that Powerset is really driving at. It's certainly the sort of thing that initially drew me to take a look. Rather than just index words, Powerset attempts to extract meaning from the text and present it in a structured way. In other words, it tries to be smart and, in some limited sense, understand the material it's indexing. For example, along with the listing of three-part Factz, it will also display "things" and "actions", with items it deems more significant shown larger.

If we view this smarter approach as an attempt at understanding, however limited, then I'm not sure that the Powerset engine understands all that much. It seems pretty good at distinguishing nouns from verbs, but beyond that, I'm not sure what "dozens measure meter" really signifies. Even in a seemingly simple factual statement like the one quoted, there is more going on than "dozens" "measuring".

It matters that it's dozens of towers, not dozens of meters (or dozens of eggs). It matters that the towers measure more than 600m tall and not less. It matters that the towers are being judged tallest in the limited context of "absolute height". It matters that this is "current", since the Burj Dubai, when completed, will be the tallest completed structure, period. This matters particularly because much of the article is spent wrangling over the meaning of "tallest", a debate which will soon be moot, at least for a while. The Factz approach appears to miss all this, none of which is particularly subtle from a human point of view.

Google, in the meantime, doesn't try to do any of this, but seems to do just fine on the queries above, given verbatim (not in googlese, and without quotes). For "What is the time zone for Afghanistan?" for instance, it said "Time Zone: (UTC+4:30) according to Wikipedia" right at the top. And, of course, Google indexes the entire web ("entire web" defined as "anything you can google", of course), in part because it doesn't spend a lot of time trying to extract meaning. As for the structured view, Wikipedia pages are already outlined, and I'm not sure what Factz give me that ordinary text search doesn't.

Ah well. Understanding natural language isn't just a hard problem, it's a collection of several hard problems, and not a particularly well-defined collection at that.

I don't want to leave the impression that Powerset is useless, and I particularly don't want to denigrate the effort behind it. In fact, I'd encourage people to at least try it. Tastes vary, and some may well find Powerset a nicer way to navigate Wikipedia. Nonetheless, Powerset only serves to confirm my impression that dumb is indeed smarter, and that Google's "we don't even pretend to understand what we're indexing" approach sets the bar remarkably high.

1 comment:

Anonymous said...
This comment has been removed by a blog administrator.