Tuesday, June 9, 2009

Baker's dozen: What did I expect?

With three engines tested (or three and a half, if you include Bing), it's starting to look like another case of dumb is smarter. The pure text-based approach grinds away happily and, even if you don't try to cater to its whims and just give plain English questions, it almost always finds something relevant. Often you'll have to chase links, but all in all you can do quite well.

So far, the semantic approach seems to do no better and may do worse. This might be an artifact of the smaller text base (more or less Wikipedia), but the smaller, more select text base is deliberately part of the approach. If you're lucky, the answer you seek might arrive wrapped up in a bow, but usually not, just as with the pure text approach.

It's also interesting that Ask, which started life as a plain English, sophisticated alternative to Google now seems to look and act pretty much like Google.

And yet ...

There is certainly information in the baker's dozen that could be exploited to give smarter results. The question is whether exploiting this requires magic, or just engineering. I'm not going to make a call on that, but here's what I see:
  • How much energy does the US consume?
"How much" says "look for a number" and "energy" says look for a unit of energy (BTU, Joules, kWh etc.) Anything that doesn't seem to give such a number in the context of "US" and "consumes" is probably not useful.
  • How many cell phones are there in Africa?
"How many" suggests a number. For bonus points, the number is probably going to be within an order of magnitude of the population.
  • When is the next Cal-Stanford game?
"When is the next ..." is a formula suggesting a schedule, timetable or such. A search for "Cal-Stanford game" will probably correlate highly with football (as opposed to, say, "Carolina-Duke game"). If so, that would suggest a football schedule.
  • When is the next Cal game?
I found this one interesting. Most Americans will know that "Cal" is one of the short forms of "California". The search engines know that, too, and it throws them off. "Cal" in the context of "game" almost certainly refers to the University of California (Berkeley) Golden Bears. And again, "When is the next ..." rules out the hits for "California Gaming". In this case, even dumb is too smart for its own good.
  • Who starred in 2001?
  • Who starred in 2001: a Space Odyssey?
"Who starred in X" indicates that X is a film, TV show or play. Powerset appears able to grok this.
  • Who has covered "Ruby Tuesday"?
"Who has covered" indicates a song, though not as strongly as "Who starred in ..." suggests an acting role.
  • What kinds of trees give red fruit?
"What kind(s) of X" indicates that the answers should be Xs.
  • Who invented the hammock?
"Who" indicates that the answer is a person or group of people, but in the case at hand the group is pretty abstract.
  • Who played with Miles Davis on Kind of Blue?
"Who played with X on Y" indicates that the answer is a person, probably a musician or member of a sports team ("Who played with Satchel Paige on the Monarchs?" -- another fairly impressive list)
  • How far is it from Bangor to Leeds?
  • How far is it from Bangor to New York?
  • How far is it from Paris to Dallas?
I wrote that "In these, of course, the implicit question is 'Which Bangor?' or 'Which Paris?'" Google Maps handles this well by simple heuristics based (I assume) mainly on size. The "how far is it from X to Y" formula indicates that X and Y are places and the answer is a distance.

Now let me be clear here that when I say that some formula indicates something about the answer, I'm not saying that a search engine ought to be able to exploit that information. I'm well aware it's not as simple as it might look. Rather, I'm saying that if search engines are really going to get smarter, this is the kind of information they'll need to be able to find and exploit.

As to the original question of what did I expect, I'll just say that nothing so far has been surprising, except perhaps that Google handles plain English questions better than I thought it might.

No comments: