Tuesday, May 20, 2008

What a concept. Or rather, what's a concept?

One theme I've had kicking around in my head for a while, and may yet write up in earnest, is the concept of "dumb is smarter". The idea is that you can often do better by giving up on the idea of "understanding" the problem your solving and using a blatant hack. For example, Google relies on page rank -- the way a page is connected to other pages -- rather than any abstract understanding of a document, to decide what hits are likely to be "relevant".

The technique of Latent Semantic Analysis is an interesting case. LSA attempts to solve some of the well-known problems with searching based on words alone, particularly synonymy and polysemy.

Synonymy -- different words meaning the same thing -- means you can ask for "house" and miss pages that only say "home". Worse, you don't know what you're missing since, well, you missed it.

Polysemy -- the same word meaning different things -- means you can ask for "house" and get pages on the U.S. House of Representatives when you wanted real estate. This is probably more of an annoyance, particularly since you probably want the more popular sense of a word and not the one that that sense is drowning out.

LSA tries to mitigate these problems by starting with information on what words appear in what documents, then applying a little linear algebra to reduce the number of dimensions involved.

This means that, for example, instead keeping a separate scores for "house", "home" and "senate", there might be one combined score for "house" and "home" and another one for "house" and "senate". A document that contains "house" and "home" but not "senate" would be rated differently from one that contains "house" and "senate" but not "home", which is just the kind of thing we're looking for.

This combined system is called "concept space". Does it deserve the name?

On the one hand, yes, because intuitively it reflects the idea that "house" and "home" can represent the same, or at least related concepts, and because it seems to do fairly well empirically in mimicking how people actually rate documents as "similar" or "different".

On the other hand, clearly no, since all we're doing is counting words and doing a little math, and also because the "concept space" can include combinations that don't have much to do with each other, but happen to fall out of the particular texts used -- maybe "house" and "eggnog" happen to appear together for whatever reason.

The last would be a case of "correlation doesn't necessarily mean cause", and the interesting thing here is that LSA seems to do a decent job of emulating faulty human reasoning. People make that particular mistake all the time, too. As always, one must distinguish "human-like" from "intelligent".

No comments: