While watching a miniseries on ancient history, I got to wondering how quickly people could move around in those days. The scriptwriters mostly glossed over this, except when it was important to the overall picture, which seems fine, but it still seemed odd to see someone back in their capital city discussing a battle they'd taken part in a thousand kilometers away as though it had happened yesterday.
So I did a search for "How far can a horse travel in a day?". The answer was on the order of 40 kilometers for most horses, and closer to 150 for specially-bred endurance horses. That would make it about a week to cover 1000km, assuming conditions were good, except that a horse, even a specially-bred one, needs to rest.
What if you could set up a relay and change horses, say, every hour? At this point we're well off into speculation, and it's probably best to go to historical sources and see how long it actually took, or just keep in mind that it probably took a small number of weeks to cross that kind of distance and leave it at that. But speculation is fun, so I searched for "How far can a horse travel in an hour?"
It may not surprise you that I didn't get the answer I was looking for, at least not without digging, but I did get answers to a different question: What is the top speed of a horse in km/hr? (full disclosure, I actually got an answer in miles per hour, because US, but I try to write for a broader audience here). How fast a person or animal can sprint is not the same as how far can the same person or animal go in an hour.
This seems to be the pattern now that we have LLMs involved in web search. I don't know what the actual algorithms are (and couldn't tell you if I did), but it seems very much like:
- Look at the query and create a model of what the user really wants, based on a Large Language Model (LLM)
- Do text-based searches based on that model
- Aggregate the results according to the model
It's not hard to see how an approach like this would (in some sense) infer that I'm asking "How many kilometers per hour can a horse run?", which is very similar in form to the original question, even though it's not the same question at all. There are probably lots of examples in the training data of asking how fast something can go in some unit per hour and not very many of asking how far something can go in an hour. My guess is that this goes on at both ends: the search is influenced by an LLM-driven estimate of what you're likely to be asking, and the results are prioritized by the same model's estimate of what kind of answers you want.
It's reasonable that questions like "How fast can a horse go?" or even "How fast is a horse?" would be treated the same as "How many km/hr can a horse run?". That's good to the extent that it makes the system more flexible and easier to communicate with in natural language. The problem is that the model doesn't seem good enough to realize that "How far can a horse travel in an hour?" is a distinct question and not just another way to phrase the more common question of a horse's top speed at a sprint.
I wish I could say that this was a one-off occurrence, but it doesn't seem to be. Search-with-LLM's estimate of what you're asking for is driven by the LLM, which doesn't really understand anything, because it's an LLM. It's just going off of what-tends-to-be-associated-with-what. LLMs are great at recognizing overall patterns, but not so good at fine distinctions. On the question side, "How far in an hour?" associates well with "How fast?" and on the answer side, "in an hour" associates strongly with "per hour," and there you go.
That's great if you're looking for a likely answer to a likely question, but it's actively in the way if you're asking a much-less-likely question that happens to closely resemble a likely question, which is something I seem to be doing a lot of lately. This doesn't just apply to one company's particular search engine. I've seen the same failure to catch subtle but important distinctions with AI-enhanced interfaces across the board.
Before all this happened, I had pretty good luck fine-tuning queries to pick up the distinctions I was trying to make. This doesn't seem to work as well in a world where the AI will notice that your new carefully-reworded query looks a lot like your previous not-so-carefully-worded query, or maybe more accurately, it maps to something in the same neighborhood as whatever the original query mapped to, despite your careful changes.
Again, I'm probably wrong on the details of how things actually work, but there's no mystery about what the underlying technology is: a machine learning (ML) model based on networks with backpropagation. This variety of ML is good at finding patterns and similarities, in a particular mathematical sense, which is why there are plenty of specialized models finding useful results in areas like chemistry, medicine and astronomy by picking out patterns that humans miss.
But these MLs aren't even trying to form an explicit model of what any of it means, and the results I'm seeing from dealing with LLM-enhanced systems are consistent with that. There's a deeper philosophical question of to what extent "understanding" is purely formal, that is, can be obtained by looking only at how formal objects like segments of text relate to each other, but for my money the empirical answer is "not to any significant extent, at least not with this kind of processing".
Back in the olden days, "Do What I Mean", DWIM for short, was shorthand for any ability for a system to catch minor errors like spelling mistakes and infer what you were actually trying to do. For example, the UNIX/GNU/Linux family of command-line tools includes a command ls (list files) and a command less (show text a page at a time, with a couple of other conveniences). If you type les, you'll get an error, because that's not a command, and nothing will ask you, or try to figure out from context, if you meant ls or less.
A DWIM capability would help you figure that out. In practice, this generally ended up as error messages with a "Did you mean ...?" based on what valid possibilities were close in spelling to what you typed. These are still around, of course, because they're useful enough to keep around, crude though they are.
There are now coding aids that will suggest corrections to compiler errors and offer to add pieces of code based on context. In my experience, these are a mixed bag. They work great in some contexts, but they are also good at suggesting plausible-but-wrong code, sometimes so plausible that you don't realize it's wrong until after you've tried it in a larger context, at which point you get to go back and undo it.
There's always been a tension between the literal way that computers operate and the much less literal way human brains think. For a computer, each instruction means exactly the same thing each time it executes and each bit pattern in memory stays exactly the same until it's explicitly changed (rare random failures due to cosmic rays and such can and do happen, but that doesn't really affect the argument here). This carries over into the way things like computer languages are defined. A while loop always executes the code in its body as long as its condition is true, ls always means "list files" and so forth.
Human brains deal in similarities and approximations. The current generation of ML represents a major advance in enabling computers to deal in similarities and approximations as well. We're currently in the early stages of figuring out what that's good for. One early result, I think, is that sometimes it's best just to talk to a computer like a computer.