Showing posts with label OCR. Show all posts
Showing posts with label OCR. Show all posts

Tuesday, December 9, 2008

The OCR X-Prize

A while ago -- in fact, just about a year ago as it happens -- I remarked that the use of Captchas to try to keep bots away is effectively an X Prize for OCR hackers. If there's money to be made by using a bot, as there is with, say, online ticket sales for popular shows, then there ought to be reason for someone to write a better text recognizer to get past the Captcha. I was sort of right, kind of.

Sure enough, people have written better text recognizers. But from what I can make it out, they've done it for fame and recognition (well, at least to publish and not perish) and not for money. There are several academic papers out there, but as far as I can tell no enterprising script kiddie has done the requisite research. There have been reports of real sites invaded by real Captcha-cracking bots, but most likely someone just cadged the work done in the research papers and put it to ill use.

So, Catpchas have spurred research, because OCR has become a somewhat hot topic. They've also spurred actual scammers to crack Captchas, but not necessarily through OCR research. Actual scammers have little reason to invent new OCR algorithms, or even read the literature on the subject. That's not their strong suit. Their strong suit is social engineering. People are good at reading squiggly Captcha letters; spammers are good at getting people to do stuff; ergo, spammers get people to read the squiggly Captcha letters for them.

How do they do this? Put up a site featuring a thrilling pictorial presentation of, say, accounting standards through the ages (they actually used a slightly different subject matter). After each image is the promise of more ... if you can read the squiggly letters in the box. The letters, of course, are taken from a legitimate site that the scammer is trying to crack into at the moment, and the mark's response is fed directly back to that site. The awful beauty of this approach is that it will work for any "reverse Turing test" approach whatsoever.

If they're smart, they wait for a successful response back from the legitimate site before letting the mark proceed. Otherwise the mark could put in anything at all, for example "notarealticketbuyer", and by definition the scam site wouldn't know the difference.

Meanwhile, Captchas have become just almost too hard for humans to read (I came up empty on one today, which is what spurred this post). In other words, they've almost reached the point at which they can no longer discriminate between humans and bots. Clearly rendered text can't discriminate, because both humans and bots can read it easily. Gibberish can't discriminate either, because no one can read it. There is less and less room left in the middle.

Monday, December 10, 2007

Why is there still print?

The Newsweek article on Kindle quotes Jeff Bezos as saying "Books are the last bastion of analog." I take his point, but it seems an odd statement. Text, after all, is arguably the first real digital medium. What he means by "digital", of course, is "available to computers". Unlike music and video, which are now routinely released in computer-readable form, books are still released in a form you can't just download. Bezos aims to change this with the Kindle.

The interesting question is, why does print resist digitization so well? I've suggested that publishers like it because it provides copy protection, but why does it? The answer has to be economic, not technical. Technically, it's trivial to digitize a book. Just scan it in. Don't bother to try to convert the image back to text. If all that people want to do with the result is read it, the image should work fine.

There's an interesting subplot here. Optical character recognition (OCR) seems to do fairly well these days on well-printed books, judging by Google books and Amazon's own "Search inside the book" feature. On the other hand, the fully general problem of reading anything a person can make out still appears to be hard, which is why sites use distorted text CAPTCHAs to try to stop bots. This seems like the equivalent of anX-prize for freelance OCR hackers, and indeed the inevitable arms race appears to be well under way. Finally, bringing us full circle, one source of these CAPTCHAs is printed text that failed to scan correctly.

In any case, the difficulty doesn't seem to be digitizing text in a readable form. The problem is, what do you do with it once you've got it? It's technically trivial to scan a book, but it still takes some time and effort to flip through all the pages, at least without expensive specialized equipment. So if I've done this, I'd like to see some compensation -- assuming I don't mind violating copyright laws.

Can I put it on the web and sell it? Well, um, I've just brought it into digital form, thereby making it hugely easier to copy. In other words, I've just put myself in the position of the publisher whose print-based copy protection I've just broken. If copy-protection is out, there's always advertising. Except that's maybe not such a good idea given that I've just broken the law.

This same argument would seem to act as a counterbalance to all sorts of unauthorized copying, but obviously it doesn't apply as effectively to audio and video. This is probably because copying CDs and DVDs is much, much easier than scanning books, and also because books are simply a different medium. I'd expect that PhDs have already been earned on just such matters.