Tuesday, October 28, 2008

What do I mean "over the net"?

Re-reading, I notice I said in the post on tinyurl that URLs are "specifically designed to be written down, printed, put on billboards, read over the phone and otherwise transmitted non-electronically."

That's not quite right. Reading something over the phone is transmitting it electronically. How about "non-digitally"? As far as I'm concerned, text is digital, whether you put it on the wire, store it in a disk or spell it out in print. OK, how about "not transmitted over the net"? Well, OK, but what about VOIP? A significant portion of phone traffic goes over the net.

How did the RFC I was referencing get around this? It says a URL "may be represented in a variety of ways: e.g., ink on paper, or a sequence of octets in a coded character set." In other words, it talks about representation and not transmission. That's a good call for the RFC, but it leaves me out in the cold.

OK, so what's different about sending a URL in an HTTP request and reading it over the phone to someone over what happens to be a VOIP connection? Clearly the VOIP-ness of the connection is accidental, not essential, whereas HTTP has to run over a TCP connection. It's incidental because the audio stream is opaque to the software. The VOIP infrastructure will carry conversations that contain URLs and ones that don't with equal ease. In the case of an HTTP request, the URL is very much visible and needed.

Is this all just quibbling over definitions? Sort of. But the same problem crops up in other, more practical situations. An unstructured block of text is semi-opaque to the web infrastructure. You can scrape it and parse it and extract nuggets of meaning, but that takes a lot of effort. About the only thing you can really do with it easily is index it using what are largely crude but effective methods. Images and audio are even more recalcitrant.

As I understand it, the semantic web and its cousins like microformats are attempts to dig out otherwise obscured information and present it in machine-friendly form. From that point of view, if I were able to, say, press a "mark URL" button on my phone and cause the URL to be extracted via speech recognition, the phone call would cease to be opaque and would become at least that bit more webby. If it were stored somewhere accesible by a URL along with any extracted information, it would be a full-fledged resource.

Not sure there's anything deep or notable in all that, but there it is. I should also note that I previously touched on the metadata-scraping theme from a slightly different angle in this post.

1 comment:

David Hull said...

Note to self: This (and maybe the linked post) seems worth a follow-up.