Wednesday, January 30, 2008

Blobs, metadata and web content

Every time I price hard drives, the cost for a given size looks ridiculously cheap, even by comparison to the ridiculously cheap price from the last time I looked. If you spend $X on a disk drive every year, this year's drive will generally be able to swallow last year's whole without great effort.

And yet, we still manage to fill it all up.

Many years ago, at my first geek job, we splurged a somewhat scary amount of money and bought The Ten Megabyte Disk. It was about the size of a beer fridge and I remember wondering what we could possibly do with all that storage. I mean, it could hold the entire 128K expansion memory of the computer it was attached to (a box about the size of a large set-top cable box) almost 100 times over ...

So what's going to fill up today's terabyte-plus disks (100,000 times the beer fridge, if you're keeping score)? It would have to be video, I'd think. A terabyte will hold around about 80 hours of mini-DV video, 250 single-sided DVDs, 125 double-sided DVDs, or 40 blu-ray DVDs (assuming they're more-or-less full).

I previously estimated 3D IMAX at around 1GB per second, so if that's more or less right a terabyte will hold about 17 minutes of really high-definition video. There's no good way you could watch that in most people's houses right now, but give it time (I'm thinking some sort of VR glasses, not a humongous IMAX screen in every home).

What strikes me here is the vast difference between video and everything else.

A terabyte is about a million minutes of mp3 sound, or if you prefer, 70 solid 24-hour days, or two solid weeks if you prefer high-quality FLAC to mp3. It's hundreds of thousands of high-quality digital photos. It would hold all the actual software installed on my computer, including bitmap images, PDF versions of documentation and so forth, hundreds of times over. If you put every word and line of code I've ever typed in my entire life on a piece of the disk and painted that piece neon candy-apple red, you'd need a microscope to see it.

In the business, we call things like movies and songs BLOBs (Binary Large OBjects). Blobs become much more useful if you attach other information to them, for example the title of a movie, an index of scenes, cast and crew credits, and all the other kinds of stuff you'd find in IMDB. This sort of descriptive information about another piece of data is called metadata.

Metadata is very important. Imagine having 1000 songs and 100 DVDs stored on your disk, listed only by a 4-digit number. It's also absolutely tiny. The entire IMDB entry for a typical movie would fit into a small fraction of a single frame of that movie. You couldn't even use it as a cuing dot. It would flash by too quickly.

By comparison, typical web content is very rich in metadata. For example, this blog entry contains a couple of links and tags (that I put in) and some other indexing information (that Blogger puts in). It sits in a page full of other links and indexing information. Overall there is more text in this blog than metadata, but not thousands or millions of times more.

Many (but not all) web offerings are similarly metadata-rich. A typical social-networking site is all about the links. Google makes its money handing you piles of links. Even something like flickr adds value by categorizing blobs and otherwise making them easy to find.

All of which brings me back to my recurring theme of human bandwidth. We humans can consume vast amounts of visual information, large amounts of audio information, but only so much metadata. As a corollary, the amount of space per user on a web site will be tiny (from the computer's point of view), unless it happens to be rich in audio or video.

David Hull said...

Note to self: as later posts point out, scientific data is even bigger