Monday, June 30, 2008

Pinging across the blogosphere

A few days ago I tracked down and added proper credit for the photo in my profile. It was taken a couple of years ago by Paul Downey, a fellow standards committee member at the time (and, among other things, a worthy blogger and a keen-eyed photographer).

In line with his efforts to make the web more generally useful, Paul maintains a mashup of various feeds relevant to his blog. Since I had directed the photo credit to his blog, this showed up on his Technorati page, which showed up on Paul's feeds. This mention in turn showed up on Bloglines (which I always imagine has an extra 's', as in cleanliness is next to Blogliness).

Interestingly, the mention didn't show up on my Technorati page, probably because Technorati doesn't count Paul's feed list as a blog. But it's also interesting how much happened without either of us doing anything, beyond Paul setting up his list.

Saturday, June 28, 2008

Hunting the elusive voorwerp

Since it's been slashdotted and has appeared on the major news feeds, there's a good chance you've heard of Hanny's Voorwerp by now. It's a we-haven't-seen-anything-quite-like-this found by a Dutch schoolteacher named Hanny as part of the Galaxy Zoo project (I've seen voorwerp variously translated as "object" or "thing" -- the etymology suggests there might be a shade of meaning we're missing, so better not to translate).

The Galaxy Zoo is an interesting bit of crowdsourcing. Unlike SETI@home, GIMPS or similar projects it relies on human processing power rather than the idle cycles of millions of PCs. In the typical distributed-computing project, the algorithms are well-understood but require massive amounts of computing power. In this case, no one knows a good algorithm for classifying galaxies, but people can do it reasonably well, even with no training in astronomy.

Galaxy Zoo literally gives the rest of us a look at what kinds of things astronomers spend their time looking at, plus the chance of turning up something truly new. It also relieves the astronomers of the effort of taking a first look at millions and millions of images and turns up oddities, like the voorwerp, that would otherwise have gone unnoticed. Going by the entries on the Galaxy Zoo blog and the acknowledgments in the submitted papers, they're very grateful for the assist.

Apart from the human visual system's ability to recognize shapes, Galaxy Zoo takes advantage of another facet of human perception, namely its imprecision. While most people could easily distinguish a well-defined spiral galaxy from an elliptical one, many actual galaxies are harder to characterize.

In such cases a particular algorithm distributed to everyone's PC would always give the same answer, whether it's right or wrong. With some effort, you could distribute several different algorithms [or several versions of the same one, or one with some random "fuzz"] and look for discrepancies, but you'd have to write several different algorithms, or at least discover tuning parameters that made a significant difference. This not impossible, by any means, but you get it for free by having several people look at the same image (or, quite possibly, just from having the same person look at the image every so often).

Objects that produce varying answers are likely to be the interesting borderline cases, whether because there's more going on with them, or simply because we humans have more trouble figuring it out.

Thursday, June 26, 2008

Three times the commercials! That's progress for you ...

There seems to be a buzz developing over phone providers' "three-screen strategy." The idea is that you'll be able to get video -- TV shows, "user generated" videos, whatever -- delivered anywhere you like, whether your TV, your PC or your cell phone. Underpinning this is a new kind of distribution deal. Previously, carriers and providers had negotiated separate deals for the three kinds of device, but the latest deals cover all three.

This is a perfect example of convergence from the vendors' point of view, and a small but necessary step from the consumer's point of view. Again, there's a tension between competing interests. On the one hand, the carriers would like to be able to provide a broad selection of programming, since that will attract customers. But if everyone is providing the same broad selection then the service becomes a commodity, consumers can easily switch providers and the providers' margins shrink.

Right now, we seem to be at the "baby steps" stage, with deals covering content like five-minute excerpts from popular shows. Perhaps it's a bit early to make too many predictions.

Myself, I don't feel a great need to watch TV on my tiny cell phone screen, particularly since advertising seems to be a major part of the current plans, but advertising isn't necessarily bad, and I'm sure a lot of people won't mind greatly. As always, it will be interesting to see how things shake out.

Searching for a smarter search engine

One look at Google's quarterly reports should be enough to understand why people are still trying to build a better search engine. Google search does a great job. It will come as no shock that I've consulted it repeatedly in practically every post here. A friend once described it as adding (say) 25 points to his IQ, though not everyone agrees with that assessment.

I've cited Google as a classic case of "dumb is smarter". Google doesn't try to do anything one might consider "understanding" the material it's indexing. For the most part it just looks at words and the links between pages. There is some secret sauce involved, for example in handling inflections or making it harder to game the rankings. Mainly, though, Google wins because its PageRank algorithm turns out to do a good job of finding relevant pages and because it throws massive amounts of computing power at indexing everything in sight [There's a lot of secret sauce involved in getting that to work at the scale Google operates on].

Google is the dominant search engine, but that doesn't man there's no room for other engines, particularly engines that take a noticeably different approach or that try to solve a noticeably different problem. Powerset is one such engine. Rather than trying to index the entire web by keyword, Powerset answers English queries about material in Wikipedia. Without delving into a proper product review or comparison, which would have to include at least Google and, say, Ask (formerly Ask Jeeves), I'll just note a few impressions and head on to my real goal of blue-sky speculation [geek note: The "power set" of a set is the set of all that set's subsets; less formally, all the combinations of given set of elements.].

Suppose you want to know when John von Neumann was born. You ask "When was John von Neumann born?" Hmm ... oddly enough, it didn't answer that one directly. It did give the Wikipedia page for von Neuman, which gives the answer (December 28, 1903). "When was Mel Brooks born?" works more as intended, with a nice big "1926" at the top of the results. It also shows a link to a page that says 1928, but seems to know better than to believe it.

Other examples
  • "Where is the world's tallest building?" turns up the list of tallest buildings.
  • "What is the time zone for Afghanistan?" turns up a list of pages, the first of which mentions the right answer.
  • "How much money has been spent on cancer research?" turns up a link giving a figure for the UK, but nothing suggesting an overall figure
  • "Why is there air?" brings up the Bill Cosby album of the same name.
Beyond accepting questions posed in plain English, Powerset also aims to give you a richer view of the results it finds. This includes an outline of the page contents and a list of "Factz" gleaned from the text. These take the form of short subject-verb-object near-sentences like (in the case of the "tallest building" article) "dozens measure meter" and "television broadcasts towers". Click on one of these and it highlights a relevant passage in the text, for example "In terms of absolute height, the tallest structures are currently the dozens of radio and television broadcasting towers which measure over 600 meters (about 2,000 feet) in height." or "In terms of absolute height, the tallest structures are currently the dozens of radio and television broadcasting towers which measure over 600 meters (about 2,000 feet) in height."

It's not immediately clear what this is supposed to give me. Powerset says "For most people, places and things, Powerset shows a summary of Factz from across Wikipedia," and to illustrate this, it shows a section of a table of Factz about Henry VII -- whom he married (wife, Anne Boleyn ...) what he dissolved (monestaries, Abbey ...) and so forth. Evidently Henry provides a better example than tall buildings do.

The Factz summary appears to be the sort of thing that Powerset is really driving at. It's certainly the sort of thing that initially drew me to take a look. Rather than just index words, Powerset attempts to extract meaning from the text and present it in a structured way. In other words, it tries to be smart and, in some limited sense, understand the material it's indexing. For example, along with the listing of three-part Factz, it will also display "things" and "actions", with items it deems more significant shown larger.

If we view this smarter approach as an attempt at understanding, however limited, then I'm not sure that the Powerset engine understands all that much. It seems pretty good at distinguishing nouns from verbs, but beyond that, I'm not sure what "dozens measure meter" really signifies. Even in a seemingly simple factual statement like the one quoted, there is more going on than "dozens" "measuring".

It matters that it's dozens of towers, not dozens of meters (or dozens of eggs). It matters that the towers measure more than 600m tall and not less. It matters that the towers are being judged tallest in the limited context of "absolute height". It matters that this is "current", since the Burj Dubai, when completed, will be the tallest completed structure, period. This matters particularly because much of the article is spent wrangling over the meaning of "tallest", a debate which will soon be moot, at least for a while. The Factz approach appears to miss all this, none of which is particularly subtle from a human point of view.

Google, in the meantime, doesn't try to do any of this, but seems to do just fine on the queries above, given verbatim (not in googlese, and without quotes). For "What is the time zone for Afghanistan?" for instance, it said "Time Zone: (UTC+4:30) according to Wikipedia" right at the top. And, of course, Google indexes the entire web ("entire web" defined as "anything you can google", of course), in part because it doesn't spend a lot of time trying to extract meaning. As for the structured view, Wikipedia pages are already outlined, and I'm not sure what Factz give me that ordinary text search doesn't.

Ah well. Understanding natural language isn't just a hard problem, it's a collection of several hard problems, and not a particularly well-defined collection at that.

I don't want to leave the impression that Powerset is useless, and I particularly don't want to denigrate the effort behind it. In fact, I'd encourage people to at least try it. Tastes vary, and some may well find Powerset a nicer way to navigate Wikipedia. Nonetheless, Powerset only serves to confirm my impression that dumb is indeed smarter, and that Google's "we don't even pretend to understand what we're indexing" approach sets the bar remarkably high.

Wednesday, June 25, 2008

Challenges of the 20th century

This isn't about anything particular on the web, but I mean to refer back to it, so bear with me ...

The 20th century saw major technological advances and massive efforts to solve problems that were barely even conceivable in the 19th. I want to look at two such problems: Sending a rocket to the moon, and curing cancer.

In both cases, there was broad agreement that humanity (or in the case of the moonshoot, at least the U.S.) both could and should take up the task. By the mid 20th century rocket technology had been advancing steadily, to the point that when the Soviet Union announced the launch of Sputnik, the U.S. felt bound to respond with Apollo, whether out of a feeling of destiny or just to keep from losing face. In the field of medicine, antibiotics, vaccines and other medical developments had seen diseases that had been inevitable tragedies of life largely controlled or eradicated. Why not suppose, then, that cancer could be similarly dispatched?

A few decades on, sending a rocket to the moon is a solved problem, so much so that we now set our sights on Mars and the outer solar system and it doesn't seem outlandish to sponsor a prize for the first privately-sponsored lunar landing. Success rates in space ventures aren't perfect, by any means, and neither is our knowledge of space travel, but at least we have a general idea of why things fail and what can be done about it.

On the other hand, not only is cancer still a fact of life, there are still basic questions unanswered or poorly understood despite all the massive effort expended. This is not to say that the effort has been wasted. There have been dramatic improvements in treatment and prevention and in many cases there is now much hope where there had been little. Nonetheless, we can definitively say "We can send a rocket to the moon," but we are not even close to saying "We have cured cancer."

What's the difference, then? The clearest one is this: Before the space race even started, everyone knew what a rocket was and where the moon was. We didn't know what cancer was. In fact, one of the major results of the decades of cancer research is that there isn't really any such thing as "cancer" per se. There are a number of diseases with a number of different causes, symptoms, treatments and prognoses, all of which can potentially vary from person to person.

The two quests are qualitatively different. One was a feat of engineering, for the most part applying well-understood principles to a well-defined end. The other has been an ongoing field of research in which new principles and goals are still being uncovered.

Now, coming back to the field of computing, consider two very similar-looking problems that are getting significant attention: speech recognition and natural-language processing. Speech recognition is like sending a rocket to the moon. It has required major engineering effort and the results aren't perfect. However, there are commercial applications both for recognizing a small vocabulary from a broad range of speakers and for recognizing a broad vocabulary from a small range of speakers. Success rates aren't perfect, but at least we have a general idea of why things fail and what can be done about it.

Natural language processing, on the other hand, is an area of research comprising any number of different problems (including speech recognition, as it happens), requiring any number of different techniques and principles, some of which we haven't discovered yet. As with cancer, one of the major results has been understanding that there is much more to the problem than was originally thought. By the same token, there has been significant progress in some areas.

As things currently stand, one could plausibly argue "Computers can recognize spoken words," but we're not even close to saying "Computers can process language as well as humans," or even clearly defining what "process language" or "as well as humans" might mean here.

All of which sets the stage for bit of musing on search engines ...

Tuesday, June 24, 2008

Goodbye, TLDs?

When I saw the headline "'Shake-up' for internet proposed," I immediately thought "IPv6", but in fact it's about a proposal to relax the restrictions on the top-level domains (TLDs). Currently there are just under 300, including the well-known .com, national domains (.us etc.), the moderately obscure (.coop) and the experimental (.xn--deba0ad).

As it stands, ICANN strictly controls this list and seldom touches it. If you want foo.com, all you have to do is make sure no one else has it (that one's taken, of course) and pay a small fee. But if you want foo.bar, you're out of luck. There is no .bar on the official TLD list.

Under the new scheme, you could claim .bar for your very own, so long as you could show a "business plan and technical capacity". And, um, pay several thousand dollars at a minimum and more if you get into a bidding war. In theory, this opens up vast new areas of virtual real estate and greatly blurs the line between TLDs and domains in general.

While it will be nice to have internationalized names (.xn-deba0ad spells "test" in Yiddish), I expect that .com will continue to rule the roost for quite some time, though .xxx will doubtless spark a land rush. For all the talk of running out of domain names, .com and the web have become inextricably linked. See .tv for a cautionary tale.

Monday, June 23, 2008

Why advertising won't die

Alex Papadimoulis of The Daily WTF, in the course of not apologizing for running ads, said it:
I really don’t hate advertising. And neither do most of you. We just hate obnoxious advertising.
I don't really have anything to add to that, except that if you've ever been tormented by the sneaking feeling that software isn't always quite what it ought to be, the WTF offers solace and insight. And it's trenchantly funny.

Saturday, June 21, 2008

Media convergence and divergence

Technically, it's game over.

There's no reason why a box like Roku's Netflix box [reviewed here] couldn't be used to deliver current TV shows, live events, premium cable content, whatever. And in fact, it looks like you can get just that kind of service from DSL providers (well, maybe you can -- it's not available where I am yet). Boxes like Apple TV are also in the mix; the streaming aspect is there, albeit downplayed.

There are some concerns about bandwidth on the backbone, but Cisco says it's manageable, and they're allowing for a sizable chunk of peer-to-peer traffic, which may (or may not) become less of an issue if people are happy with what they can get directly from the providers. If I can see a given movie anytime I want as part of a $10/month subscription, why would I hassle with copying it peer-to-peer? But maybe that's just me.

So somewhere around now, give or take a couple of years, is the point where in major markets it's technically reasonably easy to get all the media you want via the net, media meaning audio, video, phone (i.e., two-way point-to-point audio) and the web. At some point not too much later than that point -- again whether it's already happened or about to happen depends on your definitions -- it's all available in a mobile environment. Just take a more-or-less broadband mobile connection and use it instead of your wired connection. QED.

Technically, media convergence is here, and if technology were all that mattered we'd be about done. For better or worse, however, technology is only part of the show, and often not the part in charge. In this case, the real drivers are two divergent views of convergence: the consumer's view and the providers' view.

As a consumer, I want a complete mix-and-match free-for-all with the flexibility of the "pocket-thing" scenario. I can get my bits delivered any way I like and don't care which particular means is in effect at any given time. I can get whatever bits I like delivered without caring too much who's providing them, and in all cases I can easily pick how they're actually rendered useful to me.

If that's a bit too abstract (it seems a bit too abstract to me and I wrote it), here are some concrete examples of what that means:
  • I could switch from, say, cable to DSL or WiMax or whatever tomorrow and, apart from performance, not notice the difference -- I could watch the same shows, keep the same phone number, listen to the same music, etc., etc.
  • I get the same services when traveling as when at home, though again I may be dealing with a better or worse internet connection on the road. My services are tied to me, not to a particular location or device.
  • If I want a particular bit of content -- a movie or a TV show, for example -- it doesn't matter whether I've got cable, DSL or something else, or where I am.
Providers, on the other hand, tend to take a different view. Convergence means you get all your media through them. They'd just as soon not have it be too easy to switch to a different provider tomorrow, and if you have to pay them again to access something on the road, so much the better. Hey, they're in business to make money and their interests are only partially aligned with yours.

My particular view is that the consumer will tend to win in the long run, but it will take a while and proceed in fits and starts. If what's provided gets too far out of line with what people want and what the technology can do, new players will step in and steal business from the existing ones. This tends to bring things back in line. The new players start to lose their competitive advantage, commoditization sets in, competition becomes harsher, weaker players are shaken out and competition wanes.

This lets the remaining players cash in and service once again get out of line with what people want and the technology will do. Which is where we came in ...

Interestingly, in the bullet points I gave above, video is the odd one out. Switching both land-line and mobile phone providers is fairly transparent. If you switch ISPs, your email accounts still work, and the web is the same, well, world-wide.

My guess is that this is at least partly because the technology for reasonable video over the net is only starting to become widely available, and that the Netflix/Roku box is an early participant in this particular cycle of innovation and shakeout. It's going to be an interesting time to be a studio or TV network, not to mention a cable or satellite TV company.

Tuesday, June 17, 2008

Netflix/Roku review

The Netflix set-top box I ordered came in this weekend. It's about the size of a small CD box set, all clean straight lines, black with a single white LED on the front. Ordering it, I'd had two concerns: technology and selection. Would the box be able to deliver a clear, steady picture, or would there be dropouts as it dealt with our local broadband connection? Can you have 10,000 titles to choose from but nothing to watch?

First, the technology works. Nicely. The hardest part of setting up the box was moving my honking big old-school TV so I could hook up the cable (and then negotiating the TV's menu to get it to show the right input). The second hardest part was putting in my WEP key in its full hexadecimal glory. This is really a problem with the router -- I've had to do that with everything I've hooked up. Roku makes even this about as easy as can be, especially considering that all you have to work with at that point is a screen and a remote with nine buttons.

Once it's connected, watching movies is easy. Netflix gives you an "Instant" queue, parallel to the regular one. You can flip through the contents of that queue on the screen. The titles show up as DVD cover images with a text caption. When you select a title, you see the description. If you pick something with multiple parts, like a TV series, you can easily select the episode you want, or the next episode if you've just finished one. Once you make a selection, the box spends a few seconds filling its buffers and you're off and running.

I'm not a "show every last sub-pixel" videophile, so your mileage may vary, but for my money the picture was just great: sharp and steady and at least as good as the regular cable feed. I kept waiting for some sort of glitch but so far it hasn't happened.

You can pause just like a DVD. Fast-forward/rewind is a bit different. You see a series of stills taken every so often either side of where you left off. You can scan through these just like you can scan through your queue. It would be nice to be able to scan through the contents of the buffer, so you could go back a second or two to replay that line you just missed, but for finding your place it seems to work well.

What about the selection? That's largely a matter of taste, of course. The current catalog tends heavily toward older releases and direct-to-DVD fare, lots of Night of the Obscurities II: The Relapse and such. Even so, with 10,000 titles to choose from, a high chaff/wheat ratio isn't necessarily a problem. That latest release you want almost certainly won't be there, but many older hits will be, along with back episodes of popular TV series and a smattering of interesting-but-not-blockbuster offerings.

Netflix makes no bones about the "Instant" selection being more limited. From their point of view it's a supplement to their DVD service, and that's still the main attraction. If you want that latest release, you'll have to put it in your regular queue and wait for the DVD to show up.

On the other hand, they say that they're working to expand the instant offerings. From the looks of it, they're doing this even as I write. The bottleneck seems mostly to be negotiating with the studios, who largely seem to be playing wait-and-see. My guess is that if the box turns out to be a hit -- which it might very well -- the studios will be more forthcoming, at least with older releases. Realistically, though, the box will continue to be the poor cousin for a while.

I could also imagine a "premium instant" service, where for more than the base $10/month you could get more choices, but that probably will have to wait until the studios are comfortable with the technology.

So should you get one? I'd say take a good look at the selections on Netflix and if it looks like you can find enough there to justify $100 for the box (and $10/month if you don't already belong to Netflix), go for it. Netflix says that this Roku box is "likely to be the lowest cost Netflix ready device for the foreseeable future". I hate that kind of sales pitch, but in this case it's probably true.

Saturday, June 14, 2008

Nobody was really sure if he was from the House of Lords

My focus these days seems to be across the pond ...

The House of Lords doesn't exactly have a cutting-edge reputation, so if you see them on YouTube addressing the question of "What's it all about?", that says something about YouTube's status as a mainstream medium, not to mention something about the House of Lords.

One could take such a clip as a stuffy bunch of old geezers trying to use the web to spiff up their image and try to convince the world that they're a hip, happenin' kind of legislative body like, say, the U.S. Senate, but that would be unfair. Actually, it's a nicely put together, brief and cogent explanation of what the House of Lords actually is, what it does and why people, and Britons in particular, should care -- if they do say so themselves. I certainly learned a couple of things from it.

In other words, it's an entirely legitimate use of modern media to help people understand what their government is up to. This being the web, if you want an alternative viewpoint, such are also available. Interestingly enough, the House itself maintains a listing of recent proposals to reform or abolish it (I think I'm safe in calling "since 1900" recent in this context).

Just make sure you follow the link above, or go through the House of Lords' own web site. The "What's it all about?" video seems to have made its way up the rankings since I first looked, but if you search YouTube directly you're just as likely to end up with a band by the same name, or footage of a peer appearing to sleep through a speech ...

Friday, June 13, 2008

The slowest fiber service in Europe

British phone/internet provider BT is rolling out fiber-optic service to homes in Ebbsfleet, Kent. BT says the speeds are "Higher in fact that anyone currently needs." Critics say it's "The slowest fiber service in Europe". Both statements are factual, but neither seems very helpful.

On the one hand, how much bandwidth does one need? The answer could range from zero, given that thousands of generations of humans survived before the internet, to "enough to saturate the senses of everyone in the house simultaneously" (not currently available anywhere I know of).

The BT offering is somewhere in between. Sustained bandwitdh is 2.5Mb/s, shy of full DVD, but it can also handle "bursts" of up to 100Mb/s. That's a fairly broad range, and it's not clear how long a burst can be. If it's say, 10 seconds, then in that time you could buffer up about four minutes of DVD video, and if you could do it again every few minutes you should be able to keep the buffers full. Of course, if you "need" HD, you'll need to buffer more at the outset. If you "need" to watch two or more different live HD offerings at once, you're probably out of luck.

On the other hand, if your bandwidth is adequate to your needs, what does it matter how big a pipe they have on the Continent, or in Asia or wherever? But maybe that's just sour grapes from a (relatively) bandwidth-constrained Yank.

Thursday, June 12, 2008

Virgin Media, copying and use

BBC Blogger Bill Thompson is concerned that his ISP, Virgin Media, appears to be monitoring its customers' activity for signs of illegal copying. Along with the obvious general question of who can monitor whose activity when and for what, he has a specific concern about copying via BitTorrent:

Like almost every technically-competent internet user of my acquaintance I've used BitTorrent to get my hands on a copy of a TV show that I missed, taking advantage of the kindness of strangers who bothered to record and upload the shows for fans because the companies that make and broadcast them choose not to.

However I also go out and buy the DVD box sets as soon as I can.

And I don't feel like a criminal, because I don't see why downloading a copy of a show that someone else has recorded should be seen as a breach of copyright while recording it myself onto a DVD is not.
It's certainly not wrong to make a copy of something you've already paid for, and by that argument it doesn't seem wrong to make a copy of something that you fully intend to pay for and do pay for reasonably soon. But the question here is not whether something is right or wrong, but whether it is legal. Ideally, these are closely related questions, but we're talking about digital media here.

A while ago I had a small epiphany from Linus's assertion that copyright is about distribution and not use. The copyright holder can't and shouldn't be able to control how you use something you've bought, but it does have some say over who can copy it and how.

Thompson's two cases look the same from the point of view of use. In either case, he's watching something and paying for it. They look completely different from the point of view of copying and distribution. In one case he's following the licensed method and in the other case he's not.

It's also pretty clear why the distributor would care. In the licensed case, it knows exactly who copied what and it knows it's getting paid. In the unlicensed case, it knows neither, particularly not the latter. Many people, like Thompson, are honest, but even an honest person might be tempted to think "Oh, I only watched it once" or "I'll pay for it next week" and next week never comes.

Of course, this being the law, things are not always so clear-cut. It's generally a copyright violation to charge people admission to watch a DVD you bought. That looks more like "use" than "distribution", unless you squint just right.

As always, remember I'm not a lawyer.

A cautionary tale from AOL

Anyone doing research on, say, the locations of people's cell phones would have to be aware of, and keen not to repeat, the great AOL search data debacle of 2006.

I have to admit I didn't follow this closely at the time. Seemed like the kind of thing that was bound to happen sooner or later, and might happen a little less often given that AOL, despite repeated and doubtless sincere apologies, lost business and was generally humiliated for its troubles. But as fate would have it, two stories I wanted to comment on intersected exactly there. One was the piece on cell phones, and the other I'll get to in a bit.

There is certainly value in gathering anonymized bulk data and studying overall patterns. Paul Boutin has an interesting informal analysis of the AOL data, for example. Unfortunately, there are limits to how anonymous that data can be.

Anonymity depends critically on everyone being able to plausibly say "How do you know it was me? It could have been any of these people." I call this the "I'm Spartacus" effect, and it in turn depends on not giving away specific, unique data.

It turns out that people's internet searches can be very specific indeed. Sure, lots of people search for popular products, or celebrities, or any of a number of other things, but we also search for friends or acquaintances, or local businesses, or organizations we belong to or what-have-you. In the case of the AOL data, the New York Times had no trouble tracking down a lady in Georgia, who was kind enough to be interviewed, and several other searchers have also been identified.

At least one searcher, User 927, became notorious even without being identified, owing to a particularly disturbing search history, and is now the inspiration for a play of the same name. This was the other news item that led me to revisit the AOL fiasco. I haven't seen the play and doubt I will, just as I doubt User 927 will be laying claim to any of royalties.

Naturally, AOL tried to put the genie back in the bottle, and naturally it failed. The raw data is available on several sites -- you can search for them, of course -- and at least one site lets you search the searches on line. I wonder if they log that.


[The domain name for the original link for the play seems to have turned over since this was written.  The link I gave now points at a banking site somewhere in Scandinavia.  I've updated to an Ars Technica article on the play -- D.H. Sep 2018]

Wednesday, June 11, 2008

More on cell phones as tracking devices

It was this BBC piece on a recent study at Northeastern University that set me musing about tracking via cell phone.

The article is sort of a roller coaster ride of "yikes!":

It would be wonderful if every [mobile] carrier could give universities access to their data because it's so rich

The researchers said they were 'not at liberty' to disclose where the information had been collected.

... giving way to "that's not so bad":

[S]teps had been taken to guarantee the participants' anonymity

[W]e only know the coordinates of the tower routing the communication, hence a user's location is not known within a tower's service area

... and the occasional "hmm ...":

Nokia have put forward an idea to attach sensors to phones that could report back on air quality. The project would allow a large location-specific database to be built very quickly.

Ofcom is also planning to use mobiles to collect data about the quality of wi-fi connections around the UK.

Evidently the business of attaching interesting sensors to cell phones is expected to boom in the next few years.

The real punchline, though, was the unsurprising conclusion that most people's daily activities are pretty boring: "The study concludes that humans are creatures of habit, mostly visiting the same few spots time and time again. Most people also move less than 10km on a regular basis[.]" Even those that travel further still tend to visit a small number of places repeatedly.

It's natural to be concerned about the ever-increasing speed of communication, and the prospect that at some point everyone might have access to everything known about everyone. But on the other hand it's comforting to know that one's own activities are probably too boring for most people to care about.

Monday, June 9, 2008

It's 2:00 am. Does your cell phone know where you are?

As I understand it, cell phones are called cell phones because the area of coverage is divided into (generally overlapping) "cells", each covered by a given tower. This means that if you're connected to the network, the provider will be able to tell, at a minimum, which cells you're in. By looking at signal strength from the towers involved one can get a much more accurate estimate. And as if that's not enough -- and apparently it isn't -- GPS is becoming a standard feature.

Having a precise, accurate location device on hand at all times can be handy and in some cases even life-saving. On the other hand, having an unobtrusive tracking device on one's person at all times raises some obvious privacy issues.

There are two contrasting extreme views on this sort of thing. The Utopian view plays up the "never lost" and "find a restaurant" features and goes on to argue that a world where everybody can locate everybody else is a fundamentally Good Thing.

The dystopian view plays up the privacy concerns, argues that The Man wants to know where you are and, further, wants to make it nearly-impossible to live without your personal tracker.

Naturally, I don't subscribe to either extreme view. I'm not really excited by the idea of a service that alerts me if a friend happens to wander into my vicinity (or vice versa), but neither do I see the whole thing as a step down the slippery slope towards Big Brother. I am a bit concerned that it's easy to forget, or never really realize, how locatable you are when you carry a cell phone, but that problem has been around for a while now.

On the balance I see it as technology taking yet another incremental step and life going on more or less as usual.

The web as distinct from its applications

In a fairly interesting article pondering what the next big platform, or platforms might be, Josh Quittner parenthesizes:
(Yes, the Web is nothing more than a big layer of code; all those websites we visit are merely applications that sit atop it.)
Now, I think I get what angle this is coming from. I've argued myself that you don't interact with "the web" but with a web application. Even so, I don't think the picture above is quite right, or even quite consistent. The web applications are indeed a layer of code. But if they sit on top of the web itself, then what is the web itself?

Muddying the waters a bit is the recent swing towards fatter clients, represented by the AJAX head of the Web 2.0 hydra, but for my purposes here it doesn't much matter where the code is sitting, whether in the browser or at the other end of the connection. Wherever the code is running, there's you, there are "all those websites" and either
  • Nothing else, in which case the applications aren't a layer on top of the web, they are the web.
  • Something else, in which case what?
I'm reasonably comfortable with either view. Either one can be made to fit my earlier working definition of the web as "all resources accessible on the net".

That definition is deliberately vague on what a resource is. If by resource you mean "web application", then you have the "nothing else" view. This works as long as you include the application's data with the application. That's reasonable, and good if you have an "active data" point of view.

On the other hand, the original web was (largely but not completely) about hypertext documents linked to each other, and the modern web is still very much about documents (or other data sets) and links between them. Much of the modern machinery is about either finding documents more effectively or about presenting the findings in a more useful or interactive way.

Following this line of reasoning a bit, you can use different applications to get at the same resources, and the same application to get at different resources. If the resources are the web, then the applications stand in an M:N relation to them, certainly not 1:1, and are thus clearly a different thing.

That's not to say that a resource can't itself be an application, say an annoying popup-filled multimedia experience. Rather, a resource isn't always an application, an application that accesses the web isn't necessarily a web resource, and the web does not appear to be a big layer of code with websites sitting on top of it.

Tuesday, June 3, 2008

Filters through a personal datastore lens

Returning to the theme of personal datastores ...

Services like The Filter consist of several parts:
  • A social networking component. Who knows whom? Everybody and their dog has this now, which is unfortunate, since each then has its own slightly different copy.
  • A database of things you've done through the service. Music services, for example, have a record of what you've bought and perhaps of what you play. Your email service, whether it's web-based or not, has a record of whom you've emailed and who's in your address book. Unlike the case of social networks, different kinds of services have different kinds of associated data, but for a particular kind of data, each service still has its own slightly different copy.
  • An engine that pulls the rest of the data together and tries to give you something of value from it. This is the secret sauce. Different music services will have different ways of recommending music (and different music collections to draw from).
One of the core tenets of personal datastores, as I understand the concept, is that data about you should be centered around you and you should control it. In the current "data fiefdom" model, every service has its own record of whom you know and what you've done. That's of value to them, of course, so they treat it as their data, but this means that, leaving aside the obvious privacy concerns, each service has a different, incomplete view of your world. Such an incomplete view is less valuable to the service provider than a complete one would be.

On the other hand, it's about you, so why shouldn't you control it? If you control data about your history and connections, you benefit because you control access to it and because all your data is in one place. In such a world, if you subscribe to a service that wants to know about your listening habits, you give it permission to see your music choices -- and nothing else. Then when you listen to a song using whatever means you like, the recommendation service knows about it. You don't have to use their web site or do things their way. If you switch services, you don't have to somehow move your history from the old service to the new one.

Whether the service providers would go for such a scheme remains to be seen. On the one hand, it lets them give better service, because they have better information and can get to it easily and uniformly. On the other hand, it eliminates a form of "vendor lock-in". If you can easily switch from their service to someone else's, maybe you will. This is a problem for established players, but a good thing for upstarts.

If you buy the premise that data about connections and history should move from the services to the person using the service (or more likely a datastore provider acting on that person's behalf), then all that's left is the engine. How would this work?
  • I launch a new service promising to recommend, say, lawn care products based on the music you listen to. Hey, stranger things have happened.
  • If you want to subscribe, you give me access to your music database (or, if you don't want me to know about your secret obsession for accordion music, you give me access to a sanitized view of it). This access would typically include a feed of updates, so that if you bought a new song I would know about it. My engine crunches that data and gives you recommendations.
  • You might also choose to give me access to your "casual acquaintances" data. In that case, if an acquaintance also joins and also grants the necessary permissions, my engine will know about it and (perhaps) make better recommendations to both of you.
Technically, this seems pretty nice. Each party deals with what it's most equipped to deal with. You maintain your personal data, I run my engine. There's only one copy of the personal data. You can give out or withhold permission as you see fit.

As I've said before, with more or less cynicism, it's not clear how or whether we get to such a world from here, but it does seem like a pretty sensible world.

Peter Gabriel adds a wrinkle or two (sorry, couldn't resist)

Actually, despite the somewhat dismissive tone of my last post, there are a couple of interesting wrinkles to The Filter.

First is that it does away with user ratings. It doesn't ask you to give a number of stars or whatever to a selection. Instead, it notes what you do with it. If you keep purchasing something by an artist, it concludes that you must like that artist.

In other words, actions speak louder than words.

The Filter also weights more recent actions more heavily than older ones, sort of a "what have you done for me lately?" approach. Your rave about your favorite boy band is not going to come back and haunt you in your thirties (unless you're still into them and buy the reunion DVD, of course).

Ratings have a couple of theoretical weaknesses. One is how to normalize the scores. I might tend to reserve a five-star rating for that rare near-perfect item, while you might tend to give it to everything you like. Another is that either of our ratings of an item might change after the initial enthusiasm or reluctance wears off, but most of us won't be bothered to change a rating except perhaps in extreme cases. Another is that many of us can rarely be bothered to assign a rating in the first place.

One of the tenets of the whole "wisdom of crowds" approach is that, given enough data points, such discrepancies will tend to even out. Fair enough, but that's true whether the raw data comes from people's recommendations or from their mouse clicks. If ratings are redundant or even inconsistent with more accurate indicators of what people are thinking, then they may well just be in the way.

The only way to know for sure is to try the experiment. Either way The Filter remains a crowd-based service. As far as I can tell you're not getting PG's picks, per se, or anybody's in particular.

I still don't think any of this is particularly novel. As I recall, the various music playing apps track how much you play a song and can infer a rating from it, and weighting recent activity more heavily is an old idea (compare exponential moving averages in technical trading, for example). On the other hand, not everything has to be a breakthrough. Much -- some could argue all -- progress is made by incremental tinkering and trying existing ideas in slightly new combinations.

Monday, June 2, 2008

Another old rocker on the web

Neil Young isn't the only aging rocker trying to establish a presence on the web. Well, most of them probably are, really, but Peter Gabriel in particular seems to be at it. Well he's been at it for a while, what with OD2 and Real World but ...

OK, where was the news item here? Apart from PG's servers getting nicked? Ah yes ...

PG is about to launch The Filter, a web-based recommendation service that seems quite a bit like other recommendation services, except, well, PG's behind it. The site does aim to be fairly broad in scope, including not just music but movies, web videos and such, and aiming eventually to include features like restaurant recommendations for tourists. Good stuff if you're a PG fan, but probably not so web-shaking in the larger scheme of things.

BTW, I actually like PG's stuff, by and large, and his efforts in distributing music are interesting ... just having second thoughts about the notability of this particular item. But the major news services had no such doubts, and (as someone else of that vintage said) who am I to disagree?

Netflix Roku box on order

I've gone ahead and ordered Netflix's set-top box, made by Roku. Due to high demand, it'll be a couple of weeks before it arrives, but I'll post a review when I've had a chance to play with it [as promised, the review is here].

Flashing your virtual headlights

It's not as common a practice as it once was, but if you drive in the US, you've probably heard of the custom of flashing your headlights to warn oncoming traffic of a speed trap you've just passed.

Now you can do the equivalent on the web, using your cell phone. Once you've set up Trapster, you can speed dial a phone number (without taking your eyes off the road, assuming you're good at speed-dialing) to report a speed trap you drive past. Other Trapster-aware drivers will then be alerted as they approach the trap. You can also go online (while parked, presumably) to see speed traps in your area.

Is it legal? It seems like it ought to be, particularly if it's a passenger and not the driver playing with the phone. It's not clear that The Man minds all that much either, the question being whether The Man is more interested in getting you to slow down or in handing out tickets.