Saturday, August 29, 2009

How's that Babelfish coming along?

I'm not going to do a full baker's dozen on this, at least not at the moment, but just to test the waters let's try round tripping-the last paragraph of my last post through Yahoo! Babelfish. In the past, this exercise (with various on-line translation engines, not just Yahoo!) has been good for much amusement, but these things have a way of incrementally improving.

The original paragraph was
Personally, I still don't see I, Robot coming to life any time soon, but I do see things that got written off as impossible during the dead-and-debunked phase starting to stir to life again. I'm thinking, say, competent machine translation or robots that can pick things up and carry them around a house, happening gradually in the next decade or so.
For whatever reason, Babelfish has trouble with "don't" and "I'm", passing them through to the target language undigested, so for the test I've changed them to "do not" and "I am". I've also quoted "I Robot" to reduce confusion. This seems to help.

First, let's try Dutch. Dutch, along with its cousin Frisian, is just about as closely related as extant languages get to English.
Persoonlijk, zie ik nog geen "I Robot" de komst aan het leven wanneer spoedig, maar ik dingen zien die onmogelijk tijdens afgeschreven werden dood-en-debunked fase die aan het leven begint opnieuw te bewegen. Ik denk, namelijk, bekwame automatische vertaling of robots die kunnen dingen oprapen en hen dragen die rond een huis, geleidelijk aan in het volgende decennium gebeuren of zo.
From my limited knowledge of Dutch, this is actually not too far off. I'm pretty sure it's ungrammatical in places, and I'd be interested to hear from a native Dutch speaker, but I'm also pretty sure such a person could get the intent of what I was saying. Of course, most Dutch speakers would just read the Engels versie and be done with it.

Now for the back-translation of the translated text. For several reasons, one shouldn't expect the original to come back word-for-word, but ideally it should come back in good English with the sense substantially intact:
Personally, to see I "still no; I Robot" the arrival to living when shortly, but I things see that impossible during was written down dood-en-debunked phase move which to living starts again. I think, as it happens, competent automatic translation or robots which can pick up things and them carry that around a house, gradually in the next decade happen or this way.
Um ... I'd say this bears a strong statistical resemblance to what I said, but I wouldn't say it hews closely to the original sense.

Several things have always puzzled me about Babelfish. Why does it not at least attempt to come up with grammatical statements in the target language? I suspect this is because it takes a local, phrase-at-a-time approach, but then why can it translate "dead-and-debunked" to "dood-en-debunked", that is to say "Dutch for "dead"-Dutch for "and"-Don't know what this is, just pass it through", but not take that back to "dead-and-debunked" by the same logic? Why can it rewrite English word order into more-or-less Dutch word order, but not vice versa?

I'm less surprised that it would take "say" into "namelijk" and then back to "as it happens". The failure of senses of words to map one-to-one even between closely related languages is the bane of all translation. In this case "bij voorbeeld" [or maybe "met name"?] would have worked better, coming back as "for example", but a human translator would likely have seen that "for example" was more appropriate in any case, even with the Dutch text as it stands.

This sort of determination would seem at least to require a more global view of the text, which global view Babelfish clearly lacks. It's just the sort of thing I've been waiting to see signs of in online translation. It might be out there somewhere, but not in Babelfish, and I expect it would take significant structural changes in the engine for Babelfish to acquire it.

Ah well. What about Spanish? Spanish is more widely spoken than Dutch, and particularly more widely spoken as an only language, so perhaps the English-Spanish-English translators get more attention. It's also still in the same general family as English and one could make a plausible argument that English owes at least as much of its structure to the Romance languages as to the Germanic. In other words, the winds appear at least as favorable here as for the last attempt. Without further ado:
Personalmente, todavía no veo el "I, Robot" viniendo la vida en caulquier momento pronto, pero a mí vea las cosas que consiguieron amortizadas como imposible durante la fase muerto-y-debunked que comenzaba a revolver a la vida otra vez. Estoy pensando la traducción automática o las robustezas, por ejemplo, competente que pueden coger cosas y llevarlas alrededor de una casa, sucediendo gradualmente en la década próxima o tan.
Again, I'm not a native speaker, but this seems like reasonable Spanish, although still with what look to my untrained eye like a couple of grammatical errors. But at least "say" came across as "por ejemplo", which should come back as "for example". How does the whole thing come back?
Personally, still I do not see "I, Robot" coming the life in caulquier moment soon, but it sees the things that obtained amortized like impossible during the die-and-debunked phase that revolver to the life began again. I am thinking the automatic translation or the robustezas, for example, competent that can take things and to take them around a house, happening gradually about the next decade or so.
Urk. Again, why can it produce "cualquier" but not at least come back with "whatever", which would sound a bit weird but at least be English? Likewise "revolver" to "return" or "robustezas" to "robots"? Even with those filled in, it's still a bit of a word salad, leading me to think I was probably too generous about the quality of the Dutch and Spanish. Not being particularly expert in those languages, I'm probably better able to gloss over gross errors.

OK, one more try: Simplified Chinese. My ignorance of written Chinese is profound, so the translation here could read "the square on the hypotenuse is equal to the sum of the other two squares" for all I know, but here it is:
亲自,我仍然不看" 我, Robot" 很快来到生活,但是我看得到注销的事物,不可能在开始死和被揭穿的阶段期间再搅动到生活。 我认为可能拾起事和在房子附近运载他们,在下十年或如此逐渐发生的能干机器翻译或机器人。
And back again:
Personally, I still did not look at "I, Robot" Arrives at the life very quickly, but I looked that obtains the logging out thing, is impossible to start to die with stage period which reveals mixes the life again. I thought that possibly ascends to stage a rebellion with delivers them nearby the house, either has the competent machine translation or the robot so gradually in the next ten years.
I'll just let that speak for itself.

Wednesday, August 26, 2009

Hey look, I advanced human knowledge!

Back during the Baker's Dozen series on search engines (a.k.a. "the topic that ate my blog"), I threw questions like "Who starred in 2001?" at various search engines. The idea was to see how well they would deal with questions beyond just matching up words statistically. Mind, I'm a fan of the statistical approach. It's easy to explain and, with a little googly special sauce, produces good results quickly.

I was particularly intrigued by True Knowledge (and by Wolfram Alpha). True Knowledge uses a fairly classic AI knowledge base approach to store facts in a structured way and draw inferences. For example, it might be able to glean from "starred in" that we're talking about a film or play, and it might know that there was a film called "2001". This sort of real-world, can't-be-derived-from-general-rules knowledge was one of the larger rocks against which the exuberant early predictions of AI -- I'm talking 1960s here -- were dashed. These days, with orders of magnitude more storage and processing power available, the parameters have changed and so the game has too.

At the time, True Knowledge was able to provide a good answer to "Who starred in 2001: A Space Odyssey?", but it couldn't quite connect the dots and realize that "Who starred in 2001?" was probably the same question. However, it did find a possible link, and offered
2001 can also be used as a way of referring to 2001: A Space Odyssey, the 1968 science fiction film directed by Stanley Kubrick, written by Kubrick and Arthur C. Clarke. If this is actually the recordable medium you are adding, please click the button below.
I did so, but the answer still came up the same. In the post I said:
Most likely the new facts are still rattling through the various caches, or perhaps someone's moderating the input. But if the search succeeds for you later, you'll know whom to thank.
Just now, I wondered whether the new knowledge had been assimilated into the database. And voila, True Knowledge can now answer the question. And the credit is mine! All mine! Bwahahahaha! (and, um, maybe a little bit to the nice folks at True Knowledge for putting the engine together in the first place, and all the people who contributed related facts to the database, and ...).

Flippant comments aside, this is actually pretty cool. Partly it's cool to see one's contribution, however minor, go into the Big Mix. But that's been a feature of web.life pretty much since the start. Mostly it's cool that True Knowledge was able to assimilate it the way it did.

In very broad and oversimplified strokes, the whole AI/robotics thing has gone through several phases:
  • (very early, but I suspect still very much present in the popular view) Hey, these computers can be programmed to do anything! They can solve equations in seconds that humans could never figure out. Simple stuff like walking and talking should only be a couple of years away.
  • Oh my. This walking and talking is much more complicated than it looks (again, this was a pretty early realization). You need some specialized knowledge.
  • A long period of building tools and solving specialized problems ensues. It becomes clear that you don't need "some" specialized knowledge. You need a whole lot. It also becomes clear that there are not just "some" specialized problems to solve, but lots and lots. To the outside world, nothing's happening. It's all dead, debunked (again, I suspect this is a fairly prevalent view in the world at large).
  • In reality, the research is paying benefits. It's just not producing I, Robot scenarios. This is the decades-long "If we know how to do it, it's not AI (any more)" phase on the computing side. Cognitive science (or "natural computation") is blossoming as a field and producing all kinds of interesting findings about how brains work.
  • And now, stuff is actually starting to appear. Computers are winning chess matches against top humans (albeit mostly through sheer computation). Demos like Big Dog are appearing. The computer end of human-computer interaction is getting smarter.
Personally, I still don't see I Robot coming to life any time soon, but I do see things that got written off as impossible during the dead-and-debunked phase starting to stir to life again. I'm thinking, say, competent machine translation or robots that can pick things up and carry them around a house, happening gradually in the next decade or so.

Sunday, August 23, 2009

Some highlights of the year

Here are some tidbits gleaned from Google Analytics:
  • The top five pages are
    1. The main Field Notes page. I think that's generally people dropping back in to check the site.
    2. Go ahead and talk to strangers, about the intriguing and popular Omegle anonymous chat server.
    3. Information age: Not dead yet, a reply to Joe Andrieu's contention that the Information Age is behind us now, has been picking up steady traffic from searches like "When did the information age begin?" ever since. It doesn't answer the question. Rather, it argues that the question itself is ill-defined, which I still think is a valid response.
    4. "Hackers crack SSL", one of a series of articles in which I tried to track down what actually happened and discovered an interesting tale. I suspect that at least some visitors were looking for advice on how to crack SSL, in which case they would likely have come away disappointed.
    5. Now what happened to my bookmarks?, another post likely to disappoint searchers. It's about why I don't seem to use the bookmarks feature on by browser much at all, not about how to recover lost bookmarks, so I've added a note at the top with links to several more useful pages and/or searches.
  • The top five searches are
    1. omega talk to strangers (see above)
    2. how to guess someones password on worldscape. It turns out that the words in question appear in close enough proximity for Google to think I might have something to say on the topic. I don't.
    3. field notes on the web, fairly enough
    4. what happened to my google bookmarks (see above)
    5. when did the information age begin (likewise)
  • Other searhces that caught my eye, for whatever reason
    • how many threes are in a dozen?
    • powerset -"power set" -powersets
    • hammock kenotic torrent
    • "poach tickets"
    • "david hull" -aerosmith +microsoft
    • "david hull" -aerosmith -humanities
    • "what's a concept"
    • accelerometer, disable the device,driving
    • al gore lisp
    • all human knowledge how much information
    • hand ciphers touch-tone phone
    • how can the concept of a wovel be used in ecommerce systems
    • j. k. rowling lisp
    • photo fedex man eating lunch
    • quick before you change your mind david hull
    • salt lake city trip checked into hotel
    • was kathleen antonelli interested in music
    • what is that chat room called? omega something? chat with strangers?
    • why did the information age occur
["How many threes are in a dozen?" has a particularly interesting history with this blog ... I'll let you do the searching, particularly since results may change from time to time ... --D.H. Dec 2015]

Happy Birthday, Field Notes!

It's been two years now since I finally succumbed to the implicit peer pressure -- I don't recall that anyone was explicitly asking me, but it seemed all the cool kids were doing it -- and started a blog. At the time the idea was to draw on my vast knowledge of networking standards (well, I'd sat on a couple of committees, so that's something at least) and make my deep thoughts available to the world at large. I would then spread my fame far and wide via social networking and, well, only good things could happen after that, right?

I had a working title, Morphisms, an obscure reference to category theory, a fairly obscure and extremely abstract branch of mathematics that has been used to derive obscure, extremely abstract but yet useful results in computer science. I had a rough idea of a theme or topic. I forget exactly what it was, but it was something to do with emphasizing the relationships between things on the web, rather than the things themselves. I hope you'll take my word for it that this has something to do with the concept of morphisms.

I had put together a sketch or two, at least in my head, of an initial post or posts. Somewhere in that process I took time off to go to a concert and got to thinking about how e-ticketing can work when the admission pass is any piece of paper with the right bar code on it. By the time I'd figured it out, Morphisms was out the window and I realized what I was really doing: figuring out the web as I went along.

I had been reading Darwin's Voyage of the Beagle (which I should get back to one of these days). While there are certainly more thorough and rigorous field notes around, Darwin's really let you see his mind at work as he alternates between, say, speculating about why the geology of some island has the form it has, and actually looking at the rocks in question to see if the theory makes sense. So the idea of field notes had to come into the mix somewhere, if only as a goad to better finding-out.

I don't presume to have come anywhere close to my model in depth of analysis, but keeping its curious spirit in mind has made writing Field Notes a lot of fun, even when I'm casting a jaundiced eye at some new form of web.hype or just generally waxing curmudgeonly about some annoying site or phenomenon. Occasionally I even learn something.

If FeedBurner is to be believed, I've also picked up a handful of readers. If you are one, I extend my sincere thanks but can only offer more-or-less more of the same: sporadic musings on whatever topic springs to mind, at an overall rate of one every other day or so (I've written just over 365 in the past two years; my informal goal is at least ten in any given month). Certain themes will no doubt continue to come up. Whether that's consistency, laziness or both is up to you.

[I did eventually finish Beagle, and highly recommend it.  Field Notes is still going, of course, but much more sporadically.  The ten-post-a-month idea went away about the time I joined Google -- they tend to keep you busy -- as did material for "Hmm ... I wonder what Google is up to with this" posts, since it's easiest to recuse myself from anything Google.  I also started the other blog for less webby musings.  I have no plans to stop with either --D.H. Jan 2016]

Thursday, August 20, 2009

How not to implement updates

One of my development tools is out of date. It kindly tells me this because it has contacted the mothership and determined there is a newer version available. Nice. Increasingly commonplace, but no less nice for that. Nor is it unreasonable, particularly for a development tool, to require me to actively get the latest update instead of having it installed automagically. Which scheme works better depends on preference and context.

On the popup is a web-link looking message saying something like "Visit the website for the latest download". Mind, it knows how to contact the mothership to see that it's out of date. Contacting the mothership to download the new version is not appreciably harder.

I click on the link. It puts me on the home page, not the download page. The download area is fairly inconspicuous amongst several other buttons at the top of the page. Fine.

Up comes a download page with a bunch of Google ads. For similar but competing tools. The actual download I want is clear off the screen below the decoy ads. How messed up is that?

To summarize:
  1. No automatic download. Please visit our website.
  2. You don't go to the downloads page but to a fairly cluttered home page with the downloads somewhere on it.
  3. When you get there, the real download is hidden by decoys advertising competitors.
That's about three steps too many -- a trifecta of annoyances. Good thing the tool itself is quite useful and well-written. And rarely needs updates.

Tuesday, August 18, 2009

The birth of the FAQ

Well, I won't swear this is the origin of the term, and I'm certain that FAQs had been around under other names long before the internet, but ...

Back in the mists of time, when bang paths roamed the earth, there was USENET. Originally delivered via uucp and still around to this day at least in form, USENET began life as an improved way for Carolina and Duke to send each other announcements on academic topics, and presumably also of each one's deep respect and love for the other's basketball team and traditions.

As with many innovations, it soon took on a life of its own. In September 1993 it spun completely out of control as hordes of AOL users joined the fun. Somewhere along the way, news groups (or "forums" as they're called these days) began to attract more and more "newbie" questions and regular denizens began to get more and more annoyed. The eventual response was to regularly post lists of Frequently Asked Questions and to gently, or not-so-gently, admonish posters of such questions to read the FAQ post.

After a while, the nice folks at MIT took it upon themselves to gather such posts for easy FTP access at rtfm.mit.edu. For a while, before Google took over the world, I would often troll the FAQs for useful information. I have no doubt there's till a trove of useful information there.

And that was when it struck me that USENET had two completely different purposes. The first was its intended purpose: to provide a way for people with common interests to communicate. The second was a fine example of unintended consequences: to generate FAQ lists for people like me.

Thursday, August 13, 2009

More fun from Galaxy Zoo

The Galaxy Zoo project has been so successful, it's been expanded. Along with classifying images of galaxies -- which is still ongoing -- you can help them screen candidates for supernovae. Astronomers are standing by in the Canary Islands to examine the likeliest ones. Evidently said astronomers are happy with the results so far, having confirmed several supernovae last night based on the Zoo's findings.

As with the galaxy classification, it's oddly fascinating and somewhat addictive. Betcha can't classify just one ...

Busy, busy

My day job is going to be taking up serious amounts of time for a while, so posts will be less frequent and you may notice a drop-off in attention to detail and depth of analysis. Or, worryingly, you may notice no difference at all.

Be that as it may, some items have caught my attention recently. I hope to explore the first two in greater depth, time permitting:
  1. Rupert Murdoch's News Corp, owner of the Wall Street Journal, News of the World, The Times (of London), BSkyB, Fox, Hulu, MySpace and a zillion other properties, has come down firmly on the side of not-free content. This should hardly come as a surprise, particularly since the Journal, at least, has been charging for its online content forever and Murdoch is nothing if not a capitalist. However, it will be interesting to see just how this plays out. I would expect a series of experiments and adjustments, but beyond that rather obvious prediction, who knows?
  2. The associated press has caused a ruckus with its announcement that it will charge up $2.50 per word for excerpts posted online (the per-word rate drops off rapidly from that $2.50 worst case). There is some uncertainty as to whom this applies to. Large commercial sites can and probably should pay for such excerpts, just as major print magazines do. Smaller non-commercial sites (read, blogs) shouldn't. AP assures the public the policy is not aimed at small fish, but without a clear, binding statement to that effect, I would expect most bloggers to stay well clear of quoting AP on anything. Which, of course, is the exact opposite of what AP might want. Yet another example (see the previous post for another) of old media needing to be careful in the process of adapting their existing practice to the web.
  3. A middle-schooler has recently informed me that Yahoo! mail is cool and gmail is uncool, duh. So I guess I won't be sitting with the cool kids at lunch anytime soon.

Thursday, August 6, 2009

Is this the story of Johnny Rotten?

Now this is just plain silly.

I'm browsing through some songs using Rhythmbox, one of the Linux song players, having just discovered the "Lyrics" feature. I select an oldie but goodie and instead of lyrics I get
Unfortunately, due to licensing restrictions from some of the major music publishers we can no longer return lyrics through the LyricWiki API (where this application gets some or all of its lyrics).

The lyrics for this song can be found at the following URL:
http://lyricwiki.org/Public_Image_Ltd.:Public_Image

<a href='http://lyricwiki.org/Public_Image_Ltd.:Public_Image'>Public Image Ltd.:Public Image</a> [sic -- the raw HTML appeared verbatim in the text; probably there's some setting to tweak to fix this]


(Please note: this is not the fault of the developer who created this application, but is a restriction imposed by the music publishers themselves.)

Lyrics provided by lyricwiki.org
Huh? I can go browse the web and look at the lyrics all I want, but the music publishers want to make good and sure I can't do something nefarious with them, like display them in a popup on a music player? This sort of thing does nothing to quiet the accusations that music companies Just Don't Get The Web and are instead shooting themselves repeatedly in their collective foot fighting pointless legal battles.

I believe it's Fair Use for me to quote a small portion of the lyrics in question:
What you wanted was never made clear
Behind the image was ignorance and fear
You hide behind this public machine
Still follow same old scheme


Goodbye.

Wednesday, August 5, 2009

Blogging is not a genre

I think this is probably one of those that seems less profound when you start to write it down, but here goes:

A news report this morning introduced one of its subjects as "a blogger" from a country currently in political turmoil. Automatically an image formed of a member of the opposition bravely reporting conditions and advocating for the cause, at considerable personal risk. Salam Pax would be an archetype here.

As it happened, that particular image was basically correct. But as with all snap judgments, it need not have been. Run down the following list of labels and see if a particular image doesn't form involuntarily:
  • Political blogger
  • Entertainment blogger
  • Mommy blogger
  • Blogger
If you're like me (and if your mileage varies, great!), all but the last invoke not just the literal meaning but a particular kind of blogger. "Political blogger" suggests a partisan of whatever party. "Entertainment blogger" suggests tabloid-style gossip. "Mommy blogger" suggests a "soccer mom." I would venture to guess that for many people "blogger" in general suggests a particular genre of interest. For example, a politician might equate "political blogger" with "those irresponsible muckrakers making my life miserable" or "those hard-working souls selflessly putting the word out," depending on the day.

Interestingly, those characterizations don't seem a particularly good fit for the handful of blogs I actually read semi-regularly (These in turn are fairly disjoint from the blogs I've referenced here. This blog is about the web, not so much about my reading habits per se). That's not a complete surprise. When I try to decode someone's shorthand description, I'm trying to figure out what they mean, not what I might mean by the same thing.

But blogging is not a particular genre. Blogging is fundamentally a structure. Its distinguishing feature is not that it is about a particular brand of politics, or gossip, or parenting or whatever. Its distinguishing feature is that it is written serially, in small segments.

There does not need to be a great deal of continuity across those segments. The topic can shift with each post. Characters and scenery may or may not recur. Any action described in one post may or may not relate to action in any other post. On the other hand, because a blog is generally written by a single author or at most a small group, there will generally be some continuity of theme and style.

Within those constraints -- short, serialized segments and general continuity of theme and style -- pretty much anything is possible. Just as there are many genres of novel, play, movie, TV show, magazine article or newspaper column (one of the blog's closest relatives), there can be, and are, any number of genres built on the blog structure.

Nope, that wasn't particularly profound, but I guess I had to get it out of my system anyway. What can I say? I used to be an English major (for all of a semester).

Monday, August 3, 2009

If you can read this, thank a philosophy major

A Windows box in the house recently had a nasty case of the scareware, one of those fake virus removal thingies that pops up all kinds of frightening messages about how your computer is infected and you need to act now because, well, if your computer is acting like some rogue program has taken over, the only sensible thing to do is follow that program's instructions to remove the threat, right?

Sigh.

It's not just some web site gone amok. This thing has installed itself in the start menu. Shutting down the browser does no good. Rebooting does no good. Big-name virus checker is up-to-date, but says the computer is still vulnerable to intrusion. Offers a handy "fix" button for that. Which does nothing.

Sigh.

So, shut down the machine and go googling. A bunch of people are recommending roughly the same thing: download XYZ removal tool (a couple of people advise using some sort of "remove me" feature that the scareware authors have thoughtfully provided -- yep). One of the major publications appears to have given XYZ a good review. I visit the publication's site -- directly, not through a link, of course -- and the review seems to be there. So I go to the XYZ site. Um, is there an SSL certificate on this download site? Um, no. For that matter, was there one on the major publication's site? Um, no.

Sigh.

Further down the list of hits, a couple of sites have instructions for manual removal: delete a suspicious-looking entry from the Windows registry. Delete some files that don't look like they belong there anyway. Fortunately, this isn't one of those "delete *.dll from your SYSTEM32 folder and everything will be fine" scams. So I restart the machine in "safe mode", fire up regedit, delete the files in question, alias a few useless-looking sites to 127.0.0.1 for good measure and reboot.

Problem solved. Unless it only looks like it's solved.

Sigh.

You know what really bothers me here? It's not the annoyance of the malware itself. It's the epistomological nightmare that ensues. How would I know that that download site was legit? Probably it was, but you'd think a security software provider would think to buy a certificate. But even if they had, how sure could I be?

What makes me think the manual removal instructions were legit (besides a rudimentary knowledge of how Windows works and the fact that the annoyance seemed to stop)? Do I know that the malware is really gone and not just gone into stealth mode? Was it a decoy for something else? Do I cut the red wire or the blue one?

Who knows?

Who wrote the malware? The straightforward theory is a bunch of criminals just trying to harvest credit card numbers. The sneakier theory would be the upstart security provider with the removal tool. Subtheory A: They're just trying to steal market share from the big guys. Subtheory B: They're distributing malware themselves, disguised as a removal tool for a fake removal tool. Clever, what?

But I say it was a philosopher. Somewhere in the basement of some liberal arts department, a bitter post-doc is howling with laughter as all the computer geeks that went on to lucrative engineering jobs get what's coming to them.

Well played, sir or madam.