Thursday, January 27, 2011

Google and Yad Vashem

Organizing the world's information and making it universally accessible and useful means many things.  Among the more solemn:

In observance of Yom HaShoah (Holocaust Remembrance Day), Google and the Israeli Holocaust museum, Yad Vashem, have made available a searchable archive of some 130,000 photos from the museum's collection, as a step toward putting the museum's entire archives online.

[The museum's online digital collections have expanded since this was written.  I changed the link above to the photos collection since the old link was broken.  I'm not sure if it's the same collection or not.  There are several other collections as well, which can be found at -- D.H. Nov 2018]

OK, this is a bit unsettling ...

File under unintended consequences.  It all makes sense, and yet, it doesn't seem quite right.

Mike Cardwell blogs:
When you visit my website, I can automatically and silently determine if you're logged into Facebook, Twitter, GMail and Digg.
and sure enough, the page will say "Yes, you are logged in" or "No, you are not logged in" at the appropriate places.  Eerie.  What's going on here?

As Cardwell explains, whenever you send an HTTP request to a server, you get back a response code.  That response code might say things like "Your request was OK, here's the data you asked for," or "Sorry, I don't have what you're looking for," or "Goodness, I seem to be having some sort of problem here." or any of a number of other things.  So far, so good.

Modern browsers can keep track of whether you're logged in to particular sites, so you don't have to keep logging in.  Fair enough.  If you're logged in and you ask for something on a site, you'll get it (assuming you have the proper permissions, etc.).  If not, you'll typically get an error.

HTML allows you to reference other web sites within your document -- that's pretty much what makes the web webby -- and modern browsers allow you to behave one way or another depending on what happens when you try to fetch something (it doesn't even have to be based on a status code -- pretty much any reliably observable difference in behavior will do).

Put it all together, and
  • any web site
  • can use a reference to another site
  • to tell if you're logged in to that site
In Chrome, at least, if you open an incognito window to visit Cardwell's site, it can no longer tell whether you're logged in, because incognito windows don't share any state with other browser windows.  But that's kind of throwing out the baby with the bathwater.  You can also turn off JavaScript support (or only selectively turn it on), but that has its own problems.

To really solve the problem you have to be able to control what state is shared between, for example, different tabs or windows.  Doing that simply and non-intrusively is easier said than done.

On the other hand, as a couple of commenters point out, such tricks have been around for a while.  Whether anyone's exploiting them in a significant way is another matter.  Before a site can find out if you're logged in, it has to get you to visit it, not that there aren't plenty of sneaky ways to do that, and then it just knows whether you're logged in or not to sites it knows how to check for (each site requires its own custom-tailored check).  And then, if all you log into is, say, GMail and Twitter, then all your adversary can find out -- from this particular particular, at least -- is that (yawn) you use GMail and Twitter.

Worth losing sleep over?  Probably not.  Worth keeping in mind?  Definitely.

Cardwell's site looks to have a lot of other fun and useful information on it as well ... and if you stop by for a visit, your browser will most likely tell his server I sent you.

The no-tag tag

I recently ran across a blog with a tag I don't think I'd seen before: "No particular tag"

What's the point?  Well, for one thing it gives you an easy way to bring up all the posts that don't have any other tag, and which otherwise couldn't be reached at all through the tag list.

This distinction between nothing and a label for nothing comes up again and again: The empty set vs. no set at all; a null value vs. an empty string or other collection; Odysseus getting Polyphemus to say that "Noman" was attacking him ...

It's a double-edged sword.  It's certainly useful, probably even necessary, to have a something-that-stands-for-nothing, but it can also cause no end of confusion.  Any number of bugs come down to losing track of the distinction between no value and an empty value.

It's a neat idea, adding a tag for no tag, but I'm not sure how much demand there is for it.  If there were much, I'd expect to see more of it.  But perhaps I should leave the definitive statement to the experts:
Everybody knows that more wars have been won with a shovel than a sword. Give a man a hole and what does he have? Nothing, but give a man a shovel and he can dig a hole to contain the nothing.

The not-so-dumb terminal and the cycle of reincarnation

One of the longest-running spectacles in computing is the migration of computing power back and forth between the CPU and its peripherals, particularly the graphics processor:

Start with a CPU and a dumb piece of hardware.  Pretty soon someone notices that the CPU is always telling the dumb piece of hardware the same basic things, over and over.  It would really be more efficient if the hardware could be a bit smarter and just do those basic things itself when the CPU told it to.  So the piece of hardware gets its own computing power, generally some specialized set of chips, to help out with the routine operations.  Just something simple to interpret simple commands and offload some of the busywork.

Over time, the peripheral gets more and more powerful as more functionality is offloaded, and someone realizes that what started out as a few components has effectively become a general-purpose computer, but implemented in an ad-hoc, expensive and unmaintainable fashion.  Might as well use an off-the-shelf CPU.  That works pretty well.  The peripheral is fast, sophisticated and wonderfully customizable.

Then someone notices there are two basically identical CPUs in the system, and people start to write hacks to use the peripheral CPU as a spare, doing things that have little or nothing to do with the original hardware function.  Why not just bring that extra CPU back onto the motherboard and let the hardware device be dumb?

Lather, rinse, repeat ...

With all that in mind, I was going to talk about another prominent cycle, and then I realized that it wasn't really a cycle.  For that matter, the CPU <--> peripheral cycle is only a cycle in the relative amount of horsepower in one place or the other, but even taking that into account ... well, let's just get into it:

Start with a pile of computing power.  It's not much good by itself, so connect something up to it so you can talk to it.  Nothing fancy.  In some of my first computing experiences it was a paper-fed Teletype (TTY) with a 110 baud modem connection to the local computing center.  Later it was a "glass TTY" -- a CRT and a keyboard and a supercharged 2400 baud serial connection to a VAX a couple of rooms over.

Even the dumbest of these CRT terminals could do a couple of things -- clear the screen, display a character, move to the next line -- but not necessarily much of anything more.  But why not?  It's a CRT we're putting characters on, not paper.  We ought to be able to go back and change the characters we've already put up without having to clear the screen and start over.  A couple of improvements, and now you've got a proper video terminal that will let you move the cursor up and down, maybe insert and delete characters, certainly overwrite what's there.

Now, at 2400 baud (about 20 times slower than the "dialup" that everything's faster than), bandwidth is precious, putting pressure on terminal designers to encode more and more elaborate functionality into "escape sequences" -- magic strings of characters that do things like change colors, apply underlines, turn off echoing of characters typed or, if some of the magic characters get dropped, spew gibberish on a perfectly good screen.  For bonus points, let the application actually program macros -- new escape sequences put together out of the existing ones -- getting even more out of just a few characters on the wire.

That's not a glass Teletype any more.  That's a "smart terminal".  Inside the smart terminal is a microprocessor, some RAM for storing things like macro definitions and for tracking what's on the screen, and a ROM full of code telling the microprocessor how to interpret all the special characters and sequences.

In other words, it's a computer.

Well, if it's a computer, it might as well act like one.  Why limit yourself to putting characters on the screen for someone else when roughly the same hardware plus some extra RAM and a disk could do most of the things your wee share of the time-sharing system at the other end of the modem could do?  Thus began the PC revolution that is only just now reaching its endgame.  Sort of [If you think of "PC" as "big boxy thing that sits under your desk", then it's pretty clear PCs are well past their prime.  If you think of "PC" literally as "Personal Computer", we have more of them than ever before -- D.H. Dec 2015].

The problem with cutting the umbilical cord to the central server is that while you may have a pretty useful box, it's no longer connected to anything.  Unless, of course, you buy a modem.  Then you can connect to the local BBS to chat, play games, maybe even transfer some files.

At this point, the box you're talking to may not be particularly more powerful than yours.  Even if you're dialing in to a corporate or university site, there's a good chance that you're still connecting to somebody's workstation, not some central mainframe.  Gone are the days when you connected to "the computer" and it did all the magic.  Now you're connecting your computer to something else it can use.  Relations are much more peer-to-peer, even though there's still a lot of client-server architecture going on.

More importantly, the data has moved outwards.  Instead of one central data store, you've got an ever-growing number of little data stores, which means an ever-growing backlog of routine maintenance -- upgrades, backups, configuration and the like.

If you're using a personal computer at home and you need something that you don't have locally, you have to find the data you need in an ever-increasing collection of places it could be.  If you're a larger institution with a number of workstations, you have the additional problem of making sure everyone sees the same view of important data and configurations.

These basic pressures spur on two major developments: the internet (which is already underway before PCs come along) and the web.

Before too long, things are connected again, except now there are huge numbers of things to connect to, not just one central computer (hmm ... maybe someone could start a business supplying an index to all the stuff out there ...).  With the advent of the web, you have a gazillion web sites all telling a bunch of early-generation browsers the same basic things over and over again.  So ...

... the intelligence starts moving out to the browsers.  Browsers grow scripting languages so that they can  be programmed to respond quickly instead of waiting for instructions from the server.  That cuts out at least three bottlenecks: limited bandwidth, latency between the browser and the server, and the ability of the server to respond to a growing number of connections.  AJAX is born.  Browsers start looking like full-fledged platforms with much the same functionality as the operating system underneath.

On the other hand, data starts moving the other way -- "into the cloud".  For example, email shifts from "download messages to your one and only disk" (POP) to "leave the messages on the server so you can see them from everywhere" (IMAP or webmail).  The more bandwidth you have, the easier this sort of thing is to do, and bandwidth is coming along.  Even so, I'm pretty sure Peter Deutsch will get the last laugh one way or another.

Let's step back a bit and try to figure out what's going on here in broad strokes:
  • From one point of view we have a long cycle
    • In the beginning, all the real work is happening at the other end of a communications link
    • In the middle, all the real work is happening locally
    • These days, more and more real work is happening remotely again -- OK, I haven't run down the numbers on that one, but everybody says it is and I'll take their word for it.
  • On the other hand, today is not a repeat of the old mainframe days
    • A browser is not a dumb terminal.  Even a basic netbook running a minimal configuration has orders of magnitude more CPU, memory and disk than the mainframe of old.
    • There is no center any more.  Even displaying a single web page often involves communicating with several servers in several different locations -- often run by separate entities.
  • The pattern looks different depending on what resource you look at
    • You can make a pretty good argument that data has in fact largely cycled from remote (and centralized) to local to remote again (but decentralized)
    • Computation, on the other hand, has increased all around, and the exact share between local and remote varies depending on the particular application.  I'd hesitate to declare an overall trend.
  • The key drivers are most likely economic
    • Maintaining and administering a bunch of applications locally is more expensive than doing so on a server
    • If bandwidth is expensive relative to computing and storage, you want to do things locally; in the reverse case, you want to do things remotely
Where do we stand with the analogy that started all this, namely the notion that the shift from remote computing to local and back is like the shift of (relative) computing power from CPU to peripheral and back?  Superficially, there's at least a resemblance, but on closer examination, not so much.

Tuesday, January 4, 2011

A strange attractor in web search

Looking through the stats, I see that one of the search terms that landed someone here at Field Notes recently was "How many threes are in a dozen?"

That sentence does appear in this blog, unlikely though that might seem, in a post in which I summarized, among other things, odd search terms that had brought people to Field Notes.  At the time it was sheer coincidence that that search happened to work, but of course since I mentioned it, it's no longer coincidence.  Moreover, since I'm mentioning it again here, I am practically putting myself forth as an expert on the subject.  Search engines are still figuring out the use-mention distinction.

The exact phrase "How many threes are in a dozen?" turns up only two hits (soon to be three).  Since the other one is a discourse on riddles, I should mention that I don't know the intended answer.  The only ones I can come up with are:

  • Um, four, right?
  • None -- there are no threes in "a dozen"
  • 220 (that's the math degree talking)
[And, of course, right after I hit the publish button I realized the right answer is almost certainly 12]