Saturday, August 22, 2015

Margaret Hamilton: 1 New Horizons: 0

A bit more on Pluto, from a compugeek perspective if not a full-on web perspective ...

The New Horizons flyby was not completely without incident.  Shortly before the flyby itself, the craft went into "safe mode", contact was lost for a little over an hour and a small amount of scientific data was lost.  The underlying problem was "a hard-to-detect timing flaw in the spacecraft command sequence".  This quite likely means what's known in the biz as a "race condition", where two operations are going on at the same time, the software behaves incorrectly if the wrong one finishes first and the developers didn't realize it mattered.

Later investigation concluded that the problem happened when "The computer was tasked with receiving a large command load at the same time it was engaged in compressing previous science data."  This means that the CPU would have been both heavily loaded and multitasking, making it more likely that various "multithreading issues" such as race conditions would be exposed.

Now, before I go on, let me emphasize that bugs like this are notoriously easy to introduce by accident and notoriously hard to find if they do creep in, even though there are a number of well-known tools and techniques for finding them and keeping them out in the first place.

The incident does not in any way indicate that the developers involved can't code.  Far from it.  New Horizons made it through a ten-year, five billion kilometer journey, arriving within 72 seconds of the expected time, and was able to beam back spectacularly detailed images.  That speaks for itself.  It's particularly significant that the onboard computers were able to recover from the error condition instead of presenting the ground crew with an interplanetary Blue Screen of Death.  More on that in a bit.

Still ...

It's July 20, 1969.  The Apollo 11 lunar lander is three minutes from landing on the Moon when several alarms go off.  According to a later recounting by the leader of the team involved
Due to an error in the checklist manual, the rendezvous radar switch was placed in the wrong position. This caused it to send erroneous signals to the computer. The result was that the computer was being asked to perform all of its normal functions for landing while receiving an extra load of spurious data which used up 15% of its time.
This is a serious issue.  If the computer can't function, the landing has to be aborted.  However,
The computer (or rather the software in it) was smart enough to recognize that it was being asked to perform more tasks than it should be performing. It then sent out an alarm, which meant to the astronaut, I'm overloaded with more tasks than I should be doing at this time and I'm going to keep only the more important tasks; i.e., the ones needed for landing ... Actually, the computer was programmed to do more than recognize error conditions. A complete set of recovery programs was incorporated into the software. The software's action, in this case, was to eliminate lower priority tasks and re-establish the more important ones.
This is awesome.  Since "awesome" is generally taken to mean "kinda cool" these days, I'll reiterate: The proper response to engineering on this level is awe.  Let me try to explain why.

Depending on where you start counting, modern computing was a decade or two old at the time.  The onboard computer had "approximately 64Kbyte of memory and operated at 0.043MHz".  Today, you can buy a system literally a million times faster and with a million times more memory for a few hundred dollars.

While 64K is tiny by today's standards, it still leaves plenty of room for sophisticated code, which is exactly what was in there.  It does, however, mean that every byte and every machine cycle counts, and for that reason among others the code itself was written in assembler (hand-translated from a language called MAC and put on punch cards for loading).  Assembler is as low-level as it gets, short of putting in raw numbers, flipping switches or fiddling with the wiring by hand.

Here's a printout of that code if you're curious.  The dark bands are from printing out the listing on green-and-white-striped fanfold paper with a line printer such as used to be common at computer centers around the world.  The stripes were there to help the eye follow the 132-character lines.  Good times.  But I digress.

Just in case writing in assembler with an eye towards extremely tight code isn't enough, the software is asynchronous.   What does that mean?  There are two basic ways to structure a program such as this one that has to deal with input from a variety of sources simultaneously: the synchronous approach and the asynchronous approach.

Synchronous code essentially does one thing at a time.  If it's reading temperature and acceleration (or whatever), it will first read one input, say temperature from the temperature sensor, then read acceleration from the accelerometer (or whatever).  If it's asking some part of the engine to rotate 5 degrees, it sends the command to the engine part, then waits for confirmation that the part really did turn.  For example, it might read the position sensor for that part over and over until it reads five degrees different, or raise an alarm if doesn't get the right reading after a certain number of tries.

Code like this is easy to reason about and easy to read.  You can tell immediately that, say, it's an error if you try to move something and its position doesn't reach the desired value after a given number of tries.  However, it's no way to run a spaceship.  For example, suppose you need to be monitoring temperature continuously and raise a critical alarm if it gets outside its acceptable range.  You can't do that if you're busy reading the position sensor.

This is why high-performance, robust systems tend to be asynchronous.  In an asynchronous system, commands can be sent and data can arrive at any time.  There will generally be a number of event handlers, each for a given type of event.  The temperature event handler might record the temperature somewhere and then check to make sure it's in range.

If it's not, it will want to raise an alarm.  Suppose the alarm is a beep every five seconds.  In the asynchronous world, that means creating a timer to trigger events every five seconds, and creating an event handler that sends a beep command to the beeper when the timer fires (or, you can set a "one-shot" timer and have the handler create a new one-shot timer after it sends the beep command).

While all this is going on, other sensors will be triggering events.  In between "the temperature sensor just reported X" and "the timer for your beeper just went off", the system might get events like "the accelerometer just reported Y" and "the position sensor for such-and-such-part just read Z".

To move an engine part in this setup, you need to send it a command to move, and also create a handler for the position sensor's event.  That handler has to include a counter to remember how many position readings have come in since the command to move, along with the position the part is supposed to get to (or better, a time limit and the expected position).

A system like this is very flexible and doesn't spend time "blocked" waiting for things to happen, but it's also harder to read and reason about, since things can happen in any order and the logic is spread across a number of handlers, which can come and go depending on what the system is doing.

And then, on top of all this, the system has code to detect and recover from error conditions, not just in the ship it's controlling but in its own operation.  Do-it-yourself brain surgery, in other words.


I report my occupation as "software engineer" for tax purposes and such, but that's on a good day.  Most of us spend most of our time coding, that is, writing detailed instructions for machines to carry out.  True software engineering means designing a robust and efficient system to solve a practical problem.  The term was coined by Margaret Hamilton, the architect of the Apollo 11 control systems quoted above and a pioneer in the design of asynchronous systems.  As the story of the lunar landing demonstrates, she and her team set a high bar for later work.

New Horizons ran into essentially the same sort of problem that Apollo 11 did, but handled it less robustly (going to "safe mode" and then recovering, as opposed to automatically re-prioritizing), all building on techniques that Hamilton and her team helped develop, and using vastly more powerful equipment and development tools based on decades of collective experience.  So, with all due respect to the New Horizons team, I'd have to say Apollo 11 wins that one.

Friday, August 21, 2015

Latency, bandwidth and Pluto

As you may well know, the New Horizons spacecraft flew by Pluto last month, dramatically increasing our knowledge of Pluto and its moons (let's not even get into whether Pluto and Charon jointly constitute a "binary dwarf planet" or whatever).  There are even a few pictures on the web.

But wait ... that's not very many pictures for a ten-year mission.  Even worse, if you were watching at the time you'll know that New Horizons went completely dark for most of a day, right when it was flying by Pluto.  Isn't this the modern web, where everything is available everywhere instantly?  What gives, NASA?

Part of the problem is the way New Horizons is designed.  It's expensive to accelerate mass to the speed New Horizons is going, and since you can't exactly send a repair crew out to Pluto, it's good to have as few moving parts as possible.  As a result, the ship has a small battery and both the antenna and the cameras are mounted firmly in place.  If you want to turn the antenna toward Earth, you have to move the whole ship, using some of the small store of remaining fuel dedicated to course corrections and attitude control.  If you want to point the cameras toward Pluto, you have to turn the ship that way.  You can't do both.

That explains why the ship went dark for the duration of the flyby, but actually it effectively went dark for considerably longer than that.  It takes about 4.5 hours for a signal to travel the distance between Earth and Pluto.  That means the sequence of events is, more or less
  • t + 0: Flight control sends commands to New Horizons to point the cameras at Pluto, take pictures, orient the antenna toward Earth, and report back.
  • t + 4.5 hours: New Horizons gets the command and starts re-orienting and taking pictures
  • t + 9 hours: Last time at which any signal from before the pointing operation will reach Earth
  • t + 25.5 hours (more or less): New horizons, now with the antenna pointed back toward earth, sends "Phone home" message reporting status.
  • t + 30 hours (more or less): "Phone home" message arrives
The important thing to note here is that, while the ship is actually out of contact for 21 hours, it's 25.5 hours from the time the command is sent to the time the ship is reachable again, and 30 hours before the ground crew knows it's reachable again.   If the phone home signal hadn't arrived, it would be 9 more hours, at a minimum, before they knew if any corrective action they'd taken had worked.  By internet standards this is ridiculously high latency, but anyone who's played a laggy video game or been on a conference call with people on the opposite side of the world has experienced the same problem on a smaller scale.

So part of the reason pictures have been slow in coming is latency.  It's going to take a minimum of 4.5 hours to beam anything back, and longer if the ground crew has to send instructions.

The other problem is bandwidth.  Pluto is about 5 billion kilometers (about 3 billion miles) away.  Signal strength drops off as the square of the distance, so, for example, a signal from Pluto is about 160 times weaker than a signal from Mars and, since the power source has to be small (around 15 watts), the signal is not going to be extremely powerful to start with.  Lower power means a lower signal to noise ratio and less bandwidth (or at least that's my dim software-engineer understanding of it -- in real life "bandwidth" doesn't exactly mean "how many bits you can transmit", and I'm sure there's lots more I'm glossing over).

Put all that together and on a good day we have about 2Kbps coming back from Pluto.  That's about what you could get out of a modem in the mid 1980s.   Internet technology has progressed just a bit since then, but internet technology doesn't have to cope with vast distances and stringent mass limitations.  At 2Kbps, one raw image from LORRI (the hi-res black-and-white camera) takes close to two hours to transmit.  This is why, if all goes well, we'll be getting Pluto pictures (and other data) well into 2016.

I'd still say that New Horizons is "on the web" in some meaningful sense, but the high latency and low bandwidth make it a great example of Deutsch's fallacies in action.

[Update: Not only is New Horizons sending back data slowly, it's not sending back particularly pretty data at the moment.  From their main page:
Why hasn’t this website included any new images from New Horizons since July? As planned, New Horizons itself is on a bit of a post-flyby break, currently sending back lower data-rate information collected by the energetic particle, solar wind and space dust instruments. It will resume sending flyby images and other data in early September.
-- D.H. 24 Aug 2015]

Friday, June 12, 2015

A decoy within a decoy ...

It's not news that more-or-less legitimate news sites like to surround their articles with enticing links to somewhat-less-legitimate news sites: 20 reasons your dog may be considering a new career ... You'll never believe this simple secret for removing egg yolk from pine trees ... Read this before you buy a moose ... you know, those sorts of things.

Click through on one of those and there's a good chance there will be some sort of slide show surrounded by a minefield of garish ads.  Sure, why not?  Putting together pictures of whatever sort of fluff for people to page through is somewhere around neutral on my personal scale of good or bad ways to make a living.

Except, when you lay out the page with a large ad at the top with an arrow icon next to it, and white space between, looking for all the world like it's going to take you to the article you're really interested in (well, at least maybe kinda interested in).  Except that arrow icon takes you to the advertiser's page.  That tilts ever so slightly toward "bad" in my book.  For the reader, it's a straight-up bait and switch.  For the advertiser, it's a steady stream of annoyed readers and, if it's a pay-per-click model, they get to pay for the privilege.  True, there's not supposed to be any such thing as bad publicity, but still.

Sunday, May 10, 2015

OK, this is cool, and it's on the web:

Hyrodium is the pseudonym (presumably) of a Japanese student who has put together a really neat blog of visualizations of mathematical proofs and objects.  Exactly the kind of thing the Web was originally supposed to be good for, and sometimes still is.

Sunday, March 22, 2015

Now available on its own domain

At least for the next year, more or less, you'll be able to reach this blog not only at fieldnotesontheweb.blogger.com, but at plain old fieldnotesontheweb.com.  I doubt this will really affect anyone's life greatly, but hey, why not?

Monday, March 16, 2015

Protocol basics: Packets and connections

Not so long ago, the easiest way, and in some cases the only practical way, to get in touch with someone a long distance away was to send a message by mail.  Physical mail, that is.  I found it handy when traveling in Europe, um a while ago, for example.  Rather than hunt down a payphone (really, spell check, you don't recognize "payphone"?), double-check that the other person was likely to be awake, buy a card or assemble the right pile of coins, dial, wait to see if the other person was actually home, and then worry about cramming a conversation into an alarmingly small number of minutes, it was often easier and more fun to buy a few postcards, scrawl "Having a lovely time, wish you were" on them, address them, stamp them and throw them in the nearest post box.

It's not that one couldn't make an international phone call, or that it was never worth it, just that it was often cheaper and easier not to.

Mail and the telephone are canonical examples of two basic ways of sending information.  These two ways have gone by various names, but the two that stick in my head are "packet-switched" (mail) vs. "circuit-switched" (phone), and "connection-oriented" (phone) vs. "connectionless" (mail).  Each approach has its own pros and cons and neither is perfect for every situation, which is why both are still going strong.

Connection-oriented protocols require time to set up the connection.  In the case of the telephone, this means identifying the other party (dialing the number) and establishing a connection (whatever the phone company does, plus waiting for the other party to pick up).  Both parties must participate in the connection at the same time.  Once this is done, however, communication is efficient.  I say something to you and you can reply immediately.  Depending on the exact protocol, we may even be able to both talk at once ("full-duplex") or we may have to take turns ("half-duplex"), but this distinction is more important in computer protocols than human conversations (human conversations are effectively half-duplex, as anyone who's ever said "no, you go ahead" at the same time as the other person will realize).

Connectionless protocols are the mirror image of this.  Setup is minimal.  You only have to identify the other party and send the message.  Depending on the protocol, your party may or may not have to be present to receive the message when it arrives.  On the other hand, you generally don't know whether your message arrived at all, and if you send two messages to the same person, they may or may not arrive in the order you sent them, and in some cases the same message might get sent more than once (though this is not likely to happen through the mails).

Of the common ways we exchange information as people, some are connection-oriented, for example, phone calls and video conferences, and some are connectionless, such as texting and email.

The underlying protocol of most of the internet, IP (for ... wait for it ... Internet Protocol) is connectionless.  You build a packet of data, slap an address on it, send it and hope for the best.  Generally there is no specific connection involved, though it used to be more common (in the days of dedicated T1 lines, PLIP and such) for IP to run on top of something that happened to us a point-to-point connection at a lower level.

However, most actual applications of the internet are connection-oriented.  When you browse a web page, for example, your browser uses a connection to the server serving the page.  That way it doesn't care whether the page fits in a single packet (which it almost certainly doesn't) and it knows for sure when it's retrieved the entire page.  Likewise, when you send mail, your mail client makes a connection to a mail server, when you join a videoconference your device makes a connection to the videoconference server, and so forth.

To bridge this gap, we use what is hands-down one of the most successful inventions in the field of computing: the Transmission Control Protocol, or TCP.  TCP assumes that you have some way -- it doesn't much care what -- to deliver packets of information -- it doesn't care what's in them and doesn't much care how big they are -- to a given address.  IP is one way of doing this, so you'll see the term TCP/IP for TCP using IP to transport packets, but it's not the only way.

From these basic assumptions, TCP allows you to pretend you have a direct connection to another party.  If you send a series of messages to that other party, you know that either they will all arrive, in order, or the whole connection will fail and you will know it failed.

This is a big improvement over simply sending messages and hoping they arrive.  The cost is that it takes time to set up a TCP connection, TCP adds extra traffic, particularly ACK (acknowledgement) messages to be sure that messages have arrived, it may have to re-send messages, and it requires each side of the connection to remember a certain amount of previous traffic.  For example, if I'm sending you a bunch of data, TCP will break that down into packets, and then keep track of which packets have and haven't been acknowledged.  If too many haven't been acknowledged, it will stop sending new data, slowing things down.  Each side also has to keep around any packets that it's sent but hasn't had acknowledged yet, since it may have to resend them.

Since setting up connections is expensive, we try to avoid it.  The worst case is establishing a fresh connection for each message to be sent.  In the early days of the web, HTTP/1.0 required a fresh connection for each object retrieved.  If your web page included an image, the browser would have to set up one connection to get the text of the web page, tear it down, and then set up a fresh connection to get the image, even if it was stored on the same server.  HTTP/1.1 fixed this with its "keep-alive" mechanism, but the same problem comes up in other contexts, with the same solution: re-use connections where possible.

A great many "application layer" protocols are built on TCP.  They all start by making a connection to a remote host and then sending messages back and forth on that connection.  These include
  • HTTP and HTTPS, the fundamental protocols of the web
  • SMTP, POP and IMAP, used for email
  • SSL/TLS, for making secure connections to other hosts (HTTPS is essentially HTTP using SSL/TLS)
  • SSH, also used for making secure connections to other hosts
  • FTP, SFTP and FTPS, used for transferring files (technically, SFTP is built on SSH, which is in turn built on TCP)
  • XMPP, one way of sending short messages like texts
  • Any number of special purpose protocols developed over the years
Several of these (particularly SSH and HTTP) allow for "tunneling" of TCP, meaning you can establish a TCP connection on top of SSH or TCP, which will in turn be using a different TCP connection.  There are actually legitimate reasons to do this.

To bring things full circle, email protocols and file transfer protocols are essentially packet protocols, just with nicer addresses and bigger packets.  You could, in principle, use them as an underlying protocol for TCP.  This would be hugely inefficient for a number of reasons.  Among others, you'd generally be paying the cost of setting up a connection for each "packet", which is pretty much the worst of both worlds ... but it's certainly possible.  In a more abstract way, an email thread is essentially an ad-hoc connection on top of email (packets), which in turn is delivered via TCP (connection), which in turn (generally) uses IP (packets).

Thursday, January 1, 2015

Bitcoin: Volume matters

It's easy to get fixated on prices, particularly if the price you're interested in keeps going up and up and up (or down and down and down).  In early December of 2013, for example, when Bitcoin was trading over $1000 with no limit in sight, the only logical conclusion was that it would soon take over the world.

Since then, the dollar price of BTC has fallen into the 300s.  Considering that the famous 10,000 BTC pizza would be $3 million at today's prices (if just a bit stale), that's still pretty good.  On the other hand, it's around a quarter of the peak price.

The exact exchange rate between BTC and reserve currencies shouldn't really matter greatly, though, so long as it's reasonably stable and there's enough money supply to handle whatever it is BTC is being used for.  But what about volume?  How much BTC is changing hands on a given day?  How much Bitcoin, measured in dollars or some other reserve currency, is there in all?

As an aside, there's a whole body of established techniques for using volume in conjunction with price to make buying and selling decisions.  I won't even pretend to know much about that.  I'm more interested in how Bitcoin stacks up against other currencies (if you consider it a currency) and money transfer networks (if you consider it that).

According to bitcoincharts.com, around half a million Bitcoins are "sent" daily, or about $150 million worth at current prices.  I believe this is taken from the blockchain that records all Bitcoin transactions, and so would include activity on the various exchanges, miners collecting and redeeming their bounties, direct transfers of BTC to and from long-term holdings (the "pure" Bitcoin economy) and whatever else.

If I buy a Bitcoin for $320 or whatever, that would show up as one Bitcoin sent to me from the exchange, and if I use that Bitcoin to pay you for some good or service, that would in turn show up as one Bitcoin sent to you, and if you turn around and sell that Bitcoin for $320, that's another Bitcoin sent, to the exchange from you.  In other words, a typical transaction using Bitcoin as an intermediary between reserve currencies will be counted three times.  If we want to consider Bitcoin as a money transfer network, we should probably allow for some amount of non money-transfer activity, then divide by three, but in what follows let's count all $150 million, if only because it's simpler.

The Bitcoin money supply is deliberately easy to calculate.  While there are several measures of "how many dollars are there?", the total number of Bitcoins at any given time is a known parameter of the protocol.  Currently, the BTC money supply is about 14 million, or $4 billion.

By contrast:

  • About 100 nations have more than $4 billion in currency
  • The US M1 money supply is around $2 trillion, or about 500 times the Bitcoin money supply
  • The US M2 money supply is around $12 trillion, or about 3000 times the Bitcoin money supply
  • The US GDP is close to $50 billion per day, or about 300 times total Bitcoin activity (1000 times or more if you buy the triple-counting argument above)
  • VISA processes about $9 billion in transactions every day, or about 60 times total Bitcoin activity (or about 200+ with triple-counting)
and so forth.

I'm actually struck that the Bitcoin economy, or whatever you want to call it, is as big as it is.  There may be 100 countries with more money, but there are also 50 or so with less.  You could literally run a small country on Bitcoin.  I'm not saying it would be good policy, but in some sense the numbers are there.  A small fraction of US GDP is still a fair bit, as is a small fraction of Visa.  For a while Bitcoin was arguably bigger than Western Union.  Your call whether that's a matter of Bitcoin being big, WU being small, or some of both.

The main point remains: If you want to get an idea of how important Bitcoin might be, don't look at the price alone.  Look at volume as well.



Tuesday, December 30, 2014

That CAPTCHA moved!

While recovering a password for a site -- that is, my real password was whatever information the recovery page needed -- I noticed a new wrinkle on CAPTCHA: Moving CAPTCHA.  Instead of the usual smeared-out or obscured letters, three plainly readable letters, somewhat tilted, on a clearly contrasting background, but wiggling slightly back and forth.

Seems like an interesting step in the whole OCR arms race, except ...

The problem for an attacker to solve here isn't recognizing a moving character, which might or might not be harder than recognizing a still one.  It's grabbing a frame of the animation to examine.  If you can do that at all, then recognizing one particular arrangement of the letters is no harder than recognizing any other CAPTCHA.  Easier, in fact, since you have nice, legible letters, and you can re-run the OCR on each frame and go with the consensus.

Again, I haven't looked at this in detail, but there would seem to be two main ways of putting the moving image up in the first place: A .gif or other animated image format, which is no problem to decode into its images, or some sort of JavaScript animation.  That might be harder to grab, but not because of the animation.  You can just as well use JavaScript to put up a still image, and in either case the answer is to render the JavaScript and then grab the pixels.

In other words, it seems unlikely that the moving image adds any real difficulty for an attacker.  It does look harder, intuitively, to the human eye, but the attacker isn't using a human eye -- that's the whole point of the exercise to begin with.

Tuesday, September 30, 2014

Heartbleed, Shellshock and Raymond's Linus's Law

You have probably heard by now that bash, one of the basic tools in the Linux/GNU toolkit, has had a glaring vulnerability for the last, oh, twenty-plus years, now deemed Shellshock.  You've probably also heard of the Heartbleed vulnerability in OpenSSL.  Apart from making international press and raising serious questions about computer security, these two bugs have a number of features in common:
  • They're implementation bugs.  Bash, as defined in its documentation, does not allow the sort of behavior that Shellshock allows, and likewise for SSL (the protocol) and OpenSSL (an implementation of SSL).  In both cases, the implementations were doing things they shouldn't have.
  • They're basic implementation bugs.  In Shellshock, text which should be ignored or discarded is instead interpreted as a command.  In Heartbleed, a reply message which is supposed to have a given length instead has another.
  • No one noticed them for a long time.  In the case of shellshock, a very long time.  Or at least, no one seems to have visibly exploited them.
It's that last item I want to focus on.  In his famous essay The Cathedral and the Bazaar, extolling the virtues of open source development, Eric Raymond claimed that "given enough eyeballs, all bugs are shallow," or in other words, if you had enough people looking at the source code to a system, any serious issues would be flushed out and fixed quickly.  He called this principle Linus's Law, in honor of Linux creator Linus Torvalds (Linus didn't come up with it.  Linus did put forth his own Linus's Law, but it doesn't seem to have garnered much attention).

In any case, despite bash and OpenSSL being two of the most widely used tools in the software world, these basic and serious bugs don't seem to have been flushed out quickly at all.  Now, it is possible that multiple people noticed the problems, shrugged and went on with their lives, or that some entity or another discovered the bugs and exploited them very quietly, but that's not how Raymond's Linus's law is supposed to work.

I think there are two reasons for this.

First, as many have pointed out, there's no convincing evidence that more eyeballs really do mean more bugs found.  Rather, it seems that you quickly hit diminishing returns.  Four people may or may not find about twice as many bugs as two people, but forty people probably won't find twice as many bugs as twenty.  Forty people may not even find twice as many bugs as two.

Exactly why this might be is a good research topic, but I'd guess that a lot of it is because some bugs are easy to find, some aren't, and once you've found the easy bugs throwing more eyeballs at the problem (now there's an image) won't necessarily help find the hard bugs.

One of the sobering implications of Shellshock and Heartbleed is that even simple bugs can be hard to find, but that's not news to anyone who's done much coding.

I think there's a second reason, though, more subtle than the first but worth noting:  There probably aren't really that many eyeballs on the source code to begin with.

In theory, millions of people could have found either of these two bugs.  If you've installed Linux, you have the bash and OpenSSL source code, or if you didn't copy it, you can easily get at it.  Odds are you didn't, though, unless you were actively developing one of those packages.  Why would you?  I use Linux systems all the time.  I don't want to study the source code.  I just want it to work.  I have looked at various parts of the Linux/GNU source, but generally just to see how it worked, not with a particular eye toward finding bugs.  Maybe that makes me a bad net.citizen, but if so, I'm pretty sure I'm in good company.

OK, but there have still been hundreds of contributors to each of those projects.  Surely one of them would have seen the problem code and fixed it?  Not necessarily.  A tool like bash consists of a large number of modules (more or less), and the whole point of breaking things down into modules is that you can work on one without caring (much) about (many of) the others.  Someone who worked on job control in bash would not necessarily have even looked at the environment variable parsing, which is where the problem actually was.

In other words, there might only have been a handful of people who even had the opportunity to find Shellshock or Heartbleed in the source code, and they didn't happen to spot the problems, probably because they were trying to get something else done at the time.


There's another kind of eyeball, though: testers.  Even if only a few people were looking closely at the source, lots of people actually use bash, OpenSSL and other open-source tools.

Fair enough, but again, their attention is not necessarily focused where the bugs are.  Most people logging into a Linux box and using bash are not going to be defining functions in environment variables.  Most script writers aren't either (though git, headed by Linus himself, seems to like to).  It's a moderately tricky thing to do.  Likewise, almost no one using OpenSSL is even going to be in a position to look at heartbeat packets.  Most of us don't even know if we're using OpenSSL or not, though if you've visited an https:// URL, you probably have.

In short, Raymond's implicit assumption that bug-finding is a matter of many independent trials, in the statistical sense, evenly distributed over the space of all possible bugs, looks to be wrong on both counts: "many" and "independent".

[The current Wikipedia article on Linus's law cites Robert Glass's Facts and Fallacies about Software Engineering, which made similar observations in 2003, over a decade before this was posted.  It also no longer seems to mention any version of Linus's law due to Linus himself.  That was removed in this edit  --D.H. Oct 2018]

Wednesday, July 16, 2014

Protocol basics -- heartbeats, pings and acks

For no particular reason, I thought I'd start an occasional series on the basics of computer protocols such as those, like TCP and HTTP, that the web is built on.  Also for no particular reason, the basic principle that came to mind first is the idea of heartbeats.

But first, what's a protocol?

The word itself derives from Greek protos (first) and kolla (glue), so that ought to be clear enough.

No?

The trail is something like: prōtokollon really refers to the first draft of an official agreement (the first one glued into a binding), and thence more generally to an official set of rules and procedures, and thence finally to the computing meaning: A set of rules for exchanging messages between computers (often called hosts).

One of the most basic problems in computer protocols is determining whether the other party is there or not.  How hard can that be, right?

Unlike the physical world, you can't just look.  All you have is some means of sending messages, typically a relay of several steps mixing wired and wireless transmission, high-volume and low-volume connections, and so forth.  I'll go into deeper detail in some later post, but the point is that all you can do is send a message, and any particular message might or might not arrive at its destination in any particular amount of time.

One simple way to tell if the other party is there is just to ask.  Send a message saying "If you get this, please send it back to me."  You send that message, the other host sends back a reply and voila, you know they're there.

This is a perfectly good approach.  The first message is generally called a ping, probably taken from SONAR terminology, and the reply packet is generally called an ack (or ACK), short for "acknowledgement".  (There's also such a thing as a nack ( or NAK), short for "negative acknowledgement", which means "yes, I got that, but I couldn't understand it," or "yes, I got that, but you're sending me messages too fast, so please stop for a bit".  I'll admit to occasionally having said "NAK" in response to an explanation that went over my head.)

But what if you don't get your ack?  Is your connection bad?  Has the other host crashed?  Did it receive your ping but fail to reply?  Did it reply, but the return connection was bad?  How long should you wait before you decide that the ack isn't coming?

To help get around problems like this, you can send a series of pings and listen for a series of acks.  To help tell what's going on, you can number them so you can match the acks to the pings.  If the connection is flaky, you might miss an ack from time to time, but overall if the other host is there and you have at least some sort of connection, you'll get at least some acks back.

You might even have the other host tell you how many pings it's heard.  That will give you some idea of whether any problems are on the outbound connection, inbound connection, or both.  For example, if the return connection is bad but the outbound connection is fine, you'll hear something like "Ack for ping 1, I've heard 1 ping", "Ack for ping 3, I've heard 3 pings" ...  If you hear "Ack for ping 3, I've heard 2 pings", you know that it missed ping 2.  Most bad connections will affect both directions, but that doesn't have to always be the case -- the other host's network layer is part of the incoming connection, and it's possible that it's able to send messages but sometimes has trouble hearing them.

If the other host crashes and restarts, you might hear something like "Ack for ping 1, I've heard 1 ping", "Ack for ping 2, I've heard two pings", and then eventually, once the other host is up again, "Ack for ping 50, I've heard 1 ping".  This may or may not be useful information.  It's a basic principle of networking that during that eerie silence, there's no way to know whether the other host is crashing and restarting, the network is down, the other host is running slowly, there's a bug in whatever's handling the pings, the network is up but messages are being delayed, or whatever.

By the point you hear back that the rebooted host has only heard one ping, you may not greatly care.  You can't begin to figure out what's going on until you get a message from the other host, and even then what you can deduce depends on the exact messages, that is, on the protocol.  On the other hand, you can decide that if you haven't heard replies for N pings in a row, something is wrong.  That's often a good bet, but you have to be prepared for the possibility that things are just slow and the other host was there all along.

In some kinds of network, messages are always sent to everyone who could be listening.  In most such cases, the networking layer will filter out messages that aren't addressed to a particular system, but it's also possible to mark them "broadcast", meaning that everyone should listen.  In such setups, a broadcast ping is a good way to find out who's on the network.  This process is called discovery, and since not all networks have broadcasting built in, there are discovery protocols for networks that don't.

If you're having an actual conversation with another host, say, sending requests and getting replies, you're automatically pinging and acking.  However, you may reach a point where you don't have anything to say at the moment, but you want the other host to know you're still there.  In that case, you could send a ping, either as a do-nothing request or as a special kind of message.  It doesn't much matter which, so long as you and the other host agree on the protocol.  Such a message is generally called a keep-alive, since it's meant to keep the hosts from killing the connection (which basically means forgetting about it) on the assumption the other has gone away.

In some cases, only one host cares if the other is there.  For example, imagine a weather station where the main host is listening for data coming from a bunch of sensors -- thermometer, anemometer, hygrometer, manometer, and so forth.  It's fine for the sensors to blindly send out their information no matter what, but the main host would like to be able to report if a sensor is faulty.  Or in an even simpler example, you just want to know if another host is there at all, without needing it to send you any particular information.

In such cases, you shouldn't have to ping (and you might not even be able to, for example if the sensors have transmitters but no receivers), but you want the things you're monitoring to send acks regularly as though you had.  You can then decide that if you miss N messages, you'll report a problem.  Since they're not actually acknowledging anything, such a message is generally called a heartbeat rather than an ack.

In fact, any series of regular messages meant to determine if a host is present or not can be called a heartbeat.  The heartbeats in the famous heartbleed bug, for example, were a series of pings and acks.  The bug was that a badly constructed ping would cause the ack to contain information that shouldn't have been there.


This post has turned out longer than I expected.  I had expected to write a couple of paragraphs about heartbeats, but to get there I ended up delving a bit deeper.  As is often the case, there's more to even the simple pieces than might meet the eye.  I would like to make one last point, though.  Heartbeats, pings, acks and indeed most of the basics of computer protocols, have been around much longer than computers.  It would be interesting to hunt down early examples, but one that springs to mind is a team on an isolated, dangerous mission agreeing to send out regular radio messages.  If some number don't arrive, send in the rescue squad (or just assume the mission has, sadly, failed).

The basic idea of "make a noise if you're still here" is, of course, considerably older than radio.