Saturday, August 22, 2015

Margaret Hamilton: 1 New Horizons: 0

A bit more on Pluto, from a compugeek perspective if not a full-on web perspective ...

The New Horizons flyby was not completely without incident.  Shortly before the flyby itself, the craft went into "safe mode", contact was lost for a little over an hour and a small amount of scientific data was lost.  The underlying problem was "a hard-to-detect timing flaw in the spacecraft command sequence".  This quite likely means what's known in the biz as a "race condition", where two operations are going on at the same time, the software behaves incorrectly if the wrong one finishes first and the developers didn't realize it mattered.

Later investigation concluded that the problem happened when "The computer was tasked with receiving a large command load at the same time it was engaged in compressing previous science data."  This means that the CPU would have been both heavily loaded and multitasking, making it more likely that various "multithreading issues" such as race conditions would be exposed.

Now, before I go on, let me emphasize that bugs like this are notoriously easy to introduce by accident and notoriously hard to find if they do creep in, even though there are a number of well-known tools and techniques for finding them and keeping them out in the first place.

The incident does not in any way indicate that the developers involved can't code.  Far from it.  New Horizons made it through a ten-year, five billion kilometer journey, arriving within 72 seconds of the expected time, and was able to beam back spectacularly detailed images.  That speaks for itself.  It's particularly significant that the onboard computers were able to recover from the error condition instead of presenting the ground crew with an interplanetary Blue Screen of Death.  More on that in a bit.

Still ...

It's July 20, 1969.  The Apollo 11 lunar lander is three minutes from landing on the Moon when several alarms go off.  According to a later recounting by the leader of the team involved
Due to an error in the checklist manual, the rendezvous radar switch was placed in the wrong position. This caused it to send erroneous signals to the computer. The result was that the computer was being asked to perform all of its normal functions for landing while receiving an extra load of spurious data which used up 15% of its time.
This is a serious issue.  If the computer can't function, the landing has to be aborted.  However,
The computer (or rather the software in it) was smart enough to recognize that it was being asked to perform more tasks than it should be performing. It then sent out an alarm, which meant to the astronaut, I'm overloaded with more tasks than I should be doing at this time and I'm going to keep only the more important tasks; i.e., the ones needed for landing ... Actually, the computer was programmed to do more than recognize error conditions. A complete set of recovery programs was incorporated into the software. The software's action, in this case, was to eliminate lower priority tasks and re-establish the more important ones.
This is awesome.  Since "awesome" is generally taken to mean "kinda cool" these days, I'll reiterate: The proper response to engineering on this level is awe.  Let me try to explain why.

Depending on where you start counting, modern computing was a decade or two old at the time.  The onboard computer had "approximately 64Kbyte of memory and operated at 0.043MHz".  Today, you can buy a system literally a million times faster and with a million times more memory for a few hundred dollars.

While 64K is tiny by today's standards, it still leaves plenty of room for sophisticated code, which is exactly what was in there.  It does, however, mean that every byte and every machine cycle counts, and for that reason among others the code itself was written in assembler (hand-translated from a language called MAC and put on punch cards for loading).  Assembler is as low-level as it gets, short of putting in raw numbers, or flipping switches or fiddling with the wiring by hand.

Here's a printout of that code if you're curious.  The dark bands are from printing out the listing on green-and-white-striped fanfold paper with a line printer such as used to be common at computer centers around the world.  The stripes were there to help the eye follow the 132-character lines.  Good times.  But I digress.

Just in case writing in assembler with an eye towards extremely tight code isn't enough, the software is asynchronous.   What does that mean?  There are two basic ways to structure a program such as this one that has to deal with input from a variety of sources simultaneously: the synchronous approach and the asynchronous.

Synchronous code essentially does one thing at a time.  If it's reading temperature and acceleration (or whatever), it will first read one input, say temperature from the temperature sensor, then read acceleration from the accelerometer (or whatever).  If it's asking some part of the engine to rotate 5 degrees, it sends the command to the engine part, then waits for confirmation that the part really did turn.  For example, it might read the position sensor for that part over and over until it reads five degrees different, or raise an alarm if doesn't get the right reading after a certain number of tries.

Code like this is easy to reason about and easy to read.  You can tell immediately that, say, it's an error if you try to move something and its position doesn't reach the desired value after a given number of tries.  However, it's no way to run a spaceship.  For example, suppose you need to be monitoring temperature continuously and raise a critical alarm if it gets outside its acceptable range.  You can't do that if you're busy reading the position sensor.

This is why high-performance, robust systems tend to be asynchronous.  In an asynchronous system, commands can be sent and data can arrive at any time.  There will generally be a number of event handlers, each for a given type of event.  The temperature event handler might record the temperature somewhere and then check to make sure it's in range.

If it's not, it will want to raise an alarm.  Suppose the alarm is a beep every five seconds.  In the asynchronous world, that means creating a timer to trigger events every five seconds, and creating an event handler that sends a beep command to the beeper when the timer fires (or, you can set a "one-shot" timer and have the handler create a new one-shot timer after it sends the beep command).

While all this is going on, other sensors will be triggering events.  In between "the temperature sensor just reported X" and "the timer for your beeper just went off", the system might get events like "the accelerometer just reported Y" and "the position sensor for such-and-such-part just read Z".

To move an engine part in this setup, you need to send it a command to move, and also create a handler for the position sensor's event.  That handler has to include a counter to remember how many position readings have come in since the command to move, along with the position the part is supposed to get to (or better, a time limit and the expected position).

A system like this is very flexible and doesn't spend time "blocked" waiting for things to happen, but it's also harder to read and reason about, since things can happen in any order and the logic is spread across a number of handlers, which can come and go depending on what the system is doing.

And then, on top of all this, the system has code to detect and recover from error conditions, not just in the ship it's controlling but in its own operation.  Do-it-yourself brain surgery, in other words.


I report my occupation as "software engineer" for tax purposes and such, but that's on a good day.  Most of us spend most of our time coding, that is, writing detailed instructions for machines to carry out.  True software engineering means designing a robust and efficient system to solve a practical problem.  The term was coined by Margaret Hamilton, the architect of the Apollo 11 control systems quoted above and a pioneer in the design of asynchronous systems.  As the story of the lunar landing demonstrates, she and her team set a high bar for later work.

New Horizons ran into essentially the same sort of problem that Apollo 11 did, but handled it less robustly (going to "safe mode" and then recovering, as opposed to automatically re-prioritizing), all building on techniques that Hamilton and her team helped develop, and using vastly more powerful equipment and development tools based on decades of collective experience.  So, with all due respect to the New Horizons team, I'd have to say Apollo 11 wins that one.

Friday, August 21, 2015

Latency, bandwidth and Pluto

As you may well know, the New Horizons spacecraft flew by Pluto last month, dramatically increasing our knowledge of Pluto and its moons (let's not even get into whether Pluto and Charon jointly constitute a "binary dwarf planet" or whatever).  There are even a few pictures on the web.

But wait ... that's not very many pictures for a ten-year mission.  Even worse, if you were watching at the time you'll know that New Horizons went completely dark for most of a day, right when it was flying by Pluto.  Isn't this the modern web, where everything is available everywhere instantly?  What gives, NASA?

Part of the problem is the way New Horizons is designed.  It's expensive to accelerate mass to the speed New Horizons is going, and since you can't exactly send a repair crew out to Pluto, it's good to have as few moving parts as possible.  As a result, the ship has a small battery and both the antenna and the cameras are mounted firmly in place.  If you want to turn the antenna toward Earth, you have to move the whole ship, using some of the small store of remaining fuel dedicated to course corrections and attitude control.  If you want to point the cameras toward Pluto, you have to turn the ship that way.  You can't do both.

That explains why the ship went dark for the duration of the flyby, but actually it effectively went dark for considerably longer than that.  It takes about 4.5 hours for a signal to travel the distance between Earth and Pluto.  That means the sequence of events is, more or less
  • t + 0: Flight control sends commands to New Horizons to point the cameras at Pluto, take pictures, orient the antenna toward Earth, and report back.
  • t + 4.5 hours: New Horizons gets the command and starts re-orienting and taking pictures
  • t + 9 hours: Last time at which any signal from before the pointing operation will reach Earth
  • t + 25.5 hours (more or less): New horizons, now with the antenna pointed back toward earth, sends "Phone home" message reporting status.
  • t + 30 hours (more or less): "Phone home" message arrives
The important thing to note here is that, while the ship is actually out of contact for 21 hours, it's 25.5 hours from the time the command is sent to the time the ship is reachable again, and 30 hours before the ground crew knows it's reachable again.   If the phone home signal hadn't arrived, it would be 9 more hours, at a minimum, before they knew if any corrective action they'd taken had worked.  By internet standards this is ridiculously high latency, but anyone who's played a laggy video game or been on a conference call with people on the opposite side of the world has experienced the same problem on a smaller scale.

So part of the reason pictures have been slow in coming is latency.  It's going to take a minimum of 4.5 hours to beam anything back, and longer if the ground crew has to send instructions.

The other problem is bandwidth.  Pluto is about 5 billion kilometers (about 3 billion miles) away.  Signal strength drops off as the square of the distance, so, for example, a signal from Pluto is about 160 times weaker than a signal from Mars and, since the power source has to be small, it's not going to be extremely powerful to start with.  Lower power means a lower signal to noise ratio and less bandwidth (or at least that's my dim software-engineer understanding of it -- in real life "bandwidth" doesn't exactly mean "how many bits you can transmit", and I'm sure there's lots more I'm glossing over).

Put all that together and on a good day we have about 2Kbps coming back from Pluto.  That's about what you could get out of a modem in the mid 1980s.   Internet technology has progressed just a bit since then, but internet technology doesn't have to cope with vast distances and stringent mass limitations.  At 2Kbps, one raw image from LORRI (the hi-res black-and-white camera) takes close to two hours to transmit.  This is why, if all goes well, we'll be getting Pluto pictures (and other data) well into 2016.

I'd still say that New Horizons is "on the web" in some meaningful sense, but the high latency and low bandwidth make it a great example of Deutsch's fallacies in action.

[Update: Not only is New Horizons sending back data slowly, it's not sending back particularly pretty data at the moment.  From their main page:
Why hasn’t this website included any new images from New Horizons since July? As planned, New Horizons itself is on a bit of a post-flyby break, currently sending back lower data-rate information collected by the energetic particle, solar wind and space dust instruments. It will resume sending flyby images and other data in early September.
-- D.H. 24 Aug 2015]