Saturday, August 22, 2015

Margaret Hamilton: 1 New Horizons: 0

A bit more on Pluto, from a compugeek perspective if not a full-on web perspective ...

The New Horizons flyby was not completely without incident.  Shortly before the flyby itself, the craft went into "safe mode", contact was lost for a little over an hour and a small amount of scientific data was lost.  The underlying problem was "a hard-to-detect timing flaw in the spacecraft command sequence".  This quite likely means what's known in the biz as a "race condition", where two operations are going on at the same time, the software behaves incorrectly if the wrong one finishes first and the developers didn't realize it mattered.

Later investigation concluded that the problem happened when "The computer was tasked with receiving a large command load at the same time it was engaged in compressing previous science data."  This means that the CPU would have been both heavily loaded and multitasking, making it more likely that various "multithreading issues" such as race conditions would be exposed.

Now, before I go on, let me emphasize that bugs like this are notoriously easy to introduce by accident and notoriously hard to find if they do creep in, even though there are a number of well-known tools and techniques for finding them and keeping them out in the first place.

The incident does not in any way indicate that the developers involved can't code.  Far from it.  New Horizons made it through a ten-year, five billion kilometer journey, arriving within 72 seconds of the expected time, and was able to beam back spectacularly detailed images.  That speaks for itself.  It's particularly significant that the onboard computers were able to recover from the error condition instead of presenting the ground crew with an interplanetary Blue Screen of Death.  More on that in a bit.

Still ...

It's July 20, 1969.  The Apollo 11 lunar lander is three minutes from landing on the Moon when several alarms go off.  According to a later recounting by the leader of the team involved
Due to an error in the checklist manual, the rendezvous radar switch was placed in the wrong position. This caused it to send erroneous signals to the computer. The result was that the computer was being asked to perform all of its normal functions for landing while receiving an extra load of spurious data which used up 15% of its time.
This is a serious issue.  If the computer can't function, the landing has to be aborted.  However,
The computer (or rather the software in it) was smart enough to recognize that it was being asked to perform more tasks than it should be performing. It then sent out an alarm, which meant to the astronaut, I'm overloaded with more tasks than I should be doing at this time and I'm going to keep only the more important tasks; i.e., the ones needed for landing ... Actually, the computer was programmed to do more than recognize error conditions. A complete set of recovery programs was incorporated into the software. The software's action, in this case, was to eliminate lower priority tasks and re-establish the more important ones.
This is awesome.  Since "awesome" is generally taken to mean "kinda cool" these days, I'll reiterate: The proper response to engineering on this level is awe.  Let me try to explain why.

Depending on where you start counting, modern computing was a decade or two old at the time.  The onboard computer had "approximately 64Kbyte of memory and operated at 0.043MHz".  Today, you can buy a system literally a million times faster and with a million times more memory for a few hundred dollars.

While 64K is tiny by today's standards, it still leaves plenty of room for sophisticated code, which is exactly what was in there.  It does, however, mean that every byte and every machine cycle counts, and for that reason among others the code itself was written in assembler (hand-translated from a language called MAC and put on punch cards for loading).  Assembler is as low-level as it gets, short of putting in raw numbers, or flipping switches or fiddling with the wiring by hand.

Here's a printout of that code if you're curious.  The dark bands are from printing out the listing on green-and-white-striped fanfold paper with a line printer such as used to be common at computer centers around the world.  The stripes were there to help the eye follow the 132-character lines.  Good times.  But I digress.

Just in case writing in assembler with an eye towards extremely tight code isn't enough, the software is asynchronous.   What does that mean?  There are two basic ways to structure a program such as this one that has to deal with input from a variety of sources simultaneously: the synchronous approach and the asynchronous.

Synchronous code essentially does one thing at a time.  If it's reading temperature and acceleration (or whatever), it will first read one input, say temperature from the temperature sensor, then read acceleration from the accelerometer (or whatever).  If it's asking some part of the engine to rotate 5 degrees, it sends the command to the engine part, then waits for confirmation that the part really did turn.  For example, it might read the position sensor for that part over and over until it reads five degrees different, or raise an alarm if doesn't get the right reading after a certain number of tries.

Code like this is easy to reason about and easy to read.  You can tell immediately that, say, it's an error if you try to move something and its position doesn't reach the desired value after a given number of tries.  However, it's no way to run a spaceship.  For example, suppose you need to be monitoring temperature continuously and raise a critical alarm if it gets outside its acceptable range.  You can't do that if you're busy reading the position sensor.

This is why high-performance, robust systems tend to be asynchronous.  In an asynchronous system, commands can be sent and data can arrive at any time.  There will generally be a number of event handlers, each for a given type of event.  The temperature event handler might record the temperature somewhere and then check to make sure it's in range.

If it's not, it will want to raise an alarm.  Suppose the alarm is a beep every five seconds.  In the asynchronous world, that means creating a timer to trigger events every five seconds, and creating an event handler that sends a beep command to the beeper when the timer fires (or, you can set a "one-shot" timer and have the handler create a new one-shot timer after it sends the beep command).

While all this is going on, other sensors will be triggering events.  In between "the temperature sensor just reported X" and "the timer for your beeper just went off", the system might get events like "the accelerometer just reported Y" and "the position sensor for such-and-such-part just read Z".

To move an engine part in this setup, you need to send it a command to move, and also create a handler for the position sensor's event.  That handler has to include a counter to remember how many position readings have come in since the command to move, along with the position the part is supposed to get to (or better, a time limit and the expected position).

A system like this is very flexible and doesn't spend time "blocked" waiting for things to happen, but it's also harder to read and reason about, since things can happen in any order and the logic is spread across a number of handlers, which can come and go depending on what the system is doing.

And then, on top of all this, the system has code to detect and recover from error conditions, not just in the ship it's controlling but in its own operation.  Do-it-yourself brain surgery, in other words.


I report my occupation as "software engineer" for tax purposes and such, but that's on a good day.  Most of us spend most of our time coding, that is, writing detailed instructions for machines to carry out.  True software engineering means designing a robust and efficient system to solve a practical problem.  The term was coined by Margaret Hamilton, the architect of the Apollo 11 control systems quoted above and a pioneer in the design of asynchronous systems.  As the story of the lunar landing demonstrates, she and her team set a high bar for later work.

New Horizons ran into essentially the same sort of problem that Apollo 11 did, but handled it less robustly (going to "safe mode" and then recovering, as opposed to automatically re-prioritizing), all building on techniques that Hamilton and her team helped develop, and using vastly more powerful equipment and development tools based on decades of collective experience.  So, with all due respect to the New Horizons team, I'd have to say Apollo 11 wins that one.

2 comments:

earl said...

Wow! There is a certain attitude among really effective craftsmen, that a) you can tell if it's right or not (you know there's a problem somewhere, though you may have no idea where) and you simply cannot leave it until the problem is gone. In the Apollo 11 case I suspect that there were also about 40 checker per coder.

Also the very fact that they were working under constraints of memory and speed may have forced them to elegance.

earl said...

By the bye, that was very good writing. Enough explanation, clear, and in just the right places.