Tuesday, July 16, 2019

Space Reliability Engineering

In a previous post on the Apollo 11 mission, I emphasized the role of software architecture, and the architect Margaret Hamilton in particular, in ensuring the success of the Apollo 11 lunar landing.  I stand by that, including the assessment of the whole thing as "awesome" in the literal sense, but as usual there's more to the story.

Since that non-particularly-webby post was on Field Notes, so is this one.  What follows is mostly taken from the BBC's excellent if majestically paced podcast 13 Minutes to the Moon [I hope to go back and recheck the details directly at some point, but searching through a dozen or so hours of podcast is time-consuming and I don't know if there's a transcript available -- D.H.], which in turn draws heavily on NASA's Johnson Space Center Oral History Project.

I've also had a look at Ars Technica's No, a "checklist error" did not almost derail the Apollo 11 mission, which takes issue with Hamilton's characterization of the incident and also credits Hal Laning as a co-author of the Executive portion of the guidance software which ultimately saved the day (to me, the main point Hamilton was making was that the executive saved the day, regardless of the exact cause of the 1202 code).

Before getting too far into this, it's worth reiterating just how new computing was at the time.  The term "software engineer" didn't exist (Hamilton coined it during the project -- Paul Niquette claims to have coined the term "software" itself and I see no reason to doubt him).  There wasn't any established job title for what we now call software engineers.  The purchase order for the navigation computer, which was the very first order in the whole Apollo project, didn't mention software, programming or anything of the sort.  The computer was another piece of equipment to be made to work just like an engine, window, gyroscope or whatever.  Like them it would have to be installed and have whatever other things done to it to make it functional.  Like "programming" (whatever that was).

In a way, this was a feature rather than a bug.  The Apollo spacecraft have been referred to, with some justification, as the first fly-by-wire vehicles.  The navigational computer was an unknown quantity.  At least one astronaut promised to turn the thing off at the first opportunity.  Flying was for pilots, not computers.

This didn't happen, of course.  Instead, as the podcast describes so well, control shifted back and forth between human and computer depending on the needs of the mission at the time, but it was far from obvious at the beginning that this would be the case.

Because the computer wasn't trusted implicitly, but treated as just another unknown to be dealt with, -- in other words, another risk to be mitigated -- ensuring its successful operation was seen as a matter of engineering, just like making sure that the engines were efficient and reliable, and not a matter of computer science.  This goes a long way toward explaining the self-monitoring design of the software.

Mitigating the risk of using the computer included figuring out how to make it as foolproof as possible for the astronauts to operate.  The astronauts would be wearing spacesuits with bulky gloves, so they wouldn't exactly be swiping left or right, even if the hardware of the time could have supported it.  Basically you had a numeric display and a bunch of buttons.  The solution was to break the commands down to a verb and a noun (or perhaps more accurately a predicate and argument), each expressed numerically.  It would be a ridiculous interface today.  At the time it was a highly effective use of limited resources [I don't recall the name of the designer who came up with this, but it's in the podcast --D.H.].

But the only way to really know if an interface will work is to try it out with real users.  Both the astronauts and the mission control staff needed to practice the whole operation as realistically as possible, including the operation of the computer.  This was for a number of reasons, particularly to learn how the controls and indicators worked, to be prepared for as many contingencies as possible and to try to flush out unforeseen potential problems.  The crew and mission control conducted many of these and they were generally regarded as just as demanding and draining as the real thing, perhaps moreso.

It was during one of these simulations that the computer displayed a status code that no one had ever seen before and therefore didn't know how to react to.  After the session was over, flight director Gene Kranz instructed guidance software expert Jack Garman to look up and memorize every possible code and determine what course of action to take when it came up.  This would take a lot of time searching through the source code, with the launch date approaching, but it had to be done and it was.  Garmin produced a handwritten list of every code and what to do about it.

As a result, when the code 1202 came up with the final opportunity to turn back fast approaching, capsule communicator (CAPCOM) Charlie Duke was able to turn to guidance controller Steve Bales, who could turn to Garman and determine that the code was OK if it didn't happen continuously.  There's a bit of wiggle room in what constitutes "continuously", but knowing that the code wasn't critical was enough to keep the mission on track.  Eventually, Buzz Aldrin noticed that the code seemed to happen when a particular radar unit was being monitored.  Mission Control took over the monitoring and the code stopped happening.


I now work for a company that has to keep large fleets of computers running to support services that billions of people use daily.  If a major Google service is down for five minutes, it's headline news, often on multiple continents.  It's not the same as making sure a plane or a spaceship lands safely or a hospital doesn't lose power during a hurricane, but it's still high-stakes engineering.

There is a whole profession, Site Reliability Engineer, or SRE for short, dedicated to keeping the wheels turning.  These are highly-skilled people who would have little problem doing my job instead of theirs if they preferred to.  Many of their tools -- monitoring, redundancy, contingency planning, risk analysis, and so on -- can trace their lineage through the Apollo program.  I say "through" because the concepts themselves are considerably older than space travel, but it's remarkable how many of them were not just employed, but significantly advanced, as a consequence of the effort to send people to the moon and bring them back.

One tool in particular, Garman's list of codes, played a key role at a that critical juncture.  Today we would call it a playbook.  Anyone who's been on call for a service has used one (I know I have).



In the end, due to a bit of extra velocity imparted during the maneuver to extract the lunar module and dock it to the command module, the lunar module ended up overshooting its intended landing place.  In order to avoid large boulders and steep slopes in the area they were now approaching, Neil Armstrong ended up flying the module by hand in order to find a good landing spot, aided by a switch to increase or decrease the rate of descent.

The controls were similar to those of a helicopter, except the helicopter was flying sideways through (essentially) a vacuum over the surface of the moon, steered by precisely aimed rocket thrust while continuing to descend, and was made of material approximately the thickness of a soda can which could have been punctured by a good jab with a ball-point pen.  So not really like a helicopter at all.

The Eagle landed with eighteen seconds of fuel to spare.  It helps to have a really, really good pilot.