Tuesday, July 16, 2019

Space Reliability Engineering

In a previous post on the Apollo 11 mission, I emphasized the role of software architecture, and the architect Margaret Hamilton in particular, in ensuring the success of the Apollo 11 lunar landing.  I stand by that, including the assessment of the whole thing as "awesome" in the literal sense, but as usual there's more to the story.

Since that non-particularly-webby post was on Field Notes, so is this one.  What follows is mostly taken from the BBC's excellent if majestically paced podcast 13 Minutes to the Moon [I hope to go back and recheck the details directly at some point, but searching through a dozen or so hours of podcast is time-consuming and I don't know if there's a transcript available -- D.H.], which in turn draws heavily on NASA's Johnson Space Center Oral History Project.  I've also had a look at Ars Technica's No, a "checklist error" did not almost derail the Apollo 11 mission, which takes issue with Hamilton's characterization of the incident and also credits Hal Laning as a co-author of the Executive portion of the guidance software which ultimately saved the day (to me, the main point Hamilton was making was that the executive saved the day, regardless of the exact cause of the 1202 code).

Before getting too far into this, it's worth reiterating just how new computing was at the time.  The term "software engineer" didn't exist (Hamilton coined it during the project -- Paul Niquette claims to have coined the term "software" itself and I see no reason to doubt him).  There wasn't any established job title for what we now call software engineers.  The order for the navigation computer, which was the very first order in the whole Apollo project, didn't mention software, programming or anything of the sort.  The computer was another piece of equipment to be made to work just like an engine, window, gyroscope or whatever.  Like them it would have to be installed and have whatever other things done to it to make it functional.  Like "programming".

In a way, this was a feature rather than a bug.  The navigational computer was an unknown quantity.  The Apollo spacecraft have been referred to, with some justification, as the first fly-by-wire vehicles.  At least one astronaut promised to turn the thing off at the first opportunity.  Flying was for pilots, not computers.

This didn't happen, of course.  Instead, as the podcast describes so well, control shifted back and forth between human and computer depending on the needs of the mission at the time, but it was far from obvious at the beginning that this would be the case.

Because the computer wasn't trusted implicitly, but treated as just another unknown to be dealt with, -- in other words, another risk to be mitigated -- ensuring its successful operation was seen as a matter of engineering, just like making sure that the engines were efficient and reliable, and not a matter of computer science.  This goes a long way toward explaining the self-monitoring design of the software.

Mitigating the risk of using the computer included figuring out how to make it as foolproof as possible for the astronauts to operate.  The astronauts would be wearing spacesuits with bulky gloves, so they wouldn't exactly be swiping left or right, even if the hardware of the time could have supported it.  Basically you had a numeric display and a bunch of buttons.  The solution was to break the commands down to a verb and a noun (or perhaps more accurately a predicate and argument), each expressed numerically.  It would be a ridiculous interface today, but at the time it was a highly effective use of limited resources.

But the only way to really know if an interface will work is to try it out with real users.  Both the astronauts and the mission control staff needed to practice the whole operation as realistically as possible, including the operation of the computer.  This was for a number of reasons, particularly to learn how the controls and indicators worked, to be prepared for as many contingencies as possible and to try to flush out potential problems.  The crew and mission control conducted many of these and they were generally regarded as just as demanding and draining as the real thing, perhaps moreso.

It was during one of these simulations that the computer displayed a status code that no one had ever seen before and therefore didn't know how to react to.  After the session was over, flight director Gene Kranz instructed guidance software expert Jack Garman to look up and memorize every possible code and determine what course of action to take when it came up.  This would take a lot of time searching through the source code, with the launch date approaching, but it had to be done and it was.  Garmin produced a handwritten list of every code and what to do about it.

As a result, when the code 1202 came up with the final opportunity to turn back fast approaching, capsule communicator (CAPCOM) Charlie Duke was able to turn to guidance controller Steve Bales, who could turn to Garman and determine that the code was OK if it didn't happen continuously.  There's a bit of wiggle room in what constitutes "continuously", but knowing that the code wasn't critical was enough to keep the mission on track.  Eventually, Buzz Aldrin noticed that the code seemed to happen when a particular radar unit was being monitored.  Mission Control took over the monitoring and the code stopped happening.


I now work for a company that has to keep large fleets of computers running to support services that billions of people use daily.  If a major Google service is down for five minutes, it's headline news, often on multiple continents.  It's not the same as making sure a plane or a spaceship lands safely or a hospital doesn't lose power during a hurricane, but it's still high-stakes engineering.

There is a whole profession (Site Reliability Engineer, or SRE for short) dedicated to keeping the wheels turning.  These are highly-skilled people who would have little problem doing my job instead of theirs if they preferred to.  Many of their tools -- monitoring, redundancy, contingency planning, risk analysis, and so on -- can trace their lineage through the Apollo program -.  I say "through" because the concepts themselves are considerably older than space travel, but it's remarkable how many of them were not just employed, but significantly advanced, as a consequence of the effort to send people to the moon and bring them back.

One tool in particular, Garman's list of codes, played a key role at a that critical juncture.  Today we would call it a playbook.  Anyone who's been on call for a service has used one (I know I have).



In the end, due to a bit of extra velocity imparted during the maneuver to extract the lunar module and dock it to the command module, the lunar module ended up overshooting its intended landing place.  In order to avoid large boulders and steep slopes in the area they were now approaching, Neil Armstrong ended up flying the module by hand in order to find a good landing spot, aided by a switch to increase or decrease the rate of descent.

The controls were similar to those of a helicopter, except the helicopter was flying sideways through (essentially) a vacuum over the surface of the moon, steered by precisely aimed rocket thrust while continuing to descend, and was made of material approximately the thickness of a soda can which could have been punctured by a good jab with a ball-point pen.  So not really like a helicopter at all.

The Eagle landed with eighteen seconds of fuel to spare.  It helps to have a really, really good pilot.

Tuesday, April 16, 2019

Distributed astronomy

Recently, news sources all over the place have been reporting on the imaging of a black hole, or  more precisely, the immediate vicinity of a black hole.  The black hole itself, more or less by definition, can't be imaged (as far as we know so far).  Confusing things a bit more, any image of a black hole will look like a black disc surrounded by a distorted image of what's actually in the vicinity, but this is because the black hole distorts space-time due to its gravitational field, not because you're looking at something black.  It's the most natural thing in the world to look at the image and think "Oh, that round black area in the middle is the black hole", but it's not.

Full disclosure: I don't completely understand what's going on here.  Katie Bouman has done a really good lecture on how the images were captured, and Matt Strassler has an also really good, though somewhat long overview of how to interpret all this.  I'm relying heavily on both.

Imaging a black hole in a nearby galaxy has been likened to "spotting a bagel on the moon".  A supermassive black hole at the middle of a galaxy is big, but even a "nearby" galaxy is far, far away.

To do such a thing you don't just need a telescope with a high degree of magnification.  The laws of optics place a limit on how detailed an image you can get from a telescope or similar instrument, regardless of the magnification.  The larger the telescope, the higher the resolution, that is, the sharper the image.  This applies equally well to ordinary optical telescopes, X-ray telescopes, radio telescopes and so forth.  For purposes of astronomy these are all considered "light", since they're all forms of electromagnetic radiation and so all follow the same laws.

Actual telescopes can only be built so big, so in order to get sharper images astronomers use interferometry to combine images from multiple telescopes.  If you have a telescope at the South Pole and one in the Atacama desert in Chile, you can combine their images to get the same resolution you would with a giant telescope that spanned from Atacama to the pole.  The drawback is that since you're only sampling a tiny fraction of the light falling on that area, you have to reconstruct the rest of the image using highly sophisticated image processing techniques.  It helps to have more than two telescopes.  The Event Horizon Telescope project that produced the image used eight, across six sites.

Even putting together images from several telescopes, you don't have enough information to precisely know what the full image really would be and you have to be really careful to make sure that the image you reconstruct shows things that are actually there and not artifacts of the processing itself (again, Bouman's lecture goes into detail).  In this case, four teams worked with the raw data independently for seven weeks, using two fundamentally different techniques, to produce the images that were combined into the image sent to the press.  In preparation for that, the image processing techniques themselves were thoroughly tested for their ability to recover images accurately from test data.  All in all, a whole lot of good, careful work by a large number of people went into that (deliberately) somewhat blurry picture.

All of this requires very precise synchronization among the individual telescopes, because interferometry only works for images taken at the same time, or at least to within very small tolerances (once again, the details are ... more detailed).  The limiting factor is the frequency of the light used in the image, which for radio telescopes is on the order of gigahertz. This means that images from the telescopes have to be recorded on the order of a billion times a second.  The total image data ran into the petabytes (quadrillions of bytes), with the eight telescopes producing hundreds of terabytes (that is, hundreds of trillions of bytes) each.

That's a lot of data, which brings us back to the web (as in "Field notes on the ...").  I haven't dug up the exact numbers, but accounts in the popular press say that the telescopes used to produce the black hole images produced "as much data as the LHC produces in a year", which in approximate terms is a staggering amount of data.  A radio interferometer comprising multiple radio telescopes at distant points on the globe is essentially an extremely data-heavy distributed computing system.

Bear in mind that one of the telescopes in question is at the south pole.  Laying cable there isn't a practical option, nor is setting up and maintaining a set of radio relays.  Even satellite communication is spotty.  According to the Wikipedia article, the total bandwidth available is under 10MB/s (consisting mostly of a 50 megabit/second link), which is nowhere near enough for the telescope images, even if stretched out over days or weeks.  Instead, the data was recorded on physical media and flown back to the site where it was actually processed.

I'd initially thought that this only applied to the south pole station, but in fact all six sites flew their data back rather than try to send it over the internet (just to throw numbers out, receiving a petabyte of data over a 10GB/s link would take about a day).   The south pole data just took longer because they had to wait for the antarctic summer.

Not sure if any carrier pigeons were involved.

Thursday, April 4, 2019

Martian talk

This morning I was on the phone with a customer service representative about emails I was getting from an insurance company and which were clearly meant for someone else with a similar name (fortunately nothing earth-shaking, but still something this person would probably like to know about).  As is usually the case, the reply address was a bit bucket, but there were a couple of options in the body of the email: a phone number and a link.  I'd gone with the phone number.

The customer service rep politely suggested that I use the link instead.  I chased the link, which took me to a landing page for the insurance company.  Crucially, it was just a plain link, with nothing to identify where it had come from*.  I wasn't sure how best to try to get that across to the rep, but I tried to explain that usually there are a bunch of magic numbers or "hexadecimal gibberish" on a link like that to tie it back to where it came from.

"Oh yeah ... I call that 'Martian talk'," the rep said.

"Exactly.  There's no Martian talk on the link.  By the way, I think I'm going to start using that."

We had a good laugh and from that point on we were on the same page.  The rep took all the relevant information I could come up with and promised to follow up with IT.

What I love about the term 'Martian talk' is that it implies that there's communication going on, but not in a way that will be meaningful to the average human, and that's exactly what's happening.

And it's fun.

I'd like to follow up at some point and pull together some of the earlier posts on Martian talk -- magic numbers, hexadecimal gibberish and such -- but that will take more attention than I have at the moment.


* From a strict privacy point of view there would be plenty of clues, but there was nothing to tie the link to a particular account for that insurance company, which was what we needed.

Thursday, January 3, 2019

Hats off to New Horizons

A few years ago, around the time of the New Horizons encounter with Pluto (or if you're really serious about the demotion thing, minor planet 134340 Pluto), I gave the team a bit of grief over the probe having to go into "safe mode" with only days left before the flyby, though I also tried to make clear that this was still engineering of a very high order.

Early on New Year's Day (US Eastern time), New Horizons flew by a Kuiper Belt object nicknamed Ultima Thule (two syllables in Thule: THOO-lay).  I'm posting to recognize the accomplishment, and this post will be grief-free.

The Ultima Thule encounter was much like the Pluto encounter with a few minor differences:
  • Ultima Thule is much smaller.  Its long axis is about 1-2% of Pluto's diameter
  • Ultima Thule is darker, reflecting about 10% of light that reaches, compared to around 50% for Pluto. Ultima Thule is about as dark as potting soil.  Pluto is more like old snow.
  • Ultima Thule is considerably further away (about 43 AU from the sun as opposed to about 33 AU for Pluto at the time of encounter -- an AU is the average distance from the Sun to the Earth)
  • New Horizons passed much closer to Ultima Thule than it did to Pluto (3,500 km vs. 12,500 km).  This requires more accurate navigation and to some extent increased the chances of a disastrous collision with either Ultima Thule or, more likely, something near it that there was no way to know about.  At 50,000 km/h, even a gravel-sized chunk would cause major if not fatal damage.
  • Because Ultima Thule is further away, radio signals take proportionally longer to travel between Earth and the probe, about six hours vs. about four hours.
  • Because Ultima Thule is much smaller, much darker and significantly further away, it's much harder to spot from Earth.  Before New Horizons, Pluto itself was basically a tiny dot, with a little bit of surface light/dark variation inferred by taking measurements as it rotated.  Ultima Thule was nothing more than a tinier dot, and a hard-to-spot dot at that.
  • We've had decades to work out exactly where Pluto's orbit goes and where its moons are.  Ultima Thule wasn't even discovered until after New Horizons was launched.  Until a couple of days ago we didn't even know whether it had moons, rings or an atmosphere (it appears to have none).  [Neither Pluto nor Ultima Thule is a stationary object, just to add that little additional degree of difficulty.  The Pluto flyby might be considered a bit more difficult in that respect, though.  Pluto's orbital speed at the time of the flyby was around 20,000 km/h, while Ultima Thule's is closer to 16,500 km/h.  I'd think this would mainly affect the calculations for rotating to keep the cameras pointed, so it probably doesn't make much practical difference.]
In both cases, New Horizons had to shift from pointing its radio antenna at Earth to pointing its cameras at the target.  As it passes by the target at around 50,000 km/h, it has to rotate to keep the cameras pointed correctly, while still out of contact with Earth (which is light-hours away in any case).  It then needs to rotate its antenna back toward Earth, "phone home" and start downloading data at around 1,000 bits per second.  Using a 15-watt transmitter slightly more powerful than a CB radio.  Since this is in space, rotating means firing small rockets attached to the probe in a precise sequence (there are also gyroscopes on New Horizons, but they're not useful for attitude changes).

So, a piece of cake, really.

Seriously, though, this is amazing engineering and it just gets more amazing the more you look at it.  The Pluto encounter was a major achievement, and this was significantly more difficult in nearly every possible way.

So far there don't seem to be any close-range images of Ultima Thule on the mission's web site (see, this post is actually about the web after all), but the team seems satisfied that the flyby went as planned and more detailed images will be forthcoming over the next 20 months or so.  As I write this, New Horizons is out of communication, behind the Sun from Earth's point of view for a few days, but downloads are set to resume after that.  [The images started coming in not long after this was posted, of course --D.H. Jul 2019]