Tuesday, July 16, 2019

Space Reliability Engineering

In a previous post on the Apollo 11 mission, I emphasized the role of software architecture, and the architect Margaret Hamilton in particular, in ensuring the success of the Apollo 11 lunar landing.  I stand by that, including the assessment of the whole thing as "awesome" in the literal sense, but as usual there's more to the story.

Since that non-particularly-webby post was on Field Notes, so is this one.  What follows is mostly taken from the BBC's excellent if majestically paced podcast 13 Minutes to the Moon [I hope to go back and recheck the details directly at some point, but searching through a dozen or so hours of podcast is time-consuming and I don't know if there's a transcript available -- D.H.], which in turn draws heavily on NASA's Johnson Space Center Oral History Project.  I've also had a look at Ars Technica's No, a "checklist error" did not almost derail the Apollo 11 mission, which takes issue with Hamilton's characterization of the incident and also credits Hal Laning as a co-author of the Executive portion of the guidance software which ultimately saved the day (to me, the main point Hamilton was making was that the executive saved the day, regardless of the exact cause of the 1202 code).

Before getting too far into this, it's worth reiterating just how new computing was at the time.  The term "software engineer" didn't exist (Hamilton coined it during the project -- Paul Niquette claims to have coined the term "software" itself and I see no reason to doubt him).  There wasn't any established job title for what we now call software engineers.  The order for the navigation computer, which was the very first order in the whole Apollo project, didn't mention software, programming or anything of the sort.  The computer was another piece of equipment to be made to work just like an engine, window, gyroscope or whatever.  Like them it would have to be installed and have whatever other things done to it to make it functional.  Like "programming".

In a way, this was a feature rather than a bug.  The Apollo spacecraft have been referred to, with some justification, as the first fly-by-wire vehicles.  The navigational computer was an unknown quantity.  At least one astronaut promised to turn the thing off at the first opportunity.  Flying was for pilots, not computers.

This didn't happen, of course.  Instead, as the podcast describes so well, control shifted back and forth between human and computer depending on the needs of the mission at the time, but it was far from obvious at the beginning that this would be the case.

Because the computer wasn't trusted implicitly, but treated as just another unknown to be dealt with, -- in other words, another risk to be mitigated -- ensuring its successful operation was seen as a matter of engineering, just like making sure that the engines were efficient and reliable, and not a matter of computer science.  This goes a long way toward explaining the self-monitoring design of the software.

Mitigating the risk of using the computer included figuring out how to make it as foolproof as possible for the astronauts to operate.  The astronauts would be wearing spacesuits with bulky gloves, so they wouldn't exactly be swiping left or right, even if the hardware of the time could have supported it.  Basically you had a numeric display and a bunch of buttons.  The solution was to break the commands down to a verb and a noun (or perhaps more accurately a predicate and argument), each expressed numerically.  It would be a ridiculous interface today.  At the time it was a highly effective use of limited resources.

But the only way to really know if an interface will work is to try it out with real users.  Both the astronauts and the mission control staff needed to practice the whole operation as realistically as possible, including the operation of the computer.  This was for a number of reasons, particularly to learn how the controls and indicators worked, to be prepared for as many contingencies as possible and to try to flush out potential problems.  The crew and mission control conducted many of these and they were generally regarded as just as demanding and draining as the real thing, perhaps moreso.

It was during one of these simulations that the computer displayed a status code that no one had ever seen before and therefore didn't know how to react to.  After the session was over, flight director Gene Kranz instructed guidance software expert Jack Garman to look up and memorize every possible code and determine what course of action to take when it came up.  This would take a lot of time searching through the source code, with the launch date approaching, but it had to be done and it was.  Garmin produced a handwritten list of every code and what to do about it.

As a result, when the code 1202 came up with the final opportunity to turn back fast approaching, capsule communicator (CAPCOM) Charlie Duke was able to turn to guidance controller Steve Bales, who could turn to Garman and determine that the code was OK if it didn't happen continuously.  There's a bit of wiggle room in what constitutes "continuously", but knowing that the code wasn't critical was enough to keep the mission on track.  Eventually, Buzz Aldrin noticed that the code seemed to happen when a particular radar unit was being monitored.  Mission Control took over the monitoring and the code stopped happening.

I now work for a company that has to keep large fleets of computers running to support services that billions of people use daily.  If a major Google service is down for five minutes, it's headline news, often on multiple continents.  It's not the same as making sure a plane or a spaceship lands safely or a hospital doesn't lose power during a hurricane, but it's still high-stakes engineering.

There is a whole profession, Site Reliability Engineer, or SRE for short, dedicated to keeping the wheels turning.  These are highly-skilled people who would have little problem doing my job instead of theirs if they preferred to.  Many of their tools -- monitoring, redundancy, contingency planning, risk analysis, and so on -- can trace their lineage through the Apollo program -.  I say "through" because the concepts themselves are considerably older than space travel, but it's remarkable how many of them were not just employed, but significantly advanced, as a consequence of the effort to send people to the moon and bring them back.

One tool in particular, Garman's list of codes, played a key role at a that critical juncture.  Today we would call it a playbook.  Anyone who's been on call for a service has used one (I know I have).

In the end, due to a bit of extra velocity imparted during the maneuver to extract the lunar module and dock it to the command module, the lunar module ended up overshooting its intended landing place.  In order to avoid large boulders and steep slopes in the area they were now approaching, Neil Armstrong ended up flying the module by hand in order to find a good landing spot, aided by a switch to increase or decrease the rate of descent.

The controls were similar to those of a helicopter, except the helicopter was flying sideways through (essentially) a vacuum over the surface of the moon, steered by precisely aimed rocket thrust while continuing to descend, and was made of material approximately the thickness of a soda can which could have been punctured by a good jab with a ball-point pen.  So not really like a helicopter at all.

The Eagle landed with eighteen seconds of fuel to spare.  It helps to have a really, really good pilot.

Tuesday, April 16, 2019

Distributed astronomy

Recently, news sources all over the place have been reporting on the imaging of a black hole, or  more precisely, the immediate vicinity of a black hole.  The black hole itself, more or less by definition, can't be imaged (as far as we know so far).  Confusing things a bit more, any image of a black hole will look like a black disc surrounded by a distorted image of what's actually in the vicinity, but this is because the black hole distorts space-time due to its gravitational field, not because you're looking at something black.  It's the most natural thing in the world to look at the image and think "Oh, that round black area in the middle is the black hole", but it's not.

Full disclosure: I don't completely understand what's going on here.  Katie Bouman has done a really good lecture on how the images were captured, and Matt Strassler has an also really good, though somewhat long overview of how to interpret all this.  I'm relying heavily on both.

Imaging a black hole in a nearby galaxy has been likened to "spotting a bagel on the moon".  A supermassive black hole at the middle of a galaxy is big, but even a "nearby" galaxy is far, far away.

To do such a thing you don't just need a telescope with a high degree of magnification.  The laws of optics place a limit on how detailed an image you can get from a telescope or similar instrument, regardless of the magnification.  The larger the telescope, the higher the resolution, that is, the sharper the image.  This applies equally well to ordinary optical telescopes, X-ray telescopes, radio telescopes and so forth.  For purposes of astronomy these are all considered "light", since they're all forms of electromagnetic radiation and so all follow the same laws.

Actual telescopes can only be built so big, so in order to get sharper images astronomers use interferometry to combine images from multiple telescopes.  If you have a telescope at the South Pole and one in the Atacama desert in Chile, you can combine their images to get the same resolution you would with a giant telescope that spanned from Atacama to the pole.  The drawback is that since you're only sampling a tiny fraction of the light falling on that area, you have to reconstruct the rest of the image using highly sophisticated image processing techniques.  It helps to have more than two telescopes.  The Event Horizon Telescope project that produced the image used eight, across six sites.

Even putting together images from several telescopes, you don't have enough information to precisely know what the full image really would be and you have to be really careful to make sure that the image you reconstruct shows things that are actually there and not artifacts of the processing itself (again, Bouman's lecture goes into detail).  In this case, four teams worked with the raw data independently for seven weeks, using two fundamentally different techniques, to produce the images that were combined into the image sent to the press.  In preparation for that, the image processing techniques themselves were thoroughly tested for their ability to recover images accurately from test data.  All in all, a whole lot of good, careful work by a large number of people went into that (deliberately) somewhat blurry picture.

All of this requires very precise synchronization among the individual telescopes, because interferometry only works for images taken at the same time, or at least to within very small tolerances (once again, the details are ... more detailed).  The limiting factor is the frequency of the light used in the image, which for radio telescopes is on the order of gigahertz. This means that images from the telescopes have to be recorded on the order of a billion times a second.  The total image data ran into the petabytes (quadrillions of bytes), with the eight telescopes producing hundreds of terabytes (that is, hundreds of trillions of bytes) each.

That's a lot of data, which brings us back to the web (as in "Field notes on the ...").  I haven't dug up the exact numbers, but accounts in the popular press say that the telescopes used to produce the black hole images produced "as much data as the LHC produces in a year", which in approximate terms is a staggering amount of data.  A radio interferometer comprising multiple radio telescopes at distant points on the globe is essentially an extremely data-heavy distributed computing system.

Bear in mind that one of the telescopes in question is at the south pole.  Laying cable there isn't a practical option, nor is setting up and maintaining a set of radio relays.  Even satellite communication is spotty.  According to the Wikipedia article, the total bandwidth available is under 10MB/s (consisting mostly of a 50 megabit/second link), which is nowhere near enough for the telescope images, even if stretched out over days or weeks.  Instead, the data was recorded on physical media and flown back to the site where it was actually processed.

I'd initially thought that this only applied to the south pole station, but in fact all six sites flew their data back rather than try to send it over the internet (just to throw numbers out, receiving a petabyte of data over a 10GB/s link would take about a day).   The south pole data just took longer because they had to wait for the antarctic summer.

Not sure if any carrier pigeons were involved.

Thursday, April 4, 2019

Martian talk

This morning I was on the phone with a customer service representative about emails I was getting from an insurance company and which were clearly meant for someone else with a similar name (fortunately nothing earth-shaking, but still something this person would probably like to know about).  As is usually the case, the reply address was a bit bucket, but there were a couple of options in the body of the email: a phone number and a link.  I'd gone with the phone number.

The customer service rep politely suggested that I use the link instead.  I chased the link, which took me to a landing page for the insurance company.  Crucially, it was just a plain link, with nothing to identify where it had come from*.  I wasn't sure how best to try to get that across to the rep, but I tried to explain that usually there are a bunch of magic numbers or "hexadecimal gibberish" on a link like that to tie it back to where it came from.

"Oh yeah ... I call that 'Martian talk'," the rep said.

"Exactly.  There's no Martian talk on the link.  By the way, I think I'm going to start using that."

We had a good laugh and from that point on we were on the same page.  The rep took all the relevant information I could come up with and promised to follow up with IT.

What I love about the term 'Martian talk' is that it implies that there's communication going on, but not in a way that will be meaningful to the average human, and that's exactly what's happening.

And it's fun.

I'd like to follow up at some point and pull together some of the earlier posts on Martian talk -- magic numbers, hexadecimal gibberish and such -- but that will take more attention than I have at the moment.

* From a strict privacy point of view there would be plenty of clues, but there was nothing to tie the link to a particular account for that insurance company, which was what we needed.

Thursday, January 3, 2019

Hats off to New Horizons

A few years ago, around the time of the New Horizons encounter with Pluto (or if you're really serious about the demotion thing, minor planet 134340 Pluto), I gave the team a bit of grief over the probe having to go into "safe mode" with only days left before the flyby, though I also tried to make clear that this was still engineering of a very high order.

Early on New Year's Day (US Eastern time), New Horizons flew by a Kuiper Belt object nicknamed Ultima Thule (two syllables in Thule: THOO-lay).  I'm posting to recognize the accomplishment, and this post will be grief-free.

The Ultima Thule encounter was much like the Pluto encounter with a few minor differences:
  • Ultima Thule is much smaller.  Its long axis is about 1-2% of Pluto's diameter
  • Ultima Thule is darker, reflecting about 10% of light that reaches, compared to around 50% for Pluto. Ultima Thule is about as dark as potting soil.  Pluto is more like old snow.
  • Ultima Thule is considerably further away (about 43 AU from the sun as opposed to about 33 AU for Pluto at the time of encounter -- an AU is the average distance from the Sun to the Earth)
  • New Horizons passed much closer to Ultima Thule than it did to Pluto (3,500 km vs. 12,500 km).  This requires more accurate navigation and to some extent increased the chances of a disastrous collision with either Ultima Thule or, more likely, something near it that there was no way to know about.  At 50,000 km/h, even a gravel-sized chunk would cause major if not fatal damage.
  • Because Ultima Thule is further away, radio signals take proportionally longer to travel between Earth and the probe, about six hours vs. about four hours.
  • Because Ultima Thule is much smaller, much darker and significantly further away, it's much harder to spot from Earth.  Before New Horizons, Pluto itself was basically a tiny dot, with a little bit of surface light/dark variation inferred by taking measurements as it rotated.  Ultima Thule was nothing more than a tinier dot, and a hard-to-spot dot at that.
  • We've had decades to work out exactly where Pluto's orbit goes and where its moons are.  Ultima Thule wasn't even discovered until after New Horizons was launched.  Until a couple of days ago we didn't even know whether it had moons, rings or an atmosphere (it appears to have none).  [Neither Pluto nor Ultima Thule is a stationary object, just to add that little additional degree of difficulty.  The Pluto flyby might be considered a bit more difficult in that respect, though.  Pluto's orbital speed at the time of the flyby was around 20,000 km/h, while Ultima Thule's is closer to 16,500 km/h.  I'd think this would mainly affect the calculations for rotating to keep the cameras pointed, so it probably doesn't make much practical difference.]
In both cases, New Horizons had to shift from pointing its radio antenna at Earth to pointing its cameras at the target.  As it passes by the target at around 50,000 km/h, it has to rotate to keep the cameras pointed correctly, while still out of contact with Earth (which is light-hours away in any case).  It then needs to rotate its antenna back toward Earth, "phone home" and start downloading data at around 1,000 bits per second.  Using a 15-watt transmitter slightly more powerful than a CB radio.  Since this is in space, rotating means firing small rockets attached to the probe in a precise sequence (there are also gyroscopes on New Horizons, but they're not useful for attitude changes).

So, a piece of cake, really.

Seriously, though, this is amazing engineering and it just gets more amazing the more you look at it.  The Pluto encounter was a major achievement, and this was significantly more difficult in nearly every possible way.

So far there don't seem to be any close-range images of Ultima Thule on the mission's web site (see, this post is actually about the web after all), but the team seems satisfied that the flyby went as planned and more detailed images will be forthcoming over the next 20 months or so.  As I write this, New Horizons is out of communication, behind the Sun from Earth's point of view for a few days, but downloads are set to resume after that.  [The images started coming in not long after this was posted, of course --D.H. Jul 2019]

Thursday, December 13, 2018

Common passwords are bad ... by definition

It's that time of the year again, time for the annual lists of worst passwords.  Top of at least one list: 123456, followed by password.  It just goes to show how people never change.  Silly people!

Except ...

A good password has a very high chance of being unique, because a good password is selected randomly from a very large space of possible passwords.  If you pick your password at random from a trillion possibilities, then the odds that a particular person who did the same also picked your password are one in a trillion, the odds that one of a million other such people picked your password are about one in a million, as are the odds that any particular two people picked the same password.  If a million people used the same scheme as you did, there's a good chance that some pair of them accidentally share a password, but almost certainly almost all of those passwords are unique.

If you count up the most popular passwords in this scenario, you'll get a fairly tedious list:
  • 1: some string of random gibberish, shared by two people
  • 2 - 999,999: Other strings of random gibberish, 999,998 in all
Now suppose that seven people didn't get the memo.  Four of them choose 123456 and three of them choose password.  The list now looks like
  • 1: 123456,  shared by four people
  • 2: password,  shared by three people
  • 3: some string of random gibberish, shared by two people
  • 4-999,994:  Other strings of random gibberish, 999,991 in all
Those seven people are pretty likely to have their passwords hacked, but overall password hygiene is still quite good -- 99.9993% of people picked a good password.  It's certainly better than if 499,999 people picked 123456 and 499,998 picked password, two happened to pick the same strong password and the other person picked a different strong password, even though the resulting rankings are the same as above.

Likewise, if you see a list of 20 worst passwords taken from 5 million leaked passwords, that could mean anything from a few hundred people having picked bad passwords to everyone having done so.  It would be more interesting to report how many people picked popular passwords as opposed to unique ones, but that doesn't seem to make its way into the "wow, everyone's still picking bad passwords" stories.

From what I was able to dig up, that portion is probably around 10%.  Not great, but not horrible, and probably less than it was ten years ago.  But as long as some people are picking bad passwords, the lists will stay around and the headlines will be the same, regardless of whether most people are doing a better job.

(I would have provided a link for that 10%, but the site I found it on had a bunch of broken links and didn't seem to have a nice tabular summary of bad passwords vs other passwords from year to year, so I didn't bother)

Saturday, December 8, 2018

Software cities

In the previous post I stumbled on the idea that software projects are like cities.  The more I thought about it, I said, the more I liked the idea.  Now that I've had some more time to think about it, I like the idea even more, so I'd like to try to draw the analogy out a little bit, ideally not past the breaking point.

What first drew me to the concept was realizing that software projects, like cities, are neither completely planned nor completely unplanned.  Leaving aside the question of what level of planning is best -- which surely varies -- neither of the extremes is likely to actually happen in real life.

If you try to plan every last detail, inevitably you run across something you didn't anticipate and you'll have to adjust.  Maybe it turns out that the place you wanted to put the city park is prone to flooding, or maybe you discover that the new release of some platform your depending doesn't actually support what you thought it did, or at least not as well as you need it to.

Even if you could plan out every last detail of a city, once people start living in it, they're going to make changes and deviate from your assumptions.  No one actually uses that beautiful new footbridge, or if they do, they cut across a field to get to it and create a "social trail" thereby bypassing the carefully designed walkways.  People start using an obscure feature of one of the protocols to support a use case the designers never thought of.  Cities develop and evolve over time, with or without oversight, and in software there's always a version 2.0 ... and 2.1, and 2.2, and 2.2b (see this post for the whole story).

On the other hand, even if you try to avoid planning and let everything "just grow", planning happens anyway.  If nothing else, we codify patterns that seem to work -- even if they arose organically with no explicit planning -- as customs and traditions.

In a distant time in the Valley, I used to hear the phrase "paving the cow paths" quite a bit.  It puzzled me at first -- Why pave a perfectly good cow path?  Cattle are probably going to have a better time on dirt, and that pavement probably isn't going to hold up too well if you're marching cattle on it all the time ...  Eventually I came to understand that it wasn't about the cows.  It was about taking something that people had been doing already and upgrading the infrastructure for it.  Plenty of modern-day highways (or at least significant sections of them) started out as smaller roads which in turn used to be dirt road for animals, foot traffic and various animal-drawn vehicles.

Upgrading a road is a conscious act requiring coordination across communities all along the roadway.  Once it's done, it has a significant impact on communities on the road, which expect to benefit from increased trade and decreased effort of travel, but also communities off the road, which may lose out, or may alter their habits now that the best way to get to some important place is by way of the main road and not the old route.  This sort of thing happens both inside and outside cities, but for the sake of the analogy think of ordinary streets turning into arterials or bypasses and ring roads diverting traffic around areas people used to have to cross through.

One analogue of this is in software is standards.  Successful standards tend to arise when people get together to codify existing practice, with the aim of improving support for things people were doing before the standard, just in a variety of similar but still needlessly different ways.  Basically pick a route and make it as smooth and accessible as possible.  This is a conscious act requiring coordination across communities, and once it's done it has a significant impact on the communities involved, and on communities not directly involved.

This kind of thing isn't always easy.  A business district thrives and grows, and more and more people want to get to it.  Traffic becomes intolerable and the city decides to develop a new thoroughfare to carry traffic more efficiently (thereby, if it all works, accelerating growth in the business district and increasing traffic congestion ...).  Unfortunately, there's no clear space for building this new thoroughfare.  An ugly political fight ensues over whose houses should get condemned to make way and eventually the new road is built, cutting through existing communities and forever changing the lives of those nearby.

One analog of this in software is the rewrite.  A rewrite almost never supports exactly the same features as the system being rewritten.  The reasons for this are probably material for a separate post,  but the upshot is that some people's favorite features are probably going to break with the rewrite, and/or be replaced by something different that the developers believe will solve the same problem in a way compatible with the new system.  Even if the developers are right about this, which they often are, there's still going to be significant disruption (albeit nothing like on the order of having one's house condemned).

Behind all this, and tying the two worlds of city development and software develop together, is culture.  Cities have culture, and so do major software projects.  Each has its own unique culture, but, whether because the same challenges recur over and over again, leading to similar solutions, or because some people are drawn to large communities while others prefer smaller, the cultures of different cities tend to have a fair bit in common, perhaps more in common with each other than with life outside them.  Likewise with major software projects.

Cities require a certain level of infrastructure -- power plants, traffic lights, parking garages, public transport, etc. -- that smaller communities can mostly do without.  Likewise, a major software project requires some sort of code repository with version control, some form of code review to control what gets into that repository, a bug tracking system and so forth.  This infrastructure comes at a price, but also with significant benefits.  You don't have to do everything yourself, and at a certain point you can't do everything yourself.  That means people can specialize, and to some extent have to specialize.  This both requires a certain kind of culture and tends to foster that same sort of culture.

It's worth noting that even large software projects are pretty small by the standards of actual cities.  Somewhere around 15,000 people have contributed to the git repository for the Linux kernel.  There appear to be a comparable (but probably smaller) number of Apache committers.  As with anything else, some of these are more active in the community than others.  On the corporate side, large software companies have tens of thousands of engineers, all sharing more or less the same culture.

Nonetheless, major software projects somehow seem to have more of the character of large cities than one might think based on population.  I'm not sure why that might be, or even if it's really true once you start to look more closely, but it's interesting that the question makes sense at all.

Sunday, November 4, 2018

Waterfall vs. agile

Near where I work is a construction project for a seven-floor building involving dozens of people on site and suppliers from all over the place, supplying items ranging from local materials to an impressively tall crane from a company on another continent.  There are literally thousands of things to keep track of, from the concrete in the foundation to the location of all the light switches to the weatherproofing for the roof.  The project will take over a year, and there are significant restrictions on what can happen when.  Obviously you can't put windows on before there's a wall in place to put them in, but less obviously there are things you can't do during the several weeks that the structural concrete needs to cure, and so forth.

Even digging the hole for the parking levels took months and not a little planning.  You have to have some place to put all that dirt and the last part takes longer since you no longer have a ramp to drive things in and out with, so whatever you use for that last part has to be small enough you can lift it out.

I generally try to keep in mind that no one else's job is as simple as it looks when you don't have to actually do it, but this is a particularly good example.  Building a building this size, to say nothing of an actual skyscraper, is real engineering.  Add into the mix that lives are literally on the line -- badly designed or built structures do fail and kill people, not to mention the abundance of hazards on a construction site -- and you have a real challenge on your hands.

And yet, the building in question has been proceeding steadily and there's every reason to expect that it will be finished within a few weeks of the scheduled date and at a cost reasonably close to the initial estimate.

We can't do that in my line of work.

Not to say it's never happened, but it's not the norm.  For example, I'm trying to think of a major software provider that still gives dates and feature lists for upcoming releases.  Usually you have some idea of what month the next release might be in, and maybe a general idea of the major features that the marketing is based around, but beyond that, it comes out when it comes out and whatever's in it is in it.  That fancy new feature might be way cooler and better than anyone expected, or it might be a half-baked collection of somewhat-related upgrades that only looks like the marketing if you squint hard enough.

The firmer the date, the vaguer the promised features and vice versa ("schedule, scope, staff: pick two").  This isn't isolated to any one provider (I say "provider" rather than "company" so as to include open source).  Everyone does it in their own way.

In the construction world, this would be like saying "The new building will open on November 1st, but we can't say how many floors it will have or whether there will be knobs on the doors" or "This building will be completely finished somewhere between 12 and 30 months from now."  It's not that construction projects never overrun or go over budget, just that the normal outcomes are in a much tighter range and people's expectations are set accordingly.

[Re-reading this, I realize I didn't mention small consultants doing projects like putting up a website and social media presence for a local store.  I haven't been close to that end of the business for quite a while, but my guess is that delivering essentially on time and within the budget is more common.  However, I'm more interested here in larger projects, like, say, upgrading trade settlement for a major bank.  I don't have a lot of data points for large consultants in such situations, but what I have seen tends to bear out my main points here]

Construction is a classic waterfall process.  In fact, the use of roles like "architect" and "designer" in a waterfall software methodology gives a pretty good hint where the ideas came from.  In construction, you spend a lot of time up front working with an architect and designer to develop plans for the building.  These then get turned into more detailed plans and drawings for the people actually doing the construction.  Once that's done and construction is underway, you pretty much know what you're supposed to be getting.

In between design and construction there's a fair bit of planning that the customer doesn't usually see once the design itself is complete.  For example, if your building will have steel beams, as many do, someone has to produce the drawing that says exactly what size and grade of beam to use, how long to cut it and (often) exactly where and what size to drill the holes so it can be bolted together with the other steel pieces.  Much of this process is now automated with CAD software, and for that matter more and more of the actual cutting and drilling is automated, but the measurements still have to specified and communicated.

Even if there's a little bit of leeway for changes later in the game -- you don't necessarily have select all the paint colors before they pour concrete for the underground levels -- for the most part you're locked in once the plans are finalized.  You're not going to decide that your seven-level building needs to be a ten-level building while the concrete is curing, or if you do, you'll need to be ready to shell out a lot of money and throw the schedule out the window (if there's one to throw it out of).

Interwoven with all this is a system of zoning, permitting and inspections designed to ensure that your building is safe and usable, and fits in well with the neighborhood and the local infrastructure.  Do you have enough sewer capacity?  Is the building about the same height as the buildings around it (or is the local government on board with a conspicuously tall or short one)?  Will the local electrical grid handle your power demand, and is your wiring sufficient?  This will typically involve multiple checks: The larger-scale questions like how much power you expect to use are addressed during permitting, the plans will be inspected before construction, the actual wiring will be inspected after it's in, and the contractor will need to be able to show that all the electricians working on the job are properly licensed.

This may seem like a lot of hassle, and it is, but most regulations are in place because people learned the hard way.  Wiring from the early 1900s would send most of today's licensed electricians running to the fuse box (if there is one) to shut off the power, or maybe just running out of the immediate area.  There's a reason you Don't Do Things That Way any more: buildings burned down and people got electrocuted.

Putting all this together, large-scale construction uses a waterfall process for two reasons: First, you can't get around it.  It's effectively required by law.  Second, and more interesting here, is that it works.

Having a standard process for designing and constructing a building, standard materials and parts and a standard regime of permits, licenses and inspections gives everyone involved a good idea of what to expect and what to do.  Having the plans finalized allows the builder to order exactly the needed materials (or more realistically, to within a small factor) and precisely track the cost of the project.  Standards and processes allow people to specialize.  A person putting up drywall knows that the studs will be placed with a standard spacing compatible with a standard sheet of drywall without having to know or care exactly how those studs got there.

Sure, there are plenty of cases of construction projects going horribly off the rails, taking several times as long as promised and coming in shockingly over budget, or turning out to be unusable or having major safety issues requiring expensive retrofitting.  It happens, and when it does lots of people tend to hear about it.  But drive into a major city some time.  All around you will be buildings that went up without significant incident, built more or less as I described above.  On your way in will be suburbs full of cookie-cutter houses built the same way on a smaller scale.  The vast majority of them will stand for generations.

Before returning to the field of software, it's worth noting that this isn't the only way to build.  Plenty of houses are built by a small number of people without a lot of formal planning.  Plenty of buildings, particularly houses, have been added onto in increments over time.  People rip out and replace cabinetry or take out walls (ideally non-structural ones) with little more than general knowledge of how houses are put together.  At a larger scale, towns and cities range from carefully planned (Brasilia, Washington D.C) to agglomerations of ancient villages and newer developments with the occasional Grand Plan forced through (London comes to mind).

So why does a process that works fine for projects like buildings, roads and bridges or, with appropriate variations, Big Science projects like CERN, the Square Kilometer Array or the Apollo missions, not seem to carry over to software?  I've been pondering this from time to time for decades now, but I can't say that I've arrived at any definitive answer.  Some possibilities:
  • Software is soft.  Software is called "software" because, unlike the hardware, which is wired together and typically shipped off to a customer site, you can change any part of the software whenever you want.  Just restart the server with a new binary, or upgrade the operating system and reboot.  You can even do this with the "firmware" -- code and data that aren't meant to be changed often -- though it will generally require a special procedure.  Buildings and bridges are quintessential hardware.
Yes, but ... in practice you can't change anything you want at any time.  Real software consists of a number of separate components acting together.  When you push a button on a web site, the browser interprets the code for the web page and, after a series of steps, makes a call to the operating system to send a message to the server on the other end.  The server on the other end is typically a "front-end" whose job is to dispatch the message to another server, which may then talk to several others, or to a database, or possibly to a server elsewhere on the web or on the internal network, or most likely a combination of these, in order to do the real work.  The response comes back, the operating system notifies the browser and the browser interprets more of the web page's code in order to update the image on the screen.

This is actually a highly simplified view.  The point here is that all these pieces have to know how to talk to each other and that puts fairly strict limits on what you can change how fast.  Some things are easy.  If I figure out a way to make a component do its current job faster, I can generally just roll that out and the rest of the world is fine, and ideally happier since things that depend on it should get faster for free (otherwise, why make the change?).  If I want to add a new feature, I can generally do that easily too, since no one has to use it until it's ready.

But if I try to change something that lots of people are already depending on, that's going to require some planning.  If it's a small change, I might be able to get by with a "flag" or "mode switch" that says whether to do things the old way or the new way.  I then try to get everyone using the service to adapt to the new way and set the switch to "new".  When everyone's using the "new" mode, I can turn off the "old" mode and brace for an onslaught of "hey, why doesn't this work any more?" from whoever I somehow missed.

Larger changes require much the same thing on a larger scale.  If you hear terms like "backward compatibility" and "lock-in", this is what they're talking about.  There are practices that try to make this easier and even to anticipate future changes ("forward compatibility" or "future-proofing").  Nonetheless, software is not as soft as one might think.
  • Software projects are smaller.  Just as it's a lot easier to build at least some kind of house than a seven-level office building, it's easy for someone to download a batch of open source tools and put together an app or an open source tool or any of a number of other things (my post on Paul Downey's hacking on a CSV of government data shows how far you can get with not-even-big-enough-to-call-a-project).  There are projects just like this all over GitHub.
Yes, but ... there are lots of small projects around, some of which even see a large audience, but if this theory were true you'd expect to see a more rigid process as projects get larger, as you do in moving from small DIY projects to major construction. That's not necessarily the case.  Linux grew from Linus's two processes printing "A" and "B" on the screen to millions of lines of kernel code and millions more of tools and applications.  The development process may have gradually become more structured over time, but it's not anything like a classic waterfall.  The same could be said of Apache and any number of similar efforts.  Projects do tend to add some amount of process as they go ("So-and-so will now be in charge of the Frobulator component and will review all future changes"), but not nearly to the point of a full-fledged waterfall.

For that matter, it's not clear that Linux or Apache should be considered single projects.  They're more like a collection of dozens or hundreds of projects, each with its own specific standards and practices, but nonetheless they fit together into a more or less coherent whole.  The point being, it can be hard to say how big a particular project is or isn't.

I think this is more or less by design.  Software engineering has a strong drive to "decouple" so that different parts can be developed largely independently.  This generally requires being pretty strict about the boundaries and interfaces between the various components, but that's a consequence of the desire to decouple, not a driving principle in and of itself.  To the extent that decoupling is successful, it allows multiple efforts to go on in parallel without a strictly-defined architecture or overall plan.  The architecture, such as it is, is more an emergent property of the smaller-scale decisions about what the various components do and how they interact.

The analogy here is more with city planning than building architecture.  While it's generally good to have someone taking the long view in order to help keep things working smoothly overall, this doesn't mean planning out every piece in advance, or even having a master plan.  You can get surprisingly well with the occasional "Wait, how is this going to work with that" or "There are already two groups working on this problem -- maybe you should talk to them before starting a third one".

Rather than a single construction project, a project like Linux or Apache is much like a city growing in stages over time.  In fact, the more I think about it, the more I like that analogy.  I'd like to develop it further.  I had wanted to add a few more bullet points, particularly "Software is innovative" -- claiming that people generally write software in order to do something new, while a new skyscraper, however innovative the design, is still a skyscraper (and more innovative buildings tend to cost more, take longer and have a higher risk of going over budget) -- but I think I'll leave that for now.  This post is already on the longish side and developing the city analogy seems more interesting at the moment.

Sunday, October 7, 2018

Economics of anonymity -- but whose economics?

Re-reading the posts I've tagged with the anonymity label, I think I finally put my finger on something that had been bothering me pretty much from the beginning.  I'd argued (here, for example) that economic considerations should make it hard to successfully run an anonymizer -- a service that allows you to connect with sites on the internet without giving away who or where you are.  And yet they exist.

The argument was that the whole point of an anonymizer is that whoever's connecting to a given site could be anyone using the anonymizer, that is, if you're using the anonymizer, you could be associated with anything that anyone on the anonymizer was doing.  Since some things that people want to do anonymously are risky, an anonymizer is, in some sense, distributing risk.  People doing risky things should be happy to participate, but people doing less risky things may not be so willing.  As a result, there may not be enough people participating to provide adequate cover.

However, this assumes that anyone can be (mis)taken for anyone else.  At the technical level of IP addresses, this is true, but at the level of who's actually doing what, which is what really matters if The Man comes knocking, it's less applicable.

There are lots of reasons to want anonymity -- the principle of the thing, the ability to do stuff you're not supposed to be able to do at work, wanting to hide embarrassing activity from the people around you, wanting to blow the whistle on an organization, communicating to the outside world from a repressive regime, dealing in illicit trade, planning acts of violence and many others.  The fact of using an anonymizer says little about why you might be doing it.

If I'm anonymously connecting to FaceSpace at work, there's little chance that the authorities in whatever repressive regime will come after me for plotting to blow up their government buildings, and vice versa (mutatis mutandis et cetera.).  In other words, there's probably not much added risk for people doing relatively innocuous things in places where using an anonymizer is not itself illegal.

On the other hand, this is little comfort to someone trying to, say, send information out of a place where use of the internet, and probably anonymizers in particular, is restricted.  The local authorities will probably know exactly which hosts are connecting with the anonymizer's servers and make it their business to find out who's associated with those hosts -- a much smaller task than tracking down all users of the anonymizer.

This is much the same situation as, say, spies in WWII trying to send radio messages out before the local authorities can triangulate their position.   Many of the same techniques should apply -- never setting up in the same place twice, limiting the number of people you communicate with and how much you know about them, limiting the amount of information you know, and so forth.

So I suppose I'll be filing this under not-so-disruptive technology as well as anonymity.

Saturday, September 29, 2018

Agile vs. waterfall

Another comment reply that outgrew the comment box.  Earl comments
This speaks to my prejudice in favor of technique over technology. And the concept of agility seems to be just the attitude of any good designer that your best weapon is your critical sense, and your compulsion to discard anything that isn't going to work.
To which I would reply:

Sort of ... "agile" is a term of art that refers to a collection of practices aimed at reducing the lag between finding out people want something and giving it to them.  Arguably the core of it is "launch and iterate", meaning "put your best guess out there, find out what it still needs, fix the most important stuff and try again".

This is more process than design, but there are definitely some design rules that tend to go with agile development, particularly "YAGNI", short for "You ain't gonna need it", which discourages trying to anticipate a need you don't yet know that you have.  In more technical terms, this means not trying to build a general framework for every possible use case, but being prepared to "refactor" later on if you find out that you need to do more than you thought you did.  Or, better, designing in such a way that later functionality can be added with minimum disruption to what's already there, often by having less to disrupt to begin with, because ... YAGNI.

Downey refers to "agile" both generally and in the more specific context of "agile vs. waterfall".  The "waterfall" design process called for exhaustively gathering all requirements up front, then producing a design to meet those requirements, then implementing the design, then independently testing the implementation against the requirements, fixing any bugs, retesting and eventually delivering a product to the customer.  Each step of the process flows into the next, and you only go forward, much like water flowing over a series of cascades.  Only the test/fix/retest/... cycle was meant to be iterative, and ideally with as few iterations as possible.  Waterfall projects can take months at the least and more typically years to get through all the steps, at which point there's a significant chance that the customer's understanding of what they want has evolved -- but don't worry, we can always gather more requirements, produce an improved design ...

(As an aside, Downey alludes to discussion over whether "customer" is an appropriate term for someone, say, accessing a public data website.  A fair point.  I'm using "customer" here because in this case this is someone paying money for a the service of producing software.   The concept of open source cuts against this, but that's a whole other discussion.)

The waterfall approach can be useful in situations like space missions and avionics.  In the first case, when you launch, you're literally launched and there is no "iterate".  In the second, the cost of an incomplete or not-fully-vetted implementation is too high to risk.  However, there's a strong argument to be made that "launch and iterate" works in more cases than one might think.

In contrast to waterfall approaches, agile methodologies think more in terms of weeks.  A series of two-week "sprints", each producing some number of improvements from a list, is a fairly common approach. Some web services go further and use a "push on green" process where anything that passes the tests (generally indicated by a green bar on a test console) goes live immediately.  Naturally, part of adding a new feature is adding tests that it has to pass, but that should generally be the case anyway.

Superficially, a series of two-week sprints may seem like a waterfall process on a shorter time scale, but I don't think that's a useful comparison.  In a classic waterfall, you talk to your customer up front and then go dark for months, or even a year or more while the magic happens, though the development managers may produce a series of progress reports with an aggregate number of requirements implemented or such.  Part of the idea or short sprints, on the other hand, is to stay in contact with your customer in order to get frequent feedback on whether you're doing the right thing.   Continuous feedback is one of the hallmarks of a robust control system, whether in software or steam engines.

There are also significant differences in the details of the processes.  In an agile process, the list of things to do (often organized by "stories") can and does get updated at any time.  The team will generally pick a set of things to implement at the the beginning of a sprint in order to coordinate their efforts, but this is more a tactical decision, and "requirements gathering" is not blocked while the developers are implementing.

Work in agile shops tends to be estimated in relative terms like "small", "medium" or "large", since people are much better at estimating relative sizes, and there's generally an effort to break "large" items into smaller pieces since people are better at estimating them.  Since this is done frequently, everyone ends up doing a bunch of fairly small-scale estimates on a regular basis, and hopefully skills improve.

Waterfall estimates are generally done up front by specialists.  By the end of the design phase, you should have a firm estimate of how long the rest will take (and, a cynic might add, a firm expectation of putting in serious overtime as the schedule begins to slip).

It's not clear how common a true waterfall process is in practice.  I've personally only seen it once up close, and the result was a slow-motion trainwreck the likes of which I hope never to see again.  Among other things, the process called for designers to reduce their designs to "pseudocode", which is basically a detailed description of an algorithm using words instead of a formal computer language.

This was to be done in such detail that the actual coder hired to produce the code would not have to make any decisions in translating the pseudocode to actual code.  This was explicitly stated in the (extensive) process documentation.  But if you can explain something in that much detail, you've essentially coded it and you're just using the coder as an expensive human typewriter, not a good proposition for anyone involved.  You've also put a layer of scheduling and paperwork between designing an algorithm and finding out whether it works.

We did, however, produce an impressive volume of paper binders full of documentation.  I may still have a couple somewhere.  I'm not sure I or anyone else has ever needed to read them.

This is an extreme case, but the mindset behind it is pervasive enough to make "agile vs. waterfall" a real controversy.  As with all such controversy at least some of the waterfallish practices actually out there have more merit than the extreme case.  The extreme case, even though it does exist in places, functions more as a strawman.  Nonetheless, I tend to favor the sort of "admirable impatience" that Downey exemplifies.  Like anything else it can be taken too far, but not in the case at hand.

Friday, September 28, 2018

One CSV, 30 stories (for small values of 30)

While re-reading some older posts on anonymity (of which more later, probably), and updating the occasional broken link, I happened to click through on the credit link on my profile picture.  Said link is still in fine fettle and, while it hasn't been updated in a while, and one of the more recent posts is Paul Downey chastising himself for just that, there's still plenty of interesting material there, including the (current) last post, now nearly three years old, with a brilliant observation on "scope creep".

What caught my attention in particular was the series One CSV, thirty stories, which took on the "do 30 Xs in 30 days" kind of challenge in an effort to kickstart the blog.  Taken literally, it wasn't a great success -- there only ended up being 21 stories, and there hasn't been much on the blog since -- but purely from a blogging point of view I'd say the experiment was indeed a success.

Downey takes a single, fairly large, CSV file containing records of land sales transactions from the UK and proceeds to turn this raw data into useful and interesting information.  The analysis starts with basic statistics such as how many transactions there are (about 19 million), how many years they cover (20) and how much money changed hands (about £3 trillion) and ends up with some nifty visualizations showing changes in activity from day to day within the week, over the course of the year and over decades.

This is all done with off-the-shelf tools, starting with old-school Unix commands that date back to the 70s and then pulling together various free-source from off the web.  Two of Downey's recurring themes, which were very much evident to me when we worked together on standards committees, um, a few years ago, are also very much in evidence here: A deep commitment to open data and software, and an equally strong conviction that one can and should be able to do significant things with data using basic and widely available tools.

A slogan that pops up a couple of times in the stories is "Making things open makes them better".  In this spirit, all the code and data used is publicly available.  Even better, though, the last story, Mistakes were made, catches the system in the act of improving itself due to its openness.  On a smaller scale, reader suggestions are incorporated in real time and several visualizations benefit from collaboration with colleagues.

There's even a "hack day" in the middle.  If anything sums up Downey's ideal of how technical collaboration should work, it's this: "My two favourite hacks had multidisciplinary teams build something, try it with users, realise it was the wrong thing, so built something better as a result. All in a single day!"  It's one thing to believe in open source, agile development and teamwork in the abstract.  The stories show them in action.

As to the second theme, the whole series, from the frenetic "30 things in 30 days" pace through to the actual results, shows an admirable sort of impatience:  Let's not spend a lot of time spinning up the shiniest tools on a Big Data server farm.  I've got a laptop.  It's got some built-in commands.  I've got some data.  Let's see what we can find out.

Probably my favorite example is the use of geolocation in Postcodes.  It would be nice to see sales transactions plotted on a map of the UK.  Unfortunately, we don't have one of those handy, and they're surprisingly hard to come by and integrate with, but never mind.  Every transaction is tagged with a "northing" and "easting", basically latitude and longitude, and there are millions of them.  Just plot them spatially and, voila, a map of England and Wales, incidentally showing clearly that the data set doesn't cover Scotland or Northern Ireland.

I wouldn't say that just anyone could do the same analyses in 30 days, but neither is there any deep wizardry going on.  If you've taken a couple of courses in computing, or done a moderate amount of self-study, you could almost certainly figure out how the code in the stories works and do some hacking on it yourself (in which case, please contribute anything interesting back to the repository).  And then go forth and hack on other interesting public data sets, or, if you're in a position to do so, make some interesting data public yourself (but please consult with your local privacy expert first).

In short, these stories are an excellent model of what the web was meant to be: open, collaborative, lightweight and fast.

Technical content aside, there are also several small treasures in the prose, from Wikipedia links on a variety of subjects to a bit on the connection between the cover of Joy Division's Unknown Pleasures and the discovery of pulsars by Jocelyn Bell Burnell et. al..

Finally, one of the pleasures of reading the stories was their sheer Englishness (and, if I understand correctly, their Northeast Englishness in particular).   The name of the blog is whatfettle.  I've already mentioned postcodes, eastings and northings, but the whole series is full of Anglicisms -- whilsta spot of breakfastcock-a-hoop, if you are minded, splodgy ... Not all of these may be unique to the British Isles, but the aggregate effect is unmistakeable.

I hesitate to even mention this for fear of seeming to make fun of someone else's way of speaking, but that's not what I'm after at all.   This isn't cute or quaint, it's just someone speaking in their natural manner.  The result is located or even embodied.  On the internet, anyone could be anywhere, and we all tend to pick up each other's mannerisms.  But one fundamental aspect of the web is bringing people together from all sorts of different backgrounds.  If you buy that, then what's the point if no one's background shows through?

Thursday, May 31, 2018

Cookies, https and OpenId

I finally got around to looking at the various notices that have accumulated on the admin pages for this blog.  As a result:
  • This blog is supposed to display a notice regarding cookies if you access it from the EU.  I'm not sure that this notice is actually appearing when it should (I've sent feedback to try to clarify), but as far as I can tell blogspot is handling cookies for this blog just like any other.  I have not tried to explicitly change that behavior.
  • This blog has for some time used "redirect to https".  This means that if you try to access this blog via http://, it will be automatically changed to https://.  This shouldn't make any difference.  On the one hand, https has been around for many years, all browsers I know of handle it just fine and in any case this blog has been redirecting to https for a long time without incident.  On the other hand, this is a public blog, so there's no sensitive private information here.  It might maybe make a difference if you have to do some sort of login to leave comments, but I doubt it.
  • Blogger no longer supports OpenID.  I think this would only matter if I'd set up "trust these web sites" under the OpenId settings, but I didn't.
In other words, this should all be a whole lot of nothing, but I thought I'd let people know.