Showing posts with label software engineering. Show all posts
Showing posts with label software engineering. Show all posts

Tuesday, July 16, 2019

Space Reliability Engineering

In a previous post on the Apollo 11 mission, I emphasized the role of software architecture, and the architect Margaret Hamilton in particular, in ensuring the success of the Apollo 11 lunar landing.  I stand by that, including the assessment of the whole thing as "awesome" in the literal sense, but as usual there's more to the story.

Since that non-particularly-webby post was on Field Notes, so is this one.  What follows is mostly taken from the BBC's excellent if majestically paced podcast 13 Minutes to the Moon [I hope to go back and recheck the details directly at some point, but searching through a dozen or so hours of podcast is time-consuming and I don't know if there's a transcript available -- D.H.], which in turn draws heavily on NASA's Johnson Space Center Oral History Project.

I've also had a look at Ars Technica's No, a "checklist error" did not almost derail the Apollo 11 mission, which takes issue with Hamilton's characterization of the incident and also credits Hal Laning as a co-author of the Executive portion of the guidance software which ultimately saved the day (to me, the main point Hamilton was making was that the executive saved the day, regardless of the exact cause of the 1202 code).

Before getting too far into this, it's worth reiterating just how new computing was at the time.  The term "software engineer" didn't exist (Hamilton coined it during the project -- Paul Niquette claims to have coined the term "software" itself and I see no reason to doubt him).  There wasn't any established job title for what we now call software engineers.  The purchase order for the navigation computer, which was the very first order in the whole Apollo project, didn't mention software, programming or anything of the sort.  The computer was another piece of equipment to be made to work just like an engine, window, gyroscope or whatever.  Like them it would have to be installed and have whatever other things done to it to make it functional.  Like "programming" (whatever that was).

In a way, this was a feature rather than a bug.  The Apollo spacecraft have been referred to, with some justification, as the first fly-by-wire vehicles.  The navigational computer was an unknown quantity.  At least one astronaut promised to turn the thing off at the first opportunity.  Flying was for pilots, not computers.

This didn't happen, of course.  Instead, as the podcast describes so well, control shifted back and forth between human and computer depending on the needs of the mission at the time, but it was far from obvious at the beginning that this would be the case.

Because the computer wasn't trusted implicitly, but treated as just another unknown to be dealt with, -- in other words, another risk to be mitigated -- ensuring its successful operation was seen as a matter of engineering, just like making sure that the engines were efficient and reliable, and not a matter of computer science.  This goes a long way toward explaining the self-monitoring design of the software.

Mitigating the risk of using the computer included figuring out how to make it as foolproof as possible for the astronauts to operate.  The astronauts would be wearing spacesuits with bulky gloves, so they wouldn't exactly be swiping left or right, even if the hardware of the time could have supported it.  Basically you had a numeric display and a bunch of buttons.  The solution was to break the commands down to a verb and a noun (or perhaps more accurately a predicate and argument), each expressed numerically.  It would be a ridiculous interface today.  At the time it was a highly effective use of limited resources [I don't recall the name of the designer who came up with this. It's in the podcast --D.H.].

But the only way to really know if an interface will work is to try it out with real users.  Both the astronauts and the mission control staff needed to practice the whole operation as realistically as possible, including the operation of the computer.  This was for a number of reasons, particularly to learn how the controls and indicators worked, to be prepared for as many contingencies as possible and to try to flush out unforeseen potential problems.  The crew and mission control conducted many of these simulations and they were generally regarded as just as demanding and draining as the real thing, perhaps moreso.

It was during one of the simulations that the computer displayed a status code that no one had ever seen before and therefore didn't know how to react to.  After the session was over, flight director Gene Kranz instructed guidance software expert Jack Garman to look up and memorize every possible code and determine what course of action to take when it came up.  This would take a lot of time searching through the source code, with the launch date imminent, but it had to be done and it was.  Garmin produced a handwritten list of every code and what to do about it.

As a result, when the code 1202 came up with the final opportunity to turn back fast approaching, capsule communicator (CAPCOM) Charlie Duke was able to turn to guidance controller Steve Bales, who could turn to Garman and determine that the code was OK if it didn't happen continuously.  There's a bit of wiggle room in what constitutes "continuously", but knowing that the code wasn't critical was enough to keep the mission on track.  Eventually, Buzz Aldrin noticed that the code only seemed to happen when a particular radar unit was being monitored.  Mission Control took over the monitoring and the code stopped happening.


I now work for a company that has to keep large fleets of computers running to support services that billions of people use daily.  If a major Google service is down for five minutes, it's headline news, often on multiple continents.  It's not the same as making sure a plane or a spaceship lands safely or a hospital doesn't lose power during a hurricane, but it's still high-stakes engineering.

There is a whole profession, Site Reliability Engineer, or SRE for short, dedicated to keeping the wheels turning.  These are highly-skilled people who would have little problem doing my job instead of theirs if they preferred to.  Many of their tools -- monitoring, redundancy, contingency planning, risk analysis, and so on -- can trace their lineage through the Apollo program.  I say "through" because the concepts themselves are considerably older than space travel, but it's remarkable how many of them were not just employed, but significantly advanced, as a consequence of the effort to send people to the moon and bring them back.

One tool in particular, Garman's list of codes, played a key role at a that critical juncture.  Today we would call it a playbook.  Anyone who's been on call for a service has used one (I know I have).



In the end, due to a bit of extra velocity imparted during the maneuver to extract the lunar module and dock it to the command module, the lunar module ended up overshooting its intended landing place.  In order to avoid large boulders and steep slopes in the area they were now approaching, Neil Armstrong ended up flying the module by hand in order to find a good landing spot, aided by a switch to increase or decrease the rate of descent.

The controls were similar to those of a helicopter, except the helicopter was flying sideways through (essentially) a vacuum over the surface of the moon, steered by precisely aimed rocket thrusts while continuing to descend, and was made of material approximately the thickness of a soda can which could have been punctured by a good jab with a ball-point pen.  So not really like a helicopter at all.

The Eagle landed with eighteen seconds of fuel to spare.  It helps to have a really, really good pilot.

Saturday, December 8, 2018

Software cities

In the previous post I stumbled on the idea that software projects are like cities.  The more I thought about it, I said, the more I liked the idea.  Now that I've had some more time to think about it, I like the idea even more, so I'd like to try to draw the analogy out a little bit further, ideally not past the breaking point.

What first drew me to the concept was realizing that software projects, like cities, are neither completely planned nor completely unplanned.  Leaving aside the question of what level of planning is best -- which surely varies -- neither of the extremes is likely to actually happen in real life.

If you try to plan every last detail, inevitably you run across something you didn't anticipate and you'll have to adjust.  Maybe it turns out that the place you wanted to put the city park is prone to flooding, or maybe you discover that the new release of some platform your depending doesn't actually support what you thought it did, or at least not as well as you need it to.

Even if you could plan out every last detail of a city, once people start living in it, they're going to make changes and deviate from your assumptions.  No one actually uses that beautiful new footbridge, or if they do, they cut across a field to get to it and create a "social trail" thereby bypassing the carefully designed walkways.  People start using an obscure feature of one of the protocols to support a use case the designers never thought of.  Cities develop and evolve over time, with or without oversight, and in software there's always a version 2.0 ... and 2.1, and 2.2, and 2.2b (see this post for the whole story).

On the other hand, even if you try to avoid planning and let everything "just grow", planning happens anyway.  If nothing else, we codify patterns that seem to work -- even if they arose organically with no explicit planning -- as customs and traditions.

In a distant time in the Valley, I used to hear the phrase "paving the cow paths" quite a bit.  It puzzled me at first.  Why pave a perfectly good cow path?  Cattle are probably going to have a better time on dirt, and that pavement probably isn't going to hold up too well if you're marching cattle on it all the time ...  Eventually I came to understand that it wasn't about the cows.  It was about taking something that people had been doing already and upgrading the infrastructure for it.  Plenty of modern-day highways (or at least significant sections of them) started out as smaller roads which in turn used to be dirt roads for animals, foot traffic and various animal-drawn vehicles.

Upgrading a road is a conscious act requiring coordination across communities all along the roadway.  Once it's done, it has a significant impact on communities on the road, which expect to benefit from increased trade and decreased effort of travel, but also communities off the road, which may lose out, or may alter their habits now that the best way to get to some important place is by way of the main road and not the old route.  This sort of thing happens both inside and outside cities, but for the sake of the analogy think of ordinary streets turning into arterials or bypasses and ring roads diverting traffic around areas people used to have to cross through.

One analogue of this is in software is standards.  Successful standards tend to arise when people get together to codify existing practice, with the aim of improving support for things people were doing before the standard, just in a variety of similar but still needlessly different ways.  Basically pick a route and make it as smooth and accessible as possible.  This is a conscious act requiring coordination across communities, and once it's done it has a significant impact on the communities involved, and on communities not directly involved.

This kind of thing isn't always easy.  A business district thrives and grows, and more and more people want to get to it.  Traffic becomes intolerable and the city decides to develop a new thoroughfare to carry traffic more efficiently (thereby, if it all works, accelerating growth in the business district and increasing traffic congestion ...).  Unfortunately, there's no clear space for building this new thoroughfare.  An ugly political fight ensues over whose houses should get condemned to make way and eventually the new road is built, cutting through existing communities and forever changing the lives of those nearby.

One analog of this in software is the rewrite.  A rewrite almost never supports exactly the same features as the system being rewritten.  The reasons for this are probably material for a separate post,  but the upshot is that some people's favorite features are probably going to break with the rewrite, and/or be replaced by something different that the developers believe will solve the same problem in a way compatible with the new system.  Even if the developers are right about this, which they often are, there's still going to be significant disruption (albeit nowhere near the magnitude of having one's house condemned).


Behind all this, and tying the two worlds of city development and software develop together, is culture.  Cities have culture, and so do major software projects.  Each has its own unique culture, but, whether because the same challenges recur over and over again, leading to similar solutions, or because some people are drawn to large communities while others prefer smaller, the cultures of different cities tend to have a fair bit in common, perhaps more in common with each other than with life outside them.  Likewise with major software projects.

Cities require a certain level of infrastructure -- power plants, coordinated traffic lights, parking garages, public transport, etc. -- that smaller communities can mostly do without.  Likewise, a major software project requires some sort of code repository with version control, some form of code review to control what gets into that repository, a bug tracking system and so forth.  This infrastructure comes at a price, but also with significant benefits.  In a large project as in a large city, you don't have to do everything yourself, and at a certain point you can't do everything yourself.  That means people can specialize, and to some extent have to specialize.  This both requires a certain kind of culture and tends to foster that same sort of culture.


It's worth noting that even large software projects are pretty small by the standards of actual cities.  Somewhere around 15,000 people have contributed to the git repository for the Linux kernel.  There appear to be a comparable (but probably smaller) number of Apache committers.  As with anything else, some of these are more active in the community than others.  On the corporate side, large software companies have tens of thousands of engineers, all sharing more or less the same culture.

Nonetheless, major software projects somehow seem to have more of the character of large cities than one might think based on population.  I'm not sure why that might be, or even if it's really true once you start to look more closely, but it's interesting that the question makes sense at all.

Sunday, November 4, 2018

Waterfall vs. agile

Near where I work is a construction project for a seven-floor building involving dozens of people on site and suppliers from all over the place, supplying items ranging from local materials to an impressively tall crane from a company on another continent.  There are literally thousands of things to keep track of, from the concrete in the foundation to the location of all the light switches to the weatherproofing for the roof.  The project will take over a year, and there are significant restrictions on what can happen when.  Obviously you can't put windows on before there's a wall in place to put them in, but less obviously there are things you can't do during the several weeks that the structural concrete needs to cure, and so forth.

Even digging the hole for the parking levels took months and not a little planning.  You have to have some place to put all that dirt and the last part takes longer since you no longer have a ramp to drive things in and out with, so whatever you use for that last part has to be small enough you can lift it out.

I generally try to keep in mind that no one else's job is as simple as it looks when you don't have to actually do it, but this is a particularly good example.  Building a building this size, to say nothing of an actual skyscraper, is real engineering.  Add into the mix that lives are literally on the line -- badly designed or built structures do fail and kill people, not to mention the abundance of hazards on a construction site -- and you have a real challenge on your hands.

And yet, the building in question has been proceeding steadily and there's every reason to expect that it will be finished within a few weeks of the scheduled date and at a cost reasonably close to the initial estimate.

We can't do that in my line of work.

Not to say it's never happened, but it's not the norm.  For example, I'm trying to think of a major software provider that still gives dates and feature lists for upcoming releases.  Usually you have some idea of what month the next release might be in, and maybe a general idea of the major features that the marketing is based around, but beyond that, it comes out when it comes out and whatever's in it is in it.  That fancy new feature might be way cooler and better than anyone expected, or it might be a half-baked collection of somewhat-related upgrades that only looks like the marketing if you squint hard enough.

The firmer the date, the vaguer the promised features and vice versa ("schedule, scope, staff: pick two").  This isn't isolated to any one provider (I say "provider" rather than "company" so as to include open source).  Everyone does it in their own way.

In the construction world, this would be like saying "The new building will open on November 1st, but we can't say how many floors it will have or whether there will be knobs on the doors" or "This building will be completely finished somewhere between 12 and 30 months from now."  It's not that construction projects never overrun or go over budget, just that the normal outcomes are in a much tighter range and people's expectations are set accordingly.

[Re-reading this, I realize I didn't mention small consultants doing projects like putting up a website and social media presence for a local store.  I haven't been close to that end of the business for quite a while, but my guess is that delivering essentially on time and within the budget is more common.  However, I'm more interested here in larger projects, like, say, upgrading trade settlement for a major bank.  I don't have a lot of data points for large consultants in such situations, but what I have seen tends to bear out my main points here]

Construction is a classic waterfall process.  In fact, the use of roles like "architect" and "designer" in a waterfall software methodology gives a pretty good hint where the ideas came from.  In construction, you spend a lot of time up front working with an architect and designer to develop plans for the building.  These then get turned into more detailed plans and drawings for the people actually doing the construction.  Once that's done and construction is underway, you pretty much know what you're supposed to be getting.

In between design and construction there's a fair bit of planning that the customer doesn't usually see.  For example, if your building will have steel beams, as many do, someone has to produce the drawing that says exactly what size and grade of beam to use, how long to cut it and (often) exactly where and what size to drill the holes so it can be bolted together with the other steel pieces.  Much of this process is now automated with CAD software, and for that matter more and more of the actual cutting and drilling is automated, but the measurements still have to specified and communicated.

Even if there's a little bit of leeway for changes later in the game -- you don't necessarily have select all the paint colors before they pour concrete for the underground levels -- for the most part you're locked in once the plans are finalized.  You're not going to decide that your seven-level building needs to be a ten-level building while the concrete is curing, or if you do, you'll need to be ready to shell out a lot of money and throw the schedule out the window (if there's one to throw it out of).

Interwoven with all this is a system of zoning, permitting and inspections designed to ensure that your building is safe and usable, and fits in well with the neighborhood and the local infrastructure.  Do you have enough sewer capacity?  Is the building about the same height as the buildings around it (or is the local government on board with a conspicuously tall or short one)?  Will the local electrical grid handle your power demand, and is your wiring sufficient?  This will typically involve multiple checks: The larger-scale questions like how much power you expect to use are addressed during permitting, the plans will be inspected before construction, the actual wiring will be inspected after it's in, and the contractor will need to be able to show that all the electricians working on the job are properly licensed.

This may seem like a lot of hassle, and it is, but most regulations are in place because people learned the hard way.  Wiring from the early 1900s would send most of today's licensed electricians running to the fuse box (if there is one) to shut off the power, or maybe just running out of the immediate area.  There's a reason you Don't Do Things That Way any more: buildings burned down and people got electrocuted.

Putting all this together, large-scale construction uses a waterfall process for two reasons: First, you can't get around it.  It's effectively required by law.  Second, and more interesting here, is that it works.

Having a standard process for designing and constructing a building, standard materials and parts and a standard regime of permits, licenses and inspections gives everyone involved a good idea of what to expect and what to do.  Having the plans finalized allows the builder to order exactly the needed materials (or more realistically, to within a small factor) and precisely track the cost of the project.  Standards and processes allow people to specialize.  A person putting up drywall knows that the studs will be placed with a standard spacing compatible with a standard sheet of drywall without having to know or care exactly how those studs got there.

Sure, there are plenty of cases of construction projects going horribly off the rails, taking several times as long as promised and coming in shockingly over budget, or turning out to be unusable or having major safety issues requiring expensive retrofitting.  It happens, and when it does lots of people tend to hear about it.  But drive into a major city some time.  All around you will be buildings that went up without significant incident, built more or less as I described above.  On your way in will be suburbs full of cookie-cutter houses built the same way on a smaller scale.  The vast majority of them will stand for generations.

Before returning to the field of software, it's worth noting that this isn't the only way to build.  Plenty of houses are built by a small number of people without a lot of formal planning.  Plenty of buildings, particularly houses, have been added onto in increments over time.  People rip out and replace cabinetry or take out walls (ideally non-structural ones) with little more than general knowledge of how houses are put together.  At a larger scale, towns and cities range from carefully planned (Brasilia, Washington D.C) to agglomerations of ancient villages and newer developments with the occasional Grand Plan forced through (London comes to mind).


So why does a process that works fine for projects like buildings, roads and bridges or, with appropriate variations, Big Science projects like CERN, the Square Kilometer Array or the Apollo missions, not seem to carry over to software?  I've been pondering this from time to time for decades now, but I can't say that I've arrived at any definitive answer.  Some possibilities:
  • Software is soft.  Software is called "software" because, unlike the hardware, which is wired together and typically shipped off to a customer site, you can change any part of the software whenever you want.  Just restart the server with a new binary, or upgrade the operating system and reboot.  You can even do this with the "firmware" -- code and data that aren't meant to be changed often -- though it will generally require a special procedure.  Buildings and bridges are quintessential hardware.
Yes, but ... in practice you can't change anything you want at any time.  Real software consists of a number of separate components acting together.  When you push a button on a web site, the browser interprets the code for the web page and, after a series of steps, makes a call to the operating system to send a message to the server on the other end.  The server on the other end is typically a "front-end" whose job is to dispatch the message to another server, which may then talk to several others, or to a database (probably running on its own set of servers), or possibly to a server elsewhere on the web or on the internal network, or most likely a combination of some or all of these, in order to do the real work.  The response comes back, the operating system notifies the browser and the browser interprets more of the web page's code in order to update the image on the screen.

This is actually a highly simplified view.  The point here is that all these pieces have to know how to talk to each other and that puts fairly strict limits on what you can change how fast.  Some things are easy.  If I figure out a way to make a component do its current job faster, I can generally just roll that out and the rest of the world is fine, and ideally happier since things that depend on it should get faster for free (otherwise, why make the change?).  If I want to add a new feature, I can generally do that easily too, since no one has to use it until it's ready.

But if I try to change something that lots of people are already depending on, that's going to require some planning.  If it's a small change, I might be able to get by with a "flag" or "mode switch" that says whether to do things the old way or the new way.  I then try to get everyone using the service to adapt to the new way and set the switch to "new".  When everyone's using the "new" mode, I can turn off the "old" mode and brace for an onslaught of "hey, why doesn't this work any more?" from whoever I somehow missed.

Larger changes require much the same thing on a larger scale.  If you hear terms like "backward compatibility" and "lock-in", this is what they're talking about.  There are practices that try to make this easier and even to anticipate future changes ("forward compatibility" or "future-proofing").  Nonetheless, software is not as soft as one might think.
  • Software projects are smaller.  Just as it's a lot easier to build at least some kind of house than a seven-level office building, it's easy for someone to download a batch of open source tools and put together an app or an open source tool or any of a number of other things (my post on Paul Downey's hacking on a CSV of government data shows how far you can get with not-even-big-enough-to-call-a-project).  There are projects just like this all over GitHub.
Yes, but ... there are lots of small projects around, some of which even see a large audience, but if this theory were true you'd expect to see a more rigid process as projects get larger, as you do in moving from small DIY projects to major construction. That's not necessarily the case.  Linux grew from Linus's two processes printing "A" and "B" on the screen to millions of lines of kernel code and millions more of tools and applications.  The development process may have gradually become more structured over time, but it's not anything like a classic waterfall.  The same could be said of Apache and any number of similar efforts.  Projects do tend to add some amount of process as they go ("So-and-so will now be in charge of the Frobulator component and will review all future changes"), but not nearly to the point of a full-fledged waterfall.

For that matter, it's not clear that Linux or Apache should be considered single projects.  They're more like a collection of dozens or hundreds of projects, each with its own specific standards and practices, but nonetheless they fit together into a more or less coherent whole.  The point being, it can be hard to say how big a particular project is or isn't.

I think this is more or less by design.  Software engineering has a strong drive to "decouple" so that different parts can be developed largely independently.  This generally requires being pretty strict about the boundaries and interfaces between the various components, but that's a consequence of the desire to decouple, not a driving principle in and of itself.  To the extent that decoupling is successful, it allows multiple efforts to go on in parallel without a strictly-defined architecture or overall plan.  The architecture, such as it is, is more an emergent property of the smaller-scale decisions about what the various components do and how they interact.

The analogy here is more with city planning than building architecture.  While it's generally good to have someone taking the long view in order to help keep things working smoothly overall, this doesn't mean planning out every piece in advance, or even having a master plan.  You can get surprisingly well with the occasional "Wait, how is this going to work with that" or "There are already two groups working on this problem -- maybe you should talk to them before starting a third one".

Rather than a single construction project, a project like Linux or Apache is much like a city growing in stages over time.  In fact, the more I think about it, the more I like that analogy.  I'd like to develop it further.

I had wanted to add a few more bullet points, particularly "Software is innovative" -- claiming that people generally write software in order to do something new, while a new skyscraper, however innovative the design, is still a skyscraper (and more innovative buildings tend to cost more, take longer and have a higher risk of going over budget) -- but I think I'll leave that for now.  This post is already on the longish side and developing the city analogy seems more interesting at the moment.

Saturday, September 29, 2018

Agile vs. waterfall

Another comment reply that outgrew the comment box.  Earl comments
This speaks to my prejudice in favor of technique over technology. And the concept of agility seems to be just the attitude of any good designer that your best weapon is your critical sense, and your compulsion to discard anything that isn't going to work.
To which I would reply:

Sort of ... "agile" is a term of art that refers to a collection of practices aimed at reducing the lag between finding out people want something and giving it to them.  Arguably the core of it is "launch and iterate", meaning "put your best guess out there, find out what it still needs, fix the most important stuff and try again".

This is more process than design, but there are definitely some design rules that tend to go with agile development, particularly "YAGNI", short for "You ain't gonna need it", which discourages trying to anticipate a need you don't yet know that you have.  In more technical terms, this means not trying to build a general framework for every possible use case, but being prepared to "refactor" later on if you find out that you need to do more than you thought you did.  Or, better, designing in such a way that later functionality can be added with minimum disruption to what's already there, often by having less to disrupt to begin with, because ... YAGNI.

Downey refers to "agile" both generally and in the more specific context of "agile vs. waterfall".  The "waterfall" design process called for exhaustively gathering all requirements up front, then producing a design to meet those requirements, then implementing the design, then independently testing the implementation against the requirements, fixing any bugs, retesting and eventually delivering a product to the customer.  Each step of the process flows into the next, and you only go forward, much like water flowing over a series of cascades.  Only the test/fix/retest/... cycle was meant to be iterative, and ideally with as few iterations as possible.  Waterfall projects can take months at the least and more typically years to get through all the steps, at which point there's a significant chance that the customer's understanding of what they want has evolved -- but don't worry, we can always gather more requirements, produce an improved design ...

(As an aside, Downey alludes to discussion over whether "customer" is an appropriate term for someone, say, accessing a public data website.  A fair point.  I'm using "customer" here because in this case this is someone paying money for a the service of producing software.   The concept of open source cuts against this, but that's a whole other discussion.)

The waterfall approach can be useful in situations like space missions and avionics.  In the first case, when you launch, you're literally launched and there is no "iterate".  In the second, the cost of an incomplete or not-fully-vetted implementation is too high to risk.  However, there's a strong argument to be made that "launch and iterate" works in more cases than one might think.

In contrast to waterfall approaches, agile methodologies think more in terms of weeks.  A series of two-week "sprints", each producing some number of improvements from a list, is a fairly common approach. Some web services go further and use a "push on green" process where anything that passes the tests (generally indicated by a green bar on a test console) goes live immediately.  Naturally, part of adding a new feature is adding tests that it has to pass, but that should generally be the case anyway.

Superficially, a series of two-week sprints may seem like a waterfall process on a shorter time scale, but I don't think that's a useful comparison.  In a classic waterfall, you talk to your customer up front and then go dark for months, or even a year or more while the magic happens, though the development managers may produce a series of progress reports with an aggregate number of requirements implemented or such.  Part of the idea or short sprints, on the other hand, is to stay in contact with your customer in order to get frequent feedback on whether you're doing the right thing.   Continuous feedback is one of the hallmarks of a robust control system, whether in software or steam engines.

There are also significant differences in the details of the processes.  In an agile process, the list of things to do (often organized by "stories") can and does get updated at any time.  The team will generally pick a set of things to implement at the the beginning of a sprint in order to coordinate their efforts, but this is more a tactical decision, and "requirements gathering" is not blocked while the developers are implementing.

Work in agile shops tends to be estimated in relative terms like "small", "medium" or "large", since people are much better at estimating relative sizes, and there's generally an effort to break "large" items into smaller pieces since people are better at estimating them.  Since this is done frequently, everyone ends up doing a bunch of fairly small-scale estimates on a regular basis, and hopefully skills improve.

Waterfall estimates are generally done up front by specialists.  By the end of the design phase, you should have a firm estimate of how long the rest will take (and, a cynic might add, a firm expectation of putting in serious overtime as the schedule begins to slip).

It's not clear how common a true waterfall process is in practice.  I've personally only seen it once up close, and the result was a slow-motion trainwreck the likes of which I hope never to see again.  Among other things, the process called for designers to reduce their designs to "pseudocode", which is basically a detailed description of an algorithm using words instead of a formal computer language.

This was to be done in such detail that the actual coder hired to produce the code would not have to make any decisions in translating the pseudocode to actual code.  This was explicitly stated in the (extensive) process documentation.  But if you can explain something in that much detail, you've essentially coded it and you're just using the coder as an expensive human typewriter, not a good proposition for anyone involved.  You've also put a layer of scheduling and paperwork between designing an algorithm and finding out whether it works.

We did, however, produce an impressive volume of paper binders full of documentation.  I may still have a couple somewhere.  I'm not sure I or anyone else has ever needed to read them.

This is an extreme case, but the mindset behind it is pervasive enough to make "agile vs. waterfall" a real controversy.  As with all such controversy at least some of the waterfallish practices actually out there have more merit than the extreme case.  The extreme case, even though it does exist in places, functions more as a strawman.  Nonetheless, I tend to favor the sort of "admirable impatience" that Downey exemplifies.  Like anything else it can be taken too far, but not in the case at hand.

Saturday, August 22, 2015

Margaret Hamilton: 1 New Horizons: 0

A bit more on Pluto, from a compugeek perspective if not a full-on web perspective ...

The New Horizons flyby was not completely without incident.  Shortly before the flyby itself, the craft went into "safe mode", contact was lost for a little over an hour and a small amount of scientific data was lost.  The underlying problem was "a hard-to-detect timing flaw in the spacecraft command sequence".  This quite likely means what's known in the biz as a "race condition", where two operations are going on at the same time, the software behaves incorrectly if the wrong one finishes first and the developers didn't realize it mattered.

Later investigation concluded that the problem happened when "The computer was tasked with receiving a large command load at the same time it was engaged in compressing previous science data."  This means that the CPU would have been both heavily loaded and multitasking, making it more likely that various "multithreading issues" such as race conditions would be exposed.

Now, before I go on, let me emphasize that bugs like this are notoriously easy to introduce by accident and notoriously hard to find if they do creep in, even though there are a number of well-known tools and techniques for finding them and keeping them out in the first place.

The incident does not in any way indicate that the developers involved can't code.  Far from it.  New Horizons made it through a ten-year, five billion kilometer journey, arriving within 72 seconds of the expected time, and was able to beam back spectacularly detailed images.  That speaks for itself.  It's particularly significant that the onboard computers were able to recover from the error condition instead of presenting the ground crew with an interplanetary Blue Screen of Death.  More on that in a bit.

Still ...

It's July 20, 1969.  The Apollo 11 lunar lander is three minutes from landing on the Moon when several alarms go off.  According to a later recounting by the leader of the team involved
Due to an error in the checklist manual, the rendezvous radar switch was placed in the wrong position. This caused it to send erroneous signals to the computer. The result was that the computer was being asked to perform all of its normal functions for landing while receiving an extra load of spurious data which used up 15% of its time.
This is a serious issue.  If the computer can't function, the landing has to be aborted.  However,
The computer (or rather the software in it) was smart enough to recognize that it was being asked to perform more tasks than it should be performing. It then sent out an alarm, which meant to the astronaut, I'm overloaded with more tasks than I should be doing at this time and I'm going to keep only the more important tasks; i.e., the ones needed for landing ... Actually, the computer was programmed to do more than recognize error conditions. A complete set of recovery programs was incorporated into the software. The software's action, in this case, was to eliminate lower priority tasks and re-establish the more important ones.
This is awesome.  Since "awesome" is generally taken to mean "kinda cool" these days, I'll reiterate: The proper response to engineering on this level is awe.  Let me try to explain why.

Depending on where you start counting, modern computing was a decade or two old at the time.  The onboard computer had "approximately 64Kbyte of memory and operated at 0.043MHz".  Today, you can buy a system literally a million times faster and with a million times more memory for a few hundred dollars.

While 64K is tiny by today's standards, it still leaves plenty of room for sophisticated code, which is exactly what was in there.  It does, however, mean that every byte and every machine cycle counts, and for that reason among others the code itself was written in assembler (hand-translated from a language called MAC and put on punch cards for loading).  Assembler is as low-level as it gets, short of putting in raw numbers, flipping switches or fiddling with the wiring by hand.

Here's a printout of that code if you're curious.  The dark bands are from printing out the listing on green-and-white-striped fanfold paper with a line printer such as used to be common at computer centers around the world.  The stripes were there to help the eye follow the 132-character lines.  Good times.  But I digress.

Just in case writing in assembler with an eye towards extremely tight code isn't enough, the software is asynchronous.   What does that mean?  There are two basic ways to structure a program such as this one that has to deal with input from a variety of sources simultaneously: the synchronous approach and the asynchronous approach.

Synchronous code essentially does one thing at a time.  If it's reading temperature and acceleration (or whatever), it will first read one input, say temperature from the temperature sensor, then read acceleration from the accelerometer (or whatever).  If it's asking some part of the engine to rotate 5 degrees, it sends the command to the engine part, then waits for confirmation that the part really did turn.  For example, it might read the position sensor for that part over and over until it reads five degrees different, or raise an alarm if doesn't get the right reading after a certain number of tries.

Code like this is easy to reason about and easy to read.  You can tell immediately that, say, it's an error if you try to move something and its position doesn't reach the desired value after a given number of tries.  However, it's no way to run a spaceship.  For example, suppose you need to be monitoring temperature continuously and raise a critical alarm if it gets outside its acceptable range.  You can't do that if you're busy reading the position sensor.

This is why high-performance, robust systems tend to be asynchronous.  In an asynchronous system, commands can be sent and data can arrive at any time.  There will generally be a number of event handlers, each for a given type of event.  The temperature event handler might record the temperature somewhere and then check to make sure it's in range.

If it's not, it will want to raise an alarm.  Suppose the alarm is a beep every five seconds.  In the asynchronous world, that means creating a timer to trigger events every five seconds, and creating an event handler that sends a beep command to the beeper when the timer fires (or, you can set a "one-shot" timer and have the handler create a new one-shot timer after it sends the beep command).

While all this is going on, other sensors will be triggering events.  In between "the temperature sensor just reported X" and "the timer for your beeper just went off", the system might get events like "the accelerometer just reported Y" and "the position sensor for such-and-such-part just read Z".

To move an engine part in this setup, you need to send it a command to move, and also create a handler for the position sensor's event.  That handler has to include a counter to remember how many position readings have come in since the command to move, along with the position the part is supposed to get to (or better, a time limit and the expected position).

A system like this is very flexible and doesn't spend time "blocked" waiting for things to happen, but it's also harder to read and reason about, since things can happen in any order and the logic is spread across a number of handlers, which can come and go depending on what the system is doing.

And then, on top of all this, the system has code to detect and recover from error conditions, not just in the ship it's controlling but in its own operation.  Do-it-yourself brain surgery, in other words.


I report my occupation as "software engineer" for tax purposes and such, but that's on a good day.  Most of us spend most of our time coding, that is, writing detailed instructions for machines to carry out.  True software engineering means designing a robust and efficient system to solve a practical problem.  The term was coined by Margaret Hamilton, the architect of the Apollo 11 control systems quoted above and a pioneer in the design of asynchronous systems.  As the story of the lunar landing demonstrates, she and her team set a high bar for later work.

New Horizons ran into essentially the same sort of problem that Apollo 11 did, but handled it less robustly (going to "safe mode" and then recovering, as opposed to automatically re-prioritizing), all building on techniques that Hamilton and her team helped develop, and using vastly more powerful equipment and development tools based on decades of collective experience.  So, with all due respect to the New Horizons team, I'd have to say Apollo 11 wins that one.

Friday, August 3, 2012

Is there a UX crisis?

Back in the early days of computing, a software crisis was declared.  Projects were being launched with high expectations -- this was back when computers could do absolutely anything -- only to end up late, over budget, disappointingly lacking in features, buggy to the point of uselessness, or not delivered at all.

Many solutions were proposed.  Software should be written in such a way that it could be mechanically proved correct.  Software engineering should become a proper engineering discipline with licenses required to practice.  Methodologies should be developed to control the development process and make it regular and predictable.  There were many others.

None of these things has happened on a significant scale.  A proof of correctness assumes you understand the problem well enough to state the requirements mathematically, which is not necessarily easier than writing the code itself.  For whatever reason, degrees and certificates have not turned out to be particularly important, at least in the places I've worked for the past decades.

Methodologies have come and gone, and while most working engineers can recognize and understand a process problem when they see it ("Why did I not know that API was about to change?" ... "How did we manage to release that without testing feature X??"), there is a high degree of skepticism about methodologies in general.

This isn't to say that there aren't any software methodologies -- there are hundreds -- or that they're not used in practice.  I've personally seen up close a highly-touted methodology that used hundreds of man-years and multiple calendar years to replace an old mainframe system with a new, state-of-the art distributed solution that the customer -- which had changed ownership at least once during the wait -- was clearly unhappy with.  And well they should have been.  Several months in it had been scaled down as it became clear that the original objectives weren't going to be met.

I've also seen "agile" methodologies put in place, with results that were less disastrous but not exactly miraculous either.  Personally I'm not at all convinced that a formal methodology is as helpful as a good development culture (you know it when you see it), frequent launches, good modularity and lots of testing.

Several things have happened instead of a cure, or cures, for the software crisis.  Languages and tools have improved.  Standards, generally de facto, have emerged.  Now that a lot of software is out, both customers and developers have more realistic expectations about what it can and cannot do.  Best practices have emerged (Unit tests are your friend.  Huge monoliths of code aren't.).  Projects get delivered, often late, over budget, lacking features and buggy, but good enough.  And it's just code.  We can always fix it.  I can sense the late Edsger Dijkstra shaking his head in disapproval as I write this, but nonetheless the code is running and a strong case can be made that the world is better for it.

We don't have, nor did we have, a crisis.  What we have is consistent disappointment.  We can see what software could be, and we see what it is, and the gap between the two, particularly in the mistakes we get to make over and over again, is disheartening.


Which leads me back to a persistent complaint: UXen, in general, suck.

Yes, there are plenty of examples of apps and web sites that are easy to use and even beautiful, but there are tons and tons that are annoying, if not downright infuriating, and ugly to boot.  For that matter, there are a fair number of pretty-but-useless interfaces.  Despite decades of UX experience and extensive research, basic flaws keep coming back again and again.  Off the top of my head without trying too hard:
  • Forms that make you re-enter everything if you make a mistake with anything (these actually seem to be getting rarer, and a good browser will bail you out by remembering things for you -- and in many cases that's a perfectly fine solution).
  • Lists of one item that you have to pick from anyway as though there were an actual choice.
  • "Next" buttons that don't go away when you get to the last item (likewise for "Previous")
  • Links to useless pages that just link you to where you wanted to go in the first place.
  • Security theater that pretends to make things safer.  Please make it stop.
  • Forms that require you use a special format for things like phone numbers.  Do I include the dashes or not?
  • Wacky forms for things like dates that throw everything you know about keys like backspace and tab out the window.
  • Error handling that tells you nothing about how to fix the problem.
  • Layouts that only line up right on a particular browser.
  • Pages that tell you to "upgrade" if you're not running a particular browser.
  • General garish design. Text that doesn't contrast with the background, which is too busy anyway.  Text that contrasts too much.  Cutely unreadable fonts.  Animated GIFs that cycle endlessly.
  • Things that pop up in front of what you're trying to look at for no good reason.
  • Editors that assume, a la Heisenberg, that the mere act of opening an edit window on a document causes unspecified "unsaved changes" that you must then decide whether or not to save (yeah, Blogger, you're guilty here).
And so forth.  This is just off the top of my head.  I've ranted about several of these already, though for some reason the industry doesn't seem to have taken heed.

How does this happen?

How does any less-than-satisfactory design ever happen?  One answer is that reality sets in.  Any real project is a compromise between the desire to produce something great and the need to get something out in front of the customer.  Perfect is the enemy of good enough.

In an ideal world, people would be able to describe exactly what they want and designers could just give it to them.  In the real world, people don't always know what they want, or what's reasonably feasible, and designers don't always know how to give it to them.  In the ideal world a designer has at hand all possible solutions and is never swayed by the desire to use some clever new technique whether it really applies or not.  In the real world designers are humans with limited resources.

This isn't unique to software by any means.  Doors have been around for millennia, and people still don't always know how to design them.

I should pause here to acknowledge that UX is difficult.  There are rules and methods, and tons of tools, but putting together a truly excellent UX that's both pleasant and fully functional, that makes easy things easy and hard things possible, takes a lot of thought, effort and back-and-forth with people actually trying to use it.

Again, though, that's not a property of UX.  It's a property of good design.  The question here is why are UX things that seem simple enough -- like avoiding useless buttons and links -- so often wrong in practice.  A few possible answers:
  • Actually, UX designers get it more-or-less right most of the time.  We just notice the failures because they're really, really annoying.
  • It's harder than it looks.  It's not always easy to figure out (in terms even a computer can understand) that a link or button is useless, or how to lay something out consistently on widely different screens.
  • The best tools aren't always available.  Maybe there's a really good widget for handling a changing list of items that allows for both quick and fine-grained scrolling and so forth.  But it's something your competitor wrote, or it's freely available but not on the platform you're using.
  • Dogma.  Occasionally guidelines require foolish consistency and UX is not in a position to bend them.  This may explain some tomfoolery regarding dates, social security numbers and such.
  • Plausible-sounding reasoning that never gets revisited.  It may seem like a great idea to make sure you have a valid social security number by requiring the user to put in the dashes as well.  That way you know they're paying attention.  Well, no.
  • Reinvented wheels.  The person doing the UX hasn't yet developed the "this must already exist somewhere" Spidey sense, or thinks it would be Really Cool to write yet another text editing widget.
  • Software rot.  The page starts out really nicely, but changes are jammed in without regard to an overall plan.  Inconsistencies develop and later changes are built on top of them.
Hmm ... once again, none of these seems particularly unique to UX.  Time to admit it: UX is a branch of software engineering, liable to all the faults of other software engineering endeavors.  Yes, there is an element of human interaction, but if you think about it, designing a library for people to code to is also a kind of UX design, just not one with screens and input devices.  You could just as well say the same things that make UX development error prone make library design error prone as the other way around.

To answer the original question, there is no UX crisis, no more than there was a software crisis.  We just have the same kinds of consistent disappointment.

But who asked?  Well, I did, in the title of this post.  Interestingly enough, no one actually seems to have declared a UX crisis, or at least the idea doesn't seem to have taken off.  Maybe we have learned a bit in the past few decades after all.

Sunday, April 17, 2011

Xanadu vs. the web: Part V - Fail.

This is the least pleasant segment to write.  Xanadu the architecture and business model are interesting to write about.  They may not have panned out, but they deserve to be studied and kept in mind.  It's never good to assume that the present way of doing things is the only or best way.  Xanadu provides alternatives that, if not viable, are at least worth thinking through.

However, there is a sadder side to the story: Xanadu the software project, which by any reasonable standard has been an unmitigated failure.  In particular, it never shipped anything of consequence.  Gary Wolf's piece, which again Nelson has strongly disputed, at least seems to square with the lack of notable Xanadu applications.  Autodesk founder John Walker's assessment of the four years and millions of dollars spent when Xanadu was affiliated with his company and given ample resources -- much more, for example, than were used to found Autodesk itself -- corroborates this.  Xanadu had literally become a footnote.

Looking through the various Xanadu websites, I can't say I've turned over every stone, but I've only managed to find three tangible results:
  • A Windows demo dated 2007 that I haven't yet run because I don't have a suitable Windows box handy at the moment (and frankly, given everything else, I may never run)
  • A Python script called the Transquoter
  • A link to a site not directly affiliated with Xanadu but clearly influenced by it.
  • [Since I wrote this, Xanadu has put up a demo of the UX, showing several texts side by side with the ability to jump from one to another.  Judge for yourself.  For my money it's not that much different from browsing Wikipedia or TV Tropes, despite the different visual presentation -- one could easily imagine a web browser with a browsing mode that looks like this.  Still, it is something tangible, and kudos to Xanadu for that.]

The Transquoter seems indicative of the overall state of things.  It essentially takes a list of links in a file and pastes their contents together.  Each quotation gets its own highlight color when moused over, and each quotation is a link.  The links in the file are of a special form, with query parameters indicating which version of the document in question to use and what range of characters to extract from it.  These both assume the stable "write once, never edit" model of documents that Xanadu uses.  The scheme would work for, say, Wikipedia articles, if you were careful to link to a particular version, but not for a lot of other things.

Never mind, though.  A sample is provided.  The sample links exclusively to Nelson's sites.  It's Nelson's content, Nelson's chosen servers and an officially blessed Python script.  Nonetheless, the links don't work.  The script generates plausible-looking links, with version and charrange parameters saying what to look for, but the servers just ignore them and serve the whole page.  The whole point of this transclusion thing, I thought, was that when you navigate from one occurence of the content to another, you actually get to that content in its context, not to the document it came from with no indication of where that content might be.

Seriously?

OK then, suppose you have a server that actually understands the syntax of the links.  Perhaps one that can take a link anywhere on the web, perhaps cache a copy of it to ensure stability, and serve that page up, showing the quoted text, suitably highlighted and in context.  Not something I'm inclined to work on personally, but certainly feasible.  How do you put together the document that references it?

You put together a text file containing an "edit difference list".  The only edit difference supported seems to be directly pulling in a quote by giving the position of the first character in the source document and a size in characters (working around HTML tags as needed).  No tool is provided for, say, highlighting some text in your browser and dragging it into a document-in-progress.  But hey, you can use any text editor you like to produce a list of specially-tweaked URLs to give to the script on the command line, to get an HTML file to upload to your site.

Seriously?


There's a principle in software engineering of "eating your own dogfood," that is, using your own tools wherever possible.  For example, Linux became "self-hosting", meaning that further development of Linux was done using Linux systems, at an early age.  Becoming self-hosting is a major rite of passage for many kinds of tool, particularly compilers.  Among other things, it's a fairly convincing demonstration that the tool can be used for something real.

Does Xanadu.net use the transquoter?  Not as far as I can tell.  If it did, there would probably be a real editor.

OK, enough.  At some point this is just piling on.


The most plausible demonstration of how Xanadu might work comes not from within Xanadu, but from Jason Rohrer's token_word system which Nelson mentions.  It attempts to show what a payment system based on Xanadu's pay-per-first-access model would look like.  You can even put in real money via a PayPal account, though fortunately the first 50,000 "tokens" are free.  [I can no longer find a running version of token_word.  Its developer, Jason Rohrer appears to be developing video games now, though there is quite a bit of other interesting stuff on Rohrer's SourceForge site, including some material on token_word -- D.H. Oct 2018]

There are documents on the site, though not very many, and you can construct new documents by writing text or by pulling quotations in from other documents in the system.  The process is inconvenient but doable: To extract quotes you put <q> and </q> tags around the text you want to quote in an edit box and push a button. Then you put <q n=""> into your own text — use <q 0=""> for the first quote you extracted, <q 1=""> for the next ... Stone-age tools, but I can forgive that here because 1) the site looks to be fairly old and done as a spare-time project by one person, 2) other parts of the demo, particularly tracking how much text you've accessed, work and 3) the stone-age axe at least has a handle and a blade.

Have a look around if you like.  At least it's something.  It's the kind of demo that, if there were more of them and they had together been developed into a prototype system, might have gotten people to take the project seriously.  Personally, I believe I got enough of the flavor to see that I would prefer not to access documents in such a way, even with a slicker interface, but that's a good thing.  Allowing people to make that kind of determination is exactly why we have demos, and it's what puts the site light-years ahead of the transquoter, which would probably leave most people scratching their heads.




Xanadu has been around as a concept since the 1960s.  People have devoted years of their lives to working on it.  Millions of dollars have been spent.  In that time the entire computing industry has been turned upside down multiple times.   [Did I really say that?  Well, the technology, at least, has undergone major changes.] Thousands of new companies have emerged, some even surviving.  Billions of lines of code have been written.  Protocols have been defined.  Apps have been written and shipped, fortunes made, lost, remade.  Even taking into account the inherent architectural difficulties of the Grand Scheme, even if every word of Wolf's story is absolutely accurate, even taking into account code lost at various stages during the development of the project, it still boggles the mind just how little there is to show for it.

Wednesday, June 30, 2010

Wiki without the pedia

While tagging my previous post, I noticed that I had tags for both "Wikipedia" and "wiki". There are four articles (now five, of course) tagged "wiki," three of which are more or less to do with Wikipedia. The other is from the Baker's Dozen series, speculating about what role the wiki approach may play in the next generation of search engines.

What really stands out to me about wikis is that there's Wikipedia and then there's everything else.

Everybody's heard of Wikipedia by now and quite a few people have tried their hand at editing it. As a result, there is a well-known tool for editing Wikipedia (Mediawiki) along with a well-established culture and etiquette. There is also enough of a critical mass that, for the most part, articles tend to improve over time.

And then there's everything else. Don't get me wrong. There are some good wikis out there. But there are also an awful lot of half-baked ones. These tend to crop up when a small software shop or similar organization decides that it needs a wiki to, say, document its software architecture and development process. Well, why not? Wikipedia is pretty successful, and software shops are always looking for lightweight, dare I say "agile" ways of tracking what's going on.

In practice, there are several pitfalls:
  • Wikipedia has a lot of eyes. According to Wikipedia, Wal-Mart has about 2 million employees, while Wikipedia has close to 13 million registered users. Granted, Wikipedia claims only about 90,000 "active contributors", but that's still about the same headcount as Microsoft. Chances are, your company isn't that big*
  • It used to be every computer science undergrad wanted to invent and implement a programming language. Somewhere around the turn of the century that ambition seems to have shifted to writing a wiki engine (which typically has at least a toy programming language in it somewhere). So many to choose from and, even though approximately one of the choices has a huge userbase and all that goes with it, the odds are that whoever set up your wiki chose something "better" than Mediawiki.
  • Wikis were designed for quickly throwing together webs of loosely structured text, and not for any of several other things they sometimes get used for. A wiki page generally doesn't know what role it has in a bigger picture. A wiki is not a bug tracker. It is not a release planning system. It doesn't know that feature X was promised to FooCorp for release 2.1 whose schedule has just slipped. No one told it any of that. Ah, but that's where the toy programming language comes in ...
  • Many shops are content to limit wikis to the smaller role of gathering together bits of wisdom that people tend to email each other as the occasion demands. "Why did you design it this way?" "Well ..." The problem is that this conversation tends to happen when, for any of myriad reasons, the design wasn't documented close to the code, so someone is now asking the author. Ideally, the original designer goes and documents the code and replies with a link to the new doc. Alternatively, if the conversation is taking place on an archived list, the answer will be in the archives for future generations. In either case, it's not clear that updating a wiki and replying with a link to that would be an improvement.
  • Wikis need gardening to combat various forms of rot. Typically there's even less time for this, particularly in a small shop, than there is for updating the wiki in the first place.
Wiki writing is not magically easier than any other kind of writing. Maintaining a wiki takes time and dedication. Wikipedia has a lot of dedicated contributors, including many who specialize in gardening and other less glamorous jobs. If your organization is not specifically in the business of producing wiki pages, chances are the wiki will reflect that.


* On the other hand, chances are you wiki is not going to be as big as Wikipedia. Nonetheless, (I claim) there are economies of scale that happen when the user base gets larger.  In a large community people can specialize, for example in maintenance tasks.

[Wikipedia continues to dominate the world of Wiki, even neglecting its sister projects.  The one notable exception I can think of is TV Tropes.  I doubt it has anywhere near the readership of Wikipedia, but it's still the rare example of a publicly-edited non-Wikipedia wiki with a significant readership -- D.H. Dec 2015]

Thursday, September 10, 2009

Classic software engineering and the mighty pigeon

A diligent member of my army of stringers, researchers, fact-checkers and miscellaneous hangers-on forwarded me a BBC article on a contest between South Africa's broadband carrier and a carrier pigeon. The pigeon carried a 4GB memory stick 60 miles in 2 hours. In the same amount of time, the broadband connection had transmitted 2% of the data. Pigeon 1, broadband nil.

Is this really news? First, how fast was the broadband connection? 2% of 4GB is 80MB. 80MB/7200s = 11KB/s. OK, that's pretty slow. For comparison, I just ran a speed test against a server about 60 miles away. The download speed was about 14Mb/s, or about 2MB/s. That would bring me my 4GB in about 40 minutes. Pigeon 1, broadband 1.

But wait. I can only download as fast as the other end can upload. If the other end has my broadband connection, it can upload at about 360Kb/s or about 45KB/s. You read that right. My upload speed would appear to be about a fortieth of my download speed. That's about four times the speed of the South African connection, meaning I couldn't even get 10% of the data transmitted by wire before the pigeon reached its destination. Pigeon 2, broadband 1.

Hmm ... my ability to send large quantities of data -- movies, for example -- to the world at large is severely limited, but my ability to access said data from whoever can make it available quickly isn't. And my internet connection is provided by a Cable TV/old-school media company ... but I digress.

When I got the link about the South African pigeon, I immediately thought of Jon Bentley's classic Programming Pearls and More Programming Pearls. If you're interested in software engineering you could do worse than to stop reading this right now, go hunt down copies of these books and inhale them. The code samples are variously given in C, C++ and a procedural pseudocode reminiscent of Old High Algol, but just as Chaucer is worth the trouble of reading in the original, so too Bentley.

If you don't want to hunt down paper copies, check out the web site [The web site doesn't carry the entire book, but it's still well worth visiting. The book itself has been extensively updated in the recent second edition. There's even a little Java here and there, but only a little. The Labs is still the Labs, after all].

I assume you're back now and along the way noticed problem 11 in column 1 of Pearls:
11. In the early 1980's Lockheed engineers transmitted daily a dozen drawings from a Computer Aided Design (CAD) system in their Sunnyvale, California, plant to a test station in Santa Cruz. Although the facilities were just 25 miles apart, an automobile courier service took over an hour (due to traffic jams and mountain roads) [Ah, Highway 17. Good times, good times.] and cost a hundred dollars per day. Propose alternative data transmission schemes and estimate their cost.
The solution Lockheed came up with?
The computers at the two facilities were linked by microwave, but printing the drawings at the test base would have required a printer that was very expensive at the time. The team therefore drew the pictures at the main plant, photographed them, and sent 35mm film to the test station by carrier pigeon, where it was enlarged and printed photographically. The pigeon's 45-minute flight took half the time of the car, and cost only a few dollars per day. During the 16 months of the project the pigeons transmitted several hundred rolls of film, and only two were lost (hawks inhabit the area; no classified data was carried). Because of the low price of modern printers, a current solution to the problem would probably use the microwave link.
Pigeon 3, broadband 1.

For several obvious reasons I doubt that pigeons are going to be the optimum solution for most high-volume data transmission problems, but it certainly gives one pause to note that a generation after the Lockheed story, in the face of Moore's law and all that, pigeon power is still a plausible solution. At least compared to what passes for broadband here in the States.

Why should this be? Moore's law cuts both ways. In fact, it currently favors the pigeon. The South African test was done with a 4GB memory stick. Sticks of 16GB are now available. Leaving aside the question of whether a pigeon (or two) could carry more than one stick, even a single pigeon with a single 16GB stick could beat my download speed over a 60 mile course.

Memory sticks are getting bigger much faster than broadband is getting faster, if only because switching to a larger stick is much, much easier than switching broadband technologies.

This is all reminding me of my early posts on Jim Gray.

Wednesday, April 30, 2008

Engineering, in general

Today I put up a tent. This wasn't some high-tech mountain climbing special, but an ordinary one from one of the major retail chains, the same brand as the ones we used when I was growing up. It was reasonably light and compact for what it was, certainly lighter and more compact than the comparable model a generation ago.

It wasn't too hard to put together, even a year after last having done the exercise and without looking at the instructions. There were two colors of poles, black and gray. The black poles slipped through black openings and attached to the tent with black clips. The gray poles slipped through gray openings and attached with gray clips.

At any point, there were only a few sensible things to do. Most of them worked, and if you got something wrong -- tried to seat a pole at the wrong spot, for example -- it would soon be clear that it wasn't going to work. If you got the three main poles bent into place, you had something resembling a tent in shape. Everything after that made it a better tent -- roomier, more stable, easier to get into and out of, shaded, what-have-you.

If you neglected to do something small -- clip a particular clip or fasten a particular little strap -- the result was only slightly less good than if you hadn't. If you left out something major -- left the fly off or decided not to stake it down -- there would be a more noticeable effect. The result might be sub-optimal, but you could tell something was missing and choose to fix it if you wanted.

This particular tent was old enough (and cheap enough) that the elastic cord holding two of the poles together had broken. That made it harder to keep the sections of the poles together at first, but once the poles were under load there was no functional difference. The design still worked even with a less-than-perfect implementation.

In other words, it was a very nice piece of engineering. You could tell that the company had been making tents for quite a long time. The lessons to be drawn, say regarding user interface and design for robustness, are so obvious I won't bother to point them out.

Or as I used to say, if the average software shop ran as well as the average sandwich shop, the world would be a better place.

Saturday, January 12, 2008

Another rule of thumb

In these pages I've quoted Linus' law, Moore's law and probably others, so why not have a go at it myself? I don't believe I've heard this one put quite this way, though I'm sure it's been said before, particularly in other fields of engineering:
Development time depends largely on how quickly you can get accurate results. [Maybe a bit more snappily, on how quickly you can see what happened]
This can depend on any number of things, for example:
  • How long it takes to find out you've made a trivial mistake like misspelling a name. IDEs shorten this time by doing much of the bookkeeping before even officially compiling.
  • How long it takes to find out whether you've fixed a bug or properly implemented a new feature. Processes that put more testing in the developer's hands help shorten this.
  • How long it takes to verify that what worked in your sandbox also works for the official version. Tight build management and release processes help eliminate surprises in what should ideally be a quick step.
  • How long it takes to be sure that you've done what your customer wanted. Processes that put development teams in touch with customers frequently aim to shorten this.
At heart, this is just saying that good feedback loops require good feedback, but it stills seems worth noting.