Sunday, November 4, 2018

Waterfall vs. agile

Near where I work is a construction project for a seven-floor building involving dozens of people on site and suppliers from all over the place, supplying items ranging from local materials to an impressively tall crane from a company on another continent.  There are literally thousands of things to keep track of, from the concrete in the foundation to the location of all the light switches to the weatherproofing for the roof.  The project will take over a year, and there are significant restrictions on what can happen when.  Obviously you can't put windows on before there's a wall in place to put them in, but less obviously there are things you can't do during the several weeks that the structural concrete needs to cure, and so forth.

Even digging the hole for the parking levels took months and not a little planning.  You have to have some place to put all that dirt and the last part takes longer since you no longer have a ramp to drive things in and out with, so whatever you use for that last part has to be small enough you can lift it out.

I generally try to keep in mind that no one else's job is as simple as it looks when you don't have to actually do it, but this is a particularly good example.  Building a building this size, to say nothing of an actual skyscraper, is real engineering.  Add into the mix that lives are literally on the line -- badly designed or built structures do fail and kill people, not to mention the abundance of hazards on a construction site -- and you have a real challenge on your hands.

And yet, the building in question has been proceeding steadily and there's every reason to expect that it will be finished within a few weeks of the scheduled date and at a cost reasonably close to the initial estimate.

We can't do that in my line of work.

Not to say it's never happened, but it's not the norm.  For example, I'm trying to think of a major software provider that still gives dates and feature lists for upcoming releases.  Usually you have some idea of what month the next release might be in, and maybe a general idea of the major features that the marketing is based around, but beyond that, it comes out when it comes out and whatever's in it is in it.  That fancy new feature might be way cooler and better than anyone expected, or it might be a half-baked collection of somewhat-related upgrades that only looks like the marketing if you squint hard enough.

The firmer the date, the vaguer the promised features and vice versa ("schedule, scope, staff: pick two").  This isn't isolated to any one provider (I say "provider" rather than "company" so as to include open source).  Everyone does it in their own way.

In the construction world, this would be like saying "The new building will open on November 1st, but we can't say how many floors it will have or whether there will be knobs on the doors" or "This building will be completely finished somewhere between 12 and 30 months from now."  It's not that construction projects never overrun or go over budget, just that the normal outcomes are in a much tighter range and people's expectations are set accordingly.

[Re-reading this, I realize I didn't mention small consultants doing projects like putting up a website and social media presence for a local store.  I haven't been close to that end of the business for quite a while, but my guess is that delivering essentially on time and within the budget is more common.  However, I'm more interested here in larger projects, like, say, upgrading trade settlement for a major bank.  I don't have a lot of data points for large consultants in such situations, but what I have seen tends to bear out my main points here]

Construction is a classic waterfall process.  In fact, the use of roles like "architect" and "designer" in a waterfall software methodology gives a pretty good hint where the ideas came from.  In construction, you spend a lot of time up front working with an architect and designer to develop plans for the building.  These then get turned into more detailed plans and drawings for the people actually doing the construction.  Once that's done and construction is underway, you pretty much know what you're supposed to be getting.

In between design and construction there's a fair bit of planning that the customer doesn't usually see.  For example, if your building will have steel beams, as many do, someone has to produce the drawing that says exactly what size and grade of beam to use, how long to cut it and (often) exactly where and what size to drill the holes so it can be bolted together with the other steel pieces.  Much of this process is now automated with CAD software, and for that matter more and more of the actual cutting and drilling is automated, but the measurements still have to specified and communicated.

Even if there's a little bit of leeway for changes later in the game -- you don't necessarily have select all the paint colors before they pour concrete for the underground levels -- for the most part you're locked in once the plans are finalized.  You're not going to decide that your seven-level building needs to be a ten-level building while the concrete is curing, or if you do, you'll need to be ready to shell out a lot of money and throw the schedule out the window (if there's one to throw it out of).

Interwoven with all this is a system of zoning, permitting and inspections designed to ensure that your building is safe and usable, and fits in well with the neighborhood and the local infrastructure.  Do you have enough sewer capacity?  Is the building about the same height as the buildings around it (or is the local government on board with a conspicuously tall or short one)?  Will the local electrical grid handle your power demand, and is your wiring sufficient?  This will typically involve multiple checks: The larger-scale questions like how much power you expect to use are addressed during permitting, the plans will be inspected before construction, the actual wiring will be inspected after it's in, and the contractor will need to be able to show that all the electricians working on the job are properly licensed.

This may seem like a lot of hassle, and it is, but most regulations are in place because people learned the hard way.  Wiring from the early 1900s would send most of today's licensed electricians running to the fuse box (if there is one) to shut off the power, or maybe just running out of the immediate area.  There's a reason you Don't Do Things That Way any more: buildings burned down and people got electrocuted.

Putting all this together, large-scale construction uses a waterfall process for two reasons: First, you can't get around it.  It's effectively required by law.  Second, and more interesting here, is that it works.

Having a standard process for designing and constructing a building, standard materials and parts and a standard regime of permits, licenses and inspections gives everyone involved a good idea of what to expect and what to do.  Having the plans finalized allows the builder to order exactly the needed materials (or more realistically, to within a small factor) and precisely track the cost of the project.  Standards and processes allow people to specialize.  A person putting up drywall knows that the studs will be placed with a standard spacing compatible with a standard sheet of drywall without having to know or care exactly how those studs got there.

Sure, there are plenty of cases of construction projects going horribly off the rails, taking several times as long as promised and coming in shockingly over budget, or turning out to be unusable or having major safety issues requiring expensive retrofitting.  It happens, and when it does lots of people tend to hear about it.  But drive into a major city some time.  All around you will be buildings that went up without significant incident, built more or less as I described above.  On your way in will be suburbs full of cookie-cutter houses built the same way on a smaller scale.  The vast majority of them will stand for generations.

Before returning to the field of software, it's worth noting that this isn't the only way to build.  Plenty of houses are built by a small number of people without a lot of formal planning.  Plenty of buildings, particularly houses, have been added onto in increments over time.  People rip out and replace cabinetry or take out walls (ideally non-structural ones) with little more than general knowledge of how houses are put together.  At a larger scale, towns and cities range from carefully planned (Brasilia, Washington D.C) to agglomerations of ancient villages and newer developments with the occasional Grand Plan forced through (London comes to mind).


So why does a process that works fine for projects like buildings, roads and bridges or, with appropriate variations, Big Science projects like CERN, the Square Kilometer Array or the Apollo missions, not seem to carry over to software?  I've been pondering this from time to time for decades now, but I can't say that I've arrived at any definitive answer.  Some possibilities:
  • Software is soft.  Software is called "software" because, unlike the hardware, which is wired together and typically shipped off to a customer site, you can change any part of the software whenever you want.  Just restart the server with a new binary, or upgrade the operating system and reboot.  You can even do this with the "firmware" -- code and data that aren't meant to be changed often -- though it will generally require a special procedure.  Buildings and bridges are quintessential hardware.
Yes, but ... in practice you can't change anything you want at any time.  Real software consists of a number of separate components acting together.  When you push a button on a web site, the browser interprets the code for the web page and, after a series of steps, makes a call to the operating system to send a message to the server on the other end.  The server on the other end is typically a "front-end" whose job is to dispatch the message to another server, which may then talk to several others, or to a database (probably running on its own set of servers), or possibly to a server elsewhere on the web or on the internal network, or most likely a combination of some or all of these, in order to do the real work.  The response comes back, the operating system notifies the browser and the browser interprets more of the web page's code in order to update the image on the screen.

This is actually a highly simplified view.  The point here is that all these pieces have to know how to talk to each other and that puts fairly strict limits on what you can change how fast.  Some things are easy.  If I figure out a way to make a component do its current job faster, I can generally just roll that out and the rest of the world is fine, and ideally happier since things that depend on it should get faster for free (otherwise, why make the change?).  If I want to add a new feature, I can generally do that easily too, since no one has to use it until it's ready.

But if I try to change something that lots of people are already depending on, that's going to require some planning.  If it's a small change, I might be able to get by with a "flag" or "mode switch" that says whether to do things the old way or the new way.  I then try to get everyone using the service to adapt to the new way and set the switch to "new".  When everyone's using the "new" mode, I can turn off the "old" mode and brace for an onslaught of "hey, why doesn't this work any more?" from whoever I somehow missed.

Larger changes require much the same thing on a larger scale.  If you hear terms like "backward compatibility" and "lock-in", this is what they're talking about.  There are practices that try to make this easier and even to anticipate future changes ("forward compatibility" or "future-proofing").  Nonetheless, software is not as soft as one might think.
  • Software projects are smaller.  Just as it's a lot easier to build at least some kind of house than a seven-level office building, it's easy for someone to download a batch of open source tools and put together an app or an open source tool or any of a number of other things (my post on Paul Downey's hacking on a CSV of government data shows how far you can get with not-even-big-enough-to-call-a-project).  There are projects just like this all over GitHub.
Yes, but ... there are lots of small projects around, some of which even see a large audience, but if this theory were true you'd expect to see a more rigid process as projects get larger, as you do in moving from small DIY projects to major construction. That's not necessarily the case.  Linux grew from Linus's two processes printing "A" and "B" on the screen to millions of lines of kernel code and millions more of tools and applications.  The development process may have gradually become more structured over time, but it's not anything like a classic waterfall.  The same could be said of Apache and any number of similar efforts.  Projects do tend to add some amount of process as they go ("So-and-so will now be in charge of the Frobulator component and will review all future changes"), but not nearly to the point of a full-fledged waterfall.

For that matter, it's not clear that Linux or Apache should be considered single projects.  They're more like a collection of dozens or hundreds of projects, each with its own specific standards and practices, but nonetheless they fit together into a more or less coherent whole.  The point being, it can be hard to say how big a particular project is or isn't.

I think this is more or less by design.  Software engineering has a strong drive to "decouple" so that different parts can be developed largely independently.  This generally requires being pretty strict about the boundaries and interfaces between the various components, but that's a consequence of the desire to decouple, not a driving principle in and of itself.  To the extent that decoupling is successful, it allows multiple efforts to go on in parallel without a strictly-defined architecture or overall plan.  The architecture, such as it is, is more an emergent property of the smaller-scale decisions about what the various components do and how they interact.

The analogy here is more with city planning than building architecture.  While it's generally good to have someone taking the long view in order to help keep things working smoothly overall, this doesn't mean planning out every piece in advance, or even having a master plan.  You can get surprisingly well with the occasional "Wait, how is this going to work with that" or "There are already two groups working on this problem -- maybe you should talk to them before starting a third one".

Rather than a single construction project, a project like Linux or Apache is much like a city growing in stages over time.  In fact, the more I think about it, the more I like that analogy.  I'd like to develop it further.

I had wanted to add a few more bullet points, particularly "Software is innovative" -- claiming that people generally write software in order to do something new, while a new skyscraper, however innovative the design, is still a skyscraper (and more innovative buildings tend to cost more, take longer and have a higher risk of going over budget) -- but I think I'll leave that for now.  This post is already on the longish side and developing the city analogy seems more interesting at the moment.