Monday, December 12, 2016

The topological sort

I can imagine two reactions to that title:
  • "Topological sort?"  What's that?  Which I aim to answer shortly.
  • "You really have run out of things to say about the web, haven't you?  Now you're just dragging out stuff from your CS classes."  To which I can confidently reply ... maybe.  But let's see where this goes.
Suppose I'm baking a cake.  Before I do anything, I need to buy ingredients.  Then I need to mix up the batter.  I'll also need to warm up the oven and grease the pan.  I'll probably want to make frosting.  After the oven is warm I can put the cake in and bake it, then let it cool and frost it.  Also I should clean up, so let's say that after I mix up the batter and bake the cake I do the dishes.

We can make this a little more formal.  I'll use a letter with an arrow to another letter to show that one task has to happen before the other.  So:
  • I → M (you have to have the ingredients before you mix up the batter)
  • M → B (you have to mix up the batter before you bake it, duh)
  • W → B (warm up the oven before baking)
  • G → B (grease before baking)
  • F → A (make frosting before, um, A is for applying the frosting)
  • B → C (bake before cooling because, um, that's how cooling works)
  • C → A (cool the cake before applying the frosting)
  • M → D (no point in doing the dishes until you're done mixing up the batter)
  • B → D (gotta clean up that cake pan as well)
We can put this all together into a little ASCII-art diagram (or I suppose I could draw a picture with boxes and arrows, but this is quicker):

     + -------+ -> D
I -> M -> |   |
     W -> |-> B -> C -> |-> A
          |             |
     G -> |        F -> |

This kind of diagram is generally called a dependency graph, since it shows what depends on what.  It leaves out some important things like how long it takes to do things, and for this example I've quietly required that you have to do all the dishes at once, but this ought to be enough for purposes of illustration.  Some key features of a diagram like this are:
  • The tasks are labeled, as opposed to just being featureless dots, so we can tell W from G and so forth.
  • A task might depend on more than one thing, as B depends on M, W and G.
  • A task might have more than one thing depending on it, as M has both B and D depending on it.
  • There are no cycles, where two tasks each depend on the other, whether directly or indirectly.
  • In particular this means that each line can only have an arrow on one end.  You can't have M depend on I and also have I depend on M.
A diagram like this, whether it's describing dependencies between tasks or something else, is called a directed acyclic graph or DAG.  Directed because the lines have a direction to them, and acyclic because there are no cycles, whether direct or indirect.

If you take any two tasks, they can't each depend on the other (that is, the rules say no cycles), so either
  • One depends on the other, whether directly or indirectly.  For example A depends on W, because W has to happen before B, which has to happen before C, which has to happen before A.  This can be true for more than one reason.  For example D depends on M directly, but also because D depends on B and B depends on M.
  • Neither depends on the other.  For example, you can mix up batter and grease the pan in either order.  You can start the oven warming before you buy ingredients (if you don't mind having the oven on while you're out shopping), or vice versa.
The mathematical name for this is, reasonably enough, a partial order.  Partial orders and DAGs are just two different ways of describing the same situation.

Now suppose there's only one of you in the kitchen and you can only do one task at once.1 You need to make this partial order into a total order, where you can always say which task comes next and as a consequence each task except the first has exactly one task immediately after it.  Here's how you do that:
  • Find a task that doesn't depend on any other task in the diagram
  • Write it down in the answer and remove it from the diagram
  • Do this until there are no tasks left.
In this example, you have your choice of I, W or G.  Once you remove I, M also doesn't depend on anything in the diagram so you have your choice of M, W or G.  Keep this up until there's nothing left and you might get I, M, G, W, B, D, F, C, A.  You might also get any of a number of different answers, but in every case you'll get something that you could actually do in a kitchen.  You'll never be asked to bake the cake before you have the ingredients or anything else impossible.

This is about as simple an algorithm as you can get.  The only tricky bit is keeping track of which tasks do and don't have other tasks before them, and in most cases that's not particularly hard either.  Flattening a partial order into a total order is a lot simpler than sorting a bunch of items that are totally ordered.  Books have been written about that.

OK, so what is this good for and what does it have to do with the web?  Applications for topological sorting crop up all over the place.  Planning tasks, like in the baking example.  Figuring out how to build a large application that uses libraries that depend on other libraries (web browser depends on HTTP depends on TCP depends on low-level networking plus a few hundred others).  Recalculating cells in a spreadsheet.   Finding the best route from point A to point B. Typesetting.

Typesetting?  Yep.  If you're trying to find the best places to break a paragraph into lines, which places you can break the second line depends on where you break the first line, and so forth.  That gives you a DAG of dependencies between lines and we're off to the races.  Here's a pretty good summary (if a bit technical) of several ways to solve the problem, including a topological sort.

Suppose you're trying to learn a new subject.  If I want to learn about Frobnitizian Mechanics (not actually a thing), I might need to know about Hamiltonians in physics (actually a thing).  To understand that, I need to know some multivariate calculus and Newton's laws.  Both of those depend on single-variable calculus, and so on down to basic algebra.  Where should I start?  Start with anything that doesn't require anything you don't already know.  Repeat.

You already knew that, of course, but it's nice to know that that's all you have to do.

Of course in the real world, particularly on the web, not everything breaks cleanly into a DAG.  There are plenty of cases where, say, page A points to page B points to page C points back to A and there's no obvious place to start.  How you handle a situation like this in computing depends on what exactly you're trying to do, but it can be handled.  In real life, you should probably just start reading because hey, you gotta start somewhere.

There's one other piece of information that comes out of a topological sort that can be just as important as what has to happen before what: what doesn't have to happen before what.  In the baking example, I, W and G are independent of each other and we can remove any of them from the dependency graph first.

But in many situations you don't have to remove just one.  If I had someone helping me, I could send them out for groceries while I got the pan and the oven ready.  Most likely in that case I'd still be done with that before they got back and I'd have to wait before I could go further, but I'll still get done sooner than if I had to make the grocery run myself.

There are two reasons.2 that computers can do so much more today than they could in the past.  The one everyone knows about is Moore's law.  Processors are faster, memory is bigger and cheaper and so forth.  The other is parallel processing.  If I can break a problem down into pieces that can run independently of each other, I can throw more of this faster cheaper hardware at the problem.  If the structure of the problem limits me to having one thing happening at a time, it doesn't matter how many processors I have.  A topological sort will show you when you can parallelize and when you can't.

1 Yes, in real life you could and probably should mix up batter while the oven is warming, but I'm not going to fix the diagram to reflect this.  You could do that by splitting W into W1 for "start the oven warming" and W2 for "the oven finishes warming up", talking about "events" instead of "tasks" and adding extra dependencies  W1 →  W2 and W2 → B.  Then you could have the order W1,M,W2 with only one thing happening at a time.  I don't think this really helps the discussion.

2 Well, there are more than two, really.  Every so often someone will find a significantly faster algorithm for a key problem, for example.  One could also point to improvements in software engineering.  I mean, you'd think we'd be better at this than we used to be, and I'm pretty sure we are. Nailing down exactly how that's true, with proper respect to the very good engineering that's been done in the past, is a trickier problem.  Quantifying processor speed, parallelism or algorithmic complexity is generally straightforward.