Showing posts with label graph theory. Show all posts
Showing posts with label graph theory. Show all posts

Monday, December 12, 2016

The topological sort

I can imagine two reactions to that title:
  • "Topological sort?"  What's that?  Which I aim to answer shortly.
  • "You really have run out of things to say about the web, haven't you?  Now you're just dragging out stuff from your CS classes."  To which I can confidently reply ... maybe.  But let's see where this goes.
Suppose I'm baking a cake.  Before I do anything, I need to buy ingredients.  Then I need to mix up the batter.  I'll also need to warm up the oven and grease the pan.  I'll probably want to make frosting.  After the oven is warm I can put the cake in and bake it, then let it cool and frost it.  Also I should clean up, so let's say that after I mix up the batter and bake the cake I do the dishes.

We can make this a little more formal.  I'll use a letter with an arrow to another letter to show that one task has to happen before the other.  So:
  • I → M (you have to have the ingredients before you mix up the batter)
  • M → B (you have to mix up the batter before you bake it, duh)
  • W → B (warm up the oven before baking)
  • G → B (grease before baking)
  • F → A (make frosting before, um, A is for applying the frosting)
  • B → C (bake before cooling because, um, that's how cooling works)
  • C → A (cool the cake before applying the frosting)
  • M → D (no point in doing the dishes until you're done mixing up the batter)
  • B → D (gotta clean up that cake pan as well)
We can put this all together into a little ASCII-art diagram (or I suppose I could draw a picture with boxes and arrows, but this is quicker):

     + -------+ -> D
     |   
I -> M -> |   |
          |
     W -> |-> B -> C -> |-> A
          |             |
     G -> |        F -> |
        

This kind of diagram is generally called a dependency graph, since it shows what depends on what.  It leaves out some important things like how long it takes to do things, and for this example I've quietly required that you have to do all the dishes at once, but this ought to be enough for purposes of illustration.  Some key features of a diagram like this are:
  • The tasks are labeled, as opposed to just being featureless dots, so we can tell W from G and so forth.
  • A task might depend on more than one thing, as B depends on M, W and G.
  • A task might have more than one thing depending on it, as M has both B and D depending on it.
  • There are no cycles, where two tasks each depend on the other, whether directly or indirectly.
  • In particular this means that each line can only have an arrow on one end.  You can't have C depend on B and also have B depend on C.
A diagram like this, whether it's describing dependencies between tasks or something else, is called a directed acyclic graph or DAG.  Directed because the lines have a direction to them, and acyclic because there are no cycles, whether direct or indirect.

If you take any two tasks, they can't each depend on the other (that is, the rules say no cycles), so either
  • One depends on the other, whether directly or indirectly.  For example A depends on W, because W has to happen before B, which has to happen before C, which has to happen before A.  This can be true for more than one reason.  For example D depends on M directly, but also because D depends on B and B depends on M.
  • Neither depends on the other.  For example, you can mix up batter and grease the pan in either order.  You can start the oven warming before you buy ingredients (if you don't mind having the oven on while you're out shopping), or vice versa.
The mathematical name for this is, reasonably enough, a partial order.  Partial orders and DAGs are just two different ways of describing the same situation.

Now suppose there's only one of you in the kitchen and you can only do one task at once.1 You need to make this partial order into a total order, where you can always say which task comes next and, as a consequence, each task except the first has exactly one task immediately after it.  Here's how you do that:
  • Find a task that doesn't depend on any other task in the diagram
  • Write it down in the answer and remove it from the diagram
  • Do this until there are no tasks left.
In this example, you have your choice of I, W or G.  Once you remove I, M also doesn't depend on anything in the diagram so you have your choice of M, W or G.  Keep this up until there's nothing left and you might get I, M, G, W, B, D, F, C, A.  You might also get any of a number of different answers, but in every case you'll get something that you could actually do in a kitchen.  You'll never be asked to bake the cake before you have the ingredients or anything else impossible.

This is about as simple an algorithm as you can get.  The only tricky bit is keeping track of which tasks do and don't have other tasks before them, and in most cases that's not particularly hard either.  Flattening a partial order into a total order is a lot simpler than sorting a bunch of items that are totally ordered.  Books have been written about that.

OK, so what is this good for and what does it have to do with the web?  Applications for topological sorting crop up all over the place.  Planning tasks, like in the baking example.  Figuring out how to build a large application that uses libraries that depend on other libraries (web browser depends on HTTP depends on TCP depends on low-level networking plus a few hundred others).  Recalculating cells in a spreadsheet.   Finding the best route from point A to point B. Typesetting.

Typesetting?  Yep.  If you're trying to find the best places to break a paragraph into lines, which places you can break the second line depends on where you break the first line, and so forth.  That gives you a DAG of dependencies between lines and we're off to the races.  Here's a pretty good summary (if a bit technical) of several ways to solve the problem, including a topological sort.

Suppose you're trying to learn a new subject.  If I want to learn about Frobnitizian Mechanics (not actually a thing), I might need to know about Hamiltonians in physics (actually a thing).  To understand that, I need to know some multivariate calculus and Newton's laws.  Both of those depend on single-variable calculus, and so on down to basic algebra.  Where should I start?  Start with anything that doesn't require anything you don't already know.  Repeat.

You already knew that, of course, but it's nice to know that that's all you have to do.

Of course in the real world, particularly on the web, not everything breaks cleanly into a DAG.  There are plenty of cases where, say, page A points to page B points to page C points back to A and there's no obvious place to start.  How you handle a situation like this in computing depends on what exactly you're trying to do, but it can be handled.  In real life, you should probably just start reading because hey, you gotta start somewhere.


There's one other piece of information that comes out of a topological sort that can be just as important as what has to happen before what: what doesn't have to happen before what.  In the baking example, I, W and G are independent of each other and we can remove any of them from the dependency graph first.

But in many situations you don't have to remove just one.  If I had someone helping me, I could send them out for groceries while I got the pan and the oven ready.  Most likely in that case I'd still be done with that before they got back and I'd have to wait before I could go further, but I'll still get done sooner than if I had to make the grocery run myself.

There are two reasons.2 that computers can do so much more today than they could in the past.  The one everyone knows about is Moore's law.  Processors are faster, memory is bigger and cheaper and so forth.  The other is parallel processing.  If I can break a problem down into pieces that can run independently of each other, I can throw more of this faster cheaper hardware at the problem.  If the structure of the problem limits me to having one thing happening at a time, it doesn't matter how many processors I have.  A topological sort will show you when you can parallelize and when you can't.


1 Yes, in real life you could and probably should mix up batter while the oven is warming, but I'm not going to fix the diagram to reflect this.  You could do that by splitting W into W1 for "start the oven warming" and W2 for "the oven finishes warming up", talking about "events" instead of "tasks" and adding extra dependencies  W1 →  W2 and W2 → B.  Then you could have the order W1,M,W2 with only one thing happening at a time.  I don't think this really helps the discussion.

2 Well, there are more than two, really.  Every so often someone will find a significantly faster algorithm for a key problem, for example.  One could also point to improvements in software engineering.  I mean, you'd think we'd be better at this than we used to be. I'm pretty sure we are. Nailing down exactly how that's true, with proper respect to the very good engineering that's been done in the past, is a trickier problem.  Quantifying processor speed, parallelism or algorithmic complexity is generally straightforward.

Wednesday, June 30, 2010

More random graph theory on the web

Picking up an earlier thread, I went looking for papers on non-uniform random graphs, that is to say, connected networks where some members have many more connections than others, but that's about all we know. I turned up an interesting one by Fan Chung and Linyuan Lu describing the characteristics of graphs more like ones you'll find in the context of the web.

In particular, social networks and others commonly encountered appear to follow a power law, meaning that, for some particular number b, there are roughly b times as many people with n connections as there are with n+1. In social networks, b tends to be between 2 and 3, so for example there might be around 1000 members with just one connection, 500 with two, 250 with three, and so on to only a handful having eight or more.

Chung and Lu show that under such conditions, there will very likely be a core of closely connected people, that two people chosen at random will very likely be close to each other, but there will also be some significantly less-connected members. Connecting two people outside the core will generally take considerably more steps than connecting most pairs of people.

So: six degrees of separation (or whatever it really is) in most cases, but considerably more in a few cases.

The vague terms like "very likely," "some" and "significantly" have precise mathematical definitions. The details are in the paper if you're interested.



On a recent plane flight, I looked at the route map to see if the airline's network behaved like a social networking graph. Airline networks are specifically designed to connect destination cities well but as cheaply as possible, keeping in mind that cost depends on a number of complex factors. In particular, you want to minimize the number of stops needed to get from point A to point B.

Did the airline's network follow a power law? It did, but only up to a point. Most cities had only a single connection. Fewer had two and about the same proportion fewer had three. And then there were the two hubs, each with dozens of connections. There was nothing in between. You could get from almost anywhere in the network to almost anywhere else by flying to a hub and then on to the final destination — not a big surprise.

The interesting thing is that airline networks specifically don't behave like social networks, precisely because social networks tend to leave some members out in the cold and in air travel, that just won't do. It may be worthwhile in many cases to emulate some naturally-arising structure, but it's not always the best choice. Sometimes you actually have to plan.

Saturday, October 10, 2009

Six degrees, more or less, sort of

In reference to a previous post on degrees of separation, I went looking through Wikipedia and found what I was pretty sure I'd seen before about graph theory and the "small world" phenomenon. A few points:
  • Actual social networks, and a wide variety of similar networks from all sorts of different fields, don't act like classic random graphs, where each object in the network has about as many connections as any other. Rather, there tend to be a few objects with lots of connections and a lot with relatively few connections. But ...
  • ... not a lot is known for certain, especially when it comes to networks of real live people. How connected people are depends on what kind of connections you count. There have been various efforts to measure connections in networks like FaceBook and from looking at instant messaging traffic (anonymized, I would hope!), but it's not clear what you can or can't tell about people in general from that. However ...
  • ... networks in general seem to behave roughly similarly, keeping in mind that nice, regular networks are the exception both in theory and real life. In particular, as you add to the network, the diameter (the largest number of hops required to connect two objects in the network) tends to increase logarithmically. To double the diameter, you generally have to square the number of objects. One intriguing result is that if you start with a nice, regular network with a high diameter and add just a few connections here and there, the diameter drops sharply. So ...
  • ... the "small world" phenomenon looks to be a general property of networks and not necessarily a product of our modern age. My hunch about human bandwidth is that there are only so many connections a person can keep up and that limit was hit quite some time ago. The question is more to what extent scattered groups of people might be more connected than in times past. And finally ...
  • ... the famous "six degrees of separation" is not some sort of deep magic but rather a shorthand description of some early and not particularly rigorous experiments that suggested that most people are probably more or less six hops from each other. There's nothing particularly special about the number six, there are still small pockets of people who are not meaningfully connected to the outside world at all, and in any case the actual upper limit might just as well be twelve or five, depending on how you count (leaving isolates aside).

Thursday, October 1, 2009

Linking my way to fame and fortune

As I may have mentioned, I'm not exactly a social networking party animal (more a party vegetable, really), but I happened to log back into LinkedIn after I'm-not-sure-how-long and lo and behold, I'm a 3rd-degree connection to the President of the United States himself.

I doubt this puts me in particularly distinguished company. LinkedIn itself appears to be on a first-name basis with the prez, and tells me that "Barack's connections" number "500+".
Alas, "Barack Obama is not currently open to receiving Introductions or InMail™" so I doubt I'll be joining that select 500+ anytime soon.

A few statistics about my own network may make the three degrees of separation somewhat less surprising. Keep in mind that I haven't tried particularly hard to make connections. They just seem to accumulate:
  • I have 83 direct connections.
  • About a dozen of those have 500+ connections.
  • 43 of the 83, so just over half, are two degrees from Obama
  • I have 14,000+ 2nd-degree connections and about 1.4 million 3rd-degree connections.
The president's statistics are probably more like
  • So many direct connections you need a staffer to keep them straight
  • Given the nature of politics, dozens if not hundreds of those have 500+ connections. Granted, there's going to be a lot of overlap.
Now, how many 2nd-degree connections, and how many of those direct connections are two degrees from me?

There are approximately 45 million LinkedIn users overall, making it a proper megacommunity by my earlier definition. If we assume that being two degrees from one person is independent of being two degrees from another -- a dicey assumption for a number of reasons -- then the POTUS's 2nd-degree connections would number about half that, or 22 million+, based on about half of my connections being his 2nd-degree connections.

That's not completely implausible, but it's also possible that through the luck of the draw I'm connected to a cluster of people with somewhat closer than usual ties to the president. Nonetheless, if you're an active social networker with 500+ connections yourself, your odds of being 2nd-degree or closer look pretty good. The odds of any random LinkedIn member being in my position also look quite good.

Likewise, the purely random model would suggest that about 1/3,000th of Barack's connections would be two degrees from me, so likely a handful of people, maybe just one, depending on just how far over 500 the number really is. Again plausible, but again, we're not dealing with strictly random samples.

Paul Erdős, Alfréd Rényi, Béla Bollobás and others proved some very interesting results about random graphs starting in 1959, but social network graphs don't appear to fit the usual model. I recall running across more relevant work while trolling through Wikipedia a while ago. I might have to go back for another look. [There's more on social network graphs here and here --D.H. Dec 2015]

So, does this demonstrate the awesome power of social networking, that a random none-too-social geek can find himself three steps away from one of the world's most influential offices? Well, just what use am I meant to make of this connectivity beyond getting a longer-than-expected blog post out of it? Whatever use I might want to put it to, pretty much the rest of LinkedIn has the same shot if not better. I and the rest of the teeming masses can't see Obama's connections, or send him InMail, or do anything else particularly impressive. In other words, we're in about the same situation as any other private citizen of the US, which is where we were without LinkedIn.

There's a general principle at work here: You can't deliver privileged access to everyone.

Thursday, May 8, 2008

Connect the dots and the world will follow

Musing about connectivity and intelligence put me in mind of one of my favorite branches of mathematics (yep, I'm a geek -- that's "favorite branches of mathematics", plural): graph theory.

For those unfamiliar with the game, here's how it works. Take a bunch of dots and connect them with lines. The lines don't have to be straight, and they can cross each other. You can move the dots around however you like to make the picture clearer. All that matters is what's connected to what. If dot A is connected to dot B, it doesn't matter here how many lines you drew between the two, so let's just say there's never more than one. From this simple setup come many deep and interesting results (interesting to a math geek, at least).

Suppose you can connect each dot to at most one other dot (if you can't connect them at all, you've just got dots). In that case, you'll always end up with some number (maybe zero) of loose dots, and some number (maybe zero) of pairs of dots connected by a line.

Suppose you can connect each dot to at most two other dots. Then (after maybe re-arranging to get a clearer picture) you'll get three different things: loose dots, connected pairs and rings of three or more dots.

Now suppose you can connect each dot to at most three other dots. There are now infinitely many different possibilities, almost all of which are completely unknown except in a broad statistical sense. Even telling if two arrangements are the same or different is (in general) an intractable problem. Or at least we're pretty sure it is.

What if you can connect each dot to up to four (or five, or a million) different dots. Have you gained much? Not really. You can take any dot with more than three lines out and replace it with a ring of dots, each with one line out to the rest of the world.

In short, there are effectively only four levels of connectivity: zero (boring), one (boring), two (pretty boring) and three or more (infinitely complex). How connected the world is really only matters at one critical point.