Wednesday, July 16, 2014

Protocol basics -- heartbeats, pings and acks

For no particular reason, I thought I'd start an occasional series on the basics of computer protocols such as those, like TCP and HTTP, that the web is built on.  Also for no particular reason, the basic principle that came to mind first is the idea of heartbeats.

But first, what's a protocol?

The word itself derives from Greek protos (first) and kolla (glue), so that ought to be clear enough.

No?

The trail is something like: prōtokollon really refers to the first draft of an official agreement (the first one glued into a binding), and thence more generally to an official set of rules and procedures, and thence finally to the computing meaning: A set of rules for exchanging messages between computers (often called hosts).

One of the most basic problems in computer protocols is determining whether the other party is there or not.  How hard can that be, right?

Unlike the physical world, you can't just look.  All you have is some means of sending messages, typically a relay of several steps mixing wired and wireless transmission, high-volume and low-volume connections, and so forth.  I'll go into deeper detail in some later post, but the point is that all you can do is send a message, and any particular message might or might not arrive at its destination in any particular amount of time.

One simple way to tell if the other party is there is just to ask.  Send a message saying "If you get this, please send it back to me."  You send that message, the other host sends back a reply and voila, you know they're there.

This is a perfectly good approach.  The first message is generally called a ping, probably taken from SONAR terminology, and the reply packet is generally called an ack (or ACK), short for "acknowledgement".  (There's also such a thing as a nack ( or NAK), short for "negative acknowledgement", which means "yes, I got that, but I couldn't understand it," or "yes, I got that, but you're sending me messages too fast, so please stop for a bit".  I'll admit to occasionally having said "NAK" in response to an explanation that went over my head.)

But what if you don't get your ack?  Is your connection bad?  Has the other host crashed?  Did it receive your ping but fail to reply?  Did it reply, but the return connection was bad?  How long should you wait before you decide that the ack isn't coming?

To help get around problems like this, you can send a series of pings and listen for a series of acks.  To help tell what's going on, you can number them so you can match the acks to the pings.  If the connection is flaky, you might miss an ack from time to time, but overall if the other host is there and you have at least some sort of connection, you'll get at least some acks back.

You might even have the other host tell you how many pings it's heard.  That will give you some idea of whether any problems are on the outbound connection, inbound connection, or both.  For example, if the return connection is bad but the outbound connection is fine, you'll hear something like "Ack for ping 1, I've heard 1 ping", "Ack for ping 3, I've heard 3 pings" ...  If you hear "Ack for ping 3, I've heard 2 pings", you know that it missed ping 2.  Most bad connections will affect both directions, but that doesn't have to always be the case -- the other host's network layer is part of the incoming connection, and it's possible that it's able to send messages but sometimes has trouble hearing them.

If the other host crashes and restarts, you might hear something like "Ack for ping 1, I've heard 1 ping", "Ack for ping 2, I've heard two pings", and then eventually, once the other host is up again, "Ack for ping 50, I've heard 1 ping".  This may or may not be useful information.  It's a basic principle of networking that during that eerie silence, there's no way to know whether the other host is crashing and restarting, the network is down, the other host is running slowly, there's a bug in whatever's handling the pings, the network is up but messages are being delayed, or whatever.

By the point you hear back that the rebooted host has only heard one ping, you may not greatly care.  You can't begin to figure out what's going on until you get a message from the other host, and even then what you can deduce depends on the exact messages, that is, on the protocol.  On the other hand, you can decide that if you haven't heard replies for N pings in a row, something is wrong.  That's often a good bet, but you have to be prepared for the possibility that things are just slow and the other host was there all along.

In some kinds of network, messages are always sent to everyone who could be listening.  In most such cases, the networking layer will filter out messages that aren't addressed to a particular system, but it's also possible to mark them "broadcast", meaning that everyone should listen.  In such setups, a broadcast ping is a good way to find out who's on the network.  This process is called discovery, and since not all networks have broadcasting built in, there are discovery protocols for networks that don't.

If you're having an actual conversation with another host, say, sending requests and getting replies, you're automatically pinging and acking.  However, you may reach a point where you don't have anything to say at the moment, but you want the other host to know you're still there.  In that case, you could send a ping, either as a do-nothing request or as a special kind of message.  It doesn't much matter which, so long as you and the other host agree on the protocol.  Such a message is generally called a keep-alive, since it's meant to keep the hosts from killing the connection (which basically means forgetting about it) on the assumption the other has gone away.

In some cases, only one host cares if the other is there.  For example, imagine a weather station where the main host is listening for data coming from a bunch of sensors -- thermometer, anemometer, hygrometer, manometer, and so forth.  It's fine for the sensors to blindly send out their information no matter what, but the main host would like to be able to report if a sensor is faulty.  Or in an even simpler example, you just want to know if another host is there at all, without needing it to send you any particular information.

In such cases, you shouldn't have to ping (and you might not even be able to, for example if the sensors have transmitters but no receivers), but you want the things you're monitoring to send acks regularly as though you had.  You can then decide that if you miss N messages, you'll report a problem.  Since they're not actually acknowledging anything, such a message is generally called a heartbeat rather than an ack.

In fact, any series of regular messages meant to determine if a host is present or not can be called a heartbeat.  The heartbeats in the famous heartbleed bug, for example, were a series of pings and acks.  The bug was that a badly constructed ping would cause the ack to contain information that shouldn't have been there.


This post has turned out longer than I expected.  I had expected to write a couple of paragraphs about heartbeats, but to get there I ended up delving a bit deeper.  As is often the case, there's more to even the simple pieces than might meet the eye.  I would like to make one last point, though.  Heartbeats, pings, acks and indeed most of the basics of computer protocols, have been around much longer than computers.  It would be interesting to hunt down early examples, but one that springs to mind is a team on an isolated, dangerous mission agreeing to send out regular radio messages.  If some number don't arrive, send in the rescue squad (or just assume the mission has, sadly, failed).

The basic idea of "make a noise if you're still here" is, of course, considerably older than radio.