Friday, August 22, 2008

The "bigger fish to fry" rule

A long while back I was talking to a developer about a messaging system that was meant to deliver messages reliably even if the network was misbehaving (in other words, it was like TCP, but on a different time scale). I asked the obvious question: "What happens if the network goes down?"

"The sender keeps track of what messages have been sent. It retries if it doesn't get an acknowledgment back that a message was received."

"But what if, say, the sending machine loses power?"

"There's an option to log to disk. It's slower, but the system will recover when the machine comes back up."

"But what if there's a disk crash when the power goes off?"

"You can configure sending processes on more than one machine. It'll cost you more speed, but if one sender fails, the others will take over."

"But what if they all go down?"

At this point the developer started to lose patience. There's really not much more you can do at that point, except keep more copies and reduce the chances of them all being destroyed simultaneously. At some point, it's just not worth it, if only because keeping everyone sufficiently in sync takes more and more effort, especially since you're assuming an unreliable network in the first place. There is no 100% reliable system -- in computing or anywhere else that I know of.

On the other hand, if you had, say, three senders, all logging to disk, keeping in sync over a fast, decently reliable local network (it's the outside network we're most concerned with here), and they all crash unrecoverably, what's going on? Most likely the building is on fire or something similarly bad is happening, and whatever is trying to produce the messages in the first place is probably not able to do its job either. You'd better have enough off-site backups to get things going again after the fire trucks leave.

In such a case, the messaging system is certainly going to fail. But its job is not to be perfect. Its job, and everyone else's, is to be good enough that if it fails there are bigger fish to fry.

No comments: