It’s well known that abstracting away network failure in inter-process communication is a bad thing. but there are other similarly harmful abstractions one might adopt when handling networks such as assuming uniformity.

In recent times there’s been a resurgence of interest in using messaging between processes as a mechanism for taming concurrency rather than the (possibly) more conventional approach of using threads and locking. This model is very appealing in it’s simplicity and some variations even allow for process failure (though I think there’s still some interesting discussion to be had around being certain that a process has failed rather than become partitioned away by network failure – split brain scenarios etc).

Some are wondering if this messaging approach could be extended beyond concurrent programming across multiple cores in a single box to deal with concurrent programming across networked machines. I think there’s maybe a small fly in the ointment – latency. If all processes are communicating via messaging inside of a single SMP box we will likely have at least approximately uniform latency between processes which is reasonably easy to manage. The same cannot be said of messaging across processes in a NUMA system or on a network. Things get still more tricky if one has processes running on a mix of SMP and NUMA machines all living on a network and messaging each other.

Managing such a mix is difficult – one must consider carefully where to deploy things and the nature of the messages you send (what you’d consider moving around an SMP system’s bus is probably not the same sort of payload you’d want to place on a network). When a process fails, one potentially cannot start up a replacement anywhere rather it must be placed carefully and appropriately.

Technorati Tags: , , ,

2 Responses to “Abstracting the Network Still Harmful”
  1. Process failure vs. network partitioning — from the point of view of the sender, should there be a distinction between these two kinds of failures?

    Maybe there’s a similar discussion re: latency. The more “local” processes assume near-zero, uniform latency the more difficult moving one or more processes to distant nodes. So it’s not like you can write a significant application “locally” and then just move the processes here and there.

    In that case the programmer may have to make some assumptions about what is local, and should be; what is local, but may not be in the future; and what is not local even today.

    If the “local and should be” assumption falls apart down the road, tough luck. A bit more rework is in order. That still seems better than getting local code all tangles up in shared memory.

    The OTP “behavior” frameworks for Erlang help out a bit in that they do not try to make the system look as if it is all local. Instead they provide mechanisms to account for network and process failures without having to rewrite those all the time. So if a process is sending a message to a gen_server and that server is local, some failures will occur less often. Otherwise there are a number of mechanisms in place that could help with little extra effort.

    And stuff.

  2. Dan Creswell says:

    “from the point of view of the sender, should there be a distinction between these two kinds of failures?”

    From the point of view of the sender there likely is no difference but I think the distinction from a receiver perspective may be more significant especially if it’s still reachable by at least some other senders.

    “So it’s not like you can write a significant application “locally” and then just move the processes here and there.”

    Indeed.

    “If the “local and should be” assumption falls apart down the road, tough luck. A bit more rework is in order. That still seems better than getting local code all tangles up in shared memory.”

    Possibly though I think there’s two things that still need more thought:

    (1) I don’t think we know yet just how far we can take these messaging models – silver bullet syndrome is still alive and well?

    (2) Arguably shared memory paradigms make it fairly obvious what is “local” (well except for the likes of Terracotta and I think it’s well known I’m not a supporter of that approach) and therefore make it clear when your judgement about local versus remote (yes, I accept that most of the remote comms frameworks inappropriately hide that distinction).

  3.