Neglecting to account for failure is an age old problem. Consider this common error (Purify anybody?):

#include <stdio.h>
#include <stdlib.h>

struct rhubarb { int aVal; int anotherVal; char* aString; };
...... struct rhubarb* mystruct;
mystruct = malloc(sizeof(struct rhubarb)); mystruct->aVal = 55; ......

Of course the following code should have been included after the malloc:

  If memory wasn't allocated, do something appropriate.
if (mystruct == NULL) {

An equivalent mistake is easily possible when building a distributed system in http or RMI by ignoring error codes or exceptions that are designed to communicate failures that we ought to handle. It’s similarly easy to ignore latency, or implement brittle and dumb retry logic or assume something is reliable (like a message queue) when it isn’t. Many have managed to concoct systems with http that breach the idempotent “constraints” of REST and whilst Erlang provides link() and receive timeouts, we’re not forced to use them.

In essence there is no way to ensure developers do the right thing in a single-process or distributed context. No technology, tool or design approach can prevent developers from making poor implementation decisions which limits the value in re-hashing (Steve, Steve and Stu) RPC rights and wrongs.

I believe the best chance we have for doing distributed right is not by providing some de-facto standard toolset, rather it’s through education[1] and mentoring to encourage the correct mindset. Such a mindset allows a developer building a distributed system to choose the most appropriate tools and use them right.

[1] Material to be covered would be substantially broader then the fallacies, failure handling, latency and should probably include: logical time, FLP, failure detectors, global snapshots and Paxos.

One Comment

  1. Education is precisely the intent of my latest column, Dan. I’ve been surprised by the number of programmers I’ve met in my recent travels who’ve never read or even heard of Waldo’s paper and who’ve never heard of there being any fundamental problems with RPC. They certainly can’t avoid those inherent problems if they don’t know they’re there.

    What you say in your final paragraph is much like what the primary message of my latest column was, which is to be aware that there are other ways of doing these things that are often better — both convenient *and* correct — so developers should do themselves a favor and learn those ways, rather than just sticking with what’s convenient and ignoring the “correctness” part, whether ignorantly or willfully.

Comments are closed.