A little known fact in distributed systems is that once you make them resilient in the face of failure, handling upgrades is relatively simple. Why is this?

To make a distributed system resilient in the face of failure requires that we eliminate all single sources of truth. Truth must be maintained by and sourced from a collective which can maintain all relevant state in the face of nodes joining, leaving and failing.

There’s an additional subtlety to resilience which is that it requires scalability and in particular we need to be able to spread load across the collective whilst the members change otherwise the loss of a node will potentially mean we can no longer handle the current load in our system (we’d no longer have enough processors to cope). Likely as not, to make this work will require us to relax system constraints enough to allow truth to propagate asynchronously.

How does all this help with upgrading though? Because the typical upgrade cycle for a node looks like this:

  1. Shut down a node.
  2. Upgrade software on node.
  3. Restart node.

Such a sequence of events when considered from the point of view of other nodes looks like failure - the node disappears, later returns and needs to be re-knitted into the collective.

Sadly, nothing comes for free so to make upgrading work we need to ensure a level of backward compatibility between versions of the software on our nodes and we also need to account for this in our communication protocols, however there are plenty of examples to help us here such as http and DNS.

Some references:

  • Tom Limoncelli of Google mentions this in a talk at Lisa ‘06.
  • Bill Clementson in a recent blog about Erlang but it’s largely relevant to any kind of distributed system.

Technorati Tags: , , ,

Comments are closed.