Archive for July, 2008

Neglecting to account for failure is an age old problem. Consider this common error (Purify anybody?):

#include <stdio.h>
#include <stdlib.h>
struct rhubarb {
  int aVal;
  int anotherVal;
  char* aString;
};
......
  struct rhubarb* mystruct;
  mystruct = malloc(sizeof(struct rhubarb));
  mystruct->aVal = 55;
......

Of course the following code should have been included after the malloc:

/*
  If memory wasn't allocated, do something appropriate.
*/
if (mystruct == NULL) {
  .....
}

An equivalent mistake is easily possible when building a distributed system in http or RMI by ignoring error codes or exceptions that are designed to communicate failures that we ought to handle. It’s similarly easy to ignore latency, or implement brittle and dumb retry logic or assume something is reliable (like a message queue) when it isn’t. Many have managed to concoct systems with http that breach the idempotent “constraints” of REST and whilst Erlang provides link() and receive timeouts, we’re not forced to use them.

In essence there is no way to ensure developers do the right thing in a single-process or distributed context. No technology, tool or design approach can prevent developers from making poor implementation decisions which limits the value in re-hashing (Steve, Steve and Stu) RPC rights and wrongs.

I believe the best chance we have for doing distributed right is not by providing some de-facto standard toolset, rather it’s through education[1] and mentoring to encourage the correct mindset. Such a mindset allows a developer building a distributed system to choose the most appropriate tools and use them right.

[1] Material to be covered would be substantially broader then the fallacies, failure handling, latency and should probably include: logical time, FLP, failure detectors, global snapshots and Paxos.

Comments 1 Comment »

Amazon has had a few problems of late, one of the more interesting ones being something S3 users encountered. It took Amazon a little while to identify the root cause:

We’ve isolated this issue to a single load balancer that was brought into service at 10:55pm PDT on Friday, 6/20. It was taken out of service at 11am PDT Sunday, 6/22. While it was in service it handled a small fraction of Amazon S3′s total requests in the US. Intermittently, under load, it was corrupting single bytes in the byte stream.

Perhaps they had anticipated this scenario as the S3 API features explicit support for software-level check-summing via MD5:

For all PUT requests, Amazon S3 computes its own MD5, stores it with the object, and then returns the computed MD5 as part of the PUT response code in the ETag. By validating the ETag returned in the response, customers can verify that Amazon S3 received the correct bytes even if the Content MD5 header wasn’t specified in the PUT request. Because network transmission errors can occur at any point between the customer and Amazon S3, we recommend that all customers use the Content-MD5 header and/or validate the ETag returned on a PUT request to ensure that the object was correctly transmitted. This is a best practice that we’ll emphasize more heavily in our documentation to help customers build applications that can handle this situation.

Some developers were surprised that any of this was necessary, expecting TCP/UDP checksums to be sufficient however Stevens points out in TCP/IP Illustrated Vol I:

Also, if your data is valuable, you might not want to trust the UDP or the TCP checksum, since these are simple checksums and were not meant to catch all possible errors.

Takeaways:

  1. Not all types of failure are binary – working or not working.
  2. Leaving the responsibility of data-safety to software layers further down the stack may not be best.
  3. Mechanisms for failure handling must be embedded in APIs.

Comments Comments Off