Amazon has had a few problems of late, one of the more interesting ones being something S3 users encountered. It took Amazon a little while to identify the root cause:

We’ve isolated this issue to a single load balancer that was brought into service at 10:55pm PDT on Friday, 6/20. It was taken out of service at 11am PDT Sunday, 6/22. While it was in service it handled a small fraction of Amazon S3′s total requests in the US. Intermittently, under load, it was corrupting single bytes in the byte stream.

Perhaps they had anticipated this scenario as the S3 API features explicit support for software-level check-summing via MD5:

For all PUT requests, Amazon S3 computes its own MD5, stores it with the object, and then returns the computed MD5 as part of the PUT response code in the ETag. By validating the ETag returned in the response, customers can verify that Amazon S3 received the correct bytes even if the Content MD5 header wasn’t specified in the PUT request. Because network transmission errors can occur at any point between the customer and Amazon S3, we recommend that all customers use the Content-MD5 header and/or validate the ETag returned on a PUT request to ensure that the object was correctly transmitted. This is a best practice that we’ll emphasize more heavily in our documentation to help customers build applications that can handle this situation.

Some developers were surprised that any of this was necessary, expecting TCP/UDP checksums to be sufficient however Stevens points out in TCP/IP Illustrated Vol I:

Also, if your data is valuable, you might not want to trust the UDP or the TCP checksum, since these are simple checksums and were not meant to catch all possible errors.

Takeaways:

  1. Not all types of failure are binary – working or not working.
  2. Leaving the responsibility of data-safety to software layers further down the stack may not be best.
  3. Mechanisms for failure handling must be embedded in APIs.
  • Share/Bookmark

Comments are closed.