Archive for July 3rd, 2007

My notes on the talk by Werner Vogels and Swami Sivasubramanian:

State management is the dominant factor in scaling - this is the stuff that is tough to look after, stateless is easy.

There’s a tight, complex interplay between scalability, availability, consistency, efficiency, management and performance.

Consider that billions of your body’s cells commit suicide in a day and yet you continue to function uninhibited. This process (Apoptosis) is essential for the health and stability of the overall organism and can be usefully applied in distributed systems. There are other interesting aspects of our biology that are relevant - check out the paper "The Limits of the Alpha Male"

Amazon is a collection of seven web-sites, it started as a website and a database but is now a distributed system. These changes were driven by the natural brittleness of integration via the database, performance and scaling issues. It was noted that database technology is many years old (reference was made to this article in ACM Queue) and we really need to move on.

For Amazon, incremental scalability is key and it’s desirable to be able to scale dynamically both up and down with demand. Improved performance can be defined in many ways including serving more units or serving larger units such as is required when datasets grow.

An always-on service is said to be scalable if adding resources to facilitate redundancy does not result in a loss of performance. Other aspects of a scalable service are that it:

  • handles heterogeneity
  • is operationally efficient
  • is resilient
  • becomes more cost effective when it grows

We should never expect systems to be stable:

  • things leave, join and fail continuously
  • perturbations and disruptions happen
  • failures are highly correlated and systems do not fail by stopping

A key part of Amazon’s approach to defining service contracts is SLA’s. Conventional wisdom for SLA’s is that they are a one-way contract but in fact they should be considered as two-way contracts (what the service promises and how it is to be used). The contract might well include factors around:

  • latency in respect of single service or paths through the system
  • durability and availability
  • cost

SLA’s introduce the right for a service to throttle in the face of various conditions and should not be defined with single numbers, rather they should be defined with ranges.

The remainder of the talk was concerned with Dynamo which has been previously known as HASS due to constraints in respect of an upcoming unreleased paper (titled "Dynamo: Amazon’s Highly Available Key-Value Store" to be presented at SOSP 2007 and my notes say it will be released on August 9th). Dynamo embodies much of what was talked about above, achieving it’s functional and non-functional targets with a mixture of:

  • Sloppy quorum and hinted handoff (Werner’s own terms)
  • Vector clocks for versioning and consistency, and exposed to the client application which is expected to define the model for merges etc)
  • Consistent hashing and other p2p techniques for scalability (I’d recommend examination of examples such as Chord or Bamboo)
  • Anti-entropy using Merkle Trees

Update: The paper is now available

Technorati Tags: , , , , ,

Comments 2 Comments »

This scenario get’s played out all the time in IT. The young guns claim that the old guys are out of it, don’t get the latest tech, aren’t smart enough whilst the old dogs smile and are heard to say they’ve seen it all before.

There’s a fundamental tradeoff at work here:

  1. Intelligence allows us to at least potentially progress faster
  2. Experience allows us to avoid making mistakes as we make progress

Thus a bright inexperienced person may make fast progress but they’re much more likely to make mistakes which will slow them down. In contrast the experienced person may make slower progress but fewer mistakes. Classic hare and tortoise. Who wins?

The nature of software is such that the mistakes we make can take a long time to manifest and when they do, they cost us big time. Thus:

  1. Mistakes don’t result in short-term localized damage rather they are far more disruptive with long-term, difficult to clean up damage
  2. The time between the root cause of the problem and it’s costly manifestation is large.

It follows that for our intelligence to count we must be able to see sufficiently far ahead to spot our mistakes coming before they get out of hand. Is this achievable? I think software history says it’s not and thus experience is our only tool for understanding root causes and spotting the early signs of an approaching asteroid.

I reckon there’s a lot to be said for the old tradition of master craftsmen handing down their knowledge and experience to apprentices…. (and perhaps the old dogs can learn a few new tricks along the way).

Technorati Tags: , ,

Comments Comments Off