Seattle Scalability Conference: Amazon on Data Storage
Posted by: Dan Creswell in Distributed SystemsMy notes on the talk by Werner Vogels and Swami Sivasubramanian:
State management is the dominant factor in scaling - this is the stuff that is tough to look after, stateless is easy.
There’s a tight, complex interplay between scalability, availability, consistency, efficiency, management and performance.
Consider that billions of your body’s cells commit suicide in a day and yet you continue to function uninhibited. This process (Apoptosis) is essential for the health and stability of the overall organism and can be usefully applied in distributed systems. There are other interesting aspects of our biology that are relevant - check out the paper "The Limits of the Alpha Male"
Amazon is a collection of seven web-sites, it started as a website and a database but is now a distributed system. These changes were driven by the natural brittleness of integration via the database, performance and scaling issues. It was noted that database technology is many years old (reference was made to this article in ACM Queue) and we really need to move on.
For Amazon, incremental scalability is key and it’s desirable to be able to scale dynamically both up and down with demand. Improved performance can be defined in many ways including serving more units or serving larger units such as is required when datasets grow.
An always-on service is said to be scalable if adding resources to facilitate redundancy does not result in a loss of performance. Other aspects of a scalable service are that it:
- handles heterogeneity
- is operationally efficient
- is resilient
- becomes more cost effective when it grows
We should never expect systems to be stable:
- things leave, join and fail continuously
- perturbations and disruptions happen
- failures are highly correlated and systems do not fail by stopping
A key part of Amazon’s approach to defining service contracts is SLA’s. Conventional wisdom for SLA’s is that they are a one-way contract but in fact they should be considered as two-way contracts (what the service promises and how it is to be used). The contract might well include factors around:
- latency in respect of single service or paths through the system
- durability and availability
- cost
SLA’s introduce the right for a service to throttle in the face of various conditions and should not be defined with single numbers, rather they should be defined with ranges.
The remainder of the talk was concerned with Dynamo which has been previously known as HASS due to constraints in respect of an upcoming unreleased paper (titled "Dynamo: Amazon’s Highly Available Key-Value Store" to be presented at SOSP 2007 and my notes say it will be released on August 9th). Dynamo embodies much of what was talked about above, achieving it’s functional and non-functional targets with a mixture of:
- Sloppy quorum and hinted handoff (Werner’s own terms)
- Vector clocks for versioning and consistency, and exposed to the client application which is expected to define the model for merges etc)
- Consistent hashing and other p2p techniques for scalability (I’d recommend examination of examples such as Chord or Bamboo)
- Anti-entropy using Merkle Trees
Update: The paper is now available
Technorati Tags: distributed systems, google, conference, scalability, amazon, dynamo
Entries (RSS)
July 5th, 2007 at 8:26 am
[…] Pragmatic Dictator ยป Seattle Scalability Conference: Amazon on Data Storage (tags: google scalability amazon data storage) […]
July 10th, 2007 at 9:29 am
[…] Seattle Scalability Conference: Amazon on Data Storage Dan’s feedback on the some of the Google Scalability conference. Gotta love the biology! […]