Archive for the “Distributed Systems” Category
I mentioned a while back that one could exploit DNS to ease some of the common static configuration issues around hostnames, ports etc. What follows is a simple outline solution, we’ve moved a long way beyond this at Betfair but the details will have to remain secret for now (sorry).
Let’s assume that we have several different releases in testing at any one time such that we wish to segment our development/testing systems into separate enclaves (each handling a separate release) and may wish to add more enclaves over time. Assume also that production is an enclave in its own right.
Firstly we define a set of logical hostnames that refer to the significant components of our system such as databases, file servers etc. Other elements such as webservers are probably independent and not referenced from other parts of the system and thus do not need names. These logical hostnames are what feature in our configuration files and do not need to change from enclave to enclave because we are going to use DNS to map from these logical hostnames to real physical machines.
Thus we want is a separate namespace for hosts in each of these enclaves so as to prevent leakage. To that end we map each namespace onto a separate domain within our DNS setup.
[Note our DNS setup would typically consist of a set of servers that maintain records for our own internal domains and possibly forward other requests for say external web address to other servers.]
Each enclave therefore has:
- A separate namespace represented as a unique domain
- A set of services deployed onto physical machines
- A mapping from logical machine names to physical machine names (or IP addresses)
- A collection of configuration files all referencing logical machine names
Each domain (namespace) contains the logical to physical mapping of machines for its associated enclave. Each domain can be a separate zone and is thus kept in a separate file read by our DNS master. This allows us to maintain a template file which can be quickly edited to create a new domain (namespace). Thus whenever we wish to create a new enclave we setup a new zone, containing the definition of a new domain which is the namespace for that enclave.
To actually resolve a logical hostname we must ensure that it is concatenated with the domain appropriate to the enclave’s namespace. Before discussing options, note that each machine will be allocated to an enclave and must be configured accordingly which we can exploit to our advantage:
- Simple configuration - ensure that the application has access to the domain to concatenate. This could be done via command-line argument but better is to source it from a well-known file on the machine which could be setup as part of allocating it to an enclave.
- Default search domain - any name not fully qualified has the default search domain appended to it. This default is typically part of the resolver configuration of the operating system and again can be setup as part of allocating a machine to an enclave.
Missing from the above is the handling of ports which might change from one enclave to the next. This can be tackled with a similar logical/physical mapping approach but must be based on the use of DNS SRV records rather than simple hostname mappings. The JDK provides little help out of the box for querying these records so something like dnsjava will be required.
Technorati Tags: distributed systems, dns, testing
Comments Off
Why do people still use static addressing in configuration files? Fixed hostnames or worse IP addresses?
These things make one’s life a nightmare when moving from one environment to another e.g. desktop to QA, QA to staging or staging to production.
With each transition, one must wade through all the relevant configuration files, find all these addresses and edit them. This creates many an opportunity for error such as missing one configuration variable or mistyping an address. It’s also a nightmare to maintain accurate documentation for all these scattered settings.
And yet this is so unnecessary if one exploits the abilities of DNS (and maybe Bonjour) properly. Just look at some of the cool stuff one can do. Better still most (all?) of it is supported in BIND.
Technorati Tags: architecture, distributed systems, systems
3 Comments »
Much is being made of a comment from Subodh Bapat especially in conjunction with further words from Greg Papadopoulos.
It’s believable that many a company will choose to host in a so-called “megacentre” but that doesn’t have to mean disaster come the day one of these fails. One can only get so much power into one place, so much cooling etc. Then there’s latency challenges such that if you’re hosted in the wrong place your customers will be displeased with the performance of your system. Which is a long-winded way of saying that whilst one might expect to see consolidation of cloud providers they’ll still need an awful lot of data-centres to hold all the kit required and provide the appropriate speed-of-light tradeoffs for those they host.
What about resilience? We know that to solve a useful class of problem (byzantine failure) one requires a minimum of n > 3f where f is the number of failures one wishes to tolerate and n is the number of nodes required. If we lower our sights a little, the minimum to handle a data-centre failure requires an active-passive approach with remote replication. Some companies however are moving to active-active models to solve problems of data-centre outage in recognition of the fact that simpler approaches work but mean significant downtime whilst the DR (disaster recovery) site is brought online.
Why if there are techniques available that address these nastier classes of failure are we losing so many “big” sites when we lose data-centres? Because most software houses (enterprise, web or otherwise) assume that failure can be prevented using backup network providers, clusters, replicated disk networks etc. i.e. hardware-based approaches that allow our software writers to pretend that nothing ever breaks leaving them to just write the important business logic.
To allow for data-centre fallure, the clouds of the future will require us to make considerably fewer assumptions in our software, network addresses might change, storage can become unavailable, processes might move and weaker consistency models must be exploited. One such cloud has already arrived in the form of Amazon and it’s notable that many developers are struggling with the new model it offers (they can’t for example find a suitable traditional database solution).
The challenges of the cloud are not in data-centre failure or consolidation of hosting solutions but in our own ability to write software that runs in these environments.
Technorati Tags: amazon, architecture, availability, distributed systems
2 Comments »
Going to be at the Google London Open Source Jam on Thursday 29th November 2007, 6pm - 9.30pm for an evening of distributed systems hackery.
1 Comment »
The release of the Dynamo paper has generated a lot of interest around the net. That’s more than appropriate because I don’t think there can be any doubt that Dynamo is a great piece of work.
It seems there might be a further bonus that’s largely gone unmentioned (even Greg seems to have missed it) but has been hinted at by Werner at various points in the past. Read carefully and you’ll find some details of a custom invocation infrastructure:
“Both get and put operations are invoked using Amazon’s infrastructure-specific request processing framework over HTTP. There are two strategies that a client can use to select a node: (1) route its request through a generic load balancer that will select a node based on load information, or (2) use a partition-aware client library that routes requests directly to the appropriate coordinator nodes. The advantage of the first approach is that the client does not have to link any code specific to Dynamo in its application, whereas the second strategy can achieve lower latency because it skips a potential forwarding step.”
Notice how they have support for both smart and dumb clients with the smart client setup being somewhat akin to a pattern that’s been seen in Google’s software including Chubby. The choice to reuse http would give them an option to leverage many a load balancer’s capability to apply custom routing by URL which would assist in service invocation routing.
Other interesting tidbits include:
- A mention at Google Scalability Conference of a lightweight rendering engine that might invoke upwards of 150 requests per page. Given some of the latencies discussed in the dynamo paper I am wondering if this custom framework might have some support for making collections of requests in parallel.
- Common service types are stateless aggregator services that can perform a lot of caching (wondering how much the use of http helps here) or stateful services.
- A statement from a past interview with Vogels:
“The first category is the services that make up the Amazon platform. There we use interface specifications such as WSDL but we use optimized transport and marshalling technology to ensure efficient use of CPU and network resources.”.
See the mention of the custom framework again but also a possible hint that they make use of a variety of interface specifications (perhaps including something homebrew).
Food for thought?
Technorati Tags: distributed systems, performance, amazon, scalability
Comments Off
It’s well known that abstracting away network failure in inter-process communication is a bad thing. but there are other similarly harmful abstractions one might adopt when handling networks such as assuming uniformity.
In recent times there’s been a resurgence of interest in using messaging between processes as a mechanism for taming concurrency rather than the (possibly) more conventional approach of using threads and locking. This model is very appealing in it’s simplicity and some variations even allow for process failure (though I think there’s still some interesting discussion to be had around being certain that a process has failed rather than become partitioned away by network failure - split brain scenarios etc).
Some are wondering if this messaging approach could be extended beyond concurrent programming across multiple cores in a single box to deal with concurrent programming across networked machines. I think there’s maybe a small fly in the ointment - latency. If all processes are communicating via messaging inside of a single SMP box we will likely have at least approximately uniform latency between processes which is reasonably easy to manage. The same cannot be said of messaging across processes in a NUMA system or on a network. Things get still more tricky if one has processes running on a mix of SMP and NUMA machines all living on a network and messaging each other.
Managing such a mix is difficult - one must consider carefully where to deploy things and the nature of the messages you send (what you’d consider moving around an SMP system’s bus is probably not the same sort of payload you’d want to place on a network). When a process fails, one potentially cannot start up a replacement anywhere rather it must be placed carefully and appropriately.
Technorati Tags: concurrency, distributed systems, messaging, networks
2 Comments »
Distributed systems practitioners get really excited by swarm theory because it holds the promise of being able to assert a state change in a system without centralized control or a global interaction (e.g. two-phase commit). Bees, ants and the like manage to conduct a co-ordinated life within a co-operative using only local interactions. They don’t have to communicate with the entire colony to agree a good place to nest or a good source of raw materials. It should be no surprise then that distributed systems practitioners also get excited about epidemic behaviours given that they possess similar qualities.
These natural systems contain a good measure of resilience as well. Each bee for example carries a little bit of state with it that it communicates (gossips) with other bees it meets. This state is the result of something it’s encountered in it’s environment. Should this bee die losing the state it’s unlikely to matter as another bee will probably encounter the same environmental conditions and thus the state will be recovered. Victory through weight of numbers.
So nature has provided us a mechanism that can make a binding decision using very loose co-ordination and eventual consistency, whilst functioning entirely off local interactions - no centralized control. Naturally more scalable. What we don’t get is a predictable time for the point at which this decision will be made (concurrent transactional systems aren’t entirely predictable in this respect either). In addition for certain kinds of decision, it’s entirely possible to have race conditions leading to a less than optimal choice. For some systems this doesn’t matter (nature is happy to be a little sloppy) but in cases where it is important there are solutions available.
One key challenge in building these systems is that much of our existing communications middleware is either point-to-point (e.g. RPC or RMI) with fixed addresses or broadcast (message queues or multicast) neither of which is well suited to the random one-to-one nature of gossip driven approaches such as those described above. What’s needed is some form of registry which can cope with dynamically changing membership that takes account of locality of nodes and an efficient means of directed inter-node communication algorithm which mimics the relatively random properties of gossiping.
Technorati Tags: biology, distributed systems, epidemic, swarm, theory
Comments Off
A little known fact in distributed systems is that once you make them resilient in the face of failure, handling upgrades is relatively simple. Why is this?
To make a distributed system resilient in the face of failure requires that we eliminate all single sources of truth. Truth must be maintained by and sourced from a collective which can maintain all relevant state in the face of nodes joining, leaving and failing.
There’s an additional subtlety to resilience which is that it requires scalability and in particular we need to be able to spread load across the collective whilst the members change otherwise the loss of a node will potentially mean we can no longer handle the current load in our system (we’d no longer have enough processors to cope). Likely as not, to make this work will require us to relax system constraints enough to allow truth to propagate asynchronously.
How does all this help with upgrading though? Because the typical upgrade cycle for a node looks like this:
- Shut down a node.
- Upgrade software on node.
- Restart node.
Such a sequence of events when considered from the point of view of other nodes looks like failure - the node disappears, later returns and needs to be re-knitted into the collective.
Sadly, nothing comes for free so to make upgrading work we need to ensure a level of backward compatibility between versions of the software on our nodes and we also need to account for this in our communication protocols, however there are plenty of examples to help us here such as http and DNS.
Some references:
- Tom Limoncelli of Google mentions this in a talk at Lisa ‘06.
- Bill Clementson in a recent blog about Erlang but it’s largely relevant to any kind of distributed system.
Technorati Tags: availability, design, distributed systems, versioning
Comments Off
When we write programs one of the things we seek to do is encapsulate our data so as to allow us to manage our dependencies and keep our code clean. Most languages OO or otherwise provide mechanisms to support this way of working.
The thing about the average database is that it doesn’t really encourage similar behaviour. It is all too tempting (and easy) to just allow everyone to access everything. Whilst we confine ourselves to a single application using the database, the problem is to some extent contained but often what we actually do is allow multiple applications access to the same database. The exact way in which this is done varies:
- Sometimes we bundle all our middle tier code together even though it has separate roles and responsibilities and integrate all of it via a single database.
- Sometimes we have multiple applications each running in a different process.
With each application we put on top of the database the problem gets worse increasing the number of invisible dependencies tying unrelated elements of code together by virtue of accessing a shared schema.
What’s happening is we’re sharing too much intimate knowledge across our system, something we’re all taught to fear. The solution is as always to prevent direct access to this intimate knowledge by interposing layers of abstraction. One way to do this is by requiring access to data to be wrapped up behind an interface. Historically we’ve done this by having a system own the database and expose interfaces that other systems can use to get the data.
Unfortunately there is a well-known issue with this approach which is that the level of granularity is wrong and these additional integration interfaces rapidly balloon into complex beasts. What we need is a a database wrapping entity that has a finer level of granularity than an entire system. Then the integration interfaces will be simpler because there will naturally be a less complex schema underpinning this more limited functionality.
What are we talking about? Services. We end up with a system of lots of discrete services each wrapping up their own data storage.
There are other benefits to this approach:
- Each service can utilize the most appropriate storage option for it’s contained data whilst having zero impact on other services that might have different needs.
- Each service is an independent entity that can be managed (monitored, deployed etc) separately.
- Centralized access patterns are more easily broken down which is useful in cases where we deploy across multiple data-centres.
Who would do such a thing?
Technorati Tags: architecture, database, distributed systems, enterprise
3 Comments »
…..Three-tier architecture. Three tier architecture is really a logical partitioning of system functions into presentation, business logic and data-access. Of course, some frameworks have attempted to turn this into a physical reality which is fine but many people believe that such systems are or can be distributed which makes little sense - why?
Because as has been said elsewhere placing code far away from the data source makes little sense. If one is to place code away from the data source it’s going to be for one or more of the following reasons:
- The computational weight demands separate scaling from the computational load inherent in managing the storage of the data.
- The computational load in the data-storage layer can be better scaled elsewhere.
- We can offset the additional latency introduced by network roundtrips.
Most three-tier architectures fail to satisfy any of the above criteria and thus aren’t good candidates for distribution.
…..Synchronous. This is because, in order to offset latency we must exploit asynchronous behaviour. Note that this does not imply the use of messaging rather it means adopting suitable asynchronous design patterns which can be implemented via RPC or messaging.
…..Completely consistent all of the time. Trying to enforce ACID properties across a distributed system is opening an enormous can of worms where one constantly attempts to defy the nature of the network. It’s not impossible to achieve but there is a tradeoff to be made. It’s often better to prefer eventual consistency in good-enough time i.e. something that approximates total consistency under most circumstances whilst degrading (hopefully gracefully) under load or in the presence of failure.
Many have attempted to implement distributed systems whilst falling into one or more of the traps above. Many have paid the price and many have consequently pronounced that distributed is inappropriate, impossible or insane. This is the motoring equivalent of strapping a massive turbo to an unmodified engine and complaining when the pistons explode through the bonnet and the oil is dumped all over the floor. In both worlds the remedy is the same, talk to an expert and be prepared to throw out a few beliefs.
Technorati Tags: architecture, distributed systems
Comments Off
My notes on the talk by Werner Vogels and Swami Sivasubramanian:
State management is the dominant factor in scaling - this is the stuff that is tough to look after, stateless is easy.
There’s a tight, complex interplay between scalability, availability, consistency, efficiency, management and performance.
Consider that billions of your body’s cells commit suicide in a day and yet you continue to function uninhibited. This process (Apoptosis) is essential for the health and stability of the overall organism and can be usefully applied in distributed systems. There are other interesting aspects of our biology that are relevant - check out the paper "The Limits of the Alpha Male"
Amazon is a collection of seven web-sites, it started as a website and a database but is now a distributed system. These changes were driven by the natural brittleness of integration via the database, performance and scaling issues. It was noted that database technology is many years old (reference was made to this article in ACM Queue) and we really need to move on.
For Amazon, incremental scalability is key and it’s desirable to be able to scale dynamically both up and down with demand. Improved performance can be defined in many ways including serving more units or serving larger units such as is required when datasets grow.
An always-on service is said to be scalable if adding resources to facilitate redundancy does not result in a loss of performance. Other aspects of a scalable service are that it:
- handles heterogeneity
- is operationally efficient
- is resilient
- becomes more cost effective when it grows
We should never expect systems to be stable:
- things leave, join and fail continuously
- perturbations and disruptions happen
- failures are highly correlated and systems do not fail by stopping
A key part of Amazon’s approach to defining service contracts is SLA’s. Conventional wisdom for SLA’s is that they are a one-way contract but in fact they should be considered as two-way contracts (what the service promises and how it is to be used). The contract might well include factors around:
- latency in respect of single service or paths through the system
- durability and availability
- cost
SLA’s introduce the right for a service to throttle in the face of various conditions and should not be defined with single numbers, rather they should be defined with ranges.
The remainder of the talk was concerned with Dynamo which has been previously known as HASS due to constraints in respect of an upcoming unreleased paper (titled "Dynamo: Amazon’s Highly Available Key-Value Store" to be presented at SOSP 2007 and my notes say it will be released on August 9th). Dynamo embodies much of what was talked about above, achieving it’s functional and non-functional targets with a mixture of:
- Sloppy quorum and hinted handoff (Werner’s own terms)
- Vector clocks for versioning and consistency, and exposed to the client application which is expected to define the model for merges etc)
- Consistent hashing and other p2p techniques for scalability (I’d recommend examination of examples such as Chord or Bamboo)
- Anti-entropy using Merkle Trees
Update: The paper is now available
Technorati Tags: distributed systems, google, conference, scalability, amazon, dynamo
2 Comments »
Woohoo, after jumping through some hoops I’m going to be able to attend the conference. I will of course be representing Betfair.
I’d be happy to attend any of the talks but my personal favourite is the storage talk by Swami Sivasubramanian and Werner Vogels. I’m really looking forward to making some new contacts. If anyone can recommend some good Seattle beers to try, drop me a line.
1 Comment »
It’s the end of my working day, time to head for the train station and make my way home. Because I’ve done this trip many a time it’s a fairly well honed process:
- I have a rough idea of how long each leg of my journey takes so if I need to be home for a given time, I can backtrack from there and figure out roughly when I need to get a train.
- I leave work on most days at about the same time so I know which trains are when, how many stops there are and which will stop at Reading.
But there are days when this doesn’t work because the trains aren’t running to time. So I fall back to a much simpler approach which is to:
- Look at each train departure
- See if it’s going to Reading
- See if it is making fewer stops than my current choice
- See how soon it’s leaving
Based on the above I make a guess as to which train is best and climb aboard. Of course, the train I’m on can break down in which case I’ll be dumped at some intermediate station and the process starts again.
In these modern times things are made easier because departures, destinations and so on are announced and displayed on big electronic scoreboards but that’s an optimization. My process doesn’t require much in the way of extra information beyond what I could get from the driver and a map of the available stations and routes (often some subset information is posted at individual stations which means I don’t need to carry all this information around in my head).
Notice also how I don’t actually need a sense of time because in the worst cases I can dump my needs for predicting/controlling my arrival time and just take the first train I can find that gets me towards my destination.
Thankfully most trains do run to time but on any particular day, some don’t. Sometimes there’s an announcement that tells me why the train hasn’t arrived, whether it’s late or cancelled and I can then decide to wait or make other arrangements. What if there isn’t an announcement? Well I’ll wait a while and then assume the train is not going to arrive. Whether I get an announcement or not I can still make progress by virtue of my self-imposed waiting limit. Of course I might miss an announcement because I didn’t wait long enough but it doesn’t matter, I’ll still find a train. Even in the worst case where I miss the last train home, I can curl up somewhere and wait until the first useful train the next day.
You probably aren’t worried about my daily journey or whether I make it home but you might want to think about the above in the context of polling, timeouts and events. And you might want to consider that only polling and timeouts are really necessary for me to find a train to get me home. Event’s help me optimize but aren’t necessary. And you may notice that I require only a minimum amount of knowledge and it’s usually available locally from train driver or station wall.
Distributed fault tolerant systems are everywhere and seem only to need the simplest of underpinning mechanisms to make them work.
Technorati Tags: architecture, availability, distributed systems, software
3 Comments »
Dan Pritchett presents some good insight in respect of large scale systems.
I’m left to ponder the fact that whilst Dan is deliberately creating these systems….
I find myself looking at nondeterministic systems a lot lately. Many solutions for the challenges of extreme scale involve relaxing constraints and coping with the ensuing chaos.
….there’s many a techie out there building large systems unaware of the fact that they can’t assert order on such a beast. All they can do is hold back the wave of chaos whilst having to clear up the odd drop that made it over the levee (e.g. consequences of a network outage or operation ordering problems). They are a long way from appreciating that chaos is a given, let alone actively managing it.
In a further twist of irony for all that chaos seems to introduce complexity, if you accept it’s existence and work with it you often end up with something simpler than was ever possible with “old world” thinking.
Technorati Tags: distributed systems, architecture, chaos
Comments Off
All seats are taken for what should be a really interesting day’s worth of interaction. But fear not, I’m reliably informed that the presentations will all be posted up on Google Video afterwards.
Nice one Google, many other conferences aren’t that willing to share.
Technorati Tags: google, scalability, conference, distributed systems
Comments Off
|