Archive for the “Architecture” Category

How big does a website have to get before custom infrastructure becomes necessary? When a website reaches this stage, what infrastructure gets built? Before trying to answer these questions we must have some means of measuring the size of a website. I’ve settled on the number of machines as a reasonable approximation because:

  • As a codebase grows it must be split up along functional boundaries, and spread across multiple processes. More code equals more processes and more machines to run them on.
  • More customers, means more load and requires more machines to handle it.
  • More data means more storage and more processors to chew through it.

Now let’s see how many machines some of the big players are running and what infrastructure they’re talking about:

TicketMaster have at least 3000 machines and have built Spine to help them manage configuration of their infrastructure.

eBay have built a custom deployment tool (Roller), logging infrastructure, configuration management for their software services, messaging software and more. They’re running around 15000 machines across four geographical locations.

Microsoft have built a custom deployment, configuration and monitoring infrastructure called Autopilot focused on many thousands of machines. In fact we’re talking hundreds of thousands.

Google are dealing in a million or more machines and expending effort on software to handle staged, automatic upgrades. Of course they’ve already built GFS, Chubby etc.

Twitter have moved beyond the half-dozen or so machines they used to have to “a lot of servers” (hundreds?) and are seemingly still hiring operations staff but have built a custom queue server.

Facebook have at least 10000 webservers, 800 MemcacheD instances and 1800 MySQL instances. They’ve built a custom configuration-serving infrastructure, management and monitoring tools. They also contribute to MemcacheD and have built Cassandra and Thrift. They also appear to be busy building their own optimized webservers and a replacement for squid.

Amazon have tens of thousands of servers (surely more?) and have constructed Dynamo, S3, EC2, SQS etc.

A few tentative conclusions:

  1. It would seem that by the time a website has moved into the thousands of boxes it will have had to address configuration and monitoring. Which suggests development efforts started before this threshold (perhaps at a couple of hundred boxes?)
  2. As the machine count moves towards the tens of thousands, automated deployment becomes essential and there’s a need to develop more service-specific infrastructure.
  • Share/Bookmark

Comments 1 Comment »

It seems it’s generally accepted[1] that SOA means breaking up your system into a set of co-operating components partitioned by business process. If you’re not doing that, you’re not doing SOA. It never ceases to amaze me how we get so zealous about fixed methods for architecting a system. I suspect it’s because we’d like to believe that architecture (and much of the act of development) can be done with fixed rules, cookie cutter style, get your catalog of patterns and technology, apply them – job done. The ultimate embodiment of this behaviour is deployment of a piece of technology in the belief that once the integration is complete the system has radically shifted in terms of it’s architecture (e.g. deploying an ESB suddenly makes your system SOA).

So if the fixed methods of SOA are thrown out and technology is not the solution, how do we build a system? Let’s first consider some of the things we’d like from our architecture:

  1. Avoid integration via the database – otherwise data coupling will cripple us
  2. Support for granular updates – taking down the whole system is not desirable
  3. Fast rollback of changes – in case an update breaks
  4. In-production testing – there’s no substitute for real traffic in tests
  5. Minimal shared resources such as storage – so should there be an outage, impact is minimised
  6. Horizontal scaling – more boxes equals more power
  7. Support for scalable development – dev teams should be able to act in isolation most of the time
  8. Support for appropriate CAP tradeoffs – making everything consistent can be bad for availability

Although we wish to avoid coupling via the database, the reality is that our code still requires access to the data in some form or another. The best we can do under this circumstance is to limit the amount of code that directly accesses the data. We achieve this by vertically slicing (as opposed to horizontal sharding) our data and consolidating the code that is most closely related to it (e.g. performs updates) into a single encapsulated unit. All other access to the data must go via the code element of its associated unit (note that one needn’t always go to a unit for the data, it’s perfectly acceptable to cache).

In this way we limit the impact of data-schema changes to it’s associated unit, other parts of the system need not be concerned but there’s still some work to do. If the code within a unit were to be co-located within all processes containing code that wishes to make use of it, we’d need to restart all those processes when we wish to deploy a new version of that code (for whatever reason). Such a deployment model also encourages several bad habits:

  1. Ignoring the remoteness of the data – it’s hidden behind some form of interface and it’s tempting to attempt to hide failure behind that interface
  2. Focus on synchronous method calls – it’s natural for a developer to write synchronous method calls when the code being called looks local (note that method calls can support asynchronous behaviours)

To avoid these issues, we deploy each unit in it’s own process accessed via some network endpoint that dependants use to interact with it thus:

  • Each unit can now easily be allocated it’s own independent storage, apply it’s own sharding policy etc.
  • The network endpoint can support multiple protocol versions or we can opt to terminate multiple network endpoints onto a unit, a powerful primitive for supporting several versions of a remote interface simultaneously.
  • The network endpoint can be terminated onto some form of load balancer or custom routing implementation (which might be part of the code within the unit itself perhaps because it’s P2P based) facilitating horizontal scaling, hot upgrades, A/B testing, in-production tests etc.
  • Each unit can be assigned to a development team and much work can be done independently of development efforts elsewhere, making for less contention in development.
  • Each unit can implement whatever CAP tradeoff makes sense.

If we arrange for the network endpoint of each unit to be discovered dynamically at runtime we gain the ability to move our units around (e.g. for DR reasons) and have means for our system to dynamically knit itself together reducing configuration issues. Such an arrangement can also make it easier to deal with ordered startup issues (where some set of things must be available before others).

Of course it’s not all good news, we will have to manage our desire for ACID guarantees because many of the mechanisms (such as two-phase commit) for achieving this in a distributed system are fraught with problems. Fortunately, people have been thinking about this for a while. We’ll also have to take care of the fallacies but even this has some positive aspects as failure and upgrade in some cases can be considered the same (noting that abstractions for message passing, failure detectors and the like can be implemented in many languages, not just Erlang).

So what remoting approaches might we use? REST/http, WS-*, RMI, CORBA, messages, custom protocol – whatever is suitable for our situation (noting that some choices impact the means by which we can handle evolution of protocols etc). What guidelines might we follow in determining how to split our code and data? There are a number of different approaches including:

  1. Considering similarities in consistency, availability and partitioning (CAP) requirements
  2. Data access localities
  3. Data relationships
  4. Jurisdictional requirements
  5. Roles and responsibilities (at coarser level than OO)
  6. Features (e.g. recommendations)
  7. Business processes
  8. Constituent elements of an overall business process

Most systems likely require a combination of these rather than one fixed approach, taste and gut instinct count for a lot. And what might we call these units I speak of? I prefer to call them services as do a few other people but there’s no doubt that’ll be confusing, have to think of something else…….

[1] I know that Steve might well argue otherwise.

  • Share/Bookmark

Comments Comments Off

There are many distributed algorithms and they vary in lots of ways including:

Communication Method: Possibilities include shared memory, point-to-point or broadcast messages etc.

Failure Model: Perhaps the algorithm assumes complete reliability. Perhaps it copes with some types of processor failure (including stop, transient failure or byzantine where the processor behaves arbitrarily). It might cope with problems in it’s communications layers (including message loss and duplication).

Timing Model: The algorithm might require computation and communication to progress in lock-step (synchronous) or it might cope with steps in arbitrary order with arbitrary speed (asynchronous). In between these two extremes exists an area of algorithms that have partial timing information (e.g. processors can access partially synchronised clocks). Asynchronous/Synchronous is independently applied to processors and communication channels.

The easiest to program are the synchronous algorithms. Asynchronous algorithms are harder to program because the order of happenings is uncertain however they have the advantage of needing no consideration of timing. Asynchronous algorithms also present some unique challenges for consensus which can be addressed by means of a failure detector. Many distributed systems provide stronger guarantees in respect of timing than is assumed in the asynchronous model thus we get to the partially synchronous model which perhaps surprisingly is the most difficult to program. Algorithms in this class are potentially efficient and the most realistic but care must be taken to ensure the timing assumptions they make are not violated (perhaps by failing to arrange for some aspect of process behaviour to act within the assumptions).

Such a classification helps us choose algorithms appropriate to our network environment (which should include consideration of how often manual intervention will be required), A popular leader election algorithm simply requires each process to broadcast its UID across the network and maintenance of a lease. If a process doesn’t receive a UID higher than its own it can assume it is the leader. This algorithm works in a synchronous network with no failures. It can also be adapted to work in an asynchronous network with reliable FIFO channels and no failures. However it can fail in the presence of a network partition or packet loss leading to split brain behaviour which would need to be addressed with manual action or additional fault handling in other parts of the system.

  • Share/Bookmark

Comments Comments Off

Jim Waldo writes:

My own conclusion is that system design is really a matter of technique, a way of thinking rather than a subject that can be taught in a particular course. It might be possible to build a program that teaches system design by putting students through a series of courses that hone their system design skills as they move through the subject matter of the courses. Such a series of courses would, in effect, be a formalized version of the apprenticeship that is now the way people acquire their system design technique…..

…..Even worse than not being visible to the customer, work done on designing the system is not visible to the management of the company that is developing the system. Even though managers will pay lip service to the teaching of The Mythical Man Month, there is still the worry that engineers who aren’t producing code are not doing anything useful. While there are few companies that explicitly measure productivity in lines-of-code per week, there is still pressure to produce something that can be seen. The notion that design can take weeks or months and that during that time little or no code will be written is hard to sell to managers. Harder still is selling the notion that any code that does get written will be thrown away, which often appears to be regression rather than progress.

In such an environment lip service often extends to technical strategy as well.

  • Share/Bookmark

Comments Comments Off

From On Designing and Deploying Internet-Scale Services (James Hamilton – Windows Live Services Platform):

We have long believed that 80% of operations issues originate in design and development, so this section on overall service design is the largest and most important. When systems fail, there is a natural tendency to look first to operations since that is where the problem actually took place. Most operations issues, however, either have their genesis in design and development or are best solved there. Throughout the sections that follow, a consensus emerges that firm separation of development, test, and operations isn’t the most effective approach in the services world. The trend we’ve seen when looking across many services is that low-cost administration correlates highly with how closely the development, test, and operations teams work together.

Yep, I’m a firm believer too….

  • Share/Bookmark

Comments 1 Comment »