Changing The Way We Build Systems

Introduction

Many systems are still being built on top of an abstraction which attempts to aggregate all hardware so that it looks like a single machine. This results in a tendency to create a system which processes all data everywhere (theoretically it could be) as if the system were still running on a single CPU with a single hard-disk. Building systems this way leads to many problems. For example, global synchronization of state becomes a major engineering headache and can become a performance bottleneck.

System Design Principles

  1. It's accepted amongst CPU designers that most programs exhibit a property known as the principle of locality3. This principle essentially states that programs have a tendency to execute various sequences of instructions repeatedly. It's also been noted that a similar behaviour is seen in terms of data access (though this pattern occurs less often).
  2. Jim Gray recently wrote a technical note concerning distributed computing economics1 which concludes with a recommendation to locate computation near to the data it acts upon such that in cases where one must aggregate data from several sources, it makes sense to filter as much as possible, close to the data source, before merging the results together elsewhere.
  3. Whilst CPU performance is doubling approximately every eighteen months, network traffic is doubling approximately every twelve months2. The basic implication being that any single machine is unlikely to be able to scale such that it can continue to handle all requests to a particular system as time goes on.

The Problem Of Today

Most systems are engineered to take account of the principles stated above but it's done in a centralized fashion where everything is held in one or two databases with the executing processes running nearby. This leads, over time, to a very dense, closely inter-connected web of systems which is difficult to provision and change according to demand.

Solving The Problem For Tomorrow

The solution is actually relatively simple from an architectural perspective. Rather than centralize our data, we scatter it across a number of repositories and place the appropriate processes close to those repositories creating enclaves of related components. This exploits the tendency for local communication to predominate but doesn't prevent these enclaves from communicating with each other (however, if the system is built correctly, this inter-enclave communication should occur much less often than intra-enclave communication). It also provides a useful side-effect in that these systems become more modular making maintenance, in it's many forms, much easier.

The outstanding issue, of course, is resilience/scaling. This is solved by ensuring that each enclave has the appropriate level of redundancy and replication required (a task that is made much easier because we've broken our system down into enclaves). Such replication and redundancy can be achieved using simple data scatter/gather, failover, RAID and hot swap CPU techniques all of which are relatively accessible solutions today. There's little need for expensive cluster or large-scale shared memory solutions which tend to be required when adopting the more centralized approach. Note that it's entirely possible that some enclaves require little in the form of "backup" as they are either not critical or operate only on transient as opposed to long-lived state.

Conveniently, this approach to hardware usage dovetails well with the latest trends toward commodity computing such as racks, blade servers etc.

One problem still remains which is related to the available tools and frameworks we use to build our systems. They too are targeted at the construction of centralized systems and will need replacing or, at least supplementing. There will also be a demand for new architectural patterns which will likely be more asynchronous in nature.  This applies equally to the operations we apply and the staleness of data.

An Aside - Too Narrow Focus

When talking to many developers, I find they focus on a very narrow part of the problem space in which they work. The result is that they do not 'see' the higher-level design issues such as lifecycle problems or growing complexity. Many of them see all issues from the perspective of database access or code inside an application server. When confronted with new technological options, they attempt to classify them according to components they are already familiar with, such as databases or message queues. This is a design-mistake of massive proportions. Humans learn best by doing, which makes it more useful to take new technology and try things with it to see what works and what doesn't rather than thinking in terms such as 'is it a better message queue?' or 'can I replace my database with it?'

Where Are We Going?

There will be an increasing number of devices accessing systems from the network edges, which have less horsepower than a PC and require additional work to be co-ordinated on their behalf. In this climate of ever more complex integration between systems, the traditional centralized approach to architecture/systems design will not be viable.

A growing proportion of system access will be performed by intelligent devices working alone with no user-supervision, increasing load and complexity in the systems they talk to. The number of transactions will, in these cases, not be limited by the typical human 'think time' between operations. The web interface will become one of a number of means of access to systems. Current technologies will not handle this form of scale given they are largely either based on centralization or fixed peer-to-peer infrastructures.

Spontaneous, ubiquitous, networking will be the new order and software infrastructure will have to change to accommodate this. Issues in need of addressing will include:

References

  1. Distributed Computing Economics, Jim Gray, PDF, Word
  2. Internet traffic growth: Sources and implications, Andrew Odlyzko, PDF
  3. Computer Architecture: A Quantitative Approach, Book
  4. Asynchronous Design Homepage, Website
  5. Big Systems from Simple, Unreliable Components Weblog

© Copyright 2003 Dan Creswell

Distributed Systems