Archive for the 'Distributed Systems' Category

Trains, Timetables and Getting Home

June 01st, 2007 | Category: Distributed Systems

It’s the end of my working day, time to head for the train station and make my way home. Because I’ve done this trip many a time it’s a fairly well honed process:

  1. I have a rough idea of how long each leg of my journey takes so if I need to be home for a given time, I can backtrack from there and figure out roughly when I need to get a train.
  2. I leave work on most days at about the same time so I know which trains are when, how many stops there are and which will stop at Reading.

But there are days when this doesn’t work because the trains aren’t running to time. So I fall back to a much simpler approach which is to:

  1. Look at each train departure
  2. See if it’s going to Reading
  3. See if it is making fewer stops than my current choice
  4. See how soon it’s leaving

Based on the above I make a guess as to which train is best and climb aboard. Of course, the train I’m on can break down in which case I’ll be dumped at some intermediate station and the process starts again.

In these modern times things are made easier because departures, destinations and so on are announced and displayed on big electronic scoreboards but that’s an optimization. My process doesn’t require much in the way of extra information beyond what I could get from the driver and a map of the available stations and routes (often some subset information is posted at individual stations which means I don’t need to carry all this information around in my head).

Notice also how I don’t actually need a sense of time because in the worst cases I can dump my needs for predicting/controlling my arrival time and just take the first train I can find that gets me towards my destination.

Thankfully most trains do run to time but on any particular day, some don’t. Sometimes there’s an announcement that tells me why the train hasn’t arrived, whether it’s late or cancelled and I can then decide to wait or make other arrangements. What if there isn’t an announcement? Well I’ll wait a while and then assume the train is not going to arrive. Whether I get an announcement or not I can still make progress by virtue of my self-imposed waiting limit. Of course I might miss an announcement because I didn’t wait long enough but it doesn’t matter, I’ll still find a train. Even in the worst case where I miss the last train home, I can curl up somewhere and wait until the first useful train the next day.

You probably aren’t worried about my daily journey or whether I make it home but you might want to think about the above in the context of polling, timeouts and events. And you might want to consider that only polling and timeouts are really necessary for me to find a train to get me home. Event’s help me optimize but aren’t necessary. And you may notice that I require only a minimum amount of knowledge and it’s usually available locally from train driver or station wall.

Distributed fault tolerant systems are everywhere and seem only to need the simplest of underpinning mechanisms to make them work.

Technorati Tags: , , ,

3 comments

Unappreciated Chaos

May 25th, 2007 | Category: Architecture, Distributed Systems

Dan Pritchett presents some good insight in respect of large scale systems.

I’m left to ponder the fact that whilst Dan is deliberately creating these systems….

I find myself looking at nondeterministic systems a lot lately. Many solutions for the challenges of extreme scale involve relaxing constraints and coping with the ensuing chaos.

….there’s many a techie out there building large systems unaware of the fact that they can’t assert order on such a beast. All they can do is hold back the wave of chaos whilst having to clear up the odd drop that made it over the levee (e.g. consequences of a network outage or operation ordering problems). They are a long way from appreciating that chaos is a given, let alone actively managing it.

In a further twist of irony for all that chaos seems to introduce complexity, if you accept it’s existence and work with it you often end up with something simpler than was ever possible with “old world” thinking.

Technorati Tags: , ,

Comments are off for this post

Google Scalability Day

May 24th, 2007 | Category: Conference, Distributed Systems

All seats are taken for what should be a really interesting day’s worth of interaction. But fear not, I’m reliably informed that the presentations will all be posted up on Google Video afterwards.

Nice one Google, many other conferences aren’t that willing to share.

Technorati Tags: , , ,

Comments are off for this post

Distributed Systems Reading List

May 23rd, 2007 | Category: Distributed Systems

I realized a while ago that I have this bunch of material I keep to hand at all times - thought it might be worth sharing. It’s in the sidebar on the right. I’ll be adding to it over time, for a start there’s some stuff over on my website that should be in there.

Enjoy.

Update: Added some more links and fixed all the one’s the stupid HTML editor didn’t correctly tag, mutter.

2 comments

Making a Brighter Browser

May 08th, 2007 | Category: Web

Patrick and Fuzzy jumped on something from Mark Baker about Apollo and whether or not it’s built on top of the web.

Mark’s argument essentially amounts to “make it work in the existing browser” as opposed to “extend beyond the browser”. There’s a fine dividing line here because the current environment provided by the browser is limited.

There are a multitude of ways to deal with the fact that browsers don’t support local file access etc. One might be to properly “web-enable” the local computer’s operating system having it provide access to resources etc via 127.0.0.1 in a browser friendly fashion.

Another would be something like Apollo or Java or …..

More interesting for me is that whilst the web is supposed to be this de-centralized thing (and it is indeed in terms of services and combining them together) we’re not really de-centralizing the implementation of individual services such that we are able to push substantial logic into the client-side.

The current model, as I’ve said before, can make maintenance of services easier because new code for the browser is but a page refresh away. I think we could maintain a good deal of this constraint whilst getting closer to something like Apollo.

[ And how Jini like is the web-deployment model? Seems very similar - push the impl to the client just when it needs it. ]

So maybe Apollo isn’t the solution, maybe the existing browser isn’t the solution, maybe we need to consider the browser to be a work in progress, in need of extension so we can build better “on top of the web”.

Who knows?

Why do we want this kind of logic in the client anyway? Well because whilst statelessness is great and all, the reality is that if the state is not held inside of a human being’s head, it’s going to be in the browser and manipulating that state in efficient fashions (not pushing it all back to the server in every call) dictates a fair amount of client-side intelligence. We might also want to batch things up and so on for the purposes of making better use of the network. Lastly we might want enhanced user-interaction but eye-candy is sweet, you can never get enough and your browser could end up obese.

[For more on state in web apps, see Dave Orchard's note]

Technorati Tags: , , ,

Comments are off for this post

Victims Of J2EE Success

April 27th, 2007 | Category: Distributed Systems, Java

The vast majority of server-side Java programmers have J2EE on their resumes, they pride themselves for being experts in this particular technology but there’s a problem. Many of these programmers have their minds warped into the J2EE way of thinking:

  1. There is nothing beyond the database
  2. POJOs focused purely on business logic
  3. This is distributed programming
  4. Ops is someone elses problem
  5. Deploy more or bigger boxes to scale

Most enterprises can comfortably tolerate systems built this way but what if you’re not most enterprises? What if you are an eBay or a MySpace? eBay for example have thrown out almost all of J2EE and built their own libraries to tackle the problems they face around:

  1. Monitoring
  2. Hot Upgrades
  3. Scaling

Basically once you’re beyond a certain level of challenge the J2EE way of thought and patterns of design don’t work. So where does one find Java programmers that can cope with such a challenge? They’re going to need serious knowledge of:

  1. Deployment
  2. Monitoring
  3. Networking
  4. FLP
  5. SEDA
  6. Threads
  7. REST
  8. …..

But put that on a job advert and see how many responses you get! J2EE is a raging success to be sure but if you’re a company that can’t use it you’re likely going to be a victim of that success when looking to hire server-side Java programmers.

All of this has me wondering how one should frame job adverts of this nature. Should we even bother asking for Java experience or simply drop the language/platform constraint entirely? What should we be asking for? Multi-user online game programming perhaps? What else?

Technorati Tags: , , , ,

Update: I’ve added REST to the list as I suspect that it won’t fit well with existing J2EE-derived thinking.

Update 2: For an idea of MySpace’s challenges see here and here.  And then grab a copy of the slides from the Mix06 site entitled “Running a Mega-Site on Microsoft Technologies” (under Breakout, Next Generation Browsing Experience).

36 comments

Significant Gravity

April 25th, 2007 | Category: Distributed Systems

And I quote:

Project Darkstar is a research effort focused on the design of massive-scale, latency-optimized systems like online games. Written entirely in the Java programming language, the server platform provides a simple but powerful interface for defining server-side application logic. It takes care of persistence, load balancing, consistency and communications, leaving developers free to focus on their applications.

As of Fall 2006 we are in the final stages of a re-design and implementation of the system. Check back here soon for new details, programming interfaces, and news about public releases.

There’s some pretty serious people involved in that effort one of whom I was talking to over email earlier today, sounds like they’re doing some fun stuff.

Technorati Tags: , ,

Comments are off for this post

Manny Joins The Betfair Blog Party

April 25th, 2007 | Category: Web

Manny is by his own admission a “pretty pictures guy” hence his interest in web clients, Ajax, REST etc.

His first posting covers the basics of the Betfair business model which will set the scene for future tech postings and discussion of the brutal challenges Betfair faces (expect to see some serious performance numbers).

Technorati Tags: , ,

Comments are off for this post

Distributed Systems in Practice: The Human Body

April 04th, 2007 | Category: Distributed Systems, Technology

Patrick has noticed the parallels as has Werner Vogels.

There’s an awful lot of stuff to be learnt about building distributed systems from the human body including how control isn’t centralized, how reliability is achieved and how the constant replacement of cells contributes to overall health.

A couple of books worth a read:

Technorati Tags: , ,

Comments are off for this post

Blurring The Remote Call Boundary

March 28th, 2007 | Category: Distributed Systems, Java

I see many a comment talking about concerns with roundtrips in SOA/distributed architectures. For example in “The Fractal Nature of Web Services” the author starts totalling up additional round trips taken once we disassemble what was a single app into separate remote services. He then goes on to walk down the well worn performance path expressing concern about performance etc.

His concern is certainly valid but it makes a fundamental assumption about how you implement your services:

Each service is represented by some stub that communicates via some protocol to some remote back-end. Clients access the service via the stub and thus incur remote call costs.

Further whether he meant it or not, the author gives the impression that SOA can only be done with Web Services which is absolutely not true. Your choice of method for implementation for SOA is determined by your chosen architectural constraints and can often (and maybe, should always) lead to a hybrid model with “public” interfaces exposed via WS-* or REST and others implemented in Java.

And this is where it gets interesting because with Java we can break the stub/remote back-end constraint. Why? Because we have code-downloading. We can construct a service implementation that offers an interface and is advertised in a service registry as usual but at the point of access, downloads it’s whole self to the client. The service then runs in the client’s address space.

i.e. We have what appears to be a remote service with none of the roundtrip costs because in fact it runs locally.

From a practical perspective that means we can do things like:

  1. Create a service that wraps a database and contains all the SQL etc
  2. Advertise that service
  3. Client downloads service and all it’s dependent code and configuration
  4. Client invokes on service
  5. SQL connection is made to database
  6. Queries are run against database

Now, were the service genuinely remote we would incur two roundtrips:

  1. Client to Service
  2. Service to Database

But because we can “inject” the service into the client we actually only have the roundtrip to the database. Things to note:

  • We keep all the benefits of late binding of services and clients at runtime.
  • We taught the client a new protocol on the fly - we downloaded all the client/database communication code at runtime as part of service lookup.

Technorati Tags: , ,

Comments are off for this post

Availability is a Feature

March 25th, 2007 | Category: Architecture, Web

Normally we class availability as a non-functional requirement of the systems we build but I’m starting to think that’s a mistake because often when we get to project scheduling it’s forgotten. We allocate time, budget and resource to make sure we’re feature complete, that this button or that tick box is visible or that we can gather credit card details but rarely do we treat availability similarly.

Unlike a functional requirement which has some form of physical manifestation availability tends to be invisible dotted around the system in all sorts of different places from architecture and code to hardware. In contrast real features are a logical sequence of inter-linked pieces of code and design that are naturally traceable from entry to exit point. No entry point means the feature is not complete and this can be trivially determined prior to deployment. The first time we know availability isn’t “feature complete” is when our deployed website is down.

Worse, with every single feature or subsystem we add and the corresponding increase in load and size of dependency graph there’s an increased chance that we compromise availability. This can make for some serious growing pains where we add more features, attract more customers and thus more load all the time paying no attention to availability until it’s too late.

Consider the wiring in a house, we add a garage which needs lighting, maybe heating and a motor-powered door. Then we build an extension for guests, more lighting, heating and wall sockets. Eventually, something gives, maybe it’s a piece of wiring that should’ve been replaced or a bad junction but the net effect is to risk burning out all the wiring in the house. Replacing all the wiring would’ve been hard enough without the extensions but now the house is twice as big and the wiring is four times more complex. Worse, we cannot do a piecemeal upgrade instead we must replace all the wiring at once and until it’s done we can’t have all the heating on or we can’t cook at the same time as using the garage.

Like the wiring, availability needs constant and sometimes significant attention and putting the work off can lead to significant cost down the line. Similar arguments apply to other non-functional aspects of our systems such as scalability (it’s no good adding features that attract more customers if the system won’t cope with the additional load).

In light of the above, I believe it might be better to classify availability as a feature if not the feature. After all if the system is not available, none of the cool features we’ve implemented are worth anything because they can’t be accessed. It might also be a catalyst for discussion (especially amongst non-technical staff) of availability’s costs and benefits as compared to those of other features leading to improved strategic thinking.

Relevant Links:

Technorati Tags: , , ,

2 comments

Compulsory Attendance

March 22nd, 2007 | Category: Architecture, Distributed Systems

I came across a new concept in my travels this week, an invitation with compulsory acceptance.

Is an invitation that is compulsory to accept an invitation at all? Surely most people would see it as nothing more than a dressed up way of demanding attendance? Surely such “garnish” (Ben Elton fan’s will know exactly what I’m talking about) just engenders bad feeling in many recipients?

Meanwhile in the world of systems…….

We often do the equivalent of compulsory invitations in the way we build software - consider for example the whole dependency injection thing.

In many cases we build some object via a constructor that expects to be injected with a whole heap of other things. How do we guarantee all those things will attend and remain present for the lifetime of our object? After all without these things our object cannot perform it’s intended task. About the only way this works is by insisting that all these things exist within the same JVM as the object to be injected - then they all either exist or don’t exist.

Clearly, this doesn’t work so well in distributed systems where achieving the same availability assurance is considerably more difficult. A slightly more subtle issue is that if our systems are written in the above manner, upgrades are also more difficult because we must manage a dependency tree. e.g. We wish to upgrade System A and it has dependants, Systems B and C both of which will need to be taken off-line whilst we upgrade A.

What is required is a more dynamic form of injection, something that can be reinitialised or changed which at least invalidates injection via constructors (well, unless we throw the whole object out and rebuild it from scratch). In fact, we probably require some kind of event-based solution such that systems can get liveness information about the systems they rely upon and take appropriate action (which will likely include dealing with in flight operations).

Technorati Tags: , , ,

6 comments

Still Can’t Pin the Tail on the Donkey

March 15th, 2007 | Category: Distributed Systems, Technology

I attended Sun’s London Tech Days conference yesterday. I came away with a couple of overwhelming impressions.

We have serious problems in terms of the quality of our developers. Some of the questions being asked should be reserved for those getting to know the trade at school or university and not be uttered by “paid professionals”. Simply horrifying……and here’s some further food for thought.

Sun have this saying “The Network is The Computer” - based on yesterday I’d say that’s true for all the wrong reasons. See what they really mean is “The Network is One Computer”. There was lots of talk about multiple cores, multiple threads (no mention of the likes of SEDA), race condition detectors (a cool piece of tech by the way), optimizing compilers and so on.

But there wasn’t a single mention of what they would do when you grew beyond a single machine. The closest they came was a few brief mentionings of clustering which only counts for tightly-coupled, uniform computational scenarios. All those web 2.0 sites, Amazon, Google and so on are building systems on top of multiple boxes which need to work together in ad-hoc, changing configurations.

In our current systems construction doctrine we still focus on building our application inside of a single machine out of bits (e.g. Spring or App Server style). Witness how we strive to allow developers to run all that’s required on their own machine and rarely force them to run remotely. Do we really feel this is a good thing given that when the code is given to ops the first thing they do is put it on lots of machines? What incentive does a developer have to write monitoring tools to help ops out when they don’t see the pain in development?

Standard development practices are not aligned with what goes on in deployment, Sun and others need to start bridging that gap and developers need to start working in a style that fits much better with what goes on “over the wall”. Certainly it’s not going to be easy given we have such a long legacy of confining ourselves to a single machine but it’s necessary.

The fact of the matter is we know developers don’t like threads, transactional memory might be a nicer model but it won’t scale any better and virtualization makes it easier to spread software across multiple “machines”. The future looks like it will be multiple, co-operating, separate (possibly remote) processes - now where are the tools etc to make that happen?

Technorati Tags: , , ,

4 comments

At The Edge of Control

March 05th, 2007 | Category: Architecture, Distributed Systems

In the last couple of years we’ve seen the arrival of the mashup which is at least on some level nothing more than the latest in a long line of terms for integration. Thus far most mashups consist of a simple amalgamation of a couple of services which leads to a very flat graph of service dependencies. Service dependencies are things like:

  1. Data Schemas - structure of data provided by services
  2. Endpoints - location of service be it a URL or a WSDL endpoint
  3. Availability - whether a service is available for use
  4. Reliability - whether a service that is available behaves as expected

These mashups are already at the mercy of the underlying services they are built on. The mashup provider has little control over these services. For the most part these mashups work but it’s because they have only a few moving parts such that the likelihood of issues is low.

Many enterprises have considerably deeper dependency graphs inside their firewalls and have to work hard to keep them stable. There’s probably some limit to what can be achieved once the dependency graph gets beyond a certain depth and it might well be that the maximum depth is smaller once external services are brought into the mix. The maximum depth is likely further reduced because these enterprises wish to treat external services as if they are part of their organization. They want to be able to integrate them using transactions, they want them to have the same level of reliability as what lives within their own data-centres, they want integrated security options etc.

I suspect that a lot of what can be done in a single enterprise (at substantial cost) such as high reliability is going to be considerably more (prohibitively?) complex to achieve across organizations. This is because the level of control required to achieve these targets is beyond that available across enterprises.

Right now I think there’s much effort being made to paper over these issues such as features in WS-*, SLA’s etc. I wonder if it might it be better to give up on this idea of control and build some simpler solutions….

Technorati Tags: , , ,

1 comment

Wondering Just What SOAP Is For….

March 02nd, 2007 | Category: Distributed Systems, Technology

I’m not talking personal hygiene which of course is important (your Mum is right, brush your teeth, bathe regularly etc) I’m talking protocols.

Pete’s highlighting the issue which comes down to this strange comment from Paul.

SOAP isn’t RPC? Well sure if I read this I can see the statement Paul references:

SOAP is fundamentally a stateless, one-way message exchange paradigm, but applications can create more complex interaction patterns (e.g., request/response, request/multiple responses, etc.) by combining such one-way exchanges with features provided by an underlying protocol and/or application-specific information.

Trouble is the same document also says:

[SOAP Part2] defines a data model for SOAP, a particular encoding scheme for data types which may be used for conveying remote procedure calls (RPC), as well as one concrete realization of the underlying protocol binding framework defined in [SOAP Part1].”

And then (here):

One of the design goals of SOAP Version 1.2 is to encapsulate remote procedure call functionality using the extensibility and flexibility of XML. SOAP Part 2 section 4 has defined a uniform representation for RPC invocations and responses carried in SOAP messages.

What does [SOAP Part2] say?

One of the design goals of SOAP is to facilitate the exchange of messages that map conveniently to definitions and invocations of method and procedure calls in commonly used programming languages. For that purpose, this section defines a uniform representation of remote procedure call (RPC) requests and responses.

If we go by the letter of the base-spec SOAP isn’t about RPC but if we look at the spirit, it’s stated that a design goal for SOAP is to support RPC. Note the number of times RPC is mentioned in the base and adjunct specs. Seems like RPC is a fairly fundamental part of this whole thing. Confused?

Update: Don Box wrote up this brief history of SOAP which also makes the link to RPC

Technorati Tags: , , ,

2 comments

« Previous PageNext Page »