Building a concurrent system ultimately boils down to:

  1. Partitioning the data into chunks that can be separately acted upon
  2. Applying computations against those chunks to produce results

The smaller or more fine-grained the chunks, the more concurrent activity will be possible. In theory the closer one can get to one chunk per core the better but in reality it’s rare (a function of throughput and size of calculation) one needs to do computation across all chunks simultaneously such that a core can be assigned many chunks any one of which it will dispatch operations against at a moment in time.

There are many solutions for building concurrent systems but those that provide some abstraction which makes request routing easy to implement are likely to work best as it makes re-balancing of computation easier. One shouldn’t immediately assume that message passing is the answer as there are many ways to achieve routing (e.g. via DNS).

Any solution represents a transparency tradeoff. If for example routing is hidden inside of the solution, this can make it easy to get something up and running but we might find it difficult to transition from one box to a multi-box deployment. There are many tradeoffs to be made and for any case where control is given to the developer/architect it’s likely there will be libraries/frameworks to ease the initial implementation burden, programming languages alone will not be enough (Scala makes such a differentiation quite difficult given it’s language extension capabilities).

One aspect discussed less often is the difference between processing on a set of cores all in one box versus processing across a set of cores on many boxes. The latter brings the following challenges all related to the fallacies of distributed computing:

  1. Cores are more likely to become inaccessible
  2. The latency of an operation can become substantially more variable
  3. Any centralised functions (e.g. job scheduler or watchdogs) are more vulnerable to becoming isolated from the resources they manage such that processing ceases.

The latency factor is particularly challenging as few concurrent approaches make it sufficiently explicit that developers/architects are encouraged to be appropriately mindful.

Thus far, as has been the case throughout our history, the solutions are polarising into those that work within the confines of a single box and those that work across multiple boxes with the emphasis on the former. I fully expect developers and architects to fall into the old trap of using a single-box solution to solve a multi-box problem with all the associated issues. Of the solutions that work across multiple boxes, very few account fully for the impact of the network.

Comments Comments Off

As soon as we give something a name, it becomes open to abuse and misuse.

Vendors can claim they are doing it and support it, developers can claim they do it, use it or implement it. There are a bunch of ready examples: Agile, XP, SOA and REST. Naming something makes it easy to ignore or forget its underpinnings, the elements that deliver value.

As a martial artist, I’m familiar with this pattern of behaviour: various people claim to practice and teach authentic Silat, Karate, Kung Fu, Escrima and so on. Inevitably some of them are exposed as pretenders. One of the more notable martial artists, Bruce Lee was sufficiently concerned about this that he gave serious consideration to leaving his approach to martial art (Jeet Kune Do) unnamed*.

Is it worth naming things? Might we be better served by making our knowledge, approaches and philosophies visible for others without naming them to adopt or not as they see fit? Would it reduce the number of valueless certifications, buzzword cv’s and endless wars over which way is the way and who’s doing it right?


* Jeet Kune Do (1997) ‘Actually, I never wanted to give a name to the kind of Chinese gung fu that I have invented, but for convenience sake, I still call it “Jeet Kune Do”. However, I want to emphasize that there is no distinction between jeet kune do and any other kind of gung fu, for I strongly object to formality, and to the idea of distinction of branches.’

Comments 3 Comments »

I’ve spent a significant amount of my career helping to unpick messed up architectures and wondering how they ever come to be. Certainly it can’t be because they’re appealing to work with:

  1. Making changes becomes increasingly expensive – make one small change and it spiders into changes across many other areas and gets into corners one least expects.
  2. Replacing components of the system because for example they’re no longer supported, don’t perform adequately or can’t scale requires significant reverse engineering to understand dependencies etc.
  3. It only takes one piece of the system failing to bring everything to its knees.
  4. Isolating the root cause of a bug takes significant amounts of effort because it’s difficult to quickly eliminate large chunks of the system.

More often than not it’s believed (I’m guilty) these systems come into being through incompetence or indiscipline on behalf of the developers involved but I think there’s maybe another contributory factor: Much of the advice on design and architecture is couched in terms of design from scratch, there’s less guidance in regard to working with an existing architecture.

The result is that when developers start out building a system they have a lot of advice they can apply but as it grows, it becomes more difficult to apply the advice and discern what changes are appropriate, so the architecture unravels. Is there a way to avoid this unravelling? I believe there is and it’s derived from the process for fixing up an errant architecture.

These architectures have smells equivalent to the code-level examples Fowler discusses in his book on refactoring such as:

  1. Some area of the system is too tightly coupled, making changes harder.
  2. Some part of the system contains an assumption that there is only one resource of some type (e.g. a database) limiting scaling.
  3. Many components of the system are reliant upon one key component being constantly available such that if it fails, nothing works.

Having identified these smells we need to perform appropriate cleanup which, for the list of examples above might include:

  1. Placing additional APIs (interfaces) within the tightly coupled area of the system to reduce shared implementation knowledge and create well-bounded islands of data.
  2. Introducing a resource discovery pattern to abstract away the assumption of a single resource at a single address.
  3. Introducing concepts like acceptable staleness of data which allows caching for a period of time, eventual consistency which supports making updates and resolving the outcome at a later date or asynchronous operations.

It’s important to realise that in any substantial system we will be unable to eradicate a smell completely in a single update because it’s too risky. There will be many places in the code we might forget to patch up, a high likelihood we’ll miss something in testing, low probability we’ll get API designs exactly right etc. We must gradually introduce modifications over a period of time (months or even years) rather than perform significant rewrites. This isn’t as bad as it seems because no architecture is perfect for very long once it’s exposed to users. It also suggests that perhaps we need to focus on documenting techniques for gradual evolution of an architecture.

If we were to get better at spotting these architectural smells early (slight odour as opposed to horrific stench) and working to address them sooner than later it might be possible to avoid having a system’s architecture unravel, leading to something more sustainable.

Updated: to include additional commentary on APIs and perfection.

Comments Comments Off

A bad habit I’ve noticed in many a techie:

The tendency to thrash around and wildly speculate about the root cause of whatever production issue they’re facing. They tweak code and configuration following some random hypothesis or another, hoping that the issue will magically go away. It must surely be clear that this is a horribly inefficient way to solve a problem?

What’s required is data, data that we can use to home in on the source of the fault. We could wade through log files but this is inefficient and ought to be the last resort. Ideally we’d have some idea of what to look for beforehand.

Instrumentation is one tool we can use to guide our efforts. It can tell us things like how much memory is used, how much load there is, how many users are logged in, rate and types of request, cache hits etc.

Self-tests are also useful as they can exercise common operations, perform internal consistency checks and provide feedback on what’s working and not.

We can also get online memory dumps and there are tools like dtrace and tcpdump.

Given all these possibilities, why do we indulge in wild speculation? Perhaps it’s because we’ve foolishly left ourselves no choice:

  1. Instrumentation that should be a rich source of useful information is often limited to what is available from the operating system because we neglect to instrument our own code.
  2. As with instrumentation, we don’t make the time to implement self-test facilities.
  3. Only a few of us bother to learn about tools such as dtrace.
  4. Logging even if we could wade through it all is implemented in such a fashion that it cannot be turned on in production because the performance cost is too high.

Comments 6 Comments »

We’ve all seen it, customers change their requirements, add a few more features and yet expect the project deadline to stay the same even though there are no additional resources.

For some reason they act as if a software team has infinite, cost-free capacity. The psychology that drives this behaviour is somewhat unclear because there are various potential motivators such as political ambition, naivety or willful ignorance.

One might expect to see this problem occurring in waterfall projects but it can also plague early agile projects. Typically the backlog grows and grows, the customer has a desired release date in mind and expresses horror when it becomes clear that the whole backlog cannot possibly be implemented in the timeframe (accompanied by cries of “but I followed the process”).

It shouldn’t be possible to make this mistake given real-world experiences. For example:

We put our car in for an oil change, we get a quote for cost and an estimate for how long the work will take. We drop the car in at the garage and then a little later phone up and request additional work such as fixing the air-conditioning, replacing two tires, sorting the exhaust and swapping out the brake pads. Not for a second do we entertain the idea that the cost and time for the work will be the same as originally quoted.

Yet we still persist in the notion that a software development team is a bottomless pit of resource.

Comments Comments Off