My current company has for obvious business reasons got a serious interest in delivering a quality website experience during the World Cup and thus I’ve been spending a lot of time focused on our own performance and capacity management of late.

P&C is one of those 80/20 tradeoffs. There’s always more one can do or measure or test, equally getting the basics in place will deliver substantial benefit. I’d go further and argue that without a solid grasp of the basics, one cannot easily determine what else beyond that might be required. Here then are the basics that I’ve found myself repeating over and over:

  • Have an enquiring mind – anomalies are not to be ignored or dismissed on the basis of pure speculation. Determining root cause is essential to prevent surprises in production. Some recent examples:
    1. In one test we noticed that every so often we’d get a substantial blip in disk I/O on servers that should be processing entirely out of memory. Along with that blip there’d be a corresponding reduction in throughput, we could have ignored it, after all things sorted themselves out relatively quickly but we chose to investigate. All these servers were periodically running a cleanup job the developers were unaware of and had not factored into their capacity calculations. The implications for production would have been a regularly overloaded, badly performing website. We’ve since tuned the jobs, adjusted their schedules and increased our capacity to ensure we can always spread the load around enough to accommodate them.
    2. An examination of the distribution of load on the boxes behind our load-balancers revealed a higher than expected amount of variance in CPU and connections. A review of the application revealed that any particular user’s traffic is sticky to one box, unfortunate as it’s stateless, time for a code change. We also spent time looking at the monitoring infrastructure and discovered that in certain cases we’d get false reports of 100% CPU utilisation, that one will be fixed with an OS patch.
  • Gather the right data – there’s no value in allowing oneself to be limited by what is easily available via some set of tools people are comfortable with. One tool we were using had an unreasonably low ceiling on the number and rate of samples it could handle such that any graphs it produced showed hardly anything of the true profile of e.g. CPU utilisation, memory consumption or I/O. Forming any opinion about system behaviour in respect of load was going to be an exercise in speculation. We junked the tool and are looking for a replacement, in the meantime we’ve fallen back to making use of low level performance counters which we can sample local to the machine and whack onto disk for later analysis via scripts, opensource tools etc.
  • Design tests that support reasoning – One should indeed try and replicate production load behaviours to judge overall system behaviour. The challenge of such testing is that it can be difficult to relate performance data back to exactly what was going on during some period of a test and make a diagnosis or be confident of an improvement. There are a number of things we can do to improve the situation:
    1. Ensure tests are deterministic such that any given run can be compared against other runs. This isn’t as simple as it looks when e.g. you wish to gradually increase load at a fixed rate that is being produced by more than one box.
    2. Have tests produce sufficient logging that one can easily identify what was going on at particular points in the sampled data. Logging of course can actually affect test behaviour and that isn’t always desirable.
    3. Build additional tests that target particular user journey’s through the system. Doing this for all possible journey’s can be costly so it makes sense to focus on testing those which are most popular with users. These kinds of tests restrict the reasoning tree making analysis, diagnosis and solution identification much easier.
  • Measure what customers care about – they don’t care about CPUs, I/O or memory, they worry about things like response times. It is important to focus on maintaining a quality user experience not endlessly improving system efficiency. Considering user factors such as response times stops us expending huge effort on CPU utilisation when we should be focusing on say, network I/O, browser performance or reducing the amount of data we push to the browser before a page can render.
  • Beware of averages – it is very tempting to combine datasets via the use of averaging unfortunately such a practice can easily hide spikes that might be indicative of a problem. On more than one occasion an engineer has presented a graph that tracks the average CPU and a table that summarises min, avg and max. After which they’ve pronounced load testing was a success and yet they have no explanation for why the average is never more than 50% but the max is 100% and whether or not this is good or bad.

  • More than load – excessive focus on measuring the effect of a particular load can make us blind to another important metric, resource cost per unit of work – these are the collection of tests and analysis that help us understand what to tune and how much to keep our appetite for boxes and bandwidth reasonable. One simple thing teams can do per sprint (assuming you’re agile, why wouldn’t you be?) is point a profiler at each component and look for the low hanging fruit that is poor algorithm selection or inefficient code (e.g. repeated scanning of lists where a hashmap would be better or repeatedly computing something that could be cached).
  • Share/Bookmark

Comments 2 Comments »

Prioritisation is a solution that can be used in a few situations:

  • Messaging – where some class of messages needs to be processed before one or more other classes.
  • Job execution – where the results of some set of jobs need to be available before others.
  • Levelling – where satisfying peak demand would require lots of hardware that in other periods would be significantly under-utilised.

It’s a very useful pattern but there are a few dark corners to think about:

  1. Even low priority items have some importance, otherwise they wouldn’t exist at all. If there are too many high priority items passing through the system there is significant risk the low priority items will not be processed in an acceptable time period.
  2. If there are too many high priority items passing through the system, the low priority items might not get processed at all leading to huge backlogs that take an age to process.
  3. If the high priority items begin taking a large amount of time to process, low priority items are delayed with resulting in a huge backlog as above.

In essence, a certain workload mix can mean that one must wait infinitely for low priority items to be processed and that is rarely acceptable. Making prioritisation work effectively means ensuring that there is sufficient capacity to process all work within their respective acceptable time periods.

For some applications there is a convenient “quiet” period overnight where low priority items can be cleared out of the system as there’s a dearth of high priority items to process. In other cases processing of priority classes must be interleaved e.g. process 100 high priority items, then 5 low priority items and repeat. Alternatively one can dedicate varying sized pools of resource (partitioning) to processing priority classes with each pool scaled according to their timeliness requirements.

Some technical staff naively use priority to solve a throughput problem where capacity is insufficient to cope with all work in parallel. This can appear to work for a while if there are lulls in demand as mentioned above but ultimately, as workload increases such an approach will fail unless care is taken in profiling the workload and ensuring there is sufficient capacity to satisfy all priorities.

  • Share/Bookmark

Comments Comments Off

A programming language is a tool. These days in fact it’s more a toolbox as there’s an entire ecosystem associated with a language that makes it more or less suitable for a particular discipline (e.g. website development). There are many other tools beyond languages of course: CORBA, J2EE, SOAP, AJAX, Visual Studio .NET, Emacs etc

The obsession we have with our tools is verging on the sexual. We worship them, we endlessly compare them, we get excited about this or that extension. It drives much conversation in corridors and at conferences but it’s largely worthless because there’s no context.

Does a carpenter get excited about a saw, a power-drill or the latest hammer? Not really, because long ago they realised that whilst one must know how to make effective use of a tool and how to maintain it whilst it goes unused, what really matters is figuring out what the job itself actually is. This is the context that dictates which tools are appropriate.

We speculate about concurrency, we speculate about building websites, we speculate about writing this or that application but it’s all pointless until we actually set about a specific task with intent.

The smart techie has a good grasp of a wide range of tools, knows when to use them and ensures they have meaningful escape plans (that may never be implemented) in case the day comes when those tools turn out to be the wrong choice or need replacing. Most of all a smart techie puts thinking and planning well before worrying about tools.

In simple terms, we need to stop playing with our tools and focus on the real challenge, tackling real-world problems with elegant, simple, well thought out, maintainable, cost-effective solutions. Tools help you build such things but they aren’t the essence of it.

  • Share/Bookmark

Comments 4 Comments »

Building a concurrent system ultimately boils down to:

  1. Partitioning the data into chunks that can be separately acted upon
  2. Applying computations against those chunks to produce results

The smaller or more fine-grained the chunks, the more concurrent activity will be possible. In theory the closer one can get to one chunk per core the better but in reality it’s rare (a function of throughput and size of calculation) one needs to do computation across all chunks simultaneously such that a core can be assigned many chunks any one of which it will dispatch operations against at a moment in time.

There are many solutions for building concurrent systems but those that provide some abstraction which makes request routing easy to implement are likely to work best as it makes re-balancing of computation easier. One shouldn’t immediately assume that message passing is the answer as there are many ways to achieve routing (e.g. via DNS).

Any solution represents a transparency tradeoff. If for example routing is hidden inside of the solution, this can make it easy to get something up and running but we might find it difficult to transition from one box to a multi-box deployment. There are many tradeoffs to be made and for any case where control is given to the developer/architect it’s likely there will be libraries/frameworks to ease the initial implementation burden, programming languages alone will not be enough (Scala makes such a differentiation quite difficult given it’s language extension capabilities).

One aspect discussed less often is the difference between processing on a set of cores all in one box versus processing across a set of cores on many boxes. The latter brings the following challenges all related to the fallacies of distributed computing:

  1. Cores are more likely to become inaccessible
  2. The latency of an operation can become substantially more variable
  3. Any centralised functions (e.g. job scheduler or watchdogs) are more vulnerable to becoming isolated from the resources they manage such that processing ceases.

The latency factor is particularly challenging as few concurrent approaches make it sufficiently explicit that developers/architects are encouraged to be appropriately mindful.

Thus far, as has been the case throughout our history, the solutions are polarising into those that work within the confines of a single box and those that work across multiple boxes with the emphasis on the former. I fully expect developers and architects to fall into the old trap of using a single-box solution to solve a multi-box problem with all the associated issues. Of the solutions that work across multiple boxes, very few account fully for the impact of the network.

  • Share/Bookmark

Comments Comments Off

As soon as we give something a name, it becomes open to abuse and misuse.

Vendors can claim they are doing it and support it, developers can claim they do it, use it or implement it. There are a bunch of ready examples: Agile, XP, SOA and REST. Naming something makes it easy to ignore or forget its underpinnings, the elements that deliver value.

As a martial artist, I’m familiar with this pattern of behaviour: various people claim to practice and teach authentic Silat, Karate, Kung Fu, Escrima and so on. Inevitably some of them are exposed as pretenders. One of the more notable martial artists, Bruce Lee was sufficiently concerned about this that he gave serious consideration to leaving his approach to martial art (Jeet Kune Do) unnamed*.

Is it worth naming things? Might we be better served by making our knowledge, approaches and philosophies visible for others without naming them to adopt or not as they see fit? Would it reduce the number of valueless certifications, buzzword cv’s and endless wars over which way is the way and who’s doing it right?


* Jeet Kune Do (1997) ‘Actually, I never wanted to give a name to the kind of Chinese gung fu that I have invented, but for convenience sake, I still call it “Jeet Kune Do”. However, I want to emphasize that there is no distinction between jeet kune do and any other kind of gung fu, for I strongly object to formality, and to the idea of distinction of branches.’

  • Share/Bookmark

Comments 3 Comments »