Posts Tagged “Engineering”

When I arrived at Sporting Index, three or so years ago, my early tasks included the planning of a programme of work to overhaul the existing trading system.

To call it a trading system was, at least architecturally, a gross lie, as in fact it was an everything system: payments, accounts, customer profile, reporting (yes, OLAP and OLTP on the same database, madness) and the bet engine.

So the programme broke down into two parts:

  • Clean up and separate out the components
  • Replace certain components with new implementations

The programme of work started about 18 months ago and if we deliver the entire roadmap there’s another 18 months to go.

Thus far we’ve separated the B2B elements out (yes, they were hanging off the side of the “trading system” as well) and put a new data delivery infrastructure in place with considerably reduced latency and increased reliability. We’ve also just about completed the moving of all reporting into our OLAP systems with real-time updates from the OLTP elements (we used to do reporting refresh every 24 hours with all the painful load spike issues that go with that). The other essential element has been to eliminate the intimate relationship between website and trading system (most website content should not live in the betting engine).

The next major step we’re focused on is the splitting out of customer and account handling. Once this is done we’ll be in the happy situation where we can introduce our new bet engine and run in parallel with the old one so a customer placing bets on markets in either engine continues to get a complete and accurate view of their position (as do our traders).

Our other major area of focus is the development of a new betting engine and a key innovation there will be that we don’t use RDBMS’en for storing that information. We maintain auditing trails and DR abilities but with a faster, far lighter weight solution that will cost much less than what we’re running now.

Some technical details:

  • We’ve opted for a service-based implementation, mostly with RESTful interfaces and always with smart stubs. Fact is we have to do a distributed solution to support our regulatory requirements effectively and efficiently (PCI, FSA and various gambling authority needs).
  • We’ve implemented a service lookup mechanism from scratch based on gossip algorithms. This allows us sophisticated load and failure management strategies tuned on a service by service basis. It also gives us scope for admission control.
  • We’re building up a new multicast infrastructure to deliver updates from the bet engine to desktops, other systems etc in real-time.
  • Our bet engine is partitioned such that we can up or down scale on demand via virtualisation (no we can’t use most forms of cloud infrastructure as that breaches a number of regulations).
  • We’ve got some nice automated recovery protocols that make recovery from hardware or component failures straightforwards for operational staff. In essence, they replace the broken element and it automatically knits itself back into the system and supports an SLA for recovery. For example, we can say that a cache will contain all relevant data within 5 minutes assuming a certain set of constraints are met (failure recovery times are difficult to guarantee 100%).
  • Everything is monitored including stubs, services and infrastructure. Our operational teams get to routinely use what’s being developed and be involved in the specification of the data generated and the writing of the manuals. We’ve standardised the protocols/methods of exposure for both the monitoring data and logging output.

You’ll notice I haven’t talked about languages used and such. That’s because it doesn’t matter as with our service-based approach we can use whatever suits us best on a per service basis. That’s a key part of our general engineering philosophy, “right tool for the right job”, we don’t do fashion, buzz or hype influenced work, the pragmatic, practical, effective and efficient space is where we’re focused.

Comments 2 Comments »

Many in IT would have you believe that they’re progressive, constantly advancing and learning. They will provide examples such as:

  • We’re adopting lean (or agile)
  • We’re using Scala
  • We’re doing large-scale

Dig a little deeper though and you’ll quickly find:

  • The agile or lean they’ve adopted and trumpet is merely a fixed set of ceremonies such as time-boxed cycles, limiting in-flight work, pair programming and automated testing.
  • The Scala code they’re writing is largely imperative with no use of any of the obvious functional aspects.
  • The supposed large-scale is in fact a couple of thousand customers running on a single database with an operation count per minute of no more than 60.

These individuals and companies are paying lip-service, they haven’t learnt anything other than what the latest buzz is and they’re operating some parody of the real thing.

The fundamental problem is, they don’t do their research. They don’t actually examine what has been done previously and understand its fundamentals, apply it and sharpen it. This is the skill that I gained at university, not some set of buzz technologies or processes. This is the fundamental skill that allows us to progress, to learn, adapt and act.

[ There's a supreme irony in an individual or company claiming to be doing agile or lean and not being able to research, learn and adapt because they follow a fixed set of ceremonies. ]

Unfortunately this skill isn’t generally recognised as valuable in the IT industry, it’s rarely taught and not seen in requests for curriculum changes from business to universities. This sort of feedback is more typical:


…startups need graduates who can hit the ground running, who are proficient in PHP, Python and Ruby (among other modern programming languages), and who, ultimately, understand the practical side of software engineering as opposed to just the theoretical side which they learn at university.

Being proficient in a language does not make you good, it just means you can crank (poor) code fast. Further, understanding the practical side of software engineering means learning from what you do, analysing and adjusting. If we’re all so good at that, why do we repeatedly get burnt by the same classic mistakes?

Far too many within IT act on hype or pay lip-service, they don’t do research, they don’t adopt the basic disciplines of learning. Whilst that continues, progress is nothing but a pretence.

Comments Comments Off

A frequent problem I observe when reviewing system designs is they are built atop one or more libraries, frameworks or products that are poorly suited to the intended task. Fitting the design to these underpinnings warps it in undesirable ways incurring all sorts of costs:

  • It takes an increasing number of staff just to deploy and run the system.
  • Customers face an increasingly bad experience in terms of interaction, performance and stability.
  • One spends more time refactoring than developing new features – although in many cases developers will simply not bother with this effort which accelerates the drop in quality for customers.
  • The level of coupling increases impacting the integrity of the design and making future change more difficult.

I call this the “design-by-product anti-pattern”. There are a couple of things that cause it to manifest:

  1. Absence of a real design prior to product/framework/library selection – most of those given the remit for design cannot construct proper abstractions that are adequately divorced from implementation. That is they do not understand the core entities and operations that exist within the domain they are building a system for. Thus when products/libraries/frameworks are selected there is limited structure to assist in evaluating their appropriateness.
  2. These products are used because they are on the list of “company approved technologies”. The justification for the existence of such a list is that it “reduces cost” which it might well do if all one accounts for is licenses and product support. Unfortunately, the cost equation is not nearly so simple (see above re: costs).
  3. A related problem to “company approved technologies” is hot or favourite technologies preferred by the development team regardless of their appropriateness for use in any particular design situation.

Any product/library/framework is created by an individual who has their own view of how their customers design their systems and builds APIs accordingly. In the worst cases these individuals design APIs in total isolation, focused on making them theoretically perfect (for some definition of perfect). If we as customers create designs that do not align well with the views of these individuals, the result will be costly as we force the two designs together. The cost is magnified for each additional conflicting product/library/framework design.

Loose coupling as the result of proper definition of roles and responsibilities is the only tool we have to allow for future design evolution. Poor selection of products/libraries/frameworks erodes this property and should be avoided otherwise death-march awaits.

Comments Comments Off

The difficulty of constructing remote services is often not in writing them but testing and debugging whilst ensuring that some of the nastier types of failure (e.g. packet loss or machine failure) are adequately handled.

The norm for these kinds of testing scenarios is to have a full, mocked-up test environment with a bunch of servers. Such a setup needs sysadmin and repeated deployment steps which for most organisations are slow, ponderous things. Incremental test cycles in such an environment become costly which leads to onerous, last-minute testing and the late discovery of difficult to fix bugs that introduce endless release delays.

Over the years I’ve developed an approach for pushing all these testing scenarios back toward the unit level so they can be run regularly per build as they take mere minutes to complete. The core philosophy is to design the software in such a fashion that it runs on a single machine using all the network protocols it would use when deployed across many servers (ah, the power of localhost/127.0.0.1).

Preliminaries

Putting this philosophy into practice requires that we adopt certain design practices:

  1. Clean separation between the transport/remote layers and the core service logic. This makes it easy to develop tests that verify the core logic without any remoteness concerns and a second set of tests that perform the more heavyweight remote tests. The benefit is that we can more easily isolate issues when they occur. For example, if the core logic tests pass but the remote tests fail we can be pretty confident the issue is in the remote layers.
  2. Clean separation of configuration source from core service and transport/remote layer. This ensures all our software requests configuration using a consistent API which could then be implemented via LDAP, flat-files, in-memory etc. Such a setup allows us to easily build up configuration inside of our tests and make it available to the services we’re building.
  3. Runtime discovery of endpoints. To allow us to dynamically allocate port/ip combinations and make them available to whichever services require them. One can achieve this via the abstracted configuration source but it’s often cleaner to have a dynamic lookup/discovery mechanism.
  4. Configurable log file locations. So that we can avoid path clashes between services.

Once these things are in place, unit tests can construct transports, endpoints and configuration dynamically at run time in whatever combination is required for a test. It is thus possible to instantiate a collection of services inside of a single process and have them talk to each other as if they were all running remotely. This is somewhat at odds with other design practices where we typically look to remove remoteness when running services locally for purposes of performance.

Failure Scenarios

By virtue of the unit tests having control of all the services and their transports/endpoints it becomes possible to stop or disable services thus simulating machine failures but it’s also possible to extend the approach to cover problems such as packet loss, corruption or increased latency.

These more advanced scenarios are more readily handled with server construction toolkits such as Netty which allows tight control of packet processing and protocol. Using Netty, one can build up the protocol stack per service exactly as required and introduce Decoder/Encoder pairs, Handlers or wrappers around core service implementation that can randomly (and silently without severing the connection) lose messages or packets, break connections etc.

Example

I’ve been working on a Paxos implementation which breaks down into:

  • State machines – Leader, Acceptor and Learner and associated elements such as leader election and failure detection.
  • Persistent storage layer – as various state must be remembered across Paxos instances.
  • Remote communications layer – including cluster membership and remote communications.

The state machines accept messages, make appropriate state transitions and produce messages. These are then passed around between participants via the remote communications layer. The persistent storage layer allows for specification of file locations at construction time which allows test code to allocate separate directories on a single-disk to hold respective state.

The remote layer is built such that none of the members need static/well-known ports to operate off. There is one exception which is a fixed multicast address that is used to do initial cluster discovery. It is implemented using Netty and consists of some codecs for the various messages and a handler that passes messages to and from the state machines.

There are several different implementations of the handler. There is the normal version that dispatches messages reliably and several others that randomly drop messages or lose them at critical moments in an instance of Paxos. The exact behaviour of these handlers is configured at runtime which allows unit tests to construct random or specific failure scenarios and ensure the state machines behave appropriately.

All these elements together allow unit tests to construct, in a single-process, fully remote services that communicate via TCP and UDP/Multicast as if they were running on a network and simulate failure scenarios. Alongside these tests are a collection to verify correct behaviour of the state machines and a set that validates their failure handling via timeouts, leader election behaviours etc. The entire suite including the failure scenarios runs in less than five minutes. That leaves one long-running test that exercises a collection of state machines concurrently for long periods, a necessary soak test run separately.

Alternative Implementations

A similar testing approach is possible with the likes of Jetty 7 as the lower IO layers are open enough to be customised to support these test scenarios. This can be a better option than Netty if services are Servlet based.

More challenging are the RPC-based services as these tend to run atop closed stacks that limit the amount of customisation possible and often have horrid configuration methods. However Thrift, by virtue of it’s Processor/Protocol abstraction can be readily modified to support such testing.

Sidenotes:

  1. Applications that use databases for state storage can make this sort of testing tricky but it’s not impossible. One solution to the problem is to make use of virtual machines where one instantiates an image containing a pre-defined database and shuts it down afterwards alongside some scripts to prepare and tear down data within the database
  2. I’ve recently applied this approach to several other systems including a trade management system written in Clojure, a trading platform written in Scala and a gossip-based directory service also written in Scala

Comments 1 Comment »

My current company has for obvious business reasons got a serious interest in delivering a quality website experience during the World Cup and thus I’ve been spending a lot of time focused on our own performance and capacity management of late.

P&C is one of those 80/20 tradeoffs. There’s always more one can do or measure or test, equally getting the basics in place will deliver substantial benefit. I’d go further and argue that without a solid grasp of the basics, one cannot easily determine what else beyond that might be required. Here then are the basics that I’ve found myself repeating over and over:

  • Have an enquiring mind – anomalies are not to be ignored or dismissed on the basis of pure speculation. Determining root cause is essential to prevent surprises in production. Some recent examples:
    1. In one test we noticed that every so often we’d get a substantial blip in disk I/O on servers that should be processing entirely out of memory. Along with that blip there’d be a corresponding reduction in throughput, we could have ignored it, after all things sorted themselves out relatively quickly but we chose to investigate. All these servers were periodically running a cleanup job the developers were unaware of and had not factored into their capacity calculations. The implications for production would have been a regularly overloaded, badly performing website. We’ve since tuned the jobs, adjusted their schedules and increased our capacity to ensure we can always spread the load around enough to accommodate them.
    2. An examination of the distribution of load on the boxes behind our load-balancers revealed a higher than expected amount of variance in CPU and connections. A review of the application revealed that any particular user’s traffic is sticky to one box, unfortunate as it’s stateless, time for a code change. We also spent time looking at the monitoring infrastructure and discovered that in certain cases we’d get false reports of 100% CPU utilisation, that one will be fixed with an OS patch.
  • Gather the right data – there’s no value in allowing oneself to be limited by what is easily available via some set of tools people are comfortable with. One tool we were using had an unreasonably low ceiling on the number and rate of samples it could handle such that any graphs it produced showed hardly anything of the true profile of e.g. CPU utilisation, memory consumption or I/O. Forming any opinion about system behaviour in respect of load was going to be an exercise in speculation. We junked the tool and are looking for a replacement, in the meantime we’ve fallen back to making use of low level performance counters which we can sample local to the machine and whack onto disk for later analysis via scripts, opensource tools etc.
  • Design tests that support reasoning – One should indeed try and replicate production load behaviours to judge overall system behaviour. The challenge of such testing is that it can be difficult to relate performance data back to exactly what was going on during some period of a test and make a diagnosis or be confident of an improvement. There are a number of things we can do to improve the situation:
    1. Ensure tests are deterministic such that any given run can be compared against other runs. This isn’t as simple as it looks when e.g. you wish to gradually increase load at a fixed rate that is being produced by more than one box.
    2. Have tests produce sufficient logging that one can easily identify what was going on at particular points in the sampled data. Logging of course can actually affect test behaviour and that isn’t always desirable.
    3. Build additional tests that target particular user journey’s through the system. Doing this for all possible journey’s can be costly so it makes sense to focus on testing those which are most popular with users. These kinds of tests restrict the reasoning tree making analysis, diagnosis and solution identification much easier.
  • Measure what customers care about – they don’t care about CPUs, I/O or memory, they worry about things like response times. It is important to focus on maintaining a quality user experience not endlessly improving system efficiency. Considering user factors such as response times stops us expending huge effort on CPU utilisation when we should be focusing on say, network I/O, browser performance or reducing the amount of data we push to the browser before a page can render.
  • Beware of averages – it is very tempting to combine datasets via the use of averaging unfortunately such a practice can easily hide spikes that might be indicative of a problem. On more than one occasion an engineer has presented a graph that tracks the average CPU and a table that summarises min, avg and max. After which they’ve pronounced load testing was a success and yet they have no explanation for why the average is never more than 50% but the max is 100% and whether or not this is good or bad.

  • More than load – excessive focus on measuring the effect of a particular load can make us blind to another important metric, resource cost per unit of work – these are the collection of tests and analysis that help us understand what to tune and how much to keep our appetite for boxes and bandwidth reasonable. One simple thing teams can do per sprint (assuming you’re agile, why wouldn’t you be?) is point a profiler at each component and look for the low hanging fruit that is poor algorithm selection or inefficient code (e.g. repeated scanning of lists where a hashmap would be better or repeatedly computing something that could be cached).

Comments 2 Comments »