Overhaul
Posted by Dan Creswell in Technology, tags: Distributed Systems, Engineering, sporting indexWhen I arrived at Sporting Index, three or so years ago, my early tasks included the planning of a programme of work to overhaul the existing trading system.
To call it a trading system was, at least architecturally, a gross lie, as in fact it was an everything system: payments, accounts, customer profile, reporting (yes, OLAP and OLTP on the same database, madness) and the bet engine.
So the programme broke down into two parts:
- Clean up and separate out the components
- Replace certain components with new implementations
The programme of work started about 18 months ago and if we deliver the entire roadmap there’s another 18 months to go.
Thus far we’ve separated the B2B elements out (yes, they were hanging off the side of the “trading system” as well) and put a new data delivery infrastructure in place with considerably reduced latency and increased reliability. We’ve also just about completed the moving of all reporting into our OLAP systems with real-time updates from the OLTP elements (we used to do reporting refresh every 24 hours with all the painful load spike issues that go with that). The other essential element has been to eliminate the intimate relationship between website and trading system (most website content should not live in the betting engine).
The next major step we’re focused on is the splitting out of customer and account handling. Once this is done we’ll be in the happy situation where we can introduce our new bet engine and run in parallel with the old one so a customer placing bets on markets in either engine continues to get a complete and accurate view of their position (as do our traders).
Our other major area of focus is the development of a new betting engine and a key innovation there will be that we don’t use RDBMS’en for storing that information. We maintain auditing trails and DR abilities but with a faster, far lighter weight solution that will cost much less than what we’re running now.
Some technical details:
- We’ve opted for a service-based implementation, mostly with RESTful interfaces and always with smart stubs. Fact is we have to do a distributed solution to support our regulatory requirements effectively and efficiently (PCI, FSA and various gambling authority needs).
- We’ve implemented a service lookup mechanism from scratch based on gossip algorithms. This allows us sophisticated load and failure management strategies tuned on a service by service basis. It also gives us scope for admission control.
- We’re building up a new multicast infrastructure to deliver updates from the bet engine to desktops, other systems etc in real-time.
- Our bet engine is partitioned such that we can up or down scale on demand via virtualisation (no we can’t use most forms of cloud infrastructure as that breaches a number of regulations).
- We’ve got some nice automated recovery protocols that make recovery from hardware or component failures straightforwards for operational staff. In essence, they replace the broken element and it automatically knits itself back into the system and supports an SLA for recovery. For example, we can say that a cache will contain all relevant data within 5 minutes assuming a certain set of constraints are met (failure recovery times are difficult to guarantee 100%).
- Everything is monitored including stubs, services and infrastructure. Our operational teams get to routinely use what’s being developed and be involved in the specification of the data generated and the writing of the manuals. We’ve standardised the protocols/methods of exposure for both the monitoring data and logging output.
You’ll notice I haven’t talked about languages used and such. That’s because it doesn’t matter as with our service-based approach we can use whatever suits us best on a per service basis. That’s a key part of our general engineering philosophy, “right tool for the right job”, we don’t do fashion, buzz or hype influenced work, the pragmatic, practical, effective and efficient space is where we’re focused.

Entries (RSS)
Interesting stuff, have you thought about blogging more about the automated recovery protocols and the monitoring?
Hi Colin
I hadn’t thought about blogging that stuff but, given I’ve had some prompting from my audience, I’ll try and cover some more of that ground in the future.
Thanks for the feedback,
Dan.