Posts Tagged “design”

A frequent problem I observe when reviewing system designs is they are built atop one or more libraries, frameworks or products that are poorly suited to the intended task. Fitting the design to these underpinnings warps it in undesirable ways incurring all sorts of costs:

  • It takes an increasing number of staff just to deploy and run the system.
  • Customers face an increasingly bad experience in terms of interaction, performance and stability.
  • One spends more time refactoring than developing new features – although in many cases developers will simply not bother with this effort which accelerates the drop in quality for customers.
  • The level of coupling increases impacting the integrity of the design and making future change more difficult.

I call this the “design-by-product anti-pattern”. There are a couple of things that cause it to manifest:

  1. Absence of a real design prior to product/framework/library selection – most of those given the remit for design cannot construct proper abstractions that are adequately divorced from implementation. That is they do not understand the core entities and operations that exist within the domain they are building a system for. Thus when products/libraries/frameworks are selected there is limited structure to assist in evaluating their appropriateness.
  2. These products are used because they are on the list of “company approved technologies”. The justification for the existence of such a list is that it “reduces cost” which it might well do if all one accounts for is licenses and product support. Unfortunately, the cost equation is not nearly so simple (see above re: costs).
  3. A related problem to “company approved technologies” is hot or favourite technologies preferred by the development team regardless of their appropriateness for use in any particular design situation.

Any product/library/framework is created by an individual who has their own view of how their customers design their systems and builds APIs accordingly. In the worst cases these individuals design APIs in total isolation, focused on making them theoretically perfect (for some definition of perfect). If we as customers create designs that do not align well with the views of these individuals, the result will be costly as we force the two designs together. The cost is magnified for each additional conflicting product/library/framework design.

Loose coupling as the result of proper definition of roles and responsibilities is the only tool we have to allow for future design evolution. Poor selection of products/libraries/frameworks erodes this property and should be avoided otherwise death-march awaits.

Comments Comments Off

There are some design basics that development teams routinely fail to account for:

  1. Roles
  2. Responsibilities
  3. Coupling

Role

The basic justification for the existence of some api, interface or class. A summary of what it’s for. Just as importantly, the role defines what a particular entity is not for.

Responsibility

The things that some entity can do/knows in support of a role.

Coupling

An expression of the dependencies between roles. This property tells us a lot about the state of our design.

Two things that are heavily dependent upon each other might well be serving individual parts of a single role and thus should be consolidated. If everything ends up in a single role, it can suggest that the current approach to classifying behaviours is missing some factors.

Coupling can be temporal such that, for example, one entity cannot dispatch its responsibilities without the presence of another at the same time. This might indicate the need for some work on handling availability issues in a distributed system.

Limited coupling is a sign of cohesion, clarity in roles and responsibilities which can be indicative of a clean, maintainable design.

Platform Neutral

These basics apply regardless of the platform one chooses to develop upon. Roles, responsibilities and coupling apply just as well to service architectures, databases (tables and associated triggers and packages) and applications in Java, Scala, Clojure, C# or any other programming environment.

Warning Signs

It is very common for individual developers or development teams to allocate additional functions to existing elements of a design unthinkingly, thus eroding its quality. This manifests in many ways including:

  1. Some element of the system becomes the source of all information in respect of e.g. configuration or the entirety of customer data.
  2. A single cache contains all data regardless of its nature (e.g. customer, account details, market price).
  3. Some element of the system must always be running otherwise nothing else works.
  4. Some element of the system has functions that span many different bits of data (e.g. customer, account, market price).

Rule of Thumb

Any entity within a system should do only one thing and it should do it well (often credited as Unix Philosophy). This applies to everything from applications and products to services and individual classes.

Comments Comments Off

Design is not rules, it’s not patterns, it’s not technological choices or indeed code. Design is tradeoffs, driven by data where possible and gut instinct. It’s about identifying the core challenges of a problem domain (which might ultimately be one or many systems) and addressing them through creation of appropriate abstractions. These abstractions embody:

  • Functions to be performed
  • Data to be discovered, consumed and produced
  • Non-functionals (e.g. SLAs)

The abstractions are then rendered into the real-world using appropriate hardware, technologies, patterns and languages. A good design:

  • Exhibits few exception cases
  • Has logic and/or data located neatly and predictably
  • Applies a small set of core constructs repeatedly
  • Addresses operational needs
  • Considers cost versus value delivered
  • Is as simple as possible
  • Has the minimum of implementation assumption

There are several key failing points in the design process:

  • No adjustment in the face of implementation feedback – No design is complete or perfect. There will always be missed details leading to brittle code, complex corner cases or convoluted solutions. It is critical that we monitor our progress and adapt the design accordingly.
  • No up front design – Design is the skeleton upon which we hang technology choices and code structure. In it’s absence we rapidly descend into a world of difficult to navigate code and costly constraints set by uninformed product choices.
  • No care in following the design – A key element of design is to place the right things in the right places. Failing to do this at code time increases coupling, makes maintenance difficult and can impact both performance and scalability. Similar effects occur as the result of poor technology selection.

Design and implementation go hand in hand yet many of us lack awareness of where the boundary between these two elements lies. We don’t understand how these elements interact with each other or appreciate the impact of decisions we make in respect of one element on the other.

 

Comments Comments Off

The difficulty of constructing remote services is often not in writing them but testing and debugging whilst ensuring that some of the nastier types of failure (e.g. packet loss or machine failure) are adequately handled.

The norm for these kinds of testing scenarios is to have a full, mocked-up test environment with a bunch of servers. Such a setup needs sysadmin and repeated deployment steps which for most organisations are slow, ponderous things. Incremental test cycles in such an environment become costly which leads to onerous, last-minute testing and the late discovery of difficult to fix bugs that introduce endless release delays.

Over the years I’ve developed an approach for pushing all these testing scenarios back toward the unit level so they can be run regularly per build as they take mere minutes to complete. The core philosophy is to design the software in such a fashion that it runs on a single machine using all the network protocols it would use when deployed across many servers (ah, the power of localhost/127.0.0.1).

Preliminaries

Putting this philosophy into practice requires that we adopt certain design practices:

  1. Clean separation between the transport/remote layers and the core service logic. This makes it easy to develop tests that verify the core logic without any remoteness concerns and a second set of tests that perform the more heavyweight remote tests. The benefit is that we can more easily isolate issues when they occur. For example, if the core logic tests pass but the remote tests fail we can be pretty confident the issue is in the remote layers.
  2. Clean separation of configuration source from core service and transport/remote layer. This ensures all our software requests configuration using a consistent API which could then be implemented via LDAP, flat-files, in-memory etc. Such a setup allows us to easily build up configuration inside of our tests and make it available to the services we’re building.
  3. Runtime discovery of endpoints. To allow us to dynamically allocate port/ip combinations and make them available to whichever services require them. One can achieve this via the abstracted configuration source but it’s often cleaner to have a dynamic lookup/discovery mechanism.
  4. Configurable log file locations. So that we can avoid path clashes between services.

Once these things are in place, unit tests can construct transports, endpoints and configuration dynamically at run time in whatever combination is required for a test. It is thus possible to instantiate a collection of services inside of a single process and have them talk to each other as if they were all running remotely. This is somewhat at odds with other design practices where we typically look to remove remoteness when running services locally for purposes of performance.

Failure Scenarios

By virtue of the unit tests having control of all the services and their transports/endpoints it becomes possible to stop or disable services thus simulating machine failures but it’s also possible to extend the approach to cover problems such as packet loss, corruption or increased latency.

These more advanced scenarios are more readily handled with server construction toolkits such as Netty which allows tight control of packet processing and protocol. Using Netty, one can build up the protocol stack per service exactly as required and introduce Decoder/Encoder pairs, Handlers or wrappers around core service implementation that can randomly (and silently without severing the connection) lose messages or packets, break connections etc.

Example

I’ve been working on a Paxos implementation which breaks down into:

  • State machines – Leader, Acceptor and Learner and associated elements such as leader election and failure detection.
  • Persistent storage layer – as various state must be remembered across Paxos instances.
  • Remote communications layer – including cluster membership and remote communications.

The state machines accept messages, make appropriate state transitions and produce messages. These are then passed around between participants via the remote communications layer. The persistent storage layer allows for specification of file locations at construction time which allows test code to allocate separate directories on a single-disk to hold respective state.

The remote layer is built such that none of the members need static/well-known ports to operate off. There is one exception which is a fixed multicast address that is used to do initial cluster discovery. It is implemented using Netty and consists of some codecs for the various messages and a handler that passes messages to and from the state machines.

There are several different implementations of the handler. There is the normal version that dispatches messages reliably and several others that randomly drop messages or lose them at critical moments in an instance of Paxos. The exact behaviour of these handlers is configured at runtime which allows unit tests to construct random or specific failure scenarios and ensure the state machines behave appropriately.

All these elements together allow unit tests to construct, in a single-process, fully remote services that communicate via TCP and UDP/Multicast as if they were running on a network and simulate failure scenarios. Alongside these tests are a collection to verify correct behaviour of the state machines and a set that validates their failure handling via timeouts, leader election behaviours etc. The entire suite including the failure scenarios runs in less than five minutes. That leaves one long-running test that exercises a collection of state machines concurrently for long periods, a necessary soak test run separately.

Alternative Implementations

A similar testing approach is possible with the likes of Jetty 7 as the lower IO layers are open enough to be customised to support these test scenarios. This can be a better option than Netty if services are Servlet based.

More challenging are the RPC-based services as these tend to run atop closed stacks that limit the amount of customisation possible and often have horrid configuration methods. However Thrift, by virtue of it’s Processor/Protocol abstraction can be readily modified to support such testing.

Sidenotes:

  1. Applications that use databases for state storage can make this sort of testing tricky but it’s not impossible. One solution to the problem is to make use of virtual machines where one instantiates an image containing a pre-defined database and shuts it down afterwards alongside some scripts to prepare and tear down data within the database
  2. I’ve recently applied this approach to several other systems including a trade management system written in Clojure, a trading platform written in Scala and a gossip-based directory service also written in Scala

Comments 1 Comment »

Building a concurrent system ultimately boils down to:

  1. Partitioning the data into chunks that can be separately acted upon
  2. Applying computations against those chunks to produce results

The smaller or more fine-grained the chunks, the more concurrent activity will be possible. In theory the closer one can get to one chunk per core the better but in reality it’s rare (a function of throughput and size of calculation) one needs to do computation across all chunks simultaneously such that a core can be assigned many chunks any one of which it will dispatch operations against at a moment in time.

There are many solutions for building concurrent systems but those that provide some abstraction which makes request routing easy to implement are likely to work best as it makes re-balancing of computation easier. One shouldn’t immediately assume that message passing is the answer as there are many ways to achieve routing (e.g. via DNS).

Any solution represents a transparency tradeoff. If for example routing is hidden inside of the solution, this can make it easy to get something up and running but we might find it difficult to transition from one box to a multi-box deployment. There are many tradeoffs to be made and for any case where control is given to the developer/architect it’s likely there will be libraries/frameworks to ease the initial implementation burden, programming languages alone will not be enough (Scala makes such a differentiation quite difficult given it’s language extension capabilities).

One aspect discussed less often is the difference between processing on a set of cores all in one box versus processing across a set of cores on many boxes. The latter brings the following challenges all related to the fallacies of distributed computing:

  1. Cores are more likely to become inaccessible
  2. The latency of an operation can become substantially more variable
  3. Any centralised functions (e.g. job scheduler or watchdogs) are more vulnerable to becoming isolated from the resources they manage such that processing ceases.

The latency factor is particularly challenging as few concurrent approaches make it sufficiently explicit that developers/architects are encouraged to be appropriately mindful.

Thus far, as has been the case throughout our history, the solutions are polarising into those that work within the confines of a single box and those that work across multiple boxes with the emphasis on the former. I fully expect developers and architects to fall into the old trap of using a single-box solution to solve a multi-box problem with all the associated issues. Of the solutions that work across multiple boxes, very few account fully for the impact of the network.

Comments Comments Off