Remoting
Posted by Dan Creswell in Distributed Systems, tags: design, Distributed Systems, Engineering, testingThe difficulty of constructing remote services is often not in writing them but testing and debugging whilst ensuring that some of the nastier types of failure (e.g. packet loss or machine failure) are adequately handled.
The norm for these kinds of testing scenarios is to have a full, mocked-up test environment with a bunch of servers. Such a setup needs sysadmin and repeated deployment steps which for most organisations are slow, ponderous things. Incremental test cycles in such an environment become costly which leads to onerous, last-minute testing and the late discovery of difficult to fix bugs that introduce endless release delays.
Over the years I’ve developed an approach for pushing all these testing scenarios back toward the unit level so they can be run regularly per build as they take mere minutes to complete. The core philosophy is to design the software in such a fashion that it runs on a single machine using all the network protocols it would use when deployed across many servers (ah, the power of localhost/127.0.0.1).
Preliminaries
Putting this philosophy into practice requires that we adopt certain design practices:
- Clean separation between the transport/remote layers and the core service logic. This makes it easy to develop tests that verify the core logic without any remoteness concerns and a second set of tests that perform the more heavyweight remote tests. The benefit is that we can more easily isolate issues when they occur. For example, if the core logic tests pass but the remote tests fail we can be pretty confident the issue is in the remote layers.
- Clean separation of configuration source from core service and transport/remote layer. This ensures all our software requests configuration using a consistent API which could then be implemented via LDAP, flat-files, in-memory etc. Such a setup allows us to easily build up configuration inside of our tests and make it available to the services we’re building.
- Runtime discovery of endpoints. To allow us to dynamically allocate port/ip combinations and make them available to whichever services require them. One can achieve this via the abstracted configuration source but it’s often cleaner to have a dynamic lookup/discovery mechanism.
- Configurable log file locations. So that we can avoid path clashes between services.
Once these things are in place, unit tests can construct transports, endpoints and configuration dynamically at run time in whatever combination is required for a test. It is thus possible to instantiate a collection of services inside of a single process and have them talk to each other as if they were all running remotely. This is somewhat at odds with other design practices where we typically look to remove remoteness when running services locally for purposes of performance.
Failure Scenarios
By virtue of the unit tests having control of all the services and their transports/endpoints it becomes possible to stop or disable services thus simulating machine failures but it’s also possible to extend the approach to cover problems such as packet loss, corruption or increased latency.
These more advanced scenarios are more readily handled with server construction toolkits such as Netty which allows tight control of packet processing and protocol. Using Netty, one can build up the protocol stack per service exactly as required and introduce Decoder/Encoder pairs, Handlers or wrappers around core service implementation that can randomly (and silently without severing the connection) lose messages or packets, break connections etc.
Example
I’ve been working on a Paxos implementation which breaks down into:
- State machines – Leader, Acceptor and Learner and associated elements such as leader election and failure detection.
- Persistent storage layer – as various state must be remembered across Paxos instances.
- Remote communications layer – including cluster membership and remote communications.
The state machines accept messages, make appropriate state transitions and produce messages. These are then passed around between participants via the remote communications layer. The persistent storage layer allows for specification of file locations at construction time which allows test code to allocate separate directories on a single-disk to hold respective state.
The remote layer is built such that none of the members need static/well-known ports to operate off. There is one exception which is a fixed multicast address that is used to do initial cluster discovery. It is implemented using Netty and consists of some codecs for the various messages and a handler that passes messages to and from the state machines.
There are several different implementations of the handler. There is the normal version that dispatches messages reliably and several others that randomly drop messages or lose them at critical moments in an instance of Paxos. The exact behaviour of these handlers is configured at runtime which allows unit tests to construct random or specific failure scenarios and ensure the state machines behave appropriately.
All these elements together allow unit tests to construct, in a single-process, fully remote services that communicate via TCP and UDP/Multicast as if they were running on a network and simulate failure scenarios. Alongside these tests are a collection to verify correct behaviour of the state machines and a set that validates their failure handling via timeouts, leader election behaviours etc. The entire suite including the failure scenarios runs in less than five minutes. That leaves one long-running test that exercises a collection of state machines concurrently for long periods, a necessary soak test run separately.
Alternative Implementations
A similar testing approach is possible with the likes of Jetty 7 as the lower IO layers are open enough to be customised to support these test scenarios. This can be a better option than Netty if services are Servlet based.
More challenging are the RPC-based services as these tend to run atop closed stacks that limit the amount of customisation possible and often have horrid configuration methods. However Thrift, by virtue of it’s Processor/Protocol abstraction can be readily modified to support such testing.
Sidenotes:

Entries (RSS)