Posts Tagged “concurrency”

Queues and Threads

An actor is essentially a queue of requests fed into some piece of single-threaded logic. Only one thread at any moment in time can be dispatching a request through the actor.

In essence this is a SingleThreadExecutor as found in java.util.concurrent.

The model is wonderfully simple at the surface but is hiding a number of challenging issues which aren’t easily solved without assistance from the actor/app developer.

Flow Control

The speed at which an actor’s queue is drained is related to the time taken for the actor to dispatch a request taken from the queue. There are also a number of runtime factors that can cause the queue to empty more slowly than might be expected including:

  • Limited size thread of actor thread pool.
  • The work being performed within the actor takes a long time generally.
  • An actor doing I/O is slowed down due to contention on the device or because of a fault (e.g. a RAID enclosure slowing down whilst it restores a disk).

If actor queues fill faster than they can be emptied we will eventually exhaust resources and fail. It must be possible to do explicit flow-control and generate back-pressure on other components in a system which requires an actor infrastructure to expose metrics about queue sizes or provide other feedback about the state of the system (e.g. by posting events to a listener).

Partitioning

If we partition our actors by function then only one actor can perform some set of actions against some set of state owned by that function which represents a throughput limitation. An alternative is to partition the state across multiple actors each capable of performing the function. Or we can pass all state to be worked upon into an actor as part of the request such that we can run many stateless actors. The resultant and prior state must then be stored somewhere else (database or global memory or some other actor). In summary, we must either:

Ensure state is partitioned inside of actors in sufficiently granular fashion that we can provide enough throughput of operations to prevent overload.

or:

Ensure that we have sufficient actors capable of performing some function and can spread state across requests without contention.

[Sidenote: Achieving this efficiently in a system with dynamic load is not easy, one can of course take the easy option of over-provisioning ]

Network

The kind of network used for actor communication has significant effects on system characteristics. For the purposes of this conversation we’ll consider two kinds of network:

  1. Processor/Memory bus – essentially guaranteed delivery, high throughput and low latency. Actors and requesters communicating via a bus are local to each other.
  2. Ethernet – no delivery guarantee (a machine failure now doesn’t stop the entire system and packets can be lost silently for variable periods of time), low throughput, higher and more variable latency as compared to processor/memory bus. Actors and requesters communicating via a network are remote from each other.

[Sidenote: A lot of the literature considers both communication over buses and networks to be styles of distributed system. Generally, algorithms for the bus-based style are more plentiful, considerably less complex and easier to construct than for the network-based style see e.g. Lynch ]

An actor that is remote from an entity passing it requests requires two queues. One at the originator and one at the actor. The outbound queue in particular has substantially different characteristics from local actors and requesters:

  1. Throughput – maximum throughput of the queue is closely related to the throughput of the ethernet between originating and remote machines. This is considerably lower than the throughput for a local queue.
  2. Latency – the remote queue has the latency of the underlying ethernet and transport stack. Again, substantially different from the local queue.

Why does this matter? Because the number of requests one can dispatch through a remote queue per unit time is substantially different from a local queue. One cannot write a naive algorithm to dispatch requests evenly across a set of actors and assume they make progress at the same rate. Care must be taken to control flow in and out of queues to prevent resource exhaustion. Actor infrastructures that encourage these flow-naive algorithms are likely to exhibit difficult to debug failures in the face of unexpected/unanticipated load.

If one chooses an approach of passing state into an actor, then one must ensure the state is not large because it will take significant time to pull it across the network from storage (or another machine) and put it back. Further whilst the data is being transferred it cannot be modified by some other entity for fear of modifying old-state and losing some changes thus some form of conflict resolution is required (locks, vector clock based merge etc).

Availability

By virtue of machine and network failures (packet loss, outright failure etc), remote actors may become unavailable temporarily or permanently. A stateless actor can be respawned on another machine with little effort but a stateful actor requires some storage that is available across a number of nodes. Note that use of stateless actors will only eradicate the need for storage in the case where nothing else is required to hold state and pass it to actors for processing.

[Sidenote: The need for storage and the costly disk-syncing/checkpointing required to look after state so it can be made available to relocated actors impacts the relevance of actor benchmarks that concentrate on request throughput particularly in the multi-machine scenario. This is because storage and network become the dominant performance factors rather than the speed of request dispatch in real-world scenarios. Such a change may require us to adjust our actor/state partitioning approach ]

Should a machine be running a large number of actors and become unavailable, there is now a need to do significant and costly work to get these actors onto another machine. State must be moved, actors must be re-started and potentially re-balanced across remaining machines. Over the course of a re-balancing, various actor queues must be temporarily suspended which can lead to an increased backlog of requests which as above can cause resource exhaustion. Note also that when a machine becomes unavailable all queues on remote machines that were feeding it must be paused until new resource becomes available.

[Sidenote: Once one is running actors across multiple machines, it becomes necessary to have some form of directory service which can be used to find an actor. Without this feature it becomes difficult to relocate an actor (how do you find it's new address?). We must now keep this service up to date, make it tolerant of failures and scalable to potentially large numbers of actors ]

Summary

In a single-machine/multi-core environment, actor models can work well subject to being able to achieve an appropriate partitioning of function and/or state. This mode of operation also permits substantial simplicity in APIs but flow-control cannot be ignored.

Actor models will only work well in the multi-machine environment if they trade their simplicity for more complicated APIs and heightened developer awareness (see e.g. A Note on Distributed Computing) or are deployed into a very strictly controlled environment (increasingly difficult as a system grows).

Actors like many other concurrency models have their uses (though I’d claim they aren’t anything new and can be built in e.g. Java using an executor) but are not a panacea and their much-vaunted simplicity can be a double-edged sword.

Comments 2 Comments »

Building a concurrent system ultimately boils down to:

  1. Partitioning the data into chunks that can be separately acted upon
  2. Applying computations against those chunks to produce results

The smaller or more fine-grained the chunks, the more concurrent activity will be possible. In theory the closer one can get to one chunk per core the better but in reality it’s rare (a function of throughput and size of calculation) one needs to do computation across all chunks simultaneously such that a core can be assigned many chunks any one of which it will dispatch operations against at a moment in time.

There are many solutions for building concurrent systems but those that provide some abstraction which makes request routing easy to implement are likely to work best as it makes re-balancing of computation easier. One shouldn’t immediately assume that message passing is the answer as there are many ways to achieve routing (e.g. via DNS).

Any solution represents a transparency tradeoff. If for example routing is hidden inside of the solution, this can make it easy to get something up and running but we might find it difficult to transition from one box to a multi-box deployment. There are many tradeoffs to be made and for any case where control is given to the developer/architect it’s likely there will be libraries/frameworks to ease the initial implementation burden, programming languages alone will not be enough (Scala makes such a differentiation quite difficult given it’s language extension capabilities).

One aspect discussed less often is the difference between processing on a set of cores all in one box versus processing across a set of cores on many boxes. The latter brings the following challenges all related to the fallacies of distributed computing:

  1. Cores are more likely to become inaccessible
  2. The latency of an operation can become substantially more variable
  3. Any centralised functions (e.g. job scheduler or watchdogs) are more vulnerable to becoming isolated from the resources they manage such that processing ceases.

The latency factor is particularly challenging as few concurrent approaches make it sufficiently explicit that developers/architects are encouraged to be appropriately mindful.

Thus far, as has been the case throughout our history, the solutions are polarising into those that work within the confines of a single box and those that work across multiple boxes with the emphasis on the former. I fully expect developers and architects to fall into the old trap of using a single-box solution to solve a multi-box problem with all the associated issues. Of the solutions that work across multiple boxes, very few account fully for the impact of the network.

Comments Comments Off