Acting
Posted by Dan Creswell in Concurrency, Distributed Systems, tags: actors, concurrency, Distributed SystemsQueues and Threads
An actor is essentially a queue of requests fed into some piece of single-threaded logic. Only one thread at any moment in time can be dispatching a request through the actor.
In essence this is a SingleThreadExecutor as found in java.util.concurrent.
The model is wonderfully simple at the surface but is hiding a number of challenging issues which aren’t easily solved without assistance from the actor/app developer.
Flow Control
The speed at which an actor’s queue is drained is related to the time taken for the actor to dispatch a request taken from the queue. There are also a number of runtime factors that can cause the queue to empty more slowly than might be expected including:
- Limited size thread of actor thread pool.
- The work being performed within the actor takes a long time generally.
- An actor doing I/O is slowed down due to contention on the device or because of a fault (e.g. a RAID enclosure slowing down whilst it restores a disk).
If actor queues fill faster than they can be emptied we will eventually exhaust resources and fail. It must be possible to do explicit flow-control and generate back-pressure on other components in a system which requires an actor infrastructure to expose metrics about queue sizes or provide other feedback about the state of the system (e.g. by posting events to a listener).
Partitioning
If we partition our actors by function then only one actor can perform some set of actions against some set of state owned by that function which represents a throughput limitation. An alternative is to partition the state across multiple actors each capable of performing the function. Or we can pass all state to be worked upon into an actor as part of the request such that we can run many stateless actors. The resultant and prior state must then be stored somewhere else (database or global memory or some other actor). In summary, we must either:
Ensure state is partitioned inside of actors in sufficiently granular fashion that we can provide enough throughput of operations to prevent overload.
or:
Ensure that we have sufficient actors capable of performing some function and can spread state across requests without contention.
[Sidenote: Achieving this efficiently in a system with dynamic load is not easy, one can of course take the easy option of over-provisioning ]
Network
The kind of network used for actor communication has significant effects on system characteristics. For the purposes of this conversation we’ll consider two kinds of network:
- Processor/Memory bus – essentially guaranteed delivery, high throughput and low latency. Actors and requesters communicating via a bus are local to each other.
- Ethernet – no delivery guarantee (a machine failure now doesn’t stop the entire system and packets can be lost silently for variable periods of time), low throughput, higher and more variable latency as compared to processor/memory bus. Actors and requesters communicating via a network are remote from each other.
[Sidenote: A lot of the literature considers both communication over buses and networks to be styles of distributed system. Generally, algorithms for the bus-based style are more plentiful, considerably less complex and easier to construct than for the network-based style see e.g. Lynch ]
An actor that is remote from an entity passing it requests requires two queues. One at the originator and one at the actor. The outbound queue in particular has substantially different characteristics from local actors and requesters:
- Throughput – maximum throughput of the queue is closely related to the throughput of the ethernet between originating and remote machines. This is considerably lower than the throughput for a local queue.
- Latency – the remote queue has the latency of the underlying ethernet and transport stack. Again, substantially different from the local queue.
Why does this matter? Because the number of requests one can dispatch through a remote queue per unit time is substantially different from a local queue. One cannot write a naive algorithm to dispatch requests evenly across a set of actors and assume they make progress at the same rate. Care must be taken to control flow in and out of queues to prevent resource exhaustion. Actor infrastructures that encourage these flow-naive algorithms are likely to exhibit difficult to debug failures in the face of unexpected/unanticipated load.
If one chooses an approach of passing state into an actor, then one must ensure the state is not large because it will take significant time to pull it across the network from storage (or another machine) and put it back. Further whilst the data is being transferred it cannot be modified by some other entity for fear of modifying old-state and losing some changes thus some form of conflict resolution is required (locks, vector clock based merge etc).
Availability
By virtue of machine and network failures (packet loss, outright failure etc), remote actors may become unavailable temporarily or permanently. A stateless actor can be respawned on another machine with little effort but a stateful actor requires some storage that is available across a number of nodes. Note that use of stateless actors will only eradicate the need for storage in the case where nothing else is required to hold state and pass it to actors for processing.
[Sidenote: The need for storage and the costly disk-syncing/checkpointing required to look after state so it can be made available to relocated actors impacts the relevance of actor benchmarks that concentrate on request throughput particularly in the multi-machine scenario. This is because storage and network become the dominant performance factors rather than the speed of request dispatch in real-world scenarios. Such a change may require us to adjust our actor/state partitioning approach ]
Should a machine be running a large number of actors and become unavailable, there is now a need to do significant and costly work to get these actors onto another machine. State must be moved, actors must be re-started and potentially re-balanced across remaining machines. Over the course of a re-balancing, various actor queues must be temporarily suspended which can lead to an increased backlog of requests which as above can cause resource exhaustion. Note also that when a machine becomes unavailable all queues on remote machines that were feeding it must be paused until new resource becomes available.
[Sidenote: Once one is running actors across multiple machines, it becomes necessary to have some form of directory service which can be used to find an actor. Without this feature it becomes difficult to relocate an actor (how do you find it's new address?). We must now keep this service up to date, make it tolerant of failures and scalable to potentially large numbers of actors ]
Summary
In a single-machine/multi-core environment, actor models can work well subject to being able to achieve an appropriate partitioning of function and/or state. This mode of operation also permits substantial simplicity in APIs but flow-control cannot be ignored.
Actor models will only work well in the multi-machine environment if they trade their simplicity for more complicated APIs and heightened developer awareness (see e.g. A Note on Distributed Computing) or are deployed into a very strictly controlled environment (increasingly difficult as a system grows).
Actors like many other concurrency models have their uses (though I’d claim they aren’t anything new and can be built in e.g. Java using an executor) but are not a panacea and their much-vaunted simplicity can be a double-edged sword.

Entries (RSS)