Archive for August 21st, 2007

There’s been some renewed discussion on the relative merits of push and pull for circulating changes.

What I find fascinating is how there’s often a tendency to polarize solutions one way or the other – either we’re entirely push (with failover support etc because we absolutely cannot afford for it to fail) or we must be entirely pull (and worry about what speed of polling to use and build infrastructure that can scale with it etc).

The Good and Bad of Push

Push allows for timely delivery of information updates. If the rate is high enough it makes sense to batch updates together for more efficient delivery. Significantly from the perspective of most, push ensures that we burn CPU cycles as and when there’s something worth doing in contrast to pull where we can waste cycles (though some can be saved with e.g. appropriate use of caching) finding out nothing has changed.

The downside to push comes when clients can’t receive their updates due to network partition or their own downtime (failure, running out of battery power, whatever). When this happens, if we stay push focused we must build appropriate mechanisms for tracking what messages a client has or has not received and hold on to them which can get messy/complex.

And how do we know the client is back? Because it will reconnect, it will pull if you will…..

The Need to Pull

Pull allows a client to dictate when it receives it’s updates and can be particularly attractive in the case of slow update rates. Pull also allows us to recover from various lost event scenarios like:

  1. Delivery failure – given a rough idea of rate of event delivery and a period of silence (that is no event has been received) we can perform a check for lost events by performing a pull. And a failed pull tells us quite clearly something is broken.
  2. Client offline or dead for some period of time.

Recovery is performed by going back to the "event archive" and finding all the events we missed (we can easily do this so long as we have noted the last event we’ve seen, this works really nicely if we do batching of events) after which we can return to the push mode of operation.

We can limit the size of the archive somewhat by bounding the maximum amount of time a client can be down for whilst still being able to restore itself.

To make this work requires that we provide some way to identify each event uniquely and the ability to page through the "event archive" efficiently.

The Best Of Both

Rather than focus solely on either approach in isolation, I think the best solution is to use a combination. This has a couple of advantages:

  1. Clients can potentially use whichever method is more appropriate for them.
  2. It provides significant opportunity for fine tuning.
  3. It provides a nice simple recovery model.
  4. Responsibility is balanced throughout the system keeping complexity down.

[ I'm not alone in this belief as Bill describes exactly such a hybrid approach from the perspective of his favourite technologies (I quite like them too). What I wanted to do was describe the underpinning patterns because I believe this allows us to be technology agnostic and build a working system in whatever environment we're faced with (for example JavaSpace05 could be used as a substrate). ]

Update: A variation on the scheme allows a client to pull some base state and a set of events from the archive after which it resumes listening to events. The size of the archive can then be managed by every so often updating the base state and storing events since then – basically we’re checkpointing.

Technorati Tags: , , ,

  • Share/Bookmark

Comments 4 Comments »