Author Archive

Planning and estimation discussions always come back to:

  1. Agreeing what will be done
  2. Agreeing how much it will cost
  3. Agreeing a deadline by which the what will be done

Because these questions require that one knows everything down to the deepest detail and that all possible happenings are known (which means knowing when people will be ill and for how long, how much time they’ll need to take off for dealing with family troubles, problems with the plumbing etc) and the risks are mitigated such that they absolutely will not affect your project in unpredictable fashions.

That’s not to say that one can’t set deadlines but one has to expect to trade features away, adjust resourcing etc. Of course none of this is news and yet so many places claim to be agile whilst continuing to have the what, how much, when discussion.

Technorati Tags: , , ,

Comments 2 Comments »

Colin Mcrae

Colin Mcrae
1968 - 2007

Comments Comments Off

There’s a lot of noise about transactional memory, thought I should do a bit of research. Having read a number of papers I’m left wondering just what all the racket is for. At least for me the benefits are unclear.

Let’s consider this paper which discusses amongst other things “transactifying” Berkeley Db (a piece of software I know quite well). It contains a comparison of the original version of Db’s locking system (which used a global lock) and the paper’s authors’ modified version. Initial changes were to replace all uses of the global lock with a set of transactions. A test was run and the transactional version was worse all around than the original - ooops.

The root cause boiled down to three issues:

  1. False sharing - a problem which occurs when variables accessed by different threads happen to fall in the same cacheline - this was solved with a traditional approach known as padding.
  2. Statistics collection - Db collected a bunch of statistics keeping them accurate by using the global lock. Rather than address what is surely a common problem, the authors simply turned this feature off.
  3. Object pooling - the pooling associated with lock descriptors and their related objects had to be changed from single linked-lists to collections of linked-lists to improve potential for concurrent access.

The tests were re-run and beyond a certain level of scale the transactional memory version was now better but wait, there’s a problem. Notice that all the work being done to make the transactional version better is broadly the same as the work one would do to make the locking version better. How much of the scalability gain is due to better concurrent structure and how much is down to transactional memory? Is the work we’ve just done any simpler than what we already have to do for conventional thread/lock based systems?

Another under-discussed factor across many papers in this area is related to the assertion that transactional memory is better than locking due to it being more efficient in the non-conflict case. However many modern lock primitives are now also optimized for this circumstance.

What about the fact that, one must make sure to correctly isolate the atomic actions in a system and bound them appropriately with transactions just as one currently does with locking? We still have to make sure we do that consistently across the entire system or risk the usual concurrency debugging nightmares.

Many of the transactional memory systems appear to be based on optimistic approaches - does that make sense for all algorithms and systems we might build? Other transactional systems have evolved to provide both optimistic and pessimistic options (in an attempt to cover all design possibilities) and the programmer must make the appropriate choice for their application. Will transactional memory systems also need to move this way and if so, how will the programmer work with that?

Asserting Order

I’m not going to write-off transactional memory but it seems that should it turn out to be more scalable than the conventional lock-based approaches we use:

  1. It’s really not much simpler to program with.
  2. It’s no use in the distributed case.

Meanwhile:

  1. There are other approaches around that do work across both multi-core and multi-box/distributed cases with little change (some would argue the amount of change is zero but I don’t buy that).
  2. Dealing with concurrency is about much more than whether you use locks or transactions.

Technorati Tags: , , ,

Comments 3 Comments »

The longer one holds onto the single shared memory, multi-core, big box approach, the harder and more costly it gets to shift to distributed.

Every time we buy a bigger box for increased load we’re wasting money come the day there isn’t a bigger box to buy (something that is looking increasingly likely for many of us). All that money would have been better spent on buying racks of smaller boxes. It’s possible we can recover some of our losses by repurposing that big iron via virtualization rather than throwing it away (like all our previous big boxes) but of course, if that box dies it takes an awful lot of VM’s with it.

Every time we assume we can keep all our data in a single memory or database (even if it’s a cluster) we’re embedding assumptions into our software that will be broken come the day we must partition across multiple memories or databases.

Each time we choose an algorithm that doesn’t easily partition or assumes a single memory/database we’re storing up trouble in our data and computational models.

In big monolithic systems it’s possible to create (by force) a never-fails environment which allows developers to ignore various edge cases. The move to a system built out of many separate parts makes failure almost impossible to avoid. This requires us to adjust our system design to take account of all those edge cases we previously ignored.

The time we spend gaining experience in building big monolithic systems has limited application when we switch to building distributed systems. We must learn new habits and adopt new modes of thought and that costs time.

In the worst cases, an organization’s processes, tools and departmental structure become heavily optimized for managing these big monolithic software and hardware systems such that it needs serious revision to cope with the move to horizontal, many box scaling. Typical problem areas include:

  1. Monitoring - suddenly there’s a much greater number of machines to gather stats from. Existing gui representations mightn’t cope with such a large number.
  2. Diagnosis - no longer does a single timestamp imply an order on events making analysis of logging information and root cause identification harder.
  3. Deployment - previous methods simply break as the level of automation provided is inadequate for the number of machines and software components involved.
  4. Testing - existing testing practices where everything can live on the developer’s desktop or in a single VM are no longer viable. There are too many moving parts and the convenience of isolation provided by testing at the desktop or in a single VM is lost.

I doubt threads will ever go away but learning to build and manage systems constructed in any of the following ways might be worthwhile:

  1. Multiple communicating reliable processes on a reliable bus
  2. Multiple communicating unreliable processes on a reliable bus
  3. Multiple communicating unreliable processes on an unreliable bus

[ Where bus is typically a backplane or a network ]

Technorati Tags: , , ,

Comments 4 Comments »

One got addicted and the other ran away
Some settle down a familiar place
One lets go the wheel while the other one steers

One got the money that the other put away
Some held the world and the others couldn’t stay
A few just follow their dreams while the others stood clear
After all these years
After all these years

All These Years - Adema - Kill the Headlights

Technorati Tags: ,

Comments Comments Off

Bobby Woolf penned a great article:

Clients often want to build only an ESB because that involves a technology challenge without the need for messy business requirements. Building just an ESB becomes an IT field of dreams, where IT builds an ESB and then hopes some SOA will come along and use it. Such an ESB-oriented architecture loses the benefits of SOA. It does not create business value. In fact, it incurs cost without reaping immediate benefit.

There in black and white is what happens when the focus becomes purely technological - no (business) value. The answer is an ESB, now what’s the question? A few years back we were saying:

The answer is WS-Deathstar, now what’s the question?

Before that it was:

The answer is an application server, now what’s the question?

Still further back we had:

The answer is a database, now what’s the question?

Silver bullet after silver bullet, when will we learn? Not for as long as we pursue technology for technology’s sake (death by technology). No doubt it’s rewarding for many a geek, fun for sure but it’s appropriate and innovative application of technology to a problem that matters (vision followed by implementation reality). Deploying a piece of technology on the basis that one might use or benefit from it’s presence in the future is simply foolish. It goes against the YAGNI principle for starters.

INATT is heard in the SOA realm partly because some parties are attempting to move beyond the endless hype curve/silver bullet delusion. Andrew McAfee appears to have been fortunate having clients that don’t suffer from death by technology. However judging by the experiences of Mr Woolf and myself not everyone is immune. So let’s talk tech but please only in the context of achieving a greater goal (unproven cost reduction promises don’t count). By way of example, consider this quote from Vogels:

Growth is core to Amazon.com’s business strategy, and that has had a significant impact on the way we use technology: growth through more categories, a larger selection, more services, more buying customers, more sellers, more merchants, more developers, increasing the different access methods, and expanding delivery mechanisms. The impact has been on many areas: larger data sets, faster update rates, more requests, more services, tighter SLAs (service-level agreements), more failures, more latency challenges, more service interdependencies, more developers, more documentation, more programs, more servers, more networks, more data centers. A large part of Amazon.com’s technology evolution has been driven to enable this continuing growth, to be ultra-scalable while maintaining availability and performance.

Technorati Tags: , , ,

Comments Comments Off

The question is, can we see it? If this article is anything to go by the answer would be no.

SOA is an approach to building systems, it certainly couldn’t be called a style (much to the annoyance of some) but it sure isn’t a technology.

And this is the problem - so many view everything to do with building systems as being about deploying the right technologies rather than adopting an approach and driving technology selection from there.

No surprise then that SOA adoption is isolated to small parts of various businesses - that’s the maximum level of use that can be achieved whilst it is treated as a technological shift. Change across the entire business is essential for SOA to get real traction - systems shouldn’t be viewed as necessary evils that cost, rather they should be considered as means for delivering enhanced business value. Processes and culture are the real challenge not hardware and software.

IT/Systems Development needs to be considered a first class citizen within an organization rather than simply the poor cousin that mops the floors and cleans the toilets. Fewer conversations like "here’s what we want" and more discussion around "here’s what we’re trying to do, how can you help?". Switched on readers may notice an interesting parallel with Web vs Enterprise….

Web is (amongst other things) about enabling users and agents to do interesting (interconnected/social) stuff more effectively, whilst Enterprise is often treated as little more than automating laborious tasks with strict controls. Loosely coupled versus tightly coupled, granular and cooperative versus monolithic and uncooperative.

Technorati Tags: , , ,

Comments Comments Off

There’s been some renewed discussion on the relative merits of push and pull for circulating changes.

What I find fascinating is how there’s often a tendency to polarize solutions one way or the other - either we’re entirely push (with failover support etc because we absolutely cannot afford for it to fail) or we must be entirely pull (and worry about what speed of polling to use and build infrastructure that can scale with it etc).

The Good and Bad of Push

Push allows for timely delivery of information updates. If the rate is high enough it makes sense to batch updates together for more efficient delivery. Significantly from the perspective of most, push ensures that we burn CPU cycles as and when there’s something worth doing in contrast to pull where we can waste cycles (though some can be saved with e.g. appropriate use of caching) finding out nothing has changed.

The downside to push comes when clients can’t receive their updates due to network partition or their own downtime (failure, running out of battery power, whatever). When this happens, if we stay push focused we must build appropriate mechanisms for tracking what messages a client has or has not received and hold on to them which can get messy/complex.

And how do we know the client is back? Because it will reconnect, it will pull if you will…..

The Need to Pull

Pull allows a client to dictate when it receives it’s updates and can be particularly attractive in the case of slow update rates. Pull also allows us to recover from various lost event scenarios like:

  1. Delivery failure - given a rough idea of rate of event delivery and a period of silence (that is no event has been received) we can perform a check for lost events by performing a pull. And a failed pull tells us quite clearly something is broken.
  2. Client offline or dead for some period of time.

Recovery is performed by going back to the "event archive" and finding all the events we missed (we can easily do this so long as we have noted the last event we’ve seen, this works really nicely if we do batching of events) after which we can return to the push mode of operation.

We can limit the size of the archive somewhat by bounding the maximum amount of time a client can be down for whilst still being able to restore itself.

To make this work requires that we provide some way to identify each event uniquely and the ability to page through the "event archive" efficiently.

The Best Of Both

Rather than focus solely on either approach in isolation, I think the best solution is to use a combination. This has a couple of advantages:

  1. Clients can potentially use whichever method is more appropriate for them.
  2. It provides significant opportunity for fine tuning.
  3. It provides a nice simple recovery model.
  4. Responsibility is balanced throughout the system keeping complexity down.

[ I’m not alone in this belief as Bill describes exactly such a hybrid approach from the perspective of his favourite technologies (I quite like them too). What I wanted to do was describe the underpinning patterns because I believe this allows us to be technology agnostic and build a working system in whatever environment we’re faced with (for example JavaSpace05 could be used as a substrate). ]

Update: A variation on the scheme allows a client to pull some base state and a set of events from the archive after which it resumes listening to events. The size of the archive can then be managed by every so often updating the base state and storing events since then - basically we’re checkpointing.

Technorati Tags: , , ,

Comments 4 Comments »

I was particularly interested in this part of a recent post from Coté:

James knew that there was a vast schism here between the simple-heads and the complex-heads. The REST and the SOA heads. The WS-Deathstar vs. the WS-* people. Essentially, the two camps don’t want to have anything to do with the other because each thinks the other is fatally flawed:

  • The simple-heads think the complex-heads make everything too difficult in the name of…well, they’re not quite sure why. Maybe to sell more software, because they like punishing themselves, or because they don’t know any better.
  • The complex-heads think the simple-heads are using “toy” technologies that surely won’t work in “the real world.” Simple software can’t be used for large scale, high-dollar problems: surely you couldn’t transact billions of dollars a day with “simple” software.

I think the root cause of the conflict comes down to differences in underlying design philosophies…..

Challenges

In building their systems both sides must tackle challenges such as:

  • Failure
  • Scaling
  • Data Partitioning
  • Persistence
  • State Management
  • Security

Two Different Approaches

Most of enterprise buys into the middleware concept where all the challenges are to be handled inside of an all-encompassing, tightly integrated framework leaving the “trivial” matter of writing some custom logic to satisfy their specific requirements. Cramming all these things into a single piece of infrastructure inevitably leads to complexity in many forms including large surface area (see the section titled “Simplicity as its Own Virtue”), big hardware, big configuration files and so on.

In contrast stands the web crowd that prefers de-centralized, distributed, layered solutions where all participants share responsibility in handling these challenges and application developers are expected to design accordingly (say by implementing retry strategies, idempotency and the like rather than transactions).

Incompatible Views

Essentially the enterprise crowd tackle these tough challenges with technology whilst the web crowd and others tackle them with design. One side complains of toy technologies because they expect the difficult stuff to be tackled in their underpinning middleware monolith and the other complains of complexity because they prefer to emphasise design and simple components.

Bullets

SOA isn’t (or at least shouldn’t be) about technology. The systems aspects of SOA can be constructed using many different technologies (even a mixture) but many still think SOA requires WS-Deathstar. Similar issues exist around REST which many mistakenly associate purely with http. This speaks to a much bigger ill that is rife in our industry - the tendency to conflate concept and technology.

It can surely be no surprise that the middleware mob find distributed hard, given that all they know and all that they do is pretty much in complete opposition to what is required for such systems.

An oft-cited reason for the existence of complex middleware is enterprise’s desire to utilise commodity programmers. Little wonder that so called rock star programmers have zero interest in such a world and are to be found playing around with the web.

Technorati Tags: , ,

Comments 4 Comments »

That’s what everyone craves. That’s why lines of code is still seen as a useful measure in various quarters. That’s why we obsess over IDE’s and other so-called productivity tools.

Trouble is every decent software engineer knows that actually you want less code. It’s easier to maintain, easier to change, easier to debug, easier to build on. They’ll also tell you that the best way to build big codebases is out of lots of small, well isolated, loosely coupled bits with minimal knowledge of each other.

The less code philosophy requires doing some design, pausing for thought and not cranking code. It can provide massive benefit but it’s difficult to measure in any simplistic fashion and thus is seen as pure cost by many.

More code is the hare, less code is the tortoise - know which one I’d bet on.

Technorati Tags: , ,

Comments 1 Comment »

For a long time, software development within many an enterprise is treated as a subservient entity. Something that is expected to comply with the demands of the business without complaint and with limited options for pushback.

I believe the software we produce should be viewed as a stakeholder in it’s own right. It has it’s own needs for survival and ongoing growth and if these are always placed behind everyone else’s (i.e. the business) considerations, the results will be a long slow, painful death where the software becomes more and more brittle, less and less maintainable and staff productivity drifts down as staff turnover creeps up.

Consider a car - it has needs, new tyres, oil flushes, new exhausts, paint chip removal, new springs, new clutch etc. Fail to address those needs and your car will turn into a pile of rust before your eyes with all the attendant issues of depreciation, lost investment, breakdowns etc.

Why do we refuse to accept that software has a need for the motoring equivalent of oil changes and the like? Probably because software is an invisible abstract thing such that only those working with it see the damage being done, problem cars are easier to spot for a greater percentage of the population. This isn’t really an excuse however because software gives warning signs just as cars do. If there’s steam coming from under the bonnet (hood) you’d go to a mechanic, if your software keeps crashing or upgrades keep failing or development takes longer and longer it surely follows that it’s time to visit with your software engineer.

[ You may have noticed I like car analogies - here’s another: You can’t endlessly tune a car, it will only go so fast. Any further increase in speed can only come from starting with a new car that has better basic performance ingredients. The same is true for software, eventually you need a redesign to make further progress because the initial assumptions you made have all been invalidated. ]

Technorati Tags: , ,

Comments Comments Off

Distributed systems practitioners get really excited by swarm theory because it holds the promise of being able to assert a state change in a system without centralized control or a global interaction (e.g. two-phase commit). Bees, ants and the like manage to conduct a co-ordinated life within a co-operative using only local interactions. They don’t have to communicate with the entire colony to agree a good place to nest or a good source of raw materials. It should be no surprise then that distributed systems practitioners also get excited about epidemic behaviours given that they possess similar qualities.

These natural systems contain a good measure of resilience as well. Each bee for example carries a little bit of state with it that it communicates (gossips) with other bees it meets. This state is the result of something it’s encountered in it’s environment. Should this bee die losing the state it’s unlikely to matter as another bee will probably encounter the same environmental conditions and thus the state will be recovered. Victory through weight of numbers.

So nature has provided us a mechanism that can make a binding decision using very loose co-ordination and eventual consistency, whilst functioning entirely off local interactions - no centralized control. Naturally more scalable. What we don’t get is a predictable time for the point at which this decision will be made (concurrent transactional systems aren’t entirely predictable in this respect either). In addition for certain kinds of decision, it’s entirely possible to have race conditions leading to a less than optimal choice. For some systems this doesn’t matter (nature is happy to be a little sloppy) but in cases where it is important there are solutions available.

One key challenge in building these systems is that much of our existing communications middleware is either point-to-point (e.g. RPC or RMI) with fixed addresses or broadcast (message queues or multicast) neither of which is well suited to the random one-to-one nature of gossip driven approaches such as those described above. What’s needed is some form of registry which can cope with dynamically changing membership that takes account of locality of nodes and an efficient means of directed inter-node communication algorithm which mimics the relatively random properties of gossiping.

Technorati Tags: , , , ,

Comments Comments Off

I was reading this from Bill which follows on from Joe. And had a couple of thoughts:

  1. Google seems to have applied N > 1 to everything, not just storage - that’s pure distributed thinking, not the norm for the majority of software heads.
  2. Eventual consistency might be kinda like concurrent programming for most people - i.e. many like to program sequentially and with the certainty that x has been completed immediately, in order and within a known timespan. Concurrency, eventual consistency and friends aren’t terribly amenable to this programming approach and it consequently melts many a brain.
  3. We need for a lot more people to understand CAP.

[Note for Bill should he read this: it seems like your comments are broken, I’m seeing server errors when I hit post]

Comments Comments Off

Check out this job spec.

Notice anything interesting? It’s for a seriously heavyweight distributed systems engineer sure but look deeper. Do you see mention of a single piece of technology? J2EE? JavaScript? Ruby? No, right? How weird is that? How many job specs do you see like that? Surely what matters is whether you know JBoss or Websphere, Java or Erlang or Ruby?

What’s the deal? It’s recognition of the fact that building systems is about how you think and reason which requires sound understanding of theory and how to apply it. It doesn’t matter how much code you can write or in what language because delivering a project is about a whole lot more than code.

So often I see companies create job specs for engineers where the key requirement is to hire someone who can hit the deck coding like mad using whatever tools have been selected. To that end they load the specs up with endless tech hubris and at interview ask the details of this or that bit of syntax or API call. But what about the next project within the company where the tech is different? All those engineers that just got hired are now useless, they don’t have the skills and we lose time whilst they learn. Or we could fire them and hire another lot?

Of course what happens more often than not is that companies ensure they don’t use new tech. Instead they force new projects into using all the same stuff they used before. This is a design disaster as now technology is dictating not design or suitability to requirements. A company that follows the hit the deck coding mantra just has deathmarch and no career progression stamped all over it.

Keeping an eye on trends and keeping abreast of new technology is a good thing to do but the larger context of what to use when, when to build rather than buy, when to dump something because it’s warping the design, when to dump one design approach for another (e.g. going from centralized to distributed) and so on is what really matters. This requires thinking, not an encyclopedic knowledge of a huge number of technologies.

Tech is for sissies - Concepts, principles, patterns, measurement, theory and so on are what matters.

[Confession: The title for this entry was inspired by a recent piece from Pat Helland, one of my favourite thinkers]

Technorati Tags: , , , ,

Comments 2 Comments »

All my life
I’ve denied
Ever knowing what its like
You came around
You shook my ground
Now i’m searching for a drug to come down
You’re where I thought I’d never go
I can’t believe I did

Look out below
I’m letting go
Look out below
I’m falling completely
I lost control
I let it go
Now I can see so clearly around me
You’re everything I need

Closure (featured on the soundtrack for Billabong Odyssey - check out those waves).

Comments Comments Off