The longer one holds onto the single shared memory, multi-core, big box approach, the harder and more costly it gets to shift to distributed.
Every time we buy a bigger box for increased load we’re wasting money come the day there isn’t a bigger box to buy (something that is looking increasingly likely for many of us). All that money would have been better spent on buying racks of smaller boxes. It’s possible we can recover some of our losses by repurposing that big iron via virtualization rather than throwing it away (like all our previous big boxes) but of course, if that box dies it takes an awful lot of VM‘s with it.
Every time we assume we can keep all our data in a single memory or database (even if it’s a cluster) we’re embedding assumptions into our software that will be broken come the day we must partition across multiple memories or databases.
Each time we choose an algorithm that doesn’t easily partition or assumes a single memory/database we’re storing up trouble in our data and computational models.
In big monolithic systems it’s possible to create (by force) a never-fails environment which allows developers to ignore various edge cases. The move to a system built out of many separate parts makes failure almost impossible to avoid. This requires us to adjust our system design to take account of all those edge cases we previously ignored.
The time we spend gaining experience in building big monolithic systems has limited application when we switch to building distributed systems. We must learn new habits and adopt new modes of thought and that costs time.
In the worst cases, an organization’s processes, tools and departmental structure become heavily optimized for managing these big monolithic software and hardware systems such that it needs serious revision to cope with the move to horizontal, many box scaling. Typical problem areas include:
- Monitoring – suddenly there’s a much greater number of machines to gather stats from. Existing gui representations mightn’t cope with such a large number.
- Diagnosis – no longer does a single timestamp imply an order on events making analysis of logging information and root cause identification harder.
- Deployment – previous methods simply break as the level of automation provided is inadequate for the number of machines and software components involved.
- Testing – existing testing practices where everything can live on the developer’s desktop or in a single VM are no longer viable. There are too many moving parts and the convenience of isolation provided by testing at the desktop or in a single VM is lost.
I doubt threads will ever go away but learning to build and manage systems constructed in any of the following ways might be worthwhile:
- Multiple communicating reliable processes on a reliable bus
- Multiple communicating unreliable processes on a reliable bus
- Multiple communicating unreliable processes on an unreliable bus
[ Where bus is typically a backplane or a network ]
Technorati Tags: architecture, distributed systems, process, scalability