Much is being made of a comment from Subodh Bapat especially in conjunction with further words from Greg Papadopoulos.
It’s believable that many a company will choose to host in a so-called “megacentre” but that doesn’t have to mean disaster come the day one of these fails. One can only get so much power into one place, so much cooling etc. Then there’s latency challenges such that if you’re hosted in the wrong place your customers will be displeased with the performance of your system. Which is a long-winded way of saying that whilst one might expect to see consolidation of cloud providers they’ll still need an awful lot of data-centres to hold all the kit required and provide the appropriate speed-of-light tradeoffs for those they host.
What about resilience? We know that to solve a useful class of problem (byzantine failure) one requires a minimum of n > 3f where f is the number of failures one wishes to tolerate and n is the number of nodes required. If we lower our sights a little, the minimum to handle a data-centre failure requires an active-passive approach with remote replication. Some companies however are moving to active-active models to solve problems of data-centre outage in recognition of the fact that simpler approaches work but mean significant downtime whilst the DR (disaster recovery) site is brought online.
Why if there are techniques available that address these nastier classes of failure are we losing so many “big” sites when we lose data-centres? Because most software houses (enterprise, web or otherwise) assume that failure can be prevented using backup network providers, clusters, replicated disk networks etc. i.e. hardware-based approaches that allow our software writers to pretend that nothing ever breaks leaving them to just write the important business logic.
To allow for data-centre fallure, the clouds of the future will require us to make considerably fewer assumptions in our software, network addresses might change, storage can become unavailable, processes might move and weaker consistency models must be exploited. One such cloud has already arrived in the form of Amazon and it’s notable that many developers are struggling with the new model it offers (they can’t for example find a suitable traditional database solution).
The challenges of the cloud are not in data-centre failure or consolidation of hosting solutions but in our own ability to write software that runs in these environments.
Technorati Tags: amazon, architecture, availability, distributed systems

Entries (RSS)