Much is being made of a comment from Subodh Bapat especially in conjunction with further words from Greg Papadopoulos.

It’s believable that many a company will choose to host in a so-called “megacentre” but that doesn’t have to mean disaster come the day one of these fails. One can only get so much power into one place, so much cooling etc. Then there’s latency challenges such that if you’re hosted in the wrong place your customers will be displeased with the performance of your system. Which is a long-winded way of saying that whilst one might expect to see consolidation of cloud providers they’ll still need an awful lot of data-centres to hold all the kit required and provide the appropriate speed-of-light tradeoffs for those they host.

What about resilience? We know that to solve a useful class of problem (byzantine failure) one requires a minimum of n > 3f where f is the number of failures one wishes to tolerate and n is the number of nodes required. If we lower our sights a little, the minimum to handle a data-centre failure requires an active-passive approach with remote replication. Some companies however are moving to active-active models to solve problems of data-centre outage in recognition of the fact that simpler approaches work but mean significant downtime whilst the DR (disaster recovery) site is brought online.

Why if there are techniques available that address these nastier classes of failure are we losing so many “big” sites when we lose data-centres? Because most software houses (enterprise, web or otherwise) assume that failure can be prevented using backup network providers, clusters, replicated disk networks etc. i.e. hardware-based approaches that allow our software writers to pretend that nothing ever breaks leaving them to just write the important business logic.

To allow for data-centre fallure, the clouds of the future will require us to make considerably fewer assumptions in our software, network addresses might change, storage can become unavailable, processes might move and weaker consistency models must be exploited. One such cloud has already arrived in the form of Amazon and it’s notable that many developers are struggling with the new model it offers (they can’t for example find a suitable traditional database solution).

The challenges of the cloud are not in data-centre failure or consolidation of hosting solutions but in our own ability to write software that runs in these environments.

Technorati Tags: , , ,

2 Responses to “Dark Skies”
  1. Bob Warfield says:

    When I look at the examples of datacenter failures bringing down many sites that you mention (Rackspace in Texas and 365 Main in San Francisco, two thoughts come to mind.

    First, take out any major datacenter and probably there is a list of famous sites that are affected that will make the news. In other words, it sounds major but may not be that big a deal.

    Second, the sites that got taken out were physically close to their datacenters. This implies old-school thinking which is that you want to be able to touch the hardware. Reality is you want to design so you’d never touch the hardware. You’d also want redundancy in multiple locations to avoid this sort of thing. Cloud computing done right offers both.

    Best,

    BW

  2. Dan Creswell says:

    Hi Bob,

    “First, take out any major datacenter and probably there is a list of famous sites that are affected that will make the news. In other words, it sounds major but may not be that big a deal.”

    I agree with the analysis, I believe I’m driving at a slightly different point which is that for all these sites are “famous” and “big” they look quite weak DR-wise all things considered. Perhaps they aren’t as mature as most think they are?

    “You’d also want redundancy in multiple locations to avoid this sort of thing.”

    Exactly – as I said in the posting:

    “Some companies however are moving to active-active models to solve problems of data-centre outage in recognition of the fact that simpler approaches work but mean significant downtime whilst the DR (disaster recovery) site is brought online.”

    Happy Christmas,

    Dan,

  3.