How big does a website have to get before custom infrastructure becomes necessary? When a website reaches this stage, what infrastructure gets built? Before trying to answer these questions we must have some means of measuring the size of a website. I’ve settled on the number of machines as a reasonable approximation because:
- As a codebase grows it must be split up along functional boundaries, and spread across multiple processes. More code equals more processes and more machines to run them on.
- More customers, means more load and requires more machines to handle it.
- More data means more storage and more processors to chew through it.
Now let’s see how many machines some of the big players are running and what infrastructure they’re talking about:
TicketMaster have at least 3000 machines and have built Spine to help them manage configuration of their infrastructure.
eBay have built a custom deployment tool (Roller), logging infrastructure, configuration management for their software services, messaging software and more. They’re running around 15000 machines across four geographical locations.
Microsoft have built a custom deployment, configuration and monitoring infrastructure called Autopilot focused on many thousands of machines. In fact we’re talking hundreds of thousands.
Google are dealing in a million or more machines and expending effort on software to handle staged, automatic upgrades. Of course they’ve already built GFS, Chubby etc.
Twitter have moved beyond the half-dozen or so machines they used to have to “a lot of servers” (hundreds?) and are seemingly still hiring operations staff but have built a custom queue server.
Facebook have at least 10000 webservers, 800 MemcacheD instances and 1800 MySQL instances. They’ve built a custom configuration-serving infrastructure, management and monitoring tools. They also contribute to MemcacheD and have built Cassandra and Thrift. They also appear to be busy building their own optimized webservers and a replacement for squid.
Amazon have tens of thousands of servers (surely more?) and have constructed Dynamo, S3, EC2, SQS etc.
A few tentative conclusions:
- It would seem that by the time a website has moved into the thousands of boxes it will have had to address configuration and monitoring. Which suggests development efforts started before this threshold (perhaps at a couple of hundred boxes?)
- As the machine count moves towards the tens of thousands, automated deployment becomes essential and there’s a need to develop more service-specific infrastructure.

Entries (RSS)
A good set of observations Dan.
Beyond number of servers, another factor is rate of infrastructure growth. Bottom line is that investing in a provisioning & monitoring infrastructure early (say going from 50 to 100 nodes) would pay off in a 100 to 500 surge and beyond. My postulation is that, the flatter the growth curve, the longer you can avoid building custom infrastructure.
Sure, the argument can be made that growth rates are not a hard science, but one can improve that by monitoring growth trends and intangibles such as word-of-mouth, press & infosphere sentiment. Sometimes, one can just sense the impending detonation; and the sooner the investment is made, the faster the return and the smoother the experience. That’s assuming whatever is built actually works :).
The other factor that must be taken seriously is that the anomalies one experiences @ 10/100/1000/10000/100000/1000000 node levels are different. I don’t think anybody beyond Yahoo/MS/Google have experienced all the circles of hell.