Archive for May 27th, 2007

I’m a firm believer in making the minimum number of mistakes and one of the most effective ways to achieve this is to learn from the history of others particularly those leading the field. In distributed systems, one of the leaders is Amazon whom it could be argued are unlike anything else out there and thus are not applicable. However it’s surely a mistake to not look at Amazon and see if there’s anything we might find useful simply because if we are successful we could face the same problems. Here then is my perspective on what some of the more interesting published data-points are:

Amazon.com started 10 years ago as a monolithic application, running on a Web server, talking to a database on the back end. This application, dubbed Obidos, evolved to hold all the business logic, all the display logic, and all the functionality that Amazon eventually became famous for: similarities, recommendations, Listmania, reviews, etc. For years the scaling efforts at Amazon were focused on making the back-end databases scale to hold more items, more customers, more orders, and to support multiple international sites. This went on until 2001 when it became clear that the front-end application couldn’t scale anymore…….

……..The many things that you would like to see happening in a good software environment couldn’t be done anymore; there were many complex pieces of software combined into a single system. It couldn’t evolve anymore. The parts that needed to scale independently were tied into sharing resources with other unknown code paths. There was no isolation and, as a result, no clear ownership.

At the same time, there was continued difficulty in the back-end database scaling effort. Databases—and by that time we were using several databases—were a shared resource, which made it very hard to scale-out the overall business. So both the front-end and back-end processes were restricted in their evolution because they were shared by many different teams and processes.

Notice how the monolithic single database architecture hasn’t just confined scalability and performance but the speed with which new features could be added. As an enterprise one might certainly argue that the level of scale of amazon is irrelevant to them but the ability to add features or change? That sounds like something we should all be interested in. Here’s how they changed things:

We went through a period of serious introspection and concluded that a service-oriented architecture would give us the level of isolation that would allow us to build many software components rapidly and independently. By the way, this was way before service-oriented was a buzzword. For us service orientation means encapsulating the data with the business logic that operates on the data, with the only access through a published service interface. No direct database access is allowed from outside the service, and there’s no data sharing among the services.

Over time, this grew into hundreds of services and a number of application servers that aggregate the information from the services. The application that renders the Amazon.com Web pages is one such application server, but so are the applications that serve the Web-services interface, the customer service application, the seller interface, and the many third-party Web sites that run on our platform.

If you hit the Amazon.com gateway page, the application calls more than 100 services to collect data and construct the page for you….

…….It depends a bit on what kind of page you visit—whether it is a product page, a checkout page, etc. It also depends on how effective caching is for the objects on that page, as well as some other factors.

Many are deeply concerned with avoiding remoteness whenever possible, notice that Amazon don’t run away screaming, rather they pragmatically engineer and include use of caching etc.

The first and foremost lesson is a meta-lesson: If applied, strict service orientation is an excellent technique to achieve isolation; you come to a level of ownership and control that was not seen before. A second lesson is probably that by prohibiting direct database access by clients, you can make scaling and reliability improvements to your service state without involving your clients. Other lessons are related to how you access services: If you want to be able to aggregate services easily, if you want to insert advanced infrastructure techniques such as decentralized request routing or distributed request tracking, you need a single unified service-access mechanism.

Another lesson we’ve learned is that it’s not only the technology side that was improved by using services. The development and operational process has greatly benefited from it as well. The services model has been a key enabler in creating teams that can innovate quickly with a strong customer focus. Each service has a team associated with it, and that team is completely responsible for the service—from scoping out the functionality, to architecting it, to building it, and operating it.

There is another lesson here: Giving developers operational responsibilities has greatly enhanced the quality of the services, both from a customer and a technology point of view. The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it. Not at Amazon. You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. This customer feedback loop is essential for improving the quality of the service.

There’s a hint in respect of how Amazon do what they do technically and notice how the benefits of this approach also touch on process and quality. For example development teams can make more independent progress than they might were they are all forced to work in lockstep via centralized approaches.

There is quite a bit of development happening in Eclipse, but IntelliJ’s IDEA is also popular for Java development. Some development happens in Visual Studio. Developers of our services can use any tools they see fit to build their services. Developers themselves know best which tools make them most productive and which tools are right for the job. If that means using C++, then so be it. Whatever tools are necessary, we provide them, and then get the hell out of the way of the developers so that they can do their jobs…..

…….I think part of the chaotic nature—the emerging nature—of Amazon’s platform is that there are many tools available, and we try not to impose too many constraints on our engineers. We provide incentives for some things, such as integration with the monitoring system and other infrastructure tools. But for the rest, we allow teams to function as independently as possible. Developers are like artists; they produce their best work if they have the freedom to do so, but they need good tools. As a result of this principle, we have many support tools that are of a self-help nature. The support environment around the service development should never get in the way of the development itself.

Here’s further evidence of how Amazon “decouple” their development teams. The teams are empowered to choose the tools for the job that make them most effective but at the same time they are expected to follow some guidelines in respect of monitoring and other infrastructure.

We have a very good understanding of how customers interact with the site as is. When we expose new features we measure how they change the customer’s behavior. For example, does it take the customer fewer steps to find what he or she needs? This is hard because you are measuring human behavior; there are some things that customers are delighted about immediately and there are other things that they have to get used to…..

…..We measure whether or not a new feature is successful in terms of customer satisfaction: Do people find things more easily? If we can improve the convenience of shopping on Amazon, then we have booked a major success. If we can help them find things that they might not have thought of before, that is also excellent. Customers tend to vote with their wallets, so if there is a clear negative result, we know what to do with that service.

Measurement is king and we’re not just talking performance stats!

First thing, I think there’s a whole list of good practices that we have in terms of design, in terms of architecture, in terms of building. And one of those points — one of the bullets on that list is that you have to design for failure, meaning that failure of components, whether they’re hardware, software, humans, is a fact of life, and you have to architect as if they are continuously happening to you. And if you do that and you happen to hit a good streak, then you’re fine. But failure in any large-scale system is the normal case, not the exception. So build, for example, for fast recovery. That’s an essential part. You know, stuff fails, comes back up, and you have to make sure that it can be inserted back into the functioning set as soon as possible.

Unreliability of all sorts means the same thing, no service. It’s not just about network failure or machine failure but problems in software at multiple levels. Tackling these issues whilst maintaining service is challenging and typically requires the application to co-operate in an active fashion. This is counter to the established norm which would be to write the application to be naive of issues and run it on top of a cluster.

I think if you talk to anybody in industry that is responsible for running a very large-scale, geographically distributed, distributed system, such as Amazon is, relying on third parties, on vendors, to actually deliver this availability for you is very dangerous. We’ve seen that there are a number of vendors out there that are exceptional in providing highly available systems in very contained environments. There aren’t that many systems out there of the scale of Amazon. The problems that we have-I won’t say the problems-the challenges that we have in delivering this very highly available system…there are not that many others that have these kinds of challenges. And so third party software is clearly not geared to meeting our challenges there.

Great success brings with it great challenge. If you are a large successful service-provider (be that web, enterprise, mobile etc) there may come a time where the vendors simply don’t cater for you. You have moved outside of their target market and the scope of both their product and their experience. Migrating away from vendors cleanly is going to be a heck of a hairy thing to achieve.

Technorati Tags: , , ,

  • Share/Bookmark

Comments 5 Comments »

The author of Code Complete and my favourite project management book Rapid Development is blogging over at Construx.

This should be good…..

  • Share/Bookmark

Comments 1 Comment »