Archive for May, 2007

IMHO, this is terrible for the sport. If one wishes to ban team orders one must ruthlessly enforce the rules otherwise a gray area is left open to exploitation. Ron Dennis stated that his reasons for issuing orders were to ensure his drivers finished the race, trouble is a convenient side effect of this action is that Alonso takes the ten points which neatly helps to clean up the “issues” around the championship lead. Herein lies the problem, one cannot be certain that Ron Dennis had pure intentions, we must just believe which leads to a bunch of possible not so nice interpretations around the ruling:

  1. Lewis Hamilton brings a lot of extra interest to F1 and we wouldn’t wish to poison that with a team orders embarrassment.
  2. The stakeholders in the Monte Carlo race might not want it’s reputation tarnished.
  3. Nobody wants another Ferrari win, anything else is preferable.

So here we are then, Ferrari could be judged to have taken action all those years ago to give themselves the best chance of having one of their drivers win the championship and that’s punished. Mclaren could be judged to have done exactly the same thing and weren’t punished. Just where is the line then? Perhaps it would be better to accept the team orders ban cannot be effectively applied and junk it so as to avoid this kind of mess?

Technorati Tags: , ,

  • Share/Bookmark

Comments Comments Off

Intel wades in on the “we can’t do any more magic concurrency for software” issue.

It’s been debated often enough and always seems to come down to the fact that the average programmer isn’t able to cope with concurrency and needs higher levels of abstraction to do it for them. The thing is, we already have such abstractions e.g. transactions and we know they can only take us so far. Worse there are other abstractions out there such as blackboard systems which these average programmers either can’t or won’t try to cope with.

So what is to be done? Well if the last couple of decades are anything to go by, absolutely nothing! Why? It’s the talent limit. How many chip designers are there in the world? How many motherboard designers? How many car designers? How many developers? I’m willing to bet that there are considerably more people in the developer category than any of the others. This is because we’ve lowered the bar in terms of developer quality to cope with the wide demand for bums on seats and lets face it, it’s unlikely that the average enterprise is going to change it’s policies in this respect. It’s interesting to note that it’s much harder to lower the bar in for example chip design, it either works or it doesn’t whilst software is almost expected to be flaky these days.

Until we decide to clean house in software land, we’ll not get progress on these thorny issues because they only matter to the few and when all said and done, there is a school of thought that might suggest that software is good enough, concurrency, efficiency, quality and green’ness be damned.

Update: An example of how challenging it can be to make concurrency easy can be found here. On the surface we’ve done some good things and yet we are still open to the simplest of errors (in this case a missing synchronized clause).

  • Share/Bookmark

Comments 2 Comments »

I’m a firm believer in making the minimum number of mistakes and one of the most effective ways to achieve this is to learn from the history of others particularly those leading the field. In distributed systems, one of the leaders is Amazon whom it could be argued are unlike anything else out there and thus are not applicable. However it’s surely a mistake to not look at Amazon and see if there’s anything we might find useful simply because if we are successful we could face the same problems. Here then is my perspective on what some of the more interesting published data-points are:

Amazon.com started 10 years ago as a monolithic application, running on a Web server, talking to a database on the back end. This application, dubbed Obidos, evolved to hold all the business logic, all the display logic, and all the functionality that Amazon eventually became famous for: similarities, recommendations, Listmania, reviews, etc. For years the scaling efforts at Amazon were focused on making the back-end databases scale to hold more items, more customers, more orders, and to support multiple international sites. This went on until 2001 when it became clear that the front-end application couldn’t scale anymore…….

……..The many things that you would like to see happening in a good software environment couldn’t be done anymore; there were many complex pieces of software combined into a single system. It couldn’t evolve anymore. The parts that needed to scale independently were tied into sharing resources with other unknown code paths. There was no isolation and, as a result, no clear ownership.

At the same time, there was continued difficulty in the back-end database scaling effort. Databases—and by that time we were using several databases—were a shared resource, which made it very hard to scale-out the overall business. So both the front-end and back-end processes were restricted in their evolution because they were shared by many different teams and processes.

Notice how the monolithic single database architecture hasn’t just confined scalability and performance but the speed with which new features could be added. As an enterprise one might certainly argue that the level of scale of amazon is irrelevant to them but the ability to add features or change? That sounds like something we should all be interested in. Here’s how they changed things:

We went through a period of serious introspection and concluded that a service-oriented architecture would give us the level of isolation that would allow us to build many software components rapidly and independently. By the way, this was way before service-oriented was a buzzword. For us service orientation means encapsulating the data with the business logic that operates on the data, with the only access through a published service interface. No direct database access is allowed from outside the service, and there’s no data sharing among the services.

Over time, this grew into hundreds of services and a number of application servers that aggregate the information from the services. The application that renders the Amazon.com Web pages is one such application server, but so are the applications that serve the Web-services interface, the customer service application, the seller interface, and the many third-party Web sites that run on our platform.

If you hit the Amazon.com gateway page, the application calls more than 100 services to collect data and construct the page for you….

…….It depends a bit on what kind of page you visit—whether it is a product page, a checkout page, etc. It also depends on how effective caching is for the objects on that page, as well as some other factors.

Many are deeply concerned with avoiding remoteness whenever possible, notice that Amazon don’t run away screaming, rather they pragmatically engineer and include use of caching etc.

The first and foremost lesson is a meta-lesson: If applied, strict service orientation is an excellent technique to achieve isolation; you come to a level of ownership and control that was not seen before. A second lesson is probably that by prohibiting direct database access by clients, you can make scaling and reliability improvements to your service state without involving your clients. Other lessons are related to how you access services: If you want to be able to aggregate services easily, if you want to insert advanced infrastructure techniques such as decentralized request routing or distributed request tracking, you need a single unified service-access mechanism.

Another lesson we’ve learned is that it’s not only the technology side that was improved by using services. The development and operational process has greatly benefited from it as well. The services model has been a key enabler in creating teams that can innovate quickly with a strong customer focus. Each service has a team associated with it, and that team is completely responsible for the service—from scoping out the functionality, to architecting it, to building it, and operating it.

There is another lesson here: Giving developers operational responsibilities has greatly enhanced the quality of the services, both from a customer and a technology point of view. The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it. Not at Amazon. You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. This customer feedback loop is essential for improving the quality of the service.

There’s a hint in respect of how Amazon do what they do technically and notice how the benefits of this approach also touch on process and quality. For example development teams can make more independent progress than they might were they are all forced to work in lockstep via centralized approaches.

There is quite a bit of development happening in Eclipse, but IntelliJ’s IDEA is also popular for Java development. Some development happens in Visual Studio. Developers of our services can use any tools they see fit to build their services. Developers themselves know best which tools make them most productive and which tools are right for the job. If that means using C++, then so be it. Whatever tools are necessary, we provide them, and then get the hell out of the way of the developers so that they can do their jobs…..

…….I think part of the chaotic nature—the emerging nature—of Amazon’s platform is that there are many tools available, and we try not to impose too many constraints on our engineers. We provide incentives for some things, such as integration with the monitoring system and other infrastructure tools. But for the rest, we allow teams to function as independently as possible. Developers are like artists; they produce their best work if they have the freedom to do so, but they need good tools. As a result of this principle, we have many support tools that are of a self-help nature. The support environment around the service development should never get in the way of the development itself.

Here’s further evidence of how Amazon “decouple” their development teams. The teams are empowered to choose the tools for the job that make them most effective but at the same time they are expected to follow some guidelines in respect of monitoring and other infrastructure.

We have a very good understanding of how customers interact with the site as is. When we expose new features we measure how they change the customer’s behavior. For example, does it take the customer fewer steps to find what he or she needs? This is hard because you are measuring human behavior; there are some things that customers are delighted about immediately and there are other things that they have to get used to…..

…..We measure whether or not a new feature is successful in terms of customer satisfaction: Do people find things more easily? If we can improve the convenience of shopping on Amazon, then we have booked a major success. If we can help them find things that they might not have thought of before, that is also excellent. Customers tend to vote with their wallets, so if there is a clear negative result, we know what to do with that service.

Measurement is king and we’re not just talking performance stats!

First thing, I think there’s a whole list of good practices that we have in terms of design, in terms of architecture, in terms of building. And one of those points — one of the bullets on that list is that you have to design for failure, meaning that failure of components, whether they’re hardware, software, humans, is a fact of life, and you have to architect as if they are continuously happening to you. And if you do that and you happen to hit a good streak, then you’re fine. But failure in any large-scale system is the normal case, not the exception. So build, for example, for fast recovery. That’s an essential part. You know, stuff fails, comes back up, and you have to make sure that it can be inserted back into the functioning set as soon as possible.

Unreliability of all sorts means the same thing, no service. It’s not just about network failure or machine failure but problems in software at multiple levels. Tackling these issues whilst maintaining service is challenging and typically requires the application to co-operate in an active fashion. This is counter to the established norm which would be to write the application to be naive of issues and run it on top of a cluster.

I think if you talk to anybody in industry that is responsible for running a very large-scale, geographically distributed, distributed system, such as Amazon is, relying on third parties, on vendors, to actually deliver this availability for you is very dangerous. We’ve seen that there are a number of vendors out there that are exceptional in providing highly available systems in very contained environments. There aren’t that many systems out there of the scale of Amazon. The problems that we have-I won’t say the problems-the challenges that we have in delivering this very highly available system…there are not that many others that have these kinds of challenges. And so third party software is clearly not geared to meeting our challenges there.

Great success brings with it great challenge. If you are a large successful service-provider (be that web, enterprise, mobile etc) there may come a time where the vendors simply don’t cater for you. You have moved outside of their target market and the scope of both their product and their experience. Migrating away from vendors cleanly is going to be a heck of a hairy thing to achieve.

Technorati Tags: , , ,

  • Share/Bookmark

Comments 5 Comments »

The author of Code Complete and my favourite project management book Rapid Development is blogging over at Construx.

This should be good…..

  • Share/Bookmark

Comments 1 Comment »

Dan Pritchett presents some good insight in respect of large scale systems.

I’m left to ponder the fact that whilst Dan is deliberately creating these systems….

I find myself looking at nondeterministic systems a lot lately. Many solutions for the challenges of extreme scale involve relaxing constraints and coping with the ensuing chaos.

….there’s many a techie out there building large systems unaware of the fact that they can’t assert order on such a beast. All they can do is hold back the wave of chaos whilst having to clear up the odd drop that made it over the levee (e.g. consequences of a network outage or operation ordering problems). They are a long way from appreciating that chaos is a given, let alone actively managing it.

In a further twist of irony for all that chaos seems to introduce complexity, if you accept it’s existence and work with it you often end up with something simpler than was ever possible with “old world” thinking.

Technorati Tags: , ,

  • Share/Bookmark

Comments Comments Off