Archive for the “Systems” Category

Intel wades in on the “we can’t do any more magic concurrency for software” issue.

It’s been debated often enough and always seems to come down to the fact that the average programmer isn’t able to cope with concurrency and needs higher levels of abstraction to do it for them. The thing is, we already have such abstractions e.g. transactions and we know they can only take us so far. Worse there are other abstractions out there such as blackboard systems which these average programmers either can’t or won’t try to cope with.

So what is to be done? Well if the last couple of decades are anything to go by, absolutely nothing! Why? It’s the talent limit. How many chip designers are there in the world? How many motherboard designers? How many car designers? How many developers? I’m willing to bet that there are considerably more people in the developer category than any of the others. This is because we’ve lowered the bar in terms of developer quality to cope with the wide demand for bums on seats and lets face it, it’s unlikely that the average enterprise is going to change it’s policies in this respect. It’s interesting to note that it’s much harder to lower the bar in for example chip design, it either works or it doesn’t whilst software is almost expected to be flaky these days.

Until we decide to clean house in software land, we’ll not get progress on these thorny issues because they only matter to the few and when all said and done, there is a school of thought that might suggest that software is good enough, concurrency, efficiency, quality and green’ness be damned.

Update: An example of how challenging it can be to make concurrency easy can be found here. On the surface we’ve done some good things and yet we are still open to the simplest of errors (in this case a missing synchronized clause).

Comments 2 Comments »

I’m a firm believer in making the minimum number of mistakes and one of the most effective ways to achieve this is to learn from the history of others particularly those leading the field. In distributed systems, one of the leaders is Amazon whom it could be argued are unlike anything else out there and thus are not applicable. However it’s surely a mistake to not look at Amazon and see if there’s anything we might find useful simply because if we are successful we could face the same problems. Here then is my perspective on what some of the more interesting published data-points are:

Amazon.com started 10 years ago as a monolithic application, running on a Web server, talking to a database on the back end. This application, dubbed Obidos, evolved to hold all the business logic, all the display logic, and all the functionality that Amazon eventually became famous for: similarities, recommendations, Listmania, reviews, etc. For years the scaling efforts at Amazon were focused on making the back-end databases scale to hold more items, more customers, more orders, and to support multiple international sites. This went on until 2001 when it became clear that the front-end application couldn’t scale anymore…….

……..The many things that you would like to see happening in a good software environment couldn’t be done anymore; there were many complex pieces of software combined into a single system. It couldn’t evolve anymore. The parts that needed to scale independently were tied into sharing resources with other unknown code paths. There was no isolation and, as a result, no clear ownership.

At the same time, there was continued difficulty in the back-end database scaling effort. Databases—and by that time we were using several databases—were a shared resource, which made it very hard to scale-out the overall business. So both the front-end and back-end processes were restricted in their evolution because they were shared by many different teams and processes.

Notice how the monolithic single database architecture hasn’t just confined scalability and performance but the speed with which new features could be added. As an enterprise one might certainly argue that the level of scale of amazon is irrelevant to them but the ability to add features or change? That sounds like something we should all be interested in. Here’s how they changed things:

We went through a period of serious introspection and concluded that a service-oriented architecture would give us the level of isolation that would allow us to build many software components rapidly and independently. By the way, this was way before service-oriented was a buzzword. For us service orientation means encapsulating the data with the business logic that operates on the data, with the only access through a published service interface. No direct database access is allowed from outside the service, and there’s no data sharing among the services.

Over time, this grew into hundreds of services and a number of application servers that aggregate the information from the services. The application that renders the Amazon.com Web pages is one such application server, but so are the applications that serve the Web-services interface, the customer service application, the seller interface, and the many third-party Web sites that run on our platform.

If you hit the Amazon.com gateway page, the application calls more than 100 services to collect data and construct the page for you….

…….It depends a bit on what kind of page you visit—whether it is a product page, a checkout page, etc. It also depends on how effective caching is for the objects on that page, as well as some other factors.

Many are deeply concerned with avoiding remoteness whenever possible, notice that Amazon don’t run away screaming, rather they pragmatically engineer and include use of caching etc.

The first and foremost lesson is a meta-lesson: If applied, strict service orientation is an excellent technique to achieve isolation; you come to a level of ownership and control that was not seen before. A second lesson is probably that by prohibiting direct database access by clients, you can make scaling and reliability improvements to your service state without involving your clients. Other lessons are related to how you access services: If you want to be able to aggregate services easily, if you want to insert advanced infrastructure techniques such as decentralized request routing or distributed request tracking, you need a single unified service-access mechanism.

Another lesson we’ve learned is that it’s not only the technology side that was improved by using services. The development and operational process has greatly benefited from it as well. The services model has been a key enabler in creating teams that can innovate quickly with a strong customer focus. Each service has a team associated with it, and that team is completely responsible for the service—from scoping out the functionality, to architecting it, to building it, and operating it.

There is another lesson here: Giving developers operational responsibilities has greatly enhanced the quality of the services, both from a customer and a technology point of view. The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it. Not at Amazon. You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. This customer feedback loop is essential for improving the quality of the service.

There’s a hint in respect of how Amazon do what they do technically and notice how the benefits of this approach also touch on process and quality. For example development teams can make more independent progress than they might were they are all forced to work in lockstep via centralized approaches.

There is quite a bit of development happening in Eclipse, but IntelliJ’s IDEA is also popular for Java development. Some development happens in Visual Studio. Developers of our services can use any tools they see fit to build their services. Developers themselves know best which tools make them most productive and which tools are right for the job. If that means using C++, then so be it. Whatever tools are necessary, we provide them, and then get the hell out of the way of the developers so that they can do their jobs…..

…….I think part of the chaotic nature—the emerging nature—of Amazon’s platform is that there are many tools available, and we try not to impose too many constraints on our engineers. We provide incentives for some things, such as integration with the monitoring system and other infrastructure tools. But for the rest, we allow teams to function as independently as possible. Developers are like artists; they produce their best work if they have the freedom to do so, but they need good tools. As a result of this principle, we have many support tools that are of a self-help nature. The support environment around the service development should never get in the way of the development itself.

Here’s further evidence of how Amazon “decouple” their development teams. The teams are empowered to choose the tools for the job that make them most effective but at the same time they are expected to follow some guidelines in respect of monitoring and other infrastructure.

We have a very good understanding of how customers interact with the site as is. When we expose new features we measure how they change the customer’s behavior. For example, does it take the customer fewer steps to find what he or she needs? This is hard because you are measuring human behavior; there are some things that customers are delighted about immediately and there are other things that they have to get used to…..

…..We measure whether or not a new feature is successful in terms of customer satisfaction: Do people find things more easily? If we can improve the convenience of shopping on Amazon, then we have booked a major success. If we can help them find things that they might not have thought of before, that is also excellent. Customers tend to vote with their wallets, so if there is a clear negative result, we know what to do with that service.

Measurement is king and we’re not just talking performance stats!

First thing, I think there’s a whole list of good practices that we have in terms of design, in terms of architecture, in terms of building. And one of those points — one of the bullets on that list is that you have to design for failure, meaning that failure of components, whether they’re hardware, software, humans, is a fact of life, and you have to architect as if they are continuously happening to you. And if you do that and you happen to hit a good streak, then you’re fine. But failure in any large-scale system is the normal case, not the exception. So build, for example, for fast recovery. That’s an essential part. You know, stuff fails, comes back up, and you have to make sure that it can be inserted back into the functioning set as soon as possible.

Unreliability of all sorts means the same thing, no service. It’s not just about network failure or machine failure but problems in software at multiple levels. Tackling these issues whilst maintaining service is challenging and typically requires the application to co-operate in an active fashion. This is counter to the established norm which would be to write the application to be naive of issues and run it on top of a cluster.

I think if you talk to anybody in industry that is responsible for running a very large-scale, geographically distributed, distributed system, such as Amazon is, relying on third parties, on vendors, to actually deliver this availability for you is very dangerous. We’ve seen that there are a number of vendors out there that are exceptional in providing highly available systems in very contained environments. There aren’t that many systems out there of the scale of Amazon. The problems that we have-I won’t say the problems-the challenges that we have in delivering this very highly available system…there are not that many others that have these kinds of challenges. And so third party software is clearly not geared to meeting our challenges there.

Great success brings with it great challenge. If you are a large successful service-provider (be that web, enterprise, mobile etc) there may come a time where the vendors simply don’t cater for you. You have moved outside of their target market and the scope of both their product and their experience. Migrating away from vendors cleanly is going to be a heck of a hairy thing to achieve.

Technorati Tags: , , ,

Comments 5 Comments »

Dare has this to say about API versioning. His basic approach can be summarized as don’t change anything at all and separate your URL space using some version moniker. This is a good piece of advice but I’m more wary of other aspects of the posting such as the suggestion that we should support all versions over time to maintain backward compatibility.

I’m tempted to suggest that this is a classic Microsoft (and enterprise) mindset: pile up a stack of legacy and backward compatibility requirement which can potentially lead to a mess of complexity and brittleness underneath the APIs. I feel the reasoning around client migration also needs re-examination. This is because Live Messenger doesn’t live in a browser. It is a desktop native binary application and therefore has completely different characteristics from a genuine web application in terms of dispersal, upgrading and updates.

Consider that for many a browser based application, upgrades are but a reload away. Further many a web service that is mashed up as part of another offering will ultimately present to a user as a browser-based UI. i.e. We don’t need all our desktops to upgrade, we need to upgrade the integration point in the backend services of the mashups that use the web service. That’s at least potentially considerably fewer entities that need to upgrade than is normal in the desktop world.

Dare’s posting also seems to have an underlying current of belief that one can define an API perfectly in the first place. We have known for a long time that such a feat of prediction is often very difficult though it can be made more manageable via functional partitioning.

Limited lifetime is one of the best antidotes we have to the complexity of legacy and the brittleness of code. We resist it far too often leading to the mess we have in many an enterprise:

  1. Instability
  2. Slower Feature Delivery
  3. Difficult maintenance

The web, the human body and many real-world systems (cars for example) exploit the concept of limited lifetime successfully, perhaps this philosophy should be more widely applied in software?

Technorati Tags: , , , , ,

Comments Comments Off

Push versus pull is a constant debate in techie land. Interestingly, a lot of the web is adopting pull techniques. I wonder if some of that is leaking into these other areas or whether it’s just a general trend away from centralized solutions which we know don’t scale.

In any case, this is an interesting read and recommended by none other than Werner Vogels.

Scary reading for the control freaks and another example of decentralized control/intelligence. Of course, a few minutes thought makes it clear that pull is not good for everything.

Technorati Tags: , , , ,

Comments 2 Comments »

How often do we admit something’s wrong and take corrective action? How often do we admit something’s wrong before it gets well out of hand?

There are two basic things we need before we can admit something’s wrong and address the problem:

  1. Be strong enough to admit there’s something wrong
  2. Be aware that something is wrong

The first item is about having sufficient strength of character to admit our mistakes, learn and move on. Inhibitors for this include ego or pride, fear and external environmental factors (e.g. a violent spouse).

The second item is about recognizing the indicators of a problem. For us humans indicators can be anger, frustration, lack of drive, fatigue, health problems or a constant need to distract oneself via activity. In terms of computer systems, indicators might be Java exceptions under odd circumstances, strange lockups, weird performance issues etc.

The art whether in the human or systems world is detecting these problems early on and reacting to them before they get out of control. In the systems world, so many times the first we know of a problem is when we see an exception trace or the pager beeps at 2am but is this really necessary?

Could we not have been aware of the problem earlier? We put a lot of effort into testing but considerably less effort into endowing our systems with monitoring and feedback mechanisms that can provide us with useful statistics. These statistics are the things that can give us prior warning of a problem and come in many forms:

  1. Too many tasks
  2. Too many threads
  3. Queues that are too long
  4. Time per task increasing

Frequently we build logging into our software but this is really a postmortem facility something that is only useful after the problem has occurred. Our OS’en, hardware and software platforms often have monitoring infrastructure built-in, why is it that we so often choose to build our applications without similar facilities?

Technorati Tags: , ,

Comments 1 Comment »