Archive for the “Systems” Category


In general, writing software is made hard by our inability to predict the future. We’re always caught between the two stools of building what we need now and what we might need in the future. Writing infrastructure is even harder because it’s an expression of common patterns and challenges of implementation in a specific context such as one company’s systems or services. The broader the context (e.g. all enterprises), the harder it gets to account for all possible permutations of usage.

So to a core trouble with building infrastructure: common patterns and challenges of implementation can only be established by reviewing history. Basically until some systems have been built, we can’t tell with certainty what the infrastructure should look like.

It’s all too easy to believe that some problem we’re currently faced with is generic and therefore best tackled by:

  • Custom Infrastructure or …
  • A third-party technology stack or …
  • Defining some standard format, code convention etc
  • How do we know something is generic? Experience. To be confident something is generic we must have seen it across our systems universe (that is the collection of systems in our domain of concern). The danger is that multiple teams adopt their own solution to the same problem but this isn’t necessarily a bad thing. Each team is doing valuable investigation work that will help identify the most promising options for a solution. The trick of course is to have sufficient cross-team discussion about architecture so as to avoid excessive proliferation of independent solutions.

    I have a little mantra that I use to remind myself and others of this thorny issue: Experience leads infrastructure.

    Technorati Tags: , ,

    Comments 1 Comment »

    …to Steve McConnell for distilling out the key issues around technical debt:

    “The reason most often cited by technical staff for avoiding debt altogether is the challenge of communicating the existence of technical debt to business staff and the challenge of helping business staff remember the implications of the technical debt that has previously been incurred. Everyone agrees that it’s a good idea to incur debt late in a release cycle, but business staff can sometimes resist accounting for the time needed to pay off the debt on the next release cycle. The main issue seems to be that, unlike financial debt, technical debt is much less visible, and so people have an easier time ignoring it.”

    I would quibble with the “easier to ignore” aspect though as I think for the most part both kinds of debt attract the same behaviour - sticking our heads in the sand allowing things to get worse…..

    Comments Comments Off

    So we’re building an Internet service not a Web service, decide on an address, allocate a port, write our daemon (yes, I like Unix) job done. One might think so but there’s a killer deployment issue lurking in the background - the Firewall.

    The average corporate security policy really doesn’t like opening ports on external firewalls (and often the same goes for internal ones). Best case we’ll have to wade through masses of red tape, worst case we’ll be given a flat no to our request for an open port. What to do?

    Find an open port, find a way to tunnel through it. Which is the port most likely to be open? 80 and we all know which protocol runs over that. Sure as night follows day, we end up building a solution that tunnels over http.

    Do we want to be “of the web”? No.

    Would we by desire tunnel over http? No. It’s not designed for that purpose and it’ll likely let us know come implementation time.

    Should we re-design our service to fit with ROA? No. Hopefully we did the research and looked at this option before choosing to implement an Internet service as opposed to a Web service.

    So Internet and Web services are different but can end up looking similar enough to lead to confusion. Ain’t reality ugly? Tradeoffs must be made, the results are often less than pretty and there might well be a lot of complaining.

    Technorati Tags: , ,

    Comments 1 Comment »

    Everybody’s pitch is about the great things that can be done with their technology, method, architectural approach etc. They’ll sing it from the roof tops, put it up on websites, spam us with email and so on.

    But they rarely (if ever) discuss the flip-side - what their stuff is not good for. Sometimes:

    • selecting an appropriate solution is more about what something is bad at than good at
    • the easiest way to understand something is to know it in terms of what it can’t do

    It’s amazing how often asking someone about the negatives of their stuff results in silence.

    Technorati Tags: , ,

    Comments Comments Off

    There’s a lot of noise about transactional memory, thought I should do a bit of research. Having read a number of papers I’m left wondering just what all the racket is for. At least for me the benefits are unclear.

    Let’s consider this paper which discusses amongst other things “transactifying” Berkeley Db (a piece of software I know quite well). It contains a comparison of the original version of Db’s locking system (which used a global lock) and the paper’s authors’ modified version. Initial changes were to replace all uses of the global lock with a set of transactions. A test was run and the transactional version was worse all around than the original - ooops.

    The root cause boiled down to three issues:

    1. False sharing - a problem which occurs when variables accessed by different threads happen to fall in the same cacheline - this was solved with a traditional approach known as padding.
    2. Statistics collection - Db collected a bunch of statistics keeping them accurate by using the global lock. Rather than address what is surely a common problem, the authors simply turned this feature off.
    3. Object pooling - the pooling associated with lock descriptors and their related objects had to be changed from single linked-lists to collections of linked-lists to improve potential for concurrent access.

    The tests were re-run and beyond a certain level of scale the transactional memory version was now better but wait, there’s a problem. Notice that all the work being done to make the transactional version better is broadly the same as the work one would do to make the locking version better. How much of the scalability gain is due to better concurrent structure and how much is down to transactional memory? Is the work we’ve just done any simpler than what we already have to do for conventional thread/lock based systems?

    Another under-discussed factor across many papers in this area is related to the assertion that transactional memory is better than locking due to it being more efficient in the non-conflict case. However many modern lock primitives are now also optimized for this circumstance.

    What about the fact that, one must make sure to correctly isolate the atomic actions in a system and bound them appropriately with transactions just as one currently does with locking? We still have to make sure we do that consistently across the entire system or risk the usual concurrency debugging nightmares.

    Many of the transactional memory systems appear to be based on optimistic approaches - does that make sense for all algorithms and systems we might build? Other transactional systems have evolved to provide both optimistic and pessimistic options (in an attempt to cover all design possibilities) and the programmer must make the appropriate choice for their application. Will transactional memory systems also need to move this way and if so, how will the programmer work with that?

    Asserting Order

    I’m not going to write-off transactional memory but it seems that should it turn out to be more scalable than the conventional lock-based approaches we use:

    1. It’s really not much simpler to program with.
    2. It’s no use in the distributed case.

    Meanwhile:

    1. There are other approaches around that do work across both multi-core and multi-box/distributed cases with little change (some would argue the amount of change is zero but I don’t buy that).
    2. Dealing with concurrency is about much more than whether you use locks or transactions.

    Technorati Tags: , , ,

    Comments 3 Comments »

    The question is, can we see it? If this article is anything to go by the answer would be no.

    SOA is an approach to building systems, it certainly couldn’t be called a style (much to the annoyance of some) but it sure isn’t a technology.

    And this is the problem - so many view everything to do with building systems as being about deploying the right technologies rather than adopting an approach and driving technology selection from there.

    No surprise then that SOA adoption is isolated to small parts of various businesses - that’s the maximum level of use that can be achieved whilst it is treated as a technological shift. Change across the entire business is essential for SOA to get real traction - systems shouldn’t be viewed as necessary evils that cost, rather they should be considered as means for delivering enhanced business value. Processes and culture are the real challenge not hardware and software.

    IT/Systems Development needs to be considered a first class citizen within an organization rather than simply the poor cousin that mops the floors and cleans the toilets. Fewer conversations like "here’s what we want" and more discussion around "here’s what we’re trying to do, how can you help?". Switched on readers may notice an interesting parallel with Web vs Enterprise….

    Web is (amongst other things) about enabling users and agents to do interesting (interconnected/social) stuff more effectively, whilst Enterprise is often treated as little more than automating laborious tasks with strict controls. Loosely coupled versus tightly coupled, granular and cooperative versus monolithic and uncooperative.

    Technorati Tags: , , ,

    Comments Comments Off

    The debate about all these many core processors continues to circle the blogosphere. Tim Bray had this to say which set me thinking (always a bad thing):

    Any time we have a piece of state that needs to be accessed concurrently we hit problems. One can hide this problem using messaging (or similar) but the key aspect in these solutions is that we can partition operations into streams against discrete elements of data (a discrete element could be a group of things) that don’t interfere with each other. Partitioning however can be problematic:

    1. Our data has to be amenable to partitioning via hashing or some other method.
    2. It gets tricky when we need to deal with availability and disaster recovery.
    3. Getting the correct granularity of partitioning can be challenging.

    Which is interesting because whilst we’ve eliminated the concurrency issue, we’re now faced with a different one (partitioning) which could be just as hard to cope with and requires just as much thought from a developer and/or architect. Coincidentally, Werner Vogels (Amazon) is going to be talking about an internal data store (HASS) at the Google Scability Conference and specifically the problems of partitioning and consistent hashing (my original interest with respect to this talk was in the context of the CAP conjecture).

    Another means of avoiding all these concurrency issues is to push them somewhere else. More often than not this becomes an exercise in creating a supposedly stateless system which in reality simply puts all the state in one place, usually the database. The argument is that this is acceptable because it’s only the likes of databases that should deal with these hard issues.

    The rub with having the database handle it is that the concurrency model it uses will only scale across so many processors (more if you’re read mostly, less if your not) and cope with so many concurrent accesses from the stateless component. Once again to get our database layer to scale, we’ll need to partition our data into shards across multiple databases (an approach adopted by a number of top-line websites) or find some other way to reduce concurrent load on the database instance.

    The act of partitioning can mean we reach a point where we can no longer expect to have atomic updates because the mechanisms for achieving it (e.g. two-phase commit) stop us scaling. When this happens we must construct complex or at least exotic solutions such as that proposed by Pat Helland.

    Okay we got rid of our concurrency problem and swapped it for a partitioning problem which then turned into something of an exotic problem. Are we any better off? It seems no matter which way we go we end up with some tough problems to solve.

    Perhaps there’s a sweet-spot tradeoff where the combination of a CMT box, with data partitioned across a number of processes and each process containing a simple concurrency model covers most situations. Even if that’s the case it seems developers will have to learn a few new tricks.

    Technorati Tags: , , ,

    Update: A good comment over on Reddit.

    Comments 2 Comments »

    Intel wades in on the “we can’t do any more magic concurrency for software” issue.

    It’s been debated often enough and always seems to come down to the fact that the average programmer isn’t able to cope with concurrency and needs higher levels of abstraction to do it for them. The thing is, we already have such abstractions e.g. transactions and we know they can only take us so far. Worse there are other abstractions out there such as blackboard systems which these average programmers either can’t or won’t try to cope with.

    So what is to be done? Well if the last couple of decades are anything to go by, absolutely nothing! Why? It’s the talent limit. How many chip designers are there in the world? How many motherboard designers? How many car designers? How many developers? I’m willing to bet that there are considerably more people in the developer category than any of the others. This is because we’ve lowered the bar in terms of developer quality to cope with the wide demand for bums on seats and lets face it, it’s unlikely that the average enterprise is going to change it’s policies in this respect. It’s interesting to note that it’s much harder to lower the bar in for example chip design, it either works or it doesn’t whilst software is almost expected to be flaky these days.

    Until we decide to clean house in software land, we’ll not get progress on these thorny issues because they only matter to the few and when all said and done, there is a school of thought that might suggest that software is good enough, concurrency, efficiency, quality and green’ness be damned.

    Update: An example of how challenging it can be to make concurrency easy can be found here. On the surface we’ve done some good things and yet we are still open to the simplest of errors (in this case a missing synchronized clause).

    Comments 2 Comments »

    I’m a firm believer in making the minimum number of mistakes and one of the most effective ways to achieve this is to learn from the history of others particularly those leading the field. In distributed systems, one of the leaders is Amazon whom it could be argued are unlike anything else out there and thus are not applicable. However it’s surely a mistake to not look at Amazon and see if there’s anything we might find useful simply because if we are successful we could face the same problems. Here then is my perspective on what some of the more interesting published data-points are:

    Amazon.com started 10 years ago as a monolithic application, running on a Web server, talking to a database on the back end. This application, dubbed Obidos, evolved to hold all the business logic, all the display logic, and all the functionality that Amazon eventually became famous for: similarities, recommendations, Listmania, reviews, etc. For years the scaling efforts at Amazon were focused on making the back-end databases scale to hold more items, more customers, more orders, and to support multiple international sites. This went on until 2001 when it became clear that the front-end application couldn’t scale anymore…….

    ……..The many things that you would like to see happening in a good software environment couldn’t be done anymore; there were many complex pieces of software combined into a single system. It couldn’t evolve anymore. The parts that needed to scale independently were tied into sharing resources with other unknown code paths. There was no isolation and, as a result, no clear ownership.

    At the same time, there was continued difficulty in the back-end database scaling effort. Databases—and by that time we were using several databases—were a shared resource, which made it very hard to scale-out the overall business. So both the front-end and back-end processes were restricted in their evolution because they were shared by many different teams and processes.

    Notice how the monolithic single database architecture hasn’t just confined scalability and performance but the speed with which new features could be added. As an enterprise one might certainly argue that the level of scale of amazon is irrelevant to them but the ability to add features or change? That sounds like something we should all be interested in. Here’s how they changed things:

    We went through a period of serious introspection and concluded that a service-oriented architecture would give us the level of isolation that would allow us to build many software components rapidly and independently. By the way, this was way before service-oriented was a buzzword. For us service orientation means encapsulating the data with the business logic that operates on the data, with the only access through a published service interface. No direct database access is allowed from outside the service, and there’s no data sharing among the services.

    Over time, this grew into hundreds of services and a number of application servers that aggregate the information from the services. The application that renders the Amazon.com Web pages is one such application server, but so are the applications that serve the Web-services interface, the customer service application, the seller interface, and the many third-party Web sites that run on our platform.

    If you hit the Amazon.com gateway page, the application calls more than 100 services to collect data and construct the page for you….

    …….It depends a bit on what kind of page you visit—whether it is a product page, a checkout page, etc. It also depends on how effective caching is for the objects on that page, as well as some other factors.

    Many are deeply concerned with avoiding remoteness whenever possible, notice that Amazon don’t run away screaming, rather they pragmatically engineer and include use of caching etc.

    The first and foremost lesson is a meta-lesson: If applied, strict service orientation is an excellent technique to achieve isolation; you come to a level of ownership and control that was not seen before. A second lesson is probably that by prohibiting direct database access by clients, you can make scaling and reliability improvements to your service state without involving your clients. Other lessons are related to how you access services: If you want to be able to aggregate services easily, if you want to insert advanced infrastructure techniques such as decentralized request routing or distributed request tracking, you need a single unified service-access mechanism.

    Another lesson we’ve learned is that it’s not only the technology side that was improved by using services. The development and operational process has greatly benefited from it as well. The services model has been a key enabler in creating teams that can innovate quickly with a strong customer focus. Each service has a team associated with it, and that team is completely responsible for the service—from scoping out the functionality, to architecting it, to building it, and operating it.

    There is another lesson here: Giving developers operational responsibilities has greatly enhanced the quality of the services, both from a customer and a technology point of view. The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it. Not at Amazon. You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. This customer feedback loop is essential for improving the quality of the service.

    There’s a hint in respect of how Amazon do what they do technically and notice how the benefits of this approach also touch on process and quality. For example development teams can make more independent progress than they might were they are all forced to work in lockstep via centralized approaches.

    There is quite a bit of development happening in Eclipse, but IntelliJ’s IDEA is also popular for Java development. Some development happens in Visual Studio. Developers of our services can use any tools they see fit to build their services. Developers themselves know best which tools make them most productive and which tools are right for the job. If that means using C++, then so be it. Whatever tools are necessary, we provide them, and then get the hell out of the way of the developers so that they can do their jobs…..

    …….I think part of the chaotic nature—the emerging nature—of Amazon’s platform is that there are many tools available, and we try not to impose too many constraints on our engineers. We provide incentives for some things, such as integration with the monitoring system and other infrastructure tools. But for the rest, we allow teams to function as independently as possible. Developers are like artists; they produce their best work if they have the freedom to do so, but they need good tools. As a result of this principle, we have many support tools that are of a self-help nature. The support environment around the service development should never get in the way of the development itself.

    Here’s further evidence of how Amazon “decouple” their development teams. The teams are empowered to choose the tools for the job that make them most effective but at the same time they are expected to follow some guidelines in respect of monitoring and other infrastructure.

    We have a very good understanding of how customers interact with the site as is. When we expose new features we measure how they change the customer’s behavior. For example, does it take the customer fewer steps to find what he or she needs? This is hard because you are measuring human behavior; there are some things that customers are delighted about immediately and there are other things that they have to get used to…..

    …..We measure whether or not a new feature is successful in terms of customer satisfaction: Do people find things more easily? If we can improve the convenience of shopping on Amazon, then we have booked a major success. If we can help them find things that they might not have thought of before, that is also excellent. Customers tend to vote with their wallets, so if there is a clear negative result, we know what to do with that service.

    Measurement is king and we’re not just talking performance stats!

    First thing, I think there’s a whole list of good practices that we have in terms of design, in terms of architecture, in terms of building. And one of those points — one of the bullets on that list is that you have to design for failure, meaning that failure of components, whether they’re hardware, software, humans, is a fact of life, and you have to architect as if they are continuously happening to you. And if you do that and you happen to hit a good streak, then you’re fine. But failure in any large-scale system is the normal case, not the exception. So build, for example, for fast recovery. That’s an essential part. You know, stuff fails, comes back up, and you have to make sure that it can be inserted back into the functioning set as soon as possible.

    Unreliability of all sorts means the same thing, no service. It’s not just about network failure or machine failure but problems in software at multiple levels. Tackling these issues whilst maintaining service is challenging and typically requires the application to co-operate in an active fashion. This is counter to the established norm which would be to write the application to be naive of issues and run it on top of a cluster.

    I think if you talk to anybody in industry that is responsible for running a very large-scale, geographically distributed, distributed system, such as Amazon is, relying on third parties, on vendors, to actually deliver this availability for you is very dangerous. We’ve seen that there are a number of vendors out there that are exceptional in providing highly available systems in very contained environments. There aren’t that many systems out there of the scale of Amazon. The problems that we have-I won’t say the problems-the challenges that we have in delivering this very highly available system…there are not that many others that have these kinds of challenges. And so third party software is clearly not geared to meeting our challenges there.

    Great success brings with it great challenge. If you are a large successful service-provider (be that web, enterprise, mobile etc) there may come a time where the vendors simply don’t cater for you. You have moved outside of their target market and the scope of both their product and their experience. Migrating away from vendors cleanly is going to be a heck of a hairy thing to achieve.

    Technorati Tags: , , ,

    Comments 5 Comments »

    Dare has this to say about API versioning. His basic approach can be summarized as don’t change anything at all and separate your URL space using some version moniker. This is a good piece of advice but I’m more wary of other aspects of the posting such as the suggestion that we should support all versions over time to maintain backward compatibility.

    I’m tempted to suggest that this is a classic Microsoft (and enterprise) mindset: pile up a stack of legacy and backward compatibility requirement which can potentially lead to a mess of complexity and brittleness underneath the APIs. I feel the reasoning around client migration also needs re-examination. This is because Live Messenger doesn’t live in a browser. It is a desktop native binary application and therefore has completely different characteristics from a genuine web application in terms of dispersal, upgrading and updates.

    Consider that for many a browser based application, upgrades are but a reload away. Further many a web service that is mashed up as part of another offering will ultimately present to a user as a browser-based UI. i.e. We don’t need all our desktops to upgrade, we need to upgrade the integration point in the backend services of the mashups that use the web service. That’s at least potentially considerably fewer entities that need to upgrade than is normal in the desktop world.

    Dare’s posting also seems to have an underlying current of belief that one can define an API perfectly in the first place. We have known for a long time that such a feat of prediction is often very difficult though it can be made more manageable via functional partitioning.

    Limited lifetime is one of the best antidotes we have to the complexity of legacy and the brittleness of code. We resist it far too often leading to the mess we have in many an enterprise:

    1. Instability
    2. Slower Feature Delivery
    3. Difficult maintenance

    The web, the human body and many real-world systems (cars for example) exploit the concept of limited lifetime successfully, perhaps this philosophy should be more widely applied in software?

    Technorati Tags: , , , , ,

    Comments Comments Off

    Push versus pull is a constant debate in techie land. Interestingly, a lot of the web is adopting pull techniques. I wonder if some of that is leaking into these other areas or whether it’s just a general trend away from centralized solutions which we know don’t scale.

    In any case, this is an interesting read and recommended by none other than Werner Vogels.

    Scary reading for the control freaks and another example of decentralized control/intelligence. Of course, a few minutes thought makes it clear that pull is not good for everything.

    Technorati Tags: , , , ,

    Comments 2 Comments »

    How often do we admit something’s wrong and take corrective action? How often do we admit something’s wrong before it gets well out of hand?

    There are two basic things we need before we can admit something’s wrong and address the problem:

    1. Be strong enough to admit there’s something wrong
    2. Be aware that something is wrong

    The first item is about having sufficient strength of character to admit our mistakes, learn and move on. Inhibitors for this include ego or pride, fear and external environmental factors (e.g. a violent spouse).

    The second item is about recognizing the indicators of a problem. For us humans indicators can be anger, frustration, lack of drive, fatigue, health problems or a constant need to distract oneself via activity. In terms of computer systems, indicators might be Java exceptions under odd circumstances, strange lockups, weird performance issues etc.

    The art whether in the human or systems world is detecting these problems early on and reacting to them before they get out of control. In the systems world, so many times the first we know of a problem is when we see an exception trace or the pager beeps at 2am but is this really necessary?

    Could we not have been aware of the problem earlier? We put a lot of effort into testing but considerably less effort into endowing our systems with monitoring and feedback mechanisms that can provide us with useful statistics. These statistics are the things that can give us prior warning of a problem and come in many forms:

    1. Too many tasks
    2. Too many threads
    3. Queues that are too long
    4. Time per task increasing

    Frequently we build logging into our software but this is really a postmortem facility something that is only useful after the problem has occurred. Our OS’en, hardware and software platforms often have monitoring infrastructure built-in, why is it that we so often choose to build our applications without similar facilities?

    Technorati Tags: , ,

    Comments 1 Comment »