Archive for the “Architecture” Category

Jim Waldo writes:

My own conclusion is that system design is really a matter of technique, a way of thinking rather than a subject that can be taught in a particular course. It might be possible to build a program that teaches system design by putting students through a series of courses that hone their system design skills as they move through the subject matter of the courses. Such a series of courses would, in effect, be a formalized version of the apprenticeship that is now the way people acquire their system design technique…..

…..Even worse than not being visible to the customer, work done on designing the system is not visible to the management of the company that is developing the system. Even though managers will pay lip service to the teaching of The Mythical Man Month, there is still the worry that engineers who aren’t producing code are not doing anything useful. While there are few companies that explicitly measure productivity in lines-of-code per week, there is still pressure to produce something that can be seen. The notion that design can take weeks or months and that during that time little or no code will be written is hard to sell to managers. Harder still is selling the notion that any code that does get written will be thrown away, which often appears to be regression rather than progress.

In such an environment lip service often extends to technical strategy as well.

Comments Comments Off

From On Designing and Deploying Internet-Scale Services (James Hamilton – Windows Live Services Platform):

We have long believed that 80% of operations issues originate in design and development, so this section on overall service design is the largest and most important. When systems fail, there is a natural tendency to look first to operations since that is where the problem actually took place. Most operations issues, however, either have their genesis in design and development or are best solved there. Throughout the sections that follow, a consensus emerges that firm separation of development, test, and operations isn’t the most effective approach in the services world. The trend we’ve seen when looking across many services is that low-cost administration correlates highly with how closely the development, test, and operations teams work together.

Yep, I’m a firm believer too….

Comments 1 Comment »

I commented in my previous post on the fact that working with message brokers can lead to tensions where we’re forced into letting the broker do more than we’d like.

I suspect that the bigger the list of features provided in a single broker the greater the probability we’ll be forced to comply with (and attempt to work around) undesirable models and behaviours that aren’t core to our requirements. Thus under certain circumstances a specific solution that addresses one core problem whilst asserting a minimum of additional constraints can be more attractive than a jack-of-all-trades solution.

It also occurred to me that brokers might not be the only piece of software to exhibit this characteristic. RPC systems tend to suffer similar problems having a tendency to tightly bind endpoint addressing, transport and argument marshalling. I also wonder if these issues contribute to some of the backlash against the behemoth that is an RDBMS implementation. For example whilst we might like our RDBMS to look after our data, we might not wish to put up with the querying, concurrency, deployment and management models it also asserts even with a myriad of configuration options.

Technorati Tags: , ,

Comments 2 Comments »

Say the word messaging to a subset of developers and for some reason the immediate knee-jerk is to assume that means using some kind of message broker (Tibco, ActiveMQ or whatever). Utter the term “asynchronous communication” and that is typically equated to messaging and thus also implies use of a message broker.

I find this strange because messaging is possible via a myriad of methods including carrier pigeon, pony express, tcp, multicast, http and many other transports. As for supporting asynchronous operations well that’s governed by the API(s) provided by the transport. In fact this is only partially true because a lower-level synchronous API can be wrapped in an additional layer to produce an asynchronous API. Transports and layering are often seen together, for example it allows us to introduce support for reliable delivery if the underlying transport is not robust (often it isn’t).

There’s no denying that all (or most) of these features (asynchronous APIs, messaging, reliable delivery) are provided in various of the message broker implementations but rarely in the form of a set of composable layers. The preferred approach is usually a myriad of configuration options leaving us at the mercy of the vendor to provide and support just the right combination of configuration possibilities to match our design challenge.

The indivisible nature of these brokers hampers us in other ways too. It can be difficult to use the broker purely for messaging without being forced to work with say its models for routing control and security. I’m sure this is appealing to the vendors but it doesn’t seem like such a great deal to me.

Technorati Tags: , ,

Comments 1 Comment »

Bill ponders:

“Sometimes I wonder how a deployed n-tiered scale up monolith can be gradually refactored to a scale out model or run on scale out infrastructure. It’s well documented that companies like Amazon and Ebay have done just that, only how they did it tends to get left out of the slide decks. I suspect it involves thinking quite differently about what ‘good’ application code is….”

I think one reason for the absence of this information in slide decks is that it’s a competitive advantage, a barrier to entry. Mastering scale-out allows one company to handle more load and more customers whilst offering more function than another. Why teach other companies how to compete? For me though the real issue is that the transition is a major effort that defies easy explanation via slideware.

Makeup of a Monolith

The rough form of a monolithic system is a logical (maybe physical) n-tier with simple scale out in the places where it’s easy e.g. a stateless web tier. The remainder is typically a single database containing all data which is fronted by some kind of cache (some systems will also talk to legacy systems via various mechanisms and some will have an extra database or two). Some may view their database replication strategy/cluster to be horizontal scaling but to be the real deal would require explicit partitioning or sharding. All code has access to all of the data in the database and it’s typically not well controlled which leads to what I term the “integration via database” anti-pattern.

In essence to move to a scale out architecture requires us to partition up all the state we have held in the database (admitting there’s incentive to break up the middle tier to avoid it getting overly large and unwieldy). Sounds easy but of course is anything but because there’s just so much beyond “thinking quite differently about what ‘good’ application code is” to learn and change. One might expect that we build things up in horizontal slices, create some new infrastructure, port to the new infrastructure, the usual “upgrade the middleware” type approach but this doesn’t work due to lack of hindsight and the sheer size of the challenge. The antidote to this is to take one little step at a time.

Basic Approach

Instead of the “big scary re-architecture” we take vertical slices through the monolith, separating them out so that we can evolve each with a limited amount to consider. We push these pieces all the way through into deployment so as to expose all our processes and tools to this new type of entity. This is not easy but:

  1. At least it’s being done against a small vertical slice which makes it a more manageable problem.
  2. The transition cannot be managed in one big step, it will be necessary to manage old and new alongside each other so we need to get used to that.

Another option for learning how to work outside of the monolith is to implement some new set of features in this new style and drive that through the delivery chain. Obviously whilst this develops knowledge it’s not breaking down the existing monolith, so let’s move on to consider the basic steps (order is not fixed) for extracting a vertical element:

  1. Identify some reasonable sized function or chunk of data to partition.
  2. Define an interface which will wrap this function and/or data.
  3. Move code and data behind the interface.
  4. Modify other code to access code and data via the interface rather than e.g. direct access to the database.

Step (1) often requires us to examine both data and function. Ideally the interface we introduce in step (2) would look like a remote service. It may for some period of time still be part of the monolith and therefore local. In some cases we can’t make the interface remote initially because the refactor would be too complex so we must later come back and recast the interface as a remote service. What form does this new interface take? Ignoring religious issues it could be WS-*, REST, some other form of remote invocation (sometimes custom but not fine-grained method calls) or messaging.

The arrival of this new interface allows us to introduce behaviours such as asynchronous operation and eventual consistency into a controlled area of our codebase. It’s possible that the data we’ve wrapped behind the interface is still being drawn from the original database so we can also consider moving this data to it’s own separate storage mechanism (which may or may not be another database) and introduce sharding and partitioning.

Challenges

Some developers will fall back on “remoteness dogma” stating that remoteness is a performance inhibitor. Indeed they are right but scale is at least (if not more) important and becomes a key focus when we can’t buy a bigger box to make the monolith perform. For me the real issue at hand is the corruption of the mind that comes from thinking transactionally. In this world we focus on minimising the amount of time a transaction takes for fear of lock contention and blocking threads too long. We become obsessed with consistency, feeling compelled to do all work ahead of the actual transaction. This thinking naturally leads to optimising the execution paths heavily and eschewing remoteness. Key to tackling these issues is socialising use of asynchronous techniques and the fact that consistency is in the eye of the beholder (and determined by how often they can observe the relevant data).

During the recasting of a vertical slice into a standalone element it’s usually necessary to ensure it can cope with existing load (and a bit more). Establishing what the current load looks like can be challenging as monitoring and statistics are often centred around information available from the database, application servers, load-balancers and operating systems i.e. the infrastructure. This data is certainly useful but says little about what’s actually happening in the application code. A good way to address this problem and improve the lot of the operational teams is to introduce a programme of application instrumentation that will deliver appropriate statistics and high-level diagnostics.

Inevitably we will be adopting and/or building some new infrastructure but we must be careful how much we try to acquire and/or custom build in advance of real-world experience (see the hindsight link above). Fortunately there’s little that’s actually required up front other than some mechanisms for service location and statistics gathering so the impact of mistakes is usefully limited. Follow-on candidates might include messaging, deployment and security. Remember also that many vendors are still producing infrastructure suited to big-iron monolithic development and single data-centre environments (which can make resilience in the face of certain kinds of failure difficult).

Static configuration (per-machine or in tools) can be extremely troublesome containing a lot of URL’s, server addresses, machine names and database references. All of these items need changing at each stage from development, through testing, QA, staging and production. The move toward a more distributed approach will only make this worse as it creates a need to copy and tweak more configuration on more machines. It’s important from the early stages to focus on eliminating as much of this as possible. In an ideal world a machine would be configured with at most its designated duty and environment (testing, development, production) obtaining everything else it needs from services in the environment.

Wrap Up

To go horizontal is extremely challenging and cannot be addressed by the typical re-architecting initiatives many companies indulge in. There’s too much to learn and change such that the only option is a slow, step by step, learn as you go transition that gradually chips away at the monolith. For Amazon it seems this transition has taken at least five years and over at eBay it looks like they started the transition sometime around 1999/2000.

Technorati Tags: ,

Comments Comments Off