Archive for the “Architecture” Category

A frequent problem I observe when reviewing system designs is they are built atop one or more libraries, frameworks or products that are poorly suited to the intended task. Fitting the design to these underpinnings warps it in undesirable ways incurring all sorts of costs:

  • It takes an increasing number of staff just to deploy and run the system.
  • Customers face an increasingly bad experience in terms of interaction, performance and stability.
  • One spends more time refactoring than developing new features – although in many cases developers will simply not bother with this effort which accelerates the drop in quality for customers.
  • The level of coupling increases impacting the integrity of the design and making future change more difficult.

I call this the “design-by-product anti-pattern”. There are a couple of things that cause it to manifest:

  1. Absence of a real design prior to product/framework/library selection – most of those given the remit for design cannot construct proper abstractions that are adequately divorced from implementation. That is they do not understand the core entities and operations that exist within the domain they are building a system for. Thus when products/libraries/frameworks are selected there is limited structure to assist in evaluating their appropriateness.
  2. These products are used because they are on the list of “company approved technologies”. The justification for the existence of such a list is that it “reduces cost” which it might well do if all one accounts for is licenses and product support. Unfortunately, the cost equation is not nearly so simple (see above re: costs).
  3. A related problem to “company approved technologies” is hot or favourite technologies preferred by the development team regardless of their appropriateness for use in any particular design situation.

Any product/library/framework is created by an individual who has their own view of how their customers design their systems and builds APIs accordingly. In the worst cases these individuals design APIs in total isolation, focused on making them theoretically perfect (for some definition of perfect). If we as customers create designs that do not align well with the views of these individuals, the result will be costly as we force the two designs together. The cost is magnified for each additional conflicting product/library/framework design.

Loose coupling as the result of proper definition of roles and responsibilities is the only tool we have to allow for future design evolution. Poor selection of products/libraries/frameworks erodes this property and should be avoided otherwise death-march awaits.

Comments Comments Off

There are some design basics that development teams routinely fail to account for:

  1. Roles
  2. Responsibilities
  3. Coupling

Role

The basic justification for the existence of some api, interface or class. A summary of what it’s for. Just as importantly, the role defines what a particular entity is not for.

Responsibility

The things that some entity can do/knows in support of a role.

Coupling

An expression of the dependencies between roles. This property tells us a lot about the state of our design.

Two things that are heavily dependent upon each other might well be serving individual parts of a single role and thus should be consolidated. If everything ends up in a single role, it can suggest that the current approach to classifying behaviours is missing some factors.

Coupling can be temporal such that, for example, one entity cannot dispatch its responsibilities without the presence of another at the same time. This might indicate the need for some work on handling availability issues in a distributed system.

Limited coupling is a sign of cohesion, clarity in roles and responsibilities which can be indicative of a clean, maintainable design.

Platform Neutral

These basics apply regardless of the platform one chooses to develop upon. Roles, responsibilities and coupling apply just as well to service architectures, databases (tables and associated triggers and packages) and applications in Java, Scala, Clojure, C# or any other programming environment.

Warning Signs

It is very common for individual developers or development teams to allocate additional functions to existing elements of a design unthinkingly, thus eroding its quality. This manifests in many ways including:

  1. Some element of the system becomes the source of all information in respect of e.g. configuration or the entirety of customer data.
  2. A single cache contains all data regardless of its nature (e.g. customer, account details, market price).
  3. Some element of the system must always be running otherwise nothing else works.
  4. Some element of the system has functions that span many different bits of data (e.g. customer, account, market price).

Rule of Thumb

Any entity within a system should do only one thing and it should do it well (often credited as Unix Philosophy). This applies to everything from applications and products to services and individual classes.

Comments Comments Off

Prioritisation is a solution that can be used in a few situations:

  • Messaging – where some class of messages needs to be processed before one or more other classes.
  • Job execution – where the results of some set of jobs need to be available before others.
  • Levelling – where satisfying peak demand would require lots of hardware that in other periods would be significantly under-utilised.

It’s a very useful pattern but there are a few dark corners to think about:

  1. Even low priority items have some importance, otherwise they wouldn’t exist at all. If there are too many high priority items passing through the system there is significant risk the low priority items will not be processed in an acceptable time period.
  2. If there are too many high priority items passing through the system, the low priority items might not get processed at all leading to huge backlogs that take an age to process.
  3. If the high priority items begin taking a large amount of time to process, low priority items are delayed with resulting in a huge backlog as above.

In essence, a certain workload mix can mean that one must wait infinitely for low priority items to be processed and that is rarely acceptable. Making prioritisation work effectively means ensuring that there is sufficient capacity to process all work within their respective acceptable time periods.

For some applications there is a convenient “quiet” period overnight where low priority items can be cleared out of the system as there’s a dearth of high priority items to process. In other cases processing of priority classes must be interleaved e.g. process 100 high priority items, then 5 low priority items and repeat. Alternatively one can dedicate varying sized pools of resource (partitioning) to processing priority classes with each pool scaled according to their timeliness requirements.

Some technical staff naively use priority to solve a throughput problem where capacity is insufficient to cope with all work in parallel. This can appear to work for a while if there are lulls in demand as mentioned above but ultimately, as workload increases such an approach will fail unless care is taken in profiling the workload and ensuring there is sufficient capacity to satisfy all priorities.

Comments Comments Off

I’ve spent a significant amount of my career helping to unpick messed up architectures and wondering how they ever come to be. Certainly it can’t be because they’re appealing to work with:

  1. Making changes becomes increasingly expensive – make one small change and it spiders into changes across many other areas and gets into corners one least expects.
  2. Replacing components of the system because for example they’re no longer supported, don’t perform adequately or can’t scale requires significant reverse engineering to understand dependencies etc.
  3. It only takes one piece of the system failing to bring everything to its knees.
  4. Isolating the root cause of a bug takes significant amounts of effort because it’s difficult to quickly eliminate large chunks of the system.

More often than not it’s believed (I’m guilty) these systems come into being through incompetence or indiscipline on behalf of the developers involved but I think there’s maybe another contributory factor: Much of the advice on design and architecture is couched in terms of design from scratch, there’s less guidance in regard to working with an existing architecture.

The result is that when developers start out building a system they have a lot of advice they can apply but as it grows, it becomes more difficult to apply the advice and discern what changes are appropriate, so the architecture unravels. Is there a way to avoid this unravelling? I believe there is and it’s derived from the process for fixing up an errant architecture.

These architectures have smells equivalent to the code-level examples Fowler discusses in his book on refactoring such as:

  1. Some area of the system is too tightly coupled, making changes harder.
  2. Some part of the system contains an assumption that there is only one resource of some type (e.g. a database) limiting scaling.
  3. Many components of the system are reliant upon one key component being constantly available such that if it fails, nothing works.

Having identified these smells we need to perform appropriate cleanup which, for the list of examples above might include:

  1. Placing additional APIs (interfaces) within the tightly coupled area of the system to reduce shared implementation knowledge and create well-bounded islands of data.
  2. Introducing a resource discovery pattern to abstract away the assumption of a single resource at a single address.
  3. Introducing concepts like acceptable staleness of data which allows caching for a period of time, eventual consistency which supports making updates and resolving the outcome at a later date or asynchronous operations.

It’s important to realise that in any substantial system we will be unable to eradicate a smell completely in a single update because it’s too risky. There will be many places in the code we might forget to patch up, a high likelihood we’ll miss something in testing, low probability we’ll get API designs exactly right etc. We must gradually introduce modifications over a period of time (months or even years) rather than perform significant rewrites. This isn’t as bad as it seems because no architecture is perfect for very long once it’s exposed to users. It also suggests that perhaps we need to focus on documenting techniques for gradual evolution of an architecture.

If we were to get better at spotting these architectural smells early (slight odour as opposed to horrific stench) and working to address them sooner than later it might be possible to avoid having a system’s architecture unravel, leading to something more sustainable.

Updated: to include additional commentary on APIs and perfection.

Comments Comments Off

Cloud computing platforms offer many benefits including:

  1. Cheaper operational costs.
  2. Dynamic scaling in response to load spikes.
  3. Roll-on, roll-off deployments for e.g. newspaper archive processing.

These platforms exist as the result of the investment of companies such as Amazon, Google and Microsoft in developing cost-effective infrastructure with system to administrator ratios of 2500:1 (whilst the average enterprise manages around 150:1 and inefficient properties manage maybe 10:1).

Key to allowing these infrastructures to be efficient and in turn deliver the benefits above is having applications architected such that:

  1. They don’t require masses of administrator intervention when they go wrong.
  2. They can be installed with minimal administrator effort because there’s no need to worry about tweaking URLs, IP addresses, database connections etc.
  3. They readily support horizontal scaling e.g. because they contain an abstraction that can support sharding of data-storage.

In essence an application must be designed for zero administrator intervention and fully automated deployment. It should also have a variable workload component that magnifies the savings of the architectural properties above.

Strange then that many a developer expects to move their existing application, full of enterprise DNA (static configuration, vertical clusters, no horizontal scaling, high administration costs) to such an offering with minimal change. They even complain when it proves difficult because all those “enterprise features” aren’t present. Why does this happen?

I believe it’s because these developers have fundamentally misunderstood how cloud computing delivers its benefits. They see the cheap prices but don’t stop to consider where the cost saving comes from. Some of it is achieved by cloud platform vendors getting large discounts on huge hardware orders but a significant proportion comes from the fact that they don’t need to provide (via human resources or APIs) the sysadmin functions required for conventional hosting solutions.

Quite simply typical applications, their architectures and associated administration practices are not setup for cloud platforms. Some of them may be able to run on these platforms with sufficient hackery, brute force and associated cost. However if the motivation for a move to the cloud is merely to reduce kit costs one might well be better off looking for a cheaper conventional hosting solution.

In summary, making the best of the cloud requires that we take an architectural view, something that we’ve proven remarkably bad at over and over. Simply deploying an application unchanged to the cloud is unlikely to deliver much benefit.

Comments Comments Off