Archive for the “Architecture” Category

Prioritisation is a solution that can be used in a few situations:

  • Messaging – where some class of messages needs to be processed before one or more other classes.
  • Job execution – where the results of some set of jobs need to be available before others.
  • Levelling – where satisfying peak demand would require lots of hardware that in other periods would be significantly under-utilised.

It’s a very useful pattern but there are a few dark corners to think about:

  1. Even low priority items have some importance, otherwise they wouldn’t exist at all. If there are too many high priority items passing through the system there is significant risk the low priority items will not be processed in an acceptable time period.
  2. If there are too many high priority items passing through the system, the low priority items might not get processed at all leading to huge backlogs that take an age to process.
  3. If the high priority items begin taking a large amount of time to process, low priority items are delayed with resulting in a huge backlog as above.

In essence, a certain workload mix can mean that one must wait infinitely for low priority items to be processed and that is rarely acceptable. Making prioritisation work effectively means ensuring that there is sufficient capacity to process all work within their respective acceptable time periods.

For some applications there is a convenient “quiet” period overnight where low priority items can be cleared out of the system as there’s a dearth of high priority items to process. In other cases processing of priority classes must be interleaved e.g. process 100 high priority items, then 5 low priority items and repeat. Alternatively one can dedicate varying sized pools of resource (partitioning) to processing priority classes with each pool scaled according to their timeliness requirements.

Some technical staff naively use priority to solve a throughput problem where capacity is insufficient to cope with all work in parallel. This can appear to work for a while if there are lulls in demand as mentioned above but ultimately, as workload increases such an approach will fail unless care is taken in profiling the workload and ensuring there is sufficient capacity to satisfy all priorities.

  • Share/Bookmark

Comments Comments Off

I’ve spent a significant amount of my career helping to unpick messed up architectures and wondering how they ever come to be. Certainly it can’t be because they’re appealing to work with:

  1. Making changes becomes increasingly expensive – make one small change and it spiders into changes across many other areas and gets into corners one least expects.
  2. Replacing components of the system because for example they’re no longer supported, don’t perform adequately or can’t scale requires significant reverse engineering to understand dependencies etc.
  3. It only takes one piece of the system failing to bring everything to its knees.
  4. Isolating the root cause of a bug takes significant amounts of effort because it’s difficult to quickly eliminate large chunks of the system.

More often than not it’s believed (I’m guilty) these systems come into being through incompetence or indiscipline on behalf of the developers involved but I think there’s maybe another contributory factor: Much of the advice on design and architecture is couched in terms of design from scratch, there’s less guidance in regard to working with an existing architecture.

The result is that when developers start out building a system they have a lot of advice they can apply but as it grows, it becomes more difficult to apply the advice and discern what changes are appropriate, so the architecture unravels. Is there a way to avoid this unravelling? I believe there is and it’s derived from the process for fixing up an errant architecture.

These architectures have smells equivalent to the code-level examples Fowler discusses in his book on refactoring such as:

  1. Some area of the system is too tightly coupled, making changes harder.
  2. Some part of the system contains an assumption that there is only one resource of some type (e.g. a database) limiting scaling.
  3. Many components of the system are reliant upon one key component being constantly available such that if it fails, nothing works.

Having identified these smells we need to perform appropriate cleanup which, for the list of examples above might include:

  1. Placing additional APIs (interfaces) within the tightly coupled area of the system to reduce shared implementation knowledge and create well-bounded islands of data.
  2. Introducing a resource discovery pattern to abstract away the assumption of a single resource at a single address.
  3. Introducing concepts like acceptable staleness of data which allows caching for a period of time, eventual consistency which supports making updates and resolving the outcome at a later date or asynchronous operations.

It’s important to realise that in any substantial system we will be unable to eradicate a smell completely in a single update because it’s too risky. There will be many places in the code we might forget to patch up, a high likelihood we’ll miss something in testing, low probability we’ll get API designs exactly right etc. We must gradually introduce modifications over a period of time (months or even years) rather than perform significant rewrites. This isn’t as bad as it seems because no architecture is perfect for very long once it’s exposed to users. It also suggests that perhaps we need to focus on documenting techniques for gradual evolution of an architecture.

If we were to get better at spotting these architectural smells early (slight odour as opposed to horrific stench) and working to address them sooner than later it might be possible to avoid having a system’s architecture unravel, leading to something more sustainable.

Updated: to include additional commentary on APIs and perfection.

  • Share/Bookmark

Comments Comments Off

Cloud computing platforms offer many benefits including:

  1. Cheaper operational costs.
  2. Dynamic scaling in response to load spikes.
  3. Roll-on, roll-off deployments for e.g. newspaper archive processing.

These platforms exist as the result of the investment of companies such as Amazon, Google and Microsoft in developing cost-effective infrastructure with system to administrator ratios of 2500:1 (whilst the average enterprise manages around 150:1 and inefficient properties manage maybe 10:1).

Key to allowing these infrastructures to be efficient and in turn deliver the benefits above is having applications architected such that:

  1. They don’t require masses of administrator intervention when they go wrong.
  2. They can be installed with minimal administrator effort because there’s no need to worry about tweaking URLs, IP addresses, database connections etc.
  3. They readily support horizontal scaling e.g. because they contain an abstraction that can support sharding of data-storage.

In essence an application must be designed for zero administrator intervention and fully automated deployment. It should also have a variable workload component that magnifies the savings of the architectural properties above.

Strange then that many a developer expects to move their existing application, full of enterprise DNA (static configuration, vertical clusters, no horizontal scaling, high administration costs) to such an offering with minimal change. They even complain when it proves difficult because all those “enterprise features” aren’t present. Why does this happen?

I believe it’s because these developers have fundamentally misunderstood how cloud computing delivers its benefits. They see the cheap prices but don’t stop to consider where the cost saving comes from. Some of it is achieved by cloud platform vendors getting large discounts on huge hardware orders but a significant proportion comes from the fact that they don’t need to provide (via human resources or APIs) the sysadmin functions required for conventional hosting solutions.

Quite simply typical applications, their architectures and associated administration practices are not setup for cloud platforms. Some of them may be able to run on these platforms with sufficient hackery, brute force and associated cost. However if the motivation for a move to the cloud is merely to reduce kit costs one might well be better off looking for a cheaper conventional hosting solution.

In summary, making the best of the cloud requires that we take an architectural view, something that we’ve proven remarkably bad at over and over. Simply deploying an application unchanged to the cloud is unlikely to deliver much benefit.

  • Share/Bookmark

Comments Comments Off

The codebase of a subsystem or maybe the whole system has turned into a big ball of mud. It’s claimed too brittle, too complex and too costly to continue developing. It’s at this point that a grand rewrite is proposed accompanied by statements of how things will be different:

  • We’ll eliminate static wiring using Spring
  • We’ll model everything as a service
  • We’ll adopt test-driven development and make use of jMock
  • We’ll build everything using a RESTful approach
  • We’ll avoid using RPC in favour of messaging
  • …….

Things will be so much better in this brave new world but……they won’t. The reason the codebase has got into a mess is because we failed to execute on important principles such as:

  1. Take account of coupling and cohesion.
  2. Be clear about people’s roles and responsibilities to avoid unqualified or inappropriate decision making.
  3. Clarity and simplicity of roles and responsibilities in design elements.
  4. Maintain modular, well-isolated code and conceptual integrity.
  5. Avoid shared data-schemas or integration via the database.
  6. Make the software testable and maintain the tests.
  7. Select technology based on appropriate design work.
  8. No broken windows.
  9. Track and maintain appropriate metrics.
  10. Review projects to identify and disseminate useful lessons to developers, architects and customers.
  11. Account for the operational aspects of our software in requirements and design.
  12. Review to ensure code aligns with appropriate design principles.
  13. Surface, balance and mitigate risks.

It’s these principles and others that enable superior engineering which in turn delivers a good-quality, maintainable codebase. Any rewrite will end up a ball of mud just like it’s predecessor unless the style of engineering is adapted to incorporate principles such as these.

Some propose that frameworks can prevent mistakes, ensure a quality design and deliver testable code. I think experience suggests otherwise as we routinely (by accident or design) bend frameworks to fit some problem they weren’t really designed for leading to ugly, broken, poorly designed, brittle code. What would stop us doing it with new frameworks delivered as part of a grand re-write?

Should we successfully revise our engineering practices would we then have sufficient leverage to restructure our ball of mud into something nicer to work with? Maybe, maybe not but we might be better equipped to answer the question: re-write or re-factor?

  • Share/Bookmark

Comments Comments Off

When building systems, there are some operational elements that it pays to get to grips with sooner than later:

  • Deployment
  • Packaging
  • Configuration
  • Monitoring
  • Logging

Failing to address these elements is detrimental to core aspects of what we need to do from day one:

  • Get changes out – ship a new feature, deploy an urgent bug-fix or make a tweak to handle a load-spike.
  • Determine if things have started up and configured properly.
  • Be sure things are still running right.
  • Identify and react to problems quickly.
  • Obtain data important to future architectural decisions.

Even in light of the above many of us are still tempted into leaving this until later by which time:

  1. Our software will have grown substantially making it difficult and expensive to adapt when we do decide to address the operational issues.
  2. We’ll be losing inordinate amounts of time on manual trouble-shooting and dealing with the consequences of human error (a key contributor to downtime and other problems).
  3. Operations will likely have become tightly bound to whatever our software currently looks like such that when we start addressing the issues, we’ll break all their assumptions (and the tooling they built around them).

Some Specifics

Having configuration buried inside your binaries where it cannot be easily managed is an inconvenience. We don’t really want to have to do a whole new build just to change configuration settings (though one might want to do a re-deploy of the whole lot together to allow for audit-trails and have half a chance of having all boxes configured similarly at the same time).

When it comes to deployment and packaging it pays to adopt something akin to the xcopy install approach. Everything required is contained inside of the distribution with minimal external dependencies (necessary external dependencies should ideally be satisfied dynamically at runtime rather than with static configuration). Such an approach for desktop software would be unattractive but with servers and an imperative to automate installation it’s very attractive.

What about all those existing packaging systems such as rpm? Many of these mechanisms have a design assumption around a single version of something on a machine. This can inhibit fast rollback because rather than stopping one process and starting another one has to (in simple terms):

  1. Stop a process.
  2. Uninstall it’s binaries and dependencies.
  3. Install the binaries for the old process and dependencies.
  4. Start the other process up.

In some cases it will also be necessary to perform further configuration (did we back it up?), suddenly it’s looking like a lot of work to buy ourselves appropriate risk-mitigation for broken upgrades.

Monitoring often requires an amount of configuration which can make for a bootstrap problem where one needs monitoring to detect a configuration issue but the monitoring isn’t configured yet. Thus it can be useful to have some very simple monitoring based on a primitive that can run without explicit configuration such as multicast.

Important Step

These key operational elements should be accounted for early on in the design of system and grown alongside other functional aspects.* There’s plenty of information on this topic publicly available including:

* Initially implementation can be simple scripts but at some point it becomes necessary to take a more serious approach in respect of tools and infrastructure development. This means investing in properly skilled architects and engineers, performing appropriate testing etc.

  • Share/Bookmark

Comments Comments Off