I’ve spent a significant amount of my career helping to unpick messed up architectures and wondering how they ever come to be. Certainly it can’t be because they’re appealing to work with:

  1. Making changes becomes increasingly expensive – make one small change and it spiders into changes across many other areas and gets into corners one least expects.
  2. Replacing components of the system because for example they’re no longer supported, don’t perform adequately or can’t scale requires significant reverse engineering to understand dependencies etc.
  3. It only takes one piece of the system failing to bring everything to its knees.
  4. Isolating the root cause of a bug takes significant amounts of effort because it’s difficult to quickly eliminate large chunks of the system.

More often than not it’s believed (I’m guilty) these systems come into being through incompetence or indiscipline on behalf of the developers involved but I think there’s maybe another contributory factor: Much of the advice on design and architecture is couched in terms of design from scratch, there’s less guidance in regard to working with an existing architecture.

The result is that when developers start out building a system they have a lot of advice they can apply but as it grows, it becomes more difficult to apply the advice and discern what changes are appropriate, so the architecture unravels. Is there a way to avoid this unravelling? I believe there is and it’s derived from the process for fixing up an errant architecture.

These architectures have smells equivalent to the code-level examples Fowler discusses in his book on refactoring such as:

  1. Some area of the system is too tightly coupled, making changes harder.
  2. Some part of the system contains an assumption that there is only one resource of some type (e.g. a database) limiting scaling.
  3. Many components of the system are reliant upon one key component being constantly available such that if it fails, nothing works.

Having identified these smells we need to perform appropriate cleanup which, for the list of examples above might include:

  1. Placing additional APIs (interfaces) within the tightly coupled area of the system to reduce shared implementation knowledge and create well-bounded islands of data.
  2. Introducing a resource discovery pattern to abstract away the assumption of a single resource at a single address.
  3. Introducing concepts like acceptable staleness of data which allows caching for a period of time, eventual consistency which supports making updates and resolving the outcome at a later date or asynchronous operations.

It’s important to realise that in any substantial system we will be unable to eradicate a smell completely in a single update because it’s too risky. There will be many places in the code we might forget to patch up, a high likelihood we’ll miss something in testing, low probability we’ll get API designs exactly right etc. We must gradually introduce modifications over a period of time (months or even years) rather than perform significant rewrites. This isn’t as bad as it seems because no architecture is perfect for very long once it’s exposed to users. It also suggests that perhaps we need to focus on documenting techniques for gradual evolution of an architecture.

If we were to get better at spotting these architectural smells early (slight odour as opposed to horrific stench) and working to address them sooner than later it might be possible to avoid having a system’s architecture unravel, leading to something more sustainable.

Updated: to include additional commentary on APIs and perfection.

  • Share/Bookmark

Comments Comments Off

A bad habit I’ve noticed in many a techie:

The tendency to thrash around and wildly speculate about the root cause of whatever production issue they’re facing. They tweak code and configuration following some random hypothesis or another, hoping that the issue will magically go away. It must surely be clear that this is a horribly inefficient way to solve a problem?

What’s required is data, data that we can use to home in on the source of the fault. We could wade through log files but this is inefficient and ought to be the last resort. Ideally we’d have some idea of what to look for beforehand.

Instrumentation is one tool we can use to guide our efforts. It can tell us things like how much memory is used, how much load there is, how many users are logged in, rate and types of request, cache hits etc.

Self-tests are also useful as they can exercise common operations, perform internal consistency checks and provide feedback on what’s working and not.

We can also get online memory dumps and there are tools like dtrace and tcpdump.

Given all these possibilities, why do we indulge in wild speculation? Perhaps it’s because we’ve foolishly left ourselves no choice:

  1. Instrumentation that should be a rich source of useful information is often limited to what is available from the operating system because we neglect to instrument our own code.
  2. As with instrumentation, we don’t make the time to implement self-test facilities.
  3. Only a few of us bother to learn about tools such as dtrace.
  4. Logging even if we could wade through it all is implemented in such a fashion that it cannot be turned on in production because the performance cost is too high.
  • Share/Bookmark

Comments 6 Comments »

We’ve all seen it, customers change their requirements, add a few more features and yet expect the project deadline to stay the same even though there are no additional resources.

For some reason they act as if a software team has infinite, cost-free capacity. The psychology that drives this behaviour is somewhat unclear because there are various potential motivators such as political ambition, naivety or willful ignorance.

One might expect to see this problem occurring in waterfall projects but it can also plague early agile projects. Typically the backlog grows and grows, the customer has a desired release date in mind and expresses horror when it becomes clear that the whole backlog cannot possibly be implemented in the timeframe (accompanied by cries of “but I followed the process”).

It shouldn’t be possible to make this mistake given real-world experiences. For example:

We put our car in for an oil change, we get a quote for cost and an estimate for how long the work will take. We drop the car in at the garage and then a little later phone up and request additional work such as fixing the air-conditioning, replacing two tires, sorting the exhaust and swapping out the brake pads. Not for a second do we entertain the idea that the cost and time for the work will be the same as originally quoted.

Yet we still persist in the notion that a software development team is a bottomless pit of resource.

  • Share/Bookmark

Comments Comments Off

It’s tempting when trying to be customer-centric to focus on delivering lots of functionality quickly. Supposedly features win the race and can increase revenue, but is that all that matters? Evidence such as the troubles Twitter have had in the past and this anecdote from Google about search time suggests there are other qualities of our website that matter like:

  • Service charges
  • Responsiveness
  • Availability
  • Quality of interaction

Whilst these qualities are all about the customer experience, success in maintaining them at an appropriate level is related to how well a company performs internally:

  • It’s undesirable to be charging excessively to cover development inefficiencies caused for example, by a tightly coupled architecture that makes even a small change a multi-month death-march.
  • A service that runs slow at peak times due to insufficient focus in our architecture and code on performance and scaling, appears sluggish or even down which can drive customers away.
  • Prolonged outages as the result of trivial problems occurring that take operational staff excessive time to fix because of poor monitoring and diagnostic tools, will impact customer satisfaction.
  • If we routinely rollback upgrades or they’re brittle or bug-ridden we will negatively impact the quality of interaction.

Thus being more customer-centric requires a company to quantify it’s performance and work to improve it. In the case of the examples above, things like response time, site downtime, number of failed upgrades, time to perform a release, bug counts and feature count against cost of delivery can be used as metrics to indicate how we’re doing in our mission to make the customer happy. Methods for improving these metrics though not always easy to apply are relatively well-understood and include:

  • Ensuring architecture/design includes well-defined interfaces, avoid integration via databases etc.
  • Considering scalability: how many machines can be thrown at a problem and are they used efficiently? Essentially, balancing horizontal-scale and straight-line optimisation.
  • Removing computation from the critical path to generating a user-response e.g. use asynchronous methods.
  • Publishing software and hardware telemetry, gather it all up (using the right infrastructure) and perform appropriate analysis via tools etc.
  • Focusing on simplicity, isolation of components, failure tolerance, in-live testing, versioning and the ability to rapidly rollback.
  • Applying an appropriate testing regime.

Ultimately everything a company does internally has implications for customers. This includes what might normally be notoriously subjective such as, for example technology selection. In this particular case we ought to test the technology and assess the effect on relevant metrics to verify that it does provide meaningful benefits. Also as most technology has it’s downsides, we can quantify these too and ensure there’s an appropriate trade-off.

  • Share/Bookmark

Comments Comments Off

Some notes on areas of software development where obsession with perfection can be costly.

Standards – where everything is standardised, up front so as to avoid unnecessary diversity or cost. Standards like design and code are based on assumptions about the environment in which they will be used. Thus to be sure that a standard is appropriate, one must have experience in the environment. Without the experience, selection or formation of a standard is to a reasonable degree guesswork. Asserting a standard up front often leads organisations into incurring significant costs (e.g. manpower, slow release cycles) as they twist what they do to fit some standard they’ve selected instead of recognising that the standard is inappropriate for a given situation. Standardisation should be done after a period of diversity/investigation to identify what is or is not applicable.

Estimates – where an organisation engages in grand analyses to deliver up accurate estimates for all pieces of work required so that resources can be correctly determined, budgets set and timelines agreed prior to the start of the implementation phase with the assumption that nothing will change. The futility of these efforts can be found in a dictionary:

es·ti·mate (st-mt)

tr.v. es·ti·mat·ed, es·ti·mat·ing, es·ti·mates

  1. To calculate approximately (the amount, extent, magnitude, position, or value of something).
  2. To form an opinion about; evaluate: “While an author is yet living we estimate his powers by his worst performance” Samuel Johnson.

n. (-mt)

  1. The act of evaluating or appraising.
  2. A tentative evaluation or rough calculation, as of worth, quantity, or size.
  3. A statement of the approximate cost of work to be done, such as a building project or car repairs.
  4. A judgment based on one’s impressions; an opinion.”

Estimates are implicitly guess-work, they can only ever be made accurate in hindsight. Agile is one way to get realistic about estimates.

Architecture – where organisations and individuals engage in the illusion that one can design an endlessly flexible architecture that copes with all possible unknown future demands. Just as with estimates, symptoms include grand, long-running rituals to examine every detail, verify that the architecture is sound and that nothing has been forgotten. Businesses change their processes, unexpected integrations are undertaken, products are dropped and new ones are dreamt up, hosting options change and operational challenges appear unexpectedly. The weapons for dealing with this challenge are not do-it-all containers or grand architectures, rather they are things like:

  • Principles and guidelines for keeping an architecture adaptable: loose coupling, no broken windows, cohesion and coupling, isolation etc.
  • Validation metrics: to indicate when an architectural assumption has been breached and needs re-addressing.

Implementation – where developers engage in the writing of complex, brittle, difficult to maintain and debug code to address challenging problems that can be better dealt with via process or architectural approach. Hot-update of configuration is one such example. At least in the Java world, many a developer will be tempted to tackle this problem with:

  • Clever thread management and barriers to pause work whilst configuration changes are made
  • A remote admin interface to permit lodging of configuration changes
  • Event listeners to respond to the configuration change and patch up various bits of state
  • Sundry bits of classloader magic

From an architectural perspective however, one realises that:

  1. Configuration is most easily done when a process boots, there’s minimal state to patch and no active workload to manage.
  2. At least in the case of a website, one must have failure handling mechanisms to cope with lost boxes, broken networks and failing processes.
  3. Hot re-configuration actually means provision of service without disruption to the user.

Thus hot re-configuration becomes much simpler: kill process, change configuration, re-start process.

Hardware – where expensive kit never fails, making software easier to write but ultimately compromised when it comes to availability as customer usage grows. Hardware does fail, in fact once an organisation accrues enough hardware there will be failures daily and it’s not cost-effective to pay for operational staff to run around trying to keep all this hardware running all the time (and it increases the chance of human error, another key contributor to availability problems). In the early days of a system, use of redundant hardware solutions is acceptable it’s more important to get things up and running but it pays to:

  1. Monitor service availability.
  2. Track hardware failure patterns.
  3. For each outage, maintain an estimate of cost in both operational and revenue terms.
  4. Track user behaviour after an outage: do they return and if so, how quickly?

These sorts of metrics allow an organisation to determine when to start shifting failure tolerance into the software layers. Marcus and Stern provide a good treatment of systems availability practices.

  • Share/Bookmark

Comments Comments Off