Posts Tagged “Architecture”
There are some design basics that development teams routinely fail to account for:
- Roles
- Responsibilities
- Coupling
Role
The basic justification for the existence of some api, interface or class. A summary of what it’s for. Just as importantly, the role defines what a particular entity is not for.
Responsibility
The things that some entity can do/knows in support of a role.
Coupling
An expression of the dependencies between roles. This property tells us a lot about the state of our design.
Two things that are heavily dependent upon each other might well be serving individual parts of a single role and thus should be consolidated. If everything ends up in a single role, it can suggest that the current approach to classifying behaviours is missing some factors.
Coupling can be temporal such that, for example, one entity cannot dispatch its responsibilities without the presence of another at the same time. This might indicate the need for some work on handling availability issues in a distributed system.
Limited coupling is a sign of cohesion, clarity in roles and responsibilities which can be indicative of a clean, maintainable design.
Platform Neutral
These basics apply regardless of the platform one chooses to develop upon. Roles, responsibilities and coupling apply just as well to service architectures, databases (tables and associated triggers and packages) and applications in Java, Scala, Clojure, C# or any other programming environment.
Warning Signs
It is very common for individual developers or development teams to allocate additional functions to existing elements of a design unthinkingly, thus eroding its quality. This manifests in many ways including:
- Some element of the system becomes the source of all information in respect of e.g. configuration or the entirety of customer data.
- A single cache contains all data regardless of its nature (e.g. customer, account details, market price).
- Some element of the system must always be running otherwise nothing else works.
- Some element of the system has functions that span many different bits of data (e.g. customer, account, market price).
Rule of Thumb
Any entity within a system should do only one thing and it should do it well (often credited as Unix Philosophy). This applies to everything from applications and products to services and individual classes.
Comments Off
Design is not rules, it’s not patterns, it’s not technological choices or indeed code. Design is tradeoffs, driven by data where possible and gut instinct. It’s about identifying the core challenges of a problem domain (which might ultimately be one or many systems) and addressing them through creation of appropriate abstractions. These abstractions embody:
- Functions to be performed
- Data to be discovered, consumed and produced
- Non-functionals (e.g. SLAs)
The abstractions are then rendered into the real-world using appropriate hardware, technologies, patterns and languages. A good design:
- Exhibits few exception cases
- Has logic and/or data located neatly and predictably
- Applies a small set of core constructs repeatedly
- Addresses operational needs
- Considers cost versus value delivered
- Is as simple as possible
- Has the minimum of implementation assumption
There are several key failing points in the design process:
- No adjustment in the face of implementation feedback – No design is complete or perfect. There will always be missed details leading to brittle code, complex corner cases or convoluted solutions. It is critical that we monitor our progress and adapt the design accordingly.
- No up front design – Design is the skeleton upon which we hang technology choices and code structure. In it’s absence we rapidly descend into a world of difficult to navigate code and costly constraints set by uninformed product choices.
- No care in following the design – A key element of design is to place the right things in the right places. Failing to do this at code time increases coupling, makes maintenance difficult and can impact both performance and scalability. Similar effects occur as the result of poor technology selection.
Design and implementation go hand in hand yet many of us lack awareness of where the boundary between these two elements lies. We don’t understand how these elements interact with each other or appreciate the impact of decisions we make in respect of one element on the other.
Comments Off
My current company has for obvious business reasons got a serious interest in delivering a quality website experience during the World Cup and thus I’ve been spending a lot of time focused on our own performance and capacity management of late.
P&C is one of those 80/20 tradeoffs. There’s always more one can do or measure or test, equally getting the basics in place will deliver substantial benefit. I’d go further and argue that without a solid grasp of the basics, one cannot easily determine what else beyond that might be required. Here then are the basics that I’ve found myself repeating over and over:
- Have an enquiring mind – anomalies are not to be ignored or dismissed on the basis of pure speculation. Determining root cause is essential to prevent surprises in production. Some recent examples:
- In one test we noticed that every so often we’d get a substantial blip in disk I/O on servers that should be processing entirely out of memory. Along with that blip there’d be a corresponding reduction in throughput, we could have ignored it, after all things sorted themselves out relatively quickly but we chose to investigate. All these servers were periodically running a cleanup job the developers were unaware of and had not factored into their capacity calculations. The implications for production would have been a regularly overloaded, badly performing website. We’ve since tuned the jobs, adjusted their schedules and increased our capacity to ensure we can always spread the load around enough to accommodate them.
- An examination of the distribution of load on the boxes behind our load-balancers revealed a higher than expected amount of variance in CPU and connections. A review of the application revealed that any particular user’s traffic is sticky to one box, unfortunate as it’s stateless, time for a code change. We also spent time looking at the monitoring infrastructure and discovered that in certain cases we’d get false reports of 100% CPU utilisation, that one will be fixed with an OS patch.
- Gather the right data – there’s no value in allowing oneself to be limited by what is easily available via some set of tools people are comfortable with. One tool we were using had an unreasonably low ceiling on the number and rate of samples it could handle such that any graphs it produced showed hardly anything of the true profile of e.g. CPU utilisation, memory consumption or I/O. Forming any opinion about system behaviour in respect of load was going to be an exercise in speculation. We junked the tool and are looking for a replacement, in the meantime we’ve fallen back to making use of low level performance counters which we can sample local to the machine and whack onto disk for later analysis via scripts, opensource tools etc.
- Design tests that support reasoning – One should indeed try and replicate production load behaviours to judge overall system behaviour. The challenge of such testing is that it can be difficult to relate performance data back to exactly what was going on during some period of a test and make a diagnosis or be confident of an improvement. There are a number of things we can do to improve the situation:
- Ensure tests are deterministic such that any given run can be compared against other runs. This isn’t as simple as it looks when e.g. you wish to gradually increase load at a fixed rate that is being produced by more than one box.
- Have tests produce sufficient logging that one can easily identify what was going on at particular points in the sampled data. Logging of course can actually affect test behaviour and that isn’t always desirable.
- Build additional tests that target particular user journey’s through the system. Doing this for all possible journey’s can be costly so it makes sense to focus on testing those which are most popular with users. These kinds of tests restrict the reasoning tree making analysis, diagnosis and solution identification much easier.
- Measure what customers care about – they don’t care about CPUs, I/O or memory, they worry about things like response times. It is important to focus on maintaining a quality user experience not endlessly improving system efficiency. Considering user factors such as response times stops us expending huge effort on CPU utilisation when we should be focusing on say, network I/O, browser performance or reducing the amount of data we push to the browser before a page can render.
- Beware of averages – it is very tempting to combine datasets via the use of averaging unfortunately such a practice can easily hide spikes that might be indicative of a problem. On more than one occasion an engineer has presented a graph that tracks the average CPU and a table that summarises min, avg and max. After which they’ve pronounced load testing was a success and yet they have no explanation for why the average is never more than 50% but the max is 100% and whether or not this is good or bad.
- More than load – excessive focus on measuring the effect of a particular load can make us blind to another important metric, resource cost per unit of work – these are the collection of tests and analysis that help us understand what to tune and how much to keep our appetite for boxes and bandwidth reasonable. One simple thing teams can do per sprint (assuming you’re agile, why wouldn’t you be?) is point a profiler at each component and look for the low hanging fruit that is poor algorithm selection or inefficient code (e.g. repeated scanning of lists where a hashmap would be better or repeatedly computing something that could be cached).
2 Comments »
I’ve spent a significant amount of my career helping to unpick messed up architectures and wondering how they ever come to be. Certainly it can’t be because they’re appealing to work with:
- Making changes becomes increasingly expensive – make one small change and it spiders into changes across many other areas and gets into corners one least expects.
- Replacing components of the system because for example they’re no longer supported, don’t perform adequately or can’t scale requires significant reverse engineering to understand dependencies etc.
- It only takes one piece of the system failing to bring everything to its knees.
- Isolating the root cause of a bug takes significant amounts of effort because it’s difficult to quickly eliminate large chunks of the system.
More often than not it’s believed (I’m guilty) these systems come into being through incompetence or indiscipline on behalf of the developers involved but I think there’s maybe another contributory factor: Much of the advice on design and architecture is couched in terms of design from scratch, there’s less guidance in regard to working with an existing architecture.
The result is that when developers start out building a system they have a lot of advice they can apply but as it grows, it becomes more difficult to apply the advice and discern what changes are appropriate, so the architecture unravels. Is there a way to avoid this unravelling? I believe there is and it’s derived from the process for fixing up an errant architecture.
These architectures have smells equivalent to the code-level examples Fowler discusses in his book on refactoring such as:
- Some area of the system is too tightly coupled, making changes harder.
- Some part of the system contains an assumption that there is only one resource of some type (e.g. a database) limiting scaling.
- Many components of the system are reliant upon one key component being constantly available such that if it fails, nothing works.
Having identified these smells we need to perform appropriate cleanup which, for the list of examples above might include:
- Placing additional APIs (interfaces) within the tightly coupled area of the system to reduce shared implementation knowledge and create well-bounded islands of data.
- Introducing a resource discovery pattern to abstract away the assumption of a single resource at a single address.
- Introducing concepts like acceptable staleness of data which allows caching for a period of time, eventual consistency which supports making updates and resolving the outcome at a later date or asynchronous operations.
It’s important to realise that in any substantial system we will be unable to eradicate a smell completely in a single update because it’s too risky. There will be many places in the code we might forget to patch up, a high likelihood we’ll miss something in testing, low probability we’ll get API designs exactly right etc. We must gradually introduce modifications over a period of time (months or even years) rather than perform significant rewrites. This isn’t as bad as it seems because no architecture is perfect for very long once it’s exposed to users. It also suggests that perhaps we need to focus on documenting techniques for gradual evolution of an architecture.
If we were to get better at spotting these architectural smells early (slight odour as opposed to horrific stench) and working to address them sooner than later it might be possible to avoid having a system’s architecture unravel, leading to something more sustainable.
Updated: to include additional commentary on APIs and perfection.
Comments Off
Cloud computing platforms offer many benefits including:
- Cheaper operational costs.
- Dynamic scaling in response to load spikes.
- Roll-on, roll-off deployments for e.g. newspaper archive processing.
These platforms exist as the result of the investment of companies such as Amazon, Google and Microsoft in developing cost-effective infrastructure with system to administrator ratios of 2500:1 (whilst the average enterprise manages around 150:1 and inefficient properties manage maybe 10:1).
Key to allowing these infrastructures to be efficient and in turn deliver the benefits above is having applications architected such that:
- They don’t require masses of administrator intervention when they go wrong.
- They can be installed with minimal administrator effort because there’s no need to worry about tweaking URLs, IP addresses, database connections etc.
- They readily support horizontal scaling e.g. because they contain an abstraction that can support sharding of data-storage.
In essence an application must be designed for zero administrator intervention and fully automated deployment. It should also have a variable workload component that magnifies the savings of the architectural properties above.
Strange then that many a developer expects to move their existing application, full of enterprise DNA (static configuration, vertical clusters, no horizontal scaling, high administration costs) to such an offering with minimal change. They even complain when it proves difficult because all those “enterprise features” aren’t present. Why does this happen?
I believe it’s because these developers have fundamentally misunderstood how cloud computing delivers its benefits. They see the cheap prices but don’t stop to consider where the cost saving comes from. Some of it is achieved by cloud platform vendors getting large discounts on huge hardware orders but a significant proportion comes from the fact that they don’t need to provide (via human resources or APIs) the sysadmin functions required for conventional hosting solutions.
Quite simply typical applications, their architectures and associated administration practices are not setup for cloud platforms. Some of them may be able to run on these platforms with sufficient hackery, brute force and associated cost. However if the motivation for a move to the cloud is merely to reduce kit costs one might well be better off looking for a cheaper conventional hosting solution.
In summary, making the best of the cloud requires that we take an architectural view, something that we’ve proven remarkably bad at over and over. Simply deploying an application unchanged to the cloud is unlikely to deliver much benefit.
Comments Off
|