Archive for the “Engineering” Category

Generally, the longer a defect remains undetected in a system, the more costly it will be to fix. I’ve seen this fact proven true over and over but you don’t have to take my word for it, ask Steve McConnell.

I’ve always assumed this was well understood yet many organisations adopt processes, approaches and structures that guarantee certain kinds of defects will be undiscovered for substantial periods of time. One of the more common faults is the separation of Development and Operations.

Each side has its own view of what’s important and what they’re responsible for:

  • Operations more often than not seeks to own non-functional aspects (performance, stability, scalability etc).
  • Development more often than not seeks to own the functional aspects (features).

Such a mindset often leads to a classic process mistake, issues with the functional aspects get dealt with early and all those linked to the non-functional and operational are left unsurfaced until last moment grand testing regimes (P&C, User Acceptance Testing) dig them out or worse, are discovered at the point of release into production.

The warning signs are usually there if only we paid attention to them:

  1. Developers work in isolation building, deploying and configuring the components they develop in ways that suit them. It follows that deployment and configuration are not optimised for production and do not account for any hard won operational experience.
  2. Operations staff demand huge handover documents be written by developers and passed over with the product. Inevitably the documentation fails to account for operational concerns (what would a developer know about operations?)
  3. There are separate environments for the purposes of validating correctness and accuracy of handover documents. After all developers can’t be trusted to get the documentation right so it must be checked.
  4. The development environments are ad hoc with no resemblance to production (certainly they aren’t a scale unit of production). Leading to large numbers of problems at release time: files can’t be found, configurations are broken and various versioning issues present themselves.

The antidote is relatively straightforward, all development activity should be performed in a production like situation. For example:

  1. Deployment and configuration of software components under development should be routinely performed by operational staff. The result is early knowledge transfer and the documentation can now be written by those best able to produce it (operations staff, not developers).
  2. Development environments should contain appropriate network topology. Often production setups contain segregated networks for security or availability reasons. Ensuring developers are exposed early to these issues means software is more likely to account for these demands.
  3. Monitoring and logging infrastructure should be as per production and used routinely for debugging and capture of data relevant to testing (performance, failure etc)
  4. Development environments should be scale units of production. This permits early production-like performance testing. This should be backed up with routine robustness testing e.g. to identify memory leaks early.

A typical reaction is for development and operations staff to say this cannot possibly work and will slow development to a crawl. They aren’t actually wrong but they’re missing a key insight:

If development has slowed to a crawl it’s an early warning of future production troubles.

For example, if deployment is taking too much effort and time, something needs tweaking, simplifying or automating. What we’ve done is best summarised by a proverb from Toyota (via Eric Ries):

“Stop production so that production never has to stop”.

We’ve created a feedback loop that highlights defects spanning all concerns (functional, non-functional and operational) early which keeps costs down.

Clearly, delivering a given feature will take a little longer as we must account for all aspects from functional through non-functional and operational. That’s acceptable because if we don’t cover all these aspects we’re asking for trouble in many forms including:

  • If we cannot adequately monitor the performance of a newly delivered feature there’s a direct impact on customer experience. They will know before we do that something is broken which leads to irate phone calls, lost revenue etc.
  • If we cannot adequately track the effect of a new feature on customer behaviours, we cannot evolve it appropriately.

Needless to say developing features in this fashion fits well with lean and agile approaches.

So the antidote is relatively straightforward and there are development approaches that fit well with what needs to be done. The toughest challenge remains though, effecting the necessary mindset shift to get it done. It ought to be a little easier with the rise of DevOps but notably there are early signs of trouble as has been seen with lean and agile adoption.

There are many who claim to know and practice each of these disciplines but most are paying only lip service, picking out the bits of process, mindset or tooling that suit them and ignoring the rest.

Sporting Index is right in the middle of making this tricky jump from Dev and Ops to DevOps, I’ll let you know how we get on.

 

Comments Comments Off

Design is not rules, it’s not patterns, it’s not technological choices or indeed code. Design is tradeoffs, driven by data where possible and gut instinct. It’s about identifying the core challenges of a problem domain (which might ultimately be one or many systems) and addressing them through creation of appropriate abstractions. These abstractions embody:

  • Functions to be performed
  • Data to be discovered, consumed and produced
  • Non-functionals (e.g. SLAs)

The abstractions are then rendered into the real-world using appropriate hardware, technologies, patterns and languages. A good design:

  • Exhibits few exception cases
  • Has logic and/or data located neatly and predictably
  • Applies a small set of core constructs repeatedly
  • Addresses operational needs
  • Considers cost versus value delivered
  • Is as simple as possible
  • Has the minimum of implementation assumption

There are several key failing points in the design process:

  • No adjustment in the face of implementation feedback – No design is complete or perfect. There will always be missed details leading to brittle code, complex corner cases or convoluted solutions. It is critical that we monitor our progress and adapt the design accordingly.
  • No up front design – Design is the skeleton upon which we hang technology choices and code structure. In it’s absence we rapidly descend into a world of difficult to navigate code and costly constraints set by uninformed product choices.
  • No care in following the design – A key element of design is to place the right things in the right places. Failing to do this at code time increases coupling, makes maintenance difficult and can impact both performance and scalability. Similar effects occur as the result of poor technology selection.

Design and implementation go hand in hand yet many of us lack awareness of where the boundary between these two elements lies. We don’t understand how these elements interact with each other or appreciate the impact of decisions we make in respect of one element on the other.

 

Comments Comments Off

Point the average development team at a problem and in very little time:

  • IDEs have been fired up and code is being cranked out
  • The same well-worn non-process is being followed as before
  • The testers and developers don’t talk to each other

The average development manager (and by implication their superiors) stands behind the team exhorting them to hit the keyboards and crank it out regardless:

  • They actively or passively encourage late night coding
  • There’s no concern over quality of development environment
  • Getting purchases signed-off takes far too long

There’s an endless parade of vendors promising this or that coding acceleration product that is always cheap and easy to integrate:

  • Swiss army knife frameworks
  • Automated metrics
  • Do it all database solutions
  • Testing tools
  • Code generators

The result is a self-reinforcing, software disaster generator endlessly thrashing around the same old cycle producing the same poor results but always with the expectation that “this time things will be different”. I’m pretty sure that falls under at least one definition of insanity.

I call this “code myopia” and believe it to be at the root of many ongoing industry problems including:

  • The failure to focus on customer value – why are we developing at all? What’s the minimum we need to deliver? What do our customers actually need?
  • Spiralling costs – if we’re intent on delivering value to our customers can we do it practically?
  • Operational nightmares – endless production rollbacks, painful deployment, no anticipation of production issues and slow resolution.
  • The absence of real design – the shape of a system is entirely dictated by favourite or already licensed technologies and maintenance is a nightmare.
  • Minimal advancement – focus instead is on poor reinventions of decades old algorithms or designs because few do their research or simply prefer to reinvent for intellectual entertainment.
  • Management of process – one manages people, the environment and work, not process.

Let’s be clear: The best code we can write is no code at all. We want maximum customer value for least effort and best possible profit. When that requires us to deliver code:

  • We want to leverage past experience
  • We want sustainable design
  • We want minimal code be it our own or vendors’
  • Nothing beats real-world testing
  • If we are to make mistakes, they should be new ones
  • We want products that work for us not our vendors
  • We want to be operationally effective
  • We want to get pragmatic about deadlines
  • We want active management
  • We want to help our customers innovate

That’s quite a challenge for all concerned (developers, testers, operational staff, management, architects) yet in a twist of irony, say the above to most people and they run back to what they know; those same people complain that their jobs are mostly hassle and there’s no real challenge! If however, you’re up for it, drop me a line.

Comments Comments Off

My current company has for obvious business reasons got a serious interest in delivering a quality website experience during the World Cup and thus I’ve been spending a lot of time focused on our own performance and capacity management of late.

P&C is one of those 80/20 tradeoffs. There’s always more one can do or measure or test, equally getting the basics in place will deliver substantial benefit. I’d go further and argue that without a solid grasp of the basics, one cannot easily determine what else beyond that might be required. Here then are the basics that I’ve found myself repeating over and over:

  • Have an enquiring mind – anomalies are not to be ignored or dismissed on the basis of pure speculation. Determining root cause is essential to prevent surprises in production. Some recent examples:
    1. In one test we noticed that every so often we’d get a substantial blip in disk I/O on servers that should be processing entirely out of memory. Along with that blip there’d be a corresponding reduction in throughput, we could have ignored it, after all things sorted themselves out relatively quickly but we chose to investigate. All these servers were periodically running a cleanup job the developers were unaware of and had not factored into their capacity calculations. The implications for production would have been a regularly overloaded, badly performing website. We’ve since tuned the jobs, adjusted their schedules and increased our capacity to ensure we can always spread the load around enough to accommodate them.
    2. An examination of the distribution of load on the boxes behind our load-balancers revealed a higher than expected amount of variance in CPU and connections. A review of the application revealed that any particular user’s traffic is sticky to one box, unfortunate as it’s stateless, time for a code change. We also spent time looking at the monitoring infrastructure and discovered that in certain cases we’d get false reports of 100% CPU utilisation, that one will be fixed with an OS patch.
  • Gather the right data – there’s no value in allowing oneself to be limited by what is easily available via some set of tools people are comfortable with. One tool we were using had an unreasonably low ceiling on the number and rate of samples it could handle such that any graphs it produced showed hardly anything of the true profile of e.g. CPU utilisation, memory consumption or I/O. Forming any opinion about system behaviour in respect of load was going to be an exercise in speculation. We junked the tool and are looking for a replacement, in the meantime we’ve fallen back to making use of low level performance counters which we can sample local to the machine and whack onto disk for later analysis via scripts, opensource tools etc.
  • Design tests that support reasoning – One should indeed try and replicate production load behaviours to judge overall system behaviour. The challenge of such testing is that it can be difficult to relate performance data back to exactly what was going on during some period of a test and make a diagnosis or be confident of an improvement. There are a number of things we can do to improve the situation:
    1. Ensure tests are deterministic such that any given run can be compared against other runs. This isn’t as simple as it looks when e.g. you wish to gradually increase load at a fixed rate that is being produced by more than one box.
    2. Have tests produce sufficient logging that one can easily identify what was going on at particular points in the sampled data. Logging of course can actually affect test behaviour and that isn’t always desirable.
    3. Build additional tests that target particular user journey’s through the system. Doing this for all possible journey’s can be costly so it makes sense to focus on testing those which are most popular with users. These kinds of tests restrict the reasoning tree making analysis, diagnosis and solution identification much easier.
  • Measure what customers care about – they don’t care about CPUs, I/O or memory, they worry about things like response times. It is important to focus on maintaining a quality user experience not endlessly improving system efficiency. Considering user factors such as response times stops us expending huge effort on CPU utilisation when we should be focusing on say, network I/O, browser performance or reducing the amount of data we push to the browser before a page can render.
  • Beware of averages – it is very tempting to combine datasets via the use of averaging unfortunately such a practice can easily hide spikes that might be indicative of a problem. On more than one occasion an engineer has presented a graph that tracks the average CPU and a table that summarises min, avg and max. After which they’ve pronounced load testing was a success and yet they have no explanation for why the average is never more than 50% but the max is 100% and whether or not this is good or bad.

  • More than load – excessive focus on measuring the effect of a particular load can make us blind to another important metric, resource cost per unit of work – these are the collection of tests and analysis that help us understand what to tune and how much to keep our appetite for boxes and bandwidth reasonable. One simple thing teams can do per sprint (assuming you’re agile, why wouldn’t you be?) is point a profiler at each component and look for the low hanging fruit that is poor algorithm selection or inefficient code (e.g. repeated scanning of lists where a hashmap would be better or repeatedly computing something that could be cached).

Comments 2 Comments »

A programming language is a tool. These days in fact it’s more a toolbox as there’s an entire ecosystem associated with a language that makes it more or less suitable for a particular discipline (e.g. website development). There are many other tools beyond languages of course: CORBA, J2EE, SOAP, AJAX, Visual Studio .NET, Emacs etc

The obsession we have with our tools is verging on the sexual. We worship them, we endlessly compare them, we get excited about this or that extension. It drives much conversation in corridors and at conferences but it’s largely worthless because there’s no context.

Does a carpenter get excited about a saw, a power-drill or the latest hammer? Not really, because long ago they realised that whilst one must know how to make effective use of a tool and how to maintain it whilst it goes unused, what really matters is figuring out what the job itself actually is. This is the context that dictates which tools are appropriate.

We speculate about concurrency, we speculate about building websites, we speculate about writing this or that application but it’s all pointless until we actually set about a specific task with intent.

The smart techie has a good grasp of a wide range of tools, knows when to use them and ensures they have meaningful escape plans (that may never be implemented) in case the day comes when those tools turn out to be the wrong choice or need replacing. Most of all a smart techie puts thinking and planning well before worrying about tools.

In simple terms, we need to stop playing with our tools and focus on the real challenge, tackling real-world problems with elegant, simple, well thought out, maintainable, cost-effective solutions. Tools help you build such things but they aren’t the essence of it.

Comments 4 Comments »