We’ve all seen it, customers change their requirements, add a few more features and yet expect the project deadline to stay the same even though there are no additional resources.

For some reason they act as if a software team has infinite, cost-free capacity. The psychology that drives this behaviour is somewhat unclear because there are various potential motivators such as political ambition, naivety or willful ignorance.

One might expect to see this problem occurring in waterfall projects but it can also plague early agile projects. Typically the backlog grows and grows, the customer has a desired release date in mind and expresses horror when it becomes clear that the whole backlog cannot possibly be implemented in the timeframe (accompanied by cries of “but I followed the process”).

It shouldn’t be possible to make this mistake given real-world experiences. For example:

We put our car in for an oil change, we get a quote for cost and an estimate for how long the work will take. We drop the car in at the garage and then a little later phone up and request additional work such as fixing the air-conditioning, replacing two tires, sorting the exhaust and swapping out the brake pads. Not for a second do we entertain the idea that the cost and time for the work will be the same as originally quoted.

Yet we still persist in the notion that a software development team is a bottomless pit of resource.

Comments Comments Off

It’s tempting when trying to be customer-centric to focus on delivering lots of functionality quickly. Supposedly features win the race and can increase revenue, but is that all that matters? Evidence such as the troubles Twitter have had in the past and this anecdote from Google about search time suggests there are other qualities of our website that matter like:

  • Service charges
  • Responsiveness
  • Availability
  • Quality of interaction

Whilst these qualities are all about the customer experience, success in maintaining them at an appropriate level is related to how well a company performs internally:

  • It’s undesirable to be charging excessively to cover development inefficiencies caused for example, by a tightly coupled architecture that makes even a small change a multi-month death-march.
  • A service that runs slow at peak times due to insufficient focus in our architecture and code on performance and scaling, appears sluggish or even down which can drive customers away.
  • Prolonged outages as the result of trivial problems occurring that take operational staff excessive time to fix because of poor monitoring and diagnostic tools, will impact customer satisfaction.
  • If we routinely rollback upgrades or they’re brittle or bug-ridden we will negatively impact the quality of interaction.

Thus being more customer-centric requires a company to quantify it’s performance and work to improve it. In the case of the examples above, things like response time, site downtime, number of failed upgrades, time to perform a release, bug counts and feature count against cost of delivery can be used as metrics to indicate how we’re doing in our mission to make the customer happy. Methods for improving these metrics though not always easy to apply are relatively well-understood and include:

  • Ensuring architecture/design includes well-defined interfaces, avoid integration via databases etc.
  • Considering scalability: how many machines can be thrown at a problem and are they used efficiently? Essentially, balancing horizontal-scale and straight-line optimisation.
  • Removing computation from the critical path to generating a user-response e.g. use asynchronous methods.
  • Publishing software and hardware telemetry, gather it all up (using the right infrastructure) and perform appropriate analysis via tools etc.
  • Focusing on simplicity, isolation of components, failure tolerance, in-live testing, versioning and the ability to rapidly rollback.
  • Applying an appropriate testing regime.

Ultimately everything a company does internally has implications for customers. This includes what might normally be notoriously subjective such as, for example technology selection. In this particular case we ought to test the technology and assess the effect on relevant metrics to verify that it does provide meaningful benefits. Also as most technology has it’s downsides, we can quantify these too and ensure there’s an appropriate trade-off.

Comments Comments Off

Some notes on areas of software development where obsession with perfection can be costly.

Standards – where everything is standardised, up front so as to avoid unnecessary diversity or cost. Standards like design and code are based on assumptions about the environment in which they will be used. Thus to be sure that a standard is appropriate, one must have experience in the environment. Without the experience, selection or formation of a standard is to a reasonable degree guesswork. Asserting a standard up front often leads organisations into incurring significant costs (e.g. manpower, slow release cycles) as they twist what they do to fit some standard they’ve selected instead of recognising that the standard is inappropriate for a given situation. Standardisation should be done after a period of diversity/investigation to identify what is or is not applicable.

Estimates – where an organisation engages in grand analyses to deliver up accurate estimates for all pieces of work required so that resources can be correctly determined, budgets set and timelines agreed prior to the start of the implementation phase with the assumption that nothing will change. The futility of these efforts can be found in a dictionary:

es·ti·mate (st-mt)

tr.v. es·ti·mat·ed, es·ti·mat·ing, es·ti·mates

  1. To calculate approximately (the amount, extent, magnitude, position, or value of something).
  2. To form an opinion about; evaluate: “While an author is yet living we estimate his powers by his worst performance” Samuel Johnson.

n. (-mt)

  1. The act of evaluating or appraising.
  2. A tentative evaluation or rough calculation, as of worth, quantity, or size.
  3. A statement of the approximate cost of work to be done, such as a building project or car repairs.
  4. A judgment based on one’s impressions; an opinion.”

Estimates are implicitly guess-work, they can only ever be made accurate in hindsight. Agile is one way to get realistic about estimates.

Architecture – where organisations and individuals engage in the illusion that one can design an endlessly flexible architecture that copes with all possible unknown future demands. Just as with estimates, symptoms include grand, long-running rituals to examine every detail, verify that the architecture is sound and that nothing has been forgotten. Businesses change their processes, unexpected integrations are undertaken, products are dropped and new ones are dreamt up, hosting options change and operational challenges appear unexpectedly. The weapons for dealing with this challenge are not do-it-all containers or grand architectures, rather they are things like:

  • Principles and guidelines for keeping an architecture adaptable: loose coupling, no broken windows, cohesion and coupling, isolation etc.
  • Validation metrics: to indicate when an architectural assumption has been breached and needs re-addressing.

Implementation – where developers engage in the writing of complex, brittle, difficult to maintain and debug code to address challenging problems that can be better dealt with via process or architectural approach. Hot-update of configuration is one such example. At least in the Java world, many a developer will be tempted to tackle this problem with:

  • Clever thread management and barriers to pause work whilst configuration changes are made
  • A remote admin interface to permit lodging of configuration changes
  • Event listeners to respond to the configuration change and patch up various bits of state
  • Sundry bits of classloader magic

From an architectural perspective however, one realises that:

  1. Configuration is most easily done when a process boots, there’s minimal state to patch and no active workload to manage.
  2. At least in the case of a website, one must have failure handling mechanisms to cope with lost boxes, broken networks and failing processes.
  3. Hot re-configuration actually means provision of service without disruption to the user.

Thus hot re-configuration becomes much simpler: kill process, change configuration, re-start process.

Hardware – where expensive kit never fails, making software easier to write but ultimately compromised when it comes to availability as customer usage grows. Hardware does fail, in fact once an organisation accrues enough hardware there will be failures daily and it’s not cost-effective to pay for operational staff to run around trying to keep all this hardware running all the time (and it increases the chance of human error, another key contributor to availability problems). In the early days of a system, use of redundant hardware solutions is acceptable it’s more important to get things up and running but it pays to:

  1. Monitor service availability.
  2. Track hardware failure patterns.
  3. For each outage, maintain an estimate of cost in both operational and revenue terms.
  4. Track user behaviour after an outage: do they return and if so, how quickly?

These sorts of metrics allow an organisation to determine when to start shifting failure tolerance into the software layers. Marcus and Stern provide a good treatment of systems availability practices.

Comments Comments Off

Cloud computing platforms offer many benefits including:

  1. Cheaper operational costs.
  2. Dynamic scaling in response to load spikes.
  3. Roll-on, roll-off deployments for e.g. newspaper archive processing.

These platforms exist as the result of the investment of companies such as Amazon, Google and Microsoft in developing cost-effective infrastructure with system to administrator ratios of 2500:1 (whilst the average enterprise manages around 150:1 and inefficient properties manage maybe 10:1).

Key to allowing these infrastructures to be efficient and in turn deliver the benefits above is having applications architected such that:

  1. They don’t require masses of administrator intervention when they go wrong.
  2. They can be installed with minimal administrator effort because there’s no need to worry about tweaking URLs, IP addresses, database connections etc.
  3. They readily support horizontal scaling e.g. because they contain an abstraction that can support sharding of data-storage.

In essence an application must be designed for zero administrator intervention and fully automated deployment. It should also have a variable workload component that magnifies the savings of the architectural properties above.

Strange then that many a developer expects to move their existing application, full of enterprise DNA (static configuration, vertical clusters, no horizontal scaling, high administration costs) to such an offering with minimal change. They even complain when it proves difficult because all those “enterprise features” aren’t present. Why does this happen?

I believe it’s because these developers have fundamentally misunderstood how cloud computing delivers its benefits. They see the cheap prices but don’t stop to consider where the cost saving comes from. Some of it is achieved by cloud platform vendors getting large discounts on huge hardware orders but a significant proportion comes from the fact that they don’t need to provide (via human resources or APIs) the sysadmin functions required for conventional hosting solutions.

Quite simply typical applications, their architectures and associated administration practices are not setup for cloud platforms. Some of them may be able to run on these platforms with sufficient hackery, brute force and associated cost. However if the motivation for a move to the cloud is merely to reduce kit costs one might well be better off looking for a cheaper conventional hosting solution.

In summary, making the best of the cloud requires that we take an architectural view, something that we’ve proven remarkably bad at over and over. Simply deploying an application unchanged to the cloud is unlikely to deliver much benefit.

Comments Comments Off

The codebase of a subsystem or maybe the whole system has turned into a big ball of mud. It’s claimed too brittle, too complex and too costly to continue developing. It’s at this point that a grand rewrite is proposed accompanied by statements of how things will be different:

  • We’ll eliminate static wiring using Spring
  • We’ll model everything as a service
  • We’ll adopt test-driven development and make use of jMock
  • We’ll build everything using a RESTful approach
  • We’ll avoid using RPC in favour of messaging
  • …….

Things will be so much better in this brave new world but……they won’t. The reason the codebase has got into a mess is because we failed to execute on important principles such as:

  1. Take account of coupling and cohesion.
  2. Be clear about people’s roles and responsibilities to avoid unqualified or inappropriate decision making.
  3. Clarity and simplicity of roles and responsibilities in design elements.
  4. Maintain modular, well-isolated code and conceptual integrity.
  5. Avoid shared data-schemas or integration via the database.
  6. Make the software testable and maintain the tests.
  7. Select technology based on appropriate design work.
  8. No broken windows.
  9. Track and maintain appropriate metrics.
  10. Review projects to identify and disseminate useful lessons to developers, architects and customers.
  11. Account for the operational aspects of our software in requirements and design.
  12. Review to ensure code aligns with appropriate design principles.
  13. Surface, balance and mitigate risks.

It’s these principles and others that enable superior engineering which in turn delivers a good-quality, maintainable codebase. Any rewrite will end up a ball of mud just like it’s predecessor unless the style of engineering is adapted to incorporate principles such as these.

Some propose that frameworks can prevent mistakes, ensure a quality design and deliver testable code. I think experience suggests otherwise as we routinely (by accident or design) bend frameworks to fit some problem they weren’t really designed for leading to ugly, broken, poorly designed, brittle code. What would stop us doing it with new frameworks delivered as part of a grand re-write?

Should we successfully revise our engineering practices would we then have sufficient leverage to restructure our ball of mud into something nicer to work with? Maybe, maybe not but we might be better equipped to answer the question: re-write or re-factor?

Comments Comments Off