Posts Tagged “development”

We’ve all seen it, customers change their requirements, add a few more features and yet expect the project deadline to stay the same even though there are no additional resources.

For some reason they act as if a software team has infinite, cost-free capacity. The psychology that drives this behaviour is somewhat unclear because there are various potential motivators such as political ambition, naivety or willful ignorance.

One might expect to see this problem occurring in waterfall projects but it can also plague early agile projects. Typically the backlog grows and grows, the customer has a desired release date in mind and expresses horror when it becomes clear that the whole backlog cannot possibly be implemented in the timeframe (accompanied by cries of “but I followed the process”).

It shouldn’t be possible to make this mistake given real-world experiences. For example:

We put our car in for an oil change, we get a quote for cost and an estimate for how long the work will take. We drop the car in at the garage and then a little later phone up and request additional work such as fixing the air-conditioning, replacing two tires, sorting the exhaust and swapping out the brake pads. Not for a second do we entertain the idea that the cost and time for the work will be the same as originally quoted.

Yet we still persist in the notion that a software development team is a bottomless pit of resource.

Comments Comments Off

Those specifying requirements often express them without consideration for the passing of time, assuming that actions are instantaneous. A naive development team with limited experience in distributed systems will then make the classic mistake of attempting to implement those requirements to the letter. This can lead to a bunch of undesirable outcomes including:

  • Brittleness in the face of failure.
  • High cost solutions.
  • Poor scaling properties.
  • Disappointment as the expectations of the requirements source aren’t met.

Consider a system where we have two (network) hops to an observer and one hop to the initiator of an action (assuming uniform network latency for each hop). Potentially for every two actions there will be a single observation. Thus each observation of the system is out of date by the time it reaches the observer.

Administrative actions can suffer similar problems, in that it could take several hops for the request to arrive at the system. A user may be only one hop away and could be performing many operations in the time it takes for one of our actions to reach the system. For example if we wish to block a user, whilst our request is in transit they might perform several operations.

Things are made worse by network failures which can further delay or prevent execution of an action and slow down the rate of updates to an observer.

How then do we account for these troubles when specifying requirements? By qualifying them with appropriate SLA’s. In the example above, appropriate SLA’s might include:

  • Time for propagation of an administrative action.
  • Maximum acceptable time after the action is triggered for a user to be blocked.

SLA’s such as the above:

  1. Help us to identify appropriate solutions (e.g. do we need to pay for multiple independent routes between data-centres).
  2. Allow us to make appropriate use of asynchronous operations and eventual consistency.

Since SLA’s have significant impact on the way in which a requirement will be implemented it is essential to perform appropriate expectation management, discussing and communicating the implications with the requirements source, they cannot be solely the domain of techies. Remember also that in many situations customers prefer availability over consistency.

Comments Comments Off

Neglecting to account for failure is an age old problem. Consider this common error (Purify anybody?):

#include <stdio.h>
#include <stdlib.h>
struct rhubarb {
  int aVal;
  int anotherVal;
  char* aString;
};
......
  struct rhubarb* mystruct;
  mystruct = malloc(sizeof(struct rhubarb));
  mystruct->aVal = 55;
......

Of course the following code should have been included after the malloc:

/*
  If memory wasn't allocated, do something appropriate.
*/
if (mystruct == NULL) {
  .....
}

An equivalent mistake is easily possible when building a distributed system in http or RMI by ignoring error codes or exceptions that are designed to communicate failures that we ought to handle. It’s similarly easy to ignore latency, or implement brittle and dumb retry logic or assume something is reliable (like a message queue) when it isn’t. Many have managed to concoct systems with http that breach the idempotent “constraints” of REST and whilst Erlang provides link() and receive timeouts, we’re not forced to use them.

In essence there is no way to ensure developers do the right thing in a single-process or distributed context. No technology, tool or design approach can prevent developers from making poor implementation decisions which limits the value in re-hashing (Steve, Steve and Stu) RPC rights and wrongs.

I believe the best chance we have for doing distributed right is not by providing some de-facto standard toolset, rather it’s through education[1] and mentoring to encourage the correct mindset. Such a mindset allows a developer building a distributed system to choose the most appropriate tools and use them right.

[1] Material to be covered would be substantially broader then the fallacies, failure handling, latency and should probably include: logical time, FLP, failure detectors, global snapshots and Paxos.

Comments 1 Comment »

The act of design includes:

  • Consideration of many possible technology options
  • Examination and identification of constraints
  • Thought about the pros and cons of using various patterns and styles
  • Comparing various splits of role and responsibility
  • Looking at various tradeoffs of complexity versus function
  • Formulation of opinion on possible future directions of system growth

A large proportion of this information is lost when the design document is written, because the focus is typically on providing a (notionally) definitive view of how a system should be structured which might be in the form of a bunch of UML diagrams or merely a collection of Visio-type diagrams and explanation of what each of the boxes in the diagrams does. At the code-level there is almost no chance that any of this information will have been retained. Yet this information is of high value since it:

  • is the explanation as to why a design is the way it is
  • provides reviewers with a clearer view of what was and was not considered
  • forms the basis for assessment of the maturity of a designer and can be used for coaching/mentoring
  • can provide insight for those with less experience
  • contains assumptions which if breached by changes in circumstance would dictate a re-design
  • dictates to a large extent how suitable for purpose a design might be

Thus I believe It’s important to expose elements of the act of design via documentation alongside the design itself, conversations during the design work etc.

Comments Comments Off

Some of the more common software development mistakes I’ve seen…..

triangle.jpgIgnoring the triangle – The triangle represents a trade-off between three core elements of software delivery – resource, product (features, non-functionals, quality) and schedule. One can only ever control two elements, the third being determined by the decisions regarding the other two. So if one wishes to dictate product and schedule, sufficient resource must be made available to complete the task in the allotted time. If one wishes to dictate product and resource, then the schedule cannot be limited. It is simply “as long as it takes”. And if one wishes to dictate resource and schedule, then product features, quality etc must be traded away to allow completion of development within the time allotted.

It’s amazing how often organisations attempt to dictate all three elements and are then surprised when a project gets messy. Of course, development processes have evolved in recognition of this trade-off – agile for example is great for prioritising, dropping features and getting something useful out the door in a resonable timeframe with limited resources.

Heroic efforts – these are a bad sign. A regular pattern of projects turning into mad hack-fests, saved by some apparently super-talented individual(s) is indicative of broken processes. One step in addressing this problem involves an honest surgery immediately after the project to determine root causes (e.g. inadequate risk management) of the meltdown and methods of prevention for future projects (e.g. regular risk review and identification of appropriate mitigations).

rapid_dev.jpgIn the very worst cases, management actively encourages such heroism via recognition and reward. Worshipping this kind of carnage and supposed miracle recovery is tantamount to approving bad project management. Note that well-intentioned management can unknowingly drive this behaviour. From McConnell’s Rapid Development:

Some managers encourage heroic behaviour when they focus too strongly on can-do attitudes. By elevating can-do attitudes above accurate and sometimes gloomy status reporting, such project managers undercut their ability to take corrective action. They don’t even know they need to take corrective action until the damage is done. As Tom DeMarco says, can-do attitudes escalate minor setbacks into true disasters.

No Risk Management – One can never predict or spot all the risks but there are some obvious ones that get missed over and over. For example, we’re building a piece of software that relies on a component we’ve not used before. This is a big risk, one that can be mitigated by writing a test-harness or simulation of the way in which we plan to use the component.

The simulation should include realistic load, failure conditions, maintenance etc and should be as close to the beginning of the project as possible to surface any issues early (we cannot afford to wait until final QA or deployment testing). There can be no shirking here because should our chosen component fail, we will need these tests in place so we can validate potential replacements as quickly and easily as possible.

Waterfall Agile – We’re supposedly “doing” agile but one or more of the following are true:

  1. There’s a fixed deadline, with fixed features and fixed resources.
  2. All Negotiation/Trade-off is done prior to project commencement with no review between sprints.
  3. All sprints have been planned out in advance right up to the release date with no spare time.
  4. There are no risks to manage (because there aren’t any apparently).
  5. No-one is entertaining the idea of unknowns.
  6. When a sprint doesn’t deliver as anticipated, outstanding work is simply crammed into the remaining sprints.

Comments 2 Comments »