Need

It’s not about profit – making money does not help customers.

It’s not about cost-saving – saving money does not help customers.

It’s not about shareholders – pleasing shareholders does not help customers.

It’s not about features – delivering features does not help customers.

It’s not about testability – that something is tested does not help customers.

Satisfying a need does help customers. Amongst other things it might make their lives easier, make something possible, educate them or entertain them. It engages them, enthrals them, creates emotion within them. From all of this, many good things will come.

If you are developing a system in absence of a focus on satisfaction of needs, you’ve lost already. First question then:

Who are your customers?

And if you think the only customers are those paying for what you build, you’ve lost once again. In fact, you’ve signed your own happiness away.

On The Practice of Design

Technology is not architecture or indeed design, it is a means for implementing a design. Various technologies (e.g. languages or frameworks) will be more or less compatible with implementing a specific design.

Design is an abstract exercise. It becomes constrained by our own choices which can include using existing technology or creating anew. By default, it should not be constrained, this is closer to the ideal. The more constraint exerted by technology the less ideal things are likely to be. Less ideal can be acceptable, there are always cost limits and such but it should never exist without consideration of the consequences.

Some argue that design cannot be an abstract exercise at all because real-world considerations demand otherwise. Performance is often cited as being too significant to ignore. Are they right?

The nature of performance in a system can be generalised into a set of guiding principles. In the case of computing, there’s a well known hierarchy driven by locality (starting with the fastest component):

  1. Register-based CPU instruction
  2. On CPU cache access
  3. Off CPU cache access
  4. Main memory access
  5. I/O (network, conventional disk, SSD)

Jeff Dean and others have expressed this in a table of “Numbers every programmer should know”. The performance relationship amongst these components is sufficient guidance for design work. Clearly, incrementing a number across a network connection is something to be avoided (though there are ways to make this work). It would generate significant chatter as would naive distribution of an OO design.

So to answer the question: Performance is too important to ignore in a design but the amount of consideration required is no more than other aspects such as coupling and cohesiveness.

Apple decide what they want to build first then create and/or select the technologies they need leading to great products. They ask themselves how do we make this idea, this concept we have in mind, real? This is the point at which technology becomes relevant. NASA, when set the moon-shot challenge created the technologies they needed to deliver the end result over a period of years. They iterated on engine, flight control and many other aspects to get the ultimate embodiment. 37Signals ended up creating Rails, embodying a new way of building product to deliver their vision.

The best designs start out as concepts or ideas and are largely un-constrained by technology (there are limits of course, e.g. a phone must have certain components possessing certain properties). They retain their elegance, a sense of style and taste. Designs that are forced to fit with early, uninformed technological choices are likely to be brittle and die.

Developers have a bad habit of selecting tools and technology well ahead of consideration of a problem and potential design approaches. History is littered with examples of the consequences, balls of mud and expensive “surprise” project failures that should have been “easily dispatched” because of this or that silver-bullet technology.

There is nothing harmful in the general discussion of technology tradeoffs, it’s what leads to useful guidance such as that of Jeff Dean above. It also makes sense in one’s early career to work with a variety of technologies to help gain an understanding of tradeoffs and patterns that work or don’t. However, excessive technology fixation is destructive for quality design work.

One can certainly design from a technology driven perspective (choose your language, frameworks etc and constrain your design to fit them) but that won’t be good enough for a moonshot, a class-leading product or a high-quality solution.

200

In aviation circles there is a thing known as the 200th hour rule. It goes something like this:

After 200 hours of flight time you are expert enough to feel confident in what you do but amateur enough to still screw up. Most worryingly it’s said that come the 200th hour you will screw up and quite possibly in grand style.

I’m figuring that applies in many other situations and there’s probably more than one 200th hour event in many cases. One would hope that pilots, should they survive the experience, learn from the mistake and improve. Can we all say we do the same?

Blueprints

For a long time, I’ve wanted to write something about the state of our software practices. It’s always proven quite challenging as I find myself unerringly drawn towards philosophy, creativity, engineering and a myriad of other voluminous subjects. Producing something succinct has proven consistently elusive. They say you can’t force these things and so it would appear as it’s taken some writing from Leslie Lamport to help me distil out some specific points that I want to make.

The article that started this chain of events is Blueprints in which Dr Lamport discusses the practice of coding. I found it thought provoking yet judging by the comments many felt it was irrelevant, out of date or simply wrong. Reading through those comments and a tweet discussion with Nic Ferrier led me to a bunch of observations which appear below.

Foundations

It appears that the focus on “practical” aspects of systems building (e.g. knowing how to code in popular industry languages rather than the fundamentals that underpin them all) has significantly impacted the corpus of common knowledge. Specification as a practice is not well understood:

  • It can be formal or informal – ultimately the end-goal determines what is appropriate. Formal specifications provide the opportunity for proof and verification which in critical systems is highly desirable. The relevance goes beyond critical systems though to any situation where high confidence in a piece of behaviour is required.
  • Specification in its various forms isn’t a theoretical exercise – there are a number of examples of its application in real systems. Google mention it in a variety of circumstances including Chubby and Spanner.
  • Proving correctness can be done via formal specification, it cannot be achieved by testing. Imagine standing in a dark room with a pencil beam torch trying to establish what’s in the room, dimensions etc. This will take a long time and things are easily missed unless you cover the entire room with that pencil beam (which will take forever). Formal specs allow you to simply turn the light on in the room.
  • BDD, TDD and the like are testing processes and thus cannot be directly compared with specification (certainly not the formal variety) which is a tool.

Abstraction

The ability to deal in abstraction is important for the disciplines of architecture and design. However there is a more fundamental need to satisfy when coding, the limits of the individual mind to retain and reason about detail. Our only tool for coping with systems of detail larger than can be held in an individual mind is abstraction. Abstraction creates coarser constructs that hide some of the detail and allow us to scale our reasoning to broader levels. It also makes it possible to communicate and test our reasoning with others.

Reading the responses to Lamport’s article shows that some of us are too literal in our interpretation not pausing to consider the more abstract possibilities. We are unbalanced in our view, focused on detail and specifics in a world where grey and uncertainty (a natural consequence of dropping detail for sake of abstraction) play a critical role. Some examples:

  • Software systems have nothing in common with buildings. Consider for a moment the challenge of changing an old or large system to cope with a radical new requirement say going from single machine to massively distributed. Is that so different from taking an old Victorian-age school and putting in the trunking and cabling required for modern systems development? In both cases, there will be a desire for an understanding of the current structure (dare I say blueprints?), then some consideration of options, perhaps some testing out of tools and practices before actually doing the work (which undoubtedly will be iterative, component by component or room by room).
  • Skyscrapers are big systems, toolsheds are small ones. Lamport himself states otherwise in the article: “While the specs I write are almost all informal, occasionally a piece of code is sufficiently subtle, or sufficiently critical, that it should be specified formally — either for precision or for using tools to check it. I’ve only had to do that about a half dozen times during the past dozen years.” He’s clearly talking about pieces of code, could be one method or a couple of classes or indeed entire systems.

Being too literal, ignoring the grey and reducing abstractions to strict constructs (ironic considering the vehement resistance to formal specification because it’s too constraining) has ramifications beyond design quality for aspects such as human communication, essential in any good team, agile or otherwise. We stop ourselves from considering the greater context, the bigger possibilities which might explain why some techies cannot easily relate to customer needs.

Research

A couple of Google searches reveals that Lamport has a body of work (including tools) related to reasoning about concurrent systems using specifications. This is notable because it focuses on the non-functional, something not often seen in discussions pertaining to TDD or BDD. For example, have you ever run across a test specification like this?

Will sort n items, distributed in any of the following orders (already sorted, exponential etc), in (n log n) time subject to the availability of memory being sufficient to hold 4 * n.

Returning to Lamport’s work, how often would you see any explicit treatment of concurrency or parallelism in applications of BDD or TDD? Isn’t consideration of these non-functionals relevant?

Some have argued that Lamport as the author has the responsibility for including all of this in his article. Really? Are we saying that an audience has a right to a complete, finished work that they can just apply verbatim, without thought or further development? Do all the best films have a definitive ending? Of course not, because there’s value in allowing a viewer to invent and go further.

Lamport opened the door to an opportunity for personal development and an improvement in the quality of one’s work, maybe some innovation too. Those who sought no further reading (a mere google away) and pronounced what he was writing irrelevant or covered by BDD or TDD have missed out.

These failures to dig deeper and put into context lead to stagnation of our practice (ironic given the focus on “practical” aspects such as coding). Research is essential to learning and growth but seemingly is becoming a lost art to many.

Wrapping Up

Lamport wrote in his article: “Few programmers write even a rough sketch of what their programs will do before they start coding. Most programmers regard anything that doesn’t generate code to be a waste of time.

Meanwhile, Wired headed the piece with this statement: “With widespread access to free, online coding courses and tools, “coding” has become the new writing – the everyman’s skill.

Given many of the responses to the article it seems that the readership proved Lamport right and Wired wrong. Not everyone possesses the skill to code competently and those that think it’s just about code are missing the key factors to make themselves so.

Motivations

Almost as soon as the discussion of building services starts, there are questions about latency of remote calls and what is or is not a service

I have a basic rule of thumb for remote call performance tradeoffs:

if the total compute time (that includes background/offline work to keep things up to date) required to service a remote request is greater than the round-trip time + a little fudge, it’s acceptable to make it a service.

However, there are other things to consider outside of local performance concerns. An optimised protocol running over even 100Mbps ethernet can manage round-trips of 1ms or less. To put that into perspective consider that Google reports average worldwide round-trips to their site are ~100ms and within US ~50-60ms. Google has presence across the globe and does all the right things in terms of content serving etc. Most other sites don’t do nearly that well. These numbers don’t cover mobile internet either which can have substantially worse performance. The point is that losing a couple of ms on a remote call in the context of round-trips outside the firewall is no big deal.

That said, many point to the fact that once we make a reasonable number of service calls, the latency of customer response can increase significantly. Which is true, if we choose a synchronous model of remote invocation where we wait for each call to complete before we dispatch the next. The thing is, that’s not necessary, asynchronous requests make more sense as they allow for better support of timeouts (important for failure and load handling) to protect thread pools and such. Further, asynchronous requests give us the ability to dispatch work in parallel which is exactly what Amazon does in order to ensure that the 100+ service calls they make don’t severely impact page rendering.

Beyond round-trip performance there are a bunch of other motivations that contribute to a decision to make something a service or not. In my experience, these sorts of things rarely come up in architectural conversations but are absolutely essential to getting good results:

  • Performance – a single application that contains disparate functionality can become difficult to tune meaningfully as load patterns interfere with each other – e.g. caching policies, storage performance characteristics, I/O versus compute intensive
  • Scalability – an architecture that supports substantial scale for one type of function may not work well for another – e.g. a read-mostly load can be served from disk with a filesystem cache. This is entirely inappropriate for volatile, rapidly-changing, transactional information. These conflicts are further exacerbated by geographical dispersion.
  • Availability – a single application containing disparate functionality can be brought to its knees by a single fault affecting a shared resource. The result is a situation where one functional problem ensures nothing functions at all – e.g. memory leaks, all data for all functions is retained in a single database or a single code fault generating exceptions that eventually exhausts thread-pools or causes a JVM to exit.

All the above have implications for our ability to support high-quality SLAs

  • Manageability – dependency chains become deep or wide or both such that it becomes impossible to accurately predict the consequences of a change. Refactoring is difficult, automated testing options are limited. Builds take longer and longer. A single update of a library requires all teams to co-ordinate the change to ensure all code is brought up to date. All technical staff become expert in all aspects and all technologies. This creates long training cycles, makes recruitment difficult or alternatively requires tight control/standardisation of tools to limit technological proliferation which in turn inhibits correct solution construction and innovation.
  • Data Management – some data is externally regulated and requires specific policy and infrastructure (e.g. PCI). When this data is mixed with other data (as is typical with monolithic codebases), the entire codebase and all data become subject to the same stringent requirements slowing development, increasing infrastructure costs etc.
  • Operational – as for development, ops staff are compelled to understand all aspects of everything for the purposes of diagnosis. The sheer quantity of information produced from the single codebase can make separating signal from noise challenging – e.g. log files containing all messages for all functionality. Releases must necessarily be slow and careful because there can never be much certainty of code quality and rollback is difficult. Further, staging out of updates in small chunks is made challenging by virtue of the number of things that can cause compatibility problems during upgrade. 

The basic force underlying all of this commentary is:

A one-size, fits all approach where everything sits in one big build and is replicated across many app-servers all backed by a single database fronted by caches (and other variant architectures) will only get you so far.

Thus there is a point in growth (load, features, infrastructure, availability demands etc) at which this one size fits all approach starts to hinder progress and increase cost. Smart techies will watch out for relevant trends and make plans to transition to a more service oriented style as appropriate.

[ Sidenote: I'd normally advocate not doing an SOA from the get-go. However, infrastructures with properties like EC2 make typical one-size fits all architectures less appealing. They don't cope well with failure or relocation like a decent SOA can. ]