When building systems, there are some operational elements that it pays to get to grips with sooner than later:

  • Deployment
  • Packaging
  • Configuration
  • Monitoring
  • Logging

Failing to address these elements is detrimental to core aspects of what we need to do from day one:

  • Get changes out – ship a new feature, deploy an urgent bug-fix or make a tweak to handle a load-spike.
  • Determine if things have started up and configured properly.
  • Be sure things are still running right.
  • Identify and react to problems quickly.
  • Obtain data important to future architectural decisions.

Even in light of the above many of us are still tempted into leaving this until later by which time:

  1. Our software will have grown substantially making it difficult and expensive to adapt when we do decide to address the operational issues.
  2. We’ll be losing inordinate amounts of time on manual trouble-shooting and dealing with the consequences of human error (a key contributor to downtime and other problems).
  3. Operations will likely have become tightly bound to whatever our software currently looks like such that when we start addressing the issues, we’ll break all their assumptions (and the tooling they built around them).

Some Specifics

Having configuration buried inside your binaries where it cannot be easily managed is an inconvenience. We don’t really want to have to do a whole new build just to change configuration settings (though one might want to do a re-deploy of the whole lot together to allow for audit-trails and have half a chance of having all boxes configured similarly at the same time).

When it comes to deployment and packaging it pays to adopt something akin to the xcopy install approach. Everything required is contained inside of the distribution with minimal external dependencies (necessary external dependencies should ideally be satisfied dynamically at runtime rather than with static configuration). Such an approach for desktop software would be unattractive but with servers and an imperative to automate installation it’s very attractive.

What about all those existing packaging systems such as rpm? Many of these mechanisms have a design assumption around a single version of something on a machine. This can inhibit fast rollback because rather than stopping one process and starting another one has to (in simple terms):

  1. Stop a process.
  2. Uninstall it’s binaries and dependencies.
  3. Install the binaries for the old process and dependencies.
  4. Start the other process up.

In some cases it will also be necessary to perform further configuration (did we back it up?), suddenly it’s looking like a lot of work to buy ourselves appropriate risk-mitigation for broken upgrades.

Monitoring often requires an amount of configuration which can make for a bootstrap problem where one needs monitoring to detect a configuration issue but the monitoring isn’t configured yet. Thus it can be useful to have some very simple monitoring based on a primitive that can run without explicit configuration such as multicast.

Important Step

These key operational elements should be accounted for early on in the design of system and grown alongside other functional aspects.* There’s plenty of information on this topic publicly available including:

* Initially implementation can be simple scripts but at some point it becomes necessary to take a more serious approach in respect of tools and infrastructure development. This means investing in properly skilled architects and engineers, performing appropriate testing etc.

  • Share/Bookmark

Comments are closed.