How often do we admit something’s wrong and take corrective action? How often do we admit something’s wrong before it gets well out of hand?

There are two basic things we need before we can admit something’s wrong and address the problem:

  1. Be strong enough to admit there’s something wrong
  2. Be aware that something is wrong

The first item is about having sufficient strength of character to admit our mistakes, learn and move on. Inhibitors for this include ego or pride, fear and external environmental factors (e.g. a violent spouse).

The second item is about recognizing the indicators of a problem. For us humans indicators can be anger, frustration, lack of drive, fatigue, health problems or a constant need to distract oneself via activity. In terms of computer systems, indicators might be Java exceptions under odd circumstances, strange lockups, weird performance issues etc.

The art whether in the human or systems world is detecting these problems early on and reacting to them before they get out of control. In the systems world, so many times the first we know of a problem is when we see an exception trace or the pager beeps at 2am but is this really necessary?

Could we not have been aware of the problem earlier? We put a lot of effort into testing but considerably less effort into endowing our systems with monitoring and feedback mechanisms that can provide us with useful statistics. These statistics are the things that can give us prior warning of a problem and come in many forms:

  1. Too many tasks
  2. Too many threads
  3. Queues that are too long
  4. Time per task increasing

Frequently we build logging into our software but this is really a postmortem facility something that is only useful after the problem has occurred. Our OS’en, hardware and software platforms often have monitoring infrastructure built-in, why is it that we so often choose to build our applications without similar facilities?

Technorati Tags: , ,

One Response to “Something is Wrong”
  1. [...] In our current systems construction doctrine we still focus on building our application inside of a single machine out of bits (e.g. Spring or App Server style). Witness how we strive to allow developers to run all that’s required on their own machine and rarely force them to run remotely. Do we really feel this is a good thing given that when the code is given to ops the first thing they do is put it on lots of machines? What incentive does a developer have to write monitoring tools to help ops out when they don’t see the pain in development? [...]

  2.