Archive for March 12th, 2007

How often do we admit something’s wrong and take corrective action? How often do we admit something’s wrong before it gets well out of hand?

There are two basic things we need before we can admit something’s wrong and address the problem:

  1. Be strong enough to admit there’s something wrong
  2. Be aware that something is wrong

The first item is about having sufficient strength of character to admit our mistakes, learn and move on. Inhibitors for this include ego or pride, fear and external environmental factors (e.g. a violent spouse).

The second item is about recognizing the indicators of a problem. For us humans indicators can be anger, frustration, lack of drive, fatigue, health problems or a constant need to distract oneself via activity. In terms of computer systems, indicators might be Java exceptions under odd circumstances, strange lockups, weird performance issues etc.

The art whether in the human or systems world is detecting these problems early on and reacting to them before they get out of control. In the systems world, so many times the first we know of a problem is when we see an exception trace or the pager beeps at 2am but is this really necessary?

Could we not have been aware of the problem earlier? We put a lot of effort into testing but considerably less effort into endowing our systems with monitoring and feedback mechanisms that can provide us with useful statistics. These statistics are the things that can give us prior warning of a problem and come in many forms:

  1. Too many tasks
  2. Too many threads
  3. Queues that are too long
  4. Time per task increasing

Frequently we build logging into our software but this is really a postmortem facility something that is only useful after the problem has occurred. Our OS’en, hardware and software platforms often have monitoring infrastructure built-in, why is it that we so often choose to build our applications without similar facilities?

Technorati Tags: , ,

  • Share/Bookmark

Comments 1 Comment »