Wild Speculation
Posted by Dan Creswell in Engineering, Uncategorized, tags: development, Engineering, operationsA bad habit I’ve noticed in many a techie:
The tendency to thrash around and wildly speculate about the root cause of whatever production issue they’re facing. They tweak code and configuration following some random hypothesis or another, hoping that the issue will magically go away. It must surely be clear that this is a horribly inefficient way to solve a problem?
What’s required is data, data that we can use to home in on the source of the fault. We could wade through log files but this is inefficient and ought to be the last resort. Ideally we’d have some idea of what to look for beforehand.
Instrumentation is one tool we can use to guide our efforts. It can tell us things like how much memory is used, how much load there is, how many users are logged in, rate and types of request, cache hits etc.
Self-tests are also useful as they can exercise common operations, perform internal consistency checks and provide feedback on what’s working and not.
We can also get online memory dumps and there are tools like dtrace and tcpdump.
Given all these possibilities, why do we indulge in wild speculation? Perhaps it’s because we’ve foolishly left ourselves no choice:
- Instrumentation that should be a rich source of useful information is often limited to what is available from the operating system because we neglect to instrument our own code.
- As with instrumentation, we don’t make the time to implement self-test facilities.
- Only a few of us bother to learn about tools such as
dtrace. - Logging even if we could wade through it all is implemented in such a fashion that it cannot be turned on in production because the performance cost is too high.

Entries (RSS)
I think that instrumentation and tooling often is eaten by the same dragon that eats testing: it doesn’t feel like a productive use of time. If you have crappy management, then it’s also difficult to justify to them. Of course, it usually does pay off in the end, and pays off handsomely.
I think one reason that the Java ecosystem is so successful is because there’s a well developed toolchain and good profiling tools.
I’m not even sure how to approach improving instrumentation the level of code besides logging. What did you have in mind beyond dtrace?
The previous job I worked on generally never had enough logging information in a production system to arrive at a conclusive answer. What I recognized happening might spill into the realm of ‘wild speculation’ … when working on something that didn’t have an apparent answer we generally formulated one or more theories, and designed experiments designed to elicit logs or data that supported or discredited the theory as quickly as possible, while building tools that in the future would more quickly pinpoint the suspected hypothesis. (most theories eventually ‘came back’)
When we recognized what we were doing, we formalized it a bit which ended up making the entire process faster, along with the code generated considerably more focused.
At the end of the day “It was broke, we needed to fix it” … but it added a bit of enjoyment if we tried to apply a bit of scientific method to the whole process.
Hi Andy,
I was specifically thinking about having code expose statistics via e.g. JMX. My own Javaspace implementation has a growing number of such statistics that can help identify all sorts of ills. The most recent addition was a stat to track how must cache has been utilised versus free memory to provide an indication of whether performance will be adequate and the likelihood that heap space is sufficient. Other statistics show number of outstanding operations, active transactions, various queue sizes etc all of which can be used to infer what’s causing problems.
Of course there are limits, if you haven’t got a stat for it, you’ll have to try other methods to deduce the problem. At the same time, I’ve seen an awful lot of enterprise software that doesn’t provide any instrumentation/statistics beyond what can be got from the app server.
Hi John,
“At the end of the day “It was broke, we needed to fix it” … but it added a bit of enjoyment if we tried to apply a bit of scientific method to the whole process.”
I thoroughly approve! I wouldn’t consider what you’re doing wild speculation given the methodology for eliminating theories (this and the tools building, additional instrumentation etc) which is the ingredient most forget to include leading to the thrashing I’ve seen.
Word, the problem is exacerbated when those who are allowed support the application are not those who developed it. In these cases unless a comprehensive handover process has been followed the support team is generally unaware of application-specific diagnostics even if they have been included and often of diagnostic techniques such as dtrace that are available in the OS.
Also, how calm and analytic can one be when a senior exec is standing over one’s shoulder screaming about lost revenue?
Hi James,
“Also, how calm and analytic can one be when a senior exec is standing over one’s shoulder screaming about lost revenue?”
We’ve all seen that sort of thing haven’t we?
To take a positive view: It’s worth taking the courage to exploit the opportunity and explain how things can be changed to ensure it doesn’t happen again. One method I’ve used is to perform a postmortem which of course will identify all those cut corners as the root cause and prescribe solutions.
Hope you’re keeping well!