<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Pragmatic Dictator &#187; operations</title>
	<atom:link href="http://www.dancres.org/blitzblog/tag/operations/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dancres.org/blitzblog</link>
	<description></description>
	<lastBuildDate>Sat, 31 Dec 2011 19:08:23 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Divided We Fall</title>
		<link>http://dancres.org/feeder/?FeederAction=clicked&#038;feed=Articles+%28RSS2%29&#038;seed=http%3A%2F%2Fwww.dancres.org%2Fblitzblog%2F2011%2F11%2F09%2Fdivided-we-fall%2F&#038;seed_title=Divided+We+Fall</link>
		<comments>http://dancres.org/feeder/?FeederAction=clicked&#038;feed=Articles+%28RSS2%29&#038;seed=http%3A%2F%2Fwww.dancres.org%2Fblitzblog%2F2011%2F11%2F09%2Fdivided-we-fall%2F&#038;seed_title=Divided+We+Fall#comments</comments>
		<pubDate>Tue, 08 Nov 2011 22:26:29 +0000</pubDate>
		<dc:creator>Dan Creswell</dc:creator>
				<category><![CDATA[Engineering]]></category>
		<category><![CDATA[Agile]]></category>
		<category><![CDATA[development]]></category>
		<category><![CDATA[devops]]></category>
		<category><![CDATA[lean]]></category>
		<category><![CDATA[operations]]></category>

		<guid isPermaLink="false">http://www.dancres.org/blitzblog/?p=382</guid>
		<description><![CDATA[Generally, the longer a defect remains undetected in a system, the more costly it will be to fix. I&#8217;ve seen this fact proven true over and over but you don&#8217;t have to take my word for it, ask Steve McConnell. I&#8217;ve always assumed this was well understood yet many organisations adopt processes, approaches and structures [...]]]></description>
			<content:encoded><![CDATA[<p>Generally, the longer a defect remains undetected in a system, the more costly it will be to fix. I&#8217;ve seen this fact proven true over and over but you don&#8217;t have to take my word for it, ask <a href="http://www.stevemcconnell.com/articles/art09.htm">Steve</a> <a href="http://www.stevemcconnell.com/rd.htm">McConnell</a>.</p>
<p>I&#8217;ve always assumed this was well understood yet many organisations adopt processes, approaches and structures that guarantee certain kinds of defects will be undiscovered for substantial periods of time. One of the more common faults is the separation of Development and Operations.</p>
<p>Each side has its own view of what&#8217;s important and what they&#8217;re responsible for:</p>
<ul>
<li>Operations more often than not seeks to own non-functional aspects (performance, stability, scalability etc).</li>
<li>Development more often than not seeks to own the functional aspects (features).</li>
</ul>
<p>Such a mindset often leads to a classic process mistake, issues with the functional aspects get dealt with early and all those linked to the non-functional and operational are left unsurfaced until last moment grand testing regimes (P&amp;C, User Acceptance Testing) dig them out or worse, are discovered at the point of release into production.</p>
<p>The warning signs are usually there if only we paid attention to them:</p>
<ol>
<li>Developers work in isolation building, deploying and configuring the components they develop in ways that suit them. It follows that deployment and configuration are not optimised for production and do not account for any hard won operational experience.</li>
<li>Operations staff demand huge handover documents be written by developers and passed over with the product. Inevitably the documentation fails to account for operational concerns (what would a developer know about operations?)</li>
<li>There are separate environments for the purposes of validating correctness and accuracy of handover documents. After all developers can&#8217;t be trusted to get the documentation right so it must be checked.</li>
<li>The development environments are ad hoc with no resemblance to production (certainly they aren&#8217;t a scale unit of production). Leading to large numbers of problems at release time: files can&#8217;t be found, configurations are broken and various versioning issues present themselves.</li>
</ol>
<p>The antidote is relatively straightforward, all development activity should be performed in a production like situation. For example:</p>
<ol>
<li>Deployment and configuration of software components under development should be routinely performed by operational staff. The result is early knowledge transfer and the documentation can now be written by those best able to produce it (operations staff, not developers).</li>
<li>Development environments should contain appropriate network topology. Often production setups contain segregated networks for security or availability reasons. Ensuring developers are exposed early to these issues means software is more likely to account for these demands.</li>
<li>Monitoring and logging infrastructure should be as per production and used routinely for debugging and capture of data relevant to testing (performance, failure etc)</li>
<li>Development environments should be scale units of production. This permits early production-like performance testing. This should be backed up with routine robustness testing e.g. to identify memory leaks early.</li>
</ol>
<p>A typical reaction is for development and operations staff to say this cannot possibly work and will slow development to a crawl. They aren&#8217;t actually wrong but they&#8217;re missing a key insight:</p>
<p><strong>If development has slowed to a crawl it&#8217;s an early warning of future production troubles.</strong></p>
<p>For example, if deployment is taking too much effort and time, something needs tweaking, simplifying or automating. What we&#8217;ve done is best summarised by a proverb from Toyota (<a href="http://theleanstartup.com/book">via Eric Ries</a>):</p>
<p><strong>&#8220;Stop production so that production never has to stop&#8221;.</strong></p>
<p>We&#8217;ve created a feedback loop that highlights defects spanning all concerns (functional, non-functional and operational) early which keeps costs down.</p>
<p>Clearly, delivering a given feature will take a little longer as we must account for all aspects from functional through non-functional and operational. That&#8217;s acceptable because if we don&#8217;t cover all these aspects we&#8217;re asking for trouble in many forms including:</p>
<ul>
<li>If we cannot adequately monitor the performance of a newly delivered feature there&#8217;s a direct impact on customer experience. They will know before we do that something is broken which leads to irate phone calls, lost revenue etc.</li>
<li>If we cannot adequately track the effect of a new feature on customer behaviours, we cannot evolve it appropriately.</li>
</ul>
<p>Needless to say developing features in this fashion fits well with lean and agile approaches.</p>
<p>So the antidote is relatively straightforward and there are development approaches that fit well with what needs to be done. The toughest challenge remains though, effecting the necessary mindset shift to get it done. It ought to be a little easier with the rise of DevOps but notably there are early signs of trouble as has been seen with lean and agile adoption.</p>
<p>There are many who claim to know and practice each of these disciplines but most are paying only lip service, picking out the bits of process, mindset or tooling that suit them and ignoring the rest.</p>
<p>Sporting Index is right in the middle of making this tricky jump from Dev and Ops to DevOps, I&#8217;ll let you know how we get on.</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://dancres.org/feeder/?FeederAction=clicked&#038;feed=Articles+%28RSS2%29&#038;seed=http%3A%2F%2Fwww.dancres.org%2Fblitzblog%2F2011%2F11%2F09%2Fdivided-we-fall%2F&#038;seed_title=Divided+We+Fall/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Wild Speculation</title>
		<link>http://dancres.org/feeder/?FeederAction=clicked&#038;feed=Articles+%28RSS2%29&#038;seed=http%3A%2F%2Fwww.dancres.org%2Fblitzblog%2F2009%2F09%2F09%2Fwild-speculation%2F&#038;seed_title=Wild+Speculation</link>
		<comments>http://dancres.org/feeder/?FeederAction=clicked&#038;feed=Articles+%28RSS2%29&#038;seed=http%3A%2F%2Fwww.dancres.org%2Fblitzblog%2F2009%2F09%2F09%2Fwild-speculation%2F&#038;seed_title=Wild+Speculation#comments</comments>
		<pubDate>Wed, 09 Sep 2009 20:15:48 +0000</pubDate>
		<dc:creator>Dan Creswell</dc:creator>
				<category><![CDATA[Engineering]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[development]]></category>
		<category><![CDATA[operations]]></category>

		<guid isPermaLink="false">http://www.dancres.org/blitzblog/?p=284</guid>
		<description><![CDATA[A bad habit I&#8217;ve noticed in many a techie: The tendency to thrash around and wildly speculate about the root cause of whatever production issue they&#8217;re facing. They tweak code and configuration following some random hypothesis or another, hoping that the issue will magically go away. It must surely be clear that this is a [...]]]></description>
			<content:encoded><![CDATA[<p>A bad habit I&#8217;ve noticed in many a techie:</p>
<p>The tendency to thrash around and wildly speculate about the root cause of whatever production issue they&#8217;re facing.  They tweak code and configuration following some random hypothesis or another, hoping that the issue will magically go away. It must surely be clear that this is a horribly inefficient way to solve a problem?</p>
<p>What&#8217;s required is data, data that we can use to home in on the source of the fault. We could wade through log files but this is inefficient and ought to be the last resort. Ideally we&#8217;d have some idea of what to look for beforehand.</p>
<p>Instrumentation is one tool we can use to guide our efforts. It can tell us things like how much memory is used, how much load there is, how many users are logged in, rate and types of request, cache hits etc.</p>
<p>Self-tests are also useful as they can exercise common operations, perform internal consistency checks and provide feedback on what&#8217;s working and not.</p>
<p>We can also get online memory dumps and there are tools like <code>dtrace</code> and <code>tcpdump</code>.</p>
<p>Given all these possibilities, why do we indulge in wild speculation? Perhaps it&#8217;s because we&#8217;ve foolishly left ourselves no choice:</p>
<ol>
<li>Instrumentation that should be a rich source of useful information is often limited to what is available from the operating system because we neglect to instrument our own code.</li>
<li>As with instrumentation, we don&#8217;t make the time to implement self-test facilities.</li>
<li>Only a few of us bother to learn about tools such as <code>dtrace</code>.</li>
<li>Logging even if we could wade through it all is implemented in such a fashion that it cannot be turned on in production because the performance cost is too high.</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://dancres.org/feeder/?FeederAction=clicked&#038;feed=Articles+%28RSS2%29&#038;seed=http%3A%2F%2Fwww.dancres.org%2Fblitzblog%2F2009%2F09%2F09%2Fwild-speculation%2F&#038;seed_title=Wild+Speculation/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Sooner Than Later</title>
		<link>http://dancres.org/feeder/?FeederAction=clicked&#038;feed=Articles+%28RSS2%29&#038;seed=http%3A%2F%2Fwww.dancres.org%2Fblitzblog%2F2008%2F10%2F01%2Fsooner-than-later%2F&#038;seed_title=Sooner+Than+Later</link>
		<comments>http://dancres.org/feeder/?FeederAction=clicked&#038;feed=Articles+%28RSS2%29&#038;seed=http%3A%2F%2Fwww.dancres.org%2Fblitzblog%2F2008%2F10%2F01%2Fsooner-than-later%2F&#038;seed_title=Sooner+Than+Later#comments</comments>
		<pubDate>Wed, 01 Oct 2008 16:29:00 +0000</pubDate>
		<dc:creator>Dan Creswell</dc:creator>
				<category><![CDATA[Architecture]]></category>
		<category><![CDATA[operations]]></category>

		<guid isPermaLink="false">http://www.dancres.org/blitzblog/?p=235</guid>
		<description><![CDATA[When building systems, there are some operational elements that it pays to get to grips with sooner than later: Deployment Packaging Configuration Monitoring Logging Failing to address these elements is detrimental to core aspects of what we need to do from day one: Get changes out &#8211; ship a new feature, deploy an urgent bug-fix [...]]]></description>
			<content:encoded><![CDATA[<p>When building systems, there are some operational elements that it pays to get to grips with sooner than later:</p>
<ul>
<li>Deployment</li>
<li>Packaging</li>
<li>Configuration</li>
<li>Monitoring</li>
<li>Logging</li>
</ul>
<p>Failing to address these elements is detrimental to core aspects of what we need to do from day one:</p>
<ul>
<li>Get changes out &#8211; ship a new feature, deploy an urgent bug-fix or make a tweak to handle a load-spike.</li>
<li>Determine if things have started up and configured properly.</li>
<li>Be sure things are still running right.</li>
<li>Identify and react to problems quickly.</li>
<li>Obtain data important to future architectural decisions.</li>
</ul>
<p>Even in light of the above many of us are still tempted into leaving this until later by which time:</p>
<ol>
<li>Our software will have grown substantially making it difficult and expensive to adapt when we do decide to address the operational issues.</li>
<li>We&#8217;ll be losing inordinate amounts of time on manual trouble-shooting and dealing with the consequences of human error (a <a href="http://research.microsoft.com/~gray/papers/TandemTR85.7_WhyDoComputersStop.doc">key contributor to downtime</a> and other problems).</li>
<li>Operations will likely have become tightly bound to whatever our software currently looks like such that when we start addressing the issues, we&#8217;ll break all their assumptions (and the tooling they built around them).</li>
</ol>
<h2>Some Specifics</h2>
<p>Having configuration buried inside your binaries where it cannot be easily managed is an inconvenience.  We don&#8217;t really want to have to do a whole new build just to change configuration settings (though one might want to do a re-deploy of the whole lot together to allow for audit-trails and have half a chance of having all boxes configured similarly at the same time).</p>
<p>When it comes to deployment and packaging it pays to adopt something akin to the <a href="http://en.wikipedia.org/wiki/XCOPY_deployment">xcopy install approach</a>. Everything required is contained inside of the distribution with minimal external dependencies (necessary external dependencies should ideally be satisfied dynamically at runtime rather than with static configuration).  Such an approach for desktop software would be unattractive but with servers and an imperative to automate installation it&#8217;s very attractive.</p>
<p>What about all those existing packaging systems such as rpm? Many of these mechanisms have a design assumption around a single version of something on a machine. This can inhibit fast rollback because rather than stopping one process and starting another one has to (in simple terms):</p>
<ol>
<li>Stop a process.</li>
<li>Uninstall it&#8217;s binaries and dependencies.</li>
<li>Install the binaries for the old process and dependencies.</li>
<li>Start the other process up.</li>
</ol>
<p>In some cases it will also be necessary to perform further configuration (did we back it up?), suddenly it&#8217;s looking like a lot of work to buy ourselves appropriate risk-mitigation for broken upgrades.</p>
<p>Monitoring often requires an amount of configuration which can make for a bootstrap problem where one needs monitoring to detect a configuration issue but the monitoring isn&#8217;t configured yet.  Thus it can be useful to have some very simple monitoring based on a primitive that can run without explicit configuration such as multicast.</p>
<h2>Important Step</h2>
<p>These key operational elements should be accounted for early on in the design of system and grown alongside other functional aspects.<sup>*</sup> There&#8217;s plenty of information on this topic publicly available including:</p>
<ul>
<li>Randy Shoup &#8211; <a href="http://www.infoq.com/presentations/shoup-ebay-architectural-principles">eBay Marketplace Architecture</a>.</li>
<li>Dan Pritchett &#8211; <a href="http://www.infoq.com/presentations/operational-manageability">Architecture Quality: Operational Manageability</a>.</li>
<li>Wayne Fenton &#8211; <a href="http://www.infoq.com/presentations/Operational-Scalability-Wayne-Fenton">Operational Stability in The Next Generation Web World</a> (a variation of the talk above).</li>
<li>Michael Isard &#8211; <a href="http://research.microsoft.com/users/misard/abstracts/osr2007.html">Autopilot: Automatic Data Centre Management</a>.</li>
<li>James Hamilton &#8211; <a href="http://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf">Designing and Deploying Internet Scale Services</a>.</li>
</ul>
<p>* Initially implementation can be simple scripts but at some point it becomes necessary to take a more serious approach in respect of tools and infrastructure development.  This means investing in properly skilled architects and engineers, performing appropriate testing etc.</p>
]]></content:encoded>
			<wfw:commentRss>http://dancres.org/feeder/?FeederAction=clicked&#038;feed=Articles+%28RSS2%29&#038;seed=http%3A%2F%2Fwww.dancres.org%2Fblitzblog%2F2008%2F10%2F01%2Fsooner-than-later%2F&#038;seed_title=Sooner+Than+Later/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

