Archive for the “Architecture” Category
Dan Pritchett presents some good insight in respect of large scale systems.
I’m left to ponder the fact that whilst Dan is deliberately creating these systems….
I find myself looking at nondeterministic systems a lot lately. Many solutions for the challenges of extreme scale involve relaxing constraints and coping with the ensuing chaos.
….there’s many a techie out there building large systems unaware of the fact that they can’t assert order on such a beast. All they can do is hold back the wave of chaos whilst having to clear up the odd drop that made it over the levee (e.g. consequences of a network outage or operation ordering problems). They are a long way from appreciating that chaos is a given, let alone actively managing it.
In a further twist of irony for all that chaos seems to introduce complexity, if you accept it’s existence and work with it you often end up with something simpler than was ever possible with “old world” thinking.
Technorati Tags: distributed systems, architecture, chaos
Comments Off
Dare has this to say about API versioning. His basic approach can be summarized as don’t change anything at all and separate your URL space using some version moniker. This is a good piece of advice but I’m more wary of other aspects of the posting such as the suggestion that we should support all versions over time to maintain backward compatibility.
I’m tempted to suggest that this is a classic Microsoft (and enterprise) mindset: pile up a stack of legacy and backward compatibility requirement which can potentially lead to a mess of complexity and brittleness underneath the APIs. I feel the reasoning around client migration also needs re-examination. This is because Live Messenger doesn’t live in a browser. It is a desktop native binary application and therefore has completely different characteristics from a genuine web application in terms of dispersal, upgrading and updates.
Consider that for many a browser based application, upgrades are but a reload away. Further many a web service that is mashed up as part of another offering will ultimately present to a user as a browser-based UI. i.e. We don’t need all our desktops to upgrade, we need to upgrade the integration point in the backend services of the mashups that use the web service. That’s at least potentially considerably fewer entities that need to upgrade than is normal in the desktop world.
Dare’s posting also seems to have an underlying current of belief that one can define an API perfectly in the first place. We have known for a long time that such a feat of prediction is often very difficult though it can be made more manageable via functional partitioning.
Limited lifetime is one of the best antidotes we have to the complexity of legacy and the brittleness of code. We resist it far too often leading to the mess we have in many an enterprise:
- Instability
- Slower Feature Delivery
- Difficult maintenance
The web, the human body and many real-world systems (cars for example) exploit the concept of limited lifetime successfully, perhaps this philosophy should be more widely applied in software?
Technorati Tags: architecture, browser, enterprise, evolution, versioning, web
Comments Off
Leave cement lying still for too long and it sets. Once it sets, making adjustments becomes exceedingly difficult. You’ll need pneumatic drills to break it down and bulldozers to remove it. Then you’ll need to bring in a whole new batch to be re-cast in accordance with your wishes.
So it is that software left unstirred for too long becomes brittle and will demand wholesale replacement. One can build a certain amount on such a foundation but eventually it will need a complete reconstruction affecting everything built atop. In this world stirring is refactoring.
Cement has other properties that determine just how quickly it sets and it is the same for software, in the form of do’s and dont’s like loose coupling and separation of concerns. It’s fascinating to note that we spend a massive amount of time focusing on making software malleable at compile/build time (Spring anyone?) but considerably less effort on ensuring similar flexibility post deployment making for brittleness in face of failure, upgrade, configuration changes, scaling etc.
Some aspects of this analogy (not sure what the equivalent of stirring would be) can also be applied to organizational behaviours. For example if one focuses a business heavily on tactical operation for an extended period of time, adapting to more strategic operations will be impeded by it’s structure, modus operandi and culture. Necessary changes might including hiring, firing, reorganization (with significant impact on existing hierarchy), process revision etc.
Technorati Tags: architecture, software
[Blame the child in me for the gratuitous photo’s of construction tools ]
Comments Off
We rely a lot on hardware, tools or software optimizing automatically. We expect it from processors, compilers, databases, javaspaces, caches and so on.
But there’s a limit to what can be achieved because these elements can only make a best guess at what we are trying to achieve and optimize on that basis.
These guesses are based on limited context - e.g. an instruction stream or the recency and/or frequency with which something is updated. The big picture as to what’s really going on is in the surrounding application or maybe even just the programmers head.
Thus without programmer involvement, there’s always a limit to what we can deduce from limited contextual information. The problems come when our systems can’t deduce enough to be effective in their optimization. This can happen because there isn’t enough information (it may have been removed or was never present) or more interestingly because the programmer writes code that isn’t amenable to the optimization process.
Thus it is quite interesting that we spend much time hiding details like concurrency or distribution from our programmers (via frameworks or tools) actually preventing or at minimum discouraging them from getting involved in and specifying the necessary details to optimize effectively.
Technorati Tags: architecture, design, distributed systems, performance
Comments Off
There’s been a lot of chatter recently in the blogosphere about the technical direction of the database. This discussion has been ongoing for some time and dates back to at least Bosworth’s utterings.
It kind of reminds me of the old “we’ll never use that much memory” argument, “you’ll never need more than one database”. I wonder, have we got to the point where most decent size systems be they web or enterprise will need to store more data than can be held in a single database?
Perhaps partitioning and use of dumber storage mechanisms1 will become the norm? It strikes me that this might be a more promising approach2 than attempting to build clusters when working with utility compute platforms like EC2/S3. Seems like we have some design patterns appearing, just need the frameworks to catch up?
[1] Might include RDBMS used purely for storage, containing no business logic.
[2] Partitioning might provide simpler and cheaper growth strategies than the aggravation of upgrading the server hardware for the database
Technorati Tags: architecture, storage
Comments Off
A transaction is an abstraction that provides some combination of the ACID properties.
This is just a concept. The trouble is we’ve created implementations of the concept that carry the same name. The result has been that we’ve forgotten about the concept and associate the term “transaction” typically with the RDBMS implementation of the concept.
This confusion can have significant impact when we get to design of a transactional system. What you really want to build is a system which provides the appropriate balance of ACID properties but what you usually end up doing is constructing your system around a database - i.e. we dumped design and went straight to a common implementation with no consideration for what we really set out to achieve.
And in a twist of irony, I was going to link to the ACID page over at Wikipedia but it makes the very mistake I’m blogging about! So instead I’ll point to Page 3 of Transaction Processing: Concepts and Techniques by Gray and Reuter.
Technorati Tags: architecture, design, distributed systems
Comments Off
Normally we class availability as a non-functional requirement of the systems we build but I’m starting to think that’s a mistake because often when we get to project scheduling it’s forgotten. We allocate time, budget and resource to make sure we’re feature complete, that this button or that tick box is visible or that we can gather credit card details but rarely do we treat availability similarly.
Unlike a functional requirement which has some form of physical manifestation availability tends to be invisible dotted around the system in all sorts of different places from architecture and code to hardware. In contrast real features are a logical sequence of inter-linked pieces of code and design that are naturally traceable from entry to exit point. No entry point means the feature is not complete and this can be trivially determined prior to deployment. The first time we know availability isn’t “feature complete” is when our deployed website is down.
Worse, with every single feature or subsystem we add and the corresponding increase in load and size of dependency graph there’s an increased chance that we compromise availability. This can make for some serious growing pains where we add more features, attract more customers and thus more load all the time paying no attention to availability until it’s too late.
Consider the wiring in a house, we add a garage which needs lighting, maybe heating and a motor-powered door. Then we build an extension for guests, more lighting, heating and wall sockets. Eventually, something gives, maybe it’s a piece of wiring that should’ve been replaced or a bad junction but the net effect is to risk burning out all the wiring in the house. Replacing all the wiring would’ve been hard enough without the extensions but now the house is twice as big and the wiring is four times more complex. Worse, we cannot do a piecemeal upgrade instead we must replace all the wiring at once and until it’s done we can’t have all the heating on or we can’t cook at the same time as using the garage.
Like the wiring, availability needs constant and sometimes significant attention and putting the work off can lead to significant cost down the line. Similar arguments apply to other non-functional aspects of our systems such as scalability (it’s no good adding features that attract more customers if the system won’t cope with the additional load).
In light of the above, I believe it might be better to classify availability as a feature if not the feature. After all if the system is not available, none of the cool features we’ve implemented are worth anything because they can’t be accessed. It might also be a catalyst for discussion (especially amongst non-technical staff) of availability’s costs and benefits as compared to those of other features leading to improved strategic thinking.
Relevant Links:
Technorati Tags: architecture, distributed systems, availability, web
2 Comments »
I came across a new concept in my travels this week, an invitation with compulsory acceptance.
Is an invitation that is compulsory to accept an invitation at all? Surely most people would see it as nothing more than a dressed up way of demanding attendance? Surely such “garnish” (Ben Elton fan’s will know exactly what I’m talking about) just engenders bad feeling in many recipients?
Meanwhile in the world of systems…….
We often do the equivalent of compulsory invitations in the way we build software - consider for example the whole dependency injection thing.
In many cases we build some object via a constructor that expects to be injected with a whole heap of other things. How do we guarantee all those things will attend and remain present for the lifetime of our object? After all without these things our object cannot perform it’s intended task. About the only way this works is by insisting that all these things exist within the same JVM as the object to be injected - then they all either exist or don’t exist.
Clearly, this doesn’t work so well in distributed systems where achieving the same availability assurance is considerably more difficult. A slightly more subtle issue is that if our systems are written in the above manner, upgrades are also more difficult because we must manage a dependency tree. e.g. We wish to upgrade System A and it has dependants, Systems B and C both of which will need to be taken off-line whilst we upgrade A.
What is required is a more dynamic form of injection, something that can be reinitialised or changed which at least invalidates injection via constructors (well, unless we throw the whole object out and rebuild it from scratch). In fact, we probably require some kind of event-based solution such that systems can get liveness information about the systems they rely upon and take appropriate action (which will likely include dealing with in flight operations).
Technorati Tags: architecture, design, distributed systems, philosophy
6 Comments »
In the last couple of years we’ve seen the arrival of the mashup which is at least on some level nothing more than the latest in a long line of terms for integration. Thus far most mashups consist of a simple amalgamation of a couple of services which leads to a very flat graph of service dependencies. Service dependencies are things like:
- Data Schemas - structure of data provided by services
- Endpoints - location of service be it a URL or a WSDL endpoint
- Availability - whether a service is available for use
- Reliability - whether a service that is available behaves as expected
These mashups are already at the mercy of the underlying services they are built on. The mashup provider has little control over these services. For the most part these mashups work but it’s because they have only a few moving parts such that the likelihood of issues is low.
Many enterprises have considerably deeper dependency graphs inside their firewalls and have to work hard to keep them stable. There’s probably some limit to what can be achieved once the dependency graph gets beyond a certain depth and it might well be that the maximum depth is smaller once external services are brought into the mix. The maximum depth is likely further reduced because these enterprises wish to treat external services as if they are part of their organization. They want to be able to integrate them using transactions, they want them to have the same level of reliability as what lives within their own data-centres, they want integrated security options etc.
I suspect that a lot of what can be done in a single enterprise (at substantial cost) such as high reliability is going to be considerably more (prohibitively?) complex to achieve across organizations. This is because the level of control required to achieve these targets is beyond that available across enterprises.
Right now I think there’s much effort being made to paper over these issues such as features in WS-*, SLA’s etc. I wonder if it might it be better to give up on this idea of control and build some simpler solutions….
Technorati Tags: distributed systems, enterprise, utility computing, web
1 Comment »
“Make as much stateless as possible” is the mantra but I wonder if we’re being a little over-zealous in our application? Consider this note in Fielding’s REST thesis:
“Like most architectural choices, the stateless constraint reflects a design trade-off. The disadvantage is that it may decrease network performance by increasing the repetitive data (per-interaction overhead) sent in a series of requests, since that data cannot be left on the server in a shared context. In addition, placing the application state on the client-side reduces the server’s control over consistent application behavior, since the application becomes dependent on the correct implementation of semantics across multiple client versions.”
Thus while statelessness is often claimed to achieve scalability, in certain applications that may not be the case due to the resultant load on the network.
Our pursuit of statelessness leads us to behaviours such as making a single entity responsible for the maintenance of all state. Often it’s a database that becomes a black hole sucking up hardware, network bandwidth, admin time and endless tuning effort. It also becomes the focus of our reliability concerns, with a need for clustering, RAID arrays etc. Stand around long enough and you’ll hear terrified utterings from staff such as “if we ever lose the database….”
Making some single thing responsible for all these aspects of our system is asking for trouble. Having all these heavyweight concerns squeezing down on a single element ultimately leads to breakage.
History shows that we aren’t entirely happy with this “single point of responsibility for all state”. We have cookies in browsers, local storage in browsers, thin clients that rely on servers to store all state and so on.
Perhaps we’re ignoring an underlying message: Maintenance of state is a shared responsibility for a system. We should seek to place that responsibility in appropriate places at appropriate times and be much more aware of responsibility boundaries and when it’s appropriate, share that responsibility amongst components.
Generally we consider TCP to be responsible for ensuring that state makes it to the other end of the connection. One hands some data to the TCP layer and we expect that it will ensure the data reaches the recipient. But is this true? What happens if we suffer a power outage before TCP transmits the data? When the machine restarts, is TCP going to restart and resend all that unsent data? Clearly not, whoever delegated responsibility to TCP for this data will now need to take steps to recover the situation.
What about a message queue? Typically we place some data in the queue and demand that it absolutely must deliver that message and not lose it in the meantime. That’s an awful lot of responsibility for a single component to carry! As an aside we’re also potentially making that queue a performance bottleneck of the future.
Then there’s the Web which in many cases puts responsibility on the client for maintenance of state. This is achieved through retries, restoring backups, re-entering details etc. Notably, this is the case even if the client “fails” e.g. your home router goes down or the PC overheats. There’s a certain amount of illusion here too where we believe the responsibility for state maintenance has been placed elsewhere e.g. Flickr. Ideally they don’t want to lose all your precious pictures but if they do, who will have to restore all that information?
I think it’s interesting that placing such responsibility with some single entity is perceived as the easy solution but it has a lot of hidden costs like redundant hardware, clustering, strict data-centre environment control, backups etc.
Spreading responsibility might ultimately be easier and fit with our desire for utility computing but it’s not commonplace and thus we’re lacking well documented patterns, software components etc. We are seeing some examples however, I would speculate that S3’s API is the way it is precisely because it relies on spreading responsibility for state across a co-operative shared-nothing system rather than placing it all in a single shared-everything cluster.
Technorati Tags: amazon, distributed systems, enterprise, web, utility computing
1 Comment »
The “what is a JavaSpace debate” rages on………….
Fundamentally, I don’t see this as being an exercise in pragmatists vs purists. It’s a difference in engineering philosophy/taste. Personally, I see a JavaSpace as well, errr, a JavaSpace. There are lots of things you can do with it:
- Co-ordination mechanisms
- Systems of co-operating entities for e.g. locking, queues etc
- Compute Servers
- Messaging
- Caching
Note that all of these things can be built on top of JavaSpaces. This leads to two different schools of thought:
- Make all these different things a part of a JavaSpaces implementation
- Make all these things layered frameworks that live on top of JavaSpaces
Arguments for the first option are often on the basis of improving performance or “making JavaSpaces more useful”. But JavaSpaces is already useful, acting as good simple underpinnings for all these different things.
And this is why I personally prefer the second option. I’m a minimalist, I like my JavaSpaces nice and simple and I like being able to construct cleanly layered frameworks on top of JavaSpaces. This allows me to have nicely separated responsibilities at each layer leading to (IMHO) better, more understandable, more maintainable design in my systems. I also like to avoid building such layers on top of JavaSpaces if there’s something out there already that can do the job better in a specific scenario.
[ Yep, read that paragraph again and realize that I am both pragmatist and purist. How annoying that I cannot be easily pigeon-holed as one or the other! ]
There’s no accounting for taste but let’s be clear that it has nothing to do with being pragmatic or purist. It seems like Cameron also has a taste for minimalism……
Technorati Tags: design, engineering, philosophy
Comments Off
Hot news is the proposal for a REST API in Java. Equally interesting is that various members of the community are asking for a simple API with none of the usual cruft.
I’m left wondering if such an approach will just leave you with all the existing http libraries and web infrastructure (servers, caches and proxies) such as Apache’s HttpClient.
It might be that the real core of doing better REST is in tools for converting from REST-based designs into code and infrastructure rather than an API per se. Could be time to go look at what DHH and Rails have been doing in this area???
Technorati Tags: design, REST, Rails
3 Comments »
Observations on the long running “browser as a platform” debate.
We often talk about browser-based apps but really what we’re talking about (at least for now) is browser-accessed apps (web apps). This is because the browser doesn’t really house much app logic - it mostly does the presentation and bundles up data for processing by remote logic. AJAX has made it easier to do this in more user-friendly/intuitive ways but the computational load and location of logic is clearly at the back-end.
The browser is a powerful force in that it’s pretty much on every desktop. It’s become part of a computer’s DNA. Delivery of an “app” via the browser means it can permeate to virtually all desktops and is endowed with the appearance of “zero install time”. Of course, we forget about the time spent downloading and installing browser upgrades and plugins because it’s been made “low-pain”.
It’s widely held that the bulk of innovation goes on in the browser but that’s at least in part inaccurate - the innovation is in the apps themselves which typically exploit the increasing level of networked’ness that is a hallmark of today.
The current divide of web app logic between browser and web back-end has another benefit. Essentially it endows the app builder with the ability to choose their deployment platform. The front-end (browser) has a limited but well understood set of presentation technologies (HTML, JavaScript etc) that the back-end must deliver against but one can use pretty much any back-end technology to generate the content to be consumed by the front-end. Thus the app builder can select their preferred webserver, libraries, languages etc for the task knowing that it has almost zero effect and dependency on the front-end. However, note there is some pain out there caused by variations in browser behaviour which mimics the kinds of trouble desktop-based apps face daily.
The nature of networking (bandwidth constraints and latency) exerts some force on the architecture of web apps which means we must use single bulk refresh to do the work rather than lots of querying. And bulk refresh is slow, AJAX helps here because we can hide that bulk update in the background to some extent but there’s a computational penalty in the form of polling.
One thing web apps aren’t so good at right now is exploiting the typically huge number of spare CPU cycles available at the client-side. All the load is in the data-centres and networks, a classic centralized architecture with all the usual scaling issues. Perhaps the most significant aspect though is how un-green this is. Surely there’s some sense in utilizing that client CPU for more than just presentation logic?
Expanding use of client-side compute power beyond presentation will begin to change the architecture of the typical web-app. For it to work we’ll need to do more serious app logic client-side which might take us back to longer downloads of that logic and threaten the “zero cost of install” perception. It’s also possible that browsers will need to support more sophisticated programming languages. The browser as a platform could become more difficult of a concept to execute well because we will encounter challenges such as providing consistent API behaviour across filesystems etc. We’ve seen these kinds of problems played out previously - Java anyone? One possible way around the API issue might be to adopt a more abstract approach such as having the browser provide all this stuff via a REST-based web interface available to in-browser app logic.
In complete contrast to the web app is the desktop app where rather than do almost all work in the back-end on the net somewhere we do it all in the front-end which limits computation and storage to what’s “near” (on or accessed via networks local to) the desktop machine.
Likely as not the ultimate solution is a hybrid of web and desktop app - seamless integration into the browser with sufficient client-side desktop API access to make genuine client-side processing easier whilst accessing network resources like servers and storage so users can “break out” from the limits of their own machine. The trick will be in defining a workable limit to browser APIs and programming languages which balances the need to avoid the cross platform consistency issues against sufficient functionality to make client-side apps appropriately powerful.
Technorati Tags: architecture, distributed systems, web
1 Comment »
No doubt many are aware of the fact that bird flu has hit East Anglia and there’s plenty of discussion about future implications so I shan’t be dwelling on that.
From a purely techie standpoint there are some interesting parallels with the way we build and deploy IT systems. Consider that the poultry farm where the outbreak occurred claimed to be keeping all it’s birds in buildings providing “secure biological containment” such that this sort of thing shouldn’t happen and yet it did happen and look at the damage! Does this remind you of the average corporate with all its systems centralized and hidden behind a firewall?
Since the original outbreak there’s been follow on discussion about what would have happened had the birds in question been free-range. These birds would, for a start, have stronger immune systems and not be so closely packed thus potentially limiting the impact. Sounds a little like a distributed system maybe?
What about issues related to the growth of London? Transport is over-stretched because there are too many people trying to get to work morning and evening. We build more roads or attempt to cram more trains into an already congested timetable which temporarily fixes the problem and attracts more people leading to a further transport disaster and so the cycle continues. Similar effects can be seen in locating suitable housing, refuse management and water systems. All of which sounds to me like the regular stresses and strains we suffer attempting to scale our centralized IT systems to cope with load.
One suggestion for addressing the problems involves moving business out of London to locations in Wales or the Midlands or Scotland. Of course, we’d need to improve the transport network making it easier to get to airports from these other locations, perhaps building additional roads to make access easier. The argument being that such a distributed approach might be easier to build, maintain and scale because it encourages more local commuting with fewer people converging on the same place. Sounds a little like the sort of approach used by MySpace or Google?
There are other examples of distributed systems lying around like the internal workings of our own bodies. And yet, in spite of a host of counter examples and failures we still feel compelled to pursue policies of centralization in towns, poultry farms and IT systems. Hmmmmm.
Dunno what it means, dunno who’s right (or indeed if anyone is) but it’s interesting for sure.
Technorati Tags: distributed systems, engineering, technology
2 Comments »
Common practice these days is to have a firewall between corporate’dom and the internet and only open port 80. It’s claimed this is secure but is it and what is this approach costing us? And is the price worth paying?
We tend to argue against open ports at the firewall for two reasons:
- Denial Of Service Attacks
- Hacks
Denial Of Service Attacks
The mere act of opening a port leaves one open to such attacks. The difficult bit is having access to sufficient resources with which to attack the target. This drives some to look for more sophisticated attacks that don’t require a “stampeding herd” of machines but there are very few of these people because:
- You have to be expert
- You have to have lots of time
In conclusion, denial of service attacks are a part of being on the net and firewalls help very little in this regard. What’s really needed is intrusion detection software and dynamic response/adaption something the firewall doesn’t facilitate.
Hacks
The reason a hack works at all is because it exploits some weakness in some service somewhere. But these weaknesses vary by OS, machine and software version. Blocking everything at the firewall works but really and truthfully we should:
- Keep our machines up to date
- Manage this machine by machine because they each have different needs and services
How Secure Are We?
Given the above we can see that boundary firewalls actually provide little effective protection against attack because to provide service, we must open the firewall and as soon as we do it, the hoardes can rush in. Certainly a firewall can stop certain kinds of packet spoofing, remote login etc but we don’t need to close all ports to achieve this goal.
In summary, firewalls are hammers and we are attempting to treat all security issues as nails to beat on.
The Cost of the “No open ports” Strategy
Because port 80 is the only one open, we are seeing everything tunnelled via http. This leads to a whole series of complications:
- Not everything we would like to use has support for tunnelling over http
- http is not suitable for all applications, it can be made to work but it’s complex and fragile
- Given that all traffic looks like http it is difficult to manage/monitor traffic for individual services
- Multplexing and demultiplexing all this network traffic manifests in our architectures - e.g. configuring our front-line httpd’s to redirect traffic based on URL components
Connection building in certain directions is impossible in most cases which denies us architectural options:
- Callbacks are not viable - leading us to deploy polling solutions which perform badly and prevent us from building solutions that offer timely updates
- Protocols other than http cannot be used
The border firewall becomes a point of contention trying to serve conflicting policies for individual services. Changes to the policy are slow owing to the need for agreement from so many parties or, in the worst case, no change happens at all.
We Really Need to Change
Firewalls as they are typically used are the software equivalent of a “Maginot Line” and history is littered with examples of failed border-control policy such that the military at least have learnt that defense in depth is a more appropriate solution. The “Maginot Line philosophy” denies us various software architecture options that would make our systems less complex and more scalable.
Certain elements might argue that all of this should lead us to base our services solely on the web architecture but that denies us certain kinds of desirable abstraction such that we expend lots of effort close to the metal attempting to force our solutions into a single generic approach. This would be somewhat akin to the pain we suffer trying to tunnel everything over http, not good.
Amazon’s EC2 provides some interesting possibilities in that it provides a more flexible firewall policy than straight border control and perhaps the most significant aspect of EC2 is that the firewall configuration can be controlled programmatically thus it would be possible for a service to configure the firewall to suit it’s needs on demand. If nothing else, such an option can speed up deployment removing the need to fiddle with http tunnelling and de-multiplexing or to discuss policy with a centralized admin authority.
In conclusion, we need to look at changing our policy and other approaches such as providing infrastructure that can secure individual services with appropriate policies from machine through to software level. Much of our current software infrastructure has little support for such a strategy and will need to change but surely it will be for the better?
Technorati Tags: architecture, distributed systems, technology
1 Comment »
|