Automation Architecture: build the capacity for change

From my experience in IT, mainly in what we call IT operations or more specifically platform Engineering, there’s an emphasis on some noble design principles such as reliability, resiliency, high availability, monitoring, MTTR, MTBF and so on, but something is missing.

None of them should be ignored, obviously, but the capacity for change is often overlooked. I’m convinced it should be a core design principle, re-evaluated for each change, especially for automation architecture.

It is not the strongest of the species that will survive, nor the most intelligent, but the one most responsive to change. – Charles Darwin

To understand the problem, let’s highlight the current model and how it can fall short of our expectations.
When a need for a (technological) service emerges, the IT Ops team will be tasked to build up that capability, and a project will be initiated to track the delivery, and in the best but rare case considering the lifecycle of that service after delivery or accepting that it will have to evolve, change, over time.

Requirements will be gathered, discussions will be had, permissions requested, approvals begged for, before work can be done and forgotten, while maintaining a plethora of disparate services, legacy of past projects long extinct.

Over time, the number of supported services (and scenarios) will have reached an incredible but unaccounted size.jenga

In isolation those services can be simple, but the intricate melting of one with another, layered over time, creates this huge Jenga game the IT Ops folks too often play on weekends, in case they’ll need the time to recover from a disaster.

 

As the number of service increases, the awareness and knowledge of them will have decreased since their release into production. This is where the lack of documentation has the biggest cost, someone needs to regain enough knowledge to fix a bug or make a change. It’s time consuming and frustrating.

By now the maintenance and interruptions have crippled the available time for the supporting team, leaving very little resources for improvement. On the other side, the business has grown ever more dependant on those services, and the digitalisation of the economy means they will only increase their requests for change and additional features.

The usual response to this crisis is three fold:

– cut corners (who needs a test environment anyway), compromise quality
– hire more support (we need more people to deliver)
– Automate (Script the tedious)

Because this is a reaction to the visible problem “we’re too busy”, the solution is rarely engineered to scale beyond the status quo. The effects are multiple, increasing inertia in the following downward spiral.

The corners that have been cut previously, a form of technical debt, will create growing arrears, and before you know it, you’re working around the work around of something that you should probably get rid of anyway. Any seemingly simple work will require shaving yaks!

 

Hiring more people to grow capacity is a fallacy in many cases, and it assumes either that you won’t need to hire or that you can hire fast enough.

And I’m not even factoring in the growing pain of overworked and frustrated IT Pros, increasing reliance on heroics contributing to the turn over, while decreasing the overall knowledge of the systems.

The growing team size has other impact on team dynamics, such as taking more communication to align towards a shared goal, increasing side communication (~overhead), hence decreasing the Signal to Noise Ratio.

The usual approach is to split the teams further, increasing siloisation and associated problems, such as hands off and increased cycle time in the name of local efficiency.

While it may seem to work for a while with small teams, it does not scale well. Even if the admin to server ratio, an obsolete metric, is stable or decreasing, the cost (cost of delay, cost of change) increases. Simply put, in average the same change that used to be easy and quick, becomes more expensive in time and effort. And it’s not a proportional increase, but a factored one. As an illustration, a team of 5 looking to double its throughout by hiring alone, may be able to do so by doubling its size. But from there, to double again they may need to triple the work force, as they split the value stream in more sub teams.

As those effects are becoming more visible, the desire to trim down the low value-add tasks grows. The natural approach is Automation of the tedious tasks. The manual tasks performed by operations is replaced by some scripts, either ran manually or sequenced by another system.

This usually frees up significant bandwidth initially, but quickly eaten away by further requests. The risk is to not factor the required maintenance and quality of that code, transferring the downward spiral dynamic to another silo (automation) with even scarcer resources. They usually take the name of Automation team, DevOps team, Cloud Team or something similar… Note that I’m not saying having a team with such name is an anti-pattern by itself either.

Although such automation has lots of benefits and is a certain improvement, it does not necessarily scale or age well, and become the next constraint.

The former disparate systems tend to be stacked and glued together by an automation layer, without maintaining clear interfaces between systems, and creating strong coupling between components via its automation.

As the automation of a system is built, ran, and sequenced separately from the system itself, its dependencies and management interface spread, abstracting and hiding away the configurations, thus making changes more complex and involving different systems, possibly handled by different work centre or teams. This is also true when integrating with other systems such as monitoring, IAM and so on…

Moreover, because of ingrained practices and biases, the most common being “We’ve always done like this” (aka confirmation bias), we tend to attempt making ineffective systems more efficient.

In the end, I’ve merely highlighted some core challenges an IT Ops team may face, but to improve and truly scale the whole system, not just the technology, it requires to be designed for change. Quick, small, iterative, tested, quality changes. In other words, for agility!

By not designing our Systems and principles around change capacity, we have built a downward spiral around ourselves such that the more we change the hardest it gets.

So, what’s an Automation Architecture built for change? For that, we need to get back to John Boyd and the OODA loop. It’s a Perception-Cognition-Action (PCA) problem in the end… Maybe matter for a future post, if it’s of interest…

2 thoughts on “Automation Architecture: build the capacity for change

Leave a comment