Are These Infrastructure as Code Symptoms Costing Your Business?

The Symptoms

Infrastructure as Code (IaC) has been a game-changer in the world of “DevOps”. The practice has matured and gain enough acceptance that many organizations now consider their IaC code to be a part of their DR (Disaster Recovery) strategy and a key component of their change management process.

In this post, I’d like to discuss some of the side effect symptoms in IaC adoption at the various organizations I have worked with that are troublesome. I’m not going to go over solutions or improvements in this post, but instead focus on the first step of identifying and articulating the problems.

The “symptoms” I’d ike to put a spotlight on are:

Before I get into them, let’s quickly recap at the 10,000-foot view of the typical Infrastructure IaC process at a typical organization.

A 10,000-Foot View of the Process

The idea is simple: define your infrastructure in code, version it, and deploy it just like you would with application code. This should lead to more reliable, repeatable, and scalable infrastructure management.

In practice, this means you have a repository (often Git) where your IaC files live. You make changes to these files, commit them, and then “deploy” the changes using a CI/CD pipeline. It’s idealized that humans are not in the loop for this deployment step in production environments.

Let’s take the above process and unpack the more details of what it actually entails in practice:

  • You make a change in your IaC code, and commit it to a branch to your repository.
  • You then create a pull request (PR) to merge your changes into the main branch
  • A pipeline runner runs a plan or a ’no-op’ to validate the changes and provides a report of the output of changes for review.
  • You get approvals from your team members after incorporating any feedback, and then merge the PR into the main branch.
  • The changes are applied to the environment, often through another pipeline run that is gated by the CI/CD process.

Here’s where things start getting interesting. Let’s start with the first symptom:

Symptom 1: Slow Iteration Times

Most organizations I have worked with have a process that looks like this, because they do not provide permission for engineers to access the environment for security reasons.

An example of this would be a terraform-heavy environment, the tfstate could have sensitive information, so people are not allowed to access it for running terraform plan on their local machines, so the organization ends up with a process like this to make changes of any kind:

  • The pipeline runner runs the plan again, and you get a new report.
    • There’s a problem with your code, so you fix it and push the changes to the PR.
  • The pipeline runner runs the plan again, and you get a new report.
    • There’s a problem with your code, so you fix it and push the changes to the PR.
  • The pipeline runner runs the plan again, and you get a new report.
    • The report is now good, indicating the changes are as expected, valid and can be merged.

This process can take a long time. Imagine each pipeline run takes at least 5 minutes (which in my experience is relatively quick).

Now, imagine you have to wait for a “pipeline runner” to become available, in a typical situation this can take anywhere from 0-5+ minutes on average. So, you can easily end up waiting 10-20 minutes to iterate on a single change with a few iterations to fix your code.

This is a problem because it slows down the iteration process significantly. Compare this experience above, with other coding disciplines, where a developer can iterate quickly on a backend or frontend code change, run tests, and get immediate feedback on their changes on their local machine.

Symptom 2: Flurry of Fixes

Once your PR is approved and merged, the next step is to apply the changes. This is where things can get even more complicated. In many organizations, the apply step is also gated by a pipeline runner. This means that you can’t just run terraform apply or pulumi up locally.

Anyone that has been doing this work can tell you that the apply step is often where the real issues arise.

The cloud provider API will give you an error message that the plan didn’t reveal. These are often data or environment-specific situations where, until you call the actual cloud API to manage the resource, you don’t actually know if your code change is going to succeed.

So now the primary branch of the IaC repository is in a state where it has been merged, but the changes break the pipeline. Any new changes to the IaC code will now fail the pipeline because the previous changes have not been applied successfully.

If the organization has been mindful of developing a good CI/CD process, they will have a “rollback” step in place, or perhaps an acceptable isolation level for what is considered “broken” in the pipeline and other changes can still be applied via the IaC code to other areas.

Despite the above the show must go on, so the next step is to fix the issues that were not caught in the plan step. This results in what I call the “flurry of fixes” symptom.

The “flurry of fixes” is where you have to make a series of small changes to the IaC code to get the apply step to succeed. This means you have to make a new PR, get a successful plan run, get approval(s) and then merge it again. Sometimes this can take several iterations out to the main branch before the apply step ultimately succeeds.

I need to point this out: this ritual of making a PR, getting a plan run, then merging (with approvals), and then waiting for apply to succeed is turning a simple fix into a multi-step process that can take hours or even days to complete.

Symptom 3: Avoiding Refactoring

Because of the above process, refactoring; simply put, is a bit of a nightmare when you’re operating in a mature environment.

“Refactor” is still a borderline dirty word: it’s often hard to sell the business on the value of refactoring. Your mileage may vary by business or team culture, but hear me out:

Refactoring is often met with resistance from management or project managers, who are more focused on delivering new features and capabilities. Therein lies the problem: infrastructure, and the “code” that defines it, is often seen as a “set it and forget it” situation, and not a product feature or capability that generates revenue. Therefore, it’s even that much harder to justify the time and effort to refactor.

This is a bit tragic to me:

  • The languages and toolchain for IaC frequently change more dramatically than say Java, C# or Python (it’s gotten better!)
  • Cloud provider changes keep changing rapidly, which impact the toolchain and languages
  • Refactoring is a “dirty” word that typically needs buy-in
  • IaC refactoring is some of the slowest and most expensive to do, which makes it even harder to justify

Yet: the code for infrastructure is some of the most important code in many organizations, as it defines the infrastructure that the business runs on.

Mature, long-lived IaC code often becomes complicated:

  • Because we need XYZ capability added into the terraform module…
    • We add yet another parameter (or structure of parameters) to the module, so we can conditionally enable it and not disrupt existing usages of the module.
    • (repeat this for 3+ capabilities)…

Now for the fun part:

  • One of the parameters SHOULD be changed from a bool to an object/data structure, so we can add more configurability to the capability. This is a breaking change, so we SHOULD make sure that all the existing usages of the module are updated to use the new parameter structure.

This is where the “avoiding refactoring” symptom comes into play:

  • Teams avoid making proper, disciplined changes to the IaC code because they know that it will be a long and painful process to get the changes merged and applied, and it’s not valuable enough to warrant the effort.

  • Instead, they will add a new parameter to the module, or create a new module altogether, to avoid the pain of refactoring.

    • Which in turn leads to more complexity: as you have 2 parameters for the same thing, and likely conditional logic in the module for handling the combinatorial complexity of the two (or more!) parameters.

We’ve been blessed with things like the import and moved blocks in Terraform, but the reality is that when you’re locked out of state data and only the CI/CD pipeline runner has access to it, scripting the generation of import and moved blocks for refactoring is off the table. One might be able to grovel for temporary access to state data to generate these blocks for a refactor, but that’s not my point, which is the tendency to avoid refactoring because of the headaches.

As the languages we use to express IaC themselves change (e.g.: Terraform’s HCL, Pulumi, etc.) and introduce breaking changes, we are left with a problem of refactoring for the new version of the languages and toolchain, or staying on an older version.

I’ve seen this happen with Terraform, where teams are reluctant to refactor their IaC code to accommodate the new version of the language/toolchain, and instead choose to stay on an older version, which leads to technical debt and a lack of support for new features and capabilities.

How do they address it? By avoiding refactoring and making new modules or full-on rewrites.

Infrastructure Code !== Application Code

Fortunately, infrastructure code is typically not as large and complex as application code, so at least there’s that, but I must belabor this point:

It seems we’ve forgotten that cloud infrastructure “code” is (attempted) state declaration, which is transfigured through API wrappers, it’s not the same as application code; where business logic and algorithms are the primary focus where things are mechanical and highly deterministic in comparison.

The problem is that many organizations treat IaC like application code, and try to govern it with the same (or largely similar) processes. In a practice, the game we’re playing with IaC is more akin to gambling sometimes than it is programming business logic.

As I’ve outlined above, the processes for making changes to IaC code are often slow and cumbersome, and the “Flurry of Fixes” pattern mentioned above is one way to articulate what I mean by “gambling”. The cloud provider APIs are largely non-deterministic - and right now, I’m finding organizations have adopted processes that are not designed to efficiently handle the non-determinism.

The Bottom Line

It’s my hope by just looking at the names of the symptoms it’s clear why these may be costing your business. Slow iteration times, flurry of fixes, and avoidance of refactoring are all symptoms that might not be completely eliminated, but can likely be improved upon if you stay cognizant of or even implement KPI’s to measure them to guide action for improvement.

It’s not my intention to suggest we dispense with the practices that create these symptoms, but rather to highlight that we need to be aware of them and work towards improving the processes that lead to these symptoms. I certainly have my opinions on how to improve these processes, but that’s a topic for another post.