The Subtle Differences in Apply-Before-Merge and Apply-After-Merge

My infrastructure team, as part of a larger Terraform Restructure initiative, is moving our Terraform repositories to be apply-after-merge. There are significant changes in process between these two approaches; not all are obvious.

I won’t try to hide: I like Apply-After-Merge more. I will attempt to show the strengths and problems with both processes, but you should know that I have already come to an opinion.

Let’s start by looking at the two processes side-by-side, what humans need to do, and what the CICD pipeline will do.

The Apply-Before-Merge (ABM) process is like this:

The submitter create a Merge Request (MR)
The CICD pipeline:
1. Creates a plan for the required changes and logs it for review (Plan Phase)
2. Waits for manual intervention
The reviewer looks at the code changes and the plan file
The reviewer clicks the "Apply" button to trigger the pipeline to continue
The pipeline fetches the persisted plan file from step 2 and executes the plan (Apply Phase)
The pipeline merges the MR into the default branch

The Apply-After-Merge (AAM) process looks like this:

The submitter creates a MR
The CICD pipeline creates a plan for the required changes and logs it for review
The reviewer looks at the code changes and the plan file
The reviewer clicks the "Merge" button and the code is merged into the default branch
The pipeline – using the default branch – creates and executes a plan (Apply Phase)

Now keep in mind that CICD pipelines are async processes – there will be multiple, conflicting, merge requests in various stages of the process at any given time.

"Everybody has a plan until they get punched in the mouth" – Mike Tyson

In each approach, we need to create a plan file prior to MR review, but in ABM the plan file has another purpose: that plan file will be the exact changes made during the apply step. That means that the plan file needs to be stored somewhere so that it can be retrieved later (during the Apply Phase). It also means that a plan has the opportunity to go stale.

The storage of a process-critical, but still ephemeral, bit of data can be tricky to get right and reliable. In our case, we’re using GitLab CI caching, but that comes with potential pitfalls – the cache entries need a key that’s scoped to the MR, are there potential race conditions that could cause the wrong plan file to occupy the cache? What happens if the cache entry is evicted? Many times these questions go unasked.

"The starting point for all achievement is desire." – Napoleon Hill

A much more subtle difference is in the question where is the desired state? And also, what does the default branch represent?

In Terraform, you are always dealing with two states: the desired state and the state of reality. When Terraform is creating a plan, the first thing it does is investigate the state of reality (what resources exist in the target environment and how are they configured), compare that to the desired state (the terraform code, in totality), and build a series of steps to change reality so that it matches the desired state. The plan file is that set of steps.

So, in ABM, where is the desired state stored? It’s not in the default branch – you’ve already applied a new desired state to the environment before merging the code. Does that mean it’s the unmerged branch? What happens if you have multiple unmerged branches? Do you have multiple desired states?

This also leads you to think that the default branch represents reality when using ABM – but that’s incorrect. It’s common for resources created by Terraform to be modified outside of Terraform. It’s commonplace for reality to morph in ways that’s entirely appropriate. Or, even if your environment shouldn’t change outside of your Terraform code, there’s no guarantee that it hasn’t changed. Your default branch becomes a Schrödinger’s cat – until you run a plan it both does and does not represent reality.

In fact, in ABM, the default branch represents neither reality (reliably), nor the desired state. If you squint and try real hard, you could say that the default branch represents "the last applied and quickly and successfully merged desired state assuming there are no other MRs in or after the Apply Stage that more accurately describe the desired state." That’s a LOT of caveats.

In AAM, the desired state is a lot more clear: it’s the default branch. Full stop. The default branch is your desired state, even if you’ve failed to (yet) reach the desired state and reality doesn’t match it.

"When things go wrong, don't go with them." – Elvis Presley

Let’s consider a common failure state: your terraform apply fails.

I’ve mentioned that in AAM the default branch represents the desired state – and not necessarily the state of reality. So, a failure to reach the desired state (executing your plan), results in an ongoing separation between the desired state and reality. But, that’s pretty much it. At any point, you can re-apply the default branch to again attempt to reach your desired state. In practice, this results in failed pipelines that are easy to remedy and so pretty much not a big deal.

Consider another scenario: the git merge fails.

The merge itself can fail for few reasons, but the most common is that there are multiple merge requests in flight at a time, resulting in a slow-moving race condition: MR1 is merged while you're looking at MR2. In ABM, this can be devistating. If MR1 and MR2 are both approved, and the applies are happening simutaniously, and MR1 is merged, this can – and will – result in MR2 failing to merge because it requires a rebase. You are now in a state where MR2 has been applied, but not merged.

Any other MR could now be applied and REVERT the changes from MR1 that are already applied. Again, this can be devistating to an infrastructure. It's very easy for resources to be deleted because they're not defined in MR3's desired state.

In AAM, any required rebases are found and addressed before the apply – no harm, no foul.