Drift is the silent tax on every Terraform estate. Someone clicks a button in the portal during an incident, an autoscaler rewrites a field Terraform thinks it owns, and three weeks later a routine apply proposes to destroy something nobody meant to touch. This is a practitioner’s guide to seeing drift before it bites, reading a drift plan correctly, and reconciling it without turning a small divergence into an outage.
1. What drift actually is: three things that must agree
There are three sources of truth in any Terraform workflow, and “drift” is any disagreement between them:
- Configuration — your
.tffiles: what you declared should exist. - State —
terraform.tfstate: what Terraform last recorded as existing. - The live cloud — the actual resources, as the provider API reports them right now.
config <-- plan compares these two --> state <-- refresh compares these two --> cloud
(.tf) (.tfstate) (real API)
Two distinct gaps get lumped together as “drift,” and conflating them is the root of most bad reconciliation decisions:
- State-vs-cloud drift — the world changed underneath Terraform. A subnet was deleted in the portal, a tag was added by a governance policy, an SKU was bumped manually. State is stale.
- Config-vs-state drift — your code changed but hasn’t been applied, or someone ran
terraform statecommands by hand. This is just a pending change, not “drift” in the dangerous sense.
The pipeline you build later cares almost entirely about the first kind: detecting state-vs-cloud divergence early, deciding intentionally how to reconcile it, and never letting an apply make that decision for you by accident.
Mental model:
planshows you the gap between config and state. A refresh updates state from the cloud. Real drift detection requires both, in that order.
2. How refresh, plan, and -refresh-only behave
Understanding the exact mechanics here is non-negotiable, because the defaults changed and a lot of folklore is wrong.
terraform plan refreshes by default. Before computing the diff, plan reads every resource from the provider API to update its in-memory view of state, then compares your config against that refreshed view. Crucially, this refresh is in-memory for the plan — it does not persist to the state file. So a normal plan already sees drift; it just folds it into the proposed changes.
-refresh=false skips that API read entirely. Plan trusts whatever is in state. This is faster and is what you want when you’ve just refreshed and don’t want to pay the API round-trips again — but run it blind and you can apply against a stale picture.
-refresh-only is the purpose-built drift detector. It refreshes state from the cloud and reports what changed, but it will never propose to modify, create, or destroy a real resource to match your config. Its only possible action is updating the state file to match reality.
# Pure drift detection: compare state against the live cloud, change nothing real.
terraform plan -refresh-only
# Persist the refreshed reality into state (adopts drift INTO state, not the other way).
terraform apply -refresh-only
The distinction that trips people up: terraform apply -refresh-only does not push your config to the cloud. It does the reverse — it accepts the cloud’s current values as the new contents of state. That’s sometimes right (a manual change you want to keep) and sometimes wrong (one you want to revert). You decide; the flag just makes state agree with reality.
The old standalone
terraform refreshcommand still exists but is deprecated. Useterraform apply -refresh-only, which is the same operation with a plan and an approval prompt in front of it.
Here is the decision in one table:
| Command | Reads cloud? | Can change cloud? | Can change state? | Use for |
|---|---|---|---|---|
plan |
Yes | No | No | Normal change review |
plan -refresh=false |
No | No | No | Fast plan against trusted state |
plan -refresh-only |
Yes | No | No | Detecting drift safely |
apply -refresh-only |
Yes | No | Yes | Adopting reality into state |
apply |
Yes | Yes | Yes | Making the cloud match config |
3. Reading a drift plan: noise versus real divergence
Run terraform plan -refresh-only against a workspace that has drifted and you’ll see a block headed with a note that objects have changed outside of Terraform. The body uses the same diff symbols as a normal plan, but the meaning is inverted: it’s showing how the cloud differs from state, not how your config differs from state.
Note: Objects have changed outside of Terraform
# azurerm_storage_account.this has changed
~ resource "azurerm_storage_account" "this" {
id = "/subscriptions/.../storageAccounts/kvprodsa"
~ tags = {
+ "CostCenter" = "FIN-4417"
}
# (38 unchanged attributes hidden)
}
Now the skill: telling benign provider noise apart from real divergence. Not every ~ is a problem.
- Real drift you must address: a changed SKU, a deleted rule, a security setting flipped, a tag your org policy added that your config doesn’t know about. These represent the world disagreeing with your intent.
- Provider normalization noise: the API returns a value in a different but equivalent form than you wrote — a casing difference, an attribute the provider computes and rewrites, an ordering change in a set the provider serializes as a list, a default the API fills in. These aren’t drift; they’re the provider being chatty.
- Perpetual diffs: an attribute that shows up as changed on every single run no matter what you do. That almost always means the resource schema and the API are fighting (an
azurermresource exposing a value also managed by a separate*_associationresource is the classic case), or you’re missing anignore_changes.
The diagnostic question for any drift line is: if I ran terraform apply right now, would the proposed action be correct? If Terraform wants to remove a tag a compliance policy legitimately added, the answer is no — reconcile the config, don’t apply. The -refresh-only plan is safe precisely because it forces that question before anything touches production.
4. Importing existing resources with import blocks
When a resource exists in the cloud but not in state — created by hand, by another tool, or by a different Terraform configuration — you import it. As of Terraform 1.5+, prefer import blocks over the imperative terraform import command: they’re declarative, reviewable in a pull request, and they run as part of plan/apply so you can see exactly what will happen first.
# imports.tf
import {
to = azurerm_resource_group.hotfix
id = "/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/rg-hotfix"
}
You still need configuration for the resource being imported. Terraform 1.5+ can generate it for you so you don’t hand-write 40 attributes:
# Generate HCL for everything referenced by an import block.
terraform plan -generate-config-out=generated.tf
This writes a best-effort resource block into generated.tf. Treat it as a draft: it often includes read-only or deprecated attributes, won’t carry your variables or for_each, and may need defaults trimmed. Move the cleaned-up resource into your real .tf files, then plan again.
terraform plan # with import block + config present
# Expect: "1 to import, 0 to add, 0 to change, 0 to destroy"
terraform apply
A clean import shows 0 to change after the import. If the plan wants to change attributes immediately after importing, your config doesn’t yet match the live resource — reconcile the config until the post-import plan is a no-op, then remove the import block (it’s idempotent, but leaving it is clutter).
For bulk or scripted scenarios the imperative form still works and is sometimes more convenient in a loop:
terraform import 'azurerm_resource_group.hotfix' \
"/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/rg-hotfix"
Import IDs are provider- and resource-specific. For AzureRM it’s almost always the full ARM resource ID; for AWS it varies by resource (an instance ID, an ARN, a name). Check the resource’s docs — guessing the ID format is the number one cause of failed imports.
5. Reconciliation strategies: adopt, override, or ignore with intent
Once you’ve identified real drift, there are exactly three honest responses. Pick deliberately; the worst outcome is picking by reflex.
Adopt the change (the world was right). The manual change should stand. Bring it into your configuration so code matches reality, then make state agree. Update the .tf to reflect the new value, run terraform plan and confirm it’s a no-op, and you’re reconciled. If the change is one Terraform can’t express in config (a value the cloud now owns), terraform apply -refresh-only adopts it into state.
Override the change (your config was right). The manual change was a mistake — a hotfix that bypassed review, an unauthorized edit. Leave your config as-is and let Terraform restore the intended state:
terraform plan # shows Terraform reverting the manual change
terraform apply # cloud is brought back into compliance with config
This is the case where a normal apply is the correct reconciliation, because here clobbering the drift is the goal. The danger is only doing it accidentally on drift you meant to keep.
Ignore the attribute (it’s legitimately managed elsewhere). Some fields are owned by another system by design: tags applied by Azure Policy, replica counts owned by an autoscaler, a secret rotated by a separate pipeline. Tell Terraform to stop fighting over them with lifecycle { ignore_changes = [...] }.
resource "azurerm_kubernetes_cluster" "this" {
name = "kvprod-aks"
resource_group_name = azurerm_resource_group.this.name
location = var.location
dns_prefix = "kvprod"
default_node_pool {
name = "system"
node_count = 3
vm_size = "Standard_D4s_v5"
}
identity {
type = "SystemAssigned"
}
lifecycle {
# The cluster autoscaler owns node_count; do not revert it on every apply.
ignore_changes = [
default_node_pool[0].node_count,
tags["CostCenter"], # applied by Azure Policy, not by us
]
}
}
ignore_changes is a scalpel, not a mute button. Two rules keep it honest: scope it to specific attributes, never all; and comment why every entry exists. An undocumented ignore_changes is how the next engineer learns that Terraform has silently stopped managing a security-relevant field. If you can’t name the system that legitimately owns the attribute, you shouldn’t be ignoring it — you should be adopting or overriding it.
| Strategy | When the drift is… | Mechanism |
|---|---|---|
| Adopt | A change you want to keep | Edit config to match; apply -refresh-only for state-only values |
| Override | An unauthorized or mistaken change | Plain terraform apply restores config’s intent |
| Ignore | Owned by another system by design | lifecycle { ignore_changes = [attr] }, scoped and commented |
6. A scheduled drift-detection pipeline that reports, not clobbers
The point of automation here is to surface drift on a schedule and notify a human — never to auto-apply. An automated reconciliation that runs terraform apply on a cron is how you turn one manual change into a 3 a.m. page.
The mechanism is terraform plan -refresh-only plus the -detailed-exitcode flag, which makes the result machine-readable:
0= no changes (no drift)1= error2= changes present (drift detected)
# .github/workflows/drift-detection.yml
name: drift-detection
on:
schedule:
- cron: "0 6 * * 1-5" # 06:00 UTC, weekdays
workflow_dispatch: {}
permissions:
id-token: write # OIDC to the cloud, no long-lived secrets
contents: read
issues: write # to open a drift report issue
jobs:
detect:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_wrapper: false
# Authenticate to Azure via OIDC (federated credentials), read-only role.
- uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- run: terraform init -input=false
- name: Refresh-only drift check
id: drift
run: |
set +e
terraform plan -refresh-only -detailed-exitcode -no-color -input=false -out=drift.tfplan
echo "exitcode=$?" >> "$GITHUB_OUTPUT"
set -e
- name: Render drift summary
if: steps.drift.outputs.exitcode == '2'
run: terraform show -no-color drift.tfplan > drift-report.txt
- name: Open or update drift issue
if: steps.drift.outputs.exitcode == '2'
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const body = "Drift detected by scheduled refresh-only plan.\n\n```\n"
+ fs.readFileSync('drift-report.txt','utf8').slice(0, 60000)
+ "\n```";
await github.rest.issues.create({
owner: context.repo.owner,
repo: context.repo.repo,
title: `Terraform drift detected: ${context.workflow} ${new Date().toISOString().slice(0,10)}`,
body,
labels: ['drift']
});
- name: Fail the run if drift exists
if: steps.drift.outputs.exitcode == '2'
run: exit 1
The load-bearing design decisions:
- The cloud credential is read-only. Give the drift job a role with read/list permissions only (Azure
Reader, AWSReadOnlyAccess-equivalent). Even if someone fat-fingers-refresh-onlyout of the command, the identity cannot mutate production. Defense in depth beats trusting the flag. - It reports via an issue/alert, gated behind exit code 2. No diff, no noise. A real diff opens a ticket a human triages.
- It fails the workflow on drift so the red signal is visible in dashboards, not buried in logs.
If you run HCP Terraform / Terraform Cloud, you don’t have to build this. It has native, scheduled drift detection at the workspace level (a Plus feature) that runs refresh-only health assessments and surfaces drift in the UI and via notifications. Use it if you have it; the pipeline above is for everyone self-managing state.
7. Preventing drift at the source: policy and RBAC
Detection is treating the symptom. The cure is making out-of-band changes hard to do in the first place, so Terraform stays the only writer of the resources it owns.
Least-privilege RBAC for humans. People should not hold standing write access to Terraform-managed production. Grant Reader by default and require just-in-time elevation (Azure PIM, or the AWS equivalent via assumed roles with approval) for the rare break-glass write. The only identity with durable write access is the CI/CD service principal that runs apply.
# Humans get Reader on the managed subscription/resource group.
az role assignment create \
--assignee "<user-or-group-object-id>" \
--role "Reader" \
--scope "/subscriptions/<sub-id>/resourceGroups/rg-prod"
Deny-by-policy for the changes that matter. Use cloud-native policy to block the mutations you never want made by hand. Azure Policy with a deny effect, or an Azure resource lock (CanNotDelete / ReadOnly) on critical resources, stops the portal click before it happens.
# A management lock that prevents deletion of a production resource group.
az lock create \
--name "no-delete-prod" \
--lock-type CanNotDelete \
--resource-group "rg-prod"
A
ReadOnlylock blocks Terraform too. UseCanNotDeletefor resources Terraform still manages day-to-day, and reserveReadOnlyfor things that are genuinely frozen. Locks protect against accidents; they are not a substitute for RBAC.
Policy-as-code in the pipeline. Gate apply behind OPA/Conftest or Sentinel so even reviewed changes can’t violate org standards (no public blob, mandatory tags, approved regions). This doesn’t stop manual portal drift, but it stops config drift — the slow rot where each PR bends the rules a little.
The combination is what works: RBAC removes the ability to drift, policy and locks remove the easy path, and the scheduled detector catches whatever slips through (a higher-privileged automation, a break-glass session that wasn’t reverted).
8. Incident playbook: reconciling an emergency hotfix
The realistic scenario: production is down, an engineer with break-glass access changes a setting in the portal to restore service, the incident closes. Now the cloud is ahead of Terraform, and your next apply would revert the fix. Here is the controlled sequence.
# 1. SEE the drift before touching anything. Read-only, never an apply.
terraform plan -refresh-only
# 2. CAPTURE what changed, for the record and the post-mortem.
terraform plan -refresh-only -out=incident.tfplan
terraform show -no-color incident.tfplan > incident-drift.txt
3. Decide per-attribute, not per-resource. The hotfix changed a firewall rule (keep it) and also happened to bump a tag (irrelevant). Adopt the firewall change, ignore the tag.
4. Codify the adopted change in config. Edit the .tf so the firewall rule reflects what the engineer set during the incident. This is the step people skip — and skipping it means the next engineer reverts the hotfix, re-triggering the outage. The fix isn’t real until it’s in code and merged.
# 5. PROVE config now matches reality: the plan must be a no-op for the hotfix.
terraform plan
# Expect: no proposed change to the firewall rule. If Terraform wants to
# modify it, your config still disagrees with the live resource -> keep editing.
# 6. RECONCILE state for any cloud-owned values that aren't expressible in config.
terraform apply -refresh-only
7. Open the PR, get the post-incident review, merge. The drift is only truly closed when the change has gone through the same gate every other change goes through. Until then you’ve got an undocumented hotfix that exists in exactly one person’s memory.
The cardinal rule of incident reconciliation: never run a plain
terraform applyfirst. Your opening move is always-refresh-onlyto observe. A reflexiveapplyagainst a freshly hotfixed production is how the cure becomes the second outage.
Verify
Confirm your drift workflow is sound end to end:
# 1. Refresh-only detects drift and changes nothing real.
terraform plan -refresh-only -detailed-exitcode
echo "exit code: $?" # 0 = clean, 2 = drift, 1 = error
# 2. An imported resource lands cleanly (no immediate change).
terraform plan # expect "1 to import ... 0 to change" before apply
# 3. After reconciliation, the workspace is a true no-op.
terraform plan # expect "No changes. Your infrastructure matches the configuration."
# 4. The drift job's identity genuinely cannot mutate prod (should be denied).
az role assignment list --assignee "<drift-sp-object-id>" \
--scope "/subscriptions/<sub-id>" -o table
Green means: drift is detected without side effects, imports are non-destructive, reconciliation converges to a no-op, and the detector’s credentials are read-only by construction.
Checklist
Pitfalls and next steps
The recurring failure modes are predictable. Running terraform apply as the first response to drift, clobbering a change someone made on purpose. Treating apply -refresh-only as if it pushes config to the cloud, when it does the opposite. Reaching for ignore_changes = all and quietly abandoning resources Terraform should still manage. Auto-applying drift on a schedule and discovering at 3 a.m. that “reconciliation” and “outage” can be the same event. And the slow one: never merging hotfixes back into code, so config and reality diverge a little more with every incident.
From here, wire state-change alerting to the source — Azure Activity Log or AWS CloudTrail alerts on write operations against Terraform-managed resource groups give near-real-time drift signal instead of a daily batch. Layer policy-as-code at the apply gate to stop config drift, and tune detection cadence per environment (hourly for production, daily elsewhere). The end state: drift is rare because it’s hard to create, visible within minutes when it happens, and reconciled by a deliberate human decision — never by an automation that mistakes your production for a thing it’s allowed to overwrite.