Detecting and Reconciling Terraform Drift Without Nuking Production

Drift is the silent tax on every Terraform estate. Someone clicks a button in the portal during an incident, an autoscaler rewrites a field Terraform thinks it owns, and three weeks later a routine apply proposes to destroy something nobody meant to touch. This is a practitioner’s guide to seeing drift before it bites, reading a drift plan correctly, and reconciling it without turning a small divergence into an outage.

1. What drift actually is: three things that must agree

There are three sources of truth in any Terraform workflow, and “drift” is any disagreement between them:

Configuration — your .tf files: what you declared should exist.
State — terraform.tfstate: what Terraform last recorded as existing.
The live cloud — the actual resources, as the provider API reports them right now.

   config  <-- plan compares these two -->  state  <-- refresh compares these two -->  cloud
   (.tf)                                  (.tfstate)                                 (real API)

Two distinct gaps get lumped together as “drift,” and conflating them is the root of most bad reconciliation decisions:

State-vs-cloud drift — the world changed underneath Terraform. A subnet was deleted in the portal, a tag was added by a governance policy, an SKU was bumped manually. State is stale.
Config-vs-state drift — your code changed but hasn’t been applied, or someone ran terraform state commands by hand. This is just a pending change, not “drift” in the dangerous sense.

The pipeline you build later cares almost entirely about the first kind: detecting state-vs-cloud divergence early, deciding intentionally how to reconcile it, and never letting an apply make that decision for you by accident.

Mental model: plan shows you the gap between config and state. A refresh updates state from the cloud. Real drift detection requires both, in that order.

2. How refresh, plan, and -refresh-only behave

Understanding the exact mechanics here is non-negotiable, because the defaults changed and a lot of folklore is wrong.

terraform plan refreshes by default. Before computing the diff, plan reads every resource from the provider API to update its in-memory view of state, then compares your config against that refreshed view. Crucially, this refresh is in-memory for the plan — it does not persist to the state file. So a normal plan already sees drift; it just folds it into the proposed changes.

-refresh=false skips that API read entirely. Plan trusts whatever is in state. This is faster and is what you want when you’ve just refreshed and don’t want to pay the API round-trips again — but run it blind and you can apply against a stale picture.

-refresh-only is the purpose-built drift detector. It refreshes state from the cloud and reports what changed, but it will never propose to modify, create, or destroy a real resource to match your config. Its only possible action is updating the state file to match reality.

# Pure drift detection: compare state against the live cloud, change nothing real.
terraform plan -refresh-only

# Persist the refreshed reality into state (adopts drift INTO state, not the other way).
terraform apply -refresh-only

The distinction that trips people up: terraform apply -refresh-only does not push your config to the cloud. It does the reverse — it accepts the cloud’s current values as the new contents of state. That’s sometimes right (a manual change you want to keep) and sometimes wrong (one you want to revert). You decide; the flag just makes state agree with reality.

The old standalone terraform refresh command still exists but is deprecated. Use terraform apply -refresh-only, which is the same operation with a plan and an approval prompt in front of it.

Here is the decision in one table:

Command	Reads cloud?	Can change cloud?	Can change state?	Use for
`plan`	Yes	No	No	Normal change review
`plan -refresh=false`	No	No	No	Fast plan against trusted state
`plan -refresh-only`	Yes	No	No	Detecting drift safely
`apply -refresh-only`	Yes	No	Yes	Adopting reality into state
`apply`	Yes	Yes	Yes	Making the cloud match config

3. Reading a drift plan: noise versus real divergence

Run terraform plan -refresh-only against a workspace that has drifted and you’ll see a block headed with a note that objects have changed outside of Terraform. The body uses the same diff symbols as a normal plan, but the meaning is inverted: it’s showing how the cloud differs from state, not how your config differs from state.

Note: Objects have changed outside of Terraform

  # azurerm_storage_account.this has changed
  ~ resource "azurerm_storage_account" "this" {
        id                = "/subscriptions/.../storageAccounts/kvprodsa"
      ~ tags              = {
          + "CostCenter" = "FIN-4417"
        }
        # (38 unchanged attributes hidden)
    }

Now the skill: telling benign provider noise apart from real divergence. Not every ~ is a problem.

Real drift you must address: a changed SKU, a deleted rule, a security setting flipped, a tag your org policy added that your config doesn’t know about. These represent the world disagreeing with your intent.
Provider normalization noise: the API returns a value in a different but equivalent form than you wrote — a casing difference, an attribute the provider computes and rewrites, an ordering change in a set the provider serializes as a list, a default the API fills in. These aren’t drift; they’re the provider being chatty.
Perpetual diffs: an attribute that shows up as changed on every single run no matter what you do. That almost always means the resource schema and the API are fighting (an azurerm resource exposing a value also managed by a separate *_association resource is the classic case), or you’re missing an ignore_changes.

The diagnostic question for any drift line is: if I ran terraform apply right now, would the proposed action be correct? If Terraform wants to remove a tag a compliance policy legitimately added, the answer is no — reconcile the config, don’t apply. The -refresh-only plan is safe precisely because it forces that question before anything touches production.

4. Importing existing resources with import blocks

When a resource exists in the cloud but not in state — created by hand, by another tool, or by a different Terraform configuration — you import it. As of Terraform 1.5+, prefer import blocks over the imperative terraform import command: they’re declarative, reviewable in a pull request, and they run as part of plan/apply so you can see exactly what will happen first.

# imports.tf
import {
  to = azurerm_resource_group.hotfix
  id = "/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/rg-hotfix"
}

You still need configuration for the resource being imported. Terraform 1.5+ can generate it for you so you don’t hand-write 40 attributes:

# Generate HCL for everything referenced by an import block.
terraform plan -generate-config-out=generated.tf

This writes a best-effort resource block into generated.tf. Treat it as a draft: it often includes read-only or deprecated attributes, won’t carry your variables or for_each, and may need defaults trimmed. Move the cleaned-up resource into your real .tf files, then plan again.

terraform plan   # with import block + config present
# Expect: "1 to import, 0 to add, 0 to change, 0 to destroy"
terraform apply

A clean import shows 0 to change after the import. If the plan wants to change attributes immediately after importing, your config doesn’t yet match the live resource — reconcile the config until the post-import plan is a no-op, then remove the import block (it’s idempotent, but leaving it is clutter).

For bulk or scripted scenarios the imperative form still works and is sometimes more convenient in a loop:

terraform import 'azurerm_resource_group.hotfix' \
  "/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/rg-hotfix"

Import IDs are provider- and resource-specific. For AzureRM it’s almost always the full ARM resource ID; for AWS it varies by resource (an instance ID, an ARN, a name). Check the resource’s docs — guessing the ID format is the number one cause of failed imports.

5. Reconciliation strategies: adopt, override, or ignore with intent

Once you’ve identified real drift, there are exactly three honest responses. Pick deliberately; the worst outcome is picking by reflex.

Adopt the change (the world was right). The manual change should stand. Bring it into your configuration so code matches reality, then make state agree. Update the .tf to reflect the new value, run terraform plan and confirm it’s a no-op, and you’re reconciled. If the change is one Terraform can’t express in config (a value the cloud now owns), terraform apply -refresh-only adopts it into state.

Override the change (your config was right). The manual change was a mistake — a hotfix that bypassed review, an unauthorized edit. Leave your config as-is and let Terraform restore the intended state:

terraform plan    # shows Terraform reverting the manual change
terraform apply   # cloud is brought back into compliance with config

This is the case where a normal apply is the correct reconciliation, because here clobbering the drift is the goal. The danger is only doing it accidentally on drift you meant to keep.

Ignore the attribute (it’s legitimately managed elsewhere). Some fields are owned by another system by design: tags applied by Azure Policy, replica counts owned by an autoscaler, a secret rotated by a separate pipeline. Tell Terraform to stop fighting over them with lifecycle { ignore_changes = [...] }.

resource "azurerm_kubernetes_cluster" "this" {
  name                = "kvprod-aks"
  resource_group_name = azurerm_resource_group.this.name
  location            = var.location
  dns_prefix          = "kvprod"

  default_node_pool {
    name       = "system"
    node_count = 3
    vm_size    = "Standard_D4s_v5"
  }

  identity {
    type = "SystemAssigned"
  }

  lifecycle {
    # The cluster autoscaler owns node_count; do not revert it on every apply.
    ignore_changes = [
      default_node_pool[0].node_count,
      tags["CostCenter"],   # applied by Azure Policy, not by us
    ]
  }
}

ignore_changes is a scalpel, not a mute button. Two rules keep it honest: scope it to specific attributes, never all; and comment why every entry exists. An undocumented ignore_changes is how the next engineer learns that Terraform has silently stopped managing a security-relevant field. If you can’t name the system that legitimately owns the attribute, you shouldn’t be ignoring it — you should be adopting or overriding it.

Strategy	When the drift is…	Mechanism
Adopt	A change you want to keep	Edit config to match; `apply -refresh-only` for state-only values
Override	An unauthorized or mistaken change	Plain `terraform apply` restores config’s intent
Ignore	Owned by another system by design	`lifecycle { ignore_changes = [attr] }`, scoped and commented

6. A scheduled drift-detection pipeline that reports, not clobbers

The point of automation here is to surface drift on a schedule and notify a human — never to auto-apply. An automated reconciliation that runs terraform apply on a cron is how you turn one manual change into a 3 a.m. page.

The mechanism is terraform plan -refresh-only plus the -detailed-exitcode flag, which makes the result machine-readable:

0 = no changes (no drift)
1 = error
2 = changes present (drift detected)

# .github/workflows/drift-detection.yml
name: drift-detection
on:
  schedule:
    - cron: "0 6 * * 1-5"   # 06:00 UTC, weekdays
  workflow_dispatch: {}

permissions:
  id-token: write           # OIDC to the cloud, no long-lived secrets
  contents: read
  issues: write             # to open a drift report issue

jobs:
  detect:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_wrapper: false

      # Authenticate to Azure via OIDC (federated credentials), read-only role.
      - uses: azure/login@v2
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

      - run: terraform init -input=false

      - name: Refresh-only drift check
        id: drift
        run: |
          set +e
          terraform plan -refresh-only -detailed-exitcode -no-color -input=false -out=drift.tfplan
          echo "exitcode=$?" >> "$GITHUB_OUTPUT"
          set -e

      - name: Render drift summary
        if: steps.drift.outputs.exitcode == '2'
        run: terraform show -no-color drift.tfplan > drift-report.txt

      - name: Open or update drift issue
        if: steps.drift.outputs.exitcode == '2'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const body = "Drift detected by scheduled refresh-only plan.\n\n```\n"
              + fs.readFileSync('drift-report.txt','utf8').slice(0, 60000)
              + "\n```";
            await github.rest.issues.create({
              owner: context.repo.owner,
              repo: context.repo.repo,
              title: `Terraform drift detected: ${context.workflow} ${new Date().toISOString().slice(0,10)}`,
              body,
              labels: ['drift']
            });

      - name: Fail the run if drift exists
        if: steps.drift.outputs.exitcode == '2'
        run: exit 1

The load-bearing design decisions:

The cloud credential is read-only. Give the drift job a role with read/list permissions only (Azure Reader, AWS ReadOnlyAccess-equivalent). Even if someone fat-fingers -refresh-only out of the command, the identity cannot mutate production. Defense in depth beats trusting the flag.
It reports via an issue/alert, gated behind exit code 2. No diff, no noise. A real diff opens a ticket a human triages.
It fails the workflow on drift so the red signal is visible in dashboards, not buried in logs.

If you run HCP Terraform / Terraform Cloud, you don’t have to build this. It has native, scheduled drift detection at the workspace level (a Plus feature) that runs refresh-only health assessments and surfaces drift in the UI and via notifications. Use it if you have it; the pipeline above is for everyone self-managing state.

7. Preventing drift at the source: policy and RBAC

Detection is treating the symptom. The cure is making out-of-band changes hard to do in the first place, so Terraform stays the only writer of the resources it owns.

Least-privilege RBAC for humans. People should not hold standing write access to Terraform-managed production. Grant Reader by default and require just-in-time elevation (Azure PIM, or the AWS equivalent via assumed roles with approval) for the rare break-glass write. The only identity with durable write access is the CI/CD service principal that runs apply.

# Humans get Reader on the managed subscription/resource group.
az role assignment create \
  --assignee "<user-or-group-object-id>" \
  --role "Reader" \
  --scope "/subscriptions/<sub-id>/resourceGroups/rg-prod"

Deny-by-policy for the changes that matter. Use cloud-native policy to block the mutations you never want made by hand. Azure Policy with a deny effect, or an Azure resource lock (CanNotDelete / ReadOnly) on critical resources, stops the portal click before it happens.

# A management lock that prevents deletion of a production resource group.
az lock create \
  --name "no-delete-prod" \
  --lock-type CanNotDelete \
  --resource-group "rg-prod"

A ReadOnly lock blocks Terraform too. Use CanNotDelete for resources Terraform still manages day-to-day, and reserve ReadOnly for things that are genuinely frozen. Locks protect against accidents; they are not a substitute for RBAC.

Policy-as-code in the pipeline. Gate apply behind OPA/Conftest or Sentinel so even reviewed changes can’t violate org standards (no public blob, mandatory tags, approved regions). This doesn’t stop manual portal drift, but it stops config drift — the slow rot where each PR bends the rules a little.

The combination is what works: RBAC removes the ability to drift, policy and locks remove the easy path, and the scheduled detector catches whatever slips through (a higher-privileged automation, a break-glass session that wasn’t reverted).

8. Incident playbook: reconciling an emergency hotfix

The realistic scenario: production is down, an engineer with break-glass access changes a setting in the portal to restore service, the incident closes. Now the cloud is ahead of Terraform, and your next apply would revert the fix. Here is the controlled sequence.

# 1. SEE the drift before touching anything. Read-only, never an apply.
terraform plan -refresh-only

# 2. CAPTURE what changed, for the record and the post-mortem.
terraform plan -refresh-only -out=incident.tfplan
terraform show -no-color incident.tfplan > incident-drift.txt

3. Decide per-attribute, not per-resource. The hotfix changed a firewall rule (keep it) and also happened to bump a tag (irrelevant). Adopt the firewall change, ignore the tag.

4. Codify the adopted change in config. Edit the .tf so the firewall rule reflects what the engineer set during the incident. This is the step people skip — and skipping it means the next engineer reverts the hotfix, re-triggering the outage. The fix isn’t real until it’s in code and merged.

# 5. PROVE config now matches reality: the plan must be a no-op for the hotfix.
terraform plan
# Expect: no proposed change to the firewall rule. If Terraform wants to
# modify it, your config still disagrees with the live resource -> keep editing.

# 6. RECONCILE state for any cloud-owned values that aren't expressible in config.
terraform apply -refresh-only

7. Open the PR, get the post-incident review, merge. The drift is only truly closed when the change has gone through the same gate every other change goes through. Until then you’ve got an undocumented hotfix that exists in exactly one person’s memory.

The cardinal rule of incident reconciliation: never run a plain terraform apply first. Your opening move is always -refresh-only to observe. A reflexive apply against a freshly hotfixed production is how the cure becomes the second outage.

Verify

Confirm your drift workflow is sound end to end:

# 1. Refresh-only detects drift and changes nothing real.
terraform plan -refresh-only -detailed-exitcode
echo "exit code: $?"   # 0 = clean, 2 = drift, 1 = error

# 2. An imported resource lands cleanly (no immediate change).
terraform plan        # expect "1 to import ... 0 to change" before apply

# 3. After reconciliation, the workspace is a true no-op.
terraform plan        # expect "No changes. Your infrastructure matches the configuration."

# 4. The drift job's identity genuinely cannot mutate prod (should be denied).
az role assignment list --assignee "<drift-sp-object-id>" \
  --scope "/subscriptions/<sub-id>" -o table

Green means: drift is detected without side effects, imports are non-destructive, reconciliation converges to a no-op, and the detector’s credentials are read-only by construction.

Checklist

Pitfalls and next steps

The recurring failure modes are predictable. Running terraform apply as the first response to drift, clobbering a change someone made on purpose. Treating apply -refresh-only as if it pushes config to the cloud, when it does the opposite. Reaching for ignore_changes = all and quietly abandoning resources Terraform should still manage. Auto-applying drift on a schedule and discovering at 3 a.m. that “reconciliation” and “outage” can be the same event. And the slow one: never merging hotfixes back into code, so config and reality diverge a little more with every incident.

From here, wire state-change alerting to the source — Azure Activity Log or AWS CloudTrail alerts on write operations against Terraform-managed resource groups give near-real-time drift signal instead of a daily batch. Layer policy-as-code at the apply gate to stop config drift, and tune detection cadence per environment (hourly for production, daily elsewhere). The end state: drift is rare because it’s hard to create, visible within minutes when it happens, and reconciled by a deliberate human decision — never by an automation that mistakes your production for a thing it’s allowed to overwrite.

Detecting and Reconciling Terraform Drift Without Nuking Production

1. What drift actually is: three things that must agree

2. How refresh, plan, and -refresh-only behave

3. Reading a drift plan: noise versus real divergence

4. Importing existing resources with import blocks

5. Reconciliation strategies: adopt, override, or ignore with intent

6. A scheduled drift-detection pipeline that reports, not clobbers

7. Preventing drift at the source: policy and RBAC

8. Incident playbook: reconciling an emergency hotfix

Verify

Checklist

Pitfalls and next steps

Written by Vinod

Comments

Keep Reading

Dynamic Inventory and Secure Secrets for Ansible at Cloud Scale

Engineering Idempotent Ansible Collections with Molecule Testing

Programmatic Infrastructure with CDK for Terraform in TypeScript