IaC Multi-Cloud

Policy-as-Code for Terraform with OPA and Conftest on the Plan JSON

A terraform plan someone skims in a PR review is not a guardrail; it is a hope. The moment your estate has more than a handful of contributors, “no public storage,” “everything is tagged with a cost center,” and “no instance bigger than this SKU” stop being things you can enforce by reading diffs. You enforce them by evaluating the plan with code, in CI, before apply. This guide does exactly that: it takes Terraform’s machine-readable plan JSON and gates it with Open Policy Agent (OPA) and Conftest, including reusable Rego, unit tests, waivers, and versioned policy bundles distributed over an OCI registry.

The key insight that makes all of this tractable: the plan JSON is a stable, documented contract. You write policy against that contract once, and it works across every provider, module, and team.

1. Generate machine-readable plan output

OPA does not understand HCL. It understands JSON. Terraform’s job here is to turn a proposed change into a JSON document that describes every resource it intends to create, update, or destroy. That is a two-step dance: produce a binary plan, then render it as JSON.

# 1. Produce a saved, binary plan file.
terraform plan -out=tfplan.binary

# 2. Render that exact plan as JSON (this does NOT re-plan).
terraform show -json tfplan.binary > tfplan.json

The separation matters. terraform show -json against the saved binary renders the same plan you would apply, with no fresh refresh and no chance of the world shifting between evaluation and apply. Do not pipe terraform plan -json instead; that emits a stream of log-line JSON objects (machine-readable UI events), not the plan representation Conftest expects. You want terraform show -json of a saved plan file.

For a quick look at what you are about to feed the policy engine:

# Pretty-print the top-level keys.
jq 'keys' tfplan.json

# Just the resource addresses and their planned actions.
jq -r '.resource_changes[] | "\(.address): \(.change.actions | join(","))"' tfplan.json

Run terraform init first, but for plan-only policy checks in CI you usually want terraform init -backend=false so the job never touches remote state. You are evaluating the shape of the change, not reconciling it.

2. Understand resource_changes, before/after, and the plan schema

The plan JSON has a documented format version (currently 1.x, surfaced as format_version at the top). Check it; do not assume. The pieces you actually write policy against:

Field What it holds
resource_changes[] The change set: one entry per resource being created, updated, deleted, or read. This is your primary surface.
resource_changes[].address The full module-qualified address, e.g. module.network.aws_subnet.private[0]. Use it in messages.
resource_changes[].type / .name Provider resource type and local name.
resource_changes[].change.actions An array: ["create"], ["update"], ["delete"], ["create","delete"] (replace), or ["no-op"].
resource_changes[].change.before State of the resource before the change (null on create).
resource_changes[].change.after State after the change (null on destroy).
resource_changes[].change.after_unknown A mirror of after where values not known until apply are true.
configuration The parsed config (references, expressions). Rarely needed for guardrails.
prior_state / planned_values The full resource trees. Convenient but less stable; prefer resource_changes.

Two rules save you from the most common policy bugs:

  1. Walk resource_changes, not planned_values. resource_changes is the documented, change-oriented surface and it tells you the action. planned_values is a flattened end-state tree that omits deletions and is easy to over-trust.
  2. Respect after_unknown. A value computed at apply time (an assigned ARN, a generated password, an autoscaled count) shows up as null in after and true in after_unknown. If your rule reads change.after.some_field and that field is unknown, you will get null and may produce a false deny or, worse, a false pass. When a field can be computed, check after_unknown before asserting on after.

Here is the canonical helper you will reuse everywhere: filter to resources of a type that are being created or updated (ignore destroys and no-ops, which usually should not trip a “must be configured correctly” rule).

package lib.tf

import rego.v1

# All resource_changes of a given type that are being created or updated.
resources(type) := [r |
	some r in input.resource_changes
	r.type == type
	is_managed_change(r)
]

is_managed_change(r) if {
	some action in r.change.actions
	action in {"create", "update"}
}

3. Write your first Rego deny rule

Conftest’s default convention is a package named main containing rules named deny, warn, or (older syntax) violation. A failing policy run exits non-zero, which is what fails your pipeline. We use Rego v1 syntax (import rego.v1, contains, if), which is the current, OPA-1.0-aligned dialect; the older deny[msg] { ... } partial-set form still works but the v1 form is what you should write new code in.

Two guardrails everyone needs first: required tags and encryption at rest. This example targets AWS, but the structure is provider-agnostic.

# policy/tagging.rego
package main

import rego.v1
import data.lib.tf

required_tags := {"environment", "owner", "cost_center"}

# Resource types we expect to be tagged. Extend as needed.
taggable := {"aws_instance", "aws_s3_bucket", "aws_db_instance", "aws_ebs_volume"}

deny contains msg if {
	some type in taggable
	some r in tf.resources(type)
	provided := object.keys(object.get(r.change.after, "tags", {}))
	missing := required_tags - provided
	count(missing) > 0
	msg := sprintf("%s is missing required tags: %v", [r.address, missing])
}
# policy/encryption.rego
package main

import rego.v1
import data.lib.tf

deny contains msg if {
	some r in tf.resources("aws_ebs_volume")
	r.change.after.encrypted != true
	msg := sprintf("%s must have encryption enabled (encrypted = true)", [r.address])
}

deny contains msg if {
	some r in tf.resources("aws_db_instance")
	r.change.after.storage_encrypted != true
	msg := sprintf("%s must enable storage_encrypted", [r.address])
}

object.get(r.change.after, "tags", {}) is deliberate: if tags is absent the resource still violates the rule, and object.get with a default avoids a key-not-found that would silently drop the rule. Run it:

conftest test tfplan.json --policy policy/
FAIL - tfplan.json - main - aws_ebs_volume.data must have encryption enabled (encrypted = true)
FAIL - tfplan.json - main - aws_s3_bucket.assets is missing required tags: ["cost_center", "owner"]

2 tests, 0 passed, 0 warnings, 2 failures

A subtle trap with tags: many resources support default_tags at the provider level, so a bucket can be compliant at apply time even though its own tags block is empty. If you use provider default_tags, either merge them in your module so they appear on the resource, or relax the rule to account for them. Policy that ignores how your modules actually assign tags produces noise, and noisy policy gets disabled.

4. Structure reusable libraries, helpers, and unit tests

A folder of copy-pasted some r in input.resource_changes blocks rots fast. Put shared logic in a lib package and import it. The layout that scales:

policy/
  lib/
    tf.rego          # resources(), is_managed_change(), tag helpers
    tf_test.rego     # unit tests for the helpers
  tagging.rego       # package main: deny rules
  encryption.rego
  instances.rego
  network.rego
  exceptions.rego    # waiver logic (section 6)

Now the part most teams skip and then regret: policy is code, so it gets unit tests. OPA has a first-class test runner. Test files live next to the policy, in a package, with rules prefixed test_. You feed them synthetic input with with input as ... and assert the rule fires (or does not).

# policy/tagging_test.rego
package main

import rego.v1

# A minimal plan fragment: one bucket missing two tags.
mock_plan(after) := {"resource_changes": [{
	"address": "aws_s3_bucket.assets",
	"type": "aws_s3_bucket",
	"name": "assets",
	"change": {"actions": ["create"], "after": after},
}]}

test_denies_bucket_missing_tags if {
	result := deny with input as mock_plan({"tags": {"environment": "prod"}})
	count(result) == 1
}

test_allows_fully_tagged_bucket if {
	tags := {"environment": "prod", "owner": "platform", "cost_center": "cc-42"}
	result := deny with input as mock_plan({"tags": tags})
	count(result) == 0
}

test_ignores_destroyed_resource if {
	plan := {"resource_changes": [{
		"address": "aws_s3_bucket.old",
		"type": "aws_s3_bucket",
		"change": {"actions": ["delete"], "after": null},
	}]}
	result := deny with input as plan
	count(result) == 0
}

Run the whole suite. --explain fails prints the trace for any failing assertion, which is the fastest way to debug a rule that is not matching what you think it is.

opa test policy/ -v
opa fmt -w policy/        # format, like terraform fmt
opa check policy/         # type-check / compile without running

These three commands belong in CI as their own job. A policy that has not been tested against a synthetic plan is a policy you do not actually trust.

5. Enforce instance types, regions, and public-access prevention

With the pattern established, the high-value guardrails fall out quickly. Keep allow-lists in data so they are configuration, not logic — you can override them per environment with --data files.

Allowed instance types. Deny anything outside an approved set; this is your primary cost guardrail.

# policy/instances.rego
package main

import rego.v1
import data.lib.tf

allowed_instance_types := {
	"t3.micro", "t3.small", "t3.medium",
	"m6i.large", "m6i.xlarge",
}

deny contains msg if {
	some r in tf.resources("aws_instance")
	itype := r.change.after.instance_type
	not allowed_instance_types[itype]
	msg := sprintf("%s uses disallowed instance_type %q (allowed: %v)", [r.address, itype, allowed_instance_types])
}

Allowed regions. Region usually comes from the provider block, not the resource, so the most reliable signal is the provider configuration in the plan. A pragmatic alternative many teams prefer: pass the target region in as data and assert on it, since the region is known in CI.

# policy/region.rego
package main

import rego.v1

allowed_regions := {"us-east-1", "us-west-2", "eu-west-1"}

# input.region is supplied via --data at evaluation time (see Verify).
deny contains msg if {
	region := input.region
	not allowed_regions[region]
	msg := sprintf("region %q is not in the approved list %v", [region, allowed_regions])
}

Public-access prevention. The classic. Block public S3 buckets and security groups that open the world to sensitive ports.

# policy/network.rego
package main

import rego.v1
import data.lib.tf

# S3 public access block must turn everything off.
deny contains msg if {
	some r in tf.resources("aws_s3_bucket_public_access_block")
	after := r.change.after
	not all([
		after.block_public_acls == true,
		after.block_public_policy == true,
		after.ignore_public_acls == true,
		after.restrict_public_buckets == true,
	])
	msg := sprintf("%s must enable all four public-access-block settings", [r.address])
}

sensitive_ports := {22, 3389, 3306, 5432}

# No security group ingress from 0.0.0.0/0 on a sensitive port.
deny contains msg if {
	some r in tf.resources("aws_security_group")
	some rule in r.change.after.ingress
	"0.0.0.0/0" in rule.cidr_blocks
	some port in sensitive_ports
	rule.from_port <= port
	rule.to_port >= port
	msg := sprintf("%s allows 0.0.0.0/0 to sensitive port %d", [r.address, port])
}

That last rule shows why you walk structured data instead of grepping HCL: it correctly handles a single rule that spans a port range (from_port/to_port) covering a sensitive port, which a regex would miss.

6. Soft-fail warnings vs hard-fail denials and waivers

Not every finding should block a merge. Conftest gives you two rule families:

# policy/cost_warn.rego
package main

import rego.v1
import data.lib.tf

# Nudge, don't block: gp2 is legacy; prefer gp3.
warn contains msg if {
	some r in tf.resources("aws_ebs_volume")
	r.change.after.type == "gp2"
	msg := sprintf("%s uses gp2; gp3 is cheaper and faster", [r.address])
}

The hard part of any real policy program is the exception/waiver workflow. Blanket rules will eventually be wrong for one legitimate resource, and if your only escape hatch is “disable the rule,” people disable the rule. Build waivers into the policy so exceptions are explicit, attributable, and expirable.

A clean pattern: keep a checked-in waivers.yaml, load it as data, and skip a violation only if there is a matching, unexpired waiver.

# waivers.yaml
waivers:
  - address: "aws_security_group.legacy_bastion"
    rule: "sensitive-port-public"
    reason: "Vendor appliance requires 0.0.0.0/0:22 until migration KV-1422"
    approved_by: "vinod"
    expires: "2026-09-30"
# policy/exceptions.rego
package main

import rego.v1

# Is there a valid (unexpired) waiver for this address+rule?
waived(address, rule) if {
	some w in input.waivers
	w.address == address
	w.rule == rule
	time.parse_rfc3339_ns(sprintf("%sT00:00:00Z", [w.expires])) > time.now_ns()
}

Then gate the network rule through it (note the explicit rule ID so waivers reference something stable):

deny contains msg if {
	some r in tf.resources("aws_security_group")
	some rule in r.change.after.ingress
	"0.0.0.0/0" in rule.cidr_blocks
	some port in sensitive_ports
	rule.from_port <= port
	rule.to_port >= port
	not waived(r.address, "sensitive-port-public")
	msg := sprintf("%s allows 0.0.0.0/0 to sensitive port %d", [r.address, port])
}

Expiry is the whole point. A waiver without a date is a permanent hole nobody revisits; one that expires forces a conscious renewal and shows up in CI the day it lapses.

7. Wire Conftest into GitHub Actions and pre-commit

Run policy in two places: locally before the commit (fast feedback, fewer round-trips) and in CI as the authoritative gate (cannot be skipped). They share the same policy/ directory, so behavior is identical.

pre-commit. Use the official Conftest hook to evaluate any plan JSON a developer generates. A thin wrapper script keeps the plan generation and the policy call together:

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/open-policy-agent/conftest
    rev: v0.56.0
    hooks:
      - id: conftest-test
        files: 'tfplan\.json$'
        args: ["--policy", "policy/", "--all-namespaces"]

GitHub Actions. Generate the plan JSON, then evaluate it. This stages cleanly after validate and before any expensive integration job.

# .github/workflows/policy.yml
name: policy
on:
  pull_request:

jobs:
  opa-unit:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: open-policy-agent/setup-opa@v2
        with:
          version: latest
      - run: opa fmt --list --fail policy/   # fail if unformatted
      - run: opa check policy/
      - run: opa test policy/ -v

  conftest:
    needs: opa-unit
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - name: Plan and render JSON
        run: |
          terraform init -backend=false
          terraform plan -out=tfplan.binary
          terraform show -json tfplan.binary > tfplan.json
      - uses: open-policy-agent/setup-conftest@v0
      - name: Evaluate policy
        run: |
          conftest test tfplan.json \
            --policy policy/ \
            --all-namespaces \
            --data waivers.yaml

--all-namespaces evaluates every package, not just main, which you want once your policies are organized into sub-packages. Add --no-color for clean log scraping and --output github to get annotations rendered inline on the PR diff.

8. Distribute and version policy bundles via OCI registries

A folder of Rego in one repo is fine for one team. Across an organization you want one source of policy, versioned and pulled by every pipeline, not copied. Conftest can push and pull policy bundles as OCI artifacts to any registry that supports them (GHCR, ECR, ACR, Artifactory).

Publish from the policy repo’s CI on a tagged release:

# Push the contents of policy/ as an OCI artifact, versioned by tag.
conftest push ghcr.io/kloudvin/policies:1.4.0 policy/

# Optionally also move a floating tag for "latest stable".
conftest push ghcr.io/kloudvin/policies:stable policy/

Consume it in any downstream pipeline. conftest pull fetches the bundle into a local policy/ directory, then you evaluate as normal:

conftest pull ghcr.io/kloudvin/policies:1.4.0
conftest test tfplan.json --all-namespaces --data waivers.yaml

Pin to an immutable version tag (1.4.0), never stable or latest, in the pipelines that gate production. A floating tag means your merge gate can change behavior with no commit in your repo — exactly the kind of invisible drift policy-as-code exists to prevent. Treat the policy bundle like any other dependency: pin it, bump it deliberately, and let the bump go through review.

OPA itself can also serve and pull bundles (the OPA bundle protocol, or OCI via the oci:// service), which is the right path if you run OPA as a long-lived service for admission control elsewhere. For Terraform CI specifically, Conftest’s push/pull is the lighter-weight, more direct fit.

Verify

Confirm the whole chain actually blocks bad changes and lets good ones through.

# 1. Helpers and rules pass their own unit tests.
opa test policy/ -v

# 2. Generate fresh plan JSON from your config.
terraform init -backend=false
terraform plan -out=tfplan.binary
terraform show -json tfplan.binary > tfplan.json

# 3. Evaluate. A compliant plan exits 0; a violating one exits non-zero.
conftest test tfplan.json --policy policy/ --all-namespaces \
  --data waivers.yaml --data region=us-east-1
echo "exit code: $?"

# 4. Prove a deny fires: point an EBS volume at encrypted = false,
#    re-plan, re-render, and confirm a FAIL line + non-zero exit.

# 5. Prove a waiver works: add a matching, unexpired entry to
#    waivers.yaml and confirm the previously-failing rule now passes.

# 6. Prove expiry works: set that waiver's `expires` to a past date
#    and confirm the rule fails again.

The decisive test is #6. If an expired waiver does not re-break the build, your time comparison is wrong and every waiver is effectively permanent.

Checklist

Pitfalls and next steps

The failure mode that quietly destroys a policy program is false confidence from unknown values. A rule that reads change.after.storage_encrypted passes happily when that field is null because it is computed at apply time — so a resource that will be unencrypted sails through the gate. Audit every rule that touches a possibly-computed attribute and make it consult after_unknown; treat “unknown” as “fail closed” for anything security-relevant.

The second is policy that fights your modules. Provider default_tags, computed names, and wrapper modules all mean the raw resource may not carry the attribute your rule inspects, even though the applied resource is compliant. The fix is to test policy against the plan JSON your actual modules produce, not hand-written fragments alone, so the gate matches reality and does not generate noise that trains people to ignore it.

From here, the high-value extensions are: a dedicated, separately tested policy repo published as a versioned OCI bundle (so the gate is an artifact, not a folder you copy); cost-aware policy by feeding an Infracost breakdown alongside the plan and denying changes whose monthly delta exceeds a threshold; and, once Terraform CI is solid, reusing the same Rego against Kubernetes admission and CI image policy so one Rego skill set covers the whole platform.

oparegoconftestterraformpolicy-as-codeci

Comments

Keep Reading