Most Terraform pipelines die one of two deaths. The first is the long-lived cloud access key baked into a repository secret, rotated never, scoped to everything, and one log leak away from owning your account. The second is the apply that runs against a plan nobody saw, because plan and apply were two independent jobs that each re-planned, and the world moved in between. This guide builds a pipeline that closes both: GitHub’s OpenID Connect (OIDC) issuer hands the workflow a short-lived cloud role with no stored secret, the plan is rendered as a sticky PR comment a human can read, and the exact binary plan is saved as an artifact and replayed on apply so what you reviewed is what you ship.
The examples target AWS because the role-assumption story is the most explicit there, but the OIDC pattern is identical on Azure (azure/login with federated credentials) and GCP (Workload Identity Federation). The pipeline mechanics are cloud-agnostic.
1. Keyless cloud auth with GitHub OIDC
Every GitHub Actions run can request a signed JSON Web Token from GitHub’s OIDC provider, https://token.actions.githubusercontent.com. Your cloud trusts that issuer, validates the token’s claims (which repo, which branch, which environment), and returns short-lived credentials. No AWS_ACCESS_KEY_ID ever touches a secret store.
On AWS this is a two-part setup: an IAM OIDC identity provider, and a role whose trust policy pins the sub claim. Here is the trust policy. The sub condition is the entire security boundary, so be precise.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
},
"StringLike": {
"token.actions.githubusercontent.com:sub": "repo:my-org/infra:environment:prod"
}
}
}
]
}
Scope the
subto an environment (repo:my-org/infra:environment:prod), not to a branch wildcard likerepo:my-org/infra:*. A wildcard subject means any workflow on any branch or PR from a fork-aware ref can assume your production role. Binding to a GitHub environment lets you layer reviewer approval (Step 6) on top of the cloud trust. Use a separate role per environment with asubthat matches that environment exactly.
You can create the OIDC provider once with the CLI. GitHub no longer requires a thumbprint on this provider (AWS validates the issuer’s certificate against its trust store), but passing a placeholder is still accepted:
aws iam create-open-id-connect-provider \
--url https://token.actions.githubusercontent.com \
--client-id-list sts.amazonaws.com
In the workflow, you request the token by granting id-token: write permission and then exchanging it. The official aws-actions/configure-aws-credentials action does the AssumeRoleWithWebIdentity call for you:
permissions:
id-token: write # REQUIRED to mint the OIDC token
contents: read
jobs:
plan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/gha-terraform-prod
aws-region: eu-west-1
role-session-name: gha-${{ github.run_id }}
The credentials returned default to a one-hour TTL and exist only for the life of the job. There is nothing to rotate and nothing to leak.
2. Repository structure and a matrix for multiple environments
Keep each environment’s root module separate so its state, variables, and backend are unambiguous. A layout that scales:
infra/
modules/ # reusable, versioned modules
network/
eks/
environments/
dev/
main.tf
backend.tf # backend "s3" { key = "dev/terraform.tfstate" }
dev.auto.tfvars
staging/
prod/
.github/workflows/
terraform-plan.yml
terraform-apply.yml
drift.yml
The matrix lets one workflow fan out across environments while still giving each its own role and backend. Drive fail-fast: false so a broken dev does not mask a prod problem:
jobs:
plan:
strategy:
fail-fast: false
matrix:
include:
- env: dev
role: arn:aws:iam::111111111111:role/gha-terraform-dev
- env: staging
role: arn:aws:iam::222222222222:role/gha-terraform-staging
- env: prod
role: arn:aws:iam::333333333333:role/gha-terraform-prod
runs-on: ubuntu-latest
defaults:
run:
working-directory: infra/environments/${{ matrix.env }}
For real pipelines, add a paths filter or a change-detection step so a PR touching only dev/ does not run a prod plan. The dorny/paths-filter action is the common choice; for brevity I keep the full matrix here.
3. Caching providers and the plugin cache
Terraform re-downloads providers on every init unless you point it at a shared plugin cache and then persist that cache between runs. Two things have to line up: the TF_PLUGIN_CACHE_DIR environment variable, and a cache key derived from the lockfile so the cache invalidates only when provider versions actually change.
env:
TF_PLUGIN_CACHE_DIR: ${{ github.workspace }}/.terraform.d/plugin-cache
TF_IN_AUTOMATION: "true" # quiets interactive-only hints in CI output
steps:
- run: mkdir -p ${{ env.TF_PLUGIN_CACHE_DIR }}
- uses: actions/cache@v4
with:
path: ${{ env.TF_PLUGIN_CACHE_DIR }}
key: tf-plugins-${{ runner.os }}-${{ hashFiles('infra/environments/**/.terraform.lock.hcl') }}
restore-keys: |
tf-plugins-${{ runner.os }}-
Two correctness notes that trip people up:
- The plugin cache and
.terraform.lock.hclare partners. Commit the lockfile (with multi-platform hashes viaterraform providers lock -platform=linux_amd64 ...) soinitvalidates against pinned checksums instead of silently pulling new versions. The cache speeds up the download; the lockfile guarantees integrity. - Do not cache the
.terraform/directory itself. It contains backend-initialized state config and absolute symlinks into the plugin cache that do not survive a cache restore cleanly. Cache onlyTF_PLUGIN_CACHE_DIR.
If you use hashicorp/setup-terraform, set terraform_wrapper: false when you need to capture raw stdout (the wrapper rewrites output and can interfere with parsing plan text). It is fine to leave the wrapper on for jobs that only need exit codes.
4. Run plan on PRs and render a sticky comment
The PR job does three things in order: init, fmt -check plus validate as fast gates, then plan with a detailed exit code. The -detailed-exitcode flag is the linchpin — it returns 0 for no changes, 2 for a non-empty diff, and 1 for an error, which lets the comment distinguish “nothing to do” from “here is what will change.”
- name: Terraform Init
run: terraform init -input=false
- name: Format check
run: terraform fmt -check -recursive
- name: Validate
run: terraform validate -no-color
- name: Terraform Plan
id: plan
run: |
set +e
terraform plan -input=false -no-color -lock-timeout=300s \
-out=tfplan.binary -detailed-exitcode 2>&1 | tee plan.txt
echo "exitcode=${PIPESTATUS[0]}" >> "$GITHUB_OUTPUT"
continue-on-error: true
set +e plus PIPESTATUS[0] is deliberate: a tee in the pipeline would otherwise mask Terraform’s real exit code, and -detailed-exitcode’s 2 would be read as a failure. We capture the true code and decide what to do with it in the comment step.
For the comment itself, use a sticky comment so each new push updates one comment instead of spamming the PR. marocchino/sticky-pull-request-comment keys off a header and edits in place. Wrap the plan in a collapsible block and trim it — a 4000-line plan blows past the GitHub comment size limit (~65 KB), so truncate the captured text.
- name: Trim plan output
if: always()
run: |
# Keep the tail; the summary line and resource changes live near the end.
tail -c 60000 plan.txt > plan.trimmed.txt || cp plan.txt plan.trimmed.txt
- name: Comment plan on PR
if: github.event_name == 'pull_request'
uses: marocchino/sticky-pull-request-comment@v2
with:
header: terraform-plan-${{ matrix.env }}
message: |
### Terraform Plan: `${{ matrix.env }}`
Outcome: `${{ steps.plan.outputs.exitcode == '2' && 'changes pending' || steps.plan.outputs.exitcode == '0' && 'no changes' || 'error' }}`
<details><summary>Show plan</summary>
```hcl
${{ ... }}
```
</details>
*Pusher: @${{ github.actor }} | Commit: `${{ github.sha }}`*
GitHub Actions cannot interpolate a multi-line file into a YAML scalar with ${{ }} directly. The clean way is to read the trimmed file in a prior step into a multiline GITHUB_OUTPUT (using a random delimiter), then reference that output. Here is that read step, which you place before the comment and reference as steps.read.outputs.plan:
- name: Read trimmed plan
id: read
run: |
{
echo 'plan<<__TFPLAN__'
cat plan.trimmed.txt
echo '__TFPLAN__'
} >> "$GITHUB_OUTPUT"
Then the comment body uses ${{ steps.read.outputs.plan }} inside the fenced block. One more guard: if you ever accept PRs from forks, do not use pull_request_target with checkout of the head ref and live credentials — that hands a fork your cloud role. Run plans from forks with read-only credentials or behind a manual gate.
5. Save the plan artifact and apply it exactly
This is the step that makes the pipeline trustworthy. The binary plan you produced — tfplan.binary — is a frozen description of the change against a specific state serial. Upload it as an artifact, and on apply, download that same file and run terraform apply tfplan.binary. Terraform refuses to apply a saved plan if the state has drifted since the plan was made, so you get a hard guarantee: the apply is the review, or it errors.
# In the plan job (push to main, post-merge):
- name: Upload plan artifact
uses: actions/upload-artifact@v4
with:
name: tfplan-${{ matrix.env }}
path: infra/environments/${{ matrix.env }}/tfplan.binary
retention-days: 5
# In the apply job:
- name: Download plan artifact
uses: actions/download-artifact@v4
with:
name: tfplan-${{ matrix.env }}
- name: Terraform Apply
run: terraform apply -input=false -lock-timeout=300s tfplan.binary
Note there are no -var flags on the apply. A saved plan already has every variable value baked in; passing variables to apply <planfile> is an error. This is a feature: it removes the entire class of “the plan used one value and the apply used another.”
The artifact contains the literal planned changes and any sensitive values that appear in the plan. Treat the artifact store as sensitive, keep retention-days short, and restrict who can download workflow artifacts via repository permissions.
6. Environment protection rules and manual approvals
The apply job runs in a GitHub environment, and environments carry protection rules that GitHub enforces before the job’s runner starts. This is where required reviewers and wait timers live — and crucially, the environment is also what the OIDC sub claim binds to, so the cloud trust and the approval gate reinforce each other.
Reference the environment in the job:
jobs:
apply:
environment: prod # gates on environment protection rules
runs-on: ubuntu-latest
permissions:
id-token: write
contents: read
Configure the rules under Settings -> Environments -> prod:
| Rule | Setting | Why |
|---|---|---|
| Required reviewers | 1-6 named people or teams | A human approves the apply; the job pauses until then |
| Wait timer | e.g. 5 minutes | A cool-off window to cancel a bad merge |
| Deployment branches | “Selected branches” -> main |
Only main can deploy to prod; PR branches cannot |
| Environment secrets | scoped to this env | Secrets here are unreadable from dev/staging jobs |
When the apply job is reached, GitHub posts a “Review pending deployments” prompt and freezes the job. No runner spins up, no OIDC token is minted, and nothing in the cloud happens until a designated reviewer clicks approve. Combine “Deployment branches: main” with the sub claim pinned to :environment:prod and you have defense in depth: even a workflow change cannot reach prod from a feature branch.
7. Concurrency groups to prevent overlapping applies
Two applies against the same state at the same time is how you corrupt a state file or fight over a lock. GitHub’s concurrency key serializes runs that share a group name. For applies, you want to queue, not cancel — never abort an apply mid-flight.
concurrency:
group: terraform-apply-${{ matrix.env }}
cancel-in-progress: false # NEVER cancel a running apply
For PR plans, the opposite is correct: an outdated plan is useless, so cancel superseded runs to save minutes:
concurrency:
group: terraform-plan-${{ github.ref }}
cancel-in-progress: true
concurrency controls GitHub-side scheduling; Terraform’s backend state lock (DynamoDB for the S3 backend, or native locking for newer backends) is the authoritative guard at the cloud layer. Keep both. The -lock-timeout=300s on plan and apply means a run will wait up to five minutes for a stale lock to clear instead of failing instantly — useful when a previous job is finishing its unlock.
8. Drift detection on a schedule, surfaced as issues
State drifts: someone clicks in the console, an autoscaler changes a tag, a sister pipeline edits a shared resource. Catch it on a cadence with a scheduled plan that asserts “no changes” and opens a GitHub issue when that assertion fails.
name: Drift detection
on:
schedule:
- cron: "0 6 * * 1-5" # 06:00 UTC, weekdays
workflow_dispatch: {} # allow manual runs
permissions:
id-token: write
contents: read
issues: write # required to open/update issues
jobs:
drift:
strategy:
fail-fast: false
matrix:
env: [dev, staging, prod]
runs-on: ubuntu-latest
environment: ${{ matrix.env }}
defaults:
run:
working-directory: infra/environments/${{ matrix.env }}
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::${{ vars.AWS_ACCOUNT_ID }}:role/gha-terraform-${{ matrix.env }}
aws-region: eu-west-1
- run: terraform init -input=false
- name: Detect drift
id: drift
run: |
set +e
terraform plan -input=false -no-color -lock-timeout=120s -detailed-exitcode
echo "exitcode=$?" >> "$GITHUB_OUTPUT"
- name: Open or update drift issue
if: steps.drift.outputs.exitcode == '2'
uses: actions/github-script@v7
with:
script: |
const env = '${{ matrix.env }}';
const title = `Drift detected in ${env}`;
const body = `\`terraform plan\` returned changes for **${env}** at ${new Date().toISOString()}.\n` +
`Run: ${context.serverUrl}/${context.repo.owner}/${context.repo.repo}/actions/runs/${context.runId}`;
const existing = await github.rest.issues.listForRepo({
owner: context.repo.owner, repo: context.repo.repo,
state: 'open', labels: 'drift'
});
const match = existing.data.find(i => i.title === title);
if (match) {
await github.rest.issues.createComment({
owner: context.repo.owner, repo: context.repo.repo,
issue_number: match.number, body
});
} else {
await github.rest.issues.create({
owner: context.repo.owner, repo: context.repo.repo,
title, body, labels: ['drift']
});
}
Reading the exit code: 2 means drift (a non-empty plan), 0 means clean, 1 means the run itself broke (which you also want to know about). Deduplicating on issue title means a persistent drift accrues comments on one issue instead of opening a new one every weekday morning. Create the drift label once in the repo so labels: ['drift'] resolves.
Verify
Confirm each layer independently before you trust the pipeline end to end.
# 1. OIDC trust: decode the role's trust policy and confirm the sub is pinned,
# not wildcarded.
aws iam get-role --role-name gha-terraform-prod \
--query 'Role.AssumeRolePolicyDocument' --output json | jq .
# 2. No long-lived keys remain in repo secrets (should NOT list AWS_ACCESS_KEY_ID).
gh secret list --repo my-org/infra
# 3. Saved-plan apply guarantee: prove a stale plan is rejected.
# Plan, then mutate state out-of-band, then apply the OLD plan -> must error.
terraform plan -out=tfplan.binary
terraform apply -auto-approve # makes a change, bumps state serial
terraform apply tfplan.binary # EXPECT: "Saved plan is stale" error
# 4. Plugin cache hit: run init twice; the second run should report
# providers used from the cache, not re-downloaded.
TF_PLUGIN_CACHE_DIR=$PWD/.cache terraform init
TF_PLUGIN_CACHE_DIR=$PWD/.cache terraform init # look for "using ... from cache"
In the GitHub UI, confirm the behavioral gates:
- Open a PR that changes infra and check that exactly one sticky plan comment appears per environment and updates (does not duplicate) on a second push.
- Merge it and watch the
applyjob pause on “Review pending deployments” until a required reviewer approves. - Trigger two applies to the same environment back to back; the second must queue (not cancel) behind the first.
- Run the drift workflow via
workflow_dispatchagainst an environment you know is clean (exit0, no issue) and one you have deliberately drifted (exit2, issue opened).
Checklist
A pipeline built this way has no key to steal, no plan to second-guess, and no door to prod that a feature branch can open. The plan a reviewer reads is the plan that applies, an approval stands between merge and mutation, and anything that drifts overnight is waiting in your issue tracker by the time you open your laptop.