IaC Multi-Cloud

A Production Terraform CI/CD Pipeline on GitHub Actions with OIDC

Most Terraform pipelines die one of two deaths. The first is the long-lived cloud access key baked into a repository secret, rotated never, scoped to everything, and one log leak away from owning your account. The second is the apply that runs against a plan nobody saw, because plan and apply were two independent jobs that each re-planned, and the world moved in between. This guide builds a pipeline that closes both: GitHub’s OpenID Connect (OIDC) issuer hands the workflow a short-lived cloud role with no stored secret, the plan is rendered as a sticky PR comment a human can read, and the exact binary plan is saved as an artifact and replayed on apply so what you reviewed is what you ship.

The examples target AWS because the role-assumption story is the most explicit there, but the OIDC pattern is identical on Azure (azure/login with federated credentials) and GCP (Workload Identity Federation). The pipeline mechanics are cloud-agnostic.

1. Keyless cloud auth with GitHub OIDC

Every GitHub Actions run can request a signed JSON Web Token from GitHub’s OIDC provider, https://token.actions.githubusercontent.com. Your cloud trusts that issuer, validates the token’s claims (which repo, which branch, which environment), and returns short-lived credentials. No AWS_ACCESS_KEY_ID ever touches a secret store.

On AWS this is a two-part setup: an IAM OIDC identity provider, and a role whose trust policy pins the sub claim. Here is the trust policy. The sub condition is the entire security boundary, so be precise.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
        },
        "StringLike": {
          "token.actions.githubusercontent.com:sub": "repo:my-org/infra:environment:prod"
        }
      }
    }
  ]
}

Scope the sub to an environment (repo:my-org/infra:environment:prod), not to a branch wildcard like repo:my-org/infra:*. A wildcard subject means any workflow on any branch or PR from a fork-aware ref can assume your production role. Binding to a GitHub environment lets you layer reviewer approval (Step 6) on top of the cloud trust. Use a separate role per environment with a sub that matches that environment exactly.

You can create the OIDC provider once with the CLI. GitHub no longer requires a thumbprint on this provider (AWS validates the issuer’s certificate against its trust store), but passing a placeholder is still accepted:

aws iam create-open-id-connect-provider \
  --url https://token.actions.githubusercontent.com \
  --client-id-list sts.amazonaws.com

In the workflow, you request the token by granting id-token: write permission and then exchanging it. The official aws-actions/configure-aws-credentials action does the AssumeRoleWithWebIdentity call for you:

permissions:
  id-token: write   # REQUIRED to mint the OIDC token
  contents: read

jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/gha-terraform-prod
          aws-region: eu-west-1
          role-session-name: gha-${{ github.run_id }}

The credentials returned default to a one-hour TTL and exist only for the life of the job. There is nothing to rotate and nothing to leak.

2. Repository structure and a matrix for multiple environments

Keep each environment’s root module separate so its state, variables, and backend are unambiguous. A layout that scales:

infra/
  modules/                # reusable, versioned modules
    network/
    eks/
  environments/
    dev/
      main.tf
      backend.tf          # backend "s3" { key = "dev/terraform.tfstate" }
      dev.auto.tfvars
    staging/
    prod/
  .github/workflows/
    terraform-plan.yml
    terraform-apply.yml
    drift.yml

The matrix lets one workflow fan out across environments while still giving each its own role and backend. Drive fail-fast: false so a broken dev does not mask a prod problem:

jobs:
  plan:
    strategy:
      fail-fast: false
      matrix:
        include:
          - env: dev
            role: arn:aws:iam::111111111111:role/gha-terraform-dev
          - env: staging
            role: arn:aws:iam::222222222222:role/gha-terraform-staging
          - env: prod
            role: arn:aws:iam::333333333333:role/gha-terraform-prod
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: infra/environments/${{ matrix.env }}

For real pipelines, add a paths filter or a change-detection step so a PR touching only dev/ does not run a prod plan. The dorny/paths-filter action is the common choice; for brevity I keep the full matrix here.

3. Caching providers and the plugin cache

Terraform re-downloads providers on every init unless you point it at a shared plugin cache and then persist that cache between runs. Two things have to line up: the TF_PLUGIN_CACHE_DIR environment variable, and a cache key derived from the lockfile so the cache invalidates only when provider versions actually change.

env:
  TF_PLUGIN_CACHE_DIR: ${{ github.workspace }}/.terraform.d/plugin-cache
  TF_IN_AUTOMATION: "true"      # quiets interactive-only hints in CI output

steps:
  - run: mkdir -p ${{ env.TF_PLUGIN_CACHE_DIR }}

  - uses: actions/cache@v4
    with:
      path: ${{ env.TF_PLUGIN_CACHE_DIR }}
      key: tf-plugins-${{ runner.os }}-${{ hashFiles('infra/environments/**/.terraform.lock.hcl') }}
      restore-keys: |
        tf-plugins-${{ runner.os }}-

Two correctness notes that trip people up:

If you use hashicorp/setup-terraform, set terraform_wrapper: false when you need to capture raw stdout (the wrapper rewrites output and can interfere with parsing plan text). It is fine to leave the wrapper on for jobs that only need exit codes.

4. Run plan on PRs and render a sticky comment

The PR job does three things in order: init, fmt -check plus validate as fast gates, then plan with a detailed exit code. The -detailed-exitcode flag is the linchpin — it returns 0 for no changes, 2 for a non-empty diff, and 1 for an error, which lets the comment distinguish “nothing to do” from “here is what will change.”

- name: Terraform Init
  run: terraform init -input=false

- name: Format check
  run: terraform fmt -check -recursive

- name: Validate
  run: terraform validate -no-color

- name: Terraform Plan
  id: plan
  run: |
    set +e
    terraform plan -input=false -no-color -lock-timeout=300s \
      -out=tfplan.binary -detailed-exitcode 2>&1 | tee plan.txt
    echo "exitcode=${PIPESTATUS[0]}" >> "$GITHUB_OUTPUT"
  continue-on-error: true

set +e plus PIPESTATUS[0] is deliberate: a tee in the pipeline would otherwise mask Terraform’s real exit code, and -detailed-exitcode’s 2 would be read as a failure. We capture the true code and decide what to do with it in the comment step.

For the comment itself, use a sticky comment so each new push updates one comment instead of spamming the PR. marocchino/sticky-pull-request-comment keys off a header and edits in place. Wrap the plan in a collapsible block and trim it — a 4000-line plan blows past the GitHub comment size limit (~65 KB), so truncate the captured text.

- name: Trim plan output
  if: always()
  run: |
    # Keep the tail; the summary line and resource changes live near the end.
    tail -c 60000 plan.txt > plan.trimmed.txt || cp plan.txt plan.trimmed.txt

- name: Comment plan on PR
  if: github.event_name == 'pull_request'
  uses: marocchino/sticky-pull-request-comment@v2
  with:
    header: terraform-plan-${{ matrix.env }}
    message: |
      ### Terraform Plan: `${{ matrix.env }}`

      Outcome: `${{ steps.plan.outputs.exitcode == '2' && 'changes pending' || steps.plan.outputs.exitcode == '0' && 'no changes' || 'error' }}`

      <details><summary>Show plan</summary>

      ```hcl
      ${{ ... }}
      ```
      </details>

      *Pusher: @${{ github.actor }} | Commit: `${{ github.sha }}`*

GitHub Actions cannot interpolate a multi-line file into a YAML scalar with ${{ }} directly. The clean way is to read the trimmed file in a prior step into a multiline GITHUB_OUTPUT (using a random delimiter), then reference that output. Here is that read step, which you place before the comment and reference as steps.read.outputs.plan:

- name: Read trimmed plan
  id: read
  run: |
    {
      echo 'plan<<__TFPLAN__'
      cat plan.trimmed.txt
      echo '__TFPLAN__'
    } >> "$GITHUB_OUTPUT"

Then the comment body uses ${{ steps.read.outputs.plan }} inside the fenced block. One more guard: if you ever accept PRs from forks, do not use pull_request_target with checkout of the head ref and live credentials — that hands a fork your cloud role. Run plans from forks with read-only credentials or behind a manual gate.

5. Save the plan artifact and apply it exactly

This is the step that makes the pipeline trustworthy. The binary plan you produced — tfplan.binary — is a frozen description of the change against a specific state serial. Upload it as an artifact, and on apply, download that same file and run terraform apply tfplan.binary. Terraform refuses to apply a saved plan if the state has drifted since the plan was made, so you get a hard guarantee: the apply is the review, or it errors.

# In the plan job (push to main, post-merge):
- name: Upload plan artifact
  uses: actions/upload-artifact@v4
  with:
    name: tfplan-${{ matrix.env }}
    path: infra/environments/${{ matrix.env }}/tfplan.binary
    retention-days: 5
# In the apply job:
- name: Download plan artifact
  uses: actions/download-artifact@v4
  with:
    name: tfplan-${{ matrix.env }}

- name: Terraform Apply
  run: terraform apply -input=false -lock-timeout=300s tfplan.binary

Note there are no -var flags on the apply. A saved plan already has every variable value baked in; passing variables to apply <planfile> is an error. This is a feature: it removes the entire class of “the plan used one value and the apply used another.”

The artifact contains the literal planned changes and any sensitive values that appear in the plan. Treat the artifact store as sensitive, keep retention-days short, and restrict who can download workflow artifacts via repository permissions.

6. Environment protection rules and manual approvals

The apply job runs in a GitHub environment, and environments carry protection rules that GitHub enforces before the job’s runner starts. This is where required reviewers and wait timers live — and crucially, the environment is also what the OIDC sub claim binds to, so the cloud trust and the approval gate reinforce each other.

Reference the environment in the job:

jobs:
  apply:
    environment: prod          # gates on environment protection rules
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      contents: read

Configure the rules under Settings -> Environments -> prod:

Rule Setting Why
Required reviewers 1-6 named people or teams A human approves the apply; the job pauses until then
Wait timer e.g. 5 minutes A cool-off window to cancel a bad merge
Deployment branches “Selected branches” -> main Only main can deploy to prod; PR branches cannot
Environment secrets scoped to this env Secrets here are unreadable from dev/staging jobs

When the apply job is reached, GitHub posts a “Review pending deployments” prompt and freezes the job. No runner spins up, no OIDC token is minted, and nothing in the cloud happens until a designated reviewer clicks approve. Combine “Deployment branches: main” with the sub claim pinned to :environment:prod and you have defense in depth: even a workflow change cannot reach prod from a feature branch.

7. Concurrency groups to prevent overlapping applies

Two applies against the same state at the same time is how you corrupt a state file or fight over a lock. GitHub’s concurrency key serializes runs that share a group name. For applies, you want to queue, not cancel — never abort an apply mid-flight.

concurrency:
  group: terraform-apply-${{ matrix.env }}
  cancel-in-progress: false     # NEVER cancel a running apply

For PR plans, the opposite is correct: an outdated plan is useless, so cancel superseded runs to save minutes:

concurrency:
  group: terraform-plan-${{ github.ref }}
  cancel-in-progress: true

concurrency controls GitHub-side scheduling; Terraform’s backend state lock (DynamoDB for the S3 backend, or native locking for newer backends) is the authoritative guard at the cloud layer. Keep both. The -lock-timeout=300s on plan and apply means a run will wait up to five minutes for a stale lock to clear instead of failing instantly — useful when a previous job is finishing its unlock.

8. Drift detection on a schedule, surfaced as issues

State drifts: someone clicks in the console, an autoscaler changes a tag, a sister pipeline edits a shared resource. Catch it on a cadence with a scheduled plan that asserts “no changes” and opens a GitHub issue when that assertion fails.

name: Drift detection
on:
  schedule:
    - cron: "0 6 * * 1-5"   # 06:00 UTC, weekdays
  workflow_dispatch: {}       # allow manual runs

permissions:
  id-token: write
  contents: read
  issues: write               # required to open/update issues

jobs:
  drift:
    strategy:
      fail-fast: false
      matrix:
        env: [dev, staging, prod]
    runs-on: ubuntu-latest
    environment: ${{ matrix.env }}
    defaults:
      run:
        working-directory: infra/environments/${{ matrix.env }}
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::${{ vars.AWS_ACCOUNT_ID }}:role/gha-terraform-${{ matrix.env }}
          aws-region: eu-west-1
      - run: terraform init -input=false

      - name: Detect drift
        id: drift
        run: |
          set +e
          terraform plan -input=false -no-color -lock-timeout=120s -detailed-exitcode
          echo "exitcode=$?" >> "$GITHUB_OUTPUT"

      - name: Open or update drift issue
        if: steps.drift.outputs.exitcode == '2'
        uses: actions/github-script@v7
        with:
          script: |
            const env = '${{ matrix.env }}';
            const title = `Drift detected in ${env}`;
            const body = `\`terraform plan\` returned changes for **${env}** at ${new Date().toISOString()}.\n` +
                         `Run: ${context.serverUrl}/${context.repo.owner}/${context.repo.repo}/actions/runs/${context.runId}`;
            const existing = await github.rest.issues.listForRepo({
              owner: context.repo.owner, repo: context.repo.repo,
              state: 'open', labels: 'drift'
            });
            const match = existing.data.find(i => i.title === title);
            if (match) {
              await github.rest.issues.createComment({
                owner: context.repo.owner, repo: context.repo.repo,
                issue_number: match.number, body
              });
            } else {
              await github.rest.issues.create({
                owner: context.repo.owner, repo: context.repo.repo,
                title, body, labels: ['drift']
              });
            }

Reading the exit code: 2 means drift (a non-empty plan), 0 means clean, 1 means the run itself broke (which you also want to know about). Deduplicating on issue title means a persistent drift accrues comments on one issue instead of opening a new one every weekday morning. Create the drift label once in the repo so labels: ['drift'] resolves.

Verify

Confirm each layer independently before you trust the pipeline end to end.

# 1. OIDC trust: decode the role's trust policy and confirm the sub is pinned,
#    not wildcarded.
aws iam get-role --role-name gha-terraform-prod \
  --query 'Role.AssumeRolePolicyDocument' --output json | jq .

# 2. No long-lived keys remain in repo secrets (should NOT list AWS_ACCESS_KEY_ID).
gh secret list --repo my-org/infra

# 3. Saved-plan apply guarantee: prove a stale plan is rejected.
#    Plan, then mutate state out-of-band, then apply the OLD plan -> must error.
terraform plan -out=tfplan.binary
terraform apply -auto-approve            # makes a change, bumps state serial
terraform apply tfplan.binary            # EXPECT: "Saved plan is stale" error

# 4. Plugin cache hit: run init twice; the second run should report
#    providers used from the cache, not re-downloaded.
TF_PLUGIN_CACHE_DIR=$PWD/.cache terraform init
TF_PLUGIN_CACHE_DIR=$PWD/.cache terraform init   # look for "using ... from cache"

In the GitHub UI, confirm the behavioral gates:

Checklist

A pipeline built this way has no key to steal, no plan to second-guess, and no door to prod that a feature branch can open. The plan a reviewer reads is the plan that applies, an approval stands between merge and mutation, and anything that drifts overnight is waiting in your issue tracker by the time you open your laptop.

github-actionsterraformoidcciplan-comments

Comments

Keep Reading