IaC Multi-Cloud

Scaling Terragrunt Monorepos with Dependency Graphs and run-all

A two-account Terragrunt repo is easy. Forty accounts across three regions, with two hundred units whose apply order is dictated by a dependency graph nobody can hold in their head, is a different machine. The wrapper that removed your backend.tf duplication is now the thing standing between a one-line PR and a thirty-minute serialized apply that fails on unit 137 and leaves the run half-applied.

This article is about that machine: how Terragrunt builds the dependency DAG, how run --all and run --graph traverse it, how to make plans survive a greenfield where downstream outputs do not yet exist, and — the part that actually matters at scale — how to run only the units a change touched, in parallel, safely, in CI. It assumes you already know include, remote_state generation, and the live/modules split. If you do not, start with the DRY multi-environment article and come back.

A note on CLI syntax. Terragrunt v0.88.0 redesigned the CLI. terragrunt run-all apply is now terragrunt run --all apply; graph-dependencies is now dag graph; --terragrunt-include-dir is now --queue-include-dir; --terragrunt-non-interactive is now --non-interactive. The legacy forms still work as deprecated aliases, so older pipelines will not break, but every command below uses the current syntax. If your CI logs warn about deprecated commands, that is what they mean.

1. Structure the live tree so the path is the identity

The hierarchy is account / region / environment / component, and it is not cosmetic. Terragrunt derives the state key from path_relative_to_include(), the provider role from the account file you are standing under, and the DAG from the config_path references between sibling directories. The directory layout is the deployment topology.

infra/
  modules/                         # versioned, reusable TF/OpenTofu modules
  live/
    root.hcl                       # backend + provider generation, version pins
    _envcommon/                    # per-component config shared across all envs
      network.hcl
      eks.hcl
    prod/
      account.hcl                  # account_id, account_name
      us-east-1/
        region.hcl                 # aws_region
        platform/                  # the "environment" layer
          network/
            terragrunt.hcl
          eks/
            terragrunt.hcl
          rds/
            terragrunt.hcl
      eu-west-1/
        region.hcl
        platform/
          network/
          eks/
    staging/
      account.hcl
      us-east-1/ ...

Two structural rules pay off at scale:

2. Keep the root DRY with locals and read_terragrunt_config

The root config is read by every unit, so it carries the expensive-to-repeat facts exactly once: backend, provider, and the version pins that keep a 200-unit run reproducible.

# live/root.hcl
locals {
  account_vars = read_terragrunt_config(find_in_parent_folders("account.hcl"))
  region_vars  = read_terragrunt_config(find_in_parent_folders("region.hcl"))

  account_id = local.account_vars.locals.account_id
  aws_region = local.region_vars.locals.aws_region
}

# Pin the toolchain. A 200-unit run is only reproducible if every unit
# runs the same OpenTofu and Terragrunt versions.
terraform_version_constraint  = ">= 1.9.0, < 2.0.0"
terragrunt_version_constraint = ">= 0.88.0"

remote_state {
  backend = "s3"
  generate = {
    path      = "backend.tf"
    if_exists = "overwrite_terragrunt"
  }
  config = {
    bucket       = "acme-tfstate-${local.account_id}"
    key          = "${path_relative_to_include()}/terraform.tfstate"
    region       = local.aws_region
    encrypt      = true
    use_lockfile = true            # S3-native locking; no DynamoDB table needed
  }
}

generate "provider" {
  path      = "provider.tf"
  if_exists = "overwrite_terragrunt"
  contents  = <<-EOF
    provider "aws" {
      region = "${local.aws_region}"
      assume_role {
        role_arn = "arn:aws:iam::${local.account_id}:role/terraform-exec"
      }
      default_tags {
        tags = { ManagedBy = "terragrunt", Account = "${local.account_vars.locals.account_name}" }
      }
    }
  EOF
}

terraform_version_constraint makes Terragrunt fail fast if the binary on the runner drifts from the pinned range, instead of letting a newer OpenTofu silently rewrite your state format mid-run. Pin the Terragrunt binary itself in CI tooling (a mise/asdf .tool-versions file or a container tag) — the terragrunt_version_constraint attribute is a guardrail, not a version manager.

3. The dependency graph: dependency vs dependencies

Two blocks build the DAG, and the distinction is load-bearing.

You almost always want dependency — if a unit needs ordering, it usually needs an output too. Reach for dependencies only for pure sequencing with no data (for example, “do not touch the app tier until the IAM bootstrap unit has run”).

# live/prod/us-east-1/platform/eks/terragrunt.hcl
include "root" {
  path = find_in_parent_folders("root.hcl")
}
include "envcommon" {
  path   = "${dirname(find_in_parent_folders("root.hcl"))}/_envcommon/eks.hcl"
  expose = true                       # make its locals readable here
}

dependency "network" {
  config_path = "../network"

  mock_outputs = {
    vpc_id             = "vpc-mock00000000000"
    private_subnet_ids = ["subnet-mock1", "subnet-mock2", "subnet-mock3"]
  }
  mock_outputs_allowed_terraform_commands = ["validate", "plan", "init"]
  mock_outputs_merge_strategy_with_state  = "shallow"
}

# Pure ordering, no data: wait for the org-wide IAM baseline.
dependencies {
  paths = ["../../../_baseline/iam"]
}

inputs = {
  cluster_name = "prod-platform"
  vpc_id       = dependency.network.outputs.vpc_id
  subnet_ids   = dependency.network.outputs.private_subnet_ids
}

Terragrunt reads every terragrunt.hcl under the run root, resolves each config_path, and assembles a directed acyclic graph. For plan/apply it walks the graph so dependencies run before dependents; for destroy it walks it in reverse, tearing down dependents first. You never write the order. A cycle (A depends on B depends on A) is a hard error at graph-construction time, which is exactly when you want to find it.

4. Mock outputs: surviving plan-time and greenfield applies

This is the single most misunderstood mechanic in Terragrunt, and the one that breaks naive CI.

When you plan the EKS unit but network has never been applied, network has no outputs. Reading dependency.network.outputs.vpc_id would fail and abort the plan — so on a fresh repo you could never produce a full-stack plan. mock_outputs supplies placeholder values so plan, validate, and init proceed against fakes.

The mock_outputs_allowed_terraform_commands allowlist is the safety interlock. It must exclude apply and destroy. With the allowlist above, an apply that cannot find real outputs will fail rather than feed a fake subnet ID into a real cluster. That failure is correct: it means you tried to apply a dependent before its dependency, and Terragrunt’s run --all ordering exists precisely so you never hit it in practice.

mock_outputs_merge_strategy_with_state controls what happens once the dependency has partial real state — common when you add a new output to an already-applied module:

Strategy Behavior
no_merge (default) If real state exists, use it as-is and ignore mocks. A newly-added output that the applied state lacks will be missing, failing the plan.
shallow Real outputs win; mocks fill only top-level keys the state does not yet have. The usual choice.
deep_map_only Like shallow, but recurses into map-typed outputs, filling absent keys inside maps.

Use shallow as your default. The failure mode of no_merge — add an output to a module, and every downstream plan breaks until you re-apply the dependency first — is a needless ordering constraint on a read-only operation. shallow lets the plan proceed on a mock for the one new key while using real values for everything else.

A separate knob, skip_outputs = true, tells Terragrunt to never call terragrunt output on the dependency (it still enforces ordering). Do not combine it with mock_outputs expecting “mocks only when real outputs are absent”: skip_outputs means “always mock,” mock_outputs means “mock only as a fallback.” They answer different questions.

5. Orchestrate with run --all and run --graph

run --all is the workhorse: it discovers every unit under the current directory, builds the DAG, and executes your command in topological order, parallelizing independent units.

# Stand up an entire region in dependency order.
cd infra/live/prod/us-east-1
terragrunt run --all plan
terragrunt run --all apply --non-interactive

Two operational truths:

For destroys, the graph runs in reverse — and this is where run --graph earns its place. run --all destroy from a region root tears down everything under it. When you want to destroy one unit and everything that depends on it (its downstream cone), without touching unrelated units, use run --graph:

# Destroy the network unit AND every unit that depends on it,
# in the correct reverse order. Run from inside the target unit.
cd infra/live/prod/us-east-1/platform/network
terragrunt run --graph destroy

run --graph is anchored to the current unit and traverses the dependency edges out from it; run --all is anchored to a directory and processes everything beneath it. Knowing which you mean is the difference between deleting a VPC’s dependents cleanly and deleting an entire region.

Inspect the graph itself before trusting any of this:

cd infra/live/prod/us-east-1
terragrunt dag graph | dot -Tsvg > dag.svg     # Graphviz DOT to a diagram

6. Selective execution: only the units a change touched

At 200 units, run --all plan over the whole repo on every PR is minutes of wasted compute and a wall of noise. The goal is to run only the affected units.

Terragrunt gives you two mechanisms. The blunt one is glob inclusion:

# Plan only the EKS units across every prod region.
terragrunt run --all plan --queue-include-dir "prod/*/platform/eks"

# Plan everything in us-east-1 except RDS.
terragrunt run --all plan \
  --queue-include-dir "prod/us-east-1/*" \
  --queue-exclude-dir "prod/us-east-1/platform/rds"

The sharper one is change-aware and is what you actually want in CI. --filter-affected targets the units modified between the default branch and HEAD:

# Plan only units whose code changed vs the default branch — and,
# because it respects the DAG, the dependents of those units too.
terragrunt run --all plan --filter-affected

There is a subtlety the blunt globs miss: a change to a shared file (_envcommon/eks.hcl, or a local module under modules/) affects every unit that reads it, even though no leaf terragrunt.hcl changed. --queue-include-units-reading catches exactly that class:

# If _envcommon/eks.hcl changed, plan every unit that includes/reads it.
terragrunt run --all plan \
  --queue-include-units-reading "_envcommon/eks.hcl"

The three --queue-* flags above are now aliases for the newer --filter query language, so current docs may show --filter; the queue-prefixed forms remain valid and read more clearly for directory-shaped selection. Two formerly-common flags are now deprecated because their behavior is the default: --queue-strict-include (inclusion is strict now) and --queue-exclude-external (external dependencies are excluded by default).

7. CI: detect affected units and parallelize safely

The naive pipeline runs run --all over the whole repo and serializes. The scalable one computes the affected set, plans it on PRs, and applies it on merge, bounded by parallelism so you do not exhaust provider rate limits or the runner.

# .github/workflows/terragrunt.yml
name: terragrunt
on:
  pull_request:
  push:
    branches: [main]

jobs:
  terragrunt:
    runs-on: ubuntu-latest
    permissions:
      id-token: write          # OIDC; no long-lived AWS keys
      contents: read
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0       # --filter-affected needs full history to diff

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::222222222222:role/ci-terraform
          aws-region: us-east-1

      - uses: gruntwork-io/terragrunt-action@v3
        with:
          tofu_version: 1.9.0
          tg_version: 0.88.0
          tg_dir: infra/live/prod
          # PRs plan the affected set; pushes to main apply it.
          tg_command: >-
            run --all
            ${{ github.event_name == 'pull_request' && 'plan' || 'apply' }}
            --filter-affected
            --non-interactive
            --parallelism 8
            --queue-ignore-errors

The flags that make this safe at scale:

There is one trap worth stating plainly: --queue-ignore-errors does not mean “apply what you can and skip the rest” in a way that is safe for apply. On apply, a failed dependency means its dependents should not proceed — they would apply against stale or mock data. Keep --queue-ignore-errors for plan/validate; on the apply job, prefer the default fail-fast behavior so a failed network unit stops its EKS dependent rather than applying it blind.

Pin both binaries in the action (tofu_version, tg_version) so the runner cannot drift from your *_version_constraint pins and fail the whole run on a version check. For environments beyond a sandbox, also pin every module source to a tag — source = "git::...//eks?ref=v1.5.0" — so a plan today and an apply on merge run identical module code. Promotion then becomes a reviewed one-line ref= bump per environment.

Verify

Confirm the orchestration behaves before you trust it on prod.

cd infra/live/prod/us-east-1

# 1. The DAG is acyclic and ordered as you expect.
terragrunt dag graph | dot -Tsvg > /tmp/dag.svg
#   network has no inbound edges; eks and rds depend on it.

# 2. A full-stack validate touches no cloud state but exercises every unit.
terragrunt run --all validate --non-interactive

# 3. Change detection selects the right set. From a feature branch:
git checkout -b verify/affected
touch platform/eks/terragrunt.hcl       # simulate an EKS-only change
terragrunt run --all plan --filter-affected --non-interactive
#   Expect: eks (and any dependents) planned; network and rds skipped.

# 4. Shared-file fan-out works.
terragrunt run --all plan \
  --queue-include-units-reading "$(git rev-parse --show-toplevel)/infra/live/_envcommon/eks.hcl" \
  --non-interactive
#   Expect: every unit that includes _envcommon/eks.hcl appears.

# 5. State keys are isolated per unit.
aws s3 ls s3://acme-tfstate-222222222222/prod/us-east-1/ --recursive
#   Expect distinct keys: .../platform/network/terraform.tfstate, .../platform/eks/...

You are checking four properties: the graph is acyclic and ordered correctly, --filter-affected plans only what changed plus its dependents, a shared-file edit fans out to every reader, and each unit owns a distinct state key.

Checklist

Where this approach stops scaling

run --all over a single repo has a ceiling. Two patterns push it out. First, partition the run root: never run run --all from the repo root in production — anchor it at an account or region so the DAG and blast radius stay bounded. Second, when units genuinely form a deployable bundle (a whole environment promoted at once), evaluate Terragrunt stacks (terragrunt.stack.hcl), which compose units into a higher-level node you version and run as one. Stacks are newer and some flag interactions still have rough edges, so validate on a non-critical environment first.

The discipline underneath all of it is the same one that makes the live/modules split worth maintaining: the directory path is the identity, the DAG is derived not authored, and every selective-execution flag just runs a subset of that derived graph. Get the graph right and run --all is a detail. Get it wrong and no flag will save the apply.

terragruntmonorepodependenciesciterraform

Comments

Keep Reading