Scaling Terragrunt Monorepos with Dependency Graphs and run-all

A two-account Terragrunt repo is easy. Forty accounts across three regions, with two hundred units whose apply order is dictated by a dependency graph nobody can hold in their head, is a different machine. The wrapper that removed your backend.tf duplication is now the thing standing between a one-line PR and a thirty-minute serialized apply that fails on unit 137 and leaves the run half-applied.

This article is about that machine: how Terragrunt builds the dependency DAG, how run --all and run --graph traverse it, how to make plans survive a greenfield where downstream outputs do not yet exist, and — the part that actually matters at scale — how to run only the units a change touched, in parallel, safely, in CI. It assumes you already know include, remote_state generation, and the live/modules split. If you do not, start with the DRY multi-environment article and come back.

A note on CLI syntax. Terragrunt v0.88.0 redesigned the CLI. terragrunt run-all apply is now terragrunt run --all apply; graph-dependencies is now dag graph; --terragrunt-include-dir is now --queue-include-dir; --terragrunt-non-interactive is now --non-interactive. The legacy forms still work as deprecated aliases, so older pipelines will not break, but every command below uses the current syntax. If your CI logs warn about deprecated commands, that is what they mean.

1. Structure the live tree so the path is the identity

The hierarchy is account / region / environment / component, and it is not cosmetic. Terragrunt derives the state key from path_relative_to_include(), the provider role from the account file you are standing under, and the DAG from the config_path references between sibling directories. The directory layout is the deployment topology.

infra/
  modules/                         # versioned, reusable TF/OpenTofu modules
  live/
    root.hcl                       # backend + provider generation, version pins
    _envcommon/                    # per-component config shared across all envs
      network.hcl
      eks.hcl
    prod/
      account.hcl                  # account_id, account_name
      us-east-1/
        region.hcl                 # aws_region
        platform/                  # the "environment" layer
          network/
            terragrunt.hcl
          eks/
            terragrunt.hcl
          rds/
            terragrunt.hcl
      eu-west-1/
        region.hcl
        platform/
          network/
          eks/
    staging/
      account.hcl
      us-east-1/ ...

Two structural rules pay off at scale:

One module instantiation per leaf directory. A leaf with a terragrunt.hcl is a unit — Terragrunt’s atomic node in the DAG. Never put two modules in one directory; you lose the ability to plan, target, and roll back them independently.
_envcommon/ for component-level DRY. The EKS inputs that never differ between staging and prod (addon versions, IRSA wiring, log retention) live in _envcommon/eks.hcl and are pulled in with a second include. Only the genuinely environment-specific values (cluster name, node counts) stay in the leaf. This is what keeps the promotion diff to a handful of lines across two hundred units.

2. Keep the root DRY with locals and read_terragrunt_config

The root config is read by every unit, so it carries the expensive-to-repeat facts exactly once: backend, provider, and the version pins that keep a 200-unit run reproducible.

# live/root.hcl
locals {
  account_vars = read_terragrunt_config(find_in_parent_folders("account.hcl"))
  region_vars  = read_terragrunt_config(find_in_parent_folders("region.hcl"))

  account_id = local.account_vars.locals.account_id
  aws_region = local.region_vars.locals.aws_region
}

# Pin the toolchain. A 200-unit run is only reproducible if every unit
# runs the same OpenTofu and Terragrunt versions.
terraform_version_constraint  = ">= 1.9.0, < 2.0.0"
terragrunt_version_constraint = ">= 0.88.0"

remote_state {
  backend = "s3"
  generate = {
    path      = "backend.tf"
    if_exists = "overwrite_terragrunt"
  }
  config = {
    bucket       = "acme-tfstate-${local.account_id}"
    key          = "${path_relative_to_include()}/terraform.tfstate"
    region       = local.aws_region
    encrypt      = true
    use_lockfile = true            # S3-native locking; no DynamoDB table needed
  }
}

generate "provider" {
  path      = "provider.tf"
  if_exists = "overwrite_terragrunt"
  contents  = <<-EOF
    provider "aws" {
      region = "${local.aws_region}"
      assume_role {
        role_arn = "arn:aws:iam::${local.account_id}:role/terraform-exec"
      }
      default_tags {
        tags = { ManagedBy = "terragrunt", Account = "${local.account_vars.locals.account_name}" }
      }
    }
  EOF
}

terraform_version_constraint makes Terragrunt fail fast if the binary on the runner drifts from the pinned range, instead of letting a newer OpenTofu silently rewrite your state format mid-run. Pin the Terragrunt binary itself in CI tooling (a mise/asdf .tool-versions file or a container tag) — the terragrunt_version_constraint attribute is a guardrail, not a version manager.

3. The dependency graph: dependency vs dependencies

Two blocks build the DAG, and the distinction is load-bearing.

dependencies (plural) declares ordering only. It is a list of paths that must be applied before this unit. No data crosses the edge.
dependency (singular) declares ordering and data flow. It reads the target unit’s outputs and exposes them as dependency.<name>.outputs.<key>.

You almost always want dependency — if a unit needs ordering, it usually needs an output too. Reach for dependencies only for pure sequencing with no data (for example, “do not touch the app tier until the IAM bootstrap unit has run”).

# live/prod/us-east-1/platform/eks/terragrunt.hcl
include "root" {
  path = find_in_parent_folders("root.hcl")
}
include "envcommon" {
  path   = "${dirname(find_in_parent_folders("root.hcl"))}/_envcommon/eks.hcl"
  expose = true                       # make its locals readable here
}

dependency "network" {
  config_path = "../network"

  mock_outputs = {
    vpc_id             = "vpc-mock00000000000"
    private_subnet_ids = ["subnet-mock1", "subnet-mock2", "subnet-mock3"]
  }
  mock_outputs_allowed_terraform_commands = ["validate", "plan", "init"]
  mock_outputs_merge_strategy_with_state  = "shallow"
}

# Pure ordering, no data: wait for the org-wide IAM baseline.
dependencies {
  paths = ["../../../_baseline/iam"]
}

inputs = {
  cluster_name = "prod-platform"
  vpc_id       = dependency.network.outputs.vpc_id
  subnet_ids   = dependency.network.outputs.private_subnet_ids
}

Terragrunt reads every terragrunt.hcl under the run root, resolves each config_path, and assembles a directed acyclic graph. For plan/apply it walks the graph so dependencies run before dependents; for destroy it walks it in reverse, tearing down dependents first. You never write the order. A cycle (A depends on B depends on A) is a hard error at graph-construction time, which is exactly when you want to find it.

4. Mock outputs: surviving plan-time and greenfield applies

This is the single most misunderstood mechanic in Terragrunt, and the one that breaks naive CI.

When you plan the EKS unit but network has never been applied, network has no outputs. Reading dependency.network.outputs.vpc_id would fail and abort the plan — so on a fresh repo you could never produce a full-stack plan. mock_outputs supplies placeholder values so plan, validate, and init proceed against fakes.

The mock_outputs_allowed_terraform_commands allowlist is the safety interlock. It must exclude apply and destroy. With the allowlist above, an apply that cannot find real outputs will fail rather than feed a fake subnet ID into a real cluster. That failure is correct: it means you tried to apply a dependent before its dependency, and Terragrunt’s run --all ordering exists precisely so you never hit it in practice.

mock_outputs_merge_strategy_with_state controls what happens once the dependency has partial real state — common when you add a new output to an already-applied module:

Strategy	Behavior
`no_merge` (default)	If real state exists, use it as-is and ignore mocks. A newly-added output that the applied state lacks will be missing, failing the plan.
`shallow`	Real outputs win; mocks fill only top-level keys the state does not yet have. The usual choice.
`deep_map_only`	Like `shallow`, but recurses into map-typed outputs, filling absent keys inside maps.

Use shallow as your default. The failure mode of no_merge — add an output to a module, and every downstream plan breaks until you re-apply the dependency first — is a needless ordering constraint on a read-only operation. shallow lets the plan proceed on a mock for the one new key while using real values for everything else.

A separate knob, skip_outputs = true, tells Terragrunt to never call terragrunt output on the dependency (it still enforces ordering). Do not combine it with mock_outputs expecting “mocks only when real outputs are absent”: skip_outputs means “always mock,” mock_outputs means “mock only as a fallback.” They answer different questions.

5. Orchestrate with run --all and run --graph

run --all is the workhorse: it discovers every unit under the current directory, builds the DAG, and executes your command in topological order, parallelizing independent units.

# Stand up an entire region in dependency order.
cd infra/live/prod/us-east-1
terragrunt run --all plan
terragrunt run --all apply --non-interactive

Two operational truths:

A greenfield run --all plan is approximate, not byte-exact. Downstream units plan against mocked outputs, so their plans show placeholder ARNs and counts. Read it as a sanity check on intent and ordering, not as the literal diff that apply will produce. The real plan for a downstream unit is only exact after its dependency has applied.
run --all apply auto-approves by default. Across many units there is no sane way to interactively confirm each one, so Terragrunt adds -auto-approve. If that makes you nervous, --no-auto-approve restores per-unit confirmation (rarely what you want in CI, frequently what you want for a hand-driven prod teardown).

For destroys, the graph runs in reverse — and this is where run --graph earns its place. run --all destroy from a region root tears down everything under it. When you want to destroy one unit and everything that depends on it (its downstream cone), without touching unrelated units, use run --graph:

# Destroy the network unit AND every unit that depends on it,
# in the correct reverse order. Run from inside the target unit.
cd infra/live/prod/us-east-1/platform/network
terragrunt run --graph destroy

run --graph is anchored to the current unit and traverses the dependency edges out from it; run --all is anchored to a directory and processes everything beneath it. Knowing which you mean is the difference between deleting a VPC’s dependents cleanly and deleting an entire region.

Inspect the graph itself before trusting any of this:

cd infra/live/prod/us-east-1
terragrunt dag graph | dot -Tsvg > dag.svg     # Graphviz DOT to a diagram

6. Selective execution: only the units a change touched

At 200 units, run --all plan over the whole repo on every PR is minutes of wasted compute and a wall of noise. The goal is to run only the affected units.

Terragrunt gives you two mechanisms. The blunt one is glob inclusion:

# Plan only the EKS units across every prod region.
terragrunt run --all plan --queue-include-dir "prod/*/platform/eks"

# Plan everything in us-east-1 except RDS.
terragrunt run --all plan \
  --queue-include-dir "prod/us-east-1/*" \
  --queue-exclude-dir "prod/us-east-1/platform/rds"

The sharper one is change-aware and is what you actually want in CI. --filter-affected targets the units modified between the default branch and HEAD:

# Plan only units whose code changed vs the default branch — and,
# because it respects the DAG, the dependents of those units too.
terragrunt run --all plan --filter-affected

There is a subtlety the blunt globs miss: a change to a shared file (_envcommon/eks.hcl, or a local module under modules/) affects every unit that reads it, even though no leaf terragrunt.hcl changed. --queue-include-units-reading catches exactly that class:

# If _envcommon/eks.hcl changed, plan every unit that includes/reads it.
terragrunt run --all plan \
  --queue-include-units-reading "_envcommon/eks.hcl"

The three --queue-* flags above are now aliases for the newer --filter query language, so current docs may show --filter; the queue-prefixed forms remain valid and read more clearly for directory-shaped selection. Two formerly-common flags are now deprecated because their behavior is the default: --queue-strict-include (inclusion is strict now) and --queue-exclude-external (external dependencies are excluded by default).

7. CI: detect affected units and parallelize safely

The naive pipeline runs run --all over the whole repo and serializes. The scalable one computes the affected set, plans it on PRs, and applies it on merge, bounded by parallelism so you do not exhaust provider rate limits or the runner.

# .github/workflows/terragrunt.yml
name: terragrunt
on:
  pull_request:
  push:
    branches: [main]

jobs:
  terragrunt:
    runs-on: ubuntu-latest
    permissions:
      id-token: write          # OIDC; no long-lived AWS keys
      contents: read
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0       # --filter-affected needs full history to diff

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::222222222222:role/ci-terraform
          aws-region: us-east-1

      - uses: gruntwork-io/terragrunt-action@v3
        with:
          tofu_version: 1.9.0
          tg_version: 0.88.0
          tg_dir: infra/live/prod
          # PRs plan the affected set; pushes to main apply it.
          tg_command: >-
            run --all
            ${{ github.event_name == 'pull_request' && 'plan' || 'apply' }}
            --filter-affected
            --non-interactive
            --parallelism 8
            --queue-ignore-errors

The flags that make this safe at scale:

fetch-depth: 0 — --filter-affected diffs against the default branch, which requires real git history. A shallow checkout silently makes it find nothing (and plan nothing, which looks like success).
--parallelism 8 caps concurrent units. The DAG width can be dozens of independent units; without a cap you will hit AWS API throttling and OOM the runner. Tune to your account’s rate limits; 4-10 is a sane band.
--non-interactive forces non-prompting behavior — mandatory in CI, where a hung prompt is a hung job.
--queue-ignore-errors on the plan job surfaces every broken unit in one run instead of aborting on the first. You get the full list of failures per PR rather than fixing them one slow round-trip at a time.

There is one trap worth stating plainly: --queue-ignore-errors does not mean “apply what you can and skip the rest” in a way that is safe for apply. On apply, a failed dependency means its dependents should not proceed — they would apply against stale or mock data. Keep --queue-ignore-errors for plan/validate; on the apply job, prefer the default fail-fast behavior so a failed network unit stops its EKS dependent rather than applying it blind.

Pin both binaries in the action (tofu_version, tg_version) so the runner cannot drift from your *_version_constraint pins and fail the whole run on a version check. For environments beyond a sandbox, also pin every module source to a tag — source = "git::...//eks?ref=v1.5.0" — so a plan today and an apply on merge run identical module code. Promotion then becomes a reviewed one-line ref= bump per environment.

Verify

Confirm the orchestration behaves before you trust it on prod.

cd infra/live/prod/us-east-1

# 1. The DAG is acyclic and ordered as you expect.
terragrunt dag graph | dot -Tsvg > /tmp/dag.svg
#   network has no inbound edges; eks and rds depend on it.

# 2. A full-stack validate touches no cloud state but exercises every unit.
terragrunt run --all validate --non-interactive

# 3. Change detection selects the right set. From a feature branch:
git checkout -b verify/affected
touch platform/eks/terragrunt.hcl       # simulate an EKS-only change
terragrunt run --all plan --filter-affected --non-interactive
#   Expect: eks (and any dependents) planned; network and rds skipped.

# 4. Shared-file fan-out works.
terragrunt run --all plan \
  --queue-include-units-reading "$(git rev-parse --show-toplevel)/infra/live/_envcommon/eks.hcl" \
  --non-interactive
#   Expect: every unit that includes _envcommon/eks.hcl appears.

# 5. State keys are isolated per unit.
aws s3 ls s3://acme-tfstate-222222222222/prod/us-east-1/ --recursive
#   Expect distinct keys: .../platform/network/terraform.tfstate, .../platform/eks/...

You are checking four properties: the graph is acyclic and ordered correctly, --filter-affected plans only what changed plus its dependents, a shared-file edit fans out to every reader, and each unit owns a distinct state key.

Checklist

Where this approach stops scaling

run --all over a single repo has a ceiling. Two patterns push it out. First, partition the run root: never run run --all from the repo root in production — anchor it at an account or region so the DAG and blast radius stay bounded. Second, when units genuinely form a deployable bundle (a whole environment promoted at once), evaluate Terragrunt stacks (terragrunt.stack.hcl), which compose units into a higher-level node you version and run as one. Stacks are newer and some flag interactions still have rough edges, so validate on a non-critical environment first.

The discipline underneath all of it is the same one that makes the live/modules split worth maintaining: the directory path is the identity, the DAG is derived not authored, and every selective-execution flag just runs a subset of that derived graph. Get the graph right and run --all is a detail. Get it wrong and no flag will save the apply.

Scaling Terragrunt Monorepos with Dependency Graphs and run-all

1. Structure the live tree so the path is the identity

2. Keep the root DRY with locals and read_terragrunt_config

3. The dependency graph: dependency vs dependencies

4. Mock outputs: surviving plan-time and greenfield applies

5. Orchestrate with run --all and run --graph

6. Selective execution: only the units a change touched

7. CI: detect affected units and parallelize safely

Verify

Checklist

Where this approach stops scaling

Written by Vinod

Comments

Keep Reading

Dynamic Inventory and Secure Secrets for Ansible at Cloud Scale

Engineering Idempotent Ansible Collections with Molecule Testing

Programmatic Infrastructure with CDK for Terraform in TypeScript