Azure Lesson 104 of 137

Building a Platform Layer with Azure Verified Modules and Terraform

Most teams that adopt Azure Verified Modules (AVM) stop at “I called a module and got a resource.” That is the demo, not the value. The real win is using AVM as the substrate for an opinionated platform layer your application teams consume without ever touching a raw azurerm_ resource — a layer that bakes in private endpoints, diagnostic settings, mandatory tags and naming as types, validated at plan, so a publicly-exposed Key Vault is not something an app team can ship by accident because the lever does not exist. This guide builds that layer end to end: composing AVM resource modules into your own pattern modules, pinning them sanely (the pre-1.0 ~> trap that nukes 40 data planes in one Renovate merge), testing them at two altitudes, and shipping them through a private registry — with the state-migration discipline that keeps every upgrade a zero-destroy event.

The thing that makes this hard is not Terraform syntax; it is that AVM modules are pre-1.0, so the version-constraint intuition every engineer carries (~> 0.9 is “patches only”) is exactly wrong, and the place that intuition fails is the shared wrapper that 40 repos depend on. The thing that makes it worth doing is that once the wrapper exists, your platform team upgrades the entire estate by merging one pinned-version PR with a reviewed plan diff, and your app teams ship spokes, vaults and storage that are private, tagged and observable by construction. This article is the reference you keep open while you build that: every interface input, every pin rule, every test layer, every migration block, laid out as scannable tables so you read the prose once and then work from the tables.

By the end you will stop treating AVM as a fancier resource and start treating it as the brick library underneath your org’s non-negotiables. You will know precisely which version constraint pins a 0.x module without admitting a breaking minor, which inputs to expose and which to weld shut, how to assert a wrapper’s shape at plan without deploying, how to run a real apply/destroy against an ephemeral subscription with keyless OIDC, how to publish to a private registry on a semver tag, and — the part that separates a senior platform engineer from someone who read the README — how to absorb AVM’s internal resource-address churn inside the wrapper so a minor bump never shows up as destroy/create in a consumer’s plan.

What problem this solves

The pain this solves is module sprawl plus silent drift from your standards. Before a platform layer, every app team copies a different community module (or hand-rolls azurerm_ resources), each with different input names, none of which reliably support diagnostic settings, locks, role assignments or private endpoints. Security finds a publicly-accessible storage account in a quarterly review; the team that built it points at a wiki page nobody read. “Please tag things” and “please use private endpoints” live as documentation, which means they live as suggestions. Multiply by 40 repos and you have an estate you cannot reason about, where the answer to “is everything private?” is “let me go check, repo by repo.”

What breaks without this layer: governance becomes archaeology. You cannot upgrade a defaulting convention across the estate because there is no single place that owns it. A new compliance rule (force-Entra-auth on storage, deny public Key Vault) becomes 40 pull requests against 40 inconsistent codebases instead of one version bump. And when you do try to standardise by swapping hand-rolled modules for AVM, the naive attempt shows every storage account scheduled for destroy/create in the plan — because the resource address changed — so the migration gets reverted and “AVM doesn’t work for us” enters the team’s folklore.

Who hits this: every platform / cloud-engineering team operating Terraform at more than a handful of repos, especially under a landing-zone program where the Azure Cloud Adoption Framework landing zones defines the guardrails but leaves how app teams provision inside a spoke to you. It bites hardest where pre-1.0 AVM modules are pinned with ~> (the breaking-minor trap), where wrappers are accidental passthroughs of the full AVM surface (no guardrail value), and where nobody reads the migration plan before merging a Renovate AVM bump.

To frame the whole field before the deep dive, here is every layer of the module supply chain this article builds, who owns it, and the single failure that bites at each:

Layer What it is Who owns it The failure that bites here
Upstream AVM resource module (avm-res-*) One logical resource + its children, WAF-aligned Microsoft Pre-1.0: ~> 0.x admits a breaking minor
Upstream AVM pattern module (avm-ptn-*) A multi-resource architecture (hub-spoke, LZ) Microsoft Heavier blast radius on a bad bump
Your platform wrapper Org pattern composed from AVM bricks + injected policy Platform team Accidental passthrough = no guardrail
Test + release gate terraform test + Terratest + replace-gate in CI Platform team Unacknowledged destroy/create ships
Private registry / git ref Versioned, semver-tagged distribution Platform team Copy-paste instead of source/version
App-team consumption Narrow inputs only; deploy into Azure App teams A leaked lever lets them go public

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should be comfortable with core Terraform: HCL, providers, the init/plan/apply workflow, modules with inputs and outputs, and remote state. If any of that is shaky, the Terraform fundamentals: HCL, providers, state & workflow and Terraform state deep dive come first. Module authoring conventions — inputs, outputs, versioning — are assumed from Authoring Terraform modules: structure, inputs, outputs, versioning. You should know what a version constraint means in principle (we will sharpen it for 0.x), and have an Azure subscription plus the azurerm provider configured.

This sits at the infrastructure-as-code / platform-engineering layer of an Azure estate. It assumes the landing-zone scaffolding above it — management groups, policy, the hub — from Azure Cloud Adoption Framework landing zones, and it produces the spokes app teams deploy into. It pairs with Terraform module design: composition, versioning (the composition theory), Terraform testing: native & Terratest (the test mechanics), and Terraform refactoring: moved, import & removed blocks (the migration mechanics this article applies to AVM specifically). For teams that prefer Bicep, the equivalent distribution story is Bicep private module registry with ACR & CI/CD.

A quick map of who confirms what when something goes wrong, so you route a problem to the right layer fast:

Concern Where it lives Confirm with Owns the fix
“Which AVM version actually resolved?” .terraform.lock.hcl terraform providers lock / read the lock Platform team
“Why is the plan showing a replace?” Wrapper resource addresses terraform show -json + jq Platform team
“Why did the plan error on a deployment?” enable_telemetry in a locked sub Plan error text Platform + governance
“Why can this team go public?” Wrapper variables.tf surface grep for the exposed lever Platform team
“Is the published version right?” Registry / git tag terraform init in a consumer Platform + app team
“Did the migration churn state?” Plan actions on adopt replace-gate in CI Platform team

Core concepts

Five mental models make every later decision obvious.

AVM is a specification, not just a module set. The reason AVM is worth building on is not “Microsoft published modules” — it is that every module conforms to the same interface contract. Consistent input names, mandatory support for diagnostic settings, locks, role assignments, and (where the service supports them) private endpoints, plus Well-Architected (WAF) defaults rather than the bare minimum that compiles. You learn one shape and it generalises across services. That shared shape is what lets you write generic org policy (force diagnostics everywhere) instead of bespoke wiring per resource.

There are two AVM module classes, and your wrapper is a third tier. AVM ships resource modules (Azure/avm-res-<service>-<resource>/azurerm) — one logical resource plus its directly-dependent children — and pattern modules (Azure/avm-ptn-<pattern>/azurerm) — a whole multi-resource architecture. The mental model: resource modules are LEGO bricks; pattern modules are pre-built assemblies. Your platform layer is neither — it is a third tier: your own pattern modules, composed from AVM resource bricks, that encode your org’s non-negotiables. You generally do not fork AVM; you wrap it.

Pre-1.0 changes the meaning of ~>. This is the single most consequential fact in the article. AVM resource modules are below 1.0, and AVM treats the minor segment as the breaking-change segment while below 1.0. So ~> 0.9 (which feels like “0.9.x only”) actually expands to >= 0.9.0, < 1.0.0 and will happily pull a breaking 0.10.0. The constraint that pins to a non-breaking range is the three-part ~> 0.9.1 (allows 0.9.1 .. 0.9.x, blocks 0.10.0). If you remember one thing, remember this.

The wrapper’s value is what it does not expose. A platform module is valuable in proportion to the levers it removes. If your variables.tf mirrors the AVM module’s inputs, you have built a passthrough, not a platform — an app team can still ship a public Key Vault. The discipline is to expose a narrow contract (workload name, tags, the central LAW id) and inject the rest (public_network_access_enabled = false, enable_telemetry = false, forced diagnostics and private endpoints) as constants the caller cannot override. Guardrails as types, validated at plan, not as a wiki page.

Every AVM upgrade is a potential state migration. Exact version pins control when you take an upgrade, not whether it is safe. A minor bump can move a resource under a for_each map, changing its resource address — and a changed address means Terraform plans destroy + create, which on a storage account is a data-plane deletion. The senior move is to read every AVM bump as a possible state migration, absorb the address change inside the wrapper with a moved block shipped in the same version, and gate CI so an unacknowledged replace can never merge.

The vocabulary in one table

Pin down every moving part before the deep sections; the glossary repeats these for lookup, this is the mental model side by side:

Concept One-line definition Where it lives Why it matters
Resource module (avm-res-*) One logical resource + children Public Terraform registry The brick you compose
Pattern module (avm-ptn-*) A multi-resource architecture Public registry A pre-built assembly
Platform wrapper Your pattern over AVM bricks Your private registry / repo Encodes org non-negotiables
AVM interface The shared optional input contract Each module’s variables.tf Lets you write generic policy
enable_telemetry Empty ARM deployment for usage metrics An AVM input (default true) Fails plan in locked subs
~> 0.9.1 vs ~> 0.9 Three-part vs two-part 0.x pin Module version arg One blocks breaking minors, one doesn’t
validation block Custom input precondition Wrapper variables.tf Turns conventions into plan failures
terraform test Native plan/apply assertion runner tests/*.tftest.hcl Fast contract checks, no deploy
Terratest Go E2E apply/assert/destroy test/*.go Real Azure validation, nightly
moved block Declares old→new resource address Wrapper .tf Absorbs AVM address churn
import block Brings existing Azure into state Wrapper / consumer .tf Brownfield adoption, no recreate
Replace gate CI check rejecting destroy+create Pipeline step Stops accidental data-plane loss

Why AVM exists: the resource vs. pattern split

AVM is Microsoft’s effort to replace the sprawl of inconsistent community modules with a single, owned, specification-driven set. Two things make it worth building on. A shared specification: every module conforms to the same interface contracts — consistent input names, mandatory support for diagnostic settings, locks, role assignments, and (where relevant) private endpoints; you learn one shape and it generalises. WAF alignment: modules encode Well-Architected defaults rather than the bare minimum that compiles. The two module classes you actually compose with are the resource and pattern modules — and your platform layer is a third tier over them.

Class Terraform registry prefix Scope When you reach for it
Resource module Azure/avm-res-<service>-<resource>/azurerm One logical resource + its directly dependent child resources The brick for your own pattern
Pattern module Azure/avm-ptn-<pattern>/azurerm A multi-resource architecture (hub-spoke, AKS landing zone) A whole assembly you accept as-is
Platform wrapper (yours) your private registry / git ref Org pattern composed from AVM resource bricks + injected policy What app teams actually consume

The mental model: resource modules are LEGO bricks; pattern modules are pre-built assemblies. Your platform layer is a third tier — your own pattern modules, composed from AVM resource bricks, that encode your org’s non-negotiables. You generally do not fork AVM; you wrap it.

The decision of which tier to consume, by situation:

If you need… Consume Why
A single Key Vault with org defaults Resource module, wrapped You inject policy the bare brick doesn’t enforce
A whole hub-spoke exactly as Microsoft ships it Pattern module directly No org-specific deltas; accept the assembly
A spoke with your naming, tags, PE, diagnostics Your wrapper over resource bricks The pattern module won’t encode your non-negotiables
A one-off experiment / spike Resource module directly Not worth a wrapper yet
To change a default across 40 repos Your wrapper (one bump) The only place that owns the convention

Why build on AVM at all rather than community modules or raw resources — the three approaches side by side on the axes that matter at estate scale:

Axis Raw azurerm_ resources Community modules AVM (wrapped)
Interface consistency None (you write it all) Varies wildly per author Mandated, identical shape across services
Diagnostics / locks / PE support Hand-wired each time Sometimes, inconsistently First-class, standard inputs
Defaults Whatever you type Author’s opinion WAF-aligned (good baseline)
Ownership / maintenance You own everything Author may abandon it Microsoft-owned, supported
Upstream fixes N/A If the author ships them Flow to you (you compose, not fork)
Org policy injection Manual, per resource Fork or pray Inject once in your wrapper tier
Estate-wide change N PRs, N codebases N PRs One wrapper bump

A bare resource-module call looks like this — the starting point you will deliberately narrow and harden in your wrapper:

module "kv" {
  source  = "Azure/avm-res-keyvault-vault/azurerm"
  version = "0.9.1"

  name                = "kv-platform-eus-01"
  resource_group_name = azurerm_resource_group.platform.name
  location            = "eastus"
  tenant_id           = data.azurerm_client_config.current.tenant_id
}

That call gets you a vault, but with AVM’s defaults and the full AVM surface exposed — neither of which is what you ship to app teams. The whole rest of this article is turning that into a guarded, distributed, upgrade-safe platform brick.

Reading an AVM module’s interface

Before wrapping anything, read the interface — not the README prose, the actual variables. Because the AVM spec mandates a shared shape, resource modules share a recognisable set of optional inputs beyond the resource-specific ones. Know this set cold; it is the surface you decide to expose, inject or forbid in your wrapper.

AVM input Type (shape) What it does Your wrapper’s stance
tags map(string) Tags applied to the resource Expose (validated for mandatory keys)
lock object Apply CanNotDelete / ReadOnly management lock Inject (org default) or expose narrowly
role_assignments map(object) RBAC assignments, keyed for add/remove without reindexing Inject baseline; optionally extend
diagnostic_settings map(object) Log/metric categories → workspace/storage/Event Hub Inject (non-negotiable → central LAW)
private_endpoints map(object) PE definitions (subnet, private DNS zone group) Inject (non-negotiable on PE-capable services)
managed_identities object System- and/or user-assigned identity wiring Inject or expose per pattern
enable_telemetry bool (default true) Tiny empty ARM deployment for usage metrics Inject false org-wide
<resource>-specific varies e.g. public_network_access_enabled, sku_name Mostly forbid; expose only the safe ones

That enable_telemetry row deserves a callout because it fails in a way that wastes an afternoon:

enable_telemetry: AVM modules deploy a tiny, empty ARM deployment whose name encodes the module and version. It sends no resource data to Microsoft — it lets the team measure module usage. It is harmless, but in locked-down subscriptions where Microsoft.Resources/deployments is policy-denied, it will fail a plan with a confusing error. Decide your org default once (we set it false and bake that into our wrappers) rather than per-call.

Inspect the real inputs instead of guessing — pull the module and read its variables directly:

terraform init
terraform providers schema -json > /dev/null   # sanity-check provider wiring
# Read the module's own variables directly:
find .terraform/modules/kv -name 'variables.tf' -exec grep -E '^variable' {} +

The AVM-standard inputs and their direct-resource equivalents, so you know what the brick is wiring under the hood:

AVM input Underlying azurerm mechanism it wraps Why the wrapper is nicer
diagnostic_settings azurerm_monitor_diagnostic_setting One map vs N resource blocks + category enumeration
private_endpoints azurerm_private_endpoint + DNS zone group Subnet + zone wiring abstracted, keyed
role_assignments azurerm_role_assignment Keyed map survives reordering; no index churn
lock azurerm_management_lock Single object, attached to the resource scope
managed_identities identity {} block + azurerm_user_assigned_identity System/user identity wiring normalised

Pinning and dependency strategy

AVM resource modules are pre-1.0, and this breaks the intuition most people have about ~>. The constraint that feels safe is the one that bites.

# DANGEROUS for a 0.x module:
version = "~> 0.9"   # allows 0.9.x AND 0.10.0, 0.11.0, ...

For 0.x releases, ~> 0.9 is equivalent to >= 0.9.0, < 1.0.0. Because AVM treats the minor segment as the breaking-change segment while below 1.0, that constraint happily pulls in a breaking 0.10.0. The constraint that actually pins to a non-breaking range is the three-part form:

# Allows 0.9.1 .. 0.9.x, blocks 0.10.0:
version = "~> 0.9.1"

The full constraint-operator behaviour, made explicit so you never guess what a given string admits:

Constraint written Expands to Admits a breaking 0.10.0? Verdict for a 0.x AVM module
0.9.1 exactly 0.9.1 No Best in a wrapper — deliberate, reviewed bumps
= 0.9.1 exactly 0.9.1 No Same as above, explicit form
~> 0.9.1 >= 0.9.1, < 0.10.0 No Good — allows safe patch drift inside 0.9.x
~> 0.9 >= 0.9.0, < 1.0.0 Yes Dangerous — the classic AVM mistake
>= 0.9.0 >= 0.9.0 (unbounded) Yes (and beyond) Never — unbounded, will break
>= 0.9.0, < 0.10.0 that range No Verbose but correct equivalent of ~> 0.9.1
(omitted) latest available Yes Never in shared code — irreproducible

My rule across the platform repo, and why each tier pins differently:

Repo tier Pin AVM dependencies as Pin your wrapper as Rationale
Wrapper (platform) modules exact (version = "0.9.1") n/a (this is the wrapper) The platform layer is where you absorb upgrade risk deliberately, in a PR, with a reviewed plan diff
Consuming (app) repos inherited from the wrapper ~> X.Y.Z on your wrapper Your wrappers are semver-disciplined, so ~> is safe here; app teams inherit the AVM versions you chose

Automate the bumps with Renovate so you review upgrades instead of chasing them. Renovate understands Terraform registry sources natively:

{
  "$schema": "https://docs.renovatebot.com/renovate-schema.json",
  "extends": ["config:recommended"],
  "terraform": { "enabled": true },
  "packageRules": [
    {
      "matchManagers": ["terraform"],
      "matchPackageNames": ["/^Azure/avm-/"],
      "groupName": "azure-verified-modules",
      "schedule": ["before 9am on monday"]
    }
  ]
}

Each Renovate PR becomes a single reviewable unit: the version bump plus the terraform plan your CI attaches as a comment. The lock file is what makes any of this reproducible — what each artifact pins and where:

Artifact Pins Committed? Bumped by
version = in module block Module version constraint Yes (in code) Renovate PR / manual
.terraform.lock.hcl Provider versions + checksums Yes (always commit) terraform init -upgrade
required_version (versions.tf) Terraform CLI version range Yes Manual, deliberate
required_providers (versions.tf) Provider source + version range Yes Manual / Renovate

Wrapping resource modules into pattern modules

Here is the core of the platform layer. We want app teams to ask for “a spoke” and get a VNet, a Key Vault, and a storage account — all with private endpoints, diagnostics, and tags already correct. They should not be able to opt out of those. The directory layout that scales:

platform-modules/
└── spoke-landing-zone/
    ├── main.tf          # composes AVM resource modules
    ├── variables.tf     # the narrow contract app teams see
    ├── outputs.tf
    ├── versions.tf      # required_providers + required_version
    └── tests/
        └── defaults.tftest.hcl

What each file owns, and the rule that keeps the wrapper a platform and not a passthrough:

File Owns The discipline
main.tf Composition of AVM bricks + injected policy Inject enable_telemetry, diagnostic_settings, private_endpoints here — never pass them through
variables.tf The narrow caller contract Expose only safe inputs; validation on naming + mandatory tags
outputs.tf Stable outputs (ids, URIs) Treat as API: renaming an output is a major version bump
versions.tf required_version + required_providers Pin the CLI and provider ranges deliberately
tests/*.tftest.hcl Plan-level contract assertions Assert the locked-down defaults resolve as expected

The wrapper’s main.tf composes AVM bricks and injects org policy. Note enable_telemetry, diagnostic_settings, and private_endpoints are set by us, not passed through from the caller:

locals {
  base_tags = merge(var.tags, {
    managedBy = "platform-team"
    module    = "spoke-landing-zone"
  })
}

module "vnet" {
  source  = "Azure/avm-res-network-virtualnetwork/azurerm"
  version = "0.8.1"

  name                = "vnet-${var.workload}-${var.location_short}"
  resource_group_name = var.resource_group_name
  location            = var.location
  address_space       = var.address_space
  tags                = local.base_tags
  enable_telemetry    = false

  subnets = {
    pe = {
      name             = "snet-private-endpoints"
      address_prefixes = [var.pe_subnet_prefix]
    }
  }
}

module "kv" {
  source  = "Azure/avm-res-keyvault-vault/azurerm"
  version = "0.9.1"

  name                = "kv-${var.workload}-${var.location_short}"
  resource_group_name = var.resource_group_name
  location            = var.location
  tenant_id           = var.tenant_id
  tags                = local.base_tags
  enable_telemetry    = false

  # Org default: no public access, ever.
  public_network_access_enabled = false

  diagnostic_settings = {
    central = {
      name                  = "to-law"
      workspace_resource_id = var.log_analytics_workspace_id
    }
  }

  private_endpoints = {
    vault = {
      subnet_resource_id            = module.vnet.subnets["pe"].resource_id
      private_dns_zone_resource_ids = [var.kv_private_dns_zone_id]
    }
  }
}

module "sa" {
  source  = "Azure/avm-res-storage-storageaccount/azurerm"
  version = "0.6.4"

  name                = "st${var.workload}${var.location_short}"
  resource_group_name = var.resource_group_name
  location            = var.location
  tags                = local.base_tags
  enable_telemetry    = false

  public_network_access_enabled = false
  shared_access_key_enabled     = false   # force Entra auth

  diagnostic_settings = {
    central = {
      name                  = "to-law"
      workspace_resource_id = var.log_analytics_workspace_id
    }
  }
}

The version numbers above are illustrative pins from the time of writing. Resolve the current ones for your repo from the registry and pin them exactly — never copy version strings from a blog post into production. (Yes, including this one.)

The naming convention the wrapper encodes (so app teams never hand-name a resource), with the Azure abbreviation and a worked example:

Resource Pattern in the wrapper Azure abbrev. Example (workload=checkout, eus) Constraint to respect
Resource group rg-${workload}-${loc} rg rg-checkout-eus ≤ 90 chars
Virtual network vnet-${workload}-${loc} vnet vnet-checkout-eus ≤ 64 chars
Subnet (PE) snet-private-endpoints snet snet-private-endpoints ≤ 80 chars
Key Vault kv-${workload}-${loc} kv kv-checkout-eus 3–24, globally unique
Storage account st${workload}${loc} st stcheckouteus 3–24, lowercase alnum only
Private endpoint pe-${resource}-${workload} pe pe-kv-checkout ≤ 80 chars
Log Analytics ws law-${scope} law law-central ≤ 63 chars

Note the storage-account row is why the wrapper drops the hyphen and lowercases — storage names reject hyphens and uppercase, so encoding the rule in the module stops a whole class of plan-time naming failures.

The three bricks this wrapper composes, and the policy injected onto each — the table app teams never see but every reviewer should:

Brick (AVM resource module) Pinned Injected non-negotiable What it would default to bare
avm-res-network-virtualnetwork 0.8.1 PE subnet pre-created; telemetry off No PE subnet; telemetry on
avm-res-keyvault-vault 0.9.1 public_network_access_enabled=false; PE + diag forced Public access allowed; no PE/diag wired
avm-res-storage-storageaccount 0.6.4 shared_access_key_enabled=false; public off; diag forced Key auth on; public allowed

Why these specific defaults are the non-negotiables, in plain risk terms:

Injected default Risk it removes Equivalent Azure Policy (defence in depth)
public_network_access_enabled = false (KV) Vault reachable from the internet Deny public network access on Key Vault
private_endpoints = { vault = … } Secrets traffic leaving the backbone Audit/deny resources without a PE
shared_access_key_enabled = false (SA) Long-lived account keys to steal Deny storage account key access
diagnostic_settings → central LAW Blind spot — no audit trail Deploy-if-not-exists diagnostic settings
enable_telemetry = false Plan failure in locked subs (operational, not security)

Enforcing org defaults as non-negotiable inputs

The discipline that makes a platform layer valuable is what the wrapper does not expose. Compare the AVM surface (dozens of inputs) to your variables.tf:

variable "workload" {
  type        = string
  description = "Short workload name, used in resource naming."
  validation {
    condition     = can(regex("^[a-z0-9]{2,12}$", var.workload))
    error_message = "workload must be 2-12 lowercase alphanumeric chars."
  }
}

variable "tags" {
  type        = map(string)
  description = "Caller tags; merged with mandatory platform tags."
  validation {
    condition     = contains(keys(var.tags), "costCenter") && contains(keys(var.tags), "owner")
    error_message = "tags must include costCenter and owner."
  }
}

variable "log_analytics_workspace_id" {
  type        = string
  description = "Central LAW resource ID for diagnostic settings."
}
# ... resource_group_name, location, location_short, tenant_id,
#     address_space, pe_subnet_prefix, kv_private_dns_zone_id

There is no public_network_access_enabled, no enable_telemetry, no way to skip diagnostics. App teams cannot ship a publicly exposed Key Vault through this module because the lever does not exist. That is the entire point — guardrails as types, validated at plan, not as a wiki page nobody reads. The validation blocks turn “please remember to tag things” into a hard failure.

The full contract — every input the wrapper exposes, its type, whether it is validated, and why it is safe to expose:

Input Type Validated? Why it’s safe to expose
workload string regex ^[a-z0-9]{2,12}$ Drives naming only; bounded charset
tags map(string) must contain costCenter, owner Merged with platform tags; can’t drop mandatory keys
location string (optional: allow-list of regions) Placement choice, not a security lever
location_short string (optional: regex) Naming suffix
resource_group_name string Where it lands; caller owns the RG
tenant_id string Required by KV; not a guardrail
address_space list(string) (optional: CIDR check) IPAM choice, governed upstream
pe_subnet_prefix string (optional: CIDR check) Must fit inside address_space
log_analytics_workspace_id string Forces diagnostics to your LAW
kv_private_dns_zone_id string The PE zone; injecting PE needs it

And the inputs the wrapper deliberately forbids (does not expose), with what each would let an app team do if leaked:

Forbidden lever What leaking it would allow Kept as
public_network_access_enabled Ship a public KV / storage account Hard-coded false in main.tf
shared_access_key_enabled (SA) Re-enable stealable account keys Hard-coded false
enable_telemetry Break plans in locked subs by accident Hard-coded false
diagnostic_settings Skip the audit trail / point elsewhere Injected → central LAW
private_endpoints Deploy without a PE Injected from the wrapper’s PE subnet
role_assignments (raw) Grant arbitrary RBAC inline Baseline injected; extensions reviewed

The validation patterns worth standardising, with the message your colleague sees at plan:

Validate Condition (sketch) Error message
Workload name shape can(regex("^[a-z0-9]{2,12}$", var.workload)) “workload must be 2-12 lowercase alphanumeric chars.”
Mandatory tags present contains(keys(var.tags), "costCenter") && … “tags must include costCenter and owner.”
Region allow-list contains(["eastus","centralindia"], var.location) “location must be an approved region.”
PE subnet inside VNet cidrhost(var.address_space[0], 0) != "" (+ range check) “pe_subnet_prefix must fall inside address_space.”
Env name enum contains(["dev","test","prod"], var.environment) “environment must be dev, test, or prod.”

Testing modules: terraform test and Terratest

Two layers, two tools — and they answer different questions. Native terraform test answers “does the wrapper produce the right shape?” cheaply and without deploying; Terratest answers “does it actually work in Azure?” expensively and occasionally.

Dimension terraform test (native) Terratest (Go)
Altitude plan-level (also apply if you ask) Real apply against live Azure
Speed Seconds Minutes (deploy + destroy)
Cost Free (no resources) Real Azure spend on an ephemeral sub
Deploys resources? No (for command = plan) Yes — apply then destroy
Best for Contract / shape assertions, guardrail proofs End-to-end behaviour, real PE/DNS resolution
Runs in CI… On every push / PR Nightly / pre-release
Language HCL Go
Failure means Wrapper composed the wrong shape Azure rejected or behaviour drifted

Native terraform test is fast, runs in-process, and is perfect for plan-level contract assertions — “does the wrapper produce the right shape?” No deployment needed:

# tests/defaults.tftest.hcl
run "defaults_are_locked_down" {
  command = plan

  variables {
    workload                   = "checkout"
    location                   = "eastus"
    location_short             = "eus"
    resource_group_name        = "rg-checkout"
    tenant_id                  = "00000000-0000-0000-0000-000000000000"
    address_space              = ["10.20.0.0/24"]
    pe_subnet_prefix           = "10.20.0.0/27"
    log_analytics_workspace_id = "/subscriptions/.../workspaces/law-central"
    kv_private_dns_zone_id     = "/subscriptions/.../privateDnsZones/privatelink.vaultcore.azure.net"
    tags                       = { costCenter = "1234", owner = "team@contoso.com" }
  }

  assert {
    condition     = module.kv.... == false  # assert the resolved public-access value
    error_message = "Key Vault must never allow public network access."
  }
}

Run it with:

terraform init
terraform test

The contract assertions worth writing — each proves a guardrail holds, at plan, for free:

Test (run block) command Asserts Catches
defaults_are_locked_down plan KV public_network_access_enabled == false A future edit re-exposing the vault
storage_forces_entra plan SA shared_access_key_enabled == false Key auth creeping back in
diagnostics_present plan Each resource has a diagnostic_settings entry Someone dropping the audit trail
tags_merged plan Output tags include managedBy + caller’s Tag-merge logic regressions
mandatory_tags_rejected plan (expect fail) Missing costCenter fails validation The validation block being removed
bad_workload_rejected plan (expect fail) CHECKOUT! fails the regex Naming rule regressions
pe_wired_to_subnet plan KV PE references the pe subnet id PE wiring breaking on refactor

The terraform test building blocks you actually use, so you can read and write .tftest.hcl fluently:

Construct Goes in Purpose Note
run "<name>" {} Test file One test case (a plan or apply) Runs in order; later runs see earlier state
command = plan run block Assert without deploying The default for contract tests
command = apply run block Deploy then assert (real resources) Needs creds + a real/ephemeral sub
variables {} run block Inputs for this case Override per-run
assert {} run block condition + error_message The check itself
expect_failures run block Assert a validation should fail Proves guardrails reject bad input
provider {} / providers File / run Wire/alias providers for the test Mock or real
module {} (override) run block Swap a child module for a stub Isolate the unit under test

Terratest (Go) is for real end-to-end validation against an ephemeral subscription — apply, assert against live Azure, destroy. Use it in CI nightly, not on every push:

func TestSpokeLandingZone(t *testing.T) {
    opts := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
        TerraformDir: "../examples/default",
    })
    defer terraform.Destroy(t, opts)
    terraform.InitAndApply(t, opts)

    kvURI := terraform.Output(t, opts, "key_vault_uri")
    assert.Contains(t, kvURI, "vault.azure.net")
}

In CI, authenticate with OIDC workload identity federation (no stored secrets), and target a disposable subscription so a failed destroy never pollutes a real environment:

az login --service-principal -u "$ARM_CLIENT_ID" \
  --tenant "$ARM_TENANT_ID" --federated-token "$IDTOKEN"
export ARM_USE_OIDC=true
export ARM_SUBSCRIPTION_ID="$EPHEMERAL_SUB_ID"
cd test && go test -timeout 45m ./...

The Terratest assertions worth the spend — behaviour you cannot prove at plan:

Terratest assertion Proves Why plan-level can’t catch it
key_vault_uri contains vault.azure.net The vault actually came up Plan doesn’t materialise computed URIs reliably
Private DNS resolves the KV PE name PE + DNS zone group wired correctly DNS resolution is runtime behaviour
Storage data-plane rejects shared-key auth Entra-only enforcement is real Plan asserts intent, not Azure enforcement
Diagnostic setting visible in LAW Logs flow to the workspace Ingestion is runtime
terraform destroy leaves zero resources Clean teardown (no orphans) Only an apply/destroy cycle exposes orphans

The CI auth model, made explicit — OIDC keyless is the right default; the same pattern powers GitHub Actions + Terraform OIDC plan/PR automation:

Auth approach Secret stored? Blast radius Verdict
OIDC workload identity federation None Short-lived, scoped token Use this
Service principal + client secret Yes (long-lived) Leaked secret = standing access Avoid; rotate if unavoidable
Managed identity (self-hosted runner) None Scoped to the runner identity Good for self-hosted agents
Personal az login on a runner Yes (interactive) The human’s full access Never in CI

Publishing to a private registry

Wrappers are useless if teams copy-paste them. Publish them and consume by source/version. Two common backends, and the trade-off between them:

Backend Native registry? Source string consumers use Versioning mechanism Best when
Terraform Cloud / Enterprise Yes app.terraform.io/<org>/<name>/azurerm Git tags (valid semver) You’re on TFC/TFE already
Azure DevOps (git ref) No git::https://dev.azure.com/...?ref=v1.3.0 Tag ref in the URL Azure DevOps shop, no TFC
Private VCS git ref (generic) No git::ssh://...//module?ref=v1.3.0 Tag ref in the URL Any git host, lowest setup
Storage/HTTP archive No https://.../module-1.3.0.zip Versioned artifact name Air-gapped / artifact-store shops

Terraform Cloud / Enterprise private registry. Modules must live in repos named terraform-<provider>-<name> and are published from git tags that are valid semver. Tag, push, and the registry ingests the version:

git tag v1.3.0
git push origin v1.3.0

Consumers then reference it through the registry hostname:

module "spoke" {
  source  = "app.terraform.io/contoso/spoke-landing-zone/azurerm"
  version = "~> 1.3.0"
  # ... only the narrow contract inputs
}

Azure DevOps. There is no native Terraform registry product, so the pragmatic pattern is consuming wrappers as versioned git sources (a tag ref) pointed at Azure Repos, fronted by a CI pipeline that runs validate/test on tag:

module "spoke" {
  source = "git::https://dev.azure.com/contoso/_git/platform-modules//spoke-landing-zone?ref=v1.3.0"
}

Consumption contract: semver is a promise. Bump patch for fixes, minor for additive inputs/outputs, major for anything that changes or removes an input or alters resource addresses. The moment you rename a wrapper variable, that’s a major — app teams pinned with ~> must opt in.

The semver decision table — what each kind of change costs in version terms:

Change you made to the wrapper Bump Why Consumer impact (~> X.Y.Z)
Fix a bug, no interface change patch Behaviour-preserving Auto-picked up
Add a new optional input/output minor Additive, backward-compatible Auto-picked up
Tighten a validation (stricter) major May reject previously-valid input Must opt in
Rename / remove an input major Breaks callers Must opt in
Change a resource address (for_each, etc.) major (+ moved) State migration for consumers Must opt in; needs moved
Bump an internal AVM dep (no surface change) patch/minor Depends on AVM’s own change Usually transparent
Change a default value major Silent behaviour change Must opt in

The publish pipeline gates that should run on a tag, in order:

Stage Command Gate
Format terraform fmt -check -recursive Block on diff
Validate terraform init -backend=false && terraform validate Block on error
Lint tflint (+ ruleset) Block on error
Contract tests terraform test Block on any failed run
Security scan checkov / tfsec / trivy Block on high severity (see scanning article)
Tag → publish git tag vX.Y.Z && git push --tags Registry ingests the version

Migration path: replacing hand-rolled modules without state churn

The objection that kills AVM adoption: “we have hundreds of resources in state; switching modules means destroy/recreate.” It does not — if you use moved blocks. When you swap your old module "storage" for the AVM wrapper, the resource address changes (e.g. module.storage.azurerm_storage_account.this becomes module.sa.azurerm_storage_account.this[0]). Tell Terraform it’s the same object:

moved {
  from = module.storage.azurerm_storage_account.this
  to   = module.sa.azurerm_storage_account.this[0]
}

moved blocks are declarative and version-controlled — they survive across the whole team, unlike a one-off terraform state mv. For resources that AVM creates as a child but you previously managed standalone (or that exist in Azure but not in state), use an import block instead:

import {
  to = module.sa.azurerm_storage_account.this[0]
  id = "/subscriptions/<sub>/resourceGroups/rg-checkout/providers/Microsoft.Storage/storageAccounts/stcheckouteus"
}

Migrate one module type at a time, behind a PR, and read the plan. A correct migration shows the resource moving with zero destroy/create lines — only in-place diffs for AVM’s added defaults (diagnostics, etc.). The mechanism-to-situation map — pick the right tool for what you’re migrating:

Situation Tool What it does Plan should show
Same resource, address changed (your module → AVM) moved block Re-points state at the new address Move, no destroy/create
Resource in Azure but not in Terraform state import block Brings the existing object under management Import + in-place diffs
AVM moved a resource under for_each (internal) moved block (indexed key) Maps old address → keyed address Move, no destroy/create
Resource truly being replaced (rename forces new) (accept) Genuine destroy/create Acknowledge explicitly in the PR
One-off local fix, not for the team terraform state mv (avoid) Imperative, non-versioned (use moved instead)

The migration playbook as a table — symptom in the plan, what it means, how to confirm, and the fix:

# Plan symptom on adopting AVM Root cause Confirm with Fix
1 Every storage account shows destroy + create Resource address changed (your module → AVM) terraform plan lists -/+ ... this[0] moved block from old → new address
2 A resource shows create though it exists in Azure It’s in Azure but not in state Portal/CLI shows the resource live import block with the resource id
3 Replace appears only after a minor AVM bump AVM moved the resource under for_each Diff the module’s main.tf across versions moved to the keyed address, same wrapper bump
4 Diagnostic settings show as new (in-place add) AVM injects diagnostics you didn’t have Plan shows + azurerm_monitor_diagnostic_setting Expected — accept the additive diff
5 Plan errors: deployment denied enable_telemetry = true in a locked sub Error names Microsoft.Resources/deployments Set enable_telemetry = false
6 RBAC assignment churns on reorder Unkeyed role_assignments list reindexed Plan shows delete+add of identical roles Use the keyed map form AVM expects
7 Private endpoint shows replace PE subnet id changed under the hood Compare subnet_resource_id old vs new moved the PE resource; align the subnet
8 Whole module shows replace after provider bump Provider major changed a schema .terraform.lock.hcl provider delta Pin provider; migrate per the provider guide

The CI gate that makes an unacknowledged replace impossible to merge — read every plan for delete+create and fail the build:

terraform plan -no-color -out tfplan
terraform show -json tfplan \
  | jq -e '[.resource_changes[]
            | select(.change.actions == ["delete","create"]
                  or .change.actions == ["create","delete"])] | length == 0' \
  || { echo "::error::Unacknowledged replace in plan"; exit 1; }

Architecture at a glance

The diagram traces the module supply chain left to right — the path a resource definition travels from Microsoft’s public registry to a deployed, private, tagged spoke in your subscription. Read it as five zones. At the far left, upstream AVM ships the avm-res-* bricks (pre-1.0, the version trap lives here) and the avm-ptn-* assemblies. Those bricks flow by source + version into your platform wrapper tier — the heart of the system — where the spoke-landing-zone module composes a VNet, a Key Vault and a storage account and injects the org non-negotiables: public_network_access_enabled = false, private endpoints, forced diagnostics to the central LAW, Entra-only storage auth, and enable_telemetry = false. The wrapper’s narrow variables.tf is the membrane: app teams pass a workload name and tags, nothing dangerous.

From the wrapper, the path runs through the test + release gateterraform test for plan-level contract assertions, Terratest for a real apply/destroy against an ephemeral subscription over keyless OIDC, and the replace gate that fails CI on any unacknowledged destroy/create. Only a green build tags a version into the private registry (Terraform Cloud or a git ref), from which the rightmost zone — 40+ app repos — consumes the wrapper by source/version with narrow inputs only, and terraform apply lands a spoke that is private and observable by construction. The five numbered badges mark the real hazards on this path: the ~> 0.x pin trap on the upstream brick, telemetry failing in a locked subscription, a passthrough wrapper that leaks a public lever, the state-address churn a minor bump can cause, and the brownfield-import gap when adopting AVM over existing Azure resources. Follow the numbers and you have both the architecture and the failure map in one view.

Module supply chain for an Azure Verified Modules Terraform platform layer, flowing left to right through five zones: upstream public AVM resource bricks (avm-res-, pre-1.0) and pattern assemblies (avm-ptn-); the platform-team wrapper tier where the spoke-landing-zone module composes a VNet, a Key Vault with public access off and private endpoints plus diagnostics forced, and a storage account with shared-key auth off, all with enable_telemetry false and a narrow variables.tf contract; a test and release gate of terraform test contract assertions, Terratest apply-destroy on an ephemeral subscription over OIDC, and a replace gate blocking destroy/create in the plan; a private registry publishing semver tags; and 40-plus app repos consuming the wrapper by source and version with narrow inputs to deploy a private-by-construction spoke into Azure — with five numbered badges marking the tilde-0.x pin trap, telemetry in locked subscriptions, the passthrough-not-platform leak, state-address churn on a minor bump, and brownfield import on adoption

Real-world scenario

Northwind Cloud Platform is the four-engineer central team behind a retailer’s Azure estate: 40+ application repos, each owning one or more spokes inside a CAF-aligned landing zone, all on Terraform with state in Azure Storage and CI in Azure DevOps. Eighteen months ago every app team hand-rolled azurerm_ resources; a security review found nine publicly-accessible storage accounts and a Key Vault open to the internet, and “fix it” meant nine separate PRs against nine codebases. The platform team’s mandate after that review: make “private, tagged, observable” the only way to ship a spoke, and make estate-wide convention changes a one-PR operation. Their answer was the spoke-landing-zone wrapper over AVM resource bricks, distributed as a versioned git ref, pinned Azure/avm-res-storage-storageaccount/azurerm at an exact 0.6.x, consumed by every repo with ?ref=v1.x.

It worked beautifully for six months — until a routine Renovate PR bumped a single AVM minor in the wrapper, and the terraform plan that CI attached showed every storage account across 40 repos scheduled for destroy/create. The exact pin had not saved them, because the upgrade itself was the breaking event: that AVM release had moved the storage account resource under a for_each map, changing its address from ...this to ...this["default"]. A naive merge — and the team’s normal flow was “Renovate is green-ish, approve” — would have nuked 40 production data planes in a single apply. The engineer reviewing it noticed the plan was suspiciously long, scrolled, and saw the -/+ lines. That was the whole margin: one human reading a plan.

The breakthrough was reframing the problem. The issue was never “which version” — exact pins control when you take an upgrade, not whether it is safe. The issue was that an AVM bump in a shared wrapper is, structurally, a state migration, and they had been treating it as a dependency bump. The fix was to absorb the address change inside the wrapper with a moved block, shipped in the same version bump so all 40 consumers inherited it transparently:

moved {
  from = module.sa.azurerm_storage_account.this
  to   = module.sa.azurerm_storage_account.this["default"]
}

Then they made this class of failure impossible to miss rather than relying on a tired reviewer: a CI step that parses terraform show -json and fails the build if the plan contains any destroy/create not explicitly acknowledged in the PR description. They also moved the wrapper’s exact-pin bumps behind a dedicated review checklist (“is this AVM minor a possible address change? diff the module’s main.tf”) and added a nightly Terratest run against an ephemeral subscription so behaviour regressions surfaced before a tag, not after.

Six months on, the estate is in a different posture. A new compliance rule — force customer-managed keys on storage — landed as one wrapper PR with a reviewed plan, propagated to all 40 repos by a ~> v1.x bump, with the replace-gate guaranteeing zero data-plane loss. The lesson on their wall: “Every AVM bump in a shared wrapper is a state migration until a moved-aware plan proves otherwise.” The incident, as a timeline, because the order of moves is the lesson:

Time Event Action taken Effect What it should have been
Day 0 Renovate bumps an AVM minor (PR opened, CI green-ish) Treat every AVM bump as a possible migration
Day 0 +5 min Plan attached to PR Reviewer scrolls the long plan Spots 40× destroy/create The save — but luck, not process
Day 0 +20 min Root cause found Diff the module main.tf across versions Resource moved under for_each
Day 0 +1 h Fix drafted Add moved block in the same wrapper bump Plan now shows move, zero replace Correct fix
Day 1 Shipped Tag v1.4.0; consumers inherit transparently 40 repos migrate with no churn
Day 2 Hardened Add jq replace-gate to CI Unacknowledged replace can’t merge The durable fix
+1 week Institutionalised Nightly Terratest + AVM-bump checklist Regressions caught pre-tag The process change

Advantages and disadvantages

The wrap-don’t-fork, AVM-as-substrate model both enables a real platform layer and carries sharp edges you must respect. Weigh it honestly:

Advantages (why this model helps) Disadvantages (why it bites)
One shared spec means generic org policy (force diagnostics/PE everywhere) instead of bespoke per-resource wiring The shared spec is still pre-1.0~> semantics are inverted and the trap is in the shared wrapper
WAF-aligned defaults: you inherit good defaults instead of the bare minimum that compiles “Good defaults” still aren’t your defaults — you must inject org policy, or it’s just a fancier resource
Estate-wide convention changes become one pinned-version wrapper PR A bad wrapper bump has a 40-repo blast radius; discipline is non-optional
Migration is non-destructive with moved/import — adopt over existing state, zero recreate A wrong moved target silently destroys and recreates — you must read the plan
Guardrails as types (validation, omitted levers) catch violations at plan, not in a quarterly review Narrowing the surface is work; the lazy path (passthrough) gives none of the value
Upstream fixes flow to you for free because you compose, don’t fork You’re coupled to AVM’s release cadence and its internal address choices
Keyed maps (role_assignments, diagnostic_settings) survive reordering — no index churn Telemetry’s empty deployment fails plans in locked subscriptions until you set it false
Two-tier testing (terraform test + Terratest) proves both shape and behaviour Terratest costs real Azure spend and time; you must run it judiciously

The model is right when you operate Terraform at scale (many repos, many spokes) and need conventions enforced by construction rather than by review. It is overkill for a single-team, handful-of-resources estate where a wrapper is more ceremony than value — there, consume AVM resource modules directly. It bites hardest on teams that pin pre-1.0 modules with ~>, that build passthrough “wrappers” with no injected policy, and that merge AVM bumps without reading the plan. Every one of those is a manageable failure — but only if you know it exists, which is the point of the deep sections above.

Hands-on lab

Build a minimal spoke-landing-zone wrapper over an AVM brick, prove its guardrail holds at plan with terraform test, and prove the ~> 0.x trap is real — all without deploying a thing (free). Run in any shell with Terraform ≥ 1.7 (for terraform test) and the azurerm provider available. No Azure spend: every step is plan-level.

Step 1 — Scaffold the wrapper.

mkdir -p spoke-landing-zone/tests && cd spoke-landing-zone
cat > versions.tf <<'EOF'
terraform {
  required_version = ">= 1.7.0"
  required_providers {
    azurerm = { source = "hashicorp/azurerm", version = "~> 4.0" }
  }
}
provider "azurerm" {
  features {}
  # plan-only; no real auth needed if you don't apply
  skip_provider_registration = true
}
EOF

Step 2 — A narrow contract with a validated input. This is the membrane app teams see.

cat > variables.tf <<'EOF'
variable "workload" {
  type = string
  validation {
    condition     = can(regex("^[a-z0-9]{2,12}$", var.workload))
    error_message = "workload must be 2-12 lowercase alphanumeric chars."
  }
}
variable "location"            { type = string }
variable "resource_group_name" { type = string }
variable "tags" {
  type = map(string)
  validation {
    condition     = contains(keys(var.tags), "costCenter") && contains(keys(var.tags), "owner")
    error_message = "tags must include costCenter and owner."
  }
}
EOF

Step 3 — Compose one AVM brick with injected, non-negotiable policy. A storage account: public off, Entra-only, telemetry off — none of it exposed to the caller.

cat > main.tf <<'EOF'
locals { base_tags = merge(var.tags, { managedBy = "platform-team" }) }

module "sa" {
  source  = "Azure/avm-res-storage-storageaccount/azurerm"
  version = "0.6.4"   # exact pin — the wrapper absorbs upgrade risk deliberately

  name                = "st${var.workload}eus"
  resource_group_name = var.resource_group_name
  location            = var.location
  tags                = local.base_tags

  enable_telemetry              = false   # injected, not exposed
  public_network_access_enabled = false   # injected — the lever app teams DON'T get
  shared_access_key_enabled     = false   # force Entra auth
}
EOF
terraform init

Expected: terraform init downloads Azure/avm-res-storage-storageaccount/azurerm at 0.6.4 and the azurerm provider; “Terraform has been successfully initialized!”.

Step 4 — Prove the guardrail at plan with a contract test.

cat > tests/defaults.tftest.hcl <<'EOF'
run "storage_is_locked_down" {
  command = plan
  variables {
    workload            = "checkout"
    location            = "eastus"
    resource_group_name = "rg-checkout"
    tags                = { costCenter = "1234", owner = "team@contoso.com" }
  }
  assert {
    condition     = module.sa.... == false   # resolve the public-access output your AVM version exposes
    error_message = "Storage must never allow public network access."
  }
}
EOF
terraform test

Expected: terraform test runs the run block at plan level and reports the assertion result — no resources created. (Adjust the module.sa.... reference to the actual output your pinned AVM version surfaces; terraform output/the module’s outputs.tf tells you the name.)

Step 5 — Prove the validation rejects bad input. Feed an illegal workload name and watch it fail at plan, not in production.

terraform plan -var 'workload=CHECKOUT!' \
  -var 'location=eastus' -var 'resource_group_name=rg-x' \
  -var 'tags={costCenter="1",owner="a@b.com"}'
# Expected: Error — "workload must be 2-12 lowercase alphanumeric chars."

Step 6 — Prove the ~> 0.x trap is real. Loosen the pin and watch init -upgrade reach for a higher minor than you intended.

# Temporarily change the version line to the DANGEROUS form and re-init:
#   version = "~> 0.6"     # allows 0.7.0, 0.8.0, ... a BREAKING minor
sed -i.bak 's/version = "0.6.4"/version = "~> 0.6"/' main.tf
terraform init -upgrade
# Read which version actually resolved:
grep -A2 'avm-res-storage' .terraform/modules/modules.json 2>/dev/null || \
  terraform version
# Restore the safe exact pin:
mv main.tf.bak main.tf && terraform init -upgrade

The point: ~> 0.6 silently admits a breaking 0.7.x/0.8.x; only 0.6.4 or ~> 0.6.4 holds the line.

Validation checklist. You built a wrapper that injects three security non-negotiables the caller cannot override, proved the guardrail at plan with terraform test (no deploy), proved the validation block rejects bad naming, and demonstrated the pre-1.0 ~> trap first-hand. The lab steps mapped to what each proves:

Step What you did What it proves Real-world analogue
3 Inject public_network_access_enabled=false The lever app teams don’t get Every guarded brick in the wrapper
4 terraform test the locked-down default Guardrails verified at plan, for free The CI contract suite
5 Feed an illegal workload Conventions are hard failures, not wiki text Naming policy enforced as a type
6 Loosen to ~> 0.6, re-init The pre-1.0 ~> trap is real The Renovate-bump near-miss

Cleanup. No Azure resources were created (everything was plan-level), so just remove the directory.

cd .. && rm -rf spoke-landing-zone

Cost note. Zero — every command is init/plan/test, which create nothing in Azure. (A real Terratest run would cost a few rupees of ephemeral-subscription spend for the minutes resources exist; this lab deliberately avoids apply.)

Common mistakes & troubleshooting

This is the playbook — the part you bookmark. First as a scannable table, then the entries that bite hardest with the full reasoning underneath.

# Symptom Root cause Confirm (exact cmd / check) Fix
1 A breaking module version got pulled despite ~> Pre-1.0: ~> 0.9 admits 0.10.0 Read resolved version in .terraform/modules/modules.json / lock Pin ~> 0.9.1 (three-part) or exact in wrappers
2 terraform plan errors on a deployment resource enable_telemetry=true in a policy-locked sub Plan error names Microsoft.Resources/deployments Set enable_telemetry=false in every wrapper
3 App team shipped a public Key Vault / storage Wrapper exposes the lever (passthrough) grep wrapper for public_network_access_enabled Remove the input; hard-code false in main.tf
4 Plan shows 40× destroy/create after a bump AVM moved the resource under for_each terraform show -json → actions ["delete","create"] moved block to the keyed address, same wrapper bump
5 A resource shows create but exists in Azure In Azure, not in Terraform state Portal/CLI shows the live resource import block with the resource id
6 RBAC assignments churn every plan Unkeyed role_assignments list reindexed Plan shows delete+add of identical roles Use the keyed map form AVM expects
7 version constraint won’t resolve Constraint impossible (e.g. >= 0.9, < 0.9) terraform init “no available releases” error Fix the range; check the registry for real versions
8 Consumer can’t find the published module Repo not named terraform-<provider>-<name> or no semver tag Registry shows no versions Rename repo; push a valid vX.Y.Z tag
9 terraform test passes but apply fails in Azure Plan-level test can’t catch runtime behaviour Terratest apply surfaces the real error Add a Terratest assertion for that behaviour
10 Wrapper variables.tf is huge It mirrors the AVM surface (no narrowing) Count inputs vs the AVM module’s Narrow deliberately; inject the rest
11 OIDC login fails in CI Federated credential / subject mismatch az login --federated-token error Fix the federated credential subject/audience
12 Provider major bump replaces everything azurerm v3→v4 schema change .terraform.lock.hcl provider delta Pin provider; follow the provider upgrade guide
13 moved block did nothing (still replaces) from/to address wrong Plan still shows -/+ Correct the exact source/target address strings
14 Renovate raises no AVM PRs packageRules pattern doesn’t match Renovate logs / dry-run Fix matchPackageNames to /^Azure/avm-/

The expanded form, for the entries that cause the most damage:

1. A breaking module version got pulled despite a ~> constraint. Root cause: The module is pre-1.0 and ~> 0.9 expands to >= 0.9.0, < 1.0.0, so it admits a breaking 0.10.0 because AVM treats the minor segment as breaking below 1.0. Confirm: Read the resolved version in .terraform/modules/modules.json (or the registry-backed lock) and compare to what you intended. Fix: In wrappers, pin exact (version = "0.9.1"); if you must allow drift, use the three-part ~> 0.9.1 (allows 0.9.x, blocks 0.10.0). Never ~> 0.9 on a 0.x module.

2. terraform plan fails on a deployment resource in a locked subscription. Root cause: enable_telemetry = true (the AVM default) deploys a tiny empty Microsoft.Resources/deployments; in subscriptions where that operation is policy-denied, the plan fails with a confusing error that doesn’t obviously point at telemetry. Confirm: The plan/apply error names a Microsoft.Resources/deployments operation being denied by policy. Fix: Bake enable_telemetry = false into every wrapper, decided once org-wide — not per call.

3. An app team shipped a publicly-exposed Key Vault or storage account. Root cause: The wrapper exposed public_network_access_enabled (a passthrough), so the team could set it true — the wrapper added no guardrail. Confirm: grep -r public_network_access_enabled in the wrapper finds it in variables.tf (exposed) rather than only hard-coded in main.tf. Fix: Remove the input from variables.tf; hard-code public_network_access_enabled = false in main.tf. Guardrails are the levers you omit. Back it with an Azure Policy deny for defence in depth.

4. After a minor AVM bump, the plan shows every instance of a resource scheduled for destroy/create. Root cause: The AVM release changed the resource’s address (typically moving it under a for_each map), so Terraform sees the old address gone and a new one created — a destroy/create, which on data resources is destruction. Confirm: terraform show -json tfplan | jq '.resource_changes[].change.actions' shows ["delete","create"]. Fix: Add a moved block from the old address to the new keyed address, shipped in the same wrapper version so consumers inherit it transparently; gate CI to reject unacknowledged replaces.

5. A resource shows create in the plan even though it already exists in Azure. Root cause: The resource exists in Azure but is not in Terraform state (created out-of-band, or being adopted into AVM as a child it didn’t manage before). Confirm: The portal/CLI shows the resource live; terraform state list doesn’t include it. Fix: Use an import block (to = the AVM resource address, id = the Azure resource id) so Terraform adopts it instead of creating a duplicate; read the plan for in-place-only diffs.

6. RBAC role assignments churn (delete + re-add identical roles) on every plan. Root cause: role_assignments passed as an unkeyed list gets reindexed when the order changes, so Terraform sees deletes and adds of the same assignments. Confirm: The plan shows azurerm_role_assignment deletes and creates with identical role/scope. Fix: Pass role_assignments as the keyed map AVM expects, so add/remove never reindexes the survivors.

10. The wrapper’s variables.tf is nearly a copy of the AVM module’s inputs. Root cause: The “wrapper” is a passthrough — it forwards the full AVM surface, so it provides no narrowing and no guardrails (the entire reason it exists). Confirm: The input count roughly matches the AVM module’s, and security levers (public_network_access_enabled, shared_access_key_enabled) appear in variables.tf. Fix: Narrow deliberately to the small contract app teams need; inject the rest as constants. A platform layer is defined by what it refuses to expose.

Best practices

The practices as a pre-flight checklist for any new wrapper:

Check Pass criterion Why it matters
AVM deps pinned exactly No ~> 0.x anywhere in the wrapper Avoids the breaking-minor trap
enable_telemetry injected false Not exposed; constant in main.tf No plan failure in locked subs
Security levers omitted public_network_access_enabled etc. not in variables.tf Guardrail by construction
Diagnostics + PE injected Forced to central LAW / PE subnet Observability + isolation non-optional
Naming + tags validated validation blocks present Conventions are hard failures
Contract tests exist terraform test covers the guardrails Regressions caught at plan
Replace gate in CI jq check fails on destroy/create No accidental data-plane loss
Published by semver Tag + registry/git ref, not copy-paste One bump propagates everywhere
Lock file committed .terraform.lock.hcl in VCS Reproducible provider versions

Security notes

The security controls mapped to the threat each removes and the policy backstop:

Control (in the wrapper) Threat removed Azure Policy backstop
Omit public_network_access_enabled Internet-exposed KV/SA Deny public network access
shared_access_key_enabled = false Long-lived account-key theft Deny storage key access
Injected private_endpoints Data/secrets off the backbone Audit/deny resources without PE
Injected diagnostic_settings → LAW Unaudited resource DeployIfNotExists diagnostics
OIDC keyless CI Stolen long-lived pipeline secret (conditional access on the identity)
Disposable test subscription Blast radius of a CI compromise Management-group scoping
Baseline keyed role_assignments Over-broad inline RBAC Deny role assignments at wrong scope
Supply-chain scan in publish CI Shipping a misconfigured wrapper (gate is the policy here)

Cost & sizing

There is no per-hour charge for “an AVM module” — the cost story here is operational spend plus the resources your wrappers deploy, and the way the platform layer saves money is by making convention changes one PR instead of forty. The drivers, what each costs, and how the platform layer moves the number:

Cost driver What you pay for Rough INR / month How the platform layer affects it
Terratest on an ephemeral subscription Minutes of real resources during apply→destroy ~₹500–2,000 (nightly, small spokes) Keep spokes minimal; destroy reliably; run nightly not per-push
CI compute (plan/test/publish) Pipeline minutes Often free tier / ~₹0–1,000 Plan-level terraform test is cheap; gate Terratest to nightly
Terraform Cloud / Enterprise Per-user or per-run, if used Varies (free tier exists) Optional — git-ref distribution is ₹0
Private registry storage Negligible (git tags) ~₹0 Tags cost nothing; storage/HTTP archive is tiny
The deployed spoke itself VNet (free), KV (per-op), SA, PE Per the resources (PE ~₹600–900/PE/mo) Wrapper standardises sizing; PEs add a per-endpoint hourly charge
Renovate (self-hosted or app) Compute / free GitHub app ~₹0 Saves engineer-hours chasing bumps
Engineer time (the real cost) Hours per estate-wide change (the big one) One wrapper PR vs 40 hand-edits — the whole ROI

Right-sizing guidance: the only recurring infra cost the platform layer adds is the ephemeral-subscription Terratest spend — keep example spokes minimal (one VNet, one KV, one SA, the PEs under test) and ensure destroy is reliable (the replace-gate and a defer terraform.Destroy prevent orphans that quietly bill). Private endpoints are the one line item to watch in the deployed spoke: each PE carries a small hourly charge plus per-GB processing, so don’t inject PEs on services that don’t need them. Everything else — the registry (git tags), CI plan/test, Renovate — is effectively free. The justification is engineer-time: a single compliance change (force CMK, deny public) that used to be 40 PRs becomes one reviewed wrapper bump, and the replace-gate makes that bump safe — which is worth far more than the few hundred rupees of nightly test spend.

A rough monthly picture for a 40-repo estate: nightly Terratest (~₹1,000–2,000), CI (free tier to ~₹1,000), registry/Renovate (~₹0), plus whatever the spokes themselves cost (dominated by PEs and the storage/KV operations, not the platform layer). The platform layer’s line on the bill is small; its line on the risk and engineer-hours ledger is where it pays for itself.

Interview & exam questions

1. What is the difference between an AVM resource module and a pattern module, and where does your platform wrapper fit? A resource module (avm-res-*) provisions one logical resource plus its directly-dependent children; a pattern module (avm-ptn-*) provisions a whole multi-resource architecture. Resource modules are LEGO bricks, pattern modules are pre-built assemblies. Your wrapper is a third tier — your own pattern module composed from AVM resource bricks that injects your org’s non-negotiables. You wrap, not fork.

2. Why is version = "~> 0.9" dangerous for an AVM module, and what should you use instead? AVM modules are pre-1.0, and AVM treats the minor segment as breaking below 1.0. ~> 0.9 expands to >= 0.9.0, < 1.0.0, so it admits a breaking 0.10.0. In wrappers, pin exact (0.9.1); if you need patch drift, use the three-part ~> 0.9.1 (allows 0.9.x, blocks 0.10.0).

3. What is enable_telemetry and why does it sometimes break a plan? AVM modules deploy a tiny, empty Microsoft.Resources/deployments whose name encodes module + version, used to measure usage — it sends no resource data. In subscriptions where that deployment operation is policy-denied, the plan fails with a confusing error. Set enable_telemetry = false once, org-wide, in your wrappers.

4. What makes a wrapper a real platform layer rather than a passthrough? What it does not expose. A passthrough forwards the full AVM surface, so an app team can still ship a public Key Vault. A platform layer exposes a narrow, validated contract (workload, tags, central LAW id) and injects the rest (public_network_access_enabled = false, forced PE/diagnostics, telemetry off) as constants the caller cannot override — guardrails as types, validated at plan.

5. How do you migrate a hand-rolled module to AVM without destroying resources? Use a moved block to re-point state from the old resource address to the new AVM address (the address changes, the object doesn’t), and an import block for resources that exist in Azure but not in state. Migrate one module type per PR and read the plan — a correct migration shows moves and in-place diffs with zero destroy/create.

6. A minor AVM bump in a shared wrapper shows every storage account scheduled for destroy/create. What happened and how do you fix it? The AVM release changed the resource’s address (moved it under a for_each map), so Terraform sees the old address removed and a new one created. Absorb the change with a moved block to the new keyed address, shipped in the same wrapper version so consumers inherit it transparently, and gate CI to reject unacknowledged replaces.

7. Why pin AVM exactly in wrappers but ~> X.Y.Z in app repos? The wrapper is where you absorb upgrade risk deliberately, in a reviewed PR with a plan diff — so exact pins. Your wrappers are semver-disciplined, so app repos can safely use ~> X.Y.Z on your wrapper and inherit the AVM versions you chose, getting your patches/minors automatically without ever pinning AVM directly.

8. When do you use terraform test versus Terratest? terraform test runs plan-level (or apply) assertions in-process — fast, free, no deploy — perfect for contract/shape checks (“does the wrapper produce the locked-down shape?”) on every push. Terratest runs a real apply/assert/destroy against an ephemeral subscription — slow and costs spend — for behaviour you can’t see at plan (PE/DNS resolution, Entra-only enforcement); run it nightly.

9. How do you authenticate Terraform CI to Azure without storing secrets? OIDC workload identity federation: the CI system presents a short-lived federated token (az login --federated-token, ARM_USE_OIDC=true) scoped to a specific repo/branch/environment, so there’s no long-lived service-principal secret to leak. Target a disposable subscription for any apply.

10. What semver bump does renaming a wrapper input require, and why? A major — renaming or removing an input is a breaking change to the contract; app teams pinned with ~> X.Y.Z won’t pick it up until they opt in. The same applies to changing a resource address (also needs a moved), tightening a validation, or changing a default value.

11. How does Renovate fit the AVM upgrade workflow? Renovate understands Terraform registry sources natively. A packageRule matching /^Azure/avm-/ groups AVM bumps into one PR on a schedule; CI attaches the terraform plan so each upgrade is a single reviewable unit — you review upgrades instead of chasing them, and the replace-gate guards the merge.

12. Why layer Azure Policy under the wrapper if the wrapper already enforces guardrails? Defence in depth. The wrapper enforces at author time for anything provisioned through it — but resources created out-of-band (portal, another tool, a non-wrapper module) bypass it. Azure Policy deny/deployIfNotExists enforces at the platform regardless of how a resource was created, catching what the wrapper can’t see.

These map to the HashiCorp Terraform Associate (modules, version constraints, state, moved/import) and Azure platform/DevOps exams: AZ-400 (IaC, release pipelines, secure CI) and AZ-104/AZ-305 (governance, landing zones, Policy). A compact cert mapping for revision:

Question theme Primary cert Objective area
Module classes, composition, wrapping Terraform Associate Use and create modules
Version constraints (~>, pre-1.0) Terraform Associate Module versioning & sources
moved / import migration Terraform Associate State & refactoring
terraform test / Terratest Terraform Associate / AZ-400 Testing IaC; CI
OIDC keyless CI, replace-gate AZ-400 Secure pipelines; release gates
Guardrails, Policy backstop, landing zones AZ-305 / AZ-104 Governance & design

Quick check

  1. You pin an AVM module with version = "~> 0.9". A teammate’s terraform init -upgrade pulls 0.10.0 and the plan goes haywire. Why, and what should the constraint have been?
  2. A terraform plan fails in a locked-down subscription with an error about a Microsoft.Resources/deployments being denied. What AVM setting is the likely cause and what’s the fix?
  3. True or false: the more inputs your wrapper’s variables.tf exposes, the more useful it is to app teams.
  4. After a Renovate AVM bump in a shared wrapper, the plan shows 40 storage accounts as destroy/create. Name the root cause and the two-part fix.
  5. You’re adopting AVM over a storage account that already exists in Azure but isn’t in Terraform state. Which block do you use, and what should a correct plan show?

Answers

  1. The module is pre-1.0, and ~> 0.9 expands to >= 0.9.0, < 1.0.0; because AVM treats the minor segment as breaking below 1.0, that admits a breaking 0.10.0. The constraint should have been exact (0.9.1) in a wrapper, or the three-part ~> 0.9.1 (allows 0.9.x, blocks 0.10.0).
  2. enable_telemetry = true (the AVM default) — it deploys a tiny empty Microsoft.Resources/deployments, which a policy-locked subscription denies, failing the plan. Fix: set enable_telemetry = false in the wrapper, once, org-wide.
  3. False. A wrapper’s value is what it doesn’t expose. Exposing the full AVM surface makes it a passthrough with no guardrails — an app team could ship a public Key Vault. Expose a narrow, validated contract and inject the rest.
  4. Root cause: the AVM bump changed the resource’s address (moved it under a for_each map), so Terraform plans destroy+create. Fix: (a) add a moved block to the new keyed address in the same wrapper version, and (b) gate CI to reject any unacknowledged destroy/create in the plan.
  5. Use an import block (to = the AVM resource address, id = the Azure resource id). A correct plan shows the resource imported with only in-place diffs (e.g. AVM’s added diagnostics) and zero destroy/create.

Glossary

Next steps

You can now build, guard, test, distribute and safely upgrade an AVM-based Terraform platform layer. Build outward:

AzureAVMTerraformIaCModulesBicep
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments