Azure Policy as Code: A Git-Driven Governance Pipeline

Portal-clicked policy is governance you cannot review, diff, or roll back. A rule assigned by hand in a blade has no pull request, no reviewer, and no recorded why — and the day an auditor asks “who approved exempting this subscription from disk encryption, and when does the waiver expire?”, clicking through blades does not produce an answer. The fix is to treat policy the way you treat infrastructure: definitions, initiatives, and assignments live in Git, get tested in a pipeline, and deploy to management groups through a promotion ring. This guide builds that pipeline end to end with EPAC (Enterprise Policy as Code), and it handles the part the quickstarts skip — remediating thousands of existing resources without melting the ARM control plane.

You will learn the four-object model (definition, initiative, assignment, exemption) and why keeping them separate is the entire discipline; how to choose an effect without either blocking legitimate deploys or auditing forever; how to validate cheaply with lint → What-If → an Audit ring before you ever enforce Deny; and how to run remediation tasks in throttled, per-landing-zone batches so a 429 Too Many Requests storm never leaves you with a half-fixed fleet. Because this is a reference you will return to mid-rollout, the effects, the modes, the EPAC commands, the pipeline gates, the failure modes and the cost drivers are all laid out as scannable tables — read the prose once, then keep the tables open while the pipeline runs.

By the end you will stop governing by mouse. When a control needs to change you will open a PR, read the What-If, watch the Audit blast radius in a sandbox management group, promote the same definition outward by flipping one parameter, and prove Git and Azure are in sync with a clean EPAC plan that reports no changes. That last property — a no-op plan as the definition of “in sync” — is what separates a governed estate from a pile of orphaned assignments nobody can account for.

What problem this solves

Governance that lives only in the portal rots in three predictable ways. It is unreviewable — there is no diff showing that someone widened a deny to an audit, no approver on the change, no commit message explaining the threshold. It is undeployable — you cannot stamp the same baseline across 600 subscriptions by hand without drift creeping in, and you certainly cannot recreate it after a tenant rebuild. It is unaccountable — exemptions become permanent holes because a portal exemption with no expiry and no ticket reference is indistinguishable from “someone turned this off and forgot.”

What breaks without a pipeline: a platform team ships a guardrail straight to Deny in production, discovers it blocks 4,000 legitimate deployments, and rolls it back in a panic — teaching everyone that governance is the enemy. Or a DeployIfNotExists policy with a wrong existenceCondition redeploys its template on every 24-hour scan, quietly burning ARM quota and money for months. Or a remediation task fans out tenant-wide at default concurrency, hits 429, and leaves half the fleet fixed and half not — so the next compliance scan re-flags everything and on-call cannot tell what actually changed.

Who hits this: any platform or cloud-governance team operating at landing-zone scale — the Azure Cloud Adoption Framework landing zones crowd, anyone running an enterprise-scale management-group hierarchy, and every shop that has graduated past “click a built-in initiative and hope.” It pairs with Azure Policy governance at scale (the conceptual ground this pipeline automates) and the Azure DevOps YAML multistage approvals patterns that gate it. The reward is governance you can review, diff, roll back, and prove — the same properties you already demand of your infrastructure code.

To frame the whole field before the deep dive, here is every failure class this pipeline can hit, the question it forces, and the one place to look first:

Failure class	What you observe	First question to ask	First place to look	Most common single cause
Drift / orphaned assignment	EPAC plan wants to delete something live	Did someone change it in the portal?	`Build-DeploymentPlans` plan output	A hand-edited assignment not in Git
Effect too aggressive	New deploys suddenly blocked	Did we flip to `Deny` before measuring?	`az policy state summarize` audit count	Skipped the `Audit` ring
DINE/Modify won’t deploy	“required role assignments” error	Does the identity exist and have roles?	Assignment `identity` + role list	MSI replication lag or missing role
Remediation 429s	Half the fleet fixed, half re-flagged	Was the task throttled and scoped?	Remediation task failure column	Tenant-wide blast, default concurrency
Exemption sprawl	Compliance “clean” but holes everywhere	Are exemptions time-bound and in Git?	`az policy exemption list` expiry column	Portal exemption with no `expiresOn`
Compliance shows NotStarted	Dashboard empty after deploy	Has a scan run yet?	`az policy state summarize`	No on-demand scan triggered

Learning objectives

By the end of this article you can:

Separate the four policy object types — definition, initiative, assignment, exemption — and explain which describes capability, which describes enforcement, and which describes the documented exception.
Author a custom policy definition correctly: distinguish field from value, enumerate aliases before writing a rule, and pick Indexed vs All mode deliberately.
Choose the right effect for a control — Audit, Deny, Modify, Append, DeployIfNotExists, AuditIfNotExists, Disabled — and name what each can and cannot fix.
Structure a policy repo by logical identity (never by scope) and wrap every definition in an initiative for a stable assignment surface and one compliance roll-up.
Drive the EPAC workflow — Build-DeploymentPlans → Deploy-PolicyPlan → Deploy-RolesPlan — across a plan/deploy pipeline with an approval gate, and configure global-settings.jsonc with a pacOwnerId so drift removal is safe.
Validate before rollout with lint → What-If → an Audit ring, read the audit count as your blast radius, and promote the same definition through sandbox → nonprod → prod by flipping one parameter.
Remediate existing fleets with DINE/Modify + remediation tasks, control concurrency with --parallel-deployments and --resource-count, and keep 429 storms from leaving a half-fixed estate.
Write time-bound, ticket-referenced exemptions for break-glass and waivers, and prove Git/Azure parity with a no-op EPAC plan.

Prerequisites & where this fits

You should already understand the building blocks of Azure governance: a management group (MG) is a container above subscriptions that policy and RBAC inherit down through; an Azure Policy assignment binds a rule to a scope (MG, subscription, or resource group) with parameter values; and RBAC (role assignments) is how the policy engine is granted permission to act on your behalf for Modify/DINE. You should be comfortable running az in Cloud Shell, reading JSON output, writing basic Bicep, and reading a YAML pipeline. PowerShell familiarity helps because EPAC is a PowerShell module.

This sits in the Governance & Platform Automation track. It assumes the conceptual ground from Azure Policy governance at scale and the hierarchy design in enterprise-scale management-group hierarchy design. It depends on the identity model in Entra RBAC governance, because the deploying principal’s permissions are the whole security boundary. It is one rung above infrastructure as code 101 with Terraform on Azure in mindset, and it pairs with Bicep deployment stacks, What-If & CI for the validation mechanics and Azure DevOps YAML multistage approvals for the gates.

A quick map of who owns what during a policy change, so you route the right approval fast:

Layer	What lives here	Who usually owns it	What it can block / cause
Git repo (definitions, initiatives)	The capability — the rules themselves	Platform / governance team	Bad rule logic; broken alias reference
Assignment manifests	The enforcement — scope + effect param	Platform team + control owner	Wrong scope; effect too aggressive
`global-settings.jsonc`	PaC env → MG/sub mapping + `pacOwnerId`	Platform lead	Drift removal scope; safe-delete boundary
Pipeline (plan/deploy stages)	The promotion ring + approval gate	DevOps / platform	Who can merge to prod; gate bypass
Deploying service principal	The permission to write policy + roles	Identity team	`PrincipalNotFound`; missing UAA role
Exemptions tree	The documented exceptions	Control owner + approver	Sprawl; un-expiring holes

Core concepts

Five mental models make every later decision obvious.

The four object types describe four different things, and mixing them is how repos rot. A definition is a single rule — an if/then — authored once and assignable anywhere. An initiative (policy set) bundles definitions and hoists their parameters so an assignment sets values once. An assignment binds a definition or initiative to a scope with concrete parameter values; this is the only object that knows where enforcement happens. An exemption is a time-bound, audited waiver of an assignment, down to the resource. Definitions and initiatives describe capability; assignments describe where it is enforced; exemptions describe the documented exceptions. Baking a subscription ID into a definition collapses two of these into one and you can never reuse or promote that rule again.

Policy intercepts the request, then re-checks on a scan. Most effects evaluate at resource create/update — the request is intercepted before it commits, which is exactly why Deny can block it and Modify/Append can mutate the payload in flight. Separately, a roughly 24-hour background compliance scan re-evaluates existing resources. The two *IfNotExists effects (DINE, AINE) only ever fire on writes and on that scan — never inline — because they must inspect related resources that already exist. Knowing which trigger an effect uses tells you whether it can fix the fleet or only stop new violations.

Aliases are the contract, and field ≠ value. field reads a property of the resource being evaluated and is alias-aware — Microsoft.Storage/storageAccounts/allowBlobPublicAccess is an alias mapping to the resource’s real property path. value evaluates an arbitrary expression (a parameter, [resourceGroup()], a template function) that has nothing to do with the target resource. Use field for “what is this resource’s property”; use value for “what does this expression compute to.” Crucially, a field condition lets deny/modify reach into the request payload before commit; a value condition cannot. If there is no alias for a property, you cannot write policy against it — so you enumerate aliases first.

EPAC reconciles desired state and that is the whole point. You can hand-roll deployment with az policy commands, but reconciling the repo against what is live — including deleting assignments you removed from Git — is what EPAC automates. It reads your repo, builds a plan, and applies it idempotently with full drift detection: anything in Azure stamped with your pacOwnerId that is not in Git gets flagged and (optionally) removed. The pacOwnerId is the safety boundary — EPAC never touches objects it did not stamp, so it cannot delete another team’s assignments or Microsoft’s built-in initiatives.

The effect is the most consequential single decision. Pick wrong and you either block legitimate deployments (Deny too early) or audit forever while nothing improves (Audit with no promotion plan). The parameterized-effect pattern — ship the same definition as Audit in sandbox, Audit/Deny in nonprod, Deny/DINE in prod — is the entire value of keeping the effect a parameter on the assignment rather than baked into the definition.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Concept	One-line definition	Where it lives	Why it matters to the pipeline
Definition	A single `if`/`then` rule	Git → MG scope	The reusable capability; never scope-bound
Initiative	A bundle of definitions + params	Git → MG scope	Stable assignment surface; one roll-up
Assignment	Definition/initiative bound to a scope	Git → applied at scope	The only object that says where + effect
Exemption	Time-bound waiver of an assignment	Git → down to resource	The documented exception, not a hole
Effect	What the rule does when matched	On the definition, param on assignment	`Audit`/`Deny`/`Modify`/DINE…
Alias	Map from policy field → resource property	Resource provider	No alias → no policy on that property
Mode	Which resource types are evaluated	On the definition	`Indexed` vs `All`
EPAC	PowerShell module that reconciles repo↔Azure	Pipeline agent	Plan/deploy + drift detection
`pacOwnerId`	The stamp EPAC manages by	`global-settings.jsonc`	Safe-delete boundary for drift
Remediation task	Bulk fix of existing non-compliant resources	ARM control plane	One deployment per resource; throttled
Compliance scan	~24h re-evaluation of existing resources	Platform-managed	When DINE/AINE/audit data refreshes
What-If	Preview of what a deployment would change	ARM API / pipeline	Cheapest pre-merge validation

The effects reference

The effect is the single most consequential field in a policy. This is the lookup table you scan first — every effect, when it fires, what it can fix, and the non-obvious requirement that bites. The traps are that Deny cannot fix existing resources, that Modify and DINE both need a managed identity with concrete roles, and that DINE’s existenceCondition is what makes it idempotent or a money pit.

Effect	Fires when	Can fix existing?	Needs identity?	Use it for	Key gotcha
`Audit`	On write + on scan; only flags	No (reports only)	No	Measuring a new rule’s blast radius	The audit count is your blast radius
`Deny`	On write, before commit	No — stops new only	No	Hard guardrails (“no public IPs in Corp”)	Existing drift untouched; pair with Modify/DINE
`Modify`	On write; mutates payload	With remediation task	Yes (MSI + roles)	Add/replace tags, enforce properties	Needs `location` on the assignment
`Append`	On write; adds fields	No (write-time only)	No	Inject a default where none supplied	Cannot change an existing value, only add
`DeployIfNotExists` (DINE)	On write + on scan, if related resource missing	With remediation task	Yes (MSI + roles)	Auto-onboard diagnostics, Defender, backup	Wrong `existenceCondition` → redeploys every scan
`AuditIfNotExists` (AINE)	On write + on scan	No (reports only)	No	Report on missing companion resources	Same `existenceCondition` care, no money risk
`Manual`	Sets compliance you attest manually	N/A (attestation)	No	Controls Azure can’t technically check	Compliance is set by a human, not the engine
`Disabled`	Never	No	No	Kill one rule inside an initiative	No audit trail — prefer an exemption
`DenyAction`	On a delete (or specified operation)	N/A	No	Block deletion of protected resources	Newer; scope which operations carefully

Two reading notes that save the most time:

Distinction	The trap	How to tell them apart
`Deny` vs `Modify` for the same property	Teams `Deny` a missing tag, then can’t deploy anything	`Deny` blocks the deploy; `Modify` adds the tag for you — usually what you want for tags
`Disabled` vs exemption	Disabling kills the rule for everyone, silently	`Disabled` = no scope, no expiry, no trail; an exemption is scoped, expiring, audited

And the effect-by-mode interaction, because Modify/Append/DINE have extra requirements:

Effect	Requires `roleDefinitionIds` in definition?	Requires `details.operations` (Modify)?	Requires `existenceCondition` (DINE/AINE)?	Requires `deployment` template (DINE)?
`Audit` / `Deny` / `Append`	No	No	No	No
`Modify`	Yes	Yes (add/replace/remove ops)	No	No
`AuditIfNotExists`	No	No	Yes	No
`DeployIfNotExists`	Yes	No	Yes	Yes (ARM template)
`Manual` / `Disabled`	No	No	No	No

Anatomy of a custom policy definition

A definition is JSON with a policyRule (the logic) and parameters (the knobs). The rule’s if block evaluates resource properties; the matched resources get the then.effect.

{
  "properties": {
    "displayName": "Storage accounts must disable public blob access",
    "mode": "Indexed",
    "parameters": {
      "effect": {
        "type": "String",
        "defaultValue": "Deny",
        "allowedValues": ["Audit", "Deny", "Disabled"]
      }
    },
    "policyRule": {
      "if": {
        "allOf": [
          { "field": "type", "equals": "Microsoft.Storage/storageAccounts" },
          {
            "field": "Microsoft.Storage/storageAccounts/allowBlobPublicAccess",
            "notEquals": false
          }
        ]
      },
      "then": { "effect": "[parameters('effect')]" }
    }
  }
}

Before you write the rule, enumerate the aliases — if there is no alias for a property, you cannot target it:

# List aliases for a resource type and confirm they're modifiable (needed for modify/append)
az provider show --namespace Microsoft.Storage \
  --expand "resourceTypes/aliases" \
  --query "resourceTypes[?resourceType=='storageAccounts'].aliases[].{alias:name, modifiable:defaultMetadata.attributes}" \
  -o table

The `if` block: conditions, operators, and logical structure

The if block is a tree of conditions joined by allOf/anyOf/not. Each leaf compares a field or value against an operator. Knowing the full operator set — and which ones are case-sensitive or accept wildcards — is what lets you write a precise rule instead of an over-broad one that flags half the estate.

Operator	Compares	Wildcards?	Typical use	Gotcha
`equals` / `notEquals`	Exact scalar	No	Resource type, a boolean property	Case-sensitive on strings
`like` / `notLike`	String with `*`	Yes (`*`)	`name like "prod-*"`	Single `*`; not regex
`match` / `notMatch`	`#`=digit `?`=letter `.`=any	Yes (glyph)	Naming patterns by char class	Case-sensitive
`matchInsensitive`	Same as `match`, case-insensitive	Yes	Naming patterns, any case	Slightly slower to reason about
`contains` / `notContains`	Substring	No	Tag value contains a token	Substring, not membership
`in` / `notIn`	Membership in an array	No	Allowed locations/SKUs list	Array must be a param or literal
`containsKey` / `notContainsKey`	Object has a key	No	`tags containsKey "cost-center"`	Key presence, not value
`greater` / `less` / `greaterOrEquals` / `lessOrEquals`	Numeric / date	No	Retention days, minTLS version	Type must compare cleanly
`exists`	`"true"`/`"false"`	No	Property present at all	String boolean, not bare bool

The logical operators and how they nest:

Logical op	Semantics	When to reach for it	Pitfall
`allOf`	AND — every child must match	The default; scope a rule to a type + condition	Forgetting it makes a single condition implicit
`anyOf`	OR — at least one child matches	“TLS < 1.2 OR public access on”	Easy to make too broad
`not`	Negate the wrapped condition	“NOT in the allowed-SKU list”	Double negatives get unreadable fast
`count`	Count array elements matching a condition	“≥1 NSG rule allows 0.0.0.0/0”	The most powerful and the easiest to misread
`field` (in count)	Iterate an array alias `[*]`	Inspect each subnet/rule/IP config	Needs an `[*]` alias to exist

`field` vs `value`, and `mode`

field reads a property of the target resource and is alias-aware; value evaluates an arbitrary expression. Use field to inspect the resource, value to compute something independent of it. The distinction also governs power: a field condition can drive deny/modify into the request payload, a value condition cannot. The mode then decides which resource types are even evaluated.

Mode	Evaluates	Use it for	Skips	Gotcha
`Indexed`	Resource types that support tags + location	The vast majority of resource policies	RGs, subscriptions, type-less resources	Default-correct; avoids false non-compliance on type-less resources
`All`	Every resource, plus RGs and subscriptions	Policies that must evaluate RGs/subs themselves	Nothing	Use only when you genuinely target containers
`Microsoft.Kubernetes.Data`	AKS in-cluster objects (via Gatekeeper)	Pod-level constraints on AKS	Non-AKS	Pairs with the Gatekeeper/OPA admission model
`Microsoft.KeyVault.Data`	Objects inside Key Vault (certs, keys)	Cert/key policy within a vault	Non-KV-data	Data-plane mode, different alias set
`Microsoft.Network.Data`	Specific network data-plane objects	Niche network controls	Others	Rarely needed

# field vs value in practice: 'field' reads the resource; 'value' computes from a function
# This audits resources whose location is NOT in the resource group's allowed set.
az policy definition create --name "loc-must-match-rg" \
  --rules '{
    "if": { "allOf": [
      { "field": "location", "notIn": "[parameters(\"allowedLocations\")]" },
      { "value": "[resourceGroup().location]", "notEquals": "global" }
    ]},
    "then": { "effect": "audit" }
  }' \
  --params '{ "allowedLocations": { "type": "Array" } }' \
  --mode Indexed

Evaluation order, restated as a rule: policy runs on resource create/update (request intercepted before commit — why deny blocks and modify/append mutate), and again on the ~24-hour background compliance scan. auditIfNotExists/deployIfNotExists only fire on that scan and on writes, never inline, because they inspect related resources that already exist. If your dashboard shows NotStarted, no scan has run yet — trigger one rather than assuming the policy is broken.

Choosing and parameterizing the effect

The effect decides whether a control measures, blocks, or fixes. The reference table above enumerates all of them; the discipline is to make the effect a parameter on the assignment, not a constant in the definition, so the same rule can ship Audit then Deny per ring without a code change.

The non-obvious rules, restated for the decision you actually face:

You want to…	Wrong effect (and why)	Right effect	Extra requirement
Stop new public storage	`Modify` (can’t remove a missing property cleanly)	`Deny`	None
Tag every new resource with cost-center	`Deny` (blocks the deploy)	`Modify`	MSI + Tag Contributor; `location` set
Onboard diagnostics to Log Analytics	`Deny`/`Append` (can’t create a child)	`DeployIfNotExists`	MSI + roles; correct `existenceCondition`
Report which VMs lack backup	`Deny` (forward-only)	`AuditIfNotExists`	Correct `existenceCondition`
Measure a brand-new control’s reach	`Deny` (breaks teams immediately)	`Audit`	None — read the count first
Fix existing untagged resources	`Deny` (never touches them)	`Modify` + remediation task	MSI + roles + throttled remediation

A worked parameterized definition — one rule, three ring behaviours from a single effect parameter:

{
  "properties": {
    "displayName": "Resources must carry a cost-center tag",
    "mode": "Indexed",
    "parameters": {
      "effect": {
        "type": "String",
        "defaultValue": "Audit",
        "allowedValues": ["Audit", "Modify", "Disabled"]
      },
      "tagName": { "type": "String", "defaultValue": "cost-center" }
    },
    "policyRule": {
      "if": { "field": "[concat('tags[', parameters('tagName'), ']')]", "exists": "false" },
      "then": {
        "effect": "[parameters('effect')]",
        "details": {
          "roleDefinitionIds": [
            "/providers/Microsoft.Authorization/roleDefinitions/4a9ae827-6dc8-4573-8ac7-8239d42aa03f"
          ],
          "operations": [
            { "operation": "add", "field": "[concat('tags[', parameters('tagName'), ']')]", "value": "unassigned" }
          ]
        }
      }
    }
  }
}

Sandbox assigns effect=Audit and reads the count; prod assigns effect=Modify and pairs it with a remediation task. Same Git, same definition.

Structuring the repo

Keep the four object types in separate trees, named by their logical identity, never by scope. Scope-naming (prod-sub-storage.json) is how you end up unable to promote or reuse anything.

policy/
├── policy-definitions/
│   └── deny-storage-public-access.json
├── policy-set-definitions/        # initiatives
│   └── security-baseline.json
├── policy-assignments/
│   ├── platform-mg.json           # one manifest per management group
│   └── landing-zones-mg.json
├── policy-exemptions/
│   └── prod/                      # exemptions, segregated by environment
└── global-settings.jsonc          # PaC env -> MG/subscription mapping

An initiative groups definitions and hoists their parameters so an assignment sets values once:

{
  "properties": {
    "displayName": "Security Baseline",
    "policyType": "Custom",
    "parameters": {
      "storageEffect": { "type": "String", "defaultValue": "Deny" }
    },
    "policyDefinitions": [
      {
        "policyDefinitionReferenceId": "denyStoragePublic",
        "policyDefinitionId": "/providers/Microsoft.Management/managementGroups/contoso/providers/Microsoft.Authorization/policyDefinitions/deny-storage-public-access",
        "parameters": { "effect": { "value": "[parameters('storageEffect')]" } }
      }
    ]
  }
}

Always assign initiatives, not loose definitions — even an initiative of one. It gives you a stable assignment surface (add rules later without re-pointing the assignment) and one compliance roll-up per business control. The repo-layout decisions and what each buys you:

Directory / file	Holds	Named by	Why it matters
`policy-definitions/`	Single rules	The capability (`deny-storage-public-access`)	Reusable anywhere; promotion-safe
`policy-set-definitions/`	Initiatives	The business control (`security-baseline`)	One roll-up; stable assignment target
`policy-assignments/`	One manifest per MG	The scope it targets (`platform-mg`)	The only place scope appears
`policy-exemptions/<env>/`	Exemptions	Ticket + resource	Lifecycle-managed; expires on its own
`global-settings.jsonc`	PaC env → scope + `pacOwnerId`	The environment selector	Safe-delete boundary; per-ring mapping

The object hierarchy as a quick reference for what nests in what:

Object	Contains	Contained by	Carries parameters?	Carries scope?
Definition	`policyRule` + `parameters`	Initiative (by reference) or assigned directly	Declares them	No
Initiative	References to definitions	An assignment	Hoists definition params	No
Assignment	A definition or initiative ID + param values	Applied at a scope	Sets values	Yes
Exemption	A reference to an assignment (+ optional definition refs)	A scope, down to a resource	No	Yes + expiry

Built-in vs custom definitions — when to author your own vs reference Microsoft’s, and how EPAC treats each:

Aspect	Built-in definition	Custom definition
Authored by	Microsoft	You (in Git)
`policyType`	`BuiltIn`	`Custom`
Lives at	Tenant root (always available)	The MG scope you deploy it to
Versioned / updated by	Microsoft (can change under you)	You — diffable in PRs
EPAC manages it?	References only (never deletes)	Yes — owned via `pacOwnerId`
Use when	A standard control already exists (e.g. CIS, Defender)	No built-in matches your exact rule
Gotcha	A Microsoft update can shift behaviour	You own maintenance and alias drift
Composes in an initiative with custom?	Yes — mix freely under one roll-up	Yes

The EPAC workflow

You can hand-roll deployment with az policy commands, but reconciling desired state against what is live — including deleting assignments you removed from Git — is exactly what EPAC (Enterprise Policy as Code) solves. It is a maintained PowerShell module that reads your repo, builds a plan, and applies it idempotently, with full drift detection: anything in Azure stamped with your owner ID that is not in Git is flagged and (optionally) removed.

EPAC’s three commands map cleanly onto a pipeline:

Install-Module -Name EnterprisePolicyAsCode -Scope CurrentUser

# 1. PLAN — diff desired (repo) vs. deployed; emit plan artifacts, change nothing
Build-DeploymentPlans `
  -DefinitionsRootFolder ./policy `
  -OutputFolder ./output `
  -PacEnvironmentSelector epac-prod

# 2. DEPLOY definitions, initiatives, and assignments from the plan
Deploy-PolicyPlan `
  -DefinitionsRootFolder ./policy `
  -InputFolder ./output `
  -PacEnvironmentSelector epac-prod

# 3. DEPLOY the role assignments DINE/modify identities need
Deploy-RolesPlan `
  -DefinitionsRootFolder ./policy `
  -InputFolder ./output `
  -PacEnvironmentSelector epac-prod

The three commands, what each does, what it touches, and the artifact it produces or consumes:

Command	Phase	Reads	Writes	Changes Azure?	Runs in pipeline stage
`Build-DeploymentPlans`	Plan	Repo + live Azure state	`policy-plan.json`, `roles-plan.json`	No	Plan (on PR)
`Deploy-PolicyPlan`	Deploy objects	Policy plan artifact	Definitions, initiatives, assignments	Yes	Deploy (on merge)
`Deploy-RolesPlan`	Deploy roles	Roles plan artifact	Role assignments for MSIs	Yes	Deploy-roles (after deploy)

Why EPAC over hand-rolling — the three ways to ship policy as code, side by side:

Capability	Raw `az policy` scripts	ARM/Bicep templates	EPAC
Create definitions/initiatives/assignments	Yes (imperative)	Yes (declarative)	Yes (declarative)
Delete what you removed from Git (drift)	No — you script deletes by hand	No — orphans linger	Yes — automatic, owner-scoped
Safe-delete boundary (`pacOwnerId`)	None	None	Yes
Plan/preview before apply	No native diff	What-If	Build-DeploymentPlans diff
Manages DINE/Modify role assignments	Manual	Manual	Deploy-RolesPlan
Multi-ring promotion by selector	Hand-rolled	Per-env templates	PacEnvironmentSelector
Idempotent re-run	Depends on your scripts	Mostly	Yes
Best for	One-off / tiny estates	Mid estates without drift needs	Landing-zone scale, audited

The global-settings.jsonc ties a selector to a real scope and identity:

{
  "pacOwnerId": "f0000000-1111-2222-3333-444444444444",
  "pacEnvironments": [
    {
      "pacSelector": "epac-prod",
      "cloud": "AzureCloud",
      "tenantId": "<tenant-guid>",
      "deploymentRootScope": "/providers/Microsoft.Management/managementGroups/contoso"
    }
  ]
}

pacOwnerId is what makes drift detection safe: EPAC only manages objects it stamped with that owner ID, so it never deletes assignments created by another team or by Microsoft’s built-in policy initiatives. The settings that govern reconciliation behaviour:

`global-settings.jsonc` key	Controls	Default / typical	When to change	Risk if wrong
`pacOwnerId`	Which objects EPAC manages/deletes	A unique GUID per repo	Never reuse across repos	Deletes another repo’s objects
`pacSelector`	The environment/ring name	`epac-dev`/`epac-prod`	Per ring	Deploys to the wrong MG
`deploymentRootScope`	The MG the plan targets	Root or intermediate MG	Per ring scope	Over-broad enforcement
`managedIdentityLocation`	Region for MSI-bearing assignments	e.g. `eastus`	Match your estate	Identity-bearing deploy fails
`globalNotScopes`	Scopes EPAC never manages	Decommissioned subs	Carve-outs	Manages a scope you meant to exclude
`desiredState.strategy`	`full` vs `ownedOnly` deletion	`ownedOnly` (safe)	Rarely → `full`	`full` can delete unowned objects

In Azure Pipelines, split plan from deploy across stages with an environment approval gate between them — plan on PR, deploy on merge:

stages:
  - stage: Plan
    jobs:
      - job: BuildPlan
        steps:
          - task: AzureCLI@2
            inputs:
              azureSubscription: epac-spn          # workload identity federation
              scriptType: pscore
              scriptLocation: inlineScript
              inlineScript: |
                Build-DeploymentPlans -DefinitionsRootFolder ./policy `
                  -OutputFolder $(Build.ArtifactStagingDirectory) `
                  -PacEnvironmentSelector epac-prod
          - publish: $(Build.ArtifactStagingDirectory)
            artifact: policy-plan

  - stage: Deploy
    dependsOn: Plan
    jobs:
      - deployment: ApplyPolicy
        environment: policy-prod                   # add an approval check here
        strategy:
          runOnce:
            deploy:
              steps:
                - download: current
                  artifact: policy-plan
                - task: AzureCLI@2
                  inputs:
                    azureSubscription: epac-spn
                    scriptType: pscore
                    scriptLocation: inlineScript
                    inlineScript: |
                      Deploy-PolicyPlan -DefinitionsRootFolder ./policy `
                        -InputFolder $(Pipeline.Workspace)/policy-plan `
                        -PacEnvironmentSelector epac-prod

The deploying identity needs Resource Policy Contributor at the root management group for policy objects, plus User Access Administrator (or Owner) to create the role assignments DINE identities require. Grant it to the federated service principal, not a human, and gate it behind PR review. The exact roles the pipeline principal needs and why:

Role	Scope	Why the pipeline needs it	If missing
Resource Policy Contributor	Root MG (or per-ring MG)	Create/update definitions, initiatives, assignments	Policy objects fail to deploy
User Access Administrator	Root MG	Create role assignments for DINE/Modify MSIs	`Deploy-RolesPlan` fails; DINE can’t act
Reader (implied by above)	Root MG	Read live state for the plan diff	Plan can’t compute drift
Managed Identity Operator (sometimes)	MG/sub	If using user-assigned identities	UAMI-based DINE can’t bind

The CI/CD platform choice does not change the model — the same plan/deploy split works in GitHub Actions with GitHub Actions + Terraform OIDC plan/PR automation-style federation, or Azure DevOps with the multistage YAML approvals patterns.

Testing before rollout

Three layers of validation, cheapest first. The discipline is to never let a policy reach Deny in production without passing all three.

Layer	What it catches	Cost	Speed	Where it runs
1. Lint + What-If	Bad JSON shape; what the merge would create/change	Free	Seconds	On PR
2. `Audit` ring + scan	The real-world blast radius (how many resources non-compliant)	Free	Minutes (on-demand scan)	Sandbox MG
3. MG promotion ring	Whether enforcement breaks real teams	Free	Per ring	sandbox → nonprod → prod

1. Lint and What-If on PR. Validate JSON shape, then run a What-If of the policy deployment to confirm what objects the merge would create or change — without touching production:

# Structural sanity for every definition/initiative JSON
Get-ChildItem ./policy -Recurse -Include *.json |
  ForEach-Object { $null = Get-Content $_ -Raw | ConvertFrom-Json }

# What-If the policy artifacts at the management-group scope
az deployment mg what-if \
  --management-group-id contoso \
  --location eastus \
  --template-file ./policy-bicep/assignments.bicep

What-If change types and what each tells you about the merge:

What-If change type	Means	Safe to merge?	Watch for
`Create`	A new policy object will be added	Usually	An unexpected duplicate of an existing rule
`Modify`	An existing object’s properties change	Review the diff	A scope or effect change you didn’t intend
`Delete`	An object will be removed	Pause	Drift removal of something still in use
`NoChange`	Already matches desired state	Yes	A clean plan should be mostly this
`Ignore`	Out of scope for this deployment	Yes	—

2. Assign as Audit in a ring, read compliance. Every effect-parameterized policy ships to a non-prod management group as Audit first. Trigger an on-demand scan and read the result instead of waiting ~24 hours:

# Force an evaluation at a scope, then summarize compliance
az policy state trigger-scan --resource-group rg-sandbox

az policy state summarize \
  --management-group mg-sandbox \
  --query "results.policyAssignments[].{name:policyAssignmentId, nonCompliant:results.nonCompliantResources}" \
  -o table

If Audit flags 4,000 resources, flipping straight to Deny would have broken those teams. The audit count is your blast radius. The compliance states you’ll read and what each demands:

Compliance state	Meaning	Your next move
`Compliant`	Resource satisfies the rule	None
`NonCompliant`	Resource violates the rule	This is the blast radius — remediate or accept before `Deny`
`NotStarted`	No scan has evaluated it yet	Trigger a scan; don’t conclude it’s broken
`Exempt`	An exemption covers it	Verify the exemption is time-bound and ticketed
`Conflicting`	Two assignments disagree on effect	Resolve overlapping assignments
`Unknown` (Manual effect)	Awaiting human attestation	Attest via the compliance API

What actually triggers a compliance re-evaluation, and how fast each is — so you know whether to wait or force a scan:

Trigger	What causes it	Latency	Force it manually?
Resource create/update	Any write to a governed resource	Inline (immediate)	N/A — it’s the write itself
New/changed assignment	Assigning or editing a policy	Evaluation kicks off within ~30 min	`az policy state trigger-scan`
Background compliance scan	Platform-scheduled sweep	~24 hours	`az policy state trigger-scan`
On-demand scan	You request it	Minutes (scope-dependent)	Yes — the one you use in CI
Remediation `ReEvaluateCompliance`	A remediation task with re-scan mode	Per task	Via `--resource-discovery-mode`

3. Promote through a management-group ring. Use distinct EPAC environment selectors per ring and promote the same definitions outward:

mg-sandbox  ->  mg-nonprod  ->  mg-prod
 (Audit)        (Audit/Deny)     (Deny/DINE)

Same Git, same definitions; only the assignment’s effect parameter and target scope change between selectors. That is the entire value of parameterizing the effect. The ring promotion matrix:

Ring	EPAC selector	Effect param	Approval gate	What it proves
Sandbox	`epac-sandbox`	`Audit`	None (auto on merge)	The rule is syntactically live; measures reach
Nonprod	`epac-nonprod`	`Audit` → `Deny`	Team lead	Enforcement doesn’t break realistic workloads
Prod	`epac-prod`	`Deny` / DINE	Change board	The control holds at scale with real traffic

Remediation at scale

Deny is forward-looking. For existing fleets you need DINE or Modify plus remediation tasks, and at scale the ARM control plane is the bottleneck.

A DINE assignment must declare its identity and the roles it grants. In Bicep:

resource diagAssignment 'Microsoft.Authorization/policyAssignments@2024-04-01' = {
  name: 'deploy-diag-to-law'
  scope: managementGroup()
  location: 'eastus'                  // required when identity is set
  identity: { type: 'SystemAssigned' }
  properties: {
    policyDefinitionId: tenantResourceId(
      'Microsoft.Authorization/policySetDefinitions', 'diagnostics-baseline')
    parameters: {
      logAnalytics: { value: lawResourceId }
    }
  }
}

After the identity exists, grant it the roles its template needs (for diagnostics-to-Log-Analytics that is typically Monitoring Contributor + Log Analytics Contributor), then create the remediation task:

# Remediate one initiative member across the assignment's scope
az policy remediation create \
  --name remediate-diag-2026q2 \
  --management-group contoso \
  --policy-assignment deploy-diag-to-law \
  --definition-reference-id deployDiagnostics \
  --resource-discovery-mode ReEvaluateCompliance

Throttling is the real engineering problem. A remediation task fans out one template deployment per non-compliant resource. Across thousands of resources that hammers ARM, and you will hit 429 Too Many Requests. Control concurrency with --parallel-deployments (how many remediations run at once) and --resource-count (the cap per task), then run multiple smaller, scoped tasks rather than one tenant-wide blast.

az policy remediation create \
  --name remediate-diag-batch-01 \
  --management-group contoso \
  --policy-assignment deploy-diag-to-law \
  --definition-reference-id deployDiagnostics \
  --parallel-deployments 10 \
  --resource-count 500

The remediation knobs, their defaults, and how to reason about each:

Setting	What it controls	Default	Range / values	When to change
`--parallel-deployments`	Concurrent template deployments	10	1–30	Lower it the moment you see `429`s
`--resource-count`	Max resources fixed per task	500	1–50000	Cap per landing zone to bound blast
`--resource-discovery-mode`	Whether to re-scan before fixing	`ExistingNonCompliant`	`ExistingNonCompliant` / `ReEvaluateCompliance`	`ReEvaluate` after a definition change
`--location-filters`	Restrict to regions	none	region list	Stage region-by-region
Scope (`--management-group`/`--resource-group`)	The set of resources targeted	The assignment scope	MG / sub / RG	Narrow to one landing zone per task

The throttling reality as a sizing table — why one tenant-wide task fails and batches succeed:

Approach	Deployments issued	ARM pressure	Failure mode	Outcome
One tenant-wide task, default concurrency	Thousands at once	Spikes past ARM write limits	`429` mid-run	Half-fixed fleet, re-flagged next scan
Per-landing-zone, `--resource-count 500`	≤500 per task	Bounded	Rare; isolated to one LZ	Clean batch; widen next
Per-LZ + `--parallel-deployments 5` after a `429`	≤500, throttled	Low	Almost none	Slow and steady; fully remediated

Roll remediation out per landing zone, watch the failure column, and only widen concurrency once a batch lands clean. A remediation task that 429s halfway leaves a half-fixed fleet that the next compliance scan will re-flag — slow and steady wins. The remediation lifecycle states you’ll watch:

Remediation state	Meaning	Action
`Evaluating`	Discovering non-compliant resources	Wait
`InProgress`	Issuing deployments	Watch the failure count
`Succeeded`	All targeted resources remediated	Widen scope / next LZ
`Failed`	One or more deployments failed (often `429`)	Lower concurrency; re-run (idempotent)
`Cancelled`	Manually stopped	Re-create scoped tighter
`Complete` (with failures)	Finished but some resources unfixed	Inspect failures; re-run the remainder

Exemptions and break-glass

An exemption is the documented exception — and unlike disabling a policy, it is scoped, audited, and can expire on its own. Make every exemption time-bound:

az policy exemption create \
  --name "waiver-legacy-sa-encryption" \
  --policy-assignment "/providers/Microsoft.Management/managementGroups/contoso/providers/Microsoft.Authorization/policyAssignments/security-baseline" \
  --exemption-category Waiver \
  --scope "/subscriptions/<legacy-sub-id>" \
  --policy-definition-reference-ids denyStoragePublic \
  --expires-on "2026-09-30T23:59:59Z" \
  --description "INC-4821: legacy app migrating to managed identity; owner: platform-team"

Two categories exist: Waiver (you accept the risk and are not fixing it now) and Mitigated (the risk is handled by a compensating control outside policy). Putting the ticket number and owner in --description is what turns an exemption from a hole into an auditable decision. Commit exemption JSON to policy/policy-exemptions/<env>/ so EPAC manages their lifecycle and removes them from Azure the instant they leave Git.

The two categories and when each is honest:

Category	Means	Use when	Audit expectation
`Waiver`	Risk accepted, not remediating now	A dated migration is underway	A ticket + a real expiry + an owner
`Mitigated`	Risk handled by a control outside policy	A compensating control covers it	A pointer to the compensating control

The exemption fields that make it auditable vs a silent hole:

Field	Purpose	Make it…	Smell if…
`--expires-on`	Auto-revoke date	Always set	Omitted → permanent hole
`--description`	The why + ticket + owner	`INC-####: reason; owner:`	Empty or “temp”
`--exemption-category`	Waiver vs Mitigated	Honest about the situation	Always `Waiver` with no plan
`--policy-definition-reference-ids`	Narrow to specific rules in an initiative	Scope to the one rule	Exempting the whole initiative
`--scope`	The narrowest scope that works	Resource, not subscription	Subscription-wide for one resource
Git location	Lifecycle management	In `policy-exemptions/<env>/`	Created in the portal, untracked

For break-glass, never delete a policy assignment to unblock an incident — that silently removes the guardrail for everyone and leaves no trail. Instead, create a tightly scoped, short-expiresOn exemption through the emergency-change path, and let it self-revoke. The break-glass decision table:

Incident pressure	Wrong move	Right move	Why
“This deploy is blocked, prod is down”	Delete the assignment	Scoped exemption, expiry hours away	Keeps the guardrail for everyone else; leaves a trail
“Disable the whole initiative”	Set effect `Disabled`	Exempt the one resource + rule	`Disabled` removes the control silently
“Just give the team Owner”	Broaden RBAC	Time-bound exemption	Exemption is reversible and audited
“We’ll clean it up later”	Permanent exemption, no expiry	`expiresOn` + ticket	Sprawl is the failure mode

Architecture at a glance

The diagram traces the policy-as-code path the way it actually runs, left to right, and marks the five places it most often breaks. Read it as a pipeline: an author opens a PR in the Git repo (definitions, initiatives, assignments, exemptions). The CI/CD pipeline runs Build-DeploymentPlans to produce a plan artifact on PR, then — behind an approval gate — Deploy-PolicyPlan and Deploy-RolesPlan on merge, authenticating as a federated service principal that holds Resource Policy Contributor + User Access Administrator at the root MG. EPAC writes into the Azure control plane: custom definitions and initiatives at the management-group scope, assignments that carry a system-assigned managed identity for Modify/DINE, and exemptions down to the resource. Finally the target estate — subscriptions and resource groups under the MG hierarchy — is where enforcement bites: Deny intercepts new writes, Modify/DINE plus remediation tasks fix the existing fleet in throttled batches, and the compliance scan feeds results back to the control plane.

Follow the numbered badges to read the failure map. Badge ① on the pipeline marks drift — EPAC’s plan wants to delete a live object because someone changed it in the portal; the pacOwnerId boundary is what keeps that deletion safe. Badge ② on the assignment node marks the MSI replication lag that surfaces as PrincipalNotFound when Deploy-RolesPlan outruns Azure AD. Badge ③ on the DINE/Modify node marks a wrong existenceCondition that redeploys every scan and burns quota. Badge ④ on the remediation path marks the 429 storm from an unthrottled tenant-wide task. Badge ⑤ on the exemptions node marks exemption sprawl — un-expiring, untracked waivers that make compliance lie. Every path converges on the same proof: a clean Build-DeploymentPlans re-run reporting no changes means Git and Azure are in sync.

Real-world scenario

A platform team at Northwind Logistics runs ~600 subscriptions under a single root management group, governed by a small set of custom initiatives. The team is five engineers; the governance estate had grown organically in the portal and nobody could answer audit questions cleanly, so they moved it to EPAC with a plan/deploy pipeline and the standard sandbox → nonprod → prod rings.

The migration itself went smoothly. The trouble started with a new control: a Modify policy to enforce a cost-center tag, parameterized Audit → Deny per ring as usual. Sandbox and non-prod were clean — the Audit ring flagged ~3,100 untagged resources, which the team triaged and accepted as the remediation backlog. Then the production deploy failed every assignment with The policy assignment ... does not have the required role assignments — even though Deploy-RolesPlan had run successfully in the same pipeline.

The breakthrough came from asking what was different about scale. Modify and DINE identities are system-assigned, so the principal does not exist until the assignment is created. EPAC creates assignments in Deploy-PolicyPlan, then grants roles in Deploy-RolesPlan — but Azure AD replication of each new service principal lags by seconds to minutes. At 600 assignments, role creation outran replication: the roleAssignments PUT hit principals that were not yet visible tenant-wide, and Azure surfaced it as PrincipalNotFound wrapped in the generic “required role assignments” policy error. In sandbox with a dozen assignments, replication always finished first, which is why it never reproduced below production scale.

The fix was ordering plus idempotent retry, not more permissions. The team split the two deploys into separate pipeline stages with a deliberate gap, and let EPAC’s own retry reconcile the stragglers:

- stage: DeployRoles
  dependsOn: DeployPolicy
  jobs:
    - deployment: ApplyRoles
      environment: policy-prod
      strategy:
        runOnce:
          deploy:
            steps:
              - download: current
                artifact: policy-plan
              - pwsh: Start-Sleep -Seconds 120   # let AAD replicate new MSIs
              - task: AzureCLI@2
                inputs:
                  azureSubscription: epac-spn
                  scriptType: pscore
                  scriptLocation: inlineScript
                  inlineScript: |
                    Deploy-RolesPlan -DefinitionsRootFolder ./policy `
                      -InputFolder $(Pipeline.Workspace)/policy-plan `
                      -PacEnvironmentSelector epac-prod

A re-run of Deploy-RolesPlan is a no-op for already-granted identities, so the second pass only cleans up what replication missed — without re-deploying a single policy object. With the gap in place, the prod deploy went green, and the team then remediated the ~3,100-resource tag backlog in per-landing-zone batches of 500 at --parallel-deployments 10, widening only after each batch landed clean. The whole estate reached compliant over a week with zero 429-induced half-states. The lesson on the wall: “At scale, the bug is rarely permissions — it’s that you raced a distributed system. Order the stages and let idempotency clean up the lag.”

The incident as a timeline, because the order of moves is the lesson:

Time	Symptom	Action taken	Effect	What it should have been
Day 1	Sandbox/nonprod clean	Ship `Audit`, read count	3,100 flagged — backlog known	Correct
Day 2, 10:00	Prod fails every assignment	Re-run `Deploy-RolesPlan`	Same error	Ask: what changed at scale?
10:30	Still failing	Add more roles to the SPN	No change (already had them)	Not a permissions problem
11:15	Root cause found	Recognize MSI replication lag at 600 assignments	Two coupled facts: system-assigned MSI + AAD lag	—
12:00	Mitigated	Split stages + 120s gap + idempotent retry	Prod goes green	Correct fix
+1 week	Fully governed	Per-LZ remediation, 500 × 10 concurrency	0 `429` half-states, all compliant	The actual fix is batching

Advantages and disadvantages

The git-driven, EPAC-reconciled model both fixes the unreviewability of portal governance and introduces its own operational edges. Weigh it honestly:

Advantages (why this model helps you)	Disadvantages (why it bites)
Every policy change is a PR with a reviewer, a diff, and a recorded why	Standing up the pipeline (EPAC, federation, rings) is real upfront effort
EPAC drift detection flags and removes orphaned assignments — the estate stays in sync with Git	Drift removal is dangerous if `pacOwnerId`/`desiredState` is misconfigured — it can delete live objects
One definition promotes `Audit → Deny` across rings by flipping a parameter — no code change	Parameterizing everything adds indirection; a junior reader can’t see the effective effect at a glance
What-If + an `Audit` ring make the blast radius measurable before enforcement	The compliance scan lag (~24h) means feedback isn’t instant unless you trigger scans
Remediation tasks fix existing fleets at scale	Unthrottled remediation `429`s and leaves half-fixed states; throttling is your responsibility
Exemptions in Git are time-bound, ticketed, and lifecycle-managed	A portal exemption created out-of-band becomes untracked drift the moment it’s made
Built-in initiatives compose with custom ones under one roll-up	Overlapping assignments can produce `Conflicting` compliance that’s confusing to resolve

The model is right for any team at landing-zone scale that must prove governance — regulated industries, multi-subscription estates, anyone facing audits. It is overkill for a single subscription with three policies, where the portal is genuinely faster. The disadvantages are all manageable — a correct pacOwnerId, throttled remediation, exemptions-in-Git — but only if you know they exist, which is the point of this article.

Hands-on lab

Stand up a minimal policy-as-code loop without EPAC or a management group, so it runs free in any subscription: author a custom definition, assign it as Audit at a resource-group scope, trigger a scan, read compliance, then flip to Deny and watch it block. Run in Cloud Shell (Bash). Teardown at the end.

Step 1 — Variables and a sandbox resource group.

SUB=$(az account show --query id -o tsv)
RG=rg-policy-lab
LOC=centralindia
az group create -n $RG -l $LOC -o table

Step 2 — Author a custom definition (deny public blob access), parameterized effect.

cat > rule.json <<'JSON'
{
  "if": { "allOf": [
    { "field": "type", "equals": "Microsoft.Storage/storageAccounts" },
    { "field": "Microsoft.Storage/storageAccounts/allowBlobPublicAccess", "notEquals": false }
  ]},
  "then": { "effect": "[parameters('effect')]" }
}
JSON

cat > params.json <<'JSON'
{ "effect": { "type": "String", "defaultValue": "Audit",
  "allowedValues": ["Audit","Deny","Disabled"] } }
JSON

az policy definition create \
  --name "lab-deny-public-blob" \
  --display-name "Lab: deny public blob access" \
  --mode Indexed \
  --rules @rule.json \
  --params @params.json \
  --subscription $SUB -o table

Expected: a definition row with policyType = Custom.

Step 3 — Assign it as Audit at the resource-group scope.

az policy assignment create \
  --name "lab-audit-public-blob" \
  --policy "lab-deny-public-blob" \
  --scope "/subscriptions/$SUB/resourceGroups/$RG" \
  --params '{ "effect": { "value": "Audit" } }' -o table

Step 4 — Create a deliberately non-compliant storage account, then scan.

SA=stpolicylab$RANDOM
az storage account create -n $SA -g $RG -l $LOC --sku Standard_LRS \
  --allow-blob-public-access true -o table   # intentionally non-compliant

az policy state trigger-scan --resource-group $RG   # ~1-2 min
az policy state summarize --resource-group $RG \
  --query "results.policyAssignments[].{name:policyAssignmentId, nonCompliant:results.nonCompliantResources}" \
  -o table

Expected after the scan: nonCompliant: 1 — the storage account is flagged but not blocked, because the effect is Audit. That count is your blast radius.

Step 5 — Flip the assignment to Deny and prove it blocks.

az policy assignment update \
  --name "lab-audit-public-blob" \
  --scope "/subscriptions/$SUB/resourceGroups/$RG" \
  --params '{ "effect": { "value": "Deny" } }' -o table

# Now try to create another public storage account — it should be REJECTED
az storage account create -n stpolicylab$RANDOM -g $RG -l $LOC --sku Standard_LRS \
  --allow-blob-public-access true 2>&1 | grep -i "disallowed\|RequestDisallowedByPolicy" \
  || echo "If you see a policy denial above, Deny is working."

Expected: the create fails with RequestDisallowedByPolicy naming the assignment. Note that the existing public account from Step 4 is still there — Deny is forward-only, which is exactly why a fleet needs Modify/DINE + remediation.

Step 6 — (Optional) Add a time-bound exemption for the legacy account.

az policy exemption create \
  --name "lab-waiver" \
  --policy-assignment "/subscriptions/$SUB/resourceGroups/$RG/providers/Microsoft.Authorization/policyAssignments/lab-audit-public-blob" \
  --exemption-category Waiver \
  --scope "/subscriptions/$SUB/resourceGroups/$RG/providers/Microsoft.Storage/storageAccounts/$SA" \
  --expires-on "2026-12-31T23:59:59Z" \
  --description "LAB-001: demo waiver; owner: you" -o table

Step 7 — Teardown (delete everything so there’s no spend or lingering policy).

az policy exemption delete --name "lab-waiver" \
  --scope "/subscriptions/$SUB/resourceGroups/$RG/providers/Microsoft.Storage/storageAccounts/$SA" 2>/dev/null
az policy assignment delete --name "lab-audit-public-blob" \
  --scope "/subscriptions/$SUB/resourceGroups/$RG"
az policy definition delete --name "lab-deny-public-blob" --subscription $SUB
az group delete -n $RG --yes --no-wait

The lab steps and what each proves:

Step	What you did	What it proves
2	Authored a parameterized definition	Effect is a knob, not a constant
3	Assigned as `Audit`	Same definition, scope chosen at assignment
4	Created a bad resource + scanned	Audit measures without blocking
5	Flipped to `Deny`	Enforcement blocks new violations only
6	Added a time-bound exemption	The documented, expiring exception
7	Deleted everything	Clean teardown; no lingering guardrails

Common mistakes & troubleshooting

The differentiator. Each failure mode below is symptom → root cause → how to confirm (exact command) → fix. First the playbook table you scan mid-rollout, then the detail on the ones that need it.

#	Symptom	Root cause	Confirm (exact command / path)	Fix
1	EPAC plan wants to `Delete` a live assignment	Drift — object changed/created in the portal, not in Git	`Build-DeploymentPlans` → read the plan’s deletes	Re-import to Git, or confirm `pacOwnerId` ownership before allowing the delete
2	`Deny` blocked 4,000 deploys on day one	Skipped the `Audit` ring	`az policy state summarize` shows the count you never read	Roll back to `Audit`; promote after triage
3	“does not have the required role assignments”	DINE/Modify MSI replication lag (or missing role)	`az policy assignment show --query identity`; `az role assignment list --assignee <principalId>`	Split deploy/roles stages + gap; re-run `Deploy-RolesPlan`
4	DINE redeploys its template every scan	Wrong `existenceCondition` — never matches an already-compliant resource	`az policy state list` shows perpetual non-compliance on compliant resources	Fix the `existenceCondition`; test against a known-good resource
5	Remediation `Failed` halfway, fleet half-fixed	`429 Too Many Requests` from an unthrottled tenant-wide task	Remediation task → failure count; activity log `429`s	Lower `--parallel-deployments`/`--resource-count`; scope per LZ; re-run (idempotent)
6	Identity-bearing assignment fails to deploy	Missing `location` on a `Modify`/DINE assignment	Deploy error names a missing region	Add `location:` to the assignment
7	Compliance dashboard empty / `NotStarted`	No scan has run since deploy	`az policy state summarize` shows `NotStarted`	`az policy state trigger-scan`; wait the scan window
8	Policy “does nothing” on a property	No alias exists, or `field` typo’d	`az provider show --expand resourceTypes/aliases` lacks the alias	Use an existing alias, or pick a different enforcement point
9	Two assignments fight; compliance `Conflicting`	Overlapping assignments with different effects	`az policy assignment list --scope ...` shows both	Consolidate; one initiative per control
10	Exemption “covers” a resource but it’s still flagged	Wrong `policy-definition-reference-ids` or scope	`az policy exemption show` vs the failing definition ref	Match the exact reference ID + narrowest scope
11	Effect changed in Git but Azure still old	Plan not re-run, or wrong selector deployed	`Build-DeploymentPlans` diff shows the change as pending	Re-run plan/deploy with the correct `PacEnvironmentSelector`
12	`Modify` doesn’t change existing resources	`Modify` is write-time; existing fleet needs remediation	`az policy state list` shows them still non-compliant	Create a remediation task for the `Modify` assignment

Drift wants to delete a live object (#1)

EPAC’s whole value is reconciliation, which means its plan will propose deleting anything stamped with your pacOwnerId that isn’t in Git. Confirm: read the Delete entries in the Build-DeploymentPlans output and check whether the object carries your owner stamp. Fix: if it’s a legitimate object someone created in the portal, import it back into the repo so Git becomes the source of truth; if it genuinely should go, let the plan remove it. Never set desiredState.strategy to full unless you have deliberately decided EPAC owns every policy object under the scope — full will delete unowned objects too.

“Required role assignments” at scale (#3)

The production-scale classic from the scenario. Confirm the identity exists and was granted its roles:

PRINCIPAL=$(az policy assignment show --name deploy-diag-to-law \
  --scope /providers/Microsoft.Management/managementGroups/contoso \
  --query identity.principalId -o tsv)
az role assignment list --assignee "$PRINCIPAL" -o table   # empty during replication lag

Fix: split Deploy-PolicyPlan and Deploy-RolesPlan into separate stages with a deliberate gap so Azure AD finishes replicating the new system-assigned principals, then let EPAC’s idempotent retry reconcile any stragglers. A re-run of Deploy-RolesPlan is a no-op for already-granted identities.

A wrong DINE `existenceCondition` (#4)

DINE evaluates an existenceCondition to decide whether the companion resource already exists. If that condition can never match an already-compliant resource, the engine concludes the resource is missing on every scan and redeploys the template forever — noisy and expensive. Confirm: a resource you know has diagnostics configured still shows NonCompliant. Fix: test the existenceCondition against a known-good resource and confirm it reports compliant before you assign at scale.

# Are resources you believe are compliant still flagged non-compliant? (smell test)
az policy state list --resource-group rg-known-good \
  --query "[?complianceState=='NonCompliant'].{res:resourceId, policy:policyDefinitionName}" -o table

The `429` remediation storm (#5)

A remediation task fans out one deployment per non-compliant resource. Confirm: the task state is Failed/Complete with failures, and the activity log shows 429 Too Many Requests:

az monitor activity-log list --offset 1h \
  --query "[?contains(to_string(httpRequest), '429') || status.value=='Failed'].{op:operationName.value, status:status.value, time:eventTimestamp}" \
  -o table

Fix: lower --parallel-deployments and --resource-count, scope the task to one landing zone, and re-run — remediation is idempotent, so the re-run only fixes what’s still non-compliant. Widen concurrency only after a batch lands clean.

Best practices

Crisp, production-grade rules — most of these are the difference between a governed estate and a pile of orphaned assignments.

Keep the four object types in separate Git trees, named by capability, never by scope. A scope-named definition can never be promoted or reused.
Always assign initiatives, not bare definitions — even an initiative of one. Retrofitting an initiative later means re-creating assignments and losing compliance history.
Parameterize the effect so the same definition ships Audit then Deny per ring. The effect should live on the assignment, not in the definition.
Give EPAC a unique pacOwnerId per repo and leave desiredState.strategy at the safe ownedOnly unless you have deliberately decided to own everything.
Split the pipeline into plan (PR) and deploy (merge) with an approval gate, and split Deploy-PolicyPlan from Deploy-RolesPlan with a deliberate gap at scale.
Never enforce Deny without first reading the Audit count in a sandbox MG. The audit count is your blast radius.
Pair every Deny with a remediation plan (Modify/DINE + remediation task) for anything that already exists — Deny is forward-only.
Throttle remediation and scope it per landing zone. Run small, watch the failure column, widen only on clean batches.
Make every exemption time-bound, ticket-referenced, and version-controlled. Commit them so EPAC removes them the instant they leave Git.
Grant policy permissions to a federated service principal, never a human, and gate it behind PR review.
Test a DINE existenceCondition against a known-good resource before assigning at scale, or it redeploys forever.
Wire compliance roll-up to a dashboard and run a nightly Build-DeploymentPlans as scheduled drift detection — a no-op plan is your proof of parity.

Security notes

Policy-as-code is a security control, and its own attack surface is the deploying identity and the drift boundary.

Least privilege for the pipeline principal. It needs exactly Resource Policy Contributor (policy objects) and User Access Administrator (role assignments for DINE/Modify MSIs) at the root MG — nothing broader. UAA is powerful (it can grant any role), so scope it to the policy root MG and gate every change behind PR review. Treat this principal the way you’d treat any high-privilege identity in Entra RBAC governance.
Federated, not secret-based, auth. Use workload identity federation (OIDC) for the pipeline so there is no client secret to leak — the same secretless pattern as GitHub Actions + Terraform OIDC.
The pacOwnerId is a safety boundary, not a secret. Its job is to stop EPAC deleting objects it doesn’t own; misconfiguring it (or setting desiredState to full) is the most dangerous mistake in the whole pipeline because it can remove live guardrails.
DINE/Modify identities are themselves privileged. A DINE policy that deploys diagnostics gets Monitoring Contributor; one that configures backup gets backup roles. Enumerate exactly what each assignment’s roleDefinitionIds grant and scope them to the assignment’s MG — an over-granted DINE identity is a lateral-movement path.
Exemptions are security exceptions — treat them as such. An un-expiring Waiver is an open hole. Require a ticket, an owner, an expiry, and the narrowest scope; review the live exemption list as part of every audit.
Don’t leak resource internals in displayName/description. Policy metadata is broadly readable; keep secrets and sensitive identifiers out of it.
Protect the most destructive resources with DenyAction on delete where appropriate, and never use a blanket Disabled to silence a control — an exemption is the auditable equivalent.

Cost & sizing

Azure Policy itself is free — there is no charge for definitions, assignments, evaluations, or compliance scans. The bill comes from what your policies cause to be deployed and from how you remediate. Get these wrong and a governance pipeline quietly runs up a Log Analytics and ARM bill.

Cost driver	What it is	Rough magnitude	How to control it
Azure Policy service	Definitions, assignments, scans	Free	N/A — never the cost
DINE-deployed resources	Diagnostics → Log Analytics ingestion	Per-GB ingested; can dominate at fleet scale	Scope diagnostics; sample; tier the workspace
Wrong `existenceCondition` redeploys	Template redeployed every scan	Wasted ARM ops + any resource cost	Fix the condition (mistake #4)
Remediation deployments	One deployment per resource	Compute/time, not a per-deploy fee	Batch + throttle; one-time backlog
Log Analytics for compliance	Storing compliance/activity logs	Per-GB + retention	Right-size retention; archive tier
Pipeline agent minutes	CI/CD running EPAC	Cheap (minutes per run)	Run plan on PR, deploy on merge only

The DINE-to-Log-Analytics path is where real money hides: a deployIfNotExists that turns on every diagnostic category for every resource across 600 subscriptions can ingest enormous volumes. Size it deliberately — pick the categories you actually query, consider sampling, and route to a workspace tiered for the volume, the same discipline as in Azure Monitor & Application Insights observability. In INR terms, the policy pipeline’s own footprint is negligible (pipeline minutes, a few rupees a run); the variable cost is entirely the ingestion and retention your DINE policies generate, which can run from near-zero on a small estate to lakhs per month if you onboard full diagnostics fleet-wide without sampling.

Sizing the rollout itself — how long and how risky each phase is:

Phase	Effort / duration	Cost	Risk if rushed
Stand up EPAC + pipeline	Days (one-time)	Negligible	Misconfigured `pacOwnerId`
Author + lint definitions	Hours per control	Free	Bad alias / over-broad rule
`Audit` ring + read blast radius	Minutes per control (+ scan window)	Free	Skipping it → day-one `Deny` outage
Promote to `Deny`/DINE	Per ring, gated	Free (policy)	Enforcement breaks teams
Remediate existing fleet	Days (throttled batches)	ARM time + downstream ingestion	`429` half-states; ingestion blowout

Interview & exam questions

Mapped to AZ-104 (governance), AZ-305 (design governance), and the AZ-500/SC-100 security-design angle. Which exam emphasises which slice of this topic:

Exam	What it tests on policy-as-code	The questions below that map
AZ-104	Create/assign policy + initiatives; remediation basics; exemptions	Q1, Q2, Q3, Q11
AZ-305	Design governance: MG hierarchy, ring promotion, effect choice at scale	Q1, Q7, Q9, Q12
AZ-500	Security guardrails, least-privilege deploy identity, DenyAction	Q8, Q11
SC-100	Governance strategy, exemption discipline, audit posture	Q6, Q11, Q12
AZ-400	The CI/CD pipeline, gates, idempotent deploy, drift detection	Q6, Q9, Q10, Q12

Q1. What are the four Azure Policy object types and how do they differ? Definition (a single if/then rule), initiative/policy set (a bundle of definitions with hoisted parameters), assignment (a definition or initiative bound to a scope with parameter values), and exemption (a time-bound, audited waiver). Definitions/initiatives describe capability; assignments describe where it’s enforced; exemptions describe the documented exception.

Q2. When does a policy evaluate, and which effects can’t run inline? On resource create/update (intercepted before commit) and on a ~24-hour background compliance scan. auditIfNotExists and deployIfNotExists only run on writes and on the scan — never inline — because they must inspect related resources that already exist.

Q3. Why can’t Deny fix an existing non-compliant fleet, and what do you use instead? Deny only blocks new non-compliant writes; it never touches resources that already exist. To fix the existing fleet you use Modify or DeployIfNotExists plus a remediation task, which fans out one deployment per non-compliant resource.

Q4. What’s the difference between field and value in a policy rule? field reads an alias-mapped property of the target resource and can drive deny/modify into the request payload; value evaluates an arbitrary expression (a parameter, a template function) independent of the resource and cannot reach into the payload.

Q5. Why must you enumerate aliases before writing a rule? A policy can only target a property that has an alias exposed by the resource provider. If no alias exists for the property, you cannot write a field condition against it — so you list aliases (az provider show --expand resourceTypes/aliases) first and pick a different enforcement point if needed.

Q6. What does pacOwnerId do in EPAC and why is it a safety mechanism? It stamps every object EPAC creates so the tool only manages and deletes objects bearing that ID. This makes drift removal safe — EPAC never deletes another team’s assignments or Microsoft’s built-in initiatives.

Q7. Why parameterize the effect on the assignment instead of hard-coding it in the definition? So the same definition can ship Audit in sandbox and Deny/DINE in prod by changing only the assignment’s parameter and scope — promotion through rings becomes a parameter flip, not a code change.

Q8. What permissions does the policy-deploying principal need, and why two? Resource Policy Contributor at the root MG to create policy objects, plus User Access Administrator (or Owner) to create the role assignments that DINE/Modify managed identities require. Grant both to a federated service principal, never a human.

Q9. Why do Modify/DINE assignments fail with “required role assignments” at large scale, and how do you fix it? Their identities are system-assigned, so the principal doesn’t exist until the assignment is created; at hundreds of assignments, role creation can outrun Azure AD replication and hit PrincipalNotFound. Fix by splitting deploy and role stages with a gap and relying on EPAC’s idempotent retry.

Q10. How do you remediate thousands of resources without a 429 storm? Throttle with --parallel-deployments and --resource-count, scope tasks per landing zone rather than tenant-wide, watch the failure column, and widen concurrency only after a batch lands clean. Remediation is idempotent, so re-runs only fix what’s still non-compliant.

Q11. What’s the difference between disabling a policy and exempting a resource? Disabled (or setting the effect to Disabled) silently removes the control for everyone with no scope, expiry, or trail. An exemption is scoped to specific resources/rules, carries a category, expiry, and description, is audited, and self-revokes — the reviewable equivalent.

Q12. How do you prove that Git and Azure are actually in sync? Re-run Build-DeploymentPlans after a deploy; a plan reporting no changes is the proof of parity. A nightly plan run is your scheduled drift detection.

Quick check

Which of the four object types is the only one that carries a scope and parameter values?
You ship a new tag-enforcement control. What effect do you use first, and what does its result number tell you?
A DeployIfNotExists policy redeploys its template on every compliance scan. What is almost certainly wrong?
Your production policy deploy fails with “does not have the required role assignments” at 600 assignments, even though Deploy-RolesPlan ran. What’s the cause?
A remediation task ends in Failed with the fleet half-fixed. What single setting do you change first, and what makes the re-run safe?

Answers

The assignment — definitions and initiatives describe capability; the assignment binds them to a scope with concrete parameter values.
Audit. The non-compliant count is your blast radius — flipping straight to Deny would have broken exactly that many deployments.
The existenceCondition is wrong — it never matches an already-compliant resource, so the engine thinks the companion is missing every scan. Test it against a known-good resource.
Managed-identity replication lag: system-assigned identities don’t exist until the assignment is created, and at scale role creation outruns Azure AD replication (PrincipalNotFound). Split the deploy/roles stages with a gap and let EPAC’s idempotent retry reconcile.
Lower --parallel-deployments (and/or --resource-count) and scope per landing zone; the re-run is safe because remediation is idempotent — it only fixes what’s still non-compliant.

Glossary

Policy definition — A single governance rule (if/then) authored once and assignable to any scope.
Initiative (policy set) — A bundle of definitions with hoisted parameters, giving one compliance roll-up per business control.
Assignment — A definition or initiative bound to a scope (MG/subscription/RG) with concrete parameter values; the only object that knows where enforcement happens.
Exemption — A time-bound, audited waiver of an assignment, scoped down to a resource; either Waiver or Mitigated.
Effect — What a policy does when its rule matches: Audit, Deny, Modify, Append, DeployIfNotExists, AuditIfNotExists, Manual, Disabled, DenyAction.
Alias — A mapping from a policy field to a resource provider’s real property path; without one, you can’t write policy against that property.
Mode — Which resource types a definition evaluates: Indexed (taggable/locatable resources) or All (plus RGs/subscriptions), with data-plane modes for AKS/Key Vault.
field vs value — field reads the target resource’s (alias-mapped) property and can drive payload mutation; value evaluates an arbitrary expression independent of the resource.
EPAC (Enterprise Policy as Code) — A PowerShell module that reads a policy repo, builds a plan, and applies it idempotently with drift detection.
pacOwnerId — The GUID EPAC stamps on objects it manages, so drift removal only ever touches objects it owns.
existenceCondition — The DINE/AINE check for whether a related resource already exists; a wrong one causes perpetual redeploys.
Remediation task — A bulk operation that fans out one template deployment per non-compliant resource to fix an existing fleet.
Compliance scan — The ~24-hour background re-evaluation of existing resources that refreshes audit/DINE/AINE state.
What-If — A preview of what a deployment would create, modify, or delete, without changing anything — the cheapest pre-merge validation.
Management group (MG) — A container above subscriptions through which policy and RBAC inherit downward; the usual policy deployment root scope.

Next steps

Azure Policy governance at scale — the conceptual model this pipeline automates, including initiative design and compliance roll-ups.
Enterprise-scale management-group hierarchy design — how to shape the MG rings this pipeline promotes through.
Bicep deployment stacks, What-If & CI — the validation and deployment mechanics that pair with the policy pipeline.
Azure DevOps YAML multistage environments & approvals — the gate-and-promotion patterns for the plan/deploy split.
Gatekeeper / OPA policy as code for admission control — the in-cluster equivalent for AKS, which Azure Policy’s Microsoft.Kubernetes.Data mode integrates with.

Azure Policy as Code: A Git-Driven Governance Pipeline

What problem this solves

Learning objectives

Prerequisites & where this fits

Core concepts

The vocabulary in one table

The effects reference

Anatomy of a custom policy definition

The `if` block: conditions, operators, and logical structure

`field` vs `value`, and `mode`

Choosing and parameterizing the effect

Structuring the repo

The EPAC workflow

Testing before rollout

Remediation at scale

Exemptions and break-glass

Architecture at a glance

Real-world scenario

Advantages and disadvantages

Hands-on lab

Common mistakes & troubleshooting

Drift wants to delete a live object (#1)

“Required role assignments” at scale (#3)

A wrong DINE `existenceCondition` (#4)

The `429` remediation storm (#5)

Best practices

Security notes

Cost & sizing

Interview & exam questions

Quick check

Answers

Glossary

Next steps

Written by Vinod

Comments

Azure Policy as Code: A Git-Driven Governance Pipeline

What problem this solves

Learning objectives

Prerequisites & where this fits

Core concepts

The vocabulary in one table

The effects reference

Anatomy of a custom policy definition

The if block: conditions, operators, and logical structure

field vs value, and mode

Choosing and parameterizing the effect

Structuring the repo

The EPAC workflow

Testing before rollout

Remediation at scale

Exemptions and break-glass

Architecture at a glance

Real-world scenario

Advantages and disadvantages

Hands-on lab

Common mistakes & troubleshooting

Drift wants to delete a live object (#1)

“Required role assignments” at scale (#3)

A wrong DINE existenceCondition (#4)

The 429 remediation storm (#5)

Best practices

Security notes

Cost & sizing

Interview & exam questions

Quick check

Answers

Glossary

Next steps

Written by Vinod

Comments

The `if` block: conditions, operators, and logical structure

`field` vs `value`, and `mode`

A wrong DINE `existenceCondition` (#4)

The `429` remediation storm (#5)