Azure Policy as Code: A Git-Driven Governance Pipeline

Portal-clicked policy is governance you cannot review, diff, or roll back. The fix is treating policy the way you treat infrastructure: definitions, initiatives, and assignments live in Git, get tested in a pipeline, and deploy to management groups through a promotion ring. This guide builds that pipeline end to end and handles the hard part nobody mentions in the quickstarts — remediating thousands of existing resources without melting the control plane.

The governance gap

A policy assigned by hand in the portal has no pull request, no reviewer, no history of why the deny threshold is what it is. When an auditor asks “who approved exempting this subscription from disk encryption, and when does the waiver expire?”, clicking through blades does not produce an answer. Version-controlled intent does.

The model has four object types, and keeping them separate is the whole discipline:

Object	What it is	Scope it targets
Policy definition	A single rule (the `if`/`then`)	Authored once, assigned anywhere
Initiative (policy set)	A bundle of definitions with shared params	Authored once
Assignment	A definition/initiative bound to a scope, with parameter values	Management group, subscription, RG
Exemption	A time-bound, audited waiver of an assignment	Down to the resource

Definitions and initiatives describe capability. Assignments describe where it is enforced. Exemptions describe the documented exceptions. Mixing them — for example, baking a subscription ID into a definition — is how policy repos rot.

Step 1 — Anatomy of a custom policy

A definition is JSON with a policyRule (the logic) and parameters (the knobs). The rule’s if block evaluates resource properties; the matched resources get the then.effect.

{
  "properties": {
    "displayName": "Storage accounts must disable public blob access",
    "mode": "Indexed",
    "parameters": {
      "effect": {
        "type": "String",
        "defaultValue": "Deny",
        "allowedValues": ["Audit", "Deny", "Disabled"]
      }
    },
    "policyRule": {
      "if": {
        "allOf": [
          { "field": "type", "equals": "Microsoft.Storage/storageAccounts" },
          {
            "field": "Microsoft.Storage/storageAccounts/allowBlobPublicAccess",
            "notEquals": false
          }
        ]
      },
      "then": { "effect": "[parameters('effect')]" }
    }
  }
}

Two things trip people up here:

field vs. value. field reads a property of the resource being evaluated and is alias-aware — Microsoft.Storage/storageAccounts/allowBlobPublicAccess is an alias that maps to the resource’s actual property path. value evaluates an arbitrary expression (a parameter, a [resourceGroup()] call, a template function) that has nothing to do with the target resource. Use field for “what is this resource’s property”; use value for “what does this expression compute to”. A field condition also lets deny/modify reach into the request payload before it is committed; a value condition cannot.

Aliases are the contract. If there is no alias for a property, you cannot write policy against it. Enumerate what is available before you write the rule:

# List aliases for a resource type and confirm they're modifiable (needed for modify/append)
az provider show --namespace Microsoft.Storage \
  --expand "resourceTypes/aliases" \
  --query "resourceTypes[?resourceType=='storageAccounts'].aliases[].{alias:name, modifiable:defaultMetadata.attributes}" \
  -o table

mode matters too: use Indexed for policies that target taggable/locatable resources (it skips resource types that do not support tags/location, avoiding false non-compliance), and All when you must evaluate resource groups and subscriptions themselves.

Evaluation order: policy runs on resource create/update (the request is intercepted before it commits, which is why deny can block and modify/append can mutate the payload), and again on a roughly 24-hour background compliance scan for existing resources. auditIfNotExists/deployIfNotExists only ever fire on that second pass and on writes — never inline — because they have to look at related resources that already exist.

Step 2 — Choosing the right effect

The effect is the single most consequential decision in a policy. Pick wrong and you either block legitimate deployments or audit forever while nothing improves.

Effect	Fires when	Use it for
`Audit`	On write and on scan; only flags	Measuring a new rule’s blast radius before enforcing
`Deny`	On write, before commit	Hard guardrails (“no public IPs in Corp”)
`Modify`	On write; mutates the payload	Adding/replacing tags, enforcing properties (needs MSI + roles)
`Append`	On write; adds fields	Injecting a default value where none was supplied
`DeployIfNotExists` (DINE)	On write and on scan, if a related resource is missing	Auto-onboarding diagnostics, Defender, backup
`AuditIfNotExists` (AINE)	On write and on scan	Reporting on missing companion resources
`Disabled`	Never	Killing one rule inside an initiative without removing it

The non-obvious rules: Deny cannot fix existing resources — it only stops new bad ones, so existing drift needs Modify/DINE plus a remediation task. DeployIfNotExists and Modify both require a managed identity with concrete role assignments, declared on the assignment, because the policy engine acts on your behalf. And DINE evaluates an existenceCondition — if it is wrong, the engine redeploys the template on every scan, which is both noisy and expensive.

Step 3 — Structuring the repo

Keep the four object types in separate trees, named by their logical identity, never by scope:

policy/
├── policy-definitions/
│   └── deny-storage-public-access.json
├── policy-set-definitions/        # initiatives
│   └── security-baseline.json
├── policy-assignments/
│   ├── platform-mg.json           # one manifest per management group
│   └── landing-zones-mg.json
├── policy-exemptions/
│   └── prod/                      # exemptions, segregated by environment
└── global-settings.jsonc          # PaC env -> MG/subscription mapping

An initiative groups definitions and hoists their parameters so an assignment sets values once:

{
  "properties": {
    "displayName": "Security Baseline",
    "policyType": "Custom",
    "parameters": {
      "storageEffect": { "type": "String", "defaultValue": "Deny" }
    },
    "policyDefinitions": [
      {
        "policyDefinitionReferenceId": "denyStoragePublic",
        "policyDefinitionId": "/providers/Microsoft.Management/managementGroups/contoso/providers/Microsoft.Authorization/policyDefinitions/deny-storage-public-access",
        "parameters": { "effect": { "value": "[parameters('storageEffect')]" } }
      }
    ]
  }
}

Always assign initiatives, not loose definitions — even an initiative of one. It gives you a stable assignment surface so you can add rules later without re-pointing the assignment, and one compliance roll-up per business control.

Step 4 — The EPAC workflow

You can hand-roll deployment with az policy commands, but reconciling desired state against what is live — including deleting assignments you removed from Git — is exactly what EPAC (Enterprise Policy as Code) solves. It is a maintained PowerShell module that reads your repo, builds a plan, and applies it idempotently, with full drift detection: anything in Azure that is not in Git is flagged and (optionally) removed.

EPAC’s three commands map cleanly onto a pipeline:

Install-Module -Name EnterprisePolicyAsCode -Scope CurrentUser

# 1. PLAN — diff desired (repo) vs. deployed; emit plan artifacts, change nothing
Build-DeploymentPlans `
  -DefinitionsRootFolder ./policy `
  -OutputFolder ./output `
  -PacEnvironmentSelector epac-prod

# 2. DEPLOY definitions, initiatives, and assignments from the plan
Deploy-PolicyPlan `
  -DefinitionsRootFolder ./policy `
  -InputFolder ./output `
  -PacEnvironmentSelector epac-prod

# 3. DEPLOY the role assignments DINE/modify identities need
Deploy-RolesPlan `
  -DefinitionsRootFolder ./policy `
  -InputFolder ./output `
  -PacEnvironmentSelector epac-prod

The global-settings.jsonc ties a selector to a real scope and identity:

{
  "pacOwnerId": "f0000000-1111-2222-3333-444444444444",
  "pacEnvironments": [
    {
      "pacSelector": "epac-prod",
      "cloud": "AzureCloud",
      "tenantId": "<tenant-guid>",
      "deploymentRootScope": "/providers/Microsoft.Management/managementGroups/contoso"
    }
  ]
}

pacOwnerId is what makes drift detection safe: EPAC only manages objects it stamped with that owner ID, so it never deletes assignments created by another team or by Microsoft’s built-in policy initiatives.

In Azure Pipelines, split plan from deploy across stages with an environment approval gate between them — plan on PR, deploy on merge:

stages:
  - stage: Plan
    jobs:
      - job: BuildPlan
        steps:
          - task: AzureCLI@2
            inputs:
              azureSubscription: epac-spn          # workload identity federation
              scriptType: pscore
              scriptLocation: inlineScript
              inlineScript: |
                Build-DeploymentPlans -DefinitionsRootFolder ./policy `
                  -OutputFolder $(Build.ArtifactStagingDirectory) `
                  -PacEnvironmentSelector epac-prod
          - publish: $(Build.ArtifactStagingDirectory)
            artifact: policy-plan

  - stage: Deploy
    dependsOn: Plan
    jobs:
      - deployment: ApplyPolicy
        environment: policy-prod                   # add an approval check here
        strategy:
          runOnce:
            deploy:
              steps:
                - download: current
                  artifact: policy-plan
                - task: AzureCLI@2
                  inputs:
                    azureSubscription: epac-spn
                    scriptType: pscore
                    scriptLocation: inlineScript
                    inlineScript: |
                      Deploy-PolicyPlan -DefinitionsRootFolder ./policy `
                        -InputFolder $(Pipeline.Workspace)/policy-plan `
                        -PacEnvironmentSelector epac-prod

The deploying identity needs Resource Policy Contributor at the root management group for policy objects, plus User Access Administrator (or Owner) to create the role assignments DINE identities require. Grant it to the federated service principal, not a human, and gate it behind PR review.

Step 5 — Testing before rollout

Three layers of validation, cheapest first.

1. Lint and What-If on PR. Validate JSON shape, then run a What-If of the policy deployment to confirm what objects the merge would create or change — without touching production:

# Structural sanity for every definition/initiative JSON
Get-ChildItem ./policy -Recurse -Include *.json |
  ForEach-Object { $null = Get-Content $_ -Raw | ConvertFrom-Json }

# What-If the policy artifacts at the management-group scope
az deployment mg what-if \
  --management-group-id contoso \
  --location eastus \
  --template-file ./policy-bicep/assignments.bicep

2. Assign as Audit in a ring, read compliance. Every effect-parameterized policy should ship to a non-prod management group as Audit first. Trigger an on-demand scan and read the result instead of waiting ~24 hours:

# Force an evaluation at a scope, then summarize compliance
az policy state trigger-scan --resource-group rg-sandbox

az policy state summarize \
  --management-group mg-sandbox \
  --query "results.policyAssignments[].{name:policyAssignmentId, nonCompliant:results.nonCompliantResources}" \
  -o table

If Audit flags 4,000 resources, flipping straight to Deny would have broken those teams. The audit count is your blast radius.

3. Promote through a management-group ring. Use distinct EPAC environment selectors per ring and promote the same definitions outward:

mg-sandbox  ->  mg-nonprod  ->  mg-prod
 (Audit)        (Audit/Deny)     (Deny/DINE)

Same Git, same definitions; only the assignment’s effect parameter and target scope change between selectors. That is the entire value of parameterizing the effect.

Step 6 — Remediation at scale

Deny is forward-looking. For existing fleets you need DINE or Modify plus remediation tasks, and at scale the control plane is the bottleneck.

A DINE assignment must declare its identity and the roles it grants. In Bicep:

resource diagAssignment 'Microsoft.Authorization/policyAssignments@2024-04-01' = {
  name: 'deploy-diag-to-law'
  scope: managementGroup()
  location: 'eastus'                  // required when identity is set
  identity: { type: 'SystemAssigned' }
  properties: {
    policyDefinitionId: tenantResourceId(
      'Microsoft.Authorization/policySetDefinitions', 'diagnostics-baseline')
    parameters: {
      logAnalytics: { value: lawResourceId }
    }
  }
}

After the identity exists, grant it the roles its template needs (for diagnostics-to-Log-Analytics that is typically Monitoring Contributor + Log Analytics Contributor), then create the remediation task:

# Remediate one initiative member across the assignment's scope
az policy remediation create \
  --name remediate-diag-2026q2 \
  --management-group contoso \
  --policy-assignment deploy-diag-to-law \
  --definition-reference-id deployDiagnostics \
  --resource-discovery-mode ReEvaluateCompliance

Throttling is the real engineering problem. A remediation task fans out one template deployment per non-compliant resource. Across thousands of resources that hammers ARM, and you will hit 429 Too Many Requests. Control concurrency with --parallel-deployments (how many remediations run at once) and --resource-count (the cap per task), then run multiple smaller, scoped tasks rather than one tenant-wide blast.

az policy remediation create \
  --name remediate-diag-batch-01 \
  --management-group contoso \
  --policy-assignment deploy-diag-to-law \
  --definition-reference-id deployDiagnostics \
  --parallel-deployments 10 \
  --resource-count 500

Roll remediation out per landing zone, watch the failure column, and only widen concurrency once a batch lands clean. A remediation task that 429s halfway leaves a half-fixed fleet that the next compliance scan will re-flag — slow and steady wins.

Step 7 — Exemptions and break-glass

An exemption is the documented exception — and unlike disabling a policy, it is scoped, audited, and can expire on its own. Make every exemption time-bound:

az policy exemption create \
  --name "waiver-legacy-sa-encryption" \
  --policy-assignment "/providers/Microsoft.Management/managementGroups/contoso/providers/Microsoft.Authorization/policyAssignments/security-baseline" \
  --exemption-category Waiver \
  --scope "/subscriptions/<legacy-sub-id>" \
  --policy-definition-reference-ids denyStoragePublic \
  --expires-on "2026-09-30T23:59:59Z" \
  --description "INC-4821: legacy app migrating to managed identity; owner: platform-team"

Two categories exist: Waiver (you accept the risk and are not fixing it now) and Mitigated (the risk is handled by a compensating control outside policy). Putting the ticket number and owner in --description is what turns an exemption from a hole into an auditable decision. Commit exemption JSON to policy/policy-exemptions/<env>/ so EPAC manages their lifecycle and removes them from Azure the instant they leave Git.

For break-glass, never delete a policy assignment to unblock an incident — that silently removes the guardrail for everyone and leaves no trail. Instead, create a tightly scoped, short-expiresOn exemption through the emergency-change path, and let it self-revoke.

Enterprise scenario

A platform team running ~600 subscriptions under a single root MG shipped a Modify policy to enforce a cost-center tag, parameterized Audit -> Deny per ring as usual. Sandbox and non-prod were clean. Production deploy failed every assignment with The policy assignment ... does not have the required role assignments — even though Deploy-RolesPlan had run.

The gotcha: Modify and DINE identities are system-assigned, so the principal does not exist until the assignment is created. EPAC creates assignments in Deploy-PolicyPlan, then grants roles in Deploy-RolesPlan — but Azure AD replication of the new service principal lags. At 600 assignments, role creation outran replication and the roleAssignments PUT hit a principal that was not yet visible, surfacing as PrincipalNotFound wrapped in a generic policy error.

The fix was ordering plus idempotent retry, not more permissions. Split the two deploys into separate pipeline stages with a deliberate gap, and let EPAC’s own retry reconcile the stragglers:

- stage: DeployRoles
  dependsOn: DeployPolicy
  jobs:
    - deployment: ApplyRoles
      environment: policy-prod
      strategy:
        runOnce:
          deploy:
            steps:
              - download: current
                artifact: policy-plan
              - pwsh: Start-Sleep -Seconds 120   # let AAD replicate new MSIs
              - task: AzureCLI@2
                inputs:
                  azureSubscription: epac-spn
                  scriptType: pscore
                  scriptLocation: inlineScript
                  inlineScript: |
                    Deploy-RolesPlan -DefinitionsRootFolder ./policy `
                      -InputFolder $(Pipeline.Workspace)/policy-plan `
                      -PacEnvironmentSelector epac-prod

A re-run of Deploy-RolesPlan is a no-op for already-granted identities, so the second pass only cleans up what replication missed — without re-deploying a single policy object.

Verify

Confirm the pipeline actually produced governed, observable state:

# Custom definitions exist at the management-group scope
az policy definition list --management-group contoso \
  --query "[?policyType=='Custom'].{name:name, mode:mode}" -o table

# Assignments are live and carry an identity where DINE/modify is used
az policy assignment list --scope \
  /providers/Microsoft.Management/managementGroups/contoso \
  --query "[].{name:name, effectScope:scope, identity:identity.type}" -o table

# Compliance has data (not 'NotStarted') after a scan
az policy state summarize --management-group contoso \
  --query "results.resourceDetails" -o table

# Exemptions are present and have an expiry
az policy exemption list --scope /subscriptions/<legacy-sub-id> \
  --query "[].{name:name, category:exemptionCategory, expires:expiresOn}" -o table

A clean EPAC Build-DeploymentPlans re-run after deploy should report no changes — that is your proof Git and Azure are in sync.

Readiness checklist

Pitfalls

Assigning bare definitions. Always wrap in an initiative; retrofitting later means re-creating assignments and losing compliance history.
Deny with no remediation plan. It stops new violations but never fixes the fleet — pair it with Modify/DINE for anything that already exists.
A wrong DINE existenceCondition. The engine redeploys on every scan, burning quota and money. Test the condition against an already-compliant resource and confirm it reports compliant.
Forgetting location on an identity-bearing assignment. Deployment fails outright — system-assigned identities need a region.
Unbounded remediation. Tenant-wide tasks with default concurrency will 429; scope and throttle, then widen.
Disabling instead of exempting. Disabled kills the rule for everyone with no audit trail; a scoped, expiring exemption is the reviewable equivalent.

Azure Policy as Code: A Git-Driven Governance Pipeline

The governance gap

Step 1 — Anatomy of a custom policy

Step 2 — Choosing the right effect

Step 3 — Structuring the repo

Step 4 — The EPAC workflow

Step 5 — Testing before rollout

Step 6 — Remediation at scale

Step 7 — Exemptions and break-glass

Enterprise scenario

Verify

Readiness checklist

Pitfalls

Written by Vinod

Comments

Keep Reading

Application Gateway for Containers: Gateway API on AKS with Traffic Splitting, mTLS, and Header Routing

Azure Event Hubs at Scale: Partitioning, Capture, Kafka Endpoint, and Stream Analytics Processing

Azure Service Bus at Scale: Sessions, Deduplication, and Dead-Letter Handling