Azure Governance

Azure Policy and Governance at Scale: Enforce the Rules Automatically

Quick take: Azure Policy is your automated cloud referee. It evaluates every resource against rules you author once and assign high in the hierarchy — and it can prevent a bad deployment before it exists, audit drift you already have, modify a request in flight, or deploy the missing piece. The art is not writing JSON; it is choosing the right effect, assigning at the right scope, wiring the right identity, and reading compliance without chasing a number that hasn’t refreshed yet.

A security audit lands on your desk. It found public IPs on virtual machines, storage accounts with public network access, unencrypted managed disks, resources in non-approved regions, and a thousand resource groups with no CostCenter tag. Your team fixes them by hand over a weekend — and the next Monday the report is dirty again, because nothing stopped the next engineer from doing exactly the same thing. Manual review is a treadmill: at any real scale you cannot click through every resource, in every subscription, every week, forever. Azure Policy is how you get off the treadmill. It is the Azure-native governance engine that evaluates each resource against rules you define, and — depending on the effect you choose — denies the noncompliant deployment outright, flags it for a report, rewrites the request to add the missing setting, or fires a remediation that deploys what was absent. You author the rule once, assign it at a management group, and it governs every subscription beneath it.

This is the practitioner’s playbook for running Policy at scale, not a tour of the portal. We go effect by effect (deny, audit, append, modify, deployIfNotExists, auditIfNotExists, disabled, and deny-by-default via denyAction), because choosing the wrong one is the single most common mistake — people set audit and wonder why nothing got fixed, or set a broad deny and break every pipeline in the tenant at 2pm on a Friday. We cover the assignment and inheritance model (definitions live high, assignments inherit down the management-group → subscription → resource-group tree), the managed identity and RBAC wiring that deployIfNotExists and modify silently need (forget it and remediation is a no-op that fails Forbidden), the compliance evaluation lifecycle (on-change, plus a roughly 24-hour full scan — so the dashboard lags, and chasing a stale number wastes an afternoon), and the difference between an exclusion (notScopes, scope-level) and an exemption (a tracked, expiring waiver with a reason). Every operation gets both an az CLI snippet and Bicep/JSON, and because this is a reference you will return to mid-incident, the effects, the limits, the SDK errors and the playbook are all laid out as scannable tables.

By the end you will stop firefighting compliance and start preventing it. When the audit comes you will show a green dashboard you can explain — every red exception is a tracked exemption with an owner and an expiry, every deny has been through an audit phase, every deployIfNotExists has an identity with least-privilege roles, and every assignment sits at the highest scope that makes sense. Good governance is not about saying no. It is about making the right choice the only easy choice, automatically, at the scale of a whole tenant.

What problem this solves

Cloud at scale fails open by default. Anyone with Contributor on a subscription can create a storage account with public access, spin a VM in a region your data-residency rules forbid, deploy an un-tagged resource group that no cost report can attribute, or open an NSG to 0.0.0.0/0 on port 22. None of that is a bug in their permissions — Contributor is supposed to let them deploy. The gap is that “what you are allowed to do” (RBAC) and “what you are allowed to deploy like this” (governance) are different questions, and RBAC answers only the first. Azure Policy answers the second: it constrains the shape of what gets deployed, regardless of who is deploying it.

What breaks without it is a slow, expensive grind. A regulated company fails an audit because a forgotten dev subscription has unencrypted disks. A finance team cannot do showback because 40% of resources have no cost tags. A platform team spends every sprint chasing drift tickets — “someone enabled public access on the prod storage account again” — that a single deny policy would have made impossible. And the manual remediation that does happen is itself a risk: an engineer hand-editing a thousand resources at 1am makes mistakes a deployIfNotExists would not. The cost is real money (a misconfigured public endpoint is a breach waiting to happen), real audit findings, and real engineering hours burned on work a rule should do for free.

Who hits this: every organisation past the “one subscription, five people” stage. It bites hardest on regulated workloads (where “we’ll fix it later” is an audit finding), multi-subscription landing zones (where you cannot manually govern 50 subscriptions), cost-conscious teams (untagged resources are invisible spend), and anyone running a platform that hands subscriptions to other teams. The fix is almost never “review harder.” It is to encode the rule as policy, assign it high, and let the engine enforce it on every deployment, forever, including the ones that happen while you sleep.

To frame the whole field before the deep dive, here is every governance question Policy answers, the effect that answers it, and where it bites if you get it wrong:

Governance question The policy mechanism The effect to reach for If you get it wrong
“Stop this bad thing from ever being deployed” Prevention at the ARM PUT deny (or denyAction on delete) Too broad → blocks legitimate pipelines
“Tell me what’s already wrong” Detection / reporting audit / auditIfNotExists People expect it to fix; it only flags
“Force a required setting onto every deploy” Mutation of the request modify / append Needs identity (modify); silent no-op without
“Deploy the missing piece automatically” Remediation of drift deployIfNotExists (DINE) Needs MI + RBAC; fails Forbidden silently
“Govern every subscription at once” Assignment at a management group Any effect, assigned high Assigned too low → siblings ungoverned
“Wave a rule for one team, on the record” A tracked, expiring waiver exemption (not exclusion) Untracked notScopes → a permanent hole
“Prove compliance to an auditor” The compliance store + scans (reporting, all effects) Reading a stale number (24h scan lag)

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already understand the Azure resource hierarchy: a tenant root management group at the top, management groups nesting beneath it, subscriptions inside those, resource groups inside subscriptions, and resources inside groups. (If that tree is fuzzy, read Azure Resource Hierarchy Explained: Subscriptions, Resource Groups and Resources first — it is the substrate this whole article assigns policy onto.) You should know RBAC basics (role assignments, scopes, Contributor vs Owner), be comfortable running az in Cloud Shell, and read JSON output. Familiarity with ARM/Bicep deployments helps, because policy intercepts the Resource Manager request that a Bicep deploy produces.

This sits in the Governance & Landing Zones track. It is the enforcement layer underneath an Azure Enterprise-Scale Landing Zone: Foundation for Large Organizations — the landing-zone management-group tree is where these assignments live, and the landing-zone “policy-driven governance” principle is this article in practice. It pairs with Azure FinOps and Cost Management: Controlling Cloud Spend at Scale, because tag-enforcement policies are what make cost allocation possible, and with Azure Key Vault: Secrets, Keys and Certificates Done Right and Azure Monitor and Application Insights: Full-Stack Observability, since deployIfNotExists policies are the canonical way to force diagnostic settings and Defender onto every resource. RBAC is the complement: policy governs the shape of resources, RBAC governs who can act.

A quick map of who owns what during a governance rollout, so you know who to call when a policy bites:

Layer What lives here Who usually owns it What Policy does here
Tenant root MG Top-of-tree assignments Platform / cloud CoE Tenant-wide baselines (locations, tags)
Platform MGs (Identity, Connectivity, Management) Shared-service guardrails Platform team Hub/networking and logging policies
Landing-zone MGs (Corp, Online) Workload guardrails Platform + app teams Deny public access, require encryption
Subscription The blast-radius unit App / workload team Inherited policy + sub-specific assignments
Resource group The deploy unit App / workload team Finest assignment scope; exclusions
Resource The thing evaluated App / workload team The target of deny/audit/modify/DINE

Core concepts

Six mental models make every later decision obvious.

Policy governs shape; RBAC governs access. RBAC answers “may this principal perform this action on this scope?” Policy answers “is this resource allowed to look like this, no matter who deployed it?” They are independent and complementary. A user with Owner can still be denied by a policy; a user with read-only access never triggers a deny because they never deploy. When something is blocked, the first fork is: was it RBAC (AuthorizationFailed) or Policy (RequestDisallowedByPolicy)? Different error, different owner, different fix.

Definition → initiative → assignment is the whole object model. A policy definition is a single rule in JSON: an if condition (over resource fields) and a then effect. An initiative (a.k.a. policy set definition) is a bundle of definitions you manage and assign as one unit — e.g. a 200-rule regulatory baseline. An assignment attaches a definition or initiative to a scope (management group, subscription, or resource group), supplies parameters (e.g. the list of allowed locations), and sets options like enforcement mode and exclusions. The definition is the rule; the assignment is the rule applied here, with these parameters.

Assignment inherits down the hierarchy. Assign at a management group and every child management group, subscription, resource group and resource beneath it is in scope — one assignment can govern a thousand subscriptions. This is the entire reason Policy scales. Definitions themselves are stored at a scope too (you can only assign a definition at or below where it is defined), but it is the assignment’s scope that determines what gets evaluated. Assign high to govern broadly; assign low only for genuinely local rules.

The effect decides what happens at the ARM request. When a resource is created or updated, the Resource Manager PUT is evaluated against every in-scope assignment. deny rejects the request before the resource exists (cheapest, strongest — nothing bad is ever created). audit lets it through and records non-compliance. append/modify rewrite the request (add a tag, set a property). deployIfNotExists (DINE) lets the resource through, then deploys a related resource (a diagnostic setting, a Defender plan) if it is missing. auditIfNotExists (AINE) checks for a related resource and flags if absent. “Prevent” (deny) vs “report” (audit) vs “fix” (modify/DINE) is the choice that defines your governance posture.

Existing resources are a separate problem from new ones. deny only affects new or updated resources — it never touches what already exists. To fix what is already there you either (a) let audit report it and remediate manually, or (b) use deployIfNotExists/modify plus a remediation task that re-evaluates existing resources and brings them into line. A deny assignment makes a clean future; remediation cleans the dirty past. Most real governance needs both, and forgetting that “deny doesn’t fix existing” is a classic surprise.

Compliance is eventually consistent. The compliance state you see is computed by evaluation triggered on resource change, on assignment change, and by a periodic full scan roughly every 24 hours. So right after you assign a policy, the dashboard may show “0 / 0” or stale numbers for a while — not because the policy isn’t working, but because evaluation hasn’t run. Force it with az policy state trigger-scan when you need a fresh answer. Chasing a number that hasn’t refreshed is the most common time-waster in this whole topic.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Concept One-line definition Where it lives Why it matters
Policy definition One rule: if condition → then effect A scope (MG/sub) The atom of governance
Initiative (policy set) A bundle of definitions assigned as one A scope Manage 200 rules as one unit
Assignment A definition/initiative applied to a scope, with params A scope The rule in force here
Scope MG, subscription, or resource group The hierarchy Determines what’s evaluated
Effect What happens: deny/audit/modify/DINE/… In the definition Prevent vs report vs fix
Parameter A value supplied at assignment (e.g. allowed regions) The assignment One definition, many uses
Alias A path to a resource property Policy can read In the definition What you can write rules against
Compliance state Compliant / Non-compliant / Exempt / N/A The compliance store The audit answer
Remediation task Re-evaluates existing resources for modify/DINE Per assignment Fixes the dirty past
Managed identity Identity the assignment uses to deploy/modify On the assignment DINE/modify need it or no-op
Exclusion (notScopes) A scope carved out of an assignment On the assignment Quiet, untracked carve-out
Exemption A tracked, expiring waiver with a reason On a resource/scope The auditable carve-out
Enforcement mode Default (effects fire) vs DoNotEnforce (evaluate only) On the assignment Safe rollout switch

The effects reference — every effect, end to end

The effect is the most important choice you make. Pick audit when you mean deny and nothing gets prevented; pick deny when you mean audit and you break a pipeline. Here is the complete set, what each does at deploy time, whether it touches existing resources, and whether it needs an identity:

Effect What it does At deploy (new/updated) Existing resources Needs managed identity? Order in pipeline
deny Reject non-compliant requests Blocks the PUT (request fails) Not touched (prevent only) No Evaluated last (after append/modify)
audit Flag non-compliant, allow it Allowed; marked non-compliant Marked non-compliant on scan No Reporting only
append Add fields to the request Adds the property/tag if missing Not retroactive (remediate via modify) No Before deny
modify Add/update/remove properties or tags Patches the request Yes, via remediation task Yes (role to write the property) Before deny
deployIfNotExists (DINE) Deploy a related resource if absent Resource allowed, then template deployed Yes, via remediation task Yes (roles to deploy the template) After the resource is created
auditIfNotExists (AINE) Audit if a related resource is absent Allowed; flagged if related missing Flagged on scan No Reporting only
denyAction Block specific actions (e.g. delete) Blocks the action (e.g. DELETE) Protects existing from the action No Action-level
disabled Turn a definition off without unassigning No effect (evaluation skipped) N/A No Used to toggle off
manual Track an attestation you set by hand No automatic check; you attest You set the state manually No For non-technical controls

The evaluation order matters when several effects target the same request — mutating effects run before deny so the request is shaped, then judged:

Evaluation stage Effects that run here Why this order
1. Disabled check disabled A disabled definition is skipped entirely
2. Append / modify append, modify Rewrite the request before it’s judged
3. Deny deny, denyAction Judge the (now-mutated) request; block if non-compliant
4. Audit audit Record compliance on the allowed request
5. Post-provision deployIfNotExists, auditIfNotExists Run after the resource exists, against related resources

Three reading rules that save the most time:

Distinction The trap How to choose correctly
deny vs audit Setting audit and expecting prevention deny to stop, audit to measure — almost always roll out as audit first, then flip to deny
modify vs deployIfNotExists Using DINE to set a property on the same resource modify changes a property on the resource itself (tags, TLS version); DINE deploys a separate related resource (diag setting, Defender)
append vs modify Using append to change an existing value append only adds a missing field; modify can add, replace, or remove — and is the one you remediate with

And the choice as a decision table — match the goal to the effect:

If your goal is… Reach for… Because…
Block storage with public access at create time deny Nothing bad is ever created; strongest posture
Know how many VMs lack encryption today audit Reports without breaking anything
Force a CostCenter tag inherited from the RG modify (add tag) Rewrites the request; remediable for existing
Ensure every resource sends logs to Log Analytics deployIfNotExists Deploys the missing diagnostic setting
Confirm a Defender plan exists on each sub auditIfNotExists Flags subs missing the related config
Stop anyone deleting a locked key vault denyAction (on delete) Blocks the destructive action specifically
Track a manual SOC-2 control with no API manual You attest; Policy records the state
Temporarily disable a noisy rule disabled or DoNotEnforce mode Keeps the assignment, suppresses the effect

deny — prevention at the request

deny is the strongest, cheapest effect: the noncompliant resource is never created, so there is nothing to remediate and no window of exposure. The deploy fails with RequestDisallowedByPolicy and the response names the offending policyDefinitionId. Use it for hard rules: allowed locations, allowed SKUs, mandatory encryption, no public network access. The danger is breadth — a deny assigned at the tenant root with a too-narrow allowed-locations list will fail every deployment in the tenant the moment it goes into Default mode.

# Assign the built-in "Allowed locations" policy (deny) at a management group, parameterised
az policy assignment create \
  --name "allowed-locations" \
  --display-name "Allow only India regions" \
  --policy "e56962a6-4747-49cd-b67b-bf8b01975c4c" \
  --scope "/providers/Microsoft.Management/managementGroups/corp" \
  --params '{ "listOfAllowedLocations": { "value": ["centralindia","southindia"] } }'
resource allowedLocations 'Microsoft.Authorization/policyAssignments@2024-04-01' = {
  name: 'allowed-locations'
  properties: {
    displayName: 'Allow only India regions'
    policyDefinitionId: tenantResourceId('Microsoft.Authorization/policyDefinitions', 'e56962a6-4747-49cd-b67b-bf8b01975c4c')
    enforcementMode: 'Default'   // 'DoNotEnforce' to evaluate without blocking
    parameters: {
      listOfAllowedLocations: { value: [ 'centralindia', 'southindia' ] }
    }
  }
}

The built-in deny policies you reach for most, and what each blocks:

Built-in (deny) Blocks Common parameter Gotcha
Allowed locations Resources outside the region list listOfAllowedLocations global resources (e.g. some networking) need global allowed
Allowed locations for resource groups RGs outside the list listOfAllowedLocations Separate from the resource-level policy — assign both
Allowed virtual machine SKUs VM sizes off the list listOfAllowedSKUs Long list; maintain as a parameter
Storage accounts should disable public network access Public-network storage (effect param) Breaks legit public storage — exempt deliberately
Storage account public access should be disallowed (blob anon) Anonymous blob access (effect param) Different from network access; both matter
Not allowed resource types Whole resource types listOfResourceTypesNotAllowed Strong; great for blocking classic/legacy types
Allowed resource types Everything except a list listOfResourceTypesAllowed Inverse; very restrictive, use narrowly

audit / auditIfNotExists — measure before you prevent

audit allows the deployment but records the resource as non-compliant so it shows up in the dashboard and in az policy state list. auditIfNotExists is the “related-resource” variant: it checks whether a related resource exists (e.g. a diagnostic setting on a VM, a Defender plan on a subscription) and flags non-compliance if it is absent. Audit is your reconnaissance phase — assign the rule as audit, look at the real-world blast radius, then decide whether to flip it to deny.

# What's non-compliant for an assignment, grouped by resource
az policy state list \
  --filter "PolicyAssignmentName eq 'require-disk-encryption'" \
  --query "[?complianceState=='NonCompliant'].{res:resourceId, state:complianceState}" -o table

auditIfNotExists and deployIfNotExists share the same “look for a related resource” engine — the only difference is the verb (report vs deploy). The fields that define what “related” means:

existenceCondition field What it checks Example Used by
type (in details) The related resource type to look for Microsoft.Insights/diagnosticSettings AINE + DINE
existenceCondition The condition the related resource must meet logs[*].enabled == true AINE + DINE
resourceGroupName Where to look (defaults to the target’s RG) a hub RG for shared resources DINE mostly
evaluationDelay Wait before evaluating (let deploys settle) AfterProvisioning DINE
roleDefinitionIds Roles the assignment MI needs Monitoring Contributor modify + DINE

append and modify — rewrite the request

append adds fields to a request that is missing them — e.g. add a default tag, set an allowedHeaders value. It only adds; it never overwrites an existing value. modify is the powerful one: it can add, replace, or remove tags and certain properties, and crucially it is remediable — a remediation task can apply the modification to existing resources. modify needs a managed identity with a role that can write the property (e.g. tag contributor). The canonical use is tag governance: inherit a CostCenter tag from the resource group onto every resource so cost allocation actually works.

# Built-in: "Inherit a tag from the resource group if missing" (modify) — assign with identity
az policy assignment create \
  --name "inherit-costcenter" \
  --policy "cd3aa116-8754-49c9-a813-ad46512ece54" \
  --scope "/subscriptions/$SUB_ID" \
  --params '{ "tagName": { "value": "CostCenter" } }' \
  --mi-system-assigned --location centralindia \
  --role "Contributor"
resource inheritTag 'Microsoft.Authorization/policyAssignments@2024-04-01' = {
  name: 'inherit-costcenter'
  location: 'centralindia'                 // required when there's an identity
  identity: { type: 'SystemAssigned' }     // modify needs an MI
  properties: {
    policyDefinitionId: tenantResourceId('Microsoft.Authorization/policyDefinitions', 'cd3aa116-8754-49c9-a813-ad46512ece54')
    parameters: { tagName: { value: 'CostCenter' } }
  }
}
// Then grant the assignment's MI a role to write tags (e.g. Tag Contributor) at the scope.

The modify operations and when each applies:

modify operation What it does Typical use Note
addOrReplace Add the property, or replace its value Force minimumTlsVersion = TLS1_2 Overwrites — the strong form
add Add only if absent Add a tag if not present Won’t clobber an existing value
remove Delete a property/tag Strip a forbidden tag Useful for cleanup policies
(tag inherit) Copy a tag from RG/subscription CostCenter, Environment The classic cost-governance pattern

append vs modify at a glance — pick by whether you must change a value and whether you need to fix existing:

Need append modify
Add a missing field on new deploys Yes Yes
Replace/remove an existing value No (add only) Yes
Remediate existing resources No Yes (remediation task)
Requires a managed identity No Yes
Can set tags Yes (add) Yes (add/replace/remove)

deployIfNotExists (DINE) — remediate the missing piece

DINE is how you make “every resource must have X” true rather than merely audited. It lets the resource through, then checks for a related resource (a diagnostic setting, a Defender plan, a backup config) and, if absent, deploys an ARM template to create it. This is the engine behind landing-zone “auto-everything”: auto-enable diagnostic logging, auto-deploy Microsoft Defender for Cloud plans, auto-associate a route table or NSG. DINE needs a managed identity with the roles listed in the definition’s roleDefinitionIds — without it, the deploy is a silent no-op and the remediation task fails Forbidden.

# Create the remediation task to bring EXISTING resources into line (DINE/modify)
az policy remediation create \
  --name "remediate-diag-settings" \
  --policy-assignment "send-vm-logs-to-law" \
  --resource-group "rg-prod" \
  --resource-discovery-mode ReEvaluateCompliance
resource dineDiag 'Microsoft.Authorization/policyAssignments@2024-04-01' = {
  name: 'send-vm-logs-to-law'
  location: 'centralindia'
  identity: { type: 'SystemAssigned' }   // DINE mandates an identity
  properties: {
    policyDefinitionId: tenantResourceId('Microsoft.Authorization/policyDefinitions', '<diag-settings-DINE-id>')
    parameters: {
      logAnalytics: { value: lawResourceId }
    }
  }
}
// Grant the MI the roleDefinitionIds the policy declares (e.g. Log Analytics Contributor +
// Monitoring Contributor) at the assignment scope, or remediation fails Forbidden.

The DINE remediation flow, step by step and where each step fails:

Step What happens Fails if… Confirm
1. Assign DINE with identity MI created/attached to the assignment identity omitted az policy assignment show --query identity
2. Grant MI the roleDefinitionIds MI can deploy the template Role not granted at scope az role assignment list --assignee <principalId>
3. New resource deployed Existence condition evaluated (delay) evaluationDelay not elapsed Compliance shows after settle
4. Related resource missing DINE deploys the template Template params wrong Deployment error in remediation detail
5. Remediation task (existing) Re-evaluates and deploys for old resources Step 2 missing → Forbidden az policy remediation show --query 'deploymentStatus'

Common DINE/modify built-ins and the role each managed identity needs:

DINE/modify built-in Deploys / sets Role the MI needs Scope to assign
Configure diagnostic settings to Log Analytics Diagnostic setting per resource Log Analytics Contributor, Monitoring Contributor MG or subscription
Deploy Microsoft Defender for Cloud plans Defender pricing tier on the sub Security Admin / Owner Subscription
Configure backup on VMs Recovery Services vault protection Backup Contributor, VM Contributor MG or subscription
Inherit a tag from the resource group Tag on the resource Tag Contributor (or Contributor) Subscription / RG
Configure subnets to use an NSG NSG association on subnets Network Contributor MG or subscription
Enforce TLS 1.2 on storage (modify) minimumTlsVersion property Storage Account Contributor MG or subscription

Assignment, scope and inheritance

Where you assign matters as much as what you assign. Assign too high and a niche rule blocks unrelated teams; assign too low and the sibling subscriptions you forgot stay ungoverned. The model: a definition is stored at a scope (you can only assign it at or below that scope), but the assignment’s scope is what determines evaluation, and it inherits downward to everything beneath.

The three assignable scopes, and what each is good for:

Scope Governs Best for Watch-out
Management group Every child MG, sub, RG, resource Tenant/landing-zone baselines (locations, tags, encryption) Broad blast radius; test in audit/DoNotEnforce first
Subscription Every RG and resource in the sub Sub-specific rules; a workload’s own guardrails Doesn’t cover sibling subs — assign at MG for that
Resource group Every resource in the RG Genuinely local rules; pilots Easy to forget; many RGs = many assignments

Inheritance and exclusion behaviour you must internalise:

Behaviour Rule Consequence
Downward inheritance Assignment applies to the scope and all descendants One MG assignment governs all child subs
No upward effect A sub-level assignment never affects the parent MG Assign high to go broad
notScopes exclusion A child scope listed in notScopes is carved out Quiet hole — untracked, easy to forget
Cumulative effects All in-scope assignments apply together deny from any one assignment blocks the deploy
Most-restrictive wins for deny Any matching deny blocks, regardless of other audits You cannot “allow over” a deny with another policy
Definition location You can only assign at/below where the definition lives Store shared defs high (intermediate root MG)

Initiatives — manage many rules as one

An initiative (policy set) groups definitions so you assign, parameterise and report on them as a unit. Regulatory baselines (e.g. CIS, ISO 27001, the Microsoft Cloud Security Benchmark) ship as large built-in initiatives. Assign the initiative once at a management group and you get one compliance roll-up across all its member policies. Initiatives also let you share a parameter (e.g. one allowed-locations list flowing to every member policy that needs it).

# Assign a built-in initiative (e.g. the security benchmark) at a management group
az policy assignment create \
  --name "mcsb" \
  --policy-set-definition "1f3afdf9-d0c9-4c3d-847f-89da613e70a8" \
  --scope "/providers/Microsoft.Management/managementGroups/tenant-root"

Definition vs initiative vs assignment — the object model in one grid:

Aspect Policy definition Initiative (set) Assignment
What it is One rule (if/then) A bundle of definitions A rule/initiative applied to a scope
Holds parameters? Declares them Maps + can share them Supplies their values
Has an effect? Yes (then.effect) Per member definition Inherits members’ effects
Assignable? Yes Yes (it is the assignment)
Reports compliance? Per policy Rolled up across members Per assignment
Typical count at scale Hundreds Tens Tens–hundreds

Enforcement mode and exemptions — rolling out safely

Two safety valves separate a careful rollout from an outage. Enforcement mode is per-assignment: Default means effects fire (a deny blocks); DoNotEnforce means the assignment evaluates and reports but does not enforce — so you can see exactly what a deny would block before it blocks anything. Exemptions are the auditable carve-out: a tracked waiver on a specific scope/resource, with a category (Waiver or Mitigated), an optional expiry, and a reason — unlike a notScopes exclusion, an exemption shows up in compliance as Exempt and expires.

# Evaluate a deny WITHOUT enforcing it (see the blast radius first)
az policy assignment create --name "deny-public-ip" \
  --policy "<no-public-ip-def>" --scope "/subscriptions/$SUB_ID" \
  --enforcement-mode DoNotEnforce

# Grant a tracked, expiring exemption for one resource group
az policy exemption create \
  --name "legacy-app-waiver" \
  --policy-assignment "/subscriptions/$SUB_ID/providers/Microsoft.Authorization/policyAssignments/deny-public-ip" \
  --exemption-category Waiver \
  --scope "/subscriptions/$SUB_ID/resourceGroups/rg-legacy" \
  --expires-on "2026-12-31T00:00:00Z" \
  --description "Legacy app needs a public IP until migration (JIRA-1421)"

Exclusion vs exemption — the distinction auditors care about:

Aspect Exclusion (notScopes) Exemption
What it is A scope removed from the assignment A tracked waiver for a scope/resource
Shows in compliance? No (just not evaluated) Yes — as Exempt
Has an expiry? No Yes (optional expiresOn)
Has a reason/category? No Yes (Waiver/Mitigated + description)
Auditable? Poorly (a silent hole) Yes — designed for it
Use when… Carving out a whole environment by design Granting a temporary, justified pass

Enforcement-mode and rollout phases — the safe path from idea to enforced:

Phase Setting What you learn / get Move on when
1. Audit Effect audit (or initiative default) Real count of non-compliant resources You understand the blast radius
2. DoNotEnforce deny def, enforcementMode=DoNotEnforce What a deny would block, with no breakage No surprising would-be denials remain
3. Remediate Remediation tasks for modify/DINE Existing drift cleaned up Compliance trending green
4. Enforce enforcementMode=Default The deny now prevents at create Steady-state; review exemptions

Authoring custom policies

Built-in policies are organised into categories — browse these first, because the rule you want almost certainly already exists. The categories you reach for most:

Built-in category Covers Example built-in Typical effect
General Allowed locations/types, audit basics Allowed locations deny
Tags Require/inherit/append tags Inherit a tag from the resource group modify
Storage Public access, TLS, encryption Storage accounts should disable public network access deny
Compute VM SKUs, disk encryption, extensions Allowed virtual machine SKUs deny
Network NSGs, public IPs, private endpoints Subnets should be associated with an NSG deployIfNotExists
Monitoring Diagnostic settings, agents Configure diagnostic settings to a Log Analytics workspace deployIfNotExists
Security Center Defender plans, secure-config Configure Microsoft Defender for Cloud plans deployIfNotExists
Key Vault Vault firewall, purge protection, cert/key rules Key vaults should have purge protection enabled audit / deny
Regulatory Compliance CIS, ISO, MCSB initiatives Microsoft Cloud Security Benchmark (initiative)
Kubernetes In-cluster Gatekeeper/OPA rules Kubernetes clusters should not allow privileged containers audit / deny

Always check first (az policy definition list --query "[?policyType=='BuiltIn']"). When you do write custom, a definition is JSON with parameters, a policyRule (if condition + then effect), and it reads resource properties through aliases. The if block supports field, logical operators (allOf, anyOf, not), and count for array properties.

{
  "properties": {
    "displayName": "Deny storage accounts without HTTPS-only",
    "mode": "Indexed",
    "parameters": {
      "effect": { "type": "String", "allowedValues": ["Deny","Audit","Disabled"], "defaultValue": "Deny" }
    },
    "policyRule": {
      "if": {
        "allOf": [
          { "field": "type", "equals": "Microsoft.Storage/storageAccounts" },
          { "field": "Microsoft.Storage/storageAccounts/supportsHttpsTrafficOnly", "notEquals": true }
        ]
      },
      "then": { "effect": "[parameters('effect')]" }
    }
  }
}
# Create the custom definition at a management group, then assign it
az policy definition create \
  --name "deny-storage-http" \
  --rules @rule.json \
  --management-group "corp" \
  --mode Indexed

The condition operators you actually use, and what each is for:

Operator Meaning Example use
equals / notEquals Exact match field type equals Microsoft.Storage/...
in / notIn Value in a (parameter) list location in allowed list
like / notLike Wildcard match name like 'prod-*'
match / matchInsensitively Pattern (# digit, ? letter) enforce a naming pattern
contains / containsKey Substring / tag-key presence tags containsKey 'CostCenter'
exists Field present (true/false) a property must be set
allOf / anyOf / not Boolean composition combine several conditions
count Count array elements meeting a condition “all NSG rules where…”

mode controls what a definition evaluates — get this wrong and your rule silently never matches:

mode Evaluates Use for
Indexed Resources that support tags and location Most resource policies (the common default)
All Every resource + resource groups + subscriptions RG/sub-level rules (e.g. RG must have a tag)
Microsoft.Kubernetes.Data AKS in-cluster objects (via add-on) Gatekeeper/OPA policies on Kubernetes
Microsoft.KeyVault.Data Objects inside Key Vault (certs/keys/secrets) Key Vault data-plane governance
Microsoft.Network.Data Azure Virtual Network Manager rules Network-manager security admin rules

Aliases are the crux of custom authoring: a policy can only test a property that has an alias. If the property you want isn’t aliased, no rule can read it — a frequent dead end. Find them with the CLI before you write the if:

# List aliases for a resource type so you know what you can write rules against
az provider show --namespace Microsoft.Storage \
  --query "resourceTypes[?resourceType=='storageAccounts'].aliases[].name" -o tsv | grep -i tls

Custom-authoring pitfalls and how each manifests:

Pitfall Symptom Fix
Property has no alias Rule never matches; resource stays compliant Check az provider show ... aliases; use the aliased path or pick another property
Wrong mode (Indexed for an RG rule) RG-level rule never evaluates Use mode: All for RG/subscription rules
Effect hard-coded, not parameterised Can’t switch audit↔deny without editing the def Parameterise effect with allowedValues
count misused on a non-array Evaluation error / no match Use count only over array aliases ([*])
Custom dup of a built-in Maintenance burden, drift from MS updates Search built-ins first; only author the genuine gap

Compliance evaluation and reporting

The compliance store answers the audit question — but it is eventually consistent, and not understanding the timing wastes more time than any other single thing in this topic. Evaluation is triggered three ways, and the on-demand scan is your friend when you need a fresh answer now.

What triggers an evaluation, and how fast:

Trigger When it fires Latency Note
Resource change A resource is created/updated Minutes The deploy itself is evaluated synchronously for deny/modify
Assignment change You create/update/delete an assignment ~30 min for full effect New assignment’s compliance appears after a scan
Periodic full scan Background, roughly every 24 h Up to ~24 h Why the dashboard lags
On-demand scan You run trigger-scan Minutes (async) Force it instead of waiting
# Force an on-demand compliance scan for a subscription (async; returns when done)
az policy state trigger-scan --resource-group "rg-prod"

# Read the summarised compliance for an assignment
az policy state summarize \
  --filter "PolicyAssignmentName eq 'require-disk-encryption'" \
  --query "value[0].results" -o json

The compliance states a resource can be in, and what each means:

State Meaning Counts against you? Typical cause
Compliant Meets every in-scope policy No Correctly configured
NonCompliant Violates ≥1 audit/deny-evaluated policy Yes Drift, or a new audit rule
Exempt Covered by an exemption No (tracked) A justified, expiring waiver
Conflicting Conflicting effects across assignments Investigate Two policies fighting over a property
NotStarted Evaluation hasn’t run yet N/A Just-assigned; pre-scan
Unknown (manual) manual effect, not yet attested N/A Awaiting an attestation

Why your number looks wrong — the reading traps:

You see… It’s probably… What to do
“0 of 0” right after assigning Evaluation hasn’t run (NotStarted) az policy state trigger-scan, wait minutes
Non-compliant but you “fixed it” Last scan predates your fix Trigger a scan; re-read
A resource missing from the report Wrong scope, or mode excludes it Verify assignment scope and definition mode
Count differs portal vs CLI Different time windows / filters Align the --filter and timestamp
Suddenly all non-compliant A new initiative member rule landed Check recent assignment/initiative updates

The az policy commands you actually live in, grouped by what you’re doing:

Task Command Note
List built-in definitions az policy definition list --query "[?policyType=='BuiltIn']" Search before authoring custom
Create a custom definition az policy definition create --rules @rule.json --mode Indexed Add --management-group to store it high
Assign a policy/initiative az policy assignment create --policy <id> --scope <scope> --policy-set-definition for initiatives
Assign with identity (modify/DINE) az policy assignment create ... --mi-system-assigned --location <r> Then grant the declared roles
See non-compliant resources az policy state list --filter "PolicyAssignmentName eq '<n>'" Filter by assignment/resource
Summarise compliance az policy state summarize --filter ... Roll-up counts for an auditor
Force an evaluation az policy state trigger-scan --resource-group <rg> Beat the ~24h scan lag
Remediate existing drift az policy remediation create --policy-assignment <n> -g <rg> For modify/DINE only
Grant an exemption az policy exemption create --exemption-category Waiver --expires-on <t> Tracked, expiring waiver
List exemptions az policy exemption list --scope <scope> Review near-expiry ones monthly

Architecture at a glance

The diagram traces governance the way it actually flows, left to right, and marks the five places it goes wrong. On the left is the control plane where you author: a policyDefinition (the if/then JSON rule) and an initiative that bundles many definitions. Authoring is harmless — nothing is enforced yet. The second zone is the management-group hierarchy, where you assign: you attach the definition or initiative to a management group with parameters and exclusions, and that assignment inherits down through every child subscription and resource group — one assignment, thousands of resources. The assignment also carries the managed identity that deployIfNotExists and modify need to act.

The third zone is the evaluation path, where the rubber meets the Resource Manager PUT: a deny blocks the request before anything is created (a 403 at create time), an audit/auditIfNotExists flags it without blocking, and modify/deployIfNotExists either rewrites the request or deploys the missing related resource. The fourth zone is the result: the compliance store aggregates state (refreshed on change, then a roughly 24-hour full scan), and remediation tasks use the assignment’s identity to drag existing drift back into line. The five numbered badges sit on the real failure points — wrong scope or a forgotten exclusion (1), a deny that blocks a legitimate deploy (2), an audit that only flags when people expected a fix (3), a DINE/modify that no-ops because its identity lacks RBAC (4), and a compliance number that looks stale because the 24-hour scan hasn’t run (5). Read the badge, run the named confirm command, apply the fix — that is the whole operating loop.

Azure Policy governance flowing left to right across four zones — an author/control-plane zone with a policyDefinition (if/then JSON) and an initiative bundle; a management-group hierarchy zone where the definition is assigned with parameters and a managed identity and inherits down through subscriptions; an evaluation zone on the Resource Manager PUT path showing the deny effect blocking the request, audit/auditIfNotExists flagging without blocking, and modify/deployIfNotExists rewriting the request or deploying the missing related resource; and a govern/result zone with a compliance store (on-change plus a 24-hour full scan) and remediation tasks that use the managed identity to fix existing drift — overlaid with five numbered failure badges for wrong scope or forgotten exclusion, a deny blocking a legitimate deploy, audit flagging without fixing, a DINE/modify no-op from missing RBAC on the identity, and a stale compliance number from scan lag

Real-world scenario

Medindi Health is a fictional but realistic Indian health-tech company running a regulated workload across 38 subscriptions under an enterprise-scale landing zone in Central India and South India. The platform team is six engineers; the compliance team needs to pass a payer audit in eight weeks. The starting state was ugly: a quarterly scan found 410 storage accounts with public network access enabled, 1,200 resources with no CostCenter tag, 60 VMs with unencrypted OS disks, diagnostic logging configured on barely a third of resources, and a handful of resources quietly running in non-approved regions because a contractor had deployed to eastus to “test something.” Manual remediation had been attempted twice and failed — every fix decayed within a fortnight.

The platform lead’s first instinct was the right idea and the wrong execution: she drafted a deny initiative — allowed-locations, no-public-storage, require-encryption — and very nearly assigned it at the tenant root in Default mode on a Friday afternoon. A senior architect stopped the rollout with one question: “Do you know what that denies today?” They didn’t. So they ran the entire initiative as audit first at the landing-zone management group, forced a scan with az policy state trigger-scan, and read the real blast radius. The audit revealed the surprise: a billing-integration subscription legitimately needed public storage for a partner SFTP drop, and two subscriptions ran workloads in eastus by design for a US-hosted dependency. A blind deny at root would have broken both and triggered a Sev-1.

With the blast radius known, the rollout went in phases. Phase 1 (week 1–2): the whole initiative as audit, plus DoNotEnforce on the deny components, to confirm exactly what would block. Phase 2 (week 2–3): modify to inherit CostCenter from each resource group (managed identity granted Tag Contributor), and deployIfNotExists to push diagnostic settings to Log Analytics (identity granted Log Analytics Contributor + Monitoring Contributor) — followed by remediation tasks that cleaned the 1,200 untagged resources and the under-logged two-thirds in place, no hand-editing. The DINE remediation initially failed Forbidden on one subscription; the cause was a missing role assignment for the assignment’s identity, fixed in one az role assignment create. Phase 3 (week 4): for the two legitimate exceptions, they wrote exemptionsWaiver category, a JIRA reference, and a 90-day expiry — rather than silent notScopes exclusions, so the auditor could see why each hole existed and that it was time-boxed. Phase 4 (week 5): flipped the deny components to Default. From that moment, a new public-access storage account simply could not be created.

The result eight weeks later: the compliance dashboard read 97% compliant, and every remaining red item was a tracked, expiring exemption with an owner — exactly what an auditor wants to see. The payer audit passed with zero governance findings. Cost allocation, previously impossible, now covered 98% of spend because the tag-inheritance modify had backfilled CostCenter everywhere. The lesson the team wrote on the wall: audit before deny, remediate before you enforce, and a hole you can’t explain is worse than the violation it hides.” The whole rollout, as the order-of-operations table that was the lesson:

Phase Action Effect / mode Outcome What would have gone wrong otherwise
0 Draft deny initiative (about to enforce at root) Friday Sev-1 from blind deny
1 Run initiative as audit at LZ MG audit + DoNotEnforce Real blast radius known Two legit workloads would’ve broken
2a Inherit CostCenter tag modify + remediation 1,200 resources tagged in place Cost allocation stays impossible
2b Push diagnostic settings deployIfNotExists + remediation Logging on ~all resources Audit finding on observability
2c Fix DINE Forbidden grant MI the role Remediation succeeds Silent no-op, drift persists
3 Exempt the 2 legit exceptions exemption (Waiver, 90d) Auditable, time-boxed holes Permanent untracked notScopes
4 Flip deny to enforce enforcementMode=Default New violations impossible

Advantages and disadvantages

Policy-driven governance is the only thing that scales to a multi-subscription estate — but it has sharp edges that bite teams who assign first and think later. Weigh it honestly:

Advantages (why this model wins) Disadvantages (why it bites)
Prevention over detectiondeny stops misconfiguration before the resource exists; nothing bad is ever created A too-broad deny blocks legitimate work tenant-wide the instant it goes to Default mode
One assignment governs thousands of resources via management-group inheritance Inheritance cuts both ways — assign at the wrong scope and you over-reach or under-cover silently
Auditable by design — compliance state + exemptions feed straight into governance reviews Compliance is eventually consistent (24h scan); the dashboard lags reality and people chase stale numbers
Remediation (modify/DINE) fixes existing drift automatically, not in a backlog DINE/modify silently no-op without the right managed identity + RBAC — failures are quiet (Forbidden)
Built-ins cover most needs — regulatory initiatives ship ready to assign Custom policy JSON gets intricate; missing aliases can make a desired rule impossible to write
Effects are granular — prevent, report, mutate, or deploy as the situation needs Choosing the wrong effect (audit when you meant deny) means nothing actually gets prevented
Exemptions give a tracked, expiring escape hatch with a reason notScopes exclusions create silent, permanent holes that auditors hate
Decouples governance (shape) from RBAC (access) — clean separation of concerns Two systems to reason about; “blocked” could be RBAC or Policy, and the errors differ

The model is right for any estate past a single team: regulated workloads, landing zones, cost governance, and anywhere “review harder” has already failed. It is over-engineering for a single throwaway sandbox subscription, where a couple of audit policies suffice. The disadvantages are all manageable — audit-first rollouts tame the deny risk, remediation identities are a one-time wiring job, and trigger-scan defeats the lag — but only if you know they exist, which is the entire point of this article.

Hands-on lab

Create a custom deny policy, watch it block a non-compliant deployment, then add a tag-inheritance modify with a remediation task — all free (Policy itself has no charge; we deploy a storage account briefly and delete it). Run in Cloud Shell (Bash). You need permission to create policy definitions/assignments at a subscription (Resource Policy Contributor or Owner).

Step 1 — Variables and a sandbox resource group.

SUB_ID=$(az account show --query id -o tsv)
RG=rg-policy-lab
LOC=centralindia
az group create -n $RG -l $LOC -o table

Step 2 — Author a custom deny policy (storage must require HTTPS-only).

cat > rule.json <<'JSON'
{
  "if": {
    "allOf": [
      { "field": "type", "equals": "Microsoft.Storage/storageAccounts" },
      { "field": "Microsoft.Storage/storageAccounts/supportsHttpsTrafficOnly", "notEquals": true }
    ]
  },
  "then": { "effect": "deny" }
}
JSON

az policy definition create --name "lab-deny-storage-http" \
  --display-name "Lab: deny storage without HTTPS-only" \
  --rules @rule.json --mode Indexed -o table

Expected: a definition row with policyType: Custom.

Step 3 — Assign it to the sandbox resource group.

az policy assignment create --name "lab-deny-storage-http" \
  --policy "lab-deny-storage-http" \
  --scope "/subscriptions/$SUB_ID/resourceGroups/$RG" -o table

Step 4 — Try to deploy a non-compliant storage account and watch it fail.

# httpsTrafficOnly=false → should be DENIED by the policy
az storage account create -n stlab$RANDOM -g $RG -l $LOC \
  --sku Standard_LRS --https-only false 2>&1 | tail -5
# Expect: "RequestDisallowedByPolicy" naming lab-deny-storage-http

The deployment fails with RequestDisallowedByPolicy — the resource is never created. That is deny doing its job at the request.

Step 5 — Deploy a compliant storage account (HTTPS-only) and watch it succeed.

SA=stlab$RANDOM
az storage account create -n $SA -g $RG -l $LOC \
  --sku Standard_LRS --https-only true -o table
# Succeeds — it satisfies the policy.

Step 6 — Add a tag-inheritance modify and remediate the existing account.

# Tag the resource group so there's something to inherit
az group update -n $RG --set tags.CostCenter=CC-1001 -o none

# Assign the built-in "Inherit a tag from the resource group if missing" (modify) WITH an identity
az policy assignment create --name "lab-inherit-tag" \
  --policy "cd3aa116-8754-49c9-a813-ad46512ece54" \
  --scope "/subscriptions/$SUB_ID/resourceGroups/$RG" \
  --params '{ "tagName": { "value": "CostCenter" } }' \
  --mi-system-assigned --location $LOC --role "Contributor" --identity-scope "/subscriptions/$SUB_ID/resourceGroups/$RG" -o table

# Force a scan, then remediate the EXISTING (untagged) storage account
az policy state trigger-scan --resource-group $RG
az policy remediation create --name "lab-remediate-tag" \
  --policy-assignment "lab-inherit-tag" --resource-group $RG -o table

After remediation, the storage account inherits CostCenter=CC-1001. Verify:

az storage account show -n $SA -g $RG --query "tags" -o json
# Expect: { "CostCenter": "CC-1001" }

Validation checklist. You authored a custom deny, proved it blocks the bad deploy and allows the good one (the RequestDisallowedByPolicy line is the whole point), then used modify + a remediation task to fix an existing resource in place. The steps mapped to what each proves:

Step What you did What it proves Real-world analogue
2 Author custom deny JSON A rule is just if/then over fields Encoding a hard guardrail
4 Deploy non-compliant SA deny blocks at the request (RequestDisallowedByPolicy) The 2pm pipeline failure
5 Deploy compliant SA The policy allows correct config Normal deploys are unaffected
6 modify + remediation Existing drift is fixed in place, not by hand Backfilling tags across a tenant

Cleanup (no lingering cost).

az policy assignment delete --name "lab-deny-storage-http" --scope "/subscriptions/$SUB_ID/resourceGroups/$RG"
az policy assignment delete --name "lab-inherit-tag" --scope "/subscriptions/$SUB_ID/resourceGroups/$RG"
az policy definition delete --name "lab-deny-storage-http"
az group delete -n $RG --yes --no-wait

Cost note. Azure Policy has no charge — you pay only for resources it deploys/remediates. The lone storage account in this lab costs a few paise for the minutes it exists; deleting the resource group stops everything.

Common mistakes & troubleshooting

This is the playbook you bookmark — first as a scannable table to read mid-incident, then the same entries with full confirm-command detail underneath.

# Symptom Root cause Confirm (exact cmd / portal path) Fix
1 A deploy fails with RequestDisallowedByPolicy A deny policy matched the request Read the error — it names the policyDefinitionId and assignment Parameterise the allowed set, add an exemption, or run DoNotEnforce while triaging
2 Assigned audit, expected it to fix things audit/AINE only flags; it never changes a resource az policy state list shows NonCompliant, no remediation Switch to deny (prevent) or deployIfNotExists/modify (fix)
3 DINE/modify “ran” but nothing changed; remediation Forbidden Assignment has no managed identity, or the MI lacks the roleDefinitionIds az policy assignment show --query identity; az role assignment list --assignee <principalId> Add --mi-system-assigned; grant the declared roles at the scope; re-run remediation
4 Compliance dashboard looks wrong/stale Eventually consistent — last full scan predates your change Check lastEvaluated; compare to your change time az policy state trigger-scan, then re-read
5 A sibling subscription stays ungoverned Assignment made at one subscription/RG, not the parent MG az policy assignment list --scope <MG> shows nothing Re-assign at the management group to inherit down
6 Custom policy never matches; resource stays compliant The property has no alias, or wrong mode (Indexed for an RG rule) az provider show ... aliases; check definition mode Use the aliased path / mode: All; or pick an aliased property
7 A whole environment is silently uncovered A notScopes exclusion you forgot az policy assignment show --query notScopes Remove the exclusion, or convert to a tracked exemption with expiry
8 A legit pipeline breaks the moment deny goes live Deny flipped to Default without an audit phase Deployment errors spike; error names the def Roll back to DoNotEnforce/audit, fix params/exemptions, re-enforce
9 Two policies fight; resource shows Conflicting Conflicting effects (e.g. one modify adds, another removes the same tag) Compliance state Conflicting; review both assignments Reconcile the rules; keep one source of truth per property
10 Tag-inherit modify did nothing on a child resource Modify isn’t retroactive without a remediation task; or MI lacks tag-write role Resource missing the tag after scan; az policy remediation list Run a remediation task; grant the MI Tag Contributor/Contributor
11 deny blocks a resource you thought was exempt Exemption scoped wrong, or expired az policy exemption show --query "{scope:scope,expires:expiresOn}" Re-scope the exemption / extend the expiry
12 New initiative suddenly shows everything non-compliant A member policy with a strict effect landed on existing drift Diff the initiative’s member definitions / recent updates Expected — remediate, or set the member effect to audit first

The expanded form, with the full reasoning for the entries that bite hardest:

1. A deployment fails with RequestDisallowedByPolicy. Root cause: A deny assignment matched the request — over-broad allowed-locations, an allowed-SKU list missing the size, or a no-public-access rule on a resource that legitimately needs it. Confirm: The error body names the policyDefinitionId, the policyAssignmentId, and often the failing field. In the portal, Policy → Compliance → (the assignment) → Deny events, or the deployment’s error detail. Fix: If the resource is legitimate, parameterise the allowed set (add the region/SKU), or grant a scoped exemption. If you’re mid-rollout, drop the assignment to enforcementMode=DoNotEnforce while you triage so deploys aren’t blocked.

2. You assigned audit and expected it to fix things. Root cause: audit and auditIfNotExists are report-only — they mark non-compliance and never modify a resource. Confirm: az policy state list --filter "PolicyAssignmentName eq '<name>'" shows NonCompliant with no associated change. Fix: Decide your posture: deny to prevent future violations, or modify/deployIfNotExists (plus a remediation task) to fix existing ones. audit is a measurement phase, not an end state.

3. A deployIfNotExists/modify policy “ran” but nothing changed; remediation fails Forbidden. Root cause: The assignment has no managed identity, or the identity lacks the roles the definition declares in roleDefinitionIds. DINE/modify deploy/patch as that identity; with no rights, it’s a silent no-op. Confirm: az policy assignment show -n <name> --scope <scope> --query identity (is it null?); az role assignment list --assignee <principalId> --scope <scope> (are the declared roles present?). The remediation detail shows Forbidden. Fix: Re-create the assignment with --mi-system-assigned --location <region>; grant the identity each roleDefinitionId at the assignment scope (az role assignment create); re-run az policy remediation create.

4. The compliance dashboard looks wrong or stale. Root cause: Compliance is eventually consistent — evaluation runs on change and via a background full scan roughly every 24 hours, so a number can predate your fix or a new assignment. Confirm: Check the assignment’s last evaluation time; compare to when you made the change. Fix: az policy state trigger-scan --resource-group <rg> (or at subscription scope) forces an on-demand scan; re-read after it completes. Don’t make decisions off a number you haven’t refreshed.

5. A sibling subscription stays ungoverned. Root cause: The assignment was made at one subscription or resource group, which never affects siblings or the parent — inheritance is downward only. Confirm: az policy assignment list --scope "/providers/Microsoft.Management/managementGroups/<mg>" returns nothing for the rule. Fix: Assign at the management group that is the common ancestor of all the subscriptions you mean to govern; it inherits down to all of them.

6. A custom policy never matches; the resource stays compliant no matter what. Root cause: The property you’re testing has no alias (Policy can only read aliased properties), or the definition mode is wrong (Indexed won’t evaluate resource-group- or subscription-level rules). Confirm: az provider show --namespace <ns> --query "resourceTypes[?resourceType=='<type>'].aliases[].name" — is your path there? Check the definition’s mode. Fix: Use the exact aliased path; for RG/subscription rules set mode: All; if no alias exists, the rule isn’t expressible — pick an aliased property or a different control.

7. A whole environment is silently uncovered by a policy you thought was tenant-wide. Root cause: A notScopes exclusion on the assignment carves that scope out, quietly and without expiry. Confirm: az policy assignment show -n <name> --scope <scope> --query notScopes. Fix: Remove the exclusion if it was a mistake; if the carve-out is justified, replace it with a tracked exemption (category + reason + expiry) so it shows as Exempt and is reviewed.

8. A legitimate pipeline breaks the instant a deny goes live. Root cause: A deny was flipped to Default enforcement without an audit / DoNotEnforce phase, so its first encounter with reality is a production block. Confirm: Deployment failures spike right after the assignment change; the error names the definition. Fix: Roll the assignment back to DoNotEnforce (or audit), measure the real blast radius, parameterise/exempt the legitimate cases, then re-enforce. This is exactly the phased rollout the scenario above followed.

9. Two policies fight and a resource shows Conflicting. Root cause: Conflicting effects across assignments — e.g. one modify adds a tag another modify removes, or two policies set the same property to different values. Confirm: Compliance state Conflicting; inspect both assignments’ definitions and parameters. Fix: Establish a single source of truth per property; reconcile or remove the duplicate. Don’t run two modify policies that touch the same field in opposite directions.

10. A tag-inheritance modify didn’t tag an existing child resource. Root cause: modify rewrites new requests; existing resources need a remediation task. And the assignment’s identity may lack tag-write rights. Confirm: The resource is still untagged after a scan; az policy remediation list --resource-group <rg> shows none for it. Fix: Run az policy remediation create for the assignment; ensure the MI has Tag Contributor (or Contributor) at the scope.

11. A deny blocks a resource you thought was exempt. Root cause: The exemption is scoped wrong (it covers a sibling RG, not this one) or has expired. Confirm: az policy exemption show -n <name> --scope <scope> --query "{scope:scope,expires:expiresOn,cat:exemptionCategory}". Fix: Re-scope the exemption to the exact resource/RG, or extend expiresOn. Exemptions are deliberately time-boxed — an expiry firing is the system working.

12. A newly assigned initiative suddenly reports everything non-compliant. Root cause: A member policy with a strict effect just evaluated against existing drift — the resources were always non-compliant; now something is measuring them. Confirm: Diff the initiative’s member definitions; check what changed in the latest version. Fix: This is expected, not a bug. Remediate the drift, or set the noisy member effect to audit first and tighten later. A spike in non-compliance after assignment is reconnaissance, not failure.

Best practices

The governance cadence worth committing to — what to review, how often, and why:

Cadence Review Why it’s leading
Weekly New non-compliant resources by assignment Catch drift and new rules’ blast radius early
Weekly Remediation tasks (failed / pending) A Forbidden remediation is silent otherwise
Monthly Exemptions nearing expiry Holes re-open on expiry; renew or close deliberately
Monthly Custom definitions vs new built-ins Retire custom dups Microsoft now ships
Quarterly Assignment scopes and notScopes Find over-reach and silent exclusions
Per release Policy-as-code diff in PR The change is reviewed and recorded

Security notes

The security-relevant policy controls and what each one buys you:

Control Policy mechanism Secures against Effect to use
No public storage Built-in deny (network + anon access) Data exfiltration via public endpoints deny
Encryption everywhere Require encryption (disks/storage) Data-at-rest exposure deny / audit then remediate
Region residency Allowed locations Data leaving an approved geography deny
Mandatory logging Diagnostic settings to Log Analytics Blind spots during incident response deployIfNotExists
TLS floor Enforce minimumTlsVersion = TLS1_2 Downgrade / cleartext transport modify
Protect critical resources Block delete on locked resources Accidental/malicious deletion denyAction
Least-priv remediation Scoped MI with declared roles only Over-privileged automation identity (assignment identity wiring)

The RBAC roles for operating Policy, and the roles remediation identities commonly need — grant the narrowest that fits:

Role Lets the principal… Give to
Resource Policy Contributor Create/edit definitions, initiatives, assignments, exemptions Platform/governance engineers
Policy Insights Data Writer Trigger scans, write attestations Automation that forces evaluation
Reader View compliance and definitions Auditors, app teams (read-only)
Tag Contributor Write tags (no other changes) The MI of a tag-inheritance modify
Log Analytics Contributor + Monitoring Contributor Create diagnostic settings The MI of a diagnostic-settings DINE
Network Contributor Associate NSGs/route tables The MI of a network DINE
Security Admin Set Defender plans The MI of a Defender-plan DINE

Cost & sizing

A rough cost picture for governance on a mid-size estate (a few dozen subscriptions):

Cost driver What you pay for Rough INR / month What it buys Watch-out
Azure Policy engine Nothing — evaluation is free ₹0 All evaluation, assignment, compliance
Diagnostic-settings DINE Log Analytics ingestion (per GB) ~₹8,000–60,000+ Tenant-wide logging for IR Scope + retention + sampling drive it
Defender-plan DINE Defender for Cloud per resource ~₹10,000–50,000+ Threat protection across subs Enable per-plan deliberately
Backup DINE Recovery Services storage Workload-dependent Auto-protected VMs GRS vs LRS changes the bill
Tag-inherit modify Nothing (saves on FinOps) ₹0 (net negative) Cost allocation becomes possible One-time remediation effort
Engineering time (good rollout) Audit-first phasing Hours, not a Sev-1 No broken pipelines Skipping it is the expensive path

Interview & exam questions

1. What is the difference between a policy definition, an initiative, and an assignment? A definition is a single rule (if condition → then effect) in JSON. An initiative (policy set) bundles many definitions to manage, assign, parameterise and report on as one unit (e.g. a regulatory baseline). An assignment attaches a definition or initiative to a scope (MG/sub/RG) with parameter values and options like enforcement mode. The definition is the rule; the assignment is the rule in force here, with these parameters.

2. Name the Azure Policy effects and when you’d use each. deny (block non-compliant deploys at the request), audit (allow but flag), append/modify (rewrite the request — add/replace fields/tags), deployIfNotExists (deploy a missing related resource), auditIfNotExists (flag if a related resource is absent), denyAction (block specific actions like delete), disabled (turn off), and manual (attest a non-technical control). Prevent → deny; report → audit; mutate → modify; remediate → deployIfNotExists.

3. Why does a deployIfNotExists policy sometimes do nothing, and how do you fix it? DINE deploys its template as the assignment’s managed identity. If the assignment has no identity, or the identity lacks the roles declared in roleDefinitionIds, the deployment is a silent no-op and remediation fails Forbidden. Fix by creating the assignment with a managed identity (--mi-system-assigned) and granting it exactly those roles at the assignment scope, then re-running the remediation task.

4. How does policy assignment scope and inheritance work? An assignment applies to its scope and every descendant (management group → subscription → resource group → resource), inheriting downward only — a subscription-level assignment never affects a sibling or the parent. Assign at a management group to govern many subscriptions at once. You can only assign a definition at or below the scope where it is defined.

5. You assigned an audit policy and expected it to fix resources. What happened? Nothing was fixed — audit (and auditIfNotExists) are report-only; they mark resources non-compliant but never change them. To prevent future violations use deny; to fix existing ones use modify/deployIfNotExists plus a remediation task. Audit is a measurement phase.

6. Difference between an exclusion and an exemption? An exclusion (notScopes) removes a scope from the assignment — it is simply not evaluated, with no record or expiry (a silent hole). An exemption is a tracked waiver on a scope/resource with a category (Waiver/Mitigated), an optional expiry, and a reason; it shows in compliance as Exempt. Use exemptions for justified, time-boxed passes — they’re the auditable choice.

7. How do you safely roll out a strict deny policy across a tenant? Phase it: assign as audit (or enforcementMode=DoNotEnforce) first, force a scan, and read the real blast radius; parameterise allowed-lists and add exemptions for legitimate exceptions; remediate existing drift; then flip to enforcementMode=Default. Never assign a broad deny at the tenant root in enforce mode on day one — it can break every pipeline.

8. Why does the compliance dashboard sometimes show stale or wrong numbers? Compliance is eventually consistent — evaluation runs on resource change, on assignment change, and via a background full scan roughly every 24 hours. So a number can predate your fix or a just-made assignment. Force a fresh result with az policy state trigger-scan and read after it completes; don’t decide off an un-refreshed number.

9. What is an alias in a custom policy and why does it matter? An alias is a path that exposes a resource property for Policy to evaluate. A rule can only test properties that have aliases — if the property you want isn’t aliased, the rule is not expressible. Discover them with az provider show --namespace <ns> --query "...aliases". A missing alias is a common reason a custom policy “never matches.”

10. How does Azure Policy relate to RBAC? They’re complementary and independent: RBAC governs who can perform which actions on which scope; Policy governs the allowed shape of resources, regardless of who deploys them. A user with Owner can still be denied by policy; a blocked deploy is either AuthorizationFailed (RBAC) or RequestDisallowedByPolicy (Policy). You need both.

11. What does mode control in a definition, and when do you use All vs Indexed? mode decides what the definition evaluates. Indexed (the common default) evaluates resources that support tags and location — most resource policies. All additionally evaluates resource groups and subscriptions, so use it for RG-/subscription-level rules (e.g. “every resource group must have a CostCenter tag”). There are also data-plane modes for Kubernetes, Key Vault, and network manager.

12. A finance team can’t allocate 40% of cloud spend. Which policy mechanism helps, and how? A modify policy that inherits a tag (e.g. CostCenter) from the resource group onto every resource — assigned with a managed identity that has tag-write rights — plus a remediation task to backfill existing resources. After remediation, nearly all resources carry the cost tag and showback/chargeback works. A deny-require-tag policy then keeps new resources compliant.

These map to AZ-104 (Administrator)implement and manage Azure governance: policies, initiatives, RBAC, management groups — and AZ-305 (Solutions Architect)design governance and identity, landing-zone guardrails — and AZ-500 (Security Engineer)Policy as a preventive security control, regulatory compliance, Defender for Cloud integration. A compact cert-mapping for revision:

Question theme Primary cert Exam objective area
Definition / initiative / assignment model AZ-104 Implement and manage governance
Effects (deny/audit/modify/DINE) AZ-104 / AZ-305 Governance design & implementation
Scope, management groups, inheritance AZ-305 Design governance; landing zones
DINE identity + RBAC wiring AZ-104 / AZ-500 Remediation; secure automation
Exemptions, enforcement mode, safe rollout AZ-305 Governance operations
Policy as a security control + compliance AZ-500 Regulatory compliance; Defender

Quick check

  1. You assign an audit policy for “VMs must have encryption” and a week later the report still shows non-compliant VMs — and nothing has been fixed. Why, and what do you change to actually fix them?
  2. A deployIfNotExists policy for diagnostic settings shows resources as non-compliant and the remediation task fails Forbidden. Name the two-part root cause and the fix.
  3. True or false: assigning a policy at a subscription governs that subscription’s sibling subscriptions too.
  4. You need to wave a deny-public-storage rule for one legacy resource group, and you want the auditor to see why and for it to expire automatically. Exclusion or exemption — and which field gives the expiry?
  5. Right after creating an assignment the compliance dashboard shows “0 of 0” / stale numbers. What’s happening and what one command gives you a fresh answer?

Answers

  1. audit is report-only — it flags non-compliance but never changes a resource, so the VMs stay as they are. To fix them, switch to a modify/deployIfNotExists effect (with a managed identity and a remediation task) to remediate existing VMs, and/or deny to prevent new unencrypted ones. Audit measures; it doesn’t remediate.
  2. (a) The assignment has no managed identity, or (b) the identity lacks the roles declared in roleDefinitionIds — DINE deploys as that identity, so without rights it no-ops/Forbidden. Fix: create the assignment with --mi-system-assigned, grant the declared roles at the assignment scope, then re-run az policy remediation create.
  3. False. Inheritance is downward only — a subscription-level assignment covers that subscription’s resource groups and resources but never its siblings or the parent. To govern multiple subscriptions, assign at the management group that is their common ancestor.
  4. An exemption (not an exclusion) — it shows in compliance as Exempt, carries a category (Waiver/Mitigated) and a reason, and expires via the expiresOn field. A notScopes exclusion would be a silent, permanent hole the auditor can’t see.
  5. Compliance is eventually consistent — evaluation hasn’t run yet (the full scan is ~every 24h), so the number is NotStarted/stale, not a broken policy. Run az policy state trigger-scan to force an on-demand scan, then re-read after it completes.

Glossary

Next steps

You can now author, assign, scope, remediate and report Azure Policy across a tenant. Build outward:

AzureAzure PolicyGovernanceComplianceManagement GroupsdeployIfNotExistsRemediationLanding Zones
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading