Architecture Azure

Designing the Enterprise-Scale Landing Zone Management Group Hierarchy and Policy Layering

The management group hierarchy is the load-bearing wall of an enterprise-scale landing zone. Get it right and governance scales for free: every new subscription inherits the correct guardrails the moment it lands in the tree. Get it wrong and you spend the next two years untangling policy assignments, exemptions, and one-off RBAC grants that nobody dares to touch. This is the part of the platform you change least and reason about most, so it deserves to be designed deliberately, not grown organically. Here is how I architect the hierarchy, layer policy onto it, and operate it once workloads start arriving.

1. Enterprise-scale design principles and the management group taxonomy

Enterprise-scale (the architecture pattern behind Azure Landing Zones) rests on a few principles that directly shape the hierarchy:

The mistake almost everyone makes first is modelling the org chart in management groups — a node per business unit, per region, per environment. It feels intuitive and it is wrong. Management groups exist to scope policy and access, and reorgs happen far more often than your security baseline changes. The taxonomy should answer one question: what set of guardrails does a subscription need? Two subscriptions that need the same policy set and the same RBAC belong in the same management group, full stop — regardless of which department owns them.

Rule of thumb: if you cannot articulate a distinct policy or RBAC difference for a management group, it should not exist. Keep the tree shallow. Azure supports up to six levels of depth under the root, but a healthy landing zone uses three or four.

2. Designing the hierarchy: platform, landing zones, sandbox, and decommissioned

Every tenant has a built-in Tenant Root Group. You do not assign policy there for day-to-day governance — it is hard to scope cleanly and a mistake there hits everything including the platform itself. Instead, create a single top-level (intermediate root) management group directly under the tenant root, and build everything below it. I will call it contoso.

Under that intermediate root, four children carry the canonical model:

Tenant Root Group
└── contoso                      (intermediate root: org-wide baseline policy)
    ├── platform                 (shared platform services)
    │   ├── identity             (domain controllers / Entra Domain Services, identity tooling)
    │   ├── management           (Log Analytics, automation, central monitoring)
    │   └── connectivity         (hub vNet / vWAN, firewall, ExpressRoute, DNS)
    ├── landingzones             (application workloads)
    │   ├── corp                 (internal, hub-routed, no public ingress by default)
    │   └── online               (internet-facing, public ingress allowed)
    ├── sandbox                  (innovation/dev, intentionally loose, no peering to prod)
    └── decommissioned           (subscriptions being torn down, deny-most)

The split between platform and landingzones is the most important boundary in the whole design. Platform subscriptions run the shared services that landing zones depend on — hub networking, central logging, identity. They are operated by the platform team, have different RBAC, and carry policies that landing zones do not (and vice versa). Never put a workload in a platform subscription, and never let a workload owner near the connectivity subscription.

sandbox exists so that experimentation has a sanctioned home with looser policy, kept off the production network so a sandbox mistake cannot reach corp data. decommissioned is where you move a subscription when a workload retires: it gets a deny-heavy policy set that blocks new resource creation while you wind down, so a “dead” subscription cannot quietly come back to life.

Provisioning the skeleton is a handful of az calls. Management groups are idempotent on create, so this is safe to re-run:

# Intermediate root under the tenant root group
az account management-group create --name contoso --display-name "Contoso"

# Platform branch
az account management-group create --name platform     --display-name "Platform"     --parent contoso
az account management-group create --name identity      --display-name "Identity"      --parent platform
az account management-group create --name management    --display-name "Management"    --parent platform
az account management-group create --name connectivity  --display-name "Connectivity"  --parent platform

# Landing zones branch
az account management-group create --name landingzones  --display-name "Landing Zones" --parent contoso
az account management-group create --name corp          --display-name "Corp"          --parent landingzones
az account management-group create --name online        --display-name "Online"        --parent landingzones

# Sandbox and decommissioned
az account management-group create --name sandbox        --display-name "Sandbox"        --parent contoso
az account management-group create --name decommissioned --display-name "Decommissioned" --parent contoso

3. Policy layering strategy and choosing the right assignment scope

Policy in this model is layered: each level of the tree adds guardrails, and a subscription is governed by the union of every assignment from the intermediate root down to its own management group. Assigning at the right scope is the entire game — too high and you over-constrain the platform, too low and you repeat yourself and leave gaps.

My layering convention:

Scope What belongs here Examples
contoso (intermediate root) Org-wide, non-negotiable, applies to platform and workloads alike Allowed locations, deny public IP on NICs by exception, require diagnostic settings, deploy Defender for Cloud
platform Guardrails specific to shared services Stricter network rules, key-rotation enforcement on platform Key Vaults
landingzones Everything every workload needs Enforce TLS, require private endpoints for PaaS, tag governance, deny classic resources
corp Corp-only constraints Deny public inbound / public IP, force traffic through the hub firewall
online Online-only constraints Require WAF on public ingress, restrict which public services are allowed
sandbox Loosened but bounded Audit instead of deny, but hard caps on SKU size and a budget

Three operational rules make layering survivable:

  1. Assign initiatives (policy sets), not loose individual policies. One assignment of a curated initiative per scope is auditable; forty individual assignments are not.
  2. Prefer DeployIfNotExists and Modify over Deny where remediation is possible. Deny blocks the deployment and generates a support ticket; deploy/modify fixes it and keeps the platform invisible. Reserve Deny for things that must never exist (public IPs in corp, disallowed regions).
  3. Assign at the highest scope where the rule is universally true. If a rule is true for every workload, it goes on landingzones, not duplicated on corp and online.

Here is the corp deny-public-IP guardrail assigned at the corp scope. The built-in policy 83a86a26-fd1f-447c-b59d-e51f44264114 denies creation of public IP addresses:

CORP_ID=$(az account management-group show --name corp --query id -o tsv)

az policy assignment create \
  --name "deny-public-ip-corp" \
  --display-name "Corp: deny public IP addresses" \
  --scope "$CORP_ID" \
  --policy "83a86a26-fd1f-447c-b59d-e51f44264114" \
  --enforcement-mode Default

For a DeployIfNotExists policy that requires a managed identity to remediate, you must grant that identity rights at the assignment scope. The CLI no longer auto-creates the role assignment, so do it explicitly:

LZ_ID=$(az account management-group show --name landingzones --query id -o tsv)

# Assign a DINE policy with a system-assigned identity (needs a location)
az policy assignment create \
  --name "deploy-diag-to-la" \
  --display-name "Landing Zones: route diagnostics to central LA workspace" \
  --scope "$LZ_ID" \
  --policy "<policy-definition-id>" \
  --mi-system-assigned \
  --location westeurope \
  --params '{ "logAnalytics": { "value": "<central-la-workspace-resource-id>" } }'

# Grant the policy identity the role it needs to remediate, at the MG scope
PRINCIPAL=$(az policy assignment show --name deploy-diag-to-la --scope "$LZ_ID" --query identity.principalId -o tsv)
az role assignment create --assignee-object-id "$PRINCIPAL" --assignee-principal-type ServicePrincipal \
  --role "Log Analytics Contributor" --scope "$LZ_ID"

4. Landing zone archetypes: corp, online, and confidential workloads

An archetype is the pairing of a management group with the policy initiative and RBAC model assigned to it. A subscription does not get “configured” — it gets placed into an archetype and inherits everything. Three archetypes cover the vast majority of enterprises:

The point of archetypes is to resist per-team customisation. Offer a small fixed menu. A workload owner picks corp or online; they do not get to negotiate their policy set. If a real new requirement appears that the menu cannot serve, you add an archetype deliberately — a reviewed change to the platform — rather than special-casing one subscription.

Confidential archetypes, when present, slot in cleanly:

az account management-group create --name confidentialcorp \
  --display-name "Confidential Corp"   --parent landingzones
az account management-group create --name confidentialonline \
  --display-name "Confidential Online" --parent landingzones

5. Subscription democratization and the policy-driven governance model

The payoff of all this structure is subscription democratization: because the tree enforces guardrails, you can hand out subscriptions liberally and cheaply. A team that needs isolation gets its own subscription instead of sharing one and fighting over RBAC and quotas. This is only safe because policy is doing the governing — the guardrails travel with the subscription automatically.

Placing a subscription is a single operation, and it is where governance actually attaches:

# Move an existing subscription into the corp archetype
az account management-group subscription add \
  --name corp \
  --subscription "<subscription-id>"

The instant that command returns, the subscription is subject to every assignment from contoso -> landingzones -> corp. There is no separate “apply policy” step. That is the whole model: governance is a property of position in the tree. Onboarding a hundred subscriptions is a hundred placements, not a hundred bespoke configurations — which is exactly why this scales when an org-chart-shaped hierarchy does not.

This is also why the platform team can say yes to subscription requests by default. The cost of an extra subscription is near zero, and the risk is bounded by the archetype. Democratization without policy-driven guardrails is chaos; with them, it is the operating model.

6. Diff-and-merge customization without forking the reference architecture

Microsoft ships a reference implementation (the ALZ Bicep/Terraform modules and policy library). The temptation is to fork it and edit in place. Do not. The reference library updates constantly — new built-in policies, fixes, new initiatives — and a fork strands you on a snapshot you can never cleanly update.

The supported pattern is diff-and-merge: keep the upstream reference intact and express your organisation’s changes as a thin overlay that is merged on top. Concretely, with the Terraform caf-enterprise-scale / Azure Landing Zones module you supply a custom library directory and let the module merge your definitions with the built-ins, rather than editing theirs:

module "alz" {
  source  = "Azure/avm-ptn-alz/azurerm"
  version = "~> 0.11"

  architecture_name  = "contoso"          # your archetype map, merged over the library
  parent_resource_id = data.azurerm_client_config.core.tenant_id

  # Your overlay: extra policy definitions/initiatives layered on the built-in library,
  # NOT a fork of it. The module merges lib/ with the embedded reference library.
  library_references = [
    {
      path = "${path.root}/lib"           # only your deltas live here
    }
  ]
}

Your lib/ folder holds only the deltas — a custom initiative, a tightened parameter default, an extra archetype assignment. When Microsoft updates the upstream library, you bump the module version and your overlay re-merges on top. You get upstream improvements for free and your customisations survive the upgrade. The same philosophy applies even if you hand-roll Bicep: import the reference policy set definitions as-is and add your own alongside, never by editing the originals.

7. Brownfield adoption: moving existing subscriptions into the hierarchy

Greenfield is easy. The hard, real job is brownfield — moving existing, live subscriptions full of running workloads into the tree without breaking them. The failure mode is obvious: you move a busy subscription under corp, the deny-public-IP policy applies, and the next deployment that needs a public IP fails. Do this carefully and in stages.

  1. Inventory first. Pull every subscription and its current management group so you know your starting point and can prove the move later.

    az account management-group entities list \
      --query "[?type=='/subscriptions'].{name:displayName, id:name, parent:parent.id}" -o table
    
  2. Land soft, not hard. Move the subscription into a holding management group whose policies are all in audit / DoNotEnforce mode. This tells you what would break without breaking it. You can also flip a specific assignment to non-enforcing during migration:

    az policy assignment update --name "deny-public-ip-corp" \
      --scope "$CORP_ID" --enforcement-mode DoNotEnforce
    
  3. Read the compliance report. After a couple of evaluation cycles, look at non-compliant resources. This is your remediation backlog — the existing resources that violate the archetype you are moving toward.

    az policy state summarize --management-group corp \
      --query "policyAssignments[].{policy:policyAssignmentId, nonCompliant:results.nonCompliantResources}" -o table
    
  4. Remediate, then enforce. Fix or grant exemptions for the legitimate exceptions, run remediation tasks for DeployIfNotExists policies, and only then flip enforcement back to Default. Move the subscription to its final archetype management group last.

Never skip straight to step 4. A hard cut on a production subscription is how you cause an outage with a governance change — the worst possible reason to take down a workload.

Verify

Confirm the hierarchy, assignments, and inheritance actually behave before you call it done.

# 1. The tree is shaped correctly
az account management-group show --name contoso --expand --recursive \
  --query "{name:displayName, children:children[].displayName}" -o json

# 2. A subscription sees the policies you expect (inherited + direct)
SUB_SCOPE="/subscriptions/<subscription-id>"
az policy assignment list --scope "$SUB_SCOPE" --disable-scope-strict-match \
  --query "[].{name:displayName, scope:scope, enforcement:enforcementMode}" -o table

# 3. Compliance is being evaluated at the archetype scope
az policy state summarize --management-group corp \
  --query "results.{compliant:resourceDetails[?complianceState=='Compliant'].count | [0]}" -o json

Then prove the guardrail bites: from a corp subscription, attempt to create a public IP and confirm it is denied.

az network public-ip create -g rg-test -n pip-should-fail --sku Standard
# Expect: RequestDisallowedByPolicy referencing "Corp: deny public IP addresses"

If the public IP is created, your assignment scope or inheritance is wrong — stop and fix it before onboarding anything.

Enterprise scenario

A financial-services platform team had inherited 140 subscriptions created over five years with no hierarchy — every subscription sat directly under the tenant root, governed by nothing but a few ad-hoc per-subscription policy assignments. Audit flagged it: there was no demonstrable, uniform control plane, and they could not prove that production workloads were isolated from internet-facing ones. The constraint was brutal: zero workload downtime during the migration, and a regulator deadline.

A big-bang move under enforcing policies was off the table — flipping Deny on 140 live subscriptions at once would have caused dozens of deployment failures. So they ran a phased diff-and-merge brownfield migration. They stood up the contoso -> platform / landingzones / sandbox / decommissioned tree, but assigned the full corp initiative to landingzones in DoNotEnforce mode first. Subscriptions were moved in batches of ten, each batch left in audit for one week while the compliance report built the remediation backlog. The killer detail: roughly 30 subscriptions had legitimate public-facing workloads that belonged in online, not corp. The audit phase surfaced them precisely — they were the subscriptions whose public IPs showed as non-compliant against the corp deny — so the team re-pointed those into the online archetype instead of remediating them away.

The pattern they leaned on was the audit-first assignment, which let policy report before it could block:

# Phase 1: assign corp guardrails in report-only mode across landing zones
az policy assignment create \
  --name "corp-baseline" \
  --display-name "Corp baseline (audit phase)" \
  --scope "$(az account management-group show --name landingzones --query id -o tsv)" \
  --policy-set-definition "<corp-initiative-id>" \
  --enforcement-mode DoNotEnforce

After eight weeks, every subscription was in its correct archetype, the backlog was remediated or exempted, and only then did they flip enforcement to Default. Not a single workload took downtime, and the audit closed because the hierarchy gave them one provable answer to “how is this governed”: look at where the subscription sits in the tree.

Checklist

landing-zoneenterprise-scalegovernanceazurepolicy-as-code

Comments

Keep Reading