The platform team becomes the bottleneck the moment a subscription request turns into a ticket, a meeting, and three days of someone hand-running scripts. Subscription vending is the fix: a workload owner requests a landing zone, and a pipeline mints a governed Azure subscription — peered to the hub, policy-bound, RBAC-scoped, budget-capped — in minutes, with zero clickops. This is how I build that machine so it scales to hundreds of subscriptions without scaling the platform team.
Why manual subscription onboarding breaks the cloud operating model
The Cloud Adoption Framework (CAF) draws a clean line between the platform (management groups, connectivity, identity, governance) and application landing zones (the subscriptions where workloads live). The model assumes the landing zone is a commodity — fast, identical, disposable. Manual onboarding breaks that in three ways:
- It does not scale. Every request consumes a senior engineer. At 20 subscriptions a quarter you are underwater, and the work is pure toil.
- It drifts. Two engineers onboarding two subscriptions produce two subtly different results — different policy assignments, a forgotten DNS link, an over-broad role grant. Drift is a security and audit problem, not a tidiness one.
- It centralises risk. A human with
Ownerat a management group, running ad-hoc scripts, is your blast radius. The pipeline identity should be the only thing holding that power, with every action in source control.
The mental shift: a subscription is not a project, it is a deployable artifact. You version it, test it, and roll it out like any other. If you cannot recreate a landing zone from code, you do not have a landing zone — you have a pet.
Step 1 — The application landing zone contract: what every workload gets
Before any pipeline, write the contract. This is the single most important artifact, because it is the promise the platform makes to every workload and the surface you have to keep stable. Mine, by default:
| Capability | What is provisioned |
|---|---|
| Subscription | Created via alias under the right billing scope, placed in the correct management group |
| Networking | A spoke virtual network, peered bidirectionally to the regional hub, with platform-managed DNS |
| Identity | RBAC role assignments for the workload’s groups; PIM-eligible, not standing, for privileged roles |
| Governance | Inherited Azure Policy from the management group, plus a deny on disallowed regions/SKUs |
| Cost | A consumption budget with alert thresholds wired to the owning team |
| Observability | Diagnostic settings routed to the central Log Analytics workspace |
Two design rules keep this maintainable. First, the spoke inherits, it does not redefine. Policy, DNS, and logging come down from the management group and connectivity subscription; the vending module only attaches the spoke to them. Second, archetypes, not snowflakes. Offer a small fixed set — corp (routed to on-prem via the hub), online (internet-facing, no corp routing), maybe sandbox. Each archetype maps to a management group with its own policy set. Resist per-team customisation.
Step 2 — Designing the vending pipeline and request intake
The machine has three moving parts: intake, the module, and the pipeline that glues them. Keep intake dumb and declarative — a structured request that a human or a service-desk integration can produce, validated before it ever touches Azure.
# requests/orders-api-prod.yaml — one file per landing zone, PR-reviewed
landingZone:
name: orders-api-prod
archetype: corp # corp | online | sandbox
billingScope: "/providers/Microsoft.Billing/billingAccounts/1234567/enrollmentAccounts/567890"
managementGroupId: mg-corp
location: westeurope
owners:
- groupObjectId: "11111111-1111-1111-1111-111111111111" # Entra security group
role: Contributor
network:
addressSpace: "10.40.8.0/22"
hubResourceId: "/subscriptions/<conn-sub>/resourceGroups/rg-hub-we/providers/Microsoft.Network/virtualNetworks/vnet-hub-we"
budget:
amount: 5000
contactGroups: ["team-checkout"]
The flow I run on every merge to main:
- Validate the YAML against a JSON Schema (required fields, CIDR is a valid non-overlapping block, archetype is allowed, owners reference real group object IDs).
- Plan the module against the request and post the plan to the PR for review.
- Apply on merge, using a pipeline identity that authenticates with OIDC workload identity federation — no stored secrets to rotate.
- Record the resulting subscription ID into an inventory (state, a CMDB, or a simple table) so lifecycle operations have a source of truth.
The identity is the crux. The pipeline’s federated identity needs Owner at the parent management group (to create and move subscriptions and assign roles) plus billing-scope rights (Subscription Creator on the EA enrollment account or MCA billing profile). That is a lot of power; it is acceptable only because every action is gated by a reviewed pull request and the identity has no interactive login.
Step 3 — Provisioning with the subscription-vending module
Do not write subscription creation from scratch. The CAF program ships a maintained, opinionated module that does the hard parts — alias creation, management-group placement, peering, role assignments, budgets — in one pass. It exists for both toolchains: Azure/lz-vending/azurerm on the Terraform Registry, and the Bicep module published to the public registry as br/public:lz/sub-vending.
Under the hood, subscription creation is the Microsoft.Subscription/aliases resource, which requires a billing scope (EA enrollment account, MCA billing profile, or MPA). You cannot vend a subscription without one, and that scope dictates the rights your pipeline identity needs.
Idempotency note: a subscription alias is keyed by its alias name, not by display name. Re-running with the same alias is a no-op; it does not create a duplicate. But deleting the alias resource does not delete the subscription — it only removes the pointer. Treat subscription deletion as a deliberate, separate lifecycle step (Step 6).
Terraform
module "orders_api_prod" {
source = "Azure/lz-vending/azurerm"
version = "~> 5.0"
location = "westeurope"
# --- Subscription creation ---
subscription_alias_enabled = true
subscription_alias_name = "orders-api-prod"
subscription_display_name = "orders-api-prod"
subscription_billing_scope = "/providers/Microsoft.Billing/billingAccounts/1234567/enrollmentAccounts/567890"
subscription_workload = "Production"
# --- Management group placement (governance inheritance) ---
subscription_management_group_association_enabled = true
subscription_management_group_id = "mg-corp"
# --- Spoke networking, peered to the hub ---
virtual_network_enabled = true
virtual_networks = {
spoke = {
name = "vnet-orders-api-prod"
address_space = ["10.40.8.0/22"]
resource_group_name = "rg-orders-api-prod-network"
hub_peering_enabled = true
hub_network_resource_id = "/subscriptions/<conn-sub>/resourceGroups/rg-hub-we/providers/Microsoft.Network/virtualNetworks/vnet-hub-we"
}
}
# --- RBAC: workload owners get Contributor on the new subscription ---
role_assignment_enabled = true
role_assignments = {
owners = {
principal_id = "11111111-1111-1111-1111-111111111111" # Entra group object ID
definition = "Contributor"
relative_scope = "" # empty = subscription root
}
}
# --- Budget guardrail ---
budget_enabled = true
budgets = {
monthly = {
amount = 5000
time_grain = "Monthly"
notifications = {
actual80 = {
enabled = true
operator = "GreaterThan"
threshold = 80
threshold_type = "Actual"
contact_groups = ["/subscriptions/<sub>/resourceGroups/rg-platform/providers/microsoft.insights/actionGroups/ag-team-checkout"]
}
}
}
}
}
The Terraform variant has a subtle but critical wrinkle: the AzureRM provider it uses to configure the spoke must target a subscription that does not exist until apply time. The module solves this with subscription_use_azapi = true, which uses the subscription-agnostic AzAPI provider for the creation step so a single apply both mints the subscription and configures inside it. Enable it; it removes the classic two-phase apply.
Bicep
targetScope = 'managementGroup'
module orders_api_prod 'br/public:lz/sub-vending:5.2.1' = {
name: 'vend-orders-api-prod'
params: {
subscriptionAliasEnabled: true
subscriptionAliasName: 'orders-api-prod'
subscriptionDisplayName: 'orders-api-prod'
subscriptionBillingScope: '/providers/Microsoft.Billing/billingAccounts/1234567/enrollmentAccounts/567890'
subscriptionWorkload: 'Production'
subscriptionManagementGroupAssociationEnabled: true
subscriptionManagementGroupId: 'mg-corp'
virtualNetworkEnabled: true
virtualNetworkName: 'vnet-orders-api-prod'
virtualNetworkLocation: 'westeurope'
virtualNetworkResourceGroupName: 'rg-orders-api-prod-network'
virtualNetworkAddressSpace: ['10.40.8.0/22']
virtualNetworkPeeringEnabled: true
hubNetworkResourceId: '/subscriptions/<conn-sub>/resourceGroups/rg-hub-we/providers/Microsoft.Network/virtualNetworks/vnet-hub-we'
roleAssignmentEnabled: true
roleAssignments: [
{
principalId: '11111111-1111-1111-1111-111111111111'
definition: 'Contributor'
relativeScope: ''
}
]
}
}
Deploy a management-group-scoped Bicep file with az deployment mg create:
az deployment mg create \
--name "vend-orders-api-prod" \
--management-group-id "mg-platform" \
--location "westeurope" \
--template-file ./vend-orders-api-prod.bicep
Step 4 — Auto-wiring networking, peering, and DNS to the platform
Peering is the part people get wrong because it is bidirectional and crosses a subscription boundary. The spoke side is created by the vending module; the hub side must also be created, in the connectivity subscription. The lz-vending module’s hub_peering_enabled handles both directions, but it needs rights in the hub subscription to do so — give the pipeline identity at least Network Contributor on the hub resource group, or the spoke-to-hub link will succeed while the return link silently does not, and traffic will black-hole.
DNS is the second trap. In a hub-and-spoke, workloads must resolve Private Link records (privatelink.blob.core.windows.net, privatelink.vaultcore.azure.net, and friends) through the platform’s private DNS zones. Do not create per-spoke private DNS zones — that fractures resolution. Pick one of:
- Azure DNS Private Resolver (or central forwarders) in the hub, with spoke VNets using the resolver IPs as their DNS servers. This is my default now; it scales without per-zone link sprawl.
- Central private DNS zones in the connectivity subscription, with a
Microsoft.Network/privateDnsZones/virtualNetworkLinksfrom each new spoke to each zone — driven by an Azure Policy withDeployIfNotExists, not the vending module, so links self-heal even for VNets created outside the pipeline.
Either way, the vending module sets the spoke’s DNS servers to the hub resolver, and policy handles the zone links. Keeping DNS-zone management in policy rather than the per-spoke module is what stops it from rotting.
Step 5 — Injecting policy, RBAC, PIM, and budget guardrails at creation
The elegant thing about the CAF model is how little of this the vending module does — the heavy guardrails live at the management group, and the subscription inherits them the instant it is placed there (Step 3’s subscription_management_group_id). That single association pulls in every policy assigned to mg-corp and its ancestors: allowed locations and SKUs, required tags, deny of public IPs, mandatory diagnostic settings. You assign those once per archetype, not per subscription.
What the vending module does inject per-subscription is the workload-specific layer:
- RBAC — group-based role assignments (shown above). Assign to Entra groups, never users, never broader than the workload needs.
- Budgets — the consumption budget and its alert thresholds.
- Subscription-level deny/audit — any policy that must be scoped to this single subscription (rare; prefer the management group).
PIM is the one piece to deliberately keep outside the create-time module. The vending module grants standing role assignments; privileged access should be eligible, activated just-in-time. The clean separation: the module assigns the day-to-day role (e.g. Contributor to the dev group), and a separate process configures PIM eligibility for elevated roles (Owner, User Access Administrator) via the Microsoft.Authorization/roleEligibilityScheduleRequests API or the azurerm_pim_eligible_role_assignment resource. Mixing JIT elevation into the bulk vending run couples two things that change on very different cadences and tempts you toward standing privilege.
Guardrail philosophy: prevent at the management group with
deny, detect everywhere withauditandDeployIfNotExists, and grant least privilege at the subscription. The vending module is the attachment point, not the policy author.
Step 6 — Lifecycle: decommissioning, drift detection, and re-baselining
Vending is the easy half. A platform that can only create is a platform that accumulates. Three lifecycle operations matter:
Decommission. Removing the request file and applying does not delete an Azure subscription — by design, both toolchains leave it intact to prevent catastrophic accidental deletion. Decommissioning is a deliberate runbook: cancel the subscription (az account subscription cancel --id <sub-id> moves it to Disabled, recoverable for up to 90 days), strip role assignments, remove the hub peering, then drop it from inventory. Automate the runbook, but gate it behind explicit approval — never a side effect of a file deletion.
Drift detection. Run terraform plan (or az deployment mg what-if) for every managed landing zone on a schedule — nightly is fine — and alert on any non-empty diff. Drift means someone clickopsed a change: an opened NSG, a deleted peering, a hand-edited budget. You want to know within a day, not at audit time.
# Nightly drift sweep across all vended landing zones (CI scheduled job)
for lz in $(ls requests/*.yaml | xargs -n1 basename | sed 's/.yaml//'); do
terraform plan -detailed-exitcode -var-file="requests/${lz}.tfvars" \
|| echo "DRIFT DETECTED in ${lz}" # exit code 2 = changes present
done
-detailed-exitcode is the key flag: it returns 0 for no changes, 2 for a non-empty plan, and 1 for an error — so the loop distinguishes drift from failure.
Re-baselining. When the contract evolves — a new mandatory policy, a tighter budget default, an added DNS zone — you must roll the change across every existing subscription, not just new ones. This is why the contract being code matters: bump the module version, run the plan sweep, review the aggregate diff, apply. Re-baselining 200 subscriptions is a for loop, not a project, precisely because they were all vended identically.
Step 7 — Operating the platform: versioning, testing, and team enablement
Version the module like the product it is. Pin consumers to a minor range (~> 5.0), publish a changelog, and never make a breaking change to the contract without a major bump and a migration note. Your “customers” are other engineering teams; treat the interface with the same discipline you would a public API.
Test before you ship a version. The cheapest insurance against vending a broken landing zone to 50 teams:
- Static —
terraform validate/bicep build, plustflintand policy linting (Checkov,az policywhat-if) on every PR. - Contract —
terraform test(HCL native test framework) asserting the module produces the right resource shape from a representative request, on every push without touching Azure. - End-to-end — nightly, vend a real landing zone into a disposable canary billing scope, assert peering and policy compliance are live, then decommission it. This is the only test that catches billing-scope and cross-subscription peering failures, which never show up in a plan.
Enable the teams. The platform succeeds when workload owners self-serve without reading your Terraform. Ship a one-page “how to request a landing zone” doc, a commented request template, and a Backstage (or equivalent IDP) form that emits the request YAML so the average developer never writes HCL or Bicep. The pipeline, not the platform team, is the interface.
Enterprise scenario
A retail platform team I worked with vended ~180 subscriptions cleanly, then started getting sporadic “subscription quota exceeded” failures from the pipeline — but only for corp archetype tenants, and only intermittently. The plan was green every time; the apply died at the Microsoft.Subscription/aliases step. The trap: an EA enrollment account has a hard cap on subscriptions, and they were brushing it. Worse, every failed alias still consumed quota until garbage-collected, so retries dug the hole deeper. The actual fix had two parts. First, they spread vending across multiple enrollment accounts and selected the billing scope per-archetype in intake, so a single account never saturated. Second — the real lesson — they added a pre-flight quota gate to the pipeline that hard-fails before touching the alias resource:
# Pre-flight: refuse to vend if the target enrollment account is near its cap
BILLING="/providers/Microsoft.Billing/billingAccounts/1234567/enrollmentAccounts/567890"
USED=$(az rest --method get \
--url "https://management.azure.com${BILLING}/billingSubscriptions?api-version=2024-04-01" \
--query "length(value)" -o tsv)
LIMIT=5000 # confirm with your EA agreement; raise via support, not retries
if [ "$USED" -ge $((LIMIT - 20)) ]; then
echo "FAIL: enrollment account at ${USED}/${LIMIT} — route to another scope"; exit 1
fi
The broader principle: subscription-level resources have tenant-wide quotas that no plan or what-if will surface, because they are evaluated at the control plane, not in your template. Treat billing-scope capacity as a first-class input to intake, alarm on it well before the ceiling, and never let the pipeline retry into a quota wall.
Verify
After vending a landing zone, confirm the contract was actually fulfilled — do not trust a green pipeline alone:
SUB_ID="<new-subscription-id>"
# 1. Subscription exists, is enabled, and sits in the right management group
az account subscription show --id "$SUB_ID" --query "{name:displayName,state:state}" -o table
az account management-group subscription show --name mg-corp --subscription "$SUB_ID" -o table
# 2. Spoke peering is Connected in BOTH directions (run against each side)
az network vnet peering list \
--resource-group rg-orders-api-prod-network \
--vnet-name vnet-orders-api-prod \
--subscription "$SUB_ID" \
--query "[].{name:name,state:peeringState,gateway:useRemoteGateways}" -o table
# 3. Inherited policy is present and the subscription is compliant
az policy state summarize --subscription "$SUB_ID" \
--query "value[0].results.{nonCompliant:nonCompliantResources,policies:policyAssignments}" -o json
# 4. Budget and its alert thresholds exist
az consumption budget list --subscription "$SUB_ID" -o table
# 5. RBAC is group-scoped, not user-scoped or over-broad
az role assignment list --subscription "$SUB_ID" --include-inherited \
--query "[].{principal:principalName,role:roleDefinitionName,type:principalType,scope:scope}" -o table
The peering check is the one to watch: peeringState must read Connected from both the spoke and the hub. If the spoke says Connected but the hub side is missing, your pipeline identity lacked rights in the connectivity subscription — the most common silent failure here.
Platform readiness checklist
Pitfalls
- Assuming alias deletion deletes the subscription. It does not. Removing the request leaves a live, billed subscription. Decommissioning is a separate, deliberate runbook.
- One-directional peering. The pipeline identity lacking
Network Contributorin the connectivity subscription produces a half-built peering that plans clean but black-holes traffic. Verify both sides. - Per-spoke private DNS zones. They fracture resolution and multiply maintenance. Centralise DNS in the hub and link via policy.
- Standing privileged access baked into vending. Granting
Ownerat create time defeats zero-standing-privilege. Keep day-to-day roles in the module and elevated roles in PIM. - Skipping the end-to-end canary. Billing-scope permission gaps and cross-subscription peering failures only surface on a real apply. A nightly vend-and-destroy is the only test that catches them before a team does.
Get the contract right, make the pipeline the only path, and verify the boring things — peering direction, policy inheritance, least-privilege RBAC — and landing zones become a commodity that scales with a for loop.