Azure FinOps

Azure FinOps and Cost Management: Controlling Cloud Spend at Scale

A fast-growing SaaS company opens its Azure invoice and finds ₹2.4 crore — roughly triple the forecast — and nobody can say whose spend it is. The bill is real, the resources are real, and yet half of it lands in an “unallocated” bucket because the resources shipped without tags, the cost data was read in ActualCost (so a one-off Reservation purchase made one team look like it 10×'d for a month), idle non-production environments ran 24×7 over a weekend with a budget alert nobody wired to an action, and production VMs were sized for a peak that lasts ninety minutes a day. None of that is a finance problem you discover when the bill arrives. Every rupee of it was an engineering design decision made — or skipped — at provision time. This is the gap FinOps closes: a cultural and operational practice that brings engineering, finance and product into one feedback loop so the people who spend the money also see and own it, in near-real time, while the workload is still running.

This article is the operating model, not a feature tour. Azure Cost Management — the native, free billing-analytics service built into every subscription — is the data plane; the discipline around it is what makes spend predictable at scale. You will learn to run the loop the diagram below traces: govern and tag at the management-group root so every resource is attributable; ingest usage as amortized daily exports in the FOCUS schema; allocate 100% of the invoice (including shared hub costs) back to teams as showback or chargeback; optimize the rate (Reservations, Savings Plans, Azure Hybrid Benefit) and the usage (right-sizing, auto-stop); and act through budgets that alert on forecast and trigger automation, not just email. Because cost work at scale is a reference discipline — you return to it every month-end and every anomaly — the tag schema, the export options, the commitment matrix, the budget knobs and the failure modes are all laid out as scannable tables. Read the prose once; keep the tables open at month-end.

By the end you will stop being surprised by the invoice. You will know why a chart says a team’s spend tripled and went to zero (ActualCost vs Amortized), why showback never reconciles to the bill (unsplit shared cost), why a Reservation discount landed on a team that never paid for it (Shared scope), and why a budget “fired” but the spend kept climbing (no action group). Knowing which leak you are looking at — and the one az command or Cost Analysis view that confirms it — is what turns a quarterly bill-shock into a Tuesday adjustment.

What problem this solves

Cloud’s pay-as-you-go model inverts the old capex control. There is no purchase order, no procurement gate, no “the server is full” ceiling. An engineer types az vm create and the meter starts; a misconfigured autoscale rule or a forgotten P3v3 in a dev resource group bleeds money silently until the invoice lands a month later. The spend is decentralized (hundreds of engineers can provision), continuous (per-second metering), and opaque after the fact (the invoice is a single number unless you built the attribution beforehand). Without FinOps, finance sees a number it can’t question and engineering sees uptime it’s proud of, and the two never reconcile.

What breaks without this: the unallocated bucket grows until “who owns this?” is unanswerable; reserved-capacity decisions get made on gut feel (or not at all), leaving 30–50% pay-as-you-go premium on steady-state compute; idle resources — dev environments, orphaned disks, unattached public IPs, over-provisioned databases — accrete because nobody is accountable for switching them off; and anomalies (a runaway query, a leaked credential spinning up crypto-mining VMs, a log-ingestion explosion) are discovered weeks late, on the invoice, instead of within hours by an alert. The damage is both money and trust: when finance can’t predict the bill, they cap cloud spend bluntly, and engineering velocity dies under approval gates.

Who hits this: every organization past a single team on Azure. It bites hardest on multi-subscription enterprises (where the Azure resource hierarchy of management groups, subscriptions and resource groups is the cost-allocation boundary), on platform teams running shared services (hub firewall, Log Analytics, gateways) that no single product wants to pay for, and on anyone who bought commitments before understanding their baseline. The fix is never “spend less” as a blanket order — it’s making spend visible, attributable, and optimizable so each team trims its own waste while shipping faster.

To frame the whole field before the deep dive, here is every cost-leak class this article covers, the question it forces, and the one place to look first:

Leak class What you observe First question to ask First place to look Most common single cause
Unallocated spend A large “untagged/no CostCenter” bucket Are resources born tagged, or tagged later (never)? Cost Analysis grouped by CostCenter tag No tag-inheritance policy at MG scope
Skewed cost trends A team “tripled then went to zero” Am I reading ActualCost or AmortizedCost? Cost Analysis metric selector Reporting in ActualCost; an RI/SP landed
Showback ≠ invoice Per-team sum < total bill Is shared/hub cost being split to teams? Cost allocation rules; amortized totals Shared services have no allocation rule
Commitment waste Low RI/SP utilization, or discount on wrong team Is the commitment scoped and sized to a real baseline? Reservations → Utilization; appliedScopeType Over-bought, or Shared scope on a single workload
Idle / over-provisioned Advisor flags right-sizing; non-prod runs 24×7 Does this resource’s size/uptime match real load? Advisor Cost recommendations No auto-stop; SKU sized for rare peak
Silent overrun Bill spikes, found weeks late Did anything alert before the invoice? Budgets + anomaly alerts Budget with email but no action/forecast

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already understand the Azure control-plane shape: a tenant contains management groups (an inheritance tree), under which sit subscriptions (the billing and policy boundary), each holding resource groups and resources — the model covered in Azure Resource Hierarchy Explained. You should know that Azure Policy evaluates and can deny or modify resources (see Azure Policy and Governance at Scale), and be comfortable running az in Cloud Shell, reading JSON, and basic KQL. Familiarity with reservations and savings plans mechanics from Azure Cost: Reservations, Savings Plans & Hybrid Benefit Strategy lets you go deeper on the commitment math; this article uses them as one lever in the larger loop.

This sits in the Governance & FinOps track and is the cost counterpart to the security/identity governance you apply in an Enterprise-Scale Landing Zone. It depends on the resource hierarchy (your allocation boundary) and on policy (your enforcement engine), and it feeds observability — the same Azure Monitor and Application Insights telemetry that tells you a workload is slow also tells you it’s expensive (log-ingestion cost is a real line item). For the deep commitment-engineering mechanics, The Azure FinOps Engineering Guide is the companion to this operating-model view.

A quick map of who owns what in the cost loop, so you route a question to the right team fast:

Layer What lives here Who usually owns it Cost-leak classes it can cause
Management group / policy Tag inheritance, deny rules, initiative Platform / governance Unallocated spend (no tag policy)
Subscription Billing boundary, budgets, RBAC Platform + finance Showback gaps; over-broad budgets
Resource group Workload grouping, tags, lifecycle App / product team Idle resources; untagged RGs
Cost Management data Usage records, amortization, exports FinOps / data Skewed trends (Actual vs Amortized)
Commitment layer RI / SP / AHB scope and utilization FinOps + finance Commitment waste; wrong-scope discount
Automation / alerting Budgets, anomaly, action groups, runbooks Platform + SRE Silent overruns (no action wired)

Core concepts

Six mental models make every later decision obvious.

Cost is created at provision time, not billed time. The invoice is a lagging report of decisions already made. Every control that matters — tagging, sizing, commitment, auto-stop — is applied before or during provisioning, in the same IaC and policy plane you use for everything else. FinOps is “shift-left” for money: the cheapest place to fix a cost is in the pull request that created the resource, not in the meeting that reviews the bill.

The resource hierarchy is the cost-allocation hierarchy. Spend rolls up exactly the way the management group → subscription → resource group → resource tree does. Tags add an orthogonal dimension (CostCenter, Owner, Environment, Product) so you can slice cost by team across subscriptions, or by environment within one. If a resource is untagged and lives in a shared subscription, it is effectively un-ownable — which is why tag governance is the foundation, not a nicety.

Amortized cost is the truth for trends; actual cost is the truth for cash. ActualCost records the charge on the day it hits the account — so a 1-year Reservation paid upfront shows the entire year’s cost on the purchase day, then ₹0 for that resource for 12 months. AmortizedCost spreads that commitment evenly across its term, so a team’s monthly trend reflects consumption, not payment timing. Use Amortized for showback, budgets and trend analysis; use Actual only when reconciling to the cash invoice. Reading the wrong one is the single most common analysis mistake.

Cost optimization has two independent axes: rate and usage. You reduce the rate (price per unit) with commitments — Reservations, Savings Plans, Hybrid Benefit, Spot — without changing what you run. You reduce the usage (units consumed) with right-sizing, auto-stop, deleting orphans, and architectural changes (serverless, autoscale-to-zero). They compose: right-size first (so you don’t commit to oversized capacity), then commit to the smaller, stable baseline.

Showback informs; chargeback enforces. Showback shows each team its cost without moving money (visibility, low friction, the usual starting point). Chargeback actually bills the cost back to the team’s budget (real accountability, real friction, needs trust and clean allocation first). Both require that you can attribute ~100% of the invoice — including shared services — or teams reject the numbers as unfair.

Budgets are alerts, not limits — until you wire them to action. An Azure budget does not stop spending when breached; by default it sends an email. It becomes a control only when its alert triggers an action group that runs automation (a Function or Automation runbook that deallocates non-prod). Alerting on forecast (predicted month-end) rather than actual buys you the days needed to act before the overage.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Concept One-line definition Where it lives Why it matters to cost at scale
Cost Management Native, free cost-analysis/budgets/exports service Every subscription / billing account The data plane for the whole loop
ActualCost Charge on the day it hits the account Cost Analysis metric Skews trends when a commitment lands
AmortizedCost Commitment spread across its term Cost Analysis metric The truth for showback and trends
Tag Key/value metadata on a resource/RG/sub Resource Manager The cost-attribution dimension
CostCenter / Owner tag Who pays / who is responsible Tag schema Turns spend into team-level cost
Budget A spend threshold with alert rules Cost Management Catches overrun (only if wired to action)
Anomaly alert ML-detected spend deviation Cost Management Catches unexpected spend early
Reservation (RI) 1/3-yr capacity pre-commit (~up to 72% off) Reservations Rate cut on steady-state compute
Savings Plan (SP) $/hr compute commitment (flexible) Savings Plans Rate cut with SKU/region flexibility
Azure Hybrid Benefit Use owned Windows/SQL licenses Resource config Removes license cost on eligible SKUs
Spot Evictable surplus capacity (deep discount) VM/VMSS/AKS config Cheap for interruptible workloads
Cost allocation rule Splits shared cost to teams Cost Management Makes showback reconcile to 100%
Export Scheduled cost data to storage Cost Management Lakehouse-scale analysis (FOCUS)
Advisor (Cost) Right-sizing/idle recommendations Azure Advisor The usage-reduction worklist

Tag governance: making every cost attributable

If you fix nothing else, fix tagging — it is the foundation every other control stands on. The goal: every resource is born with the tags that let you attribute its cost, enforced by policy, with existing resources remediated. Manual tagging always decays; the only durable approach is deny what’s untagged and inherit tags down from the resource group/subscription.

The tag schema

Decide the schema once and enforce it everywhere. A pragmatic, cost-focused minimum:

Tag key Purpose Example values Enforcement Allocation use
CostCenter Finance code that pays CC-4412, CC-7781 Deny if missing Chargeback line
Owner Accountable person/DL team-payments@, vinod.h@ Deny if missing Who to ping on overrun
Environment Lifecycle stage prod, staging, dev, sandbox Allowed-values + deny Non-prod auto-stop targeting
Product / Service App/workload name checkout, search, billing Deny if missing Per-product unit economics
BusinessUnit Org rollup retail, platform Inherit from MG Executive showback
Project Initiative / funding line migration-2026, bau Optional, allowed-values Project-based budgets
DataClass Sensitivity (governance) public, confidential Audit (not cost, but ride-along) Compliance filtering
ExpiryDate Auto-cleanup date 2026-12-31 Audit + automation reads it Drives orphan/sandbox sweep
ManagedBy IaC vs manual terraform, bicep, portal Audit Flags click-ops drift

Two rules keep the schema usable: keep it small (5–7 cost tags; every extra mandatory tag is friction at create time and a source of deny-failures) and lowercase, fixed-vocabulary values (use Azure Policy allowedValues for Environment, or prod and Prod and PROD fracture your reports).

Enforce with Azure Policy: deny + inherit + remediate

Three policy patterns work together. Deny blocks creation of a resource missing a required tag. Modify (inherit) copies a tag from the resource group (or subscription) onto the resource if absent — invaluable because many resource types are created by services that don’t set tags. Audit reports non-compliance without blocking (use while you roll out, before flipping to deny).

# Assign the built-in "Require a tag on resources" (deny) at a management group
az policy assignment create \
  --name "require-costcenter" \
  --display-name "Require CostCenter tag (deny)" \
  --scope "/providers/Microsoft.Management/managementGroups/mg-landingzones" \
  --policy "871b6d14-10aa-478d-b590-94f262ecfa99" \
  --params '{ "tagName": { "value": "CostCenter" } }'

# Assign "Inherit a tag from the resource group if missing" (modify) — needs an identity for remediation
az policy assignment create \
  --name "inherit-costcenter" \
  --display-name "Inherit CostCenter from RG" \
  --scope "/providers/Microsoft.Management/managementGroups/mg-landingzones" \
  --policy "cd3aa116-8754-49c9-a813-ad46512ece54" \
  --params '{ "tagName": { "value": "CostCenter" } }' \
  --mi-system-assigned --location centralindia

In Bicep, ship the assignment as code so it lives in the landing-zone repo, reviewed in PRs:

// Inherit-tag (modify) assignment at a management group, with a managed identity for remediation
resource inheritCostCenter 'Microsoft.Authorization/policyAssignments@2024-04-01' = {
  name: 'inherit-costcenter'
  location: 'centralindia'
  identity: { type: 'SystemAssigned' }
  properties: {
    displayName: 'Inherit CostCenter from RG'
    policyDefinitionId: tenantResourceId('Microsoft.Authorization/policyDefinitions', 'cd3aa116-8754-49c9-a813-ad46512ece54')
    parameters: { tagName: { value: 'CostCenter' } }
    enforcementMode: 'Default' // 'DoNotEnforce' = audit-only while rolling out
  }
}

Existing resources stay non-compliant until you run a remediation task (the modify effect only fires on create/update otherwise):

# Find the assignment's policy definition reference, then remediate existing resources
az policy remediation create \
  --name "remediate-inherit-costcenter" \
  --policy-assignment "inherit-costcenter" \
  --resource-discovery-mode ReEvaluateCompliance

Confirm coverage — the number that proves tagging is working is the compliance percentage and the size of the unallocated bucket:

# Summarize compliance for the require-tag policy across the MG
az policy state summarize \
  --management-group mg-landingzones \
  --filter "PolicyAssignmentName eq 'require-costcenter'" \
  --query "policyAssignments[].results.{nonCompliant:nonCompliantResources, total:resourceDetails[0].count}"

The policy effects, what each does, and when to use which:

Policy effect What it does at evaluation Blocks creation? Fixes existing? Use it when
Audit Logs non-compliance, no change No No Rolling out; measuring before enforcing
Deny Rejects the create/update request Yes No The tag is mandatory going forward
Modify (add/inherit) Adds/replaces the tag No (allows + fixes) Yes (via remediation) Backfilling + auto-tagging from RG/sub
Append Adds a property if missing (legacy) No No (create-time) Older tag-add scenarios; prefer Modify
DeployIfNotExists Deploys a related resource No Yes (remediation) Tagging via a deployed config, advanced
Disabled Turns the rule off No No Temporarily suspend without deleting

The classic tag-governance failure modes and how each shows up:

Symptom Root cause Confirm Fix
Big “no CostCenter” bucket in Cost Analysis No require-tag policy, or only at sub scope az policy state summarize shows low compliance Assign deny + inherit at MG; remediate
Some resource types still untagged Created by a service that ignores tags Resource Graph: where isnull(tags.CostCenter) Add Modify-inherit; remediation task
Reports fracture across prod/Prod No fixed vocabulary on values Group by Environment shows duplicates allowedValues policy; normalize existing
Deny breaks pipelines A required tag isn’t set by IaC Deployment error names the tag Set the tag in the module’s tags block
Tags exist but cost still unattributed Tag added after billing period Cost predates the tag Remediate early; tags apply forward only

A subtlety that bites: tags are not retroactive in cost data. Tagging a resource today does not re-attribute last month’s spend for it. This is why you enforce tagging before a workload accrues cost, and why the remediation task should run as soon as the policy is assigned — every untagged day is a day of unallocated spend you can’t fix later.

Reading Cost Management correctly: analysis, amortization, and the Query API

Cost Management is free and built in, but it answers different questions depending on the scope, the metric, and the grouping you choose. Getting these three right is most of the skill.

Scopes: where you point the analysis

Cost data exists at several scopes; you analyze and budget at the one that matches your accountability boundary:

Scope What it aggregates Who uses it Note
Billing account (EA/MCA) Everything under the agreement Finance, central FinOps Highest level; invoice reconciliation
Billing profile / invoice section (MCA) A billing slice Finance MCA-specific grouping
Management group All subs beneath it Platform / BU leads Org/BU rollups
Subscription One sub’s resources App + finance Most common budget scope
Resource group One workload Product team Fine-grained showback
Tag filter (cross-scope) All resources with a tag value Per-team across subs The team view

Metric: AmortizedCost vs ActualCost (the most important toggle)

The metric selector in Cost Analysis silently changes the answer. Internalize this table:

Question you’re asking Use this metric Why
What did this team consume this month? AmortizedCost Spreads commitments; reflects usage
What is the monthly trend per product? AmortizedCost Trend isn’t distorted by purchase dates
What will I be billed in cash this period? ActualCost Matches the invoice cash-flow
Did a Reservation purchase hit this month? ActualCost The upfront charge shows on its day
Showback / chargeback numbers AmortizedCost Fair per-team consumption
Reconciling my export to the PDF invoice ActualCost The invoice is actual charges

The trap, concretely: a team buys a ₹12,00,000 1-year upfront VM Reservation on the 5th. In ActualCost, June shows ₹12,00,000+ for that team and July–next-May show ~₹0 for those VMs — a chart that looks like a 10× spike then a collapse. In AmortizedCost, every month shows ~₹1,00,000 — the real consumption. Every showback report, budget, and trend should be Amortized; reserve Actual for cash reconciliation.

Grouping and filtering

Group by the dimension that answers your question; the common ones:

Group by Answers Typical use
Service name Where is the money going by service? “Storage is 40% — why?”
Resource type Which resource kind dominates? VMs vs disks vs DBs split
Resource group Which workload costs most? Per-team RG showback
Resource Which exact resource? Hunting the expensive single thing
Tag (CostCenter/Owner/Environment) Which team/env? Showback; non-prod ratio
Location Which region? Egress and region-price analysis
Meter Which billed unit? RU/s, GB-month, vCPU-hours detail
Reservation Commitment utilization Are we using what we bought?
Subscription Which sub drives cost? Per-sub budget vs actual
Charge type Usage / purchase / refund Separate commitments from usage

Pull data with the Query API, not clicks

At scale you do not click through Cost Analysis monthly — you query. The Cost Management Query API returns aggregated, server-side-grouped cost so you build dashboards and month-end packs programmatically:

# Amortized cost this month, grouped by the CostCenter tag, at a subscription scope
SUB=$(az account show --query id -o tsv)
az rest --method post \
  --uri "https://management.azure.com/subscriptions/$SUB/providers/Microsoft.CostManagement/query?api-version=2024-08-01" \
  --body '{
    "type": "AmortizedCost",
    "timeframe": "MonthToDate",
    "dataset": {
      "granularity": "None",
      "aggregation": { "totalCost": { "name": "Cost", "function": "Sum" } },
      "grouping": [ { "type": "TagKey", "name": "CostCenter" } ]
    }
  }'

For ad-hoc CLI summaries, az consumption usage list reads metered records, and Azure Resource Graph finds the resources behind the cost (e.g. every untagged or orphaned thing):

// Resource Graph: resources missing a CostCenter tag (the unallocated bucket's membership)
Resources
| where isnull(tags['CostCenter']) or tags['CostCenter'] == ''
| project name, type, resourceGroup, subscriptionId, location
| order by type asc
// Resource Graph: orphaned managed disks (Unattached) — pure waste, delete or snapshot
Resources
| where type == 'microsoft.compute/disks' and properties.diskState == 'Unattached'
| project name, resourceGroup, sizeGB = properties.diskSizeGB, sku = sku.name
| order by sizeGB desc

The data sources and what each is best for:

Source Granularity Best for Note
Cost Analysis (portal) Aggregated, interactive Ad-hoc exploration Click; not for automation
Query API Aggregated, scriptable Dashboards, month-end packs Server-side grouping; respects Amortized
az consumption usage list Per usage record Quick CLI checks Metered detail; rate-limited
Exports (FOCUS) Full per-record dataset Lakehouse analysis at scale Daily/monthly to storage
Azure Resource Graph Resource inventory Finding the resources (orphans, untagged) Not cost numbers, but the targets
Advisor (Cost) Recommendations The right-size/idle worklist Actionable, prioritized

Cost exports and the FOCUS schema

When cost analysis outgrows the portal — you want to join cost to your own data (deployments, business KPIs), retain history beyond the portal’s window, or run it through a lakehouse — you configure a scheduled export. An export writes the full, per-record cost dataset to an ADLS Gen2 / Storage container on a daily or monthly cadence.

The current best practice is to export in the FOCUS schema (FinOps Open Cost and Usage Specification) — a vendor-neutral column set so the same Spark/SQL works across clouds and the same dashboards survive a billing change. Configure it:

# Create a daily FOCUS-format export of amortized cost to a storage container
az costmanagement export create \
  --name "daily-focus-export" \
  --scope "/subscriptions/$SUB" \
  --storage-account-id "/subscriptions/$SUB/resourceGroups/rg-finops/providers/Microsoft.Storage/storageAccounts/stfinopsexports" \
  --storage-container "cost-focus" \
  --timeframe MonthToDate \
  --recurrence Daily \
  --recurrence-period from="2026-06-01T00:00:00Z" to="2027-06-01T00:00:00Z" \
  --schema-version "1.0" --format Csv

The export options and when each matters:

Option Values When to change Note
Schema FOCUS / legacy Actual / Amortized FOCUS for new pipelines FOCUS is cross-cloud, future-proof
Timeframe MonthToDate / previous month / custom MTD for a rolling daily push Daily MTD overwrites the month file
Recurrence Daily / Weekly / Monthly Daily for fresh dashboards Monthly for invoice-close snapshots
Format CSV / Parquet Parquet for lakehouse Smaller, typed; better for Spark
Partitioning On / off (file partitioning) On for very large accounts Splits big months into chunks
Destination Storage account + container Use a locked-down FinOps storage acct
Scope Sub / MG / billing account Billing account for org-wide Higher scope = full picture in one file
Overwrite vs append Replace or add daily file Overwrite for MTD; append for history Decide retention strategy upfront
Compression None / gzip (with CSV) gzip for large CSV Smaller egress/storage footprint

When to use an export instead of the portal or Query API:

Need Portal Query API Export
Quick “where’s the money” look Best OK No
Automated daily dashboard refresh No Good Good
Join cost to deployments / KPIs No Hard Best
Retain >13 months history No No Best
Run through Spark / SQL warehouse No No Best
Cross-cloud unified schema No No Best (FOCUS)

A practical note on the destination: put exports in a dedicated FinOps storage account with restricted RBAC, lifecycle rules to tier old months to cool/archive, and (ideally) a private endpoint — cost data is sensitive (it reveals architecture and scale). The same storage fundamentals you’d apply to any data apply here.

Allocation: showback, chargeback, and splitting shared cost

Attribution is only useful if it reconciles to the invoice. The hardest part at scale is shared cost — the hub firewall, Bastion, Log Analytics workspace, DDoS plan, and gateways that serve everyone and are owned by the platform team’s subscription. If you ignore them, the sum of per-team showback is always less than the bill, and teams (rightly) distrust numbers that don’t add up.

Showback vs chargeback

Dimension Showback Chargeback
What it does Shows each team its cost Bills cost to the team’s budget
Money moves? No Yes (internal cross-charge)
Friction Low High
Accountability Awareness Real ownership
Prerequisite Tagging Tagging + clean shared-cost split + trust
Good starting point Yes (start here) After showback is trusted
Risk if done early Low Teams reject “unfair” numbers

Cost allocation rules: split the shared cost

Cost Management supports cost allocation rules that take a source (a shared resource group or subscription) and distribute its cost to target teams by a chosen basis — proportional to compute spend, proportional to total cost, or a fixed percentage. This is how showback reaches 100%.

Allocation basis How it splits shared cost Best when Watch-out
Proportional to total cost By each team’s share of total spend Default, “fair” general split Big teams subsidize small ones evenly
Proportional to compute By compute (vCPU) spend Shared cost tracks compute (e.g. logs) Storage-heavy teams under-charged
Proportional to a specific tag/metric By a chosen dimension A clear cost driver exists Needs a clean driver metric
Fixed percentage Hard-coded splits per team Stable, negotiated agreements Drifts from reality; revisit quarterly
Even split Equal shares Few teams, similar size Penalizes small teams

The reconciliation check that proves allocation works — the per-team amortized total must equal the invoice amortized total:

# Per-CostCenter amortized totals (sum these; it must equal the account amortized invoice total)
az rest --method post \
  --uri "https://management.azure.com/subscriptions/$SUB/providers/Microsoft.CostManagement/query?api-version=2024-08-01" \
  --body '{
    "type": "AmortizedCost", "timeframe": "TheLastMonth",
    "dataset": { "granularity": "None",
      "aggregation": { "total": { "name": "Cost", "function": "Sum" } },
      "grouping": [ { "type": "TagKey", "name": "CostCenter" } ] }
  }' --query "properties.rows"

The allocation failure modes:

Symptom Root cause Confirm Fix
Per-team sum < invoice Shared cost not allocated Compare grouped total vs account total Add a cost allocation rule for shared RGs
One team’s cost jumped, no usage change A new shared service got split to them Diff the allocation rule’s basis/period Re-examine the basis; pin a fairer metric
“Unallocated” still large after rules Untagged resources upstream Resource Graph untagged query Fix tagging first; allocation can’t fix tags
Teams dispute the split Basis doesn’t match their driver Review which basis is configured Switch to a driver-aligned basis; socialize it
Chargeback rejected by finance Numbers don’t tie to GL Reconcile amortized export to invoice Use Actual for cash tie-out; Amortized for show

Rate optimization: reservations, savings plans, Hybrid Benefit and Spot

The rate axis cuts price-per-unit without changing what you run. Azure offers four overlapping levers; choosing among them is the core commitment decision. (For the full commitment-engineering math, see Azure Cost: Reservations, Savings Plans & Hybrid Benefit Strategy; here is the operating-model view.)

The four levers compared

Lever What you commit to Discount (rough) Flexibility Best for
Reservation (RI) A specific SKU family + region, 1 or 3 yr Up to ~72% vs PAYG Low (instance-size flex within family) Stable, known SKU baseline (VMs, SQL, Cosmos RU, Storage)
Savings Plan (SP) A fixed $/hour of compute, 1 or 3 yr Up to ~65% vs PAYG High (any region/SKU compute) Steady compute spend, changing shapes
Azure Hybrid Benefit (AHB) Nothing — use owned licenses Windows ~40%+, SQL large N/A (eligibility-based) You own Windows Server / SQL Server licenses
Spot Nothing — take evictable capacity Up to ~90% vs PAYG N/A (can be evicted with 30s notice) Interruptible: batch, CI, dev, stateless scale
Dev/Test pricing A Dev/Test subscription offer Reduced Windows/some rates N/A (subscription-type gated) Non-prod environments under EA/Dev-Test
Pay-as-you-go (no commit) Nothing 0% (list price) Maximum Spiky/unknown/short-lived workloads

These stack: apply AHB to remove license cost, cover the steady compute baseline with a Savings Plan or RIs, and burst on Spot for interruptible work. Right-size before committing, or you lock in oversized capacity.

Term, payment and break-even

Choice Options Trade-off
Term 1-year vs 3-year 3-yr deeper discount, less flexibility/longer lock-in
Payment Upfront vs monthly Upfront slightly cheaper; monthly preserves cash & avoids ActualCost spike
RI vs SP Specific SKU vs flexible $/hr RI deeper for a known shape; SP forgiving as shapes change
Coverage target % of baseline committed Commit the floor (e.g. P50 of steady usage), leave headroom on PAYG/Spot
Scope Single vs Shared vs MG Single = predictable ownership; Shared = max utilization but messy attribution

The break-even rule of thumb: a 3-year RI/SP typically pays back versus pay-as-you-go in roughly 8–14 months depending on SKU and discount, so it only makes sense for capacity you are confident will run past that window. Commit the stable floor of usage, not the peak.

Scope: the leak that lands a discount on the wrong team

A commitment’s scope decides which resources receive its discount. Shared scope auto-applies the discount to any matching resource across the billing account — maximizing utilization but meaning the discount can land on a team that never paid for the commitment. Single scope ties it to one subscription. For clean chargeback, default to single scope unless you deliberately want pooled utilization.

# Inspect a reservation order's scope and utilization
az reservations reservation-order list --query "[].{name:displayName, term:term, billingPlan:billingPlan}" -o table

# Change a reservation's applied scope to a single subscription (clean attribution)
az reservations reservation update \
  --reservation-order-id <orderId> --reservation-id <reservationId> \
  --applied-scope-type Single --applied-scopes "/subscriptions/$SUB"

Monitor utilization — an under-used commitment is wasted money, the inverse of the problem you bought it to solve:

Commitment metric What it tells you Healthy Action if unhealthy
Utilization % How much of the commit is used >90% sustained Re-scope (Single→Shared) or resize down at renewal
Coverage % How much eligible usage is committed 60–80% of baseline Buy more if PAYG hours are high and stable
Applied scope Single / Shared / MG Matches chargeback model Re-scope to Single for clean attribution
Expiry date When the term ends Tracked + alerted Renew or let lapse deliberately, never by surprise
PAYG hours above commit Uncommitted steady usage Low Candidate for an additional commitment

The commitment failure modes:

Symptom Root cause Confirm Fix
A team’s spend “tripled then zeroed” Upfront commitment read in ActualCost Spike aligns with purchase date Report in AmortizedCost everywhere
Discount on a team that didn’t buy Shared scope auto-applying org-wide appliedScopeType == Shared Re-scope to Single; default new buys Single
Low RI/SP utilization Over-bought, or baseline shrank Reservations → Utilization < 90% Re-scope Shared for pooling; resize at renewal
Committed but still high PAYG bill Coverage too low vs stable usage PAYG hours high and flat Increase coverage on the stable floor
Bought RI then re-architected to serverless Committed to capacity you no longer run Utilization drops post-migration Prefer SP (flexible) when shapes may change
Windows VMs at full price AHB not enabled despite owned licenses VM shows PAYG Windows rate Enable Hybrid Benefit on eligible SKUs

Usage optimization: right-sizing, auto-stop, and killing orphans

The usage axis reduces units consumed. It is where the fastest wins live, because most fleets carry obvious waste: over-sized SKUs, non-production running 24×7, and orphaned resources nobody deletes.

Right-sizing with Advisor

Azure Advisor continuously analyzes utilization and recommends downsizing or shutting down underused resources, with the estimated saving attached. It is your prioritized worklist.

# List Advisor Cost recommendations with estimated annual savings
az advisor recommendation list --category Cost \
  --query "[].{resource:impactedValue, problem:shortDescription.problem, savings:extendedProperties.annualSavingsAmount}" -o table

The usage-reduction worklist, by lever and typical payoff:

Usage lever What it targets Typical saving Effort Risk
Right-size VMs/DBs Over-provisioned SKUs 20–50% on those resources Low Validate headroom for peaks
Auto-stop non-prod Dev/test running 24×7 ~65% on non-prod compute Low Schedule must respect work hours
Delete orphans Unattached disks, unused IPs, stale snapshots Pure waste removed Low Confirm truly unused first
Autoscale / scale-to-zero Fixed capacity for variable load Tracks demand Medium Tune min/max; cold-start cost
Serverless / consumption Idle always-on services Pay-per-use Medium Re-architecture; cold starts
Storage tiering Hot data that’s actually cold 50%+ on cold blobs Low Retrieval cost/latency on archive
Log-ingestion control Verbose/duplicated logs Often large Low Don’t drop signal you need
Disk SKU downgrade Premium SSD on low-IOPS disks 30–60% on those disks Low Validate IOPS/throughput need
Egress reduction Cross-region/internet traffic Varies Medium Private Link, same-region, CDN
Snapshot lifecycle Snapshots never pruned Pure waste removed Low Keep a retention policy

Auto-stop non-production

Non-production compute that runs nights and weekends is the most common easy win. Target it by the Environment tag and deallocate on a schedule (deallocated VMs stop compute charges; you still pay for disks). Azure Automation, a Logic App, or a scheduled Function all work:

# Deallocate every VM tagged Environment=dev (run on a schedule via Automation/Functions)
az vm deallocate --ids $(az vm list --query "[?tags.Environment=='dev'].id" -o tsv)

The key distinction that catches people: Stop (deallocate) releases the compute and stops billing for it; Stop (from inside the OS) leaves the VM allocated and still billing. Always deallocate.

VM power state Compute billed? Disk billed? Public IP (static) billed?
Running Yes Yes Yes
Stopped (OS shutdown, still allocated) Yes Yes Yes
Stopped (deallocated) No Yes Yes
Deleted No No (if disk deleted) No (if IP deleted)

Hunt the orphans

Orphaned resources are silent, pure waste. The usual suspects and how to find them:

Orphan type Why it lingers Find it Action
Unattached managed disks VM deleted, disk kept Resource Graph diskState == 'Unattached' Snapshot then delete
Unassociated public IPs (static) NIC/LB deleted Graph ipConfiguration == null Delete
Stale snapshots Backups never pruned Graph by age on snapshots Lifecycle-prune
Idle/empty App Service plans App removed, plan kept Plans with 0 sites Delete the plan
Old disks of deallocated VMs “We might need it” Deallocated VM age Review + delete
Unused NAT Gateways / gateways Workload retired Graph by association Delete
Over-provisioned DB tiers Sized for launch peak Advisor + DTU/RU metrics Scale down
Idle load balancers (no backends) Backend pool emptied Graph: empty backend pool Delete
Orphaned NICs (no VM) VM deleted, NIC kept Graph virtualMachine == null Delete
Premium disks on stopped VMs Dev disks left Premium SSD Disk SKU on deallocated VMs Downgrade to Standard

Budgets, anomaly detection, and closing the loop with automation

Visibility and optimization are nothing without a control loop that catches overruns before the invoice. Azure gives you budgets, anomaly alerts, and action groups; the discipline is wiring them to forecast and action, not just email.

Budgets that actually control spend

A budget is a threshold at a scope with notification rules. By itself it only emails — it does not cap spending. Two design choices make it useful: alert on forecasted spend (predicted month-end, so you act early) and attach an action group that runs automation.

# Create a subscription budget that alerts at 80% actual and 100% forecast, to an action group
az consumption budget create \
  --budget-name "sub-monthly-cap" \
  --amount 500000 --time-grain Monthly \
  --start-date 2026-06-01 --end-date 2027-06-01 \
  --category Cost \
  --notifications '{
    "actual80": { "enabled": true, "operator": "GreaterThan", "threshold": 80,
                  "contactGroups": ["/subscriptions/'$SUB'/resourceGroups/rg-finops/providers/microsoft.insights/actionGroups/ag-finops"] },
    "forecast100": { "enabled": true, "operator": "GreaterThan", "threshold": 100, "thresholdType": "Forecasted",
                  "contactGroups": ["/subscriptions/'$SUB'/resourceGroups/rg-finops/providers/microsoft.insights/actionGroups/ag-finops"] }
  }'

In Bicep, ship budgets as code per landing zone so every new subscription is born with a guardrail:

resource budget 'Microsoft.Consumption/budgets@2023-11-01' = {
  name: 'sub-monthly-cap'
  properties: {
    category: 'Cost'
    amount: 500000
    timeGrain: 'Monthly'
    timePeriod: { startDate: '2026-06-01', endDate: '2027-06-01' }
    notifications: {
      actual80: { enabled: true, operator: 'GreaterThan', threshold: 80, contactGroups: [ actionGroupId ], thresholdType: 'Actual' }
      forecast100: { enabled: true, operator: 'GreaterThan', threshold: 100, contactGroups: [ actionGroupId ], thresholdType: 'Forecasted' }
    }
  }
}

The budget knobs and how to reason about each:

Setting What it does Default / typical When to change
Amount The threshold value Your monthly cap Set per scope from baseline + growth
Time grain Reset cadence Monthly Quarterly/Annual for capex-style caps
Scope Where it measures Subscription RG for team-level; MG for BU
Threshold % Alert trip points 50/80/100 Add an early 50% for fast-growing subs
Threshold type Actual vs Forecasted Actual Forecasted to act before overage
Action group What fires on breach Email only Attach automation to control, not just notify
Filters Restrict to a tag/RG/service None Budget a single team/product via tag filter
Reset / recurrence period Start & end of the budget window 1 year Re-baseline annually as the estate grows
Notification recipients Emails / contact roles / groups Owner email Route to the team that can act, not a shared inbox

Anomaly detection: catch the unexpected

Budgets catch known limits; anomaly alerts catch unexpected deviations (a leaked key spinning up VMs, a log explosion, a runaway query) using Cost Management’s built-in ML. Subscribe to anomaly alerts so a 3× day-over-day jump pages you in hours, not on the invoice.

Detection mechanism Catches Latency Best for
Budget (actual) Crossing a known threshold Hours–day Hard caps you set
Budget (forecast) Predicted to cross threshold Days early Acting before the overage
Anomaly alert Statistically unusual spend ~Daily Unknown unknowns (leaks, runaways)
Scheduled export + query Anything you script a check for Daily Custom rules (per-team caps, ratios)
Advisor (cost) Right-size/idle opportunities Continuous Proactive savings, not overruns

Close the loop: action groups → automation

The control becomes real when the alert does something. Wire the budget/anomaly action group to an Automation runbook or Function that takes a safe action — deallocate non-prod, or post to the team channel with the offending resource and a one-click stop.

# Action group that triggers an Automation webhook on budget/anomaly breach
az monitor action-group create \
  --name ag-finops --resource-group rg-finops \
  --action webhook stopNonProd "https://<automation-webhook-url>" \
  --action email finops finops@example.com

The escalation ladder — match the action to the severity and the environment:

Trigger Severity Safe automated action Human action
Non-prod budget 80% (actual) Low Post to channel Review what’s running
Non-prod budget 100% (forecast) Medium Deallocate Environment=dev VMs Confirm nothing legit broke
Prod budget 100% (forecast) High Notify only (never auto-kill prod) Investigate; scale/optimize
Anomaly: 3× day-over-day High Snapshot context, page on-call Identify the runaway/leak
Anomaly in a sandbox sub Medium Throttle / deallocate sandbox Find who/what spun up

The cardinal rule: automate destructive actions only in non-production. A budget breach in prod is a notify-and-investigate event — never let automation deallocate production because a forecast crossed a line.

Architecture at a glance

The diagram traces spend the way it actually moves through a mature cost program — left to right as a closed loop — and marks the five places it leaks. Read it as a pipeline. In GOVERN + TAG, the management group anchors policy that flows down to subscriptions and resource groups; Azure Policy denies untagged resources and inherits CostCenter/Owner/Environment so every resource is born attributable (badge 1 marks the leak: weak enforcement → an unallocated bucket). Those tagged resources emit usage records into INGEST, where Cost Management amortizes commitments and a daily FOCUS export lands in ADLS Gen2 (badge 2: reading ActualCost instead of Amortized skews every trend). The amortized data feeds ALLOCATE, where showback slices cost per team by tag and a shared-split allocation rule distributes the hub firewall, Bastion and Log Analytics back to teams so the numbers reconcile to 100% (badge 3: unsplit shared cost makes showback under-count).

From allocation you derive a savings target that drives OPTIMIZEReservations / Savings Plans / Hybrid Benefit cut the rate on the stable baseline (badge 4: a Shared-scope commitment discounts a team that never paid), while right-sizing + auto-stop cut the usage. Finally ACT closes the loop: budgets alert on forecast and anomaly, and an action group triggers a Function/runbook that remediates — deallocating non-prod, or remediating tags back at the GOVERN stage (badge 5: a budget that emails but triggers no action lets non-prod burn over a weekend). Notice the loop closes — the remediate flow runs from ACT back to GOVERN, because the output of every overrun is a tightened policy or a stopped resource at the origin. The whole method is: govern so it’s attributable, ingest amortized, allocate to 100%, optimize rate and usage, and act on forecast with automation — and every numbered badge is a specific, confirmable leak with a one-command check.

Azure cost-control operating model as a left-to-right closed loop: a GOVERN + TAG zone where a management group anchors Azure Policy that denies untagged resources and inherits CostCenter, Owner and Environment tags onto subscriptions and resource groups; an INGEST zone where Cost Management amortizes commitments and a daily FOCUS-schema export lands in ADLS Gen2; an ALLOCATE zone where showback slices cost per team by tag and a cost allocation rule splits shared hub firewall and Log Analytics back to teams to reconcile to 100 percent of the invoice; an OPTIMIZE zone with Reservations, Savings Plans and Hybrid Benefit cutting the rate plus Advisor right-sizing and auto-stop cutting usage; and an ACT zone where budgets alert on forecast and anomaly and an action group triggers a Function or Automation runbook that remediates non-production and loops back to remediate tags at the GOVERN stage — with five numbered badges marking the leaks: untagged spend, ActualCost-skewed trends, unsplit shared cost, wrong-scope commitments, and budgets with no action wired

Real-world scenario

Northwind Commerce runs a multi-tenant retail platform on Azure across 38 subscriptions organized under an mg-landingzones management group: a shared platform subscription (hub VNet, Azure Firewall, Bastion, a central Log Analytics workspace, an Application Gateway), and per-product subscriptions for checkout, search, catalog, billing and a dozen others, plus sandbox subs per team. The FinOps function is two people inside the platform team. Monthly Azure spend had grown to about ₹1.9 crore and the forecast was wrong by 30–40% every month — finance had started talking about a hard spend freeze.

The first audit was brutal. Cost Analysis grouped by CostCenter showed 41% “unallocated” — nearly half the bill belonged to no team. A chart of the billing product showed a 9× spike in March then near-zero in April, which finance had flagged as “a billing bug”; it was actually a ₹14,00,000 1-year SQL Reservation bought upfront and read in ActualCost. The platform subscription — firewall, Bastion, Log Analytics, gateway — was ₹38,00,000/month and charged to nobody, so every product’s showback was wildly understated and no team believed the numbers. Sandbox subscriptions ran 24×7; one had spun up eight Standard_NC GPU VMs for “a quick experiment” six weeks earlier and left them running — about ₹6,00,000 of pure waste discovered only because someone finally grouped by resource.

The remediation ran in three waves over a quarter. Wave 1 — make it attributable. They assigned require-tag (deny) for CostCenter, Owner, Product, Environment and inherit-tag (modify) for the same at mg-landingzones, then ran a remediation task that backfilled tags from resource groups; the unallocated bucket fell from 41% to under 4% in two weeks. They switched every report, budget and export to AmortizedCost, and the “billing bug” vanished — the SQL Reservation now showed as a flat ~₹1,16,000/month. Wave 2 — allocate to 100%. A cost allocation rule split the platform subscription proportionally to each product’s compute spend; for the first time per-team showback summed to the invoice, and the product teams accepted the numbers because they could see why they owed a slice of the firewall.

Wave 3 — optimize and close the loop. With clean tags and trusted allocation, they right-sized 60+ over-provisioned VMs and databases off Advisor (~₹11,00,000/month), put a scheduled Function on Environment=sandbox/dev to deallocate nightly and weekends (~₹9,00,000/month), and — only after right-sizing — bought a 3-year Savings Plan sized to the P50 of steady compute at Single scope per product (so each team’s discount stayed with that team), layering Azure Hybrid Benefit on their owned Windows/SQL licenses. Finally they shipped budgets-as-Bicep into every landing zone: a per-sub budget alerting at 80% actual and 100% forecast, anomaly alerts, and an action group that posts to the team channel and (for non-prod only) triggers the deallocate runbook. The next quarter’s spend landed at ₹1.34 crore — a ~30% reduction while compute capacity grew 18% — and, more importantly, the forecast came within 6% every month, so finance dropped the freeze. The lesson on the wall: “You can’t optimize what you can’t attribute — fix tags and amortization first, or every later number is a fight.”

The program as a before/after, because the order of the fixes is the lesson:

Stage Before Action After
Attribution 41% unallocated Deny + inherit tags at MG; remediate <4% unallocated
Trend accuracy “9× then zero” billing chart Switch all reports to AmortizedCost Flat, real consumption trend
Allocation Platform sub charged to nobody Allocation rule splits shared cost Showback sums to 100% of invoice
Usage 60+ oversized; sandbox 24×7; GPU orphans Advisor right-size + auto-stop + delete ~₹20,00,000/mo removed
Rate All PAYG Right-size then 3-yr SP (Single) + AHB Deep discount on stable floor
Control Surprise on the invoice Budgets (forecast) + anomaly + runbook Overruns caught in hours

Advantages and disadvantages

The FinOps operating model both prevents a class of expensive surprises and imposes real discipline. Weigh it honestly:

Advantages (why this model helps you) Disadvantages (why it bites)
Cost Management is native and free — no third-party tool needed to start Doing it well (allocation, automation, FOCUS lakehouse) is real engineering effort
Tag governance via policy makes every cost attributable and reconcilable to the invoice Tag discipline is unforgiving — one missing policy and the unallocated bucket grows; tags aren’t retroactive
Amortized reporting gives finance a stable, trustworthy trend to forecast against The Actual-vs-Amortized distinction is subtle and silently breaks analysis if misread
Reservations/SP/AHB/Spot cut steady-state cost 30–70% without changing the workload Commitments lock you in (term, region, SKU/scope); over-buying or wrong scope wastes money
Budgets + anomaly + automation catch overruns in hours, not on the invoice Budgets don’t cap spend by default; automation on prod is dangerous — careful scoping required
Showback creates accountability so each team trims its own waste, preserving velocity Chargeback adds cross-charging friction and needs trust + clean allocation first
Right-sizing and auto-stop are fast, low-risk wins off Advisor’s prioritized list Aggressive right-sizing without headroom causes performance incidents under peak

The model is right for any organization past a single team — the cost of not doing it is paid in bill-shock, blunt spend freezes, and unattributable waste. It bites hardest when treated as a finance afterthought rather than an engineering practice: tags applied late (so cost is already unallocated), commitments bought before a baseline exists, and automation bolted onto production where it can do damage. Every disadvantage is manageable — and the whole point is to make cost a continuous, low-friction part of how you build, not a quarterly fire-drill.

Hands-on lab

Stand up the core controls on one subscription — tag enforcement, a budget with forecast alerting, an amortized query, and an orphan hunt — all using free Cost Management and a near-zero-cost test resource. Run in Cloud Shell (Bash).

Step 1 — Variables and a resource group.

RG=rg-finops-lab
LOC=centralindia
SUB=$(az account show --query id -o tsv)
az group create -n $RG -l $LOC -o table

Step 2 — Assign a require-tag (deny) policy at the resource-group scope (we scope to the RG for a safe, reversible lab; in production you’d scope to a management group).

az policy assignment create \
  --name "lab-require-costcenter" \
  --display-name "Lab: require CostCenter (deny)" \
  --scope "/subscriptions/$SUB/resourceGroups/$RG" \
  --policy "871b6d14-10aa-478d-b590-94f262ecfa99" \
  --params '{ "tagName": { "value": "CostCenter" } }'

Step 3 — Prove the deny works. Try to create a public IP without the tag (expect a policy denial), then with it (expect success):

# Expect: RequestDisallowedByPolicy — the deny fired
az network public-ip create -g $RG -n pip-untagged -o table

# Expect: success — the required tag is present
az network public-ip create -g $RG -n pip-tagged --tags CostCenter=CC-LAB Owner=you Environment=dev -o table

The first command failing with RequestDisallowedByPolicy is the lab’s core lesson: untagged spend can’t be created, so it can’t become unallocated.

Step 4 — Create a budget with a forecast alert. A ₹1,000 monthly budget alerting at 80% actual and 100% forecast (swap in an action group ID if you have one):

az consumption budget create \
  --budget-name "lab-budget" \
  --amount 1000 --time-grain Monthly \
  --start-date 2026-06-01 --end-date 2026-12-01 \
  --category Cost \
  --notifications '{
    "actual80": { "enabled": true, "operator": "GreaterThan", "threshold": 80, "contactEmails": ["you@example.com"] },
    "forecast100": { "enabled": true, "operator": "GreaterThan", "threshold": 100, "thresholdType": "Forecasted", "contactEmails": ["you@example.com"] }
  }'

Step 5 — Query amortized cost for the subscription, grouped by CostCenter. This is the month-end pack in one call:

az rest --method post \
  --uri "https://management.azure.com/subscriptions/$SUB/providers/Microsoft.CostManagement/query?api-version=2024-08-01" \
  --body '{ "type": "AmortizedCost", "timeframe": "MonthToDate",
    "dataset": { "granularity": "None",
      "aggregation": { "total": { "name": "Cost", "function": "Sum" } },
      "grouping": [ { "type": "TagKey", "name": "CostCenter" } ] } }' \
  --query "properties.rows"

Expected: rows of [cost, CostCenter, currency] — your CC-LAB resources appear under their tag; anything untagged appears as a blank bucket (which, after Step 2, should be shrinking).

Step 6 — Hunt orphans with Resource Graph. Find unattached disks and unassociated static IPs across the subscription:

az graph query -q "Resources
| where (type == 'microsoft.compute/disks' and properties.diskState == 'Unattached')
   or (type == 'microsoft.network/publicipaddresses' and isnull(properties.ipConfiguration))
| project name, type, resourceGroup, location" -o table

Validation checklist. You enforced tagging (deny blocked an untagged create), created a budget that alerts on forecast before the overage, pulled amortized cost grouped by team in one API call, and inventoried orphaned waste — the four pillars of the loop, on one subscription.

Step What you did What it proves Real-world analogue
2–3 Deny untagged create Untagged spend can’t be born MG-scope tag governance
4 Budget with forecast alert You act before the overage Per-sub budgets-as-code
5 Amortized query by tag The correct metric, scripted Month-end showback pack
6 Resource Graph orphan hunt Waste is findable and deletable Monthly orphan sweep

Cleanup (avoid lingering charges).

az policy assignment delete --name "lab-require-costcenter" --scope "/subscriptions/$SUB/resourceGroups/$RG"
az consumption budget delete --budget-name "lab-budget"
az group delete -n $RG --yes --no-wait

Cost note. A static public IP is a few paise per hour; the whole lab runs well under ₹20, and deleting the resource group plus the budget/policy stops everything. Cost Management, budgets, and the Query API are free.

Common mistakes & troubleshooting

This is the playbook — the part you bookmark and reopen at month-end. First as a scannable table, then the entries that bite hardest with full confirm-detail.

# Symptom Root cause Confirm (exact cmd / portal path) Fix
1 Large “unallocated / no CostCenter” bucket No tag-governance policy, or only at sub scope Cost Analysis → group by CostCenter; az policy state summarize low compliance Deny + inherit tags at MG scope; remediation task
2 A team “tripled then dropped to zero” Reading ActualCost; a commitment landed Cost Analysis metric = Actual; spike aligns with RI/SP buy date Switch all reports/budgets/exports to AmortizedCost
3 Per-team showback < invoice total Shared/hub cost not allocated to teams Sum grouped amortized < account amortized Add a cost allocation rule for shared RGs
4 Reservation discount on a team that didn’t buy Shared scope auto-applies org-wide az reservations reservation listappliedScopeType == Shared Re-scope to Single; default new buys Single
5 Low RI/SP utilization, money wasted Over-bought, or baseline shrank/re-architected Reservations → Utilization < 90% Re-scope Shared to pool; resize/let lapse at renewal
6 Budget “fired” but spend kept climbing Budget emails only; no action group / no forecast Budget has notifications, no action group Wire to action group → runbook; alert on forecast
7 Tagged today but last month still unallocated Tags aren’t retroactive in cost data Cost predates the tag application Enforce + remediate early; can’t backfill old cost
8 Non-prod bill high despite “stopped” VMs VMs stopped from OS, still allocated az vm get-instance-viewPowerState/stopped (not deallocated) Deallocate (az vm deallocate), not OS shutdown
9 Storage cost creeping with little new data Hot tier for cold data; orphaned snapshots/disks Cost by meter; Resource Graph orphan query Lifecycle-tier to cool/archive; delete orphans
10 Log Analytics / App Insights bill exploded Verbose or duplicated ingestion Cost by service = Monitor; ingestion volume spike Sampling, table-level retention, drop noisy logs
11 Export job produces no/partial files Wrong scope, storage RBAC, or schema mismatch az costmanagement export show; storage container empty Fix scope/RBAC; re-run; verify FOCUS schema
12 Anomaly/overrun found weeks late on invoice No anomaly alert; no forecast budget No anomaly subscription; budgets actual-only Enable anomaly alerts + forecast budgets
13 Chargeback numbers rejected by finance Amortized used for cash tie-out (or vice-versa) Numbers don’t tie to GL/invoice Actual for cash tie-out, Amortized for showback
14 Right-sized then performance incidents Downsized without peak headroom Advisor applied blindly; p95 CPU/RU pinned post-change Re-size up; validate against real peak before cutting

The expanded form, for the entries that cost the most time and money:

1. A large “unallocated / no CostCenter” bucket in Cost Analysis. Root cause: No require-tag (deny) and no inherit-tag (modify) policy, or they’re only at subscription scope so new subs and service-created resources slip through. Confirm: Cost Analysis → group by CostCenter shows a big blank bucket; az policy state summarize --management-group <mg> --filter "PolicyAssignmentName eq 'require-costcenter'" shows low compliance; Resource Graph Resources | where isnull(tags['CostCenter']) lists the offenders. Fix: Assign deny + inherit at the management-group root; run a remediation task to backfill existing resources; add allowedValues on Environment to stop value fragmentation.

2. A team “tripled then went to zero.” Root cause: Reporting in ActualCost, so an upfront Reservation/Savings Plan purchase posts its whole charge on the buy date, then ₹0 for that resource over the term. Confirm: The Cost Analysis metric selector reads Actual; the spike date matches a reservation order’s purchase date (az reservations reservation-order list). Fix: Switch every report, budget, and export to AmortizedCost; reserve Actual only for cash-invoice reconciliation.

3. Per-team showback sums to less than the invoice. Root cause: Shared services (hub firewall, Bastion, Log Analytics, gateways) in the platform subscription aren’t allocated to teams. Confirm: Sum the per-CostCenter amortized totals from the Query API; it’s less than the account amortized total. Fix: Add a cost allocation rule that splits the shared RGs/subscription to teams by a basis (proportional to compute is usually fairest); re-check that the per-team sum now equals the invoice.

4. A reservation discount landed on a team that never bought it. Root cause: The commitment was purchased with Shared applied-scope, so its discount auto-applies to any matching resource across the billing account. Confirm: az reservations reservation list shows appliedScopeType == Shared. Fix: az reservations reservation update --applied-scope-type Single --applied-scopes /subscriptions/<id>; make Single the default for new commitments unless you deliberately want pooled utilization.

6. A budget “fired” but spend kept climbing. Root cause: Budgets don’t cap spend — by default they email. The alert wasn’t wired to an action group that runs automation, and it alerted on actual (too late) rather than forecast. Confirm: The budget shows notifications but no contactGroups/action group; threshold type is Actual. Fix: Attach an action group → Automation runbook/Function that deallocates non-prod; add a Forecasted threshold so you act days before the overage. Never auto-deallocate production.

8. Non-prod bill stays high even though VMs are “stopped.” Root cause: The VMs were stopped from inside the OS (or “Stop” that leaves them allocated) — compute is still billed. Only deallocated VMs stop compute charges. Confirm: az vm get-instance-view --ids <id> --query "instanceView.statuses[?starts_with(code,'PowerState')].code" shows PowerState/stopped rather than PowerState/deallocated. Fix: Use az vm deallocate (or the auto-stop runbook) — and remember disks and static IPs still bill even when deallocated.

10. Log Analytics / Application Insights bill exploded. Root cause: Verbose, duplicated, or unsampled ingestion — a chatty app, debug logging left on, or multiple agents shipping the same data. Confirm: Cost by service shows Monitor climbing; the workspace’s ingestion volume spikes; the same telemetry observability story from Azure Monitor and Application Insights. Fix: Turn on adaptive sampling, set table-level retention (keep verbose tables short), drop noisy logs at the data collection rule, and consolidate duplicate agents — without dropping signal you need for incidents.

Best practices

Security notes

Cost & sizing

FinOps tooling is itself nearly free; the spend is in the workload, and the practice is what right-sizes it. What drives the (small) tooling cost and the (large) savings:

A rough monthly picture of the tooling cost for a large multi-subscription estate (the workload is separate and is what you’re optimizing):

Tooling cost driver What you pay for Rough INR / month What it enables Watch-out
Cost Management + budgets + anomaly Native service ₹0 The entire analysis + alert loop None — it’s free
Query API Native API ₹0 Scripted month-end packs, dashboards Throttled at high call rates
Daily FOCUS export storage ADLS Gen2 GB-month ~₹100–1,000 Lakehouse-scale analysis, history Lifecycle-tier old months
Auto-stop automation Function/Automation runs ~₹50–500 ~65% off non-prod compute Schedule must respect work hours
Lakehouse compute (optional) Spark/SQL for exports Varies Cost joined to KPIs / unit economics Only if you outgrow the portal
Net effect Tooling ≈ ₹1k–2k Savings ≈ 25–40% of the bill Effort is the real cost, not money

Interview & exam questions

1. What is the difference between AmortizedCost and ActualCost, and which do you use for showback? ActualCost records a charge on the day it hits the account, so an upfront Reservation shows its whole cost on the purchase day then ₹0 over the term. AmortizedCost spreads commitments evenly across their term, reflecting consumption. Use Amortized for showback, budgets and trends; use Actual only to reconcile to the cash invoice.

2. You see a large “unallocated” bucket in Cost Analysis. What’s the cause and the durable fix? Resources shipped untagged because there’s no tag-governance policy (or only at subscription scope). The durable fix is deny (require-tag) plus modify (inherit-tag) at the management-group root, then a remediation task to backfill existing resources — and accept that tags aren’t retroactive, so cost before tagging stays unallocated.

3. A reservation’s discount is landing on a team that never paid for it. Why, and how do you fix it? The reservation was bought with Shared applied-scope, which auto-applies its discount to any matching resource across the billing account. Re-scope it to Single (the subscription that owns the baseline) with az reservations reservation update --applied-scope-type Single, and default new commitments to Single for clean chargeback.

4. How do you make a budget actually control spend rather than just notify? A budget only emails by default. Attach an action group that triggers an Automation runbook/Function to take a safe action (deallocate non-prod), and alert on Forecasted spend so you act before the overage. Critically, never auto-deallocate production — that’s a notify-and-investigate event.

5. Reservations vs Savings Plans — when do you pick which? Reservations commit to a specific SKU family + region and give the deepest discount (up to ~72%) for a known, stable shape. Savings Plans commit to a fixed $/hour of compute with full region/SKU flexibility (up to ~65%) and are forgiving when your shapes change. Pick RIs for a fixed baseline you’re confident in; SPs when the workload mix evolves.

6. Why must you right-size before buying commitments? Commitments lock in a rate for whatever capacity you run; if you commit to oversized resources you pay a multi-year discount on waste. Right-size off Advisor first, then commit to the smaller, stable floor of usage — never the pre-optimization or peak number.

7. How do you allocate shared services (hub firewall, Log Analytics) so showback reconciles to the invoice? Use a cost allocation rule that splits the shared resource group/subscription to teams by a basis — proportional to compute spend is usually fairest. Without it, the sum of per-team showback is always less than the bill and teams reject the numbers. Validate by checking the per-team amortized total equals the account amortized total.

8. A dev VM was “stopped” but still costs money. Why? It was stopped from inside the OS (or otherwise left allocated) — Azure still bills compute for allocated VMs. Only deallocated VMs stop compute charges (az vm deallocate); even then, disks and static public IPs continue to bill.

9. What does the FOCUS schema give you over the legacy export formats? FOCUS (FinOps Open Cost and Usage Specification) is a vendor-neutral, standardized column set, so the same queries and dashboards work across clouds and survive a billing-format change. It future-proofs a lakehouse cost pipeline and eases multi-cloud unit-economics.

10. How do you catch a runaway cost (a leaked key spinning up VMs) before the invoice? Budgets catch known thresholds; anomaly alerts (Cost Management’s built-in ML) catch statistically unusual spend day-over-day and page you in hours. Wire both to an action group, and route anomaly alerts to security too — a spend spike is often the first visible sign of a compromise.

11. What’s the difference between showback and chargeback, and which do you start with? Showback shows each team its cost without moving money (low friction — start here). Chargeback actually bills the cost to the team’s budget (real accountability, more friction). Move to chargeback only after showback is trusted and you can attribute ~100% of the invoice, including shared cost.

12. Which Azure roles separate “viewing cost” from “spending money,” and why does it matter? Cost Management Reader views cost; Cost Management Contributor manages budgets/exports; purchasing Reservations/Savings Plans needs billing/owner-level rights. Separating them enforces least privilege — viewing a chart shouldn’t grant the ability to make a 3-year financial commitment.

These map to AZ-104 (Administrator)monitor and manage Azure resources, cost management, budgets, tags — and AZ-305 (Solutions Architect)design a cost-optimized architecture, governance, and the resource-organization/allocation model. The commitment and billing depth touches the Microsoft FinOps guidance and the FinOps Framework certification. A compact mapping for revision:

Question theme Primary cert Objective area
Tags, policy governance, allocation AZ-104 / AZ-305 Governance; resource organization
Cost Management, budgets, alerts AZ-104 Monitor & manage resources
Amortized vs Actual, exports/FOCUS FinOps Framework Inform / data
Reservations, Savings Plans, AHB, Spot AZ-305 / FinOps Cost-optimized design; Optimize
Right-sizing, auto-stop, anomalies AZ-305 / FinOps Optimize / Operate
Showback vs chargeback, scope/roles FinOps Framework Operate; allocation

Quick check

  1. A chart shows one team’s spend at 9× in March and near-zero in April. Which cost metric are you almost certainly reading, and what should you switch to?
  2. Your per-team showback sums to ₹1.2 crore but the invoice says ₹1.6 crore. What’s the most likely cause and the fix?
  3. True or false: an Azure budget will stop spending once it’s breached.
  4. A 3-year Reservation’s discount is applying to teams that didn’t buy it. What setting caused this and what do you change it to?
  5. You “stopped” all dev VMs from inside the OS but the non-prod bill barely moved. Why, and what’s the correct action?

Answers

  1. You’re reading ActualCost, which posts an upfront Reservation/Savings Plan charge entirely on its purchase day and ₹0 over the rest of the term. Switch every report, budget and export to AmortizedCost, which spreads the commitment across its term and reflects real consumption.
  2. Shared cost isn’t being allocated — the hub firewall, Log Analytics, Bastion and gateways in the platform subscription aren’t split back to teams, so the per-team sum is short of the invoice. Add a cost allocation rule to distribute the shared RGs/subscription to teams (proportional to compute is usually fairest), then re-check that the per-team amortized total equals the account amortized total.
  3. False. A budget only alerts (emails by default); it does not cap spend. It becomes a control only when its alert triggers an action group → automation that takes action, and you alert on forecast to act before the overage. Never auto-deallocate production.
  4. The reservation was bought with Shared applied-scope, which auto-applies the discount org-wide. Change it to Single scope (az reservations reservation update --applied-scope-type Single --applied-scopes /subscriptions/<id>) tied to the subscription that owns the baseline, and default new commitments to Single.
  5. VMs stopped from inside the OS stay allocated, and Azure still bills compute for allocated VMs. Use az vm deallocate (or an auto-stop runbook) to release the compute — though disks and static public IPs continue to bill even when deallocated.

Glossary

Next steps

You can now stand up the full cost-control loop — attribute, amortize, allocate, optimize and act. Build outward:

AzureFinOpsCost ManagementBudgetsReservationsSavings PlansTagsAzure Policy
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading