Architecture Azure

Cost Optimization Without Wrecking Reliability: Navigating WAF Tradeoffs

The Well-Architected cost pillar is the one everyone quotes when they want to cut something, and the one nobody quotes when they want to defend it. The result is the same predictable failure mode: a cost sprint trims redundancy on instinct, an incident follows, and the organization swings back to over-provisioning everything “to be safe.” Both states are unmanaged. The job of a principal architect is not to minimize cost or maximize reliability; it is to make the tradeoff between them explicit, quantified, and reversible. This is the framework I use to do that, ending in a tradeoff decision record you can actually defend in a review.

Why pillar tradeoffs are unavoidable and how to make them explicit

The five Well-Architected pillars (Cost Optimization, Reliability, Performance Efficiency, Security, Operational Excellence) are not independent dials. They are coupled, and the strongest couplings are between cost and the other four:

Cloud vendors publish this as guidance. Azure’s WAF explicitly frames cost optimization as a balancing act against the other pillars, and AWS says the same. The mistake teams make is treating “optimize cost” as an unconditional good. It is a tradeoff, and a tradeoff has a counterparty.

The fix is procedural, not heroic. Every cost change that touches a reliability, performance, or security control must (1) name the pillar it is trading against, (2) quantify both sides in the same units where possible (dollars/month vs minutes of risk/recovery), and (3) be tied to a workload criticality tier so the decision is anchored to business value rather than the loudest engineer in the room. The rest of this article is the mechanism for each of those.

A tradeoff you did not write down is not a decision, it is an accident you have not had yet. The deliverable of this whole process is the TDR in Step 6. Steps 1-5 just generate the numbers that go in it.

Step 1 — Quantify the cost of redundancy per reliability tier

You cannot trade cost against reliability until you can price reliability. Start by pricing the marginal cost of each redundancy step for the components in your critical journey. (If you have not mapped that journey and classified dependencies as hard or soft yet, do that first; I covered it in the reliability pillar deep dive.)

Take a concrete pair: a database tier under increasing redundancy, and an app tier going single-region to active-active. For each step, record the delta cost and the failure mode it removes.

Redundancy step Removes failure mode Cost delta (illustrative)
Single instance -> zone-redundant Single-zone outage + storage/compute for ZR SKU
Zone-redundant -> geo-replicated (passive) Regional outage (RPO minutes) + read replica + egress
Passive geo -> active-active multi-region Regional outage (near-zero RTO) + full second stack + global routing + cross-region data

The point is not the exact numbers, it is the shape: each step typically multiplies a portion of the bill while removing a less-frequent, higher-impact failure. Use the cloud pricing APIs to get real deltas instead of guessing. The Azure Retail Prices API needs no auth and is the fastest way to diff two SKUs:

# Compare a zone-redundant vs locally-redundant managed disk price in one region
curl -s "https://prices.azure.com/api/retail/prices?\$filter=serviceName eq 'Storage' and armRegionName eq 'eastus' and skuName eq 'P30 LRS'" \
  | jq -r '.Items[0] | "\(.skuName): \(.retailPrice) \(.currencyCode)/\(.unitOfMeasure)"'
# AWS: list On-Demand pricing dimensions for an RDS instance class
aws pricing get-products \
  --region us-east-1 \
  --service-code AmazonRDS \
  --filters "Type=TERM_MATCH,Field=instanceType,Value=db.r6g.large" \
            "Type=TERM_MATCH,Field=deploymentOption,Value=Multi-AZ" \
  --query 'PriceList[0]' --output text | head -c 800

Now convert reliability into the same conversation by pairing each step with the downtime it buys back. An error budget for a 99.9% SLO is roughly 43 minutes per 30 days. If a regional-failover step removes a failure mode that historically costs you two 4-hour outages a year, that is 480 minutes of risk against, say, a known monthly cost. That ratio, dollars per minute of avoided downtime, is the unit that makes the tradeoff arguable instead of emotional.

Step 2 — Right-tiering: matching SKUs and redundancy to workload criticality

Most overspend is not waste in the FinOps sense (idle resources); it is mis-tiering, paying mission-critical prices for tier-3 workloads. Define a small, fixed criticality taxonomy and bind each tier to a default reliability and SKU posture. Three or four tiers is plenty.

Tier Example workload Target Redundancy default Compute posture
0 - Mission critical Checkout, auth 99.95%+ Multi-region active/active Reserved + on-demand burst
1 - Business critical Core API, primary DB 99.9% Zone-redundant + passive geo Reserved baseline
2 - Standard Internal tools, reporting 99.5% Zone-redundant On-demand / savings plan
3 - Best effort Batch, dev jobs none Single instance Spot / scale-to-zero

The discipline is that a workload’s tier is a business decision, recorded as a tag, and the architecture follows from the tier rather than the other way around. Enforce it so drift is visible. In Azure, a policy can require the tag and (optionally) deny resources that exceed the tier’s allowed SKUs:

{
  "if": {
    "allOf": [
      { "field": "tags['criticality']", "exists": "false" }
    ]
  },
  "then": { "effect": "deny" }
}
# Surface workloads paying for zone redundancy without a criticality tag (drift check)
az graph query -q "
Resources
| where type =~ 'microsoft.sql/servers/databases'
| where properties.zoneRedundant == true
| where isnull(tags['criticality']) or tags['criticality'] in ('2','3')
| project name, resourceGroup, tier=tags['criticality']
" -o table

That Resource Graph query is the single highest-leverage thing in this article. It finds tier-2/tier-3 databases paying for tier-0/1 redundancy, which is almost always pure savings with zero reliability loss because you are removing protection the business never asked to pay for.

Step 3 — Spot, autoscale, and scale-to-zero without breaking SLOs

Spot/preemptible capacity is the largest single compute discount available, often 60-90% off on-demand. The tradeoff is eviction: the platform can reclaim the node with little notice. The rule that keeps this safe is simple, spot is a tier-2/tier-3 tool, or a burst-only tool for tier 0/1. Never put a stateful, hard-dependency, single-replica workload on spot.

On AKS, isolate spot into its own node pool with a taint so only tolerant workloads land there:

az aks nodepool add \
  --resource-group rg-prod \
  --cluster-name aks-prod \
  --name spotpool \
  --priority Spot \
  --eviction-policy Delete \
  --spot-max-price -1 \
  --enable-cluster-autoscaler --min-count 0 --max-count 20 \
  --node-taints "kubernetes.azure.com/scalesetpriority=spot:NoSchedule"
# Only batch/stateless workloads tolerate the spot taint and prefer the pool
spec:
  tolerations:
    - key: "kubernetes.azure.com/scalesetpriority"
      operator: "Equal"
      value: "spot"
      effect: "NoSchedule"
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          preference:
            matchExpressions:
              - key: "kubernetes.azure.com/scalesetpriority"
                operator: In
                values: ["spot"]

The --min-count 0 is what enables scale-to-zero: when no tolerant pods are pending, the autoscaler drains the pool to nothing and you pay zero. Pair it with event-driven scaling (KEDA) so a queue depth or cron schedule wakes the workload. Scale-to-zero is free money for bursty, latency-tolerant work and a reliability hazard for anything serving live user traffic, because the cold start is added latency on the critical path. That is the tradeoff: you are trading p99 latency and a small availability risk for cost. Keep it off tier-0 request paths.

For the steady-state baseline that must always be on, do not use spot at all. Commit it. Reserved Instances and savings plans discount the predictable floor; let on-demand and spot absorb the variable top. The composite posture for a tier-1 service is: reserved baseline + on-demand autoscale for normal peaks + spot for batch side-work.

Step 4 — Data tiering and lifecycle policies vs recovery requirements

Storage is where cost optimization quietly trades against recovery, and the bill compounds because data only grows. The tradeoff has two distinct axes that teams conflate:

  1. Access tiering (hot -> cool -> cold -> archive) trades retrieval latency and cost for storage cost. Archive retrieval can take hours.
  2. Retention/lifecycle (delete after N days) trades storage cost directly against your recovery point and compliance obligations.

The hard constraint that bounds both: a lifecycle rule must never delete or archive data faster than your RPO/RTO and legal hold allow. Archiving backups you might need for a fast restore turns a 15-minute RTO into a multi-hour one. I have seen a “cost win” that moved DR backups to archive tier and silently broke the recovery runbook nobody re-tested.

Encode lifecycle as policy so it is reviewable, not a console click:

{
  "rules": [
    {
      "name": "logs-tier-and-expire",
      "enabled": true,
      "type": "Lifecycle",
      "definition": {
        "filters": { "blobTypes": ["blockBlob"], "prefixMatch": ["logs/"] },
        "actions": {
          "baseBlob": {
            "tierToCool":    { "daysAfterModificationGreaterThan": 30 },
            "tierToArchive": { "daysAfterModificationGreaterThan": 90 },
            "delete":        { "daysAfterModificationGreaterThan": 365 }
          }
        }
      }
    }
  ]
}
az storage account management-policy create \
  --account-name stproddata \
  --resource-group rg-prod \
  --policy @lifecycle.json

The 365-day delete above is appropriate for logs. It is wrong for backups governed by RPO or for anything under legal hold, so those get separate rules (or immutable, versioned containers with no delete action). The reviewable artifact forces the question “what is the recovery requirement for this prefix?” before the cost optimization ships, which is exactly the explicitness this whole framework is after.

Step 5 — Environment-aware architecture: prod vs non-prod cost posture

The fastest, safest cost wins live in non-prod, because non-prod has near-zero reliability SLO and yet frequently inherits prod-grade redundancy by copy-paste IaC. Make environment a first-class input to the architecture so the same module produces a lean dev stack and a hardened prod stack.

locals {
  # Posture is derived from environment, not hand-set per resource
  redundancy = {
    prod    = { sku = "Premium",  zone_redundant = true,  geo_backup = true,  min_replicas = 3 }
    staging = { sku = "Standard", zone_redundant = true,  geo_backup = false, min_replicas = 2 }
    dev     = { sku = "Basic",    zone_redundant = false, geo_backup = false, min_replicas = 1 }
  }
  cfg = local.redundancy[var.environment]
}

resource "azurerm_mssql_database" "app" {
  name        = "appdb"
  server_id   = azurerm_mssql_server.this.id
  sku_name    = local.cfg.sku
  zone_redundant         = local.cfg.zone_redundant
  storage_account_type   = local.cfg.geo_backup ? "Geo" : "Local"
}

The second non-prod lever is time: dev and test environments do not need to run nights and weekends. A scheduled deallocation of VMs, dev databases, and non-prod AKS pools recovers roughly two-thirds of the clock. The tradeoff (engineers occasionally wait for an environment to spin up) is trivial against the savings, provided you exclude anything someone is actively load-testing.

# Tag-driven nightly shutdown of non-prod compute (run from an Automation runbook / scheduled task)
az vm deallocate --ids $(
  az vm list --query "[?tags.env=='dev' || tags.env=='staging'].id" -o tsv
)

Be deliberate about what non-prod is for. A staging environment that exists to validate failover behavior must keep zone redundancy, or you are testing a different system than you ship. “Non-prod is cheap” is a default, not a law; the criticality tier still wins where it matters.

Step 6 — A tradeoff decision record (TDR) template and review cadence

Everything above produces numbers. The TDR is where they become a decision with an owner, a counterparty pillar, and an expiry. Treat it like an ADR (architecture decision record): one Markdown file per material tradeoff, committed next to the code, immutable once accepted, superseded rather than edited. The expiry date and trigger are the parts most teams omit and the parts that prevent yesterday’s good call from becoming tomorrow’s incident.

# TDR-014: Move tier-2 reporting DB from zone-redundant to single-zone

- Status: Accepted
- Date: 2026-06-04
- Owner: platform-team
- Workload tier: 2 (Standard)

## Tradeoff
- Pillar gained: Cost Optimization
- Pillar traded: Reliability

## Decision
Drop zone redundancy on the reporting database (read-only, regenerable
from the OLTP store within ~30 min).

## Quantification
- Cost saved: ~$X/month (ZR SKU premium, from Retail Prices API)
- Reliability cost: exposes single-zone outage. Blast radius = reporting
  only; no impact to checkout/auth (tier 0). Recovery = redeploy + reload.
- SLO impact: reporting has no committed SLO. Error budget unaffected.

## Alternatives considered
- Keep ZR (rejected: paying tier-1 price for tier-2 data)
- Geo-passive (rejected: over-provisioned for regenerable data)

## Review / expiry
- Revisit: 2026-12-04, OR immediately if workload is re-tiered to 1,
  or if reporting becomes a hard dependency of a tier-0 journey.

The review cadence is two-track. Per-change: no cost optimization that touches a reliability, performance, or security control merges without a TDR linked in the PR. Periodic: a monthly or quarterly FinOps review walks open TDRs, checks expiries, and re-validates the numbers against actual spend and actual incidents. The cloud cost-management exports feed this directly:

# Pull last month's actual cost grouped by the criticality tag to validate TDR assumptions
az costmanagement query \
  --type ActualCost --timeframe MonthToDate \
  --scope "/subscriptions/$SUB_ID" \
  --dataset-grouping type=TagKey name=criticality \
  --dataset-aggregation '{"totalCost":{"name":"PreTaxCost","function":"Sum"}}'

Enterprise scenario

A payments platform ran a cost sprint and flagged NAT Gateway egress as a top-five line item. The “win” proposed was collapsing three zonal NAT Gateways (one per AZ) down to a single shared one to drop two $0.045/hr gateways plus their data-processing charges. It shipped to the tier-1 VPC. Six weeks later an AZ partial outage took the NAT-hosting zone offline, and every outbound call to Stripe and the KMS endpoint from the surviving two AZs failed, because a NAT Gateway is zonal and the cross-zone route had no fallback. The blast radius was the entire checkout path, not the modest bill it was meant to trim.

The gotcha: NAT Gateway cost is mostly data processing, not the hourly rate, so consolidating gateways barely moved spend while quietly turning an AZ-redundant design into a single-zone dependency. The real fix was two-fold. First, the durable win was killing NAT data-processing charges for AWS-API traffic entirely via VPC gateway/interface endpoints, which keep S3, DynamoDB, and KMS traffic off the NAT path. Second, restore per-AZ NAT so no cross-zone dependency exists:

resource "aws_vpc_endpoint" "s3" {
  vpc_id            = aws_vpc.prod.id
  service_name      = "com.amazonaws.us-east-1.s3"
  vpc_endpoint_type = "Gateway"            # no NAT, no hourly, no per-GB
  route_table_ids   = aws_route_table.private[*].id
}

resource "aws_nat_gateway" "per_az" {
  for_each      = toset(["a", "b", "c"])   # one NAT per AZ, no shared SPOF
  allocation_id = aws_eip.nat[each.key].id
  subnet_id     = aws_subnet.public[each.key].id
}

The TDR recorded it correctly: the gateway endpoint saved real dollars with zero reliability cost, while NAT consolidation was rejected as paying nothing to add a tier-0 single-zone failure mode.

Verify

A tradeoff framework is only real if it shows up in the running system. Verify:

# 1. Every production-touching resource carries a criticality tier (no untagged drift)
az graph query -q "
Resources
| where subscriptionId == '$SUB_ID'
| where isnull(tags['criticality'])
| summarize untagged=count() by type
| order by untagged desc
" -o table

# 2. Spot is isolated and scales to zero (min-count 0, correct taint)
az aks nodepool show -g rg-prod --cluster-name aks-prod -n spotpool \
  --query "{priority:scaleSetPriority, min:minCount, taints:nodeTaints}" -o json

# 3. Lifecycle policy exists on data accounts and matches retention requirements
az storage account management-policy show \
  --account-name stproddata -g rg-prod \
  --query "policy.rules[].{name:name, delete:definition.actions.baseBlob.delete}" -o json
# 4. Each accepted cost-vs-reliability TDR names a traded pillar and has an expiry
grep -L "Pillar traded" docs/tdr/*.md   # should print nothing
grep -L "Revisit" docs/tdr/*.md          # should print nothing

If the untagged count is non-zero, your right-tiering (Step 2) has gaps. If a TDR lacks a traded pillar or an expiry, it is documentation, not a decision.

Tradeoff checklist

Pitfalls

The throughline is the same one that runs through every Well-Architected pillar: replace instinct with numbers, write the decision down, and put an expiry on it. Cost optimization done this way is not the enemy of reliability. It is the discipline that tells you exactly how much reliability you are buying, and lets you prove it was worth the price.

Well-ArchitectedCost OptimizationReliabilityFinOpsTradeoffsAzure

Comments

Keep Reading