Active-Active Multi-Region on Azure: Building for RTO Near Zero

“Multi-region” gets written into architecture decision records far more often than it gets exercised under fire. Standing up a second region is a weekend; making both regions take live traffic, replicating state continuously, and failing over with no human in the loop is the actual engineering. This is the blueprint for a true active-active Azure system whose recovery time and recovery point both sit near zero — and it is blunt about where “near zero” stops being free.

Defining real RTO/RPO targets and the cost of each nine

Before any topology, fix two numbers and defend them with money, not adjectives.

RTO (Recovery Time Objective): how long the service may be unavailable. Done right, this is the time for the edge to stop routing to a sick region — seconds.
RPO (Recovery Point Objective): how much data you may lose, governed entirely by your replication model, not your routing. Synchronous replication gives RPO 0 but taxes write latency; asynchronous gives single-digit-second RPO but admits loss on a hard regional failure.

The trap is conflating the two. Front Door can give you a 10-second RTO while your asynchronously replicated database still loses the last few seconds of writes. You do not get RPO 0 from a load balancer — you buy it in the data tier and pay in latency or dollars.

Availability target	Downtime / year	Realistic architecture
99.9%	~8.8 hours	Single region, zone-redundant
99.95%	~4.4 hours	Single region + warm DR region
99.99%	~52 minutes	Active-passive multi-region, automated failover
99.999%	~5 minutes	Active-active multi-region, multi-write data

Each nine roughly multiplies cost and complexity. Active-active at five nines means full capacity in two regions, cross-region replication egress, and the muscle to fail over on demand. Decide which nine the business will actually fund before you design for it.

Reference topology: paired regions, global front door, and per-region stamps

The pattern that holds up is the deployment stamp: a complete, self-sufficient copy of the application in one region, with a thin global layer above.

                      Clients
                         |
              Azure Front Door (anycast, WAF)
                 /                      \
   ----- Stamp: West Europe -----   ----- Stamp: North Europe -----
   | App tier (AKS / App Svc)   |   | App tier (AKS / App Svc)    |
   | Regional cache (Redis)     |   | Regional cache (Redis)      |
   | Private endpoints          |   | Private endpoints           |
   ------------------------------   -------------------------------
                 \                      /
            Globally-replicated data layer
        (Cosmos DB multi-write  /  SQL failover group)

Two rules make this work.

A stamp is independently healthy. Every dependency a request touches — compute, cache, config, secrets, private DNS — exists in-region. A request served by West Europe must never make a synchronous call to North Europe; that fuses one regional outage into two.
Pick paired or near regions deliberately. Azure region pairs (e.g. West Europe / North Europe) get sequential platform updates and prioritized recovery; for latency-sensitive active-active you may instead pick two low-RTT regions on one continent. Both are valid — know which you optimized for.

The global layer is stateless and Microsoft-operated, so it is not your failure domain. The stamp is — and the whole design is about making one disposable.

Step 1 — Global ingress with Azure Front Door and health-probe-driven failover

Front Door is the right front for HTTP because it is anycast and decides per request at the edge: when an origin fails its probes, the edge POP stops routing to it with no DNS TTL to wait out, so RTO is bounded by probe settings, not client resolver caches. (For non-HTTP protocols, Traffic Manager does DNS-level steering, but its failover is bound by DNS TTL — keep it off the critical path.) For active-active, put both stamps in one origin group at equal priority and let latency routing send each client to its nearest healthy origin.

RG=rg-aa-prod
PROFILE=afd-aa-prod
ENDPOINT=app-aa

az afd profile create \
  --resource-group $RG \
  --profile-name $PROFILE \
  --sku Premium_AzureFrontDoor

az afd endpoint create \
  --resource-group $RG \
  --profile-name $PROFILE \
  --endpoint-name $ENDPOINT \
  --enabled-state Enabled

The probe cadence is your RTO dial. It lives on the origin group, not the origins.

az afd origin-group create \
  --resource-group $RG --profile-name $PROFILE \
  --origin-group-name og-app \
  --probe-request-type GET \
  --probe-protocol Https \
  --probe-path /health/deep \
  --probe-interval-in-seconds 30 \
  --sample-size 4 \
  --successful-samples-required 3 \
  --additional-latency-in-milliseconds 50

The math that matters: with a 30s interval and 3-of-4 samples required to flip state, worst-case detection is two to three probe cycles. Tightening the interval shortens RTO but multiplies probe load, because every edge POP probes independently. Tune interval and sample counts together against a real drill.

Register both stamps at equal priority and weight so latency routing governs. Repeat this for the second region with its own --origin-name and --host-name:

az afd origin create \
  --resource-group $RG --profile-name $PROFILE \
  --origin-group-name og-app --origin-name westeurope \
  --host-name app-we.example.internal \
  --origin-host-header app-we.example.internal \
  --http-port 80 --https-port 443 \
  --priority 1 --weight 1000 --enabled-state Enabled

The single most important decision here is the probe path. /health/deep must exercise the in-region dependencies that make a request succeed — database, cache, a critical downstream — and return non-200 when any is broken; a shallow probe quietly destroys RTO.

Disable session affinity for active-active unless you genuinely need sticky sessions — it pins a client to one origin and undercuts the point of two live regions. If the app needs session state, externalize it (Step 2) rather than pinning at the edge.

Step 2 — Stateless tier replication and config drift control across regions

Active-active only works if either stamp can serve any request — which requires a stateless app tier and two stamps that are byte-for-byte identical except for region-specific values.

Externalize all session state. No in-process sessions, no sticky local disk — push session and ephemeral state to the regional cache or global data tier so a client can land on either stamp between requests.

Deploy one artifact to both regions, from one region-parameterized module. Build once, then fan out the same immutable image digest (never a floating tag) to both regions in a single pipeline run. Terraform (or Bicep) with a per-region variable set keeps the rest from drifting: the module is identical, only the inputs differ.

module "stamp" {
  source   = "../modules/regional-stamp"
  for_each = toset(["westeurope", "northeurope"])

  location            = each.value
  resource_group_name = "rg-stamp-${each.value}"
  image_digest        = var.image_digest # same digest to every stamp
  app_config_endpoint = var.app_config_endpoint
}

Centralize configuration. Use Azure App Configuration so both stamps read the same flags from one source of truth, with regional overrides as labels. Feature-flag drift is a classic active-active bug: the same user gets different behavior depending on which region the edge picked.

Detect drift, don’t hope for its absence. Run terraform plan against both stamps on a schedule and alert on any non-empty diff. A drifted stamp behaves differently the moment it takes failover traffic — exactly when you can least afford a surprise.

Step 3 — Data layer choices: zone-redundant vs geo-replicated vs multi-write

This is where RPO is won or lost — three tiers of resilience, in increasing cost and capability:

Model	Scope	RPO	Write topology	When to use
Zone-redundant	Within one region	0 (zone loss)	Single region	Baseline HA; survives a datacenter, not a region
Geo-replicated (async)	Cross-region, one writer	Seconds (lag)	Active-passive	Most apps; simple, cheap, accepts tiny data-loss window
Multi-write	Cross-region, all writers	Near 0 with conflict handling	Active-active	True active-active where both regions accept writes

Zone redundancy is the floor, not the ceiling. Configure every regional resource as zone-redundant first; it is cheap and removes the single-datacenter failure mode, but it still dies with its region.

Geo-replicated, single-writer (Azure SQL failover group) is the pragmatic default for active-active reads with one write region. The secondary takes read traffic and is promotable on failover; replication is asynchronous, so plan for an RPO measured in seconds.

# Create a failover group spanning the primary and secondary SQL servers.
az sql failover-group create \
  --name app-fog \
  --resource-group $RG \
  --server sql-we-primary \
  --partner-server sql-ne-secondary \
  --failover-policy Automatic \
  --grace-period 1 \
  --add-db appdb

Route read-only workloads to the geo-secondary with ApplicationIntent=ReadOnly against the failover-group listener; writes always reach whichever server currently holds the primary role.

Multi-write (Cosmos DB) is the only model that lets both regions accept writes with single-digit-millisecond latency and near-zero RPO. Enable multiple write regions and choose a conflict resolution policy deliberately, because the default has real semantics.

az cosmosdb create \
  --name cosmos-aa-prod \
  --resource-group $RG \
  --locations regionName=westeurope failoverPriority=0 isZoneRedundant=true \
  --locations regionName=northeurope failoverPriority=1 isZoneRedundant=true \
  --enable-multiple-write-locations true \
  --default-consistency-level Session

Consistency level is the RPO-vs-latency knob. Strong is unavailable across multiple write regions; Session (the default) gives read-your-writes while keeping cross-region writes fast. A weaker level widens the window in which two regions can disagree — the exact problem Step 4 solves.

Step 4 — Handling split-brain and write conflicts in active-active

The moment both regions accept writes, two users can update the same record within the replication window. The system will produce conflicts; your only choice is to decide their resolution or discover it in production. Cosmos DB multi-write gives two policies:

Last Writer Wins (LWW). The default. The item with the highest value on a chosen path wins — a system timestamp by default, or any numeric property you nominate (e.g. a monotonic version) — and all regions converge on the same winner. One sharp edge: in delete-vs-update conflicts, delete always wins. Correct when writes are idempotent or last-update-wins is genuinely the business rule.

{
  "conflictResolutionPolicy": {
    "mode": "LastWriterWins",
    "conflictResolutionPath": "/_ts"
  }
}

Custom (merge procedure). Where silently dropping the losing write is unacceptable — inventory counts, financial balances — register a merge stored procedure that reconciles conflicts under a server transaction. If it is absent or throws, conflicts land in the conflicts feed for your application to resolve out of band. An unread conflicts feed is unresolved data loss in waiting — monitor it.

{
  "conflictResolutionPolicy": {
    "mode": "Custom",
    "conflictResolutionProcedure": "dbs/appdb/colls/orders/sprocs/resolveConflict"
  }
}

Two design principles blunt the problem before resolution:

Partition to keep an entity’s writes regional. Route a customer or tenant predominantly to one region (sharded ownership) so conflicts become rare — active-active across the fleet, single-writer per entity in practice.
Prefer commutative operations. Model state changes as appends or counters that merge rather than destructive overwrites. An event-sourced ledger has no write conflicts because nothing is overwritten.

Split-brain is not only a database concern. A partition can leave both stamps believing they are primary for a coordination task (a scheduler, a leader-elected job). Use a single global coordination authority for anything that must run exactly once, and make regional jobs idempotent.

Step 5 — Automating failover and failback with runbooks and health gates

In active-active, the HTTP data-plane failover is automatic — Front Door drains a sick origin on its own. What still needs orchestration is the stateful failover (promoting a database) and the failback, both too consequential for reflex.

The decision that needs a runbook is whether to force a database failover that may incur data loss. Azure SQL failover groups distinguish planned (set-primary alone, succeeds only with zero data loss) from forced (--allow-data-loss, completes even if the primary is gone) — and that difference is your RPO. Wrap the forced path in a runbook with a health gate so it cannot fire on a transient blip:

#!/usr/bin/env bash
set -euo pipefail
# Confirm the primary region is actually down before any data-loss failover.
PRIMARY_HEALTH=$(curl -s -o /dev/null -w "%{http_code}" \
  --max-time 5 https://app-we.example.internal/health/deep || echo 000)

if [[ "$PRIMARY_HEALTH" == "200" ]]; then
  echo "Primary still healthy ($PRIMARY_HEALTH) - refusing forced failover."
  exit 1
fi

echo "Primary unhealthy ($PRIMARY_HEALTH) - promoting North Europe with data-loss accepted."
az sql failover-group set-primary \
  --name app-fog --resource-group "$RG" \
  --server sql-ne-secondary --allow-data-loss

Failback is deliberate and planned — never automatic. Failing back the instant a probe flips green turns one outage into two: the region reports healthy at the edge while its data tier is still re-syncing. Make failback a planned failover (no --allow-data-loss) during a quiet window, only after replication lag is confirmed zero.

Codify both runbooks as Azure Automation runbooks or pipeline jobs, version them alongside the infrastructure, and require human approval on the data-loss path. The goal is not zero humans, but zero improvisation.

Step 6 — Chaos game days: proving the failover before the outage does

A failover path you have never executed is a hypothesis, not a capability. Start cheap and reversible: disable one origin at the edge and watch traffic continue from the survivor.

# Game-day step 1: take West Europe out of rotation at the edge.
az afd origin update \
  --resource-group $RG --profile-name $PROFILE \
  --origin-group-name og-app --origin-name westeurope \
  --enabled-state Disabled

# Drive load (e.g. a 60s curl loop) and confirm 200s keep flowing from North Europe,
# then restore.
az afd origin update \
  --resource-group $RG --profile-name $PROFILE \
  --origin-group-name og-app --origin-name westeurope \
  --enabled-state Enabled

Then escalate the blast radius, measuring time-to-recovery at each tier:

Single instance / pod — proves in-region redundancy.
One origin disabled — proves edge failover (above).
Forced database failover in a drill — proves the stateful runbook and measures real RPO by reconciling what was written just before the cut.
Full regional simulation — block the region’s inbound at the NSG or fault every probe, and let the whole mechanism react.

Run game days on a schedule (quarterly at minimum). A drill revealing the real RTO is 90 seconds against a 30-second target is a success — you found the gap in a controlled window, not during a real outage. For in-line fault injection, Azure Chaos Studio applies faults (VM shutdown, NSG block, AKS pod failure) as repeatable experiments.

Enterprise scenario

A payments platform we ran went active-active across West Europe and North Europe on Cosmos DB multi-write, Session consistency, LWW on /_ts. The edge failover drilled clean for months. Then a real West Europe degradation flipped both stamps live under full write load, and the ledger started disagreeing: a handful of wallet balances settled to the wrong value after convergence. The gotcha was LWW semantics colliding with our delete path. Reversal records were modeled as deletes; under LWW, delete always wins a delete-vs-update conflict, so a concurrent legitimate debit in the other region lost to a stale reversal that happened to replicate last. No conflict surfaced in the app — LWW converges silently — and the conflicts feed was empty because LWW never populates it.

The fix was two-layered. First, we stopped overwriting balances at all: the wallet became an append-only event stream, balance derived by fold, so there is nothing to conflict on. Second, for the few collections that genuinely needed reconciliation, we moved off LWW to a custom merge sproc and alerted on conflict-feed depth like a dead-letter queue.

function resolveConflict(incoming, existing, conflicting) {
  var ctx = getContext();
  // Never let a delete silently win over a newer monetary update.
  if (incoming.deleted && existing && existing._ts >= incoming._ts) {
    ctx.getResponse().setBody(existing);   // keep the live record
  } else {
    ctx.getResponse().setBody(incoming);
  }
}

The lesson: in active-active, the default conflict policy is a business decision, not a database setting — and delete-wins LWW is wrong for money.

Verify

Confirm the system behaves as designed.

Both stamps serve, and the deep probe is honest. The X-Azure-Ref header proves a request transited Front Door. Then break the in-region database in a non-prod stamp and confirm /health/deep returns non-200 and the edge drains that origin.

# Edge transit, then the deep probe (expect non-200 when a dependency is down).
curl -sSI https://$ENDPOINT.z01.azurefd.net/ | grep -iE 'x-azure-ref|x-cache'
curl -s -o /dev/null -w "%{http_code}\n" https://app-we.example.internal/health/deep

Data is replicating with the expected lag. Inspect the failover group’s replication state; for Cosmos, write in one region and read in the other to observe convergence.

az sql failover-group show \
  --name app-fog --resource-group $RG \
  --server sql-we-primary \
  --query "{role:replicationRole, state:replicationState}"

Stamps have not drifted. A scheduled terraform plan -detailed-exitcode against both should return exit 0 (exit 2 signals drift to alert on).

Failover meets the target. During a drill, capture the wall-clock from fault injection to the survivor’s first 200, and the writes lost across a forced database failover — both inside your stated RTO/RPO.

Production checklist

Pitfalls

Buying RTO and assuming you got RPO. Front Door gives fast routing failover; it does nothing for data loss. RPO 0 is purchased in the data tier — synchronous replication or multi-write with conflict handling — and paid for in latency or money.
Shallow health probes. A /health that returns 200 while the database is unreachable keeps a dead stamp serving errors and quietly multiplies RTO. Probe the real dependencies; return non-200 the instant the stamp can’t serve a request.
Ignoring the conflicts feed. Custom conflict resolution that nobody monitors is silent data loss with extra steps. Alert on conflict-feed depth like a dead-letter queue.
Config drift between stamps. The same user getting different behavior depending on which region the edge picked is maddening to debug. Deploy one artifact to both regions and alert on any non-empty terraform plan.

Next steps

Wire origin-health-flip and replication-lag alerts into the on-call rotation, add a synthetic canary exercising the full path through Front Door every minute from multiple geographies, and put a recurring game day on the calendar. Once active-active reads are solid, decide honestly whether the write side needs multi-write at all — for many systems a geo-replicated single-writer with a tight, well-rehearsed forced-failover runbook hits the target at a fraction of the cost of multi-write conflict handling. Architect for the nine the business will actually pay for, prove it on a schedule, and let the measured numbers — not the ADR — be the source of truth.