Azure Resiliency

Azure Multi-Region Active-Active Architecture: Designing for Zero-Downtime

Quick take — An active-active design runs your application in two (or more) Azure regions at the same time, with a global front door splitting traffic across both and data replicated between them. When a region fails, users barely notice. The price you pay is real: distributed-data consistency, doubled run-rate, and the operational discipline to keep two live stacks identical. This article shows the architecture, the failover sequence, the data-tier choice that makes or breaks it, and exactly when the complexity is worth it — laid out as scannable tables you can keep open during a game-day or an outage.

At 03:14 a regional networking incident takes Central India’s load balancers offline. A payments platform that runs entirely in that one region goes dark. The on-call engineer has a runbook to “fail over to the DR region” — but DR is a cold copy: VMs are off, the last database restore point is twenty minutes old, and DNS still points at the dead region. By the time DNS TTLs expire and the standby database is promoted, 47 minutes have passed and a handful of in-flight transactions are lost. The post-incident review asks one question: why did a single region’s bad night become our customers’ bad night?

Multi-region active-active is the architecture that answers that question. Instead of a primary that fails over to a standby, you run both regions hot, take real traffic in each, and treat a region loss as the removal of capacity rather than a disaster you scramble to recover from. A single Azure region is a remarkably reliable unit, but it is still a shared fault domain — a control-plane bug, a fibre cut, a bad config push, or a capacity shortfall can degrade an entire region at once, and Availability Zones (which protect against a single datacentre failure within a region) do not help when the whole region is impaired.

By the end of this article you will stop treating “the region” as a single point of failure. You will know how Azure Front Door health-probes and steers traffic globally, how the data tier — not the stateless web tier — is the real decision, how to pick between Azure SQL auto-failover groups, Cosmos DB multi-region writes, and event-driven replication, how to budget RPO/RTO honestly, what each choice costs in rupees and consistency, and how to run a region-kill game-day that proves the design instead of merely hoping. Every decision comes with a table that enumerates the options end-to-end, plus the az/Bicep to implement it and the KQL to watch it.

What problem this solves

A single-region workload couples your availability to the worst day of one Azure region. Most of the time that is excellent — a well-architected single region with zone redundancy clears three to four nines. But the tail risk is brutal: when a whole region degrades, everything you run there degrades together, and the blast radius is your entire customer base. Active-passive disaster recovery softens this but does not remove it — the standby is cold or warm, the failover is manual or semi-automatic, and the recovery is measured in tens of minutes during which you are losing money and trust.

What breaks without active-active: the failover gap (promote the standby, repoint DNS, warm the caches — minutes you do not have), the cold-standby surprise (the standby that has never taken production load is the one that fails when you finally need it), and the all-or-nothing capacity cliff (a region loss takes you to zero, not to half). Teams discover all three at 3 a.m., in that order, with an audience.

Who hits this: customer-facing, revenue- or safety-critical workloads where an hour of downtime costs more than a second region costs per month — payments, ordering, authentication, real-time APIs, anything with a contractual SLA above ~99.9%. It is not for an internal tool that can tolerate a 30-minute recovery; that workload wants active-passive with a warm standby, which is far cheaper and simpler. The art is knowing which workload you have, and engineering the data tier honestly once you commit.

To frame the whole field before the deep dive, here is what active-active removes from your single-point-of-failure list, and what it adds to your problem list in exchange:

Single-region risk it removes How active-active removes it New problem it hands you in exchange
Region is a single fault domain Both regions serve live traffic concurrently Data must be replicated and reconciled across regions
Minutes-long failover gap Front Door evicts the bad origin in seconds, no DNS change Health probes must be meaningful or you route to a dead region
Cold standby that has never run Both stacks are continuously exercised by real users Config/schema/secret drift between regions surfaces as failover bugs
Capacity drops to zero on outage Capacity drops to ~half; survivors autoscale You must provision survivors to absorb 100% load, not 50%
Untested recovery path Region-drain becomes a routine deploy/maintenance move You must run real region-kill game-days, not tabletop ones

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should be comfortable with single-region Azure architecture: a resource group, a VNet with subnets and NSGs, an ingress (App Gateway or Front Door), a stateless compute tier (App Service / AKS / Functions), and a managed data store (Azure SQL or Cosmos DB). You should know what an Availability Zone is and why it is not a multi-region story — if that distinction is fuzzy, read Azure Regions & Availability Zones Explained first, because it is the floor this article builds on. You should be able to run az in Cloud Shell, read JSON output, and deploy a Bicep module.

This sits at the top of the Resiliency & Business Continuity track. It assumes the regional fundamentals and extends them across regions. It is the active-active sibling of Azure Backup & Site Recovery: Protection Strategies (which covers the active-passive / DR end of the spectrum), it leans on the global routing concepts in Azure Load Balancer vs Application Gateway, and its data tier connects to anything you have learned about Azure SQL and Cosmos DB. When a region does fail, the per-service diagnosis lives in companions like Troubleshooting Azure SQL: Connectivity, Timeouts, Throttling & Blocking and Troubleshooting App Service: 502/503, Cold Starts & Restart Loops.

A quick map of who owns which layer of an active-active stack, so during an incident you page the right person fast:

Layer What lives here Who usually owns it What it can break during a region loss
Global routing (Front Door) TLS, WAF, health probes, origin steering Network / platform team Routes to a dead origin (bad probe), or fails to evict it
DNS / naming Apex/CNAME to the front door, TTLs Network team Stale TTLs delay any DNS-based failover (avoid DNS in the path)
Regional ingress App Gateway / regional LB, private DNS Network + app team One region’s ingress down; global tier must steer away
Compute (stateless) App Service / AKS / Functions per region App / dev team Survivor under-provisioned for 100% load → 503 under failover
Data tier (stateful) SQL failover group / Cosmos multi-write Data + app team Split-brain, write loss, lag spike — the real risk
Messaging Service Bus / Event Hubs geo-DR Integration team Duplicate or lost events if not idempotent
Config / secrets / certs Key Vault, App Config per region Platform team Drift makes the survivor behave differently (subtle bugs)

Core concepts

Six mental models make every later decision obvious. Define each one precisely now; the deep sections then go option-by-option.

Active-active means every region serves live traffic concurrently. None is idle, none is a standby waiting to be promoted. A region loss is the removal of capacity, absorbed by the survivors and (ideally) by autoscale — not a disaster you recover from. Contrast this with active-passive (one region serves, the other waits) and pilot-light (the other region exists only as data + minimal scaffolding, scaled up on demand).

The global routing tier is the thing that makes “active-active” real. A layer-7 (or layer-4) entry point that health-probes each regional origin and steers each user to a healthy, near one. Azure Front Door is the default (anycast edge, TLS termination, WAF, fast origin failover); Traffic Manager is DNS-based (slower, TTL-bound); cross-region Load Balancer is the layer-4 option. The probe is the brain: it decides which origins are eligible, and a lying probe (returns 200 from a region that cannot actually serve) is the single most dangerous bug in the design.

RPO and RTO are budgets, and the data tier sets them. RPO (Recovery Point Objective) is how much data you can afford to lose — driven by replication mode (synchronous = ~0, asynchronous = your replication lag). RTO (Recovery Time Objective) is how fast you recover — driven by probe interval, failover automation, and (for the writer) promotion time. Stateless tiers give you RTO in seconds for free; the data tier is where both numbers are earned.

The stateless tiers are trivially duplicated; state is the entire problem. Ingress and compute are stateless and identical in both regions — you stamp them from one IaC module with the region as a parameter. State does not duplicate for free: you choose single global writer (SQL failover groups — one writable primary, async geo-replicas), multi-region writes (Cosmos DB — every region writable, with conflict resolution and tunable consistency), or event-driven (idempotent operations, replicated events, eventual consistency). This choice, not the web tier, determines your RPO, your consistency, and most of your cost.

Paired regions are Azure’s curated couples. Azure pairs most regions (e.g. Central India ↔ South India, East US ↔ West US) with two properties that matter: sequential platform updates (Azure won’t patch both halves of a pair at once) and geo-replication affinity (some services default their geo-replica to the pair). You are not required to use the pair, but pinning to it buys update isolation and is the conventional choice. Note the asterisk: a few regions (notably Brazil South, and some newer regions) have non-reciprocal or no pairs — verify before you assume.

Consistency is a dial, not a switch. Cosmos DB exposes five consistency levels (Strong → Bounded Staleness → Session → Consistent Prefix → Eventual). Azure SQL geo-replicas are read-only and asynchronous (so cross-region reads can lag the primary). Choosing weaker consistency buys availability and latency; choosing stronger buys correctness at the cost of cross-region round-trips (and, for Strong in Cosmos, a same-region-write constraint). Active-active forces you to pick a number here rather than ignore it.

Pin the vocabulary side by side before the deep dive:

Term One-line definition Where it lives Why it matters to active-active
Active-active Both regions serve live traffic at once Whole architecture The premise; region loss = capacity loss, not outage
Active-passive / DR One serves, the other waits to be promoted Whole architecture Cheaper, simpler, but has a failover gap
Global routing Health-probed L7/L4 entry steering users Front Door / TM / cross-region LB The mechanism that hides a region loss
RPO Max data loss tolerated Data tier Set by sync vs async replication
RTO Max time to recover Routing + data tier Set by probe interval + promotion time
Paired region Azure’s curated region couple Platform Update isolation + geo-replica affinity
Failover group SQL’s auto-failover listener + replica set Azure SQL One writer, async read replicas, auto-promote
Multi-region writes Every region accepts writes Cosmos DB True active-active writes; needs conflict resolution
Consistency level The staleness/availability dial Cosmos DB (5 levels) Trades correctness vs latency/availability
Replication lag How far the replica trails the primary Data tier Your real RPO; treat as an SLO
Health probe The check that marks an origin eligible Front Door / LB A lying probe routes users to a dead region
Idempotency key Makes a repeated operation safe App code Turns cross-region retries from “double” to “once”

Region pairs you’ll actually use

Pin your two regions to an Azure pair for sequential platform updates and geo-replica affinity. The common pairs (the asterisks matter — verify before you assume reciprocity):

Primary region Paired with Geo (for geo-redundant storage) Notes
Central India South India India The default Indian pair; West India is non-paired
East US West US United States Classic US pair
East US 2 Central US United States Common US-East pairing
West Europe North Europe Europe The default European pair
UK South UK West United Kingdom In-country pair for data residency
Southeast Asia East Asia Asia Pacific Singapore ↔ Hong Kong
Australia East Australia Southeast Australia In-country pair
Brazil South South Central US (one-way) Non-reciprocal — verify the asterisk

Which Azure services support multi-region (and how)

Active-active is only as resilient as your weakest stateful service. The mechanism differs per service — know each one’s native cross-region story:

Service Native multi-region mechanism Active-active writes? RPO story
Azure Front Door Global by design (anycast) N/A (edge) N/A
App Service / AKS / Functions Stateless; stamp per region Yes (stateless) N/A (state is elsewhere)
Azure SQL Database Auto-failover groups (async) Single-writer (partition for AA) ~replication lag
Cosmos DB Multi-region writes Yes ~0 (conflicts possible)
Azure Storage RA-GRS / GZRS (read-only secondary) No (read secondary) Async; manual/account failover
Service Bus Geo-DR alias / Premium geo-replication Alias repoint Metadata (or data on Premium)
Event Hubs Geo-replication Promote secondary Stream-position handling
Key Vault Auto-replicated within geo Reads everywhere Platform-managed
Azure Cache for Redis Active geo-replication (Enterprise) Yes (Enterprise) Near-real-time

Resiliency patterns end to end (choose your altitude)

Before drilling into active-active specifically, locate it on the spectrum — because the most common architecture mistake is reaching for active-active when warm-standby would do, or shipping “active-passive with extra steps” and calling it active-active. The four canonical patterns, by what they cost and what they buy:

Pattern Second region runs… Typical RTO Typical RPO Relative cost Failover trigger
Backup & restore Nothing (restore from backup) Hours Hours (last backup) ~1.05× Manual restore
Pilot light Data replica + minimal core 10s of min Minutes ~1.2× Manual/scripted scale-up
Warm standby (active-passive) Full stack, scaled down, no traffic Minutes Seconds–minutes ~1.4–1.7× Auto/semi-auto promote + repoint
Active-active (multi-site) Full stack, scaled up, live traffic Seconds ~0 (committed) ~1.8–2.1× Probe evicts origin; writer auto-promotes

The same four, judged on the qualities that decide which one a workload actually needs:

Quality Backup & restore Pilot light Warm standby Active-active
Is the recovery path continuously tested? No Partly Partly Yes (real traffic)
Cold-start risk on failover High High Medium None
Capacity after one region lost 0 → restore Low → scale Full (if pre-scaled) ~Half → autoscale
Data-tier complexity you own Low Low–med Medium High
Suits revenue/safety-critical? No Marginal Often Yes
Suits an internal back-office tool? Yes Yes Sometimes Overkill

Read the decision as a table — match the workload to the smallest pattern that meets its budget:

If the workload… Tolerable RTO Right pattern Why not go higher
Internal report, nightly batch Hours Backup & restore Active-active wastes money on a tool nobody pages for
Line-of-business app, business hours 10s of min Pilot light / warm standby A short recovery is acceptable; halve the bill
Customer portal, modest SLA Minutes Warm standby Auto-promote covers it without dual-write complexity
Payments / auth / ordering / real-time API Seconds Active-active Every minute down is revenue/trust; the data model can be partitioned or tuned
Strongly-consistent single-writer that cannot partition Minutes Warm standby (not active-active) Multi-write would break correctness; don’t force it

The global routing tier — Front Door and the failover brain

Everything reaching your stacks passes through the global routing tier, so it is where “active-active” is won or lost. The default is Azure Front Door Standard/Premium: an anycast edge that terminates TLS, runs a WAF, health-probes each regional origin, and steers each request to a healthy, low-latency one. Because Front Door does this at the connection level (not via DNS), eviction of a failed origin happens in seconds, with no TTL to wait out — the property active-passive DNS failover lacks.

You configure an origin group containing your two regional origins (e.g. the public hostnames of each region’s App Gateway or App Service), a health probe (path, protocol, interval), and load-balancing settings (sample size, successful-sample threshold, latency-sensitivity). The probe is the brain. Point it at a /healthz that checks downstream dependencies (DB reachable, cache reachable) — not a static 200 from the web tier — or Front Door will happily keep routing to a region whose web tier is up but whose database is unreachable.

# Front Door Standard/Premium: an endpoint, an origin group with two regional origins,
# a meaningful health probe, and a route. (profile already created)
PROFILE=afd-pay-prod ; RG=rg-pay-global
az afd origin-group create -g $RG --profile-name $PROFILE \
  --origin-group-name og-app \
  --probe-path /healthz --probe-protocol Https --probe-request-type GET \
  --probe-interval-in-seconds 30 \
  --sample-size 4 --successful-samples-required 3 --additional-latency-in-milliseconds 50

az afd origin create -g $RG --profile-name $PROFILE --origin-group-name og-app \
  --origin-name central-india --host-name app-pay-ci.azurewebsites.net \
  --origin-host-header app-pay-ci.azurewebsites.net --priority 1 --weight 1000 --enabled-state Enabled

az afd origin create -g $RG --profile-name $PROFILE --origin-group-name og-app \
  --origin-name south-india --host-name app-pay-si.azurewebsites.net \
  --origin-host-header app-pay-si.azurewebsites.net --priority 1 --weight 1000 --enabled-state Enabled
resource og 'Microsoft.Cdn/profiles/originGroups@2024-02-01' = {
  parent: profile
  name: 'og-app'
  properties: {
    loadBalancingSettings: {
      sampleSize: 4
      successfulSamplesRequired: 3
      additionalLatencyInMilliseconds: 50
    }
    healthProbeSettings: {
      probePath: '/healthz'
      probeProtocol: 'Https'
      probeRequestType: 'GET'
      probeIntervalInSeconds: 30
    }
  }
}

Routing methods — what “active-active” actually means at the edge

Front Door (and the other global options) support several steering methods. With equal priority and weight, Front Door uses latency-based routing among healthy origins — that is true active-active: every region takes the traffic nearest to it. Set different priorities and you get active-passive (priority 2 only serves when priority 1 is unhealthy). Weights let you do canary / gradual shift. Knowing which knob produces which behaviour stops you from accidentally building active-passive:

Routing method How it picks an origin Resulting topology Use it for Gotcha
Latency (equal priority/weight) Lowest measured latency among healthy Active-active The default for active-active Both regions must handle their share and the other’s on failover
Priority Highest-priority healthy origin only Active-passive Cheap DR with auto-failover Standby cold-ish unless you also send synthetic traffic
Weighted Proportional to weights Canary / gradual shift Blue-green at region level, traffic splitting Not for HA on its own; pair with health
Session affinity (cookie) Pins a client to one origin Sticky active-active Legacy stateful apps Defeats even spread; avoid for stateless

Probe and eviction timing — where your routing-tier RTO comes from

The time Front Door takes to evict a failed origin is a function of probe interval × the sample threshold, plus a few seconds of propagation. With a 30-second interval and “3 of 4 samples healthy,” a hard-down origin is evicted within roughly a minute and a half worst case; tighten the interval and the sampling to shave that down (at the cost of more probe traffic and more sensitivity to blips). These are the knobs that set your routing-tier RTO — the data tier adds its own promotion time on top for writes:

Setting What it controls Typical value Lower it to… Trade-off of lowering
probeIntervalInSeconds How often each origin is probed 30 s Detect failure faster More probe load; more sensitive to transient blips
sampleSize How many recent probes are considered 4 Smaller = jumpier decisions
successfulSamplesRequired How many must pass to stay eligible 3 Evict faster More false evictions on a flaky network
Probe path What “healthy” means /healthz (deep) Too shallow = route to a dead region; too deep = flap
Latency sensitivity (additionalLatency) Tie-breaking window for “near” 50 ms Spread more evenly Too tight = ping-pong between regions

Choosing the global front door — Front Door vs Traffic Manager vs cross-region LB

Front Door is the right default for HTTP(S) active-active, but it is not the only global option, and L4 or non-HTTP workloads change the answer:

Capability Front Door Std/Premium Traffic Manager Cross-region Load Balancer
OSI layer L7 (HTTP/S) DNS (steers names) L4 (TCP/UDP)
Failover speed Seconds (connection-level) TTL-bound (tens of s–min) Seconds
TLS termination + WAF Yes No No
Caching / CDN Yes No No
Non-HTTP protocols No Yes (any, via DNS) Yes (TCP/UDP)
Health probe depth HTTP path, deep HTTP/TCP endpoint TCP/HTTP
Best for Web/API active-active Legacy/any-protocol, DNS steering L4 / regional LB front-ends
Anti-pattern Anything needing sub-TTL failover HTTP apps wanting WAF/caching

The decision in one line per case:

If your front-facing workload is… Choose Because
HTTPS web app or API Front Door Seconds-level failover, WAF, TLS, caching at the edge
TCP/UDP or non-HTTP Cross-region LB (or Traffic Manager) Front Door is HTTP-only
Legacy that only understands DNS Traffic Manager DNS steering with health, any protocol
HTTP but you also need regional L4 LB Front Door over regional Standard LBs Global L7 in front, regional L4 behind

If you do land on Traffic Manager (non-HTTP, or a protocol Front Door can’t terminate), its routing methods map to the same active-active vs active-passive choice — but every decision is DNS-resolution-bound, so failover is only as fast as your shortest safe TTL:

Traffic Manager method How it steers Active-active? Use it for TTL caveat
Performance Lowest network latency to the user Yes Latency-optimal multi-region Failover waits out the record TTL
Priority Top healthy endpoint only No (active-passive) DNS-based DR with auto-failover Standby cold unless warmed
Weighted Proportional to weights Yes (split) Canary / gradual region shift Not HA on its own
Geographic By the user’s source geography Yes (data-residency) Compliance / data-residency routing Mis-geo’d clients pinned wrongly
MultiValue Returns multiple healthy IPs Yes Client-side failover across A records Client picks; uneven spread
Subnet By caller IP range mapping Yes Routing specific networks to specific regions Maintenance of the IP map

The data tier — the decision that actually defines the design

Stateless tiers fail over for free. State does not. This section is the heart of the article: you pick one of three patterns, and that choice sets your RPO, your consistency story, your conflict-handling burden, and most of your bill. Get this right and the rest is plumbing; get it wrong and you have either a correctness bug (lost or conflicting writes) or an availability bug (a single writer that takes the whole system down when its region dies).

The three patterns, side by side, on the axes that matter:

Axis A. Single global writer (SQL failover groups) B. Multi-region writes (Cosmos DB) C. Event-driven / async (Service Bus + idempotency)
Where writes land One region’s primary; other is read-only replica Every region accepts writes Each region writes locally; events replicate
Consistency Strong within primary; async (lagging) replica Tunable: 5 levels (Strong→Eventual) Eventual
RPO (data loss on region loss) ~replication lag (async) ~0 within a region; conflicts possible ~in-flight events (idempotent retries)
RTO for writes Promotion time (~30–120 s typical) ~0 (other regions already writable) ~0 (other region already producing)
Conflict resolution N/A (single writer) Required (LWW or custom) Designed away via idempotency
Cross-region cost Geo-replica + egress Multi-write RU surcharge + egress Egress + dup processing
Best fit Relational, partitionable, single-writer-friendly Globally distributed, write-anywhere Async/queue-based, retry-safe operations
The trap “Active-active” reads but writes still single-region Weak default consistency surprises devs Eventual consistency leaks into UX

Pattern A — Single global writer with Azure SQL auto-failover groups

An auto-failover group wraps one or more Azure SQL databases with a read-write listener and a read-only listener, a writable primary in one region, and an asynchronously replicated secondary in the other. Apps connect to the read-write listener; on failover the group promotes the secondary and the listener now points there — connection strings do not change. Because replication is asynchronous, your RPO is the replication lag (typically seconds, but it is not zero), and a forced failover during an outage can lose the un-replicated tail.

The crucial subtlety for active-active: the secondary is read-only until promoted. So true active-active writes with SQL means either (a) partition data by home region — a customer’s writes always go to their home region’s primary, and the other region holds their read replica — or (b) accept that one region owns all writes and the other serves reads + stands ready to promote. Pattern A is “active-active reads, single-writer writes” unless you do the partitioning work.

# Create an auto-failover group across the paired regions (primary server already exists)
az sql failover-group create --name fog-pay --resource-group rg-pay-data \
  --server sql-pay-ci --partner-server sql-pay-si \
  --add-db payments \
  --failover-policy Automatic --grace-period 1   # hours of unavailability before auto-failover
resource fog 'Microsoft.Sql/servers/failoverGroups@2023-08-01-preview' = {
  parent: primaryServer            // sql-pay-ci (Central India)
  name: 'fog-pay'
  properties: {
    partnerServers: [ { id: secondaryServer.id } ]   // sql-pay-si (South India)
    readWriteEndpoint: {
      failoverPolicy: 'Automatic'
      failoverWithDataLossGracePeriodMinutes: 60     // wait before forced (lossy) failover
    }
    readOnlyEndpoint: { failoverPolicy: 'Enabled' }
    databases: [ payments.id ]
  }
}

Connect each region’s app to the right listener — writers to the read-write endpoint, read-heavy paths to the read-only endpoint for local latency:

Listener Hostname pattern Points at Use it for Behaviour on failover
Read-write fog-pay.database.windows.net Current primary All writes; read-after-write Repoints to promoted secondary automatically
Read-only fog-pay.secondary.database.windows.net Current secondary (replica) Local-region reads, reports Follows the role swap

The failover-group knobs you must understand, because their defaults decide whether you lose data or availability:

Setting What it does Default / typical When to change Trade-off
failoverPolicy Automatic vs Manual failover Manual (set to Automatic for HA) Set Automatic for true auto-failover Automatic can fail over on a transient region blip
failoverWithDataLossGracePeriodMinutes (grace period) How long to wait before a forced, possibly lossy failover 60 min Lower for tighter RTO; higher to avoid lossy flips Shorter = faster recovery but higher data-loss risk
Read-only endpoint Whether the RO listener is enabled Enabled Keep enabled to offload reads Replica reads can be stale (async lag)
Replica count / regions One secondary (FOG); more via active geo-replication 1 secondary Add geo-replicas for more read regions Each replica costs ~full DB price
Service tier (DTU/vCore) Compute on both primary and secondary Match prod Size secondary = primary You pay full price for the secondary

The failure modes specific to SQL failover groups — confirm and fix:

Symptom Likely cause Confirm Fix
Writes fail after a region blip; app on RW listener Auto-failover triggered, app cached old IP / DNS az sql failover-group show --query replicationRole; check listener resolution Use the listener name, not the server name; set short client DNS TTL; retry transient errors
Replica lag climbing, RPO at risk Write-heavy load or throttled secondary sys.dm_geo_replication_link_status / portal replication-lag metric Scale the secondary to match; reduce write burst; alert on lag
Failover didn’t happen during a real outage Grace period not elapsed, or policy Manual failoverWithDataLossGracePeriodMinutes; failoverPolicy Shorten grace period; set Automatic; or trigger manual forced failover
Cross-region writes are slow App in region B writing to primary in region A App Insights dependency latency to SQL Partition by home region, or accept B as read-only until promotion
Split data after forced failover Lossy failover dropped un-replicated tail Reconcile against event log / idempotency store Use idempotent writes + an event log to replay the tail

Pattern B — True multi-region writes with Cosmos DB

Azure Cosmos DB is the cleanest fit for active-active writes: enable multi-region writes and every region you add becomes a writable replica. Writes are accepted locally (single-digit-millisecond latency in-region), replicated to the others, and conflicts (two regions writing the same item) are resolved by a policy — Last-Writer-Wins (LWW) on a timestamp by default, or a custom merge procedure. The price is a higher RU cost for multi-write and a default consistency (Session) that surprises developers expecting strong reads everywhere.

# Cosmos account with two regions and multi-region writes enabled
az cosmosdb create --name cosmos-pay --resource-group rg-pay-data \
  --locations regionName=centralindia failoverPriority=0 isZoneRedundant=true \
  --locations regionName=southindia  failoverPriority=1 isZoneRedundant=true \
  --enable-multiple-write-locations true \
  --default-consistency-level Session
resource cosmos 'Microsoft.DocumentDB/databaseAccounts@2024-05-15' = {
  name: 'cosmos-pay'
  location: 'centralindia'
  properties: {
    databaseAccountOfferType: 'Standard'
    enableMultipleWriteLocations: true
    consistencyPolicy: {
      defaultConsistencyLevel: 'Session'           // pick deliberately; see table
      maxStalenessPrefix: 100000                    // only used for BoundedStaleness
      maxIntervalInSeconds: 300
    }
    locations: [
      { locationName: 'centralindia', failoverPriority: 0, isZoneRedundant: true }
      { locationName: 'southindia',  failoverPriority: 1, isZoneRedundant: true }
    ]
  }
}

The five consistency levels are the dial you must set on purpose. Stronger = more correct, more latency/cost, less availability under partition; weaker = faster, cheaper, more anomalies your code must tolerate:

Level Guarantee Read latency Availability under partition Multi-write friendly? Use it when
Strong Linearizable; reads see latest committed Highest (cross-region quorum) Lowest No (single write region only) You truly need global linearizability (rare)
Bounded Staleness Lags by at most K versions or T seconds High Medium Yes You need “close to fresh” with a bounded window
Session (default) Read-your-writes within a session Low High Yes Most apps; per-user consistency is enough
Consistent Prefix Never see out-of-order writes Low High Yes Order matters but absolute freshness doesn’t
Eventual Converges, no order guarantee Lowest Highest Yes Counters, likes, telemetry — anomalies are fine

Conflict resolution when two regions write the same item:

Mode How it resolves Configure Best for Limitation
Last-Writer-Wins (LWW) Highest value of a chosen property (default: _ts) wins conflictResolutionPolicy: LastWriterWins Most cases; simple, automatic Silently drops the “loser” write
Custom (stored procedure) Your merge logic runs on conflict conflictResolutionPolicy: Custom + sproc Mergeable state (carts, sets) You must write & maintain the merge
Conflict feed (manual) Conflicts surfaced to you to resolve Read the conflicts feed Audit / human-in-loop reconciliation You build the resolver and the UX

Cosmos failure modes in active-active:

Symptom Likely cause Confirm Fix
“Lost” updates after a partition healed LWW dropped a concurrent write Conflicts feed; compare _ts Use custom merge for mergeable state; design idempotent ops
Reads look stale in region B Session/Eventual consistency as configured Check defaultConsistencyLevel; use session tokens Raise to Bounded Staleness, or pass session tokens through
RU costs jumped after enabling multi-write Multi-write RU surcharge + cross-region replication Cosmos metrics: RU/s by region Right-size RU/s; use autoscale RU; partition hot keys
429 throttling under failover load Survivor region RU/s sized for half the traffic TotalRequestUnits + 429 count Autoscale RU; provision survivor for 100%
Hot partition in one region Poor partition key choice Cosmos metrics: per-partition RU Re-key for even distribution; spread the hot tenant

Pattern C — Event-driven replication with idempotency

When the data model resists both single-writer and multi-write — or when operations are naturally asynchronous — you sidestep distributed transactions entirely: make every operation idempotent (a repeated “charge order #123” applies once, not twice), have each region act on local state, and replicate the events between regions with Service Bus geo-disaster-recovery (alias-based namespace pairing) or Event Hubs geo-replication. You accept eventual consistency but you never have a distributed-lock or conflict problem, because retries are safe by construction.

# Service Bus geo-DR: pair a primary and secondary namespace under a stable alias.
# Apps connect to the alias; on failover the alias repoints to the secondary.
az servicebus georecovery-alias set --resource-group rg-pay-msg \
  --namespace-name sb-pay-ci --alias sb-pay \
  --partner-namespace $(az servicebus namespace show -g rg-pay-msg -n sb-pay-si --query id -o tsv)

The two messaging geo options differ in what they replicate — pick by whether you need the data or just the metadata to survive:

Option What replicates Failover model Data loss risk Use it for
Service Bus geo-DR (alias) Entities/metadata (not in-flight messages) Manual alias repoint to secondary In-flight messages not replicated Queue/topic topology survival; idempotent consumers
Service Bus Premium geo-replication Metadata and message data Promote replica Lower (data replicated) When losing queued messages is unacceptable
Event Hubs geo-replication Namespace metadata (+ data in newer tiers) Promote secondary Stream position handling needed Telemetry/stream pipelines

Idempotency is the load-bearing discipline. The patterns that make cross-region retries safe:

Technique How it works Where to store the key Good for
Idempotency key (client-supplied) Caller sends a unique key; server dedups Cosmos/SQL unique index on the key Payments, order submission
Dedup window (broker) Broker drops duplicate message IDs in a window Service Bus duplicate detection At-least-once delivery to exactly-once effect
Upsert by natural key Write is INSERT ... ON CONFLICT UPDATE The store itself State-convergent updates
Outbox pattern Write state + event in one local tx; relay later Local DB outbox table Avoiding dual-write inconsistency

Picking the data pattern — the decision table

The whole data-tier decision in one grid — start at the top and stop at the first row that matches:

If your data… And you need… Pick Active-active writes?
Is relational and partitions cleanly by tenant/region Familiar SQL, strong in-region consistency SQL failover groups + home-region partitioning Yes (per-partition)
Is relational but cannot partition; one writer is fine Simplicity over write-locality SQL failover group (single writer) No (active-active reads)
Is document/key-value, globally distributed Write-anywhere, tunable consistency Cosmos DB multi-region writes Yes
Tolerates eventual consistency; ops are retry-safe No distributed transactions Event-driven + idempotency Yes (async)
Needs global linearizability on every read Correctness above all Single strong writer (not active-active writes) No

Keeping two regions identical — IaC, drift, and parity

Active-active fails in subtle ways when the two regions are not byte-for-byte equivalent: a feature flag set in one region and not the other, a TLS cert renewed in region A but expired in B, an app setting that differs by a typo. The discipline is non-negotiable: deploy both regions from one IaC module with the region as a parameter, and treat any drift as an incident. The compute and ingress tiers are stateless, so this is mechanical — a for over a region list in Bicep or a Terraform module called twice.

// One module, stamped per region. main.bicep:
param regions array = [ 'centralindia', 'southindia' ]

module stamp 'region-stack.bicep' = [for r in regions: {
  name: 'stack-${r}'
  params: {
    location: r
    appName: 'app-pay-${take(r,2)}'   // app-pay-ce / app-pay-so
    skuName: 'P1v3'
    // identical everything else — only location changes
  }
}]

The parity checklist — everything that must match across regions, how it drifts, and how you catch it:

Component Must be identical because… Common drift cause How to detect drift
App settings / config Survivor behaves differently otherwise Hotfix applied to one region Diff az webapp config appsettings list both regions in CI
Secrets / Key Vault refs A missing secret crash-loops the survivor Secret rotated in one vault only Compare secret names/versions; use one rotation pipeline
TLS certificates Expired cert in the standby fails on failover Renewed in A, not B Cert-expiry alert on both; automate renewal
Schema / migrations A write to an un-migrated replica fails Migration ran in one region Migration gate in CI applies to both / shared DB
Compute SKU & count Survivor can’t absorb 100% load Scaled up one region manually IaC drift detection (what-if / terraform plan)
WAF / NSG rules One region blocks legit traffic Rule added ad hoc Policy-as-code; deny manual portal edits
Feature flags Behaviour diverges under failover Flag toggled per region Centralise flags (App Config) with region-agnostic targeting

Drift-detection commands you run on a schedule:

# Bicep what-if against both regions' resource groups — anything non-empty is drift
az deployment group what-if -g rg-pay-ci -f main.bicep -p regions="['centralindia']"
az deployment group what-if -g rg-pay-si -f main.bicep -p regions="['southindia']"

# Diff app settings between the two regions (should produce no differences)
diff <(az webapp config appsettings list -g rg-pay-ci -n app-pay-ce --query "sort_by([],&name)" -o json) \
     <(az webapp config appsettings list -g rg-pay-si -n app-pay-so --query "sort_by([],&name)" -o json)

RPO/RTO budgeting and the SLO maths

You do not get to wish for “RPO ≈ 0, RTO ≈ seconds” — you compute it from the mechanisms you chose. Two-nines of difference in availability comes from getting these numbers right. Build the budget from the parts:

Budget component Set by Active-active typical What worsens it
Routing-tier RTO Probe interval × sample threshold + propagation ~30–90 s to evict a bad origin Long probe interval; shallow probe that lies
Writer RTO (SQL) Failover-group promotion time ~30–120 s Long grace period; manual policy
Writer RTO (Cosmos multi-write) None — other region already writable ~0 N/A
RPO (SQL async) Replication lag at moment of loss Seconds (not zero) Write burst; throttled secondary
RPO (Cosmos) In-region durability; conflict outcome ~0 committed; LWW may drop a loser Concurrent cross-region writes
RPO (event-driven) In-flight, un-replicated events ~ a few events (idempotent retries recover) Geo-DR that doesn’t replicate message data

The budget tells you how long; this tells you what fires the failover and whether a human is in the loop — the second half of the RTO story:

Failover trigger Who/what initiates it Automatic? Data-loss risk Typical time
Front Door origin eviction Probe failing sample threshold Yes None (routing only) ~30–90 s
SQL FOG planned failover Operator (failover) Manual, no data loss None (sync drain) ~30–60 s
SQL FOG forced failover Operator or auto after grace period Auto (Automatic policy) Possible (async tail) ~30–120 s
Cosmos region failover Operator or auto-failover priority Auto (if enabled) ~0 (multi-write) Seconds
Service Bus geo-DR Operator alias repoint Manual In-flight messages Seconds–minutes
Storage account failover Operator Manual Async tail Up to ~1 h

Translate availability targets into what they allow per year, so the business chooses with eyes open:

Availability Downtime / year Downtime / month Realistic with…
99.9% 8.77 h 43.8 min Single region + zones, good ops
99.95% 4.38 h 21.9 min Warm standby with auto-failover
99.99% 52.6 min 4.38 min Active-active, tuned probes, auto-promote
99.999% 5.26 min 26.3 s Active-active + flawless data tier + drills (hard)

Treat replication lag as a first-class SLO and alert on it — it is your real RPO. KQL over the lag metric:

// Alert when SQL geo-replication lag (seconds) breaches the RPO budget
AzureMetrics
| where ResourceProvider == "MICROSOFT.SQL" and MetricName == "replication_lag_sec"
| summarize maxLag = max(Maximum) by bin(TimeGenerated, 5m), Resource
| where maxLag > 30   // RPO budget = 30 s
| order by TimeGenerated desc

The signals to wire before the next failover — the leading indicators that catch trouble before users do:

Signal Source metric Alert threshold (starting point) Why it’s leading
SQL replication lag replication_lag_sec > RPO budget (e.g. 30 s) Predicts data loss on a lossy failover
Cosmos staleness / conflicts Conflicts feed count; staleness Any sustained conflicts LWW may be dropping real writes
Origin health flapping Front Door origin health % < 100% intermittently A region is becoming ineligible
Survivor saturation App Service CPU% / Cosmos RU% per region > 70% sustained Survivor can’t take 100% on failover
429 throttling Cosmos 429 count by region > 0 sustained RU/s under-provisioned for failover load
5xx at the edge Front Door Http5xx > 1% of requests Routing to a region that can’t serve
Cross-region egress Inter-region data transfer (GB) Trend vs budget Chatty cross-region calls / runaway cost
Cert expiry (both regions) Key Vault cert expiry < 30 days, either region Standby cert expiry only bites on failover

Composite SLA is the other number leadership asks for, and active-active changes its shape. For services in series (a request must pass all of them), multiply the SLAs — adding components lowers the composite. For a workload deployed redundantly across two regions (either can serve), the combined availability is 1 − (1 − A)², which raises it. That asymmetry is the whole financial argument for active-active:

Configuration Formula Example (per-component A = 99.9%) Composite
Two components in series A₁ × A₂ 0.999 × 0.999 99.80% (worse)
Three components in series A₁ × A₂ × A₃ 0.999³ 99.70% (worse)
Same stack in two regions (redundant) 1 − (1 − A)² 1 − (0.001)² 99.9999% (better)
Front Door SLA (the edge gate) Stated SLA Front Door availability SLA ~99.99%
Realistic end-to-end active-active min(edge, redundant stack) edge ~99.99% caps it ~99.99%

The practical reading: the redundant stack math gives you headroom, but your composite is capped by the single global front door in front of it — so the edge SLA, not the doubled stack, is usually your ceiling. Adding more series components (extra hops, extra dependencies) erodes it; adding region redundancy to each tier restores it.

Architecture at a glance

Read the diagram left to right as the request and data paths an active-active payments stack actually uses. At the far left, users arrive over HTTPS and hit Azure Front Door at the edge, which terminates TLS, runs the WAF, and — because both regional origins share equal priority and weight — uses latency-based routing to steer each user to the nearest healthy region. Front Door continuously probes a deep /healthz in each region; the moment one region’s probe fails the sample threshold, Front Door evicts that origin within seconds and serves everyone from the survivor, with no DNS change and no human in the loop.

The middle of the diagram is the two regional stacks — Central India and South India — each a self-contained, identical stamp: a regional ingress, an App Service / AKS compute tier, and a regional view of the data store. These tiers are stateless and IaC-identical, so either region can serve a full request alone. The right-hand zone is the data tier, the real decision: an Azure SQL auto-failover group (one writable primary, an async read-only replica, a read-write listener that auto-repoints on promotion) for relational state, and Cosmos DB with multi-region writes (every region writable, Session consistency, LWW conflict resolution) for document state. Idempotent payment events flow through Service Bus geo-DR so a retried charge is never double-applied. The numbered badges mark the four places this design most often breaks — a lying health probe, a single-writer cross-region write, replication lag blowing the RPO, and an under-provisioned survivor — and the legend narrates each as symptom → confirm → fix.

Active-active Azure architecture for a payments platform: users reach Azure Front Door at the edge, which terminates TLS, runs a WAF, health-probes a deep /healthz, and latency-routes to the nearest healthy of two identical regional stacks in Central India and South India; each stack has a regional ingress, a stateless App Service compute tier, and a regional data view; the data tier spans an Azure SQL auto-failover group with one writable primary and an async read-only replica behind an auto-repointing read-write listener, plus Cosmos DB with multi-region writes at Session consistency and last-writer-wins conflict resolution, with idempotent events flowing through Service Bus geo-DR; numbered badges mark a lying health probe, a single-writer cross-region write, replication lag breaching RPO, and an under-provisioned survivor

Real-world scenario

Paython, a fictional but realistic Indian payments processor, runs an authorization API: a .NET 8 service on App Service P1v3 behind Application Gateway + WAF, with Azure SQL for the ledger and Cosmos DB for the idempotency/transaction store. Traffic averages 900 requests/second, spiking to ~2,400 rps on salary-day evenings. The platform team is six engineers; the original single-region (Central India) stack cost about ₹95,000/month all-in. After a 47-minute regional incident cost them a six-figure chargeback dispute and a hard conversation with their largest merchant, they committed to active-active across Central India ↔ South India.

The rebuild took the three-pattern decision seriously. The ledger (relational, must be auditable, writes must be strongly consistent in-region) went onto an auto-failover group, with data partitioned by merchant home-region: a merchant’s authorizations always write to their home region’s SQL primary, while the other region holds a read-only replica for low-latency reads and instant promotion. The idempotency store (write-anywhere, must survive a region loss with zero RPO) went onto Cosmos DB multi-region writes at Session consistency with LWW on _ts — a duplicate “authorize txn #X” from a cross-region retry resolves to one record. Idempotent authorization events flow through Service Bus geo-DR so a retried charge is applied exactly once. Front Door fronts both regions with equal priority (true latency-based active-active) and a /healthz that checks SQL and Cosmos reachability — not a static 200.

The first game-day exposed the classic bug. The team killed the Central India stack at 14:30 on a Tuesday. Front Door correctly evicted the origin in ~40 seconds and South India took all traffic — and then started throwing 429 and 503. Cause: the survivor’s Cosmos container and App Service plan were each sized for half the load, so when they suddenly carried 100% they throttled. The fix was a sizing rule that became policy: each region must be provisioned (or able to autoscale) to the full load, not its steady-state share. They moved Cosmos to autoscale RU/s with a max set to peak-total, set App Service autoscale max to cover full peak, and re-ran the drill. Second game-day: clean. South India absorbed 2,400 rps with p95 holding at 240 ms.

The second bug was sneakier and only the third game-day caught it. During a forced (lossy) SQL failover, a handful of authorizations written to the Central India primary in the final second before the kill were not yet replicated — the async lag was ~4 seconds under salary-day write burst, blowing the team’s stated RPO of 1 second. Two changes fixed it: they scaled the secondary to match the primary (a throttled secondary had been the lag source), bringing steady-state lag under 1 second, and they made the authorization path fully idempotent against the Cosmos store + event log, so the un-replicated tail could be replayed from events after promotion rather than lost. They also added a replication-lag SLO alert at 1 second so lag creep is caught before an outage, not during one.

When the next real regional incident hit Central India eight weeks later, the numbers told the story: Front Door evicted the unhealthy origin in 34 seconds, South-India merchants saw nothing, Central-India merchants were served read-only from South India for ~70 seconds while the failover group promoted, then writes resumed — RTO ≈ 70 s, RPO ≈ 0 for committed authorizations, replayed tail included. Monthly cost landed at ₹178,000 (≈1.9× single-region) — and the chargeback-dispute risk that had triggered the whole project went to near zero. The lesson on the team’s wall: “Active-active is not ‘deploy it twice.’ It’s ‘provision the survivor for the whole load, make the writes idempotent, and prove it with a kill switch.’”

The game-day timeline, because the order of discovery is the lesson:

Drill What they killed What happened Root cause Fix that became policy
#1 Central India stack Front Door evicted in 40 s ✓, then survivor 429/503 Survivor sized for half load Provision/autoscale each region for full load
#2 Central India stack Clean traffic shift, p95 held — (validated the sizing fix)
#3 Forced lossy SQL failover ~4 s of writes lost; RPO breached Throttled secondary → 4 s lag Match secondary SKU; idempotent replay from events
#4 Both data + compute Clean; tail replayed from event log Lag SLO alert at 1 s
Real incident (Azure) Central India networking RTO 70 s, RPO 0; merchants unaffected The design held

Advantages and disadvantages

The active-active model both removes your single largest availability risk and hands you the hardest distributed-systems problems. Weigh it honestly:

Advantages (why you build it) Disadvantages (why it’s expensive and hard)
Near-zero RTO/RPO for a single-region loss, with automatic, human-free failover Cost roughly doubles — full capacity in two regions, plus cross-region egress and replication
Both stacks take real traffic, so there is no untested standby waiting to disappoint you Distributed-data complexity (conflict resolution, partitioning, or eventual consistency) is now your problem
Lower latency globally — users routed to the nearest healthy region Operational discipline: config/schema/secret/cert parity or failover surfaces subtle bugs
Maintenance/deploys can drain one region at a time — a natural region-level blue-green Testing burden: real region-kill game-days, not tabletop ones; an untested path is a liability
Capacity degrades to ~half on a region loss, not to zero Survivor must be provisioned for 100% load, eroding the “pay for what you use” saving
Failover is a routine operation, not a once-a-year scramble More moving parts (Front Door, failover groups, geo-DR) = more to monitor and more to break

When each side dominates: the advantages dominate for revenue- or safety-critical, customer-facing workloads where an hour of downtime costs more than the second region costs per month, and where the data model can be partitioned or tolerates tunable consistency. The disadvantages dominate for internal tools that tolerate a 30-minute recovery (use warm standby), for data models that demand a single strongly-consistent writer and cannot be partitioned, and for teams that lack the maturity to operate two live stacks — a poorly-run active-active is less reliable than a well-run single region, because it adds failure modes (split-brain, drift, conflict bugs) without the discipline to contain them.

Hands-on lab

You will stand up the global routing skeleton of an active-active app — two regional web apps stamped identically, a Front Door in front with a meaningful health probe and latency routing, then kill one region and watch Front Door fail over in seconds. Free-tier-friendly (B1 plans; delete at the end). Run in Cloud Shell (Bash).

Step 1 — Variables and resource groups (two regions).

SUFFIX=$RANDOM
RG=rg-aa-lab
APP_CI=app-aa-ci-$SUFFIX     # Central India
APP_SI=app-aa-si-$SUFFIX     # South India
az group create -n $RG -l centralindia -o table

Step 2 — Stamp two identical B1 web apps in two regions.

az appservice plan create -n plan-ci -g $RG -l centralindia --is-linux --sku B1 -o table
az appservice plan create -n plan-si -g $RG -l southindia  --is-linux --sku B1 -o table
az webapp create -n $APP_CI -g $RG -p plan-ci --runtime "NODE:20-lts" -o table
az webapp create -n $APP_SI -g $RG -p plan-si --runtime "NODE:20-lts" -o table

Expected: two apps, identical except for region. Both respond on https://<app>.azurewebsites.net.

Step 3 — Give each a /healthz that returns 200 (the probe target). For the lab, the platform’s default page suffices as a stand-in; in production this path checks downstream dependencies. Enable health-check so the platform itself also tracks it:

az webapp config set -n $APP_CI -g $RG --generic-configurations '{"healthCheckPath": "/"}'
az webapp config set -n $APP_SI -g $RG --generic-configurations '{"healthCheckPath": "/"}'

Step 4 — Create a Front Door Standard profile and an endpoint.

PROFILE=afd-aa-$SUFFIX
az afd profile create -g $RG --profile-name $PROFILE --sku Standard_AzureFrontDoor -o table
az afd endpoint create -g $RG --profile-name $PROFILE --endpoint-name ep-aa --enabled-state Enabled -o table

Step 5 — Origin group with a 30 s probe, then both regions as equal-priority origins (active-active).

az afd origin-group create -g $RG --profile-name $PROFILE --origin-group-name og-aa \
  --probe-path / --probe-protocol Https --probe-request-type GET --probe-interval-in-seconds 30 \
  --sample-size 4 --successful-samples-required 3 --additional-latency-in-milliseconds 50

for pair in "ci:$APP_CI" "si:$APP_SI"; do
  name=${pair%%:*}; host=${pair##*:}.azurewebsites.net
  az afd origin create -g $RG --profile-name $PROFILE --origin-group-name og-aa \
    --origin-name $name --host-name $host --origin-host-header $host \
    --priority 1 --weight 1000 --enabled-state Enabled --https-port 443
done

Step 6 — Add a route so the endpoint serves from the origin group.

az afd route create -g $RG --profile-name $PROFILE --endpoint-name ep-aa \
  --route-name route-aa --origin-group og-aa \
  --supported-protocols Https --https-redirect Enabled --forwarding-protocol HttpsOnly --link-to-default-domain Enabled

Find your endpoint hostname and curl it a few times — you are being served from a healthy region:

az afd endpoint show -g $RG --profile-name $PROFILE --endpoint-name ep-aa --query hostName -o tsv
# curl https://<that-host>/  → 200, served from the nearest healthy origin

Step 7 — Kill one region and watch failover. Stop the Central India app to simulate a regional loss:

az webapp stop -n $APP_CI -g $RG
# Within ~1–1.5 min (30 s interval × 3-of-4 samples), Front Door evicts CI and serves only SI.
# Keep curling the endpoint host — it keeps returning 200, now from South India.
watch -n 5 "curl -s -o /dev/null -w '%{http_code}\n' https://$(az afd endpoint show -g $RG --profile-name $PROFILE --endpoint-name ep-aa --query hostName -o tsv)/"

Expected: the endpoint keeps returning 200 throughout — no DNS change, no human action. That is the active-active property in one observation.

Step 8 — Restore and confirm the region rejoins.

az webapp start -n $APP_CI -g $RG
# After the next healthy samples, Front Door re-admits CI and resumes latency routing across both.

Validation checklist — what each step proved:

Step What you did What it proves
2 Stamped two identical regional apps Stateless tiers are trivially duplicated
5 Equal priority/weight origins + deep probe This is active-active (latency), not active-passive
7 Stopped one region, kept curling Front Door evicts a dead origin in seconds; users see 200 throughout
8 Restarted the region Failback is automatic when probes pass again

Cleanup (avoid lingering plan + Front Door charges).

az group delete -n $RG --yes --no-wait

Cost note. Two B1 plans plus a Standard Front Door for an hour is well under ₹100; deleting the resource group stops everything. This lab covered only the routing tier — the data tier (failover groups / Cosmos multi-write) is where production cost and complexity live.

Common mistakes & troubleshooting

This is the playbook — bookmark it for the next game-day or incident. First as a scannable table, then the entries that bite hardest expanded with the exact confirm path.

# Symptom Root cause Confirm (exact cmd / portal path) Fix
1 Front Door keeps routing to a region that can’t actually serve Probe is shallow (static 200), region’s DB/cache is down Compare probe path vs real dependency health; App Insights failures in that region Make /healthz check downstream deps; never a static 200
2 Survivor 429/503 right after failover Survivor sized for half the load Cosmos 429 count / App Service Http503; plan CPU pinned Provision/autoscale each region for full load
3 Cross-region writes slow (region B writing to region A primary) Single global writer + no partitioning App Insights dependency latency to SQL RW listener Partition by home region, or move that data to Cosmos multi-write
4 Data lost after a forced SQL failover Async replica lag at the moment of loss replication_lag_sec metric just before failover Match secondary SKU; idempotent replay from event log; tighten lag SLO
5 “Lost”/overwritten updates after a partition heals (Cosmos) LWW dropped a concurrent write Cosmos conflicts feed; compare _ts Custom merge sproc for mergeable state; idempotent ops
6 Reads stale in one region Consistency level weaker than the UX assumes az cosmosdb show --query consistencyPolicy Raise to Bounded Staleness, or pass session tokens through
7 Failover didn’t fire during a real outage Grace period not elapsed, or policy Manual failoverWithDataLossGracePeriodMinutes; failoverPolicy Set Automatic; shorten grace period; or trigger forced failover
8 Failover fired on a transient blip (flapping) Probe too sensitive / grace period too short Front Door probe history; FOG events Raise sample threshold; lengthen grace period slightly
9 Survivor behaves differently (a feature is off / breaks) Config / flag / secret drift between regions Diff app settings & flags across regions One IaC module; centralise flags; drift detection in CI
10 TLS errors only after failover to the standby region Cert expired/renewed in one region only Cert expiry on both origins Automate renewal; alert on both; one cert pipeline
11 Duplicate charges/effects after a cross-region retry Operations not idempotent Search for duplicate effects by business key Idempotency keys + unique index; broker dedup; outbox
12 Egress/replication bill far higher than expected Cross-region data transfer + multi-write RU surcharge Cost analysis by meter (inter-region egress, Cosmos RU) Reduce chatty cross-region calls; right-size RU; keep reads local
13 One region’s writes never reach the other Geo-DR replicates metadata, not in-flight messages Service Bus geo-DR mode; message-count both namespaces Use Premium geo-replication (data) or rely on idempotent replay
14 DNS-based failover takes minutes Traffic Manager TTL, or app caches DNS TTL on the TM profile; client DNS cache Use Front Door (connection-level), not DNS, in the hot path; lower TTLs
15 Both regions patched/restarted at once Not pinned to a paired region az account list-locations; check the pair Pin both to an Azure pair so updates are sequential
16 Split-brain: both regions think they’re primary Forced failover while old primary returned SQL FOG replicationRole on both servers One source of truth for promotion; fence the old primary; reconcile via event log
17 Private endpoint resolves wrong region’s PaaS Cross-region private DNS not zone-linked nslookup the PaaS FQDN in each region Link the private DNS zone to both VNets; per-region records
18 Failover works in drills but not real outages Game-days only kill compute, not the data tier Compare drill scope vs real failure modes Drill the forced data-tier failover, not just webapp stop
19 App in region B can’t read its own recent write Read routed to a lagging replica / wrong consistency Trace the read path; check session token use Read-your-writes: Session consistency + propagate session tokens; or read RW listener
20 Cost spikes only during failover events Survivor autoscales to 2× to absorb full load Cost analysis during the incident window Expected — budget for it; reserve baseline, autoscale the burst

The entries that cause the most 3 a.m. confusion, expanded:

1. Front Door keeps sending users to a region that returns errors. Root cause: The health probe is shallow — it hits a path that returns 200 from the web tier even when the region’s database or cache is unreachable. Front Door thinks the origin is healthy; users get 5xx from the broken downstream. Confirm: Compare the probe path against what a real request needs. If /healthz returns 200 but dependencies in App Insights for that region show the DB failing, the probe is lying. Fix: Make /healthz a deep check — verify the database and any must-have dependency are reachable, return non-200 if not. The probe must answer “can this region serve a real request?”, not “is the web process up?”.

2. The surviving region throttles (429/503) the instant it takes all the traffic. Root cause: Each region was sized for its steady-state share (~half), so when one dies the survivor suddenly carries 100% and exceeds its provisioned compute or RU/s. Confirm: Cosmos 429 count or App Service Http503 spikes exactly at the failover moment; plan CPU pinned at 100%. Fix: Provision (or autoscale) each region for the full expected load, not half. Use Cosmos autoscale RU/s with a max at peak-total, and App Service autoscale max at full peak. This is the single most common active-active mistake.

3. Writes from one region are slow because they cross the WAN to the other region’s primary. Root cause: You chose a single global writer (SQL failover group) without partitioning, so region B’s writes travel to region A’s primary every time. Confirm: App Insights dependency latency from region B to the SQL read-write listener is consistently ~the inter-region RTT. Fix: Partition data by home region (each region writes its own data locally), or move that workload to Cosmos multi-region writes where every region writes locally.

4. A forced SQL failover lost the last few seconds of writes. Root cause: SQL geo-replication is asynchronous, so a forced (lossy) failover during an outage drops whatever hadn’t replicated — your RPO is the lag, not zero. Confirm: replication_lag_sec just before the failover shows the gap; the missing rows correspond to that window. Fix: Match the secondary’s SKU to the primary (a throttled secondary is the usual lag source), alert on lag against your RPO budget, and make the write path idempotent against an event log so the un-replicated tail can be replayed after promotion.

5. Concurrent writes in two regions “lost” one of them (Cosmos). Root cause: With multi-region writes and Last-Writer-Wins, two regions writing the same item resolve to one — the “loser” is silently dropped, which is wrong for mergeable state (e.g. a shopping cart). Confirm: The conflicts feed shows the conflict; the surviving item’s _ts is the later one. Fix: Use a custom merge stored procedure for mergeable state, or design the operation to be idempotent/commutative so order doesn’t matter.

Best practices

Security notes

The security controls that double as resilience controls — they pull in the same direction:

Control Mechanism Secures against Also prevents
Managed identity in both regions System/user-assigned MI + RBAC Secrets in plaintext config Survivor crash-loop from a missing secret
WAF at the edge (one policy) Front Door WAF, policy-as-code OWASP attacks, bots Divergent regional WAF rules
Private Endpoints per region Private Link + private DNS Public exposure of data tier Replication/app traffic over the internet
TLS + CMK parity minTlsVersion, CMK in both vaults Downgrade / cleartext CMK-missing failure on the survivor
Identical network rules IaC-managed NSG/IP rules Bypass via a relaxed DR rule “DR-only” holes exposed under stress

Cost & sizing

The bill drivers and how they interact with the design:

A rough monthly picture for a mid-size API (the Paython shape, ~900 rps), single-region baseline vs active-active:

Cost driver Single-region baseline Active-active What the delta buys
Compute (App Service P1v3 × N) ~₹40,000 ~₹80,000 (both regions, full-load) Survivor absorbs 100% load
Azure SQL (primary; +secondary in AA) ~₹25,000 ~₹52,000 (primary + matched secondary) RPO≈seconds, auto-promote
Cosmos DB (single → multi-write) ~₹18,000 ~₹30,000 (multi-write RU + 2nd region) Write-anywhere, RPO≈0
Cross-region egress + replication ~₹6,000 Keeping both regions in sync
Front Door Std/Premium ~₹2,000 (or single ingress) ~₹5,000 Seconds-level global failover + WAF
Service Bus geo-DR included ~₹3,000 (Premium for data) Events survive a region loss
Rough total ~₹95,000 ~₹178,000 (≈1.9×) Region loss becomes a non-event

Right-sizing rules: only go active-active where an outage costs more per hour than the second region costs per month — otherwise warm-standby halves the bill. Use autoscale aggressively so the “full-load survivor” capacity is available but not always paid for. And re-measure after you fix bugs: Paython, like many teams, found that fixing connection reuse and partitioning let them run smaller SKUs than the panicked first cut, landing the active-active bill well below the worst-case 2.1×. For the FinOps discipline around tagging, budgets, and reservations that make a two-region bill predictable, see Azure FinOps & Cost Management at Scale.

Interview & exam questions

1. What is the difference between active-active and active-passive, and when do you choose each? Active-active runs both regions hot, serving live traffic concurrently, so a region loss is a capacity reduction (~half) with seconds-level RTO and ~0 RPO. Active-passive keeps one region serving and the other warm/cold, with a failover gap measured in minutes. Choose active-active for revenue/safety-critical workloads where downtime costs more per hour than the second region per month and the data model can be partitioned or tolerate tunable consistency; choose active-passive for workloads that tolerate a short recovery, because it’s far cheaper and simpler.

2. Why don’t Availability Zones make a workload multi-region resilient? Zones protect against the failure of a single datacentre within a region — they give intra-region HA. A whole-region impairment (control-plane bug, regional networking incident, capacity shortfall) takes all zones in that region together. Multi-region active-active (or DR) is the only thing that removes the region as a single fault domain.

3. Why is the health probe the most important part of the global routing tier? Because it decides which origins are eligible. A shallow probe that returns 200 from the web tier even when the region’s database is down makes Front Door route users to a region that cannot serve, producing 5xx. The probe must be a deep /healthz that verifies downstream dependencies and answers “can this region serve a real request?”.

4. You enabled active-active but writes from one region are slow. Why, and how do you fix it? You likely chose a single global writer (SQL auto-failover group) without partitioning, so the non-primary region’s writes cross the WAN to the primary every time. Fix by partitioning data by home region (each region writes locally) or moving that data to Cosmos DB multi-region writes, where every region accepts writes locally.

5. Compare the three data-tier patterns for active-active. (a) SQL auto-failover groups — one writable primary, async read-only replica, auto-promotion; RPO ≈ replication lag, writes are single-region unless you partition. (b) Cosmos multi-region writes — every region writable with conflict resolution (LWW/custom) and five consistency levels; RPO ≈ 0, weak default consistency. © Event-driven — idempotent operations with replicated events (Service Bus/Event Hubs geo-DR); eventual consistency, no distributed transactions. The choice sets your RPO, consistency, and most of your cost.

6. What determines RPO and RTO in an active-active design? RPO comes from replication mode: synchronous ≈ 0, asynchronous ≈ the replication lag at the moment of loss (so SQL failover groups have non-zero RPO). RTO comes from the routing tier (probe interval × sample threshold to evict a bad origin, ~30–90 s) plus, for writes, the writer’s promotion time (SQL ~30–120 s; Cosmos ~0 because other regions are already writable).

7. What are the five Cosmos DB consistency levels and how do they relate to active-active? Strong, Bounded Staleness, Session, Consistent Prefix, Eventual — from most to least consistent. Strong forbids multi-region writes (single write region only); the other four are multi-write friendly. Session (read-your-writes per session) is the default and suits most apps. Stronger levels add cross-region latency and reduce availability under partition; weaker levels are faster/cheaper but expose anomalies your code must tolerate.

8. How does Cosmos resolve conflicting writes from two regions, and what’s the catch? By a conflict-resolution policy: Last-Writer-Wins (highest _ts or a chosen property wins) by default, or a custom stored procedure for merge logic, or a conflicts feed for manual resolution. The catch with LWW is that it silently drops the loser, which is wrong for mergeable state (carts, sets) — use custom merge or commutative/idempotent operations there.

9. After failover the surviving region starts throwing 429/503. What happened and how do you prevent it? The survivor was sized for its steady-state share (~half the load), so when it suddenly carries 100% it exceeds its provisioned compute or Cosmos RU/s and throttles. Prevent it by provisioning (or autoscaling) each region for the full expected load — Cosmos autoscale RU/s with max at peak-total, App Service autoscale max at full peak. This is the most common active-active mistake and the first thing a game-day exposes.

10. Why must the two regions be kept byte-for-byte identical, and how? Because drift (a config/flag/secret/cert/schema difference) means the survivor behaves differently than the region that failed — a subtle bug that only appears under failover, when you can least afford it. Keep them identical by deploying both from one IaC module with the region as a parameter, centralising feature flags, automating cert/secret rotation across both, and running scheduled drift detection (what-if / terraform plan).

11. What is a region-kill game-day and why is it non-negotiable? It’s a scheduled drill where you actually take a region offline (stop its stack or force a failover) and verify the design recovers within RTO/RPO. It’s non-negotiable because an untested failover path is a liability dressed as resilience — the first game-day almost always finds the survivor-sizing bug or a replication-lag breach, both invisible until you pull the trigger.

12. When should you NOT use active-active? When the workload tolerates a 30-minute recovery (warm standby is far cheaper), when the data model demands a single strongly-consistent writer and cannot be partitioned (multi-write would break correctness), or when the team lacks the maturity to operate two live stacks and keep them in parity — a poorly-run active-active adds failure modes (split-brain, drift, conflict bugs) and is less reliable than a well-run single region.

These map to AZ-305 (Designing Microsoft Azure Infrastructure Solutions)design for high availability, business continuity, and disaster recovery, region pairs, Front Door, failover groups, Cosmos consistency — and to AZ-700 (Network Engineer) for the global routing tier. The data-tier specifics touch DP-420 (Cosmos DB). A compact cert mapping for revision:

Question theme Primary cert Objective area
Active-active vs DR; RTO/RPO design AZ-305 Design BC/DR; resiliency patterns
Front Door, Traffic Manager, routing AZ-305 / AZ-700 Design network connectivity; global load balancing
SQL failover groups, geo-replication AZ-305 Design data storage; high availability
Cosmos multi-region writes & consistency DP-420 / AZ-305 Distributed data design; consistency models
Paired regions, AZ vs multi-region AZ-305 / AZ-104 Resiliency fundamentals
IaC parity, drift, governance AZ-305 / AZ-400 Infrastructure as code; reliable deployment

Quick check

  1. Your app is “active-active” but writes from the second region are slow and all land on the first region’s database. What did you most likely skip, and what are the two fixes?
  2. True or false: Availability Zones give you multi-region resilience.
  3. Front Door is still routing users to a region that’s returning 5xx. What’s wrong with your design, and where exactly do you fix it?
  4. You force a SQL auto-failover-group failover during an outage and lose four seconds of writes. Why was RPO not zero, and name two fixes.
  5. After a region fails, the survivor immediately throttles with 429/503. What sizing rule did you violate?

Answers

  1. You skipped data partitioning by home region while using a single global writer (SQL failover group), so region B’s writes cross the WAN to region A’s primary. Fixes: partition by home region so each region writes locally, or move that data to Cosmos DB multi-region writes where every region is writable.
  2. False. Zones protect against a single-datacentre failure within a region; a whole-region impairment takes all zones together. Only multi-region (active-active or DR) removes the region as a single point of failure.
  3. The health probe is too shallow — it returns 200 from the web tier even though a downstream (DB/cache) in that region is down, so Front Door keeps the origin eligible. Fix it in the /healthz endpoint: make it a deep check that verifies downstream dependencies and returns non-200 when the region can’t actually serve.
  4. SQL geo-replication is asynchronous, so a forced failover drops whatever hadn’t replicated — RPO equals the replication lag (four seconds here), not zero. Fixes: match the secondary’s SKU to the primary (a throttled secondary causes lag) and make writes idempotent against an event log so the un-replicated tail can be replayed after promotion; also alert on lag against the RPO budget.
  5. You provisioned each region for its steady-state share (~half the load) instead of the full load. The rule: every region must be provisioned or able to autoscale to 100% of expected load, because a single-region loss makes the survivor carry everything.

Glossary

Next steps

You can now design, cost, and prove an active-active Azure workload. Build outward:

AzureResiliencyMulti-RegionFront DoorCosmos DBArchitectureHigh AvailabilityDisaster Recovery
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading