Azure Multi-Region Active-Active Architecture: Designing for Zero-Downtime

Quick take — An active-active design runs your application in two (or more) Azure regions at the same time, with a global front door splitting traffic across both and data replicated between them. When a region fails, users barely notice. The price you pay is real: distributed-data consistency, doubled run-rate, and the operational discipline to keep two live stacks identical. This article shows the architecture, the failover sequence, the data-tier choice that makes or breaks it, and exactly when the complexity is worth it — laid out as scannable tables you can keep open during a game-day or an outage.

At 03:14 a regional networking incident takes Central India’s load balancers offline. A payments platform that runs entirely in that one region goes dark. The on-call engineer has a runbook to “fail over to the DR region” — but DR is a cold copy: VMs are off, the last database restore point is twenty minutes old, and DNS still points at the dead region. By the time DNS TTLs expire and the standby database is promoted, 47 minutes have passed and a handful of in-flight transactions are lost. The post-incident review asks one question: why did a single region’s bad night become our customers’ bad night?

Multi-region active-active is the architecture that answers that question. Instead of a primary that fails over to a standby, you run both regions hot, take real traffic in each, and treat a region loss as the removal of capacity rather than a disaster you scramble to recover from. A single Azure region is a remarkably reliable unit, but it is still a shared fault domain — a control-plane bug, a fibre cut, a bad config push, or a capacity shortfall can degrade an entire region at once, and Availability Zones (which protect against a single datacentre failure within a region) do not help when the whole region is impaired.

By the end of this article you will stop treating “the region” as a single point of failure. You will know how Azure Front Door health-probes and steers traffic globally, how the data tier — not the stateless web tier — is the real decision, how to pick between Azure SQL auto-failover groups, Cosmos DB multi-region writes, and event-driven replication, how to budget RPO/RTO honestly, what each choice costs in rupees and consistency, and how to run a region-kill game-day that proves the design instead of merely hoping. Every decision comes with a table that enumerates the options end-to-end, plus the az/Bicep to implement it and the KQL to watch it.

What problem this solves

A single-region workload couples your availability to the worst day of one Azure region. Most of the time that is excellent — a well-architected single region with zone redundancy clears three to four nines. But the tail risk is brutal: when a whole region degrades, everything you run there degrades together, and the blast radius is your entire customer base. Active-passive disaster recovery softens this but does not remove it — the standby is cold or warm, the failover is manual or semi-automatic, and the recovery is measured in tens of minutes during which you are losing money and trust.

What breaks without active-active: the failover gap (promote the standby, repoint DNS, warm the caches — minutes you do not have), the cold-standby surprise (the standby that has never taken production load is the one that fails when you finally need it), and the all-or-nothing capacity cliff (a region loss takes you to zero, not to half). Teams discover all three at 3 a.m., in that order, with an audience.

Who hits this: customer-facing, revenue- or safety-critical workloads where an hour of downtime costs more than a second region costs per month — payments, ordering, authentication, real-time APIs, anything with a contractual SLA above ~99.9%. It is not for an internal tool that can tolerate a 30-minute recovery; that workload wants active-passive with a warm standby, which is far cheaper and simpler. The art is knowing which workload you have, and engineering the data tier honestly once you commit.

To frame the whole field before the deep dive, here is what active-active removes from your single-point-of-failure list, and what it adds to your problem list in exchange:

Single-region risk it removes	How active-active removes it	New problem it hands you in exchange
Region is a single fault domain	Both regions serve live traffic concurrently	Data must be replicated and reconciled across regions
Minutes-long failover gap	Front Door evicts the bad origin in seconds, no DNS change	Health probes must be meaningful or you route to a dead region
Cold standby that has never run	Both stacks are continuously exercised by real users	Config/schema/secret drift between regions surfaces as failover bugs
Capacity drops to zero on outage	Capacity drops to ~half; survivors autoscale	You must provision survivors to absorb 100% load, not 50%
Untested recovery path	Region-drain becomes a routine deploy/maintenance move	You must run real region-kill game-days, not tabletop ones

Learning objectives

By the end of this article you can:

Distinguish active-active from active-passive and pilot-light/warm-standby, and pick the right one for a given RTO/RPO budget and cost ceiling.
Design the global routing tier with Azure Front Door (or Traffic Manager / cross-region Load Balancer) — health probes, priority vs latency vs weighted routing, session affinity, and the failover timing maths.
Choose the data-tier pattern — single global writer (SQL failover groups), true multi-region writes (Cosmos DB), or event-driven/async — and explain the consistency, conflict, and cost trade-off of each.
Budget RPO and RTO from replication mode (sync vs async), probe interval, and failover automation, and turn replication lag into a first-class SLO.
Keep two regions identical via IaC (one Bicep/Terraform module, region as a parameter) and reason about deployment, secrets, and certificate parity.
Run a region-kill game-day and read the failover in Front Door, SQL failover-group, and Cosmos metrics — and diagnose the common failure modes when failover misbehaves.
Right-size the design so you pay for resilience only where an outage genuinely hurts, with rough INR/USD figures for each tier.

Prerequisites & where this fits

You should be comfortable with single-region Azure architecture: a resource group, a VNet with subnets and NSGs, an ingress (App Gateway or Front Door), a stateless compute tier (App Service / AKS / Functions), and a managed data store (Azure SQL or Cosmos DB). You should know what an Availability Zone is and why it is not a multi-region story — if that distinction is fuzzy, read Azure Regions & Availability Zones Explained first, because it is the floor this article builds on. You should be able to run az in Cloud Shell, read JSON output, and deploy a Bicep module.

This sits at the top of the Resiliency & Business Continuity track. It assumes the regional fundamentals and extends them across regions. It is the active-active sibling of Azure Backup & Site Recovery: Protection Strategies (which covers the active-passive / DR end of the spectrum), it leans on the global routing concepts in Azure Load Balancer vs Application Gateway, and its data tier connects to anything you have learned about Azure SQL and Cosmos DB. When a region does fail, the per-service diagnosis lives in companions like Troubleshooting Azure SQL: Connectivity, Timeouts, Throttling & Blocking and Troubleshooting App Service: 502/503, Cold Starts & Restart Loops.

A quick map of who owns which layer of an active-active stack, so during an incident you page the right person fast:

Layer	What lives here	Who usually owns it	What it can break during a region loss
Global routing (Front Door)	TLS, WAF, health probes, origin steering	Network / platform team	Routes to a dead origin (bad probe), or fails to evict it
DNS / naming	Apex/CNAME to the front door, TTLs	Network team	Stale TTLs delay any DNS-based failover (avoid DNS in the path)
Regional ingress	App Gateway / regional LB, private DNS	Network + app team	One region’s ingress down; global tier must steer away
Compute (stateless)	App Service / AKS / Functions per region	App / dev team	Survivor under-provisioned for 100% load → 503 under failover
Data tier (stateful)	SQL failover group / Cosmos multi-write	Data + app team	Split-brain, write loss, lag spike — the real risk
Messaging	Service Bus / Event Hubs geo-DR	Integration team	Duplicate or lost events if not idempotent
Config / secrets / certs	Key Vault, App Config per region	Platform team	Drift makes the survivor behave differently (subtle bugs)

Core concepts

Six mental models make every later decision obvious. Define each one precisely now; the deep sections then go option-by-option.

Active-active means every region serves live traffic concurrently. None is idle, none is a standby waiting to be promoted. A region loss is the removal of capacity, absorbed by the survivors and (ideally) by autoscale — not a disaster you recover from. Contrast this with active-passive (one region serves, the other waits) and pilot-light (the other region exists only as data + minimal scaffolding, scaled up on demand).

The global routing tier is the thing that makes “active-active” real. A layer-7 (or layer-4) entry point that health-probes each regional origin and steers each user to a healthy, near one. Azure Front Door is the default (anycast edge, TLS termination, WAF, fast origin failover); Traffic Manager is DNS-based (slower, TTL-bound); cross-region Load Balancer is the layer-4 option. The probe is the brain: it decides which origins are eligible, and a lying probe (returns 200 from a region that cannot actually serve) is the single most dangerous bug in the design.

RPO and RTO are budgets, and the data tier sets them. RPO (Recovery Point Objective) is how much data you can afford to lose — driven by replication mode (synchronous = ~0, asynchronous = your replication lag). RTO (Recovery Time Objective) is how fast you recover — driven by probe interval, failover automation, and (for the writer) promotion time. Stateless tiers give you RTO in seconds for free; the data tier is where both numbers are earned.

The stateless tiers are trivially duplicated; state is the entire problem. Ingress and compute are stateless and identical in both regions — you stamp them from one IaC module with the region as a parameter. State does not duplicate for free: you choose single global writer (SQL failover groups — one writable primary, async geo-replicas), multi-region writes (Cosmos DB — every region writable, with conflict resolution and tunable consistency), or event-driven (idempotent operations, replicated events, eventual consistency). This choice, not the web tier, determines your RPO, your consistency, and most of your cost.

Paired regions are Azure’s curated couples. Azure pairs most regions (e.g. Central India ↔ South India, East US ↔ West US) with two properties that matter: sequential platform updates (Azure won’t patch both halves of a pair at once) and geo-replication affinity (some services default their geo-replica to the pair). You are not required to use the pair, but pinning to it buys update isolation and is the conventional choice. Note the asterisk: a few regions (notably Brazil South, and some newer regions) have non-reciprocal or no pairs — verify before you assume.

Consistency is a dial, not a switch. Cosmos DB exposes five consistency levels (Strong → Bounded Staleness → Session → Consistent Prefix → Eventual). Azure SQL geo-replicas are read-only and asynchronous (so cross-region reads can lag the primary). Choosing weaker consistency buys availability and latency; choosing stronger buys correctness at the cost of cross-region round-trips (and, for Strong in Cosmos, a same-region-write constraint). Active-active forces you to pick a number here rather than ignore it.

Pin the vocabulary side by side before the deep dive:

Term	One-line definition	Where it lives	Why it matters to active-active
Active-active	Both regions serve live traffic at once	Whole architecture	The premise; region loss = capacity loss, not outage
Active-passive / DR	One serves, the other waits to be promoted	Whole architecture	Cheaper, simpler, but has a failover gap
Global routing	Health-probed L7/L4 entry steering users	Front Door / TM / cross-region LB	The mechanism that hides a region loss
RPO	Max data loss tolerated	Data tier	Set by sync vs async replication
RTO	Max time to recover	Routing + data tier	Set by probe interval + promotion time
Paired region	Azure’s curated region couple	Platform	Update isolation + geo-replica affinity
Failover group	SQL’s auto-failover listener + replica set	Azure SQL	One writer, async read replicas, auto-promote
Multi-region writes	Every region accepts writes	Cosmos DB	True active-active writes; needs conflict resolution
Consistency level	The staleness/availability dial	Cosmos DB (5 levels)	Trades correctness vs latency/availability
Replication lag	How far the replica trails the primary	Data tier	Your real RPO; treat as an SLO
Health probe	The check that marks an origin eligible	Front Door / LB	A lying probe routes users to a dead region
Idempotency key	Makes a repeated operation safe	App code	Turns cross-region retries from “double” to “once”

Region pairs you’ll actually use

Pin your two regions to an Azure pair for sequential platform updates and geo-replica affinity. The common pairs (the asterisks matter — verify before you assume reciprocity):

Primary region	Paired with	Geo (for geo-redundant storage)	Notes
Central India	South India	India	The default Indian pair; West India is non-paired
East US	West US	United States	Classic US pair
East US 2	Central US	United States	Common US-East pairing
West Europe	North Europe	Europe	The default European pair
UK South	UK West	United Kingdom	In-country pair for data residency
Southeast Asia	East Asia	Asia Pacific	Singapore ↔ Hong Kong
Australia East	Australia Southeast	Australia	In-country pair
Brazil South	South Central US (one-way)	—	Non-reciprocal — verify the asterisk

Which Azure services support multi-region (and how)

Active-active is only as resilient as your weakest stateful service. The mechanism differs per service — know each one’s native cross-region story:

Service	Native multi-region mechanism	Active-active writes?	RPO story
Azure Front Door	Global by design (anycast)	N/A (edge)	N/A
App Service / AKS / Functions	Stateless; stamp per region	Yes (stateless)	N/A (state is elsewhere)
Azure SQL Database	Auto-failover groups (async)	Single-writer (partition for AA)	~replication lag
Cosmos DB	Multi-region writes	Yes	~0 (conflicts possible)
Azure Storage	RA-GRS / GZRS (read-only secondary)	No (read secondary)	Async; manual/account failover
Service Bus	Geo-DR alias / Premium geo-replication	Alias repoint	Metadata (or data on Premium)
Event Hubs	Geo-replication	Promote secondary	Stream-position handling
Key Vault	Auto-replicated within geo	Reads everywhere	Platform-managed
Azure Cache for Redis	Active geo-replication (Enterprise)	Yes (Enterprise)	Near-real-time

Resiliency patterns end to end (choose your altitude)

Before drilling into active-active specifically, locate it on the spectrum — because the most common architecture mistake is reaching for active-active when warm-standby would do, or shipping “active-passive with extra steps” and calling it active-active. The four canonical patterns, by what they cost and what they buy:

Pattern	Second region runs…	Typical RTO	Typical RPO	Relative cost	Failover trigger
Backup & restore	Nothing (restore from backup)	Hours	Hours (last backup)	~1.05×	Manual restore
Pilot light	Data replica + minimal core	10s of min	Minutes	~1.2×	Manual/scripted scale-up
Warm standby (active-passive)	Full stack, scaled down, no traffic	Minutes	Seconds–minutes	~1.4–1.7×	Auto/semi-auto promote + repoint
Active-active (multi-site)	Full stack, scaled up, live traffic	Seconds	~0 (committed)	~1.8–2.1×	Probe evicts origin; writer auto-promotes

The same four, judged on the qualities that decide which one a workload actually needs:

Quality	Backup & restore	Pilot light	Warm standby	Active-active
Is the recovery path continuously tested?	No	Partly	Partly	Yes (real traffic)
Cold-start risk on failover	High	High	Medium	None
Capacity after one region lost	0 → restore	Low → scale	Full (if pre-scaled)	~Half → autoscale
Data-tier complexity you own	Low	Low–med	Medium	High
Suits revenue/safety-critical?	No	Marginal	Often	Yes
Suits an internal back-office tool?	Yes	Yes	Sometimes	Overkill

Read the decision as a table — match the workload to the smallest pattern that meets its budget:

If the workload…	Tolerable RTO	Right pattern	Why not go higher
Internal report, nightly batch	Hours	Backup & restore	Active-active wastes money on a tool nobody pages for
Line-of-business app, business hours	10s of min	Pilot light / warm standby	A short recovery is acceptable; halve the bill
Customer portal, modest SLA	Minutes	Warm standby	Auto-promote covers it without dual-write complexity
Payments / auth / ordering / real-time API	Seconds	Active-active	Every minute down is revenue/trust; the data model can be partitioned or tuned
Strongly-consistent single-writer that cannot partition	Minutes	Warm standby (not active-active)	Multi-write would break correctness; don’t force it

The global routing tier — Front Door and the failover brain

Everything reaching your stacks passes through the global routing tier, so it is where “active-active” is won or lost. The default is Azure Front Door Standard/Premium: an anycast edge that terminates TLS, runs a WAF, health-probes each regional origin, and steers each request to a healthy, low-latency one. Because Front Door does this at the connection level (not via DNS), eviction of a failed origin happens in seconds, with no TTL to wait out — the property active-passive DNS failover lacks.

You configure an origin group containing your two regional origins (e.g. the public hostnames of each region’s App Gateway or App Service), a health probe (path, protocol, interval), and load-balancing settings (sample size, successful-sample threshold, latency-sensitivity). The probe is the brain. Point it at a /healthz that checks downstream dependencies (DB reachable, cache reachable) — not a static 200 from the web tier — or Front Door will happily keep routing to a region whose web tier is up but whose database is unreachable.

# Front Door Standard/Premium: an endpoint, an origin group with two regional origins,
# a meaningful health probe, and a route. (profile already created)
PROFILE=afd-pay-prod ; RG=rg-pay-global
az afd origin-group create -g $RG --profile-name $PROFILE \
  --origin-group-name og-app \
  --probe-path /healthz --probe-protocol Https --probe-request-type GET \
  --probe-interval-in-seconds 30 \
  --sample-size 4 --successful-samples-required 3 --additional-latency-in-milliseconds 50

az afd origin create -g $RG --profile-name $PROFILE --origin-group-name og-app \
  --origin-name central-india --host-name app-pay-ci.azurewebsites.net \
  --origin-host-header app-pay-ci.azurewebsites.net --priority 1 --weight 1000 --enabled-state Enabled

az afd origin create -g $RG --profile-name $PROFILE --origin-group-name og-app \
  --origin-name south-india --host-name app-pay-si.azurewebsites.net \
  --origin-host-header app-pay-si.azurewebsites.net --priority 1 --weight 1000 --enabled-state Enabled

resource og 'Microsoft.Cdn/profiles/originGroups@2024-02-01' = {
  parent: profile
  name: 'og-app'
  properties: {
    loadBalancingSettings: {
      sampleSize: 4
      successfulSamplesRequired: 3
      additionalLatencyInMilliseconds: 50
    }
    healthProbeSettings: {
      probePath: '/healthz'
      probeProtocol: 'Https'
      probeRequestType: 'GET'
      probeIntervalInSeconds: 30
    }
  }
}

Routing methods — what “active-active” actually means at the edge

Front Door (and the other global options) support several steering methods. With equal priority and weight, Front Door uses latency-based routing among healthy origins — that is true active-active: every region takes the traffic nearest to it. Set different priorities and you get active-passive (priority 2 only serves when priority 1 is unhealthy). Weights let you do canary / gradual shift. Knowing which knob produces which behaviour stops you from accidentally building active-passive:

Routing method	How it picks an origin	Resulting topology	Use it for	Gotcha
Latency (equal priority/weight)	Lowest measured latency among healthy	Active-active	The default for active-active	Both regions must handle their share and the other’s on failover
Priority	Highest-priority healthy origin only	Active-passive	Cheap DR with auto-failover	Standby cold-ish unless you also send synthetic traffic
Weighted	Proportional to weights	Canary / gradual shift	Blue-green at region level, traffic splitting	Not for HA on its own; pair with health
Session affinity (cookie)	Pins a client to one origin	Sticky active-active	Legacy stateful apps	Defeats even spread; avoid for stateless

Probe and eviction timing — where your routing-tier RTO comes from

The time Front Door takes to evict a failed origin is a function of probe interval × the sample threshold, plus a few seconds of propagation. With a 30-second interval and “3 of 4 samples healthy,” a hard-down origin is evicted within roughly a minute and a half worst case; tighten the interval and the sampling to shave that down (at the cost of more probe traffic and more sensitivity to blips). These are the knobs that set your routing-tier RTO — the data tier adds its own promotion time on top for writes:

Setting	What it controls	Typical value	Lower it to…	Trade-off of lowering
`probeIntervalInSeconds`	How often each origin is probed	30 s	Detect failure faster	More probe load; more sensitive to transient blips
`sampleSize`	How many recent probes are considered	4	—	Smaller = jumpier decisions
`successfulSamplesRequired`	How many must pass to stay eligible	3	Evict faster	More false evictions on a flaky network
Probe path	What “healthy” means	`/healthz` (deep)	—	Too shallow = route to a dead region; too deep = flap
Latency sensitivity (`additionalLatency`)	Tie-breaking window for “near”	50 ms	Spread more evenly	Too tight = ping-pong between regions

Choosing the global front door — Front Door vs Traffic Manager vs cross-region LB

Front Door is the right default for HTTP(S) active-active, but it is not the only global option, and L4 or non-HTTP workloads change the answer:

Capability	Front Door Std/Premium	Traffic Manager	Cross-region Load Balancer
OSI layer	L7 (HTTP/S)	DNS (steers names)	L4 (TCP/UDP)
Failover speed	Seconds (connection-level)	TTL-bound (tens of s–min)	Seconds
TLS termination + WAF	Yes	No	No
Caching / CDN	Yes	No	No
Non-HTTP protocols	No	Yes (any, via DNS)	Yes (TCP/UDP)
Health probe depth	HTTP path, deep	HTTP/TCP endpoint	TCP/HTTP
Best for	Web/API active-active	Legacy/any-protocol, DNS steering	L4 / regional LB front-ends
Anti-pattern	—	Anything needing sub-TTL failover	HTTP apps wanting WAF/caching

The decision in one line per case:

If your front-facing workload is…	Choose	Because
HTTPS web app or API	Front Door	Seconds-level failover, WAF, TLS, caching at the edge
TCP/UDP or non-HTTP	Cross-region LB (or Traffic Manager)	Front Door is HTTP-only
Legacy that only understands DNS	Traffic Manager	DNS steering with health, any protocol
HTTP but you also need regional L4 LB	Front Door over regional Standard LBs	Global L7 in front, regional L4 behind

If you do land on Traffic Manager (non-HTTP, or a protocol Front Door can’t terminate), its routing methods map to the same active-active vs active-passive choice — but every decision is DNS-resolution-bound, so failover is only as fast as your shortest safe TTL:

Traffic Manager method	How it steers	Active-active?	Use it for	TTL caveat
Performance	Lowest network latency to the user	Yes	Latency-optimal multi-region	Failover waits out the record TTL
Priority	Top healthy endpoint only	No (active-passive)	DNS-based DR with auto-failover	Standby cold unless warmed
Weighted	Proportional to weights	Yes (split)	Canary / gradual region shift	Not HA on its own
Geographic	By the user’s source geography	Yes (data-residency)	Compliance / data-residency routing	Mis-geo’d clients pinned wrongly
MultiValue	Returns multiple healthy IPs	Yes	Client-side failover across A records	Client picks; uneven spread
Subnet	By caller IP range mapping	Yes	Routing specific networks to specific regions	Maintenance of the IP map

The data tier — the decision that actually defines the design

Stateless tiers fail over for free. State does not. This section is the heart of the article: you pick one of three patterns, and that choice sets your RPO, your consistency story, your conflict-handling burden, and most of your bill. Get this right and the rest is plumbing; get it wrong and you have either a correctness bug (lost or conflicting writes) or an availability bug (a single writer that takes the whole system down when its region dies).

The three patterns, side by side, on the axes that matter:

Axis	A. Single global writer (SQL failover groups)	B. Multi-region writes (Cosmos DB)	C. Event-driven / async (Service Bus + idempotency)
Where writes land	One region’s primary; other is read-only replica	Every region accepts writes	Each region writes locally; events replicate
Consistency	Strong within primary; async (lagging) replica	Tunable: 5 levels (Strong→Eventual)	Eventual
RPO (data loss on region loss)	~replication lag (async)	~0 within a region; conflicts possible	~in-flight events (idempotent retries)
RTO for writes	Promotion time (~30–120 s typical)	~0 (other regions already writable)	~0 (other region already producing)
Conflict resolution	N/A (single writer)	Required (LWW or custom)	Designed away via idempotency
Cross-region cost	Geo-replica + egress	Multi-write RU surcharge + egress	Egress + dup processing
Best fit	Relational, partitionable, single-writer-friendly	Globally distributed, write-anywhere	Async/queue-based, retry-safe operations
The trap	“Active-active” reads but writes still single-region	Weak default consistency surprises devs	Eventual consistency leaks into UX

Pattern A — Single global writer with Azure SQL auto-failover groups

An auto-failover group wraps one or more Azure SQL databases with a read-write listener and a read-only listener, a writable primary in one region, and an asynchronously replicated secondary in the other. Apps connect to the read-write listener; on failover the group promotes the secondary and the listener now points there — connection strings do not change. Because replication is asynchronous, your RPO is the replication lag (typically seconds, but it is not zero), and a forced failover during an outage can lose the un-replicated tail.

The crucial subtlety for active-active: the secondary is read-only until promoted. So true active-active writes with SQL means either (a) partition data by home region — a customer’s writes always go to their home region’s primary, and the other region holds their read replica — or (b) accept that one region owns all writes and the other serves reads + stands ready to promote. Pattern A is “active-active reads, single-writer writes” unless you do the partitioning work.

# Create an auto-failover group across the paired regions (primary server already exists)
az sql failover-group create --name fog-pay --resource-group rg-pay-data \
  --server sql-pay-ci --partner-server sql-pay-si \
  --add-db payments \
  --failover-policy Automatic --grace-period 1   # hours of unavailability before auto-failover

resource fog 'Microsoft.Sql/servers/failoverGroups@2023-08-01-preview' = {
  parent: primaryServer            // sql-pay-ci (Central India)
  name: 'fog-pay'
  properties: {
    partnerServers: [ { id: secondaryServer.id } ]   // sql-pay-si (South India)
    readWriteEndpoint: {
      failoverPolicy: 'Automatic'
      failoverWithDataLossGracePeriodMinutes: 60     // wait before forced (lossy) failover
    }
    readOnlyEndpoint: { failoverPolicy: 'Enabled' }
    databases: [ payments.id ]
  }
}

Connect each region’s app to the right listener — writers to the read-write endpoint, read-heavy paths to the read-only endpoint for local latency:

Listener	Hostname pattern	Points at	Use it for	Behaviour on failover
Read-write	`fog-pay.database.windows.net`	Current primary	All writes; read-after-write	Repoints to promoted secondary automatically
Read-only	`fog-pay.secondary.database.windows.net`	Current secondary (replica)	Local-region reads, reports	Follows the role swap

The failover-group knobs you must understand, because their defaults decide whether you lose data or availability:

Setting	What it does	Default / typical	When to change	Trade-off
`failoverPolicy`	Automatic vs Manual failover	Manual (set to Automatic for HA)	Set Automatic for true auto-failover	Automatic can fail over on a transient region blip
`failoverWithDataLossGracePeriodMinutes` (grace period)	How long to wait before a forced, possibly lossy failover	60 min	Lower for tighter RTO; higher to avoid lossy flips	Shorter = faster recovery but higher data-loss risk
Read-only endpoint	Whether the RO listener is enabled	Enabled	Keep enabled to offload reads	Replica reads can be stale (async lag)
Replica count / regions	One secondary (FOG); more via active geo-replication	1 secondary	Add geo-replicas for more read regions	Each replica costs ~full DB price
Service tier (DTU/vCore)	Compute on both primary and secondary	Match prod	Size secondary = primary	You pay full price for the secondary

The failure modes specific to SQL failover groups — confirm and fix:

Symptom	Likely cause	Confirm	Fix
Writes fail after a region blip; app on RW listener	Auto-failover triggered, app cached old IP / DNS	`az sql failover-group show --query replicationRole`; check listener resolution	Use the listener name, not the server name; set short client DNS TTL; retry transient errors
Replica lag climbing, RPO at risk	Write-heavy load or throttled secondary	`sys.dm_geo_replication_link_status` / portal replication-lag metric	Scale the secondary to match; reduce write burst; alert on lag
Failover didn’t happen during a real outage	Grace period not elapsed, or policy Manual	`failoverWithDataLossGracePeriodMinutes`; `failoverPolicy`	Shorten grace period; set Automatic; or trigger manual forced failover
Cross-region writes are slow	App in region B writing to primary in region A	App Insights dependency latency to SQL	Partition by home region, or accept B as read-only until promotion
Split data after forced failover	Lossy failover dropped un-replicated tail	Reconcile against event log / idempotency store	Use idempotent writes + an event log to replay the tail

Pattern B — True multi-region writes with Cosmos DB

Azure Cosmos DB is the cleanest fit for active-active writes: enable multi-region writes and every region you add becomes a writable replica. Writes are accepted locally (single-digit-millisecond latency in-region), replicated to the others, and conflicts (two regions writing the same item) are resolved by a policy — Last-Writer-Wins (LWW) on a timestamp by default, or a custom merge procedure. The price is a higher RU cost for multi-write and a default consistency (Session) that surprises developers expecting strong reads everywhere.

# Cosmos account with two regions and multi-region writes enabled
az cosmosdb create --name cosmos-pay --resource-group rg-pay-data \
  --locations regionName=centralindia failoverPriority=0 isZoneRedundant=true \
  --locations regionName=southindia  failoverPriority=1 isZoneRedundant=true \
  --enable-multiple-write-locations true \
  --default-consistency-level Session

resource cosmos 'Microsoft.DocumentDB/databaseAccounts@2024-05-15' = {
  name: 'cosmos-pay'
  location: 'centralindia'
  properties: {
    databaseAccountOfferType: 'Standard'
    enableMultipleWriteLocations: true
    consistencyPolicy: {
      defaultConsistencyLevel: 'Session'           // pick deliberately; see table
      maxStalenessPrefix: 100000                    // only used for BoundedStaleness
      maxIntervalInSeconds: 300
    }
    locations: [
      { locationName: 'centralindia', failoverPriority: 0, isZoneRedundant: true }
      { locationName: 'southindia',  failoverPriority: 1, isZoneRedundant: true }
    ]
  }
}

The five consistency levels are the dial you must set on purpose. Stronger = more correct, more latency/cost, less availability under partition; weaker = faster, cheaper, more anomalies your code must tolerate:

Level	Guarantee	Read latency	Availability under partition	Multi-write friendly?	Use it when
Strong	Linearizable; reads see latest committed	Highest (cross-region quorum)	Lowest	No (single write region only)	You truly need global linearizability (rare)
Bounded Staleness	Lags by at most K versions or T seconds	High	Medium	Yes	You need “close to fresh” with a bounded window
Session (default)	Read-your-writes within a session	Low	High	Yes	Most apps; per-user consistency is enough
Consistent Prefix	Never see out-of-order writes	Low	High	Yes	Order matters but absolute freshness doesn’t
Eventual	Converges, no order guarantee	Lowest	Highest	Yes	Counters, likes, telemetry — anomalies are fine

Conflict resolution when two regions write the same item:

Mode	How it resolves	Configure	Best for	Limitation
Last-Writer-Wins (LWW)	Highest value of a chosen property (default: `_ts`) wins	`conflictResolutionPolicy: LastWriterWins`	Most cases; simple, automatic	Silently drops the “loser” write
Custom (stored procedure)	Your merge logic runs on conflict	`conflictResolutionPolicy: Custom` + sproc	Mergeable state (carts, sets)	You must write & maintain the merge
Conflict feed (manual)	Conflicts surfaced to you to resolve	Read the conflicts feed	Audit / human-in-loop reconciliation	You build the resolver and the UX

Cosmos failure modes in active-active:

Symptom	Likely cause	Confirm	Fix
“Lost” updates after a partition healed	LWW dropped a concurrent write	Conflicts feed; compare `_ts`	Use custom merge for mergeable state; design idempotent ops
Reads look stale in region B	Session/Eventual consistency as configured	Check `defaultConsistencyLevel`; use session tokens	Raise to Bounded Staleness, or pass session tokens through
RU costs jumped after enabling multi-write	Multi-write RU surcharge + cross-region replication	Cosmos metrics: RU/s by region	Right-size RU/s; use autoscale RU; partition hot keys
429 throttling under failover load	Survivor region RU/s sized for half the traffic	`TotalRequestUnits` + `429` count	Autoscale RU; provision survivor for 100%
Hot partition in one region	Poor partition key choice	Cosmos metrics: per-partition RU	Re-key for even distribution; spread the hot tenant

Pattern C — Event-driven replication with idempotency

When the data model resists both single-writer and multi-write — or when operations are naturally asynchronous — you sidestep distributed transactions entirely: make every operation idempotent (a repeated “charge order #123” applies once, not twice), have each region act on local state, and replicate the events between regions with Service Bus geo-disaster-recovery (alias-based namespace pairing) or Event Hubs geo-replication. You accept eventual consistency but you never have a distributed-lock or conflict problem, because retries are safe by construction.

# Service Bus geo-DR: pair a primary and secondary namespace under a stable alias.
# Apps connect to the alias; on failover the alias repoints to the secondary.
az servicebus georecovery-alias set --resource-group rg-pay-msg \
  --namespace-name sb-pay-ci --alias sb-pay \
  --partner-namespace $(az servicebus namespace show -g rg-pay-msg -n sb-pay-si --query id -o tsv)

The two messaging geo options differ in what they replicate — pick by whether you need the data or just the metadata to survive:

Option	What replicates	Failover model	Data loss risk	Use it for
Service Bus geo-DR (alias)	Entities/metadata (not in-flight messages)	Manual alias repoint to secondary	In-flight messages not replicated	Queue/topic topology survival; idempotent consumers
Service Bus Premium geo-replication	Metadata and message data	Promote replica	Lower (data replicated)	When losing queued messages is unacceptable
Event Hubs geo-replication	Namespace metadata (+ data in newer tiers)	Promote secondary	Stream position handling needed	Telemetry/stream pipelines

Idempotency is the load-bearing discipline. The patterns that make cross-region retries safe:

Technique	How it works	Where to store the key	Good for
Idempotency key (client-supplied)	Caller sends a unique key; server dedups	Cosmos/SQL unique index on the key	Payments, order submission
Dedup window (broker)	Broker drops duplicate message IDs in a window	Service Bus duplicate detection	At-least-once delivery to exactly-once effect
Upsert by natural key	Write is `INSERT ... ON CONFLICT UPDATE`	The store itself	State-convergent updates
Outbox pattern	Write state + event in one local tx; relay later	Local DB outbox table	Avoiding dual-write inconsistency

Picking the data pattern — the decision table

The whole data-tier decision in one grid — start at the top and stop at the first row that matches:

If your data…	And you need…	Pick	Active-active writes?
Is relational and partitions cleanly by tenant/region	Familiar SQL, strong in-region consistency	SQL failover groups + home-region partitioning	Yes (per-partition)
Is relational but cannot partition; one writer is fine	Simplicity over write-locality	SQL failover group (single writer)	No (active-active reads)
Is document/key-value, globally distributed	Write-anywhere, tunable consistency	Cosmos DB multi-region writes	Yes
Tolerates eventual consistency; ops are retry-safe	No distributed transactions	Event-driven + idempotency	Yes (async)
Needs global linearizability on every read	Correctness above all	Single strong writer (not active-active writes)	No

Keeping two regions identical — IaC, drift, and parity

Active-active fails in subtle ways when the two regions are not byte-for-byte equivalent: a feature flag set in one region and not the other, a TLS cert renewed in region A but expired in B, an app setting that differs by a typo. The discipline is non-negotiable: deploy both regions from one IaC module with the region as a parameter, and treat any drift as an incident. The compute and ingress tiers are stateless, so this is mechanical — a for over a region list in Bicep or a Terraform module called twice.

// One module, stamped per region. main.bicep:
param regions array = [ 'centralindia', 'southindia' ]

module stamp 'region-stack.bicep' = [for r in regions: {
  name: 'stack-${r}'
  params: {
    location: r
    appName: 'app-pay-${take(r,2)}'   // app-pay-ce / app-pay-so
    skuName: 'P1v3'
    // identical everything else — only location changes
  }
}]

The parity checklist — everything that must match across regions, how it drifts, and how you catch it:

Component	Must be identical because…	Common drift cause	How to detect drift
App settings / config	Survivor behaves differently otherwise	Hotfix applied to one region	Diff `az webapp config appsettings list` both regions in CI
Secrets / Key Vault refs	A missing secret crash-loops the survivor	Secret rotated in one vault only	Compare secret names/versions; use one rotation pipeline
TLS certificates	Expired cert in the standby fails on failover	Renewed in A, not B	Cert-expiry alert on both; automate renewal
Schema / migrations	A write to an un-migrated replica fails	Migration ran in one region	Migration gate in CI applies to both / shared DB
Compute SKU & count	Survivor can’t absorb 100% load	Scaled up one region manually	IaC drift detection (`what-if` / `terraform plan`)
WAF / NSG rules	One region blocks legit traffic	Rule added ad hoc	Policy-as-code; deny manual portal edits
Feature flags	Behaviour diverges under failover	Flag toggled per region	Centralise flags (App Config) with region-agnostic targeting

Drift-detection commands you run on a schedule:

# Bicep what-if against both regions' resource groups — anything non-empty is drift
az deployment group what-if -g rg-pay-ci -f main.bicep -p regions="['centralindia']"
az deployment group what-if -g rg-pay-si -f main.bicep -p regions="['southindia']"

# Diff app settings between the two regions (should produce no differences)
diff <(az webapp config appsettings list -g rg-pay-ci -n app-pay-ce --query "sort_by([],&name)" -o json) \
     <(az webapp config appsettings list -g rg-pay-si -n app-pay-so --query "sort_by([],&name)" -o json)

RPO/RTO budgeting and the SLO maths

You do not get to wish for “RPO ≈ 0, RTO ≈ seconds” — you compute it from the mechanisms you chose. Two-nines of difference in availability comes from getting these numbers right. Build the budget from the parts:

Budget component	Set by	Active-active typical	What worsens it
Routing-tier RTO	Probe interval × sample threshold + propagation	~30–90 s to evict a bad origin	Long probe interval; shallow probe that lies
Writer RTO (SQL)	Failover-group promotion time	~30–120 s	Long grace period; manual policy
Writer RTO (Cosmos multi-write)	None — other region already writable	~0	N/A
RPO (SQL async)	Replication lag at moment of loss	Seconds (not zero)	Write burst; throttled secondary
RPO (Cosmos)	In-region durability; conflict outcome	~0 committed; LWW may drop a loser	Concurrent cross-region writes
RPO (event-driven)	In-flight, un-replicated events	~ a few events (idempotent retries recover)	Geo-DR that doesn’t replicate message data

The budget tells you how long; this tells you what fires the failover and whether a human is in the loop — the second half of the RTO story:

Failover trigger	Who/what initiates it	Automatic?	Data-loss risk	Typical time
Front Door origin eviction	Probe failing sample threshold	Yes	None (routing only)	~30–90 s
SQL FOG planned failover	Operator (`failover`)	Manual, no data loss	None (sync drain)	~30–60 s
SQL FOG forced failover	Operator or auto after grace period	Auto (Automatic policy)	Possible (async tail)	~30–120 s
Cosmos region failover	Operator or auto-failover priority	Auto (if enabled)	~0 (multi-write)	Seconds
Service Bus geo-DR	Operator alias repoint	Manual	In-flight messages	Seconds–minutes
Storage account failover	Operator	Manual	Async tail	Up to ~1 h

Translate availability targets into what they allow per year, so the business chooses with eyes open:

Availability	Downtime / year	Downtime / month	Realistic with…
99.9%	8.77 h	43.8 min	Single region + zones, good ops
99.95%	4.38 h	21.9 min	Warm standby with auto-failover
99.99%	52.6 min	4.38 min	Active-active, tuned probes, auto-promote
99.999%	5.26 min	26.3 s	Active-active + flawless data tier + drills (hard)

Treat replication lag as a first-class SLO and alert on it — it is your real RPO. KQL over the lag metric:

// Alert when SQL geo-replication lag (seconds) breaches the RPO budget
AzureMetrics
| where ResourceProvider == "MICROSOFT.SQL" and MetricName == "replication_lag_sec"
| summarize maxLag = max(Maximum) by bin(TimeGenerated, 5m), Resource
| where maxLag > 30   // RPO budget = 30 s
| order by TimeGenerated desc

The signals to wire before the next failover — the leading indicators that catch trouble before users do:

Signal	Source metric	Alert threshold (starting point)	Why it’s leading
SQL replication lag	`replication_lag_sec`	> RPO budget (e.g. 30 s)	Predicts data loss on a lossy failover
Cosmos staleness / conflicts	Conflicts feed count; staleness	Any sustained conflicts	LWW may be dropping real writes
Origin health flapping	Front Door origin health %	< 100% intermittently	A region is becoming ineligible
Survivor saturation	App Service CPU% / Cosmos RU% per region	> 70% sustained	Survivor can’t take 100% on failover
429 throttling	Cosmos `429` count by region	> 0 sustained	RU/s under-provisioned for failover load
5xx at the edge	Front Door `Http5xx`	> 1% of requests	Routing to a region that can’t serve
Cross-region egress	Inter-region data transfer (GB)	Trend vs budget	Chatty cross-region calls / runaway cost
Cert expiry (both regions)	Key Vault cert expiry	< 30 days, either region	Standby cert expiry only bites on failover

Composite SLA is the other number leadership asks for, and active-active changes its shape. For services in series (a request must pass all of them), multiply the SLAs — adding components lowers the composite. For a workload deployed redundantly across two regions (either can serve), the combined availability is 1 − (1 − A)², which raises it. That asymmetry is the whole financial argument for active-active:

Configuration	Formula	Example (per-component A = 99.9%)	Composite
Two components in series	A₁ × A₂	0.999 × 0.999	99.80% (worse)
Three components in series	A₁ × A₂ × A₃	0.999³	99.70% (worse)
Same stack in two regions (redundant)	1 − (1 − A)²	1 − (0.001)²	99.9999% (better)
Front Door SLA (the edge gate)	Stated SLA	Front Door availability SLA	~99.99%
Realistic end-to-end active-active	min(edge, redundant stack)	edge ~99.99% caps it	~99.99%

The practical reading: the redundant stack math gives you headroom, but your composite is capped by the single global front door in front of it — so the edge SLA, not the doubled stack, is usually your ceiling. Adding more series components (extra hops, extra dependencies) erodes it; adding region redundancy to each tier restores it.

Architecture at a glance

Read the diagram left to right as the request and data paths an active-active payments stack actually uses. At the far left, users arrive over HTTPS and hit Azure Front Door at the edge, which terminates TLS, runs the WAF, and — because both regional origins share equal priority and weight — uses latency-based routing to steer each user to the nearest healthy region. Front Door continuously probes a deep /healthz in each region; the moment one region’s probe fails the sample threshold, Front Door evicts that origin within seconds and serves everyone from the survivor, with no DNS change and no human in the loop.

The middle of the diagram is the two regional stacks — Central India and South India — each a self-contained, identical stamp: a regional ingress, an App Service / AKS compute tier, and a regional view of the data store. These tiers are stateless and IaC-identical, so either region can serve a full request alone. The right-hand zone is the data tier, the real decision: an Azure SQL auto-failover group (one writable primary, an async read-only replica, a read-write listener that auto-repoints on promotion) for relational state, and Cosmos DB with multi-region writes (every region writable, Session consistency, LWW conflict resolution) for document state. Idempotent payment events flow through Service Bus geo-DR so a retried charge is never double-applied. The numbered badges mark the four places this design most often breaks — a lying health probe, a single-writer cross-region write, replication lag blowing the RPO, and an under-provisioned survivor — and the legend narrates each as symptom → confirm → fix.

Real-world scenario

Paython, a fictional but realistic Indian payments processor, runs an authorization API: a .NET 8 service on App Service P1v3 behind Application Gateway + WAF, with Azure SQL for the ledger and Cosmos DB for the idempotency/transaction store. Traffic averages 900 requests/second, spiking to ~2,400 rps on salary-day evenings. The platform team is six engineers; the original single-region (Central India) stack cost about ₹95,000/month all-in. After a 47-minute regional incident cost them a six-figure chargeback dispute and a hard conversation with their largest merchant, they committed to active-active across Central India ↔ South India.

The rebuild took the three-pattern decision seriously. The ledger (relational, must be auditable, writes must be strongly consistent in-region) went onto an auto-failover group, with data partitioned by merchant home-region: a merchant’s authorizations always write to their home region’s SQL primary, while the other region holds a read-only replica for low-latency reads and instant promotion. The idempotency store (write-anywhere, must survive a region loss with zero RPO) went onto Cosmos DB multi-region writes at Session consistency with LWW on _ts — a duplicate “authorize txn #X” from a cross-region retry resolves to one record. Idempotent authorization events flow through Service Bus geo-DR so a retried charge is applied exactly once. Front Door fronts both regions with equal priority (true latency-based active-active) and a /healthz that checks SQL and Cosmos reachability — not a static 200.

The first game-day exposed the classic bug. The team killed the Central India stack at 14:30 on a Tuesday. Front Door correctly evicted the origin in ~40 seconds and South India took all traffic — and then started throwing 429 and 503. Cause: the survivor’s Cosmos container and App Service plan were each sized for half the load, so when they suddenly carried 100% they throttled. The fix was a sizing rule that became policy: each region must be provisioned (or able to autoscale) to the full load, not its steady-state share. They moved Cosmos to autoscale RU/s with a max set to peak-total, set App Service autoscale max to cover full peak, and re-ran the drill. Second game-day: clean. South India absorbed 2,400 rps with p95 holding at 240 ms.

The second bug was sneakier and only the third game-day caught it. During a forced (lossy) SQL failover, a handful of authorizations written to the Central India primary in the final second before the kill were not yet replicated — the async lag was ~4 seconds under salary-day write burst, blowing the team’s stated RPO of 1 second. Two changes fixed it: they scaled the secondary to match the primary (a throttled secondary had been the lag source), bringing steady-state lag under 1 second, and they made the authorization path fully idempotent against the Cosmos store + event log, so the un-replicated tail could be replayed from events after promotion rather than lost. They also added a replication-lag SLO alert at 1 second so lag creep is caught before an outage, not during one.

When the next real regional incident hit Central India eight weeks later, the numbers told the story: Front Door evicted the unhealthy origin in 34 seconds, South-India merchants saw nothing, Central-India merchants were served read-only from South India for ~70 seconds while the failover group promoted, then writes resumed — RTO ≈ 70 s, RPO ≈ 0 for committed authorizations, replayed tail included. Monthly cost landed at ₹178,000 (≈1.9× single-region) — and the chargeback-dispute risk that had triggered the whole project went to near zero. The lesson on the team’s wall: “Active-active is not ‘deploy it twice.’ It’s ‘provision the survivor for the whole load, make the writes idempotent, and prove it with a kill switch.’”

The game-day timeline, because the order of discovery is the lesson:

Drill	What they killed	What happened	Root cause	Fix that became policy
#1	Central India stack	Front Door evicted in 40 s ✓, then survivor 429/503	Survivor sized for half load	Provision/autoscale each region for full load
#2	Central India stack	Clean traffic shift, p95 held	—	— (validated the sizing fix)
#3	Forced lossy SQL failover	~4 s of writes lost; RPO breached	Throttled secondary → 4 s lag	Match secondary SKU; idempotent replay from events
#4	Both data + compute	Clean; tail replayed from event log	—	Lag SLO alert at 1 s
Real incident	(Azure) Central India networking	RTO 70 s, RPO 0; merchants unaffected	—	The design held

Advantages and disadvantages

The active-active model both removes your single largest availability risk and hands you the hardest distributed-systems problems. Weigh it honestly:

Advantages (why you build it)	Disadvantages (why it’s expensive and hard)
Near-zero RTO/RPO for a single-region loss, with automatic, human-free failover	Cost roughly doubles — full capacity in two regions, plus cross-region egress and replication
Both stacks take real traffic, so there is no untested standby waiting to disappoint you	Distributed-data complexity (conflict resolution, partitioning, or eventual consistency) is now your problem
Lower latency globally — users routed to the nearest healthy region	Operational discipline: config/schema/secret/cert parity or failover surfaces subtle bugs
Maintenance/deploys can drain one region at a time — a natural region-level blue-green	Testing burden: real region-kill game-days, not tabletop ones; an untested path is a liability
Capacity degrades to ~half on a region loss, not to zero	Survivor must be provisioned for 100% load, eroding the “pay for what you use” saving
Failover is a routine operation, not a once-a-year scramble	More moving parts (Front Door, failover groups, geo-DR) = more to monitor and more to break

When each side dominates: the advantages dominate for revenue- or safety-critical, customer-facing workloads where an hour of downtime costs more than the second region costs per month, and where the data model can be partitioned or tolerates tunable consistency. The disadvantages dominate for internal tools that tolerate a 30-minute recovery (use warm standby), for data models that demand a single strongly-consistent writer and cannot be partitioned, and for teams that lack the maturity to operate two live stacks — a poorly-run active-active is less reliable than a well-run single region, because it adds failure modes (split-brain, drift, conflict bugs) without the discipline to contain them.

Hands-on lab

You will stand up the global routing skeleton of an active-active app — two regional web apps stamped identically, a Front Door in front with a meaningful health probe and latency routing, then kill one region and watch Front Door fail over in seconds. Free-tier-friendly (B1 plans; delete at the end). Run in Cloud Shell (Bash).

Step 1 — Variables and resource groups (two regions).

SUFFIX=$RANDOM
RG=rg-aa-lab
APP_CI=app-aa-ci-$SUFFIX     # Central India
APP_SI=app-aa-si-$SUFFIX     # South India
az group create -n $RG -l centralindia -o table

Step 2 — Stamp two identical B1 web apps in two regions.

az appservice plan create -n plan-ci -g $RG -l centralindia --is-linux --sku B1 -o table
az appservice plan create -n plan-si -g $RG -l southindia  --is-linux --sku B1 -o table
az webapp create -n $APP_CI -g $RG -p plan-ci --runtime "NODE:20-lts" -o table
az webapp create -n $APP_SI -g $RG -p plan-si --runtime "NODE:20-lts" -o table

Expected: two apps, identical except for region. Both respond on https://<app>.azurewebsites.net.

Step 3 — Give each a /healthz that returns 200 (the probe target). For the lab, the platform’s default page suffices as a stand-in; in production this path checks downstream dependencies. Enable health-check so the platform itself also tracks it:

az webapp config set -n $APP_CI -g $RG --generic-configurations '{"healthCheckPath": "/"}'
az webapp config set -n $APP_SI -g $RG --generic-configurations '{"healthCheckPath": "/"}'

Step 4 — Create a Front Door Standard profile and an endpoint.

PROFILE=afd-aa-$SUFFIX
az afd profile create -g $RG --profile-name $PROFILE --sku Standard_AzureFrontDoor -o table
az afd endpoint create -g $RG --profile-name $PROFILE --endpoint-name ep-aa --enabled-state Enabled -o table

Step 5 — Origin group with a 30 s probe, then both regions as equal-priority origins (active-active).

az afd origin-group create -g $RG --profile-name $PROFILE --origin-group-name og-aa \
  --probe-path / --probe-protocol Https --probe-request-type GET --probe-interval-in-seconds 30 \
  --sample-size 4 --successful-samples-required 3 --additional-latency-in-milliseconds 50

for pair in "ci:$APP_CI" "si:$APP_SI"; do
  name=${pair%%:*}; host=${pair##*:}.azurewebsites.net
  az afd origin create -g $RG --profile-name $PROFILE --origin-group-name og-aa \
    --origin-name $name --host-name $host --origin-host-header $host \
    --priority 1 --weight 1000 --enabled-state Enabled --https-port 443
done

Step 6 — Add a route so the endpoint serves from the origin group.

az afd route create -g $RG --profile-name $PROFILE --endpoint-name ep-aa \
  --route-name route-aa --origin-group og-aa \
  --supported-protocols Https --https-redirect Enabled --forwarding-protocol HttpsOnly --link-to-default-domain Enabled

Find your endpoint hostname and curl it a few times — you are being served from a healthy region:

az afd endpoint show -g $RG --profile-name $PROFILE --endpoint-name ep-aa --query hostName -o tsv
# curl https://<that-host>/  → 200, served from the nearest healthy origin

Step 7 — Kill one region and watch failover. Stop the Central India app to simulate a regional loss:

az webapp stop -n $APP_CI -g $RG
# Within ~1–1.5 min (30 s interval × 3-of-4 samples), Front Door evicts CI and serves only SI.
# Keep curling the endpoint host — it keeps returning 200, now from South India.
watch -n 5 "curl -s -o /dev/null -w '%{http_code}\n' https://$(az afd endpoint show -g $RG --profile-name $PROFILE --endpoint-name ep-aa --query hostName -o tsv)/"

Expected: the endpoint keeps returning 200 throughout — no DNS change, no human action. That is the active-active property in one observation.

Step 8 — Restore and confirm the region rejoins.

az webapp start -n $APP_CI -g $RG
# After the next healthy samples, Front Door re-admits CI and resumes latency routing across both.

Validation checklist — what each step proved:

Step	What you did	What it proves
2	Stamped two identical regional apps	Stateless tiers are trivially duplicated
5	Equal priority/weight origins + deep probe	This is active-active (latency), not active-passive
7	Stopped one region, kept curling	Front Door evicts a dead origin in seconds; users see 200 throughout
8	Restarted the region	Failback is automatic when probes pass again

Cleanup (avoid lingering plan + Front Door charges).

az group delete -n $RG --yes --no-wait

Cost note. Two B1 plans plus a Standard Front Door for an hour is well under ₹100; deleting the resource group stops everything. This lab covered only the routing tier — the data tier (failover groups / Cosmos multi-write) is where production cost and complexity live.

Common mistakes & troubleshooting

This is the playbook — bookmark it for the next game-day or incident. First as a scannable table, then the entries that bite hardest expanded with the exact confirm path.

#	Symptom	Root cause	Confirm (exact cmd / portal path)	Fix
1	Front Door keeps routing to a region that can’t actually serve	Probe is shallow (static 200), region’s DB/cache is down	Compare probe path vs real dependency health; App Insights failures in that region	Make `/healthz` check downstream deps; never a static 200
2	Survivor 429/503 right after failover	Survivor sized for half the load	Cosmos `429` count / App Service `Http503`; plan CPU pinned	Provision/autoscale each region for full load
3	Cross-region writes slow (region B writing to region A primary)	Single global writer + no partitioning	App Insights dependency latency to SQL RW listener	Partition by home region, or move that data to Cosmos multi-write
4	Data lost after a forced SQL failover	Async replica lag at the moment of loss	`replication_lag_sec` metric just before failover	Match secondary SKU; idempotent replay from event log; tighten lag SLO
5	“Lost”/overwritten updates after a partition heals (Cosmos)	LWW dropped a concurrent write	Cosmos conflicts feed; compare `_ts`	Custom merge sproc for mergeable state; idempotent ops
6	Reads stale in one region	Consistency level weaker than the UX assumes	`az cosmosdb show --query consistencyPolicy`	Raise to Bounded Staleness, or pass session tokens through
7	Failover didn’t fire during a real outage	Grace period not elapsed, or policy Manual	`failoverWithDataLossGracePeriodMinutes`; `failoverPolicy`	Set Automatic; shorten grace period; or trigger forced failover
8	Failover fired on a transient blip (flapping)	Probe too sensitive / grace period too short	Front Door probe history; FOG events	Raise sample threshold; lengthen grace period slightly
9	Survivor behaves differently (a feature is off / breaks)	Config / flag / secret drift between regions	Diff app settings & flags across regions	One IaC module; centralise flags; drift detection in CI
10	TLS errors only after failover to the standby region	Cert expired/renewed in one region only	Cert expiry on both origins	Automate renewal; alert on both; one cert pipeline
11	Duplicate charges/effects after a cross-region retry	Operations not idempotent	Search for duplicate effects by business key	Idempotency keys + unique index; broker dedup; outbox
12	Egress/replication bill far higher than expected	Cross-region data transfer + multi-write RU surcharge	Cost analysis by meter (inter-region egress, Cosmos RU)	Reduce chatty cross-region calls; right-size RU; keep reads local
13	One region’s writes never reach the other	Geo-DR replicates metadata, not in-flight messages	Service Bus geo-DR mode; message-count both namespaces	Use Premium geo-replication (data) or rely on idempotent replay
14	DNS-based failover takes minutes	Traffic Manager TTL, or app caches DNS	TTL on the TM profile; client DNS cache	Use Front Door (connection-level), not DNS, in the hot path; lower TTLs
15	Both regions patched/restarted at once	Not pinned to a paired region	`az account list-locations`; check the pair	Pin both to an Azure pair so updates are sequential
16	Split-brain: both regions think they’re primary	Forced failover while old primary returned	SQL FOG `replicationRole` on both servers	One source of truth for promotion; fence the old primary; reconcile via event log
17	Private endpoint resolves wrong region’s PaaS	Cross-region private DNS not zone-linked	`nslookup` the PaaS FQDN in each region	Link the private DNS zone to both VNets; per-region records
18	Failover works in drills but not real outages	Game-days only kill compute, not the data tier	Compare drill scope vs real failure modes	Drill the forced data-tier failover, not just `webapp stop`
19	App in region B can’t read its own recent write	Read routed to a lagging replica / wrong consistency	Trace the read path; check session token use	Read-your-writes: Session consistency + propagate session tokens; or read RW listener
20	Cost spikes only during failover events	Survivor autoscales to 2× to absorb full load	Cost analysis during the incident window	Expected — budget for it; reserve baseline, autoscale the burst

The entries that cause the most 3 a.m. confusion, expanded:

1. Front Door keeps sending users to a region that returns errors. Root cause: The health probe is shallow — it hits a path that returns 200 from the web tier even when the region’s database or cache is unreachable. Front Door thinks the origin is healthy; users get 5xx from the broken downstream. Confirm: Compare the probe path against what a real request needs. If /healthz returns 200 but dependencies in App Insights for that region show the DB failing, the probe is lying. Fix: Make /healthz a deep check — verify the database and any must-have dependency are reachable, return non-200 if not. The probe must answer “can this region serve a real request?”, not “is the web process up?”.

2. The surviving region throttles (429/503) the instant it takes all the traffic. Root cause: Each region was sized for its steady-state share (~half), so when one dies the survivor suddenly carries 100% and exceeds its provisioned compute or RU/s. Confirm: Cosmos 429 count or App Service Http503 spikes exactly at the failover moment; plan CPU pinned at 100%. Fix: Provision (or autoscale) each region for the full expected load, not half. Use Cosmos autoscale RU/s with a max at peak-total, and App Service autoscale max at full peak. This is the single most common active-active mistake.

3. Writes from one region are slow because they cross the WAN to the other region’s primary. Root cause: You chose a single global writer (SQL failover group) without partitioning, so region B’s writes travel to region A’s primary every time. Confirm: App Insights dependency latency from region B to the SQL read-write listener is consistently ~the inter-region RTT. Fix: Partition data by home region (each region writes its own data locally), or move that workload to Cosmos multi-region writes where every region writes locally.

4. A forced SQL failover lost the last few seconds of writes. Root cause: SQL geo-replication is asynchronous, so a forced (lossy) failover during an outage drops whatever hadn’t replicated — your RPO is the lag, not zero. Confirm: replication_lag_sec just before the failover shows the gap; the missing rows correspond to that window. Fix: Match the secondary’s SKU to the primary (a throttled secondary is the usual lag source), alert on lag against your RPO budget, and make the write path idempotent against an event log so the un-replicated tail can be replayed after promotion.

5. Concurrent writes in two regions “lost” one of them (Cosmos). Root cause: With multi-region writes and Last-Writer-Wins, two regions writing the same item resolve to one — the “loser” is silently dropped, which is wrong for mergeable state (e.g. a shopping cart). Confirm: The conflicts feed shows the conflict; the surviving item’s _ts is the later one. Fix: Use a custom merge stored procedure for mergeable state, or design the operation to be idempotent/commutative so order doesn’t matter.

Best practices

Make health probes meaningful. /healthz must check downstream dependencies (DB, cache), not just return 200 from the web tier — otherwise the global tier keeps routing to a region that cannot actually serve. This is the load-bearing rule of the whole design.
Provision every region for the full load, not its share. A survivor sized for 50% throttles the instant it carries 100%. Autoscale max = peak-total, on both compute (App Service/AKS) and data (Cosmos RU/s).
Keep regions identical via IaC. One Bicep/Terraform module, region as a parameter. Run drift detection (what-if / plan) on a schedule and treat drift as an incident.
Pick the data pattern deliberately. Single-writer + home-region partitioning, Cosmos multi-write, or event-driven — each has a different RPO and consistency story. Don’t default into “active-active reads, single-writer writes” by accident.
Design every write to be idempotent. Cross-region retries are inevitable; an idempotency key + unique index turns “maybe double-charged” into “exactly once,” and lets you replay an un-replicated tail.
Treat replication lag as an SLO. It is your real RPO. Alert when SQL replication_lag_sec or Cosmos staleness exceeds the budget, before an outage forces a lossy failover.
Prefer Front Door (connection-level) over DNS (Traffic Manager) in the hot path. DNS failover is TTL-bound and slow; Front Door evicts a dead origin in seconds.
Pin to Azure paired regions for sequential platform updates and geo-replica affinity — but verify the pair exists and is reciprocal (a few regions aren’t paired).
Run real region-kill game-days on a schedule. Resilience you haven’t tested by actually killing a region is a hypothesis, not a guarantee. The first drill almost always exposes the survivor-sizing bug.
Separate liveness from readiness, and from the global probe. The platform health-check, your liveness, and Front Door’s “can I serve?” probe answer different questions; conflating them either evicts good regions or routes to bad ones.
Keep cross-region chatter down. Read locally, write locally where the pattern allows; every synchronous cross-region call adds latency and egress cost and couples the two regions.
Automate certificate and secret rotation across both regions with one pipeline, and alert on expiry in both — a cert that’s valid in A and expired in B is invisible until failover.

Security notes

Managed identity over secrets, in both regions. Each region’s compute uses its own (or a shared user-assigned) managed identity to reach Key Vault, SQL, and Cosmos — no plaintext connection strings. Grant least privilege (Key Vault Secrets User, scoped SQL/Cosmos RBAC), and ensure the identity exists and is granted in both regions or the survivor crash-loops.
WAF at the global edge, consistently. Run the WAF on Front Door so both regions are protected by one policy — managing two divergent regional WAFs is how a rule lands in one region and not the other. Keep the policy in code.
Private connectivity for the data tier. Reach SQL and Cosmos over Private Endpoints in each region’s VNet so replication and app traffic stay on the Microsoft backbone, not the public internet. See Azure Private Link & Private DNS for PaaS for the cross-region private-DNS pattern that makes this work in both regions.
Encrypt in transit and at rest, both regions. Enforce TLS 1.2+ end-to-end (Front Door → origin → data), and confirm encryption-at-rest (and, where required, customer-managed keys) is configured identically in each region — a CMK present in one region’s Key Vault and not the other breaks the survivor.
Lock down the health endpoint. /healthz returns a status, not a system map — it must not leak dependency hostnames, versions, or internal topology to an anonymous caller, even though Front Door (and you) need it reachable.
Don’t let failover bypass controls. The standby path must enforce the same IP restrictions, authentication, and network rules as the primary; a relaxed rule “just for the DR region” is a hole that’s only exposed when you’re already stressed.
Audit the conflict/replay path. Lossy failovers and LWW conflict resolution touch financial or sensitive state — log every conflict outcome and every replayed event so the reconciliation is auditable after the incident.

The security controls that double as resilience controls — they pull in the same direction:

Control	Mechanism	Secures against	Also prevents
Managed identity in both regions	System/user-assigned MI + RBAC	Secrets in plaintext config	Survivor crash-loop from a missing secret
WAF at the edge (one policy)	Front Door WAF, policy-as-code	OWASP attacks, bots	Divergent regional WAF rules
Private Endpoints per region	Private Link + private DNS	Public exposure of data tier	Replication/app traffic over the internet
TLS + CMK parity	`minTlsVersion`, CMK in both vaults	Downgrade / cleartext	CMK-missing failure on the survivor
Identical network rules	IaC-managed NSG/IP rules	Bypass via a relaxed DR rule	“DR-only” holes exposed under stress

Cost & sizing

The bill drivers and how they interact with the design:

Two full stacks dominate. You pay for production-grade compute and data in both regions, sized for full load (not half — see best practices), so the floor is roughly 1.8–2.1× a single-region stack. Active-active’s cost premium is mostly this, not the extras.
Cross-region data egress is metered per GB. Replication traffic (SQL geo-replica, Cosmos multi-write), plus any synchronous cross-region application calls, all cross the WAN. Chatty cross-region patterns can quietly add a meaningful line item — keep reads and writes local where the pattern allows.
Cosmos multi-write carries an RU surcharge versus single-write, and you pay RU/s in every write region; size with autoscale RU/s so you’re not paying peak-total around the clock, but set the max high enough that a survivor carrying 100% doesn’t throttle.
SQL failover groups mean paying ~full price for the secondary, because it must match the primary’s tier to keep replication lag (your RPO) low. A throttled, under-sized secondary is a false economy that breaks RPO.
Front Door Standard/Premium adds a base fee plus per-GB and per-request charges — small relative to two stacks, and it replaces per-region public ingress complexity.

A rough monthly picture for a mid-size API (the Paython shape, ~900 rps), single-region baseline vs active-active:

Cost driver	Single-region baseline	Active-active	What the delta buys
Compute (App Service P1v3 × N)	~₹40,000	~₹80,000 (both regions, full-load)	Survivor absorbs 100% load
Azure SQL (primary; +secondary in AA)	~₹25,000	~₹52,000 (primary + matched secondary)	RPO≈seconds, auto-promote
Cosmos DB (single → multi-write)	~₹18,000	~₹30,000 (multi-write RU + 2nd region)	Write-anywhere, RPO≈0
Cross-region egress + replication	—	~₹6,000	Keeping both regions in sync
Front Door Std/Premium	~₹2,000 (or single ingress)	~₹5,000	Seconds-level global failover + WAF
Service Bus geo-DR	included	~₹3,000 (Premium for data)	Events survive a region loss
Rough total	~₹95,000	~₹178,000 (≈1.9×)	Region loss becomes a non-event

Right-sizing rules: only go active-active where an outage costs more per hour than the second region costs per month — otherwise warm-standby halves the bill. Use autoscale aggressively so the “full-load survivor” capacity is available but not always paid for. And re-measure after you fix bugs: Paython, like many teams, found that fixing connection reuse and partitioning let them run smaller SKUs than the panicked first cut, landing the active-active bill well below the worst-case 2.1×. For the FinOps discipline around tagging, budgets, and reservations that make a two-region bill predictable, see Azure FinOps & Cost Management at Scale.

Interview & exam questions

1. What is the difference between active-active and active-passive, and when do you choose each? Active-active runs both regions hot, serving live traffic concurrently, so a region loss is a capacity reduction (~half) with seconds-level RTO and ~0 RPO. Active-passive keeps one region serving and the other warm/cold, with a failover gap measured in minutes. Choose active-active for revenue/safety-critical workloads where downtime costs more per hour than the second region per month and the data model can be partitioned or tolerate tunable consistency; choose active-passive for workloads that tolerate a short recovery, because it’s far cheaper and simpler.

2. Why don’t Availability Zones make a workload multi-region resilient? Zones protect against the failure of a single datacentre within a region — they give intra-region HA. A whole-region impairment (control-plane bug, regional networking incident, capacity shortfall) takes all zones in that region together. Multi-region active-active (or DR) is the only thing that removes the region as a single fault domain.

3. Why is the health probe the most important part of the global routing tier? Because it decides which origins are eligible. A shallow probe that returns 200 from the web tier even when the region’s database is down makes Front Door route users to a region that cannot serve, producing 5xx. The probe must be a deep /healthz that verifies downstream dependencies and answers “can this region serve a real request?”.

4. You enabled active-active but writes from one region are slow. Why, and how do you fix it? You likely chose a single global writer (SQL auto-failover group) without partitioning, so the non-primary region’s writes cross the WAN to the primary every time. Fix by partitioning data by home region (each region writes locally) or moving that data to Cosmos DB multi-region writes, where every region accepts writes locally.

5. Compare the three data-tier patterns for active-active. (a) SQL auto-failover groups — one writable primary, async read-only replica, auto-promotion; RPO ≈ replication lag, writes are single-region unless you partition. (b) Cosmos multi-region writes — every region writable with conflict resolution (LWW/custom) and five consistency levels; RPO ≈ 0, weak default consistency. © Event-driven — idempotent operations with replicated events (Service Bus/Event Hubs geo-DR); eventual consistency, no distributed transactions. The choice sets your RPO, consistency, and most of your cost.

6. What determines RPO and RTO in an active-active design? RPO comes from replication mode: synchronous ≈ 0, asynchronous ≈ the replication lag at the moment of loss (so SQL failover groups have non-zero RPO). RTO comes from the routing tier (probe interval × sample threshold to evict a bad origin, ~30–90 s) plus, for writes, the writer’s promotion time (SQL ~30–120 s; Cosmos ~0 because other regions are already writable).

7. What are the five Cosmos DB consistency levels and how do they relate to active-active? Strong, Bounded Staleness, Session, Consistent Prefix, Eventual — from most to least consistent. Strong forbids multi-region writes (single write region only); the other four are multi-write friendly. Session (read-your-writes per session) is the default and suits most apps. Stronger levels add cross-region latency and reduce availability under partition; weaker levels are faster/cheaper but expose anomalies your code must tolerate.

8. How does Cosmos resolve conflicting writes from two regions, and what’s the catch? By a conflict-resolution policy: Last-Writer-Wins (highest _ts or a chosen property wins) by default, or a custom stored procedure for merge logic, or a conflicts feed for manual resolution. The catch with LWW is that it silently drops the loser, which is wrong for mergeable state (carts, sets) — use custom merge or commutative/idempotent operations there.

9. After failover the surviving region starts throwing 429/503. What happened and how do you prevent it? The survivor was sized for its steady-state share (~half the load), so when it suddenly carries 100% it exceeds its provisioned compute or Cosmos RU/s and throttles. Prevent it by provisioning (or autoscaling) each region for the full expected load — Cosmos autoscale RU/s with max at peak-total, App Service autoscale max at full peak. This is the most common active-active mistake and the first thing a game-day exposes.

10. Why must the two regions be kept byte-for-byte identical, and how? Because drift (a config/flag/secret/cert/schema difference) means the survivor behaves differently than the region that failed — a subtle bug that only appears under failover, when you can least afford it. Keep them identical by deploying both from one IaC module with the region as a parameter, centralising feature flags, automating cert/secret rotation across both, and running scheduled drift detection (what-if / terraform plan).

11. What is a region-kill game-day and why is it non-negotiable? It’s a scheduled drill where you actually take a region offline (stop its stack or force a failover) and verify the design recovers within RTO/RPO. It’s non-negotiable because an untested failover path is a liability dressed as resilience — the first game-day almost always finds the survivor-sizing bug or a replication-lag breach, both invisible until you pull the trigger.

12. When should you NOT use active-active? When the workload tolerates a 30-minute recovery (warm standby is far cheaper), when the data model demands a single strongly-consistent writer and cannot be partitioned (multi-write would break correctness), or when the team lacks the maturity to operate two live stacks and keep them in parity — a poorly-run active-active adds failure modes (split-brain, drift, conflict bugs) and is less reliable than a well-run single region.

These map to AZ-305 (Designing Microsoft Azure Infrastructure Solutions) — design for high availability, business continuity, and disaster recovery, region pairs, Front Door, failover groups, Cosmos consistency — and to AZ-700 (Network Engineer) for the global routing tier. The data-tier specifics touch DP-420 (Cosmos DB). A compact cert mapping for revision:

Question theme	Primary cert	Objective area
Active-active vs DR; RTO/RPO design	AZ-305	Design BC/DR; resiliency patterns
Front Door, Traffic Manager, routing	AZ-305 / AZ-700	Design network connectivity; global load balancing
SQL failover groups, geo-replication	AZ-305	Design data storage; high availability
Cosmos multi-region writes & consistency	DP-420 / AZ-305	Distributed data design; consistency models
Paired regions, AZ vs multi-region	AZ-305 / AZ-104	Resiliency fundamentals
IaC parity, drift, governance	AZ-305 / AZ-400	Infrastructure as code; reliable deployment

Quick check

Your app is “active-active” but writes from the second region are slow and all land on the first region’s database. What did you most likely skip, and what are the two fixes?
True or false: Availability Zones give you multi-region resilience.
Front Door is still routing users to a region that’s returning 5xx. What’s wrong with your design, and where exactly do you fix it?
You force a SQL auto-failover-group failover during an outage and lose four seconds of writes. Why was RPO not zero, and name two fixes.
After a region fails, the survivor immediately throttles with 429/503. What sizing rule did you violate?

Answers

You skipped data partitioning by home region while using a single global writer (SQL failover group), so region B’s writes cross the WAN to region A’s primary. Fixes: partition by home region so each region writes locally, or move that data to Cosmos DB multi-region writes where every region is writable.
False. Zones protect against a single-datacentre failure within a region; a whole-region impairment takes all zones together. Only multi-region (active-active or DR) removes the region as a single point of failure.
The health probe is too shallow — it returns 200 from the web tier even though a downstream (DB/cache) in that region is down, so Front Door keeps the origin eligible. Fix it in the /healthz endpoint: make it a deep check that verifies downstream dependencies and returns non-200 when the region can’t actually serve.
SQL geo-replication is asynchronous, so a forced failover drops whatever hadn’t replicated — RPO equals the replication lag (four seconds here), not zero. Fixes: match the secondary’s SKU to the primary (a throttled secondary causes lag) and make writes idempotent against an event log so the un-replicated tail can be replayed after promotion; also alert on lag against the RPO budget.
You provisioned each region for its steady-state share (~half the load) instead of the full load. The rule: every region must be provisioned or able to autoscale to 100% of expected load, because a single-region loss makes the survivor carry everything.

Glossary

Active-active (multi-site) — an architecture where every region serves live traffic concurrently; a region loss is a capacity reduction, not an outage.
Active-passive / warm standby — one region serves; the other waits (warm or cold) to be promoted on failover, with a failover gap of minutes.
Pilot light — the second region holds only the data replica and minimal core, scaled up on demand; cheapest cross-region option with the longest RTO.
Global routing tier — the health-probed L7/L4 entry point (Azure Front Door, Traffic Manager, or cross-region Load Balancer) that steers users to a healthy region.
Azure Front Door — anycast L7 edge with TLS termination, WAF, caching, deep health probes, and connection-level (seconds) origin failover; the default for HTTP active-active.
RPO (Recovery Point Objective) — the maximum data loss tolerated; set by replication mode (sync ≈ 0, async ≈ replication lag).
RTO (Recovery Time Objective) — the maximum time to recover; set by probe interval × sample threshold plus writer promotion time.
Paired region — Azure’s curated region couple with sequential platform updates and geo-replica affinity (e.g. Central India ↔ South India); a few regions are unpaired.
Auto-failover group (FOG) — an Azure SQL construct with a read-write and read-only listener over a writable primary and an async replica, auto-promoting on failover without connection-string changes.
Read-write / read-only listener — the stable DNS names a SQL failover group exposes; the RW listener follows the current primary, the RO listener follows the replica.
Multi-region writes — a Cosmos DB mode where every region is writable, with conflict resolution and tunable consistency; the cleanest fit for active-active writes.
Consistency level — Cosmos DB’s staleness/availability dial: Strong, Bounded Staleness, Session (default), Consistent Prefix, Eventual.
Conflict resolution (LWW / custom) — how Cosmos reconciles two regions writing the same item: Last-Writer-Wins on a timestamp, or a custom merge stored procedure.
Replication lag — how far a replica trails its primary; for async replication it is your real RPO and should be an SLO.
Service Bus geo-DR — alias-based pairing of two namespaces so the alias repoints to the secondary on failover; replicates metadata (Premium geo-replication also replicates message data).
Idempotency key — a unique value that makes a repeated operation apply once, making cross-region retries safe and enabling tail replay.
Home-region partitioning — partitioning data so each tenant/customer’s writes land in their home region’s primary, enabling local writes with a single-writer engine.
Region-kill game-day — a scheduled drill that actually takes a region offline to prove the design meets RTO/RPO before a real incident does.
Drift — divergence in config, flags, secrets, certs, or schema between regions; the cause of subtle failover-only bugs.

Next steps

You can now design, cost, and prove an active-active Azure workload. Build outward:

Foundation: Azure Regions & Availability Zones Explained — the intra-region resilience this article builds on; know exactly where zones stop and multi-region starts.
The DR sibling: Azure Backup & Site Recovery: Protection Strategies — the active-passive end of the spectrum for workloads that don’t need active-active.
Routing depth: Azure Load Balancer vs Application Gateway and Standard Load Balancer: Outbound Rules, Cross-Region HA & HA Ports — the regional and cross-region L4 building blocks under the global tier.
Private data tier: Azure Private Link & Private DNS for PaaS — make SQL/Cosmos reachable privately in both regions, including the cross-region private-DNS pattern.
When a region does fail: Troubleshooting Azure SQL: Connectivity, Timeouts, Throttling & Blocking and Troubleshooting App Service: 502/503, Cold Starts & Restart Loops — the per-service diagnosis during a failover.
Pay for it predictably: Azure FinOps & Cost Management at Scale — tagging, budgets, and reservations that keep a two-region bill from surprising you.