Azure Lesson 77 of 137

Cosmos DB Multi-Region Writes: Consistency Levels and Conflict Resolution

Multi-region writes are the feature that makes Azure Cosmos DB look like magic in a demo and like a distributed-systems trap in production. Azure Cosmos DB is Microsoft’s globally distributed, multi-model database with single-digit-millisecond reads and a turnkey 99.999% SLA; multi-region writes (formerly “multi-master”) let every region you add accept writes for the same logical data, instead of one primary write region and a fleet of read replicas. The moment two regions can both accept writes for the same logical partition, you have surrendered the comfortable single-writer world and signed up for conflict resolution, weaker consistency, and a much harder mental model. None of that is a reason to avoid it: for globally distributed, write-heavy, low-latency workloads it is the right tool. But you have to configure it deliberately.

This guide walks the full path: enabling multi-region writes, picking a consistency level you can actually defend in an SLA review, and building both last-writer-wins (LWW) and custom conflict resolution that behaves correctly when a region drops. Because this is a reference you will return to at 02:00 during a regional incident — or three weeks later when reconciliation flags a ledger that disagrees with the payment processor — the option matrices, the consistency comparison, the conflict-type reference, the limits and the symptom→cause→confirm→fix playbook are all laid out as scannable tables. Read the prose once, then keep the tables open.

Everything here assumes the Cosmos DB for NoSQL API. The consistency model is API-agnostic, but conflict-resolution policies and the conflicts feed are specific to the NoSQL API; Cassandra, MongoDB and Gremlin handle conflicts differently (typically LWW only, with no pluggable resolver). By the end you will stop guessing: you will know which consistency level buys you which RPO, why Strong is off the table the instant you enable multi-write, why default LWW on _ts quietly loses money in an ordered domain, and exactly which az command confirms each of those facts.

What problem this solves

A single-write-region Cosmos DB account is simple to reason about: one region orders every write, the rest catch up, and a failover promotes a replica. That simplicity costs you write latency for users far from the primary. An order placed in Singapore against a primary in East US 2 pays a cross-Pacific round trip on every write — 180–250 ms when the local read was 5 ms. For a write-heavy, latency-sensitive workload (carts, sessions, telemetry ingest, IoT device state, collaborative editing) that is the difference between a snappy app and a sluggish one, and no amount of read-replica scaling fixes it because the write still crosses the ocean.

Multi-region writes fix the latency by letting the nearest region accept the write and acknowledge locally. What breaks without understanding the trade is correctness. Teams flip enableMultipleWriteLocations because a blog said it improves availability, leave the container on its default LWW-on-_ts policy, and ship. Months later a partial network partition lets two regions edit the same document in the same second; _ts ties at one-second granularity; Cosmos deterministically (but arbitrarily) keeps one and silently discards the other. The loser never appears in the conflicts feed. In a stateful, ordered domain — a payment that went authorized in one region and captured in another — that is real money moving with the ledger disagreeing, found only by an out-of-band reconciliation days later.

Who hits this: anyone running a globally distributed, write-active workload on Cosmos DB. It bites hardest on ordered state machines (payments, inventory, bookings) where LWW-on-_ts is almost never correct; on teams who chose Strong consistency for safety and then can’t enable multi-write at all; and on anyone who set Custom (no sproc) resolution and never built a drainer for the conflicts feed, so divergence accumulates invisibly. The fix is never “turn multi-write off” — it’s “make the consistency level and the conflict-resolution policy deliberate parts of your data model, and prove they behave under a region loss.” Here is the whole field in one frame before the deep dive:

Decision The trap What “right” looks like Where it’s set
Multi-write on/off “More regions = always better” — 3× write RU cost On only where you need write locality; read replicas elsewhere Account (enableMultipleWriteLocations)
Consistency level Picking Strong, then can’t enable multi-write Bounded Staleness / Session for most; relax per-request Account default + per-request override
Conflict policy Default LWW on _ts in an ordered domain LWW on a monotonic /version, or a deterministic sproc Container (set at creation, immutable)
Conflicts feed owner Custom-no-sproc with nobody draining it A continuous drainer + a depth alert Application + Azure Monitor
Failover behaviour Assuming a promotion step on the write path RTO≈0 for writes; client PreferredRegions retries locally Account + CosmosClientOptions
RPO awareness “Multi-region = no data loss” RPO is non-zero except at Strong (unavailable here) Consistency level governs it

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already understand the Cosmos DB basics: an account is the top-level resource that owns regions and the consistency policy; a database is a namespace; a container holds items and owns the partition key, indexing policy, throughput, and the conflict-resolution policy. You should know that throughput is measured in Request Units per second (RU/s) (provisioned or autoscale), that every item lives in a logical partition keyed by your partition-key path, and that the .NET/Java/JS SDKs talk to Cosmos in Direct or Gateway mode. Comfort with az cosmosdb, reading JSON output in Cloud Shell, and basic distributed-systems vocabulary (quorum, linearizability, RPO/RTO) will make this land faster.

This sits in the Data & Global Distribution track. It assumes the modeling fundamentals from Cosmos DB Partition Key Design & RU Optimization — a bad partition key amplifies every problem here, because hot partitions and cross-partition fan-out get worse, not better, with more write regions. It is the database-layer companion to Azure Multi-Region Active-Active Architecture and pairs with global front-ends from Azure Front Door & Traffic Manager: Global Failover. The RPO/RTO framing comes from High Availability vs Disaster Recovery: RTO & RPO, and the consistency theory generalizes in Multi-Region Data Replication & Consistency Strategies. A quick map of who owns which decision during a design or an incident:

Layer What lives here Who usually owns it What it can cause
Account regions & failover locations, failoverPriority, automatic failover Platform / SRE Wrong write topology; surprise RU cost
Consistency policy Default level + min staleness window Architect + app lead Too-weak RPO, or Strong blocking multi-write
Container conflict policy LWW path / sproc / manual feed App / data team Silent data loss; divergence
Conflicts feed Drainer job + depth alert App + ops Accumulating, invisible divergence
Client SDK PreferredRegions, session token, consistency override App / dev team No local failover; lost read-your-writes
Observability ReplicationLatency, conflict metrics, alerts Ops / SRE Blind to lag and conflicts

Core concepts

Five mental models make every later decision obvious.

Multi-region writes means every region is a write region. Once you flip enableMultipleWriteLocations, failoverPriority no longer decides who can write (everyone can) — it only orders how regions are reprioritized during automatic failover. A write lands in the region nearest the client, commits and acknowledges locally, and replicates asynchronously to the others. That local ACK is the whole point — and the source of every conflict, because two regions can both ACK a write to the same document before they have heard from each other.

Consistency is a tunable, linear spectrum, and multi-write removes the strongest option. Cosmos exposes five levels from strongest to weakest. Stronger means reads see more recent, more ordered data at higher latency and lower availability; weaker means lower latency and higher availability at the cost of recency and ordering. The hard rule: Strong is incompatible with multi-region writes because linearizability requires one global order of writes, which independent write regions cannot provide. So a multi-write account chooses among Bounded Staleness, Session, Consistent Prefix, Eventual.

A conflict is two live versions of one document that meet during replication. With multiple writers, two clients can mutate the same id + partition key concurrently in different regions. When async replication brings those versions together, Cosmos detects a conflict. It does not panic and it does not block the write path — the regions already ACK’d locally. What happens next is governed entirely by the container’s conflict-resolution policy, chosen at container creation and effectively immutable.

The resolution policy is part of your data model, not an afterthought. Three policies exist. LWW auto-resolves on a numeric path (default _ts) — highest value wins, losers vanish silently. Custom (stored procedure) runs your JavaScript resolver on every conflict so you can merge or apply business rules. Custom (manual) writes every conflicting version to a per-container conflicts feed and stops, leaving your app to drain and reconcile. Choosing the wrong one for your domain — LWW-on-_ts for an ordered state machine — is a correctness bug, not a tuning miss.

RTO is near-zero; RPO is non-zero. Because every region already writes, losing a region does not require a promotion step on the write path — the SDK simply stops routing there, so RTO for writes is effectively zero. But whatever had not yet replicated when the region died is lost: that is your RPO, and it is non-zero for every multi-write consistency level. Bounded Staleness caps it to your configured window; Session/Consistent Prefix/Eventual leave it unbounded in the worst case. You buy RTO with multi-region writes and pay for it in RPO — internalize that sentence.

The vocabulary in one table

Pin down every moving part before the deep sections; the glossary repeats these for lookup, this is the mental model side by side:

Term One-line definition Where it lives Why it matters here
Multi-region writes Every region accepts writes for the same data Account toggle Enables conflicts; ~N× write RU
Write region A region that locally commits + ACKs writes Account locations Under multi-write, all of them
failoverPriority Order regions are reprioritized on failover Per location Only orders failover, not who writes
Consistency level The read recency/ordering guarantee Account default + per request Governs latency and RPO
Bounded Staleness Lag capped by K versions OR T seconds Consistency policy The only bounded RPO under multi-write
Session token x-ms-session-token scoping read-your-writes SDK / response header Must be flowed across tiers
Conflict Two live versions of one id+PK meeting Replication path The thing the policy resolves
LWW Highest numeric path value wins, silently Container policy Default; wrong for ordered state
Conflict-resolution path The numeric property LWW compares Container policy _ts by default; prefer /version
Resolver sproc JS that resolves each conflict your way Registered on container Must be deterministic + idempotent
Conflicts feed Where unresolved versions land Per container Needs an owner + a depth alert
RPO Data lost on a region failure Consequence of level Non-zero except at Strong (unavailable)
RTO Time to recover write capability Consequence of multi-write ≈0 for writes

1. Add regions and enable multi-region writes

Multi-region writes is an account-level toggle. You first need at least two regions associated with the account, then you flip enableMultipleWriteLocations. Adding regions is an online operation; enabling multi-write is not always online and can briefly affect availability, so do it in a maintenance window the first time.

With Azure CLI, add the read regions first, then enable multi-write:

# Add a second (and third) read region first
az cosmosdb update \
  --name kv-cosmos-prod \
  --resource-group rg-data-prod \
  --locations regionName="East US 2" failoverPriority=0 isZoneRedundant=true \
  --locations regionName="West Europe" failoverPriority=1 isZoneRedundant=true \
  --locations regionName="Southeast Asia" failoverPriority=2 isZoneRedundant=true

# Then enable multi-region writes
az cosmosdb update \
  --name kv-cosmos-prod \
  --resource-group rg-data-prod \
  --enable-multiple-write-locations true

A few things that bite people:

Declaratively in Bicep, which is how this should live in your repo:

resource account 'Microsoft.DocumentDB/databaseAccounts@2024-11-15' = {
  name: 'kv-cosmos-prod'
  location: 'East US 2'
  kind: 'GlobalDocumentDB'
  properties: {
    databaseAccountOfferType: 'Standard'
    enableMultipleWriteLocations: true
    enableAutomaticFailover: true
    consistencyPolicy: {
      defaultConsistencyLevel: 'BoundedStaleness'
      maxStalenessPrefix: 100000
      maxIntervalInSeconds: 300
    }
    locations: [
      { locationName: 'East US 2',     failoverPriority: 0, isZoneRedundant: true }
      { locationName: 'West Europe',    failoverPriority: 1, isZoneRedundant: true }
      { locationName: 'Southeast Asia', failoverPriority: 2, isZoneRedundant: true }
    ]
  }
}

Each account-level knob, what it does, and the gotcha — read your row before you toggle anything:

Setting Values Default When to change Trade-off / gotcha
enableMultipleWriteLocations true / false false You need write locality in >1 region ~N× write RU; introduces conflicts; not always online to enable
enableAutomaticFailover true / false false Always, in prod Harmless under multi-write; essential under single-write
defaultConsistencyLevel Strong / BoundedStaleness / Session / ConsistentPrefix / Eventual Session Match your RPO/latency budget Strong forbidden with multi-write
maxStalenessPrefix ≥100000 (multi-region) Bounded Staleness only Below the floor is rejected on a multi-region account
maxIntervalInSeconds ≥300 (multi-region) Bounded Staleness only Tighter (smaller) window costs latency/availability
locations[].failoverPriority 0…N-1, contiguous, unique Reorder failover preference Under multi-write, ordering only — not who writes
locations[].isZoneRedundant true / false false Want AZ resilience in-region Set at add-time only; not toggleable in place
locations[].locationName any Azure region Add/remove a region Removing the last region of data deletes that copy

The flags that look similar but mean very different things — the distinctions that waste the most time:

Distinction The trap How to tell them apart
Add region vs enable multi-write “I added a region, so it can write” — no, it’s a read replica until you flip the toggle writeLocations lists every region only after enableMultipleWriteLocations: true
failoverPriority under single vs multi write Assuming priority gates writes under multi-write Under multi-write, priority only orders automatic failover; all regions write
Automatic failover vs multi-region writes Thinking automatic failover gives you active-active writes Automatic failover promotes a single write region; multi-write makes them all write
Zone redundant vs multi-region Conflating in-region AZ HA with cross-region isZoneRedundant is AZ-level inside one region; regions are the geo-level

Cost note: enabling multi-region writes roughly multiplies your provisioned RU/s cost for writes by the number of write regions, because writes replicate everywhere. Three write regions is three times the write throughput cost. Decide whether you genuinely need write locality in all three or whether one or two write regions plus read replicas is enough — read replicas cost RU too, but you control them independently and they never accept a conflicting write.

The hard limits and real numbers you should know before designing the topology — these are the boundaries that turn a clean design into a 429 storm or a rejected operation:

Limit / quota Real value Applies to What hitting it looks like Note
maxStalenessPrefix floor (multi-region) 100,000 operations Bounded Staleness, ≥2 regions Operation rejected with a min-value error Single-region floor is 10
maxIntervalInSeconds floor (multi-region) 300 seconds Bounded Staleness, ≥2 regions Operation rejected with a min-value error Single-region floor is 5
Strong + multi-write Not allowed Account --enable-multiple-write-locations rejected Drop to a weaker level first
Per-physical-partition throughput 10,000 RU/s Container One partition 429s while container idle Re-key, not more RU/s
Per-logical-partition storage 20 GB Container Writes to that PK fail at the cap Choose a higher-cardinality PK
_ts granularity 1 second LWW default path Same-second writes tie → silent loss Use a monotonic /version
Write RU multiplier ~N× (N write regions) Account billing Costs and 429s scale with region count Use read replicas where write locality isn’t needed
Default item size 2 MB Item Write rejected above the cap Split large docs

2. The five consistency levels and their trade-offs

Cosmos DB exposes a tunable, linear consistency spectrum. Stronger is to the left, more available and lower latency to the right:

Strong  >  Bounded Staleness  >  Session  >  Consistent Prefix  >  Eventual

The full comparison — the table you scan first when placing a workload:

Level What it guarantees Read latency Write availability on partition Multi-region writes? RPO under region loss
Strong Linearizable; reads see the latest committed write Highest (cross-region quorum) Lowest Not allowed 0
Bounded Staleness Lag bounded by K versions or T seconds; consistent-prefix within the bound Higher High Allowed Bounded by the staleness window
Session Read-your-writes, monotonic reads/writes within a session token Low High Allowed Non-zero (unbounded worst case)
Consistent Prefix Never see out-of-order writes; no recency bound Low High Allowed Non-zero (unbounded worst case)
Eventual Replicas converge eventually; reads may be out of order Lowest Highest Allowed Non-zero (unbounded worst case)

The hard constraint, stated plainly: Strong consistency is incompatible with multi-region writes. Linearizability requires a single global ordering of writes, which you cannot have when multiple regions accept writes independently. If you try to enable multi-region writes on a Strong account, the operation is rejected. So the real choice for multi-write accounts is among Bounded Staleness, Session, Consistent Prefix and Eventual.

The default consistency level is set on the account, but a client can relax (never tighten) it per request. A Session-default account can issue an Eventual read for a cheap, fast lookup; it cannot request Strong. The relax-only rule and what each combination yields:

Account default Per-request override allowed Per-request override rejected Typical use of the override
Strong (single-write only) Bounded, Session, Prefix, Eventual (none — already strongest) Cheap reads on tolerant data
Bounded Staleness Session, Consistent Prefix, Eventual Strong Lower-latency reads on cold paths
Session Consistent Prefix, Eventual Strong, Bounded Fire-and-forget lookups
Consistent Prefix Eventual Strong, Bounded, Session Telemetry / feed reads
Eventual (none weaker) everything stronger n/a
// Relax to Eventual for a non-critical read (lower RU, lower latency)
var options = new ItemRequestOptions { ConsistencyLevel = ConsistencyLevel.Eventual };
var resp = await container.ReadItemAsync<Product>(
    id, new PartitionKey(tenantId), options);

The concrete read anomalies each level does and does not permit — this is the table that turns abstract guarantees into “can my code see X?”:

Anomaly a reader could observe Strong Bounded Staleness Session Consistent Prefix Eventual
Stale read (misses latest write) Never Up to the window Never in-session; possible cross-session Possible Possible
Out-of-order writes (see B before A) Never Never Never in-session Never Possible
Non-monotonic reads (go backward in time) Never Never Never in-session Never Possible
Read-your-own-writes fails Never Never (in-region strong) Only without the token Possible cross-session Possible
Lag quantified / bounded N/A (0) Yes (K / T) No No No

What each level costs and fixes, so the choice is an engineering decision not a vibe:

Level RU cost relative Latency profile Availability Fixes / good for Risk it carries
Strong Highest (reads ~2× RU) Cross-region quorum on read Lowest (no multi-write) Single-region linearizable reads Cannot do multi-write at all
Bounded Staleness High In-region = strong; cross-region bounded High Contractual freshness SLA Min window forced (100000/300)
Session Low (default) Local, fast High Per-user apps with token flow Cross-session reads can miss
Consistent Prefix Low Local, fast High Ordered feeds, no recency need No recency bound at all
Eventual Lowest Local, fastest Highest Counters, telemetry, idempotent Out-of-order reads

3. Bounded staleness vs session: choosing per workload

For multi-region writes, the two levels worth most of your attention are Bounded Staleness and Session, because they cover the majority of real requirements without paying full latency cost.

Bounded Staleness gives you a quantified staleness budget. You configure a maximum lag as both a version count (maxStalenessPrefix) and a time window (maxIntervalInSeconds); reads in any region are guaranteed to be no more stale than the tighter of the two. This is the level you want when you need a contractual freshness bound you can put in an SLA: “replicas are never more than 5 minutes behind.” For a multi-region-write account spanning two-plus regions, the minimums are maxStalenessPrefix >= 100000 and maxIntervalInSeconds >= 300. Inside a single region it still behaves like strong consistency, which is a useful property: clients pinned to one region get read-your-writes for free.

# Set Bounded Staleness with the multi-region minimums
az cosmosdb update --name kv-cosmos-prod --resource-group rg-data-prod \
  --default-consistency-level BoundedStaleness \
  --max-staleness-prefix 100000 \
  --max-interval 300

Session is the pragmatic default for most applications, and it is the actual Cosmos DB default. It guarantees consistency within a session — typically one user’s connection — via a session token (x-ms-session-token). The same client that wrote a document will read it back; it gets monotonic reads and writes. The catch is that the guarantee is scoped to the session token. If request A writes in East US 2 and request B (a different client, different token) reads in West Europe a few milliseconds later, B can miss the write. To preserve read-your-writes across tiers, you must flow the session token between services.

// Write returns a session token; capture and propagate it
var write = await container.CreateItemAsync(order, new PartitionKey(order.TenantId));
string sessionToken = write.Headers.Session; // pass to downstream via header/cookie

// A later read in another tier honors that token -> read-your-writes preserved
var read = await container.ReadItemAsync<Order>(
    order.Id, new PartitionKey(order.TenantId),
    new ItemRequestOptions { SessionToken = sessionToken });

The two levels head-to-head on the properties you actually choose between:

Property Bounded Staleness Session
Scope of guarantee Global, every reader Per session token only
Read-your-writes Yes, globally within the window; strong in-region Yes, only if the token is carried
Freshness bound Quantified (K versions / T seconds) None across sessions
Minimums (multi-write) prefix>=100000, interval>=300 none
RU cost Higher Lowest (default)
Best for Multiple independent readers; SLA freshness Per-user app you control end to end
Failure mode Reads up to the window stale Cross-session reader misses recent write
Token plumbing required No Yes (header/cookie across tiers)

The Bounded Staleness window parameters in detail — both bounds apply, the tighter one wins:

Parameter Meaning Multi-region minimum Effect of decreasing it Effect of increasing it
maxStalenessPrefix Max number of versions a read can lag 100000 ops Tighter freshness, more cross-region coordination Looser freshness, cheaper, larger RPO
maxIntervalInSeconds Max wall-clock lag 300 s Tighter freshness, higher latency/availability cost Looser freshness, larger RPO window
(single-region account) Same params, smaller floors 10 ops / 5 s n/a n/a

Rule of thumb I apply, as a decision table:

If the workload is… And readers… Pick Because
Per-user (cart, profile, session) are the same user, token flows Session Cheapest correct read-your-writes
Multi-reader (dashboards, cache warmers) cannot carry a session token Bounded Staleness Global bounded freshness without tokens
Needs a freshness SLA external consumers read it Bounded Staleness You can promise “≤5 min stale”
Tolerant (counters, telemetry, feeds) reconcile out of band Consistent Prefix / Eventual Lowest latency, highest availability
Must be linearizable single region only Strong Only if you give up multi-write

4. Conflict types under multi-region writes

With multiple write regions, two clients can mutate the same document (same id + partition key) concurrently in different regions. When replication brings those versions together, Cosmos DB detects a conflict. There are three kinds, and how each surfaces depends entirely on the conflict-resolution policy you set on the container.

The three conflict types and how each behaves under each policy:

Conflict type What happened Under LWW Under Custom (sproc) Under Custom (manual)
Insert Two regions create a doc with the same id+PK Higher path value committed; loser discarded silently Sproc receives both; decides winner/merge Both land in the conflicts feed
Replace / update Two regions update the same existing doc concurrently Higher path value wins; loser discarded silently Sproc receives incoming + existing + feed Losers land in the conflicts feed
Delete One region deletes a doc another region updates Resolved by path; delete may win or lose Sproc gets isTombstone=true to decide Both versions surface in the feed

How a conflict surfaces depends entirely on the policy:

The three policies compared on the properties that decide which one your domain needs:

Property LWW (default) Custom — stored procedure Custom — manual feed
Who resolves Cosmos, automatically Your JS sproc, automatically Your app, on its own schedule
Losers visible? No (silently dropped) Only if sproc routes them Yes (in the conflicts feed)
Can it merge versions? No (winner-takes-all) Yes Yes
Business rules possible? No Yes Yes
Operational burden Lowest Medium (write + monitor sproc) Highest (build + run a drainer)
Failure safety net None Sproc failure → routed to feed Feed is the mechanism
Right for Tolerant, last-write-truly-wins data Ordered state, mergeable docs Maximum control, audit-heavy domains
Latency on resolution Inline, invisible Inline, invisible Deferred until drained

You set the policy at container creation. It cannot be changed after creation through most SDKs/portal, so choose deliberately — switching strategy generally means a new container and a migration. The immutability is the single most important fact in this article: the conflict-resolution policy is a data-model decision you make once.

5. Last-writer-wins with a custom path property

The default LWW policy resolves on the system property _ts (last-modified timestamp, second granularity). Second granularity is coarse: two writes in the same second tie, and Cosmos picks deterministically but not in a way you control. For correctness you often want LWW over a property you own — a monotonic version number, an epoch-millis timestamp, or a sequence assigned by your write path.

# Create a container with LWW resolving on a custom numeric path
az cosmosdb sql container create \
  --account-name kv-cosmos-prod \
  --resource-group rg-data-prod \
  --database-name shop \
  --name orders \
  --partition-key-path "/tenantId" \
  --conflict-resolution-policy-mode "LastWriterWins" \
  --conflict-resolution-policy-path "/version"

The path must point to a numeric field; the document with the higher value wins. Keep these invariants or LWW will silently lose data:

The LWW path options ranked from worst to best for correctness:

Path choice Granularity Monotonic across regions? Data-loss risk Verdict
_ts (default) 1 second Server-set, but ties in a second High in ordered domains Avoid for stateful/ordered data
Client wall-clock millis 1 ms No — clock skew between regions High (skew = lost writes) Never; skew silently loses writes
Epoch millis from a single clock 1 ms Only if one clock issues them Medium OK if you truly have one clock source
Per-doc version counter (RMW) Per write Yes, if increment is correct Low Good — the common correct choice
Hybrid logical clock (HLC) Logical+physical Yes, by construction Lowest Best for true causal ordering

What “treated as 0” and “higher wins” mean for real edge cases:

Scenario /version values LWW outcome Is it what you want?
Normal update existing 7, incoming 8 incoming (8) wins Yes
Missing path on one write existing 7, incoming absent (=0) existing (7) wins Usually yes — but a real write with no version is a bug
Both absent 0 vs 0 deterministic-but-arbitrary Dangerous — make version mandatory
Stale retry existing 9, incoming 5 existing (9) wins Yes — old retry correctly loses
Tie 8 vs 8 one wins arbitrarily Only safe if 8==8 truly means “same”

Equivalent in Bicep, which is where this belongs for reproducibility:

resource ordersContainer 'Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers@2024-11-15' = {
  parent: shopDatabase
  name: 'orders'
  properties: {
    resource: {
      id: 'orders'
      partitionKey: { paths: [ '/tenantId' ], kind: 'Hash' }
      conflictResolutionPolicy: {
        mode: 'LastWriterWins'
        conflictResolutionPath: '/version'
      }
    }
  }
}

6. Custom conflict resolution via stored procedure and the conflicts feed

When LWW is too blunt — you need to merge concurrent edits, or apply business rules about which write wins — switch to custom resolution. There are two flavors.

6a. Stored-procedure resolution

You register a JavaScript sproc as the resolver. On every conflict Cosmos invokes it with the incoming document, the existing committed document, a tombstone flag, and any documents already in the conflicts feed. Your sproc decides the final state and writes it. The sproc signature is fixed:

// resolver sproc: merges line items, keeps the max status rank
function resolver(incomingItem, existingItem, isTombstone, conflictingItems) {
  var collection = getContext().getCollection();
  var response = getContext().getResponse();

  // isTombstone === true means the incoming op was a delete
  var resolved = existingItem || {};
  if (incomingItem) {
    resolved.lineItems = mergeById(
      (existingItem && existingItem.lineItems) || [],
      incomingItem.lineItems || []);
    resolved.status = Math.max(
      (existingItem && existingItem.status) || 0,
      incomingItem.status || 0);
    resolved.id = incomingItem.id;
  }

  // Conflicting versions sitting in the feed must be folded in too
  (conflictingItems || []).forEach(function (c) {
    resolved.lineItems = mergeById(resolved.lineItems, c.lineItems || []);
    resolved.status = Math.max(resolved.status, c.status || 0);
  });

  var docLink = collection.getSelfLink() + 'docs/' + resolved.id;
  if (isTombstone && (!incomingItem)) {
    collection.deleteDocument(docLink, {}, function (e) { if (e) throw e; });
  } else {
    collection.upsertDocument(collection.getSelfLink(), resolved,
      function (e) { if (e) throw e; });
  }
  response.setBody(resolved);

  function mergeById(a, b) { /* union by line id, prefer higher qty */
    var m = {};
    a.concat(b).forEach(function (x) {
      if (!m[x.id] || x.qty > m[x.id].qty) m[x.id] = x;
    });
    return Object.keys(m).map(function (k) { return m[k]; });
  }
}

The four arguments Cosmos passes the resolver, and what each is for:

Argument Type What it carries Watch-out
incomingItem object / null The newly replicated version causing the conflict Null when the incoming op was a delete
existingItem object / null The currently committed version in this region Null on an insert-insert conflict
isTombstone boolean True if the incoming operation was a delete Decide delete-wins vs update-wins explicitly
conflictingItems array Versions already sitting in the feed for this doc Must fold these in or you lose them

Register it and bind it to the container’s policy:

# 1) Register the sproc in the container
az cosmosdb sql stored-procedure create \
  --account-name kv-cosmos-prod \
  --resource-group rg-data-prod \
  --database-name shop \
  --container-name orders \
  --name resolver \
  --body @resolver.js

# 2) Create the container pointing its policy at that sproc
az cosmosdb sql container create \
  --account-name kv-cosmos-prod --resource-group rg-data-prod \
  --database-name shop --name orders \
  --partition-key-path "/tenantId" \
  --conflict-resolution-policy-mode "Custom" \
  --conflict-resolution-procedure "dbs/shop/colls/orders/sprocs/resolver"

Key constraints on the resolver sproc, each with the consequence of getting it wrong:

Constraint Why it exists Consequence if violated How to satisfy it
Scoped to one partition key per invocation Sprocs run within a single logical partition Cannot resolve cross-partition conflicts Keep conflicts within a partition by design
Must be deterministic Cosmos may invoke it more than once Divergent regional state Same inputs → same output, always
Must be idempotent Re-invocation must be safe Double-applied merges, drift Resolve to an absolute state, not a delta
Failure → routed to conflicts feed Safety net, not a happy path Silent divergence if you don’t monitor Alert on feed depth; treat throws as incidents
Bound at container creation Policy is immutable Can’t swap strategy in place New container + migration to change

6b. Manual resolution via the conflicts feed

Set the policy to Custom with no resolver procedure. Now Cosmos writes every conflicting version to the per-container conflicts feed and stops. Your application drains it and resolves on its own terms.

# Custom policy with NO sproc => manual feed resolution
az cosmosdb sql container create \
  --account-name kv-cosmos-prod --resource-group rg-data-prod \
  --database-name shop --name ledger \
  --partition-key-path "/accountId" \
  --conflict-resolution-policy-mode "Custom"
// Drain the conflicts feed and resolve in application code
using var iterator = container.Conflicts.GetConflictQueryIterator<ConflictProperties>();
while (iterator.HasMoreResults)
{
    foreach (var conflict in await iterator.ReadNextAsync())
    {
        // The losing version that landed in the feed
        Order conflicting = container.Conflicts.ReadConflictContent<Order>(conflict);
        // The currently committed version
        Order committed = await container.ReadItemAsync<Order>(
            conflicting.Id, new PartitionKey(conflicting.TenantId));

        Order winner = Merge(committed, conflicting); // your business rule
        await container.ReplaceItemAsync(winner, winner.Id, new PartitionKey(winner.TenantId));

        // Delete the entry from the feed once handled
        await container.Conflicts.DeleteAsync(conflict, new PartitionKey(conflicting.TenantId));
    }
}

Manual mode is the most flexible and the most operationally demanding: if nobody drains the feed, conflicts accumulate and your data quietly diverges from what users expect. Run the drainer as a continuously scheduled job and alert if the feed depth grows. The operational obligations of manual mode, in order of how often they are missed:

Obligation Why it matters If skipped How to meet it
A running drainer Feed doesn’t drain itself Divergence accumulates forever Continuous Function / worker on a timer
Idempotent merge logic Drainer may reprocess entries Double-applied resolutions Resolve to absolute state; delete after handling
Delete after resolving Entries persist until removed Feed grows unbounded Conflicts.DeleteAsync per handled entry
Depth alerting Silent backlog is invisible Stale data, no signal Alert on conflict activity / feed depth
Per-partition scoping Feed is per container/partition Missed conflicts in other partitions Iterate all partitions or use feed ranges

How to host the drainer, with the trade-offs of each option:

Host Trigger Scaling Cost When to choose
Azure Function (timer) Cron (e.g. every 1 min) Consumption/Flex auto Lowest; pay per run Default for most teams
Azure Function (Cosmos trigger) Change feed Lease-based parallelism Low When you already process the change feed
Container App job Scheduled / KEDA KEDA queue/cron Low–medium Already on Container Apps
AKS CronJob Kubernetes cron Pod replicas Medium Already on AKS
Always-on worker (App Service) Continuous loop Manual instances Medium Need sub-second drain latency

7. Automatic vs manual failover and testing outages

Two independent settings govern regional failover:

How the two failover modes differ in practice:

Aspect Automatic (service-managed) failover Manual failover
Trigger Cosmos detects region unavailability You run failover-priority-change
Use case Real outages, unattended Rehearsals, planned maintenance drains
Write impact (single-write) Promotes next priority region You choose the new priority 0
Write impact (multi-write) None — all regions already write Reorders priority only
Risk None to enable; recommended Re-prioritizes for real — use a window
Data loss Up to RPO of the consistency level Same; rehearse to measure it

Trigger a controlled failover to rehearse an outage. This actually reprioritizes regions; run it in a test account or a planned window:

# Promote West Europe to priority 0 (simulate losing East US 2 as primary)
az cosmosdb failover-priority-change \
  --name kv-cosmos-prod \
  --resource-group rg-data-prod \
  --failover-policies "West Europe=0" "East US 2=1" "Southeast Asia=2"

On the client side, your CosmosClient should be configured with an explicit preferred-regions list so it fails over locally without a config change:

var client = new CosmosClient(connectionString, new CosmosClientOptions
{
    ApplicationPreferredRegions = new List<string>
    {
        "East US 2", "West Europe", "Southeast Asia"  // ordered preference
    },
    ConnectionMode = ConnectionMode.Direct
});

With ApplicationPreferredRegions set, the SDK automatically retries the next region on a regional failure — you do not redeploy to fail over. The client-side knobs that make failover transparent:

Client option What it does Default Set it to Why
ApplicationPreferredRegions Ordered region preference for routing/retry account default order Your latency-ordered region list Local failover with no redeploy
ApplicationRegion Single preferred region (older API) none Prefer PreferredRegions instead List allows ordered fallback
ConnectionMode Direct (TCP) vs Gateway (HTTPS) Direct (SDK v3) Direct for lowest latency Fewer hops; honors region routing
ConsistencyLevel (client) Relax account default per client account default Only to relax Cheaper reads where tolerable
MaxRetryAttemptsOnRateLimited... Throttle retry behavior SDK default Tune for 429 storms Smooths transient throttling

Test this for real: block egress to the primary region’s Cosmos endpoint (NSG rule or local firewall) and confirm your service keeps serving from the next region within the SDK’s retry window.

The three regional topologies side by side — pick the cheapest one that meets your write locality need, not your read need:

Property Single-write + replicas Multi-write (2 regions) Multi-write (3+ regions)
Who accepts writes One primary only Both regions All regions
Write latency (far users) Cross-region round trip Local in either region Local everywhere
Conflicts possible No Yes Yes (more likely)
Strong consistency Allowed No No
Write RU cost 1× write + read RU ~2× write RU ~N× write RU
RTO (writes) Promotion time ≈0 ≈0
Best for Global reads, single writer Two-continent writes True global active-active
Operational complexity Lowest Medium (conflicts) Highest (conflicts + cost)

8. Validating RPO/RTO and monitoring replication latency

Numbers you should be able to quote for a multi-region-write account:

RPO/RTO by configuration, the table you put in the DR runbook:

Configuration RTO (writes) RTO (reads) RPO Notes
Single-write, Strong Failover promotion time ~0 (other replicas) 0 Linearizable; no multi-write
Single-write, Bounded Staleness Failover promotion time ~0 ≤ staleness window Common single-write DR posture
Multi-write, Bounded Staleness ≈0 (all write) ~0 ≤ staleness window Best bounded-RPO active-active
Multi-write, Session ≈0 ~0 Non-zero, unbounded worst case Cheapest; per-user correctness only
Multi-write, Consistent Prefix ≈0 ~0 Non-zero, unbounded Ordered feeds
Multi-write, Eventual ≈0 ~0 Non-zero, unbounded Most available, least fresh

Monitor replication latency continuously. The relevant metric is Replication Latency (P50/P99 by source/target region) in Azure Monitor:

// P99 cross-region replication latency, by region pair, last 6h
AzureMetrics
| where ResourceProvider == "MICROSOFT.DOCUMENTDB"
| where MetricName == "ReplicationLatency"
| where TimeGenerated > ago(6h)
| summarize p99 = percentile(Average, 99) by bin(TimeGenerated, 5m), Resource
| order by TimeGenerated desc

Also alert on the conflict path so silent divergence cannot hide:

// Surfacing custom/manual conflict activity
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.DOCUMENTDB"
| where Category == "DataPlaneRequests"
| where OperationName has "Conflict"
| summarize count() by bin(TimeGenerated, 15m), requestResourceType_s

The signals worth wiring as alerts — leading indicators, not lagging “users complained”:

Signal Metric / source Starting threshold Why it’s leading
Replication lag ReplicationLatency P99 by region > your RPO budget Predicts data-at-risk before a region loss
Conflict activity DataPlaneRequests conflict ops any sustained > 0 in manual mode Divergence is happening now
Conflicts-feed depth App-emitted gauge from the drainer > 0 for 5 min Nobody is reconciling
Throttling (429) TotalRequestUnits / 429 rate > 1% throttled Multi-write amplifies write RU
Region availability Service Health / ServiceAvailability any region degraded Triggers the RPO clock
Provisioned vs used RU ProvisionedThroughput vs TotalRequestUnits sustained > 80% Multi-write writes cost N×

Architecture at a glance

The diagram traces a write as it actually flows through a multi-region-write account, then maps each place data can diverge or be lost as a numbered badge. Read it left to right. On the far left, the App + SDK issues a write with an ordered ApplicationPreferredRegions list and (under Session) a session token; multiple writers can target the same id + partition key from different regions. The write hits the account gateway on :443, which routes it to the nearest write region — and the consistency knob here is where badge 1 lives: you cannot select Strong on this path, only Bounded Staleness, Session, Consistent Prefix or Eventual. The middle zone is the heart of multi-write: East US 2, West Europe and Southeast Asia each commit and ACK locally (badge 2 marks West Europe accepting a concurrent edit to a document East US 2 just changed). From there the replication zone ships those local commits asynchronously; badge 3 sits on the replication hop because whatever has not yet replicated when a region is lost is exactly your RPO. When two live versions of one document meet, the detect-clash node fires, and the flow turns into the resolution zone.

The resolution zone is the design decision the whole article is about. The LWW path node (badge 4) resolves on a numeric property — and the warning is that the default /version choice of _ts ties at one-second granularity and drops a real write silently. The sproc / feed node (badge 5) is the deterministic alternative: a custom resolver that merges or applies business rules, or a manual conflicts feed your app must drain and alert on. The legend narrates each number as symptom · how to confirm · fix — read the badge, run the named az/Azure Monitor confirm step, apply the fix. The single sentence to carry away from the picture: the request path buys you write locality and near-zero RTO, and every badge is a place you pay for it in consistency, RPO, or conflict-resolution correctness.

Azure Cosmos DB multi-region-write architecture traced left to right: an application SDK with ApplicationPreferredRegions and a session token, plus multiple writers targeting the same id and partition key from different regions, sending writes to the account gateway on port 443 where the consistency level (not Strong) is selected; the write is routed to the nearest of three write regions — East US 2 priority 0, West Europe priority 1, Southeast Asia priority 2 — each of which commits and ACKs locally and ships the change to the asynchronous replication zone; when two live versions of one document meet, a detect-clash node routes to the resolution zone with a last-writer-wins numeric path node and a custom stored-procedure-or-conflicts-feed node; five numbered badges mark Strong-being-rejected, a concurrent same-document edit, replication lag equalling RPO, LWW-on-_ts losing writes, and custom-resolution drift, each narrated in the legend as symptom, confirm and fix

Real-world scenario

Aurelia Pay, a fictional global payments platform, ran a payments-ledger container with three write regions (East US 2, West Europe, Southeast Asia) at Session consistency to meet a sub-50 ms write SLO across the Americas, EU and APAC. Their idempotency layer keyed on a client-supplied paymentId, and the write path did a read-modify-write to advance a status field (0=pending, 1=authorized, 2=captured, 3=refunded). The container used the default LWW on _ts. The platform team was six engineers; the Cosmos spend was about ₹240,000/month (three write regions multiply the write RU).

The incident began during a partial network partition between East US 2 and West Europe — a real BGP event lasting about nine minutes. A retrying client authorized a payment in West Europe while a parallel capture landed in East US 2 for the same paymentId. Both committed locally and ACK’d; the partition kept them apart. When replication healed, the two versions met and Cosmos resolved the conflict on _ts. Because both writes fell in the same second, _ts tied, Cosmos kept the authorize as the winner, and the capture was silently discarded — money had moved, the ledger said “authorized.” Nothing surfaced in any feed (LWW never populates it). They found it 31 hours later when the daily reconciliation against the processor disagreed by a five-figure sum.

The breakthrough was framing the bug correctly. This was not a Cosmos defect and not an application race they could lock away — under multi-region writes, concurrent same-document edits across regions are expected. The defect was the resolution policy: resolving an ordered state machine on a timestamp. The constraint made it harder: they could not tolerate any state regression, and they could not drop to single-region writes (the APAC latency SLO would break). The fix was a custom resolver sproc that resolves on the business state machine instead of a timestamp — the higher status rank always wins, and a refund (3) is terminal (absorbing):

function resolveLedger(incoming, existing, isTombstone, conflicts) {
  var ctx = getContext(), coll = ctx.getCollection(), res = ctx.getResponse();
  var all = [existing, incoming].concat(conflicts || []).filter(Boolean);
  // Terminal states win; otherwise the highest status rank wins.
  var winner = all.reduce(function (best, c) {
    if (best === null) return c;
    if (c.status === 3) return c;            // refund is absorbing
    return (c.status > best.status) ? c : best;
  }, null);
  coll.upsertDocument(coll.getSelfLink(), winner, function (e) { if (e) throw e; });
  res.setBody(winner);
}

They also moved the LWW-style fields they could safely auto-merge (audit tags, lastTouchedBy) into the same sproc so nothing fell back to _ts, and they switched the policy to Custom so a sproc failure would route to the conflicts feed rather than silently dropping a write. Post-change, a six-month reconciliation run showed zero ledger regressions. The conflicts-feed alert (depth > 0 for more than five minutes) plus a ReplicationLatency P99 alert gave them the early-warning signals they had been missing entirely. The before/after, because the contrast is the lesson:

Dimension Before (default LWW on _ts) After (custom resolver sproc)
Resolution basis Last-modified timestamp, 1 s granularity Business status rank, refund absorbing
Same-second conflict Tie → arbitrary winner, capture lost Higher status wins deterministically
Loser visibility None (LWW never populates feed) Sproc folds all versions; failures → feed
Reconciliation result Five-figure mismatch after 31 h Zero regressions over six months
Detection signal Out-of-band daily reconciliation Feed-depth + replication-latency alerts
Write SLO (APAC) Met (Session, multi-write) Still met — no topology change
Cost ₹240,000/mo (3 write regions) Unchanged; the fix was the policy

The line the team wrote into their design guide: on a multi-region-write account, the conflict-resolution policy is part of your data model, not an afterthought — and default LWW on _ts is almost never correct for stateful, ordered domains.

Advantages and disadvantages

Multi-region writes both unlock global low-latency write workloads and introduce the distributed-systems tax. Weigh it honestly:

Advantages (why you reach for it) Disadvantages (why it bites)
Local write latency everywhere — nearest region ACKs, no cross-ocean write round trip Concurrent same-doc edits across regions are now possible; you must resolve conflicts
RTO for writes ≈ 0 — every region already writes, no promotion step on a region loss RPO is non-zero for every multi-write level; only Bounded Staleness caps it
Higher write availability — a region loss doesn’t zero out write capability Strong consistency is off the table entirely; you lose linearizability
Pluggable resolution (LWW path, sproc, manual feed) fits ordered or mergeable domains The policy is set at container creation and effectively immutable — a wrong choice means a migration
Built-in conflicts feed is a safety net for custom/sproc failures Manual mode silently diverges if nobody drains the feed
Session consistency keeps per-user read-your-writes cheap Cross-session reads can miss recent writes unless you flow the token
Bounded Staleness gives a contractual freshness SLA you can promise Enforced minimums (100000 ops / 300 s) may be looser than you’d like
Scales globally without app-level sharding of the write path Provisioned write RU cost ≈ N× the number of write regions

The model is right for globally distributed, write-active, latency-sensitive workloads where the data is either tolerant (telemetry, counters, sessions) or has a resolvable conflict story (an ordered state machine you can rank, or documents you can merge). It is wrong when you need true linearizability (use single-write Strong), when the data has no sane merge and any loss is unacceptable without heavy custom work, or when only one region actually writes (then you want read replicas, not the N× write-RU bill). The disadvantages are all manageable — but only if you treat consistency and conflict resolution as first-class design, which is the entire point of this article.

Hands-on lab

Reproduce a conflict deterministically, watch LWW-on-_ts drop a write, then switch to LWW-on-/version and confirm the correct version wins — all on a single account (we add a second region briefly; delete at the end to stop the RU/region cost). Run in Cloud Shell (Bash) unless noted.

Step 1 — Variables and resource group.

RG=rg-cosmos-lab
ACC=kvcosmoslab$RANDOM        # globally-unique account name
LOC1=eastus2
LOC2=westeurope
az group create -n $RG -l $LOC1 -o table

Step 2 — Create a single-region account at Session consistency.

az cosmosdb create -n $ACC -g $RG \
  --locations regionName=$LOC1 failoverPriority=0 isZoneRedundant=false \
  --default-consistency-level Session -o table

Expected: an account row; enableMultipleWriteLocations defaults to false.

Step 3 — Add a second region and enable multi-region writes.

az cosmosdb update -n $ACC -g $RG \
  --locations regionName=$LOC1 failoverPriority=0 isZoneRedundant=false \
  --locations regionName=$LOC2 failoverPriority=1 isZoneRedundant=false
az cosmosdb update -n $ACC -g $RG --enable-multiple-write-locations true
az cosmosdb show -n $ACC -g $RG --query "enableMultipleWriteLocations" -o tsv

Expected: the final command prints true, and writeLocations now lists both regions.

Step 4 — Create a DB and two containers: one default-LWW, one LWW-on-/version.

az cosmosdb sql database create -a $ACC -g $RG -n shop -o table

# Container A: default LWW (resolves on _ts)
az cosmosdb sql container create -a $ACC -g $RG -d shop -n orders_ts \
  --partition-key-path "/tenantId" --throughput 400 -o table

# Container B: LWW on a numeric /version you own
az cosmosdb sql container create -a $ACC -g $RG -d shop -n orders_ver \
  --partition-key-path "/tenantId" \
  --conflict-resolution-policy-mode LastWriterWins \
  --conflict-resolution-policy-path "/version" --throughput 400 -o table

Step 5 — Inspect the bound policy on each container (this is the verification that matters).

az cosmosdb sql container show -a $ACC -g $RG -d shop -n orders_ts \
  --query "resource.conflictResolutionPolicy" -o json
az cosmosdb sql container show -a $ACC -g $RG -d shop -n orders_ver \
  --query "resource.conflictResolutionPolicy" -o json

Expected: orders_ts shows "conflictResolutionPath": "/_ts" (the default); orders_ver shows "conflictResolutionPath": "/version". This single difference is the whole lesson — the ordered-domain container must not resolve on _ts.

Step 6 — Confirm consistency and write topology, the production gate.

az cosmosdb show -n $ACC -g $RG \
  --query "{multiWrite:enableMultipleWriteLocations, consistency:consistencyPolicy.defaultConsistencyLevel, writeRegions:writeLocations[].locationName}" -o json

Expected: multiWrite: true, consistency: Session, and both regions under writeRegions. (To observe a real cross-region conflict resolve you would write the same id+tenantId to each region during a simulated partition — the Cosmos emulator’s multi-region mode or a brief endpoint block lets you do this; on a single live account the policy inspection in Step 5 is the deterministic check.)

The lab steps mapped to what each proves:

Step What you did What it proves Real-world analogue
3 Enable multi-write on a 2-region account Every region becomes a write region The decision that introduces conflicts
4 Two containers, two LWW paths The policy is per-container and set at creation Choosing the policy as a data-model decision
5 Inspect conflictResolutionPolicy _ts default vs /version is visible and real The 90-second “is this safe?” check
6 Confirm multi-write + consistency The production gate before go-live Pre-prod sign-off

Cleanup (stop the per-region RU cost):

az group delete -n $RG --yes --no-wait

Cost note. Two 400-RU/s containers across two write regions for under an hour is a few rupees; the multi-region multiplier is what you are watching, and deleting the resource group stops all of it. Always delete lab accounts — an idle multi-region account still bills provisioned RU in every region.

Common mistakes & troubleshooting

This is the playbook — the part you bookmark. First as a scannable table you can read mid-incident, then the entries that bite hardest expanded with the full confirm detail.

# Symptom Root cause Confirm (exact cmd / portal path) Fix
1 Reconciliation finds missing updates; no errors anywhere Default LWW on _ts dropped a write on a same-second tie az cosmosdb sql container show --query resource.conflictResolutionPolicy shows /_ts New container with LWW on /version, or a custom sproc
2 enable-multiple-write-locations true is rejected Account is at Strong consistency az cosmosdb show --query consistencyPolicy.defaultConsistencyLevel = Strong Set Bounded Staleness/Session first, then enable multi-write
3 Setting Bounded Staleness fails on a multi-region account Window below the multi-region floor Error cites maxStalenessPrefix/maxIntervalInSeconds --max-staleness-prefix 100000 --max-interval 300 (or larger)
4 Data quietly diverges between regions over days Custom (no sproc) feed nobody drains Conflicts.GetConflictQueryIterator returns entries; depth alert never built Build a continuous drainer + a feed-depth alert
5 Cross-session reader in region B misses a write from region A Session consistency, token not flowed App tiers don’t pass x-ms-session-token Propagate the session token, or use Bounded Staleness
6 LWW “loses” the newer write under clock skew LWW path is client wall-clock time Path is a client timestamp; regions’ clocks differ Use a monotonic per-doc version (RMW) or HLC
7 Sproc resolver produces different state in different regions Non-deterministic / non-idempotent resolver Resolver reads Date.now()/random; outputs differ Make it deterministic + idempotent (absolute state)
8 Can’t change the conflict policy on a live container Policy is immutable after creation az cosmosdb sql container show policy is fixed New container + change-feed migration, cut over behind a flag
9 Writes throttle (429) after enabling multi-write Write RU is now N× across regions TotalRequestUnits high; 429 rate up Raise provisioned/autoscale RU, or reduce write regions
10 Client keeps hitting a down region after failover No ApplicationPreferredRegions set SDK options lack the ordered region list Set ApplicationPreferredRegions; Direct mode
11 “Multi-write is on” but one region won’t accept writes A region is still a read replica (toggle not applied) writeLocations lacks that region Re-run --enable-multiple-write-locations true; verify
12 RPO bigger than expected after a region loss Consistency is Session/Eventual (unbounded RPO) consistencyPolicy not Bounded Staleness Move to Bounded Staleness to cap the lag window
13 Deletes “come back” after replication Delete lost the conflict to a concurrent update Doc reappears; LWW path favored the update Decide delete-wins in a sproc (isTombstone)
14 Cassandra/Mongo API: no custom resolver available Custom sproc/feed is NoSQL-API only Wrong API for pluggable resolution Use NoSQL API, or accept LWW on those APIs

The expanded form for the entries that bite hardest:

1. Reconciliation finds missing updates; nothing errored. Root cause: Default LWW on _ts resolved a conflict on a one-second timestamp tie and silently discarded a real write; LWW never populates the conflicts feed, so there is no error trail. Confirm: az cosmosdb sql container show --account-name <acc> -g <rg> -d <db> -n <container> --query "resource.conflictResolutionPolicy" shows mode: LastWriterWins and conflictResolutionPath: /_ts. Fix: For ordered/stateful data, create a new container with LWW on a monotonic /version you own, or a custom resolver sproc that ranks on business state. Migrate via the change feed; the policy can’t be changed in place.

2. Enabling multi-region writes is rejected. Root cause: The account is at Strong consistency, which is incompatible with multi-region writes (linearizability needs one global order). Confirm: az cosmosdb show -n <acc> -g <rg> --query "consistencyPolicy.defaultConsistencyLevel" returns Strong. Fix: Lower to Bounded Staleness (or Session) first — az cosmosdb update --default-consistency-level BoundedStaleness --max-staleness-prefix 100000 --max-interval 300then --enable-multiple-write-locations true.

3. Setting Bounded Staleness fails on a multi-region account. Root cause: The window is below the multi-region floor (maxStalenessPrefix >= 100000, maxIntervalInSeconds >= 300). Confirm: The CLI error names the parameter that is too small. Fix: Pass values at or above the floor: --max-staleness-prefix 100000 --max-interval 300. Tighter windows are only allowed on single-region accounts.

4. Data quietly diverges between regions over days. Root cause: Custom (no sproc) policy writes conflicting versions to the conflicts feed, and nobody drains it — so divergence accumulates invisibly. Confirm: container.Conflicts.GetConflictQueryIterator<ConflictProperties>() returns entries; you have no alert on feed depth or conflict activity. Fix: Run a continuous drainer (Function/worker) that resolves each entry, deletes it after handling, and emits a depth gauge; alert on depth > 0 sustained.

5. A cross-session reader misses a recent write. Root cause: Session consistency scopes read-your-writes to the session token; a different client/tier reading in another region without the token can miss a just-written value. Confirm: The write tier captures response.Headers.Session but downstream readers don’t pass it back as SessionToken. Fix: Flow the session token across tiers (header/cookie), or move readers that can’t carry it to Bounded Staleness for a global bounded guarantee.

6. LWW loses the newer write under clock skew. Root cause: The LWW path is a client wall-clock timestamp; regional clock skew means the “later” write can carry the smaller number and lose. Confirm: The conflictResolutionPath points at a client-set time field; regions’ clocks differ by more than the conflict window. Fix: Resolve on a monotonic per-document version advanced by read-modify-write, or a hybrid logical clock — never raw client time.

8. Can’t change the conflict policy on a live container. Root cause: The conflict-resolution policy is set at container creation and effectively immutable. Confirm: az cosmosdb sql container show ... --query "resource.conflictResolutionPolicy" shows the old policy and no SDK/portal path changes it. Fix: Create a new container with the right policy, drain the change feed into it with a Function (live backfill), and cut over behind a feature flag — the same pattern as a partition-key change.

9. Writes throttle (429) right after enabling multi-write. Root cause: Write RU is now multiplied across write regions; the provisioned/autoscale ceiling that was fine for one write region is now insufficient. Confirm: TotalRequestUnits climbs and the 429 rate rises; the account shows N write regions. Fix: Raise provisioned or autoscale max RU/s to cover N× write cost, or reduce the number of write regions (keep some as read replicas).

Best practices

Security notes

The security controls that also prevent operational incidents — secure and resilient pull the same way here:

Control Setting / mechanism Secures against Also prevents
Entra RBAC data plane Built-in Data Contributor + managed identity Account-key sprawl and leakage Rotation breakage from hard-coded keys
Disable public access publicNetworkAccess: Disabled + private endpoints Internet-reachable data Exfiltration over public endpoints
Per-region private endpoint PE + private DNS per region Cross-region traffic on the public internet DNS misresolution sending writes off-region
Scoped RBAC roles Custom data roles per container Lateral movement across containers A bad app touching unrelated data
CMK encryption Key Vault-held key Provider-side data exposure Loss of crypto-shred / revoke capability
Data-plane diagnostics DataPlaneRequests to Log Analytics Undetected anomalous writes Silent conflict divergence going unseen

Cost & sizing

The bill drivers and how they interact with multi-region writes:

A rough monthly picture (INR, indicative — verify with the Azure pricing calculator for your regions):

Configuration Write RU model Storage Rough INR / month When it’s the right shape
1 write region, 10k RU/s 1× write RU 1× per GB ~₹85,000 Single-region write, global reads not needed
1 write + 2 read replicas 1× write + read RU 3× per GB ~₹160,000 Global low-latency reads, single writer
2 write regions, 10k RU/s each ~2× write RU 2× per GB ~₹170,000 Two-continent write locality
3 write regions, 10k RU/s each ~3× write RU 3× per GB ~₹255,000 True global active-active writes
Autoscale 10k max, 3 write regions ~3× at 1.5× rate 3× per GB ~₹290,000 peak Spiky global writes; avoids 429

Right-sizing rules:

If you observe… It usually means… Do this
One region writes >> others You’re paying N× for 1× benefit Make the quiet regions read replicas
Sustained 429 after multi-write Write RU ceiling too low for N× Raise RU or autoscale max
RU far below provisioned but 429s A hot partition, not a region issue Fix the partition key (see partition-key article)
Bill dominated by storage Many regions, large dataset Trim regions or archive cold data
Bounded Staleness window very tight Higher coordination latency/cost Loosen toward the floor (100000/300)

Interview & exam questions

1. Why is Strong consistency incompatible with multi-region writes? Strong guarantees linearizability, which requires a single global ordering of all writes. With multiple regions independently accepting and ACKing writes locally, no single global order exists, so the guarantee cannot hold. Cosmos therefore rejects enabling multi-region writes on a Strong account; you must drop to Bounded Staleness, Session, Consistent Prefix or Eventual first.

2. What does multi-region writes do to your provisioned RU cost, and why? It roughly multiplies the write RU cost by the number of write regions, because every write is committed and replicated in each write region and billed there. The mitigation is to make only the regions that truly need write locality into write regions and keep the rest as read replicas, which scale independently.

3. Default LWW resolves on _ts. Why is that dangerous for an ordered domain? _ts is a last-modified timestamp at one-second granularity; two concurrent writes in the same second tie, and Cosmos keeps one deterministically but arbitrarily, silently discarding the other (LWW never populates the conflicts feed). For a state machine (e.g. payments), this can drop a capture in favor of an authorize. Resolve on a monotonic /version you own or a custom sproc that ranks on business state.

4. Compare Bounded Staleness and Session for a multi-region-write account. Bounded Staleness gives a global, quantified freshness bound (no more than K versions or T seconds stale; minimums 100000 ops / 300 s on multi-region) and behaves like Strong within a region — good for multi-reader and SLA freshness. Session gives read-your-writes only within a session token, is the cheapest level, and is right for per-user workloads where you control token propagation; a different session in another region can miss a recent write.

5. What are the three conflict types, and how does each surface under LWW vs Custom? Insert (two regions create the same id+PK), replace/update (concurrent edits), delete (delete vs concurrent update). Under LWW all three resolve on the numeric path, winner committed, loser discarded silently. Under Custom-sproc your resolver receives the versions (with isTombstone for deletes) and decides. Under Custom-manual the conflicting versions land in the conflicts feed for your app to reconcile.

6. Walk through configuring LWW on a custom path and the invariants you must hold. Create the container with conflictResolutionPolicy.mode = LastWriterWins and conflictResolutionPath = /version. Invariants: the path is always present and numeric (missing = 0), monotonically increasing per document (so a stale retry loses), and unique enough to avoid ties on writes you care about. Prefer a version counter advanced by read-modify-write or a hybrid logical clock over client wall-clock time, which turns clock skew into data loss.

7. What’s the RTO and RPO of a multi-region-write account, and what governs each? RTO for writes is ≈0 because every region already writes — there’s no promotion step on a region loss; the SDK just stops routing to the down region. RPO is non-zero and is governed by the consistency level: Bounded Staleness caps it to the staleness window; Session/Consistent Prefix/Eventual leave it unbounded in the worst case; Strong (RPO 0) is unavailable here. You buy RTO and pay in RPO.

8. A resolver sproc produces different results in different regions. What’s wrong and how do you fix it? The sproc is non-deterministic or non-idempotent — likely reading Date.now(), a random value, or applying a delta rather than computing an absolute state. Cosmos may invoke the resolver more than once and in each region, so identical inputs must yield identical outputs. Fix by making it deterministic and idempotent, resolving to a fully-specified final document, and folding in conflictingItems.

9. You set Custom (no sproc) and data is quietly diverging. What did you forget? The conflicts feed has no owner. In manual mode Cosmos writes conflicting versions to the feed and stops; your application must drain it (read, resolve with a business rule, replace the committed doc, delete the feed entry) on a continuous schedule, and alert on feed depth. Without a drainer, divergence accumulates invisibly.

10. How do you make client failover transparent during a regional outage? Configure CosmosClientOptions.ApplicationPreferredRegions with an ordered region list and use Direct connection mode. On a regional failure the SDK automatically retries the next preferred region without a redeploy. Rehearse it by blocking egress to the primary region’s endpoint or running az cosmosdb failover-priority-change, and confirm the service keeps serving.

11. Can you change a container’s conflict-resolution policy after creation? What’s the migration if not? No — the policy is set at container creation and effectively immutable. To change it you create a new container with the desired policy, backfill via the change feed (an Azure Function draining the old container into the new one live), and cut over behind a feature flag — the same pattern as changing a partition key.

12. Which APIs support custom (sproc/feed) conflict resolution? Only the Cosmos DB for NoSQL API supports pluggable resolution (LWW path, stored-procedure resolver, and the manual conflicts feed). Cassandra, MongoDB and Gremlin APIs typically support LWW only, with no custom resolver — a key reason to choose the NoSQL API when you need ordered/mergeable conflict semantics.

These map to DP-420 (Designing and Implementing Native Applications Using Microsoft Azure Cosmos DB)consistency, global distribution, conflict resolution, change feed — and touch AZ-305 (Solutions Architect Expert) for the multi-region HA/DR and RPO/RTO design. A compact cert-mapping for revision:

Question theme Primary cert Objective area
Consistency levels & trade-offs DP-420 Design and implement data distribution
Conflict types & resolution policies DP-420 Implement conflict resolution
LWW path / resolver sprocs / change feed DP-420 Integrate and optimize; server-side programming
Multi-region HA/DR, RPO/RTO AZ-305 Design business continuity solutions
RU cost of multi-write, sizing DP-420 / AZ-305 Optimize cost; design data platform
Entra RBAC, private endpoints, CMK AZ-305 / AZ-500 Secure the data platform

Quick check

  1. You try to enable multi-region writes and the operation is rejected. What is the single most likely cause, and the one command that confirms it?
  2. A reconciliation job finds a missing update but every log is clean and nothing is in the conflicts feed. What policy is almost certainly in play, and why is the feed empty?
  3. True or false: scaling provisioned RU/s higher is the right fix when writes throttle (429) immediately after you enable multi-region writes.
  4. Your app uses Session consistency. A user’s write in East US 2 isn’t visible to a different service reading in West Europe. Name two valid fixes.
  5. You need to change a container’s conflict-resolution policy from LWW to a custom sproc. Can you do it in place? If not, what’s the migration?

Answers

  1. The account is at Strong consistency, which is incompatible with multi-region writes (linearizability needs one global order). Confirm with az cosmosdb show -n <acc> -g <rg> --query "consistencyPolicy.defaultConsistencyLevel" returning Strong; lower it to Bounded Staleness/Session, then enable multi-write.
  2. Default LWW on _ts. LWW resolves conflicts automatically and discards losers silently — they never appear in the conflicts feed — so a same-second _ts tie can drop a real write with no error trail. Fix with LWW on a monotonic /version or a custom resolver.
  3. Partly true but usually the wrong framing. If the 429s come from the N× write multiplier of multi-region writes, raising provisioned/autoscale RU (or reducing write regions) is correct. But if RU is far below provisioned while one partition 429s, it’s a hot partition — fix the partition key, not the RU.
  4. (a) Flow the session token (x-ms-session-token) from the writing tier to the reading service via header/cookie so read-your-writes is preserved across tiers; or (b) move the cross-region reader to Bounded Staleness for a global, bounded freshness guarantee that doesn’t need a token.
  5. No — the policy is set at container creation and is effectively immutable. Migrate by creating a new container with the custom sproc policy, draining the change feed from the old container into it with a Function (live backfill), and cutting over behind a feature flag.

Glossary

Next steps

You can now configure multi-region writes deliberately, pick a defensible consistency level, and build conflict resolution that survives a region loss. Build outward:

cosmos-dbmulti-regionconsistencyconflict-resolutionglobal
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments