Azure Cache for Redis Enterprise: Clustering, Active Geo-Replication, and Resilient Failover Patterns

Most “Redis is down” pages I have been dragged into were not Redis failing. They were a client library that opened a single connection to a single node, hardcoded a regional hostname, and treated MOVED as a fatal error instead of a routing hint. Azure Cache for Redis Enterprise – the tier built on the commercial Redis Enterprise runtime rather than OSS Redis – gives you clustering, multi-region active-active replication, durable persistence, and the Redis modules (Search, JSON, TimeSeries, Bloom). But every one of those features changes the contract your client must honor. Cross-slot multi-key commands are no longer free. A node can move under you mid-request. Two regions can both accept a write to the same key and you have to decide who wins. This guide wires up the Enterprise tier correctly and, just as importantly, builds the client-side behavior that survives the day the topology shifts.

Everything here targets the Enterprise and Enterprise Flash tiers, with notes on where Premium diverges. The provider resource is Microsoft.Cache/redisEnterprise, a different ARM resource type from the Microsoft.Cache/redis you use for Basic/Standard/Premium. That distinction trips up Terraform and Bicep modules constantly, and it is the first thing to get right because a module written for one resource type silently does nothing useful against the other.

By the end you will stop treating “the cache is down” as a single event. You will know whether you are looking at a CROSSSLOT error from a keyspace that ignored hash tags, a MOVED that escaped to application code because cluster mode is off, an OOM because NoEviction met an undersized cluster, a connection storm because the client pools per-request instead of multiplexing, or a genuine regional outage that an active-active CRDB should have made a non-event. Knowing which in ninety seconds is what separates a five-minute blip from a two-hour incident.

What problem this solves

A cache that holds session state, idempotency keys, feature flags, rate-limit counters, or a hot read-through layer is on the critical path of every request. When it stalls, the application stalls behind it – or worse, falls back to the database and runs at a fraction of normal throughput until the database itself buckles. The naive answer (“just add a replica”) solves node loss but not the three failure classes that actually page senior engineers: a client that cannot follow a topology change, a single-region cache that cannot accept writes during a regional incident, and a cache used as an LRU eviction store in a geo-replicated topology where evictions silently diverge the regions.

What breaks without the Enterprise patterns: a passive geo-replica that is read-only at exactly the moment you need to write to it (during the primary region’s outage), turning an eleven-minute regional blip into eleven minutes of either blocked writes or duplicate processing. A MULTI/EXEC transaction that worked in dev against a single node and throws CROSSSLOT the first time it runs against a real cluster. A ConnectionMultiplexer created with the default AbortOnConnectFail = true that throws permanently the first time a maintenance window blips the connection and never recovers. A cache sized for the average working set that OOMs the moment a campaign doubles the keyspace, returning OOM command not allowed on every write because the policy is NoEviction.

Who hits this: any team running Redis on the critical path at scale. It bites hardest on multi-region applications (where passive replication is mistaken for an availability tool), on stateful workloads that model everything as opaque strings (and so eat last-write-wins data loss), on teams that adopt clustering without auditing their client library’s cluster support, and on anyone who treats Redis persistence as a backup. The fix is rarely “buy a bigger SKU” – it is “model the data as the right CRDT, pick the clustering policy the client can actually drive, size for the full working set under NoEviction, and make the client survive a node move.”

To frame the field before the deep dive, here is every failure class this article covers, the question it forces, and where to look first:

Failure class	What it looks like	First question to ask	First place to look	Most common single cause
CROSSSLOT error	Multi-key command rejected	Are these keys in one hash slot?	Client exception text	Keyspace ignores hash tags
MOVED in app code	App sees a `MOVED 1234 ip:port`	Is the client in cluster mode?	Client config / cluster flag	Cluster-unaware client on OSS policy
OOM on write	`OOM command not allowed`	Memory % vs eviction policy	`usedmemorypercentage` metric	`NoEviction` + undersized cluster
Connection storm	Thousands of conns/sec	One multiplexer or pool-per-request?	`connectionscreatedpersecond`	New connection per operation
Regional write outage	One region can’t write	Passive replica or active-active?	Topology / `geoReplication`	Passive geo-replica used for HA
Replica divergence	Two regions disagree	Is eviction on in a geo group?	Eviction policy + metrics	LRU eviction on an active-active DB

Learning objectives

By the end of this article you can:

Choose between Standard, Premium, Enterprise, and Enterprise Flash from durability and topology requirements rather than raw memory size, and explain why Enterprise is a separate runtime, not the next rung on a ladder.
Pick the right clustering policy (OSS vs Enterprise/proxy) at creation time, knowing it is permanent, and design a keyspace with hash tags so transactions, Lua, and MGET/MSET stay single-slot.
Stand up active geo-replication as a conflict-free replicated database (CRDB) across regions, and model write data as the correct CRDT (counter, set, hash) where last-write-wins on strings is unacceptable.
Configure RDB and AOF persistence with the right durability trade-off, and explain why persistence is not a backup and how it relates to (but differs from) replication.
Lock the cache into a VNet with a Private Endpoint, Private DNS, TLS-only on port 10000, and Microsoft Entra ID token auth instead of a shared access key.
Build a resilient client: a singleton multiplexer with AbortOnConnectFail = false, periodic health checks, and a jittered operation-level retry that survives scaling, patching, and node moves.
Run online scale-up and scale-out operations, validate them with a load test through a live reshard, and alert on the leading indicators (used-memory %, eviction rate, server load, client-side p99) before they become outages.

Prerequisites & where this fits

You should already understand Redis at the data-structure level: strings, hashes, sets, sorted sets, TTL/EXPIRE, and the basic command set (SET, GET, INCR, MGET, MULTI/EXEC). You should know how to run az in Cloud Shell, read JSON output, and reason about a VNet, subnet, NSG, and Private DNS zone. Familiarity with at least one Redis client library (StackExchange.Redis, Lettuce/Jedis, redis-py, go-redis) helps, because half of the resilience story lives in client configuration.

This sits in the Data & Caching track of the Zero-to-Hero program. It assumes the networking fundamentals from Azure Virtual Network Basics: Subnets, NSGs, and Peering and the private-connectivity patterns from Azure Private Endpoints and Private DNS at Scale. The identity and secret-handling side leans on Azure Key Vault: Secret Rotation with Managed Identity. For multi-region thinking beyond the cache, it pairs with Cosmos DB Multi-Region Writes and Conflict Resolution (a useful contrast: Cosmos lets you plug in conflict logic; Redis CRDTs resolve automatically per type) and Azure Multi-Region Active-Active Disaster Recovery. When the cache fronts a relational store, Azure SQL Database: Hyperscale, Elastic Pools, Ledger is the backing tier the cache protects.

A quick map of who owns what during a cache incident, so you call the right person fast:

Layer	What lives here	Who usually owns it	Failure classes it can cause
Client library	Multiplexer, retries, cluster mode	App / dev team	`MOVED` leaks, connection storms, no reconnect
Keyspace design	Hash tags, TTLs, data types	App / dev team	`CROSSSLOT`, LWW data loss, divergence
Cache database	Eviction, persistence, clustering	Platform / data team	OOM, restart data loss, slot routing
Geo-replication mesh	CRDB links, group nickname	Platform / data team	Write outage (if passive), replication lag
Network	Private Endpoint, DNS, NSG, TLS	Network team	Public exposure, DNS resolves to public IP
Identity	Access key, Entra token, RBAC	Security / platform	Leaked key, token expiry, auth failure

Core concepts

Six mental models make every later decision obvious.

Enterprise is a different runtime, not a bigger Premium. Basic/Standard/Premium run OSS Redis under Microsoft.Cache/redis. Enterprise and Enterprise Flash run the commercial Redis Enterprise software under Microsoft.Cache/redisEnterprise, with a parent cluster resource and a child database resource. The database is the thing your client connects to. This split is why a Bicep module that creates a Microsoft.Cache/redis resource cannot produce an Enterprise cache, and why the endpoint, port, and capabilities differ.

The clustering policy is a permanent client contract. You choose OSS or Enterprise clustering policy at database creation and you cannot change it without recreating the database. OSS exposes the native Redis Cluster API: the client discovers every shard, computes the CRC16 hash slot (16384 slots) for each key, and connects directly to the owning node – lowest latency, but it requires a cluster-aware client and the client sees every node address. Enterprise policy puts a proxy in front of the shards so the client talks to a single endpoint like a standalone Redis – simplest for clients and networking, at the cost of a proxy hop.

Multi-key commands need one hash slot. In any clustered Redis, a command touching multiple keys requires all of them in the same hash slot. MSET user:1001 a user:1002 b hashes the two keys to different slots and fails with CROSSSLOT under OSS policy. Hash tags – the substring inside the first {} – force co-location: MSET user:{t42}:1001 a user:{t42}:1002 b hashes both on t42. Design the keyspace so everything you co-access (a tenant, an order, a session) shares a hash tag; otherwise transactions, Lua scripts, and MGET/MSET break.

Active-active means no primary and automatic conflict resolution. Enterprise active geo-replication builds an Active-Active CRDB (conflict-free replicated database). Every region accepts reads and writes; changes replicate full-mesh to all peers. A region outage means you keep serving from the survivors with no failover step. Concurrent writes converge deterministically because each data type is reimplemented as a CRDT (conflict-free replicated data type): strings are last-write-wins by timestamp, counters are additive (both increments apply), and sets/hashes/sorted-sets merge per element. You do not write conflict logic; you choose the data type that gives the convergence you need.

Persistence survives restart; replication survives node loss; neither is a backup. RDB snapshots the dataset on an interval; you lose everything since the last snapshot on a hard failure. AOF logs every write, and with fsync every second the worst-case loss is ~1 second. Both protect against a full cluster restart. Replication protects against losing a node or a region. Neither protects against a bad FLUSHALL or a logic bug – that is what an export to a storage account is for.

The client is where outages are made or prevented. A correctly provisioned, multi-region, persistent cluster behind a broken client is still an outage. The client must multiplex (one long-lived connection, not pool-per-request), keep retrying instead of throwing on first-connect failure, follow MOVED/ASK redirects (OSS policy), health-check idle connections, and retry the operation with jittered backoff through a node move. Every one of those is a configuration choice, and the defaults are frequently wrong for Azure.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Concept	One-line definition	Where it lives	Why it matters
Enterprise cluster	The `redisEnterprise` parent resource (nodes)	Subscription / RG	Holds the SKU, zones, capacity
Database	The child `databases/default` Redis endpoint	On the cluster	The thing the client connects to
Clustering policy	OSS (native) vs Enterprise (proxy) routing	Database property (permanent)	Dictates client mode + networking
Hash slot	One of 16384 CRC16 buckets for keys	Across shards	Multi-key cmds need one slot
Hash tag	`{...}` substring that is hashed	In the key name	Forces co-location of keys
CRDB	Conflict-free replicated (geo) database	Across regions	Active-active, no primary
CRDT	Conflict-free data type (per Redis type)	Per key/value	Deterministic merge of writes
RDB / AOF	Snapshot / append-only persistence	Database property	Restart durability (not backup)
Eviction policy	What happens at memory limit	Database property	`NoEviction` → OOM; LRU → loss/divergence
Private Endpoint	Private IP projection into a VNet	In your subnet	Removes public exposure
Multiplexer	One long-lived client connection object	In the app	Prevents connection storms
Server Load	% time the Redis main thread is busy	Azure Monitor metric	Leading CPU-bound indicator

Tier selection: Standard, Premium, Enterprise, Enterprise Flash

Pick the tier from your durability and topology requirements, not from raw memory size. The tiers are not a linear ladder – Enterprise is a separate runtime, and Flash trades RAM for NVMe to cut cost on large skewed datasets.

Capability	Standard	Premium	Enterprise	Enterprise Flash
Runtime	OSS Redis	OSS Redis	Redis Enterprise	Redis Enterprise
ARM resource type	`Microsoft.Cache/redis`	`Microsoft.Cache/redis`	`Microsoft.Cache/redisEnterprise`	`Microsoft.Cache/redisEnterprise`
SLA (single region)	99.9%	99.9% (99.99% zone-redundant)	up to 99.999%	up to 99.999%
Clustering	no	OSS only	OSS or Enterprise policy	OSS or Enterprise policy
Active geo-replication	no	passive (geo-replica link)	active-active (CRDB)	active-active (CRDB)
Persistence (RDB + AOF)	no	yes	yes	yes
Redis modules (Search/JSON/etc.)	no	no	yes	yes
Storage medium	RAM	RAM	RAM	RAM + NVMe tier
Zone redundancy	no	yes	yes	yes
Default TLS port	6380	6380	10000	10000

The deciding factors, as a decision table:

If you need…	Then choose…	Because…
Cheapest possible cache, regenerable data	Standard	No persistence, no clustering, lowest price
Single-region HA + persistence, no modules	Premium	OSS clustering + RDB/AOF, 99.99% zone-redundant
Multi-region read failover (manual)	Premium + passive geo-replica	One-way link; DR tool, not availability
Multi-region write (both regions accept writes)	Enterprise	Active-active CRDB, no failover step
RediSearch / RedisJSON / TimeSeries / Bloom	Enterprise	Modules exist only on the Enterprise runtime
Large dataset, skewed access, cost-sensitive	Enterprise Flash	Hot keys in RAM, cold values on NVMe
Uniformly hot, latency-critical, large	Enterprise (not Flash)	The flash hop adds p99 latency
Highest SLA (99.999%)	Enterprise / Flash	Only the Enterprise runtime offers it

A note on each tier’s sweet spot, because the table compresses real trade-offs:

Tier	Best for	Avoid when	Key limit / gotcha
Standard	Dev, regenerable read caches	You need persistence or HA	No clustering; single node failure = data loss
Premium	Prod single-region with persistence	You need multi-region writes or modules	Passive geo-replica is read-mostly + manual failover
Enterprise	Modules, active-active, 99.999%	Tiny caches where cost dominates	Distinct resource type; even-numbered capacity
Enterprise Flash	Large session/cache stores, skewed	Uniformly hot or latency-critical	NVMe hop visible at p99 on hot keys

# Enterprise tier uses a distinct resource: redisenterprise, with a child database.
# --capacity must be EVEN for Enterprise SKUs (nodes deploy in HA pairs).
az redisenterprise create \
  --name kv-redis-prod \
  --resource-group rg-data-prod \
  --location eastus2 \
  --sku Enterprise_E10 \
  --capacity 2 \
  --zones 1 2 3

# The database (the actual Redis endpoint) is a child resource.
az redisenterprise database create \
  --cluster-name kv-redis-prod \
  --resource-group rg-data-prod \
  --client-protocol Encrypted \
  --clustering-policy EnterpriseCluster \
  --eviction-policy NoEviction \
  --persistence aof-enabled=true aof-frequency=1s

The SKU name encodes both the engine (Enterprise_E10, EnterpriseFlash_F300) and a capacity unit. --capacity must be an even number for Enterprise SKUs because nodes deploy in pairs for HA. Always pass --zones 1 2 3 at create time; you cannot add zone redundancy to an existing cluster in place.

The SKU families and what the suffix encodes:

SKU family	Example SKUs	Storage	Scales by	Typical use
`Enterprise_E*`	`E5`, `E10`, `E20`, `E50`, `E100`	All RAM	SKU (up) + capacity (out)	Modules, active-active, latency-critical
`EnterpriseFlash_F*`	`F300`, `F700`, `F1500`	RAM + NVMe	SKU (up) + capacity (out)	Large skewed datasets, lower cost/GB

Clustering policies and key distribution

This is the single most consequential decision and it is permanent for the database’s lifetime: you choose it at creation and it dictates how your client connects and how multi-key operations behave.

OSS clustering policy exposes the native Redis Cluster API. The client discovers all shards, computes the CRC16 hash slot for each key, and connects directly to the owning node. Lowest latency and highest throughput because there is no proxy hop – but it requires a cluster-aware client, and the client sees every node’s address, which complicates private networking (every node IP must be routable).

Enterprise clustering policy puts a proxy in front of the shards. The client connects to a single endpoint as if it were a standalone Redis; the proxy routes commands to the correct shard. Far simpler for clients (any standard client works, no cluster mode) and for networking (one endpoint), at the cost of a proxy hop.

OSS vs Enterprise policy, side by side

Dimension	OSS clustering policy	Enterprise clustering policy
Client requirement	Cluster-aware client + cluster mode on	Any standard client; no cluster mode
Topology visibility	Client sees every shard/node address	Client sees one proxy endpoint
Network complexity	Every node IP must be reachable	One endpoint to route/whitelist
Latency	Lowest (direct to shard)	One extra proxy hop
Max throughput	Highest	Slightly lower (proxy overhead)
Multi-key across slots	Fails `CROSSSLOT`	Proxy may fan out simple cmds
`MULTI/EXEC` + Lua across slots	Requires single slot	Still requires single slot
`MOVED`/`ASK` handling	Client must follow redirects	Absorbed by the proxy
Best when	You control the client, want max perf	Simpler networking, non-cluster client

The multi-key contract

The behavior that surprises people is multi-key commands. A command touching multiple keys requires all those keys to live in the same hash slot:

# This fails across slots -- the keys hash to different slots
MSET user:1001 alice user:1002 bob   # CROSSSLOT error under OSS policy

# Hash tags force keys into the same slot using the {...} substring
MSET user:{tenant42}:1001 alice user:{tenant42}:1002 bob   # both hash on "tenant42"

Only the substring inside the first {} is hashed. Design your keyspace with hash tags around the entity you co-access so transactions and MGET/MSET stay single-slot. Under the Enterprise policy, the proxy makes some cross-slot multi-key commands appear to work by fanning out, but MULTI/EXEC transactions and Lua scripts still require single-slot keys – so the hash-tag discipline is non-negotiable either way.

Which operations are slot-sensitive, and what each policy does:

Operation	OSS policy	Enterprise policy	Make it safe by…
Single-key `GET`/`SET`/`INCR`	Always fine	Always fine	(nothing)
`MGET`/`MSET` same slot	Fine	Fine	Hash tag the keys
`MGET`/`MSET` cross slot	`CROSSSLOT`	Proxy fans out	Hash tag the keys
`MULTI/EXEC` cross slot	`CROSSSLOT`	`CROSSSLOT`	Hash tag every key in the txn
Lua `EVAL` with `KEYS[]` cross slot	`CROSSSLOT`	`CROSSSLOT`	All `KEYS` share a hash tag
`SUNIONSTORE`/`ZADD` across keys	`CROSSSLOT`	`CROSSSLOT`	Hash tag source + dest keys
`SCAN`	Per-node (OSS)	Single endpoint	Aggregate across shards (OSS)
`KEYS *`	Per-node, blocking	Per-node, blocking	Avoid in prod entirely

Hash-tag design patterns that keep co-accessed keys together:

Access pattern	Key template	Hashed substring	Guarantees
Per-tenant data	`t:{tenantId}:orders`	`tenantId`	All tenant keys one slot
Per-session bundle	`sess:{sessionId}:cart`	`sessionId`	Cart + session co-located
Per-order aggregate	`ord:{orderId}:lines`	`orderId`	Order + line items together
Per-user counters	`u:{userId}:counters`	`userId`	Atomic multi-counter updates
Global singleton set	`g:{flags}:enabled`	`flags`	All flag keys one slot

Choose OSS policy when you control the client and want maximum performance, and you are comfortable with cluster-aware libraries (StackExchange.Redis, Lettuce, redis-py with cluster mode, go-redis ClusterClient). Choose Enterprise policy when you need a single endpoint for private-networking simplicity, or your client cannot do cluster mode. You cannot change it later without recreating the database – so this decision deserves a design review, not a default.

Active geo-replication topologies and conflict handling

Enterprise active geo-replication builds an active-active database (an Active-Active CRDB). Every participating cluster accepts both reads and writes, and changes replicate to all peers. There is no primary. A region outage means you keep serving from the survivors with no failover step.

The mechanism that makes concurrent writes safe is CRDTs. Redis Enterprise reimplements each data type as a CRDT so concurrent writes in different regions converge deterministically. The key insight: you do not get to plug in custom conflict logic the way Cosmos DB does. You pick the data type whose built-in convergence matches your correctness requirement.

CRDT semantics per data type

Data type	Conflict resolution	Concurrent-write outcome	Use it for	Pitfall
String (`SET`)	Last-write-wins by timestamp	One of two concurrent writes is lost	Single-writer keys, idempotency flags	Silent loss on true concurrent writes
Counter (`INCR`/`DECRBY`)	Additive merge	Both increments apply (no lost update)	Metrics, rate limits, vote tallies	Cannot set an absolute value safely
Set (`SADD`/`SREM`)	Observed-remove, element merge	Adds + removes converge per element	Idempotency keys, tags, membership	Concurrent add+remove favors add
Hash (`HSET`)	Per-field LWW merge	Different fields merge; same field LWW	Profiles, multi-field records	Same-field concurrent write loses one
Sorted set (`ZADD`)	Per-element score merge	Members merge; score is LWW	Leaderboards, time-ordered queues	Concurrent score update is LWW
String as counter (`SET n`)	LWW	Lost increments	(anti-pattern)	Use `INCR`, never `SET` for counts

Choosing the data type for the convergence you need

Requirement	Wrong model (loses data)	Right model	Why
“Count signups across regions”	`SET count <n>`	`INCR count`	Additive merge, no lost updates
“Has this request been seen anywhere?”	`SET seen:<id> 1`	`SADD seen <id>`	Observed-remove set converges
“User’s last-known cart”	two `SET cart`	`HSET cart field val`	Per-field merge keeps both fields
“Top players this hour”	`SET score:<u> <n>`	`ZADD board <n> <u>`	Per-member score, members merge
“Single-writer config flag”	(fine) `SET flag on`	`SET flag on`	LWW acceptable; one writer

# Create an active geo-replication group spanning two regions.
# Each region is its own redisenterprise cluster + database; you link them
# via a shared group nickname and mutual linkedDatabase references.

az redisenterprise database create \
  --cluster-name kv-redis-eastus2 \
  --resource-group rg-data-eastus2 \
  --client-protocol Encrypted \
  --clustering-policy EnterpriseCluster \
  --group-nickname global-sessions \
  --linked-databases id="/subscriptions/<sub>/resourceGroups/rg-data-eastus2/providers/Microsoft.Cache/redisEnterprise/kv-redis-eastus2/databases/default" \
  --linked-databases id="/subscriptions/<sub>/resourceGroups/rg-data-westeurope/providers/Microsoft.Cache/redisEnterprise/kv-redis-westeurope/databases/default"

The --linked-databases list must include this database plus every peer, and the same group nickname must be used on every member. Designing the topology:

Keep the geo group to regions you can tolerate replicating all writes to – replication is full mesh, so N regions means each write fans out to N-1 peers. Bandwidth and cross-region egress cost scale with the mesh.
Active-active forces NoEviction semantics conceptually: do not run an active-active cache as an LRU eviction cache, because evictions are local and create divergence. Use it for data you intend to keep (sessions, counters, feature flags), and size for the full working set.
Conflict resolution is per data type and automatic. If LWW on strings is unacceptable for a key, model it as a counter, set, or hash instead.

Topology trade-offs as you add regions:

Regions in mesh	Write fan-out per write	Cross-region links	When it makes sense	Cost driver
2 (active-active pair)	1 peer	1	Most apps: 1 primary + 1 DR-as-active	Egress between 2 regions
3 (triangle)	2 peers	3	Three-continent latency reduction	3× cross-region egress
4	3 peers	6	Rare; global low-latency writes	6 links, bandwidth + latency
5+	N-1 peers	N(N-1)/2	Almost never; reconsider design	Quadratic link growth

Active geo-replication vs the Premium passive geo-replica:

Dimension	Premium passive geo-replica	Enterprise active geo-replication
Direction	One-way (primary → secondary)	Full mesh, bidirectional
Secondary writes	Read-mostly; no writes	Full read + write
Failover	Manual (DNS / link unlink)	None — survivors keep serving
Conflict handling	N/A (single writer)	Automatic CRDT merge
RPO on region loss	Replication lag at failover	Near-zero; writes accepted locally
Use as	DR tool	Availability tool
Tier	Premium	Enterprise / Enterprise Flash

Data persistence: RDB, AOF, and durability trade-offs

Enterprise supports both persistence mechanisms, and they answer different questions. Persistence is about surviving a full cluster restart; it is orthogonal to replication, which is about surviving node loss, and to backup, which is about surviving a bad command or logic bug.

RDB (snapshot) writes a point-in-time dump on an interval (e.g., every 1h/6h/12h). Cheap, low overhead, but you lose everything since the last snapshot on a hard failure.

AOF (append-only file) logs every write. With fsync every second (aof-frequency=1s), worst-case data loss is ~1 second. The cost is write amplification and larger files. This is the right default for anything you cannot regenerate.

RDB vs AOF, with the numbers

Dimension	RDB (snapshot)	AOF (append-only)
What it stores	Point-in-time dataset dump	Every write operation, replayed
Worst-case data loss	Since last snapshot (e.g. up to 1h)	~1 second (`aof-frequency=1s`)
Write overhead	Low (periodic fork)	Higher (continuous append)
File size	Compact	Larger (full op log)
Restart/restore speed	Fast (load one dump)	Slower (replay the log)
CPU/memory spike	Fork at snapshot time	Steady, lower spikes
Right for	Regenerable caches, restart speed	Stateful data you cannot recreate

AOF fsync frequency trade-off

`aof-frequency`	Worst-case loss	Write throughput impact	When to use
`1s`	~1 second	Modest	The resilient default for stateful caches
`always` (where supported)	~0 (per-write fsync)	High (every write blocks on disk)	Only when even 1s loss is unacceptable

# AOF with per-second fsync -- the resilient default for stateful caches
az redisenterprise database update \
  --cluster-name kv-redis-prod \
  --resource-group rg-data-prod \
  --persistence aof-enabled=true aof-frequency=1s

# RDB hourly -- acceptable only for regenerable caches where restart speed matters
az redisenterprise database update \
  --cluster-name kv-redis-prod \
  --resource-group rg-data-prod \
  --persistence rdb-enabled=true rdb-frequency=1h

The RDB snapshot intervals you can choose, and what each implies:

`rdb-frequency`	Worst-case loss on hard failure	Overhead	Use when
`1h`	Up to 1 hour	Lowest	Regenerable cache; restart speed matters most
`6h`	Up to 6 hours	Lowest	Bulk read cache, easy to rebuild
`12h`	Up to 12 hours	Lowest	Rarely-changing reference data

Persistence vs replication vs backup – three different jobs:

Mechanism	Protects against	Does NOT protect against	Where data lives
AOF/RDB persistence	Full cluster restart	Bad `FLUSHALL`, logic bug, region loss	Cluster-local managed disks
Zone/node replication	Single node or zone loss	Region loss, bad command	Across nodes/zones in-region
Active geo-replication	Region loss	Bad command replicated to all peers	Across regions (full mesh)
Export to storage	Bad command, point-in-time recovery	Real-time loss (it is periodic)	Your storage account

Two correctness notes. First, in an active-active geo group you generally rely on the peer regions for recovery and persistence is a secondary safety net – a surviving region rehydrates a recovered one. Second, persistence is not a backup: it protects against process restart, not against a bad FLUSHALL or a logic bug that corrupts data; a corrupting command replicates to every peer in the mesh. Enterprise persists to the cluster’s local managed disks, not to your storage account, so treat exports separately if you need point-in-time backups.

Private endpoint, VNet injection, and TLS hardening

Never expose a production cache to the public internet. The Enterprise tier supports Private Link, which projects the cache into your VNet via a private endpoint and a private IP – the public FQDN resolves to a private address through Private DNS.

resource cache 'Microsoft.Cache/redisEnterprise@2024-09-01-preview' = {
  name: 'kv-redis-prod'
  location: 'eastus2'
  sku: { name: 'Enterprise_E10', capacity: 2 }
  zones: ['1', '2', '3']
}

resource db 'Microsoft.Cache/redisEnterprise/databases@2024-09-01-preview' = {
  parent: cache
  name: 'default'
  properties: {
    clientProtocol: 'Encrypted'        // TLS-only; rejects plaintext
    clusteringPolicy: 'EnterpriseCluster'
    evictionPolicy: 'NoEviction'
    port: 10000
    persistence: { aofEnabled: true, aofFrequency: '1s' }
  }
}

resource pe 'Microsoft.Network/privateEndpoints@2024-05-01' = {
  name: 'pe-kv-redis-prod'
  location: 'eastus2'
  properties: {
    subnet: { id: dataSubnetId }
    privateLinkServiceConnections: [
      {
        name: 'redis'
        properties: {
          privateLinkServiceId: cache.id
          groupIds: ['redisEnterprise']
        }
      }
    ]
  }
}

Hardening checklist that actually matters:

clientProtocol: 'Encrypted' forces TLS. The Enterprise tier listens on port 10000 (not 6380 like Premium) – a frequent connection-string bug when migrating. Set your client’s TLS port accordingly.
Wire a Private DNS zone (privatelink.redisenterprise.cache.azure.net) linked to the VNet so the public FQDN resolves privately. Without the zone link, in-VNet clients still resolve the public IP and the private endpoint does nothing for them.
Prefer Microsoft Entra ID (token) authentication over the access key where your client supports it; it removes the long-lived shared secret. The access key still exists as a fallback – rotate it and store it in Key Vault, never in app config.

Networking and TLS settings reference

Setting	Values	Default	When to change	Limit / gotcha
`clientProtocol`	`Encrypted`, `Plaintext`	`Encrypted`	Never use plaintext in prod	Plaintext exposes data + key on the wire
`port`	TLS listen port	10000 (Enterprise)	Rarely	Premium is 6380; mismatch = connect failure
Public network access	Enabled / Disabled	Enabled	Disable once PE is live	Forgetting to disable leaves a public path
Private DNS zone	`privatelink.redisenterprise.cache.azure.net`	(none)	Always with PE	Unlinked zone → resolves to public IP
Min TLS version	1.2 / 1.3	1.2	Raise to 1.3 if clients support	Old clients may only do 1.2
Access keys	Primary / Secondary	Both active	Rotate regularly	Long-lived secret; prefer Entra token
Entra (AAD) auth	Enabled / Disabled	Disabled	Enable where client supports	Removes shared-secret risk

The port-10000 trap and other connection-string mistakes

Symptom	Likely cause	How to confirm	Fix
Connect times out from in-VNet client	DNS resolves to public IP	`nslookup <fqdn>` returns public IP	Link the Private DNS zone to the VNet
Connect refused / handshake error	Wrong port (6380 vs 10000)	Check client port setting	Use 10000 for Enterprise
Plaintext “connection reset”	TLS not enabled on client	Client `Ssl=false`	Set `Ssl=true`; server is TLS-only
Auth fails after rotation	Stale key in app config	Compare key in Key Vault vs app	Pull key from Key Vault at runtime
Token auth `NOAUTH`/expired	Entra token not refreshed	Token lifetime exceeded	Use SDK that auto-refreshes the token
Works locally, fails in Azure	Public access disabled, no PE route	Test from inside the VNet	Reach via the private endpoint only

Identity and secret-handling options, ranked:

Auth method	Secret lifetime	Rotation effort	Best for	Trade-off
Entra ID token	Short (auto-refreshed)	None (managed identity)	Modern clients on Azure	Client/SDK must support it
Access key in Key Vault	Long; rotate on a schedule	Manual rotation + redeploy/refresh	Clients without Entra support	Long-lived secret to guard
Access key in app config	Long; often never rotated	(don’t do this)	Nothing in prod	Secret leaks via config/source

Client resilience: multiplexing, retries, and reconnect

This is where most outages are actually caused or prevented. A correctly provisioned cluster behind a broken client is still an outage.

Multiplex one connection, do not pool-per-request. Redis clients like StackExchange.Redis are built around a single long-lived multiplexer that pipelines all commands over a few connections. Opening a connection per operation exhausts ports and ignores the library’s pipelining. Create the multiplexer once as a singleton:

// Singleton ConnectionMultiplexer -- created once, shared process-wide.
var config = new ConfigurationOptions
{
    EndPoints = { "kv-redis-prod.eastus2.redisenterprise.cache.azure.net:10000" },
    Ssl = true,
    AbortOnConnectFail = false,          // keep retrying instead of throwing at startup
    ConnectRetry = 5,
    ConnectTimeout = 15000,
    KeepAlive = 30,
    ReconnectRetryPolicy = new ExponentialRetry(5000)
};
// Token auth (Entra ID) instead of an access key:
await config.ConfigureForAzureWithTokenCredentialAsync(new DefaultAzureCredential());

var muxer = await ConnectionMultiplexer.ConnectAsync(config);

The non-obvious settings that matter on Azure, enumerated:

Setting (StackExchange.Redis)	Default	Recommended on Azure	Why it matters
`AbortOnConnectFail`	`true`	`false`	`true` throws permanently if first connect fails (e.g. maintenance) and never recovers
`Ssl`	`false`	`true`	Server is TLS-only; plaintext is rejected
`ConnectRetry`	3	5	Initial connect attempts before giving up
`ConnectTimeout`	5000 ms	15000 ms	Cross-region/private-link first connect can be slow
`KeepAlive`	60 s	30 s	Detects dead sockets sooner (idle LB timeouts)
`ReconnectRetryPolicy`	linear	`ExponentialRetry(5000)`	Backoff instead of hammering during an outage
`SyncTimeout`	5000 ms	tune to p99	Too low → false `RedisTimeoutException` under load
`AsyncTimeout`	5000 ms	tune to p99	Same, for async paths
`allowAdmin`	`false`	`false`	Keep off unless you run admin commands

The non-obvious behaviors:

AbortOnConnectFail = false is mandatory. The default true throws permanently if the first connect fails (e.g., during a maintenance window), and the multiplexer never recovers. With false, it reconnects in the background.
During scaling and patching, Azure issues a brief connection blip per node. Your code must retry the operation, not just rely on the multiplexer reconnecting. Wrap commands in a bounded retry (Polly) that handles RedisConnectionException and RedisTimeoutException with jittered backoff.
Under OSS clustering policy, the client must follow MOVED/ASK redirects automatically – every mainstream cluster client does, but only if you enabled cluster mode. A MOVED reaching your application code means the client is misconfigured.

# redis-py against the Enterprise (proxy) policy -- a single endpoint, TLS, retry on timeout
from redis import Redis
from redis.retry import Retry
from redis.backoff import ExponentialBackoff
from redis.exceptions import ConnectionError, TimeoutError

r = Redis(
    host="kv-redis-prod.eastus2.redisenterprise.cache.azure.net",
    port=10000, ssl=True,
    socket_timeout=5, socket_connect_timeout=5,
    retry=Retry(ExponentialBackoff(cap=2, base=0.1), retries=3),
    retry_on_error=[ConnectionError, TimeoutError],
    health_check_interval=30,
)

health_check_interval sends a periodic PING so idle connections that were silently dropped (by a node move or an Azure load-balancer idle timeout) are detected and rebuilt before a real request hits the dead socket. Without it, the first request after an idle period eats the failure.

Client library cluster support matrix

Library	Cluster-aware mode (OSS policy)	Follows `MOVED`/`ASK`	Entra token auth	Notes
StackExchange.Redis (.NET)	Yes (auto on cluster)	Yes	Yes (`ConfigureForAzure…`)	Use a singleton multiplexer
Lettuce (Java)	Yes (`RedisClusterClient`)	Yes	Via token credential	Reactive + async; topology refresh
Jedis (Java)	Yes (`JedisCluster`)	Yes	Manual token plumbing	Pool sizing matters
redis-py (Python)	Yes (`RedisCluster`)	Yes	Via token provider	`health_check_interval` is key
go-redis (Go)	Yes (`ClusterClient`)	Yes	Via credential hook	Routing + read-from-replica options
node-redis / ioredis	Yes (ioredis cluster)	Yes	Via token	ioredis preferred for cluster

Retry policy design

Exception	Retry?	Backoff	Cap	Idempotency concern
`RedisConnectionException`	Yes	Exponential + jitter	3–5 tries	Reconnect; operation may not have run
`RedisTimeoutException`	Yes (bounded)	Exponential + jitter	2–3 tries	A timed-out write may have applied
`RedisServerException` (`OOM`)	No	—	—	Fix capacity/eviction, not retry
`MOVED`/`ASK`	Client-internal	—	—	Should never reach app code
`CROSSSLOT`	No	—	—	Fix keyspace (hash tags), not retry
`NOAUTH`/auth error	No (refresh token)	—	—	Refresh credential, then reconnect

Scaling, reshard operations, and zero-downtime maintenance

Enterprise scales two ways: scale up (a bigger SKU – E10 to E20) and scale out (more capacity units, which add shards and rebalance slots). Both are online operations, but “online” assumes a resilient client (previous section).

# Scale up the SKU (more memory/throughput per node)
az redisenterprise update --name kv-redis-prod --resource-group rg-data-prod \
  --sku Enterprise_E20

# Scale out capacity (adds nodes/shards; triggers a reshard/rebalance)
az redisenterprise update --name kv-redis-prod --resource-group rg-data-prod \
  --capacity 4

Scale-up vs scale-out

Dimension	Scale up (bigger SKU)	Scale out (more capacity)
What changes	More RAM/CPU per node	More shards; slots rebalance
Operation	`--sku Enterprise_E20`	`--capacity 4` (even number)
Fixes	OOM, CPU on a single shard	Throughput ceiling, larger dataset
Client impact	Brief blip per node	Reshard; `MOVED`/`ASK` (OSS) or proxy-absorbed
Online?	Yes (rolling)	Yes (rolling)
Limit	SKU ceiling per family	Capacity must stay even

What happens during a reshard, and how to survive it:

Hash slots migrate between shards. Under OSS policy, in-flight keys briefly answer ASK/MOVED and the client re-routes – transparent only if the client handles redirects. Under Enterprise policy, the proxy absorbs this and clients see at most brief latency.
A small number of connections drop as nodes are added. This is exactly the blip your retry policy exists for. Validate by running a scale operation in a load test and confirming zero application errors, only a latency bump.
Maintenance windows. Enterprise patches the OS and Redis software with rolling, one-node-at-a-time updates so the database stays available. Configure a maintenance window aligned to your low-traffic hours, and never assume “no failover during maintenance” – assume a connection reset per node and make the client idempotent.

Idempotency under retried writes

Caches are naturally idempotent for reads; for write paths, a retried operation after a successful-but-unacknowledged write can corrupt data. Map the operation to a safe pattern:

Write operation	Retry hazard	Safe pattern
`INCR counter`	Double-count on retry	Idempotency key, or compute then `SET` known value
`SET k v`	Safe (same value)	Plain `SET` is idempotent if value is fixed
`LPUSH queue item`	Duplicate item on retry	Dedup on consume, or `SET`-based dedup key
`SADD set member`	Safe (set semantics)	`SADD` is naturally idempotent
`HSET h f v`	Safe (same value)	Idempotent for a fixed field/value
`INCRBY balance n`	Double-apply on retry	Idempotency key per transaction id

Events that cause a connection blip

Event	Trigger	Client-visible effect	Mitigation
Scale up (SKU)	`--sku` change	Brief reset per node	`AbortOnConnectFail=false` + op retry
Scale out (capacity)	`--capacity` change	Reshard; redirects/proxy hop	Op retry; cluster-aware client
OS/Redis patch	Maintenance window	One node reset at a time	Low-traffic window; health checks
Node failure	Hardware/zone fault	Failover to replica shard	Idempotent writes; retry
Geo link change	Add/remove region	Replication catch-up	Tolerate brief replication lag

Monitoring memory pressure, evictions, and latency percentiles

Redis fails loudly on CPU and silently on memory. Watch both, and alert on the leading indicators rather than the outage.

The metrics that predict incidents (all available in Azure Monitor for the Enterprise resource):

Metric	What it measures	Leading indicator of	Alert threshold	Why
Used Memory Percentage	RAM used vs limit	OOM (NoEviction) or eviction loss	75%	Above ~80% writes fail or keys evict
Evicted Keys	Keys removed at memory limit	Undersized cache / divergence (geo)	> 0 sustained	On active-active this is a correctness bug
Expired Keys	Keys removed by TTL	Normal churn (context for evictions)	(baseline)	Distinguishes TTL churn from eviction
Server Load	% main thread busy	CPU-bound cluster	80%	A slow `KEYS`/big `MGET` stalls everything
Connections Created/sec	New conns per second	Pool-per-request client bug	sustained high	Healthy clients reuse a handful
Cache Hit / Miss	Read hit ratio	Cache too small / wrong TTLs	falling hit rate	Misses push load to the backend
Total Operations/sec	Throughput	Approaching shard ceiling	near capacity	Scale out before saturation
Replication latency (geo)	Cross-region lag	Mesh bandwidth / region issue	rising	Stale reads in the lagging region

The metrics to alert on, with the action each alert should trigger:

Alert	Condition	Severity	Immediate action
Memory pressure	`usedmemorypercentage` > 75% for 5m	Warning	Scale up RAM or scale out shards
OOM imminent	`usedmemorypercentage` > 90% for 1m	Critical	Scale now; check for runaway keyspace
Eviction (geo)	`evictedkeys` > 0 on active-active DB	Critical	Size up; eviction = divergence
CPU-bound	`serverLoad` > 80% for 5m	Warning	Scale out; hunt slow commands
Connection storm	`connectionscreatedpersecond` high sustained	Warning	Audit client for pool-per-request
Hit ratio drop	hit ratio falls > 20%	Warning	Review TTLs / key sizing

// Memory pressure trend + eviction correlation over the last 24h
AzureMetrics
| where ResourceProvider == "MICROSOFT.CACHE"
| where ResourceId contains "kv-redis-prod"
| where MetricName in ("usedmemorypercentage", "evictedkeys", "serverLoad")
| summarize avg(Average), max(Maximum) by MetricName, bin(TimeGenerated, 5m)
| order by TimeGenerated desc

For latency, do not trust server-side averages – measure client-side percentiles, because an average of 1ms hides a p99 of 200ms caused by a single hot shard or a GC pause in your own process. Track p50/p99 per operation from the application, and correlate p99 spikes against serverLoad and reshard events. A latency cliff that lines up with a scaling operation is your retry policy working; one that does not is a hot key or a cross-slot fan-out.

What a latency spike is telling you, by what it correlates with:

p99 spike correlates with…	It’s probably…	Confirm with	Fix
A scale-out / reshard event	Retry policy absorbing a blip	Activity log timing vs spike	Nothing — working as intended
High `serverLoad`	CPU-bound (slow command / hot shard)	`serverLoad` metric, `SLOWLOG`	Scale out; remove `KEYS`/big `MGET`
One `cloud_RoleInstance` only	Hot key on one shard	Per-instance client metrics	Re-key to spread; add a local cache
Cross-slot fan-out commands	Proxy fanning out (Enterprise policy)	Command audit	Hash-tag keys to single slot
GC pauses in the app	Client-side, not Redis	App GC logs vs Redis latency	Tune app GC / allocation
Rising replication lag (geo)	Cross-region bandwidth/incident	Geo replication metric	Check region health; reduce mesh

Architecture at a glance

The diagram traces the data and control path of a two-region active-active deployment, left to right, and maps each failure class to the exact hop where it bites. Read it as four zones. On the left, clients (App Service / AKS pods) hold a singleton multiplexer and reach the cache over TLS on port 10000 through a Private Endpoint – never the public internet. The endpoint resolves via a Private DNS zone linked to the VNet, which is the hop where a missing zone link silently sends you to the public IP. In the middle, the East US 2 Enterprise cluster terminates the connection: under Enterprise policy a proxy fronts the shards; under OSS policy the client talks to shards directly and must follow MOVED/ASK. The shards enforce the eviction policy (NoEviction here) and write AOF at one-second fsync to local managed disk.

On the right, the West Europe Enterprise cluster is the active-active peer: a full-mesh CRDB link replicates every write both directions, so both regions accept reads and writes with no primary and no failover step. The numbered badges mark the five places this design fails and how you confirm each: a CROSSSLOT from a keyspace that ignored hash tags, an OOM where NoEviction met an undersized cluster, a connection storm from a pool-per-request client, replica divergence from running eviction on a geo database, and a stale-read window from replication lag. The legend narrates each as symptom, the metric or command that confirms it, and the fix. The whole method: localize the symptom to a hop, read the cause, run the named check, apply the fix.

Real-world scenario

Paywell, a fictional payments platform, ran a global idempotency cache on Premium with a passive geo-replica: EU writes went to West Europe, a one-way replica fed East US for reads, and failover was a manual DNS swap. The cache held one thing that mattered above all – the answer to “have I already processed this request id?” – and the platform’s entire duplicate-protection guarantee rested on it. Average load was 3,000 ops/sec, the monthly cache spend about ₹95,000, and the team was four engineers.

During a West Europe zone incident the replica was read-only, so for eleven minutes every in-flight payment in the US that needed to check idempotency either blocked on the manual failover or fell back to the database and ran at a fraction of normal throughput. Worse, after failover a handful of duplicate captures slipped through because the idempotency keys written in the US during the gap had not replicated back – the one-way link only flowed EU→US, so US writes during the outage were invisible to West Europe when it recovered. Two customers were double-charged. The post-incident review put it bluntly: passive replication is a DR tool, not an availability tool.

The first instinct was to “add another replica” – which would have solved nothing, because the problem was not node loss, it was that the secondary could not accept writes. The breakthrough was reframing the requirement: a system that must never lose a write across regions has to be active-active and conflict-free by construction. They moved to Enterprise active-active geo-replication across West Europe and East US, modeling idempotency state two ways. The “have I seen this request” check became a CRDT set (SADD seen <id>) so adds in either region converge with observed-remove semantics, and the per-request lock used SET payment:idem:<id> processing NX EX 86400 – string LWW plus NX gives “first writer in either region wins,” which is exactly the duplicate-protection semantic they needed.

They kept AOF at 1s as a restart safety net and sized for NoEviction: idempotency keys carry a 24-hour TTL via SET ... EX, never LRU eviction, so divergence is impossible. They moved both databases behind Private Endpoints with a linked Private DNS zone, switched clients to Entra token auth, and rebuilt the .NET client as a singleton multiplexer with AbortOnConnectFail = false and a Polly retry. The one subtlety they hit in testing: an early version used INCR for a per-merchant attempt counter and double-counted on a retried-after-timeout write; they switched the critical counter to an idempotency-keyed SET of a computed value.

# Idempotency check, region-local, on an active-active CRDB.
# SET NX EX is the primitive: succeeds only if the key is new, with a TTL.
# Converges across regions because string LWW + NX gives "first writer in either region wins".
SET payment:idem:7f3c-9a21 processing NX EX 86400
# -> OK    (first time, in either region: proceed)
# -> nil   (already seen anywhere in the mesh: this is a duplicate, reject)

The measurable result: the next regional zone failure was a non-event – no manual step, p99 unchanged at 1.1 ms, zero duplicate captures. Monthly spend rose to about ₹1,40,000 for the two Enterprise clusters, which the team judged trivial against a single double-capture chargeback plus reputational cost. The lesson on the wall: the duplicate-capture class of bug was designed out by construction, not monitored for.

The incident as a timeline, because the order of moves is the lesson:

Time	Symptom	Action taken	Effect	What it should have been
14:02	West Europe zone incident	(alert fires)	—	Recognize: secondary is read-only
14:04	US idempotency checks blocking	Wait on manual DNS failover	Writes stalled in US	Don’t depend on a manual step
14:07	Throughput fallback to DB	Apps fall back to database	Fraction of normal speed	—
14:13	Manual failover completes	DNS swapped to US	US can write again	11 minutes too late
+1 day	Two double-charges found	RCA: US writes never replicated back	One-way link gap exposed	Active-active needed
+1 week	Redesigned	Enterprise active-active + CRDT set	Region loss = non-event	The actual fix
+1 week	Counter double-count in test	`INCR` retried after timeout	Caught before prod	Idempotency-keyed `SET`

Advantages and disadvantages

The Enterprise active-active model both removes the regional-write failure class and adds operational discipline you must respect. Weigh it honestly:

Advantages (why Enterprise active-active helps)	Disadvantages (why it costs and constrains)
Both regions accept writes; region loss is a non-event with no failover step	Two full clusters running everywhere = roughly double the spend
CRDTs resolve conflicts automatically per data type — no custom merge code	You must model data as the right CRDT; opaque strings silently lose writes (LWW)
Redis modules (Search, JSON, TimeSeries, Bloom) unlock secondary-index workloads	Modules and active-active are Enterprise-only; you can’t get them on Premium
Up to 99.999% SLA on a managed runtime	Higher floor cost than Standard/Premium even for small caches
Persistence (AOF 1s) + replication + geo gives layered durability	Persistence is not a backup; a bad command replicates to every peer
Enterprise (proxy) policy gives a single endpoint — simple private networking	The proxy adds a hop; OSS policy is faster but needs a cluster-aware client and routable node IPs
Online scale-up and scale-out with rolling, one-node maintenance	“Online” assumes a resilient client; a naive client still sees errors on reshard
`NoEviction` + sizing for the full working set keeps geo regions convergent	Eviction in a geo group is a correctness bug (divergence), not just a hit-rate dip

The model is right when you genuinely need multi-region writes, conflict-free convergence, or Redis modules, and you can size for the full working set under NoEviction. It is the wrong tool for a cheap, regenerable, single-region read cache – that is what Standard or Premium are for. The disadvantages are all manageable, but only if you respect them: model the data type deliberately, size for the working set, lock down the network, and make the client resilient.

Hands-on lab

Stand up an Enterprise cache, prove the clustering and persistence behavior, and watch active-active counter convergence – then tear it down. Enterprise is not free-tier, so this lab uses the smallest Enterprise SKU and deletes everything at the end; budget a small hourly charge while it runs. Run in Cloud Shell (Bash).

Step 1 — Variables and resource group.

RG=rg-redis-lab
LOC=eastus2
CLUSTER=kv-redis-lab-$RANDOM   # cluster name must be unique
az group create -n $RG -l $LOC -o table

Step 2 — Create the smallest Enterprise cluster (zone-redundant).

az redisenterprise create \
  -n $CLUSTER -g $RG -l $LOC \
  --sku Enterprise_E5 --capacity 2 --zones 1 2 3 -o table

Expected: a cluster row with provisioningState: Succeeded (this takes several minutes).

Step 3 — Create the database with Enterprise (proxy) policy, NoEviction, AOF.

az redisenterprise database create \
  --cluster-name $CLUSTER -g $RG \
  --client-protocol Encrypted \
  --clustering-policy EnterpriseCluster \
  --eviction-policy NoEviction \
  --persistence aof-enabled=true aof-frequency=1s -o table

Step 4 — Get the host and access key, then connect with TLS on port 10000.

HOST=$(az redisenterprise show -n $CLUSTER -g $RG --query hostName -o tsv)
KEY=$(az redisenterprise database list-keys --cluster-name $CLUSTER -g $RG \
  --query primaryKey -o tsv)

redis-cli -h $HOST -p 10000 --tls -a "$KEY" PING
# -> PONG

Step 5 — Prove persistence is on and eviction is off.

redis-cli -h $HOST -p 10000 --tls -a "$KEY" CONFIG GET appendonly   # -> appendonly yes
redis-cli -h $HOST -p 10000 --tls -a "$KEY" CONFIG GET maxmemory-policy  # -> noeviction

Step 6 — Prove the slot contract (under Enterprise policy the proxy fans out simple MSET, but transactions still need one slot). Demonstrate hash-tag co-location:

# Co-located keys via hash tag -- guaranteed one slot, safe for MULTI/EXEC and Lua
redis-cli -h $HOST -p 10000 --tls -a "$KEY" MSET 'u:{t1}:a' 1 'u:{t1}:b' 2
redis-cli -h $HOST -p 10000 --tls -a "$KEY" MGET 'u:{t1}:a' 'u:{t1}:b'   # -> 1, 2

Step 7 — Counter additive behavior (single region here; in active-active this is what converges).

redis-cli -h $HOST -p 10000 --tls -a "$KEY" INCR global:signups
redis-cli -h $HOST -p 10000 --tls -a "$KEY" INCR global:signups
redis-cli -h $HOST -p 10000 --tls -a "$KEY" GET  global:signups   # -> 2

Step 8 — Teardown (stop the meter).

az group delete -n $RG --yes --no-wait

What each lab step proves, mapped to a section above:

Step	Proves	Section it validates
2	Enterprise is a distinct resource; even capacity; zones at create	Tier selection
3	Policy/eviction/persistence chosen at DB creation	Clustering, Persistence
4	TLS-only on port 10000	Networking & TLS
5	AOF on, NoEviction set	Persistence
6	Hash tags co-locate keys for multi-key safety	Clustering
7	Counters are additive (CRDT convergence basis)	Geo-replication
8	Clean teardown stops the charge	Cost

To make this a real active-active test, repeat steps 2–4 in a second region with a shared --group-nickname and mutual --linked-databases, then INCR global:signups once in each region and GET from either – the value converges to the sum, not a lost update. That is the closest thing to a real cross-region failover you can run on demand.

Common mistakes & troubleshooting

Eleven real failure modes, each as symptom → root cause → how to confirm → fix. This is the playbook to keep open at 02:14.

#	Symptom	Root cause	Confirm (exact cmd / metric)	Fix
1	`CROSSSLOT Keys ... don't hash to the same slot`	Multi-key cmd / txn across slots	Read the exception; check key names for `{}`	Hash-tag co-accessed keys: `k:{tag}:...`
2	`MOVED 1234 10.0.0.5:10000` reaches app code	Cluster-unaware client on OSS policy	Client config has no cluster mode	Enable cluster mode (`RedisCluster`/`ClusterClient`)
3	`OOM command not allowed when used memory > maxmemory`	`NoEviction` + undersized cluster	`usedmemorypercentage` near 100%	Scale up RAM or scale out shards
4	Keys vanish unexpectedly in a geo group	LRU eviction on active-active DB	`evictedkeys` > 0; policy = `allkeys-lru`	Set `NoEviction`; size for full working set
5	Thousands of `connectionscreatedpersecond`	Client opens a connection per request	The metric is high and sustained	Use a singleton multiplexer; reuse it
6	App throws on startup during maintenance, never recovers	`AbortOnConnectFail = true`	Multiplexer config default	Set `AbortOnConnectFail = false`
7	Connect times out from in-VNet client	Private DNS zone not linked → public IP	`nslookup <fqdn>` returns public IP	Link `privatelink.redisenterprise...` zone to VNet
8	Connect refused / TLS handshake error	Wrong port (6380) or `Ssl=false`	Client port/SSL settings	Use port 10000, `Ssl=true`
9	Concurrent cross-region writes lose data	Modeled as `SET` string (LWW)	Two regions, same key, one value survives	Model as counter/set/hash CRDT
10	Counter over-counts after a timeout	`INCR` retried after unacked success	Retry on `RedisTimeoutException` + `INCR`	Idempotency key, or compute then `SET`
11	One region serves stale reads	Cross-region replication lag	Geo replication latency metric rising	Check region health; tolerate or reduce mesh

Decision table: which failure am I looking at?

If you see…	It’s probably…	Do this first
`CROSSSLOT` in the exception	Keyspace ignores hash tags	Add `{tag}` to co-accessed keys
`MOVED`/`ASK` in app logs	Client not in cluster mode	Turn on cluster mode
`OOM command not allowed`	Memory full + `NoEviction`	Scale up/out; check runaway keyspace
`evictedkeys` > 0 on a geo DB	Eviction enabled in active-active	Switch to `NoEviction`
Conns/sec spiking	Pool-per-request client	Multiplex one connection
Permanent failure after a blip	`AbortOnConnectFail = true`	Flip it to `false`
In-VNet timeouts	DNS resolves to public IP	Link the Private DNS zone
Data loss across regions	Wrong CRDT (string LWW)	Re-model as counter/set/hash

The error/limit reference

Error / limit	Meaning	Likely cause	How to confirm	Fix
`CROSSSLOT`	Keys span multiple hash slots	No hash tag on multi-key op	Exception text + key names	Hash-tag the keys
`MOVED <slot> <ip:port>`	Slot owned by another node	Cluster-unaware client (OSS)	Appears in app logs	Enable cluster mode
`ASK <slot> <ip:port>`	Slot mid-migration	Reshard in progress	During scale-out	Client follows redirect (auto)
`OOM command not allowed`	At memory limit, `NoEviction`	Undersized cluster	`usedmemorypercentage` ~100%	Scale RAM/shards
`NOAUTH` / auth required	Missing/expired credential	Stale key or expired token	Auth response	Refresh token / correct key
`WRONGTYPE`	Op on wrong data type	Key reused as different type	`TYPE <key>`	Use the right type / re-key
`READONLY`	Write to a read-only target	Passive replica (Premium)	Topology / replica role	Write to primary; or go active-active
Capacity must be even	Enterprise nodes deploy in pairs	Odd `--capacity` value	CLI rejects the value	Use an even number
Port 6380 vs 10000	Wrong TLS port	Premium connection string reused	Client port setting	Use 10000 for Enterprise
Max 16384 slots	Hard cluster slot count	(design constant)	—	Design keyspace within it

Best practices

Production-grade rules, learned the hard way:

#	Rule	Why
1	Choose the tier from durability/topology, not memory size	Enterprise is a different runtime, not a bigger Premium
2	Decide the clustering policy at design time; it is permanent	Changing it later means recreating the database
3	Hash-tag every set of co-accessed keys	Keeps transactions, Lua, `MGET`/`MSET` single-slot
4	Use active-active (not passive) when both regions must write	Passive is DR; active-active is availability
5	Model write data as the right CRDT	Strings are LWW and silently lose concurrent writes
6	Run geo databases as `NoEviction`, sized for the full working set	Eviction in a mesh = divergence, a correctness bug
7	Enable AOF `1s` for non-regenerable data	Caps restart loss at ~1 second
8	Treat persistence as restart safety, not a backup	A bad command replicates to every peer
9	Lock the cache behind a Private Endpoint + linked Private DNS	No public exposure; in-VNet resolves privately
10	Prefer Entra token auth; keep keys in Key Vault, rotated	Removes the long-lived shared secret
11	Use a singleton multiplexer with `AbortOnConnectFail = false`	Prevents connection storms and permanent-failure-after-blip
12	Retry the operation with jittered backoff, idempotently	Survives reshard/patch blips without corrupting writes
13	Set a maintenance window in low-traffic hours	Rolling patches reset one node at a time
14	Alert on used-memory % (75%), eviction, server load (80%), conns/sec	Catch the leading indicator, not the outage
15	Measure client-side p99, correlate to reshard/server-load	Server averages hide hot keys and GC pauses

Security notes

The cache often holds the most sensitive transient data you have – session tokens, idempotency keys, PII in flight. Lock it down on every axis:

Control	Setting / action	Why
Encryption in transit	`clientProtocol: Encrypted`, TLS 1.2+ on port 10000	No plaintext; reject unencrypted clients
Network isolation	Private Endpoint + disable public network access	Cache reachable only from the VNet
Private name resolution	Linked `privatelink.redisenterprise...` zone	FQDN resolves to the private IP, not public
Identity-based auth	Microsoft Entra ID token (managed identity)	No long-lived shared secret in the app
Secret handling	Access key in Key Vault, rotated; never in app config	Limits blast radius if a config leaks
Least privilege	Scope the managed identity / RBAC to read what it needs	Avoid over-broad data-plane access
Data minimization	Short TTLs on sensitive keys (`SET ... EX`)	Sensitive data self-expires
Audit & logging	Diagnostic settings to a Log Analytics workspace	Trace connections, auth, config changes
Defense vs bad commands	Restrict/disable dangerous admin commands	`FLUSHALL`/`KEYS` blast radius
Geo data residency	Choose peer regions with compliance in mind	Writes replicate to every mesh region

Least-privilege auth options, ranked from most to least secure:

Approach	Secret exposure	Rotation	Recommendation
Entra token via managed identity	None (short-lived)	Automatic	Preferred where the client supports it
Access key from Key Vault at runtime	Low (never on disk in app)	Scheduled rotation	Acceptable fallback
Access key in environment/app settings	Medium (visible in config)	Often forgotten	Avoid in production
Access key in source/connection string literal	High (leaks via VCS)	Never	Never

Cost & sizing

Enterprise costs more than Standard/Premium because you are paying for a commercial runtime, and active-active doubles the footprint because both regions run full clusters. The bill is driven by SKU (RAM/CPU per node), capacity (number of nodes/shards), the number of regions in the mesh, and cross-region egress for replication.

What drives the bill, and how to control each lever:

Cost driver	Scales with	Control it by	Note
SKU (E5…E100, F300…)	RAM/CPU per node	Right-size to the working set	Bigger SKU = higher hourly rate
Capacity (node count)	Shards / throughput	Scale out only when needed	Must be even; each pair adds cost
Regions in geo mesh	N clusters running	Keep the mesh to needed regions	Each region is a full cluster
Cross-region egress	Write volume × peers	Reduce mesh; batch where possible	Every write fans out to N-1 peers
Persistence disk	Dataset size	(managed)	Local managed disk, modest
Flash NVMe (Flash SKUs)	Cold-tier size	Use Flash for large skewed data	Cheaper per GB than all-RAM

Right-sizing approach:

Question	Method	Action
How big is the working set?	Sum key sizes × count at peak, + headroom	Pick the SKU whose RAM covers it under `NoEviction`
RAM or Flash?	Is access skewed (hot/cold)?	Skewed + large → Flash; uniform/hot → RAM
Up or out?	CPU-bound (`serverLoad`) vs memory-bound	High server load → out (shards); OOM → up (RAM)
How many regions?	Where must writes happen?	Only regions that must accept writes
Headroom target	Alert at 75% used memory	Size so steady state sits below that

Rough figures (list-price ballpark, varies by region and commitment – always confirm with the pricing calculator):

Scenario	Approx monthly (USD)	Approx monthly (INR)	Notes
Single Enterprise `E5`, 2 nodes, one region	~$600–900	~₹50,000–75,000	Smallest prod Enterprise footprint
Single `E10`, 2 nodes, one region	~$1,200–1,800	~₹1,00,000–1,50,000	Common single-region prod size
Active-active `E10` × 2 regions	~$2,400–3,600 + egress	~₹2,00,000–3,00,000 + egress	Double clusters + cross-region egress
Enterprise Flash `F300`, 2 nodes	~$900–1,400	~₹75,000–1,15,000	Large dataset, cheaper per GB
Standard `C1` (contrast)	~$40–60	~₹3,500–5,000	No persistence/clustering/geo

There is no free tier for Enterprise. For dev and learning, use a Standard C0/C1 (which is cheap) to practice client patterns, and reserve Enterprise spend for the features that require it (modules, active-active). Commit to a reservation once steady-state size is known to cut the hourly rate.

Interview & exam questions

Maps to the AZ-204 (developing solutions), AZ-305 (designing infrastructure), and Redis-specific knowledge expected of senior Azure roles.

1. Why is Azure Cache for Redis Enterprise a different ARM resource type from Premium, and why does it matter? Enterprise/Enterprise Flash use Microsoft.Cache/redisEnterprise (a parent cluster + child database), running the commercial Redis Enterprise runtime, while Basic/Standard/Premium use Microsoft.Cache/redis running OSS Redis. It matters because IaC modules, endpoints, ports (10000 vs 6380), and capabilities (modules, active-active) differ; a Bicep/Terraform module for one type does nothing for the other.

2. Contrast OSS and Enterprise clustering policies. OSS exposes the native Redis Cluster API – the client discovers shards, computes CRC16 hash slots, and connects directly (lowest latency, needs a cluster-aware client and routable node IPs). Enterprise puts a proxy in front so the client uses one endpoint like a standalone Redis (simplest networking, any client, one extra hop). The policy is chosen at creation and is permanent.

3. What is a CROSSSLOT error and how do you prevent it? A multi-key command (or MULTI/EXEC/Lua) whose keys hash to different slots. Prevent it by co-locating keys with a hash tag – the {...} substring is what’s hashed – so all co-accessed keys (e.g. t:{tenant}:...) land in one slot. Even the Enterprise proxy requires single-slot keys for transactions and Lua.

4. Difference between Premium passive geo-replica and Enterprise active geo-replication? Passive is a one-way link to a read-mostly secondary with manual failover – a DR tool. Active-active is a full-mesh CRDB where every region accepts reads and writes with automatic CRDT conflict resolution and no failover step – an availability tool. If both regions must accept writes, you need Enterprise active-active.

5. How do CRDTs resolve concurrent writes, and why can’t you use a string counter in active-active? Each data type is reimplemented as a CRDT: strings are last-write-wins, counters are additive (both increments apply), sets/hashes merge per element/field. A string SET n for a count is LWW, so concurrent increments in two regions lose updates; use INCR (additive) instead.

6. Why model an idempotency check as a CRDT set, and how does SET NX EX behave across regions? A set’s observed-remove semantics make concurrent adds of the same id converge cleanly. SET key val NX EX succeeds only if the key is new; with string LWW across regions, it gives “first writer in either region wins,” which is exactly the duplicate-protection guarantee for idempotency.

7. RDB vs AOF – when do you choose each, and what’s the data-loss window? RDB snapshots periodically (loss up to the interval, e.g. 1h) – cheap, fast restart, good for regenerable caches. AOF logs every write; at 1s fsync, worst-case loss is ~1 second – the default for data you can’t recreate. Both protect against restart, not against bad commands.

8. Why is NoEviction mandatory for an active-active geo cache? Evictions are local to each region, so an LRU eviction in one region but not another silently diverges the dataset – a correctness bug, not just a hit-rate dip. Run geo databases as NoEviction and size for the full working set.

9. Name three client configurations that prevent Redis outages on Azure. A singleton multiplexer (not pool-per-request) to avoid connection storms; AbortOnConnectFail = false so the client recovers after a maintenance blip instead of throwing permanently; and an operation-level retry with jittered backoff so reshard/patch blips don’t surface as errors. Add periodic health checks to detect dead idle sockets.

10. What happens during a scale-out reshard, and how do you make it invisible? Hash slots migrate between shards; under OSS policy keys briefly answer ASK/MOVED and a cluster-aware client re-routes, while the Enterprise proxy absorbs it. Make it invisible with a resilient client (redirect-following, AbortOnConnectFail=false, op retries) and validate with a load test through a live scale-out expecting zero errors and only a p99 bump.

11. Why measure client-side latency percentiles instead of server averages? A server-side average of 1 ms hides a p99 of 200 ms from a hot shard, a cross-slot fan-out, or a client GC pause. Track p50/p99 per operation from the app and correlate spikes with serverLoad and reshard events to tell “retry working” from “real hot key.”

12. How do you secure an Enterprise cache end to end? TLS-only (clientProtocol: Encrypted) on port 10000; Private Endpoint with public access disabled and a linked Private DNS zone; Entra token auth via managed identity (key in Key Vault, rotated, as fallback); short TTLs on sensitive keys; diagnostic logs to Log Analytics; and compliance-aware region choice since writes replicate to every mesh peer.

Quick check

Which ARM resource type backs the Enterprise tier, and what is the default TLS port?
You need both East US and West Europe to accept writes to the same key with no failover step. Which tier and replication mode?
A MULTI/EXEC transaction throws CROSSSLOT. What single keyspace change fixes it?
Your active-active counter is losing increments across regions. What’s the likely modeling mistake?
Your app throws permanently the first time a maintenance window blips the connection. Which one client setting fixes it?

Answers

Microsoft.Cache/redisEnterprise (a parent cluster + child database), and the default TLS port is 10000 (Premium uses 6380).
Enterprise (or Enterprise Flash) with active-active geo-replication (a CRDB) – both regions accept reads and writes with automatic CRDT conflict resolution and no failover.
Add a hash tag so every key in the transaction shares the {...} substring (e.g. k:{order123}:...), forcing them into one hash slot.
The counter is modeled as a string SET (last-write-wins), so concurrent writes lose updates. Use INCR (additive CRDT) instead.
Set AbortOnConnectFail = false so the multiplexer reconnects in the background instead of throwing permanently after the first failed connect.

Glossary

Term	Definition
Enterprise tier	Azure Cache for Redis built on the commercial Redis Enterprise runtime (`Microsoft.Cache/redisEnterprise`), adding modules, active-active geo-replication, and up to 99.999% SLA.
Enterprise Flash	An Enterprise SKU family that keeps hot keys in RAM and tiers colder values to local NVMe for cheaper large-dataset storage.
Clustering policy	The permanent choice of OSS (native cluster API, direct-to-shard) or Enterprise (proxy, single endpoint) routing, set at database creation.
Hash slot	One of 16384 CRC16 buckets across which keys are distributed in a cluster; multi-key commands require all keys in one slot.
Hash tag	The substring inside the first `{}` in a key name; only it is hashed, so keys sharing a tag co-locate in one slot.
CROSSSLOT	The error when a multi-key command, transaction, or Lua script spans more than one hash slot.
MOVED / ASK	Cluster redirects telling a client which node owns a slot (`MOVED`) or that a slot is mid-migration (`ASK`); a cluster-aware client follows them transparently.
CRDB	Conflict-free replicated database – the active-active, full-mesh, multi-write database Enterprise builds across regions.
CRDT	Conflict-free replicated data type; each Redis type (string/counter/set/hash/sorted set) merges concurrent writes deterministically.
Active-active	A topology where every region accepts reads and writes with no primary; region loss requires no failover.
Passive geo-replica	A Premium one-way link to a read-mostly secondary with manual failover – a DR tool, not an availability tool.
RDB	Point-in-time snapshot persistence; cheap, but loses everything since the last snapshot on a hard failure.
AOF	Append-only-file persistence logging every write; at `1s` fsync, worst-case loss is ~1 second.
NoEviction	The policy that rejects writes (returns OOM) at the memory limit instead of evicting keys; mandatory for geo caches to avoid divergence.
Multiplexer	A single long-lived client connection object (e.g. StackExchange.Redis `ConnectionMultiplexer`) that pipelines all commands.
Server Load	The Azure Monitor metric for the percentage of time the Redis main thread is busy; the leading CPU-bound indicator.
Private Endpoint	A private IP projection of the cache into your VNet via Private Link, removing public exposure.

Next steps

Azure Private Endpoints and Private DNS at Scale — lock the cache into your VNet correctly so in-VNet clients resolve the private IP.
Azure Key Vault: Secret Rotation with Managed Identity — store and rotate the access key, or move to managed-identity token auth.
Cosmos DB Multi-Region Writes and Conflict Resolution — contrast Redis automatic CRDT merge with Cosmos’s pluggable conflict policies.
Azure Multi-Region Active-Active Disaster Recovery — fit the active-active cache into a full multi-region architecture.
Azure Monitor Deep Dive: Every Option — build the alerts on used-memory %, eviction, server load, and client-side p99.