Azure Lesson 43 of 137

Azure Cache for Redis Enterprise: Clustering, Active Geo-Replication, and Resilient Failover Patterns

Most “Redis is down” pages I have been dragged into were not Redis failing. They were a client library that opened a single connection to a single node, hardcoded a regional hostname, and treated MOVED as a fatal error instead of a routing hint. Azure Cache for Redis Enterprise – the tier built on the commercial Redis Enterprise runtime rather than OSS Redis – gives you clustering, multi-region active-active replication, durable persistence, and the Redis modules (Search, JSON, TimeSeries, Bloom). But every one of those features changes the contract your client must honor. Cross-slot multi-key commands are no longer free. A node can move under you mid-request. Two regions can both accept a write to the same key and you have to decide who wins. This guide wires up the Enterprise tier correctly and, just as importantly, builds the client-side behavior that survives the day the topology shifts.

Everything here targets the Enterprise and Enterprise Flash tiers, with notes on where Premium diverges. The provider resource is Microsoft.Cache/redisEnterprise, a different ARM resource type from the Microsoft.Cache/redis you use for Basic/Standard/Premium. That distinction trips up Terraform and Bicep modules constantly, and it is the first thing to get right because a module written for one resource type silently does nothing useful against the other.

By the end you will stop treating “the cache is down” as a single event. You will know whether you are looking at a CROSSSLOT error from a keyspace that ignored hash tags, a MOVED that escaped to application code because cluster mode is off, an OOM because NoEviction met an undersized cluster, a connection storm because the client pools per-request instead of multiplexing, or a genuine regional outage that an active-active CRDB should have made a non-event. Knowing which in ninety seconds is what separates a five-minute blip from a two-hour incident.

What problem this solves

A cache that holds session state, idempotency keys, feature flags, rate-limit counters, or a hot read-through layer is on the critical path of every request. When it stalls, the application stalls behind it – or worse, falls back to the database and runs at a fraction of normal throughput until the database itself buckles. The naive answer (“just add a replica”) solves node loss but not the three failure classes that actually page senior engineers: a client that cannot follow a topology change, a single-region cache that cannot accept writes during a regional incident, and a cache used as an LRU eviction store in a geo-replicated topology where evictions silently diverge the regions.

What breaks without the Enterprise patterns: a passive geo-replica that is read-only at exactly the moment you need to write to it (during the primary region’s outage), turning an eleven-minute regional blip into eleven minutes of either blocked writes or duplicate processing. A MULTI/EXEC transaction that worked in dev against a single node and throws CROSSSLOT the first time it runs against a real cluster. A ConnectionMultiplexer created with the default AbortOnConnectFail = true that throws permanently the first time a maintenance window blips the connection and never recovers. A cache sized for the average working set that OOMs the moment a campaign doubles the keyspace, returning OOM command not allowed on every write because the policy is NoEviction.

Who hits this: any team running Redis on the critical path at scale. It bites hardest on multi-region applications (where passive replication is mistaken for an availability tool), on stateful workloads that model everything as opaque strings (and so eat last-write-wins data loss), on teams that adopt clustering without auditing their client library’s cluster support, and on anyone who treats Redis persistence as a backup. The fix is rarely “buy a bigger SKU” – it is “model the data as the right CRDT, pick the clustering policy the client can actually drive, size for the full working set under NoEviction, and make the client survive a node move.”

To frame the field before the deep dive, here is every failure class this article covers, the question it forces, and where to look first:

Failure class What it looks like First question to ask First place to look Most common single cause
CROSSSLOT error Multi-key command rejected Are these keys in one hash slot? Client exception text Keyspace ignores hash tags
MOVED in app code App sees a MOVED 1234 ip:port Is the client in cluster mode? Client config / cluster flag Cluster-unaware client on OSS policy
OOM on write OOM command not allowed Memory % vs eviction policy usedmemorypercentage metric NoEviction + undersized cluster
Connection storm Thousands of conns/sec One multiplexer or pool-per-request? connectionscreatedpersecond New connection per operation
Regional write outage One region can’t write Passive replica or active-active? Topology / geoReplication Passive geo-replica used for HA
Replica divergence Two regions disagree Is eviction on in a geo group? Eviction policy + metrics LRU eviction on an active-active DB

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already understand Redis at the data-structure level: strings, hashes, sets, sorted sets, TTL/EXPIRE, and the basic command set (SET, GET, INCR, MGET, MULTI/EXEC). You should know how to run az in Cloud Shell, read JSON output, and reason about a VNet, subnet, NSG, and Private DNS zone. Familiarity with at least one Redis client library (StackExchange.Redis, Lettuce/Jedis, redis-py, go-redis) helps, because half of the resilience story lives in client configuration.

This sits in the Data & Caching track of the Zero-to-Hero program. It assumes the networking fundamentals from Azure Virtual Network Basics: Subnets, NSGs, and Peering and the private-connectivity patterns from Azure Private Endpoints and Private DNS at Scale. The identity and secret-handling side leans on Azure Key Vault: Secret Rotation with Managed Identity. For multi-region thinking beyond the cache, it pairs with Cosmos DB Multi-Region Writes and Conflict Resolution (a useful contrast: Cosmos lets you plug in conflict logic; Redis CRDTs resolve automatically per type) and Azure Multi-Region Active-Active Disaster Recovery. When the cache fronts a relational store, Azure SQL Database: Hyperscale, Elastic Pools, Ledger is the backing tier the cache protects.

A quick map of who owns what during a cache incident, so you call the right person fast:

Layer What lives here Who usually owns it Failure classes it can cause
Client library Multiplexer, retries, cluster mode App / dev team MOVED leaks, connection storms, no reconnect
Keyspace design Hash tags, TTLs, data types App / dev team CROSSSLOT, LWW data loss, divergence
Cache database Eviction, persistence, clustering Platform / data team OOM, restart data loss, slot routing
Geo-replication mesh CRDB links, group nickname Platform / data team Write outage (if passive), replication lag
Network Private Endpoint, DNS, NSG, TLS Network team Public exposure, DNS resolves to public IP
Identity Access key, Entra token, RBAC Security / platform Leaked key, token expiry, auth failure

Core concepts

Six mental models make every later decision obvious.

Enterprise is a different runtime, not a bigger Premium. Basic/Standard/Premium run OSS Redis under Microsoft.Cache/redis. Enterprise and Enterprise Flash run the commercial Redis Enterprise software under Microsoft.Cache/redisEnterprise, with a parent cluster resource and a child database resource. The database is the thing your client connects to. This split is why a Bicep module that creates a Microsoft.Cache/redis resource cannot produce an Enterprise cache, and why the endpoint, port, and capabilities differ.

The clustering policy is a permanent client contract. You choose OSS or Enterprise clustering policy at database creation and you cannot change it without recreating the database. OSS exposes the native Redis Cluster API: the client discovers every shard, computes the CRC16 hash slot (16384 slots) for each key, and connects directly to the owning node – lowest latency, but it requires a cluster-aware client and the client sees every node address. Enterprise policy puts a proxy in front of the shards so the client talks to a single endpoint like a standalone Redis – simplest for clients and networking, at the cost of a proxy hop.

Multi-key commands need one hash slot. In any clustered Redis, a command touching multiple keys requires all of them in the same hash slot. MSET user:1001 a user:1002 b hashes the two keys to different slots and fails with CROSSSLOT under OSS policy. Hash tags – the substring inside the first {} – force co-location: MSET user:{t42}:1001 a user:{t42}:1002 b hashes both on t42. Design the keyspace so everything you co-access (a tenant, an order, a session) shares a hash tag; otherwise transactions, Lua scripts, and MGET/MSET break.

Active-active means no primary and automatic conflict resolution. Enterprise active geo-replication builds an Active-Active CRDB (conflict-free replicated database). Every region accepts reads and writes; changes replicate full-mesh to all peers. A region outage means you keep serving from the survivors with no failover step. Concurrent writes converge deterministically because each data type is reimplemented as a CRDT (conflict-free replicated data type): strings are last-write-wins by timestamp, counters are additive (both increments apply), and sets/hashes/sorted-sets merge per element. You do not write conflict logic; you choose the data type that gives the convergence you need.

Persistence survives restart; replication survives node loss; neither is a backup. RDB snapshots the dataset on an interval; you lose everything since the last snapshot on a hard failure. AOF logs every write, and with fsync every second the worst-case loss is ~1 second. Both protect against a full cluster restart. Replication protects against losing a node or a region. Neither protects against a bad FLUSHALL or a logic bug – that is what an export to a storage account is for.

The client is where outages are made or prevented. A correctly provisioned, multi-region, persistent cluster behind a broken client is still an outage. The client must multiplex (one long-lived connection, not pool-per-request), keep retrying instead of throwing on first-connect failure, follow MOVED/ASK redirects (OSS policy), health-check idle connections, and retry the operation with jittered backoff through a node move. Every one of those is a configuration choice, and the defaults are frequently wrong for Azure.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Concept One-line definition Where it lives Why it matters
Enterprise cluster The redisEnterprise parent resource (nodes) Subscription / RG Holds the SKU, zones, capacity
Database The child databases/default Redis endpoint On the cluster The thing the client connects to
Clustering policy OSS (native) vs Enterprise (proxy) routing Database property (permanent) Dictates client mode + networking
Hash slot One of 16384 CRC16 buckets for keys Across shards Multi-key cmds need one slot
Hash tag {...} substring that is hashed In the key name Forces co-location of keys
CRDB Conflict-free replicated (geo) database Across regions Active-active, no primary
CRDT Conflict-free data type (per Redis type) Per key/value Deterministic merge of writes
RDB / AOF Snapshot / append-only persistence Database property Restart durability (not backup)
Eviction policy What happens at memory limit Database property NoEviction → OOM; LRU → loss/divergence
Private Endpoint Private IP projection into a VNet In your subnet Removes public exposure
Multiplexer One long-lived client connection object In the app Prevents connection storms
Server Load % time the Redis main thread is busy Azure Monitor metric Leading CPU-bound indicator

Tier selection: Standard, Premium, Enterprise, Enterprise Flash

Pick the tier from your durability and topology requirements, not from raw memory size. The tiers are not a linear ladder – Enterprise is a separate runtime, and Flash trades RAM for NVMe to cut cost on large skewed datasets.

Capability Standard Premium Enterprise Enterprise Flash
Runtime OSS Redis OSS Redis Redis Enterprise Redis Enterprise
ARM resource type Microsoft.Cache/redis Microsoft.Cache/redis Microsoft.Cache/redisEnterprise Microsoft.Cache/redisEnterprise
SLA (single region) 99.9% 99.9% (99.99% zone-redundant) up to 99.999% up to 99.999%
Clustering no OSS only OSS or Enterprise policy OSS or Enterprise policy
Active geo-replication no passive (geo-replica link) active-active (CRDB) active-active (CRDB)
Persistence (RDB + AOF) no yes yes yes
Redis modules (Search/JSON/etc.) no no yes yes
Storage medium RAM RAM RAM RAM + NVMe tier
Zone redundancy no yes yes yes
Default TLS port 6380 6380 10000 10000

The deciding factors, as a decision table:

If you need… Then choose… Because…
Cheapest possible cache, regenerable data Standard No persistence, no clustering, lowest price
Single-region HA + persistence, no modules Premium OSS clustering + RDB/AOF, 99.99% zone-redundant
Multi-region read failover (manual) Premium + passive geo-replica One-way link; DR tool, not availability
Multi-region write (both regions accept writes) Enterprise Active-active CRDB, no failover step
RediSearch / RedisJSON / TimeSeries / Bloom Enterprise Modules exist only on the Enterprise runtime
Large dataset, skewed access, cost-sensitive Enterprise Flash Hot keys in RAM, cold values on NVMe
Uniformly hot, latency-critical, large Enterprise (not Flash) The flash hop adds p99 latency
Highest SLA (99.999%) Enterprise / Flash Only the Enterprise runtime offers it

A note on each tier’s sweet spot, because the table compresses real trade-offs:

Tier Best for Avoid when Key limit / gotcha
Standard Dev, regenerable read caches You need persistence or HA No clustering; single node failure = data loss
Premium Prod single-region with persistence You need multi-region writes or modules Passive geo-replica is read-mostly + manual failover
Enterprise Modules, active-active, 99.999% Tiny caches where cost dominates Distinct resource type; even-numbered capacity
Enterprise Flash Large session/cache stores, skewed Uniformly hot or latency-critical NVMe hop visible at p99 on hot keys
# Enterprise tier uses a distinct resource: redisenterprise, with a child database.
# --capacity must be EVEN for Enterprise SKUs (nodes deploy in HA pairs).
az redisenterprise create \
  --name kv-redis-prod \
  --resource-group rg-data-prod \
  --location eastus2 \
  --sku Enterprise_E10 \
  --capacity 2 \
  --zones 1 2 3

# The database (the actual Redis endpoint) is a child resource.
az redisenterprise database create \
  --cluster-name kv-redis-prod \
  --resource-group rg-data-prod \
  --client-protocol Encrypted \
  --clustering-policy EnterpriseCluster \
  --eviction-policy NoEviction \
  --persistence aof-enabled=true aof-frequency=1s

The SKU name encodes both the engine (Enterprise_E10, EnterpriseFlash_F300) and a capacity unit. --capacity must be an even number for Enterprise SKUs because nodes deploy in pairs for HA. Always pass --zones 1 2 3 at create time; you cannot add zone redundancy to an existing cluster in place.

The SKU families and what the suffix encodes:

SKU family Example SKUs Storage Scales by Typical use
Enterprise_E* E5, E10, E20, E50, E100 All RAM SKU (up) + capacity (out) Modules, active-active, latency-critical
EnterpriseFlash_F* F300, F700, F1500 RAM + NVMe SKU (up) + capacity (out) Large skewed datasets, lower cost/GB

Clustering policies and key distribution

This is the single most consequential decision and it is permanent for the database’s lifetime: you choose it at creation and it dictates how your client connects and how multi-key operations behave.

OSS clustering policy exposes the native Redis Cluster API. The client discovers all shards, computes the CRC16 hash slot for each key, and connects directly to the owning node. Lowest latency and highest throughput because there is no proxy hop – but it requires a cluster-aware client, and the client sees every node’s address, which complicates private networking (every node IP must be routable).

Enterprise clustering policy puts a proxy in front of the shards. The client connects to a single endpoint as if it were a standalone Redis; the proxy routes commands to the correct shard. Far simpler for clients (any standard client works, no cluster mode) and for networking (one endpoint), at the cost of a proxy hop.

OSS vs Enterprise policy, side by side

Dimension OSS clustering policy Enterprise clustering policy
Client requirement Cluster-aware client + cluster mode on Any standard client; no cluster mode
Topology visibility Client sees every shard/node address Client sees one proxy endpoint
Network complexity Every node IP must be reachable One endpoint to route/whitelist
Latency Lowest (direct to shard) One extra proxy hop
Max throughput Highest Slightly lower (proxy overhead)
Multi-key across slots Fails CROSSSLOT Proxy may fan out simple cmds
MULTI/EXEC + Lua across slots Requires single slot Still requires single slot
MOVED/ASK handling Client must follow redirects Absorbed by the proxy
Best when You control the client, want max perf Simpler networking, non-cluster client

The multi-key contract

The behavior that surprises people is multi-key commands. A command touching multiple keys requires all those keys to live in the same hash slot:

# This fails across slots -- the keys hash to different slots
MSET user:1001 alice user:1002 bob   # CROSSSLOT error under OSS policy

# Hash tags force keys into the same slot using the {...} substring
MSET user:{tenant42}:1001 alice user:{tenant42}:1002 bob   # both hash on "tenant42"

Only the substring inside the first {} is hashed. Design your keyspace with hash tags around the entity you co-access so transactions and MGET/MSET stay single-slot. Under the Enterprise policy, the proxy makes some cross-slot multi-key commands appear to work by fanning out, but MULTI/EXEC transactions and Lua scripts still require single-slot keys – so the hash-tag discipline is non-negotiable either way.

Which operations are slot-sensitive, and what each policy does:

Operation OSS policy Enterprise policy Make it safe by…
Single-key GET/SET/INCR Always fine Always fine (nothing)
MGET/MSET same slot Fine Fine Hash tag the keys
MGET/MSET cross slot CROSSSLOT Proxy fans out Hash tag the keys
MULTI/EXEC cross slot CROSSSLOT CROSSSLOT Hash tag every key in the txn
Lua EVAL with KEYS[] cross slot CROSSSLOT CROSSSLOT All KEYS share a hash tag
SUNIONSTORE/ZADD across keys CROSSSLOT CROSSSLOT Hash tag source + dest keys
SCAN Per-node (OSS) Single endpoint Aggregate across shards (OSS)
KEYS * Per-node, blocking Per-node, blocking Avoid in prod entirely

Hash-tag design patterns that keep co-accessed keys together:

Access pattern Key template Hashed substring Guarantees
Per-tenant data t:{tenantId}:orders tenantId All tenant keys one slot
Per-session bundle sess:{sessionId}:cart sessionId Cart + session co-located
Per-order aggregate ord:{orderId}:lines orderId Order + line items together
Per-user counters u:{userId}:counters userId Atomic multi-counter updates
Global singleton set g:{flags}:enabled flags All flag keys one slot

Choose OSS policy when you control the client and want maximum performance, and you are comfortable with cluster-aware libraries (StackExchange.Redis, Lettuce, redis-py with cluster mode, go-redis ClusterClient). Choose Enterprise policy when you need a single endpoint for private-networking simplicity, or your client cannot do cluster mode. You cannot change it later without recreating the database – so this decision deserves a design review, not a default.

Active geo-replication topologies and conflict handling

Enterprise active geo-replication builds an active-active database (an Active-Active CRDB). Every participating cluster accepts both reads and writes, and changes replicate to all peers. There is no primary. A region outage means you keep serving from the survivors with no failover step.

The mechanism that makes concurrent writes safe is CRDTs. Redis Enterprise reimplements each data type as a CRDT so concurrent writes in different regions converge deterministically. The key insight: you do not get to plug in custom conflict logic the way Cosmos DB does. You pick the data type whose built-in convergence matches your correctness requirement.

CRDT semantics per data type

Data type Conflict resolution Concurrent-write outcome Use it for Pitfall
String (SET) Last-write-wins by timestamp One of two concurrent writes is lost Single-writer keys, idempotency flags Silent loss on true concurrent writes
Counter (INCR/DECRBY) Additive merge Both increments apply (no lost update) Metrics, rate limits, vote tallies Cannot set an absolute value safely
Set (SADD/SREM) Observed-remove, element merge Adds + removes converge per element Idempotency keys, tags, membership Concurrent add+remove favors add
Hash (HSET) Per-field LWW merge Different fields merge; same field LWW Profiles, multi-field records Same-field concurrent write loses one
Sorted set (ZADD) Per-element score merge Members merge; score is LWW Leaderboards, time-ordered queues Concurrent score update is LWW
String as counter (SET n) LWW Lost increments (anti-pattern) Use INCR, never SET for counts

Choosing the data type for the convergence you need

Requirement Wrong model (loses data) Right model Why
“Count signups across regions” SET count <n> INCR count Additive merge, no lost updates
“Has this request been seen anywhere?” SET seen:<id> 1 SADD seen <id> Observed-remove set converges
“User’s last-known cart” two SET cart HSET cart field val Per-field merge keeps both fields
“Top players this hour” SET score:<u> <n> ZADD board <n> <u> Per-member score, members merge
“Single-writer config flag” (fine) SET flag on SET flag on LWW acceptable; one writer
# Create an active geo-replication group spanning two regions.
# Each region is its own redisenterprise cluster + database; you link them
# via a shared group nickname and mutual linkedDatabase references.

az redisenterprise database create \
  --cluster-name kv-redis-eastus2 \
  --resource-group rg-data-eastus2 \
  --client-protocol Encrypted \
  --clustering-policy EnterpriseCluster \
  --group-nickname global-sessions \
  --linked-databases id="/subscriptions/<sub>/resourceGroups/rg-data-eastus2/providers/Microsoft.Cache/redisEnterprise/kv-redis-eastus2/databases/default" \
  --linked-databases id="/subscriptions/<sub>/resourceGroups/rg-data-westeurope/providers/Microsoft.Cache/redisEnterprise/kv-redis-westeurope/databases/default"

The --linked-databases list must include this database plus every peer, and the same group nickname must be used on every member. Designing the topology:

Topology trade-offs as you add regions:

Regions in mesh Write fan-out per write Cross-region links When it makes sense Cost driver
2 (active-active pair) 1 peer 1 Most apps: 1 primary + 1 DR-as-active Egress between 2 regions
3 (triangle) 2 peers 3 Three-continent latency reduction 3× cross-region egress
4 3 peers 6 Rare; global low-latency writes 6 links, bandwidth + latency
5+ N-1 peers N(N-1)/2 Almost never; reconsider design Quadratic link growth

Active geo-replication vs the Premium passive geo-replica:

Dimension Premium passive geo-replica Enterprise active geo-replication
Direction One-way (primary → secondary) Full mesh, bidirectional
Secondary writes Read-mostly; no writes Full read + write
Failover Manual (DNS / link unlink) None — survivors keep serving
Conflict handling N/A (single writer) Automatic CRDT merge
RPO on region loss Replication lag at failover Near-zero; writes accepted locally
Use as DR tool Availability tool
Tier Premium Enterprise / Enterprise Flash

Data persistence: RDB, AOF, and durability trade-offs

Enterprise supports both persistence mechanisms, and they answer different questions. Persistence is about surviving a full cluster restart; it is orthogonal to replication, which is about surviving node loss, and to backup, which is about surviving a bad command or logic bug.

RDB (snapshot) writes a point-in-time dump on an interval (e.g., every 1h/6h/12h). Cheap, low overhead, but you lose everything since the last snapshot on a hard failure.

AOF (append-only file) logs every write. With fsync every second (aof-frequency=1s), worst-case data loss is ~1 second. The cost is write amplification and larger files. This is the right default for anything you cannot regenerate.

RDB vs AOF, with the numbers

Dimension RDB (snapshot) AOF (append-only)
What it stores Point-in-time dataset dump Every write operation, replayed
Worst-case data loss Since last snapshot (e.g. up to 1h) ~1 second (aof-frequency=1s)
Write overhead Low (periodic fork) Higher (continuous append)
File size Compact Larger (full op log)
Restart/restore speed Fast (load one dump) Slower (replay the log)
CPU/memory spike Fork at snapshot time Steady, lower spikes
Right for Regenerable caches, restart speed Stateful data you cannot recreate

AOF fsync frequency trade-off

aof-frequency Worst-case loss Write throughput impact When to use
1s ~1 second Modest The resilient default for stateful caches
always (where supported) ~0 (per-write fsync) High (every write blocks on disk) Only when even 1s loss is unacceptable
# AOF with per-second fsync -- the resilient default for stateful caches
az redisenterprise database update \
  --cluster-name kv-redis-prod \
  --resource-group rg-data-prod \
  --persistence aof-enabled=true aof-frequency=1s

# RDB hourly -- acceptable only for regenerable caches where restart speed matters
az redisenterprise database update \
  --cluster-name kv-redis-prod \
  --resource-group rg-data-prod \
  --persistence rdb-enabled=true rdb-frequency=1h

The RDB snapshot intervals you can choose, and what each implies:

rdb-frequency Worst-case loss on hard failure Overhead Use when
1h Up to 1 hour Lowest Regenerable cache; restart speed matters most
6h Up to 6 hours Lowest Bulk read cache, easy to rebuild
12h Up to 12 hours Lowest Rarely-changing reference data

Persistence vs replication vs backup – three different jobs:

Mechanism Protects against Does NOT protect against Where data lives
AOF/RDB persistence Full cluster restart Bad FLUSHALL, logic bug, region loss Cluster-local managed disks
Zone/node replication Single node or zone loss Region loss, bad command Across nodes/zones in-region
Active geo-replication Region loss Bad command replicated to all peers Across regions (full mesh)
Export to storage Bad command, point-in-time recovery Real-time loss (it is periodic) Your storage account

Two correctness notes. First, in an active-active geo group you generally rely on the peer regions for recovery and persistence is a secondary safety net – a surviving region rehydrates a recovered one. Second, persistence is not a backup: it protects against process restart, not against a bad FLUSHALL or a logic bug that corrupts data; a corrupting command replicates to every peer in the mesh. Enterprise persists to the cluster’s local managed disks, not to your storage account, so treat exports separately if you need point-in-time backups.

Private endpoint, VNet injection, and TLS hardening

Never expose a production cache to the public internet. The Enterprise tier supports Private Link, which projects the cache into your VNet via a private endpoint and a private IP – the public FQDN resolves to a private address through Private DNS.

resource cache 'Microsoft.Cache/redisEnterprise@2024-09-01-preview' = {
  name: 'kv-redis-prod'
  location: 'eastus2'
  sku: { name: 'Enterprise_E10', capacity: 2 }
  zones: ['1', '2', '3']
}

resource db 'Microsoft.Cache/redisEnterprise/databases@2024-09-01-preview' = {
  parent: cache
  name: 'default'
  properties: {
    clientProtocol: 'Encrypted'        // TLS-only; rejects plaintext
    clusteringPolicy: 'EnterpriseCluster'
    evictionPolicy: 'NoEviction'
    port: 10000
    persistence: { aofEnabled: true, aofFrequency: '1s' }
  }
}

resource pe 'Microsoft.Network/privateEndpoints@2024-05-01' = {
  name: 'pe-kv-redis-prod'
  location: 'eastus2'
  properties: {
    subnet: { id: dataSubnetId }
    privateLinkServiceConnections: [
      {
        name: 'redis'
        properties: {
          privateLinkServiceId: cache.id
          groupIds: ['redisEnterprise']
        }
      }
    ]
  }
}

Hardening checklist that actually matters:

Networking and TLS settings reference

Setting Values Default When to change Limit / gotcha
clientProtocol Encrypted, Plaintext Encrypted Never use plaintext in prod Plaintext exposes data + key on the wire
port TLS listen port 10000 (Enterprise) Rarely Premium is 6380; mismatch = connect failure
Public network access Enabled / Disabled Enabled Disable once PE is live Forgetting to disable leaves a public path
Private DNS zone privatelink.redisenterprise.cache.azure.net (none) Always with PE Unlinked zone → resolves to public IP
Min TLS version 1.2 / 1.3 1.2 Raise to 1.3 if clients support Old clients may only do 1.2
Access keys Primary / Secondary Both active Rotate regularly Long-lived secret; prefer Entra token
Entra (AAD) auth Enabled / Disabled Disabled Enable where client supports Removes shared-secret risk

The port-10000 trap and other connection-string mistakes

Symptom Likely cause How to confirm Fix
Connect times out from in-VNet client DNS resolves to public IP nslookup <fqdn> returns public IP Link the Private DNS zone to the VNet
Connect refused / handshake error Wrong port (6380 vs 10000) Check client port setting Use 10000 for Enterprise
Plaintext “connection reset” TLS not enabled on client Client Ssl=false Set Ssl=true; server is TLS-only
Auth fails after rotation Stale key in app config Compare key in Key Vault vs app Pull key from Key Vault at runtime
Token auth NOAUTH/expired Entra token not refreshed Token lifetime exceeded Use SDK that auto-refreshes the token
Works locally, fails in Azure Public access disabled, no PE route Test from inside the VNet Reach via the private endpoint only

Identity and secret-handling options, ranked:

Auth method Secret lifetime Rotation effort Best for Trade-off
Entra ID token Short (auto-refreshed) None (managed identity) Modern clients on Azure Client/SDK must support it
Access key in Key Vault Long; rotate on a schedule Manual rotation + redeploy/refresh Clients without Entra support Long-lived secret to guard
Access key in app config Long; often never rotated (don’t do this) Nothing in prod Secret leaks via config/source

Client resilience: multiplexing, retries, and reconnect

This is where most outages are actually caused or prevented. A correctly provisioned cluster behind a broken client is still an outage.

Multiplex one connection, do not pool-per-request. Redis clients like StackExchange.Redis are built around a single long-lived multiplexer that pipelines all commands over a few connections. Opening a connection per operation exhausts ports and ignores the library’s pipelining. Create the multiplexer once as a singleton:

// Singleton ConnectionMultiplexer -- created once, shared process-wide.
var config = new ConfigurationOptions
{
    EndPoints = { "kv-redis-prod.eastus2.redisenterprise.cache.azure.net:10000" },
    Ssl = true,
    AbortOnConnectFail = false,          // keep retrying instead of throwing at startup
    ConnectRetry = 5,
    ConnectTimeout = 15000,
    KeepAlive = 30,
    ReconnectRetryPolicy = new ExponentialRetry(5000)
};
// Token auth (Entra ID) instead of an access key:
await config.ConfigureForAzureWithTokenCredentialAsync(new DefaultAzureCredential());

var muxer = await ConnectionMultiplexer.ConnectAsync(config);

The non-obvious settings that matter on Azure, enumerated:

Setting (StackExchange.Redis) Default Recommended on Azure Why it matters
AbortOnConnectFail true false true throws permanently if first connect fails (e.g. maintenance) and never recovers
Ssl false true Server is TLS-only; plaintext is rejected
ConnectRetry 3 5 Initial connect attempts before giving up
ConnectTimeout 5000 ms 15000 ms Cross-region/private-link first connect can be slow
KeepAlive 60 s 30 s Detects dead sockets sooner (idle LB timeouts)
ReconnectRetryPolicy linear ExponentialRetry(5000) Backoff instead of hammering during an outage
SyncTimeout 5000 ms tune to p99 Too low → false RedisTimeoutException under load
AsyncTimeout 5000 ms tune to p99 Same, for async paths
allowAdmin false false Keep off unless you run admin commands

The non-obvious behaviors:

# redis-py against the Enterprise (proxy) policy -- a single endpoint, TLS, retry on timeout
from redis import Redis
from redis.retry import Retry
from redis.backoff import ExponentialBackoff
from redis.exceptions import ConnectionError, TimeoutError

r = Redis(
    host="kv-redis-prod.eastus2.redisenterprise.cache.azure.net",
    port=10000, ssl=True,
    socket_timeout=5, socket_connect_timeout=5,
    retry=Retry(ExponentialBackoff(cap=2, base=0.1), retries=3),
    retry_on_error=[ConnectionError, TimeoutError],
    health_check_interval=30,
)

health_check_interval sends a periodic PING so idle connections that were silently dropped (by a node move or an Azure load-balancer idle timeout) are detected and rebuilt before a real request hits the dead socket. Without it, the first request after an idle period eats the failure.

Client library cluster support matrix

Library Cluster-aware mode (OSS policy) Follows MOVED/ASK Entra token auth Notes
StackExchange.Redis (.NET) Yes (auto on cluster) Yes Yes (ConfigureForAzure…) Use a singleton multiplexer
Lettuce (Java) Yes (RedisClusterClient) Yes Via token credential Reactive + async; topology refresh
Jedis (Java) Yes (JedisCluster) Yes Manual token plumbing Pool sizing matters
redis-py (Python) Yes (RedisCluster) Yes Via token provider health_check_interval is key
go-redis (Go) Yes (ClusterClient) Yes Via credential hook Routing + read-from-replica options
node-redis / ioredis Yes (ioredis cluster) Yes Via token ioredis preferred for cluster

Retry policy design

Exception Retry? Backoff Cap Idempotency concern
RedisConnectionException Yes Exponential + jitter 3–5 tries Reconnect; operation may not have run
RedisTimeoutException Yes (bounded) Exponential + jitter 2–3 tries A timed-out write may have applied
RedisServerException (OOM) No Fix capacity/eviction, not retry
MOVED/ASK Client-internal Should never reach app code
CROSSSLOT No Fix keyspace (hash tags), not retry
NOAUTH/auth error No (refresh token) Refresh credential, then reconnect

Scaling, reshard operations, and zero-downtime maintenance

Enterprise scales two ways: scale up (a bigger SKU – E10 to E20) and scale out (more capacity units, which add shards and rebalance slots). Both are online operations, but “online” assumes a resilient client (previous section).

# Scale up the SKU (more memory/throughput per node)
az redisenterprise update --name kv-redis-prod --resource-group rg-data-prod \
  --sku Enterprise_E20

# Scale out capacity (adds nodes/shards; triggers a reshard/rebalance)
az redisenterprise update --name kv-redis-prod --resource-group rg-data-prod \
  --capacity 4

Scale-up vs scale-out

Dimension Scale up (bigger SKU) Scale out (more capacity)
What changes More RAM/CPU per node More shards; slots rebalance
Operation --sku Enterprise_E20 --capacity 4 (even number)
Fixes OOM, CPU on a single shard Throughput ceiling, larger dataset
Client impact Brief blip per node Reshard; MOVED/ASK (OSS) or proxy-absorbed
Online? Yes (rolling) Yes (rolling)
Limit SKU ceiling per family Capacity must stay even

What happens during a reshard, and how to survive it:

Idempotency under retried writes

Caches are naturally idempotent for reads; for write paths, a retried operation after a successful-but-unacknowledged write can corrupt data. Map the operation to a safe pattern:

Write operation Retry hazard Safe pattern
INCR counter Double-count on retry Idempotency key, or compute then SET known value
SET k v Safe (same value) Plain SET is idempotent if value is fixed
LPUSH queue item Duplicate item on retry Dedup on consume, or SET-based dedup key
SADD set member Safe (set semantics) SADD is naturally idempotent
HSET h f v Safe (same value) Idempotent for a fixed field/value
INCRBY balance n Double-apply on retry Idempotency key per transaction id

Events that cause a connection blip

Event Trigger Client-visible effect Mitigation
Scale up (SKU) --sku change Brief reset per node AbortOnConnectFail=false + op retry
Scale out (capacity) --capacity change Reshard; redirects/proxy hop Op retry; cluster-aware client
OS/Redis patch Maintenance window One node reset at a time Low-traffic window; health checks
Node failure Hardware/zone fault Failover to replica shard Idempotent writes; retry
Geo link change Add/remove region Replication catch-up Tolerate brief replication lag

Monitoring memory pressure, evictions, and latency percentiles

Redis fails loudly on CPU and silently on memory. Watch both, and alert on the leading indicators rather than the outage.

The metrics that predict incidents (all available in Azure Monitor for the Enterprise resource):

Metric What it measures Leading indicator of Alert threshold Why
Used Memory Percentage RAM used vs limit OOM (NoEviction) or eviction loss 75% Above ~80% writes fail or keys evict
Evicted Keys Keys removed at memory limit Undersized cache / divergence (geo) > 0 sustained On active-active this is a correctness bug
Expired Keys Keys removed by TTL Normal churn (context for evictions) (baseline) Distinguishes TTL churn from eviction
Server Load % main thread busy CPU-bound cluster 80% A slow KEYS/big MGET stalls everything
Connections Created/sec New conns per second Pool-per-request client bug sustained high Healthy clients reuse a handful
Cache Hit / Miss Read hit ratio Cache too small / wrong TTLs falling hit rate Misses push load to the backend
Total Operations/sec Throughput Approaching shard ceiling near capacity Scale out before saturation
Replication latency (geo) Cross-region lag Mesh bandwidth / region issue rising Stale reads in the lagging region

The metrics to alert on, with the action each alert should trigger:

Alert Condition Severity Immediate action
Memory pressure usedmemorypercentage > 75% for 5m Warning Scale up RAM or scale out shards
OOM imminent usedmemorypercentage > 90% for 1m Critical Scale now; check for runaway keyspace
Eviction (geo) evictedkeys > 0 on active-active DB Critical Size up; eviction = divergence
CPU-bound serverLoad > 80% for 5m Warning Scale out; hunt slow commands
Connection storm connectionscreatedpersecond high sustained Warning Audit client for pool-per-request
Hit ratio drop hit ratio falls > 20% Warning Review TTLs / key sizing
// Memory pressure trend + eviction correlation over the last 24h
AzureMetrics
| where ResourceProvider == "MICROSOFT.CACHE"
| where ResourceId contains "kv-redis-prod"
| where MetricName in ("usedmemorypercentage", "evictedkeys", "serverLoad")
| summarize avg(Average), max(Maximum) by MetricName, bin(TimeGenerated, 5m)
| order by TimeGenerated desc

For latency, do not trust server-side averages – measure client-side percentiles, because an average of 1ms hides a p99 of 200ms caused by a single hot shard or a GC pause in your own process. Track p50/p99 per operation from the application, and correlate p99 spikes against serverLoad and reshard events. A latency cliff that lines up with a scaling operation is your retry policy working; one that does not is a hot key or a cross-slot fan-out.

What a latency spike is telling you, by what it correlates with:

p99 spike correlates with… It’s probably… Confirm with Fix
A scale-out / reshard event Retry policy absorbing a blip Activity log timing vs spike Nothing — working as intended
High serverLoad CPU-bound (slow command / hot shard) serverLoad metric, SLOWLOG Scale out; remove KEYS/big MGET
One cloud_RoleInstance only Hot key on one shard Per-instance client metrics Re-key to spread; add a local cache
Cross-slot fan-out commands Proxy fanning out (Enterprise policy) Command audit Hash-tag keys to single slot
GC pauses in the app Client-side, not Redis App GC logs vs Redis latency Tune app GC / allocation
Rising replication lag (geo) Cross-region bandwidth/incident Geo replication metric Check region health; reduce mesh

Architecture at a glance

The diagram traces the data and control path of a two-region active-active deployment, left to right, and maps each failure class to the exact hop where it bites. Read it as four zones. On the left, clients (App Service / AKS pods) hold a singleton multiplexer and reach the cache over TLS on port 10000 through a Private Endpoint – never the public internet. The endpoint resolves via a Private DNS zone linked to the VNet, which is the hop where a missing zone link silently sends you to the public IP. In the middle, the East US 2 Enterprise cluster terminates the connection: under Enterprise policy a proxy fronts the shards; under OSS policy the client talks to shards directly and must follow MOVED/ASK. The shards enforce the eviction policy (NoEviction here) and write AOF at one-second fsync to local managed disk.

On the right, the West Europe Enterprise cluster is the active-active peer: a full-mesh CRDB link replicates every write both directions, so both regions accept reads and writes with no primary and no failover step. The numbered badges mark the five places this design fails and how you confirm each: a CROSSSLOT from a keyspace that ignored hash tags, an OOM where NoEviction met an undersized cluster, a connection storm from a pool-per-request client, replica divergence from running eviction on a geo database, and a stale-read window from replication lag. The legend narrates each as symptom, the metric or command that confirms it, and the fix. The whole method: localize the symptom to a hop, read the cause, run the named check, apply the fix.

Two-region active-active Azure Cache for Redis Enterprise architecture: App Service and AKS clients with a singleton multiplexer connect over TLS port 10000 through a Private Endpoint and Private DNS into an East US 2 Enterprise cluster where a proxy fronts hash-slot shards enforcing NoEviction and AOF 1s persistence, full-mesh CRDB geo-replication links it bidirectionally to a West Europe Enterprise peer that also accepts reads and writes, with numbered failure badges marking CROSSSLOT from missing hash tags, OOM from NoEviction on an undersized cluster, a connection storm from pool-per-request clients, replica divergence from eviction on a geo database, and stale reads from cross-region replication lag

Real-world scenario

Paywell, a fictional payments platform, ran a global idempotency cache on Premium with a passive geo-replica: EU writes went to West Europe, a one-way replica fed East US for reads, and failover was a manual DNS swap. The cache held one thing that mattered above all – the answer to “have I already processed this request id?” – and the platform’s entire duplicate-protection guarantee rested on it. Average load was 3,000 ops/sec, the monthly cache spend about ₹95,000, and the team was four engineers.

During a West Europe zone incident the replica was read-only, so for eleven minutes every in-flight payment in the US that needed to check idempotency either blocked on the manual failover or fell back to the database and ran at a fraction of normal throughput. Worse, after failover a handful of duplicate captures slipped through because the idempotency keys written in the US during the gap had not replicated back – the one-way link only flowed EU→US, so US writes during the outage were invisible to West Europe when it recovered. Two customers were double-charged. The post-incident review put it bluntly: passive replication is a DR tool, not an availability tool.

The first instinct was to “add another replica” – which would have solved nothing, because the problem was not node loss, it was that the secondary could not accept writes. The breakthrough was reframing the requirement: a system that must never lose a write across regions has to be active-active and conflict-free by construction. They moved to Enterprise active-active geo-replication across West Europe and East US, modeling idempotency state two ways. The “have I seen this request” check became a CRDT set (SADD seen <id>) so adds in either region converge with observed-remove semantics, and the per-request lock used SET payment:idem:<id> processing NX EX 86400 – string LWW plus NX gives “first writer in either region wins,” which is exactly the duplicate-protection semantic they needed.

They kept AOF at 1s as a restart safety net and sized for NoEviction: idempotency keys carry a 24-hour TTL via SET ... EX, never LRU eviction, so divergence is impossible. They moved both databases behind Private Endpoints with a linked Private DNS zone, switched clients to Entra token auth, and rebuilt the .NET client as a singleton multiplexer with AbortOnConnectFail = false and a Polly retry. The one subtlety they hit in testing: an early version used INCR for a per-merchant attempt counter and double-counted on a retried-after-timeout write; they switched the critical counter to an idempotency-keyed SET of a computed value.

# Idempotency check, region-local, on an active-active CRDB.
# SET NX EX is the primitive: succeeds only if the key is new, with a TTL.
# Converges across regions because string LWW + NX gives "first writer in either region wins".
SET payment:idem:7f3c-9a21 processing NX EX 86400
# -> OK    (first time, in either region: proceed)
# -> nil   (already seen anywhere in the mesh: this is a duplicate, reject)

The measurable result: the next regional zone failure was a non-event – no manual step, p99 unchanged at 1.1 ms, zero duplicate captures. Monthly spend rose to about ₹1,40,000 for the two Enterprise clusters, which the team judged trivial against a single double-capture chargeback plus reputational cost. The lesson on the wall: the duplicate-capture class of bug was designed out by construction, not monitored for.

The incident as a timeline, because the order of moves is the lesson:

Time Symptom Action taken Effect What it should have been
14:02 West Europe zone incident (alert fires) Recognize: secondary is read-only
14:04 US idempotency checks blocking Wait on manual DNS failover Writes stalled in US Don’t depend on a manual step
14:07 Throughput fallback to DB Apps fall back to database Fraction of normal speed
14:13 Manual failover completes DNS swapped to US US can write again 11 minutes too late
+1 day Two double-charges found RCA: US writes never replicated back One-way link gap exposed Active-active needed
+1 week Redesigned Enterprise active-active + CRDT set Region loss = non-event The actual fix
+1 week Counter double-count in test INCR retried after timeout Caught before prod Idempotency-keyed SET

Advantages and disadvantages

The Enterprise active-active model both removes the regional-write failure class and adds operational discipline you must respect. Weigh it honestly:

Advantages (why Enterprise active-active helps) Disadvantages (why it costs and constrains)
Both regions accept writes; region loss is a non-event with no failover step Two full clusters running everywhere = roughly double the spend
CRDTs resolve conflicts automatically per data type — no custom merge code You must model data as the right CRDT; opaque strings silently lose writes (LWW)
Redis modules (Search, JSON, TimeSeries, Bloom) unlock secondary-index workloads Modules and active-active are Enterprise-only; you can’t get them on Premium
Up to 99.999% SLA on a managed runtime Higher floor cost than Standard/Premium even for small caches
Persistence (AOF 1s) + replication + geo gives layered durability Persistence is not a backup; a bad command replicates to every peer
Enterprise (proxy) policy gives a single endpoint — simple private networking The proxy adds a hop; OSS policy is faster but needs a cluster-aware client and routable node IPs
Online scale-up and scale-out with rolling, one-node maintenance “Online” assumes a resilient client; a naive client still sees errors on reshard
NoEviction + sizing for the full working set keeps geo regions convergent Eviction in a geo group is a correctness bug (divergence), not just a hit-rate dip

The model is right when you genuinely need multi-region writes, conflict-free convergence, or Redis modules, and you can size for the full working set under NoEviction. It is the wrong tool for a cheap, regenerable, single-region read cache – that is what Standard or Premium are for. The disadvantages are all manageable, but only if you respect them: model the data type deliberately, size for the working set, lock down the network, and make the client resilient.

Hands-on lab

Stand up an Enterprise cache, prove the clustering and persistence behavior, and watch active-active counter convergence – then tear it down. Enterprise is not free-tier, so this lab uses the smallest Enterprise SKU and deletes everything at the end; budget a small hourly charge while it runs. Run in Cloud Shell (Bash).

Step 1 — Variables and resource group.

RG=rg-redis-lab
LOC=eastus2
CLUSTER=kv-redis-lab-$RANDOM   # cluster name must be unique
az group create -n $RG -l $LOC -o table

Step 2 — Create the smallest Enterprise cluster (zone-redundant).

az redisenterprise create \
  -n $CLUSTER -g $RG -l $LOC \
  --sku Enterprise_E5 --capacity 2 --zones 1 2 3 -o table

Expected: a cluster row with provisioningState: Succeeded (this takes several minutes).

Step 3 — Create the database with Enterprise (proxy) policy, NoEviction, AOF.

az redisenterprise database create \
  --cluster-name $CLUSTER -g $RG \
  --client-protocol Encrypted \
  --clustering-policy EnterpriseCluster \
  --eviction-policy NoEviction \
  --persistence aof-enabled=true aof-frequency=1s -o table

Step 4 — Get the host and access key, then connect with TLS on port 10000.

HOST=$(az redisenterprise show -n $CLUSTER -g $RG --query hostName -o tsv)
KEY=$(az redisenterprise database list-keys --cluster-name $CLUSTER -g $RG \
  --query primaryKey -o tsv)

redis-cli -h $HOST -p 10000 --tls -a "$KEY" PING
# -> PONG

Step 5 — Prove persistence is on and eviction is off.

redis-cli -h $HOST -p 10000 --tls -a "$KEY" CONFIG GET appendonly   # -> appendonly yes
redis-cli -h $HOST -p 10000 --tls -a "$KEY" CONFIG GET maxmemory-policy  # -> noeviction

Step 6 — Prove the slot contract (under Enterprise policy the proxy fans out simple MSET, but transactions still need one slot). Demonstrate hash-tag co-location:

# Co-located keys via hash tag -- guaranteed one slot, safe for MULTI/EXEC and Lua
redis-cli -h $HOST -p 10000 --tls -a "$KEY" MSET 'u:{t1}:a' 1 'u:{t1}:b' 2
redis-cli -h $HOST -p 10000 --tls -a "$KEY" MGET 'u:{t1}:a' 'u:{t1}:b'   # -> 1, 2

Step 7 — Counter additive behavior (single region here; in active-active this is what converges).

redis-cli -h $HOST -p 10000 --tls -a "$KEY" INCR global:signups
redis-cli -h $HOST -p 10000 --tls -a "$KEY" INCR global:signups
redis-cli -h $HOST -p 10000 --tls -a "$KEY" GET  global:signups   # -> 2

Step 8 — Teardown (stop the meter).

az group delete -n $RG --yes --no-wait

What each lab step proves, mapped to a section above:

Step Proves Section it validates
2 Enterprise is a distinct resource; even capacity; zones at create Tier selection
3 Policy/eviction/persistence chosen at DB creation Clustering, Persistence
4 TLS-only on port 10000 Networking & TLS
5 AOF on, NoEviction set Persistence
6 Hash tags co-locate keys for multi-key safety Clustering
7 Counters are additive (CRDT convergence basis) Geo-replication
8 Clean teardown stops the charge Cost

To make this a real active-active test, repeat steps 2–4 in a second region with a shared --group-nickname and mutual --linked-databases, then INCR global:signups once in each region and GET from either – the value converges to the sum, not a lost update. That is the closest thing to a real cross-region failover you can run on demand.

Common mistakes & troubleshooting

Eleven real failure modes, each as symptom → root cause → how to confirm → fix. This is the playbook to keep open at 02:14.

# Symptom Root cause Confirm (exact cmd / metric) Fix
1 CROSSSLOT Keys ... don't hash to the same slot Multi-key cmd / txn across slots Read the exception; check key names for {} Hash-tag co-accessed keys: k:{tag}:...
2 MOVED 1234 10.0.0.5:10000 reaches app code Cluster-unaware client on OSS policy Client config has no cluster mode Enable cluster mode (RedisCluster/ClusterClient)
3 OOM command not allowed when used memory > maxmemory NoEviction + undersized cluster usedmemorypercentage near 100% Scale up RAM or scale out shards
4 Keys vanish unexpectedly in a geo group LRU eviction on active-active DB evictedkeys > 0; policy = allkeys-lru Set NoEviction; size for full working set
5 Thousands of connectionscreatedpersecond Client opens a connection per request The metric is high and sustained Use a singleton multiplexer; reuse it
6 App throws on startup during maintenance, never recovers AbortOnConnectFail = true Multiplexer config default Set AbortOnConnectFail = false
7 Connect times out from in-VNet client Private DNS zone not linked → public IP nslookup <fqdn> returns public IP Link privatelink.redisenterprise... zone to VNet
8 Connect refused / TLS handshake error Wrong port (6380) or Ssl=false Client port/SSL settings Use port 10000, Ssl=true
9 Concurrent cross-region writes lose data Modeled as SET string (LWW) Two regions, same key, one value survives Model as counter/set/hash CRDT
10 Counter over-counts after a timeout INCR retried after unacked success Retry on RedisTimeoutException + INCR Idempotency key, or compute then SET
11 One region serves stale reads Cross-region replication lag Geo replication latency metric rising Check region health; tolerate or reduce mesh

Decision table: which failure am I looking at?

If you see… It’s probably… Do this first
CROSSSLOT in the exception Keyspace ignores hash tags Add {tag} to co-accessed keys
MOVED/ASK in app logs Client not in cluster mode Turn on cluster mode
OOM command not allowed Memory full + NoEviction Scale up/out; check runaway keyspace
evictedkeys > 0 on a geo DB Eviction enabled in active-active Switch to NoEviction
Conns/sec spiking Pool-per-request client Multiplex one connection
Permanent failure after a blip AbortOnConnectFail = true Flip it to false
In-VNet timeouts DNS resolves to public IP Link the Private DNS zone
Data loss across regions Wrong CRDT (string LWW) Re-model as counter/set/hash

The error/limit reference

Error / limit Meaning Likely cause How to confirm Fix
CROSSSLOT Keys span multiple hash slots No hash tag on multi-key op Exception text + key names Hash-tag the keys
MOVED <slot> <ip:port> Slot owned by another node Cluster-unaware client (OSS) Appears in app logs Enable cluster mode
ASK <slot> <ip:port> Slot mid-migration Reshard in progress During scale-out Client follows redirect (auto)
OOM command not allowed At memory limit, NoEviction Undersized cluster usedmemorypercentage ~100% Scale RAM/shards
NOAUTH / auth required Missing/expired credential Stale key or expired token Auth response Refresh token / correct key
WRONGTYPE Op on wrong data type Key reused as different type TYPE <key> Use the right type / re-key
READONLY Write to a read-only target Passive replica (Premium) Topology / replica role Write to primary; or go active-active
Capacity must be even Enterprise nodes deploy in pairs Odd --capacity value CLI rejects the value Use an even number
Port 6380 vs 10000 Wrong TLS port Premium connection string reused Client port setting Use 10000 for Enterprise
Max 16384 slots Hard cluster slot count (design constant) Design keyspace within it

Best practices

Production-grade rules, learned the hard way:

# Rule Why
1 Choose the tier from durability/topology, not memory size Enterprise is a different runtime, not a bigger Premium
2 Decide the clustering policy at design time; it is permanent Changing it later means recreating the database
3 Hash-tag every set of co-accessed keys Keeps transactions, Lua, MGET/MSET single-slot
4 Use active-active (not passive) when both regions must write Passive is DR; active-active is availability
5 Model write data as the right CRDT Strings are LWW and silently lose concurrent writes
6 Run geo databases as NoEviction, sized for the full working set Eviction in a mesh = divergence, a correctness bug
7 Enable AOF 1s for non-regenerable data Caps restart loss at ~1 second
8 Treat persistence as restart safety, not a backup A bad command replicates to every peer
9 Lock the cache behind a Private Endpoint + linked Private DNS No public exposure; in-VNet resolves privately
10 Prefer Entra token auth; keep keys in Key Vault, rotated Removes the long-lived shared secret
11 Use a singleton multiplexer with AbortOnConnectFail = false Prevents connection storms and permanent-failure-after-blip
12 Retry the operation with jittered backoff, idempotently Survives reshard/patch blips without corrupting writes
13 Set a maintenance window in low-traffic hours Rolling patches reset one node at a time
14 Alert on used-memory % (75%), eviction, server load (80%), conns/sec Catch the leading indicator, not the outage
15 Measure client-side p99, correlate to reshard/server-load Server averages hide hot keys and GC pauses

Security notes

The cache often holds the most sensitive transient data you have – session tokens, idempotency keys, PII in flight. Lock it down on every axis:

Control Setting / action Why
Encryption in transit clientProtocol: Encrypted, TLS 1.2+ on port 10000 No plaintext; reject unencrypted clients
Network isolation Private Endpoint + disable public network access Cache reachable only from the VNet
Private name resolution Linked privatelink.redisenterprise... zone FQDN resolves to the private IP, not public
Identity-based auth Microsoft Entra ID token (managed identity) No long-lived shared secret in the app
Secret handling Access key in Key Vault, rotated; never in app config Limits blast radius if a config leaks
Least privilege Scope the managed identity / RBAC to read what it needs Avoid over-broad data-plane access
Data minimization Short TTLs on sensitive keys (SET ... EX) Sensitive data self-expires
Audit & logging Diagnostic settings to a Log Analytics workspace Trace connections, auth, config changes
Defense vs bad commands Restrict/disable dangerous admin commands FLUSHALL/KEYS blast radius
Geo data residency Choose peer regions with compliance in mind Writes replicate to every mesh region

Least-privilege auth options, ranked from most to least secure:

Approach Secret exposure Rotation Recommendation
Entra token via managed identity None (short-lived) Automatic Preferred where the client supports it
Access key from Key Vault at runtime Low (never on disk in app) Scheduled rotation Acceptable fallback
Access key in environment/app settings Medium (visible in config) Often forgotten Avoid in production
Access key in source/connection string literal High (leaks via VCS) Never Never

Cost & sizing

Enterprise costs more than Standard/Premium because you are paying for a commercial runtime, and active-active doubles the footprint because both regions run full clusters. The bill is driven by SKU (RAM/CPU per node), capacity (number of nodes/shards), the number of regions in the mesh, and cross-region egress for replication.

What drives the bill, and how to control each lever:

Cost driver Scales with Control it by Note
SKU (E5…E100, F300…) RAM/CPU per node Right-size to the working set Bigger SKU = higher hourly rate
Capacity (node count) Shards / throughput Scale out only when needed Must be even; each pair adds cost
Regions in geo mesh N clusters running Keep the mesh to needed regions Each region is a full cluster
Cross-region egress Write volume × peers Reduce mesh; batch where possible Every write fans out to N-1 peers
Persistence disk Dataset size (managed) Local managed disk, modest
Flash NVMe (Flash SKUs) Cold-tier size Use Flash for large skewed data Cheaper per GB than all-RAM

Right-sizing approach:

Question Method Action
How big is the working set? Sum key sizes × count at peak, + headroom Pick the SKU whose RAM covers it under NoEviction
RAM or Flash? Is access skewed (hot/cold)? Skewed + large → Flash; uniform/hot → RAM
Up or out? CPU-bound (serverLoad) vs memory-bound High server load → out (shards); OOM → up (RAM)
How many regions? Where must writes happen? Only regions that must accept writes
Headroom target Alert at 75% used memory Size so steady state sits below that

Rough figures (list-price ballpark, varies by region and commitment – always confirm with the pricing calculator):

Scenario Approx monthly (USD) Approx monthly (INR) Notes
Single Enterprise E5, 2 nodes, one region ~$600–900 ~₹50,000–75,000 Smallest prod Enterprise footprint
Single E10, 2 nodes, one region ~$1,200–1,800 ~₹1,00,000–1,50,000 Common single-region prod size
Active-active E10 × 2 regions ~$2,400–3,600 + egress ~₹2,00,000–3,00,000 + egress Double clusters + cross-region egress
Enterprise Flash F300, 2 nodes ~$900–1,400 ~₹75,000–1,15,000 Large dataset, cheaper per GB
Standard C1 (contrast) ~$40–60 ~₹3,500–5,000 No persistence/clustering/geo

There is no free tier for Enterprise. For dev and learning, use a Standard C0/C1 (which is cheap) to practice client patterns, and reserve Enterprise spend for the features that require it (modules, active-active). Commit to a reservation once steady-state size is known to cut the hourly rate.

Interview & exam questions

Maps to the AZ-204 (developing solutions), AZ-305 (designing infrastructure), and Redis-specific knowledge expected of senior Azure roles.

1. Why is Azure Cache for Redis Enterprise a different ARM resource type from Premium, and why does it matter? Enterprise/Enterprise Flash use Microsoft.Cache/redisEnterprise (a parent cluster + child database), running the commercial Redis Enterprise runtime, while Basic/Standard/Premium use Microsoft.Cache/redis running OSS Redis. It matters because IaC modules, endpoints, ports (10000 vs 6380), and capabilities (modules, active-active) differ; a Bicep/Terraform module for one type does nothing for the other.

2. Contrast OSS and Enterprise clustering policies. OSS exposes the native Redis Cluster API – the client discovers shards, computes CRC16 hash slots, and connects directly (lowest latency, needs a cluster-aware client and routable node IPs). Enterprise puts a proxy in front so the client uses one endpoint like a standalone Redis (simplest networking, any client, one extra hop). The policy is chosen at creation and is permanent.

3. What is a CROSSSLOT error and how do you prevent it? A multi-key command (or MULTI/EXEC/Lua) whose keys hash to different slots. Prevent it by co-locating keys with a hash tag – the {...} substring is what’s hashed – so all co-accessed keys (e.g. t:{tenant}:...) land in one slot. Even the Enterprise proxy requires single-slot keys for transactions and Lua.

4. Difference between Premium passive geo-replica and Enterprise active geo-replication? Passive is a one-way link to a read-mostly secondary with manual failover – a DR tool. Active-active is a full-mesh CRDB where every region accepts reads and writes with automatic CRDT conflict resolution and no failover step – an availability tool. If both regions must accept writes, you need Enterprise active-active.

5. How do CRDTs resolve concurrent writes, and why can’t you use a string counter in active-active? Each data type is reimplemented as a CRDT: strings are last-write-wins, counters are additive (both increments apply), sets/hashes merge per element/field. A string SET n for a count is LWW, so concurrent increments in two regions lose updates; use INCR (additive) instead.

6. Why model an idempotency check as a CRDT set, and how does SET NX EX behave across regions? A set’s observed-remove semantics make concurrent adds of the same id converge cleanly. SET key val NX EX succeeds only if the key is new; with string LWW across regions, it gives “first writer in either region wins,” which is exactly the duplicate-protection guarantee for idempotency.

7. RDB vs AOF – when do you choose each, and what’s the data-loss window? RDB snapshots periodically (loss up to the interval, e.g. 1h) – cheap, fast restart, good for regenerable caches. AOF logs every write; at 1s fsync, worst-case loss is ~1 second – the default for data you can’t recreate. Both protect against restart, not against bad commands.

8. Why is NoEviction mandatory for an active-active geo cache? Evictions are local to each region, so an LRU eviction in one region but not another silently diverges the dataset – a correctness bug, not just a hit-rate dip. Run geo databases as NoEviction and size for the full working set.

9. Name three client configurations that prevent Redis outages on Azure. A singleton multiplexer (not pool-per-request) to avoid connection storms; AbortOnConnectFail = false so the client recovers after a maintenance blip instead of throwing permanently; and an operation-level retry with jittered backoff so reshard/patch blips don’t surface as errors. Add periodic health checks to detect dead idle sockets.

10. What happens during a scale-out reshard, and how do you make it invisible? Hash slots migrate between shards; under OSS policy keys briefly answer ASK/MOVED and a cluster-aware client re-routes, while the Enterprise proxy absorbs it. Make it invisible with a resilient client (redirect-following, AbortOnConnectFail=false, op retries) and validate with a load test through a live scale-out expecting zero errors and only a p99 bump.

11. Why measure client-side latency percentiles instead of server averages? A server-side average of 1 ms hides a p99 of 200 ms from a hot shard, a cross-slot fan-out, or a client GC pause. Track p50/p99 per operation from the app and correlate spikes with serverLoad and reshard events to tell “retry working” from “real hot key.”

12. How do you secure an Enterprise cache end to end? TLS-only (clientProtocol: Encrypted) on port 10000; Private Endpoint with public access disabled and a linked Private DNS zone; Entra token auth via managed identity (key in Key Vault, rotated, as fallback); short TTLs on sensitive keys; diagnostic logs to Log Analytics; and compliance-aware region choice since writes replicate to every mesh peer.

Quick check

  1. Which ARM resource type backs the Enterprise tier, and what is the default TLS port?
  2. You need both East US and West Europe to accept writes to the same key with no failover step. Which tier and replication mode?
  3. A MULTI/EXEC transaction throws CROSSSLOT. What single keyspace change fixes it?
  4. Your active-active counter is losing increments across regions. What’s the likely modeling mistake?
  5. Your app throws permanently the first time a maintenance window blips the connection. Which one client setting fixes it?

Answers

  1. Microsoft.Cache/redisEnterprise (a parent cluster + child database), and the default TLS port is 10000 (Premium uses 6380).
  2. Enterprise (or Enterprise Flash) with active-active geo-replication (a CRDB) – both regions accept reads and writes with automatic CRDT conflict resolution and no failover.
  3. Add a hash tag so every key in the transaction shares the {...} substring (e.g. k:{order123}:...), forcing them into one hash slot.
  4. The counter is modeled as a string SET (last-write-wins), so concurrent writes lose updates. Use INCR (additive CRDT) instead.
  5. Set AbortOnConnectFail = false so the multiplexer reconnects in the background instead of throwing permanently after the first failed connect.

Glossary

Term Definition
Enterprise tier Azure Cache for Redis built on the commercial Redis Enterprise runtime (Microsoft.Cache/redisEnterprise), adding modules, active-active geo-replication, and up to 99.999% SLA.
Enterprise Flash An Enterprise SKU family that keeps hot keys in RAM and tiers colder values to local NVMe for cheaper large-dataset storage.
Clustering policy The permanent choice of OSS (native cluster API, direct-to-shard) or Enterprise (proxy, single endpoint) routing, set at database creation.
Hash slot One of 16384 CRC16 buckets across which keys are distributed in a cluster; multi-key commands require all keys in one slot.
Hash tag The substring inside the first {} in a key name; only it is hashed, so keys sharing a tag co-locate in one slot.
CROSSSLOT The error when a multi-key command, transaction, or Lua script spans more than one hash slot.
MOVED / ASK Cluster redirects telling a client which node owns a slot (MOVED) or that a slot is mid-migration (ASK); a cluster-aware client follows them transparently.
CRDB Conflict-free replicated database – the active-active, full-mesh, multi-write database Enterprise builds across regions.
CRDT Conflict-free replicated data type; each Redis type (string/counter/set/hash/sorted set) merges concurrent writes deterministically.
Active-active A topology where every region accepts reads and writes with no primary; region loss requires no failover.
Passive geo-replica A Premium one-way link to a read-mostly secondary with manual failover – a DR tool, not an availability tool.
RDB Point-in-time snapshot persistence; cheap, but loses everything since the last snapshot on a hard failure.
AOF Append-only-file persistence logging every write; at 1s fsync, worst-case loss is ~1 second.
NoEviction The policy that rejects writes (returns OOM) at the memory limit instead of evicting keys; mandatory for geo caches to avoid divergence.
Multiplexer A single long-lived client connection object (e.g. StackExchange.Redis ConnectionMultiplexer) that pipelines all commands.
Server Load The Azure Monitor metric for the percentage of time the Redis main thread is busy; the leading CPU-bound indicator.
Private Endpoint A private IP projection of the cache into your VNet via Private Link, removing public exposure.

Next steps

AzureRedisCachingHigh AvailabilityGeo-Replication
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments