Azure Lesson 44 of 137

Azure Service Bus at Scale: Sessions, Deduplication, and Dead-Letter Handling

Azure Service Bus is the broker you reach for when “fire a message and hope” is no longer acceptable — when you need ordering per customer, no duplicate side effects, and a place for poison messages to land instead of taking down a consumer in a tight retry loop. The primitives that deliver this (sessions, duplicate detection, PeekLock, dead-letter queues) are individually simple and collectively easy to misuse. Get the lock model wrong and you double-process under load; get session affinity wrong and your “ordered” queue silently interleaves; forget the DLQ and a single malformed message stalls a partition for hours while the delivery count climbs.

This guide builds the patterns the way they survive production. Examples use the Azure.Messaging.ServiceBus .NET SDK (the supported successor to Microsoft.Azure.ServiceBus and WindowsAzure.ServiceBus) plus az servicebus CLI and Bicep for provisioning. The concepts map directly to the Java, Python, and JavaScript SDKs — the broker semantics are identical; only the method names change. By the end you will be able to stand up a sessioned, duplicate-detected work queue, drive it from a session processor that holds order without double-processing, and operate a dead-letter re-drive loop that never loses a message — and you will know the exact az query and metric that confirms each guarantee.

Tiers matter. Sessions, duplicate detection, and topics all require the Standard or Premium tier — the Basic tier gives you queues only, with no sessions, no dedup, and no topics. Anything throughput- or latency-sensitive belongs on Premium, which gives dedicated capacity (messaging units), predictable latency, no noisy-neighbour effect, and a hard 100 MB max message size. This guide assumes Standard at minimum and calls out where Premium changes a limit.

What problem this solves

Without a broker that enforces ordering, dedup, and poison-message isolation, the failures are specific and expensive. Two debits on the same wallet processed concurrently both read the same starting balance and both succeed — an overdraft your ledger can’t explain. A gateway times out, the upstream resends, and you charge a card twice. A consumer crashes mid-process and the message is gone (ReceiveAndDelete) or redelivered forever (a handler slower than its lock). One malformed payload — a schema your deserializer can’t parse — gets abandoned, redelivered, abandoned again, and pins a consumer in a retry loop instead of stepping aside.

These are not theoretical. They are the four incidents every team running async messaging eventually hits, and the reason Service Bus exists rather than a plain queue. The cost of getting it wrong is measured in reconciliation hours, chargebacks, and a 2 a.m. page when the DLQ — which fills silently because nothing alerts on it by default — finally backs up the source entity. Who hits this: any team moving from synchronous request/response to event-driven processing, anyone with a per-key ordering requirement (wallets, devices, aggregates), and anyone whose producers retry (which is all of them, because at-least-once is the default delivery contract). The fix is never “add more consumers” — it is choosing the right primitive for each guarantee and wiring the safety net before the incident, not during it.

To frame the whole field before the deep dive, here is each guarantee this article delivers, the primitive that provides it, and the single most common way teams break it:

Guarantee you need Primitive that provides it Required tier Most common way it breaks
Per-key ordering Sessions (SessionId) Standard+ SessionId is a constant (serializes all) or unique-per-message (no ordering)
Idempotent enqueue Duplicate detection (MessageId) Standard+ Fresh GUID per send instead of a deterministic business key
No message loss on crash PeekLock receive mode Basic+ Used ReceiveAndDelete; or lock expired mid-handler
Poison-message isolation Dead-letter queue + re-drive Basic+ DLQ never alerted on; no re-drive processor exists
Content-based routing Topic + subscription filters Standard+ Default $Default rule left in place alongside a custom rule
Delayed / chained delivery Scheduled / auto-forward / defer Standard+ Deferred message’s SequenceNumber not persisted → leaked

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should be comfortable with the idea of asynchronous, decoupled services — a producer that hands off work and a consumer that processes it on its own clock — and with basic .NET (or your SDK’s language). You need an Azure subscription, the az CLI in Cloud Shell or locally, and the ability to grant a managed identity an RBAC role. Familiarity with at-least-once vs exactly-once delivery semantics helps, as does a passing knowledge of AMQP (the protocol Service Bus speaks over TCP 5671).

This sits in the integration & event-driven track. It is downstream of Message Queues vs Pub/Sub: Choosing an Async Pattern (which frames when to use a queue at all) and pairs tightly with Designing Idempotent APIs and Deduplication for Reliable Distributed Systems — because dedup at the broker is only half of exactly-once; the handler must be idempotent too. If your ordering need is really a long-running workflow, Durable Functions in Production: Orchestrations, Fan-out/Fan-in, and Entity State may be the better tool. For autoscaling consumers by queue depth, see Deploy KEDA for Event-Driven Autoscaling on Kafka and Azure Service Bus Workloads.

A quick map of who owns what when a messaging incident lands, so you escalate to the right person:

Layer What lives here Who usually owns it Failure classes it causes
Producer service MessageId, SessionId, payload, retries App / dev team Duplicate enqueue, wrong ordering key, oversized message
Namespace (broker) Tier, messaging units, entities, quotas Platform team Throttling (429), entity-full, dedup/session disabled
Entity (queue/topic) Lock duration, max delivery, TTL, filters App + platform DLQ growth, redelivery, filter mismatch
Consumer service PeekLock, concurrency, prefetch, idempotency App / dev team Double-process, lock loss, poison loops
Operations DLQ alerts, re-drive, metrics, dashboards SRE / platform Silent DLQ backup, missed throttling
Identity / network Managed identity, RBAC, Private Endpoint Security + platform Unauthorized (401), egress blocked

Core concepts

Six mental models make every later decision obvious.

A queue is point-to-point; a topic is publish/subscribe. A queue delivers each message to exactly one competing consumer. A topic delivers a copy to every subscription, and each subscription has its own cursor, DLQ, and filters — a subscription is just a queue with a filter in front. The decision is not “which is better”; it is how many independent readers does this message need. One consumer group → queue. Multiple teams reacting independently → topic.

Ordering exists only within a session. Service Bus does not guarantee global FIFO on a plain queue — competing consumers and redelivery break it. A session is a logical group identified by the SessionId on each message; all messages sharing a SessionId are delivered in order, to one consumer at a time, holding an exclusive session lock. Ordering is per-session, and concurrency scales with the number of active sessions, not the message count.

At-least-once is the floor, and dedup raises the enqueue to exactly-once-inside-a-window. A producer that times out and retries can enqueue the same logical message twice. Duplicate detection drops any message whose MessageId the entity has already seen within the configured window. That makes the enqueue idempotent; it does nothing for the consumer side, which can still see redelivery via PeekLock. “Exactly-once-ish” is the honest framing: exactly-once enqueue, at-least-once delivery, so the handler must also be idempotent.

PeekLock is a lease, not a removal. ReceiveAndDelete removes a message the instant it is delivered — fastest, zero redelivery, total loss on a crash. PeekLock (the default) leases the message with a time-bound lock; you then Complete, Abandon, DeadLetter, or Defer. If the lock expires before you act, the message is redelivered and its delivery count increments. Lock duration maxes at 5 minutes — long handlers must renew.

The dead-letter queue is a real sub-queue, and it does not empty itself. Every entity has a system sub-queue at <entity>/$DeadLetterQueue. Messages land there for exceeding max delivery count, expiring (if configured), failing a subscription filter, or by your handler’s explicit DeadLetter call. The DLQ has its own depth and does not auto-expire by default — a silently filling DLQ is one of the most common Service Bus incidents.

Several immutable choices are made at creation. requiresSession, requiresDuplicateDetection, and enablePartitioning cannot be toggled on an existing entity — you create a new one and migrate. Decide them up front, in code, reviewed.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Concept One-line definition Where it lives Why it matters
Namespace The container + capacity unit (a tier, a name, MUs) Resource group Tier decides what features exist at all
Queue Point-to-point entity, one consumer per message In the namespace The default work-distribution primitive
Topic / subscription Pub/sub: one publish, N independent subscriber copies In the namespace Fan-out to many readers
Session Ordered message group keyed by SessionId Property on a message The only ordering guarantee
SessionId The ordering boundary (e.g. CustomerId) Set by the producer Wrong value = no order or no parallelism
MessageId Identity used by duplicate detection Set by the producer Must be deterministic, not a GUID
PeekLock Lease-then-settle receive mode Receiver option Safe delivery; the lock can expire
Lock duration How long the lease lasts (max 5 min) Entity setting Too long = slow crash recovery
Delivery count Times a message was delivered Per message, server-side Hits MaxDeliveryCount → DLQ
Dead-letter queue $DeadLetterQueue sub-queue for poison/expired Per entity Fills silently if not alerted
Messaging unit (MU) Premium’s isolated capacity slice (1–16) Namespace (Premium) Exceed it → throttling, not failure
Auto-forward Server-side chaining of one entity to another Entity setting Build pipelines with no consumer code

Queues vs topics/subscriptions: choose the fan-out first

A queue is point-to-point: many senders, many competing consumers, each message delivered to exactly one consumer. A topic is publish/subscribe: senders publish once, and every subscription gets its own independent copy with its own cursor, DLQ, and filters.

The decision is not “which is better” — it is how many independent readers does this message need.

Need Use
One logical consumer group competing on work Queue
Multiple teams/services react to the same event independently Topic + subscriptions
Routing the same event differently by content Topic with SQL/correlation filters per subscription
Per-key ordering Either — enable sessions on the queue or subscription
A single ingestion endpoint that fans out server-side Topic with auto-forward to per-team queues

The two models compared on the axes that actually decide an architecture:

Dimension Queue Topic + subscriptions
Delivery fan-out Exactly one consumer per message One copy per subscription
Independent cursors No (one shared cursor) Yes (per subscription)
Per-subscription DLQ One DLQ One DLQ each
Filtering None (take everything) SQL / correlation rules per subscription
Sessions supported Yes Yes (per subscription)
Typical use Work distribution / load levelling Event broadcast / content routing
Cost shape One entity Topic + N subscriptions (storage per copy)

A subscription behaves like a queue with a filter in front. Everything below about PeekLock, sessions, lock renewal, and dead-lettering applies identically to a subscription’s receiver. Provision a namespace, a sessioned queue, and a topic:

RG=rg-sb-orders
NS=sb-orders-prod          # must be globally unique
LOC=eastus

az group create -n $RG -l $LOC
az servicebus namespace create -g $RG -n $NS -l $LOC --sku Premium --capacity 1

# Sessioned, duplicate-detected work queue
az servicebus queue create -g $RG --namespace-name $NS -n orders \
  --enable-session true \
  --enable-duplicate-detection true \
  --duplicate-detection-history-time-window PT10M \
  --max-delivery-count 10 \
  --lock-duration PT1M \
  --default-message-time-to-live P14D

# Topic with two subscriptions
az servicebus topic create -g $RG --namespace-name $NS -n order-events \
  --enable-duplicate-detection true
az servicebus topic subscription create -g $RG --namespace-name $NS \
  --topic-name order-events -n billing --max-delivery-count 10
az servicebus topic subscription create -g $RG --namespace-name $NS \
  --topic-name order-events -n analytics --max-delivery-count 10

The same entity as Bicep, so the immutable flags are reviewed in a PR rather than typed at 2 a.m.:

resource ns 'Microsoft.ServiceBus/namespaces@2022-10-01-preview' = {
  name: nsName
  location: location
  sku: { name: 'Premium', tier: 'Premium', capacity: 1 }   // 1 messaging unit
}

resource orders 'Microsoft.ServiceBus/namespaces/queues@2022-10-01-preview' = {
  parent: ns
  name: 'orders'
  properties: {
    requiresSession: true                 // IMMUTABLE
    requiresDuplicateDetection: true      // IMMUTABLE
    duplicateDetectionHistoryTimeWindow: 'PT10M'
    maxDeliveryCount: 10
    lockDuration: 'PT1M'
    defaultMessageTimeToLive: 'P14D'
    deadLetteringOnMessageExpiration: true
  }
}

--enable-session, --enable-duplicate-detection, and partitioning are immutable after creation. You cannot toggle them on an existing entity — you create a new one and migrate. Decide up front.

The settings that are locked at creation versus the ones you can change live — knowing the difference saves a painful migration:

Setting CLI / Bicep key Mutable after create? If you got it wrong
Sessions required requiresSession No Create a new sessioned entity, migrate traffic
Duplicate detection requiresDuplicateDetection No New entity with dedup on; drain old one
Partitioning enablePartitioning No New entity; re-point producers/consumers
Dedup window duplicateDetectionHistoryTimeWindow Yes Update in place
Lock duration lockDuration Yes Update in place
Max delivery count maxDeliveryCount Yes Update in place
Default TTL defaultMessageTimeToLive Yes Update in place
DLQ on expiration deadLetteringOnMessageExpiration Yes Update in place
Max size maxSizeInMegabytes Yes (Premium dynamic) Resize

Ordered processing with sessions

Service Bus does not guarantee global FIFO on a plain queue — competing consumers and redelivery break ordering. Ordering is guaranteed only within a session. A session is a logical group identified by the SessionId you set on each message. All messages sharing a SessionId are delivered in order, to a single consumer at a time, who holds an exclusive lock on that session.

The right session key is your ordering boundary: CustomerId, AggregateId, DeviceId — never a constant (that serializes everything) and never unique-per-message (that defeats the point).

await using var client = new ServiceBusClient(fullyQualifiedNamespace,
    new DefaultAzureCredential());
var sender = client.CreateSender("orders");

var msg = new ServiceBusMessage(BinaryData.FromObjectAsJson(order))
{
    SessionId = order.CustomerId,           // ordering boundary
    MessageId = order.OrderId,              // drives dedup (next section)
    ContentType = "application/json",
    Subject = "OrderPlaced",
};
await sender.SendMessageAsync(msg);

On the consumer side, use a session processor. It locks one session, drains it in order, then moves to the next free session — concurrency scales by number of active sessions, not message count:

var processor = client.CreateSessionProcessor("orders", new ServiceBusSessionProcessorOptions
{
    MaxConcurrentSessions = 8,              // 8 sessions in parallel
    MaxConcurrentCallsPerSession = 1,       // keep order within a session
    AutoCompleteMessages = false,           // complete explicitly on success
    SessionIdleTimeout = TimeSpan.FromSeconds(30),
});

processor.ProcessMessageAsync += async args =>
{
    var order = args.Message.Body.ToObjectFromJson<Order>();
    await HandleAsync(order, args.CancellationToken);
    await args.CompleteMessageAsync(args.Message);   // advance the session cursor
};
processor.ProcessErrorAsync += args =>
{
    log.LogError(args.Exception, "Session error on {Entity}", args.EntityPath);
    return Task.CompletedTask;
};
await processor.StartProcessingAsync();

Choosing the session key

The SessionId choice is the single most consequential decision in a sessioned design — it sets both your ordering boundary and your parallelism ceiling. The table makes the trade-off concrete:

Candidate SessionId Ordering you get Parallelism you get Verdict
A constant (e.g. "all") Total global order 1 (everything serialized) Almost never right — a throughput cliff
CustomerId / WalletId Per-customer order = active customers (high) The usual correct choice
AggregateId (DDD) Per-aggregate order = active aggregates Right for event-sourced systems
DeviceId / TenantId Per-device / per-tenant = active devices/tenants Right for IoT / multi-tenant
OrderId (unique per msg) None (one msg per session) Maximal Defeats the purpose — no ordering
Region (low cardinality) Per-region order = number of regions (low) A hidden throughput ceiling

Session processor options that matter

Every knob on the session processor and how to reason about it:

Option What it controls Default When to change Trade-off / gotcha
MaxConcurrentSessions Sessions locked in parallel by one instance 8 Raise for high session cardinality Each holds a session lock + resources
MaxConcurrentCallsPerSession Parallel handlers within one session 1 Keep at 1 for ordering >1 breaks per-session order
SessionIdleTimeout Idle time before releasing a session ~1 min Lower to rotate to new sessions faster Too low = thrash re-acquiring sessions
MaxAutoLockRenewalDuration How long to auto-renew the session lock 5 min Set to worst-case handler time Renewal stops past this — message redelivers
PrefetchCount Messages buffered locally 0 Short, high-rate handlers only Buffered locks expire if handlers are slow
AutoCompleteMessages Auto-complete on handler return true Set false for explicit control Auto-complete hides partial failures

Session state

Each session carries a small session state blob — server-side scratch space keyed to the SessionId, surviving across consumers and redeliveries. Use it as a checkpoint or saga cursor so a consumer that picks up an existing session knows where it left off:

processor.ProcessMessageAsync += async args =>
{
    var stateBytes = await args.GetSessionStateAsync();
    var cursor = stateBytes is null
        ? new SagaCursor()
        : stateBytes.ToObjectFromJson<SagaCursor>();

    cursor = await AdvanceAsync(cursor, args.Message);

    await args.SetSessionStateAsync(BinaryData.FromObjectAsJson(cursor));
    await args.CompleteMessageAsync(args.Message);
};

Session state counts against the entity’s storage quota, so keep it to a cursor or a few IDs — not the whole aggregate. What session state is and is not for:

Use session state for Do NOT use session state for
A saga / workflow cursor (which step am I on) The full aggregate or domain object
A persisted SequenceNumber for a deferred message Large payloads (counts against quota)
A small set of processed-IDs for in-session idempotency A substitute for a real database
A checkpoint that must survive a consumer swap Anything you need to query across sessions

Duplicate detection for idempotent producers

At-least-once delivery means a sender that times out and retries can enqueue the same logical message twice. Duplicate detection makes the enqueue idempotent: within the configured history window, Service Bus drops any message whose MessageId it has already seen on that entity, silently and server-side.

# 10-minute dedup window set at creation:
#   --enable-duplicate-detection true
#   --duplicate-detection-history-time-window PT10M

The contract is simple and strict:

How to size the dedup window against what you are actually defending against:

Window CLI duration Catches Costs When to pick
30 seconds PT30S Fast SDK retries only Minimal Tight, high-throughput, low-risk
10 minutes PT10M SDK retries + brief broker blips Low The sensible default
1 hour PT1H Gateway-driven re-sends, short replays Moderate Upstream that retries for minutes
1 day P1D Consumer-driven replay (Standard max) Higher storage/throughput Replay tooling on Standard
7 days P7D Long replays (Premium only) Highest Audit/replay windows on Premium

What makes a good MessageId versus a bad one — the difference between dedup working and silently doing nothing:

MessageId source Deterministic? Dedup works? Notes
Guid.NewGuid() per send No No Every retry has a new Id — the classic bug
Business key (OrderId, TxId) Yes Yes The right answer
Hash of the canonical payload Yes Yes Use when no natural key exists
CustomerId alone Yes but not unique Drops legit messages Too coarse — collapses distinct events
Timestamp No No Changes every send

“Exactly-once-ish” is the honest framing. Dedup gives you exactly-once enqueue inside the window. End-to-end you still get at-least-once delivery (PeekLock can redeliver), so the consumer side must also be idempotent — typically an upsert keyed by MessageId or a processed-IDs table. Dedup and an idempotent handler are complementary, not redundant. The deduplication mechanics here are the broker-side half of the pattern in Designing Idempotent APIs and Deduplication for Reliable Distributed Systems.

PeekLock vs ReceiveAndDelete, and lock renewal

There are two receive modes, and the choice is a data-safety decision.

The two modes head to head:

Aspect ReceiveAndDelete PeekLock
Network round-trips 1 (delivered = gone) 2+ (deliver, then settle)
Redelivery on crash None — message lost Yes — lock expires, redelivered
Throughput Highest High, slightly lower
Safe for critical work No Yes
Delivery-count tracking N/A Yes (drives DLQ)
Typical use Best-effort telemetry Everything you can’t lose

Once you hold a lock, you must settle it. The four settlement verbs and what each does:

Settlement SDK call Effect Delivery count When to use
Complete CompleteMessageAsync Removes the message — (done) Handler succeeded
Abandon AbandonMessageAsync Releases lock immediately +1 Transient failure, retry now
Dead-letter DeadLetterMessageAsync Moves to $DeadLetterQueue — (out) Unprocessable / poison payload
Defer DeferMessageAsync Sets aside; fetch by SequenceNumber unchanged Can’t process yet (out-of-order step)

The trap is the lock duration. LockDuration maxes out at 5 minutes. A handler that runs longer than the lock loses it mid-flight, the message is redelivered, and now two consumers process it — the classic double-processing bug. Do not crank the lock to 5 minutes and hope; renew the lock for genuinely long handlers.

The processor renews automatically up to MaxAutoLockRenewalDuration — set it to your realistic worst-case handler time:

var processor = client.CreateProcessor("orders", new ServiceBusProcessorOptions
{
    ReceiveMode = ServiceBusReceiveMode.PeekLock,
    MaxConcurrentCalls = 16,
    PrefetchCount = 0,                                     // see the scaling section
    AutoCompleteMessages = false,
    MaxAutoLockRenewalDuration = TimeSpan.FromMinutes(10), // renew past LockDuration
});

If you receive messages manually instead of via the processor, renew explicitly before the lock window closes:

var receiver = client.CreateReceiver("orders");
var message = await receiver.ReceiveMessageAsync();
try
{
    using var cts = new CancellationTokenSource(TimeSpan.FromMinutes(8));
    await receiver.RenewMessageLockAsync(message);   // call again as needed for very long work
    await DoLongWorkAsync(message, cts.Token);
    await receiver.CompleteMessageAsync(message);
}
catch (Exception ex)
{
    // Surface the reason on the DLQ so the re-drive processor can triage it.
    await receiver.DeadLetterMessageAsync(message,
        deadLetterReason: "ProcessingFailed",
        deadLetterErrorDescription: ex.Message);
}

How to set LockDuration against handler duration — the matrix that prevents both double-processing and slow crash recovery:

Handler duration Base LockDuration MaxAutoLockRenewalDuration Why
< 30 s, reliable PT1M leave default Lock comfortably covers the work
30 s – 5 min PT1M set to ~10 min Short base = fast crash recovery; renewal covers slow runs
5 – 30 min (e.g. long DB tx) PT1M set to worst case Never raise the base past 5 min — renewal is the tool
Highly variable PT1M generous (e.g. 15 min) + PrefetchCount=0 No buffered locks; renew while healthy
Best-effort, loss OK n/a n/a Consider ReceiveAndDelete instead

Rule of thumb: keep LockDuration at 1 minute and let renewal extend it. A short base lock means a crashed consumer’s messages free up fast; renewal keeps a healthy slow consumer from losing its lock. Setting a 5-minute base lock gets you the worst of both — slow recovery from crashes with no protection past 5 minutes.

Dead-letter queues and a re-drive processor

Every queue and subscription has a system-managed dead-letter sub-queue at the address <entity>/$DeadLetterQueue. Messages land there for a handful of reasons:

The DLQ is a real queue: it does not auto-expire by default and it does not auto-empty. A DLQ filling up silently is one of the most common Service Bus incidents. Alert on its depth and build a re-drive processor to inspect, fix, and replay.

Every reason a message dead-letters, the DeadLetterReason you’ll read, and how to prevent it:

Dead-letter reason What triggered it Where set / source Prevent / handle by
MaxDeliveryCountExceeded Abandoned/lock-expired > MaxDeliveryCount Entity maxDeliveryCount Fix the handler bug; re-drive once fixed
TTLExpiredException Message outlived its TTL defaultMessageTimeToLive + DLQ-on-expiry Faster consumers; longer TTL; alert
HeaderSizeExceeded Too many/large app properties Producer Trim properties; move data to body
Session... / lock errors Session handling failure Consumer Fix session settlement logic
Filter evaluation error Subscription rule threw Subscription rule Fix the SQL filter; enable DLQ-on-filter-error
Application (custom) Handler called DeadLetter... Your code Validate schema upstream; re-drive after fix
// Read the DLQ, log the reason, and either re-drive or discard.
var dlqReceiver = client.CreateReceiver("orders", new ServiceBusReceiverOptions
{
    SubQueue = SubQueue.DeadLetter,        // resolves to orders/$DeadLetterQueue
});
var resender = client.CreateSender("orders");

await foreach (var dead in dlqReceiver.ReceiveMessagesAsync())
{
    var reason = dead.DeadLetterReason;
    var desc   = dead.DeadLetterErrorDescription;
    log.LogWarning("DLQ {MessageId}: {Reason} / {Desc}", dead.MessageId, reason, desc);

    if (IsTransient(reason))
    {
        // Copy a NEW message from the dead one and resubmit to the main queue.
        var replay = new ServiceBusMessage(dead)        // copies body + app properties
        {
            MessageId = dead.MessageId,                 // preserve dedup identity
            SessionId = dead.SessionId,                 // preserve ordering boundary
        };
        await resender.SendMessageAsync(replay);
        await dlqReceiver.CompleteMessageAsync(dead);   // remove from DLQ only after re-send
    }
    else
    {
        await ArchiveForManualReviewAsync(dead);
        await dlqReceiver.CompleteMessageAsync(dead);
    }
}

You cannot move a message out of the DLQ in place — there is no “resubmit” verb. The pattern is always receive from $DeadLetterQueue, send a fresh copy to the source, then complete the dead-lettered one. Use new ServiceBusMessage(deadMessage) so the body and application properties carry over, and re-send after the new message is accepted so a crash mid-redrive never loses the message.

The re-drive decision itself, as a table you can encode directly into IsTransient:

DLQ reason / signal Classification Re-drive action
MaxDeliveryCountExceeded after a deploy that fixed the bug Transient (now) Resubmit fresh copy, preserve MessageId/SessionId
TTLExpiredException due to a consumer outage Transient Resubmit if still relevant; else archive
Bad schema / deleted referenced entity Non-transient Archive for manual review; complete
Repeated dead-letter of the same MessageId Poison Quarantine; do not loop re-drive
Filter evaluation error Config bug Fix the rule first, then re-drive

Subscription filters: SQL and correlation rules

On topics, each subscription decides which published messages it keeps via rules. A subscription created without an explicit rule gets a default 1=1 (match-all). For routing, attach filters:

The three filter types side by side:

Filter type Matches on Operators Cost Use when
CorrelationFilter System props + named app props, exact equality = only (implicit AND) Cheapest (indexed) Routing by a known property value
SQLFilter System + app props =, <>, <, >, LIKE, IN, AND, OR Higher Ranges, partial matches, compound logic
TrueFilter / FalseFilter Everything / nothing n/a Trivial $Default (match-all) or temporarily mute
# billing only wants high-value OrderPlaced events -> SQL filter
az servicebus topic subscription rule create -g $RG --namespace-name $NS \
  --topic-name order-events --subscription-name billing -n high-value \
  --filter-sql-expression "Subject = 'OrderPlaced' AND amount > 1000"

# analytics wants everything with region = 'emea' -> cheap correlation filter
az servicebus topic subscription rule create -g $RG --namespace-name $NS \
  --topic-name order-events --subscription-name analytics -n emea \
  --correlation-filter '{"properties": {"region": "emea"}}'

The sender sets those properties so filters have something to match:

var evt = new ServiceBusMessage(BinaryData.FromObjectAsJson(order))
{
    Subject = "OrderPlaced",
    CorrelationId = order.CorrelationId,
};
evt.ApplicationProperties["amount"] = order.Total;   // visible to SQL filters
evt.ApplicationProperties["region"] = order.Region;  // visible to correlation filters
await topicSender.SendMessageAsync(evt);

Which message properties a filter can actually see — the ones producers must set for routing to work:

Property Type Set by Visible to filters
Subject (Label) System Producer Correlation + SQL
CorrelationId System Producer Correlation + SQL
MessageId System Producer Correlation + SQL
To / ReplyTo System Producer Correlation + SQL
ApplicationProperties[...] Custom Producer Correlation (equality) + SQL (any op)
Message body Payload Producer Not visible — filters never read the body

If you add a custom rule, delete the default $Default rule — otherwise the subscription matches everything and your filter, and you wonder why analytics is getting low-value orders. New custom rule, drop the default. This same content-routing model, applied to Event Grid’s push delivery, appears in Event-Driven Architectures with Azure Event Grid: MQTT, Routing, and Reliable Delivery.

Auto-forwarding, scheduled messages, and deferral

Three features cover most “I need to delay or chain this” requirements without external infrastructure. At a glance:

Feature What it does Server-side? Persist anything? Typical use
Auto-forward Chains an entity to another in the namespace Yes No Fan a topic into per-team queues
Scheduled message Enqueues now, visible at a future time Yes The returned sequence number (to cancel) Reminders, delayed retries
Deferral Sets a received message aside for later Yes The SequenceNumber (mandatory) Out-of-order saga steps

Auto-forwarding chains an entity to another in the same namespace — a subscription forwards to a queue, or a queue to a topic — fully server-side. Use it to fan a topic’s matched messages into per-team work queues, or to build a single ingestion endpoint:

az servicebus topic subscription update -g $RG --namespace-name $NS \
  --topic-name order-events -n billing \
  --forward-to billing-work        # matched messages flow straight to the billing queue

Scheduled messages are enqueued now but become visible only at a future time — native delayed delivery, no Quartz or cron loop:

var seq = await sender.ScheduleMessageAsync(
    reminderMessage,
    DateTimeOffset.UtcNow.AddHours(24));   // visible in 24h
// Cancel before it fires if the situation changes:
await sender.CancelScheduledMessageAsync(seq);

Deferral is for “I received this, but I cannot process it yet” — an out-of-order step in a saga, or a dependency not ready. The message is set aside (kept off the active stream) and can only be retrieved later by its sequence number, which you must persist:

if (!ReadyToProcess(message))
{
    await receiver.DeferMessageAsync(message);
    await SaveForLaterAsync(message.SessionId, message.SequenceNumber); // you own this
    return;
}
// Later, once the dependency arrives:
var deferred = await receiver.ReceiveDeferredMessageAsync(savedSequenceNumber);
await Process(deferred);
await receiver.CompleteMessageAsync(deferred);

Deferral’s catch: a deferred message is invisible to normal receive. If you lose the sequence number you have effectively leaked the message until its TTL expires. Persist SequenceNumber durably (the session state from the sessions section is a natural home) before you defer.

Scaling consumers, prefetch, and Premium throttling

Throughput on Service Bus is a function of consumer concurrency, prefetch, and — on Premium — provisioned capacity.

How to set PrefetchCount against handler shape — the buffered-lock trap in table form:

Handler profile Suggested PrefetchCount Rationale
Long / variable (DB tx, external calls) 0 No buffered locks to expire mid-wait
Short, high-rate, idempotent MaxConcurrentCalls × 1–3 Hides round-trip latency, locks settle fast
Sessioned, ordered 0 (or very small) Buffering across sessions risks lock loss
Unknown / new workload 0 Start safe; raise only with metrics

Premium messaging-unit sizing — a starting map, not a guarantee (always validate with load):

Messaging units Relative capacity Indicative scale When to step up
1 MU Baseline isolated capacity Small/steady workloads ThrottledRequests sustained > 0
2 MU ~2× Moderate, spiky Throttling during normal peaks
4 MU ~4× Busy multi-entity namespace Throttling outside flash events
8–16 MU ~8–16× High-throughput backbones Sustained throttling at 4 MU under real load
# Scale Premium capacity up to 4 messaging units under sustained load
az servicebus namespace update -g $RG -n $NS --capacity 4

The default SDK retry policy already handles transient ServerBusyException with exponential backoff; tune it only with evidence:

var client = new ServiceBusClient(fullyQualifiedNamespace, new DefaultAzureCredential(),
    new ServiceBusClientOptions
    {
        RetryOptions = new ServiceBusRetryOptions
        {
            Mode = ServiceBusRetryMode.Exponential,
            MaxRetries = 5,
            MaxDelay = TimeSpan.FromSeconds(30),
        },
    });

The retry-policy knobs and sane starting values:

Retry option What it controls Default Tune when
Mode Fixed vs exponential backoff Exponential Almost never change from exponential
MaxRetries Attempts before surfacing the error 3 Raise for flaky networks; lower for fail-fast
Delay Base back-off delay 0.8 s Increase under sustained throttling
MaxDelay Cap on back-off 60 s Lower if you need bounded latency
TryTimeout Per-attempt timeout 60 s Lower for short ops, raise for large messages

For depth on scaling consumers automatically by queue depth (rather than a fixed instance count), see Deploy KEDA for Event-Driven Autoscaling on Kafka and Azure Service Bus Workloads.

Tiers, limits, and the error reference

Pick the tier before you write a line of code — it decides which features even exist. The three tiers on the axes that matter:

Capability Basic Standard Premium
Queues Yes Yes Yes
Topics / subscriptions No Yes Yes
Sessions No Yes Yes
Duplicate detection No Yes Yes
Max message size 256 KB 256 KB 100 MB
Dedup window max n/a 1 day 7 days
Capacity model Shared Shared Dedicated (MUs)
Predictable latency No No Yes
Private Endpoint / VNet No No Yes
Geo-disaster recovery No No Yes (pairing)

The concrete limits you will actually bump into:

Limit Standard Premium Notes
Max message size 256 KB 100 MB Body + properties count
Max LockDuration 5 min 5 min Renew for longer work
Dedup history window ≤ 1 day ≤ 7 days Storage/throughput trade-off
Max delivery count 1–2000 1–2000 Typical setting 5–10
Default TTL max 14 days (configurable) longer defaultMessageTimeToLive
Sessions per entity very high very high Cardinality = parallelism
Throughput best-effort shared per-MU, predictable Scale MUs on throttling

The errors and statuses you’ll see, what they mean on Service Bus, and the fix:

Error / exception Meaning Likely cause How to confirm Fix
ServerBusyException (429-equiv) Throttled Exceeded MU/throughput ThrottledRequests metric > 0 SDK retries; scale MUs
MessageLockLostException Lock expired before settle Handler > lock; large prefetch deliveryCount rising; redelivery Renew lock; PrefetchCount=0
SessionLockLostException Session lock expired Slow session handler Session redelivered Raise MaxAutoLockRenewalDuration
MessageSizeExceededException Message too big > 256 KB (Std) / 100 MB (Prem) Send fails immediately Trim payload; claim-check to Blob; Premium
MessagingEntityNotFoundException Entity missing Typo, wrong namespace, not created az servicebus queue show Create entity; fix name
UnauthorizedAccessException (401) Auth failed Missing RBAC / wrong identity az role assignment list Grant Service Bus Data * role
QuotaExceededException Entity full Backlog hit maxSizeInMegabytes activeMessageCount near cap Drain backlog; raise size; add consumers
MessageNotFoundException Deferred msg not found Wrong/stale SequenceNumber Persisted seq mismatch Persist seq correctly; check TTL

Architecture at a glance

The diagram traces a single message through the system left to right, and pins each guarantee to the exact hop where it can break. On the left, two producers — an App Service Order API and a Function — send over AMQP (TCP 5671) with a deterministic MessageId and a SessionId set to the ordering boundary. They hit the Premium namespace (1–16 messaging units), where the first stop is the conceptual dedup gate: inside the configured window, any repeat MessageId is dropped server-side (badge 1 — the place a fresh-GUID-per-retry bug silently defeats dedup). Surviving messages land in the sessioned orders queue with a one-minute base lock (badge 2 — where a constant or unique SessionId turns “ordered” into “interleaved”). Anything that exceeds max delivery count, expires, or is explicitly rejected falls into the $DeadLetter sub-queue (badge 3 — which fills silently if nothing alerts on its depth).

On the consumer side, a session processor drains one session at a time under PeekLock, renewing its lock for up to ten minutes and writing through an idempotent database keyed on MessageId (badge 4 — where a handler slower than its lock loses it and a second consumer double-processes). The operate zone closes the loop: a re-drive processor reads the DLQ, sends a fresh copy to the source, and completes the dead-lettered message only after the resend is accepted (badge 5 — the ordering that prevents losing a message mid-redrive), while Azure Monitor watches DeadletteredMessages and ThrottledRequests. Read the five legend entries as a diagnostic map: each is a symptom, the exact property or metric that confirms it, and the fix.

Architecture of ordered, deduplicated, dead-letter-safe messaging on Azure Service Bus Premium: App Service and Function producers send over AMQP 5671 through a duplicate-detection gate into a sessioned orders queue with a dead-letter sub-queue; a session processor under PeekLock writes to an idempotent database; a re-drive processor and Azure Monitor form the operate loop; five numbered badges mark where dedup, ordering, the DLQ, lock renewal, and re-drive each break.

Real-world scenario

A payments platform — call it WalletForge — processed wallet transactions through a single Standard-tier queue with competing consumers. Each transaction was independent — until the product team shipped running balances. Now two debits on the same wallet, processed concurrently, could read the same starting balance and both succeed, overdrawing the account. They also hit duplicate charges: a gateway timeout made the upstream service resend, and both copies were processed.

The constraint was hard: strict per-wallet ordering and no duplicate debit, without serializing the entire queue (millions of wallets, thousands of transactions per second) and with a 6-week audit retention requirement on anything that failed.

They fixed it with three changes and no new infrastructure:

  1. Sessions keyed on WalletId. Per-wallet ordering became absolute — a wallet’s transactions process one at a time, in order — while different wallets still ran fully parallel. Effective concurrency stayed high because session cardinality (number of active wallets) was enormous.
  2. Duplicate detection with a deterministic MessageId set to the upstream transaction ID, on a PT1H window sized to the gateway’s retry envelope, backed by an idempotent UPSERT keyed on the same ID so a redelivery past the window still could not double-debit.
  3. A DLQ re-drive processor moved to Premium for predictable latency, alerting on DeadletteredMessages > 0 and archiving non-transient failures to a Storage account for the 6-week audit trail before completing them.

The session consumer that closed the overdraw race:

var processor = client.CreateSessionProcessor("wallet-tx", new ServiceBusSessionProcessorOptions
{
    MaxConcurrentSessions = 32,            // 32 wallets in flight
    MaxConcurrentCallsPerSession = 1,      // strict order per wallet
    PrefetchCount = 0,                     // long DB transaction -> no buffered lock loss
    MaxAutoLockRenewalDuration = TimeSpan.FromMinutes(5),
    AutoCompleteMessages = false,
});

processor.ProcessMessageAsync += async args =>
{
    var tx = args.Message.Body.ToObjectFromJson<WalletTx>();
    // Idempotent debit: succeeds once per TxId even on redelivery.
    await ApplyDebitIfNewAsync(tx, idempotencyKey: args.Message.MessageId);
    await args.CompleteMessageAsync(args.Message);
};

The before/after, with the specific change that moved each number:

Symptom (before) Root cause Change made Result (after)
Occasional overdrafts Concurrent debits on one wallet Sessions keyed on WalletId Zero overdrafts in the quarter
Duplicate charges Gateway re-send, both processed Dedup + deterministic MessageId + idempotent upsert Zero duplicate debits
Failed messages lost / untraceable No DLQ strategy Premium + DLQ alert + archive-before-complete 6-week audit trail intact
Latency spikes under load Shared Standard capacity Move to Premium messaging units Predictable p95

Result: zero overdrafts and zero duplicate debits in the following quarter, with no message-level locking in their own code and no external coordination service — the ordering came from sessions, the dedup from MessageId plus an idempotent write, and the safety net from the DLQ.

Advantages and disadvantages

The broker-enforced model both gives you ordering/dedup/poison-isolation for free and introduces sharp edges if you misuse the primitives. Weigh it honestly:

Advantages Disadvantages
Per-session ordering with no coordination service of your own Ordering only within a session — global FIFO is not on offer
Server-side dedup makes the enqueue idempotent inside a window Dedup does not cover the consumer side — handler must still be idempotent
DLQ isolates poison messages automatically DLQ fills silently — nothing alerts by default
PeekLock gives at-least-once delivery with no message loss on crash A handler slower than its lock double-processes — a subtle, load-only bug
Topics + filters route content with zero consumer plumbing A stray $Default rule silently breaks routing
Premium gives dedicated, predictable capacity and 100 MB messages Premium costs more, and MU sizing needs load testing
Immutable flags force a deliberate design Getting requiresSession/dedup wrong means a full migration
Scheduled/auto-forward/defer cover delay & chaining natively Deferral leaks messages if you lose the SequenceNumber

The model is right when you have a real per-key ordering or exactly-once-enqueue need, multiple independent readers, or poison-message risk — i.e. most transactional async workloads. It is overkill for fire-and-forget telemetry (use a cheaper path) and the wrong tool for high-throughput streaming with replay (that is Azure Event Hubs at Scale: Partitioning, Capture, Kafka Endpoint, and Stream Analytics Processing territory). When the requirement is really a stateful, long-running workflow, Durable Functions in Production: Orchestrations, Fan-out/Fan-in, and Entity State models it more directly than hand-rolled session-state sagas.

Hands-on lab

Stand up a sessioned, duplicate-detected queue, prove ordering and dedup, force a message into the DLQ, and tear it all down. Premium has no free tier; this lab uses Premium briefly (or substitute Standard to avoid the MU cost — sessions and dedup work on Standard too). Run in Cloud Shell (Bash).

Step 1 — Variables and resource group.

RG=rg-sb-lab
LOC=centralindia
NS=sb-lab-$RANDOM          # globally unique
az group create -n $RG -l $LOC -o table

Step 2 — Create the namespace (Standard keeps the lab nearly free).

az servicebus namespace create -g $RG -n $NS -l $LOC --sku Standard -o table

Expected: a namespace row with sku.name = Standard, status = Active.

Step 3 — Create the sessioned, dedup’d queue with the immutable flags.

az servicebus queue create -g $RG --namespace-name $NS -n orders \
  --enable-session true \
  --enable-duplicate-detection true \
  --duplicate-detection-history-time-window PT10M \
  --max-delivery-count 5 \
  --lock-duration PT1M -o table

Step 4 — Confirm the entity is configured the way you think.

az servicebus queue show -g $RG --namespace-name $NS -n orders \
  --query "{session:requiresSession, dup:requiresDuplicateDetection, maxDelivery:maxDeliveryCount, lock:lockDuration}" -o json

Expected: "session": true, "dup": true, "maxDelivery": 5.

Step 5 — Grant your identity the data-plane role (RBAC, not keys).

ME=$(az ad signed-in-user show --query id -o tsv)
SCOPE=$(az servicebus namespace show -g $RG -n $NS --query id -o tsv)
az role assignment create --assignee $ME --role "Azure Service Bus Data Owner" --scope $SCOPE -o table

Step 6 — Prove dedup and ordering with a tiny script. Send the same MessageId twice (dedup should drop one) and three ordered messages in one session, then read them back. (Use the SDK snippets from this article in a small console app, or az servicebus-adjacent tooling.) Assert: exactly one copy of the duplicated MessageId arrives, and the three same-SessionId messages arrive in send order.

Step 7 — Force a dead-letter and read the reason. Send one message and abandon it six times (max delivery is 5), then peek the DLQ:

# After the redelivery loop, inspect DLQ depth + the dead-letter reason
az servicebus queue show -g $RG --namespace-name $NS -n orders \
  --query "{active:countDetails.activeMessageCount, dead:countDetails.deadLetterMessageCount}" -o json

Expected: dead ≥ 1, and reading the dead message shows DeadLetterReason = MaxDeliveryCountExceeded.

Validation checklist. You created a sessioned, dedup’d entity with deliberate immutable flags, confirmed them via az ... show, used RBAC instead of connection-string keys, proved exactly-once enqueue and per-session ordering, and drove one message to the DLQ with the expected reason. What each step proved:

Step What you did What it proves
3 Created with --enable-session/--enable-duplicate-detection The flags are set at creation and are immutable
4 az ... show the flags The entity matches your intent (no silent default)
5 Assigned a Data role Data-plane auth is RBAC, not shared keys
6 Sent dup MessageId + ordered session Dedup drops the repeat; session preserves order
7 Abandoned past max delivery Poison messages dead-letter with a readable reason

Cleanup (avoid lingering charges).

az group delete -n $RG --yes --no-wait

Cost note. A Standard namespace is billed primarily per operation and is effectively a few rupees for this lab; deleting the resource group stops everything. If you used Premium, delete promptly — a messaging unit bills hourly whether or not traffic flows.

Common mistakes & troubleshooting

The playbook — the part you bookmark. First the scannable table, then expanded reasoning for the entries that bite hardest.

# Symptom Root cause Confirm (exact cmd / property) Fix
1 “Ordered” queue processes out of order SessionId is a constant, unique-per-message, or sessions never enabled az servicebus queue show --query requiresSession; inspect SessionId values Recreate with sessions on; key SessionId on the ordering boundary; MaxConcurrentCallsPerSession=1
2 Duplicate side effects despite “dedup on” Fresh GUID MessageId per send, or window shorter than retry envelope Compare MessageId across retries; --query requiresDuplicateDetection Deterministic business MessageId; widen duplicateDetectionHistoryTimeWindow
3 Same message processed twice under load Handler outran LockDuration; lock lost & redelivered deliveryCount > 1; MessageLockLostException in logs Short base lock + MaxAutoLockRenewalDuration; PrefetchCount=0 for slow handlers
4 Messages vanish on consumer crash ReceiveAndDelete used for critical work Receiver ReceiveMode is ReceiveAndDelete Switch to PeekLock; settle explicitly
5 DLQ growing unnoticed; source backs up No alert on DeadletteredMessages; no re-drive processor --query countDetails.deadLetterMessageCount climbing Alert on the metric; build a re-drive processor
6 One bad message stalls a partition Poison payload abandoned/redelivered in a loop Same MessageId redelivering; deliveryCount climbing Dead-letter unprocessable payloads explicitly; re-drive after fix
7 Re-driven messages occasionally lost DLQ message completed before the resend was accepted Code completes before sending the fresh copy Send fresh copy first; complete the dead message only after
8 Analytics subscription gets messages it shouldn’t Default $Default rule left alongside a custom rule az servicebus topic subscription rule list shows $Default + yours Delete $Default when adding a custom rule
9 Throughput plateaus; can’t add parallelism Low session cardinality caps active sessions Few distinct SessionId values Choose a higher-cardinality key; or don’t session this entity
10 Intermittent ServerBusyException under peak Exceeded messaging-unit capacity ThrottledRequests metric > 0 sustained Let SDK retry; scale MUs (--capacity)
11 Deferred message never comes back SequenceNumber not persisted / lost No stored seq for the deferred message Persist SequenceNumber (session state) before deferring
12 Producer fails with 401 / Unauthorized Missing data-plane RBAC on the identity az role assignment list --assignee <id> empty Grant Azure Service Bus Data Sender/Receiver/Owner
13 Send fails: message too large Body+properties exceed 256 KB (Standard) MessageSizeExceededException Claim-check (store blob, send pointer); move to Premium (100 MB)
14 TTL’d messages disappear silently DeadLetteringOnMessageExpiration off Active count drops with no DLQ growth Enable deadLetteringOnMessageExpiration to capture them

The expanded form for the entries that cause the most 2 a.m. confusion:

1. The “ordered” queue interleaves. Root cause: sessions are off, or the SessionId is wrong — a constant serializes everything (and hides the bug until you ask why throughput is terrible), unique-per-message means each message is its own session (no ordering at all). Confirm: az servicebus queue show --query requiresSession and inspect the SessionId values your producer sets. Fix: sessions are immutable — recreate the entity with --enable-session true, key SessionId on the true ordering boundary, and set MaxConcurrentCallsPerSession = 1.

2. Duplicates despite dedup. Root cause: the producer sets a fresh Guid.NewGuid() per send, so each retry has a new MessageId and dedup never matches; or the window is shorter than how long the upstream keeps retrying. Confirm: log the MessageId across a retried send — if it changes, that’s the bug. Fix: derive MessageId deterministically from the business event; size duplicateDetectionHistoryTimeWindow to the retry envelope; and make the handler idempotent so a redelivery past the window still can’t double-apply.

3. Double-processing under load. Root cause: a handler that runs longer than LockDuration (max 5 min) loses its lock; the message is redelivered and a second consumer processes it concurrently. Large PrefetchCount makes it worse — buffered messages hold locks while they wait. Confirm: deliveryCount > 1 on processed messages, MessageLockLostException in logs. Fix: keep a short base LockDuration and set MaxAutoLockRenewalDuration to your worst-case handler time; set PrefetchCount = 0 for long handlers.

5. The silent DLQ. Root cause: nothing alerts on dead-letter depth by default, and the DLQ never empties itself, so failures pile up until the source entity backs up and throughput drops. Confirm: countDetails.deadLetterMessageCount climbing while nobody noticed. Fix: wire a metric alert on DeadletteredMessages > 0 and run a re-drive processor; treat a non-empty DLQ as an incident, not a curiosity.

7. The lossy re-drive. Root cause: the re-drive code completes the dead-lettered message before (or without confirming) the fresh copy was accepted by the source; a crash in that gap loses the message. Confirm: read the ordering of operations in the re-drive loop. Fix: always SendMessageAsync the new copy first and only CompleteMessageAsync the dead one after the send returns successfully.

Best practices

The metric alerts worth wiring before the next incident — leading indicators, not “consumer is down”:

Alert on Metric Threshold (starting point) Why it’s leading
Dead-letter growth DeadletteredMessages > 0 sustained 5 min Catches poison/expiry before the source backs up
Backlog building ActiveMessages Above your normal band Consumers falling behind producers
Throttling ThrottledRequests > 0 sustained MU capacity exceeded — scale before failures cascade
Server errors ServerErrors > 0 Broker-side trouble worth paging on
Incoming vs outgoing IncomingMessages / OutgoingMessages Divergence Producers outpacing consumers
Entity size Size (% of max) > 80% Approaching QuotaExceededException

The KQL for the dead-letter rate, wired to an alert:

AzureMetrics
| where ResourceProvider == "MICROSOFT.SERVICEBUS"
| where MetricName == "DeadletteredMessages"
| summarize Dead = sum(Total) by Resource, bin(TimeGenerated, 5m)
| where Dead > 0

For the full observability stack behind these alerts — workbooks, action groups, and KQL at scale — see Azure Monitor and Application Insights: Full-Stack Observability.

Security notes

The security controls mapped to what they defend against:

Control Mechanism Defends against Tier
Managed identity + RBAC DefaultAzureCredential + Data roles Leaked/rotated SAS keys All
Least-privilege roles Sender vs Receiver vs Owner Lateral abuse of one credential All
Private Endpoint Private link + no public access Internet-exposed broker Premium
TLS in transit AMQP over 5671 Eavesdropping / MITM All
CMK at rest Key Vault-managed keys Regulatory / key-control needs Premium
Claim-check for secrets Reference, not payload Secret sprawl in messages All
Scoped SAS + expiry Narrow authorization rule Broad, long-lived keys All

Cost & sizing

What drives the Service Bus bill, and how to keep it sane:

A rough monthly picture (INR, indicative — confirm against the pricing calculator for your region):

Scenario Tier / size Rough INR / month What you get Watch-out
Dev / low volume Basic, ~1M ops ~₹50–300 Queues only, no sessions No topics/dedup/sessions
Moderate prod Standard, ~20–50M ops ~₹2,000–6,000 Sessions, topics, dedup Per-op cost grows with chattiness
High-throughput / isolated Premium, 1 MU ~₹55,000+ Dedicated capacity, 100 MB, PE Bills hourly even when idle
High-throughput scaled Premium, 4 MU ~₹220,000+ ~4× capacity Scale back after peaks
Add-on DLQ/backlog storage small, usage-based Buffer headroom Neglected DLQ = creeping cost

Premium pricing is substantial — only move to it for a real need (predictable latency, 100 MB messages, VNet isolation, or geo-DR). Most teams run Standard happily and reserve Premium for the transactional backbone. WalletForge moved its wallet-transaction entity to Premium for predictable latency and audit isolation, but kept lower-criticality topics on Standard — tier per workload, not per company.

Interview & exam questions

1. Why doesn’t a plain Service Bus queue guarantee global FIFO, and how do you get ordering? Competing consumers and redelivery (a lock expires, the message goes to another consumer) break global order. Ordering is guaranteed only within a session: all messages sharing a SessionId are delivered in order to one consumer holding the session lock. You enable sessions at creation and key SessionId on the ordering boundary.

2. What does duplicate detection actually guarantee, and what does it not? It guarantees the enqueue is idempotent within the configured window — a repeat MessageId is dropped server-side. It does not make delivery exactly-once (PeekLock can still redeliver) and does not make your handler idempotent. End-to-end exactly-once needs dedup plus an idempotent write keyed on MessageId.

3. Difference between PeekLock and ReceiveAndDelete? ReceiveAndDelete removes the message on delivery — one hop, fastest, but a crash loses it. PeekLock leases the message with a time-bound lock and requires explicit settlement (Complete/Abandon/DeadLetter/Defer); if the lock expires the message is redelivered and the delivery count increments. Use PeekLock for anything you can’t lose.

4. A handler sometimes runs longer than the lock and the message double-processes. Fix? Keep LockDuration short (~1 min) so a crashed consumer recovers fast, and set MaxAutoLockRenewalDuration to your worst-case handler time so a healthy slow consumer renews the lock instead of losing it. Set PrefetchCount = 0 for long handlers so buffered messages don’t hold (and lose) locks.

5. How does a message end up in the dead-letter queue, and how do you get it out? Via exceeding MaxDeliveryCount, TTL expiry (if DLQ-on-expiration is on), header-size/filter errors, or an explicit DeadLetter call. There is no in-place “resubmit” — you receive from $DeadLetterQueue, send a fresh copy to the source (preserving MessageId/SessionId), and complete the dead-lettered message only after the resend is accepted.

6. When do you choose a topic over a queue? When more than one independent reader needs the same message. A queue delivers each message to exactly one competing consumer; a topic gives every subscription its own copy, cursor, DLQ, and filters. Count independent reader groups: one → queue, many → topic.

7. CorrelationFilter vs SQLFilter — which and when? CorrelationFilter matches system and named app properties by exact equality, is indexed, and is the cheapest — prefer it for known-value routing. SQLFilter is a SQL-92-like boolean (ranges, LIKE, IN, compound logic) that is more expressive but more expensive. Neither can read the message body — only properties.

8. What’s immutable on a Service Bus entity, and why does it matter? requiresSession, requiresDuplicateDetection, and partitioning are fixed at creation. Getting them wrong means creating a new entity and migrating traffic — so decide them deliberately in IaC up front rather than discovering the need in production.

9. You see intermittent ServerBusyException on Premium under peak. What is it and what do you do? You’ve exceeded the namespace’s messaging-unit capacity; Service Bus throttles (a 429-equivalent) rather than failing, and the SDK retries with backoff. If ThrottledRequests is sustained (not spiky), scale messaging units with az servicebus namespace update --capacity; if spiky, the default retry already absorbs it.

10. How do you pick a SessionId, and what are the two failure modes? Key it on the true ordering boundary (e.g. CustomerId). A constant serializes all traffic to one consumer (a throughput cliff); a unique-per-message value gives every message its own session (no ordering at all). The right key gives per-key order while keeping high parallelism via high session cardinality.

11. How do you secure a Service Bus namespace in a regulated environment? Use managed identity with least-privilege data-plane RBAC (Sender/Receiver/Owner split), put a Premium namespace behind a Private Endpoint with public access disabled, enforce TLS (AMQP 5671), optionally bring customer-managed keys for at-rest encryption, and keep secrets out of payloads (claim-check to Key Vault/Blob).

12. A deferred message never comes back — why? Deferral sets a message aside, retrievable only by its SequenceNumber. If you didn’t persist that number durably, the message is invisible to normal receive and effectively leaked until its TTL expires. Always persist SequenceNumber (session state is a natural home) before calling DeferMessageAsync.

These map to AZ-204 (Developer Associate)develop message-based solutions (Service Bus queues/topics, sessions, dead-letter) — and AZ-305 (Solutions Architect)design message and event-driven solutions (choosing queue vs topic, Service Bus vs Event Grid vs Event Hubs). The security/network angle touches AZ-500. A compact cert map for revision:

Question theme Primary cert Objective area
Sessions, dedup, PeekLock, DLQ AZ-204 Develop message-based solutions
Queue vs topic vs Event Grid/Hubs AZ-305 Design messaging & eventing
Filters, auto-forward, scheduled/defer AZ-204 Service Bus advanced features
Managed identity + RBAC, Private Endpoint AZ-500 Secure messaging; network isolation
MU sizing, throttling, scaling AZ-305 Design for scale & cost

Quick check

  1. Your “ordered” queue is processing a customer’s events out of order. Name the two most likely SessionId mistakes and the one setting that keeps order within a session.
  2. Duplicate detection is enabled but you still see duplicate side effects. What is the most common producer bug, and what else must be idempotent for end-to-end safety?
  3. A handler occasionally runs longer than its lock and the message is processed twice. Which two settings fix this, and what value should PrefetchCount be for long handlers?
  4. How do you correctly re-drive a message out of the dead-letter queue without ever losing it on a crash?
  5. You add a SQL filter to a subscription but it still receives messages it shouldn’t. What did you forget to delete?

Answers

  1. The two mistakes: a constant SessionId (serializes everything to one consumer) and a unique-per-message SessionId (no ordering — each message is its own session). Key it on the true ordering boundary (e.g. CustomerId) and set MaxConcurrentCallsPerSession = 1 to keep order within a session.
  2. The common bug is a fresh Guid.NewGuid() MessageId per send, so each retry looks new and dedup never fires — use a deterministic business MessageId. End-to-end, the consumer’s write must also be idempotent (an upsert keyed on MessageId), because PeekLock can still redeliver past the dedup window.
  3. Keep LockDuration short (~1 min) and set MaxAutoLockRenewalDuration to the worst-case handler time so a healthy slow consumer renews rather than loses the lock. For long handlers set PrefetchCount = 0 so buffered messages don’t hold and lose locks.
  4. Receive from $DeadLetterQueue, send a fresh copy to the source first (preserving MessageId and SessionId), and complete the dead-lettered message only after the resend is accepted — so a crash in between never loses the message (worst case it’s re-sent, and dedup/idempotency absorb the repeat).
  5. The default $Default (match-all) rule — a subscription created without an explicit rule gets it, and it stays alongside your custom filter, so the subscription matches everything and your filter. Delete $Default when you add a custom rule.

Glossary

Next steps

You can now build ordered, deduplicated, dead-letter-safe messaging on Service Bus and operate it without losing messages. Build outward:

service-busmessagingsessionsdead-letterpatterns
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments