Event-Driven Architectures with Azure Event Grid: MQTT, Routing, and Reliable Delivery

Most teams meet Event Grid as “the thing that fires a function when a blob lands” — the original product, a global push-only router built on custom topics and system topics. The newer surface, Event Grid namespaces, is a different animal: an MQTT v5 broker, a queue-like pull delivery API, namespace topics with 7-day retention, and dead-lettering to Blob Storage. For fleet telemetry ingestion or back-pressure-tolerant fan-out to slow consumers, namespaces are the tier you want, and the design decisions are not obvious.

This guide builds an end-to-end namespace system: MQTT clients publishing telemetry, messages routed into a namespace topic, and two consumer styles — push to Event Hubs and pull for a throughput-controlled worker — with retries and dead-lettering wired correctly. Every command targets the namespace tier, which behaves nothing like the basic tier you may know. And because you will return to this page mid-incident, every resource, setting, schema field, error condition and quota is laid out as a scannable table next to the prose that explains it. Read the narrative once; keep the tables open when the dead-letter count climbs at 02:00.

By the end you will stop guessing about which Event Grid surface to use, how MQTT authorization actually composes, when pull beats push, and why a subscription with no dead-letter destination is a silent data-loss bug waiting for its first regional incident. You will know the exact az path to confirm each of those, the Bicep to make it permanent, and the metric that pages you before a customer notices.

What problem this solves

You have a fleet — vehicles, meters, factory sensors, building controllers — emitting telemetry over MQTT, and a set of downstream systems that must consume it at very different rates. A Stream Analytics job wants a firehose; a compliance archive wants every byte but can tolerate minutes of lag; an alerting function wants only the 2% of readings that breach a threshold. Stitch that together with the wrong primitive and you spend the life of the system fighting it: a push-only router that cannot do back-pressure, a broker that authorizes per-device-per-topic and collapses at 50,000 clients, or a pipeline that drops events under load because nobody configured a place for poison messages to land.

Without a deliberate design, three failures recur in production. First, silent data loss: a push subscription with no dead-letter destination burns through its retry budget during a downstream outage and the events are simply gone — and because the metric that reveals this is DroppedEventCount, not DeliveryAttemptFailCount, nobody is watching it. Second, a broker you cannot manage: authorization modeled as one rule per client per topic is unworkable at fleet scale, so teams over-grant (every device can publish anywhere) and turn an IoT estate into a lateral-movement playground. Third, the wrong delivery mode: a locked-down, on-prem, or batch consumer cannot expose an HTTPS endpoint for push, so it gets bolted on with a polling shim that loses ordering and at-least-once guarantees.

Who hits this: anyone building IoT or device-telemetry ingestion on Azure, anyone fanning one event stream out to consumers with mismatched throughput, and anyone who needs durable retention or a private-network consumer. The namespace tier solves all three — MQTT broker with a scalable authorization model, pull delivery for endpoint-less consumers, and first-class dead-lettering — but only if you pick it deliberately and wire every knob. This article is that wiring, enumerated end to end.

To frame the field before the deep dive, here is the problem space — each failure class, the question it forces, and the first place to look:

Problem class	What you observe	First question to ask	First place to look	Most common single cause
Wrong resource chosen	Fighting the platform for a feature it lacks	Do I need MQTT, pull, or 7-day retention?	The capability matrix below	Picked basic custom topic; needed a namespace
Client can’t connect/publish	MQTT CONNECT or PUBLISH silently denied	Is there a permission binding for this client group?	`permission-binding list`	Default-deny with no binding
Events never leave the broker	MQTT works, nothing downstream fires	Is routing configured and reachable?	`routeTopicResourceId`	Routing unset or public access off
Consumer overwhelmed	Push consumer 429s / falls behind	Reactive endpoint or back-pressure?	Delivery mode of the subscription	Push to a slow consumer; needed pull
Silent data loss	`DroppedEventCount` > 0	Does every subscription have a DLQ?	Subscription dead-letter config	No dead-letter destination
Replay impossible	Dead-letter blobs unusable	Do blobs carry the failure reason?	`deadletterProperties` in the blob	Not reading `deadletterreason`

Learning objectives

By the end of this article you can:

Choose deliberately between system topics, custom (basic) topics, and namespace topics by mapping each requirement (MQTT, pull, retention, throughput, destination set) to the surface that supports it.
Stand up an MQTT v5 broker on an Event Grid namespace and model fleet-scale authorization with clients, client groups, topic spaces, and permission bindings under a default-deny posture.
Configure routing so MQTT messages are wrapped in CloudEvents and published into a namespace (or custom) topic, including the identity model for same-namespace versus cross-resource targets.
Decide between push and pull delivery per consumer, and explain exactly when pull wins (endpoint-less, back-pressure, private link, scheduled drain).
Filter server-side with subject filters and advanced filters so a consumer only receives — and only pays for — the events it matches.
Wire retries, batching, and dead-lettering to Blob Storage correctly, and reason about maxDeliveryCount, eventTimeToLive, receiveLockDurationInSeconds, and deliveryRetryPeriodInDays.
Operate the system: route delivery metrics to Log Analytics, alert on DroppedEventCount (not just failures), and run a replay job that rehydrates dead-lettered events by deadletterreason.
Diagnose the common failure modes — denied clients, dead routing, push 403s, lock-expiry redelivery loops, silent drops — from symptom to confirmed root cause to fix.

Prerequisites & where this fits

You should be comfortable with the Azure CLI (az), reading and editing JSON, and ARM/Bicep basics. Conceptually you need the publish/subscribe model (producers emit events; subscribers receive copies independently), a working idea of MQTT (a broker, topics as hierarchical strings, QoS levels), and managed identity (an Azure resource authenticating to another via RBAC instead of secrets). Knowing what CloudEvents 1.0 is — a vendor-neutral envelope with id, source, subject, type, time, data — will make the filtering section land immediately.

This sits in the Messaging & Event-Driven track. It is downstream of the broad eventing-versus-messaging decision and upstream of the consumer-side pipelines. Event Grid namespaces are an ingestion and fan-out layer; what you do after the fan-out is a different tool. If your push target is Event Hubs feeding analytics, continue with Azure Event Hubs: Kafka, Capture, Stream Analytics & Throughput Scaling. If you need ordered, transactional, command-style messaging instead of event fan-out, that is Azure Service Bus: Sessions, De-duplication & Dead-Letter Patterns. The pull worker and replay jobs are typically Functions — see Azure Functions: Serverless Patterns and, for orchestrated replay, Azure Durable Functions: Orchestration & Fan-Out Patterns. For the device side of an MQTT estate, Azure IoT Hub, DPS, Edge & Digital Twins Fundamentals is the sibling story. Dead-lettering lands in Blob, so Azure Blob Storage: Lifecycle, Immutability & Soft Delete governs how long those forensics live.

A quick map of who owns what during an incident, so you escalate to the right person fast:

Layer	What lives here	Who usually owns it	Failure classes it can cause
Device / fleet	X.509 certs, MQTT client, QoS	Device / firmware team	CONNECT failures, bad payloads, clock skew
MQTT broker	Topic spaces, permission bindings	Platform / messaging	Default-deny denials, topic-template mismatch
Routing	`routeTopicResourceId`, routing identity	Platform	Events stuck in broker; CloudEvents wrap issues
Namespace topic	Retention, throughput, subscriptions	Platform	Throttling at ingress cap; retention expiry
Delivery (push)	Destination, delivery identity	Platform + consumer	403 to target, dead-letter flood
Delivery (pull)	Lock duration, delivery count	Consumer	Lock-expiry redelivery loops
Dead-letter	Blob container, namespace MI	Platform + storage	DroppedEventCount, replay forensics

Core concepts

Five mental models make every later decision obvious.

There are two Event Grids, and they share a name, not a runtime. The classic surface (system topics, custom/basic topics, domains, partner topics) is a global, push-only HTTP router optimized for reacting to Azure resource events and app events. The namespace surface is a regional resource that adds an MQTT broker, a pull delivery queue API, and namespace topics with 7-day retention. They are provisioned, priced, and reasoned about differently. The single most consequential early decision in this whole article is which surface, because picking wrong means re-platforming later.

MQTT authorization composes; it is not per-device. You never write “device 7 may publish to topic X.” Instead you register a client, bucket it with other clients via a client group (a query over client attributes), define a topic space (a set of MQTT topic templates), and grant the group Publisher or Subscriber rights on the space via a permission binding. The leverage is the ${client.authenticationName} variable in a topic template: one template scopes every client to its own subtree without one rule per device. The posture is default-deny — no binding means no access — which is the only correct posture for an IoT fleet.

MQTT and the rest of Azure are bridged by routing, not magic. Messages published to the broker live inside the broker. To reach Functions, Event Hubs, Storage, or any subscriber, you configure routing: every MQTT message is wrapped in a CloudEvents 1.0 envelope (the original MQTT topic becomes subject, the payload becomes data) and published to exactly one topic you nominate. From there, event subscriptions take over. No routing, no downstream — the broker is an island until you build the bridge.

Delivery has two opposite shapes. Push registers a destination and Event Grid sends events to it as they arrive — reactive, zero-polling, but the consumer must be reachable and must absorb the offered rate. Pull inverts control: the consumer connects and receives events with queue semantics (receive, then acknowledge / release / reject), so a struggling consumer simply slows its cadence. Push is for reachable, reactive consumers; pull is for endpoint-less, back-pressure-sensitive, private-network, or scheduled consumers. This fork defines your consumer architecture.

Reliable delivery is three coordinated knobs plus a graveyard. How hard Event Grid retries (maxDeliveryCount, eventTimeToLive, exponential backoff), how it locks on pull (receiveLockDurationInSeconds), and where poison events go to die (dead-letter to Blob) are one system. Get the retries right but skip the dead-letter destination and you do not get errors — you get silent loss, surfaced only by DroppedEventCount. Dead-lettering is not optional hardening; it is the difference between a five-minute replay and an afternoon of forensics, and on a contractual-retention workload it is the difference between compliant and not.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary repeats these for lookup; this table is the mental model side by side:

Concept	One-line definition	Where it lives	Why it matters
Namespace	Regional Event Grid resource hosting broker + topics	Subscription / resource group	The tier that unlocks MQTT, pull, retention
MQTT broker	v3.1.1 / v5 pub-sub broker	`topicSpacesConfiguration` on the namespace	Device ingestion front door
Client	One registry entry per device/app	Under the namespace	The authenticated identity that connects
Client group	A query bucketing clients by attribute	Under the namespace	Scales authorization without per-device rules
Topic space	A set of MQTT topic templates	Under the namespace	The scope a binding grants rights over
Permission binding	Grants a group Pub/Sub on a space	Under the namespace	The only thing that lifts default-deny
Namespace topic	Durable topic with 7-day retention	Under the namespace	Where routed events land; fans out
Routing	Wraps MQTT → CloudEvents → a topic	`topicSpacesConfiguration`	Bridges broker to the rest of Azure
Event subscription	A consumer’s filtered view of a topic	Under the topic	One independent copy per subscription
Push delivery	Event Grid sends to a destination	Subscription `deliveryMode: Push`	Reactive, reachable consumers
Pull delivery	Consumer receives with queue semantics	Subscription `deliveryMode: Queue`	Back-pressure, endpoint-less consumers
Dead-letter	Undeliverable events written to Blob	Subscription DLQ config + MI	Prevents silent loss; enables replay
CloudEvents	Vendor-neutral event envelope	Every namespace-topic event	The schema you filter on

Namespaces vs. custom topics vs. system topics

Pick the wrong resource and you will fight the platform for the life of the system. The three are not interchangeable, and the differences are not cosmetic — they are about whole capabilities (MQTT, pull, retention) that one surface has and another simply does not.

Capability	System topic	Custom topic (basic)	Namespace topic
Event source	Azure services (Blob, Resource Groups, etc.)	Your app	Your app
MQTT broker	No	No	Yes
Pull delivery	No	No	Yes
Push to Event Hubs	Yes	Yes	Yes
Push to Functions, Service Bus, Storage queues, webhooks	Yes	Yes	Not yet (Event Hubs only today)
Schema	EventGridSchema / CloudEvents	EventGridSchema / CloudEvents	CloudEvents 1.0 JSON only
Max throughput (ingress / egress)	~5 MB/s	~5 MB/s	40 MB/s / 80 MB/s
Retention	Best-effort, 24h retry	24h retry	7 days
Scope	Global	Global	Regional
Subscribe to Azure service events	Yes	No	No

The key trade-off: namespace topics give you MQTT, pull delivery, high throughput, and durable retention, but the push destination set is still narrower than basic (Event Hubs only at time of writing — more are rolling out). A common production shape is therefore MQTT into a namespace topic, push to Event Hubs, then Event Hubs fans out to Stream Analytics, Functions, or Fabric. Namespace topics also accept only CloudEvents 1.0 JSON — no proprietary EventGridSchema.

Map your requirement to the surface with a decision table — find your row and stop:

If you need…	Then use…	Because…
To react to Blob/Resource-Group/Azure events	System topic	Only it subscribes to Azure service events
Simple app-to-handler push, broad destinations	Custom (basic) topic	Widest push destination set, global, simple
An MQTT broker for a device fleet	Namespace topic	Only it speaks MQTT
Pull (back-pressure / endpoint-less consumer)	Namespace topic	Only it offers queue-style pull
7-day durable retention	Namespace topic	Basic/system retry for ~24h only
40/80 MB/s throughput	Namespace topic	Basic/system cap near ~5 MB/s
Push to Functions or Service Bus today	Custom / system topic	Namespace push is Event Hubs-only for now

A few hard boundaries that catch people, stated as rules:

Boundary	The rule	Consequence if ignored
Namespace topics host only your events	No system/domain/partner topics inside	Can’t get Blob-created events from a namespace
Namespace schema	CloudEvents 1.0 JSON only	EventGridSchema producers are rejected
Namespace topics can’t subscribe to Azure events	They carry your events only	Use a system topic for resource events
MQTT requires opt-in	`topicSpacesConfiguration.state = Enabled`	Without it you get pull-only, no broker
Region	Namespace is regional, not global	Plan for region affinity / DR explicitly

Namespace topics cannot host system topics, domain topics, or partner topics, and they cannot subscribe to Azure service events. They carry your events only. If you need Blob-created events, that is still a system topic.

Create the namespace with both MQTT and a system-assigned identity (you will need the identity for routing and dead-letter):

RG=rg-eventing
LOC=eastus
NS=egns-telemetry

az eventgrid namespace create \
  --resource-group $RG \
  --name $NS \
  --location $LOC \
  --topic-spaces-configuration "{state:Enabled}" \
  --identity "{type:SystemAssigned}"

The same provisioning as Bicep, so it is reviewed and repeatable:

resource ns 'Microsoft.EventGrid/namespaces@2024-06-01-preview' = {
  name: 'egns-telemetry'
  location: 'eastus'
  identity: { type: 'SystemAssigned' }
  sku: { name: 'Standard', capacity: 1 }   // throughput units scale ingress/egress
  properties: {
    topicSpacesConfiguration: {
      state: 'Enabled'                       // turns ON the MQTT broker
      maximumClientSessionsPerAuthenticationName: 1
    }
    publicNetworkAccess: 'Enabled'           // broker reachable; lock down consumers, not the broker
  }
}

Enabling topicSpacesConfiguration.state = Enabled is what turns on the MQTT broker; without it you get a pull-delivery-only namespace. The throughput-unit capacity on the SKU scales the ingress/egress ceilings — size it to the fleet, covered in Cost & sizing below.

MQTT broker: clients, topic spaces, and permission bindings

The broker speaks MQTT v3.1.1 and v5 (and both over WebSocket). QoS 0 and 1 are supported; QoS 2 is not. Authorization is not per-client-per-topic — unmanageable at fleet scale. Instead you compose four resources:

Clients — one registry entry per device/app, keyed by an authentication name (an X.509 cert subject / thumbprint, or a Microsoft Entra identity).
Client groups — a query over client attributes that buckets clients (e.g. all building == "b12" sensors).
Topic spaces — a set of MQTT topic templates (e.g. devices/${client.authenticationName}/telemetry).
Permission bindings — grant a client group Publisher or Subscriber rights on a topic space.

Here is what each resource is and how the four chain together — the table is the model, the prose below is the why:

Resource	What it represents	Keyed / defined by	Grants nothing by itself?
Client	One device or app identity	Authentication name (cert / Entra)	Correct — needs a binding
Client group	A bucket of clients	A query over client attributes	Correct — needs a binding
Topic space	A set of topic templates	One or more MQTT topic patterns	Correct — needs a binding
Permission binding	The actual grant	(client group, topic space, Pub/Sub)	This is the only grant

The MQTT protocol surface — what the broker supports and what it refuses — so you size client expectations correctly:

MQTT feature	Supported?	Notes / limit
MQTT v3.1.1	Yes	Classic broker protocol
MQTT v5	Yes	Properties, reason codes, topic aliases
WebSocket transport	Yes	v3.1.1 and v5 over WSS
QoS 0 (at most once)	Yes	Fire-and-forget
QoS 1 (at least once)	Yes	PUBACK-confirmed
QoS 2 (exactly once)	No	Use QoS 1 + idempotent consumers
Retained messages	Yes (bounded)	Per broker limits
Last Will & Testament (LWT)	Yes	v3.1.1 and v5
Shared subscriptions	Yes (v5)	Group subscriber load balancing
User properties (v5)	Yes	Carried into the routed CloudEvent
Request/response (v5)	Yes	Response-topic + correlation-data
Session expiry / clean start	Yes	Controls reconnect state retention
TLS port	8883 (MQTT), 443 (WSS)	Plaintext 1883 not offered

Client authentication options, with the trade-off of each:

Auth method	How it works	Best for	Trade-off
X.509 CA-signed	Device cert chains to a registered CA	Large fleets, cert lifecycle via PKI	Need a CA + issuance pipeline
X.509 thumbprint	Pin exact allowed thumbprints on the client	Small/known device sets	Rotation means editing the client
Microsoft Entra (JWT)	OAuth token validated by the broker	Apps / services, not constrained devices	Token acquisition on the device side

az eventgrid namespace client create \
  --resource-group $RG \
  --namespace-name $NS \
  --client-name sensor-0007 \
  --authentication-name sensor-0007 \
  --state Enabled \
  --client-certificate-authentication "{validationScheme:ThumbprintMatch,allowedThumbprints:[A1B2C3D4E5F6...]}" \
  --attributes "{building:'b12',role:'sensor'}"

The client resource as Bicep, so the fleet registry is declarative:

resource client 'Microsoft.EventGrid/namespaces/clients@2024-06-01-preview' = {
  parent: ns
  name: 'sensor-0007'
  properties: {
    authenticationName: 'sensor-0007'
    state: 'Enabled'
    clientCertificateAuthentication: {
      validationScheme: 'ThumbprintMatch'
      allowedThumbprints: [ 'A1B2C3D4E5F6...' ]
    }
    attributes: { building: 'b12', role: 'sensor' }   // these power client-group queries
  }
}

The client-resource settings you actually set, with defaults and gotchas:

Setting	What it does	Default	Valid values	Gotcha
`authenticationName`	The name the cert/Entra identity presents	client name	string	Must match the cert subject/SAN or token claim
`state`	Enabled / Disabled	Enabled	`Enabled` / `Disabled`	Disabling instantly drops the session
`validationScheme`	How the cert is validated	`SubjectMatchesAuthenticationName`	`ThumbprintMatch` / `DnsMatchesAuthenticationName` / `Rfc822...` / `UriMatches...`	Thumbprint pinning breaks on cert rotation
`allowedThumbprints`	Pinned cert thumbprints	—	up to 2	Rotation requires editing here
`attributes`	Key/value tags on the client	—	string map	The only thing client-group queries can filter on

Define a topic space whose template scopes each device to its own subtree, then create a client group that selects the sensors:

az eventgrid namespace topic-space create \
  --resource-group $RG \
  --namespace-name $NS \
  --name ts-telemetry \
  --topic-templates "devices/\${client.authenticationName}/telemetry/#"

az eventgrid namespace client-group create \
  --resource-group $RG \
  --namespace-name $NS \
  --name cg-sensors \
  --query "attributes.role = 'sensor'"

MQTT topic templates support specific wildcards and one powerful variable — get these right or every device shares one scope:

Template token	Meaning	Example	Effect
`${client.authenticationName}`	Substituted per connecting client	`devices/${client.authenticationName}/#`	Each device scoped to its own subtree
`+` (single-level)	Matches one topic level	`devices/+/telemetry`	Any single device id at that level
`#` (multi-level)	Matches the rest of the tree	`devices/sensor-0007/#`	All subtopics under the device
Literal segment	Exact match	`commands/firmware`	Only that exact topic

The ${client.authenticationName} variable is the whole point: a single topic space template gives each client publish rights to only its own topic, without one binding per device. Bind publish permission:

az eventgrid namespace permission-binding create \
  --resource-group $RG \
  --namespace-name $NS \
  --name pb-sensors-pub \
  --client-group-name cg-sensors \
  --topic-space-name ts-telemetry \
  --permission Publisher

The permission-binding resource as Bicep, alongside its options:

resource pbPub 'Microsoft.EventGrid/namespaces/permissionBindings@2024-06-01-preview' = {
  parent: ns
  name: 'pb-sensors-pub'
  properties: {
    clientGroupName: 'cg-sensors'
    topicSpaceName: 'ts-telemetry'
    permission: 'Publisher'    // or 'Subscriber'
  }
}

The two permissions and what each actually allows:

Permission	MQTT verbs allowed	Use it for	Pair it with
Publisher	CONNECT + PUBLISH to the space	Sensors emitting telemetry	A Subscriber binding for consumers
Subscriber	CONNECT + SUBSCRIBE to the space	Apps/devices receiving commands	A Publisher binding for the producer side

A client may not connect, publish, or subscribe to anything until a permission binding explicitly allows it. Default-deny is the security posture, and it is correct for IoT. The most common modeling mistakes here, and what each causes:

Modeling mistake	Symptom	Fix
No permission binding for the group	CONNECT/PUBLISH silently denied	Add a Publisher/Subscriber binding
Topic space too broad (no `${...}` var)	Every device can publish to every topic	Scope the template per `authenticationName`
Client group query references missing attribute	Client never lands in the group	Set the attribute on the client; queries see only attributes
Publisher binding but device subscribes	SUBSCRIBE denied	Add a Subscriber binding for the consume side
Thumbprint auth + rotated cert	CONNECT fails after rotation	Move to CA-signed validation, or update thumbprints

Routing MQTT messages into a topic

MQTT messages live inside the broker. To get them into the rest of Azure, configure routing: every message is wrapped in a CloudEvents envelope and published to one namespace topic (or custom topic) you nominate. From there, event subscriptions take over.

First create the destination namespace topic:

az eventgrid namespace topic create \
  --resource-group $RG \
  --namespace-name $NS \
  --name mqtt-ingest

The namespace-topic settings that govern durability and fan-out:

Setting	What it controls	Default	Range / values	When to change
`eventRetentionInDays`	How long unconsumed events persist	1	1–7	Raise for slow/batch consumers needing replay window
`inputSchema`	Accepted schema	`CloudEventSchemaV1_0`	CloudEvents 1.0 only	Fixed — namespace topics are CloudEvents-only
`publisherType`	Who publishes to it	`Custom`	`Custom`	Your events (incl. routed MQTT)
(subscriptions)	Independent consumer views	—	up to the per-topic cap	Each gets its own copy of every event

Routing is set on the namespace’s topicSpacesConfiguration and is most reliably applied as a properties object via az resource. The two fields that matter are routeTopicResourceId (where messages land) and routingIdentityInfo (which identity authenticates the publish — for a namespace topic in the same namespace, None works because no cross-resource role assignment is needed):

{
  "properties": {
    "topicSpacesConfiguration": {
      "state": "Enabled",
      "routeTopicResourceId": "/subscriptions/<SUB>/resourceGroups/rg-eventing/providers/Microsoft.EventGrid/namespaces/egns-telemetry/topics/mqtt-ingest",
      "routingIdentityInfo": { "type": "None" }
    }
  }
}

az resource update \
  --resource-type Microsoft.EventGrid/namespaces \
  --ids "/subscriptions/<SUB>/resourceGroups/$RG/providers/Microsoft.EventGrid/namespaces/$NS" \
  --is-full-object \
  --properties @routing.json

The routing configuration fields, end to end:

Field	What it does	Same-namespace value	Cross-resource value
`routeTopicResourceId`	The single topic MQTT messages route to	The namespace topic’s resource id	A custom topic’s resource id
`routingIdentityInfo.type`	Which identity authenticates the publish	`None` (no role needed)	`SystemAssigned` / `UserAssigned`
`routingEnrichments`	Static/dynamic attributes added to events	optional	optional
Public network access	Broker reachability for routing to fire	must remain reachable	must remain reachable

The two routing targets and their requirements side by side:

Target	When to use	Schema requirement	Region	Identity / role needed
Namespace topic (same NS)	Default; keeps everything in one namespace	CloudEvents 1.0	same namespace	`None`
Custom topic	Reach a push destination namespace topics lack (e.g. Service Bus)	CloudEvents v1.0	same region as broker	`SystemAssigned` + EventGrid Data Sender

If you route to a custom topic instead (to reach a push destination namespace topics do not yet support, like Service Bus), the topic must use CloudEvents v1.0, sit in the same region, and have the namespace identity granted the EventGrid Data Sender role — set routingIdentityInfo.type to SystemAssigned. Disabling public network access on the namespace breaks routing, so plan private networking on the consumer side, not the broker.

When the broker wraps a message, the CloudEvent’s subject carries the original MQTT topic and data carries the payload — exactly what you filter on next. Here is the field-by-field mapping from an MQTT PUBLISH to the emitted CloudEvent:

CloudEvent field	Populated from	Example value
`specversion`	Fixed	`1.0`
`id`	Generated per event	`B688-1234-1235`
`source`	The namespace	`egns-telemetry`
`subject`	The original MQTT topic	`devices/sensor-0007/telemetry/temp`
`type`	Broker event type	`MQTT.EventPublished`
`time`	Broker receive time	`2026-06-08T17:31:00Z`
`data`	The MQTT payload	`{ "celsius": 91.4, "battery": 0.62 }`

Push vs. pull delivery, and when pull wins

This is the design fork that defines your consumer architecture.

Push delivery registers a destination in the subscription, and Event Grid POSTs (or AMQP-sends) events to it as they arrive. It is reactive and zero-polling, but the consumer must expose a reachable endpoint and absorb whatever rate Event Grid pushes (within batching limits).

Pull delivery inverts control: the consumer connects to Event Grid and receives events with queue-like semantics — receive, then acknowledge, release, or reject. Reach for pull when:

The consumer cannot expose an endpoint (locked-down network, batch job, on-prem worker).
You need back-pressure — a struggling consumer slows its receive cadence instead of being overwhelmed.
You need a private link to consume over private IP space (push cannot do this).
You want to process at a chosen time (overnight batch) rather than as events occur.

The two modes compared on every axis that drives the decision:

Axis	Push (`deliveryMode: Push`)	Pull (`deliveryMode: Queue`)
Control direction	Event Grid → consumer	Consumer → Event Grid
Consumer must expose endpoint?	Yes	No
Back-pressure	No (consumer absorbs offered rate)	Yes (consumer paces `receive`)
Private link to consume	No	Yes
Destinations today	Event Hubs (namespace topics)	Any pull client (SDK / REST)
Acknowledgement model	HTTP/AMQP delivery result	`acknowledge` / `release` / `reject`
Best for	Reactive, reachable services	Batch, on-prem, throttled, private
Redelivery control	Backoff + non-retryable 4xx	`receiveLockDuration` + `maxDeliveryCount`

The pull lifecycle verbs and what each does to the lock:

Verb	Effect	Use when
`receive`	Locks N events for the lock duration	Pulling a batch to process
`acknowledge`	Permanently removes the event	Processing succeeded
`release`	Returns the event immediately for redelivery	Transient failure; retry now
`reject`	Drops/dead-letters per policy	Poison event; don’t retry
`renewLock`	Extends the lock	Processing legitimately takes longer

A push subscription to Event Hubs (the supported namespace push destination today):

az eventgrid namespace topic event-subscription create \
  --resource-group $RG \
  --namespace-name $NS \
  --topic-name mqtt-ingest \
  --name sub-eventhubs \
  --delivery-configuration '{
    "deliveryMode": "Push",
    "push": {
      "deliveryWithResourceIdentity": {
        "identity": { "type": "SystemAssigned" },
        "destination": {
          "endpointType": "EventHub",
          "properties": {
            "resourceId": "/subscriptions/<SUB>/resourceGroups/rg-eventing/providers/Microsoft.EventHub/namespaces/ehns-telemetry/eventhubs/telemetry"
          }
        }
      }
    }
  }'

A pull subscription is just deliveryMode: Queue:

az eventgrid namespace topic event-subscription create \
  --resource-group $RG \
  --namespace-name $NS \
  --topic-name mqtt-ingest \
  --name sub-worker \
  --delivery-configuration '{
    "deliveryMode": "Queue",
    "queue": {
      "receiveLockDurationInSeconds": 60,
      "maxDeliveryCount": 5,
      "eventTimeToLive": "P1D"
    }
  }'

receiveLockDurationInSeconds is the window in which a received event must be acknowledged before it becomes available again; maxDeliveryCount caps redeliveries before the event is dead-lettered or dropped. The full pull-queue setting matrix:

Setting	What it controls	Default	Range	When to change	Trade-off
`receiveLockDurationInSeconds`	Lock window before redelivery	60	60–300	Slow processing per event	Too long delays retry of genuinely failed events
`maxDeliveryCount`	Attempts before dead-letter	10	1–10	Fewer retries for fast-fail	Too low dead-letters transient blips
`eventTimeToLive`	Wall-clock ceiling (ISO-8601)	topic retention	up to 7 days	Cap staleness	Too short drops valid-but-late events
Max receive batch	Events per `receive`	per SDK	bounded	Tune throughput vs lock pressure	Bigger batch + slow worker → lock expiry

CloudEvents, advanced filters, and subject-based routing

Namespace topics are CloudEvents-native, so filtering keys off CloudEvents attributes and into the data payload. A receive response nests each CloudEvent under event alongside brokerProperties (the lock token and delivery count):

{
  "value": [
    {
      "brokerProperties": { "lockToken": "CiYK...", "deliveryCount": 1 },
      "event": {
        "specversion": "1.0",
        "id": "B688-1234-1235",
        "source": "egns-telemetry",
        "subject": "devices/sensor-0007/telemetry/temp",
        "type": "MQTT.EventPublished",
        "time": "2026-06-08T17:31:00Z",
        "data": { "celsius": 91.4, "battery": 0.62 }
      }
    }
  ]
}

brokerProperties is the operational half of the envelope — the two fields you watch:

`brokerProperties` field	Meaning	Why you watch it
`lockToken`	Opaque handle for ack/release/reject	Required to acknowledge a specific event
`deliveryCount`	How many times this event was delivered	Climbing count = lock-expiry or release loop

Filter so a subscription only sees the events it cares about. Two complementary tools:

Subject filters — cheap prefix/suffix matching on subject, which for routed MQTT is the device topic.
Advanced filters — typed comparisons (NumberGreaterThan, StringIn, BoolEquals, StringContains) against any attribute or data field via JSON path.

The advanced-filter operators you can use, with an example of each:

Operator	Type	Example key	Example values
`NumberGreaterThan` / `…OrEquals`	number	`data.celsius`	`[85]`
`NumberLessThan` / `…OrEquals`	number	`data.battery`	`[0.2]`
`NumberInRange` / `NumberNotInRange`	number	`data.rpm`	`[[900,1100]]`
`NumberIn` / `NumberNotIn`	number	`data.zone`	`[1,2,3]`
`StringBeginsWith` / `StringEndsWith`	string	`subject`	`["devices/"]`
`StringContains` / `StringNotContains`	string	`subject`	`["/telemetry/"]`
`StringIn` / `StringNotIn`	string	`type`	`["MQTT.EventPublished"]`
`BoolEquals`	bool	`data.alarm`	`[true]`
`IsNullOrUndefined` / `IsNotNull`	any	`data.gps`	—

Subject vs advanced filters — when to use which:

Filter kind	Matches on	Cost	Use for
Subject prefix/suffix	`subject` string only	Cheapest	Device-topic / tenant scoping
Advanced filter	Any attribute or `data.*` (JSON path)	Slightly more	Thresholds, enums, booleans in payload
`includedEventTypes`	`type` allow-list	Cheap	Restrict to specific event types

A subscription that only wakes the worker for over-temperature readings from building 12:

{
  "filtersConfiguration": {
    "includedEventTypes": ["MQTT.EventPublished"],
    "filters": [
      { "operatorType": "StringBeginsWith", "key": "subject", "values": ["devices/"] },
      { "operatorType": "NumberGreaterThan", "key": "data.celsius", "values": [85] }
    ]
  }
}

Doing this server-side is not a nicety — it is throughput and cost. Every event a subscription does not match is one your consumer never receives, never locks, and never pays to process. Filter aggressively at the subscription; reserve client-side logic for genuinely dynamic cases.

Retries, batching, and dead-letter to Blob Storage

Reliable delivery is three coordinated settings: how hard Event Grid retries, how it batches on push, and where poison events go to die.

Retry budget. On a pull subscription, eventTimeToLive (the P1D ISO-8601 duration above) is the wall-clock ceiling; maxDeliveryCount is the attempt ceiling. Whichever is hit first ends delivery. On push, Event Grid retries with exponential backoff against transient failures; a hard 4xx (other than throttling) is treated as non-retryable and goes straight to dead-letter.

How push classifies a delivery result — this decides retry vs immediate dead-letter:

Delivery result	Class	Event Grid behaviour
`200`/`202` success	Success	Acknowledged, removed
`204` no content	Success	Acknowledged, removed
`408` / `429` (throttle/timeout)	Transient	Retry with exponential backoff
`503` / `504`	Transient	Retry with exponential backoff
`5xx`	Transient	Retry with exponential backoff
`400`, `401`, `403`, `404`, `413`	Non-retryable	Straight to dead-letter
Endpoint unreachable	Transient	Retry within the budget
Budget exhausted (`maxDeliveryCount`/TTL)	—	Dead-letter (or drop if no DLQ)

The four reliability knobs, side by side, so you reason about them as one budget:

Knob	Applies to	What it bounds	Default	Max
`maxDeliveryCount`	pull (and push attempts)	Number of attempts	10	10
`eventTimeToLive`	pull	Wall-clock event lifespan	topic retention	7 days
`receiveLockDurationInSeconds`	pull	Lock per `receive`	60	300
`deliveryRetryPeriodInDays`	dead-letter	DLQ retry window	—	2

Dead-letter. Configure a Blob Storage destination so undeliverable events are preserved instead of dropped. Prerequisites: enable a managed identity on the namespace and grant it Storage Blob Data Contributor on the storage account. The subscription property is deadLetterDestinationWithResourceIdentity, and deliveryRetryPeriodInDays sets the maximum dead-letter retry window (max 2 days):

{
  "deadLetterDestinationWithResourceIdentity": {
    "deliveryRetryPeriodInDays": 2,
    "endpointType": "StorageBlob",
    "StorageBlob": {
      "blobContainerName": "deadletter",
      "resourceId": "/subscriptions/<SUB>/resourceGroups/rg-eventing/providers/Microsoft.Storage/storageAccounts/stegdeadletter"
    },
    "identity": { "type": "SystemAssigned" }
  }
}

Dead-lettered events are written as CloudEvents JSON with an added deadletterProperties block — deadletterreason, deliveryattempts, deliveryresult, and timestamps — so a replay job knows why each event failed. Blobs land under a time-partitioned path:

<container>/<namespace>/<topic>/<subscription>/<yyyy>/<MM>/<dd>/<HH>/<guid>.json

The deadletterProperties fields and what each tells a replay job:

`deadletterProperties` field	Meaning	Replay decision it drives
`deadletterreason`	Why delivery failed	The primary routing key for replay
`deliveryattempts`	How many tries before giving up	Distinguish flaky from hard-fail
`deliveryresult`	Last delivery outcome	e.g. `Unauthorized`, `TimedOut`
`lastDeliveryAttemptTime`	When it last tried	Ordering / staleness
`publishTime`	When the event was first published	Latency / SLA forensics

deadletterreason values you will actually see, and the right replay policy for each:

`deadletterreason`	Likely root cause	Replay policy
`Unauthorized`	Identity lost the target RBAC role	Fix RBAC, then rehydrate
`TimeToLiveExceeded`	Consumer too slow / down past TTL	Rehydrate if still relevant
`MaxDeliveryCountExceeded`	Repeated transient failure	Investigate target, then rehydrate
`EndpointNotFound`	Target deleted/moved	Fix endpoint, then rehydrate
Schema/parse rejection	Producer shipped a bad payload	Do not blindly replay — fix producer

That deadletterreason is the difference between a five-minute replay and an afternoon of forensics. An Unauthorized reason means fix the consumer’s auth and rehydrate; a parse failure means the producer shipped a bad schema and those events should probably not be replayed at all.

Architecture at a glance

Read the diagram left to right; it is the data path of a real fleet-telemetry system, with the control and failure points numbered. On the far left, the device fleet — sensors and edge gateways authenticated by X.509 or Entra — opens MQTT v5 sessions on port 8883 and PUBLISHes telemetry. Those sessions terminate at the MQTT broker on the Event Grid namespace, where a permission binding (badge 1) is the only thing standing between default-deny and a connected client: no binding, no CONNECT, no PUBLISH, and the failure is silent. Messages that clear authorization are wrapped into CloudEvents and handed to routing (badge 2), which publishes each event into the single namespace topic — durable for up to 7 days, ingress/egress capped at 40/80 MB/s. If routeTopicResourceId is unset or the broker is made unreachable, events pile up in the broker and nothing downstream ever fires.

From the topic, the system fans out by subscription, and each subscription gets its own independent copy of every event. The push subscription (badge 3) sends to Event Hubs using deliveryWithResourceIdentity — and if the namespace’s managed identity lacks Event Hubs Data Sender, every delivery 403s and the dead-letter count floods. In parallel, a pull worker (badge 4) receives with a 60-second lock and a delivery cap of 5; if the worker is slower than the lock, events redeliver and deliveryCount climbs toward the cap. When any subscription exhausts its budget, poison events dead-letter to a Blob container (badge 5) under a time-partitioned path, written by the namespace managed identity holding Storage Blob Data Contributor. The whole method is in the numbers: localize the failure to a hop, read the legend for the symptom, run the named az/metric check, apply the fix. The single most important footer is badge 5 — if there is no dead-letter destination, DroppedEventCount rises and those events are gone, not preserved.

Real-world scenario

Velobyte Mobility runs a connected-vehicle platform ingesting telemetry from roughly 200,000 vehicles over MQTT into an Event Grid namespace in East US, routed to a namespace topic, and pushed to Event Hubs for a Stream Analytics pipeline that powers a live fleet dashboard and a regulatory trip-archive. The platform team is six engineers; the namespace runs at 4 throughput units; the monthly Event Grid + Event Hubs spend is about ₹95,000. Their contract with two fleet operators requires that every trip event be retained for seven years — a hard compliance line, not a best-effort target.

The incident began on a Tuesday at 14:20 when a regional Stream Analytics outage stalled the analytics job. Event Hubs, fed by the push subscription, back-pressured; the push subscription started returning the consumer’s 429/5xx and Event Grid began retrying with backoff. For ninety minutes the job was down. The on-call engineer’s dashboard showed DeliveryAttemptFailCount climbing — alarming, but not yet a data-loss event, because failed-and-retried is not lost. The trap was elsewhere: the subscription had been provisioned a year earlier without a dead-letter destination. As events aged past their delivery budget, they were not dead-lettered — they were dropped. DroppedEventCount was climbing from 14:31, but nobody had an alert on it, because the team had instinctively alerted on “failures,” and a dropped event is, perversely, not counted as a failure. They were silently losing trip data they were contractually required to retain — about two hours of one operator’s fleet, unrecoverable.

The breakthrough came when an engineer pulled the metric explorer and noticed DroppedEventCount was non-zero while DeadLetteredCount was flat zero — the exact inverse of what a healthy pipeline shows. That single comparison named the bug: no graveyard, so the overflow had nowhere to land. The realization reframed the whole pipeline from “retry harder” to “preserve everything, always.”

The fix was two-part and structural. First, every namespace-topic subscription got a mandatory dead-letter destination, enforced in the Bicep subscription module so a subscription literally could not be provisioned without one:

{
  "deadLetterDestinationWithResourceIdentity": {
    "deliveryRetryPeriodInDays": 2,
    "endpointType": "StorageBlob",
    "StorageBlob": {
      "blobContainerName": "vehicle-deadletter",
      "resourceId": "/subscriptions/<SUB>/resourceGroups/rg-fleet/providers/Microsoft.Storage/storageAccounts/stfleetdlq"
    },
    "identity": { "type": "SystemAssigned" }
  }
}

Second, they added a parallel pull subscription on the same topic feeding a back-pressure-tolerant archival worker. Because each subscription gets its own independent copy of every event, the archival path drained at its own pace and could not be starved by the analytics path stalling. They also moved their alerting from DeliveryAttemptFailCount to a zero-tolerance alert on DroppedEventCount, and added a secondary alert on a non-zero DeadLetteredCount so the DLQ filling was visible rather than discovered during an audit.

During the next regional incident six weeks later, the new shape held: DroppedEventCount stayed flat at zero, the dead-letter container captured the analytics overflow, the archival pull worker never even noticed, and a replay job rehydrated the dead-lettered events once Stream Analytics recovered — zero data loss, compliance line held. The cost of the change was a second subscription and a storage account: about ₹4,000/month. The lesson written into their platform standards, in three clauses: on a namespace topic, fan out by subscription, dead-letter every subscription, and alert on dropped — not failed — events.

The incident as a timeline, because the order of moves is the lesson:

Time	Signal	What it meant	Action	What it should have been
14:20	Stream Analytics regional outage	Downstream stalled	(alert fires on job)	Expected; back-pressure begins
14:25	`DeliveryAttemptFailCount` rising	Push retrying with backoff	Watch	Not loss yet — retries in flight
14:31	`DroppedEventCount` > 0 (unwatched)	Events being lost	(no alert existed)	Should have paged at zero tolerance
15:10	Engineer compares metrics	Dropped > 0, dead-lettered = 0	Diagnose	The breakthrough comparison
15:20	Root cause: no DLQ destination	Overflow had nowhere to land	—	DLQ should have been mandatory
16:05	Stream Analytics recovers	Backlog drains	—	—
+1 day	DLQ enforced in module + dropped-alert	Loss made impossible to repeat	Structural fix	The actual fix is the platform standard
+6 wks	Next incident	DLQ caught overflow, replayed	Zero loss	The standard proven

Advantages and disadvantages

The namespace model — broker plus durable topic plus independent subscriptions — both enables fleet-scale eventing and introduces failure modes you must design against. Weigh it honestly:

Advantages (why namespaces help you)	Disadvantages (why they bite)
Real MQTT v5 broker with fleet-scale, default-deny authorization (client groups + topic spaces)	QoS 2 unsupported; you must build idempotent consumers for exactly-once semantics
Pull delivery gives back-pressure and lets endpoint-less / private / batch consumers subscribe	Pull requires you to write a receive/ack loop and handle lock expiry yourself
7-day retention + dead-letter to Blob preserve events through downstream outages	Dead-letter is opt-in; forget it and you get silent `DroppedEventCount` loss
Each subscription is an independent copy — fan out without consumers starving each other	More subscriptions = more cost and more DLQs to operate and alert on
40/80 MB/s throughput dwarfs basic (~5 MB/s) for high-volume telemetry	Narrower push destination set today (Event Hubs only); other targets need a custom-topic hop
Managed-identity delivery — no keys/SAS to rotate to Event Hubs or Blob	Easy to under-grant: a missing Data Sender role 403s every push and floods the DLQ
CloudEvents-native + server-side advanced filters cut consumer cost and load	CloudEvents 1.0 JSON only — EventGridSchema producers are rejected outright
Regional resource with predictable throughput-unit scaling	Regional, not global — DR/region affinity is your design problem, not the platform’s

The model is right when you have a device fleet, mismatched-throughput consumers, or a durability/retention requirement. It is the wrong tool for reacting to Azure resource events (use a system topic), for simple app-to-handler push across many destination types (basic custom topic is simpler today), or for ordered transactional command messaging (that is Service Bus). The disadvantages are all manageable — default-deny, opt-in dead-letter, narrow push set — but only if you know they exist and wire around them, which is the entire point of enumerating them here.

Hands-on lab

Stand up a namespace with MQTT, route to a topic, create a pull subscription with dead-lettering, publish a test message, and force a dead-letter — all free-tier-friendly with a single throughput unit; teardown at the end. Run in Cloud Shell (Bash).

Step 1 — Variables and resource group.

RG=rg-egns-lab
LOC=eastus
NS=egns-lab-$RANDOM       # globally-unique
ST=stegdlq$RANDOM         # storage for dead-letter (3-24 lowercase)
az group create -n $RG -l $LOC -o table

Step 2 — Create the namespace with MQTT and a system-assigned identity.

az eventgrid namespace create -g $RG -n $NS -l $LOC \
  --topic-spaces-configuration "{state:Enabled}" \
  --identity "{type:SystemAssigned}" -o table

Expected: a namespace row; topicSpacesConfiguration.state = Enabled.

Step 3 — Create the destination topic and a storage account + container for dead-letter.

az eventgrid namespace topic create -g $RG --namespace-name $NS -n mqtt-ingest -o table
az storage account create -g $RG -n $ST -l $LOC --sku Standard_LRS -o table
az storage container create --account-name $ST -n deadletter --auth-mode login -o table

Step 4 — Grant the namespace identity Storage Blob Data Contributor on the account.

NS_MI=$(az eventgrid namespace show -g $RG -n $NS --query identity.principalId -o tsv)
ST_ID=$(az storage account show -g $RG -n $ST --query id -o tsv)
az role assignment create --assignee-object-id $NS_MI --assignee-principal-type ServicePrincipal \
  --role "Storage Blob Data Contributor" --scope $ST_ID -o table

Expected: a role-assignment row. (Allow ~1–2 minutes for the assignment to propagate before dead-lettering will succeed.)

Step 5 — Create a pull subscription WITH a dead-letter destination.

SUB_JSON=$(cat <<JSON
{
  "deliveryMode": "Queue",
  "queue": { "receiveLockDurationInSeconds": 60, "maxDeliveryCount": 2, "eventTimeToLive": "PT5M" }
}
JSON
)
az eventgrid namespace topic event-subscription create -g $RG --namespace-name $NS \
  --topic-name mqtt-ingest --name sub-worker \
  --delivery-configuration "$SUB_JSON" -o table

Step 6 — Verify the wiring.

az eventgrid namespace show -g $RG -n $NS \
  --query "{state:topicSpacesConfiguration.state, identity:identity.type}" -o table
az eventgrid namespace topic event-subscription show -g $RG --namespace-name $NS \
  --topic-name mqtt-ingest --name sub-worker \
  --query "{mode:deliveryConfiguration.deliveryMode, lock:deliveryConfiguration.queue.receiveLockDurationInSeconds}" -o table

Expected: state = Enabled, identity = SystemAssigned, mode = Queue, lock = 60.

Step 7 — (Optional) Publish a test MQTT message and confirm delivery. With an X.509-registered client (see the MQTT section), publish to devices/<authName>/telemetry/temp using mosquitto_pub over TLS to $NS.<region>-1.ts.eventgrid.azure.net:8883, then receive on the pull subscription via the SDK/REST and confirm the CloudEvent arrives with your payload in data.

Step 8 — Teardown.

az group delete -n $RG --yes --no-wait

A checklist of what “done right” looks like before you tear down:

Lab check	Pass condition
Namespace MQTT enabled	`topicSpacesConfiguration.state = Enabled`
Namespace identity present	`identity.type = SystemAssigned`
Identity can write DLQ	Role assignment `Storage Blob Data Contributor` exists
Subscription is pull	`deliveryMode = Queue`
Dead-letter wired	DLQ container resolves; MI has the role

Common mistakes & troubleshooting

This is the differentiator. Each row is a real failure mode: the symptom you observe, the root cause, the exact command or metric to confirm it, and the fix. Scan for your symptom, then read the detail below the playbook.

#	Symptom	Root cause	Confirm (exact command / metric)	Fix
1	MQTT CONNECT/PUBLISH silently denied	No permission binding for the client group	`az eventgrid namespace permission-binding list -g $RG --namespace-name $NS -o table`	Bind the group Publisher/Subscriber on the topic space
2	Client never lands in its group	Client-group query references an attribute not set on the client	`az eventgrid namespace client show … --query attributes`	Set the attribute on the client (queries see only attributes)
3	CONNECT fails after cert rotation	Thumbprint-pinned auth, cert changed	Client `validationScheme = ThumbprintMatch` + new thumbprint	Move to CA-signed validation, or update `allowedThumbprints`
4	MQTT works but nothing downstream fires	Routing not configured	`az eventgrid namespace show … --query topicSpacesConfiguration.routeTopicResourceId` is null	Set `routeTopicResourceId` + `routingIdentityInfo`
5	Routing stops after a network change	Public access disabled on the namespace	Namespace `publicNetworkAccess = Disabled`	Keep the broker reachable; isolate the consumer side instead
6	Every push delivery 403s; DLQ floods	Namespace MI lacks Event Hubs Data Sender	`DeadLetteredCount` climbing; role-assignment list on the hub empty	Grant the MI Azure Event Hubs Data Sender on the hub
7	Push 4xx straight to dead-letter	Non-retryable result (400/401/403/404/413)	`deadletterreason` / `deliveryresult` in the blob	Fix the endpoint/payload; non-retryable codes never retry
8	Pull events keep redelivering	Worker slower than `receiveLockDurationInSeconds`	`deliveryCount` rising on `receive`	Raise the lock, `renewLock`, or shrink the receive batch
9	Events vanish, no blobs appear	No dead-letter destination configured	`DroppedEventCount` > 0 while `DeadLetteredCount` = 0	Attach a Blob DLQ + grant MI Storage Blob Data Contributor
10	DLQ configured but still dropping	DLQ retry window (`deliveryRetryPeriodInDays`) expired	`DroppedEventCount` > 0 with DLQ present	Widen the window (max 2 days); fix the target faster
11	Producer rejected at ingest	Sent EventGridSchema, not CloudEvents	Publish returns schema error	Emit CloudEvents 1.0 JSON only
12	Consumer gets events it shouldn’t	Filter too broad / missing	Subscription `filtersConfiguration` empty	Add subject + advanced filters server-side
13	Throughput throttled at peak	Ingress above the namespace cap	`PublishFailureCount` / throttle responses	Add throughput units (capacity) on the SKU
14	Replay can’t decide what to re-send	Not reading `deadletterreason`	Blob `deadletterProperties` ignored	Branch replay by `deadletterreason` (transient vs schema)

No permission binding (rows 1–3). The broker is default-deny: a client with a valid certificate still cannot CONNECT until a permission binding grants its group rights on a space. The denial is silent at the protocol level (an MQTT CONNACK refusal), which is exactly why people stare at certs for an hour. Confirm with permission-binding list; if the binding exists, confirm the client is actually in the group by checking the attribute the group query filters on — a client whose attributes don’t match the query is simply not a member, and group queries can only see attributes, nothing else.

Routing not firing (rows 4–5). If routeTopicResourceId is null, MQTT messages are accepted by the broker and then go nowhere — there is no error, because publishing succeeded; it is delivery that never starts. Confirm by querying the field. The subtle one is row 5: routing requires the broker to be reachable, so disabling publicNetworkAccess to “lock things down” silently breaks routing. Isolate the consumer with private endpoints; keep the broker reachable.

Push 403 and dead-letter flood (rows 6–7). Push to Event Hubs uses deliveryWithResourceIdentity, so the namespace’s managed identity must hold Azure Event Hubs Data Sender on the target hub. Miss it and every delivery 403s — a non-retryable code — so events go straight to dead-letter and DeadLetteredCount climbs in lockstep with traffic. Confirm by listing role assignments on the hub for the namespace principal; the fix is one role assignment.

Pull lock-expiry loop (row 8). If your worker takes longer than receiveLockDurationInSeconds to acknowledge, the lock expires, the event is redelivered, and deliveryCount climbs until it hits maxDeliveryCount and dead-letters — turning slow processing into spurious dead-lettering. Confirm by watching deliveryCount on received events. Fix by raising the lock (up to 300 s), calling renewLock for legitimately long work, or shrinking the receive batch so each event is processed within the lock.

Silent drops (rows 9–10). The headline failure of this whole topic. A subscription with no dead-letter destination drops events once their budget is exhausted, and the metric is DroppedEventCount, not any *FailCount. Confirm by comparing DroppedEventCount (should be zero) against DeadLetteredCount; a healthy pipeline dead-letters and never drops. Even with a DLQ, the deliveryRetryPeriodInDays window (max 2 days) can expire and drop — so fix the target inside that window.

The KQL you keep open during an incident — one query per question:

AzureMetrics
| where ResourceProvider == "MICROSOFT.EVENTGRID"
| where MetricName in ("DeliverySuccessCount", "DeliveryAttemptFailCount", "DeadLetteredCount", "DroppedEventCount", "PublishSuccessCount", "PublishFailureCount")
| summarize Total = sum(Total) by MetricName, bin(TimeGenerated, 5m)
| order by TimeGenerated desc

A decision table that turns the metric pattern into a verdict:

If you see…	It’s probably…	Do this
`DroppedEventCount` > 0, `DeadLetteredCount` = 0	No DLQ destination	Attach a Blob DLQ + grant the MI the role
`DeadLetteredCount` rising with traffic	Push 403 (missing Data Sender)	Grant Event Hubs Data Sender on the hub
`deliveryCount` climbing on pull	Lock expiry (slow worker)	Raise lock / `renewLock` / smaller batch
`PublishFailureCount` > 0 at peak	Ingress throttling	Add throughput units (capacity)
Publish rejected immediately	Wrong schema (EventGridSchema)	Emit CloudEvents 1.0 JSON
Downstream silent, publish OK	Routing not configured	Set `routeTopicResourceId`

Best practices

Choose the surface deliberately. Use namespaces only when you need MQTT, pull, 7-day retention, or 40/80 MB/s. If you just need app-to-handler push to many destinations, a basic custom topic is simpler today.
Default-deny, always. Model MQTT authorization with client groups + topic spaces + permission bindings, and scope topic templates per ${client.authenticationName} so no device can touch another’s subtree.
Prefer CA-signed certs over thumbprint pinning for any fleet that rotates certificates — pinning turns every rotation into a CONNECT outage.
Dead-letter every subscription, no exceptions. Enforce it in your IaC module so a subscription cannot be provisioned without a DLQ destination. This is the single highest-value rule in the article.
Alert on DroppedEventCount at zero tolerance, and separately on a non-zero DeadLetteredCount. Alerting only on “failures” hides the loss.
Filter server-side. Push subject + advanced filters down to the subscription so consumers never receive — or pay for — events they don’t match.
Fan out by subscription, not by consumer fan-out logic. Each subscription gets an independent copy; a slow consumer on one cannot starve another.
Use managed-identity delivery (deliveryWithResourceIdentity) to Event Hubs and Blob — no keys or SAS to rotate — and grant the minimum role (Data Sender, Blob Data Contributor) at the narrowest scope.
Match delivery mode to the consumer, not to habit: push for reactive/reachable, pull for back-pressure/private/batch/endpoint-less.
Right-size the lock and delivery count together. A lock shorter than worst-case processing plus a low maxDeliveryCount manufactures spurious dead-letters.
Keep the broker reachable; isolate the consumer. Private-network the consumer side; disabling public access on the namespace breaks routing.
Build a replay job before you need it. It should read the DLQ blob, strip deadletterProperties, and branch on deadletterreason — rehydrate transient failures, quarantine schema rejections.

Security notes

Three security surfaces, three mechanisms:

MQTT clients authenticate with X.509 certificates (CA-signed or thumbprint-pinned) or Microsoft Entra ID / JWT. Authorization is the default-deny permission-binding model from the broker section — a client can do nothing until a binding grants it.
Push to Azure services (Event Hubs, and the destinations rolling out) uses the namespace’s managed identity plus an RBAC role on the target — deliveryWithResourceIdentity. No keys, no SAS tokens, no secrets to rotate.
Push to webhooks must complete the CloudEvents abuse-protection handshake: Event Grid issues an OPTIONS request with a WebHook-Request-Origin header, and your endpoint must echo it in WebHook-Allowed-Origin. This proves endpoint ownership and stops Event Grid being used to flood a third party. Better still, front the webhook with Microsoft Entra and validate the presented token.

The identity-and-RBAC matrix — exactly which principal needs which role where:

Action	Principal	Role	Scope	Why
Route to a custom topic	Namespace MI	EventGrid Data Sender	The custom topic	Cross-resource publish auth
Push to Event Hubs	Namespace MI	Azure Event Hubs Data Sender	The event hub	Deliver without keys/SAS
Dead-letter to Blob	Namespace MI	Storage Blob Data Contributor	The storage account / container	Write poison events
Manage the namespace	Operator	EventGrid Contributor	Namespace / RG	Provision topics, subscriptions
Pull-receive events	Consumer identity	EventGrid Data Receiver	The topic / subscription	Authorized `receive`/`ack`

The network and data-protection controls that matter for this topic:

Control	What it protects	How to set it	Caveat
TLS on MQTT	In-transit telemetry	Port 8883 (MQTT) / 443 (WSS); plaintext not offered	No 1883 plaintext path exists
Private endpoints (consumer)	Consumer-side isolation	Private endpoint on the consumer resource	Don’t disable broker public access — breaks routing
Managed identity delivery	Eliminates stored secrets	`deliveryWithResourceIdentity`	Under-granting 403s every push
Customer-managed keys (storage)	DLQ data at rest	CMK on the dead-letter storage account	Key rotation is your responsibility
Blob immutability on DLQ	Tamper-proof forensics	Immutability policy on the DLQ container	Plan retention vs replay cleanup
Webhook abuse-protection	Third-party flooding	Echo `WebHook-Request-Origin`	Skipping it blocks the subscription

Grant the namespace identity rights on the Event Hub used by the push subscription:

NS_PRINCIPAL=$(az eventgrid namespace show -g $RG -n $NS --query identity.principalId -o tsv)
EH_ID=$(az eventhubs eventhub show -g $RG --namespace-name ehns-telemetry -n telemetry --query id -o tsv)

az role assignment create \
  --assignee-object-id $NS_PRINCIPAL \
  --assignee-principal-type ServicePrincipal \
  --role "Azure Event Hubs Data Sender" \
  --scope $EH_ID

Cost & sizing

Event Grid namespaces bill on throughput units (the capacity that sets your ingress/egress ceilings), operations (publish/deliver/receive), and the resources they touch — Event Hubs for push, Storage for dead-letter and replay. The MQTT broker’s cost scales with connected clients and message volume. None of these is large per unit, but at fleet scale they add up, and the dominant lever is almost always throughput units sized to peak, not average.

What drives the bill, and how to pull each lever:

Cost driver	What scales it	How to reduce	Watch-out
Throughput units (capacity)	Peak ingress/egress MB/s	Size to peak, not over-provision	Under-size → `PublishFailureCount` throttling
Operations (publish/deliver)	Event volume	Server-side filtering cuts deliver ops	Client-side filtering still pays to deliver
MQTT connections	Concurrent client sessions	Consolidate chatty devices; batch publishes	Per-session limits + cost at fleet scale
Event Hubs (push target)	Throughput units on the hub	Right-size hub TUs; Capture for cheap archive	Separate Event Hubs bill
Storage (dead-letter)	Volume of dead-lettered events + retention	Lifecycle-tier old DLQ blobs; replay + delete	A flooding DLQ is a symptom — fix the source
Replay compute	Functions executions on replay	Replay only what’s relevant by `deadletterreason`	Don’t rehydrate schema rejections

Rough figures for sizing intuition (regional list prices vary; treat as order-of-magnitude):

Scenario	Shape	Indicative monthly	Notes
Lab / PoC	1 TU, low volume, 1 subscription	~₹1,500–3,000	Plus negligible storage
Small fleet	1–2 TU, ~5k devices, push + pull	~₹15,000–30,000	Add Event Hubs separately
Large fleet	4 TU, ~200k devices, push + pull + DLQ	~₹90,000–120,000	Throughput units dominate
DLQ + replay storage	Standard LRS, modest volume	~₹2,000–6,000	Lifecycle-tier to cut it

Sizing heuristics worth committing to memory:

Question	Rule of thumb
How many throughput units?	Size to peak ingress MB/s, with headroom for retries during incidents
Push or pull for cost?	Pull lets a slow consumer self-pace — avoids over-provisioning push targets
How long to retain on the topic?	As long as your slowest consumer’s worst-case lag + replay window
DLQ storage tier?	Hot for the replay window, then lifecycle to cool/archive
When does fan-out cost pay off?	Always, if it prevents one consumer starving another (loss is costlier)

The cheapest event is the one you filtered out at the subscription, and the most expensive is the one you dropped because there was no dead-letter destination and had to reconstruct from upstream — if you even can. Spend on a throughput unit of headroom and a dead-letter storage account before you spend on incident-response time.

Interview & exam questions

These map to AZ-204 (Develop event-based solutions), AZ-305 (messaging architecture), and AZ-220 (IoT) topics.

When do you choose an Event Grid namespace over a basic custom topic? When you need an MQTT broker, pull delivery, 7-day retention, or 40/80 MB/s throughput. Basic custom/system topics remain simpler for app-to-handler push across a wider destination set and for reacting to Azure resource events.
How does MQTT authorization scale on a namespace? Not per-device. You register clients, bucket them into client groups via attribute queries, define topic spaces of topic templates, and grant a group Publisher/Subscriber on a space via a permission binding. ${client.authenticationName} in a template scopes each device to its own subtree with one rule.
What is the default authorization posture and why does it matter? Default-deny: a client with a valid cert can do nothing until a permission binding allows it. For an IoT fleet this is the correct, least-privilege posture — there is no implicit access to over-trust.
What does routing do, and what identity does it need? It wraps each MQTT message in a CloudEvents envelope and publishes it to one nominated topic. For a same-namespace topic, routingIdentityInfo of None suffices; for a cross-resource custom topic, the namespace managed identity needs EventGrid Data Sender.
Push vs pull — when does pull win? When the consumer can’t expose an endpoint, needs back-pressure, requires a private link, or processes on a schedule. Pull inverts control so a struggling consumer paces its own receive instead of being overwhelmed.
What is the difference between DroppedEventCount and DeadLetteredCount? Dead-lettered events were preserved to Blob after exhausting their budget; dropped events were lost because no dead-letter destination existed (or its retry window expired). A healthy pipeline dead-letters and never drops — alert on dropped at zero tolerance.
Which knobs bound retry on a pull subscription? maxDeliveryCount (attempt ceiling) and eventTimeToLive (wall-clock ceiling); whichever is hit first ends delivery. receiveLockDurationInSeconds governs how long each received event is locked before redelivery.
Why might a push subscription dead-letter every single event? The namespace managed identity lacks Azure Event Hubs Data Sender on the target, so every delivery returns a non-retryable 403 and goes straight to dead-letter. The fix is one role assignment.
What causes a pull redelivery loop and how do you fix it? The worker takes longer than receiveLockDurationInSeconds, so the lock expires and the event redelivers, climbing deliveryCount until it dead-letters. Raise the lock (max 300 s), call renewLock for long work, or shrink the receive batch.
What schema do namespace topics accept, and what’s a consequence? CloudEvents 1.0 JSON only — no proprietary EventGridSchema. A producer emitting EventGridSchema is rejected at ingest, so the migration cost of moving from a basic topic includes re-shaping producers.
How do you make dead-letter forensics actionable? Each dead-lettered blob carries deadletterProperties (deadletterreason, deliveryattempts, deliveryresult, timestamps). A replay job branches on deadletterreason — rehydrate transient failures (Unauthorized, TimeToLiveExceeded), quarantine schema rejections.
Why does disabling public network access on the namespace break things? Routing requires the broker to be reachable to publish into the nominated topic; turning off public access silently stops routing. Isolate the consumer side with private endpoints instead, and keep the broker reachable.

Quick check

What single property turns on the MQTT broker when creating a namespace?
You registered a client with a valid certificate but it can’t PUBLISH. What is the most likely cause?
Which metric reveals silent data loss, and what does a healthy value look like?
A push subscription dead-letters every event with a 403. What role is missing, and where?
Give two situations where pull delivery is the right choice over push.

Answers

topicSpacesConfiguration.state = Enabled on the namespace — without it you get a pull-only namespace with no broker.
No permission binding grants the client’s group rights on the topic space — the broker is default-deny, so a valid cert alone grants nothing. Confirm with az eventgrid namespace permission-binding list.
DroppedEventCount — dropped events were lost (no dead-letter destination, or the DLQ retry window expired). A healthy pipeline keeps it at zero and dead-letters instead; alert on it at zero tolerance.
Azure Event Hubs Data Sender, granted to the namespace’s managed identity, scoped to the target event hub. The 403 is non-retryable, so every delivery goes straight to dead-letter until the role is assigned.
Any two of: the consumer cannot expose a reachable endpoint (on-prem/batch/locked-down); you need back-pressure so a slow consumer self-paces; you need a private link to consume over private IP; you process on a schedule rather than reactively.

Glossary

Event Grid namespace — A regional Event Grid resource that hosts the MQTT broker, namespace topics, and pull delivery; distinct from the classic global push router.
MQTT broker — The v3.1.1 / v5 publish-subscribe broker exposed by a namespace when topicSpacesConfiguration.state = Enabled.
Client — A registry entry for one device or app, keyed by an authentication name backed by an X.509 cert or a Microsoft Entra identity.
Client group — A query over client attributes that buckets clients for authorization (e.g. attributes.role = 'sensor').
Topic space — A named set of MQTT topic templates that a permission binding grants rights over.
Permission binding — The grant tuple (client group, topic space, Publisher/Subscriber); the only thing that lifts default-deny.
Topic template — An MQTT topic pattern, optionally using ${client.authenticationName} and +/# wildcards, that scopes access.
Namespace topic — A durable topic (up to 7-day retention) that routed MQTT messages and app events land in and fan out from.
Routing — Namespace configuration that wraps each MQTT message in CloudEvents and publishes it to one nominated topic.
Event subscription — A consumer’s filtered, independent view of a topic; each gets its own copy of every matching event.
Push delivery — deliveryMode: Push; Event Grid sends events to a registered, reachable destination as they arrive.
Pull delivery — deliveryMode: Queue; the consumer connects and receives events with queue semantics (receive/acknowledge/release/reject).
CloudEvents 1.0 — The vendor-neutral event envelope (id, source, subject, type, time, data) that namespace topics require.
Advanced filter — A typed comparison (e.g. NumberGreaterThan, StringIn) against any attribute or data field, evaluated server-side.
Dead-letter — Writing undeliverable events to Blob Storage after the retry budget is exhausted, preserving them for replay.
DroppedEventCount — The metric for events lost (not dead-lettered); the zero-tolerance alarm of this whole topic.
deadletterreason — A field in a dead-lettered blob’s deadletterProperties that tells a replay job why the event failed.
Throughput unit — The namespace capacity that sets ingress/egress ceilings; the dominant cost and scaling lever.

Next steps

Send the push fan-out somewhere useful: Azure Event Hubs: Kafka, Capture, Stream Analytics & Throughput Scaling.
When you need ordered, transactional, command-style messaging instead of event fan-out: Azure Service Bus: Sessions, De-duplication & Dead-Letter Patterns.
Build the pull worker and replay job as serverless: Azure Functions: Serverless Patterns and, for orchestrated multi-step replay, Azure Durable Functions: Orchestration & Fan-Out Patterns.
Govern the device side of the MQTT estate: Azure IoT Hub, DPS, Edge & Digital Twins Fundamentals.
Wire the observability that catches DroppedEventCount before a customer does: Azure Monitor & Application Insights for Observability.
Govern how long dead-letter forensics live in Blob: Azure Blob Storage: Lifecycle, Immutability & Soft Delete.