Azure Integration

Event Grid System Topics Explained: Reacting to Storage, Resource and Subscription Events

A file lands in a storage container at 3 a.m. and a thumbnail needs generating, a virus scan needs kicking off, and a downstream system needs telling. The old way to find out a file arrived was to poll — a timer that wakes every minute, lists the container, diffs against last time, and hopes it didn’t miss anything between runs. Polling is wasteful when nothing changes, laggy when something does, and fragile at scale. Azure Event Grid flips this around: instead of you asking “did anything happen?”, Azure tells you the instant it does. A blob is created, Event Grid pushes a small JSON event to your handler within a second or two — no timer, no list call, no missed window.

This article is about one specific and frequently misunderstood slice of Event Grid: system topics. A system topic is the built-in, Azure-managed stream of events that an Azure resource emits about itself — a Storage Account announcing “BlobCreated”, a Resource Group announcing “a resource was deployed”, a Subscription announcing “a policy compliance state changed”. You don’t write code to produce these events; Azure already does. Your job is only to subscribe — to say “when this storage account raises BlobCreated for a .jpg under /uploads/, call my Function.” That asymmetry — Azure is the publisher, you are only the subscriber — is the entire mental model, and it is what separates system topics from the custom topics you publish to yourself.

By the end you will know the four moving parts (publisher, topic, subscription, handler), read an Event Grid event payload without squinting, filter so your handler only wakes for events it cares about, and reason about what happens when delivery fails — retries, the dead-letter container, and the difference between “Event Grid couldn’t reach my handler” and “my handler threw a 500”. You will be able to wire a real blob-upload-to-Function reaction end to end, and explain when Event Grid is the right tool versus Event Hubs or Service Bus. This is foundational, AZ-204 and AZ-305 territory, and it underpins almost every serverless pattern you will build on Azure.

What problem this solves

Without an eventing layer, services that need to react to changes in other services have three bad options. They poll (the timer-and-diff loop above), which trades cost against latency and never wins both. They get tightly coupled — the storage-writing service is modified to also call the thumbnail service directly, so now an unrelated team’s deploy can break uploads, and adding a fourth consumer means a code change to the producer. Or they rely on change feeds and queues stitched together by hand, which works but is a lot of plumbing to maintain for “tell me when a blob appears.”

Event Grid removes the polling and decouples the producer from the consumers. The storage account does not know — and does not need to know — who is listening. It raises an event into its system topic; zero, one, or twenty subscribers each get their own copy delivered to their own handler. Add a consumer by adding a subscription; remove one by deleting a subscription; the publisher never changes. This is publish-subscribe at the platform level, and because Azure already emits the events, the publisher side is free.

Who hits the absence of this: any team building “do X when Y happens” across Azure resources. Image pipelines that process uploads. Compliance tooling that must react when a non-compliant resource is created. Cache-invalidation that must fire when source data changes. Audit and security automation that needs to know the moment a role assignment or resource changes. The instinct is to write a polling job; the right answer is almost always a system topic and a subscription. The few times it is not the right answer — high-throughput telemetry streams, ordered processing, long-retention replay — are exactly where Event Hubs or Service Bus belong, and knowing that boundary is half of using Event Grid well.

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should be comfortable with the basics of an Azure Storage Account (containers and blobs), able to run az commands in Cloud Shell, and able to read a block of JSON. Knowing what an Azure Function is helps, because Functions are the most common Event Grid handler; if triggers and bindings are new to you, Azure Functions Triggers and Bindings for Beginners is the companion piece — Event Grid is one of the events a Function can be triggered by. You do not need to know anything about Event Grid yet; that is what this article is for.

Where this sits: Event Grid is the discrete-event, reactive member of Azure’s messaging family. It is not a queue you pull from and it is not a stream you replay; it is a push-based notifier that fans a single event out to many subscribers. It pairs naturally with Azure Functions and serverless patterns (the compute that reacts) and sits alongside Service Bus queues and topics (the ordered, transactional sibling) and Event Hubs (the high-throughput streaming sibling). Many real architectures use Event Grid to trigger work and Service Bus to carry it reliably. Understanding the differences — covered in its own section below — is the point where this knowledge becomes load-bearing.

To anchor the family before the deep dive, here is the one-screen comparison of Azure’s three messaging services. Internalise this table and you will rarely reach for the wrong one:

Service Best for Model Delivery Retention Typical handler
Event Grid Reacting to discrete events (“a blob was created”) Publish-subscribe, push At-least-once, retried 24 h retry window (then dead-letter) Function, webhook, Service Bus
Service Bus Ordered/transactional messaging (commands, work items) Queues + topics, pull At-least-once, FIFO with sessions Until consumed (TTL) Worker, Function (SB trigger)
Event Hubs High-throughput streaming (telemetry, logs, clickstream) Partitioned log, pull At-least-once, replayable 1–90 days (configurable) Stream Analytics, consumer group

Core concepts

Five ideas make every later detail obvious. Read these once; the rest of the article is consequences of them.

The four parts: publisher, topic, subscription, handler. A publisher is whatever produces events — for system topics it is an Azure resource (a storage account, a resource group). A topic is the endpoint events are sent to; it is the routing point. A subscription (an event subscription) is a rule you create that says “route events matching this filter from this topic to that handler.” A handler (also called an event handler or endpoint) is the destination that receives the event — a Function, a webhook URL, a Service Bus queue. One topic can have many subscriptions; each subscription gets its own independent copy of every matching event. This is the whole architecture.

A system topic is the topic Azure manages for a resource. You do not create the events; the resource emits them automatically. You create the system topic (a lightweight Azure resource that represents that stream) and then create subscriptions on it. For some services you can even create the subscription directly against the source and Azure provisions the system topic implicitly. The defining trait: the topic type is fixed by Azure, the event types are defined by Azure, and you are purely a consumer. Contrast this with a custom topic, which you create and you POST your own application events to — there, you are the publisher. Partner topics are a third kind, where a non-Azure SaaS (e.g. an external system) publishes into Azure through Event Grid. This article is about system topics; the model is identical, only the publisher differs.

Topic type Who publishes the events Who creates the topic Example This article
System topic Azure itself, automatically You (lightweight, or implicit) Storage BlobCreated, RG resource changes The focus
Custom topic Your own application code (POST) You Your app raises OrderPlaced Mentioned for contrast
Partner topic A third-party SaaS via Event Grid The partner + you External SaaS pushes events into Azure Briefly noted

An event is small, and it is a notification — not the data. An Event Grid event is a compact JSON object: what happened (eventType), to what (subject), when (eventTime), a unique id, and a small data payload with specifics. Crucially, the event tells you a blob was created and gives you its URL; it does not ship the blob’s bytes. The event is a doorbell, not a delivery van. Your handler reads the event, then goes and fetches whatever it needs (the blob, the resource) using the identifiers in the event. Events are designed to be small and numerous, with a maximum size around 1 MB (and billing/optimised around 64 KB chunks), so you keep payloads lean.

Delivery is push, at-least-once, and retried. Event Grid pushes to your handler — you don’t poll Event Grid. It guarantees at-least-once delivery: it will keep trying until your handler acknowledges success (an HTTP 2xx), with an exponential-ish retry schedule over a window (default up to 24 hours). “At-least-once” has a sharp edge: under retries or races your handler can receive the same event more than once, and ordering is not guaranteed. So handlers must be idempotent — processing the same event twice must be safe. If every retry fails for the whole window, the event is dead-lettered to a storage container you nominate (or dropped, if you configured none).

Filtering happens at the subscription, before your handler wakes. You rarely want every event. A subscription can filter by event type (only BlobCreated, not BlobDeleted), by subject (only blobs under /uploads/ ending in .jpg), and by advanced filters on fields inside the event. Filtering is evaluated by Event Grid before delivery, so an unmatched event never reaches — and never bills you for invoking — your handler. Good filters are how you keep a busy storage account from waking a Function 50,000 times a day for blobs it doesn’t care about. (Every term in bold above is collected in the Glossary at the end for quick lookup.)

Which Azure services emit system-topic events

System topics exist because Azure resources publish events about themselves. You can’t subscribe to a stream that doesn’t exist, so the first practical question is always “does this source emit the event I want?” The set of event sources and their event types is fixed by Azure. Here are the ones you will actually use, with the headline event types each produces:

Event source (system topic type) Common event types Typical reason to subscribe
Storage Account (Blob) BlobCreated, BlobDeleted, BlobRenamed, DirectoryCreated Process/scan uploads; invalidate caches
Resource Group ResourceWriteSuccess, ResourceDeleteSuccess, ResourceActionSuccess (and …Failure) Audit/automation when resources change
Azure Subscription Same Resource* event family, scoped to the whole subscription Subscription-wide governance automation
Key Vault SecretNewVersionCreated, SecretNearExpiry, CertificateNearExpiry Rotate secrets; alert before expiry
App Configuration KeyValueModified, KeyValueDeleted Refresh config-driven services
Event Hubs CaptureFileCreated React when a capture file lands in storage
Container Registry ImagePushed, ImageDeleted Trigger deploys/scans on new images
Maps, Media, IoT Hub, SignalR, Machine Learning, Policy Service-specific event families Service-specific automation

Two things to internalise from this table. First, the Resource events* from a Resource Group or a Subscription are the workhorse for governance and audit automation — “tell me whenever any resource is created/changed/deleted in this scope.” Second, the event types are namespaced strings (for example Microsoft.Storage.BlobCreated), and you filter on them exactly, so getting the string right matters. The next sections drill into the two most common sources — Storage and Resource/Subscription — because they cover the vast majority of real Event Grid work.

Storage events: the most common system topic

A storage account raises events as blobs change. The two you reach for constantly are Microsoft.Storage.BlobCreated (a blob was written) and Microsoft.Storage.BlobDeleted. The event’s subject encodes the path — /blobServices/default/containers/<container>/blobs/<path> — which is exactly what subject prefix/suffix filters key off. The data block carries the blob URL, content type, size and the API that caused it (PutBlob, PutBlockList, CopyBlob, etc.).

Storage event type Fires when Key data fields Common filter
Microsoft.Storage.BlobCreated A blob is committed url, contentLength, api, contentType subject ends .jpg; api = PutBlockList
Microsoft.Storage.BlobDeleted A blob is deleted url, api subject prefix on a container
Microsoft.Storage.BlobRenamed A blob is renamed (HNS accounts) sourceUrl, destinationUrl by container prefix
Microsoft.Storage.DirectoryCreated A directory is created (HNS / Data Lake) url Data Lake folder automation

One sharp gotcha lives here. With block-blob uploads, a naive subscription to BlobCreated can fire on every PutBlock intermediate step, not just the final commit, generating noise and duplicate-looking events. The fix is to filter on the api field (an advanced filter) for PutBlockList (or FlushWithClose on Data Lake), so you only react when the blob is fully written. This single filter is the difference between a clean pipeline and one that processes half-uploaded files.

Resource and subscription events: governance and audit

A Resource Group system topic emits an event for every successful (and failed) resource operation in that group; a Subscription system topic does the same across the whole subscription. These are the events you wire to compliance and audit automation: “whenever a resource is created in rg-prod, check it’s tagged and compliant,” or “whenever any storage account is created subscription-wide, enforce a private-endpoint policy.”

Event type Meaning Use it to…
Microsoft.Resources.ResourceWriteSuccess A resource was created or updated Enforce tags, trigger config drift checks
Microsoft.Resources.ResourceWriteFailure A create/update failed Alert on failed deployments
Microsoft.Resources.ResourceDeleteSuccess A resource was deleted Audit deletions; trigger cleanup
Microsoft.Resources.ResourceActionSuccess A control-plane action ran (e.g. restart) Audit operational actions

The data block here carries the operation name, the resource URI, the caller (claims), and correlation IDs — enough for an audit handler to record who did what to which resource when. Because these can be high-volume in a busy subscription, filtering by operation name or resource type in an advanced filter is essential; subscribing to everything unfiltered will flood your handler.

Reading an Event Grid event: schemas and fields

Every handler starts by parsing the event, so you must be fluent in the payload. Event Grid supports two schemas: its native Event Grid schema (the legacy default) and CloudEvents 1.0 (an open CNCF standard, increasingly the recommended default for interoperability). They carry the same information under slightly different field names. Here is the same conceptual event in both, starting with the Event Grid schema fields:

Event Grid schema field What it is Example
id Unique event id (use for idempotency) "a1b2c3…"
eventType What happened "Microsoft.Storage.BlobCreated"
subject What it happened to (path) "/…/containers/uploads/blobs/cat.jpg"
eventTime When it happened (UTC) "2026-06-24T09:15:02Z"
data Event-specific payload object { "url": "…", "contentLength": 1048576, "api": "PutBlockList" }
dataVersion Version of the data schema "1.0"
topic Full resource ID of the topic /subscriptions/…/storageAccounts/…

CloudEvents 1.0 maps these to standardised names: eventType becomes type, topic becomes source, eventTime becomes time, and metadataVersion becomes specversion: "1.0" — while subject, id and data keep their names. So a handler that parsed the Event Grid schema needs only those four field renames to read CloudEvents, and nothing else changes conceptually.

Two practical rules. First, pick one schema per subscription and have your handler parse that — you set the delivery schema when you create the subscription (--event-delivery-schema EventGridSchema or CloudEventSchemaV1_0); for new work, CloudEvents is the safer default because tooling across clouds understands it. Second, treat the id field as your idempotency key (it’s the same id across retries of the same event). Here is a minimal Storage BlobCreated event in the Event Grid schema so you can see the shape end to end:

{
  "id": "1807e102-…",
  "topic": "/subscriptions/…/resourceGroups/rg-evt/providers/Microsoft.Storage/storageAccounts/stevtdemo",
  "subject": "/blobServices/default/containers/uploads/blobs/cat.jpg",
  "eventType": "Microsoft.Storage.BlobCreated",
  "eventTime": "2026-06-24T09:15:02.1234567Z",
  "dataVersion": "1.0",
  "data": {
    "api": "PutBlockList",
    "contentType": "image/jpeg",
    "contentLength": 1048576,
    "url": "https://stevtdemo.blob.core.windows.net/uploads/cat.jpg"
  }
}

Filtering: only wake the handler that cares

A subscription with no filter receives every event the topic emits. On a busy storage account that is thousands of invocations a day, most of them irrelevant — and each one may bill you for a Function execution. Filtering is therefore not optional polish; it is the core of a sane subscription. Event Grid offers three layers, from cheapest/simplest to most expressive:

Filter type Filters on Example Notes
Event type eventType exact match only Microsoft.Storage.BlobCreated Cheapest; always set it
Subject begins-with subject prefix /…/containers/uploads/blobs/ One container or path
Subject ends-with subject suffix .jpg File extension matching
Advanced filters Any field, incl. inside data data.api = PutBlockList; data.contentLength > 1000 Up to ~25 filters; operators below

Subject filters are case-sensitive by default (you can opt into case-insensitive matching), so /Uploads/ and /uploads/ are different — a classic “why didn’t my handler fire” bug. Advanced filters are the powerful layer: they compare any JSON field, including nested data fields, using a set of operators. The ones you’ll use:

Operator Meaning Example field & value
StringBeginsWith / StringEndsWith Prefix/suffix on a string subject ends .jpg
StringContains / StringIn / StringNotIn Substring / set membership data.api In ["PutBlockList","FlushWithClose"]
NumberGreaterThan / NumberLessThan / NumberIn Numeric comparison data.contentLength > 0
NumberGreaterThanOrEquals / …LessThanOrEquals Inclusive numeric data.contentLength >= 1024
BoolEquals Boolean match a custom boolean field

A real example ties it together: to react only to fully-committed JPEG uploads in the uploads container, you combine an event-type filter (BlobCreated), a subject begins-with (/blobServices/default/containers/uploads/blobs/), a subject ends-with (.jpg), and an advanced filter on data.api StringIn ["PutBlockList"]. That subscription wakes your handler only for the events that matter, and ignores the thousands that don’t.

# Filtered subscription: only committed .jpg uploads, to a Function
az eventgrid system-topic event-subscription create \
  --name sub-thumbnails \
  --resource-group rg-evt \
  --system-topic-name st-stevtdemo \
  --endpoint-type azurefunction \
  --endpoint "/subscriptions/<sub>/resourceGroups/rg-evt/providers/Microsoft.Web/sites/fn-thumbs/functions/MakeThumbnail" \
  --included-event-types Microsoft.Storage.BlobCreated \
  --subject-begins-with "/blobServices/default/containers/uploads/blobs/" \
  --subject-ends-with ".jpg" \
  --advanced-filter data.api StringIn PutBlockList

Delivery, retries and dead-lettering

This is the section that separates people who use Event Grid from people who get paged by it. You must understand what happens after an event is raised but the handler isn’t happy. Event Grid’s contract is at-least-once delivery: it tries to deliver until your handler returns a success status, retrying on failure. Success is an HTTP 2xx within the timeout; anything else (5xx, timeout, connection refused) is a failure that triggers a retry.

Handler response Event Grid treats it as What happens next
HTTP 200/202 (2xx) Success Event acknowledged, done
HTTP 400 / 413 (bad request / too large) Permanent failure No retry — straight to dead-letter
HTTP 401 / 403 / 404 Permanent failure (config error) No retry — dead-letter; fix the endpoint
HTTP 408 / 429 / 5xx Transient failure Retried on the schedule
Timeout / connection refused Transient failure Retried on the schedule

The distinction matters: some failures are not retried. A 400 Bad Request or 413 Payload Too Large is treated as the handler permanently rejecting the event, so Event Grid dead-letters it immediately rather than retrying for 24 hours. A 404 (wrong URL) or 401/403 (auth misconfigured) is similarly non-retriable — these are your config being wrong, and retrying wouldn’t help. Transient codes (429, 503, timeouts) are retried with back-off.

The retry schedule is best-effort exponential back-off: Event Grid retries quickly at first, then spaces attempts out, over a configurable window. The two knobs you control per subscription:

Retry control What it sets Default Range / note
--max-delivery-attempts Max number of delivery tries 30 1–30
--event-ttl (time-to-live) How long to keep retrying 1440 min (24 h) 1–1440 minutes
Dead-letter destination Where un-deliverable events go none (dropped!) A blob container you nominate

Whichever limit (attempts or TTL) is hit first ends the retries. And here is the rule that bites teams: if you do not configure a dead-letter destination, exhausted events are silently dropped. Always set one in production. The dead-letter target is a blob container; failed events land there as JSON, annotated with the reason (deliveryAttempts, lastDeliveryOutcome, lastHttpStatusCode), so you can inspect and replay them. Wire it up:

# Add retry policy + dead-letter container to a subscription
az eventgrid system-topic event-subscription update \
  --name sub-thumbnails \
  --resource-group rg-evt \
  --system-topic-name st-stevtdemo \
  --max-delivery-attempts 30 \
  --event-ttl 1440 \
  --deadletter-endpoint "/subscriptions/<sub>/resourceGroups/rg-evt/providers/Microsoft.Storage/storageAccounts/stevtdemo/blobServices/default/containers/deadletter"

The decision table that ends the 3 a.m. confusion — “is this Event Grid’s fault or my handler’s?”:

If you see… It’s probably… Do this
Events in the dead-letter container with lastHttpStatusCode: 404 Wrong/deleted handler endpoint Fix the endpoint URL; re-create the subscription
Dead-letter with 5xx and high deliveryAttempts Handler crashing/throwing Fix the handler bug; events were retried for 24 h
Dead-letter with 400/413 and deliveryAttempts: 1 Handler rejected (bad request / too big) Handler returned 4xx — fix what it rejects
No events arriving at all, none dead-lettered Filter excludes them, or webhook not validated Check filters/subject case; check handshake (below)
Duplicate processing At-least-once + non-idempotent handler Add idempotency on event id

Choosing a handler (endpoint) type

A subscription routes to exactly one handler. Which one depends on what you need: synchronous compute, durable buffering, fan-in to a worker, or just a webhook. Event Grid supports several native endpoint types, and the choice changes reliability and scale characteristics:

Handler type Use when Reliability characteristic Note
Azure Function (Event Grid trigger) You want to run code per event Function handles retry/scale; idempotency on you The default for “do X on event”
Webhook (HTTP) An external/custom HTTP endpoint Must return 2xx fast; must pass validation handshake Most general; needs the handshake
Service Bus queue/topic You need durable, ordered, transactional processing downstream Event buffered in SB; consumer pulls at its pace Event Grid → SB → worker is a top pattern
Storage Queue Simple, cheap durable buffering Event sits in a queue until consumed Lightweight alternative to Service Bus
Event Hubs You want to aggregate many events into a stream Buffers into a partitioned log For high-volume re-aggregation
Relay Hybrid Connection On-prem handler behind a firewall Tunnels to on-prem Niche but useful

The recurring senior-engineer pattern is Event Grid → Service Bus → worker: Event Grid gives you the cheap, decoupled “something happened” notification with fan-out and filtering; Service Bus gives the downstream the durability, ordering and transaction semantics that Event Grid deliberately doesn’t. If your handler must never drop work and must process in order, don’t point Event Grid straight at fragile compute — land it in a queue first. For the common “resize this image” case, a direct Function handler is perfect and simplest.

The webhook validation handshake

If your handler is a raw webhook (not a native Azure handler like Functions or Service Bus, which Azure validates automatically), Event Grid will not deliver real events until your endpoint proves it wants them. On subscription creation, Event Grid sends a SubscriptionValidationEvent, and your endpoint must echo back the validationCode it contains (or respond to a validation URL). This stops attackers from pointing Event Grid subscriptions at arbitrary URLs to flood them. If you see “subscription failed to validate,” this handshake is why — your webhook didn’t return the code. Native handlers (Functions, Logic Apps, Service Bus, Storage Queue, Event Hubs) skip this because Azure trusts and validates them internally.

Architecture at a glance

Follow the path left to right. A user (or any writer) uploads a file into a blob container on a storage account. That storage account is the publisher: the instant the upload commits, it raises a Microsoft.Storage.BlobCreated event into its system topic — a lightweight Azure resource that represents this account’s event stream. You did not write any code to produce that event; Azure emits it for free. The system topic is the routing point, and hanging off it are one or more event subscriptions, each its own filter-plus-handler rule.

The first subscription filters tightly — event type BlobCreated, subject ending .jpg, data.api is PutBlockList — and pushes matching events to an Azure Function that generates a thumbnail. A second subscription, with a different filter, fans the same upload event out to a Service Bus queue that a durable worker drains for virus scanning, because that path must never drop work. When Event Grid pushes to a handler and gets back anything but a 2xx, it retries on a back-off schedule for up to 24 hours; if every attempt fails, the event is dead-lettered into a nominated blob container where you can inspect the failure reason and replay it. The numbered badges mark the four places this design either bites or saves you: the un-committed-blob noise problem, the filter that prevents it, the dead-letter safety net, and the webhook validation handshake that an external endpoint must pass.

Left-to-right Azure Event Grid system-topic architecture: a user uploads a blob to a Storage Account which acts as publisher and raises a BlobCreated event into its Event Grid system topic; two event subscriptions filter and fan the event out, one to an Azure Function for thumbnailing and one to a Service Bus queue drained by a durable worker for virus scanning, with a dead-letter storage container catching events that fail every retry over the 24-hour window, and numbered badges marking the PutBlock noise trap, the api=PutBlockList filter, the dead-letter safety net and the webhook validation handshake

Real-world scenario

ContosoSnap, a fictional photo-sharing startup, lets users upload images that must be resized into three thumbnail sizes, scanned for malware, and indexed for search — within a couple of seconds of upload, at unpredictable volume (a viral post can spike from 5 to 5,000 uploads a minute). Their first design was a timer-triggered Function that ran every minute, listed the uploads container, and processed anything new. It worked in the demo and fell over in production: at low traffic it burned compute listing an empty container 1,440 times a day; at high traffic the per-minute batch lagged users by up to 60 seconds and occasionally double-processed blobs whose state it misjudged between runs.

They moved to Event Grid system topics. They created a system topic on the storage account and two subscriptions. The first routed Microsoft.Storage.BlobCreated events — filtered to subject ending in image extensions and data.api StringIn ["PutBlockList"] — directly to the thumbnailing Function, which now fires within a second or two of each upload and scales out automatically with load. The PutBlockList filter was the fix for a bug they’d hit immediately: without it, the Function fired on intermediate block writes and tried to resize half-uploaded files, producing corrupt thumbnails. The second subscription fanned the same events into a Service Bus queue drained by the virus-scanning worker, because scanning must never drop a file and must survive the scanner being down for maintenance — Event Grid’s 24-hour retry plus the queue’s durability gave them that guarantee without coupling the two paths.

Two incidents taught them the rest. One night the thumbnailing Function had a bad deploy and returned 500 for twenty minutes. Because they had configured a dead-letter container, the events that exhausted their retries landed there as JSON with lastHttpStatusCode: 500 and deliveryAttempts: 30; once the deploy was rolled back, they wrote a tiny script to re-publish the dead-lettered events and recovered every missed upload — nothing was lost. The second incident was subtler: an analytics teammate added a raw webhook subscription to a third-party service and it “didn’t work.” The cause was the validation handshake — the webhook never echoed the validationCode, so Event Grid never activated the subscription. Switching the third-party integration to land in a Storage Queue (a native handler that needs no handshake) and having their own code drain it fixed it in an hour.

The outcome: end-to-end upload-to-thumbnail latency dropped from up to 60 seconds to under 3, compute cost fell because nothing polls an empty container, and adding the search-indexing consumer later was a one-line subscription, not a change to the upload path. The producer (storage) never knew or cared how many consumers existed — which is the entire point of the pattern.

Advantages and disadvantages

Event Grid is the right tool for a specific shape of problem and the wrong tool for others. The explicit trade-off:

Advantages Disadvantages
Publisher emits events for free (system topics) — no producer code Not for high-throughput streaming (use Event Hubs)
Decouples producer from consumers; add/remove subscribers freely No ordering guarantee; no FIFO (use Service Bus sessions)
Push-based — sub-second reaction, no polling cost At-least-once → duplicates possible; handler must be idempotent
Fan-out: one event to many subscribers, each filtered No long retention/replay — 24 h retry window, then dead-letter
Filtering before delivery saves handler invocations/cost Webhooks need a validation handshake; a footgun for newcomers
At-least-once with 24 h retry + dead-letter = durable enough Events are notifications, not data — handler still fetches the payload
Serverless, pay-per-operation, scales automatically High-volume unfiltered subscriptions can flood handlers/costs

When each side matters: choose Event Grid when the workload is reactive and event-shaped — “when X happens, do Y” — and you value decoupling and fan-out over ordering and replay. That covers most blob-processing, governance-automation and cache-invalidation work. Avoid it (or pair it with something else) when you need strict ordering (a payment pipeline → Service Bus with sessions), high-throughput streaming with replay (millions of telemetry events/sec → Event Hubs), or guaranteed durable work queues (Event Grid → Service Bus, never Event Grid → fragile compute directly). The most common production architecture uses Event Grid for the notification and Service Bus for the durable carriage — they are complements, not competitors.

Hands-on lab

This builds a real blob-upload-to-Function reaction using Event Grid system topics, entirely with az CLI, on resources that fit comfortably in free credits. Run it in Cloud Shell. Total time ~15 minutes; teardown at the end removes everything.

Step 1 — variables and resource group.

RG=rg-evt-lab
LOC=eastus
SA=stevt$RANDOM            # storage account names must be globally unique + lowercase
az group create --name $RG --location $LOC

Step 2 — create the storage account (the publisher) and a container.

az storage account create --name $SA --resource-group $RG --location $LOC \
  --sku Standard_LRS --kind StorageV2
az storage container create --name uploads --account-name $SA --auth-mode login

Step 3 — register the Event Grid resource provider (once per subscription).

az provider register --namespace Microsoft.EventGrid
# Wait until it reports "Registered":
az provider show --namespace Microsoft.EventGrid --query registrationState -o tsv

Step 4 — create the system topic for the storage account. A system topic is a lightweight resource pointing at the source. Expected result: a topic of type Microsoft.Storage.StorageAccounts.

SA_ID=$(az storage account show -n $SA -g $RG --query id -o tsv)
az eventgrid system-topic create \
  --name st-$SA \
  --resource-group $RG \
  --location $LOC \
  --topic-type Microsoft.Storage.StorageAccounts \
  --source $SA_ID

Step 5 — wire a quick handler. For a zero-code test, route to a Storage Queue so you can see events land without deploying a Function. Create a queue, then subscribe with a tight filter.

az storage queue create --name eventqueue --account-name $SA --auth-mode login

az eventgrid system-topic event-subscription create \
  --name sub-uploads \
  --resource-group $RG \
  --system-topic-name st-$SA \
  --endpoint-type storagequeue \
  --endpoint "$SA_ID/queueservices/default/queues/eventqueue" \
  --included-event-types Microsoft.Storage.BlobCreated \
  --subject-begins-with "/blobServices/default/containers/uploads/blobs/" \
  --advanced-filter data.api StringIn PutBlockList CopyBlob

Step 6 — add a dead-letter container (the production-grade habit).

az storage container create --name deadletter --account-name $SA --auth-mode login
az eventgrid system-topic event-subscription update \
  --name sub-uploads --resource-group $RG --system-topic-name st-$SA \
  --max-delivery-attempts 30 --event-ttl 1440 \
  --deadletter-endpoint "$SA_ID/blobServices/default/containers/deadletter"

Step 7 — trigger an event. Upload a blob; within a second or two an event should appear in the queue.

echo "hello event grid" > sample.jpg
az storage blob upload --account-name $SA --container-name uploads \
  --name sample.jpg --file sample.jpg --auth-mode login

Step 8 — verify the event landed. Peek the queue; you should see one message whose body is a BlobCreated event JSON with subject ending sample.jpg.

az storage message peek --queue-name eventqueue --account-name $SA --auth-mode login -o jsonc

Expected: a base64 message body that decodes to the event, with "eventType": "Microsoft.Storage.BlobCreated" and your blob’s URL in data.url. If nothing appears, the troubleshooting section below maps the usual causes — most often a subject-case mismatch or the data.api filter excluding your upload’s API.

Step 9 — teardown. One command removes the lot.

az group delete --name $RG --yes --no-wait

The equivalent Bicep for the system topic and a Function subscription, for when you move this from lab to repo:

param location string = resourceGroup().location
param storageAccountName string
param functionResourceId string

resource sa 'Microsoft.Storage/storageAccounts@2023-05-01' existing = {
  name: storageAccountName
}

resource systemTopic 'Microsoft.EventGrid/systemTopics@2024-06-01-preview' = {
  name: 'st-${storageAccountName}'
  location: location
  properties: {
    source: sa.id
    topicType: 'Microsoft.Storage.StorageAccounts'
  }
}

resource sub 'Microsoft.EventGrid/systemTopics/eventSubscriptions@2024-06-01-preview' = {
  parent: systemTopic
  name: 'sub-thumbnails'
  properties: {
    destination: {
      endpointType: 'AzureFunction'
      properties: { resourceId: functionResourceId }
    }
    filter: {
      includedEventTypes: [ 'Microsoft.Storage.BlobCreated' ]
      subjectBeginsWith: '/blobServices/default/containers/uploads/blobs/'
      subjectEndsWith: '.jpg'
      advancedFilters: [
        { operatorType: 'StringIn', key: 'data.api', values: [ 'PutBlockList' ] }
      ]
    }
    eventDeliverySchema: 'CloudEventSchemaV1_0'
    retryPolicy: { maxDeliveryAttempts: 30, eventTimeToLiveInMinutes: 1440 }
  }
}

Common mistakes & troubleshooting

The failures below are the ones that actually generate support tickets. Each is symptom → root cause → how to confirm → fix.

# Symptom Root cause Confirm with Fix
1 Handler never fires, nothing dead-lettered Subject filter case mismatch (/Uploads/ vs /uploads/) Re-read the actual subject in a captured event Match case exactly, or enable case-insensitive subject matching
2 Handler fires on half-uploaded blobs No data.api filter; firing on PutBlock steps Inspect events’ data.api field Add advanced filter data.api StringIn ["PutBlockList"]
3 Webhook “subscription failed to validate” Endpoint didn’t echo the validationCode Subscription provisioning state = failed Implement the handshake, or use a native handler (Function/queue)
4 Same event processed twice At-least-once delivery + non-idempotent handler Duplicate side-effects with same event id De-dupe on id; make the operation idempotent
5 Events vanish on handler outage No dead-letter destination configured Subscription has no deadLetterDestination Configure a dead-letter blob container
6 No events at all after creating topic Resource provider Microsoft.EventGrid not registered az provider show … registrationState az provider register --namespace Microsoft.EventGrid
7 Handler flooded, costs spike Subscription has no/loose filter on a busy account Metrics show huge Delivery Attempts Tighten event-type + subject + advanced filters
8 404/401 in dead-letter immediately Handler endpoint URL wrong or auth missing Dead-letter lastHttpStatusCode 404/401 Fix endpoint resource ID / managed-identity access
9 “Why is my Function getting BlobDeleted too?” Subscribed to all event types, not just BlobCreated Check includedEventTypes is empty Set --included-event-types Microsoft.Storage.BlobCreated
10 Events delayed minutes, not seconds Handler returning 5xx → being retried with back-off Dead-letter/metrics show high deliveryAttempts Fix the handler so it returns 2xx promptly

Two of these deserve a sentence of emphasis. #5 (no dead-letter) is the single most expensive omission: without it, a handler outage means permanent, silent data loss after 24 hours — always nominate a dead-letter container in production. #2 (the PutBlock noise) is the most common storage-specific surprise: block-blob uploads commit in stages, and only PutBlockList is the “blob is now complete” signal; filter on it or you process incomplete files.

To confirm what’s actually flowing, the Event Grid metrics on the topic and subscription are your truth source:

Metric Tells you Watch for
Published Events Events the source emitted Zero → source isn’t raising events (wrong topic type)
Matched Events Events that passed a subscription’s filter Zero while Published > 0 → filter too tight / case bug
Delivery Attempts Total push attempts to handlers Spiking → handler failing & retrying
Delivery Succeeded Handler returned 2xx Flat while Attempts climb → handler down
Dead-Lettered Events Events that exhausted retries Any non-zero → investigate the handler

Best practices

Security notes

Event Grid’s security model has three faces: who can publish, who can receive, and how the event in transit is protected. For system topics you don’t publish (Azure does), so the action moves to receiving and access control.

Cost & sizing

Event Grid is priced per operation — essentially per event delivered (and a few other operation types), with a generous free monthly allowance (the first 100,000 operations per month are free), then a low per-million-operations rate. For most reactive workloads the bill is negligible; the cost mistakes are about volume you didn’t intend, not unit price.

Cost driver What it is How to control it
Operations (events) Each delivery is billed (after the free 100k/month) Filter at the subscription so unmatched events aren’t delivered
Retry attempts A failing handler multiplies attempts Fix handlers fast; a 5xx loop inflates operations
Downstream handler cost The Function/queue your events trigger Usually dwarfs Event Grid’s own cost — filter to reduce invocations
Dead-letter storage Blobs written for failed events Tiny; lifecycle-expire old dead-letters

Sizing intuition with rough figures: at the free tier, 100,000 events/month cost ₹0 / $0 in Event Grid charges. Even a busy app doing, say, 5 million events/month lands in single-digit US dollars for Event Grid itself — often well under ₹500/month. The number to watch is not Event Grid’s bill but the handler’s bill: 5 million unfiltered events that each invoke a Function cost far more in Functions execution than in Event Grid operations. This is why filtering is a cost control, not just a correctness one — every event you stop at the subscription is a handler invocation you didn’t pay for. Free-tier-friendly: the lab above stays inside the free operation allowance and uses Standard_LRS storage that costs pennies; the teardown removes even that.

Interview & exam questions

Q1. What is an Event Grid system topic, and how does it differ from a custom topic? A system topic is an Azure-managed event stream for events a resource emits about itself (e.g. a storage account’s BlobCreated); Azure is the publisher and you only subscribe. A custom topic is one you create and publish your own application events to. Same model, different publisher. (AZ-204)

Q2. Name the four parts of the Event Grid model. Publisher (source of events), topic (routing endpoint), event subscription (filter + handler rule), and handler/endpoint (the destination that receives events). One topic can have many subscriptions, each delivering its own filtered copy. (AZ-204)

Q3. Event Grid guarantees what delivery semantics, and what must your handler therefore do? At-least-once delivery with retries until a 2xx, over (by default) a 24-hour window. Because the same event can be delivered more than once and order isn’t guaranteed, handlers must be idempotent — typically de-duplicating on the event id. (AZ-204/AZ-305)

Q4. A storage BlobCreated subscription is firing on incomplete blobs. Why, and how do you fix it? Block-blob uploads commit in stages; the subscription is reacting to intermediate PutBlock operations. Add an advanced filter data.api StringIn ["PutBlockList"] so it only fires when the blob is fully committed. (AZ-204)

Q5. What happens to an event when every delivery attempt fails? If a dead-letter destination is configured, the event is written to that blob container with failure metadata (lastHttpStatusCode, deliveryAttempts); if none is configured, the event is silently dropped after the retry window. Always configure dead-lettering in production. (AZ-204)

Q6. When would you choose Event Grid over Service Bus or Event Hubs? Event Grid for reacting to discrete events with fan-out and filtering (“when X happens, do Y”); Service Bus for ordered/transactional messaging (work queues, FIFO with sessions); Event Hubs for high-throughput streaming with replay (telemetry, logs). They are complements; Event Grid → Service Bus is common. (AZ-305)

Q7. Why does a raw webhook handler require a validation handshake? To prevent abuse: without it, anyone could create a subscription that floods an arbitrary URL. Event Grid sends a SubscriptionValidationEvent and the endpoint must echo the validationCode to prove it consents. Native handlers (Functions, Service Bus, Storage Queue) are validated internally and skip it. (AZ-204)

Q8. Which HTTP responses from a handler are retried, and which are not? Transient codes — 408, 429, 5xx, timeouts, connection errors — are retried on the back-off schedule. 400 and 413 (and config errors like 401/403/404) are treated as permanent and dead-lettered without retry. 2xx is success. (AZ-204)

Q9. How do filters reduce cost, not just noise? Filtering is evaluated by Event Grid before delivery, so an unmatched event is never delivered — meaning your handler (often a Function you pay per execution) is never invoked. Tight filters cut both event-delivery operations and, more significantly, downstream handler invocation cost. (AZ-305)

Q10. What’s the difference between the Event Grid schema and CloudEvents 1.0? Both carry the same information; CloudEvents 1.0 is an open CNCF standard using type/source/time/specversion where the Event Grid schema uses eventType/topic/eventTime/metadataVersion. Pick one per subscription; CloudEvents is recommended for cross-cloud interoperability. (AZ-204)

Q11. How would you securely deliver events to a Service Bus queue without secrets? Enable a system-assigned managed identity on the Event Grid topic and grant it the Azure Event Grid Data Sender role on the target Service Bus, then set the subscription to deliver using that identity — no SAS keys in configuration. (AZ-305)

Q12. A governance team wants to act whenever any resource is created in a subscription. What do you build? A Subscription-scoped system topic emitting Microsoft.Resources.ResourceWriteSuccess, with a subscription filtered (by resource type or operation name) routing to a Function or Logic App. Filter tightly — a subscription-wide topic is high-volume. (AZ-305)

Quick check

  1. In the four-part model, which part do you not write code for when using a system topic?
  2. You need to react only to fully-committed blob uploads. Which advanced filter do you add?
  3. Event Grid delivery is “at-least-once.” What property must your handler therefore have?
  4. Where do events go when every delivery attempt fails — and what happens if you didn’t configure that destination?
  5. For a high-throughput telemetry stream you need to replay later, is Event Grid the right tool? If not, what is?

Answers

  1. The publisher — Azure emits the events automatically; you only create the topic and subscriptions.
  2. data.api StringIn ["PutBlockList"] (block-blob commit) — so you don’t fire on intermediate PutBlock writes.
  3. Idempotency — the same event can be delivered more than once, so reprocessing must be safe (de-dupe on the event id).
  4. They are dead-lettered to the blob container you nominate, with failure metadata; if you configured no dead-letter destination, they are silently dropped after the 24-hour retry window.
  5. No — Event Grid has no long retention/replay. Use Event Hubs for high-throughput streaming with replay; pair with Event Grid only if you also need discrete-event reactions.

Glossary

Next steps

AzureEvent GridSystem TopicsEvent-DrivenServerlessBlob StorageIntegrationAZ-204
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading