Most “event-driven” systems I inherit are point-to-point queues wearing a costume. Service A drops a message on an SQS queue that service B owns, B knows A’s payload shape by heart, and the moment a third consumer needs the same event someone forwards it, double-publishes it, or — worst case — has B re-emit it. The coupling didn’t go away; it moved into tribal knowledge.
EventBridge fixes the topology, not just the transport. A producer publishes a fact (“an order was placed”) to a bus and walks away. It does not know, and must not know, who consumes it. Routing lives in rules on the bus, evolvable independently of either side. Add a fraud-scoring consumer six months later by writing one rule — the producer never ships. This is the property that makes a system actually decoupled, and it is the lens for everything below: bus topology, event design, content filtering, targets and failure handling, cross-account routing, the schema registry, and archive/replay.
1. Bus topology: default vs custom buses and bounded-context boundaries
Every account gets a default event bus, and it is the wrong place for your application events. The default bus receives every AWS service event in the account — EC2 state changes, S3 notifications (when enabled), CloudTrail-derived API events, Health events. Mixing your domain events into that stream means your rules compete with AWS noise, your access policies cannot distinguish “my events” from “AWS events,” and you cannot cleanly archive or replay just your traffic.
Create custom buses, and align them to bounded contexts, not to teams or to environments. One bus per environment is too coarse — a single replay or a single misconfigured rule blast-radiuses across unrelated domains. One bus per microservice is too fine — you drown in cross-bus plumbing. The right grain is the bounded context: orders, payments, inventory, fulfillment. Each owns its bus, its event contracts, and its archive policy.
resource "aws_cloudwatch_event_bus" "orders" {
name = "orders"
tags = {
BoundedContext = "orders"
Team = "checkout"
}
}
# Deny anything but your account's services from putting events,
# narrowed further per producer below.
resource "aws_cloudwatch_event_bus_policy" "orders_baseline" {
event_bus_name = aws_cloudwatch_event_bus.orders.name
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Sid = "DenyCrossAccountByDefault"
Effect = "Deny"
Principal = "*"
Action = "events:PutEvents"
Resource = aws_cloudwatch_event_bus.orders.arn
Condition = {
StringNotEquals = { "aws:PrincipalAccount" = data.aws_caller_identity.current.account_id }
}
}]
})
}
Rule of thumb: if two streams of events would ever be archived, replayed, or access-controlled separately, they belong on separate buses. Replay scope is per-bus, and that single fact should drive most of your topology decisions.
2. Event design: the envelope, detail-type conventions, and versioning
An EventBridge event has a fixed envelope and a free-form detail body. The envelope fields — source, detail-type, time, id, region, account, resources — are what rules match against most efficiently and what you cannot change after the fact. Treat them as a public API.
{
"source": "com.acme.orders",
"detail-type": "OrderPlaced",
"detail": {
"metadata": {
"version": "1.0",
"correlationId": "9b1f...",
"idempotencyKey": "order-7781-placed"
},
"data": {
"orderId": "7781",
"customerId": "c-4410",
"totalCents": 18900,
"currency": "USD"
}
}
}
A few conventions that pay off at scale:
sourceis a reverse-DNS namespace you own (com.acme.orders). AWS reserves theaws.prefix; never spoof it. Keeping onesourceper bounded context makes IAM and rule patterns trivial.detail-typenames a fact in past tense —OrderPlaced,PaymentCaptured,ShipmentDispatched. Events are things that happened, not commands. If you find yourself naming onePlaceOrder, you are modeling a command and should rethink whether EventBridge is the right channel.- Version inside
detail.metadata, not indetail-type. PuttingOrderPlaced.v2indetail-typeforces every consumer to update its rule the day you bump a version. Keepdetail-typestable; carry a semanticversionin the body. Bump the major only on a breaking change, and during migration publish both versions until consumers drain off the old one. - Split
metadatafromdata. Cross-cutting fields (correlation IDs, idempotency keys, schema version, producer build) live inmetadata; the domain payload lives indata. Consumers learn one envelope-within-the-envelope and your input transformers (below) stay stable.
EventBridge does not enforce any of this — it will happily route {"x": 1}. The discipline is yours, and the schema registry in section 6 is how you make it stick.
3. Rules and content filtering: matching patterns and input transformers
A rule is a match expression plus up to five targets. The match is an event pattern — a JSON document mirroring the event’s structure, where each field holds an array of allowed values or a matching operator. A field present in the pattern must match; a field absent from the pattern is ignored.
{
"source": ["com.acme.orders"],
"detail-type": ["OrderPlaced"],
"detail": {
"data": {
"totalCents": [{ "numeric": [">=", 50000] }],
"currency": ["USD", "CAD"]
}
}
}
This is content-based routing: only high-value USD/CAD orders match. The producer emits every order once; the bus fans out by content. EventBridge supports a rich operator set inside patterns — prefix, suffix, anything-but, exists, numeric, cidr, equals-ignore-case, and wildcard — and you can combine them with $or at the top level. Two that I reach for constantly:
{
"detail": {
"data": {
"sku": [{ "wildcard": "ELEC-*-REFURB" }],
"promoCode": [{ "exists": false }]
}
}
}
exists: false is how you route on the absence of a field — orders with no promo code — which is impossible to express in most queue-based systems without a consumer-side branch.
When a target needs a different shape than the raw event, use an input transformer rather than reshaping in the consumer. It declares a map of variables drawn from the event via JSON paths, then a template that produces the target’s input. This keeps the producer’s envelope canonical while letting each target receive exactly what it wants.
{
"InputPathsMap": {
"orderId": "$.detail.data.orderId",
"total": "$.detail.data.totalCents"
},
"InputTemplate": "{ \"message\": \"Order <orderId> totals <total> cents\", \"channel\": \"#big-orders\" }"
}
A subtle but important behavior: a single event evaluated against many rules invokes every matching rule independently. There is no “first match wins.” Overlapping patterns are a feature — that is how multiple bounded contexts subscribe to the same fact — but it means a sloppy broad rule can silently double-deliver. Keep patterns specific.
4. Targets, dead-letter queues, and retry/backoff configuration
A target is where a matched event goes: Lambda, SQS, SNS, Step Functions, Kinesis, another event bus, an API destination (any HTTPS endpoint), and dozens more. The part teams skip — and then page on at 2 a.m. — is failure handling. EventBridge delivers asynchronously with retries, but if every retry fails and you configured no dead-letter queue, the event is dropped silently. There is no backstop. Configure a DLQ on every target that matters.
resource "aws_cloudwatch_event_rule" "high_value_orders" {
name = "high-value-orders"
event_bus_name = aws_cloudwatch_event_bus.orders.name
event_pattern = jsonencode({
source = ["com.acme.orders"]
"detail-type" = ["OrderPlaced"]
detail = { data = { totalCents = [{ numeric = [">=", 50000] }] } }
})
}
resource "aws_cloudwatch_event_target" "to_fraud_lambda" {
rule = aws_cloudwatch_event_rule.high_value_orders.name
event_bus_name = aws_cloudwatch_event_bus.orders.name
arn = aws_lambda_function.fraud_score.arn
retry_policy {
maximum_event_age_in_seconds = 3600 # stop retrying after 1 hour
maximum_retry_attempts = 10
}
dead_letter_config {
arn = aws_sqs_queue.fraud_dlq.arn # capture exhausted events
}
}
Two knobs govern retries. maximum_retry_attempts caps the count (up to 185); maximum_event_age_in_seconds caps the total wall-clock window (60 to 86,400 seconds). EventBridge retries with exponential backoff and jitter, and an event is discarded when either limit is hit — so an event can be dropped well before 10 attempts if it sat for an hour. Set the age limit to the longest your downstream may legitimately be unavailable; set the attempt limit to bound cost against a hot-looping failure.
The DLQ is an SQS queue that receives events EventBridge could not deliver. Critically, it captures delivery failures (target throttled, target nonexistent, permissions broken) — not application-logic failures inside a Lambda that returned 200. For business-logic retries, that is the consumer’s job (a Lambda destination or its own SQS source). Alarm on DeadLetterInvocations in CloudWatch and treat any non-zero value as a real incident; a filling DLQ means events are being lost from the live path.
5. Cross-account and cross-region event routing patterns
The canonical enterprise pattern is bus-to-bus: a producer account emits to its local bus, a rule forwards matching events to a bus in another account, and the consuming account writes its own rules on the receiving bus. Neither side shares IAM principals or knows the other’s internals. Two halves wire this up.
First, the receiving bus must grant the producer account permission to put events:
resource "aws_cloudwatch_event_bus_policy" "central_ingest" {
event_bus_name = aws_cloudwatch_event_bus.central.name
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Sid = "AllowOrdersProducerAccount"
Effect = "Allow"
Principal = { AWS = "arn:aws:iam::111122223333:root" }
Action = "events:PutEvents"
Resource = aws_cloudwatch_event_bus.central.arn
}]
})
}
Second, in the producer account, a rule targets the remote bus by ARN, using a role EventBridge assumes to perform the cross-account PutEvents:
resource "aws_cloudwatch_event_target" "forward_to_central" {
rule = aws_cloudwatch_event_rule.high_value_orders.name
event_bus_name = aws_cloudwatch_event_bus.orders.name
arn = "arn:aws:events:us-east-1:444455556666:event-bus/central"
role_arn = aws_iam_role.eb_cross_account.arn # required for bus-to-bus
}
A few constraints worth internalizing:
- Forwarding is one hop. An event delivered from account A’s bus to account B’s bus will not be forwarded again from B to a third bus. EventBridge blocks chained bus-to-bus forwarding to prevent loops. Design hub-and-spoke, not a chain.
- The envelope is preserved, but
accountandregionreflect the origin. When you match on the receiving bus,accountis still the producer’s. Filter onsourceanddetail-type, which travel intact. - Cross-region uses the same mechanism — target a bus ARN in another region with an assumed role. This is how you aggregate events into a single observability or audit account/region, or build active-active fan-in.
For ingesting events out of your estate (a partner SaaS pushing to you), use a partner event source or an API destination for the reverse direction; for hub-and-spoke fan-in across many accounts in an Organization, this bus-to-bus pattern with a central bus is the standard backbone.
6. Schema registry and discovery: contracts, code bindings, and governance
The free-form detail body is a liability without a contract. EventBridge’s schema registry stores OpenAPI/JSONSchema definitions of your events and generates strongly typed code bindings (Java, Python, TypeScript, Go) so producers and consumers compile against the same shape instead of hand-parsing maps.
Turn on schema discovery for a bus and EventBridge samples live events and infers schemas into the discovered-schemas registry automatically — invaluable for reverse-engineering an existing estate, less so as a governance source of truth.
# Infer schemas from live traffic on a bus
aws schemas create-discoverer \
--source-arn arn:aws:events:us-east-1:444455556666:event-bus/orders
# Generate typed bindings for a known schema version
aws schemas put-code-binding \
--registry-name discovered-schemas \
--schema-name com.acme.orders@OrderPlaced \
--language TypeScript3
aws schemas get-code-binding-source \
--registry-name discovered-schemas \
--schema-name com.acme.orders@OrderPlaced \
--language TypeScript3 \
/tmp/OrderPlaced.zip
For governance, do not rely on discovery. Maintain a custom registry with versioned, reviewed schemas checked into source control and published through CI. The producer’s contract test asserts its emitted event validates against the registered schema before deploy; a breaking change fails the pipeline.
aws schemas create-registry --registry-name acme-domain-events
aws schemas create-schema \
--registry-name acme-domain-events \
--schema-name com.acme.orders@OrderPlaced \
--type OpenApi3 \
--content file://schemas/order-placed-v1.json
The governance posture I push: discovery for archaeology, custom registry for contracts. Discovery tells you what is actually flowing (including the rogue events nobody documented); the curated registry is the agreement teams build against and the artifact your schema-evolution review gates on.
7. Archive and replay for disaster recovery and reprocessing
This is EventBridge’s most underused capability and the reason I treat it as a system of record for events, not just a router. An archive durably retains every event matching a filter that flows through a bus. A replay re-emits archived events back onto the bus over a time window — re-evaluating current rules against past events.
resource "aws_cloudwatch_event_archive" "orders" {
name = "orders-archive"
event_source_arn = aws_cloudwatch_event_bus.orders.arn
retention_days = 90 # 0 = indefinite
event_pattern = jsonencode({ source = ["com.acme.orders"] })
}
# Reprocess a window of past events onto the bus
aws events start-replay \
--replay-name reprocess-orders-2026-06-07 \
--event-source-arn arn:aws:events:us-east-1:444455556666:archive/orders-archive \
--event-start-time 2026-06-07T00:00:00Z \
--event-end-time 2026-06-07T06:00:00Z \
--destination '{"Arn":"arn:aws:events:us-east-1:444455556666:event-bus/orders","FilterArns":["arn:aws:events:us-east-1:444455556666:rule/orders/rebuild-projection"]}'
The mechanics that matter in practice:
- Replay targets specific rules via
FilterArns. You almost never want to replay onto every rule — that re-notifies customers, re-charges cards, re-sends emails. Scope the replay to the one idempotent consumer that needs to reprocess (a read-model rebuilder, a new analytics sink), and leave the side-effecting rules out. - Replayed events carry
replay-namein the envelope, so consumers that must behave differently on replay can branch on it. - Ordering is best-effort, not guaranteed, and replays do not preserve the original inter-event timing — events are re-emitted as fast as the service allows. Consumers must be idempotent; that is the price of admission for replay, and it is a price every well-designed event consumer should already be paying.
The two killer use cases: disaster recovery (a downstream was broken for three hours; replay the window once it is healthy) and building new consumers against history (stand up a new projection, replay six weeks of events through it, and it is caught up to live without a custom backfill job). Archive is also a clean audit trail — every business fact, retained and queryable by time, independent of any consumer’s database.
8. EventBridge vs SNS vs SQS: choosing the right backbone
These are not competitors so much as different layers, and senior reviews go sideways when someone treats them as interchangeable.
| Dimension | EventBridge | SNS | SQS |
|---|---|---|---|
| Model | Bus + content routing | Pub/sub topic fan-out | Point-to-point queue |
| Routing | Content-based (event patterns) | Topic + message filter policies | None (consumer pulls) |
| Fan-out | Many rules, many targets | Many subscriptions | One consumer group |
| Filtering | Rich (numeric, wildcard, exists, $or) | Attribute/body filter policies | None |
| Throughput / latency | Higher latency, very high scale | Very high throughput, low latency | Very high throughput, buffering |
| Replay / archive | Native archive + replay | No | No (redrive from DLQ only) |
| Schema registry | Yes | No | No |
| Ordering / exactly-once | No | FIFO topics only | FIFO queues only |
The decision rule I use:
- EventBridge when routing logic must evolve independently of producers and consumers, when you cross account/team boundaries, when you need archive/replay or schema governance — i.e., the integration backbone of a domain.
- SNS when you need cheap, low-latency, high-throughput fan-out of one message to many subscribers and the routing is simple (or FIFO ordering to a few queues). SNS-to-many-SQS is still the right primitive for raw fan-out at volume.
- SQS when you need a durable buffer to decouple production rate from consumption rate, with one logical consumer draining at its own pace and built-in backpressure.
They compose. A common, correct topology: EventBridge routes a domain event to an SQS queue (the target), Lambda drains the queue with controlled concurrency and a redrive policy. EventBridge gives you content routing and archive; SQS gives you the buffer and backpressure; you get both. Reaching for EventBridge to do high-volume, low-latency, simple fan-out — or for SQS to do content-based multi-consumer routing — is the mistake. Match the tool to the layer.
Enterprise scenario
A retail platform team ran order processing as a single SQS queue feeding a monolithic Lambda. When they split fulfillment into its own bounded context, they put a fulfillment bus alongside the existing orders bus and forwarded OrderPlaced events across. Three weeks in, a deploy to the fulfillment consumer threw on a malformed address for a batch of international orders. The Lambda returned 200 (it caught and logged), so EventBridge considered delivery successful — the events were not in any DLQ. The orders were silently never fulfilled. They found out from customer support tickets.
The constraint: they could not ask the orders producer to re-emit — those events were long gone from the source system, and replaying from the producer’s side would have re-charged cards on the orders bus’s payment rule.
The fix had two parts. First, they had (fortunately) configured an archive on the fulfillment bus, so the events still existed. They replayed precisely the affected window, scoped to only the fulfillment-rebuild rule, after the address-parsing bug was patched:
aws events start-replay \
--replay-name fulfill-intl-backfill-20260607 \
--event-source-arn arn:aws:events:us-east-1:444455556666:archive/fulfillment-archive \
--event-start-time 2026-06-07T02:00:00Z \
--event-end-time 2026-06-07T05:30:00Z \
--destination '{"Arn":"arn:aws:events:us-east-1:444455556666:event-bus/fulfillment","FilterArns":["arn:aws:events:us-east-1:444455556666:rule/fulfillment/process-shipment"]}'
Because the shipment consumer keyed every action on detail.metadata.idempotencyKey, the replay reprocessed the failed batch without duplicating the orders that had succeeded. Second — the real lesson — they stopped swallowing exceptions in the Lambda. A malformed event now throws, EventBridge retries with backoff, and after exhaustion lands in the target DLQ, which alarms on DeadLetterInvocations > 0. The archive saved them once; the DLQ-plus-alarm meant they would never again need it for this class of failure. Two controls, both native, both cheap, and the system went from “silently loses orders” to “fails loudly and recovers deterministically.”
Verify
Confirm the architecture behaves before you trust it with production traffic.
# 1. Put a test event and confirm it lands (matched rules invoke)
aws events put-events --entries '[{
"Source": "com.acme.orders",
"DetailType": "OrderPlaced",
"EventBusName": "orders",
"Detail": "{\"metadata\":{\"version\":\"1.0\"},\"data\":{\"orderId\":\"t-1\",\"totalCents\":99000,\"currency\":\"USD\"}}"
}]'
# 2. Inspect rule match + invocation metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/Events --metric-name MatchedEvents \
--dimensions Name=RuleName,Value=high-value-orders \
--start-time 2026-06-08T00:00:00Z --end-time 2026-06-08T23:59:59Z \
--period 300 --statistics Sum
# 3. Prove failure handling: confirm the DLQ is empty (no silent drops)
aws sqs get-queue-attributes \
--queue-url "$DLQ_URL" \
--attribute-names ApproximateNumberOfMessages
# 4. Validate a sample event against its registered schema before shipping a producer
aws schemas get-discovered-schema \
--type OpenApi3 \
--events file://sample-events.json
Watch these CloudWatch metrics in AWS/Events: MatchedEvents (rule is matching), InvocationsSentToDlq and DeadLetterInvocations (events failing delivery — must be zero), FailedInvocations (target returned an error and no DLQ caught it — must be zero), and ThrottledRules (you are hitting PutTargets/invocation limits). A healthy bus shows MatchedEvents tracking traffic and the failure metrics flat at zero.