Resilient Messaging with SQS and SNS: Fan-Out, FIFO Ordering, DLQs, and Poison-Message Handling

SNS and SQS are the two primitives I reach for before anything fancier, and the failure mode I see most often is teams wiring a producer straight to a single queue, then bolting on a second consumer by having the first one re-publish. That is point-to-point coupling wearing a queue costume. The decoupled version is boring on purpose: a producer publishes a fact to an SNS topic and walks away, and each consumer owns a private SQS queue subscribed to that topic. Add a consumer six months later by creating one queue and one subscription. Nobody redeploys the producer.

That topology is the easy 80%. The hard 20% is what happens when a consumer is slow, a message is malformed, the same event arrives twice, or strict ordering actually matters. This article is the operational playbook for that 20%: SNS-to-SQS fan-out with server-side filtering, the real tradeoffs between standard and FIFO, visibility-timeout and polling tuning, dead-letter queues with redrive, idempotent consumers, and the Lambda/ECS plumbing that makes partial failures survivable.

Because this is a reference you will return to mid-incident, the settings, limits, error conditions and tuning knobs are laid out as scannable tables. Read the prose once for the why, then keep the tables open when a queue is backing up at 02:00 and you need the exact maxReceiveCount, the precise metric to alarm on, or the one CLI flag that stops a double-charge. By the end you will stop guessing about delivery semantics and know exactly which layer — topic, queue, consumer, or state store — owns each guarantee.

What problem this solves

The pain this fixes is coupling that you only discover under load. A producer wired directly to one consumer’s queue cannot grow a second consumer without code changes, and the moment the first consumer slows down, the producer’s writes back up behind it. Worse, the failure modes are invisible in a demo and catastrophic in production: a too-short visibility timeout double-charges a customer, a missing dead-letter queue lets one malformed message loop for fourteen days burning compute, a single bad record in a Lambda batch drives nine good messages to the DLQ, and a cross-account subscription silently drops every message because nobody granted the KMS key.

What breaks without this knowledge is concrete and expensive. Teams ship a single standard queue, set the visibility timeout to a guess, never attach a DLQ, and write consumers that assume exactly-once delivery. Then a downstream slows during a sale, messages reappear mid-flight, a second worker grabs each one, and the same side effect fires twice — a duplicate charge, a duplicate email, a duplicate inventory decrement. The on-call engineer kills the consumers, the queue floods, and the post-mortem concludes “messaging is hard” instead of “the visibility timeout was 30 seconds and processing took 45.”

Who hits this: every team building event-driven or asynchronous systems on AWS — order pipelines, payment processors, notification fan-outs, ETL triggers, microservice choreography. It bites hardest on payments and anything with side effects that cost money, on multi-consumer fan-outs where one slow consumer must not back-pressure the others, on cross-account topologies where the IAM and KMS plumbing is silent when wrong, and on FIFO adopters who assume the five-minute dedup window covers all their retries. The fix is almost never “add a bigger queue” — SQS scales itself. The fix is correct delivery semantics, a tuned redrive policy, and an idempotent consumer.

To frame the whole field before the deep dive, here is every failure class this article covers, the layer that owns it, and the one thing to check first:

Failure class	What you observe	Layer that owns the fix	First thing to check
Double-processing	Same side effect fires twice	Visibility timeout + idempotent consumer	`ApproximateReceiveCount` > 1 on processed messages
Poison message loop	One message reprocessed for days	Redrive policy → DLQ	`maxReceiveCount` set? DLQ attached?
Silent message loss	Subscriber gets nothing, no error	Queue policy / KMS key policy	`sqs:SendMessage` allows `sns.amazonaws.com`?
Out-of-order processing	Events applied in wrong sequence	FIFO + `MessageGroupId`	Is the queue FIFO? Group ID chosen right?
Good messages DLQ’d by a sibling	Whole Lambda batch retried	`ReportBatchItemFailures`	Is the function response type set?
Consumers can’t keep up	Rising backlog, SLA breach	Consumer scaling on backlog/task	`ApproximateAgeOfOldestMessage` rising?

Learning objectives

By the end of this article you can:

Design an SNS-to-SQS fan-out with one private queue per consumer, server-side filter policies, and RawMessageDelivery — so a slow consumer never back-pressures the others and unwanted messages are dropped at the topic, never billed.
Choose standard versus FIFO deliberately, articulate the exactly-once-processing guarantee and its five-minute boundary, and pick a MessageGroupId that gives per-entity ordering with cross-entity parallelism.
Tune the visibility timeout to roughly 6× p99 processing time and extend it dynamically with ChangeMessageVisibility heartbeats, eliminating the single most common cause of double-processing.
Attach dead-letter queues with a tuned maxReceiveCount, scope redrive with RedriveAllowPolicy, and replay messages with StartMessageMoveTask after a fix — never a hand-rolled re-publish.
Build idempotent consumers keyed on a business idempotency key (not MessageId) using DynamoDB conditional writes or a dedicated idempotency table, achieving exactly-once effect regardless of delivery duplicates.
Wire Lambda event source mappings with ReportBatchItemFailures and a concurrency cap, and scale ECS/EC2 consumers on backlog-per-task rather than CPU.
Secure the path end to end: SSE-KMS, queue resource policies scoped with aws:SourceArn, and the cross-account topic-and-KMS grants that silently drop messages when missed.
Monitor the right signals — ApproximateAgeOfOldestMessage as the SLA leading indicator and DLQ-not-empty as a page — and read the limit/quota tables to size correctly.

Prerequisites & where this fits

You should already understand the AWS messaging basics: an SNS topic is a pub/sub fan-out point that pushes a copy of each message to every subscriber, while an SQS queue is a durable pull-based buffer that one or more consumers poll. You should be comfortable with aws CLI, reading JSON, IAM resource policies (the Principal/Action/Resource/Condition shape), and basic Terraform. Familiarity with at-least-once delivery semantics and the idea that distributed systems retry helps a great deal.

This sits in the event-driven / decoupling track. The conceptual ground floor is SNS, SQS and EventBridge Messaging Fundamentals — start there if “fan-out” or “DLQ” is new. This article is the production-grade deep dive above that. It pairs tightly with Lambda Deep Dive: Runtimes, Triggers, Layers and Concurrency for the consumer side, DynamoDB Deep Dive for the idempotency store, and KMS Encryption Deep Dive for the SSE-KMS and cross-account key policy. When choreography outgrows simple fan-out, Step Functions: Distributed Orchestration & Error Handling and EventBridge: Buses, Schema & Pipes are the next layer up.

A quick map of who owns what during an incident, so you reach for the right tool fast:

Layer	What lives here	Failure classes it can cause	Where you confirm
Producer	Publish call, message attributes, group/dedup IDs	Missing attribute → filtered out; wrong group ID → no ordering	Producer logs; CloudWatch publish metrics
SNS topic	Subscriptions, filter policies, topic DLQ	Over-broad delivery; silent drop on delivery failure	SNS delivery-status logs; `NumberOfNotificationsFailed`
Queue policy / KMS	Resource policy, `aws:SourceArn`, key grants	Silent message loss (esp. cross-account)	CloudTrail `Decrypt` denied; SNS failed deliveries
SQS queue	Visibility timeout, polling, retention, redrive	Double-process; poison loop; backlog	`ApproximateReceiveCount`; `ApproximateAgeOfOldestMessage`
Consumer (Lambda/ECS)	Batch handling, partial failures, scaling	Whole-batch retry; backpressure failure	ESM metrics; `IteratorAge`; task backlog metric
State store (DynamoDB)	Idempotency key, conditional write	Duplicate effect when missing	Conditional-check-failed count; dup side effects

Core concepts

Five mental models make every later decision obvious.

Publish a fact to a topic; let each consumer own a queue. The producer’s job ends when it publishes to the SNS topic. SNS pushes a copy to every subscribed SQS queue, and each consumer polls its own queue. This is the difference between a topic (one-to-many fan-out, push) and a queue (a buffer one consumer group drains, pull). Because each consumer has a private queue, a slow or broken consumer fills its queue only — it cannot back-pressure the producer or the other consumers. Adding a consumer is one queue plus one subscription, never a producer change.

Delivery is at-least-once; ordering is best-effort — unless you opt into FIFO. A standard queue may deliver a message more than once and may deliver out of order. It does this rarely, but “rarely” at scale is “several times a day.” FIFO queues add strict ordering within a message group and exactly-once processing within a five-minute deduplication window. Neither standard nor FIFO gives you exactly-once delivery across all of time — that is a myth in distributed messaging. What you engineer instead is exactly-once effect, and it lives in the consumer.

The visibility timeout is a lease, not a delete. When a consumer receives a message, SQS does not remove it — it hides it for the visibility timeout and expects the consumer to DeleteMessage before the lease expires. Delete in time and the message is gone. Let the lease lapse (crash, slow processing, forgot to delete) and the message reappears for another consumer. A timeout shorter than your processing time is the single most common cause of “my consumer ran twice.”

A dead-letter queue is the escape valve for poison. A poison message is one a consumer can never process — malformed JSON, a reference to a deleted row, a schema your code does not understand. The redrive policy on the source queue sets maxReceiveCount; after that many receives (not failures), SQS moves the message to a DLQ instead of looping it forever. The DLQ is where you triage, and redrive (StartMessageMoveTask) is the first-class feature that moves fixed messages back — you never hand-roll a re-publish Lambda.

Idempotency is the only durable answer to duplicates. Because delivery is at-least-once and FIFO dedup is bounded to five minutes, the consumer must make processing the same message twice produce the same effect as once. Derive a stable idempotency key from the business event, make the side effect conditional on it (a DynamoDB attribute_not_exists write, a dedicated idempotency table), and duplicate deliveries become a non-event. Build it here and every other layer’s “rare” duplicate stops mattering.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Concept	One-line definition	Where it lives	Why it matters
SNS topic	Pub/sub fan-out point; pushes to every subscriber	Account / region	The single publish target; decouples producer
SQS queue	Durable pull-based message buffer	Account / region	One per consumer; absorbs bursts
Filter policy	Server-side rule dropping non-matching messages at SNS	On the subscription	Cuts cost and noise; routing without code
RawMessageDelivery	Strips the SNS JSON envelope	On the subscription	Consumer reads the publisher’s body directly
Visibility timeout	Lease hiding a received message	Queue attribute	Too short → double-process; too long → latency
MessageGroupId	FIFO ordering scope	Per message (FIFO)	Order within group; parallelism across groups
Deduplication ID	Collapses duplicate sends within 5 min	Per message (FIFO)	Drops retries of the same logical event
maxReceiveCount	Receives before redrive to DLQ	Redrive policy	Bounds poison-message retries
Dead-letter queue (DLQ)	Sidelines messages that exceed retries	Separate queue	Triage surface; page when non-empty
Redrive	Moves DLQ messages back to source	`StartMessageMoveTask`	First-class replay after a fix
Idempotency key	Stable business key for dedup	DynamoDB / your store	Exactly-once effect in the consumer
Event source mapping (ESM)	Lambda’s managed poller	Lambda config	Batches, partial-failure reporting, concurrency

Fan-out topology: SNS to multiple SQS queues with filter policies

The pattern is one topic, N queues. Each subscriber gets its own queue so a slow or broken consumer cannot back-pressure the others, and each queue gets its own DLQ and its own visibility timeout. Critically, you do not want every consumer to receive every message — most consumers care about a slice. Push that filtering into SNS with a subscription filter policy so unwanted messages are dropped at the topic, never billed as SQS requests, and never even delivered.

resource "aws_sns_topic" "orders" {
  name = "orders-events"
}

resource "aws_sqs_queue" "fraud" {
  name                       = "orders-fraud"
  visibility_timeout_seconds = 120
  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.fraud_dlq.arn
    maxReceiveCount     = 5
  })
}

resource "aws_sqs_queue" "fraud_dlq" {
  name                      = "orders-fraud-dlq"
  message_retention_seconds = 1209600 # 14 days, the max
}

resource "aws_sns_topic_subscription" "fraud" {
  topic_arn = aws_sns_topic.orders.arn
  protocol  = "sqs"
  endpoint  = aws_sqs_queue.fraud.arn

  # Match on a message attribute, not the body, by default.
  filter_policy = jsonencode({
    eventType = ["OrderPlaced", "OrderUpdated"]
    amount    = [{ numeric = [">=", 500] }]
  })

  # Critical for fan-out: deliver the raw publisher payload, not the
  # SNS envelope, so the SQS consumer sees exactly what was published.
  raw_message_delivery = true
}

Two flags decide whether this works in practice:

raw_message_delivery = true strips the SNS JSON envelope (the {"Type":"Notification","Message":"..."} wrapper) so your SQS consumer reads the publisher’s body directly. Without it, every consumer parses two layers of JSON and your message attributes are nested under MessageAttributes inside the envelope. Turn it on for SQS fan-out unless you specifically need SNS metadata.
Filter policy scope. By default a filter policy matches on message attributes. If you need to filter on fields inside the body, set the subscription’s FilterPolicyScope to MessageBody. That is convenient but couples your routing to your payload schema, so I default to attributes and only reach for body filtering when the producer cannot be changed.

Always attach a redrive policy on the subscription itself so SNS deliveries that fail (for example, the SQS queue policy is misconfigured) land in an SNS-level DLQ instead of vanishing. That is separate from the SQS DLQ covered later.

resource "aws_sns_topic_subscription" "fraud_with_dlq" {
  topic_arn            = aws_sns_topic.orders.arn
  protocol             = "sqs"
  endpoint             = aws_sqs_queue.fraud.arn
  raw_message_delivery = true

  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.sns_delivery_dlq.arn
  })
}

The fan-out subscription knobs, end to end

Every subscription-level setting that changes delivery behavior, what it defaults to, and the gotcha:

Setting	Values	Default	When to change	Trade-off / gotcha
`RawMessageDelivery`	true / false	false	Almost always true for SQS fan-out	false nests body+attrs in an envelope; double-parse
`FilterPolicy`	JSON match rules	none (all delivered)	When a consumer wants a slice	Mis-typed attribute name silently delivers nothing
`FilterPolicyScope`	`MessageAttributes` / `MessageBody`	`MessageAttributes`	Filtering on body fields you can’t move to attrs	Body scope couples routing to payload schema
`RedrivePolicy` (subscription)	DLQ ARN	none	Always — catch failed SNS→SQS deliveries	Separate from the SQS queue’s own DLQ
`DeliveryPolicy`	retry/backoff JSON	platform default	Rarely; SNS retries SQS aggressively already	Mostly relevant for HTTP/S endpoints
Subscription `Protocol`	sqs / lambda / http(s) / email / sms / firehose	n/a	Per endpoint type	FIFO topics only fan into SQS FIFO

How filter-policy match operators behave — the ones people get wrong:

Match type	Example	Matches	Does NOT match
Exact string (array = OR)	`{"eventType":["OrderPlaced","OrderUpdated"]}`	`OrderPlaced` or `OrderUpdated`	`OrderCancelled`
Anything-but	`{"eventType":[{"anything-but":["Test"]}]}`	any value except `Test`	`Test`
Numeric compare	`{"amount":[{"numeric":[">=",500]}]}`	500, 1200	499 (and missing attribute)
Numeric range	`{"amount":[{"numeric":[">=",100,"<",500]}]}`	100–499	500, 99
Prefix	`{"region":[{"prefix":"eu-"}]}`	`eu-west-1`	`us-east-1`
Exists	`{"tier":[{"exists":true}]}`	message has `tier` attribute	message without `tier`
Missing attribute (any rule)	any key the message lacks	—	non-match → message dropped

The cost arithmetic that justifies pushing filters into SNS rather than filtering in the consumer:

Approach	Where messages are dropped	What you pay for dropped messages	Coupling
SNS filter policy	At the topic, before delivery	Nothing (not delivered, not enqueued)	Routing config, decoupled from code
Deliver-all + filter in consumer	In your code, after receive	SQS request + Lambda/ECS compute per message	Routing logic in every consumer
Separate topic per event type	At publish time	Nothing, but producer must know routes	Producer coupled to consumer taxonomy

Standard vs FIFO: ordering, message groups, and deduplication IDs

Pick standard unless you can articulate why you need FIFO, because FIFO buys ordering and dedup at the cost of throughput and operational complexity.

Property	Standard	FIFO
Ordering	Best-effort	Strict, per message group
Delivery	At-least-once	Exactly-once processing (within the dedup window)
Throughput	Nearly unlimited	300 msg/s, or 3,000 with batching; high-throughput mode raises this substantially
Dedup	None	5-minute dedup interval, by ID or content hash
Name suffix	any	must end in `.fifo`
Price per million	Lower (standard tier)	Higher (FIFO tier)
Lambda concurrency	Scales freely	One batch in flight per group

The mental model for FIFO ordering is the message group ID. Order is guaranteed within a group, never across groups. Choose the group ID as the entity whose events must be serialized — orderId, accountId, tenantId — and you get per-entity ordering with parallelism between entities. Use a single constant group ID and you have serialized the entire queue down to one in-flight message at a time, which is almost never what you want.

Deduplication has two modes:

Content-based dedup (ContentBasedDeduplication = true): SQS hashes the message body and rejects a duplicate body within a 5-minute window. Good when the body is naturally unique.
Explicit MessageDeduplicationId: you supply the key, typically a business idempotency key like the event ID. This is the robust choice because identical retries of the same logical event collapse even if the body differs slightly (a re-serialized timestamp, say).

resource "aws_sqs_queue" "payments" {
  name                        = "payments.fifo"
  fifo_queue                  = true
  content_based_deduplication = false # supply our own IDs
  deduplication_scope         = "messageGroup"
  fifo_throughput_limit       = "perMessageGroupId" # high-throughput mode
  visibility_timeout_seconds  = 60
}

aws sqs send-message \
  --queue-url "$PAYMENTS_FIFO_URL" \
  --message-body "$(cat payment.json)" \
  --message-group-id "acct-4471" \
  --message-deduplication-id "evt-7f3a9c-2026-06-08"

The 5-minute dedup window is a fixed property of SQS FIFO and you cannot extend it. If your retries can recur after five minutes, FIFO dedup will not save you — you still need idempotent consumers. FIFO reduces duplicate volume; it does not eliminate the need to handle them.

For SNS fan-out into FIFO queues, the SNS topic must also be FIFO, the publisher supplies MessageGroupId (and a dedup ID or content-based dedup on the topic), and every subscribed queue must be FIFO. You cannot fan a standard topic into a FIFO queue.

The FIFO configuration matrix

Every FIFO-specific attribute and how to set it:

Attribute	Values	Default	Effect	Gotcha
`FifoQueue`	true	n/a (standard if absent)	Makes the queue FIFO; name must end `.fifo`	Immutable after creation
`ContentBasedDeduplication`	true / false	false	Hash body as dedup ID if none supplied	Slightly different body ≠ dedup
`MessageDeduplicationId`	string ≤128 chars	none (required unless content-based)	Explicit dedup key	Reuse within 5 min = silently dropped
`MessageGroupId`	string ≤128 chars	none (required on send)	Ordering + parallelism unit	One constant value = fully serialized
`DeduplicationScope`	`queue` / `messageGroup`	`queue`	Scope of the dedup check	`messageGroup` required for high-throughput
`FifoThroughputLimit`	`perQueue` / `perMessageGroupId`	`perQueue`	Enables high-throughput mode	Needs `DeduplicationScope=messageGroup`

Choosing a MessageGroupId — the decision that determines your ordering and your parallelism:

Group ID choice	Ordering you get	Parallelism	Use when
Single constant (e.g. `"all"`)	Total order across everything	None — 1 message in flight	You truly need a global serial log (rare)
Per-entity (`accountId`, `orderId`)	Order within each entity	Across entities (high)	Per-customer / per-order serialization
Per-tenant (`tenantId`)	Order within a tenant	Across tenants	Multi-tenant SaaS isolation
Random / per-message	None (defeats FIFO)	Maximum	You don’t actually need ordering — use standard

Standard vs FIFO — the decision table:

If you need…	It’s probably…	Do this
Maximum throughput, order doesn’t matter	Standard	Plain standard queue + idempotent consumer
Per-entity ordering at scale	FIFO + high-throughput	`MessageGroupId=entityId`, `FifoThroughputLimit=perMessageGroupId`
Strict total order, low volume	FIFO, single group	One `MessageGroupId`, accept ~300 msg/s
Dedup of fast retries only	FIFO content-based or explicit ID	Set a dedup ID; still add idempotency for >5 min
Exactly-once effect, any duplicates	Idempotent consumer	DynamoDB conditional write on a business key

The FIFO throughput numbers you must size against:

Mode	Throughput (without batching)	With batching (10/call)	Requirement
FIFO, default	300 messages/s	up to 3,000 messages/s	`FifoThroughputLimit=perQueue`
FIFO, high-throughput	substantially higher per group	scales with group count	`perMessageGroupId` + `DeduplicationScope=messageGroup`
Standard	nearly unlimited	nearly unlimited	none

Visibility timeout, long polling, and throughput tuning

When a consumer receives a message, SQS hides it for the visibility timeout rather than deleting it. The consumer must delete it before the timeout expires; otherwise the message reappears and another consumer picks it up. Two failure modes follow directly:

Timeout too short → the message reappears mid-processing, a second worker grabs it, and you double-process. This is the single most common cause of “my consumer ran twice.”
Timeout too long → a crashed consumer’s message stays invisible far longer than necessary, inflating latency and the age-of-oldest-message metric.

Set the visibility timeout to roughly 6× your p99 processing time as a starting point, then for long or variable work extend it dynamically with ChangeMessageVisibility (a heartbeat) instead of provisioning a huge static value.

# Heartbeat: extend the lease while a long job is still running.
aws sqs change-message-visibility \
  --queue-url "$QUEUE_URL" \
  --receipt-handle "$RECEIPT_HANDLE" \
  --visibility-timeout 120

Always use long polling. Set ReceiveMessageWaitTimeSeconds = 20 on the queue (or pass --wait-time-seconds 20 per receive). Long polling waits for messages to arrive instead of returning immediately empty, which collapses empty-receive API calls — that is both a cost and a noise reduction. Short polling (the default of 0) is almost always a misconfiguration.

resource "aws_sqs_queue" "fraud" {
  name                          = "orders-fraud"
  receive_wait_time_seconds     = 20    # long polling
  visibility_timeout_seconds    = 120
  message_retention_seconds     = 345600 # 4 days; max is 14 days
}

For throughput, batch aggressively: SendMessageBatch and DeleteMessageBatch move up to 10 messages per call, cutting request count and cost by up to 10×. A standard queue scales horizontally on its own — you scale consumers, not the queue. Tune the maximum messages per ReceiveMessage (up to 10) and run more consumer threads/tasks to raise throughput.

The queue-attribute reference

Every queue attribute you actually set, with defaults, ranges, and the trade-off:

Attribute	Default	Range	When to change	Trade-off / gotcha
`VisibilityTimeout`	30 s	0 s – 12 h	~6× p99 processing time	Too short → double-process; too long → stuck latency
`ReceiveMessageWaitTimeSeconds`	0 (short poll)	0 – 20 s	Always set to 20	Short polling = empty receives + cost
`MessageRetentionPeriod`	4 days	60 s – 14 days	14 days for DLQs	Past retention, messages are silently deleted
`MaximumMessageSize`	256 KB	1 KB – 256 KB	Rarely; offload large bodies to S3	>256 KB needs the extended client (S3 pointer)
`DelaySeconds`	0	0 – 900 s	Defer first delivery (debounce)	Per-queue; per-message override via `DelaySeconds`
`ReceiveMessageWaitTimeSeconds` polling	—	—	—	Per-receive override beats per-queue default
`RedrivePolicy`	none	n/a	Always attach a DLQ + `maxReceiveCount`	Without it, poison loops 14 days
`KmsMasterKeyId`	SSE-SQS (managed)	key ID/alias	Audit/rotation/cross-account needs	SSE-KMS adds KMS calls; reuse the data key
`KmsDataKeyReusePeriodSeconds`	300 s	60 s – 86,400 s	Raise to cut KMS cost under load	Longer reuse = larger blast radius if key leaks
`ContentBasedDeduplication`	false	true/false (FIFO only)	When body is naturally unique	Slightly different body bypasses dedup

How to derive the visibility timeout from real processing latency:

Your p99 processing time	Starting visibility timeout	Heartbeat needed?	Notes
< 5 s	30 s	No	Default is usually fine
5–30 s	120–180 s	No	~6× p99, static
30 s – 2 min	300–600 s	Consider	Static may waste latency on crashes
Highly variable / long	Modest base + heartbeat	Yes	`ChangeMessageVisibility` every ~⅓ of timeout
> 12 h work	n/a — wrong tool	—	Use Step Functions, not a single SQS lease

Short polling vs long polling, the practical difference:

Aspect	Short polling (0 s)	Long polling (1–20 s)
Empty receives	Many; returns immediately empty	Few; waits for a message
Cost	Higher (billed empty receives)	Lower
Latency to first message	Lowest	Up to wait-time (negligible in practice)
Coverage of distributed queue	Samples a subset of hosts	Queries all hosts
Recommendation	Avoid	Default everywhere

Dead-letter queues: maxReceiveCount, redrive policy, and replay

A poison message is one a consumer can never process — malformed JSON, a reference to a deleted row, a schema your code does not understand. Without a DLQ it loops forever: received, fails, visibility timeout expires, received again, until it ages out 14 days later, having burned compute the whole time and blocking head-of-line progress on a FIFO queue.

The DLQ is governed by the redrive policy on the source queue: maxReceiveCount is the number of times a message may be received before SQS moves it to the DLQ. “Received” means delivered to a consumer, not “failed” — so a consumer that receives, crashes, and lets the visibility timeout lapse still increments the count. Set maxReceiveCount to give transient errors room to recover (typically 3–5), but not so high that a true poison message wastes dozens of retries.

resource "aws_sqs_queue" "orders" {
  name = "orders-work"
  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.orders_dlq.arn
    maxReceiveCount     = 5
  })
}

resource "aws_sqs_queue" "orders_dlq" {
  name                      = "orders-work-dlq"
  message_retention_seconds = 1209600 # 14 days — keep DLQ retention long

  # Allow-list which source queues may redrive back out of this DLQ.
  redrive_allow_policy = jsonencode({
    redrivePermission = "byQueue"
    sourceQueueArns   = [aws_sqs_queue.orders.arn]
  })
}

Two rules that save you later: give the DLQ the maximum 14-day retention (a DLQ message you discover on Monday may be 4 days old), and give the DLQ a queue type that matches the source — a FIFO source needs a FIFO DLQ.

Once you have triaged and fixed the root cause, redrive moves messages back to the source queue (or to a custom destination) for reprocessing. This is a first-class feature; do not hand-roll a re-publish Lambda.

# Kick off a redrive from the DLQ back to its source queue.
aws sqs start-message-move-task \
  --source-arn "arn:aws:sqs:us-east-1:111122223333:orders-work-dlq" \
  --max-number-of-messages-per-second 50

# Watch progress.
aws sqs list-message-move-tasks \
  --source-arn "arn:aws:sqs:us-east-1:111122223333:orders-work-dlq"

Rate-limit the move so you do not re-flood a downstream that is still recovering. And never redrive blindly — if the bug is not fixed, the same messages bounce straight back to the DLQ.

The DLQ and redrive settings matrix

Every redrive-related setting, what it controls, and how to size it:

Setting	Where	Values	Default	When to change
`maxReceiveCount`	Source redrive policy	1 – 1000	none (no DLQ)	3–5 typical; balances retry vs waste
`deadLetterTargetArn`	Source redrive policy	a queue ARN	none	Always point at a matching-type DLQ
`redrivePermission`	DLQ `RedriveAllowPolicy`	`allowAll` / `byQueue` / `denyAll`	`allowAll`	Lock down which sources may redrive
`sourceQueueArns`	DLQ `RedriveAllowPolicy`	up to 10 ARNs	n/a	With `byQueue`, the allowed sources
DLQ `MessageRetentionPeriod`	DLQ queue	up to 14 days	4 days	Always max — you triage late
DLQ type	DLQ queue	standard / FIFO	match source	FIFO source needs FIFO DLQ
`--max-number-of-messages-per-second`	`StartMessageMoveTask`	1+	unthrottled	Throttle so you don’t re-flood downstream

Sizing maxReceiveCount against the kind of failures you expect:

Failure profile	maxReceiveCount	Reasoning
Mostly transient (throttles, brief downstream blips)	5	Room to recover without burning retries
Mix of transient + occasional poison	3	Quick to DLQ; triage the poison fast
Expensive side effect per attempt	2	Minimize wasted work on a true poison
Cheap, highly transient	5–10	Tolerate flapping downstreams
Strict ordering (FIFO, head-of-line blocking)	3	Poison blocks the group — evict quickly

What “received” means for the redrive counter — the subtle part people miss:

Event	Increments receive count?	Notes
Consumer receives, processes, deletes	Yes (once), then gone	Normal path
Consumer receives, throws, lets lease lapse	Yes	Failure still counts
Consumer receives, crashes before delete	Yes	Crash still counts
Lambda batch fails, whole batch returns	Yes, for every message in the batch	Why `ReportBatchItemFailures` matters
`ChangeMessageVisibility` heartbeat	No	Extending the lease is not a receive
Message sits unreceived	No	Age grows, count does not

The DLQ triage runbook — what to do with a message once it lands:

#	Step	Command / action	Why
1	Alarm fires (DLQ not empty)	CloudWatch page on `ApproximateNumberOfMessagesVisible > 0`	DLQ message = system couldn’t handle it
2	Inspect without consuming	`receive-message` with short visibility, read body + attributes	Understand the poison before acting
3	Classify	Malformed? deleted ref? schema gap? downstream-was-down?	Decides fix vs discard vs redrive
4	Fix root cause	Deploy code/schema fix or restore the dependency	Redrive without a fix = bounce-back
5	Redrive, rate-limited	`start-message-move-task --max-number-of-messages-per-second 50`	Replay without re-flooding
6	Watch the source backlog	`list-message-move-tasks`; `ApproximateAgeOfOldestMessage`	Confirm they clear, not re-DLQ
7	Purge genuine junk	`delete-message` on confirmed un-fixable bodies	Don’t let the DLQ grow unbounded

Idempotency and exactly-once-effect processing

At-least-once delivery is a guarantee, not an edge case. Standard queues redeliver on their own, visibility-timeout expiry causes redelivery, and even FIFO only dedups within five minutes. The only durable answer is an idempotent consumer: processing the same message twice produces the same effect as processing it once.

Derive a stable idempotency key from the message — a business event ID, or the SQS MessageDeduplicationId, never the MessageId (a redelivery keeps the same MessageId, but a re-publish by the producer gets a new one, so MessageId alone is unreliable across producer retries). Then make the side effect conditional on that key. Two patterns I use:

A. Conditional write on the work itself. If the side effect is a DynamoDB write, use a condition expression so the second attempt is a no-op rather than a separate ledger entry.

import boto3
from botocore.exceptions import ClientError

table = boto3.resource("dynamodb").Table("payments")

def process(event_id: str, amount: int):
    try:
        table.put_item(
            Item={"pk": event_id, "amount": amount, "status": "POSTED"},
            ConditionExpression="attribute_not_exists(pk)",
        )
    except ClientError as e:
        if e.response["Error"]["Code"] == "ConditionalCheckFailedException":
            return  # already processed; safe to ack and delete
        raise

B. A dedicated idempotency table when the side effect is an external call (charge a card, send an email) that you cannot make conditional. Record the key with a TTL before the call, short-circuit if it already exists, and design the external call to be retried safely if you crash mid-flight. The AWS Lambda Powertools idempotency utility implements exactly this with a DynamoDB backing table and is worth adopting rather than reinventing.

Exactly-once delivery is largely a myth in distributed messaging. What you can actually engineer is exactly-once effect, and it lives in the consumer, not the queue. Build it there and duplicate deliveries become a non-event.

Choosing the idempotency key

The key you pick determines whether dedup actually works across all the ways a message can repeat:

Key candidate	Survives SQS redelivery?	Survives producer re-publish?	Verdict
`MessageId`	Yes (same on redelivery)	No (new ID on re-publish)	Unreliable — never use alone
`MessageDeduplicationId` (FIFO)	Yes	Yes if producer reuses it	Good when producer is disciplined
Business event ID (e.g. `orderId+version`)	Yes	Yes	Best — stable across all retries
Content hash of the body	Yes	Yes if body is byte-identical	Fragile to re-serialization
Receipt handle	No (changes every receive)	No	Never — it’s a per-receive token

Idempotency-pattern selection by side-effect type:

Side effect	Pattern	Mechanism	Notes
DynamoDB write (the work itself)	Conditional write	`ConditionExpression="attribute_not_exists(pk)"`	Cheapest; the write is the dedup
External charge / email / API call	Dedicated idempotency table	Record key + TTL before the call	Make the external call retry-safe
Counter / aggregate increment	Conditional update	`ADD` guarded by a seen-set, or transaction	Avoid double-increment under retry
Multi-item transaction	`TransactWriteItems` with a condition	Idempotency item + work items in one tx	All-or-nothing; one key guards the set
Publish a downstream event	Dedup ID on the downstream publish	FIFO dedup or idempotency record	Propagate the key, don’t mint a new one

How the idempotency table item is shaped and why each field exists:

Field	Purpose	Example
`pk` (idempotency key)	The dedup identity	`order#5521#v3`
`status`	In-progress vs completed	`IN_PROGRESS` / `COMPLETED`
`result`	Cached response for a duplicate	serialized prior outcome
`expiry` (TTL)	Auto-cleanup after the dup window	`now + 24h` epoch seconds
`lockedAt`	Detect crashed in-flight attempts	timestamp

Lambda and ECS consumers: batching, partial-batch failures, and backpressure

Lambda consumes via an event source mapping (ESM) that long-polls the queue for you and invokes in batches. The trap is the default failure behavior: if your handler throws, the entire batch returns to the queue and every message’s receive count increments — including the ones that already succeeded. They get reprocessed, and good messages can be driven to the DLQ by a single poison sibling.

Fix this with ReportBatchItemFailures. Declare the response type on the ESM and return only the IDs that failed; Lambda deletes the rest.

resource "aws_lambda_event_source_mapping" "orders" {
  event_source_arn                   = aws_sqs_queue.orders.arn
  function_name                      = aws_lambda_function.consumer.arn
  batch_size                         = 10
  maximum_batching_window_in_seconds = 5
  function_response_types            = ["ReportBatchItemFailures"]

  scaling_config {
    maximum_concurrency = 20 # cap concurrent invocations from this queue
  }
}

def handler(event, _ctx):
    failures = []
    for record in event["Records"]:
        try:
            process_record(record)
        except Exception:
            failures.append({"itemIdentifier": record["messageId"]})
    return {"batchItemFailures": failures}

Set maximum_concurrency on the ESM (minimum 2, up to 1000) to bound how hard the queue hammers a fragile downstream — backpressure for free, without throttling the whole function’s reserved concurrency. For FIFO sources, Lambda processes one batch per message group at a time, preserving order, and on failure it pauses that group until the batch clears so ordering holds.

ECS / EC2 consumers poll the queue themselves in a loop. The clean way to scale them is to drive autoscaling off backlog per task, not raw CPU. Compute a target backlog-per-task from your acceptable latency and publish it as a custom metric, or use the standard approach of a target-tracking policy on ApproximateNumberOfMessagesVisible divided by running task count.

# Typical worker loop: long poll, process, delete in a batch.
while True:
    resp = sqs.receive_message(
        QueueUrl=url, MaxNumberOfMessages=10, WaitTimeSeconds=20,
        AttributeNames=["ApproximateReceiveCount"],
    )
    msgs = resp.get("Messages", [])
    entries = []
    for m in msgs:
        if process(m):  # idempotent; returns True on success
            entries.append({"Id": m["MessageId"], "ReceiptHandle": m["ReceiptHandle"]})
    if entries:
        sqs.delete_message_batch(QueueUrl=url, Entries=entries)

Delete only what you successfully processed; let the rest time out and redeliver (and eventually DLQ). Never delete-then-process.

The Lambda event source mapping reference

Every ESM setting that changes consumer behavior:

Setting	Default	Range	Effect	Gotcha
`batch_size`	10	1 – 10,000 (≤10 for standard short window)	Messages per invocation	Larger batch = bigger blast radius on failure
`maximum_batching_window_in_seconds`	0	0 – 300 s	Wait to fill a batch	Adds latency; lets batches grow
`function_response_types`	none	`["ReportBatchItemFailures"]`	Partial-batch failure reporting	Without it, one bad record retries all
`scaling_config.maximum_concurrency`	unset	2 – 1000	Cap concurrent invocations from this queue	Backpressure without touching reserved concurrency
`maximum_concurrency` vs reserved	—	—	ESM cap is per-queue; reserved is per-function	Use ESM cap to protect a downstream
FIFO group handling	n/a	n/a	One batch per group in flight	Failure pauses that group only
Filter criteria	none	event filtering JSON	Drop records before invoke	Filtered records are deleted, not retried

The ReportBatchItemFailures contract — get the return shape exactly right or it silently no-ops:

Return value	Lambda behavior	Result
`{"batchItemFailures": []}`	Deletes the whole batch	All succeeded
`{"batchItemFailures": [{"itemIdentifier": "<msgId>"}]}`	Deletes the rest, retries listed IDs	Correct partial failure
Throws / returns nothing useful	Returns the entire batch	Every message retried (the trap)
Malformed key (e.g. wrong casing)	Treated as total failure	Whole batch retried
Reports an ID not in the batch	Ignored for that ID	No effect

Lambda vs ECS/EC2 consumers — when to pick which:

Dimension	Lambda (ESM)	ECS/EC2 poller
Polling	Managed by AWS	You write the loop
Scaling trigger	Automatic on backlog	Backlog-per-task custom metric
Concurrency control	`maximum_concurrency` cap	Task count / ASG
Max processing time	15 min (function timeout)	Unbounded (long jobs)
Cost model	Per-invocation + duration	Per running task/instance
Best for	Spiky, short, bursty work	Steady, long, stateful work
Partial failure	`ReportBatchItemFailures`	Delete only succeeded entries

How to drive ECS autoscaling off backlog rather than CPU:

Metric	What it tells you	Scaling rule	Why not CPU
`ApproximateNumberOfMessagesVisible`	Backlog depth	Target backlog-per-task	CPU lags; a fast consumer with deep backlog reads low CPU
Running task count	Current capacity	Denominator for backlog/task	—
Backlog ÷ tasks (custom)	Per-task pressure	Target-tracking on this value	The honest “are we keeping up” signal
`ApproximateAgeOfOldestMessage`	Latency / falling behind	Scale-out alarm	The SLA signal, not utilization

Encryption, access policies, and cross-account sharing

Enable server-side encryption on every queue and topic. SQS-managed keys (SSE-SQS) are free and require zero key management; use a customer-managed KMS key (SSE-KMS) when you need an audit trail, key rotation control, or cross-account key policies.

resource "aws_sqs_queue" "orders" {
  name              = "orders-work"
  kms_master_key_id = aws_kms_key.messaging.id
  # Cache the data key to cut KMS calls (and cost) under load.
  kms_data_key_reuse_period_seconds = 300
}

For SNS-to-SQS to work at all, the queue’s resource policy must allow sns.amazonaws.com to sqs:SendMessage, scoped to the topic ARN with aws:SourceArn. Terraform’s aws_sns_topic_subscription does not write this for you; add it explicitly.

data "aws_iam_policy_document" "allow_sns" {
  statement {
    sid     = "AllowSNSDelivery"
    effect  = "Allow"
    actions = ["sqs:SendMessage"]
    principals {
      type        = "Service"
      identifiers = ["sns.amazonaws.com"]
    }
    resources = [aws_sqs_queue.orders.arn]
    condition {
      test     = "ArnEquals"
      variable = "aws:SourceArn"
      values   = [aws_sns_topic.orders.arn]
    }
  }
}

resource "aws_sqs_queue_policy" "orders" {
  queue_url = aws_sqs_queue.orders.id
  policy    = data.aws_iam_policy_document.allow_sns.json
}

Cross-account fan-out (topic in account A, queue in account B) adds two requirements beyond the above: the topic policy in A must allow B’s queue to subscribe (sns:Subscribe / sns:Receive), and if the topic uses SSE-KMS, the KMS key policy must let sns.amazonaws.com and the subscribing principals use the key — a step that silently drops every cross-account message when missed. Always scope these with aws:SourceArn or aws:SourceAccount; never leave Principal: "*" unconditioned on a topic or queue.

Encryption options compared

The two SSE modes and what each buys you:

Aspect	SSE-SQS (default-managed)	SSE-KMS (customer-managed)
Key management	None — AWS owns it	You own the CMK
Cost	Free	KMS request + key cost
Audit trail (CloudTrail on key use)	No	Yes
Rotation control	AWS-managed	You configure
Cross-account key policy	Not applicable	Required and explicit
Data-key reuse tuning	n/a	`KmsDataKeyReusePeriodSeconds` 60 s–24 h
When to choose	Default for most queues	Compliance, audit, cross-account

The IAM/policy grants required for SNS→SQS, by topology:

Topology	Grant needed	Where it lives	Symptom when missing
Same-account SNS→SQS	`sqs:SendMessage` for `sns.amazonaws.com`, `aws:SourceArn`=topic	Queue resource policy	Subscription confirmed but queue stays empty
Cross-account subscribe	`sns:Subscribe`/`sns:Receive` for account B	Topic policy in A	Cannot create the subscription
Cross-account delivery (SSE-KMS)	`kms:GenerateDataKey`/`Decrypt` for `sns.amazonaws.com` + subscriber	KMS key policy	Every cross-account message silently dropped
Consumer reads encrypted queue	`kms:Decrypt` on the CMK	Consumer IAM role	`ReceiveMessage` returns access-denied on decrypt
FIFO topic → FIFO queue	Same as above + matching FIFO types	Both	Standard topic can’t fan into FIFO queue

Least-privilege actions for each role in the path:

Principal	Minimum actions	Scope	Never grant
Producer	`sns:Publish`	the topic ARN	`sns:` on ``
SNS service (delivery)	`sqs:SendMessage`	the queue, `aws:SourceArn`=topic	unconditioned `Principal:"*"`
Consumer	`sqs:ReceiveMessage`, `DeleteMessage`, `ChangeMessageVisibility`, `GetQueueAttributes` (+ `kms:Decrypt`)	the queue + CMK	`sqs:` on ``
Redrive operator	`sqs:StartMessageMoveTask`, `ListMessageMoveTasks`	the DLQ	broad SQS admin in app roles

Monitoring: queue depth, age of oldest message, and DLQ alarms

The metric that actually tells you whether consumers are keeping up is ApproximateAgeOfOldestMessage, not queue depth. Depth can be high transiently and still healthy; a rising oldest-message age means consumers are falling behind and is your true SLA signal. Alarm on it.

The non-negotiable alarm is any message in a DLQ. A message in the DLQ is, by definition, something your system could not handle — page on it.

resource "aws_cloudwatch_metric_alarm" "dlq_not_empty" {
  alarm_name          = "orders-dlq-not-empty"
  namespace           = "AWS/SQS"
  metric_name         = "ApproximateNumberOfMessagesVisible"
  dimensions          = { QueueName = aws_sqs_queue.orders_dlq.name }
  statistic           = "Maximum"
  period              = 60
  evaluation_periods  = 1
  threshold           = 0
  comparison_operator = "GreaterThanThreshold"
  alarm_actions       = [aws_sns_topic.oncall.arn]
}

resource "aws_cloudwatch_metric_alarm" "backlog_age" {
  alarm_name          = "orders-work-backlog-age"
  namespace           = "AWS/SQS"
  metric_name         = "ApproximateAgeOfOldestMessage"
  dimensions          = { QueueName = aws_sqs_queue.orders.name }
  statistic           = "Maximum"
  period              = 60
  evaluation_periods  = 5
  threshold           = 300 # seconds; tune to your latency SLO
  comparison_operator = "GreaterThanThreshold"
  alarm_actions       = [aws_sns_topic.oncall.arn]
}

Round out the dashboard with NumberOfMessagesSent / NumberOfMessagesReceived (throughput), NumberOfEmptyReceives (long-polling health), and on SNS NumberOfNotificationsFailedToRedriveToDlq (deliveries you are actually losing). For the broader observability picture — dashboards, composite alarms, log insights — see CloudWatch & CloudTrail Observability Deep Dive, and trace a message end-to-end with X-Ray Service Map & Segments.

The metric reference

The SQS and SNS metrics worth a dashboard tile or an alarm:

Metric	Namespace	What it means	Alarm?	Threshold (starting point)
`ApproximateAgeOfOldestMessage`	AWS/SQS	Oldest un-deleted message age	Yes (SLA)	> your latency SLO (e.g. 300 s)
`ApproximateNumberOfMessagesVisible`	AWS/SQS	Backlog ready to receive	Yes (on DLQ: >0)	DLQ > 0; source depends on load
`ApproximateNumberOfMessagesNotVisible`	AWS/SQS	In-flight (leased) messages	Watch	Near the in-flight cap = trouble
`NumberOfEmptyReceives`	AWS/SQS	Empty poll responses	Watch	High = short polling misconfig
`NumberOfMessagesSent` / `Received`	AWS/SQS	Throughput in/out	Dashboard	Compare in vs out for drift
`NumberOfMessagesDeleted`	AWS/SQS	Successful completions	Dashboard	Sent − Deleted ≈ backlog growth
`IteratorAge` (Lambda ESM)	AWS/Lambda	How stale the batch being processed is	Yes	Rising = consumer behind
`NumberOfNotificationsFailed`	AWS/SNS	SNS deliveries that failed	Yes	> 0 sustained
`NumberOfNotificationsFailedToRedriveToDlq`	AWS/SNS	Lost deliveries (couldn’t even DLQ)	Yes	> 0 — real loss

Reading the metrics together — what a combination tells you:

Pattern	Likely meaning	Action
Depth high but age flat/low	Healthy burst; consumers draining	None — don’t alarm on depth alone
Age rising, depth rising	Consumers can’t keep up	Scale consumers; check downstream
In-flight near cap, age rising	Messages stuck in long processing	Visibility timeout too long, or stuck work
Empty receives high	Short polling	Set `ReceiveMessageWaitTimeSeconds=20`
Sent ≫ Deleted	Processing failing or backing up	Check consumer errors + DLQ
DLQ count > 0	Poison / unhandled messages	Page; triage; fix; redrive

The in-flight and account limits you can actually hit:

Limit	Standard	FIFO	Notes
In-flight messages (received, not deleted)	120,000	20,000	Exceeding returns `OverLimit` on receive
Message size	256 KB	256 KB	Use S3 extended client beyond this
Message retention	up to 14 days	up to 14 days	Default 4 days
Batch entries per call	10	10	Send/Delete/ChangeVisibility batch
Delay seconds	up to 900 s	up to 900 s	Per-queue or per-message
Throughput	nearly unlimited	300/s (3,000 batched); higher in HT mode	Hard ceiling on FIFO without HT mode

Architecture at a glance

The diagram traces a single order event from the moment a producer publishes it to the moment its effect is durably recorded — and overlays the five places this path breaks. Read it left to right. On the far left, a producer Lambda publishes an OrderPlaced fact to the SNS topic orders-events and returns immediately; it never talks to a queue and never waits for a consumer. The topic fans the message out to one private SQS queue per consumer — a FIFO fraud.fifo queue that filters for amount ≥ 500, a standard analytics queue that takes everything with raw delivery, and so on — each subscription carrying its own filter policy so unwanted messages are dropped at the topic and never billed. Failed deliveries (a bad queue policy, a missing KMS grant) fall into the topic’s own SNS-level DLQ rather than vanishing.

From the queues, consumers pull on a 20-second long poll: a Lambda event source mapping pulling batches of ten with partial-batch-failure reporting, and an ECS poller that autoscales on backlog-per-task. Each consumer is idempotent — before it commits a side effect it does a conditional write to a DynamoDB idempotency table keyed on the business event ID, so a duplicate delivery is a no-op. Messages that exceed maxReceiveCount redrive to the work-dlq, where the DLQ-not-empty alarm pages on-call; once the root cause is fixed, StartMessageMoveTask replays them back to the source queue (the green return arrow). Every hop is encrypted with an SSE-KMS key whose policy must grant both SNS and the subscribers. The five numbered badges mark exactly where this fails in production: the queue/KMS policy that silently drops messages, the too-short visibility timeout that double-processes, the poison message that loops to the DLQ, the Lambda batch that retries good messages because of one bad sibling, and the duplicate that slips past FIFO’s five-minute dedup window — each one mapped in the legend to its symptom, the metric that confirms it, and the fix.

Real-world scenario

A payments platform team — call them Tendwell — ran a single standard SQS queue behind their card-authorization service. Under a Black Friday spike, the downstream processor slowed, processing time crept past their 30-second visibility timeout, and messages began reappearing mid-flight. A second consumer picked up each redelivered message and double-charged a subset of customers — a few hundred duplicate authorizations before they killed the consumers. The incident channel filled with “messaging is unreliable,” which was exactly the wrong lesson: the messaging was behaving precisely as documented, and the configuration was wrong.

The constraint that made the obvious fixes insufficient: they could not simply make the queue FIFO, because per-customer ordering mattered but a single global FIFO group would have collapsed throughput below their peak rate, and the 5-minute dedup window did not cover retries that recurred minutes later under sustained load. They also could not just lengthen the visibility timeout to a huge static value, because a genuinely crashed authorization would then stay invisible for many minutes, breaching their latency SLO and starving capacity during the exact window they needed it most.

The fix was layered, and the lesson was that no single setting would have saved them:

FIFO with per-account message groups (MessageGroupId = accountId) gave strict ordering per customer while keeping parallelism across customers, plus high-throughput mode (fifo_throughput_limit = "perMessageGroupId", deduplication_scope = "messageGroup") to clear the peak.
An idempotency table keyed on the gateway request ID made the charge itself a no-op on replay — the real fix, since FIFO dedup alone was insufficient past five minutes.
Visibility-timeout heartbeats (ChangeMessageVisibility every 20 seconds while a charge was in flight) stopped redelivery during legitimately slow downstream calls.
A DLQ with maxReceiveCount = 3 plus a DLQ-not-empty alarm that had never existed before, so genuinely malformed messages were sidelined and paged instead of looping silently.

# The charge became conditional on a recorded idempotency key.
def authorize(req_id: str, account_id: str, amount: int):
    try:
        ddb.put_item(
            TableName="auth-idempotency",
            Item={"req_id": {"S": req_id}, "ttl": {"N": str(ttl_24h())}},
            ConditionExpression="attribute_not_exists(req_id)",
        )
    except ddb.exceptions.ConditionalCheckFailedException:
        return load_prior_result(req_id)  # already authorized; return same result
    return gateway.charge(account_id, amount)

The numbers tell the story. Here is Tendwell’s before-and-after across the metrics that mattered:

Metric	Before (incident)	After (next peak)
Duplicate authorizations	~280 in 25 min	0
Visibility timeout	30 s static	60 s base + 20 s heartbeats
Ordering guarantee	None (standard)	Per-account (FIFO group)
Peak throughput	~limited by single consumer	cleared peak (HT mode)
Poison handling	Looped 14 days	DLQ at 3 receives, paged
Idempotency	None	DynamoDB conditional write, 24 h TTL
MTTR for the duplicate-charge class	hours (manual)	minutes (alarm + idempotent replay)

Post-change, duplicate authorizations went to zero across the next peak, and the DLQ alarm — which had not existed before — caught two genuinely malformed messages that previously would have looped silently for days. The retro’s one-line conclusion: the queue was never the problem; the consumer’s assumption of exactly-once delivery was.

Advantages and disadvantages

The decoupled SNS-to-SQS fan-out pattern is the right default for asynchronous work, but it is not free — the trade-offs are real and worth naming before you commit.

Advantages	Disadvantages
Producer and consumers fully decoupled — add a consumer with one queue + subscription	Eventual consistency; no synchronous response to the caller
One slow/broken consumer can’t back-pressure others	More moving parts (topic, N queues, N DLQs, policies) to manage
Built-in durability + retries + DLQ without custom code	At-least-once delivery forces idempotency work on you
Scales to high throughput with no capacity planning (standard)	FIFO’s ordering/dedup come at a throughput + complexity cost
Server-side filtering cuts cost and consumer noise	Filter/IAM/KMS misconfig fails silently — hard to spot
Redrive + replay is first-class, not hand-rolled	Debugging async flows needs tracing you must set up
Per-consumer tuning (visibility, polling, retention)	Easy to misconfigure visibility timeout → double-process

When each side matters: the advantages dominate any workload where the caller does not need an immediate answer — order processing, notifications, ETL triggers, audit pipelines, microservice choreography. The decoupling pays off most the second time you add a consumer without touching the producer. The disadvantages bite hardest when a synchronous contract is actually required (use a direct API call or Step Functions Express for request/response), when strict global ordering is mandatory (FIFO’s throughput ceiling may not fit), or when the team lacks the discipline to build idempotent consumers (in which case duplicates will cause incidents). The silent-failure modes — a typo’d filter attribute, a missing KMS grant — are the single biggest operational tax, which is why the monitoring section is non-negotiable.

Hands-on lab

This walk-through stands up a topic, two fan-out queues with filtering, a DLQ, and proves the failure handling — all within Free Tier (SQS gives one million requests/month free; SNS one million publishes). Use a throwaway region and tear it down at the end. Replace 111122223333 with your account ID.

1. Create the topic and two queues (plus a DLQ).

export AWS_DEFAULT_REGION=us-east-1
TOPIC_ARN=$(aws sns create-topic --name lab-orders --query TopicArn --output text)
FRAUD_URL=$(aws sqs create-queue --queue-name lab-fraud \
  --attributes '{"ReceiveMessageWaitTimeSeconds":"20","VisibilityTimeout":"60"}' \
  --query QueueUrl --output text)
ANALYTICS_URL=$(aws sqs create-queue --queue-name lab-analytics \
  --attributes '{"ReceiveMessageWaitTimeSeconds":"20"}' --query QueueUrl --output text)
DLQ_URL=$(aws sqs create-queue --queue-name lab-fraud-dlq \
  --attributes '{"MessageRetentionPeriod":"1209600"}' --query QueueUrl --output text)

2. Attach the DLQ via a redrive policy on the fraud queue.

DLQ_ARN=$(aws sqs get-queue-attributes --queue-url "$DLQ_URL" \
  --attribute-names QueueArn --query Attributes.QueueArn --output text)
aws sqs set-queue-attributes --queue-url "$FRAUD_URL" --attributes \
  "{\"RedrivePolicy\":\"{\\\"deadLetterTargetArn\\\":\\\"$DLQ_ARN\\\",\\\"maxReceiveCount\\\":\\\"3\\\"}\"}"

3. Grant SNS permission to send to each queue, then subscribe with a filter + raw delivery.

for Q in "$FRAUD_URL" "$ANALYTICS_URL"; do
  ARN=$(aws sqs get-queue-attributes --queue-url "$Q" --attribute-names QueueArn \
        --query Attributes.QueueArn --output text)
  POLICY="{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":{\"Service\":\"sns.amazonaws.com\"},\"Action\":\"sqs:SendMessage\",\"Resource\":\"$ARN\",\"Condition\":{\"ArnEquals\":{\"aws:SourceArn\":\"$TOPIC_ARN\"}}}]}"
  aws sqs set-queue-attributes --queue-url "$Q" --attributes "{\"Policy\":$(printf '%s' "$POLICY" | python3 -c 'import json,sys;print(json.dumps(sys.stdin.read()))')}"
done

FRAUD_ARN=$(aws sqs get-queue-attributes --queue-url "$FRAUD_URL" --attribute-names QueueArn --query Attributes.QueueArn --output text)
aws sns subscribe --topic-arn "$TOPIC_ARN" --protocol sqs --notification-endpoint "$FRAUD_ARN" \
  --attributes '{"RawMessageDelivery":"true","FilterPolicy":"{\"amount\":[{\"numeric\":[\">=\",500]}]}"}'

4. Publish a non-matching event — the fraud queue should stay empty.

aws sns publish --topic-arn "$TOPIC_ARN" --message '{"id":"o1"}' \
  --message-attributes '{"amount":{"DataType":"Number","StringValue":"100"}}'
sleep 3
aws sqs get-queue-attributes --queue-url "$FRAUD_URL" \
  --attribute-names ApproximateNumberOfMessages   # expect 0

5. Publish a matching event and confirm raw delivery (no SNS envelope).

aws sns publish --topic-arn "$TOPIC_ARN" --message '{"id":"o2","amount":900}' \
  --message-attributes '{"amount":{"DataType":"Number","StringValue":"900"}}'
aws sqs receive-message --queue-url "$FRAUD_URL" --wait-time-seconds 5 \
  --query 'Messages[0].Body'   # expect {"id":"o2","amount":900}, NOT {"Type":"Notification",...}

6. Force a poison message to the DLQ. Send a message, then receive it three times without deleting (let the short visibility lapse). After the third receive it lands in the DLQ.

aws sqs send-message --queue-url "$FRAUD_URL" --message-body '{"id":"poison"}'
for i in 1 2 3 4; do
  aws sqs receive-message --queue-url "$FRAUD_URL" --visibility-timeout 1 \
    --wait-time-seconds 2 --query 'Messages[0].MessageId' --output text
  sleep 2
done
sleep 3
aws sqs get-queue-attributes --queue-url "$DLQ_URL" \
  --attribute-names ApproximateNumberOfMessages   # expect >= 1

7. Redrive the DLQ back to source after the “fix.”

aws sqs start-message-move-task --source-arn "$DLQ_ARN" \
  --max-number-of-messages-per-second 50
aws sqs list-message-move-tasks --source-arn "$DLQ_ARN"

8. Teardown.

for Q in "$FRAUD_URL" "$ANALYTICS_URL" "$DLQ_URL"; do aws sqs delete-queue --queue-url "$Q"; done
aws sns delete-topic --topic-arn "$TOPIC_ARN"

Expected outcomes at a glance:

Step	Expected result	If it fails
4 (non-match)	Fraud queue depth 0	Filter attribute name/typo; check `FilterPolicy`
5 (match, raw)	Body is the raw payload	`RawMessageDelivery` not true
6 (poison)	DLQ depth ≥ 1 after 3 receives	`maxReceiveCount` not set / DLQ not attached
7 (redrive)	Move task listed, source repopulates	`RedriveAllowPolicy` blocks the source

Common mistakes & troubleshooting

These are the failure modes I see repeatedly, framed as a playbook: symptom → root cause → how to confirm (exact command) → fix. Scan the table for your symptom, then read the detail.

#	Symptom	Root cause	Confirm (exact command / metric)	Fix
1	Subscription “succeeded” but the queue stays empty	Queue policy missing `sqs:SendMessage` for `sns.amazonaws.com`	SNS `NumberOfNotificationsFailed` > 0; `aws sqs get-queue-attributes --attribute-names Policy`	Add queue policy allowing `sns.amazonaws.com`, `aws:SourceArn`=topic
2	Cross-account fan-out silently drops every message	SSE-KMS key policy doesn’t grant SNS / subscriber	CloudTrail `Decrypt`/`GenerateDataKey` AccessDenied	Grant `sns.amazonaws.com` + subscriber on the CMK key policy
3	Consumer processes the same message twice	Visibility timeout < processing time	`ApproximateReceiveCount` > 1 on processed msgs	Raise to ~6× p99; add `ChangeMessageVisibility` heartbeat
4	One bad record drives nine good ones to the DLQ	Lambda batch fails whole-batch (no partial reporting)	ESM `function_response_types` unset	Add `ReportBatchItemFailures`; return failed `itemIdentifier`s
5	A malformed message loops for days	No DLQ / no `maxReceiveCount`	Redrive policy absent on source queue	Attach DLQ with `maxReceiveCount` 3–5
6	Consumer reads garbage, double-parses JSON	`RawMessageDelivery` not enabled	Message body starts `{"Type":"Notification"`	Set `RawMessageDelivery=true` on the subscription
7	FIFO queue serialized to one message at a time	Single constant `MessageGroupId`	All messages share one group ID	Use per-entity group ID (`accountId`/`orderId`)
8	Duplicates slip past FIFO dedup	Retries recur after the 5-min window	Same business key processed twice, >5 min apart	Idempotent consumer keyed on business ID
9	High empty-receive cost / noisy polling	Short polling (wait time 0)	`NumberOfEmptyReceives` high	Set `ReceiveMessageWaitTimeSeconds=20`
10	Messages “lost” after a few days	Retention expiry (default 4 days)	`MessageRetentionPeriod` = 345600	Raise retention; on DLQ set 14 days
11	Backlog grows, consumers look idle	Scaling on CPU, not backlog	CPU low while `ApproximateAgeOfOldestMessage` rises	Scale on backlog-per-task / ESM concurrency
12	`OverLimit` errors on receive	In-flight cap hit (120k std / 20k FIFO)	`ApproximateNumberOfMessagesNotVisible` near cap	Delete faster; shorten visibility; more consumers
13	FIFO publish rejected / dropped	Reused `MessageDeduplicationId` within 5 min	Dedup ID identical to a recent send	Use a unique-per-logical-event dedup ID
14	Redriven messages bounce straight back to DLQ	Root cause not fixed before redrive	DLQ refills right after `StartMessageMoveTask`	Fix code/schema first, then redrive

The entries that bite hardest, expanded:

1. Subscription confirmed but the queue never receives anything. Root cause: The queue resource policy doesn’t allow sns.amazonaws.com to sqs:SendMessage. SNS thinks it delivered; SQS refuses the write. Terraform’s aws_sns_topic_subscription does not write this policy for you. Confirm: aws sqs get-queue-attributes --queue-url "$Q" --attribute-names Policy shows no SNS statement; SNS NumberOfNotificationsFailed is non-zero. Fix: Attach a queue policy allowing sns.amazonaws.com with aws:SourceArn scoped to the topic (see the encryption section’s Terraform).

2. Cross-account fan-out silently drops every message. Root cause: The topic uses SSE-KMS and the key policy doesn’t grant sns.amazonaws.com (and the subscribing account) GenerateDataKey/Decrypt. The message can’t be encrypted/decrypted on the path, so it’s dropped with no error to you. Confirm: CloudTrail shows AccessDenied on kms:Decrypt or kms:GenerateDataKey for the SNS service principal. Fix: Add both sns.amazonaws.com and the subscriber principal to the CMK key policy; scope with aws:SourceArn/aws:SourceAccount.

3. The consumer ran twice. Root cause: The visibility timeout is shorter than processing time, so the message reappears mid-flight and a second worker grabs it. Confirm: ApproximateReceiveCount on successfully processed messages is > 1; duplicate side effects in your data. Fix: Set the timeout to ~6× p99 processing time; for long/variable work add a ChangeMessageVisibility heartbeat. And make the consumer idempotent regardless.

4. One poison record drove the whole batch to the DLQ. Root cause: A Lambda ESM without ReportBatchItemFailures returns the entire batch on any exception, so already-succeeded messages re-run and a single bad sibling pushes the batch to DLQ. Confirm: The ESM’s function_response_types is unset; good messages show rising receive counts. Fix: Set function_response_types = ["ReportBatchItemFailures"] and return only the failed itemIdentifiers.

8. Duplicates that FIFO didn’t catch. Root cause: FIFO dedup is bounded to five minutes; a retry that recurs after that window is a fresh message to SQS. Confirm: The same business key is processed twice, more than five minutes apart. Fix: Build an idempotent consumer on a business idempotency key (DynamoDB conditional write); FIFO reduces duplicate volume but never removes the need.

Best practices

One SQS queue per consumer. Producers publish only to the SNS topic, never to consumer queues directly. Adding a consumer is one queue + one subscription, never a producer change.
RawMessageDelivery = true on SQS subscriptions. Strip the SNS envelope so consumers read the publisher’s body. Push routing into SNS with filter policies so unwanted messages are dropped at the topic, never billed.
Every source queue has a redrive policy with a tuned maxReceiveCount (3–5) pointing at a DLQ. Without it, a poison message loops for fourteen days.
DLQs use 14-day retention and a RedriveAllowPolicy scoping which sources may redrive back. A FIFO source needs a FIFO DLQ.
Standard by default; FIFO only with a deliberate MessageGroupId strategy and explicit dedup IDs. A single constant group ID serializes your whole queue — almost never what you want.
Visibility timeout ~6× p99 processing time, with ChangeMessageVisibility heartbeats for long jobs rather than a huge static value.
Long polling everywhere (ReceiveMessageWaitTimeSeconds = 20). Short polling is a cost and noise misconfiguration.
Consumers are idempotent on a business key, independent of FIFO’s five-minute dedup window. Exactly-once effect lives in the consumer, not the queue.
Lambda ESMs use ReportBatchItemFailures and a maximum_concurrency backpressure cap so one queue can’t hammer a fragile downstream.
Scale consumers on backlog-per-task, not CPU. ApproximateNumberOfMessagesVisible ÷ tasks is the honest “keeping up” signal.
SSE on all queues/topics; SNS-to-SQS queue policy scoped with aws:SourceArn. Cross-account paths grant both topic-policy and KMS-key-policy access.
Alarm on DLQ-not-empty (page) and rising ApproximateAgeOfOldestMessage (SLA signal) — not on raw queue depth, which is noisy.

Security notes

Encrypt every queue and topic at rest. SSE-SQS (managed) is free and the right default; use SSE-KMS with a customer-managed key when you need a CloudTrail audit trail of key use, controlled rotation, or cross-account key policies. See KMS Encryption Deep Dive for envelope encryption and key-policy patterns.
Scope SNS→SQS delivery, never leave it open. The queue policy must allow sns.amazonaws.com to sqs:SendMessage with a Condition on aws:SourceArn equal to the topic. An unconditioned Principal: "*" lets any topic (or anyone) write to your queue.
Least privilege per role. Producers get sns:Publish on the topic only; consumers get ReceiveMessage/DeleteMessage/ChangeMessageVisibility/GetQueueAttributes (+ kms:Decrypt) on the queue only; redrive operators get StartMessageMoveTask on the DLQ. Never sqs:*/sns:* on * in application roles.
Cross-account is two grants, both explicit. The topic policy must allow the other account to subscribe, and the KMS key policy must grant SNS and the subscriber — the second is the one that silently drops every message when missed.
Don’t put secrets or PII in message bodies you can’t encrypt and audit. If a payload carries sensitive data, SSE-KMS gives you the audit trail; consider field-level handling for the most sensitive attributes.
Tighten the data-key reuse window thoughtfully. A longer KmsDataKeyReusePeriodSeconds cuts KMS cost but widens the blast radius if a data key is ever exposed; 300 s is a sane default.
Use VPC endpoints for SQS/SNS where consumers run in private subnets, so queue traffic never traverses the public internet.

The security controls and what each prevents:

Control	Mechanism	Secures against	Also prevents
SSE-KMS on queue/topic	CMK + key policy	Plaintext data at rest	Untracked secret access
Queue policy scoped by `aws:SourceArn`	Resource policy `Condition`	Any topic/principal writing to your queue	Cross-account confused-deputy writes
Least-privilege consumer role	IAM policy on the queue ARN	Over-broad `sqs:*` blast radius	Accidental purge/delete of queues
KMS key policy grant to SNS + subscriber	Key policy statement	Cross-account decrypt failures (silent drop)	Undebuggable message loss
VPC endpoint for SQS/SNS	Interface endpoint	Traffic over public internet	Egress-filtering false positives
`RedriveAllowPolicy` (`byQueue`)	DLQ attribute	Rogue queues redriving out of a shared DLQ	Accidental cross-source replay

Cost & sizing

What drives the SQS/SNS bill and how to keep it small:

Requests dominate. SQS bills per request (each SendMessage, ReceiveMessage, DeleteMessage, ChangeMessageVisibility is a request); a long poll that returns empty still counts. Batching up to 10 messages per call and long polling are the two biggest levers — batching alone can cut request count ~10×.
Standard vs FIFO pricing. FIFO requests cost more per million than standard. If you don’t need ordering or dedup, standard is both cheaper and higher-throughput.
SNS publishes + deliveries. SNS bills per publish and per delivery to each endpoint; filter policies that drop messages at the topic mean you pay for neither the SQS request nor the consumer compute on filtered messages — server-side filtering is a cost optimization, not just a routing convenience.
KMS is the sleeper cost. SSE-KMS adds a KMS request roughly per data-key generation; raising KmsDataKeyReusePeriodSeconds (default 300 s) amortizes that across more messages. SSE-SQS is free — use it unless you specifically need the CMK.
Free Tier: one million SQS requests/month and one million SNS publishes/month are free — most dev/test and many small prod workloads never leave it.

A rough monthly picture and what each lever changes:

Cost driver	What you pay for	Rough scale	Lever to reduce	Watch-out
SQS standard requests	Per API request (incl. empty receives)	Free to ~₹35–40 per million beyond Free Tier	Batch 10×; long poll	Short polling balloons empty receives
SQS FIFO requests	Per request, FIFO tier	Higher per million than standard	Use standard if no ordering needed	FIFO chosen by default unnecessarily
SNS publishes + deliveries	Per publish + per delivery	Free to a few ₹ per million	Filter policies drop at topic	Unfiltered fan-out multiplies deliveries
KMS (SSE-KMS)	Per data-key request	Small, but per-message if reuse low	Raise data-key reuse period	Forgetting reuse → KMS call per message
Lambda consumer	Invocations + duration	Per consumed batch	Larger batches; right-size memory	Tiny batches = many invokes
ECS consumer	Running task-hours	Per task continuously	Scale on backlog; scale to zero off-peak	Over-provisioned idle tasks

Sizing guidance: start standard, single queue per consumer, long polling on, batch sends/deletes at 10. Move to FIFO only when ordering is a real requirement, and to high-throughput FIFO only when a single group’s 300 msg/s ceiling is the bottleneck. Right-size consumers by backlog, not peak — autoscale ECS tasks (or cap Lambda ESM concurrency) to the measured ApproximateAgeOfOldestMessage target rather than provisioning for the worst hour all month.

Interview & exam questions

1. What is the SNS-to-SQS fan-out pattern and why is it preferable to a producer writing to multiple queues? A producer publishes once to an SNS topic; the topic delivers a copy to each subscribed SQS queue, and every consumer drains its own private queue. It’s preferable because the producer stays decoupled — adding a consumer is one queue + one subscription with no producer change — and a slow consumer fills only its own queue, never back-pressuring the producer or other consumers.

2. What does RawMessageDelivery do and why enable it for SQS subscriptions? It strips the SNS JSON envelope ({"Type":"Notification","Message":"..."}) so the SQS consumer reads the publisher’s body directly instead of double-parsing. Without it, message attributes are nested inside the envelope and every consumer carries envelope-parsing code. Enable it for SQS fan-out unless you specifically need SNS metadata.

3. Explain the difference between standard and FIFO queues. Standard queues offer nearly unlimited throughput with at-least-once delivery and best-effort ordering. FIFO queues guarantee strict ordering within a message group and exactly-once processing within a five-minute dedup window, but cap throughput (300 msg/s, 3,000 batched, higher in high-throughput mode) and require .fifo names. Pick standard unless you can articulate why you need ordering or dedup.

4. How does the visibility timeout cause double-processing, and how do you fix it? When a consumer receives a message, SQS hides it for the visibility timeout and expects a delete before it expires. If processing takes longer than the timeout, the message reappears and a second consumer processes it. Fix by setting the timeout to ~6× p99 processing time and extending it dynamically with ChangeMessageVisibility heartbeats for long jobs — and make the consumer idempotent regardless.

5. What is maxReceiveCount and what exactly does “received” mean? It’s the number of times a message may be received before SQS moves it to the DLQ via the redrive policy. “Received” means delivered to a consumer — not “failed” — so a consumer that receives, crashes, and lets the visibility timeout lapse still increments the count. Typical values are 3–5.

6. Why is an idempotent consumer necessary even with FIFO? FIFO’s exactly-once processing only holds within a five-minute deduplication window; a retry that recurs after that window is a fresh message. Standard queues redeliver freely. The only durable guarantee is exactly-once effect in the consumer — derive a stable business idempotency key and make the side effect conditional (e.g. a DynamoDB attribute_not_exists write).

7. Why should you never use MessageId as an idempotency key? A redelivery of the same message keeps its MessageId, but a re-publish by the producer (a producer-side retry) gets a brand-new MessageId, so the same logical event can arrive with different IDs. Use a stable business key (order ID + version, or a propagated MessageDeduplicationId) that survives both redelivery and re-publish.

8. What goes wrong with a Lambda SQS consumer that doesn’t use ReportBatchItemFailures? If the handler throws, the entire batch returns to the queue and every message’s receive count increments — including the ones that already succeeded. Those get reprocessed, and a single poison message can drive the whole batch (good messages included) to the DLQ. Declaring ReportBatchItemFailures and returning only the failed itemIdentifiers deletes the successes and retries only the failures.

9. A cross-account SNS-to-SQS subscription delivers nothing and logs no error. What did you forget? Almost certainly the KMS key policy: if the topic uses SSE-KMS, the key policy must grant sns.amazonaws.com and the subscribing principal GenerateDataKey/Decrypt, or every message is silently dropped. Cross-account also needs the topic policy to allow the other account to subscribe. Confirm via CloudTrail AccessDenied on KMS.

10. Which metric is the true SLA signal for a queue, and why not queue depth? ApproximateAgeOfOldestMessage. Depth can spike transiently and still be healthy (a burst the consumers will drain); a rising oldest-message age means consumers are genuinely falling behind. Alarm on the age, and page on any message in a DLQ.

11. How do you scale ECS consumers correctly? Drive autoscaling off backlog-per-task — ApproximateNumberOfMessagesVisible ÷ running task count against a target — not CPU. CPU lags and a fast consumer with a deep backlog can read low CPU, so CPU-based scaling under-provisions exactly when you need capacity.

12. What is redrive and how do you do it safely? Redrive (StartMessageMoveTask) moves messages from a DLQ back to the source queue for reprocessing — a first-class feature, not a hand-rolled re-publish. Do it safely by fixing the root cause first (or the messages bounce straight back), rate-limiting the move (--max-number-of-messages-per-second) so you don’t re-flood a recovering downstream, and watching the source backlog as they clear.

These map to AWS Certified Developer – Associate (DVA-C02) — develop with SQS/SNS, messaging patterns, idempotency — and Solutions Architect – Associate (SAA-C03) — decoupling, fan-out, DLQs, resilient architectures. The operational depth (redrive, alarms, backpressure) touches DevOps Engineer – Professional (DOP-C02). A compact cert-mapping for revision:

Question theme	Primary cert	Objective area
Fan-out, filter policies, raw delivery	SAA-C03 / DVA-C02	Decoupled, event-driven design
Standard vs FIFO, ordering, dedup	DVA-C02	Messaging semantics
Visibility timeout, polling, double-process	DVA-C02	Develop reliable consumers
DLQ, `maxReceiveCount`, redrive	DVA-C02 / DOP-C02	Resilience & operations
Idempotency, exactly-once effect	DVA-C02	Application reliability
Cross-account IAM/KMS, SSE	SAA-C03 / SCS-C02	Secure messaging
Metrics, alarms, backlog scaling	DOP-C02	Monitoring & operations

Quick check

A producer publishes once and three consumers each need a different slice of the events. What pattern do you use, and where does the per-consumer filtering belong?
Your consumer processed the same payment twice during a slow downstream. Name the most likely misconfiguration and the metric that confirms it.
True or false: enabling FIFO removes the need for idempotent consumers.
A brand-new SNS-to-SQS subscription shows “succeeded” but the queue is always empty. What’s the first thing you check?
One malformed message in a Lambda batch of ten drove all ten to the DLQ. What single setting fixes this, and what must your handler return?

Answers

SNS-to-SQS fan-out: one topic, one private SQS queue per consumer. Put the per-consumer filtering in SNS subscription filter policies so non-matching messages are dropped at the topic — never delivered, never billed as an SQS request or consumer invocation.
The visibility timeout is shorter than the processing time, so the message reappeared mid-flight and a second worker grabbed it. Confirm with ApproximateReceiveCount > 1 on successfully processed messages. Fix: raise the timeout to ~6× p99 and add ChangeMessageVisibility heartbeats — and make the consumer idempotent.
False. FIFO’s exactly-once processing only holds within a five-minute dedup window; retries that recur later are fresh messages, and standard queues redeliver freely. You still need an idempotent consumer keyed on a business idempotency key.
The queue resource policy — it must allow sns.amazonaws.com to sqs:SendMessage with aws:SourceArn scoped to the topic. aws_sns_topic_subscription does not write it for you; SNS reports delivery while SQS refuses the write. (NumberOfNotificationsFailed will be non-zero.)
Set function_response_types = ["ReportBatchItemFailures"] on the event source mapping, and have the handler return {"batchItemFailures": [{"itemIdentifier": "<messageId>"}]} containing only the failed messages — Lambda deletes the rest.

Glossary

SNS topic — a publish/subscribe fan-out point; one publish is delivered as a copy to every subscriber (SQS, Lambda, HTTP, etc.).
SQS queue — a durable, pull-based message buffer that one consumer group drains by polling; absorbs bursts and decouples producer from consumer.
Fan-out — one SNS topic delivering to N independent SQS queues, one per consumer, so each consumer scales and fails independently.
Filter policy — a server-side rule on an SNS subscription that drops non-matching messages at the topic (default on message attributes; optionally on the body).
RawMessageDelivery — a subscription flag that strips the SNS JSON envelope so the SQS consumer reads the publisher’s body directly.
Standard queue — at-least-once delivery, best-effort ordering, nearly unlimited throughput; the default.
FIFO queue — strict ordering within a message group and exactly-once processing within a five-minute dedup window; name ends .fifo; throughput-capped.
MessageGroupId — the FIFO ordering unit; order is guaranteed within a group and parallel across groups; pick the entity to serialize (e.g. accountId).
MessageDeduplicationId — an explicit dedup key collapsing duplicate sends within five minutes; or ContentBasedDeduplication hashes the body.
Visibility timeout — the lease that hides a received message; the consumer must delete it before the lease expires, or it reappears for another consumer.
ChangeMessageVisibility — extends a message’s lease mid-processing (a heartbeat) for long or variable jobs.
Long polling — ReceiveMessageWaitTimeSeconds (up to 20 s) makes a receive wait for a message instead of returning empty, cutting empty-receive cost and noise.
Poison message — a message a consumer can never process (malformed, deleted reference, unknown schema) that would loop forever without a DLQ.
maxReceiveCount — the number of receives before SQS moves a message to the DLQ, set in the source queue’s redrive policy.
Dead-letter queue (DLQ) — a separate queue that sidelines messages exceeding maxReceiveCount; give it 14-day retention and a matching type.
Redrive / StartMessageMoveTask — the first-class feature that moves DLQ messages back to the source queue after a fix; rate-limit it.
RedriveAllowPolicy — a DLQ attribute (allowAll/byQueue/denyAll) controlling which source queues may redrive out of it.
Idempotency key — a stable business identifier used to make processing the same event twice a no-op (never MessageId).
Event source mapping (ESM) — Lambda’s managed poller that long-polls a queue and invokes in batches; carries ReportBatchItemFailures and a concurrency cap.
ReportBatchItemFailures — the ESM response type that lets a handler return only the failed message IDs, so successes in a batch aren’t reprocessed.
SSE-SQS / SSE-KMS — server-side encryption with an AWS-managed key (free) or a customer-managed KMS key (audit, rotation, cross-account control).

Next steps

You can now design a resilient fan-out, choose delivery semantics deliberately, and make consumers survive duplicates and poison. Build outward:

Foundations: SNS, SQS and EventBridge Messaging Fundamentals — the ground-floor concepts beneath this deep dive.
Related: EventBridge: Event-Driven Architecture with Buses, Schema & Pipes — when routing outgrows topic filter policies and you need a bus, schema registry, and Pipes.
Related: Step Functions: Distributed Orchestration & Error-Handling Patterns — when a workflow needs state, retries with backoff, and saga compensation beyond a single queue.
Related: Lambda Deep Dive: Runtimes, Triggers, Layers & Concurrency — the consumer-side mechanics: ESM, batching, and concurrency.
Related: DynamoDB Deep Dive: Tables, Keys, Capacity, GSIs & Streams — design the idempotency table and use Streams for change-data-capture fan-out.
Related: Event-Driven Order Processing with the Saga Pattern — a full worked architecture that combines fan-out, FIFO ordering, and compensation.
Related: CloudWatch & CloudTrail Observability Deep Dive — dashboards, composite alarms, and the audit trail behind the metrics in this article.