AWS Messaging

Resilient Messaging with SQS and SNS: Fan-Out, FIFO Ordering, DLQs, and Poison-Message Handling

SNS and SQS are the two primitives I reach for before anything fancier, and the failure mode I see most often is teams wiring a producer straight to a single queue, then bolting on a second consumer by having the first one re-publish. That is point-to-point coupling wearing a queue costume. The decoupled version is boring on purpose: a producer publishes a fact to an SNS topic and walks away, and each consumer owns a private SQS queue subscribed to that topic. Add a consumer six months later by creating one queue and one subscription. Nobody redeploys the producer.

That topology is the easy 80%. The hard 20% is what happens when a consumer is slow, a message is malformed, the same event arrives twice, or strict ordering actually matters. This article is the operational playbook for that 20%: SNS-to-SQS fan-out with server-side filtering, the real tradeoffs between standard and FIFO, visibility-timeout and polling tuning, dead-letter queues with redrive, idempotent consumers, and the Lambda/ECS plumbing that makes partial failures survivable.

1. Fan-out topology: SNS to multiple SQS queues with filter policies

The pattern is one topic, N queues. Each subscriber gets its own queue so a slow or broken consumer cannot back-pressure the others, and each queue gets its own DLQ and its own visibility timeout. Critically, you do not want every consumer to receive every message — most consumers care about a slice. Push that filtering into SNS with a subscription filter policy so unwanted messages are dropped at the topic, never billed as SQS requests, and never even delivered.

resource "aws_sns_topic" "orders" {
  name = "orders-events"
}

resource "aws_sqs_queue" "fraud" {
  name                       = "orders-fraud"
  visibility_timeout_seconds = 120
  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.fraud_dlq.arn
    maxReceiveCount     = 5
  })
}

resource "aws_sqs_queue" "fraud_dlq" {
  name                      = "orders-fraud-dlq"
  message_retention_seconds = 1209600 # 14 days, the max
}

resource "aws_sns_topic_subscription" "fraud" {
  topic_arn = aws_sns_topic.orders.arn
  protocol  = "sqs"
  endpoint  = aws_sqs_queue.fraud.arn

  # Match on a message attribute, not the body, by default.
  filter_policy = jsonencode({
    eventType = ["OrderPlaced", "OrderUpdated"]
    amount    = [{ numeric = [">=", 500] }]
  })

  # Critical for fan-out: deliver the raw publisher payload, not the
  # SNS envelope, so the SQS consumer sees exactly what was published.
  raw_message_delivery = true
}

Two flags decide whether this works in practice:

Always attach a redrive policy on the subscription itself so SNS deliveries that fail (for example, the SQS queue policy is misconfigured) land in an SNS-level DLQ instead of vanishing. That is separate from the SQS DLQ in step 4.

resource "aws_sns_topic_subscription" "fraud_with_dlq" {
  topic_arn            = aws_sns_topic.orders.arn
  protocol             = "sqs"
  endpoint             = aws_sqs_queue.fraud.arn
  raw_message_delivery = true

  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.sns_delivery_dlq.arn
  })
}

2. Standard vs FIFO: ordering, message groups, and deduplication IDs

Pick standard unless you can articulate why you need FIFO, because FIFO buys ordering and dedup at the cost of throughput and operational complexity.

Property Standard FIFO
Ordering Best-effort Strict, per message group
Delivery At-least-once Exactly-once processing (within the dedup window)
Throughput Nearly unlimited 300 msg/s, or 3,000 with batching; high-throughput mode raises this substantially
Dedup None 5-minute dedup interval, by ID or content hash
Name suffix any must end in .fifo

The mental model for FIFO ordering is the message group ID. Order is guaranteed within a group, never across groups. Choose the group ID as the entity whose events must be serialized — orderId, accountId, tenantId — and you get per-entity ordering with parallelism between entities. Use a single constant group ID and you have serialized the entire queue down to one in-flight message at a time, which is almost never what you want.

Deduplication has two modes:

  1. Content-based dedup (ContentBasedDeduplication = true): SQS hashes the message body and rejects a duplicate body within a 5-minute window. Good when the body is naturally unique.
  2. Explicit MessageDeduplicationId: you supply the key, typically a business idempotency key like the event ID. This is the robust choice because identical retries of the same logical event collapse even if the body differs slightly (a re-serialized timestamp, say).
resource "aws_sqs_queue" "payments" {
  name                        = "payments.fifo"
  fifo_queue                  = true
  content_based_deduplication = false # supply our own IDs
  deduplication_scope         = "messageGroup"
  fifo_throughput_limit       = "perMessageGroupId" # high-throughput mode
  visibility_timeout_seconds  = 60
}
aws sqs send-message \
  --queue-url "$PAYMENTS_FIFO_URL" \
  --message-body "$(cat payment.json)" \
  --message-group-id "acct-4471" \
  --message-deduplication-id "evt-7f3a9c-2026-06-08"

The 5-minute dedup window is a fixed property of SQS FIFO and you cannot extend it. If your retries can recur after five minutes, FIFO dedup will not save you — you still need idempotent consumers (step 5). FIFO reduces duplicate volume; it does not eliminate the need to handle them.

For SNS fan-out into FIFO queues, the SNS topic must also be FIFO, the publisher supplies MessageGroupId (and a dedup ID or content-based dedup on the topic), and every subscribed queue must be FIFO. You cannot fan a standard topic into a FIFO queue.

3. Visibility timeout, long polling, and throughput tuning

When a consumer receives a message, SQS hides it for the visibility timeout rather than deleting it. The consumer must delete it before the timeout expires; otherwise the message reappears and another consumer picks it up. Two failure modes follow directly:

Set the visibility timeout to roughly 6x your p99 processing time as a starting point, then for long or variable work extend it dynamically with ChangeMessageVisibility (a heartbeat) instead of provisioning a huge static value.

# Heartbeat: extend the lease while a long job is still running.
aws sqs change-message-visibility \
  --queue-url "$QUEUE_URL" \
  --receipt-handle "$RECEIPT_HANDLE" \
  --visibility-timeout 120

Always use long polling. Set ReceiveMessageWaitTimeSeconds = 20 on the queue (or pass --wait-time-seconds 20 per receive). Long polling waits for messages to arrive instead of returning immediately empty, which collapses empty-receive API calls — that is both a cost and a noise reduction. Short polling (the default of 0) is almost always a misconfiguration.

resource "aws_sqs_queue" "fraud" {
  name                          = "orders-fraud"
  receive_wait_time_seconds     = 20    # long polling
  visibility_timeout_seconds    = 120
  message_retention_seconds     = 345600 # 4 days; max is 14 days
}

For throughput, batch aggressively: SendMessageBatch and DeleteMessageBatch move up to 10 messages per call, cutting request count and cost by up to 10x. A standard queue scales horizontally on its own — you scale consumers, not the queue. Tune the maximum messages per ReceiveMessage (up to 10) and run more consumer threads/tasks to raise throughput.

4. Dead-letter queues: maxReceiveCount, redrive policy, and replay

A poison message is one a consumer can never process — malformed JSON, a reference to a deleted row, a schema your code does not understand. Without a DLQ it loops forever: received, fails, visibility timeout expires, received again, until it ages out 14 days later, having burned compute the whole time and blocking head-of-line progress on a FIFO queue.

The DLQ is governed by the redrive policy on the source queue: maxReceiveCount is the number of times a message may be received before SQS moves it to the DLQ. “Received” means delivered to a consumer, not “failed” — so a consumer that receives, crashes, and lets the visibility timeout lapse still increments the count. Set maxReceiveCount to give transient errors room to recover (typically 3-5), but not so high that a true poison message wastes dozens of retries.

resource "aws_sqs_queue" "orders" {
  name = "orders-work"
  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.orders_dlq.arn
    maxReceiveCount     = 5
  })
}

resource "aws_sqs_queue" "orders_dlq" {
  name                      = "orders-work-dlq"
  message_retention_seconds = 1209600 # 14 days — keep DLQ retention long

  # Allow-list which source queues may redrive back out of this DLQ.
  redrive_allow_policy = jsonencode({
    redrivePermission = "byQueue"
    sourceQueueArns   = [aws_sqs_queue.orders.arn]
  })
}

Two rules that save you later: give the DLQ the maximum 14-day retention (a DLQ message you discover on Monday may be 4 days old), and give the DLQ a queue type that matches the source — a FIFO source needs a FIFO DLQ.

Once you have triaged and fixed the root cause, redrive moves messages back to the source queue (or to a custom destination) for reprocessing. This is a first-class feature; do not hand-roll a re-publish Lambda.

# Kick off a redrive from the DLQ back to its source queue.
aws sqs start-message-move-task \
  --source-arn "arn:aws:sqs:us-east-1:111122223333:orders-work-dlq" \
  --max-number-of-messages-per-second 50

# Watch progress.
aws sqs list-message-move-tasks \
  --source-arn "arn:aws:sqs:us-east-1:111122223333:orders-work-dlq"

Rate-limit the move so you do not re-flood a downstream that is still recovering. And never redrive blindly — if the bug is not fixed, the same messages bounce straight back to the DLQ.

5. Idempotency and exactly-once-effect processing

At-least-once delivery is a guarantee, not an edge case. Standard queues redeliver on their own, visibility-timeout expiry causes redelivery, and even FIFO only dedups within five minutes. The only durable answer is an idempotent consumer: processing the same message twice produces the same effect as processing it once.

Derive a stable idempotency key from the message — a business event ID, or the SQS MessageDeduplicationId, never the MessageId (a redelivery keeps the same MessageId, but a re-publish by the producer gets a new one, so MessageId alone is unreliable across producer retries). Then make the side effect conditional on that key. Two patterns I use:

A. Conditional write on the work itself. If the side effect is a DynamoDB write, use a condition expression so the second attempt is a no-op rather than a separate ledger entry.

import boto3
from botocore.exceptions import ClientError

table = boto3.resource("dynamodb").Table("payments")

def process(event_id: str, amount: int):
    try:
        table.put_item(
            Item={"pk": event_id, "amount": amount, "status": "POSTED"},
            ConditionExpression="attribute_not_exists(pk)",
        )
    except ClientError as e:
        if e.response["Error"]["Code"] == "ConditionalCheckFailedException":
            return  # already processed; safe to ack and delete
        raise

B. A dedicated idempotency table when the side effect is an external call (charge a card, send an email) that you cannot make conditional. Record the key with a TTL before the call, short-circuit if it already exists, and design the external call to be retried safely if you crash mid-flight. The AWS Lambda Powertools idempotency utility implements exactly this with a DynamoDB backing table and is worth adopting rather than reinventing.

Exactly-once delivery is largely a myth in distributed messaging. What you can actually engineer is exactly-once effect, and it lives in the consumer, not the queue. Build it there and duplicate deliveries become a non-event.

6. Lambda and ECS consumers: batching, partial-batch failures, and backpressure

Lambda consumes via an event source mapping (ESM) that long-polls the queue for you and invokes in batches. The trap is the default failure behavior: if your handler throws, the entire batch returns to the queue and every message’s receive count increments — including the ones that already succeeded. They get reprocessed, and good messages can be driven to the DLQ by a single poison sibling.

Fix this with ReportBatchItemFailures. Declare the response type on the ESM and return only the IDs that failed; Lambda deletes the rest.

resource "aws_lambda_event_source_mapping" "orders" {
  event_source_arn                   = aws_sqs_queue.orders.arn
  function_name                      = aws_lambda_function.consumer.arn
  batch_size                         = 10
  maximum_batching_window_in_seconds = 5
  function_response_types            = ["ReportBatchItemFailures"]

  scaling_config {
    maximum_concurrency = 20 # cap concurrent invocations from this queue
  }
}
def handler(event, _ctx):
    failures = []
    for record in event["Records"]:
        try:
            process_record(record)
        except Exception:
            failures.append({"itemIdentifier": record["messageId"]})
    return {"batchItemFailures": failures}

Set maximum_concurrency on the ESM (minimum 2, up to 1000) to bound how hard the queue hammers a fragile downstream — backpressure for free, without throttling the whole function’s reserved concurrency. For FIFO sources, Lambda processes one batch per message group at a time, preserving order, and on failure it pauses that group until the batch clears so ordering holds.

ECS / EC2 consumers poll the queue themselves in a loop. The clean way to scale them is to drive autoscaling off backlog per task, not raw CPU. Compute a target backlog-per-task from your acceptable latency and publish it as a custom metric, or use the standard approach of a target-tracking policy on ApproximateNumberOfMessagesVisible divided by running task count.

# Typical worker loop: long poll, process, delete in a batch.
while True:
    resp = sqs.receive_message(
        QueueUrl=url, MaxNumberOfMessages=10, WaitTimeSeconds=20,
        AttributeNames=["ApproximateReceiveCount"],
    )
    msgs = resp.get("Messages", [])
    entries = []
    for m in msgs:
        if process(m):  # idempotent; returns True on success
            entries.append({"Id": m["MessageId"], "ReceiptHandle": m["ReceiptHandle"]})
    if entries:
        sqs.delete_message_batch(QueueUrl=url, Entries=entries)

Delete only what you successfully processed; let the rest time out and redeliver (and eventually DLQ). Never delete-then-process.

7. Encryption, access policies, and cross-account sharing

Enable server-side encryption on every queue and topic. SQS-managed keys (SSE-SQS) are free and require zero key management; use a customer-managed KMS key (SSE-KMS) when you need an audit trail, key rotation control, or cross-account key policies.

resource "aws_sqs_queue" "orders" {
  name              = "orders-work"
  kms_master_key_id = aws_kms_key.messaging.id
  # Cache the data key to cut KMS calls (and cost) under load.
  kms_data_key_reuse_period_seconds = 300
}

For SNS-to-SQS to work at all, the queue’s resource policy must allow sns.amazonaws.com to sqs:SendMessage, scoped to the topic ARN with aws:SourceArn. Terraform’s aws_sns_topic_subscription does not write this for you; add it explicitly.

data "aws_iam_policy_document" "allow_sns" {
  statement {
    sid     = "AllowSNSDelivery"
    effect  = "Allow"
    actions = ["sqs:SendMessage"]
    principals {
      type        = "Service"
      identifiers = ["sns.amazonaws.com"]
    }
    resources = [aws_sqs_queue.orders.arn]
    condition {
      test     = "ArnEquals"
      variable = "aws:SourceArn"
      values   = [aws_sns_topic.orders.arn]
    }
  }
}

resource "aws_sqs_queue_policy" "orders" {
  queue_url = aws_sqs_queue.orders.id
  policy    = data.aws_iam_policy_document.allow_sns.json
}

Cross-account fan-out (topic in account A, queue in account B) adds two requirements beyond the above: the topic policy in A must allow B’s queue to subscribe (sns:Subscribe / sns:Receive), and if the topic uses SSE-KMS, the KMS key policy must let sns.amazonaws.com and the subscribing principals use the key — a step that silently drops every cross-account message when missed. Always scope these with aws:SourceArn or aws:SourceAccount; never leave Principal: "*" unconditioned on a topic or queue.

8. Monitoring: queue depth, age of oldest message, and DLQ alarms

The metric that actually tells you whether consumers are keeping up is ApproximateAgeOfOldestMessage, not queue depth. Depth can be high transiently and still healthy; a rising oldest-message age means consumers are falling behind and is your true SLA signal. Alarm on it.

The non-negotiable alarm is any message in a DLQ. A message in the DLQ is, by definition, something your system could not handle — page on it.

resource "aws_cloudwatch_metric_alarm" "dlq_not_empty" {
  alarm_name          = "orders-dlq-not-empty"
  namespace           = "AWS/SQS"
  metric_name         = "ApproximateNumberOfMessagesVisible"
  dimensions          = { QueueName = aws_sqs_queue.orders_dlq.name }
  statistic           = "Maximum"
  period              = 60
  evaluation_periods  = 1
  threshold           = 0
  comparison_operator = "GreaterThanThreshold"
  alarm_actions       = [aws_sns_topic.oncall.arn]
}

resource "aws_cloudwatch_metric_alarm" "backlog_age" {
  alarm_name          = "orders-work-backlog-age"
  namespace           = "AWS/SQS"
  metric_name         = "ApproximateAgeOfOldestMessage"
  dimensions          = { QueueName = aws_sqs_queue.orders.name }
  statistic           = "Maximum"
  period              = 60
  evaluation_periods  = 5
  threshold           = 300 # seconds; tune to your latency SLO
  comparison_operator = "GreaterThanThreshold"
  alarm_actions       = [aws_sns_topic.oncall.arn]
}

Round out the dashboard with NumberOfMessagesSent / NumberOfMessagesReceived (throughput), NumberOfEmptyReceives (long-polling health), and on SNS NumberOfNotificationsFailedToRedriveToDlq (deliveries you are actually losing).

Enterprise scenario

A payments platform team ran a single standard SQS queue behind their card-authorization service. Under a Black Friday spike, the downstream processor slowed, processing time crept past their 30-second visibility timeout, and messages began reappearing mid-flight. A second consumer picked up each redelivered message and double-charged a subset of customers — a few hundred duplicate authorizations before they killed the consumers.

The constraint: they could not simply make the queue FIFO, because per-customer ordering mattered but a single global FIFO group would have collapsed throughput below their peak rate, and the 5-minute dedup window did not cover retries that recurred minutes later under sustained load.

The fix was layered, and the lesson was that no single setting would have saved them:

  1. FIFO with per-account message groups (MessageGroupId = accountId) gave strict ordering per customer while keeping parallelism across customers, plus high-throughput mode (fifo_throughput_limit = "perMessageGroupId") to clear the peak.
  2. An idempotency table keyed on the gateway request ID made the charge itself a no-op on replay — the real fix, since FIFO dedup alone was insufficient past five minutes.
  3. Visibility-timeout heartbeats (ChangeMessageVisibility every 20 seconds while a charge was in flight) stopped redelivery during legitimately slow downstream calls.
# The charge became conditional on a recorded idempotency key.
def authorize(req_id: str, account_id: str, amount: int):
    try:
        ddb.put_item(
            TableName="auth-idempotency",
            Item={"req_id": {"S": req_id}, "ttl": {"N": str(ttl_24h())}},
            ConditionExpression="attribute_not_exists(req_id)",
        )
    except ddb.exceptions.ConditionalCheckFailedException:
        return load_prior_result(req_id)  # already authorized; return same result
    return gateway.charge(account_id, amount)

Post-change, duplicate authorizations went to zero across the next peak, and the DLQ alarm — which had not existed before — caught two genuinely malformed messages that previously would have looped silently for days.

Verify

Run these to confirm the topology and failure handling actually behave:

# 1. Filter policy: publish a non-matching event; the queue should stay empty.
aws sns publish --topic-arn "$TOPIC_ARN" \
  --message '{"id":"t1"}' \
  --message-attributes '{"eventType":{"DataType":"String","StringValue":"OrderCancelled"}}'
aws sqs get-queue-attributes --queue-url "$FRAUD_URL" \
  --attribute-names ApproximateNumberOfMessages   # expect 0

# 2. Raw delivery: publish a matching event and confirm no SNS envelope.
aws sqs receive-message --queue-url "$FRAUD_URL" --wait-time-seconds 5 \
  --query 'Messages[0].Body'   # expect the raw payload, not {"Type":"Notification",...}

# 3. DLQ path: send a known-poison message and let it exceed maxReceiveCount.
#    After maxReceiveCount receives, it should appear in the DLQ.
aws sqs get-queue-attributes --queue-url "$DLQ_URL" \
  --attribute-names ApproximateNumberOfMessages   # expect >= 1

# 4. Redrive after a fix.
aws sqs start-message-move-task --source-arn "$DLQ_ARN"

# 5. FIFO ordering within a group: send three messages, same group,
#    distinct dedup IDs, then confirm receive order matches send order.
for i in 1 2 3; do
  aws sqs send-message --queue-url "$FIFO_URL" \
    --message-body "msg-$i" --message-group-id g1 \
    --message-deduplication-id "d-$i"
done

Checklist

awssqssnsmessagingfifo

Comments

Keep Reading