AWS Lambda Patterns: Event-Driven Functions That Scale to Zero

A startup processed user-uploaded documents on a long-running EC2 instance that sat idle 80% of the day. They moved the pipeline to AWS Lambda — functions triggered by S3 uploads, fanning work out through SQS — and the bill fell 80% while end-to-end latency dropped from minutes to seconds. The catch was not the migration; it was the redesign. A 40-minute batch job had to become a graph of small, idempotent functions because Lambda kills any single invocation at 15 minutes, retries async events on its own schedule, and can deliver the same event twice. The team that wins with Lambda is the team that internalises those three facts before writing a line of handler code.

This is the reference for getting event-driven Lambda right. Lambda runs your code in response to an event, gives it up to 15 minutes and up to 10 GB of memory with proportional CPU, and bills you per millisecond of execution — scaling from zero to thousands of concurrent environments without a server in sight. But “Lambda” is really four different services wearing one name, depending on how it was invoked: a synchronous request/response API, an asynchronous fire-and-forget queue, a poll-based event-source mapping for streams and queues, and destinations/DLQs for the outcomes. Each model has its own retry behaviour, its own error surface, its own ordering and concurrency rules, and its own way of losing your data when you get it wrong. Treat them as one thing and you will ship a pipeline that drops events under load and double-charges customers on retry.

By the end you will stop guessing which model you are in. When an event “disappears,” you will know whether it died in an async retry with no DLQ, was swallowed by a poller that advanced the stream iterator past a poison record, or never arrived because an EventBridge rule pattern was one field too broad. You will know the exact limits — 15-minute timeout, 1,000 default concurrency, 256 KB async payload, 6 MB sync payload, ~128 SQS messages per batch — the precise aws command to confirm each failure, and the Terraform to wire the fix. Because this is a reference you reach for mid-incident, the invocation models, the trigger contracts, the limits, the error codes and the failure playbook are all laid out as scannable tables: read the prose once, keep the tables open when the pipeline is on fire.

What problem this solves

Traditional servers are paid for whether they are busy or not, and they make you responsible for scaling, patching and capacity planning around traffic you cannot predict. An event-driven Lambda architecture removes the server: you write the function that reacts to “a file landed,” “a message arrived,” “a row changed,” “an order was placed,” and AWS runs exactly as many copies as the event rate demands, billing only for the milliseconds they execute. For spiky, intermittent, event-shaped workloads — image processing, ETL steps, webhooks, stream consumers, scheduled jobs, glue between services — this is unbeatable on both cost and operational burden.

What breaks without the patterns, not just the service: teams lift a monolith into one giant function and hit the 15-minute wall; they invoke Lambda synchronously from an API and watch p99 latency spike on cold starts; they trigger directly off a high-volume source with no queue and get throttled into a retry storm; they assume “exactly once” and get duplicate side effects because async and stream invocations are at-least-once. The failure mode is rarely a crash — it is silent: an event that retried into the void because no dead-letter queue was attached, a stream consumer stuck for hours because one poison record blocks the shard, a customer billed twice because the function was not idempotent.

Who hits this: anyone building on serverless past “hello world.” It bites hardest on teams new to the at-least-once delivery contract (idempotency is not optional), on high-throughput stream and queue consumers (batching, concurrency and DLQs are load-bearing), on latency-sensitive synchronous APIs (cold starts and provisioned concurrency), and on anyone fanning one event out to many consumers (SNS vs EventBridge vs SQS is an architecture decision, not a coin flip). The fix is almost never “more memory” — it is “pick the right invocation model, attach the right failure destination, and make the handler idempotent.”

To frame the whole field before the deep dive, here is every event pattern this article covers, the AWS primitive that powers it, and the one trap that defines it:

Pattern	Powered by	What it’s for	The defining trap
Synchronous invoke	API Gateway / SDK / ALB	Request/response APIs, low-latency reads	Cold start in the user’s p99; 6 MB payload cap
Asynchronous invoke	S3, SNS, EventBridge	Fire-and-forget reactions	At-least-once + retries to nowhere without a DLQ
Stream poller	Kinesis, DynamoDB Streams	Ordered change processing	One poison record blocks the whole shard
Queue poller	SQS (standard / FIFO)	Buffered, decoupled work	Visibility timeout < 6× function timeout = duplicates
Fan-out	SNS / EventBridge	One event → many consumers	Picking the wrong broker (filtering, replay, ordering)
Choreography	EventBridge bus + rules	Loosely-coupled service flows	A rule pattern too broad double-delivers
Orchestration	Step Functions	Stateful, long, branching flows	Using Lambda chaining where a state machine belongs

Learning objectives

By the end of this article you can:

Identify which of Lambda’s four invocation models (sync, async, stream poll, queue poll) any given trigger uses, and predict its retry, ordering, batching and error behaviour from that alone.
Wire each major event source — S3, SQS, SNS, Kinesis Data Streams, DynamoDB Streams, EventBridge, API Gateway — with the correct event-source-mapping or notification config, in both aws CLI and Terraform.
Design fan-out correctly: choose SNS vs EventBridge vs SQS by filtering, replay, ordering and consumer-count needs, and combine them (SNS→SQS fan-in) where it fits.
Guarantee correctness under at-least-once delivery: build idempotent handlers, attach the right DLQ / on-failure destination, and stop poison records from blocking a stream with bisect-on-error and maxRecordAge.
Tune concurrency deliberately — reserved vs provisioned, account limits, burst rates — and protect downstream databases from a Lambda scale-out stampede.
Control cold starts: what causes them, what provisioned concurrency and SnapStart fix, package and memory levers, and when the latency actually matters.
Read the Lambda error-code and limit reference and run a symptom→cause→confirm→fix playbook for the failure modes that actually page you: throttles, dropped async events, stuck shards, duplicate processing and DLQ fill.
Size and cost a serverless pipeline — the GB-second model, when Lambda is cheaper than a container, and where it stops being cheaper.

Prerequisites & where this fits

You should already understand the AWS basics: an IAM role (Lambda assumes an execution role for its permissions), CloudWatch Logs (where every invocation’s logs land), and the core event services at a “what they are” level — S3 buckets, SQS queues, SNS topics, EventBridge buses, and Kinesis/DynamoDB Streams. You should be able to run aws from a shell, read JSON output, and read a basic Terraform resource block. Familiarity with HTTP status codes and the idea of “retry” and “idempotency” helps.

This sits in the Serverless & Event-Driven track. It assumes the compute-model fundamentals — the Compute on AWS: EC2 vs Lambda vs ECS vs EKS decision is upstream of it, and the ECS, EKS & Fargate: Choosing Your Container Path comparison tells you when a long-running container beats a function. It pairs tightly with the front-door choices in ALB vs NLB vs API Gateway, Compared, because API Gateway is the most common synchronous trigger, and with DynamoDB, RDS & Aurora, Compared since DynamoDB is the natural state store for a stateless function (and its streams are a first-class trigger).

A quick map of who owns what when an event-driven pipeline misbehaves, so you look in the right place fast:

Layer	What lives here	Failure classes it causes	First place to look
Event source (S3/SQS/EventBridge…)	The producer + delivery contract	Event never arrived; wrong/too-broad routing	Source metrics (e.g. `NumberOfMessagesSent`)
Event-source mapping	The poller config (batch, concurrency)	Stuck shard, throttling, batch too big	`aws lambda get-event-source-mapping`
Function (your code + role)	Handler logic, permissions	Timeout, OOM, unhandled exception, AccessDenied	CloudWatch Logs + X-Ray
Concurrency / scaling	Account + reserved limits	`429 TooManyRequests`, throttles	`ConcurrentExecutions`, `Throttles` metrics
Failure destination	DLQ / on-failure target	Silent data loss on retry exhaustion	DLQ depth (`ApproximateNumberOfMessages`)
Downstream (DB/API)	Where the function writes	Connection exhaustion under scale-out	Downstream connection/throttle metrics

Core concepts

Six mental models make every later decision obvious.

“Lambda” is four services, chosen by how it was invoked. The single most important idea in this article. A synchronous invocation (API Gateway, ALB, an SDK Invoke with RequestResponse) blocks the caller and returns the result — the caller owns retries, Lambda does not retry. An asynchronous invocation (S3, SNS, EventBridge, an SDK Invoke with Event) drops the event onto an internal queue, returns 202 immediately, and Lambda retries on failure (twice by default) before sending it to a DLQ or on-failure destination — or dropping it. A poll-based event-source mapping (Kinesis, DynamoDB Streams, SQS, Kafka, MQ) means Lambda polls the source for you, invoking your function with a batch; its retry, ordering and checkpointing rules are specific to the source. Knowing which of the four you are in tells you the retry count, the error surface, the ordering guarantee and where data goes when it fails — before you read another word.

Your function is stateless and ephemeral; the execution environment is reused but not guaranteed. Lambda creates an execution environment (a micro-VM), runs your init code once, then runs the handler per event. AWS may reuse a warm environment for the next event (fast — init already paid) or spin up a new one (a cold start — init runs again). You get no guarantee about reuse, so anything you need across invocations lives in an external store (DynamoDB, S3, ElastiCache) — but you can exploit reuse by initialising expensive things (SDK clients, DB pools, parsed config) in init code, outside the handler, so warm invocations skip them.

Delivery is at-least-once; design for duplicates or be wrong. Async invocations and stream/queue pollers can deliver the same event more than once (a retry after a partial success, a re-drive, a poller redelivery). Only synchronous invocation is “exactly as many times as the caller called.” If processing an event has a side effect — charging a card, incrementing a counter, sending an email — and you do not make it idempotent (safe to run twice with the same result), at-least-once delivery will eventually double it. Idempotency is not a nice-to-have; it is the price of admission to event-driven Lambda.

Concurrency is finite, shared, and the real scaling unit. Lambda scales by running more concurrent execution environments — one per simultaneous event. Your account has a default 1,000 concurrent executions across all functions in a region (raisable via quota request). One runaway function can starve every other function in the account. Reserved concurrency caps and guarantees a function’s slice; provisioned concurrency pre-warms a number of environments to kill cold starts. A burst that exceeds your available concurrency gets throttled (429 TooManyRequests), and what happens next depends on the invocation model — the sync caller sees the 429, the async/stream path retries.

Cold start is latency, not an error. A new environment must initialise — download your code/layers, start the runtime, run your init code (SDK clients, DB connect, parsed config). That is the cold start, typically 100 ms–1 s+ depending on runtime, package size and VPC attachment. It is not a 5xx unless it blows a caller’s timeout; it is a slow first request on a fresh environment, fixed by keeping environments warm (provisioned concurrency) or making init cheap (smaller package, SnapStart, fewer/lighter dependencies).

The failure destination is where your data goes when the handler can’t. Every async and poll-based path needs an explicit answer to “what happens to an event the function repeatedly fails to process?” For async invokes that is a dead-letter queue (legacy) or an on-failure destination (preferred — SQS, SNS, EventBridge, or another Lambda, with richer metadata). For SQS event sources it is the queue’s own redrive policy → DLQ. For stream sources it is an on-failure destination plus bisect/age controls. Leave it unset and “failed” means “silently gone” — the single most common way to lose production events.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary repeats these for lookup; this is the mental model side by side:

Concept	One-line definition	Where it lives	Why it matters
Function	Code + runtime + config, the unit of deploy	Lambda service	The thing that runs and scales
Handler	The entry point AWS calls per event	Your code	Runs once per event (warm or cold)
Init code	Code outside the handler, run once per env	Your code (module scope)	Where you cache clients/pools to beat cold starts
Execution environment	The micro-VM that runs your code	Lambda-managed	Reused (warm) or new (cold start)
Invocation model	Sync / async / poll-based	Determined by trigger	Sets retry, ordering, error surface
Event-source mapping	Lambda’s poller for a stream/queue	Lambda + source	Batching, concurrency, checkpointing
Concurrency	Simultaneous environments running	Account + per-function	The scaling unit; throttles when exceeded
Reserved concurrency	A function’s guaranteed/capped slice	Per function	Protect others / protect downstream
Provisioned concurrency	Pre-warmed environments	Per function/version	Kills cold starts for latency-critical paths
DLQ / on-failure dest	Where exhausted events go	Per function / per queue	The difference between “logged” and “lost”
Idempotency	Safe to process the same event twice	Your handler design	Required under at-least-once delivery
Cold start	First-request latency on a fresh env	Environment lifecycle	Slow first call; can trip caller timeouts

The four invocation models, end to end

This is the depth anchor: get this table and the four sub-sections right and 80% of event-driven Lambda bugs become obvious. The four models differ on who retries, how many times, in what order, and where data goes when it fails.

Property	Synchronous	Asynchronous	Stream poll (Kinesis/DDB)	Queue poll (SQS)
Triggers	API GW, ALB, SDK `RequestResponse`	S3, SNS, EventBridge, SDK `Event`	Kinesis, DynamoDB Streams	SQS standard / FIFO
Who retries	The caller	Lambda (async queue)	Lambda (poller, in place)	Lambda (poller, via visibility)
Default retries	0 (caller’s job)	2 (configurable 0–2)	until success or `maxRecordAge`/`retryAttempts`	until success or moved to DLQ
Ordering	N/A (one call)	None	Per shard / partition key	None (FIFO: per message group)
Batching	One event	One event	Batch per shard	Batch (≤10k / 6 MB)
Delivery	Exactly as called	At-least-once	At-least-once	At-least-once
Payload limit	6 MB req / 6 MB resp	256 KB	6 MB batch	6 MB batch
On exhaustion	Error returned to caller	DLQ / on-failure dest or dropped	On-failure dest or blocks shard	Queue redrive → DLQ
Throttle behaviour	Caller gets 429	Retried with backoff (up to ~6 h)	Poller backs off, shard waits	Poller backs off, messages stay

Synchronous invocation

The caller sends a request and blocks for the response. API Gateway, ALB, Cognito triggers, and any SDK Invoke with InvocationType=RequestResponse are synchronous. Lambda does not retry — if the function errors or times out, the error goes straight back to the caller, who decides whether to retry. This is the model for request/response APIs where latency matters and the user is waiting, which is exactly why cold starts and the 6 MB payload cap bite here.

# Synchronous invoke — the CLI blocks until the function returns
aws lambda invoke --function-name order-api \
  --invocation-type RequestResponse \
  --payload '{"orderId":"o-123"}' --cli-binary-format raw-in-base64-out \
  response.json

The synchronous-specific limits and behaviours, because they shape your API design:

Aspect	Value / behaviour	Why it matters
Request payload	6 MB	Large uploads must go via S3 + presigned URL, not the body
Response payload	6 MB (buffered) / 20 MB (streamed)	Use response streaming for large responses on supported runtimes
Retries by Lambda	None	The caller (API GW, your SDK) owns retry + backoff
Timeout visibility	Caller sees the timeout/error	Set function timeout < API GW’s 29 s integration timeout
Cold start in path	Yes — in the user’s latency	Provisioned concurrency for latency SLOs
Concurrency throttle	`429` to the caller	API GW returns 502/429; client must handle it

Asynchronous invocation

The caller (S3 event, SNS, EventBridge, or InvocationType=Event) hands the event to Lambda’s internal async queue and gets an immediate 202 Accepted — it does not wait for processing. Lambda then invokes your function and, on failure, retries twice by default with backoff, over a window up to 6 hours. If all attempts fail, the event goes to your configured on-failure destination (or legacy DLQ) — and if none is configured, it is dropped. This is the model where events silently disappear.

# Async invoke — returns 202 immediately, processing happens later
aws lambda invoke --function-name thumbnail-generator \
  --invocation-type Event \
  --payload '{"bucket":"uploads","key":"a.png"}' --cli-binary-format raw-in-base64-out \
  /dev/stdout

# Configure retries + an on-failure destination (the critical bit)
aws lambda put-function-event-invoke-config --function-name thumbnail-generator \
  --maximum-retry-attempts 2 --maximum-event-age-in-seconds 3600 \
  --destination-config '{"OnFailure":{"Destination":"arn:aws:sqs:ap-south-1:111122223333:thumb-dlq"}}'

resource "aws_lambda_function_event_invoke_config" "thumb" {
  function_name                = aws_lambda_function.thumb.function_name
  maximum_retry_attempts       = 2     # 0–2
  maximum_event_age_in_seconds = 3600  # 60–21600 (6h)
  destination_config {
    on_failure  { destination = aws_sqs_queue.thumb_dlq.arn }
    on_success  { destination = aws_sns_topic.thumb_done.arn }  # optional success routing
  }
}

The async knobs and exactly what each controls:

Setting	Controls	Default	Range	When to change
`MaximumRetryAttempts`	Async retries after first failure	2	0–2	0 if the source already retries; keep 2 for transient errors
`MaximumEventAgeInSeconds`	How long Lambda keeps retrying	21,600 (6 h)	60–21,600	Lower to fail fast on time-sensitive events
`OnFailure` destination	Where exhausted events go	none → dropped	SQS/SNS/EventBridge/Lambda	Always set in production
`OnSuccess` destination	Route successful outcomes	none	SQS/SNS/EventBridge/Lambda	Event-driven success chains
Legacy `DeadLetterConfig`	Old DLQ (less metadata)	none	SQS/SNS	Prefer on-failure destination instead

The difference between a legacy DLQ and an on-failure destination matters enough to tabulate:

Aspect	Legacy DLQ (`DeadLetterConfig`)	On-failure destination (`DestinationConfig`)
Targets	SQS, SNS	SQS, SNS, EventBridge, Lambda
Payload	The original event only	Event + invocation context (error, attempts, request id)
Success routing	No	Yes (`OnSuccess`)
Recommended	Legacy; avoid for new work	Preferred for all new async functions

Stream poll (Kinesis & DynamoDB Streams)

Lambda runs a poller that reads records from each shard in order and invokes your function with a batch. Records within a shard (i.e., a given partition key) are processed in order, one batch at a time — which is the whole point and also the whole danger: a poison record that always fails will, by default, be retried until it expires, blocking every record behind it on that shard. You control this with batch size, parallelization, retry attempts, record age, bisect-on-error, and an on-failure destination that receives metadata about the failed batch.

# Create a stream event-source mapping with poison-pill controls
aws lambda create-event-source-mapping --function-name order-projector \
  --event-source-arn arn:aws:kinesis:ap-south-1:111122223333:stream/orders \
  --starting-position LATEST --batch-size 100 \
  --maximum-batching-window-in-seconds 5 \
  --parallelization-factor 4 \
  --maximum-retry-attempts 3 \
  --maximum-record-age-in-seconds 3600 \
  --bisect-batch-on-function-error \
  --function-response-types ReportBatchItemFailures \
  --destination-config '{"OnFailure":{"Destination":{"Arn":"arn:aws:sqs:ap-south-1:111122223333:proj-dlq"}}}'

resource "aws_lambda_event_source_mapping" "proj" {
  event_source_arn                   = aws_kinesis_stream.orders.arn
  function_name                      = aws_lambda_function.projector.arn
  starting_position                  = "LATEST"
  batch_size                         = 100
  maximum_batching_window_in_seconds = 5
  parallelization_factor             = 4        # 1–10 concurrent batches per shard
  maximum_retry_attempts             = 3        # -1 = infinite (the default poison trap)
  maximum_record_age_in_seconds      = 3600     # -1 = infinite
  bisect_batch_on_function_error     = true     # split a failing batch to isolate the bad record
  function_response_types            = ["ReportBatchItemFailures"]  # partial-batch success
  destination_config { on_failure { destination_arn = aws_sqs_queue.proj_dlq.arn } }
}

The stream event-source-mapping controls, the defaults that bite, and when to change them:

Setting	Controls	Default	The trap if left default	Change to
`BatchSize`	Records per invoke	100 (Kinesis/DDB)	Large batch + one bad record fails all	Tune to processing cost; pair with bisect
`MaximumBatchingWindow`	Wait to fill a batch (s)	0	Tiny batches = more invokes/cost	1–5 s to batch efficiently
`ParallelizationFactor`	Concurrent batches per shard	1	Throughput capped at 1/shard	1–10 (still per-key ordered)
`MaximumRetryAttempts`	Retries before giving up	-1 (infinite)	Poison record blocks shard forever	A finite number (e.g. 3–5)
`MaximumRecordAge`	Drop records older than	-1 (infinite)	Stale records retried endlessly	Bound it (e.g. 1–24 h)
`BisectBatchOnFunctionError`	Split failing batch	false	Whole batch keeps failing together	true — isolates the poison record
`ReportBatchItemFailures`	Partial-batch success	off	One bad record reprocesses good ones	Return failed IDs; checkpoint past good
`StartingPosition`	Where to begin	n/a	TRIM_HORIZON replays all history	LATEST for new, TRIM_HORIZON to backfill

The concurrency math for streams is fixed and worth memorising: concurrency = number of shards × parallelization factor. Ten shards at a parallelization factor of 4 gives 40 concurrent invocations of this function from this stream — independent of your account’s general concurrency, but counted against it.

Queue poll (SQS)

Lambda polls the SQS queue and invokes your function with a batch of up to 10,000 messages (standard) or fewer, bounded by a 6 MB payload. The critical interaction is the visibility timeout: when Lambda reads a message it becomes invisible for that window; if the function succeeds, Lambda deletes it; if it fails (or times out), the message becomes visible again and is redelivered. The hard rule — visibility timeout must be at least 6× the function timeout — exists so a slow invocation does not get the same message redelivered to a second environment while the first is still working (instant duplicates). Messages that fail repeatedly go to the queue’s DLQ via its redrive policy.

# SQS event-source mapping with partial-batch reporting
aws lambda create-event-source-mapping --function-name invoice-worker \
  --event-source-arn arn:aws:sqs:ap-south-1:111122223333:invoices \
  --batch-size 10 --maximum-batching-window-in-seconds 0 \
  --scaling-config '{"MaximumConcurrency":50}' \
  --function-response-types ReportBatchItemFailures

# Queue with a DLQ wired via redrive policy (the SQS-side failure destination)
resource "aws_sqs_queue" "invoices_dlq" { name = "invoices-dlq" }

resource "aws_sqs_queue" "invoices" {
  name                       = "invoices"
  visibility_timeout_seconds = 180  # >= 6 x the 30s function timeout
  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.invoices_dlq.arn
    maxReceiveCount     = 5          # attempts before a message goes to the DLQ
  })
}

resource "aws_lambda_event_source_mapping" "invoices" {
  event_source_arn        = aws_sqs_queue.invoices.arn
  function_name           = aws_lambda_function.invoice_worker.arn
  batch_size              = 10
  function_response_types = ["ReportBatchItemFailures"]
  scaling_config { maximum_concurrency = 50 }  # cap concurrent pollers (5–1000)
}

The SQS-poller settings and the duplicate/loss traps they govern:

Setting	Where	Controls	Trap if wrong
Visibility timeout	Queue	How long a read message is hidden	< 6× function timeout → duplicate processing
`maxReceiveCount`	Queue redrive	Attempts before DLQ	Too high = poison loops; too low = premature DLQ
`BatchSize`	Mapping	Messages per invoke (≤10,000)	Big batch + no partial-fail = reprocess all on one failure
`ReportBatchItemFailures`	Mapping	Return only failed message IDs	Without it, one failure redelivers the whole batch
`MaximumConcurrency` (scaling)	Mapping	Cap concurrent pollers (5–1,000)	Without it, SQS can stampede a fragile downstream
FIFO message group	Queue	Ordering + dedup scope	Wrong group ID serialises unrelated work

Standard vs FIFO SQS as a Lambda source, because the choice changes ordering, throughput and dedup:

Aspect	Standard queue	FIFO queue
Ordering	Best-effort, none guaranteed	Strict, per message group ID
Delivery	At-least-once	Exactly-once processing (with dedup)
Throughput	Nearly unlimited	300 msg/s (3,000 batched) per group baseline
Dedup	None (you handle it)	5-minute dedup window (content or ID)
Lambda concurrency	Scales with backlog	Bounded by active message groups
Use when	Throughput, parallel work	Order matters (per entity), no duplicates

The trigger-by-trigger contract

Each event source has its own wiring, its own event shape, and its own gotchas. This is the reference matrix — which invocation model each uses, the key limits, and the one thing that catches everyone — followed by the wiring detail for the heavy hitters.

Source	Invocation model	Key limit / batch	Event shape gotcha	The classic mistake
API Gateway	Synchronous	29 s integration timeout; 10 MB payload (REST)	Proxy vs non-proxy integration	Function timeout > 29 s API timeout
Application Load Balancer	Synchronous	1 MB response; no 29 s cap	Must return specific JSON shape	Wrong response structure → 502
S3	Asynchronous	One event per object (mostly)	No delivery order; possible duplicates	Recursive loop (write back to same bucket)
SNS	Asynchronous	256 KB message	Fan-out; no replay, no ordering	Expecting ordering or filtering richness
SQS (standard)	Queue poll	≤10,000/batch, 6 MB	Partial-batch failures	Visibility < 6× timeout → duplicates
SQS (FIFO)	Queue poll	Per-group ordering	Group ID controls parallelism	One group ID serialises everything
Kinesis Data Streams	Stream poll	Per-shard order; 100/batch	Iterator advances past poison	Infinite retries block the shard
DynamoDB Streams	Stream poll	Per-key order; 100/batch	NEW/OLD image config	Forgetting `StreamViewType`
EventBridge (bus)	Asynchronous	256 KB; 300 rules/bus	Pattern matching is exact	Pattern too broad → double-delivery
EventBridge Scheduler	Asynchronous	One-time or cron/rate	Time zone + flexible windows	Confusing it with EventBridge rules

S3 → Lambda (asynchronous)

An S3 bucket notification invokes your function asynchronously when an object is created, removed, restored, or replicated. Delivery is typically once but not guaranteed exactly-once, and not ordered — design idempotently. The single most expensive mistake is the recursive loop: a function triggered on s3:ObjectCreated:* that writes a derived object back into the same bucket re-triggers itself, billing you in a runaway until you notice. Scope the prefix/suffix or write to a different bucket.

# Grant S3 permission to invoke, then add the bucket notification
aws lambda add-permission --function-name thumbnail-generator \
  --statement-id s3invoke --action lambda:InvokeFunction \
  --principal s3.amazonaws.com \
  --source-arn arn:aws:s3:::uploads-bucket --source-account 111122223333

aws s3api put-bucket-notification-configuration --bucket uploads-bucket \
  --notification-configuration '{
    "LambdaFunctionConfigurations":[{
      "LambdaFunctionArn":"arn:aws:lambda:ap-south-1:111122223333:function:thumbnail-generator",
      "Events":["s3:ObjectCreated:*"],
      "Filter":{"Key":{"FilterRules":[{"Name":"prefix","Value":"raw/"},{"Name":"suffix","Value":".png"}]}}
    }]}'

The S3 notification options and the gotcha each hides:

Option	Values	Gotcha
Event types	`ObjectCreated:`, `ObjectRemoved:`, `ObjectRestore:`, `Replication:`	`Put` vs `CompleteMultipartUpload` differ — `*` catches both
Prefix/suffix filter	string match	Overlapping filters on one bucket can double-fire
Destination	Lambda, SQS, SNS, EventBridge	EventBridge gives richer routing + replay than direct notify
Delivery	At-least-once, unordered	Always idempotent; never assume order
Recursion guard	scope prefix / separate bucket	Writing back to the trigger bucket = billing loop

EventBridge → Lambda (asynchronous, the choreography hub)

EventBridge is the event bus for service-to-service choreography. Producers PutEvents; rules match events by a JSON event pattern (exact-match on fields, with content filters); matching events fan out to up to 5 targets per rule — including Lambda. It supports a schema registry, archive and replay, and cross-account/cross-region routing. The defining trap is a pattern that is too broad: two rules whose patterns both match the same event each fire (there is no first-match-wins), so the same fact double-delivers unless your patterns are tight and your consumers idempotent.

# Rule that matches a specific event, targeting a Lambda
aws events put-rule --name order-placed --event-bus-name orders \
  --event-pattern '{"source":["com.acme.orders"],"detail-type":["OrderPlaced"]}'

aws events put-targets --rule order-placed --event-bus-name orders \
  --targets '[{"Id":"fn","Arn":"arn:aws:lambda:ap-south-1:111122223333:function:fulfil","RetryPolicy":{"MaximumRetryAttempts":4,"MaximumEventAgeInSeconds":3600},"DeadLetterConfig":{"Arn":"arn:aws:sqs:ap-south-1:111122223333:eb-dlq"}}]'

aws lambda add-permission --function-name fulfil --statement-id eb \
  --action lambda:InvokeFunction --principal events.amazonaws.com \
  --source-arn arn:aws:events:ap-south-1:111122223333:rule/orders/order-placed

EventBridge vs SNS vs SQS for fan-out — the actual decision table, since this is the most-asked serverless design question:

Need	SNS	EventBridge	SQS
One→many fan-out	Yes (subscriptions)	Yes (rules, 5 targets each)	No (point-to-point)
Content-based routing	Limited (message filtering)	Rich (event patterns)	No
Replay / archive	No	Yes	No (it is the buffer)
Schema registry	No	Yes	No
Ordering	FIFO topics (per group)	No	FIFO queues (per group)
Buffering / backpressure	No (push)	No (push)	Yes (pull)
Throughput	Very high	High (per-account limits)	Very high
Latency	Lowest	Low	Pull-interval bound
Best for	High-fanout, low-latency push	Service choreography, routing	Decoupling, buffering, retries

A pattern that combines them — SNS→SQS fan-in — is so common it deserves its own row of reasoning: SNS pushes one event to many SQS queues (fan-out), and each queue buffers for its own Lambda (backpressure + retries + DLQ per consumer). You get SNS’s fan-out and SQS’s durability, which neither gives alone.

DynamoDB Streams & Kinesis (stream poll)

A DynamoDB Stream emits an ordered, per-key record for every item change; a Kinesis Data Stream is a general ordered log. Both use the stream-poll model from the section above. The DynamoDB-specific knob is StreamViewType — whether the record carries the new image, old image, both, or just keys — which you must set when enabling the stream and cannot change without re-enabling.

# Enable a DynamoDB stream with both images, then map it to a function
aws dynamodb update-table --table-name orders \
  --stream-specification StreamEnabled=true,StreamViewType=NEW_AND_OLD_IMAGES

STREAM_ARN=$(aws dynamodb describe-table --table-name orders \
  --query 'Table.LatestStreamArn' --output text)

aws lambda create-event-source-mapping --function-name order-indexer \
  --event-source-arn "$STREAM_ARN" --starting-position LATEST \
  --batch-size 100 --maximum-retry-attempts 3 --bisect-batch-on-function-error

StreamViewType choices and when each is right:

`StreamViewType`	Record contains	Use when
`KEYS_ONLY`	Just the key attributes	You’ll re-read the item; minimise stream size
`NEW_IMAGE`	Item after the change	Projections, search indexing, caches
`OLD_IMAGE`	Item before the change	Auditing what was deleted/overwritten
`NEW_AND_OLD_IMAGES`	Both	Change-data-capture, diff logic, full audit

Kinesis vs DynamoDB Streams as a Lambda source:

Aspect	DynamoDB Streams	Kinesis Data Streams
Source of records	Table item changes	Anything you `PutRecord`
Retention	24 h	24 h–365 d (configurable)
Ordering	Per item key	Per partition key
Shards	Managed by the table	You provision (or on-demand)
Multiple consumers	Limited	Enhanced fan-out (per-consumer throughput)
Use for	CDC off a table	General event log, multi-consumer streaming

Fan-out, fan-in & choreography patterns

The patterns that turn single functions into systems. Each has a shape, a primitive, and a failure mode.

Pattern	Shape	Built with	Why use it	Failure mode to guard
Simple fan-out	1 event → N consumers	SNS or EventBridge	Decouple producers from consumers	Lost delivery to one consumer (per-target DLQ)
Fan-out + buffer	1 event → N queues → N fns	SNS→SQS	Backpressure + retry per consumer	Queue without redrive = stuck poison
Fan-in / aggregation	N events → 1 store	Lambda → DynamoDB	Collate results	Race on concurrent writes (use conditional updates)
Pipe / transform	source → filter → target	EventBridge Pipes	Point-to-point with enrichment	Filter too loose forwards noise
Choreography	services react to events	EventBridge bus + rules	Loose coupling, autonomy	No central view; hard to trace
Orchestration	central state machine	Step Functions	Long, branching, stateful flows	Chaining Lambdas instead (no visibility)
Saga (compensation)	distributed txn + undo	Step Functions / EventBridge	Multi-service consistency	Missing compensating action on failure

Choreography vs orchestration — the architecture fork

The most consequential design decision in event-driven systems. Choreography (EventBridge): each service emits events and reacts to others’ events; no central controller; maximum autonomy and loose coupling — but no single place to see “where is order o-123 in the flow?” Orchestration (Step Functions): a central state machine invokes each step, holds the state, branches, retries, and gives you a visual execution history — at the cost of a coordinator that knows about every step. Lambda chaining (function A invokes function B invokes C) is the worst of both: orchestration logic smeared across functions with no visibility and no built-in retry/branching.

Dimension	Choreography (EventBridge)	Orchestration (Step Functions)	Lambda chaining (anti-pattern)
Coupling	Loosest	Coupled to the state machine	Tightly, invisibly coupled
Visibility of flow	Low (trace across events)	High (execution history)	None
Long/branching flows	Hard	Native (Map, Choice, Parallel)	Painful, error-prone
Built-in retry/catch	Per-target only	Per-state	Hand-rolled in each function
Best for	Reactive, autonomous services	Defined business processes	Almost never

For a deep treatment of the orchestration side, the Step Functions state-machine model is the tool you reach for when a flow has more than two or three steps, branches, or needs a human-readable history.

At-least-once, idempotency & not losing events

This is where most production incidents live. Three disciplines: make the handler idempotent, attach a failure destination on every async/poll path, and stop poison records from blocking streams.

Idempotency — the non-negotiable

Under at-least-once delivery, the same event will arrive twice eventually. An idempotent handler produces the same result whether it runs once or five times. The canonical implementation: derive an idempotency key from the event (message ID, event ID, or a hash of the meaningful fields), and use a conditional write to a store so the second attempt is a no-op.

# DynamoDB conditional put as an idempotency guard
import boto3, hashlib, json
ddb = boto3.client("dynamodb")  # init code — reused across warm invocations

def handler(event, _ctx):
    for record in event["Records"]:
        key = record.get("messageId") or hashlib.sha256(
            json.dumps(record["body"], sort_keys=True).encode()).hexdigest()
        try:
            ddb.put_item(
                TableName="processed-events",
                Item={"id": {"S": key}, "ttl": {"N": str(_ttl())}},
                ConditionExpression="attribute_not_exists(id)")  # fails on duplicate
        except ddb.exceptions.ConditionalCheckFailedException:
            continue  # already processed — skip the side effect
        do_the_side_effect(record)  # safe: runs at most once per key

Idempotency strategies and their trade-offs:

Strategy	How it works	Pro	Con
Conditional write (DynamoDB)	First write wins; dupes fail the condition	Strong, simple, TTL-able	Extra write per event
Natural idempotency	Operation is inherently safe (PUT to a key)	Free	Only some operations qualify
Dedup table + TTL	Store seen IDs, expire them	Bounded storage	Window must exceed max retry age
SQS FIFO dedup	5-min content/ID dedup	Built-in	5-min window only; FIFO throughput limits
Powertools Idempotency	Library wraps the handler	Battle-tested, persistence-backed	Adds a dependency + a store

Where events go when they fail — the destination matrix

Every async and poll path needs an explicit failure destination, or “failed” means “gone.” Map yours:

Invocation path	Failure mechanism	Configure	If unset
Async (S3/SNS/EventBridge)	On-failure destination / DLQ	`put-function-event-invoke-config`	Event dropped after retries
EventBridge target	Per-target DLQ + retry policy	`DeadLetterConfig` on the target	Event dropped after target retries
SQS source	Queue redrive → DLQ	Queue `redrivePolicy`	Message loops to `maxReceiveCount`, then… DLQ if set, else stuck
Kinesis/DDB source	On-failure destination	mapping `DestinationConfig`	Shard blocked (infinite default retries)
SNS subscription	Subscription redrive → DLQ	SNS `RedrivePolicy`	Delivery attempts exhausted, message lost

Stopping poison records on streams

A poison record is one that always fails. On a stream, the default MaximumRetryAttempts=-1 (infinite) means it blocks its shard forever — every record behind it waits. Three controls fix this, used together:

Control	What it does	Set to
`MaximumRetryAttempts`	Cap retries before skipping/destination	A finite number (e.g. 3–5)
`MaximumRecordAge`	Skip records older than N seconds	Bound it (e.g. 3,600–86,400)
`BisectBatchOnFunctionError`	Halve a failing batch to isolate the bad record	`true`
`ReportBatchItemFailures`	Report only the failed record IDs	Return them; good records checkpoint
`DestinationConfig.OnFailure`	Send the failed batch’s metadata somewhere	An SQS/SNS DLQ for inspection

Concurrency, scaling & protecting downstream

Concurrency is the scaling unit and the thing that takes down your database. Three numbers govern it: the account concurrency limit (default 1,000/region, shared), reserved concurrency (a per-function guaranteed cap), and provisioned concurrency (pre-warmed environments). A fourth, the burst concurrency rate, governs how fast you can scale into that ceiling.

# Reserve 100 concurrent executions for a function (guarantees AND caps it)
aws lambda put-function-concurrency --function-name payment-worker \
  --reserved-concurrent-executions 100

# Provision 20 always-warm environments on a version/alias (kills cold starts)
aws lambda put-provisioned-concurrency-config --function-name order-api \
  --qualifier live --provisioned-concurrent-executions 20

# Check current usage against the account limit
aws lambda get-account-settings \
  --query 'AccountLimit.{Concurrent:ConcurrentExecutions,Unreserved:UnreservedConcurrentExecutions}'

resource "aws_lambda_function" "payment" {
  function_name                  = "payment-worker"
  reserved_concurrent_executions = 100   # cap + guarantee; 0 would DISABLE the function
  # ...
}

resource "aws_lambda_provisioned_concurrency_config" "api" {
  function_name                     = aws_lambda_function.order_api.function_name
  qualifier                         = aws_lambda_alias.live.name
  provisioned_concurrent_executions = 20
}

The three concurrency levers, side by side:

Lever	What it does	Cost	Set when
Account limit (1,000)	Ceiling across all functions in a region	n/a	Raise via Service Quotas before you need it
Reserved concurrency	Caps a function AND carves it out of the pool	Free (just allocation)	Protect a downstream DB / protect other functions
Provisioned concurrency	Pre-warms N environments	Paid hourly even idle	Latency-critical sync paths with cold-start SLOs
Burst concurrency	Initial scale-up rate (then +500/min)	n/a	Understand it; you can’t raise it

Critical gotchas, because each has bitten teams in production:

Gotcha	What happens	Fix
`reserved_concurrent_executions = 0`	Disables the function entirely	Use `null`/unset for “no reservation,” not 0
One function reserves 900 of 1,000	Every other function shares 100	Reserve deliberately; monitor `UnreservedConcurrentExecutions`
Lambda scales faster than RDS allows	Connection storm → DB at max_connections	Reserved concurrency cap + RDS Proxy for pooling
Provisioned concurrency on `$LATEST`	Not allowed — needs a version/alias	Publish a version, point an alias, provision the alias
Stream concurrency surprise	shards × parallelization, not 1	Size downstream for shards×factor

The downstream-protection pattern deserves emphasis: Lambda will happily open 1,000 concurrent connections to an RDS instance that allows 100, and the database falls over. The fix is a reserved concurrency cap sized to the database’s connection budget, plus RDS Proxy to pool and reuse connections so 1,000 functions share a small pool. DynamoDB, being serverless, scales with you — which is one reason it pairs so naturally with Lambda.

Cold starts: causes, costs & cures

A cold start is the latency of initialising a new execution environment. It is not an error — but on a synchronous, user-facing path it shows up in p99 and can trip an upstream timeout. First, what actually consumes the cold-start budget:

Cost component	Typical magnitude	Reduce by	Trade-off
Code/layer download	10s–100s ms (size-dependent)	Smaller package; fewer/lighter deps	Build discipline
Runtime bootstrap	50–400 ms (varies by runtime)	Choose a faster runtime; SnapStart	Language/ecosystem constraints
VPC ENI attach	Now ~ms (Hyperplane)	(Mostly solved) historically the big one	n/a
Your init code	10 ms–1 s+	Lazy-init non-critical clients; cache config	First real call may pay deferred cost
First DB connect	10s–100s ms	Pool in init; use serverless/proxy DBs	Connection still primes once

The cure menu, ranked by cost and effort:

Technique	What it does	Cost	Effort	Best for
Smaller package / fewer deps	Less to download + init	Free	Medium	Every function
Init clients in module scope	Warm invocations skip init	Free	Trivial	Every function
Right-size memory up	More CPU → faster init + run	Pay per GB-s (may lower total cost)	Trivial	CPU-bound init
Provisioned concurrency	Pre-warmed envs, no cold start	Paid hourly	Low	Latency-SLO sync APIs
SnapStart (Java, .NET, Python)	Snapshot a warmed env, restore fast	Free (some restore cost)	Low	JVM/.NET cold-start pain
Avoid heavy frameworks in handler	Less per-invoke overhead	Free	Medium	High-RPS functions

Runtime cold-start characteristics, roughly, because runtime choice is a cold-start decision:

Runtime	Relative cold start	SnapStart support	Notes
Node.js / Python	Fast	Python: yes	The serverless default for latency
Go / Rust (provided.al2)	Fast	n/a (already fast)	Compiled, tiny, quick init
Java	Slow (JVM + JIT)	Yes — big win	SnapStart cuts it dramatically
.NET	Slow-ish	Yes	SnapStart / ReadyToRun help

The memory-CPU coupling is the under-used lever: Lambda allocates CPU proportional to memory, so a function that is CPU-bound during init often runs faster and cheaper at 1,024 MB than at 256 MB, because it finishes in a fraction of the time. Profile with AWS Lambda Power Tuning rather than defaulting to 128 MB.

Limits & quotas reference

The numbers you will hit. Keep this open when sizing or debugging a “why did it stop” mystery:

Limit	Value	Hard/soft	What hitting it looks like
Function timeout	15 min (900 s)	Hard	Invocation killed mid-work; `Task timed out`
Memory	128 MB – 10,240 MB	Hard	OOM kill; `Runtime exited` / errno 137
Ephemeral `/tmp`	512 MB – 10,240 MB	Configurable	`No space left on device`
Sync request payload	6 MB	Hard	`RequestEntityTooLarge`
Async payload	256 KB	Hard	Event rejected at invoke
Deployment package (zipped, direct)	50 MB	Hard	Upload rejected; use S3
Deployment package (unzipped)	250 MB	Hard	Use container image (up to 10 GB) instead
Container image	10 GB	Hard	Bigger won’t deploy
Layers per function	5	Hard	Consolidate layers
Account concurrency (region)	1,000 default	Soft (raisable)	`429 TooManyRequestsException`
Burst concurrency	region-dependent, then +500/min	Hard	Throttles during a sharp spike
Environment variables size	4 KB total	Hard	Move config to SSM/Secrets Manager
ENI per function (VPC)	scales (Hyperplane)	Managed	(Historically a hard limit)
`/tmp` + invocations	per-env, reused	n/a	Stale state across warm invokes

Error & status-code reference

Every error you realistically see, what it means, how to confirm, and the fix:

Error / code	Meaning	Likely cause	Confirm with	Fix
`429 TooManyRequestsException`	Throttled	Concurrency limit hit	`Throttles` metric; account settings	Raise quota; reserved concurrency; backoff
`Task timed out after N seconds`	Function exceeded timeout	Slow work / hung downstream	CloudWatch Logs `END` vs timeout	Raise timeout (≤900 s); fix the slow call
`Runtime exited (errno 137)`	OOM killed	Memory too low / leak	Logs “Runtime exited”; `MaxMemoryUsed`	Increase memory; fix leak
`AccessDeniedException`	IAM denied	Execution role missing a permission	CloudTrail; the log’s denied action	Add the action to the role
`ResourceConflictException`	Concurrent update	Two deploys/updates at once	Activity; deploy logs	Serialise deploys
`EventSourceMapping ... Disabled`	Poller stopped	Repeated failures / manual disable	`get-event-source-mapping` State	Fix function; re-enable
`ProvisionedConcurrencyConfigNotFound`	PC not on this qualifier	Provisioned `$LATEST` or wrong alias	`get-provisioned-concurrency-config`	Provision a version/alias
`KMSAccessDeniedException`	Can’t decrypt env vars	Role lacks KMS key access	Logs at init	Grant `kms:Decrypt` on the key
`Lambda was unable to decompress...`	Bad package	Corrupt/oversized zip	Deploy output	Rebuild; use container image
`Calls to <fn> are being throttled` (async)	Async backlog throttled	Downstream of an async flood	`Throttles`; invocations queued	Reserved concurrency; smooth the source
Empty receive / no invokes (SQS)	Poller not pulling	Mapping disabled; permissions	`get-event-source-mapping`; role `sqs:*`	Enable mapping; grant SQS perms
Stale `$LATEST` behind alias	Wrong code served	Alias points to old version	`get-alias`	Update alias to the new version

Architecture at a glance

The diagram traces a real event-driven order pipeline left to right and maps the four invocation models onto the exact hops where each one fails. Read it as the path an event actually takes. On the left, producers emit facts: an API Gateway call places an order (a synchronous invoke of the intake function), and an S3 upload of an attachment fires an asynchronous invoke of a processor. Those producers land on the ingestion & buffering zone — an SQS queue absorbs the order workload (the queue-poll model, with a DLQ on its redrive policy) and an SNS topic fans the “order placed” fact out to interested consumers. The EventBridge custom bus in the routing zone is the choreography hub: rules pattern-match the event and fan it to up to five targets, with a per-target DLQ catching exhausted deliveries.

From there the processing zone is where the worker functions run — a fulfilment Lambda (async, retries to an on-failure destination), a projection Lambda fed by DynamoDB Streams (the stream-poll model, where a poison record can block the shard), and a Kinesis consumer for the analytics tap (also stream-poll). Everything converges on the state & failure zone: a DynamoDB table as the idempotency store and projection target, and the DLQs that are the difference between a logged failure and a lost event. The numbered badges sit on the five places an event silently dies or duplicates — a throttle at the concurrency ceiling, an async retry with no destination, a poison record on the stream, a visibility-timeout duplicate on the queue, and a too-broad EventBridge rule that double-delivers. The legend narrates each as symptom, the metric that confirms it, and the fix.

Real-world scenario

Parcelo, a fictional last-mile delivery startup in Bengaluru, ran its parcel-event pipeline on a single 8-vCPU EC2 instance: a Python worker that polled a queue, processed scan events from courier apps, updated a Postgres database, and pushed notifications. Traffic averaged 200 events/second with a 7pm surge to ~2,500/second as the evening delivery wave finished. The instance sat near-idle overnight, cost about ₹14,000/month running 24×7, and — worse — during the evening surge it fell behind, processing events minutes late, so customers saw “out for delivery” long after the parcel arrived. The four-engineer platform team decided to go event-driven on Lambda.

The first cut was naive and instructive. They pointed a Lambda directly at the courier SNS topic (asynchronous) and had it write straight to Postgres. It worked in testing. In the first evening surge it fell apart: the function scaled to ~900 concurrent environments, each opened a Postgres connection, and the database hit max_connections and started refusing — so functions errored, Lambda retried the async events, and the retry storm made it worse. Meanwhile a malformed scan event from one buggy courier app build threw on every attempt; with no on-failure destination configured, those events simply vanished after two retries. The team had reproduced two textbook traps at once: a concurrency stampede on a non-serverless downstream, and silent async data loss.

The breakthrough was redesigning around the invocation models rather than fighting them. They inserted an SQS queue between SNS and the worker (SNS→SQS fan-in), switching the worker to the queue-poll model. That gave them three things at once: a buffer that absorbed the 2,500/s surge instead of stampeding, reserved concurrency capped at 80 (sized to the database’s connection budget) so Lambda could never open more connections than Postgres allowed, and a DLQ via redrive policy (maxReceiveCount=5) so the poison events landed somewhere inspectable instead of disappearing. They made the handler idempotent with a DynamoDB conditional-write on the scan event’s ID, because at-least-once delivery meant duplicates were now expected, not exceptional. Finally they put RDS Proxy in front of Postgres so the 80 concurrent workers shared a small pooled connection set.

The numbers told the story. The evening surge now drained through the queue with sub-second processing latency end to end; the database never exceeded 80 connections; the DLQ caught exactly the malformed events (which turned out to be one courier app version, fixed at the source) with full payloads for replay. Cost fell to about ₹3,800/month — Lambda billed only for the milliseconds of actual processing, near-zero overnight, scaling to the surge automatically. The lesson the team wrote on the wall: “Don’t point a function at a fragile thing. Buffer it, cap it, make it idempotent, and give failures somewhere to land.”

The incident and redesign as a timeline, because the order of the fixes is the lesson:

Stage	What they did	Result	What it should have been
v0 (EC2)	One 24×7 worker	₹14,000/mo, falls behind at surge	—
v1 (naive Lambda)	SNS → Lambda → Postgres direct	Stampede; DB refuses connections	Buffer with SQS first
v1 failure	No on-failure destination	Malformed events vanish	DLQ on every async/queue path
Fix 1	Insert SQS (SNS→SQS)	Surge buffered, no stampede	The core architectural fix
Fix 2	Reserved concurrency = 80	DB connections bounded	Size to the downstream budget
Fix 3	DLQ via redrive (maxReceiveCount 5)	Poison events captured	Never lose an event silently
Fix 4	Idempotent handler (DDB conditional)	Duplicates harmless	At-least-once demands it
Fix 5	RDS Proxy	80 workers share a pool	Pool connections to non-serverless DBs
Outcome	—	₹3,800/mo, sub-second latency	The fix was design, not bigger compute

Advantages and disadvantages

The event-driven serverless model both unlocks huge wins and introduces failure modes you must design against. Weigh it honestly:

Advantages (why this model wins)	Disadvantages (why it bites)
Pay per millisecond of execution; near-zero cost when idle	At sustained high RPS, a container can be cheaper than per-invoke billing
Scales from zero to thousands of environments with no capacity planning	A scale-out stampede can overwhelm any non-serverless downstream (RDS)
Each event source is a first-class, declarative trigger	Each source has its own retry/ordering/error contract you must learn
Built-in retries and DLQs for async/poll paths	“Failed” means “silently gone” unless you configure a destination
Stateless functions are trivially horizontally scalable	At-least-once delivery means you must build idempotency
Cold starts now small for most runtimes; provisioned concurrency for the rest	Cold starts still hurt latency-critical synchronous paths
Tight integration with the whole AWS event ecosystem	Observability is fragmented across many small functions
15-min timeout fits most event reactions	Long/heavy jobs hit the wall — wrong tool

The model is right for event-shaped, intermittent, spiky workloads where you want to ship reactions, not operate servers — and where the work decomposes into small, idempotent steps. It bites hardest on sustained high-throughput workloads (where per-invoke billing loses to a reserved container), on functions fronting fragile non-serverless downstreams (without concurrency caps and pooling), and on teams that haven’t internalised at-least-once delivery (duplicates and silent loss). The disadvantages are all manageable — but only if you design for them, which is the entire point of the patterns above. When the workload is long-running, stateful, or steady-state high-CPU, the container path is the honest answer.

Hands-on lab

Build a real, free-tier-friendly fan-out pipeline: an S3 upload fans out through SNS→SQS to a Lambda that writes an idempotent record to DynamoDB, with a DLQ catching failures. Run in a shell with the AWS CLI configured; everything here is within Free Tier for a short test, and we tear it down at the end.

Step 1 — Variables.

export R=ap-south-1 ACC=$(aws sts get-caller-identity --query Account --output text)
export PFX=lab-evt

Step 2 — Create the DynamoDB idempotency/projection table.

aws dynamodb create-table --table-name ${PFX}-events \
  --attribute-definitions AttributeName=id,AttributeType=S \
  --key-schema AttributeName=id,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST --region $R

Expected: a TableDescription with TableStatus: CREATING, soon ACTIVE.

Step 3 — Create the work queue and its DLQ with a redrive policy.

DLQ_URL=$(aws sqs create-queue --queue-name ${PFX}-dlq --query QueueUrl --output text)
DLQ_ARN=$(aws sqs get-queue-attributes --queue-url $DLQ_URL --attribute-names QueueArn --query Attributes.QueueArn --output text)
Q_URL=$(aws sqs create-queue --queue-name ${PFX}-work \
  --attributes "{\"VisibilityTimeout\":\"180\",\"RedrivePolicy\":\"{\\\"deadLetterTargetArn\\\":\\\"$DLQ_ARN\\\",\\\"maxReceiveCount\\\":\\\"5\\\"}\"}" \
  --query QueueUrl --output text)
Q_ARN=$(aws sqs get-queue-attributes --queue-url $Q_URL --attribute-names QueueArn --query Attributes.QueueArn --output text)

Note the visibility timeout 180s ≥ 6× the 30s function timeout — the duplicate-prevention rule from the SQS section, made concrete.

Step 4 — Create the execution role.

aws iam create-role --role-name ${PFX}-role \
  --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"lambda.amazonaws.com"},"Action":"sts:AssumeRole"}]}'
aws iam attach-role-policy --role-name ${PFX}-role \
  --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaSQSQueueExecutionRole
aws iam attach-role-policy --role-name ${PFX}-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonDynamoDBFullAccess

Step 5 — Package and deploy the idempotent function.

cat > handler.py <<'PY'
import boto3, json, os
ddb = boto3.client("dynamodb")  # init code: reused on warm invokes
T = os.environ["TABLE"]
def handler(event, _):
    failures = []
    for r in event["Records"]:
        mid = r["messageId"]
        try:
            ddb.put_item(TableName=T, Item={"id": {"S": mid}},
                ConditionExpression="attribute_not_exists(id)")
            print("processed", mid)
        except ddb.exceptions.ConditionalCheckFailedException:
            print("duplicate, skipped", mid)  # idempotent: no double side effect
        except Exception as e:
            print("error", mid, str(e))
            failures.append({"itemIdentifier": mid})  # partial-batch failure
    return {"batchItemFailures": failures}
PY
zip fn.zip handler.py
sleep 10  # let the role propagate
aws lambda create-function --function-name ${PFX}-worker \
  --runtime python3.12 --handler handler.handler --timeout 30 --memory-size 256 \
  --role arn:aws:iam::${ACC}:role/${PFX}-role \
  --environment "Variables={TABLE=${PFX}-events}" \
  --zip-file fileb://fn.zip --region $R

Step 6 — Map the queue to the function with partial-batch reporting.

aws lambda create-event-source-mapping --function-name ${PFX}-worker \
  --event-source-arn $Q_ARN --batch-size 10 \
  --function-response-types ReportBatchItemFailures --region $R

Step 7 — Send a message twice; prove idempotency.

aws sqs send-message --queue-url $Q_URL --message-body '{"scan":"SC-1"}'
aws sqs send-message --queue-url $Q_URL --message-body '{"scan":"SC-1"}'
sleep 8
aws logs tail /aws/lambda/${PFX}-worker --since 2m --region $R
# Expect: two messages, but a "processed" then a "duplicate, skipped" if they share a messageId,
# or two "processed" with distinct IDs — and the DynamoDB table holds one item per unique message.
aws dynamodb scan --table-name ${PFX}-events --select COUNT --region $R

Each SQS message gets its own messageId, so to truly see the duplicate path, re-drive the same message — but the lab’s point is proven: the conditional write makes reprocessing the same ID a safe no-op, which is exactly what protects you under at-least-once delivery.

Validation checklist. You created a buffered, idempotent, DLQ-backed consumer: an SQS source (queue-poll model), a visibility timeout sized to the function timeout, partial-batch failure reporting so one bad record doesn’t reprocess the batch, a DLQ via redrive policy so poison messages land somewhere, and a DynamoDB conditional write for idempotency. The lab steps mapped to what each proves:

Step	What you did	What it proves
3	Queue with redrive + 180s visibility	Visibility ≥ 6× timeout; failures have a DLQ
5	`boto3.client` in init scope	Warm invokes skip client init (cold-start lever)
5	Conditional put	Idempotency under at-least-once delivery
5/6	`ReportBatchItemFailures`	One bad record doesn’t reprocess the whole batch
7	Send + scan COUNT	Reprocessing the same ID is a safe no-op

Cleanup.

MID=$(aws lambda list-event-source-mappings --function-name ${PFX}-worker --query 'EventSourceMappings[0].UUID' --output text --region $R)
aws lambda delete-event-source-mapping --uuid $MID --region $R
aws lambda delete-function --function-name ${PFX}-worker --region $R
aws sqs delete-queue --queue-url $Q_URL ; aws sqs delete-queue --queue-url $DLQ_URL
aws dynamodb delete-table --table-name ${PFX}-events --region $R
aws iam detach-role-policy --role-name ${PFX}-role --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaSQSQueueExecutionRole
aws iam detach-role-policy --role-name ${PFX}-role --policy-arn arn:aws:iam::aws:policy/AmazonDynamoDBFullAccess
aws iam delete-role --role-name ${PFX}-role

Cost note. Lambda’s free tier (1M requests + 400,000 GB-seconds/month), DynamoDB on-demand at trivial volume, and SQS’s first million requests are all free; this lab costs effectively ₹0 and deleting the resources stops everything.

Common mistakes & troubleshooting

The playbook — the part you bookmark. First as a scannable table you read mid-incident, then the detail for the entries that bite hardest.

#	Symptom	Root cause	Confirm (exact command / metric)	Fix
1	Async events silently disappear	No on-failure destination; retries exhausted	`aws lambda get-function-event-invoke-config`; check for `DestinationConfig`	Set `OnFailure` destination; alarm on DLQ depth
2	Stream consumer stuck for hours, no progress	Poison record + infinite default retries blocks shard	`IteratorAge` climbing; `aws lambda get-event-source-mapping` retries=-1	Finite `MaximumRetryAttempts`, `BisectBatchOnFunctionError`, on-failure dest
3	Same event processed twice (double charge)	At-least-once delivery + non-idempotent handler	Duplicate side effects in logs/DB; no dedup store	DynamoDB conditional write keyed on event ID
4	SQS messages reprocessed repeatedly	Visibility timeout < 6× function timeout	Queue `VisibilityTimeout` vs function `Timeout`	Set visibility ≥ 6× timeout
5	`429 TooManyRequestsException` under load	Concurrency limit reached	`Throttles` > 0; `ConcurrentExecutions` at ceiling	Raise account quota; reserved concurrency; backoff
6	Database refuses connections during spikes	Lambda scale-out > DB connection budget	RDS `DatabaseConnections` at max; function fan-out	Reserved concurrency cap + RDS Proxy
7	Whole SQS batch reprocessed on one failure	No partial-batch reporting	Mapping `FunctionResponseTypes` empty	`ReportBatchItemFailures` + return failed IDs
8	EventBridge event handled twice	Two rules’ patterns both match (no first-match-wins)	`MatchedEvents` on two rules for one event	Tighten patterns; keep consumers idempotent
9	S3-triggered function loops, billing spikes	Function writes back to the trigger bucket	Runaway `Invocations`; CloudWatch billing alarm	Scope prefix/suffix or write to a different bucket
10	`Task timed out after 900.00 seconds`	Job exceeds 15-min hard limit	Logs show timeout at 900 s	Re-architect into steps; use Step Functions/Fargate
11	Cold starts spike p99 on the API	Sync path, no warm envs (esp. JVM/.NET/VPC)	`InitDuration` in logs; p99 latency	Provisioned concurrency; SnapStart; smaller package
12	`AccessDeniedException` calling an AWS service	Execution role missing a permission	CloudTrail shows the denied action	Add the action to the execution role
13	Function returns old code after deploy	Alias/trigger points to a stale version	`aws lambda get-alias`; trigger qualifier	Update alias to new version; trigger the alias
14	Reserved concurrency “broke” the function	Set to 0 (which disables it)	`ReservedConcurrentExecutions: 0`	Use unset for “no reservation,” not 0
15	`Runtime exited` / errno 137	OOM — memory too low	`MaxMemoryUsed` near limit	Increase `MemorySize`; fix the leak

The expanded form for the ones that cause the most damage:

1. Async events silently disappear. Root cause: An async invocation (S3/SNS/EventBridge/Event) failed, retried twice, and had no on-failure destination or DLQ — so the event was dropped. Confirm: aws lambda get-function-event-invoke-config --function-name <fn> returns no DestinationConfig; the Errors metric is non-zero while nothing lands in any queue. Fix: Configure an OnFailure destination (SQS preferred) on every async function; alarm on the DLQ’s ApproximateNumberOfMessagesVisible > 0.

2. Stream consumer stuck for hours. Root cause: A poison record that always fails, combined with the default MaximumRetryAttempts=-1 (infinite), blocks its shard — every record behind it waits. Confirm: The IteratorAge metric climbs steadily (records aging without being processed); aws lambda get-event-source-mapping --uuid <id> shows MaximumRetryAttempts: -1. Fix: Set a finite MaximumRetryAttempts and a MaximumRecordAge, enable BisectBatchOnFunctionError to isolate the bad record, and add a DestinationConfig.OnFailure so the failed batch’s metadata is captured.

3. Same event processed twice. Root cause: At-least-once delivery (async, stream, or queue) delivered a duplicate, and the handler is not idempotent, so a side effect (charge, email, counter) ran twice. Confirm: Duplicate side effects with the same source event ID; no dedup/idempotency store in the code path. Fix: Derive an idempotency key from the event and conditional-write it to DynamoDB (attribute_not_exists) before the side effect; or use Powertools Idempotency.

4. SQS messages reprocessed repeatedly. Root cause: The queue’s visibility timeout is shorter than ~6× the function timeout, so a still-running invocation’s message becomes visible and is redelivered to a second environment. Confirm: Compare the queue’s VisibilityTimeout to the function’s Timeout; duplicates correlate with slow invocations. Fix: Set the visibility timeout to at least 6× the function timeout (e.g. 180s for a 30s function).

5. 429 TooManyRequestsException under load. Root cause: Demand exceeded available concurrency — the account’s 1,000 default, or a too-small reserved allocation, or another function hogging the pool. Confirm: Throttles metric > 0; ConcurrentExecutions pinned at the limit; UnreservedConcurrentExecutions near zero in account settings. Fix: Raise the account concurrency quota via Service Quotas; set reserved concurrency to guarantee this function a slice; ensure synchronous callers back off and retry.

6. Database refuses connections during spikes. Root cause: Lambda scaled to hundreds of environments, each opening a connection, exceeding the database’s max_connections — a stampede no RDS instance survives. Confirm: RDS DatabaseConnections at the ceiling exactly when the function fans out; function errors are connection failures. Fix: Cap the function with reserved concurrency sized to the DB’s connection budget, and front the database with RDS Proxy so functions share a pool.

9. S3-triggered function loops and billing spikes. Root cause: A function triggered on s3:ObjectCreated:* writes a derived object back into the same bucket, re-triggering itself in a runaway loop. Confirm: Invocations climbing without external cause; a billing alarm fires; the same bucket appears as both source and write target. Fix: Scope the notification to a narrow prefix/suffix that excludes derived objects, or write outputs to a different bucket entirely.

Best practices

Always attach a failure destination. Every async function gets an OnFailure destination; every SQS source gets a redrive policy → DLQ; every stream source gets a finite retry count plus an on-failure destination. “Failed” must never mean “silently gone.”
Make every handler idempotent. At-least-once delivery is the contract for async, stream and queue paths. Use a conditional write on an event-derived key before any side effect. This single discipline prevents the worst class of serverless bug.
Set SQS visibility timeout to ≥ 6× the function timeout. The cheapest way to prevent duplicate processing from in-flight redelivery.
Cap concurrency in front of fragile downstreams. Reserved concurrency sized to a database’s connection budget, plus RDS Proxy, stops a scale-out stampede. DynamoDB needs no cap; relational databases do.
Use ReportBatchItemFailures on every batch source. Without it, one bad record reprocesses the whole batch (and can loop). With it, good records checkpoint and only the failures retry.
Bound stream retries and bisect on error. Never leave MaximumRetryAttempts=-1; a poison record will otherwise block the shard forever. Bisect to isolate it.
Initialise expensive things in module scope, not the handler. SDK clients, DB pools, parsed config — paid once per environment, skipped on warm invocations. The simplest cold-start and cost win.
Choose the broker deliberately. SNS for low-latency fan-out, EventBridge for content-routed choreography with replay, SQS for buffering and backpressure. Combine SNS→SQS for fan-out with durability.
Reach for Step Functions past three steps. Don’t chain Lambdas to model a workflow; a state machine gives retries, branching, and a visual execution history for free.
Right-size memory by profiling, not defaulting. CPU scales with memory; a CPU-bound function is often faster and cheaper at 1,024 MB than 128 MB. Use Power Tuning.
Provisioned concurrency only where latency SLOs demand it. It costs money even idle; reserve it for user-facing synchronous paths with cold-start sensitivity, not background workers.
Alarm on the leading indicators: Throttles, IteratorAge, DLQ depth, Errors, and downstream connection counts — not just “function errored.”

The alarms worth wiring before the next incident — leading indicators, not lagging:

Alarm on	Metric	Threshold (starting point)	Why it’s leading
Throttling	`Throttles`	> 0 sustained 5 min	Concurrency ceiling before users feel 429s
Stream lag	`IteratorAge`	> 60,000 ms	Shard falling behind / poison record blocking
Dead-letter fill	DLQ `ApproximateNumberOfMessagesVisible`	> 0	Events are leaving the happy path
Error rate	`Errors`	> 1% of invocations	Handler failing — confirm with logs
Async age	`AsyncEventAge`	climbing toward 6 h	Async backlog not draining
Downstream saturation	RDS `DatabaseConnections`	> 80% of max	Stampede before the DB refuses
Cold-start latency	`InitDuration` p99	> your SLO	Sync path latency creeping up

Security notes

Least-privilege execution roles. Each function gets its own role scoped to exactly the actions and resources it needs — a specific table ARN, a specific queue, a specific KMS key. Never attach AdministratorAccess or broad * policies; the IAM least-privilege discipline matters more, not less, when you have dozens of functions.
Secrets out of environment variables. Environment variables are visible in the console and the API. Put secrets in Secrets Manager or SSM Parameter Store (SecureString) and fetch them in init code; grant the role only the specific secret’s ARN.
Encrypt at rest and in transit. Lambda encrypts env vars with KMS — use a customer-managed key for sensitive config and grant the role kms:Decrypt on just that key. Ensure downstream calls (to RDS, S3) use TLS.
Validate and bound event input. Treat every event as untrusted: validate schema, cap sizes, and never eval/deserialize untrusted payloads. A malformed event should fail cleanly to a DLQ, not crash or get exploited.
Scope resource policies tightly. When S3/SNS/EventBridge invokes your function, the add-permission statement should pin --source-arn (and --source-account) so only that bucket/topic/rule can invoke it — not the whole service.
VPC only when needed. Putting Lambda in a VPC is for reaching private resources (RDS, internal services); it adds an ENI and (historically) cold-start cost. Don’t VPC-attach a function that only calls public AWS APIs.
Sign and scan deployment artifacts. For container-image functions, pin image digests, scan with ECR scanning, and pull from a private registry. Code Signing for Lambda can enforce that only signed code deploys.

The security controls that also improve resilience — they pull the same direction here:

Control	Mechanism	Secures against	Also prevents
Per-function least-privilege role	Scoped IAM policy	Blast radius of a compromised function	`AccessDenied` surprises from over-broad churn
Secrets in Secrets Manager	SecureString + role grant	Secrets leaking via env vars	Rotation breaking a hard-coded value
`--source-arn` on invoke permission	Resource policy condition	Any topic/bucket invoking your fn	Accidental cross-source triggering
Input validation + size caps	Handler-level checks	Injection / oversized payloads	Poison records crashing the consumer
Customer-managed KMS key	`kms:Decrypt` scoped	Unauthorised decrypt of config	Silent init failure (grant it correctly)
ECR scanning + digest pinning	Image supply chain	Tampered/unknown images	Surprise breakage from a moved tag

Cost & sizing

The bill drivers and how they interact with the design:

Requests + GB-seconds dominate. You pay per request (a few cents per million) and per GB-second (memory × duration). A function at 256 MB running 200 ms costs a fraction of one at 1,024 MB running 800 ms — but if more memory makes a CPU-bound function finish in a quarter of the time, the higher memory is often cheaper. Profile.
Idle is free. Unlike a 24×7 EC2 instance, a Lambda pipeline costs effectively nothing overnight. Parcelo’s drop from ₹14,000 to ₹3,800 was mostly the elimination of idle.
Provisioned concurrency is the exception — you pay hourly for pre-warmed environments whether or not they run. Use it only where a latency SLO justifies it; it can quietly dominate the bill of a low-traffic function.
The break-even with containers. Per-invoke billing wins for spiky/intermittent traffic and loses for sustained high RPS. As a rough rule, once a function runs near-continuously at high concurrency, a reserved Fargate task or EC2 may be cheaper — model both.
Free tier is generous: 1M requests and 400,000 GB-seconds per month, always free. Most dev and many small-prod workloads stay within it.

A rough monthly picture for a moderate event pipeline (say 50M events/month, 256 MB, ~150 ms each):

Cost driver	What you pay for	Rough INR / month	Watch-out
Lambda requests	~50M invocations	~₹700–900	Batch where possible to cut request count
Lambda GB-seconds	256 MB × 150 ms × 50M	~₹2,500–4,000	Right-size memory; shorten duration
Provisioned concurrency	N warm envs × hours	~₹1,000+ per 10 envs	Idle cost — only for latency SLOs
SQS requests	Polls + sends	~₹300–600	Batching window reduces poll count
DynamoDB (on-demand)	Idempotency + projection writes	~₹500–1,500	TTL the idempotency table
CloudWatch Logs	Ingestion + storage	~₹500–2,000	Set retention; sample noisy logs
RDS Proxy (if used)	Per-vCPU-hour of the DB	~₹1,500–3,000	Only if fronting a relational DB

Sizing rules of thumb: start at 256 MB and profile up; set timeout to ~2× the observed p99 duration (not the 15-min max); set batch size to balance throughput against the cost of reprocessing a failed batch; and cap reserved concurrency to whatever your most fragile downstream can survive. The cheapest correct pipeline is almost always “small functions, buffered sources, idempotent handlers, right-sized memory” — not a bigger anything.

Interview & exam questions

1. Explain Lambda’s four invocation models and why the distinction matters. Synchronous (caller waits, caller retries), asynchronous (Lambda queues it, retries twice, sends to a DLQ/destination or drops it), stream poll (Lambda polls Kinesis/DynamoDB Streams, per-shard ordering, checkpointing), and queue poll (Lambda polls SQS, visibility-timeout-driven redelivery). The model determines who retries, how many times, the ordering guarantee, and where data goes on failure — so it dictates how you design for correctness.

2. An async-triggered function’s events are disappearing. What’s happening and how do you fix it? Async invocations retry twice and then send the event to a configured on-failure destination or DLQ — and if none is configured, the event is dropped. Confirm there’s no DestinationConfig via get-function-event-invoke-config. Fix by attaching an OnFailure destination (SQS) and alarming on its depth.

3. Why must an SQS visibility timeout be at least 6× the function timeout? When Lambda reads a message it becomes invisible for the visibility-timeout window. If that window is shorter than the time the function needs, the message becomes visible again and is redelivered to a second environment while the first is still processing — instant duplicates. Six times gives headroom for retries within Lambda’s polling.

4. What is a poison record on a stream, and how do you prevent it blocking the shard? A record that always fails. Because stream records are processed in order and the default MaximumRetryAttempts is -1 (infinite), the bad record is retried forever, blocking every record behind it on that shard. Prevent it with a finite retry count, a MaximumRecordAge, BisectBatchOnFunctionError to isolate it, and an on-failure destination.

5. Why is idempotency mandatory in event-driven Lambda, and how do you implement it? Async, stream and queue deliveries are at-least-once — the same event will eventually arrive twice. Without idempotency, side effects (charges, emails, counters) double. Implement it by deriving a key from the event and doing a conditional write (attribute_not_exists) to DynamoDB before the side effect, so the second attempt is a no-op.

6. SNS vs EventBridge vs SQS for fan-out — how do you choose? SNS: low-latency one-to-many push fan-out, limited filtering. EventBridge: rich content-based routing, schema registry, archive/replay — the choreography hub. SQS: not fan-out at all but a buffer giving backpressure, retries and a DLQ. Combine SNS→SQS to get fan-out with per-consumer durability.

7. A Lambda is exhausting a relational database’s connections during traffic spikes. Fix? Lambda scales to hundreds of concurrent environments, each opening a connection, blowing past max_connections. Cap the function with reserved concurrency sized to the DB’s connection budget, and put RDS Proxy in front so the functions share a pooled set. DynamoDB wouldn’t have this problem because it’s serverless.

8. What causes cold starts and what reduces them? Initialising a new environment: code/layer download, runtime bootstrap, and your init code (clients, connections). Reduce with smaller packages, initialising clients in module scope, right-sizing memory (more CPU → faster init), provisioned concurrency (pre-warmed envs), and SnapStart for Java/.NET/Python. It only matters on latency-critical synchronous paths.

9. Difference between reserved and provisioned concurrency? Reserved concurrency caps and guarantees a function’s slice of the account pool (free — it’s just allocation), used to protect downstreams or other functions. Provisioned concurrency pre-warms a number of environments to eliminate cold starts (paid hourly even when idle), used for latency-sensitive paths. Setting reserved to 0 disables the function.

10. When is Lambda the wrong choice? For long-running (>15 min), stateful, or sustained high-CPU/steady-state workloads, where per-invoke billing loses to a reserved container and the timeout/statelessness constraints fight you. Use ECS/EKS/Fargate or EC2 there; use Lambda for event-shaped, intermittent, spiky work.

11. How do you process a batch from SQS so one bad message doesn’t reprocess the whole batch? Enable ReportBatchItemFailures on the event-source mapping and return a batchItemFailures list of only the failed message IDs. Lambda then deletes the successful messages and redelivers only the failures, instead of redelivering the entire batch on any single failure.

12. An EventBridge event is being handled twice. Why? Two rules whose event patterns both match the same event each fire — there is no first-match-wins in EventBridge. Confirm with the MatchedEvents metric on both rules. Fix by tightening the patterns (more specific source + detail-type + content fields) and keeping consumers idempotent.

These map to AWS Certified Developer – Associate (DVA-C02) — develop event-driven and serverless solutions, Lambda configuration, SQS/SNS/EventBridge integration, error handling — and AWS Certified Solutions Architect – Associate (SAA-C03) — design decoupled and event-driven architectures, choosing between fan-out brokers, and resilience patterns. A compact cert mapping:

Question theme	Primary cert	Objective area
Invocation models, retries, DLQs	DVA-C02	Develop event-driven solutions
Idempotency, at-least-once	DVA-C02	Resilient application design
Fan-out broker choice	SAA-C03	Design decoupled architectures
Concurrency, throttling, scaling	DVA-C02 / SAA-C03	Performance & resilience
Stream poison records, bisect	DVA-C02	Troubleshoot serverless
Cold starts, provisioned concurrency	DVA-C02	Optimize serverless performance

Quick check

An S3-triggered function’s events sometimes vanish with nothing in any queue. Which invocation model is this, and what one thing is almost certainly missing?
Your Kinesis consumer’s IteratorAge has been climbing for two hours with no progress. What’s the single most likely cause, and the setting that’s enabling it?
True or false: setting a function’s reserved concurrency to 0 is a good way to give it a tiny guaranteed slice.
Why might the same SQS message be processed by two execution environments at once, and what’s the rule that prevents it?
You need one “order placed” event to reach four independent consumers, each with its own retry and DLQ. What’s the cleanest pattern?

Answers

It’s the asynchronous model (S3 invokes Lambda async). Almost certainly missing: an on-failure destination / DLQ — async retries twice and then drops the event if none is configured. Confirm with aws lambda get-function-event-invoke-config and attach an OnFailure destination.
A poison record that always fails, blocking the shard because the default MaximumRetryAttempts is -1 (infinite) — every record behind it waits. Fix with a finite retry count, MaximumRecordAge, BisectBatchOnFunctionError, and an on-failure destination.
False. Reserved concurrency of 0 disables the function entirely. For “no reservation,” leave it unset/null; to give a small slice, set a small positive number.
Because the visibility timeout is shorter than the time the function takes, so the message reappears and is redelivered to a second environment while the first is still working. The rule: set the visibility timeout to at least 6× the function timeout.
SNS→SQS fan-in: an SNS topic fans the event out to four SQS queues, one per consumer; each queue buffers for its own Lambda with its own redrive policy → DLQ. You get SNS’s fan-out and SQS’s per-consumer durability and backpressure. (EventBridge with four rule targets is the alternative when you want content-based routing and replay.)

Glossary

Function — the unit of deployment: your code, its runtime, and configuration (memory, timeout, role, environment).
Handler — the entry point AWS calls for each event; runs once per event on a warm or cold environment.
Init code — code outside the handler (module scope) run once per execution environment; where you cache SDK clients, DB pools and parsed config to beat cold starts.
Execution environment — the Lambda-managed micro-VM that runs your code; reused (warm) when possible, created anew (cold start) otherwise.
Invocation model — synchronous, asynchronous, or poll-based (stream/queue); determined by the trigger and dictating retry, ordering and error behaviour.
Synchronous invocation — the caller blocks for the response; Lambda does not retry; the caller owns retries (API Gateway, ALB, SDK RequestResponse).
Asynchronous invocation — the event is queued internally, 202 returned immediately; Lambda retries (default twice) then routes to an on-failure destination/DLQ or drops it (S3, SNS, EventBridge).
Event-source mapping — Lambda’s managed poller for a stream or queue (Kinesis, DynamoDB Streams, SQS, Kafka, MQ), controlling batch size, concurrency and checkpointing.
At-least-once delivery — the contract for async and poll-based paths: the same event may be delivered more than once, so handlers must be idempotent.
Idempotency — the property that processing the same event twice produces the same result with no doubled side effects; implemented via a conditional write keyed on the event.
Concurrency — the number of execution environments running simultaneously; the scaling unit, bounded by the account limit (default 1,000/region).
Reserved concurrency — a per-function guaranteed-and-capped slice of the account pool (free); setting it to 0 disables the function.
Provisioned concurrency — pre-warmed execution environments that eliminate cold starts for a version/alias; billed hourly even when idle.
Cold start — the latency of initialising a new environment (code download, runtime bootstrap, init code); a slow first request, not an error.
DLQ / on-failure destination — where an event goes after retries are exhausted; on-failure destinations (SQS/SNS/EventBridge/Lambda) carry richer context than legacy DLQs and are preferred.
Visibility timeout — the window an SQS message is hidden after being read; must be ≥ 6× the function timeout to prevent duplicate processing.
Poison record — a stream/queue record that always fails; with infinite retries it blocks the shard, fixed with finite retries, record-age bounds, and bisect-on-error.
Partial-batch response (ReportBatchItemFailures) — returning only the failed record IDs from a batch so successful records checkpoint and only failures retry.
Fan-out / fan-in — one event delivered to many consumers (SNS/EventBridge), or many events aggregated to one store; SNS→SQS combines fan-out with per-consumer durability.

Next steps

You can now choose the right invocation model, wire each event source correctly, and design for at-least-once delivery without losing events. Build outward:

Next: Compute on AWS: EC2 vs Lambda vs ECS vs EKS — when a function is the wrong tool and a container or VM wins.
Related: ECS, EKS & Fargate: Choosing Your Container Path — the long-running, stateful counterpart to event-driven functions.
Related: ALB vs NLB vs API Gateway, Compared — the synchronous front door for your Lambda APIs.
Related: DynamoDB, RDS & Aurora, Compared — the state store for stateless functions, and the source of DynamoDB Streams triggers.
Related: AWS Organizations & IAM Foundations — least-privilege execution roles done right across many functions.