A startup processed user-uploaded documents on a long-running EC2 instance that sat idle 80% of the day. They moved the pipeline to AWS Lambda — functions triggered by S3 uploads, fanning work out through SQS — and the bill fell 80% while end-to-end latency dropped from minutes to seconds. The catch was not the migration; it was the redesign. A 40-minute batch job had to become a graph of small, idempotent functions because Lambda kills any single invocation at 15 minutes, retries async events on its own schedule, and can deliver the same event twice. The team that wins with Lambda is the team that internalises those three facts before writing a line of handler code.
This is the reference for getting event-driven Lambda right. Lambda runs your code in response to an event, gives it up to 15 minutes and up to 10 GB of memory with proportional CPU, and bills you per millisecond of execution — scaling from zero to thousands of concurrent environments without a server in sight. But “Lambda” is really four different services wearing one name, depending on how it was invoked: a synchronous request/response API, an asynchronous fire-and-forget queue, a poll-based event-source mapping for streams and queues, and destinations/DLQs for the outcomes. Each model has its own retry behaviour, its own error surface, its own ordering and concurrency rules, and its own way of losing your data when you get it wrong. Treat them as one thing and you will ship a pipeline that drops events under load and double-charges customers on retry.
By the end you will stop guessing which model you are in. When an event “disappears,” you will know whether it died in an async retry with no DLQ, was swallowed by a poller that advanced the stream iterator past a poison record, or never arrived because an EventBridge rule pattern was one field too broad. You will know the exact limits — 15-minute timeout, 1,000 default concurrency, 256 KB async payload, 6 MB sync payload, ~128 SQS messages per batch — the precise aws command to confirm each failure, and the Terraform to wire the fix. Because this is a reference you reach for mid-incident, the invocation models, the trigger contracts, the limits, the error codes and the failure playbook are all laid out as scannable tables: read the prose once, keep the tables open when the pipeline is on fire.
What problem this solves
Traditional servers are paid for whether they are busy or not, and they make you responsible for scaling, patching and capacity planning around traffic you cannot predict. An event-driven Lambda architecture removes the server: you write the function that reacts to “a file landed,” “a message arrived,” “a row changed,” “an order was placed,” and AWS runs exactly as many copies as the event rate demands, billing only for the milliseconds they execute. For spiky, intermittent, event-shaped workloads — image processing, ETL steps, webhooks, stream consumers, scheduled jobs, glue between services — this is unbeatable on both cost and operational burden.
What breaks without the patterns, not just the service: teams lift a monolith into one giant function and hit the 15-minute wall; they invoke Lambda synchronously from an API and watch p99 latency spike on cold starts; they trigger directly off a high-volume source with no queue and get throttled into a retry storm; they assume “exactly once” and get duplicate side effects because async and stream invocations are at-least-once. The failure mode is rarely a crash — it is silent: an event that retried into the void because no dead-letter queue was attached, a stream consumer stuck for hours because one poison record blocks the shard, a customer billed twice because the function was not idempotent.
Who hits this: anyone building on serverless past “hello world.” It bites hardest on teams new to the at-least-once delivery contract (idempotency is not optional), on high-throughput stream and queue consumers (batching, concurrency and DLQs are load-bearing), on latency-sensitive synchronous APIs (cold starts and provisioned concurrency), and on anyone fanning one event out to many consumers (SNS vs EventBridge vs SQS is an architecture decision, not a coin flip). The fix is almost never “more memory” — it is “pick the right invocation model, attach the right failure destination, and make the handler idempotent.”
To frame the whole field before the deep dive, here is every event pattern this article covers, the AWS primitive that powers it, and the one trap that defines it:
| Pattern | Powered by | What it’s for | The defining trap |
|---|---|---|---|
| Synchronous invoke | API Gateway / SDK / ALB | Request/response APIs, low-latency reads | Cold start in the user’s p99; 6 MB payload cap |
| Asynchronous invoke | S3, SNS, EventBridge | Fire-and-forget reactions | At-least-once + retries to nowhere without a DLQ |
| Stream poller | Kinesis, DynamoDB Streams | Ordered change processing | One poison record blocks the whole shard |
| Queue poller | SQS (standard / FIFO) | Buffered, decoupled work | Visibility timeout < 6× function timeout = duplicates |
| Fan-out | SNS / EventBridge | One event → many consumers | Picking the wrong broker (filtering, replay, ordering) |
| Choreography | EventBridge bus + rules | Loosely-coupled service flows | A rule pattern too broad double-delivers |
| Orchestration | Step Functions | Stateful, long, branching flows | Using Lambda chaining where a state machine belongs |
Learning objectives
By the end of this article you can:
- Identify which of Lambda’s four invocation models (sync, async, stream poll, queue poll) any given trigger uses, and predict its retry, ordering, batching and error behaviour from that alone.
- Wire each major event source — S3, SQS, SNS, Kinesis Data Streams, DynamoDB Streams, EventBridge, API Gateway — with the correct event-source-mapping or notification config, in both
awsCLI and Terraform. - Design fan-out correctly: choose SNS vs EventBridge vs SQS by filtering, replay, ordering and consumer-count needs, and combine them (SNS→SQS fan-in) where it fits.
- Guarantee correctness under at-least-once delivery: build idempotent handlers, attach the right DLQ / on-failure destination, and stop poison records from blocking a stream with bisect-on-error and
maxRecordAge. - Tune concurrency deliberately — reserved vs provisioned, account limits, burst rates — and protect downstream databases from a Lambda scale-out stampede.
- Control cold starts: what causes them, what provisioned concurrency and SnapStart fix, package and memory levers, and when the latency actually matters.
- Read the Lambda error-code and limit reference and run a symptom→cause→confirm→fix playbook for the failure modes that actually page you: throttles, dropped async events, stuck shards, duplicate processing and DLQ fill.
- Size and cost a serverless pipeline — the GB-second model, when Lambda is cheaper than a container, and where it stops being cheaper.
Prerequisites & where this fits
You should already understand the AWS basics: an IAM role (Lambda assumes an execution role for its permissions), CloudWatch Logs (where every invocation’s logs land), and the core event services at a “what they are” level — S3 buckets, SQS queues, SNS topics, EventBridge buses, and Kinesis/DynamoDB Streams. You should be able to run aws from a shell, read JSON output, and read a basic Terraform resource block. Familiarity with HTTP status codes and the idea of “retry” and “idempotency” helps.
This sits in the Serverless & Event-Driven track. It assumes the compute-model fundamentals — the Compute on AWS: EC2 vs Lambda vs ECS vs EKS decision is upstream of it, and the ECS, EKS & Fargate: Choosing Your Container Path comparison tells you when a long-running container beats a function. It pairs tightly with the front-door choices in ALB vs NLB vs API Gateway, Compared, because API Gateway is the most common synchronous trigger, and with DynamoDB, RDS & Aurora, Compared since DynamoDB is the natural state store for a stateless function (and its streams are a first-class trigger).
A quick map of who owns what when an event-driven pipeline misbehaves, so you look in the right place fast:
| Layer | What lives here | Failure classes it causes | First place to look |
|---|---|---|---|
| Event source (S3/SQS/EventBridge…) | The producer + delivery contract | Event never arrived; wrong/too-broad routing | Source metrics (e.g. NumberOfMessagesSent) |
| Event-source mapping | The poller config (batch, concurrency) | Stuck shard, throttling, batch too big | aws lambda get-event-source-mapping |
| Function (your code + role) | Handler logic, permissions | Timeout, OOM, unhandled exception, AccessDenied | CloudWatch Logs + X-Ray |
| Concurrency / scaling | Account + reserved limits | 429 TooManyRequests, throttles |
ConcurrentExecutions, Throttles metrics |
| Failure destination | DLQ / on-failure target | Silent data loss on retry exhaustion | DLQ depth (ApproximateNumberOfMessages) |
| Downstream (DB/API) | Where the function writes | Connection exhaustion under scale-out | Downstream connection/throttle metrics |
Core concepts
Six mental models make every later decision obvious.
“Lambda” is four services, chosen by how it was invoked. The single most important idea in this article. A synchronous invocation (API Gateway, ALB, an SDK Invoke with RequestResponse) blocks the caller and returns the result — the caller owns retries, Lambda does not retry. An asynchronous invocation (S3, SNS, EventBridge, an SDK Invoke with Event) drops the event onto an internal queue, returns 202 immediately, and Lambda retries on failure (twice by default) before sending it to a DLQ or on-failure destination — or dropping it. A poll-based event-source mapping (Kinesis, DynamoDB Streams, SQS, Kafka, MQ) means Lambda polls the source for you, invoking your function with a batch; its retry, ordering and checkpointing rules are specific to the source. Knowing which of the four you are in tells you the retry count, the error surface, the ordering guarantee and where data goes when it fails — before you read another word.
Your function is stateless and ephemeral; the execution environment is reused but not guaranteed. Lambda creates an execution environment (a micro-VM), runs your init code once, then runs the handler per event. AWS may reuse a warm environment for the next event (fast — init already paid) or spin up a new one (a cold start — init runs again). You get no guarantee about reuse, so anything you need across invocations lives in an external store (DynamoDB, S3, ElastiCache) — but you can exploit reuse by initialising expensive things (SDK clients, DB pools, parsed config) in init code, outside the handler, so warm invocations skip them.
Delivery is at-least-once; design for duplicates or be wrong. Async invocations and stream/queue pollers can deliver the same event more than once (a retry after a partial success, a re-drive, a poller redelivery). Only synchronous invocation is “exactly as many times as the caller called.” If processing an event has a side effect — charging a card, incrementing a counter, sending an email — and you do not make it idempotent (safe to run twice with the same result), at-least-once delivery will eventually double it. Idempotency is not a nice-to-have; it is the price of admission to event-driven Lambda.
Concurrency is finite, shared, and the real scaling unit. Lambda scales by running more concurrent execution environments — one per simultaneous event. Your account has a default 1,000 concurrent executions across all functions in a region (raisable via quota request). One runaway function can starve every other function in the account. Reserved concurrency caps and guarantees a function’s slice; provisioned concurrency pre-warms a number of environments to kill cold starts. A burst that exceeds your available concurrency gets throttled (429 TooManyRequests), and what happens next depends on the invocation model — the sync caller sees the 429, the async/stream path retries.
Cold start is latency, not an error. A new environment must initialise — download your code/layers, start the runtime, run your init code (SDK clients, DB connect, parsed config). That is the cold start, typically 100 ms–1 s+ depending on runtime, package size and VPC attachment. It is not a 5xx unless it blows a caller’s timeout; it is a slow first request on a fresh environment, fixed by keeping environments warm (provisioned concurrency) or making init cheap (smaller package, SnapStart, fewer/lighter dependencies).
The failure destination is where your data goes when the handler can’t. Every async and poll-based path needs an explicit answer to “what happens to an event the function repeatedly fails to process?” For async invokes that is a dead-letter queue (legacy) or an on-failure destination (preferred — SQS, SNS, EventBridge, or another Lambda, with richer metadata). For SQS event sources it is the queue’s own redrive policy → DLQ. For stream sources it is an on-failure destination plus bisect/age controls. Leave it unset and “failed” means “silently gone” — the single most common way to lose production events.
The vocabulary in one table
Before the deep sections, pin down every moving part. The glossary repeats these for lookup; this is the mental model side by side:
| Concept | One-line definition | Where it lives | Why it matters |
|---|---|---|---|
| Function | Code + runtime + config, the unit of deploy | Lambda service | The thing that runs and scales |
| Handler | The entry point AWS calls per event | Your code | Runs once per event (warm or cold) |
| Init code | Code outside the handler, run once per env | Your code (module scope) | Where you cache clients/pools to beat cold starts |
| Execution environment | The micro-VM that runs your code | Lambda-managed | Reused (warm) or new (cold start) |
| Invocation model | Sync / async / poll-based | Determined by trigger | Sets retry, ordering, error surface |
| Event-source mapping | Lambda’s poller for a stream/queue | Lambda + source | Batching, concurrency, checkpointing |
| Concurrency | Simultaneous environments running | Account + per-function | The scaling unit; throttles when exceeded |
| Reserved concurrency | A function’s guaranteed/capped slice | Per function | Protect others / protect downstream |
| Provisioned concurrency | Pre-warmed environments | Per function/version | Kills cold starts for latency-critical paths |
| DLQ / on-failure dest | Where exhausted events go | Per function / per queue | The difference between “logged” and “lost” |
| Idempotency | Safe to process the same event twice | Your handler design | Required under at-least-once delivery |
| Cold start | First-request latency on a fresh env | Environment lifecycle | Slow first call; can trip caller timeouts |
The four invocation models, end to end
This is the depth anchor: get this table and the four sub-sections right and 80% of event-driven Lambda bugs become obvious. The four models differ on who retries, how many times, in what order, and where data goes when it fails.
| Property | Synchronous | Asynchronous | Stream poll (Kinesis/DDB) | Queue poll (SQS) |
|---|---|---|---|---|
| Triggers | API GW, ALB, SDK RequestResponse |
S3, SNS, EventBridge, SDK Event |
Kinesis, DynamoDB Streams | SQS standard / FIFO |
| Who retries | The caller | Lambda (async queue) | Lambda (poller, in place) | Lambda (poller, via visibility) |
| Default retries | 0 (caller’s job) | 2 (configurable 0–2) | until success or maxRecordAge/retryAttempts |
until success or moved to DLQ |
| Ordering | N/A (one call) | None | Per shard / partition key | None (FIFO: per message group) |
| Batching | One event | One event | Batch per shard | Batch (≤10k / 6 MB) |
| Delivery | Exactly as called | At-least-once | At-least-once | At-least-once |
| Payload limit | 6 MB req / 6 MB resp | 256 KB | 6 MB batch | 6 MB batch |
| On exhaustion | Error returned to caller | DLQ / on-failure dest or dropped | On-failure dest or blocks shard | Queue redrive → DLQ |
| Throttle behaviour | Caller gets 429 | Retried with backoff (up to ~6 h) | Poller backs off, shard waits | Poller backs off, messages stay |
Synchronous invocation
The caller sends a request and blocks for the response. API Gateway, ALB, Cognito triggers, and any SDK Invoke with InvocationType=RequestResponse are synchronous. Lambda does not retry — if the function errors or times out, the error goes straight back to the caller, who decides whether to retry. This is the model for request/response APIs where latency matters and the user is waiting, which is exactly why cold starts and the 6 MB payload cap bite here.
# Synchronous invoke — the CLI blocks until the function returns
aws lambda invoke --function-name order-api \
--invocation-type RequestResponse \
--payload '{"orderId":"o-123"}' --cli-binary-format raw-in-base64-out \
response.json
The synchronous-specific limits and behaviours, because they shape your API design:
| Aspect | Value / behaviour | Why it matters |
|---|---|---|
| Request payload | 6 MB | Large uploads must go via S3 + presigned URL, not the body |
| Response payload | 6 MB (buffered) / 20 MB (streamed) | Use response streaming for large responses on supported runtimes |
| Retries by Lambda | None | The caller (API GW, your SDK) owns retry + backoff |
| Timeout visibility | Caller sees the timeout/error | Set function timeout < API GW’s 29 s integration timeout |
| Cold start in path | Yes — in the user’s latency | Provisioned concurrency for latency SLOs |
| Concurrency throttle | 429 to the caller |
API GW returns 502/429; client must handle it |
Asynchronous invocation
The caller (S3 event, SNS, EventBridge, or InvocationType=Event) hands the event to Lambda’s internal async queue and gets an immediate 202 Accepted — it does not wait for processing. Lambda then invokes your function and, on failure, retries twice by default with backoff, over a window up to 6 hours. If all attempts fail, the event goes to your configured on-failure destination (or legacy DLQ) — and if none is configured, it is dropped. This is the model where events silently disappear.
# Async invoke — returns 202 immediately, processing happens later
aws lambda invoke --function-name thumbnail-generator \
--invocation-type Event \
--payload '{"bucket":"uploads","key":"a.png"}' --cli-binary-format raw-in-base64-out \
/dev/stdout
# Configure retries + an on-failure destination (the critical bit)
aws lambda put-function-event-invoke-config --function-name thumbnail-generator \
--maximum-retry-attempts 2 --maximum-event-age-in-seconds 3600 \
--destination-config '{"OnFailure":{"Destination":"arn:aws:sqs:ap-south-1:111122223333:thumb-dlq"}}'
resource "aws_lambda_function_event_invoke_config" "thumb" {
function_name = aws_lambda_function.thumb.function_name
maximum_retry_attempts = 2 # 0–2
maximum_event_age_in_seconds = 3600 # 60–21600 (6h)
destination_config {
on_failure { destination = aws_sqs_queue.thumb_dlq.arn }
on_success { destination = aws_sns_topic.thumb_done.arn } # optional success routing
}
}
The async knobs and exactly what each controls:
| Setting | Controls | Default | Range | When to change |
|---|---|---|---|---|
MaximumRetryAttempts |
Async retries after first failure | 2 | 0–2 | 0 if the source already retries; keep 2 for transient errors |
MaximumEventAgeInSeconds |
How long Lambda keeps retrying | 21,600 (6 h) | 60–21,600 | Lower to fail fast on time-sensitive events |
OnFailure destination |
Where exhausted events go | none → dropped | SQS/SNS/EventBridge/Lambda | Always set in production |
OnSuccess destination |
Route successful outcomes | none | SQS/SNS/EventBridge/Lambda | Event-driven success chains |
Legacy DeadLetterConfig |
Old DLQ (less metadata) | none | SQS/SNS | Prefer on-failure destination instead |
The difference between a legacy DLQ and an on-failure destination matters enough to tabulate:
| Aspect | Legacy DLQ (DeadLetterConfig) |
On-failure destination (DestinationConfig) |
|---|---|---|
| Targets | SQS, SNS | SQS, SNS, EventBridge, Lambda |
| Payload | The original event only | Event + invocation context (error, attempts, request id) |
| Success routing | No | Yes (OnSuccess) |
| Recommended | Legacy; avoid for new work | Preferred for all new async functions |
Stream poll (Kinesis & DynamoDB Streams)
Lambda runs a poller that reads records from each shard in order and invokes your function with a batch. Records within a shard (i.e., a given partition key) are processed in order, one batch at a time — which is the whole point and also the whole danger: a poison record that always fails will, by default, be retried until it expires, blocking every record behind it on that shard. You control this with batch size, parallelization, retry attempts, record age, bisect-on-error, and an on-failure destination that receives metadata about the failed batch.
# Create a stream event-source mapping with poison-pill controls
aws lambda create-event-source-mapping --function-name order-projector \
--event-source-arn arn:aws:kinesis:ap-south-1:111122223333:stream/orders \
--starting-position LATEST --batch-size 100 \
--maximum-batching-window-in-seconds 5 \
--parallelization-factor 4 \
--maximum-retry-attempts 3 \
--maximum-record-age-in-seconds 3600 \
--bisect-batch-on-function-error \
--function-response-types ReportBatchItemFailures \
--destination-config '{"OnFailure":{"Destination":{"Arn":"arn:aws:sqs:ap-south-1:111122223333:proj-dlq"}}}'
resource "aws_lambda_event_source_mapping" "proj" {
event_source_arn = aws_kinesis_stream.orders.arn
function_name = aws_lambda_function.projector.arn
starting_position = "LATEST"
batch_size = 100
maximum_batching_window_in_seconds = 5
parallelization_factor = 4 # 1–10 concurrent batches per shard
maximum_retry_attempts = 3 # -1 = infinite (the default poison trap)
maximum_record_age_in_seconds = 3600 # -1 = infinite
bisect_batch_on_function_error = true # split a failing batch to isolate the bad record
function_response_types = ["ReportBatchItemFailures"] # partial-batch success
destination_config { on_failure { destination_arn = aws_sqs_queue.proj_dlq.arn } }
}
The stream event-source-mapping controls, the defaults that bite, and when to change them:
| Setting | Controls | Default | The trap if left default | Change to |
|---|---|---|---|---|
BatchSize |
Records per invoke | 100 (Kinesis/DDB) | Large batch + one bad record fails all | Tune to processing cost; pair with bisect |
MaximumBatchingWindow |
Wait to fill a batch (s) | 0 | Tiny batches = more invokes/cost | 1–5 s to batch efficiently |
ParallelizationFactor |
Concurrent batches per shard | 1 | Throughput capped at 1/shard | 1–10 (still per-key ordered) |
MaximumRetryAttempts |
Retries before giving up | -1 (infinite) | Poison record blocks shard forever | A finite number (e.g. 3–5) |
MaximumRecordAge |
Drop records older than | -1 (infinite) | Stale records retried endlessly | Bound it (e.g. 1–24 h) |
BisectBatchOnFunctionError |
Split failing batch | false | Whole batch keeps failing together | true — isolates the poison record |
ReportBatchItemFailures |
Partial-batch success | off | One bad record reprocesses good ones | Return failed IDs; checkpoint past good |
StartingPosition |
Where to begin | n/a | TRIM_HORIZON replays all history | LATEST for new, TRIM_HORIZON to backfill |
The concurrency math for streams is fixed and worth memorising: concurrency = number of shards × parallelization factor. Ten shards at a parallelization factor of 4 gives 40 concurrent invocations of this function from this stream — independent of your account’s general concurrency, but counted against it.
Queue poll (SQS)
Lambda polls the SQS queue and invokes your function with a batch of up to 10,000 messages (standard) or fewer, bounded by a 6 MB payload. The critical interaction is the visibility timeout: when Lambda reads a message it becomes invisible for that window; if the function succeeds, Lambda deletes it; if it fails (or times out), the message becomes visible again and is redelivered. The hard rule — visibility timeout must be at least 6× the function timeout — exists so a slow invocation does not get the same message redelivered to a second environment while the first is still working (instant duplicates). Messages that fail repeatedly go to the queue’s DLQ via its redrive policy.
# SQS event-source mapping with partial-batch reporting
aws lambda create-event-source-mapping --function-name invoice-worker \
--event-source-arn arn:aws:sqs:ap-south-1:111122223333:invoices \
--batch-size 10 --maximum-batching-window-in-seconds 0 \
--scaling-config '{"MaximumConcurrency":50}' \
--function-response-types ReportBatchItemFailures
# Queue with a DLQ wired via redrive policy (the SQS-side failure destination)
resource "aws_sqs_queue" "invoices_dlq" { name = "invoices-dlq" }
resource "aws_sqs_queue" "invoices" {
name = "invoices"
visibility_timeout_seconds = 180 # >= 6 x the 30s function timeout
redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.invoices_dlq.arn
maxReceiveCount = 5 # attempts before a message goes to the DLQ
})
}
resource "aws_lambda_event_source_mapping" "invoices" {
event_source_arn = aws_sqs_queue.invoices.arn
function_name = aws_lambda_function.invoice_worker.arn
batch_size = 10
function_response_types = ["ReportBatchItemFailures"]
scaling_config { maximum_concurrency = 50 } # cap concurrent pollers (5–1000)
}
The SQS-poller settings and the duplicate/loss traps they govern:
| Setting | Where | Controls | Trap if wrong |
|---|---|---|---|
| Visibility timeout | Queue | How long a read message is hidden | < 6× function timeout → duplicate processing |
maxReceiveCount |
Queue redrive | Attempts before DLQ | Too high = poison loops; too low = premature DLQ |
BatchSize |
Mapping | Messages per invoke (≤10,000) | Big batch + no partial-fail = reprocess all on one failure |
ReportBatchItemFailures |
Mapping | Return only failed message IDs | Without it, one failure redelivers the whole batch |
MaximumConcurrency (scaling) |
Mapping | Cap concurrent pollers (5–1,000) | Without it, SQS can stampede a fragile downstream |
| FIFO message group | Queue | Ordering + dedup scope | Wrong group ID serialises unrelated work |
Standard vs FIFO SQS as a Lambda source, because the choice changes ordering, throughput and dedup:
| Aspect | Standard queue | FIFO queue |
|---|---|---|
| Ordering | Best-effort, none guaranteed | Strict, per message group ID |
| Delivery | At-least-once | Exactly-once processing (with dedup) |
| Throughput | Nearly unlimited | 300 msg/s (3,000 batched) per group baseline |
| Dedup | None (you handle it) | 5-minute dedup window (content or ID) |
| Lambda concurrency | Scales with backlog | Bounded by active message groups |
| Use when | Throughput, parallel work | Order matters (per entity), no duplicates |
The trigger-by-trigger contract
Each event source has its own wiring, its own event shape, and its own gotchas. This is the reference matrix — which invocation model each uses, the key limits, and the one thing that catches everyone — followed by the wiring detail for the heavy hitters.
| Source | Invocation model | Key limit / batch | Event shape gotcha | The classic mistake |
|---|---|---|---|---|
| API Gateway | Synchronous | 29 s integration timeout; 10 MB payload (REST) | Proxy vs non-proxy integration | Function timeout > 29 s API timeout |
| Application Load Balancer | Synchronous | 1 MB response; no 29 s cap | Must return specific JSON shape | Wrong response structure → 502 |
| S3 | Asynchronous | One event per object (mostly) | No delivery order; possible duplicates | Recursive loop (write back to same bucket) |
| SNS | Asynchronous | 256 KB message | Fan-out; no replay, no ordering | Expecting ordering or filtering richness |
| SQS (standard) | Queue poll | ≤10,000/batch, 6 MB | Partial-batch failures | Visibility < 6× timeout → duplicates |
| SQS (FIFO) | Queue poll | Per-group ordering | Group ID controls parallelism | One group ID serialises everything |
| Kinesis Data Streams | Stream poll | Per-shard order; 100/batch | Iterator advances past poison | Infinite retries block the shard |
| DynamoDB Streams | Stream poll | Per-key order; 100/batch | NEW/OLD image config | Forgetting StreamViewType |
| EventBridge (bus) | Asynchronous | 256 KB; 300 rules/bus | Pattern matching is exact | Pattern too broad → double-delivery |
| EventBridge Scheduler | Asynchronous | One-time or cron/rate | Time zone + flexible windows | Confusing it with EventBridge rules |
S3 → Lambda (asynchronous)
An S3 bucket notification invokes your function asynchronously when an object is created, removed, restored, or replicated. Delivery is typically once but not guaranteed exactly-once, and not ordered — design idempotently. The single most expensive mistake is the recursive loop: a function triggered on s3:ObjectCreated:* that writes a derived object back into the same bucket re-triggers itself, billing you in a runaway until you notice. Scope the prefix/suffix or write to a different bucket.
# Grant S3 permission to invoke, then add the bucket notification
aws lambda add-permission --function-name thumbnail-generator \
--statement-id s3invoke --action lambda:InvokeFunction \
--principal s3.amazonaws.com \
--source-arn arn:aws:s3:::uploads-bucket --source-account 111122223333
aws s3api put-bucket-notification-configuration --bucket uploads-bucket \
--notification-configuration '{
"LambdaFunctionConfigurations":[{
"LambdaFunctionArn":"arn:aws:lambda:ap-south-1:111122223333:function:thumbnail-generator",
"Events":["s3:ObjectCreated:*"],
"Filter":{"Key":{"FilterRules":[{"Name":"prefix","Value":"raw/"},{"Name":"suffix","Value":".png"}]}}
}]}'
The S3 notification options and the gotcha each hides:
| Option | Values | Gotcha |
|---|---|---|
| Event types | ObjectCreated:*, ObjectRemoved:*, ObjectRestore:*, Replication:* |
Put vs CompleteMultipartUpload differ — * catches both |
| Prefix/suffix filter | string match | Overlapping filters on one bucket can double-fire |
| Destination | Lambda, SQS, SNS, EventBridge | EventBridge gives richer routing + replay than direct notify |
| Delivery | At-least-once, unordered | Always idempotent; never assume order |
| Recursion guard | scope prefix / separate bucket | Writing back to the trigger bucket = billing loop |
EventBridge → Lambda (asynchronous, the choreography hub)
EventBridge is the event bus for service-to-service choreography. Producers PutEvents; rules match events by a JSON event pattern (exact-match on fields, with content filters); matching events fan out to up to 5 targets per rule — including Lambda. It supports a schema registry, archive and replay, and cross-account/cross-region routing. The defining trap is a pattern that is too broad: two rules whose patterns both match the same event each fire (there is no first-match-wins), so the same fact double-delivers unless your patterns are tight and your consumers idempotent.
# Rule that matches a specific event, targeting a Lambda
aws events put-rule --name order-placed --event-bus-name orders \
--event-pattern '{"source":["com.acme.orders"],"detail-type":["OrderPlaced"]}'
aws events put-targets --rule order-placed --event-bus-name orders \
--targets '[{"Id":"fn","Arn":"arn:aws:lambda:ap-south-1:111122223333:function:fulfil","RetryPolicy":{"MaximumRetryAttempts":4,"MaximumEventAgeInSeconds":3600},"DeadLetterConfig":{"Arn":"arn:aws:sqs:ap-south-1:111122223333:eb-dlq"}}]'
aws lambda add-permission --function-name fulfil --statement-id eb \
--action lambda:InvokeFunction --principal events.amazonaws.com \
--source-arn arn:aws:events:ap-south-1:111122223333:rule/orders/order-placed
EventBridge vs SNS vs SQS for fan-out — the actual decision table, since this is the most-asked serverless design question:
| Need | SNS | EventBridge | SQS |
|---|---|---|---|
| One→many fan-out | Yes (subscriptions) | Yes (rules, 5 targets each) | No (point-to-point) |
| Content-based routing | Limited (message filtering) | Rich (event patterns) | No |
| Replay / archive | No | Yes | No (it is the buffer) |
| Schema registry | No | Yes | No |
| Ordering | FIFO topics (per group) | No | FIFO queues (per group) |
| Buffering / backpressure | No (push) | No (push) | Yes (pull) |
| Throughput | Very high | High (per-account limits) | Very high |
| Latency | Lowest | Low | Pull-interval bound |
| Best for | High-fanout, low-latency push | Service choreography, routing | Decoupling, buffering, retries |
A pattern that combines them — SNS→SQS fan-in — is so common it deserves its own row of reasoning: SNS pushes one event to many SQS queues (fan-out), and each queue buffers for its own Lambda (backpressure + retries + DLQ per consumer). You get SNS’s fan-out and SQS’s durability, which neither gives alone.
DynamoDB Streams & Kinesis (stream poll)
A DynamoDB Stream emits an ordered, per-key record for every item change; a Kinesis Data Stream is a general ordered log. Both use the stream-poll model from the section above. The DynamoDB-specific knob is StreamViewType — whether the record carries the new image, old image, both, or just keys — which you must set when enabling the stream and cannot change without re-enabling.
# Enable a DynamoDB stream with both images, then map it to a function
aws dynamodb update-table --table-name orders \
--stream-specification StreamEnabled=true,StreamViewType=NEW_AND_OLD_IMAGES
STREAM_ARN=$(aws dynamodb describe-table --table-name orders \
--query 'Table.LatestStreamArn' --output text)
aws lambda create-event-source-mapping --function-name order-indexer \
--event-source-arn "$STREAM_ARN" --starting-position LATEST \
--batch-size 100 --maximum-retry-attempts 3 --bisect-batch-on-function-error
StreamViewType choices and when each is right:
StreamViewType |
Record contains | Use when |
|---|---|---|
KEYS_ONLY |
Just the key attributes | You’ll re-read the item; minimise stream size |
NEW_IMAGE |
Item after the change | Projections, search indexing, caches |
OLD_IMAGE |
Item before the change | Auditing what was deleted/overwritten |
NEW_AND_OLD_IMAGES |
Both | Change-data-capture, diff logic, full audit |
Kinesis vs DynamoDB Streams as a Lambda source:
| Aspect | DynamoDB Streams | Kinesis Data Streams |
|---|---|---|
| Source of records | Table item changes | Anything you PutRecord |
| Retention | 24 h | 24 h–365 d (configurable) |
| Ordering | Per item key | Per partition key |
| Shards | Managed by the table | You provision (or on-demand) |
| Multiple consumers | Limited | Enhanced fan-out (per-consumer throughput) |
| Use for | CDC off a table | General event log, multi-consumer streaming |
Fan-out, fan-in & choreography patterns
The patterns that turn single functions into systems. Each has a shape, a primitive, and a failure mode.
| Pattern | Shape | Built with | Why use it | Failure mode to guard |
|---|---|---|---|---|
| Simple fan-out | 1 event → N consumers | SNS or EventBridge | Decouple producers from consumers | Lost delivery to one consumer (per-target DLQ) |
| Fan-out + buffer | 1 event → N queues → N fns | SNS→SQS | Backpressure + retry per consumer | Queue without redrive = stuck poison |
| Fan-in / aggregation | N events → 1 store | Lambda → DynamoDB | Collate results | Race on concurrent writes (use conditional updates) |
| Pipe / transform | source → filter → target | EventBridge Pipes | Point-to-point with enrichment | Filter too loose forwards noise |
| Choreography | services react to events | EventBridge bus + rules | Loose coupling, autonomy | No central view; hard to trace |
| Orchestration | central state machine | Step Functions | Long, branching, stateful flows | Chaining Lambdas instead (no visibility) |
| Saga (compensation) | distributed txn + undo | Step Functions / EventBridge | Multi-service consistency | Missing compensating action on failure |
Choreography vs orchestration — the architecture fork
The most consequential design decision in event-driven systems. Choreography (EventBridge): each service emits events and reacts to others’ events; no central controller; maximum autonomy and loose coupling — but no single place to see “where is order o-123 in the flow?” Orchestration (Step Functions): a central state machine invokes each step, holds the state, branches, retries, and gives you a visual execution history — at the cost of a coordinator that knows about every step. Lambda chaining (function A invokes function B invokes C) is the worst of both: orchestration logic smeared across functions with no visibility and no built-in retry/branching.
| Dimension | Choreography (EventBridge) | Orchestration (Step Functions) | Lambda chaining (anti-pattern) |
|---|---|---|---|
| Coupling | Loosest | Coupled to the state machine | Tightly, invisibly coupled |
| Visibility of flow | Low (trace across events) | High (execution history) | None |
| Long/branching flows | Hard | Native (Map, Choice, Parallel) | Painful, error-prone |
| Built-in retry/catch | Per-target only | Per-state | Hand-rolled in each function |
| Best for | Reactive, autonomous services | Defined business processes | Almost never |
For a deep treatment of the orchestration side, the Step Functions state-machine model is the tool you reach for when a flow has more than two or three steps, branches, or needs a human-readable history.
At-least-once, idempotency & not losing events
This is where most production incidents live. Three disciplines: make the handler idempotent, attach a failure destination on every async/poll path, and stop poison records from blocking streams.
Idempotency — the non-negotiable
Under at-least-once delivery, the same event will arrive twice eventually. An idempotent handler produces the same result whether it runs once or five times. The canonical implementation: derive an idempotency key from the event (message ID, event ID, or a hash of the meaningful fields), and use a conditional write to a store so the second attempt is a no-op.
# DynamoDB conditional put as an idempotency guard
import boto3, hashlib, json
ddb = boto3.client("dynamodb") # init code — reused across warm invocations
def handler(event, _ctx):
for record in event["Records"]:
key = record.get("messageId") or hashlib.sha256(
json.dumps(record["body"], sort_keys=True).encode()).hexdigest()
try:
ddb.put_item(
TableName="processed-events",
Item={"id": {"S": key}, "ttl": {"N": str(_ttl())}},
ConditionExpression="attribute_not_exists(id)") # fails on duplicate
except ddb.exceptions.ConditionalCheckFailedException:
continue # already processed — skip the side effect
do_the_side_effect(record) # safe: runs at most once per key
Idempotency strategies and their trade-offs:
| Strategy | How it works | Pro | Con |
|---|---|---|---|
| Conditional write (DynamoDB) | First write wins; dupes fail the condition | Strong, simple, TTL-able | Extra write per event |
| Natural idempotency | Operation is inherently safe (PUT to a key) | Free | Only some operations qualify |
| Dedup table + TTL | Store seen IDs, expire them | Bounded storage | Window must exceed max retry age |
| SQS FIFO dedup | 5-min content/ID dedup | Built-in | 5-min window only; FIFO throughput limits |
| Powertools Idempotency | Library wraps the handler | Battle-tested, persistence-backed | Adds a dependency + a store |
Where events go when they fail — the destination matrix
Every async and poll path needs an explicit failure destination, or “failed” means “gone.” Map yours:
| Invocation path | Failure mechanism | Configure | If unset |
|---|---|---|---|
| Async (S3/SNS/EventBridge) | On-failure destination / DLQ | put-function-event-invoke-config |
Event dropped after retries |
| EventBridge target | Per-target DLQ + retry policy | DeadLetterConfig on the target |
Event dropped after target retries |
| SQS source | Queue redrive → DLQ | Queue redrivePolicy |
Message loops to maxReceiveCount, then… DLQ if set, else stuck |
| Kinesis/DDB source | On-failure destination | mapping DestinationConfig |
Shard blocked (infinite default retries) |
| SNS subscription | Subscription redrive → DLQ | SNS RedrivePolicy |
Delivery attempts exhausted, message lost |
Stopping poison records on streams
A poison record is one that always fails. On a stream, the default MaximumRetryAttempts=-1 (infinite) means it blocks its shard forever — every record behind it waits. Three controls fix this, used together:
| Control | What it does | Set to |
|---|---|---|
MaximumRetryAttempts |
Cap retries before skipping/destination | A finite number (e.g. 3–5) |
MaximumRecordAge |
Skip records older than N seconds | Bound it (e.g. 3,600–86,400) |
BisectBatchOnFunctionError |
Halve a failing batch to isolate the bad record | true |
ReportBatchItemFailures |
Report only the failed record IDs | Return them; good records checkpoint |
DestinationConfig.OnFailure |
Send the failed batch’s metadata somewhere | An SQS/SNS DLQ for inspection |
Concurrency, scaling & protecting downstream
Concurrency is the scaling unit and the thing that takes down your database. Three numbers govern it: the account concurrency limit (default 1,000/region, shared), reserved concurrency (a per-function guaranteed cap), and provisioned concurrency (pre-warmed environments). A fourth, the burst concurrency rate, governs how fast you can scale into that ceiling.
# Reserve 100 concurrent executions for a function (guarantees AND caps it)
aws lambda put-function-concurrency --function-name payment-worker \
--reserved-concurrent-executions 100
# Provision 20 always-warm environments on a version/alias (kills cold starts)
aws lambda put-provisioned-concurrency-config --function-name order-api \
--qualifier live --provisioned-concurrent-executions 20
# Check current usage against the account limit
aws lambda get-account-settings \
--query 'AccountLimit.{Concurrent:ConcurrentExecutions,Unreserved:UnreservedConcurrentExecutions}'
resource "aws_lambda_function" "payment" {
function_name = "payment-worker"
reserved_concurrent_executions = 100 # cap + guarantee; 0 would DISABLE the function
# ...
}
resource "aws_lambda_provisioned_concurrency_config" "api" {
function_name = aws_lambda_function.order_api.function_name
qualifier = aws_lambda_alias.live.name
provisioned_concurrent_executions = 20
}
The three concurrency levers, side by side:
| Lever | What it does | Cost | Set when |
|---|---|---|---|
| Account limit (1,000) | Ceiling across all functions in a region | n/a | Raise via Service Quotas before you need it |
| Reserved concurrency | Caps a function AND carves it out of the pool | Free (just allocation) | Protect a downstream DB / protect other functions |
| Provisioned concurrency | Pre-warms N environments | Paid hourly even idle | Latency-critical sync paths with cold-start SLOs |
| Burst concurrency | Initial scale-up rate (then +500/min) | n/a | Understand it; you can’t raise it |
Critical gotchas, because each has bitten teams in production:
| Gotcha | What happens | Fix |
|---|---|---|
reserved_concurrent_executions = 0 |
Disables the function entirely | Use null/unset for “no reservation,” not 0 |
| One function reserves 900 of 1,000 | Every other function shares 100 | Reserve deliberately; monitor UnreservedConcurrentExecutions |
| Lambda scales faster than RDS allows | Connection storm → DB at max_connections | Reserved concurrency cap + RDS Proxy for pooling |
Provisioned concurrency on $LATEST |
Not allowed — needs a version/alias | Publish a version, point an alias, provision the alias |
| Stream concurrency surprise | shards × parallelization, not 1 | Size downstream for shards×factor |
The downstream-protection pattern deserves emphasis: Lambda will happily open 1,000 concurrent connections to an RDS instance that allows 100, and the database falls over. The fix is a reserved concurrency cap sized to the database’s connection budget, plus RDS Proxy to pool and reuse connections so 1,000 functions share a small pool. DynamoDB, being serverless, scales with you — which is one reason it pairs so naturally with Lambda.
Cold starts: causes, costs & cures
A cold start is the latency of initialising a new execution environment. It is not an error — but on a synchronous, user-facing path it shows up in p99 and can trip an upstream timeout. First, what actually consumes the cold-start budget:
| Cost component | Typical magnitude | Reduce by | Trade-off |
|---|---|---|---|
| Code/layer download | 10s–100s ms (size-dependent) | Smaller package; fewer/lighter deps | Build discipline |
| Runtime bootstrap | 50–400 ms (varies by runtime) | Choose a faster runtime; SnapStart | Language/ecosystem constraints |
| VPC ENI attach | Now ~ms (Hyperplane) | (Mostly solved) historically the big one | n/a |
| Your init code | 10 ms–1 s+ | Lazy-init non-critical clients; cache config | First real call may pay deferred cost |
| First DB connect | 10s–100s ms | Pool in init; use serverless/proxy DBs | Connection still primes once |
The cure menu, ranked by cost and effort:
| Technique | What it does | Cost | Effort | Best for |
|---|---|---|---|---|
| Smaller package / fewer deps | Less to download + init | Free | Medium | Every function |
| Init clients in module scope | Warm invocations skip init | Free | Trivial | Every function |
| Right-size memory up | More CPU → faster init + run | Pay per GB-s (may lower total cost) | Trivial | CPU-bound init |
| Provisioned concurrency | Pre-warmed envs, no cold start | Paid hourly | Low | Latency-SLO sync APIs |
| SnapStart (Java, .NET, Python) | Snapshot a warmed env, restore fast | Free (some restore cost) | Low | JVM/.NET cold-start pain |
| Avoid heavy frameworks in handler | Less per-invoke overhead | Free | Medium | High-RPS functions |
Runtime cold-start characteristics, roughly, because runtime choice is a cold-start decision:
| Runtime | Relative cold start | SnapStart support | Notes |
|---|---|---|---|
| Node.js / Python | Fast | Python: yes | The serverless default for latency |
| Go / Rust (provided.al2) | Fast | n/a (already fast) | Compiled, tiny, quick init |
| Java | Slow (JVM + JIT) | Yes — big win | SnapStart cuts it dramatically |
| .NET | Slow-ish | Yes | SnapStart / ReadyToRun help |
The memory-CPU coupling is the under-used lever: Lambda allocates CPU proportional to memory, so a function that is CPU-bound during init often runs faster and cheaper at 1,024 MB than at 256 MB, because it finishes in a fraction of the time. Profile with AWS Lambda Power Tuning rather than defaulting to 128 MB.
Limits & quotas reference
The numbers you will hit. Keep this open when sizing or debugging a “why did it stop” mystery:
| Limit | Value | Hard/soft | What hitting it looks like |
|---|---|---|---|
| Function timeout | 15 min (900 s) | Hard | Invocation killed mid-work; Task timed out |
| Memory | 128 MB – 10,240 MB | Hard | OOM kill; Runtime exited / errno 137 |
Ephemeral /tmp |
512 MB – 10,240 MB | Configurable | No space left on device |
| Sync request payload | 6 MB | Hard | RequestEntityTooLarge |
| Async payload | 256 KB | Hard | Event rejected at invoke |
| Deployment package (zipped, direct) | 50 MB | Hard | Upload rejected; use S3 |
| Deployment package (unzipped) | 250 MB | Hard | Use container image (up to 10 GB) instead |
| Container image | 10 GB | Hard | Bigger won’t deploy |
| Layers per function | 5 | Hard | Consolidate layers |
| Account concurrency (region) | 1,000 default | Soft (raisable) | 429 TooManyRequestsException |
| Burst concurrency | region-dependent, then +500/min | Hard | Throttles during a sharp spike |
| Environment variables size | 4 KB total | Hard | Move config to SSM/Secrets Manager |
| ENI per function (VPC) | scales (Hyperplane) | Managed | (Historically a hard limit) |
/tmp + invocations |
per-env, reused | n/a | Stale state across warm invokes |
Error & status-code reference
Every error you realistically see, what it means, how to confirm, and the fix:
| Error / code | Meaning | Likely cause | Confirm with | Fix |
|---|---|---|---|---|
429 TooManyRequestsException |
Throttled | Concurrency limit hit | Throttles metric; account settings |
Raise quota; reserved concurrency; backoff |
Task timed out after N seconds |
Function exceeded timeout | Slow work / hung downstream | CloudWatch Logs END vs timeout |
Raise timeout (≤900 s); fix the slow call |
Runtime exited (errno 137) |
OOM killed | Memory too low / leak | Logs “Runtime exited”; MaxMemoryUsed |
Increase memory; fix leak |
AccessDeniedException |
IAM denied | Execution role missing a permission | CloudTrail; the log’s denied action | Add the action to the role |
ResourceConflictException |
Concurrent update | Two deploys/updates at once | Activity; deploy logs | Serialise deploys |
EventSourceMapping ... Disabled |
Poller stopped | Repeated failures / manual disable | get-event-source-mapping State |
Fix function; re-enable |
ProvisionedConcurrencyConfigNotFound |
PC not on this qualifier | Provisioned $LATEST or wrong alias |
get-provisioned-concurrency-config |
Provision a version/alias |
KMSAccessDeniedException |
Can’t decrypt env vars | Role lacks KMS key access | Logs at init | Grant kms:Decrypt on the key |
Lambda was unable to decompress... |
Bad package | Corrupt/oversized zip | Deploy output | Rebuild; use container image |
Calls to <fn> are being throttled (async) |
Async backlog throttled | Downstream of an async flood | Throttles; invocations queued |
Reserved concurrency; smooth the source |
| Empty receive / no invokes (SQS) | Poller not pulling | Mapping disabled; permissions | get-event-source-mapping; role sqs:* |
Enable mapping; grant SQS perms |
Stale $LATEST behind alias |
Wrong code served | Alias points to old version | get-alias |
Update alias to the new version |
Architecture at a glance
The diagram traces a real event-driven order pipeline left to right and maps the four invocation models onto the exact hops where each one fails. Read it as the path an event actually takes. On the left, producers emit facts: an API Gateway call places an order (a synchronous invoke of the intake function), and an S3 upload of an attachment fires an asynchronous invoke of a processor. Those producers land on the ingestion & buffering zone — an SQS queue absorbs the order workload (the queue-poll model, with a DLQ on its redrive policy) and an SNS topic fans the “order placed” fact out to interested consumers. The EventBridge custom bus in the routing zone is the choreography hub: rules pattern-match the event and fan it to up to five targets, with a per-target DLQ catching exhausted deliveries.
From there the processing zone is where the worker functions run — a fulfilment Lambda (async, retries to an on-failure destination), a projection Lambda fed by DynamoDB Streams (the stream-poll model, where a poison record can block the shard), and a Kinesis consumer for the analytics tap (also stream-poll). Everything converges on the state & failure zone: a DynamoDB table as the idempotency store and projection target, and the DLQs that are the difference between a logged failure and a lost event. The numbered badges sit on the five places an event silently dies or duplicates — a throttle at the concurrency ceiling, an async retry with no destination, a poison record on the stream, a visibility-timeout duplicate on the queue, and a too-broad EventBridge rule that double-delivers. The legend narrates each as symptom, the metric that confirms it, and the fix.
Real-world scenario
Parcelo, a fictional last-mile delivery startup in Bengaluru, ran its parcel-event pipeline on a single 8-vCPU EC2 instance: a Python worker that polled a queue, processed scan events from courier apps, updated a Postgres database, and pushed notifications. Traffic averaged 200 events/second with a 7pm surge to ~2,500/second as the evening delivery wave finished. The instance sat near-idle overnight, cost about ₹14,000/month running 24×7, and — worse — during the evening surge it fell behind, processing events minutes late, so customers saw “out for delivery” long after the parcel arrived. The four-engineer platform team decided to go event-driven on Lambda.
The first cut was naive and instructive. They pointed a Lambda directly at the courier SNS topic (asynchronous) and had it write straight to Postgres. It worked in testing. In the first evening surge it fell apart: the function scaled to ~900 concurrent environments, each opened a Postgres connection, and the database hit max_connections and started refusing — so functions errored, Lambda retried the async events, and the retry storm made it worse. Meanwhile a malformed scan event from one buggy courier app build threw on every attempt; with no on-failure destination configured, those events simply vanished after two retries. The team had reproduced two textbook traps at once: a concurrency stampede on a non-serverless downstream, and silent async data loss.
The breakthrough was redesigning around the invocation models rather than fighting them. They inserted an SQS queue between SNS and the worker (SNS→SQS fan-in), switching the worker to the queue-poll model. That gave them three things at once: a buffer that absorbed the 2,500/s surge instead of stampeding, reserved concurrency capped at 80 (sized to the database’s connection budget) so Lambda could never open more connections than Postgres allowed, and a DLQ via redrive policy (maxReceiveCount=5) so the poison events landed somewhere inspectable instead of disappearing. They made the handler idempotent with a DynamoDB conditional-write on the scan event’s ID, because at-least-once delivery meant duplicates were now expected, not exceptional. Finally they put RDS Proxy in front of Postgres so the 80 concurrent workers shared a small pooled connection set.
The numbers told the story. The evening surge now drained through the queue with sub-second processing latency end to end; the database never exceeded 80 connections; the DLQ caught exactly the malformed events (which turned out to be one courier app version, fixed at the source) with full payloads for replay. Cost fell to about ₹3,800/month — Lambda billed only for the milliseconds of actual processing, near-zero overnight, scaling to the surge automatically. The lesson the team wrote on the wall: “Don’t point a function at a fragile thing. Buffer it, cap it, make it idempotent, and give failures somewhere to land.”
The incident and redesign as a timeline, because the order of the fixes is the lesson:
| Stage | What they did | Result | What it should have been |
|---|---|---|---|
| v0 (EC2) | One 24×7 worker | ₹14,000/mo, falls behind at surge | — |
| v1 (naive Lambda) | SNS → Lambda → Postgres direct | Stampede; DB refuses connections | Buffer with SQS first |
| v1 failure | No on-failure destination | Malformed events vanish | DLQ on every async/queue path |
| Fix 1 | Insert SQS (SNS→SQS) | Surge buffered, no stampede | The core architectural fix |
| Fix 2 | Reserved concurrency = 80 | DB connections bounded | Size to the downstream budget |
| Fix 3 | DLQ via redrive (maxReceiveCount 5) | Poison events captured | Never lose an event silently |
| Fix 4 | Idempotent handler (DDB conditional) | Duplicates harmless | At-least-once demands it |
| Fix 5 | RDS Proxy | 80 workers share a pool | Pool connections to non-serverless DBs |
| Outcome | — | ₹3,800/mo, sub-second latency | The fix was design, not bigger compute |
Advantages and disadvantages
The event-driven serverless model both unlocks huge wins and introduces failure modes you must design against. Weigh it honestly:
| Advantages (why this model wins) | Disadvantages (why it bites) |
|---|---|
| Pay per millisecond of execution; near-zero cost when idle | At sustained high RPS, a container can be cheaper than per-invoke billing |
| Scales from zero to thousands of environments with no capacity planning | A scale-out stampede can overwhelm any non-serverless downstream (RDS) |
| Each event source is a first-class, declarative trigger | Each source has its own retry/ordering/error contract you must learn |
| Built-in retries and DLQs for async/poll paths | “Failed” means “silently gone” unless you configure a destination |
| Stateless functions are trivially horizontally scalable | At-least-once delivery means you must build idempotency |
| Cold starts now small for most runtimes; provisioned concurrency for the rest | Cold starts still hurt latency-critical synchronous paths |
| Tight integration with the whole AWS event ecosystem | Observability is fragmented across many small functions |
| 15-min timeout fits most event reactions | Long/heavy jobs hit the wall — wrong tool |
The model is right for event-shaped, intermittent, spiky workloads where you want to ship reactions, not operate servers — and where the work decomposes into small, idempotent steps. It bites hardest on sustained high-throughput workloads (where per-invoke billing loses to a reserved container), on functions fronting fragile non-serverless downstreams (without concurrency caps and pooling), and on teams that haven’t internalised at-least-once delivery (duplicates and silent loss). The disadvantages are all manageable — but only if you design for them, which is the entire point of the patterns above. When the workload is long-running, stateful, or steady-state high-CPU, the container path is the honest answer.
Hands-on lab
Build a real, free-tier-friendly fan-out pipeline: an S3 upload fans out through SNS→SQS to a Lambda that writes an idempotent record to DynamoDB, with a DLQ catching failures. Run in a shell with the AWS CLI configured; everything here is within Free Tier for a short test, and we tear it down at the end.
Step 1 — Variables.
export R=ap-south-1 ACC=$(aws sts get-caller-identity --query Account --output text)
export PFX=lab-evt
Step 2 — Create the DynamoDB idempotency/projection table.
aws dynamodb create-table --table-name ${PFX}-events \
--attribute-definitions AttributeName=id,AttributeType=S \
--key-schema AttributeName=id,KeyType=HASH \
--billing-mode PAY_PER_REQUEST --region $R
Expected: a TableDescription with TableStatus: CREATING, soon ACTIVE.
Step 3 — Create the work queue and its DLQ with a redrive policy.
DLQ_URL=$(aws sqs create-queue --queue-name ${PFX}-dlq --query QueueUrl --output text)
DLQ_ARN=$(aws sqs get-queue-attributes --queue-url $DLQ_URL --attribute-names QueueArn --query Attributes.QueueArn --output text)
Q_URL=$(aws sqs create-queue --queue-name ${PFX}-work \
--attributes "{\"VisibilityTimeout\":\"180\",\"RedrivePolicy\":\"{\\\"deadLetterTargetArn\\\":\\\"$DLQ_ARN\\\",\\\"maxReceiveCount\\\":\\\"5\\\"}\"}" \
--query QueueUrl --output text)
Q_ARN=$(aws sqs get-queue-attributes --queue-url $Q_URL --attribute-names QueueArn --query Attributes.QueueArn --output text)
Note the visibility timeout 180s ≥ 6× the 30s function timeout — the duplicate-prevention rule from the SQS section, made concrete.
Step 4 — Create the execution role.
aws iam create-role --role-name ${PFX}-role \
--assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"lambda.amazonaws.com"},"Action":"sts:AssumeRole"}]}'
aws iam attach-role-policy --role-name ${PFX}-role \
--policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaSQSQueueExecutionRole
aws iam attach-role-policy --role-name ${PFX}-role \
--policy-arn arn:aws:iam::aws:policy/AmazonDynamoDBFullAccess
Step 5 — Package and deploy the idempotent function.
cat > handler.py <<'PY'
import boto3, json, os
ddb = boto3.client("dynamodb") # init code: reused on warm invokes
T = os.environ["TABLE"]
def handler(event, _):
failures = []
for r in event["Records"]:
mid = r["messageId"]
try:
ddb.put_item(TableName=T, Item={"id": {"S": mid}},
ConditionExpression="attribute_not_exists(id)")
print("processed", mid)
except ddb.exceptions.ConditionalCheckFailedException:
print("duplicate, skipped", mid) # idempotent: no double side effect
except Exception as e:
print("error", mid, str(e))
failures.append({"itemIdentifier": mid}) # partial-batch failure
return {"batchItemFailures": failures}
PY
zip fn.zip handler.py
sleep 10 # let the role propagate
aws lambda create-function --function-name ${PFX}-worker \
--runtime python3.12 --handler handler.handler --timeout 30 --memory-size 256 \
--role arn:aws:iam::${ACC}:role/${PFX}-role \
--environment "Variables={TABLE=${PFX}-events}" \
--zip-file fileb://fn.zip --region $R
Step 6 — Map the queue to the function with partial-batch reporting.
aws lambda create-event-source-mapping --function-name ${PFX}-worker \
--event-source-arn $Q_ARN --batch-size 10 \
--function-response-types ReportBatchItemFailures --region $R
Step 7 — Send a message twice; prove idempotency.
aws sqs send-message --queue-url $Q_URL --message-body '{"scan":"SC-1"}'
aws sqs send-message --queue-url $Q_URL --message-body '{"scan":"SC-1"}'
sleep 8
aws logs tail /aws/lambda/${PFX}-worker --since 2m --region $R
# Expect: two messages, but a "processed" then a "duplicate, skipped" if they share a messageId,
# or two "processed" with distinct IDs — and the DynamoDB table holds one item per unique message.
aws dynamodb scan --table-name ${PFX}-events --select COUNT --region $R
Each SQS message gets its own messageId, so to truly see the duplicate path, re-drive the same message — but the lab’s point is proven: the conditional write makes reprocessing the same ID a safe no-op, which is exactly what protects you under at-least-once delivery.
Validation checklist. You created a buffered, idempotent, DLQ-backed consumer: an SQS source (queue-poll model), a visibility timeout sized to the function timeout, partial-batch failure reporting so one bad record doesn’t reprocess the batch, a DLQ via redrive policy so poison messages land somewhere, and a DynamoDB conditional write for idempotency. The lab steps mapped to what each proves:
| Step | What you did | What it proves |
|---|---|---|
| 3 | Queue with redrive + 180s visibility | Visibility ≥ 6× timeout; failures have a DLQ |
| 5 | boto3.client in init scope |
Warm invokes skip client init (cold-start lever) |
| 5 | Conditional put | Idempotency under at-least-once delivery |
| 5/6 | ReportBatchItemFailures |
One bad record doesn’t reprocess the whole batch |
| 7 | Send + scan COUNT | Reprocessing the same ID is a safe no-op |
Cleanup.
MID=$(aws lambda list-event-source-mappings --function-name ${PFX}-worker --query 'EventSourceMappings[0].UUID' --output text --region $R)
aws lambda delete-event-source-mapping --uuid $MID --region $R
aws lambda delete-function --function-name ${PFX}-worker --region $R
aws sqs delete-queue --queue-url $Q_URL ; aws sqs delete-queue --queue-url $DLQ_URL
aws dynamodb delete-table --table-name ${PFX}-events --region $R
aws iam detach-role-policy --role-name ${PFX}-role --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaSQSQueueExecutionRole
aws iam detach-role-policy --role-name ${PFX}-role --policy-arn arn:aws:iam::aws:policy/AmazonDynamoDBFullAccess
aws iam delete-role --role-name ${PFX}-role
Cost note. Lambda’s free tier (1M requests + 400,000 GB-seconds/month), DynamoDB on-demand at trivial volume, and SQS’s first million requests are all free; this lab costs effectively ₹0 and deleting the resources stops everything.
Common mistakes & troubleshooting
The playbook — the part you bookmark. First as a scannable table you read mid-incident, then the detail for the entries that bite hardest.
| # | Symptom | Root cause | Confirm (exact command / metric) | Fix |
|---|---|---|---|---|
| 1 | Async events silently disappear | No on-failure destination; retries exhausted | aws lambda get-function-event-invoke-config; check for DestinationConfig |
Set OnFailure destination; alarm on DLQ depth |
| 2 | Stream consumer stuck for hours, no progress | Poison record + infinite default retries blocks shard | IteratorAge climbing; aws lambda get-event-source-mapping retries=-1 |
Finite MaximumRetryAttempts, BisectBatchOnFunctionError, on-failure dest |
| 3 | Same event processed twice (double charge) | At-least-once delivery + non-idempotent handler | Duplicate side effects in logs/DB; no dedup store | DynamoDB conditional write keyed on event ID |
| 4 | SQS messages reprocessed repeatedly | Visibility timeout < 6× function timeout | Queue VisibilityTimeout vs function Timeout |
Set visibility ≥ 6× timeout |
| 5 | 429 TooManyRequestsException under load |
Concurrency limit reached | Throttles > 0; ConcurrentExecutions at ceiling |
Raise account quota; reserved concurrency; backoff |
| 6 | Database refuses connections during spikes | Lambda scale-out > DB connection budget | RDS DatabaseConnections at max; function fan-out |
Reserved concurrency cap + RDS Proxy |
| 7 | Whole SQS batch reprocessed on one failure | No partial-batch reporting | Mapping FunctionResponseTypes empty |
ReportBatchItemFailures + return failed IDs |
| 8 | EventBridge event handled twice | Two rules’ patterns both match (no first-match-wins) | MatchedEvents on two rules for one event |
Tighten patterns; keep consumers idempotent |
| 9 | S3-triggered function loops, billing spikes | Function writes back to the trigger bucket | Runaway Invocations; CloudWatch billing alarm |
Scope prefix/suffix or write to a different bucket |
| 10 | Task timed out after 900.00 seconds |
Job exceeds 15-min hard limit | Logs show timeout at 900 s | Re-architect into steps; use Step Functions/Fargate |
| 11 | Cold starts spike p99 on the API | Sync path, no warm envs (esp. JVM/.NET/VPC) | InitDuration in logs; p99 latency |
Provisioned concurrency; SnapStart; smaller package |
| 12 | AccessDeniedException calling an AWS service |
Execution role missing a permission | CloudTrail shows the denied action | Add the action to the execution role |
| 13 | Function returns old code after deploy | Alias/trigger points to a stale version | aws lambda get-alias; trigger qualifier |
Update alias to new version; trigger the alias |
| 14 | Reserved concurrency “broke” the function | Set to 0 (which disables it) | ReservedConcurrentExecutions: 0 |
Use unset for “no reservation,” not 0 |
| 15 | Runtime exited / errno 137 |
OOM — memory too low | MaxMemoryUsed near limit |
Increase MemorySize; fix the leak |
The expanded form for the ones that cause the most damage:
1. Async events silently disappear.
Root cause: An async invocation (S3/SNS/EventBridge/Event) failed, retried twice, and had no on-failure destination or DLQ — so the event was dropped.
Confirm: aws lambda get-function-event-invoke-config --function-name <fn> returns no DestinationConfig; the Errors metric is non-zero while nothing lands in any queue.
Fix: Configure an OnFailure destination (SQS preferred) on every async function; alarm on the DLQ’s ApproximateNumberOfMessagesVisible > 0.
2. Stream consumer stuck for hours.
Root cause: A poison record that always fails, combined with the default MaximumRetryAttempts=-1 (infinite), blocks its shard — every record behind it waits.
Confirm: The IteratorAge metric climbs steadily (records aging without being processed); aws lambda get-event-source-mapping --uuid <id> shows MaximumRetryAttempts: -1.
Fix: Set a finite MaximumRetryAttempts and a MaximumRecordAge, enable BisectBatchOnFunctionError to isolate the bad record, and add a DestinationConfig.OnFailure so the failed batch’s metadata is captured.
3. Same event processed twice.
Root cause: At-least-once delivery (async, stream, or queue) delivered a duplicate, and the handler is not idempotent, so a side effect (charge, email, counter) ran twice.
Confirm: Duplicate side effects with the same source event ID; no dedup/idempotency store in the code path.
Fix: Derive an idempotency key from the event and conditional-write it to DynamoDB (attribute_not_exists) before the side effect; or use Powertools Idempotency.
4. SQS messages reprocessed repeatedly.
Root cause: The queue’s visibility timeout is shorter than ~6× the function timeout, so a still-running invocation’s message becomes visible and is redelivered to a second environment.
Confirm: Compare the queue’s VisibilityTimeout to the function’s Timeout; duplicates correlate with slow invocations.
Fix: Set the visibility timeout to at least 6× the function timeout (e.g. 180s for a 30s function).
5. 429 TooManyRequestsException under load.
Root cause: Demand exceeded available concurrency — the account’s 1,000 default, or a too-small reserved allocation, or another function hogging the pool.
Confirm: Throttles metric > 0; ConcurrentExecutions pinned at the limit; UnreservedConcurrentExecutions near zero in account settings.
Fix: Raise the account concurrency quota via Service Quotas; set reserved concurrency to guarantee this function a slice; ensure synchronous callers back off and retry.
6. Database refuses connections during spikes.
Root cause: Lambda scaled to hundreds of environments, each opening a connection, exceeding the database’s max_connections — a stampede no RDS instance survives.
Confirm: RDS DatabaseConnections at the ceiling exactly when the function fans out; function errors are connection failures.
Fix: Cap the function with reserved concurrency sized to the DB’s connection budget, and front the database with RDS Proxy so functions share a pool.
9. S3-triggered function loops and billing spikes.
Root cause: A function triggered on s3:ObjectCreated:* writes a derived object back into the same bucket, re-triggering itself in a runaway loop.
Confirm: Invocations climbing without external cause; a billing alarm fires; the same bucket appears as both source and write target.
Fix: Scope the notification to a narrow prefix/suffix that excludes derived objects, or write outputs to a different bucket entirely.
Best practices
- Always attach a failure destination. Every async function gets an
OnFailuredestination; every SQS source gets a redrive policy → DLQ; every stream source gets a finite retry count plus an on-failure destination. “Failed” must never mean “silently gone.” - Make every handler idempotent. At-least-once delivery is the contract for async, stream and queue paths. Use a conditional write on an event-derived key before any side effect. This single discipline prevents the worst class of serverless bug.
- Set SQS visibility timeout to ≥ 6× the function timeout. The cheapest way to prevent duplicate processing from in-flight redelivery.
- Cap concurrency in front of fragile downstreams. Reserved concurrency sized to a database’s connection budget, plus RDS Proxy, stops a scale-out stampede. DynamoDB needs no cap; relational databases do.
- Use
ReportBatchItemFailureson every batch source. Without it, one bad record reprocesses the whole batch (and can loop). With it, good records checkpoint and only the failures retry. - Bound stream retries and bisect on error. Never leave
MaximumRetryAttempts=-1; a poison record will otherwise block the shard forever. Bisect to isolate it. - Initialise expensive things in module scope, not the handler. SDK clients, DB pools, parsed config — paid once per environment, skipped on warm invocations. The simplest cold-start and cost win.
- Choose the broker deliberately. SNS for low-latency fan-out, EventBridge for content-routed choreography with replay, SQS for buffering and backpressure. Combine SNS→SQS for fan-out with durability.
- Reach for Step Functions past three steps. Don’t chain Lambdas to model a workflow; a state machine gives retries, branching, and a visual execution history for free.
- Right-size memory by profiling, not defaulting. CPU scales with memory; a CPU-bound function is often faster and cheaper at 1,024 MB than 128 MB. Use Power Tuning.
- Provisioned concurrency only where latency SLOs demand it. It costs money even idle; reserve it for user-facing synchronous paths with cold-start sensitivity, not background workers.
- Alarm on the leading indicators:
Throttles,IteratorAge, DLQ depth,Errors, and downstream connection counts — not just “function errored.”
The alarms worth wiring before the next incident — leading indicators, not lagging:
| Alarm on | Metric | Threshold (starting point) | Why it’s leading |
|---|---|---|---|
| Throttling | Throttles |
> 0 sustained 5 min | Concurrency ceiling before users feel 429s |
| Stream lag | IteratorAge |
> 60,000 ms | Shard falling behind / poison record blocking |
| Dead-letter fill | DLQ ApproximateNumberOfMessagesVisible |
> 0 | Events are leaving the happy path |
| Error rate | Errors |
> 1% of invocations | Handler failing — confirm with logs |
| Async age | AsyncEventAge |
climbing toward 6 h | Async backlog not draining |
| Downstream saturation | RDS DatabaseConnections |
> 80% of max | Stampede before the DB refuses |
| Cold-start latency | InitDuration p99 |
> your SLO | Sync path latency creeping up |
Security notes
- Least-privilege execution roles. Each function gets its own role scoped to exactly the actions and resources it needs — a specific table ARN, a specific queue, a specific KMS key. Never attach
AdministratorAccessor broad*policies; the IAM least-privilege discipline matters more, not less, when you have dozens of functions. - Secrets out of environment variables. Environment variables are visible in the console and the API. Put secrets in Secrets Manager or SSM Parameter Store (SecureString) and fetch them in init code; grant the role only the specific secret’s ARN.
- Encrypt at rest and in transit. Lambda encrypts env vars with KMS — use a customer-managed key for sensitive config and grant the role
kms:Decrypton just that key. Ensure downstream calls (to RDS, S3) use TLS. - Validate and bound event input. Treat every event as untrusted: validate schema, cap sizes, and never
eval/deserialize untrusted payloads. A malformed event should fail cleanly to a DLQ, not crash or get exploited. - Scope resource policies tightly. When S3/SNS/EventBridge invokes your function, the
add-permissionstatement should pin--source-arn(and--source-account) so only that bucket/topic/rule can invoke it — not the whole service. - VPC only when needed. Putting Lambda in a VPC is for reaching private resources (RDS, internal services); it adds an ENI and (historically) cold-start cost. Don’t VPC-attach a function that only calls public AWS APIs.
- Sign and scan deployment artifacts. For container-image functions, pin image digests, scan with ECR scanning, and pull from a private registry. Code Signing for Lambda can enforce that only signed code deploys.
The security controls that also improve resilience — they pull the same direction here:
| Control | Mechanism | Secures against | Also prevents |
|---|---|---|---|
| Per-function least-privilege role | Scoped IAM policy | Blast radius of a compromised function | AccessDenied surprises from over-broad churn |
| Secrets in Secrets Manager | SecureString + role grant | Secrets leaking via env vars | Rotation breaking a hard-coded value |
--source-arn on invoke permission |
Resource policy condition | Any topic/bucket invoking your fn | Accidental cross-source triggering |
| Input validation + size caps | Handler-level checks | Injection / oversized payloads | Poison records crashing the consumer |
| Customer-managed KMS key | kms:Decrypt scoped |
Unauthorised decrypt of config | Silent init failure (grant it correctly) |
| ECR scanning + digest pinning | Image supply chain | Tampered/unknown images | Surprise breakage from a moved tag |
Cost & sizing
The bill drivers and how they interact with the design:
- Requests + GB-seconds dominate. You pay per request (a few cents per million) and per GB-second (memory × duration). A function at 256 MB running 200 ms costs a fraction of one at 1,024 MB running 800 ms — but if more memory makes a CPU-bound function finish in a quarter of the time, the higher memory is often cheaper. Profile.
- Idle is free. Unlike a 24×7 EC2 instance, a Lambda pipeline costs effectively nothing overnight. Parcelo’s drop from ₹14,000 to ₹3,800 was mostly the elimination of idle.
- Provisioned concurrency is the exception — you pay hourly for pre-warmed environments whether or not they run. Use it only where a latency SLO justifies it; it can quietly dominate the bill of a low-traffic function.
- The break-even with containers. Per-invoke billing wins for spiky/intermittent traffic and loses for sustained high RPS. As a rough rule, once a function runs near-continuously at high concurrency, a reserved Fargate task or EC2 may be cheaper — model both.
- Free tier is generous: 1M requests and 400,000 GB-seconds per month, always free. Most dev and many small-prod workloads stay within it.
A rough monthly picture for a moderate event pipeline (say 50M events/month, 256 MB, ~150 ms each):
| Cost driver | What you pay for | Rough INR / month | Watch-out |
|---|---|---|---|
| Lambda requests | ~50M invocations | ~₹700–900 | Batch where possible to cut request count |
| Lambda GB-seconds | 256 MB × 150 ms × 50M | ~₹2,500–4,000 | Right-size memory; shorten duration |
| Provisioned concurrency | N warm envs × hours | ~₹1,000+ per 10 envs | Idle cost — only for latency SLOs |
| SQS requests | Polls + sends | ~₹300–600 | Batching window reduces poll count |
| DynamoDB (on-demand) | Idempotency + projection writes | ~₹500–1,500 | TTL the idempotency table |
| CloudWatch Logs | Ingestion + storage | ~₹500–2,000 | Set retention; sample noisy logs |
| RDS Proxy (if used) | Per-vCPU-hour of the DB | ~₹1,500–3,000 | Only if fronting a relational DB |
Sizing rules of thumb: start at 256 MB and profile up; set timeout to ~2× the observed p99 duration (not the 15-min max); set batch size to balance throughput against the cost of reprocessing a failed batch; and cap reserved concurrency to whatever your most fragile downstream can survive. The cheapest correct pipeline is almost always “small functions, buffered sources, idempotent handlers, right-sized memory” — not a bigger anything.
Interview & exam questions
1. Explain Lambda’s four invocation models and why the distinction matters. Synchronous (caller waits, caller retries), asynchronous (Lambda queues it, retries twice, sends to a DLQ/destination or drops it), stream poll (Lambda polls Kinesis/DynamoDB Streams, per-shard ordering, checkpointing), and queue poll (Lambda polls SQS, visibility-timeout-driven redelivery). The model determines who retries, how many times, the ordering guarantee, and where data goes on failure — so it dictates how you design for correctness.
2. An async-triggered function’s events are disappearing. What’s happening and how do you fix it? Async invocations retry twice and then send the event to a configured on-failure destination or DLQ — and if none is configured, the event is dropped. Confirm there’s no DestinationConfig via get-function-event-invoke-config. Fix by attaching an OnFailure destination (SQS) and alarming on its depth.
3. Why must an SQS visibility timeout be at least 6× the function timeout? When Lambda reads a message it becomes invisible for the visibility-timeout window. If that window is shorter than the time the function needs, the message becomes visible again and is redelivered to a second environment while the first is still processing — instant duplicates. Six times gives headroom for retries within Lambda’s polling.
4. What is a poison record on a stream, and how do you prevent it blocking the shard? A record that always fails. Because stream records are processed in order and the default MaximumRetryAttempts is -1 (infinite), the bad record is retried forever, blocking every record behind it on that shard. Prevent it with a finite retry count, a MaximumRecordAge, BisectBatchOnFunctionError to isolate it, and an on-failure destination.
5. Why is idempotency mandatory in event-driven Lambda, and how do you implement it? Async, stream and queue deliveries are at-least-once — the same event will eventually arrive twice. Without idempotency, side effects (charges, emails, counters) double. Implement it by deriving a key from the event and doing a conditional write (attribute_not_exists) to DynamoDB before the side effect, so the second attempt is a no-op.
6. SNS vs EventBridge vs SQS for fan-out — how do you choose? SNS: low-latency one-to-many push fan-out, limited filtering. EventBridge: rich content-based routing, schema registry, archive/replay — the choreography hub. SQS: not fan-out at all but a buffer giving backpressure, retries and a DLQ. Combine SNS→SQS to get fan-out with per-consumer durability.
7. A Lambda is exhausting a relational database’s connections during traffic spikes. Fix? Lambda scales to hundreds of concurrent environments, each opening a connection, blowing past max_connections. Cap the function with reserved concurrency sized to the DB’s connection budget, and put RDS Proxy in front so the functions share a pooled set. DynamoDB wouldn’t have this problem because it’s serverless.
8. What causes cold starts and what reduces them? Initialising a new environment: code/layer download, runtime bootstrap, and your init code (clients, connections). Reduce with smaller packages, initialising clients in module scope, right-sizing memory (more CPU → faster init), provisioned concurrency (pre-warmed envs), and SnapStart for Java/.NET/Python. It only matters on latency-critical synchronous paths.
9. Difference between reserved and provisioned concurrency? Reserved concurrency caps and guarantees a function’s slice of the account pool (free — it’s just allocation), used to protect downstreams or other functions. Provisioned concurrency pre-warms a number of environments to eliminate cold starts (paid hourly even when idle), used for latency-sensitive paths. Setting reserved to 0 disables the function.
10. When is Lambda the wrong choice? For long-running (>15 min), stateful, or sustained high-CPU/steady-state workloads, where per-invoke billing loses to a reserved container and the timeout/statelessness constraints fight you. Use ECS/EKS/Fargate or EC2 there; use Lambda for event-shaped, intermittent, spiky work.
11. How do you process a batch from SQS so one bad message doesn’t reprocess the whole batch? Enable ReportBatchItemFailures on the event-source mapping and return a batchItemFailures list of only the failed message IDs. Lambda then deletes the successful messages and redelivers only the failures, instead of redelivering the entire batch on any single failure.
12. An EventBridge event is being handled twice. Why? Two rules whose event patterns both match the same event each fire — there is no first-match-wins in EventBridge. Confirm with the MatchedEvents metric on both rules. Fix by tightening the patterns (more specific source + detail-type + content fields) and keeping consumers idempotent.
These map to AWS Certified Developer – Associate (DVA-C02) — develop event-driven and serverless solutions, Lambda configuration, SQS/SNS/EventBridge integration, error handling — and AWS Certified Solutions Architect – Associate (SAA-C03) — design decoupled and event-driven architectures, choosing between fan-out brokers, and resilience patterns. A compact cert mapping:
| Question theme | Primary cert | Objective area |
|---|---|---|
| Invocation models, retries, DLQs | DVA-C02 | Develop event-driven solutions |
| Idempotency, at-least-once | DVA-C02 | Resilient application design |
| Fan-out broker choice | SAA-C03 | Design decoupled architectures |
| Concurrency, throttling, scaling | DVA-C02 / SAA-C03 | Performance & resilience |
| Stream poison records, bisect | DVA-C02 | Troubleshoot serverless |
| Cold starts, provisioned concurrency | DVA-C02 | Optimize serverless performance |
Quick check
- An S3-triggered function’s events sometimes vanish with nothing in any queue. Which invocation model is this, and what one thing is almost certainly missing?
- Your Kinesis consumer’s
IteratorAgehas been climbing for two hours with no progress. What’s the single most likely cause, and the setting that’s enabling it? - True or false: setting a function’s reserved concurrency to 0 is a good way to give it a tiny guaranteed slice.
- Why might the same SQS message be processed by two execution environments at once, and what’s the rule that prevents it?
- You need one “order placed” event to reach four independent consumers, each with its own retry and DLQ. What’s the cleanest pattern?
Answers
- It’s the asynchronous model (S3 invokes Lambda async). Almost certainly missing: an on-failure destination / DLQ — async retries twice and then drops the event if none is configured. Confirm with
aws lambda get-function-event-invoke-configand attach anOnFailuredestination. - A poison record that always fails, blocking the shard because the default
MaximumRetryAttemptsis -1 (infinite) — every record behind it waits. Fix with a finite retry count,MaximumRecordAge,BisectBatchOnFunctionError, and an on-failure destination. - False. Reserved concurrency of 0 disables the function entirely. For “no reservation,” leave it unset/null; to give a small slice, set a small positive number.
- Because the visibility timeout is shorter than the time the function takes, so the message reappears and is redelivered to a second environment while the first is still working. The rule: set the visibility timeout to at least 6× the function timeout.
- SNS→SQS fan-in: an SNS topic fans the event out to four SQS queues, one per consumer; each queue buffers for its own Lambda with its own redrive policy → DLQ. You get SNS’s fan-out and SQS’s per-consumer durability and backpressure. (EventBridge with four rule targets is the alternative when you want content-based routing and replay.)
Glossary
- Function — the unit of deployment: your code, its runtime, and configuration (memory, timeout, role, environment).
- Handler — the entry point AWS calls for each event; runs once per event on a warm or cold environment.
- Init code — code outside the handler (module scope) run once per execution environment; where you cache SDK clients, DB pools and parsed config to beat cold starts.
- Execution environment — the Lambda-managed micro-VM that runs your code; reused (warm) when possible, created anew (cold start) otherwise.
- Invocation model — synchronous, asynchronous, or poll-based (stream/queue); determined by the trigger and dictating retry, ordering and error behaviour.
- Synchronous invocation — the caller blocks for the response; Lambda does not retry; the caller owns retries (API Gateway, ALB, SDK
RequestResponse). - Asynchronous invocation — the event is queued internally,
202returned immediately; Lambda retries (default twice) then routes to an on-failure destination/DLQ or drops it (S3, SNS, EventBridge). - Event-source mapping — Lambda’s managed poller for a stream or queue (Kinesis, DynamoDB Streams, SQS, Kafka, MQ), controlling batch size, concurrency and checkpointing.
- At-least-once delivery — the contract for async and poll-based paths: the same event may be delivered more than once, so handlers must be idempotent.
- Idempotency — the property that processing the same event twice produces the same result with no doubled side effects; implemented via a conditional write keyed on the event.
- Concurrency — the number of execution environments running simultaneously; the scaling unit, bounded by the account limit (default 1,000/region).
- Reserved concurrency — a per-function guaranteed-and-capped slice of the account pool (free); setting it to 0 disables the function.
- Provisioned concurrency — pre-warmed execution environments that eliminate cold starts for a version/alias; billed hourly even when idle.
- Cold start — the latency of initialising a new environment (code download, runtime bootstrap, init code); a slow first request, not an error.
- DLQ / on-failure destination — where an event goes after retries are exhausted; on-failure destinations (SQS/SNS/EventBridge/Lambda) carry richer context than legacy DLQs and are preferred.
- Visibility timeout — the window an SQS message is hidden after being read; must be ≥ 6× the function timeout to prevent duplicate processing.
- Poison record — a stream/queue record that always fails; with infinite retries it blocks the shard, fixed with finite retries, record-age bounds, and bisect-on-error.
- Partial-batch response (
ReportBatchItemFailures) — returning only the failed record IDs from a batch so successful records checkpoint and only failures retry. - Fan-out / fan-in — one event delivered to many consumers (SNS/EventBridge), or many events aggregated to one store; SNS→SQS combines fan-out with per-consumer durability.
Next steps
You can now choose the right invocation model, wire each event source correctly, and design for at-least-once delivery without losing events. Build outward:
- Next: Compute on AWS: EC2 vs Lambda vs ECS vs EKS — when a function is the wrong tool and a container or VM wins.
- Related: ECS, EKS & Fargate: Choosing Your Container Path — the long-running, stateful counterpart to event-driven functions.
- Related: ALB vs NLB vs API Gateway, Compared — the synchronous front door for your Lambda APIs.
- Related: DynamoDB, RDS & Aurora, Compared — the state store for stateless functions, and the source of DynamoDB Streams triggers.
- Related: AWS Organizations & IAM Foundations — least-privilege execution roles done right across many functions.