Production Amazon ECS on Fargate: Task Networking, Auto Scaling, and Safe Rolling Deployments

A docker run on a laptop and an ECS service on AWS Fargate share almost no operational concerns. Fargate is the serverless launch type for Amazon Elastic Container Service — you hand AWS a task definition and a desired count, and it runs your containers on capacity it owns and patches, with no EC2 instances for you to size, drain, or reboot. That removes the host layer, but it does not remove the decisions that decide whether a deploy is safe at 3am: how each task gets an IP and a security group, when a deployment rolls back on its own, and what happens to in-flight requests when a task is told to stop. Those are the knobs that separate a service which drains cleanly on every release from one that drops connections, leaks IPs, and pages you on a Friday.

This guide walks the pieces I actually wire up for a production Fargate service, the same way I’d brief a new engineer joining the on-call rotation. Every section goes option-by-option: the valid CPU/memory matrix, the awsvpc ENI/IP planning math, the choice between target-tracking and step scaling and exactly which metric to scale on, the deployment circuit breaker that auto-rolls-back a bad image, the SIGTERM → stopTimeout → deregistration-delay triad that makes shutdown graceful, the execution-role vs task-role split that is the most common IAM mistake on ECS, and the Fargate Spot + Graviton + right-sizing levers that move the bill. Because this is a reference you’ll return to mid-incident, the options, limits, error codes and the deploy playbook itself are laid out as scannable tables — read the prose once, then keep the tables open when the rollout is stuck.

Assume the AWS provider/region is set, an Application Load Balancer (ALB) exists, and you’re on a recent CLI (aws --version >= 2.x). I use the Linux platform version LATEST throughout, which today resolves to Fargate platform version 1.4.0, and I default to ARM64 (Graviton) because it is the cheapest change you can make. By the end you’ll be able to register a correctly-sized task, place it in private subnets with per-task security groups, scale it on the right signal, deploy it with an automatic safety net, and shut it down without severing a single request.

What problem this solves

ECS on Fargate hides the fleet so you can ship a container without owning servers. That abstraction is a gift until a deploy goes sideways, and then the failure modes are not in your application code — they’re in the wiring between the ALB, the task ENI, the scaling policy, and the lifecycle hooks. The defaults are tuned for “it runs”, not for “it survives a release under load”, and almost every production ECS incident I’ve seen traces back to one of a small set of mis-set knobs.

What breaks without this knowledge, concretely: a service scaled on CPU that is actually I/O-bound scales out after p99 has already tripled, because CPU never crossed the target while request queues grew. A container launched via sh -c "java -jar app.jar" where the shell is PID 1 swallows SIGTERM, so the JVM is SIGKILLed on every task stop and every in-flight request dies. An ALB target group left at the default 300-second deregistration delay keeps routing to tasks ECS has already begun stopping. A bad image with no circuit breaker leaves ECS replacing failing tasks forever, draining the subnet’s IP pool until new tasks can’t even be placed. And the single most common IAM mistake — conflating the execution role (used by the agent to pull the image and read secrets before the container starts) with the task role (used by your code at runtime) — silently grants your application code permissions it should never have.

Who hits this: every team running containers on Fargate behind a load balancer. It bites hardest on services with chatty downstreams (the scaling-metric trap), services that were lifted from EC2 without revisiting PID 1 and signal handling (the dropped-connection trap), large services in small subnets (the IP-exhaustion trap), and anyone who has never actually seen their circuit breaker fire — because a safety net you’ve never tested is a configuration you don’t really have.

To frame the whole field before the deep dive, here is every failure class this article covers, the question it forces, and the one place to look first:

Failure class	What it looks like	First question to ask	First place to look	Most common single cause
Deploy-time 502s	502s on every release and every scale-in	Did the ALB cut a connection the task was still serving?	ALB target group settings	`target_type` not `ip`, or dereg delay still 300s
Dropped in-flight requests	Errors spike exactly when a task stops	Does PID 1 receive and handle SIGTERM?	Task definition `command`/`entryPoint`	Shell wrapper is PID 1, swallows the signal
Scales late / flaps	p99 climbs before scale-out; thrash on scale-in	Does load map to the metric you scaled on?	Auto Scaling policy + CloudWatch	CPU target tracking on an I/O-bound service
Tasks stuck PROVISIONING	`RESOURCE:ENI` / IP-not-available stopped reason	Are there free IPs for the deploy surge?	Subnet free-IP count vs `maximumPercent`	`/26` subnet, 40-task service at `maximumPercent: 200`
Bad deploy never recovers	Failing tasks replaced forever, IP pool drains	Is the circuit breaker on with `rollback`?	`describe-services` → `rolloutState`	`deploymentCircuitBreaker` not enabled
Code has perms it shouldn’t	App can read secrets it never references	Which role is your code actually using?	`executionRoleArn` vs `taskRoleArn`	Secrets perms on the task role, not exec role

Learning objectives

By the end of this article you can:

Pick a valid Fargate cpu/memory combination for the whole task (app + sidecars), choose ARM64 vs X86_64, and pin images to an immutable digest so two tasks in one deployment never run different code.
Plan awsvpc networking: one ENI and security group per task, the subnet IP-consumption math during a rolling deploy, and reaching ECR/Secrets Manager/CloudWatch via VPC interface endpoints instead of a NAT gateway.
Register a service as a scalable target and choose between target-tracking (and which predefined metric) and step scaling, layer multiple policies safely, and get the ResourceLabel right so the policy isn’t silently a no-op.
Configure a rolling deployment with minimumHealthyPercent/maximumPercent, a deployment circuit breaker with rollback, and a health-check grace period — and know when to reach for native blue/green instead.
Make shutdown graceful by coordinating SIGTERM, PID 1, stopTimeout, and the ALB deregistration delay so in-flight requests always drain.
Separate the execution role from the task role, inject secrets via the secrets block (never environment), and scope each role to specific ARNs.
Wire Container Insights, structured JSON logs (awslogs vs FireLens), and ADOT/X-Ray tracing, then pull the cost levers: Fargate Spot via a capacity-provider strategy, Graviton, and right-sizing.
Read a symptom → root cause → confirm → fix deploy playbook and localize any Fargate deploy/scale failure to one hop.

Prerequisites & where this fits

You should already understand the container basics: an image is built and pushed to a registry (Amazon ECR), a task definition is the immutable, versioned spec of what to run, a task is one running instance of that spec, and a service keeps a desired number of tasks running and registered behind a load balancer. You should know how to run aws in a shell, read JSON output, and that a VPC has subnets spread across Availability Zones, security groups (stateful, allow-only), and route tables. Familiarity with HTTP status codes, basic Linux process/signal concepts, and IAM policy JSON helps.

This sits in the Containers track and assumes the fundamentals from Amazon ECS & ECR Fundamentals: Task Definitions, Services & Fargate and the first deploy in Your First Container Deployment on ECS Fargate. It pairs tightly with Elastic Load Balancing Deep Dive: ALB, NLB & GWLB (the ALB and target group are half the deploy story) and VPC Deep Dive: Subnets, Routing, IGW, NAT & Endpoints (where the per-task ENIs and VPC endpoints live). For the choice of whether Fargate is even the right runtime, Choose Your Container Path: ECS vs EKS vs Fargate is upstream of this.

A quick map of who owns what during a Fargate incident, so you page the right person fast:

Layer	What lives here	Who usually owns it	Failure classes it can cause
Client / DNS / TLS	Name resolution, cert, retries	Frontend / SRE	502/503 only if misrouted; mostly red herrings
ALB + target group	Routing, health checks, deregistration	Platform / network	Deploy-time 502s (dereg delay, target type)
VPC subnets / ENIs	Per-task IPs, route to endpoints/NAT	Network team	Tasks stuck PROVISIONING (`RESOURCE:ENI`)
Security groups	Inbound from ALB, egress to deps	Platform + security	Connection refused / timeouts to the task
Task definition	CPU/mem, image, ports, lifecycle	App / dev team	Crash loops, dropped requests, OOM
Auto Scaling policy	Scalable target, metric, cooldowns	Platform + app	Scales late, flaps, hits max capacity
IAM roles (exec + task)	Image pull, secrets, runtime APIs	Security + app	Image-pull denied, over-broad app perms
ECS control plane	Rollout state, circuit breaker	Managed (AWS)	Bad deploy that never rolls back

Core concepts

Six mental models make every later decision obvious.

A task is a first-class network citizen, not a process sharing a host. On Fargate the network mode is always awsvpc: each task gets its own elastic network interface (ENI) with a private IP from the subnet you place it in, and its own security group(s). You get per-task security groups, per-task VPC Flow Logs, and clean blast-radius isolation — at the cost of consuming one subnet IP (and one ENI) per running task. That IP consumption is the planning trap, because during a rolling deploy you briefly run more tasks than steady state.

The task definition is immutable and versioned; the service points at one revision. Every register-task-definition produces a new revision (family:N). The service runs whatever revision you set, and a deploy is “make the service converge from revision N to revision N+1”. This is why pinning the image to a digest matters: if the task definition says :latest, ECS resolves the tag at each task launch, so two tasks in the same deployment can pull different code. Immutability of the task definition doesn’t help if the image tag moves underneath it.

A deploy is a controlled overlap of two task sets. ECS brings up tasks from the new revision before draining the old ones, bounded by two percentages of desired count: minimumHealthyPercent (the floor it keeps healthy) and maximumPercent (the ceiling it may temporarily exceed). The overlap is what gives zero-downtime — and what consumes extra IPs and extra Fargate vCPU-seconds for the duration. The deployment circuit breaker watches for a run of failed task launches and, if rollback is on, reverts to the last known-good revision instead of replacing failing tasks forever.

Scaling is a separate control loop on top of the service. ECS services scale through Application Auto Scaling, registered against a scalable target with a min and max. A policy adjusts desiredCount based on a CloudWatch metric. The hard part is not the mechanics — it’s choosing a metric that leads load. For a request-driven service, request count per target leads; CPU often lags because the work is I/O-bound.

Shutdown is a negotiation between three timers. When ECS stops a task it sends SIGTERM to each container’s PID 1, waits up to stopTimeout (default 30s, max 120s), then SIGKILL. Simultaneously it deregisters the task from the ALB target group, and the ALB waits the deregistration delay for in-flight connections to finish. Graceful shutdown means PID 1 receives SIGTERM, the app drains (stop accepting, finish in-flight, exit) inside stopTimeout, and the deregistration delay is long enough to cover the drain but no longer.

A task has two identities, and conflating them is the classic mistake. The execution role is assumed by the ECS agent before your container starts — to pull the image from ECR, write to the log group, and resolve secrets references. The task role is assumed by your application code at runtime to call AWS APIs (S3, DynamoDB, SQS). They are different principals doing different things at different times; secrets-reading belongs to the execution role, not the task role.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Concept	One-line definition	Where it lives	Why it matters in production
Task definition	Immutable, versioned spec of containers + envelope	ECS, `family:revision`	Pin the image digest or two tasks differ
Task	One running instance of a task definition	On Fargate capacity	The unit that gets an ENI and is stopped
Service	Keeps N tasks running + ALB-registered	ECS	Owns deploys, scaling, placement
`awsvpc`	One ENI + IP + SG per task	Fargate (always)	IP planning; per-task isolation
ENI	The task’s network interface	In a subnet	Finite per subnet; deploy surge consumes more
Execution role	Agent identity (pull, logs, secrets)	Task def `executionRoleArn`	Pre-start; reads secrets
Task role	App identity at runtime	Task def `taskRoleArn`	What your code’s SDK uses
`stopTimeout`	SIGTERM→SIGKILL grace per container	Container def	Must exceed app drain time
Deregistration delay	ALB wait for in-flight on stop	Target group	Must cover drain; default 300s is too long
Scalable target	The thing Auto Scaling adjusts	App Auto Scaling	Min/max bound on `desiredCount`
Target-tracking	Keep a metric near a value	Scaling policy	The default; pick the right metric
Circuit breaker	Auto-rollback on failed launches	Deploy config	Stops a bad image looping forever
Capacity provider	FARGATE vs FARGATE_SPOT mix	Cluster + service	The biggest cost lever
Platform version	Fargate runtime version (1.4.0)	Service	Feature/behavior baseline

1. Task definition: sizing, platform, and the CPU/memory matrix

A Fargate task definition declares the container(s), the CPU/memory envelope, the network mode (awsvpc), and two distinct IAM roles. The CPU/memory pair is not free-form: Fargate only accepts specific combinations, and the valid memory range is constrained by the CPU value you pick. The whole task shares this budget — a sidecar’s usage comes out of the same pool — so size the task for the sum, then optionally cap individual containers with container-level cpu/memory.

`cpu` (vCPU)	Valid `memory` values	Step	Typical use
256 (.25)	512, 1024, 2048 MiB	fixed list	Tiny sidecar-free APIs, cron tasks
512 (.5)	1024 – 4096 MiB	1 GiB	Small web service + log router
1024 (1)	2048 – 8192 MiB	1 GiB	Standard API with sidecars
2048 (2)	4096 – 16384 MiB	1 GiB	Memory-heavier services, JVM apps
4096 (4)	8192 – 30720 MiB	1 GiB	Large workers, in-memory caches
8192 (8)	16384 – 61440 MiB	4 GiB	Big batch / data tasks (PV 1.4.0)
16384 (16)	32768 – 122880 MiB	8 GiB	Largest single-task workloads (PV 1.4.0)

The container-level fields that shape sizing and lifecycle, each with its default and the trade-off:

Field	What it does	Default	When to set	Trade-off / gotcha
`cpu` (container)	Caps/reserves vCPU for one container	unset (shares task)	Pin a sidecar’s slice	Sum can’t exceed task `cpu`
`memory` (hard)	Hard cap; container killed if exceeded	unset	Bound a leaky sidecar	OOM-kills the container at the cap
`memoryReservation` (soft)	Soft floor; can burst above	unset	Most app containers	Needs headroom in task `memory`
`essential`	If true, its exit stops the task	true	Keep on the app; sidecars vary	A non-essential sidecar dying is silent
`stopTimeout`	SIGTERM→SIGKILL grace (s)	30	Raise to cover drain	Max 120 on Fargate
`user`	UID/GID the process runs as	root	Always set non-root	Image must support the UID
`readonlyRootFilesystem`	Mounts `/` read-only	false	Harden	App must write only to mounts/tmpfs
`portMappings.containerPort`	Port the app listens on	—	Always (web)	Must match ALB target group + health check
`healthCheck`	Container-level health command	none	Catch hangs ALB can’t see	Counts toward task health

The platform/runtime choices, where the biggest cost decision (ARM64) hides:

Setting	Values	Default	When to change	Trade-off
`runtimePlatform.cpuArchitecture`	`X86_64`, `ARM64`	`X86_64`	`ARM64` for ~20% cheaper vCPU-hr	Image must be `arm64`/multi-arch
`runtimePlatform.operatingSystemFamily`	`LINUX`, `WINDOWS_*`	`LINUX`	Windows containers only	Windows on Fargate has fewer SKUs
Platform version	`1.4.0`, `LATEST`	`LATEST`→1.4.0	Pin for reproducibility	Pinning misses new behavior/fixes
`image`	tag or `@sha256:` digest	—	Always pin a digest	Tag moves; two tasks diverge
`networkMode`	`awsvpc` (only on Fargate)	`awsvpc`	n/a on Fargate	Always per-task ENI
`ephemeralStorage.sizeInGiB`	21–200 GiB	20 GiB	Large scratch/space needs	Billed above the 20 GiB free

A correct task definition, ARM64, digest-pinned, with a sane health check and bounded non-blocking logs:

{
  "family": "checkout-api",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "runtimePlatform": { "cpuArchitecture": "ARM64", "operatingSystemFamily": "LINUX" },
  "executionRoleArn": "arn:aws:iam::111122223333:role/checkout-execution",
  "taskRoleArn": "arn:aws:iam::111122223333:role/checkout-task",
  "containerDefinitions": [
    {
      "name": "app",
      "image": "111122223333.dkr.ecr.us-east-1.amazonaws.com/checkout@sha256:9b2c…e41",
      "essential": true,
      "user": "10001:10001",
      "readonlyRootFilesystem": true,
      "portMappings": [{ "containerPort": 8080, "protocol": "tcp" }],
      "stopTimeout": 60,
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:8080/healthz || exit 1"],
        "interval": 15, "timeout": 5, "retries": 3, "startPeriod": 30
      },
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/checkout-api",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "app",
          "mode": "non-blocking",
          "max-buffer-size": "25m"
        }
      }
    }
  ]
}

aws ecs register-task-definition --cli-input-json file://checkout-api.task.json

The same in Terraform, with the digest passed in from CI so it’s never :latest:

resource "aws_ecs_task_definition" "checkout" {
  family                   = "checkout-api"
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  cpu                      = "512"
  memory                   = "1024"
  execution_role_arn       = aws_iam_role.exec.arn
  task_role_arn            = aws_iam_role.task.arn

  runtime_platform {
    cpu_architecture        = "ARM64"
    operating_system_family = "LINUX"
  }

  container_definitions = jsonencode([{
    name        = "app"
    image       = var.image_digest        # "...checkout@sha256:..."
    essential   = true
    user        = "10001:10001"
    portMappings = [{ containerPort = 8080, protocol = "tcp" }]
    stopTimeout = 60
  }])
}

The two choices worth repeating: ARM64 (Graviton) is typically ~20% cheaper per vCPU-hour and usually performs as well or better on typical web workloads — covered in depth in Graviton/ARM64 Migration: Multi-Arch Builds & Benchmarking. And pin a digest — a moving tag is a correctness bug, not a convenience.

The other task-definition fields you’ll touch on a real service, with their defaults and the trade-off:

Field	What it does	Default	When to set	Trade-off / gotcha
`requiresCompatibilities`	Declares FARGATE vs EC2	—	Always `["FARGATE"]` here	Mismatch rejects invalid combos
`volumes` + `mountPoints`	Shared/EFS volumes	none	Persistent or shared data	Fargate supports EFS, not host bind
`dependsOn`	Order containers by condition	none	Sidecar must be up first (FireLens)	`START`/`HEALTHY`/`COMPLETE` conditions
`pidMode`	Share PID namespace	per-container	Rarely; `task` for shared tooling	Security blast radius
`runtimePlatform` (OS)	LINUX vs WINDOWS family	LINUX	Windows containers	Fewer Windows SKUs/AZs
`proxyConfiguration`	App Mesh / Envoy proxy	none	Service-mesh sidecar	Adds an Envoy container
`tags` / `propagateTags`	Cost-allocation tags	none	Always tag for FinOps	Propagate from service or task def

2. awsvpc networking: one ENI and IP per task, and the deploy-surge math

On Fargate the network mode is always awsvpc, so each task is a first-class network citizen with its own ENI, private IP, and security group(s). This is the single most important networking fact about Fargate, and it has two consequences you must plan for: IP consumption and egress routing.

The IP-consumption trap is the rolling deploy. During a deploy you briefly run more tasks than steady state, and each consumes a subnet IP. Plan subnets for the peak, not the average:

Peak task IPs during a deploy ≈ desired_count × (maximumPercent / 100). For a 40-task service at maximumPercent: 200, plan for up to 80 task IPs across your subnets during the deploy, on top of everything else (other services, ENIs, reserved addresses) in those subnets.

A subnet also reserves 5 addresses (network, router, DNS, future, broadcast), so the usable count is smaller than the raw CIDR size. The math by subnet size:

Subnet CIDR	Total IPs	Usable (AWS reserves 5)	Steady tasks @ 50% headroom	Max service @ `maximumPercent: 200`
/28	16	11	~7	~5 tasks (too small for prod)
/27	32	27	~18	~13 tasks
/26	64	59	~39	~29 tasks
/25	128	123	~82	~61 tasks
/24	256	251	~167	~125 tasks
/23	512	507	~338	~253 tasks

Spread tasks across at least two private subnets in different AZs, and give each a /24 or larger for any sizeable service. The service network config disables public IPs and references security groups by ID:

{
  "awsvpcConfiguration": {
    "subnets": ["subnet-0aaa1111", "subnet-0bbb2222"],
    "securityGroups": ["sg-0task55555"],
    "assignPublicIp": "DISABLED"
  }
}

assignPublicIp must be DISABLED for tasks in private subnets — they reach AWS services through a NAT gateway or, better, VPC interface endpoints. The egress choices, side by side:

Egress path	What it covers	Cost shape	When to use	Gotcha
NAT gateway	All outbound to internet + AWS	Hourly + per-GB processed	Quick start, mixed egress	Per-GB on every ECR pull adds up
Interface endpoint (`ecr.api`, `ecr.dkr`, `secretsmanager`, `logs`, `sts`)	Those AWS APIs privately	Hourly per endpoint + per-GB	Keep image pulls on AWS net	One per service used; needs SG
Gateway endpoint (S3, DynamoDB)	S3 (ECR layers!), DynamoDB	Free	Always add S3 (ECR uses it)	Route-table entry, not SG
PrivateLink to a partner service	A specific SaaS/partner endpoint	Hourly per endpoint + per-GB	Reach a partner privately	One endpoint per service; see PrivateLink
Public IP + IGW	Direct internet (no NAT)	IGW free, IP churn	Rarely for prod tasks	Exposes tasks; usually wrong

The minimum endpoint set to pull an image and read secrets without a NAT gateway is: com.amazonaws.<region>.ecr.api, ecr.dkr, secretsmanager, logs, sts (interface), plus an S3 gateway endpoint (ECR stores layers in S3). The security-group rules — reference SGs by ID, never CIDR:

Rule	Direction	Source/Dest	Port	Why
Task SG: allow ALB	Inbound	ALB’s SG (by ID)	8080 (containerPort)	Only the ALB reaches the task
Task SG: egress to DB	Outbound	DB SG (by ID)	5432	Least-privilege egress
Task SG: egress to endpoints	Outbound	Endpoint SG (by ID)	443	ECR/Secrets/Logs over HTTPS
ALB SG: allow clients	Inbound	0.0.0.0/0 or CDN range	443	Public ingress
ALB SG: egress to tasks	Outbound	Task SG (by ID)	8080	ALB → task
Endpoint SG: allow tasks	Inbound	Task SG (by ID)	443	Tasks → endpoint

Terraform for the two essential ECR-related endpoints (S3 gateway is free and mandatory):

resource "aws_vpc_endpoint" "s3" {
  vpc_id            = var.vpc_id
  service_name      = "com.amazonaws.${var.region}.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = var.private_route_table_ids   # ECR layer pulls go via S3
}

resource "aws_vpc_endpoint" "ecr_dkr" {
  vpc_id              = var.vpc_id
  service_name        = "com.amazonaws.${var.region}.ecr.dkr"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = var.private_subnet_ids
  security_group_ids  = [aws_security_group.endpoints.id]
  private_dns_enabled = true
}

The Fargate limits and quotas that actually shape a production design (many are soft/adjustable via Service Quotas — confirm against your account):

Limit / quota	Value	Adjustable?	Why it matters
Network mode on Fargate	`awsvpc` only	No	One ENI/IP per task — drives subnet sizing
Max vCPU per task	16 vCPU	No	Largest single task; split bigger work
Max memory per task	120 GiB	No	Ceiling for in-memory workloads
Ephemeral storage per task	20 GiB free, up to 200 GiB	Configurable	Scratch space; billed above 20
`stopTimeout` max	120 s	No	Caps the drain window
Containers per task definition	10	No	App + sidecars must fit
Tasks per service	5,000 (default, soft)	Yes (Service Quotas)	Very large services
Services per cluster	5,000 (soft)	Yes	Cluster packing
Subnet reserved IPs	5 per subnet	No	Reduces usable task IPs
Spot interruption warning	~120 s (SIGTERM)	No	Drain budget on reclaim
Platform version	1.4.0 (current)	n/a	Feature/behavior baseline

Networking details — subnet design, route tables, endpoint policies — are covered end to end in VPC Deep Dive: Subnets, Routing, IGW, NAT & Endpoints, and the inbound/outbound rule model in Security Groups & NACLs Deep Dive.

3. Service Auto Scaling: target tracking vs step scaling

ECS services scale through Application Auto Scaling, registered as a scalable target against the ecs:service:DesiredCount dimension with a min and max. The mechanics are easy; the metric choice is the whole game. For a request-driven service behind an ALB, ALBRequestCountPerTarget is the cleanest signal — it scales on actual load per task, independent of how CPU-bound the work is, and reacts before CPU saturates.

# Register the service as a scalable target (min/max bound on desiredCount)
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/prod-cluster/checkout-api \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 4 --max-capacity 40

The predefined target-tracking metrics, and exactly when each is the right one:

Predefined metric	Scales on	Use when	Don’t use when	Needs `ResourceLabel`
`ALBRequestCountPerTarget`	Requests/min per task	Web/API behind an ALB	No ALB; or work ≠ per-request	Yes (ALB + TG names)
`ECSServiceAverageCPUUtilization`	Avg task CPU %	CPU-bound compute	I/O-bound work (lags)	No
`ECSServiceAverageMemoryUtilization`	Avg task memory %	Memory-bound caches	Leaky apps (scales on the leak)	No

A request-count target-tracking policy. The ResourceLabel is <ALB full name>/<target group full name> — the portion after loadbalancer/ and targetgroup/ in the respective ARNs. Get it wrong and the policy silently does nothing:

{
  "TargetValue": 1000.0,
  "PredefinedMetricSpecification": {
    "PredefinedMetricType": "ALBRequestCountPerTarget",
    "ResourceLabel": "app/checkout-alb/50dc6c495c0c9188/targetgroup/checkout-tg/6d0ecf831eec9f09"
  },
  "ScaleInCooldown": 300,
  "ScaleOutCooldown": 60
}

aws application-autoscaling put-scaling-policy \
  --service-namespace ecs --resource-id service/prod-cluster/checkout-api \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name reqcount-tt --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration file://reqcount-tt.json

The target-tracking knobs and how to reason about each:

Knob	What it does	Default	When to change	Trade-off
`TargetValue`	The metric value to hold	—	Set to ~70% of a task’s safe max	Too low = over-provision; too high = late
`ScaleOutCooldown`	Wait after scaling out (s)	300	Lower (60) to react faster	Too low risks over-shoot
`ScaleInCooldown`	Wait after scaling in (s)	300	Raise to avoid flapping	Too low = thrash on noisy load
`DisableScaleIn`	Only scale out, never in	false	True for cost-blind reliability	Pay for peak forever

Reach for step scaling when you need asymmetric or aggressive reactions — for example, add capacity hard when a queue-depth alarm crosses a threshold (a worker draining SQS, not a web service):

aws application-autoscaling put-scaling-policy \
  --service-namespace ecs --resource-id service/prod-cluster/worker \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name queue-step --policy-type StepScaling \
  --step-scaling-policy-configuration '{
    "AdjustmentType": "ChangeInCapacity",
    "MetricAggregationType": "Maximum",
    "StepAdjustments": [
      { "MetricIntervalLowerBound": 0,   "MetricIntervalUpperBound": 1000, "ScalingAdjustment": 2 },
      { "MetricIntervalLowerBound": 1000,                                   "ScalingAdjustment": 5 }
    ]
  }'

Target tracking vs step scaling, decided:

Dimension	Target tracking	Step scaling
Mental model	“Hold this metric near X”	“When the alarm is this far over, add Y”
Alarms	AWS manages a pair for you	You define the metric + thresholds
Best for	Web/API steady-state load	Queues, asymmetric bursts
Scale-in	Automatic, symmetric	You define separate step(s)
Risk	Wrong metric lags	Mis-tuned steps over/under-shoot
Combine?	Yes — multiple policies allowed	Yes — layer with target tracking

The CloudWatch metrics worth alerting on for a Fargate service (leading indicators, not just “service down”), with a starting threshold:

Alert on	Metric (namespace)	Threshold (starting point)	Why it’s leading
Per-task request load	`RequestCountPerTarget` (ALB)	near your `TargetValue`	Predicts scale-out before latency spikes
Latency creep	`TargetResponseTime` (ALB) p95	> your SLO	Cold start / saturation before users feel it
Unhealthy targets	`UnHealthyHostCount` (ALB)	≥ 1 for 5 min	Catches eviction before capacity drops
CPU saturation	`CPUUtilization` (ECS)	> 80% for 10 min	Backstop signal for CPU-bound paths
Memory pressure	`MemoryUtilization` (ECS)	> 85% for 10 min	Predicts OOM kills (exit 137)
Failed task launches	service events / `failedTasks`	> 0 during deploy	The circuit breaker’s trigger
5xx from targets	`HTTPCode_Target_5XX_Count` (ALB)	> 1% of requests	The symptom — alert as confirmation
Running vs desired	`RunningTaskCount` vs `DesiredCount`	gap > 0 sustained	Deploy stuck or capacity starved

You can attach multiple policies to one service. A common pattern: request-count target tracking for steady state, plus a CPU target-tracking policy as a safety net so a CPU-heavy code path can’t starve before request count reacts. When policies disagree, Application Auto Scaling takes the largest desired count — so layering scale-out policies is safe; combining aggressive scale-in policies is what gets you into flapping. A decision table for picking the primary signal:

If your service is…	Scale primarily on…	Backstop with…
A web API behind an ALB	`ALBRequestCountPerTarget`	CPU target tracking
A gRPC streaming service	CPU utilization	Memory target tracking
An SQS/queue worker	Step scaling on queue depth (`ApproximateNumberOfMessagesVisible`)	CPU as a floor
A CPU-bound batch transformer	CPU utilization	—
A memory-bound cache/aggregator	Memory utilization	CPU as a floor

4. Deployments: rolling updates and the circuit breaker

ECS rolling deployments are governed by two knobs on the service. minimumHealthyPercent is the floor of healthy tasks ECS keeps during a deploy; maximumPercent is the ceiling it may temporarily exceed desired count to bring up replacements. For a zero-downtime rolling deploy on an even-sized service, 100/200 is the safe default: never drop below desired count, allow a full extra set while rolling.

The deploy-config matrix — every knob, its default, and the trade-off:

Setting	What it controls	Default	Safe prod value	Trade-off / gotcha
`minimumHealthyPercent`	Floor of healthy tasks during deploy	100	100 (50 if cost-sensitive + tolerant)	<100 risks a capacity dip mid-deploy
`maximumPercent`	Ceiling above desired during deploy	200	200	Higher = faster but more IPs/cost
`deploymentCircuitBreaker.enable`	Auto-detect a failing deploy	false	true	Off = bad image loops forever
`deploymentCircuitBreaker.rollback`	Revert to last good on failure	false	true	Without it, breaker only stops
`healthCheckGracePeriodSeconds`	Ignore ALB health fails after start	0	~60 (≥ cold-start)	Too low kills slow-booting tasks
`deploymentController.type`	`ECS`, `CODE_DEPLOY`, `EXTERNAL`	`ECS`	`ECS` (rolling)	Blue/green needs CodeDeploy/native
`minimumHealthyPercent` (during scale)	Same floor applies to scale-in	100	100	Affects how fast scale-in drains

The piece people skip is the deployment circuit breaker. Without it, a bad image that never passes health checks leaves the service replacing failing tasks indefinitely — draining your IP pool and paging you. With it, ECS watches for a run of failed task launches and, if rollback is on, automatically reverts to the last known-good task definition.

aws ecs update-service \
  --cluster prod-cluster --service checkout-api \
  --task-definition checkout-api:87 \
  --deployment-configuration '{
    "minimumHealthyPercent": 100,
    "maximumPercent": 200,
    "deploymentCircuitBreaker": { "enable": true, "rollback": true }
  }' \
  --health-check-grace-period-seconds 60

resource "aws_ecs_service" "checkout" {
  name            = "checkout-api"
  cluster         = aws_ecs_cluster.prod.id
  task_definition = aws_ecs_task_definition.checkout.arn
  desired_count   = 4
  launch_type     = "FARGATE"
  health_check_grace_period_seconds = 60

  deployment_circuit_breaker { enable = true, rollback = true }
  deployment_minimum_healthy_percent = 100
  deployment_maximum_percent         = 200

  load_balancer {
    target_group_arn = aws_lb_target_group.checkout.arn
    container_name   = "app"
    container_port   = 8080
  }
  network_configuration {
    subnets          = var.private_subnet_ids
    security_groups  = [aws_security_group.task.id]
    assign_public_ip = false
  }
}

--health-check-grace-period-seconds tells ECS to ignore ALB health-check failures for the first N seconds after a task starts, so a slow-booting app isn’t killed before it’s ready. Set it slightly above your real cold-start time. The circuit breaker counts failures relative to desired count (it scales the threshold with service size, with a floor), so it behaves sensibly for both a 3-task and a 300-task service.

The deployment-controller and strategy options, decided:

Strategy	How it works	Rollback	Extra cost	Use when
Rolling (ECS) + circuit breaker	Overlap old/new, auto-revert on failure	Automatic (last good)	One extra task set briefly	Default for most services
Native blue/green (ECS)	Full parallel green env, shift, cut over	Instant cutover/revert	Full second env during shift	High-stakes, instant rollback
CodeDeploy blue/green	CodeDeploy shifts ALB listener (linear/canary)	Instant + traffic-shift hooks	Full second env + CodeDeploy	Canary/linear traffic control
External	Your own orchestrator manages task sets	Yours to build	Varies	Custom CD systems

The rollout states you watch during a deploy, and what each means:

`rolloutState`	Meaning	What to do
`IN_PROGRESS`	Converging to the new revision	Wait; watch `runningCount` vs `desiredCount`
`COMPLETED`	New revision fully healthy	Done — verify targets healthy
`FAILED`	Circuit breaker tripped	Read `rolloutStateReason`; check task stopped reasons
(rolling back)	Reverting to last good revision	Confirm the prior revision is what’s running

The most common task stopped reasons you’ll read in describe-tasks during a failed rollout, and what each points at:

Stopped reason (substring)	What it means	Likely fix
`CannotPullContainerError`	Image pull failed (bad digest or no route)	Fix digest; add ECR endpoints / NAT
`ResourceInitializationError: unable to pull secrets`	Exec role can’t read a secret	Grant exec role secret ARN + KMS
`RESOURCE:ENI`	No free ENI/IP in the subnet	Larger subnets; lower `maximumPercent`
`Task failed ELB health checks`	ALB marked the task unhealthy	Fix port/path/matcher; raise grace
`OutOfMemoryError` (exit 137)	Container exceeded its memory	Raise task `memory`; fix leak
`Essential container in task exited`	An `essential` container exited non-zero	Read its logs; fix crash/entrypoint
`Scaling activity initiated by ...`	Normal scale-in stop	None — expected
`Task stopped by deployment` (rollback)	Circuit breaker removed a bad task	Confirm prior revision is healthy

5. Graceful shutdown: SIGTERM, `stopTimeout`, and deregistration

When ECS stops a task — a deploy, a scale-in, or a Spot interruption — it sends SIGTERM to each container’s entrypoint process (PID 1), waits up to stopTimeout (default 30s on Fargate, max 120s), then sends SIGKILL. Two failure modes hide here, and they are the number-one cause of deploy-time errors.

First: PID 1 must actually receive and handle SIGTERM. If your container starts the app via a shell (sh -c "node server.js"), the shell is PID 1 and may not forward the signal — your app gets SIGKILLed with in-flight requests. Either run the app as PID 1 directly (exec form CMD, or ENTRYPOINT ["node", "server.js"]) or set "initProcessEnabled": true in linuxParameters to get a tini-style init that reaps zombies and forwards signals. The combinations:

How PID 1 is set up	Receives SIGTERM?	Reaps zombies?	Verdict
`CMD ["node","server.js"]` (exec form)	Yes	No (but app rarely forks)	Fine for most apps
`CMD node server.js` (shell form)	Often no (shell swallows)	No	Broken — drops requests
`ENTRYPOINT ["app"]`, exec	Yes	No	Fine
`initProcessEnabled: true` + any CMD	Yes (init forwards)	Yes	Best for multi-process / forking apps
Custom init (tini/dumb-init) in image	Yes	Yes	Fine; init handles it

Second: drain before you exit. On SIGTERM the app should stop accepting new work, finish in-flight requests, then exit — inside stopTimeout:

const server = app.listen(8080);

process.on('SIGTERM', () => {
  console.log('SIGTERM received, draining');
  server.close(() => {                 // stop accepting, finish in-flight
    console.log('drained, exiting');
    process.exit(0);
  });
  // safety net well under stopTimeout (60s here)
  setTimeout(() => process.exit(1), 50_000).unref();
});

Coordinate three timers so they nest correctly. The order that must hold: deregistration delay ≥ app drain grace ≤ stopTimeout, and stopTimeout ≥ drain grace. ECS deregisters the task from the target group on stop; the ALB stops sending new connections and waits the deregistration delay for existing ones to finish. If stopTimeout is shorter than the drain, SIGKILL cuts the app mid-drain; if the deregistration delay is shorter than the drain, the ALB cuts connections the app is still serving.

Timer	Where set	Default	Recommended (fast service)	If too low	If too high
ALB deregistration delay	Target group `deregistration_delay.timeout_seconds`	300	30	ALB cuts in-flight requests	Slow deploys/scale-in
App drain grace	Your SIGTERM handler	n/a	~25–45	App exits before draining	Risks > `stopTimeout`
`stopTimeout`	Container def	30	60	SIGKILL mid-drain	Max 120; slow stops
Health-check grace	Service	0	60	New task killed before ready	Slow to detect real failures

The target group itself must register tasks by IP, not instance, because each Fargate task is its own ENI:

resource "aws_lb_target_group" "checkout" {
  name                 = "checkout-tg"
  port                 = 8080
  protocol             = "HTTP"
  target_type          = "ip"          # REQUIRED for awsvpc/Fargate tasks
  vpc_id               = var.vpc_id
  deregistration_delay = 30            # drain fast, well inside stopTimeout

  health_check {
    path                = "/healthz"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    interval            = 15
    timeout             = 5
    matcher             = "200"
  }
}

6. Secrets, config, and least-privilege roles

Fargate tasks have two roles, and conflating them is the most common IAM mistake on ECS. The split, in one table:

Role	Assumed by	When	Used for	The wrong instinct
Execution role (`executionRoleArn`)	The ECS agent	Before the container starts	ECR pull, log group writes, resolving `secrets`	Forgetting it → image-pull/secret failures
Task role (`taskRoleArn`)	Your application code	At runtime	S3, DynamoDB, SQS, etc. via the SDK	Putting secrets-read perms here

Keep them separate and minimal. Inject secrets via the secrets block so plaintext never lands in the task definition or in describe-tasks output, and keep non-sensitive config in environment:

"secrets": [
  { "name": "DB_PASSWORD", "valueFrom": "arn:aws:secretsmanager:us-east-1:111122223333:secret:prod/checkout/db-AbCdEf" }
],
"environment": [
  { "name": "LOG_LEVEL", "value": "info" }
]

The execution role needs secretsmanager:GetSecretValue (and kms:Decrypt if the secret uses a customer-managed key) on exactly those secret ARNs — not * — plus the ECR and Logs actions:

Action	On the execution role for…	Scope to
`ecr:GetAuthorizationToken`	Authenticating to ECR	`*` (token is account-wide)
`ecr:BatchGetImage`, `ecr:GetDownloadUrlForLayer`	Pulling the image	The specific repo ARN
`logs:CreateLogStream`, `logs:PutLogEvents`	Writing app logs	The log-group ARN
`secretsmanager:GetSecretValue`	Resolving `secrets`	The exact secret ARN(s)
`kms:Decrypt`	CMK-encrypted secrets/SSM	The specific key ARN
`ssm:GetParameters`	SSM Parameter Store `secrets`	The parameter ARN(s)

{
  "Effect": "Allow",
  "Action": "secretsmanager:GetSecretValue",
  "Resource": "arn:aws:secretsmanager:us-east-1:111122223333:secret:prod/checkout/*"
}

The task role carries only the runtime permissions your code uses. If your app writes to one bucket, scope it to that bucket’s ARN and the s3:PutObject action — nothing more. Static environment entries are visible in plaintext via the API, so never put credentials there; that’s what secrets is for. The config-injection options compared:

Mechanism	Plaintext in API?	Rotates without redeploy?	Cost	Use for
`environment`	Yes	No	Free	Non-secret config (log level, region)
`secrets` → Secrets Manager	No	Yes (new value picked up on task launch)	Per-secret/month + API calls	Passwords, API keys
`secrets` → SSM Parameter Store (SecureString)	No	Yes (on launch)	Free std / paid advanced	Cheaper secrets, config hierarchy
App reads at runtime via task role	No	Yes (live)	API calls	Hot-reload of secrets without restart

Secrets-rotation patterns and Parameter Store vs Secrets Manager are covered in Secrets Manager & Parameter Store Deep Dive; scoping roles tightly is the subject of IAM Least Privilege & Permission Boundaries.

7. Observability: Container Insights, structured logs, tracing

Turn on Container Insights at the cluster level for per-task/service CPU, memory, and network metrics plus curated dashboards. Enable the enhanced observability tier for container-level granularity:

aws ecs update-cluster-settings \
  --cluster prod-cluster \
  --settings name=containerInsights,value=enhanced

The three observability pillars and how to wire each on Fargate:

Pillar	Tool on Fargate	Wire it via	Cost driver	Gotcha
Metrics	Container Insights	Cluster setting `containerInsights=enhanced`	Per-metric ingestion	`enhanced` = container-level, costs more
Logs	`awslogs` driver	`logConfiguration` per container	Per-GB ingest + storage	Use `non-blocking` + buffer cap
Logs (routed)	FireLens (Fluent Bit)	`firelensConfiguration` sidecar	Sidecar CPU/mem + destinations	Sidecar must be `essential` for ordering
Traces	ADOT collector	Sidecar + task-role X-Ray perms	Per-trace	Instrument app with OTel SDK

For logs, the awslogs driver is the simplest path; set mode=non-blocking with a bounded max-buffer-size so a slow log backend can’t block your application threads. The log-driver options:

Option	What it does	Default	Set to	Why
`mode`	`blocking` or `non-blocking`	`blocking`	`non-blocking`	A slow backend won’t stall the app
`max-buffer-size`	Buffer when non-blocking	1m	`25m`	Headroom for bursts; bounds memory
`awslogs-stream-prefix`	Stream name prefix	—	`app`	Required for readable stream names
`awslogs-datetime-format`	Multiline grouping	—	your pattern	Stack traces stay one event

When you need routing — duplicate to S3 and a SIEM, parse, or sample — use FireLens with a Fluent Bit sidecar:

{
  "name": "log_router",
  "image": "public.ecr.aws/aws-observability/aws-for-fluent-bit:stable",
  "essential": true,
  "firelensConfiguration": { "type": "fluentbit" },
  "memoryReservation": 50
}

Then the app container’s logConfiguration uses "logDriver": "awsfirelens" with output options. Emit logs as JSON from the app so they’re queryable in CloudWatch Logs Insights:

fields @timestamp, level, msg, latency_ms
| filter level = "error"
| sort @timestamp desc
| limit 50

For distributed tracing, add the AWS Distro for OpenTelemetry (ADOT) collector as a sidecar and grant the task role AWSXRayDaemonWriteAccess; instrument the app with OTel and export to X-Ray for end-to-end spans. The full tracing setup is in AWS X-Ray: Service Map, Segments & ADOT Tracing; the metrics/logs foundation in CloudWatch & CloudTrail Observability Deep Dive.

8. Cost levers: Fargate Spot, capacity providers, right-sizing

Three levers move the bill, in order of impact.

Capacity providers + Fargate Spot. Fargate Spot runs the same tasks at a steep discount but can reclaim them with a ~2-minute SIGTERM warning. Run a mixed strategy via a capacity-provider strategy: a base of on-demand FARGATE for a guaranteed floor, then FARGATE_SPOT for the elastic, interruption-tolerant remainder:

aws ecs put-cluster-capacity-providers \
  --cluster prod-cluster \
  --capacity-providers FARGATE FARGATE_SPOT \
  --default-capacity-provider-strategy \
    capacityProvider=FARGATE,base=2,weight=1 \
    capacityProvider=FARGATE_SPOT,weight=4

This keeps 2 tasks always on-demand, then splits additional tasks 1:4 on-demand:Spot. The capacity-provider parameters:

Parameter	What it does	Example	Effect
`base`	Minimum tasks on this provider first	`FARGATE base=2`	2 tasks always on-demand
`weight`	Relative share of the rest	`FARGATE=1, SPOT=4`	Remainder split 20% / 80%
(Spot interruption)	~2-min SIGTERM then reclaim	—	Needs graceful shutdown (Section 5)

Only do this for stateless services that handle SIGTERM cleanly — Spot reclamation uses the same graceful-stop path, so a service that drains correctly tolerates it. The three cost levers ranked:

Lever	Typical saving	Effort	Risk / precondition	Covered in
Fargate Spot (mixed strategy)	Up to ~70% on the Spot portion	Low	Must tolerate ~2-min reclaim	Section 5 (graceful stop)
Graviton (ARM64)	~20% per vCPU-hr	Low–medium	Image must be arm64/multi-arch	Graviton migration
Right-sizing	Varies (often 30–60%)	Medium	Measure first; redeploy task def	Container Insights / Compute Optimizer

Graviton (ARM64) — already covered in Section 1, the cheapest change you can make for compatible images. Right-sizing — use Container Insights and Compute Optimizer’s ECS recommendations to find tasks provisioned at 4 vCPU that peak at 0.8. Fargate bills per vCPU-second and GB-second from pull to stop, so an oversized task definition costs you on every running replica, every hour. Resize the task definition, redeploy, re-measure. Spot interruption handling at scale is the subject of EC2 Spot & Mixed Instances with ASG Interruption Handling — the same draining discipline applies.

Architecture at a glance

Follow a single request left to right and the whole system falls into place. A client hits the ALB on 443; the ALB terminates TLS and forwards to a healthy target on the container port (8080). Because the tasks run awsvpc, the target group is target_type = ip and the ALB routes straight to a task ENI’s private IP inside two private subnets across two AZs — each task its own ENI, its own security group that only accepts the ALB’s SG, no public IP. The task itself runs the app container plus any sidecars on a valid CPU/memory envelope (here 512/1024, ARM64), with two IAM roles: the execution role pulled the digest-pinned image from ECR (via a VPC endpoint, layers over the free S3 gateway endpoint) and read secrets before start, while the task role is what the app’s SDK uses at runtime. Two control loops sit beside the data path: Application Auto Scaling watches ALBRequestCountPerTarget and moves desiredCount between 4 and 40, and the rollout runs at 100/200 with the deployment circuit breaker armed to roll back a bad revision. Downstream, the task reaches Secrets Manager + KMS and ships logs and traces to CloudWatch / X-Ray.

The five numbered badges mark exactly where a deploy or scale event breaks if a knob is wrong: a deploy-time 502 when the target type isn’t ip or the deregistration delay is still 300s (badge 1); IP/ENI exhaustion when the subnets can’t absorb the deploy surge (badge 2); swallowed SIGTERM dropping in-flight requests when a shell is PID 1 (badge 3); late or flapping scaling from the wrong metric (badge 4); and a bad deploy that never rolls back when the circuit breaker is off (badge 5). Read the diagram once with the legend, and the troubleshooting playbook below maps one-to-one onto these hops.

Real-world scenario

Lumio Pay, a fintech platform team, ran a payment-authorization service on Fargate behind an ALB, scaled on CPU target tracking, 6 tasks steady. It worked until a Friday evening release: under a traffic spike, p99 latency tripled and the team saw a steady trickle of 502s on every deploy and every scale-in event — even though CPU never crossed the 70% target. The on-call engineer’s first instinct was to scale up the task size, which did nothing, then to roll back, which also threw 502s on the way down.

Three root causes, none of them the application logic. First, the service used CPU target tracking, but the workload was I/O-bound on a downstream HSM — CPU stayed low while request queues grew, so scaling reacted late, after latency had already spiked. Second, and worse, the app was launched via sh -c "java -jar app.jar": the shell was PID 1, swallowed SIGTERM, and the JVM was SIGKILLed on every task stop, severing in-flight authorizations the instant a task drained. Third, the ALB target group still had the default 300-second deregistration delay, so during deploys the ALB kept routing new connections to tasks ECS had already begun stopping — a second source of cut connections layered on top of the first.

They confirmed each in minutes. The scaling lag showed up as CloudWatch CPU flat at ~40% while the ALB target-response-time and request-count climbed. The PID 1 problem was visible in the task definition ("command": ["sh","-c","java -jar app.jar"]) and in the stopped-task pattern — every deploy logged tasks SIGKILLed, not gracefully exited. The deregistration delay was a one-line describe-target-group.

The fix was three coordinated changes, no new infrastructure. They switched the primary scaling signal to ALBRequestCountPerTarget (keeping a CPU policy as a backstop), changed the container entrypoint to exec the JVM as PID 1 with a real SIGTERM handler that drained the in-flight queue, and aligned the timers: deregistration delay to 30s, stopTimeout to 60s, drain grace to ~45s.

"deploymentConfiguration": {
  "minimumHealthyPercent": 100,
  "maximumPercent": 200,
  "deploymentCircuitBreaker": { "enable": true, "rollback": true }
}

resource "aws_lb_target_group" "auth" {
  name                 = "auth-tg"
  port                 = 8080
  protocol             = "HTTP"
  target_type          = "ip"          # required for awsvpc/Fargate tasks
  vpc_id               = var.vpc_id
  deregistration_delay = 30

  health_check {
    path                = "/healthz"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    interval            = 15
    timeout             = 5
    matcher             = "200"
  }
}

Note target_type = "ip" — Fargate tasks register by IP, not instance, because each task is its own ENI. After the change, deploy-time 502s went to zero, and the service scaled out ahead of the latency curve instead of behind it. While they were in there, they also enabled the circuit breaker with rollback (they’d never had one) and tested it by deploying a deliberately-broken revision in staging — it flipped to FAILED and restored the prior revision in under two minutes. The lesson the team took away: on Fargate, “graceful shutdown” is not one setting — it’s PID 1, stopTimeout, and the target-group deregistration delay all agreeing with each other, and “scaling” only works if the metric you chose actually leads your load.

Advantages and disadvantages

The serverless-container model both removes real toil and introduces failure modes that live in the wiring rather than your code. Weigh it honestly:

Advantages (why Fargate helps you)	Disadvantages (why it bites)
No EC2 to size, patch, drain, or reboot — AWS owns the host fleet	You can’t `ssh` to “the box”; debugging is via ECS Exec, logs, and Insights
Per-task ENI gives clean isolation, per-task SGs and Flow Logs	Each task consumes a subnet IP + ENI; deploy surge can exhaust small subnets
Pay per vCPU-second/GB-second, scale to zero idle cost	Per-second billing means an oversized task def bleeds on every replica, every hour
Circuit breaker auto-rolls-back a bad deploy with no tooling	Off by default — a bad image loops forever until you enable it
Fargate Spot cuts the elastic portion ~70%	Only safe if the app drains on SIGTERM; reclaim is a ~2-min warning
Application Auto Scaling is a managed control loop	Wrong metric scales late; aggressive scale-in policies flap
Graviton/ARM64 is a ~20% saving for one flag	Image must be arm64/multi-arch first
Two-role model enforces least privilege by design	Conflating exec vs task role is the most common ECS IAM bug

Fargate is the right default when you want to ship containers, not operate servers, and your services are stateless and ALB-fronted. It bites hardest on chatty/I/O-bound services scaled on the wrong metric, services lifted from EC2 without revisiting PID 1 and signal handling, large services packed into small subnets, and anyone who deploys with the defaults (no circuit breaker, 300s deregistration delay) and never tunes them. The disadvantages are all manageable — but only if you know they exist, which is the entire point of this article. When the constraints argue for self-managed nodes (GPU, daemons, very high task density, specialized kernels), Choose Your Container Path: ECS vs EKS vs Fargate is the decision to revisit.

Hands-on lab

Stand up a minimal Fargate service, deploy a deliberately-broken revision, and watch the circuit breaker roll it back — then prove graceful drain. Free-tier-friendly-ish (Fargate has no free tier, but a 512/1024 task for under an hour is a few rupees; tear down at the end). Run in any shell with the AWS CLI configured.

Step 1 — Variables and a cluster.

REGION=us-east-1
CLUSTER=lab-cluster
aws ecs create-cluster --cluster-name $CLUSTER --region $REGION \
  --settings name=containerInsights,value=enhanced

Step 2 — A log group and a minimal task definition (good image). Use a public sample that listens on 80:

aws logs create-log-group --log-group-name /ecs/lab-web --region $REGION
cat > lab-web.task.json <<'JSON'
{
  "family": "lab-web",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256", "memory": "512",
  "executionRoleArn": "arn:aws:iam::ACCOUNT:role/ecsTaskExecutionRole",
  "containerDefinitions": [{
    "name": "web",
    "image": "public.ecr.aws/nginx/nginx:stable",
    "essential": true,
    "portMappings": [{ "containerPort": 80, "protocol": "tcp" }],
    "stopTimeout": 30,
    "logConfiguration": { "logDriver": "awslogs", "options": {
      "awslogs-group": "/ecs/lab-web", "awslogs-region": "us-east-1", "awslogs-stream-prefix": "web" } }
  }]
}
JSON
aws ecs register-task-definition --cli-input-json file://lab-web.task.json --region $REGION

Expected: a taskDefinition JSON with "revision": 1 and "status": "ACTIVE".

Step 3 — Create the service with the circuit breaker armed. Use two private subnets and a task SG you already have (or a default-VPC subnet + SG for the lab):

aws ecs create-service --cluster $CLUSTER --service-name lab-web \
  --task-definition lab-web:1 --desired-count 2 --launch-type FARGATE \
  --deployment-configuration '{"deploymentCircuitBreaker":{"enable":true,"rollback":true},"minimumHealthyPercent":100,"maximumPercent":200}' \
  --network-configuration 'awsvpcConfiguration={subnets=[subnet-AAA,subnet-BBB],securityGroups=[sg-XXX],assignPublicIp=ENABLED}' \
  --region $REGION

Expected: a service with "rolloutState": "IN_PROGRESS" that reaches COMPLETED once 2 tasks are running.

Step 4 — Prove each task has its own ENI + private IP (awsvpc).

aws ecs list-tasks --cluster $CLUSTER --service-name lab-web --query 'taskArns' --output text --region $REGION \
  | xargs aws ecs describe-tasks --cluster $CLUSTER --region $REGION --tasks \
  --query 'tasks[].attachments[].details[?name==`privateIPv4Address`].value' --output text

Expected: two distinct private IPs — one per task.

Step 5 — Register a deliberately-broken revision and deploy it. A task def pointing at an image that will never become healthy (a non-existent tag):

sed 's#nginx/nginx:stable#nginx/nginx:THIS-TAG-DOES-NOT-EXIST#' lab-web.task.json > lab-web.broken.json
aws ecs register-task-definition --cli-input-json file://lab-web.broken.json --region $REGION
aws ecs update-service --cluster $CLUSTER --service lab-web --task-definition lab-web:2 --region $REGION

Step 6 — Watch the circuit breaker fire and roll back.

aws ecs describe-services --cluster $CLUSTER --services lab-web --region $REGION \
  --query 'services[0].deployments[].{status:status,rollout:rolloutState,reason:rolloutStateReason,desired:desiredCount,running:runningCount,failed:failedTasks}'

Expected: the new deployment moves to rolloutState: FAILED with a rolloutStateReason mentioning the circuit breaker, and the service converges back onto revision 1 (the last known-good) — running stays at 2 throughout.

Validation checklist. You created a service with the breaker armed, proved per-task ENIs, deployed a broken revision, and watched ECS automatically restore the good one without you touching it. The steps mapped to what each proves:

Step	What you did	What it proves	Real-world analogue
3	Service with `deploymentCircuitBreaker`	The safety net is on, not assumed	Every prod service should have this
4	Two distinct private IPs	`awsvpc` = one ENI/IP per task	IP-planning the deploy surge
5	Deploy an unhealthy revision	A bad image would loop forever without the breaker	A failed release at 3am
6	`rolloutState: FAILED` → rollback	The breaker fires and reverts to last good	The incident that doesn’t page you

Cleanup (avoid lingering Fargate charges).

aws ecs update-service --cluster $CLUSTER --service lab-web --desired-count 0 --region $REGION
aws ecs delete-service --cluster $CLUSTER --service lab-web --force --region $REGION
aws ecs delete-cluster --cluster $CLUSTER --region $REGION
aws logs delete-log-group --log-group-name /ecs/lab-web --region $REGION

Cost note. Two 256/512 tasks for under an hour is well under ₹40; deleting the service stops the per-second billing immediately. Container Insights enhanced adds a small ingestion cost — fine for a lab, watch it at scale.

Common mistakes & troubleshooting

This is the playbook — the part you bookmark. First as a scannable table you read mid-incident, then the entries that bite hardest with full confirm-command detail.

#	Symptom	Root cause	Confirm (exact cmd / console path)	Fix
1	502s on every deploy and scale-in; fine at steady state	Target group not draining: `target_type` wrong or dereg delay 300s	`aws elbv2 describe-target-groups --query 'TargetGroups[].{type:TargetType,dereg:...}'`	`target_type=ip`; `deregistration_delay=30`; align under `stopTimeout`
2	In-flight requests error exactly when a task stops	PID 1 is a shell, swallows SIGTERM → app SIGKILLed	Inspect task def `command`/`entryPoint`; stopped tasks show no graceful exit	`exec` form CMD or `initProcessEnabled:true`; drain in handler
3	p99 climbs before scale-out; thrash on scale-in	Wrong scaling metric (CPU on I/O work) or bad `ResourceLabel`	CloudWatch CPU flat while ALB request count/latency climb	Switch to `ALBRequestCountPerTarget`; raise `ScaleInCooldown`
4	Scaling policy does nothing at all	`ResourceLabel` malformed (wrong ALB/TG name portion)	`aws application-autoscaling describe-scaling-policies` → inspect label	Use `<ALB full name>/<TG full name>` exactly
5	Tasks stuck `PROVISIONING`; stopped reason `RESOURCE:ENI`	Subnet out of free IPs for the deploy surge	Subnet free-IP count vs `desired × maximumPercent`	`/24+` subnets across 2 AZ; lower `maximumPercent` temporarily
6	Bad image: failing tasks replaced forever, IP pool drains	Circuit breaker off (or `rollback` off)	`describe-services` → `rolloutState` stuck `IN_PROGRESS`, rising `failedTasks`	Enable `deploymentCircuitBreaker` with `rollback:true`
7	Task fails to start: `CannotPullContainerError`	Bad digest/tag, or no route to ECR (no endpoint/NAT)	`describe-tasks` → `stoppedReason`; check subnet route + endpoints	Fix digest; add ECR `api`/`dkr` + S3 endpoints or NAT
8	`ResourceInitializationError: unable to pull secrets`	Execution role missing `GetSecretValue`/`kms:Decrypt`, or no route	`stoppedReason`; exec-role policy; Secrets Manager endpoint	Grant exec role the secret ARN + KMS key; add endpoint
9	App can read secrets/buckets it never references	Secrets/runtime perms on the task role (or both `*`)	Diff `taskRoleArn` policy vs what code uses	Move secret-read to exec role; scope task role to used ARNs
10	New task killed seconds after start, never goes healthy	No `healthCheckGracePeriodSeconds`; ALB fails it during cold start	`describe-services` events show health-check failures right after start	Set grace ≥ cold-start; speed up boot
11	Task OOM-killed; container exits 137	`memory` (hard cap) too low or a leak	`stoppedReason` “OutOfMemory”; Container Insights memory ~100%	Raise task `memory` to a valid combo; fix leak
12	Fargate Spot tasks vanish under load	Spot reclamation (~2-min SIGTERM), app didn’t drain	Service events: tasks stopped, capacity-provider `FARGATE_SPOT`	Handle SIGTERM (Section 5); raise on-demand `base`
13	Two tasks in one deploy run different code	Image is a moving tag (`:latest`), resolved per launch	Task def `image` is a tag, not `@sha256:`	Pin an immutable digest in CI
14	Deploy hangs at `IN_PROGRESS`, never completes	Tasks never pass ALB health check (wrong port/path/matcher)	`describe-target-health` → `unhealthy`; reason	Align health-check port/path/matcher to the container
15	`assign_public_ip` task can’t reach internet/ECR	Private subnet, `assignPublicIp=DISABLED`, no NAT/endpoint	Subnet route table has no NAT/IGW; no endpoints	Add NAT gateway or the VPC endpoints; keep IP disabled

The expanded form for the entries that bite hardest:

1. 502s on every deploy and every scale-in; fine at steady state. Root cause: The ALB target group isn’t draining gracefully — either target_type isn’t ip (so registration is wrong for awsvpc) or the deregistration delay is the default 300s, longer than ECS’s stop sequence, so the ALB keeps sending new connections to a task ECS is stopping. Confirm: aws elbv2 describe-target-groups --target-group-arns <arn> --query 'TargetGroups[].{type:TargetType,dereg:Attributes}' (or read the deregistration-delay attribute). Inspect stopped tasks for SIGKILL vs graceful exit. Fix: target_type=ip, deregistration_delay=30, and make sure that delay sits under stopTimeout (e.g. 60) so both the ALB and ECS finish draining together.

2. In-flight requests error exactly when a task stops. Root cause: PID 1 is a shell (sh -c "...") that swallows SIGTERM, so the app never gets the signal and is SIGKILLed after stopTimeout with requests still in flight. Confirm: Inspect the task definition’s command/entryPoint for a shell wrapper; stopped tasks show no graceful-exit log line, just an abrupt stop. Fix: Run the app as PID 1 via the exec form (CMD ["node","server.js"]) or set "initProcessEnabled": true in linuxParameters; implement a SIGTERM handler that stops accepting and finishes in-flight inside stopTimeout.

3. p99 climbs before scale-out; service thrashes on scale-in. Root cause: Wrong scaling metric — CPU target tracking on an I/O-bound service, so CPU stays low while queues grow and scaling reacts late; and/or a too-short ScaleInCooldown causing flapping. Confirm: CloudWatch shows CPU flat (e.g. 40%) while the ALB’s RequestCountPerTarget and TargetResponseTime climb. Fix: Make ALBRequestCountPerTarget the primary signal (keep CPU as a backstop), and raise ScaleInCooldown to stop thrash.

5. Tasks stuck PROVISIONING; stopped reason RESOURCE:ENI or IP-not-available. Root cause: The subnet(s) ran out of free IPs/ENIs during the deploy surge — desired × maximumPercent exceeded usable addresses (often a /26 or smaller hosting a 30+ task service at maximumPercent: 200). Confirm: Compare each subnet’s free-IP count to desired_count × (maximumPercent/100); describe-tasks shows stoppedReason with RESOURCE:ENI. Fix: Move tasks to /24-or-larger subnets across ≥2 AZs; as an immediate unblock, lower maximumPercent (e.g. to 150) so the surge is smaller.

6. A bad image leaves failing tasks replaced forever, IP pool draining. Root cause: The deployment circuit breaker is off (or on without rollback), so ECS keeps launching tasks from a revision that never becomes healthy. Confirm: aws ecs describe-services --query 'services[0].deployments[].{rollout:rolloutState,failed:failedTasks}' shows IN_PROGRESS with failedTasks climbing. Fix: update-service --deployment-configuration '{"deploymentCircuitBreaker":{"enable":true,"rollback":true}}'; redeploy and test it once in non-prod so you’ve actually seen it fire.

8. ResourceInitializationError: unable to pull secrets or registry auth. Root cause: The execution role lacks secretsmanager:GetSecretValue (or kms:Decrypt for a CMK), or the task has no network route to the Secrets Manager / ECR endpoints. Confirm: describe-tasks → stoppedReason; check the exec-role policy and whether a secretsmanager interface endpoint (or NAT) exists. Fix: Grant the exec role the exact secret ARN and KMS key; add the secretsmanager (and ECR) VPC endpoints or a NAT route.

9. The app can read secrets or buckets it never references. Root cause: Secret-read or broad runtime permissions were attached to the task role (the one your code assumes), or both roles use *. Your application now holds privileges it should never have. Confirm: Diff the taskRoleArn policy against what the code actually calls; look for secretsmanager:* or s3:* on the task role. Fix: Move secrets-reading to the execution role; scope the task role to only the specific actions and ARNs the code uses.

Best practices

Pin an immutable image digest, never :latest. ECS resolves a tag at each task launch, so a moving tag means two tasks in one deployment run different code. Pass the digest from CI.
Size the task for the whole task (app + sidecars) and pick a valid CPU/memory combination; cap individual containers only when a sidecar needs bounding.
Default to ARM64 (Graviton) for compatible images — ~20% cheaper for the same size, usually equal or better performance.
Spread tasks across ≥2 private subnets in different AZs with IP headroom for maximumPercent. Plan subnets for the deploy surge, not steady state.
Disable public IPs; reach AWS via VPC endpoints (ECR api/dkr, secretsmanager, logs, sts interface + S3 gateway) — cheaper than NAT per-GB and keeps pulls on the AWS network.
Scale on a metric that leads load — ALBRequestCountPerTarget for web, queue depth for workers, CPU/memory only when work maps to them. Layer a CPU backstop; avoid aggressive scale-in policies that flap.
Always enable the deployment circuit breaker with rollback: true and a healthCheckGracePeriodSeconds above your cold-start time. Test it once so you’ve seen it fire.
Make shutdown graceful as a triad: PID 1 receives SIGTERM (exec form or initProcessEnabled), the app drains inside stopTimeout, and the ALB deregistration delay (≈30s) covers the drain but no longer.
Register targets by IP (target_type = ip) — Fargate tasks are ENIs, not instances.
Keep the execution role and task role separate and minimal, each scoped to specific ARNs; inject secrets via secrets, never environment.
Turn on Container Insights, emit structured JSON logs (non-blocking), and wire ADOT/X-Ray tracing from day one — diagnosis is a lookup, not an archaeology dig.
Use Fargate Spot via a capacity-provider strategy (with an on-demand base) only for stateless services that drain on SIGTERM; right-size with Compute Optimizer and re-measure.

Security notes

Two-role least privilege. The execution role pulls images, writes logs, and reads secrets before start; the task role is what your code uses at runtime. Scope each to specific ARNs — never * — and never put secrets-reading on the task role.
Secrets out of plaintext. Inject via the secrets block (Secrets Manager or SSM SecureString), encrypted with a CMK where it matters; environment is visible in describe-tasks and the API, so it’s for non-secret config only.
Network isolation by default. Tasks run in private subnets with assignPublicIp: DISABLED, security groups that accept only the ALB’s SG on the container port, and least-privilege egress (DB SG, endpoint SG) — reference SGs by ID, never CIDR.
Private AWS access. VPC interface/gateway endpoints keep ECR pulls, secret reads, and log writes on the AWS backbone instead of traversing a NAT to the public internet.
Harden the container. Run as a non-root user, set readonlyRootFilesystem: true (write only to mounts/tmpfs), and scan images in ECR; pin digests so a tampered or moved tag can’t slip in.
Lock down ECS Exec. If you enable ECS Exec for debugging, gate it with IAM and log sessions; it’s a shell into a running task and should not be broadly granted.
Front with a WAF where it’s internet-facing. Put the ALB behind AWS WAF and restrict the ALB SG to your CDN/edge ranges so tasks are never directly reachable.

The security controls that also prevent these incidents — secure and resilient pull the same direction:

Control	Mechanism	Secures against	Also prevents
Two-role split	`executionRoleArn` vs `taskRoleArn`	App holding excess privilege	Secret-pull failures (right role scoped)
`secrets` block + CMK	Secrets Manager / SSM + KMS	Plaintext creds in task def	Rotation breaking the app (picked up on launch)
Private subnets + SG-by-ID	`awsvpc` + SG references	Direct internet exposure	Connection-refused from CIDR drift
VPC endpoints	Interface/gateway endpoints	Egress over public internet	NAT per-GB cost on every pull
Digest pinning + ECR scan	`@sha256:` + image scanning	Tampered/unknown images	Two-tasks-differ at deploy
Non-root + read-only root FS	`user`, `readonlyRootFilesystem`	Container escape blast radius	Accidental writes corrupting state

Cost & sizing

The bill drivers and how they interact with the fixes:

vCPU-seconds and GB-seconds dominate. Fargate bills per vCPU-second and GB-second from image pull to task stop, per running task. An oversized task definition (4 vCPU peaking at 0.8) costs you on every replica, every hour — right-sizing is often the biggest single saving.
ARM64 is ~20% off the same size for one flag, on compatible images — the cheapest change you can make.
Fargate Spot cuts the elastic portion up to ~70% but reclaims with a ~2-minute warning; only for stateless, SIGTERM-clean services, with an on-demand base for the floor.
Networking adds up quietly. A NAT gateway charges per-GB on every ECR pull; VPC endpoints (interface hourly + per-GB, S3 gateway free) are usually cheaper at scale and keep traffic on AWS.
Observability is per-GB / per-metric. Container Insights enhanced and CloudWatch ingest are worth it, but sample high-volume logs/traces so a traffic spike doesn’t spike the telemetry bill.

A rough monthly picture for a small production API (steady ~6 tasks, bursting to ~12), us-east-1, indicative — confirm against the live pricing page:

Cost driver	What you pay for	Rough INR / month	What it buys	Watch-out
6× 0.5 vCPU / 1 GB on-demand	Steady Fargate compute	~₹9,000–12,000	Always-on floor	Per-second; right-size first
Burst 6× more on Spot	Elastic peak portion	~₹1,500–3,000	~70% off the burst	Must drain on SIGTERM
ARM64 vs X86_64	Same size, cheaper arch	−~20% of compute	The free saving	Image must be arm64
NAT gateway	Hourly + per-GB egress	~₹3,000–5,000	Internet/AWS egress	Per-GB on every pull
VPC endpoints (3 interface + S3)	Hourly per endpoint + per-GB	~₹2,000–3,500	Private pulls/secrets/logs	Cheaper than NAT at volume
Container Insights + logs	Per-metric + per-GB ingest	~₹1,500–4,000	Diagnosis itself	Sample high-traffic
ALB	Hourly + LCU	~₹2,000–3,000	Ingress + health checks	LCUs scale with traffic

What exactly Fargate meters, so you know which knob each line item responds to:

Billed dimension	Metered as	From → to	Lever that reduces it
vCPU	per vCPU-second	image pull start → task stop	Right-size `cpu`; ARM64; Spot; scale-in faster
Memory	per GB-second	image pull start → task stop	Right-size `memory`; fewer over-provisioned tasks
Ephemeral storage	per GB-month above 20 GiB	provisioned duration	Keep within the 20 GiB free tier
Architecture	~20% lower rate on ARM64	n/a	Build arm64/multi-arch images
Capacity provider	Spot rate on `FARGATE_SPOT`	n/a	Mix on-demand `base` + Spot `weight`
Data egress	per-GB (NAT/internet)	per byte	VPC endpoints; same-region pulls

Right-sizing workflow: read Container Insights / Compute Optimizer’s ECS recommendations, find tasks over-provisioned versus their peak, resize the task definition to the next valid combo down, redeploy, and re-measure after a full traffic cycle. Lumio’s post-incident bill dropped once they right-sized back down after fixing the scaling metric — the fix is usually configuration, not a bigger task.

Interview & exam questions

1. Why must a Fargate ALB target group use target_type = ip? Because every Fargate task runs awsvpc networking and has its own ENI and private IP — there’s no shared EC2 instance to register. instance target type registers EC2 instance IDs, which don’t exist on Fargate; ip registers each task’s private IP directly. Mapping to the SAA/DVA container objectives.

2. A service throws 502s on every deploy and scale-in but is fine at steady state. What’s the cause? The ALB target group isn’t draining gracefully — typically the default 300-second deregistration delay keeps routing new connections to tasks ECS is stopping, and/or PID 1 swallows SIGTERM so the app is SIGKILLed mid-request. Fix: deregistration_delay≈30 aligned under stopTimeout, and a real SIGTERM handler with the app as PID 1.

3. What does the deployment circuit breaker do, and what happens without it? It watches for a run of failed task launches during a deploy and, with rollback: true, automatically reverts to the last known-good task definition. Without it, a bad image that never passes health checks leaves ECS replacing failing tasks indefinitely, draining the subnet IP pool and paging on-call. It scales its failure threshold with service size.

4. Difference between the execution role and the task role? The execution role is assumed by the ECS agent before the container starts — to pull the image from ECR, write to the log group, and resolve secrets. The task role is assumed by your application code at runtime to call AWS APIs (S3, DynamoDB). Secrets-reading belongs to the execution role; the task role carries only runtime permissions. Conflating them is the classic ECS IAM mistake.

5. Which scaling metric should a web API behind an ALB use, and why not CPU? ALBRequestCountPerTarget — it scales on actual per-task load and reacts before CPU saturates. CPU target tracking lags for I/O-bound services because CPU stays low while request queues grow, so scaling reacts after latency has already spiked. Keep a CPU policy as a backstop, not the primary.

6. How do you plan subnet sizing for a Fargate deploy? Each task consumes one subnet IP via its ENI, and during a rolling deploy you run up to desired × (maximumPercent/100) tasks. For a 40-task service at maximumPercent: 200, plan for ~80 IPs during the deploy, plus AWS’s 5 reserved addresses per subnet and anything else in those subnets — so a /24 or larger across ≥2 AZs.

7. A task is stuck in PROVISIONING with stopped reason RESOURCE:ENI. What happened? The subnet ran out of free IPs/ENIs during the deploy surge — the task can’t get an ENI. Confirm by comparing free IPs to desired × maximumPercent. Fix by using larger subnets (/24+) across more AZs, or temporarily lowering maximumPercent to shrink the surge.

8. Why pin an image digest instead of a tag? ECS resolves the image reference at each task launch. A moving tag like :latest means two tasks in the same deployment can pull different code, producing nondeterministic behavior that’s brutal to debug. A @sha256: digest is immutable, so every task in a deployment runs identical bits.

9. How does graceful shutdown work on Fargate, and what are the three timers? On stop, ECS sends SIGTERM to PID 1, waits stopTimeout (default 30s, max 120s), then SIGKILL — while the ALB deregisters the task and waits its deregistration delay for in-flight connections. The three timers must nest: deregistration delay ≥ app drain grace ≤ stopTimeout. PID 1 must actually receive SIGTERM (exec form or initProcessEnabled).

10. When would you choose blue/green over rolling deployments? Rolling with a circuit breaker is the right default for most services. Reach for blue/green (native ECS or CodeDeploy) when you need a full parallel environment with instant cutover and rollback, or canary/linear traffic shifting — high-stakes changes where you want to validate the green environment before sending it real traffic. The cost is running a full second environment during the shift.

11. How does Fargate Spot save money, and what’s the precondition? It runs the same tasks at up to ~70% off but can reclaim them with a ~2-minute SIGTERM warning. The precondition is that the service is stateless and drains cleanly on SIGTERM — Spot reclamation uses the same graceful-stop path. Use a capacity-provider strategy with an on-demand base for the guaranteed floor and Spot for the elastic remainder.

12. Your scaling policy seems to do nothing. What’s the most likely silent cause? A malformed ResourceLabel on an ALBRequestCountPerTarget policy. It must be <ALB full name>/<target group full name> — the portions after loadbalancer/ and targetgroup/ in the ARNs. Get it wrong and Application Auto Scaling can’t read the metric, so the policy silently never acts.

These map to AWS Certified Solutions Architect – Associate (SAA-C03) and Developer – Associate (DVA-C02) for ECS/Fargate, task definitions, IAM roles, and deployments; the networking depth (awsvpc, endpoints, SGs) touches Advanced Networking – Specialty (ANS-C01). A compact cert-mapping for revision:

Question theme	Primary cert	Objective area
Task def, roles, deploys, circuit breaker	DVA-C02 / SAA-C03	Deploy & operate containerized apps
awsvpc, ENIs, endpoints, SGs	SAA-C03 / ANS-C01	Design resilient/secure networking
Auto Scaling metric choice	SAA-C03	Design scalable architectures
Two-role IAM least privilege	SAA-C03 / SCS-C02	Secure access; least privilege
Spot/Graviton/right-sizing cost	SAA-C03	Cost-optimized architectures

Quick check

A Fargate service throws 502s on every deploy and scale-in but is healthy at steady state. Name the two most likely causes and the one target-group setting you check first.
Your container starts via sh -c "node server.js". Why might in-flight requests be dropped on every task stop, and what are two fixes?
True or false: scaling out to more tasks fixes a service whose tasks are getting OOM-killed.
A web API behind an ALB is scaling late under load even though CPU stays at 40%. What metric should it scale on instead, and why?
A bad image is deployed and ECS keeps replacing failing tasks until the subnet runs out of IPs. What one feature would have prevented this, and how do you turn it on?

Answers

Cause A: the ALB target group’s deregistration delay is the default 300s, longer than ECS’s stop sequence, so the ALB keeps sending new connections to a stopping task. Cause B: PID 1 swallows SIGTERM so the app is SIGKILLed mid-request. First setting to check: deregistration_delay (set it to ≈30s and align it under stopTimeout). Also confirm target_type = ip.
The shell is PID 1 and may not forward SIGTERM, so the app never gets the signal and is SIGKILLed after stopTimeout with requests in flight. Fixes: run the app as PID 1 via the exec form (CMD ["node","server.js"]), or set "initProcessEnabled": true in linuxParameters to get a signal-forwarding init — plus implement a SIGTERM handler that drains.
False. OOM is against the per-task memory cap; every scaled-out task hits the same ceiling and OOMs. Fix by raising the task memory to a valid CPU/memory combination (scale up) and/or fixing the leak — scaling out doesn’t change the per-task limit.
ALBRequestCountPerTarget. It scales on actual per-task request load and reacts before CPU saturates; CPU target tracking lags for I/O-bound work because CPU stays low while request queues grow. Keep CPU as a backstop policy.
The deployment circuit breaker with rollback: true. Enable it via update-service --deployment-configuration '{"deploymentCircuitBreaker":{"enable":true,"rollback":true}}' (and set a healthCheckGracePeriodSeconds); it auto-detects the failing deploy and reverts to the last known-good revision.

Glossary

AWS Fargate — the serverless launch type for ECS (and EKS); you run containers without provisioning or managing EC2 hosts, billed per vCPU-second and GB-second.
Task definition — the immutable, versioned (family:revision) blueprint of containers, CPU/memory, network mode, and the two IAM roles.
Task — one running instance of a task definition; on Fargate it gets its own ENI and private IP.
Service — the controller that keeps a desired number of tasks running, registered behind a load balancer, and owns deployments and scaling.
awsvpc network mode — the Fargate-mandatory mode giving each task its own ENI, IP, and security group(s).
ENI (elastic network interface) — the virtual NIC attached to each task; finite per subnet, which constrains the deploy surge.
Execution role — the IAM role the ECS agent assumes before the container starts (ECR pull, log writes, resolving secrets).
Task role — the IAM role your application code assumes at runtime to call AWS APIs.
minimumHealthyPercent / maximumPercent — the floor of healthy tasks ECS keeps and the ceiling it may temporarily exceed during a deploy.
Deployment circuit breaker — the deploy safeguard that detects a run of failed task launches and (with rollback) reverts to the last known-good revision.
stopTimeout — the grace period (default 30s, max 120s on Fargate) between SIGTERM and SIGKILL when a container is stopped.
Deregistration delay — the time the ALB waits for in-flight connections to finish before fully removing a target (default 300s; set ≈30s for fast services).
Application Auto Scaling — the service that adjusts a scalable target’s desiredCount based on a CloudWatch metric via target-tracking or step policies.
ALBRequestCountPerTarget — the predefined target-tracking metric that scales on requests-per-task; the leading signal for ALB-fronted web services.
ResourceLabel — the <ALB full name>/<target group full name> string a request-count policy needs; if malformed, the policy silently does nothing.
Capacity provider — FARGATE (on-demand) or FARGATE_SPOT; a strategy with base/weight mixes them for cost.
Fargate Spot — discounted capacity (up to ~70% off) that can be reclaimed with a ~2-minute SIGTERM warning; for stateless, drain-clean services only.
VPC endpoint — an interface (ECR, Secrets Manager, Logs, STS) or gateway (S3, DynamoDB) endpoint that keeps AWS API traffic on the backbone instead of via NAT.
PID 1 / init — the entrypoint process that must receive SIGTERM to shut down gracefully; initProcessEnabled adds a signal-forwarding, zombie-reaping init.
Container Insights — the CloudWatch feature giving per-task/service (and, in enhanced mode, per-container) metrics and dashboards.

Next steps

You can now wire a production Fargate service: correctly sized, isolated per-task, scaled on the right signal, deployed with a tested safety net, and shut down without dropping a request. Build outward:

Next: Elastic Load Balancing Deep Dive: ALB, NLB & GWLB — the target groups, health checks, and listeners that are half of every Fargate deploy.
Related: VPC Deep Dive: Subnets, Routing, IGW, NAT & Endpoints — design the subnets and endpoints your per-task ENIs live in.
Related: ECS Service Connect vs Load Balancers: Discovery & Resilience — service-to-service discovery once you have more than one Fargate service.
Related: Graviton/ARM64 Migration: Multi-Arch Builds & Benchmarking — make the ~20% ARM64 saving real with multi-arch images.
Related: Secrets Manager & Parameter Store Deep Dive — inject and rotate the secrets your execution role resolves at launch.
Related: AWS X-Ray: Service Map, Segments & ADOT Tracing — add the distributed tracing sidecar referenced in the observability section.

Production Amazon ECS on Fargate: Task Networking, Auto Scaling, and Safe Rolling Deployments

What problem this solves

Learning objectives

Prerequisites & where this fits

Core concepts

The vocabulary in one table

1. Task definition: sizing, platform, and the CPU/memory matrix

2. awsvpc networking: one ENI and IP per task, and the deploy-surge math

3. Service Auto Scaling: target tracking vs step scaling

4. Deployments: rolling updates and the circuit breaker

5. Graceful shutdown: SIGTERM, `stopTimeout`, and deregistration

6. Secrets, config, and least-privilege roles

7. Observability: Container Insights, structured logs, tracing

8. Cost levers: Fargate Spot, capacity providers, right-sizing

Architecture at a glance

Real-world scenario

Advantages and disadvantages

Hands-on lab

Common mistakes & troubleshooting

Best practices

Security notes

Cost & sizing

Interview & exam questions

Quick check

Answers

Glossary

Next steps

Written by Vinod

Comments

Production Amazon ECS on Fargate: Task Networking, Auto Scaling, and Safe Rolling Deployments

What problem this solves

Learning objectives

Prerequisites & where this fits

Core concepts

The vocabulary in one table

1. Task definition: sizing, platform, and the CPU/memory matrix

2. awsvpc networking: one ENI and IP per task, and the deploy-surge math

3. Service Auto Scaling: target tracking vs step scaling

4. Deployments: rolling updates and the circuit breaker

5. Graceful shutdown: SIGTERM, stopTimeout, and deregistration

6. Secrets, config, and least-privilege roles

7. Observability: Container Insights, structured logs, tracing

8. Cost levers: Fargate Spot, capacity providers, right-sizing

Architecture at a glance

Real-world scenario

Advantages and disadvantages

Hands-on lab

Common mistakes & troubleshooting

Best practices

Security notes

Cost & sizing

Interview & exam questions

Quick check

Answers

Glossary

Next steps

Written by Vinod

Comments

5. Graceful shutdown: SIGTERM, `stopTimeout`, and deregistration