AWS Lesson 12 of 123

Production Amazon ECS on Fargate: Task Networking, Auto Scaling, and Safe Rolling Deployments

A docker run on a laptop and an ECS service on AWS Fargate share almost no operational concerns. Fargate is the serverless launch type for Amazon Elastic Container Service — you hand AWS a task definition and a desired count, and it runs your containers on capacity it owns and patches, with no EC2 instances for you to size, drain, or reboot. That removes the host layer, but it does not remove the decisions that decide whether a deploy is safe at 3am: how each task gets an IP and a security group, when a deployment rolls back on its own, and what happens to in-flight requests when a task is told to stop. Those are the knobs that separate a service which drains cleanly on every release from one that drops connections, leaks IPs, and pages you on a Friday.

This guide walks the pieces I actually wire up for a production Fargate service, the same way I’d brief a new engineer joining the on-call rotation. Every section goes option-by-option: the valid CPU/memory matrix, the awsvpc ENI/IP planning math, the choice between target-tracking and step scaling and exactly which metric to scale on, the deployment circuit breaker that auto-rolls-back a bad image, the SIGTERM → stopTimeout → deregistration-delay triad that makes shutdown graceful, the execution-role vs task-role split that is the most common IAM mistake on ECS, and the Fargate Spot + Graviton + right-sizing levers that move the bill. Because this is a reference you’ll return to mid-incident, the options, limits, error codes and the deploy playbook itself are laid out as scannable tables — read the prose once, then keep the tables open when the rollout is stuck.

Assume the AWS provider/region is set, an Application Load Balancer (ALB) exists, and you’re on a recent CLI (aws --version >= 2.x). I use the Linux platform version LATEST throughout, which today resolves to Fargate platform version 1.4.0, and I default to ARM64 (Graviton) because it is the cheapest change you can make. By the end you’ll be able to register a correctly-sized task, place it in private subnets with per-task security groups, scale it on the right signal, deploy it with an automatic safety net, and shut it down without severing a single request.

What problem this solves

ECS on Fargate hides the fleet so you can ship a container without owning servers. That abstraction is a gift until a deploy goes sideways, and then the failure modes are not in your application code — they’re in the wiring between the ALB, the task ENI, the scaling policy, and the lifecycle hooks. The defaults are tuned for “it runs”, not for “it survives a release under load”, and almost every production ECS incident I’ve seen traces back to one of a small set of mis-set knobs.

What breaks without this knowledge, concretely: a service scaled on CPU that is actually I/O-bound scales out after p99 has already tripled, because CPU never crossed the target while request queues grew. A container launched via sh -c "java -jar app.jar" where the shell is PID 1 swallows SIGTERM, so the JVM is SIGKILLed on every task stop and every in-flight request dies. An ALB target group left at the default 300-second deregistration delay keeps routing to tasks ECS has already begun stopping. A bad image with no circuit breaker leaves ECS replacing failing tasks forever, draining the subnet’s IP pool until new tasks can’t even be placed. And the single most common IAM mistake — conflating the execution role (used by the agent to pull the image and read secrets before the container starts) with the task role (used by your code at runtime) — silently grants your application code permissions it should never have.

Who hits this: every team running containers on Fargate behind a load balancer. It bites hardest on services with chatty downstreams (the scaling-metric trap), services that were lifted from EC2 without revisiting PID 1 and signal handling (the dropped-connection trap), large services in small subnets (the IP-exhaustion trap), and anyone who has never actually seen their circuit breaker fire — because a safety net you’ve never tested is a configuration you don’t really have.

To frame the whole field before the deep dive, here is every failure class this article covers, the question it forces, and the one place to look first:

Failure class What it looks like First question to ask First place to look Most common single cause
Deploy-time 502s 502s on every release and every scale-in Did the ALB cut a connection the task was still serving? ALB target group settings target_type not ip, or dereg delay still 300s
Dropped in-flight requests Errors spike exactly when a task stops Does PID 1 receive and handle SIGTERM? Task definition command/entryPoint Shell wrapper is PID 1, swallows the signal
Scales late / flaps p99 climbs before scale-out; thrash on scale-in Does load map to the metric you scaled on? Auto Scaling policy + CloudWatch CPU target tracking on an I/O-bound service
Tasks stuck PROVISIONING RESOURCE:ENI / IP-not-available stopped reason Are there free IPs for the deploy surge? Subnet free-IP count vs maximumPercent /26 subnet, 40-task service at maximumPercent: 200
Bad deploy never recovers Failing tasks replaced forever, IP pool drains Is the circuit breaker on with rollback? describe-servicesrolloutState deploymentCircuitBreaker not enabled
Code has perms it shouldn’t App can read secrets it never references Which role is your code actually using? executionRoleArn vs taskRoleArn Secrets perms on the task role, not exec role

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already understand the container basics: an image is built and pushed to a registry (Amazon ECR), a task definition is the immutable, versioned spec of what to run, a task is one running instance of that spec, and a service keeps a desired number of tasks running and registered behind a load balancer. You should know how to run aws in a shell, read JSON output, and that a VPC has subnets spread across Availability Zones, security groups (stateful, allow-only), and route tables. Familiarity with HTTP status codes, basic Linux process/signal concepts, and IAM policy JSON helps.

This sits in the Containers track and assumes the fundamentals from Amazon ECS & ECR Fundamentals: Task Definitions, Services & Fargate and the first deploy in Your First Container Deployment on ECS Fargate. It pairs tightly with Elastic Load Balancing Deep Dive: ALB, NLB & GWLB (the ALB and target group are half the deploy story) and VPC Deep Dive: Subnets, Routing, IGW, NAT & Endpoints (where the per-task ENIs and VPC endpoints live). For the choice of whether Fargate is even the right runtime, Choose Your Container Path: ECS vs EKS vs Fargate is upstream of this.

A quick map of who owns what during a Fargate incident, so you page the right person fast:

Layer What lives here Who usually owns it Failure classes it can cause
Client / DNS / TLS Name resolution, cert, retries Frontend / SRE 502/503 only if misrouted; mostly red herrings
ALB + target group Routing, health checks, deregistration Platform / network Deploy-time 502s (dereg delay, target type)
VPC subnets / ENIs Per-task IPs, route to endpoints/NAT Network team Tasks stuck PROVISIONING (RESOURCE:ENI)
Security groups Inbound from ALB, egress to deps Platform + security Connection refused / timeouts to the task
Task definition CPU/mem, image, ports, lifecycle App / dev team Crash loops, dropped requests, OOM
Auto Scaling policy Scalable target, metric, cooldowns Platform + app Scales late, flaps, hits max capacity
IAM roles (exec + task) Image pull, secrets, runtime APIs Security + app Image-pull denied, over-broad app perms
ECS control plane Rollout state, circuit breaker Managed (AWS) Bad deploy that never rolls back

Core concepts

Six mental models make every later decision obvious.

A task is a first-class network citizen, not a process sharing a host. On Fargate the network mode is always awsvpc: each task gets its own elastic network interface (ENI) with a private IP from the subnet you place it in, and its own security group(s). You get per-task security groups, per-task VPC Flow Logs, and clean blast-radius isolation — at the cost of consuming one subnet IP (and one ENI) per running task. That IP consumption is the planning trap, because during a rolling deploy you briefly run more tasks than steady state.

The task definition is immutable and versioned; the service points at one revision. Every register-task-definition produces a new revision (family:N). The service runs whatever revision you set, and a deploy is “make the service converge from revision N to revision N+1”. This is why pinning the image to a digest matters: if the task definition says :latest, ECS resolves the tag at each task launch, so two tasks in the same deployment can pull different code. Immutability of the task definition doesn’t help if the image tag moves underneath it.

A deploy is a controlled overlap of two task sets. ECS brings up tasks from the new revision before draining the old ones, bounded by two percentages of desired count: minimumHealthyPercent (the floor it keeps healthy) and maximumPercent (the ceiling it may temporarily exceed). The overlap is what gives zero-downtime — and what consumes extra IPs and extra Fargate vCPU-seconds for the duration. The deployment circuit breaker watches for a run of failed task launches and, if rollback is on, reverts to the last known-good revision instead of replacing failing tasks forever.

Scaling is a separate control loop on top of the service. ECS services scale through Application Auto Scaling, registered against a scalable target with a min and max. A policy adjusts desiredCount based on a CloudWatch metric. The hard part is not the mechanics — it’s choosing a metric that leads load. For a request-driven service, request count per target leads; CPU often lags because the work is I/O-bound.

Shutdown is a negotiation between three timers. When ECS stops a task it sends SIGTERM to each container’s PID 1, waits up to stopTimeout (default 30s, max 120s), then SIGKILL. Simultaneously it deregisters the task from the ALB target group, and the ALB waits the deregistration delay for in-flight connections to finish. Graceful shutdown means PID 1 receives SIGTERM, the app drains (stop accepting, finish in-flight, exit) inside stopTimeout, and the deregistration delay is long enough to cover the drain but no longer.

A task has two identities, and conflating them is the classic mistake. The execution role is assumed by the ECS agent before your container starts — to pull the image from ECR, write to the log group, and resolve secrets references. The task role is assumed by your application code at runtime to call AWS APIs (S3, DynamoDB, SQS). They are different principals doing different things at different times; secrets-reading belongs to the execution role, not the task role.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Concept One-line definition Where it lives Why it matters in production
Task definition Immutable, versioned spec of containers + envelope ECS, family:revision Pin the image digest or two tasks differ
Task One running instance of a task definition On Fargate capacity The unit that gets an ENI and is stopped
Service Keeps N tasks running + ALB-registered ECS Owns deploys, scaling, placement
awsvpc One ENI + IP + SG per task Fargate (always) IP planning; per-task isolation
ENI The task’s network interface In a subnet Finite per subnet; deploy surge consumes more
Execution role Agent identity (pull, logs, secrets) Task def executionRoleArn Pre-start; reads secrets
Task role App identity at runtime Task def taskRoleArn What your code’s SDK uses
stopTimeout SIGTERM→SIGKILL grace per container Container def Must exceed app drain time
Deregistration delay ALB wait for in-flight on stop Target group Must cover drain; default 300s is too long
Scalable target The thing Auto Scaling adjusts App Auto Scaling Min/max bound on desiredCount
Target-tracking Keep a metric near a value Scaling policy The default; pick the right metric
Circuit breaker Auto-rollback on failed launches Deploy config Stops a bad image looping forever
Capacity provider FARGATE vs FARGATE_SPOT mix Cluster + service The biggest cost lever
Platform version Fargate runtime version (1.4.0) Service Feature/behavior baseline

1. Task definition: sizing, platform, and the CPU/memory matrix

A Fargate task definition declares the container(s), the CPU/memory envelope, the network mode (awsvpc), and two distinct IAM roles. The CPU/memory pair is not free-form: Fargate only accepts specific combinations, and the valid memory range is constrained by the CPU value you pick. The whole task shares this budget — a sidecar’s usage comes out of the same pool — so size the task for the sum, then optionally cap individual containers with container-level cpu/memory.

cpu (vCPU) Valid memory values Step Typical use
256 (.25) 512, 1024, 2048 MiB fixed list Tiny sidecar-free APIs, cron tasks
512 (.5) 1024 – 4096 MiB 1 GiB Small web service + log router
1024 (1) 2048 – 8192 MiB 1 GiB Standard API with sidecars
2048 (2) 4096 – 16384 MiB 1 GiB Memory-heavier services, JVM apps
4096 (4) 8192 – 30720 MiB 1 GiB Large workers, in-memory caches
8192 (8) 16384 – 61440 MiB 4 GiB Big batch / data tasks (PV 1.4.0)
16384 (16) 32768 – 122880 MiB 8 GiB Largest single-task workloads (PV 1.4.0)

The container-level fields that shape sizing and lifecycle, each with its default and the trade-off:

Field What it does Default When to set Trade-off / gotcha
cpu (container) Caps/reserves vCPU for one container unset (shares task) Pin a sidecar’s slice Sum can’t exceed task cpu
memory (hard) Hard cap; container killed if exceeded unset Bound a leaky sidecar OOM-kills the container at the cap
memoryReservation (soft) Soft floor; can burst above unset Most app containers Needs headroom in task memory
essential If true, its exit stops the task true Keep on the app; sidecars vary A non-essential sidecar dying is silent
stopTimeout SIGTERM→SIGKILL grace (s) 30 Raise to cover drain Max 120 on Fargate
user UID/GID the process runs as root Always set non-root Image must support the UID
readonlyRootFilesystem Mounts / read-only false Harden App must write only to mounts/tmpfs
portMappings.containerPort Port the app listens on Always (web) Must match ALB target group + health check
healthCheck Container-level health command none Catch hangs ALB can’t see Counts toward task health

The platform/runtime choices, where the biggest cost decision (ARM64) hides:

Setting Values Default When to change Trade-off
runtimePlatform.cpuArchitecture X86_64, ARM64 X86_64 ARM64 for ~20% cheaper vCPU-hr Image must be arm64/multi-arch
runtimePlatform.operatingSystemFamily LINUX, WINDOWS_* LINUX Windows containers only Windows on Fargate has fewer SKUs
Platform version 1.4.0, LATEST LATEST→1.4.0 Pin for reproducibility Pinning misses new behavior/fixes
image tag or @sha256: digest Always pin a digest Tag moves; two tasks diverge
networkMode awsvpc (only on Fargate) awsvpc n/a on Fargate Always per-task ENI
ephemeralStorage.sizeInGiB 21–200 GiB 20 GiB Large scratch/space needs Billed above the 20 GiB free

A correct task definition, ARM64, digest-pinned, with a sane health check and bounded non-blocking logs:

{
  "family": "checkout-api",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "runtimePlatform": { "cpuArchitecture": "ARM64", "operatingSystemFamily": "LINUX" },
  "executionRoleArn": "arn:aws:iam::111122223333:role/checkout-execution",
  "taskRoleArn": "arn:aws:iam::111122223333:role/checkout-task",
  "containerDefinitions": [
    {
      "name": "app",
      "image": "111122223333.dkr.ecr.us-east-1.amazonaws.com/checkout@sha256:9b2c…e41",
      "essential": true,
      "user": "10001:10001",
      "readonlyRootFilesystem": true,
      "portMappings": [{ "containerPort": 8080, "protocol": "tcp" }],
      "stopTimeout": 60,
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:8080/healthz || exit 1"],
        "interval": 15, "timeout": 5, "retries": 3, "startPeriod": 30
      },
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/checkout-api",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "app",
          "mode": "non-blocking",
          "max-buffer-size": "25m"
        }
      }
    }
  ]
}

Register it:

aws ecs register-task-definition --cli-input-json file://checkout-api.task.json

The same in Terraform, with the digest passed in from CI so it’s never :latest:

resource "aws_ecs_task_definition" "checkout" {
  family                   = "checkout-api"
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  cpu                      = "512"
  memory                   = "1024"
  execution_role_arn       = aws_iam_role.exec.arn
  task_role_arn            = aws_iam_role.task.arn

  runtime_platform {
    cpu_architecture        = "ARM64"
    operating_system_family = "LINUX"
  }

  container_definitions = jsonencode([{
    name        = "app"
    image       = var.image_digest        # "...checkout@sha256:..."
    essential   = true
    user        = "10001:10001"
    portMappings = [{ containerPort = 8080, protocol = "tcp" }]
    stopTimeout = 60
  }])
}

The two choices worth repeating: ARM64 (Graviton) is typically ~20% cheaper per vCPU-hour and usually performs as well or better on typical web workloads — covered in depth in Graviton/ARM64 Migration: Multi-Arch Builds & Benchmarking. And pin a digest — a moving tag is a correctness bug, not a convenience.

The other task-definition fields you’ll touch on a real service, with their defaults and the trade-off:

Field What it does Default When to set Trade-off / gotcha
requiresCompatibilities Declares FARGATE vs EC2 Always ["FARGATE"] here Mismatch rejects invalid combos
volumes + mountPoints Shared/EFS volumes none Persistent or shared data Fargate supports EFS, not host bind
dependsOn Order containers by condition none Sidecar must be up first (FireLens) START/HEALTHY/COMPLETE conditions
pidMode Share PID namespace per-container Rarely; task for shared tooling Security blast radius
runtimePlatform (OS) LINUX vs WINDOWS family LINUX Windows containers Fewer Windows SKUs/AZs
proxyConfiguration App Mesh / Envoy proxy none Service-mesh sidecar Adds an Envoy container
tags / propagateTags Cost-allocation tags none Always tag for FinOps Propagate from service or task def

2. awsvpc networking: one ENI and IP per task, and the deploy-surge math

On Fargate the network mode is always awsvpc, so each task is a first-class network citizen with its own ENI, private IP, and security group(s). This is the single most important networking fact about Fargate, and it has two consequences you must plan for: IP consumption and egress routing.

The IP-consumption trap is the rolling deploy. During a deploy you briefly run more tasks than steady state, and each consumes a subnet IP. Plan subnets for the peak, not the average:

Peak task IPs during a deploy ≈ desired_count × (maximumPercent / 100). For a 40-task service at maximumPercent: 200, plan for up to 80 task IPs across your subnets during the deploy, on top of everything else (other services, ENIs, reserved addresses) in those subnets.

A subnet also reserves 5 addresses (network, router, DNS, future, broadcast), so the usable count is smaller than the raw CIDR size. The math by subnet size:

Subnet CIDR Total IPs Usable (AWS reserves 5) Steady tasks @ 50% headroom Max service @ maximumPercent: 200
/28 16 11 ~7 ~5 tasks (too small for prod)
/27 32 27 ~18 ~13 tasks
/26 64 59 ~39 ~29 tasks
/25 128 123 ~82 ~61 tasks
/24 256 251 ~167 ~125 tasks
/23 512 507 ~338 ~253 tasks

Spread tasks across at least two private subnets in different AZs, and give each a /24 or larger for any sizeable service. The service network config disables public IPs and references security groups by ID:

{
  "awsvpcConfiguration": {
    "subnets": ["subnet-0aaa1111", "subnet-0bbb2222"],
    "securityGroups": ["sg-0task55555"],
    "assignPublicIp": "DISABLED"
  }
}

assignPublicIp must be DISABLED for tasks in private subnets — they reach AWS services through a NAT gateway or, better, VPC interface endpoints. The egress choices, side by side:

Egress path What it covers Cost shape When to use Gotcha
NAT gateway All outbound to internet + AWS Hourly + per-GB processed Quick start, mixed egress Per-GB on every ECR pull adds up
Interface endpoint (ecr.api, ecr.dkr, secretsmanager, logs, sts) Those AWS APIs privately Hourly per endpoint + per-GB Keep image pulls on AWS net One per service used; needs SG
Gateway endpoint (S3, DynamoDB) S3 (ECR layers!), DynamoDB Free Always add S3 (ECR uses it) Route-table entry, not SG
PrivateLink to a partner service A specific SaaS/partner endpoint Hourly per endpoint + per-GB Reach a partner privately One endpoint per service; see PrivateLink
Public IP + IGW Direct internet (no NAT) IGW free, IP churn Rarely for prod tasks Exposes tasks; usually wrong

The minimum endpoint set to pull an image and read secrets without a NAT gateway is: com.amazonaws.<region>.ecr.api, ecr.dkr, secretsmanager, logs, sts (interface), plus an S3 gateway endpoint (ECR stores layers in S3). The security-group rules — reference SGs by ID, never CIDR:

Rule Direction Source/Dest Port Why
Task SG: allow ALB Inbound ALB’s SG (by ID) 8080 (containerPort) Only the ALB reaches the task
Task SG: egress to DB Outbound DB SG (by ID) 5432 Least-privilege egress
Task SG: egress to endpoints Outbound Endpoint SG (by ID) 443 ECR/Secrets/Logs over HTTPS
ALB SG: allow clients Inbound 0.0.0.0/0 or CDN range 443 Public ingress
ALB SG: egress to tasks Outbound Task SG (by ID) 8080 ALB → task
Endpoint SG: allow tasks Inbound Task SG (by ID) 443 Tasks → endpoint

Terraform for the two essential ECR-related endpoints (S3 gateway is free and mandatory):

resource "aws_vpc_endpoint" "s3" {
  vpc_id            = var.vpc_id
  service_name      = "com.amazonaws.${var.region}.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = var.private_route_table_ids   # ECR layer pulls go via S3
}

resource "aws_vpc_endpoint" "ecr_dkr" {
  vpc_id              = var.vpc_id
  service_name        = "com.amazonaws.${var.region}.ecr.dkr"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = var.private_subnet_ids
  security_group_ids  = [aws_security_group.endpoints.id]
  private_dns_enabled = true
}

The Fargate limits and quotas that actually shape a production design (many are soft/adjustable via Service Quotas — confirm against your account):

Limit / quota Value Adjustable? Why it matters
Network mode on Fargate awsvpc only No One ENI/IP per task — drives subnet sizing
Max vCPU per task 16 vCPU No Largest single task; split bigger work
Max memory per task 120 GiB No Ceiling for in-memory workloads
Ephemeral storage per task 20 GiB free, up to 200 GiB Configurable Scratch space; billed above 20
stopTimeout max 120 s No Caps the drain window
Containers per task definition 10 No App + sidecars must fit
Tasks per service 5,000 (default, soft) Yes (Service Quotas) Very large services
Services per cluster 5,000 (soft) Yes Cluster packing
Subnet reserved IPs 5 per subnet No Reduces usable task IPs
Spot interruption warning ~120 s (SIGTERM) No Drain budget on reclaim
Platform version 1.4.0 (current) n/a Feature/behavior baseline

Networking details — subnet design, route tables, endpoint policies — are covered end to end in VPC Deep Dive: Subnets, Routing, IGW, NAT & Endpoints, and the inbound/outbound rule model in Security Groups & NACLs Deep Dive.

3. Service Auto Scaling: target tracking vs step scaling

ECS services scale through Application Auto Scaling, registered as a scalable target against the ecs:service:DesiredCount dimension with a min and max. The mechanics are easy; the metric choice is the whole game. For a request-driven service behind an ALB, ALBRequestCountPerTarget is the cleanest signal — it scales on actual load per task, independent of how CPU-bound the work is, and reacts before CPU saturates.

# Register the service as a scalable target (min/max bound on desiredCount)
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/prod-cluster/checkout-api \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 4 --max-capacity 40

The predefined target-tracking metrics, and exactly when each is the right one:

Predefined metric Scales on Use when Don’t use when Needs ResourceLabel
ALBRequestCountPerTarget Requests/min per task Web/API behind an ALB No ALB; or work ≠ per-request Yes (ALB + TG names)
ECSServiceAverageCPUUtilization Avg task CPU % CPU-bound compute I/O-bound work (lags) No
ECSServiceAverageMemoryUtilization Avg task memory % Memory-bound caches Leaky apps (scales on the leak) No

A request-count target-tracking policy. The ResourceLabel is <ALB full name>/<target group full name> — the portion after loadbalancer/ and targetgroup/ in the respective ARNs. Get it wrong and the policy silently does nothing:

{
  "TargetValue": 1000.0,
  "PredefinedMetricSpecification": {
    "PredefinedMetricType": "ALBRequestCountPerTarget",
    "ResourceLabel": "app/checkout-alb/50dc6c495c0c9188/targetgroup/checkout-tg/6d0ecf831eec9f09"
  },
  "ScaleInCooldown": 300,
  "ScaleOutCooldown": 60
}
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs --resource-id service/prod-cluster/checkout-api \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name reqcount-tt --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration file://reqcount-tt.json

The target-tracking knobs and how to reason about each:

Knob What it does Default When to change Trade-off
TargetValue The metric value to hold Set to ~70% of a task’s safe max Too low = over-provision; too high = late
ScaleOutCooldown Wait after scaling out (s) 300 Lower (60) to react faster Too low risks over-shoot
ScaleInCooldown Wait after scaling in (s) 300 Raise to avoid flapping Too low = thrash on noisy load
DisableScaleIn Only scale out, never in false True for cost-blind reliability Pay for peak forever

Reach for step scaling when you need asymmetric or aggressive reactions — for example, add capacity hard when a queue-depth alarm crosses a threshold (a worker draining SQS, not a web service):

aws application-autoscaling put-scaling-policy \
  --service-namespace ecs --resource-id service/prod-cluster/worker \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name queue-step --policy-type StepScaling \
  --step-scaling-policy-configuration '{
    "AdjustmentType": "ChangeInCapacity",
    "MetricAggregationType": "Maximum",
    "StepAdjustments": [
      { "MetricIntervalLowerBound": 0,   "MetricIntervalUpperBound": 1000, "ScalingAdjustment": 2 },
      { "MetricIntervalLowerBound": 1000,                                   "ScalingAdjustment": 5 }
    ]
  }'

Target tracking vs step scaling, decided:

Dimension Target tracking Step scaling
Mental model “Hold this metric near X” “When the alarm is this far over, add Y”
Alarms AWS manages a pair for you You define the metric + thresholds
Best for Web/API steady-state load Queues, asymmetric bursts
Scale-in Automatic, symmetric You define separate step(s)
Risk Wrong metric lags Mis-tuned steps over/under-shoot
Combine? Yes — multiple policies allowed Yes — layer with target tracking

The CloudWatch metrics worth alerting on for a Fargate service (leading indicators, not just “service down”), with a starting threshold:

Alert on Metric (namespace) Threshold (starting point) Why it’s leading
Per-task request load RequestCountPerTarget (ALB) near your TargetValue Predicts scale-out before latency spikes
Latency creep TargetResponseTime (ALB) p95 > your SLO Cold start / saturation before users feel it
Unhealthy targets UnHealthyHostCount (ALB) ≥ 1 for 5 min Catches eviction before capacity drops
CPU saturation CPUUtilization (ECS) > 80% for 10 min Backstop signal for CPU-bound paths
Memory pressure MemoryUtilization (ECS) > 85% for 10 min Predicts OOM kills (exit 137)
Failed task launches service events / failedTasks > 0 during deploy The circuit breaker’s trigger
5xx from targets HTTPCode_Target_5XX_Count (ALB) > 1% of requests The symptom — alert as confirmation
Running vs desired RunningTaskCount vs DesiredCount gap > 0 sustained Deploy stuck or capacity starved

You can attach multiple policies to one service. A common pattern: request-count target tracking for steady state, plus a CPU target-tracking policy as a safety net so a CPU-heavy code path can’t starve before request count reacts. When policies disagree, Application Auto Scaling takes the largest desired count — so layering scale-out policies is safe; combining aggressive scale-in policies is what gets you into flapping. A decision table for picking the primary signal:

If your service is… Scale primarily on… Backstop with…
A web API behind an ALB ALBRequestCountPerTarget CPU target tracking
A gRPC streaming service CPU utilization Memory target tracking
An SQS/queue worker Step scaling on queue depth (ApproximateNumberOfMessagesVisible) CPU as a floor
A CPU-bound batch transformer CPU utilization
A memory-bound cache/aggregator Memory utilization CPU as a floor

4. Deployments: rolling updates and the circuit breaker

ECS rolling deployments are governed by two knobs on the service. minimumHealthyPercent is the floor of healthy tasks ECS keeps during a deploy; maximumPercent is the ceiling it may temporarily exceed desired count to bring up replacements. For a zero-downtime rolling deploy on an even-sized service, 100/200 is the safe default: never drop below desired count, allow a full extra set while rolling.

The deploy-config matrix — every knob, its default, and the trade-off:

Setting What it controls Default Safe prod value Trade-off / gotcha
minimumHealthyPercent Floor of healthy tasks during deploy 100 100 (50 if cost-sensitive + tolerant) <100 risks a capacity dip mid-deploy
maximumPercent Ceiling above desired during deploy 200 200 Higher = faster but more IPs/cost
deploymentCircuitBreaker.enable Auto-detect a failing deploy false true Off = bad image loops forever
deploymentCircuitBreaker.rollback Revert to last good on failure false true Without it, breaker only stops
healthCheckGracePeriodSeconds Ignore ALB health fails after start 0 ~60 (≥ cold-start) Too low kills slow-booting tasks
deploymentController.type ECS, CODE_DEPLOY, EXTERNAL ECS ECS (rolling) Blue/green needs CodeDeploy/native
minimumHealthyPercent (during scale) Same floor applies to scale-in 100 100 Affects how fast scale-in drains

The piece people skip is the deployment circuit breaker. Without it, a bad image that never passes health checks leaves the service replacing failing tasks indefinitely — draining your IP pool and paging you. With it, ECS watches for a run of failed task launches and, if rollback is on, automatically reverts to the last known-good task definition.

aws ecs update-service \
  --cluster prod-cluster --service checkout-api \
  --task-definition checkout-api:87 \
  --deployment-configuration '{
    "minimumHealthyPercent": 100,
    "maximumPercent": 200,
    "deploymentCircuitBreaker": { "enable": true, "rollback": true }
  }' \
  --health-check-grace-period-seconds 60
resource "aws_ecs_service" "checkout" {
  name            = "checkout-api"
  cluster         = aws_ecs_cluster.prod.id
  task_definition = aws_ecs_task_definition.checkout.arn
  desired_count   = 4
  launch_type     = "FARGATE"
  health_check_grace_period_seconds = 60

  deployment_circuit_breaker { enable = true, rollback = true }
  deployment_minimum_healthy_percent = 100
  deployment_maximum_percent         = 200

  load_balancer {
    target_group_arn = aws_lb_target_group.checkout.arn
    container_name   = "app"
    container_port   = 8080
  }
  network_configuration {
    subnets          = var.private_subnet_ids
    security_groups  = [aws_security_group.task.id]
    assign_public_ip = false
  }
}

--health-check-grace-period-seconds tells ECS to ignore ALB health-check failures for the first N seconds after a task starts, so a slow-booting app isn’t killed before it’s ready. Set it slightly above your real cold-start time. The circuit breaker counts failures relative to desired count (it scales the threshold with service size, with a floor), so it behaves sensibly for both a 3-task and a 300-task service.

The deployment-controller and strategy options, decided:

Strategy How it works Rollback Extra cost Use when
Rolling (ECS) + circuit breaker Overlap old/new, auto-revert on failure Automatic (last good) One extra task set briefly Default for most services
Native blue/green (ECS) Full parallel green env, shift, cut over Instant cutover/revert Full second env during shift High-stakes, instant rollback
CodeDeploy blue/green CodeDeploy shifts ALB listener (linear/canary) Instant + traffic-shift hooks Full second env + CodeDeploy Canary/linear traffic control
External Your own orchestrator manages task sets Yours to build Varies Custom CD systems

The rollout states you watch during a deploy, and what each means:

rolloutState Meaning What to do
IN_PROGRESS Converging to the new revision Wait; watch runningCount vs desiredCount
COMPLETED New revision fully healthy Done — verify targets healthy
FAILED Circuit breaker tripped Read rolloutStateReason; check task stopped reasons
(rolling back) Reverting to last good revision Confirm the prior revision is what’s running

The most common task stopped reasons you’ll read in describe-tasks during a failed rollout, and what each points at:

Stopped reason (substring) What it means Likely fix
CannotPullContainerError Image pull failed (bad digest or no route) Fix digest; add ECR endpoints / NAT
ResourceInitializationError: unable to pull secrets Exec role can’t read a secret Grant exec role secret ARN + KMS
RESOURCE:ENI No free ENI/IP in the subnet Larger subnets; lower maximumPercent
Task failed ELB health checks ALB marked the task unhealthy Fix port/path/matcher; raise grace
OutOfMemoryError (exit 137) Container exceeded its memory Raise task memory; fix leak
Essential container in task exited An essential container exited non-zero Read its logs; fix crash/entrypoint
Scaling activity initiated by ... Normal scale-in stop None — expected
Task stopped by deployment (rollback) Circuit breaker removed a bad task Confirm prior revision is healthy

5. Graceful shutdown: SIGTERM, stopTimeout, and deregistration

When ECS stops a task — a deploy, a scale-in, or a Spot interruption — it sends SIGTERM to each container’s entrypoint process (PID 1), waits up to stopTimeout (default 30s on Fargate, max 120s), then sends SIGKILL. Two failure modes hide here, and they are the number-one cause of deploy-time errors.

First: PID 1 must actually receive and handle SIGTERM. If your container starts the app via a shell (sh -c "node server.js"), the shell is PID 1 and may not forward the signal — your app gets SIGKILLed with in-flight requests. Either run the app as PID 1 directly (exec form CMD, or ENTRYPOINT ["node", "server.js"]) or set "initProcessEnabled": true in linuxParameters to get a tini-style init that reaps zombies and forwards signals. The combinations:

How PID 1 is set up Receives SIGTERM? Reaps zombies? Verdict
CMD ["node","server.js"] (exec form) Yes No (but app rarely forks) Fine for most apps
CMD node server.js (shell form) Often no (shell swallows) No Broken — drops requests
ENTRYPOINT ["app"], exec Yes No Fine
initProcessEnabled: true + any CMD Yes (init forwards) Yes Best for multi-process / forking apps
Custom init (tini/dumb-init) in image Yes Yes Fine; init handles it

Second: drain before you exit. On SIGTERM the app should stop accepting new work, finish in-flight requests, then exit — inside stopTimeout:

const server = app.listen(8080);

process.on('SIGTERM', () => {
  console.log('SIGTERM received, draining');
  server.close(() => {                 // stop accepting, finish in-flight
    console.log('drained, exiting');
    process.exit(0);
  });
  // safety net well under stopTimeout (60s here)
  setTimeout(() => process.exit(1), 50_000).unref();
});

Coordinate three timers so they nest correctly. The order that must hold: deregistration delay ≥ app drain grace ≤ stopTimeout, and stopTimeout ≥ drain grace. ECS deregisters the task from the target group on stop; the ALB stops sending new connections and waits the deregistration delay for existing ones to finish. If stopTimeout is shorter than the drain, SIGKILL cuts the app mid-drain; if the deregistration delay is shorter than the drain, the ALB cuts connections the app is still serving.

Timer Where set Default Recommended (fast service) If too low If too high
ALB deregistration delay Target group deregistration_delay.timeout_seconds 300 30 ALB cuts in-flight requests Slow deploys/scale-in
App drain grace Your SIGTERM handler n/a ~25–45 App exits before draining Risks > stopTimeout
stopTimeout Container def 30 60 SIGKILL mid-drain Max 120; slow stops
Health-check grace Service 0 60 New task killed before ready Slow to detect real failures

The target group itself must register tasks by IP, not instance, because each Fargate task is its own ENI:

resource "aws_lb_target_group" "checkout" {
  name                 = "checkout-tg"
  port                 = 8080
  protocol             = "HTTP"
  target_type          = "ip"          # REQUIRED for awsvpc/Fargate tasks
  vpc_id               = var.vpc_id
  deregistration_delay = 30            # drain fast, well inside stopTimeout

  health_check {
    path                = "/healthz"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    interval            = 15
    timeout             = 5
    matcher             = "200"
  }
}

6. Secrets, config, and least-privilege roles

Fargate tasks have two roles, and conflating them is the most common IAM mistake on ECS. The split, in one table:

Role Assumed by When Used for The wrong instinct
Execution role (executionRoleArn) The ECS agent Before the container starts ECR pull, log group writes, resolving secrets Forgetting it → image-pull/secret failures
Task role (taskRoleArn) Your application code At runtime S3, DynamoDB, SQS, etc. via the SDK Putting secrets-read perms here

Keep them separate and minimal. Inject secrets via the secrets block so plaintext never lands in the task definition or in describe-tasks output, and keep non-sensitive config in environment:

"secrets": [
  { "name": "DB_PASSWORD", "valueFrom": "arn:aws:secretsmanager:us-east-1:111122223333:secret:prod/checkout/db-AbCdEf" }
],
"environment": [
  { "name": "LOG_LEVEL", "value": "info" }
]

The execution role needs secretsmanager:GetSecretValue (and kms:Decrypt if the secret uses a customer-managed key) on exactly those secret ARNs — not * — plus the ECR and Logs actions:

Action On the execution role for… Scope to
ecr:GetAuthorizationToken Authenticating to ECR * (token is account-wide)
ecr:BatchGetImage, ecr:GetDownloadUrlForLayer Pulling the image The specific repo ARN
logs:CreateLogStream, logs:PutLogEvents Writing app logs The log-group ARN
secretsmanager:GetSecretValue Resolving secrets The exact secret ARN(s)
kms:Decrypt CMK-encrypted secrets/SSM The specific key ARN
ssm:GetParameters SSM Parameter Store secrets The parameter ARN(s)
{
  "Effect": "Allow",
  "Action": "secretsmanager:GetSecretValue",
  "Resource": "arn:aws:secretsmanager:us-east-1:111122223333:secret:prod/checkout/*"
}

The task role carries only the runtime permissions your code uses. If your app writes to one bucket, scope it to that bucket’s ARN and the s3:PutObject action — nothing more. Static environment entries are visible in plaintext via the API, so never put credentials there; that’s what secrets is for. The config-injection options compared:

Mechanism Plaintext in API? Rotates without redeploy? Cost Use for
environment Yes No Free Non-secret config (log level, region)
secrets → Secrets Manager No Yes (new value picked up on task launch) Per-secret/month + API calls Passwords, API keys
secrets → SSM Parameter Store (SecureString) No Yes (on launch) Free std / paid advanced Cheaper secrets, config hierarchy
App reads at runtime via task role No Yes (live) API calls Hot-reload of secrets without restart

Secrets-rotation patterns and Parameter Store vs Secrets Manager are covered in Secrets Manager & Parameter Store Deep Dive; scoping roles tightly is the subject of IAM Least Privilege & Permission Boundaries.

7. Observability: Container Insights, structured logs, tracing

Turn on Container Insights at the cluster level for per-task/service CPU, memory, and network metrics plus curated dashboards. Enable the enhanced observability tier for container-level granularity:

aws ecs update-cluster-settings \
  --cluster prod-cluster \
  --settings name=containerInsights,value=enhanced

The three observability pillars and how to wire each on Fargate:

Pillar Tool on Fargate Wire it via Cost driver Gotcha
Metrics Container Insights Cluster setting containerInsights=enhanced Per-metric ingestion enhanced = container-level, costs more
Logs awslogs driver logConfiguration per container Per-GB ingest + storage Use non-blocking + buffer cap
Logs (routed) FireLens (Fluent Bit) firelensConfiguration sidecar Sidecar CPU/mem + destinations Sidecar must be essential for ordering
Traces ADOT collector Sidecar + task-role X-Ray perms Per-trace Instrument app with OTel SDK

For logs, the awslogs driver is the simplest path; set mode=non-blocking with a bounded max-buffer-size so a slow log backend can’t block your application threads. The log-driver options:

Option What it does Default Set to Why
mode blocking or non-blocking blocking non-blocking A slow backend won’t stall the app
max-buffer-size Buffer when non-blocking 1m 25m Headroom for bursts; bounds memory
awslogs-stream-prefix Stream name prefix app Required for readable stream names
awslogs-datetime-format Multiline grouping your pattern Stack traces stay one event

When you need routing — duplicate to S3 and a SIEM, parse, or sample — use FireLens with a Fluent Bit sidecar:

{
  "name": "log_router",
  "image": "public.ecr.aws/aws-observability/aws-for-fluent-bit:stable",
  "essential": true,
  "firelensConfiguration": { "type": "fluentbit" },
  "memoryReservation": 50
}

Then the app container’s logConfiguration uses "logDriver": "awsfirelens" with output options. Emit logs as JSON from the app so they’re queryable in CloudWatch Logs Insights:

fields @timestamp, level, msg, latency_ms
| filter level = "error"
| sort @timestamp desc
| limit 50

For distributed tracing, add the AWS Distro for OpenTelemetry (ADOT) collector as a sidecar and grant the task role AWSXRayDaemonWriteAccess; instrument the app with OTel and export to X-Ray for end-to-end spans. The full tracing setup is in AWS X-Ray: Service Map, Segments & ADOT Tracing; the metrics/logs foundation in CloudWatch & CloudTrail Observability Deep Dive.

8. Cost levers: Fargate Spot, capacity providers, right-sizing

Three levers move the bill, in order of impact.

Capacity providers + Fargate Spot. Fargate Spot runs the same tasks at a steep discount but can reclaim them with a ~2-minute SIGTERM warning. Run a mixed strategy via a capacity-provider strategy: a base of on-demand FARGATE for a guaranteed floor, then FARGATE_SPOT for the elastic, interruption-tolerant remainder:

aws ecs put-cluster-capacity-providers \
  --cluster prod-cluster \
  --capacity-providers FARGATE FARGATE_SPOT \
  --default-capacity-provider-strategy \
    capacityProvider=FARGATE,base=2,weight=1 \
    capacityProvider=FARGATE_SPOT,weight=4

This keeps 2 tasks always on-demand, then splits additional tasks 1:4 on-demand:Spot. The capacity-provider parameters:

Parameter What it does Example Effect
base Minimum tasks on this provider first FARGATE base=2 2 tasks always on-demand
weight Relative share of the rest FARGATE=1, SPOT=4 Remainder split 20% / 80%
(Spot interruption) ~2-min SIGTERM then reclaim Needs graceful shutdown (Section 5)

Only do this for stateless services that handle SIGTERM cleanly — Spot reclamation uses the same graceful-stop path, so a service that drains correctly tolerates it. The three cost levers ranked:

Lever Typical saving Effort Risk / precondition Covered in
Fargate Spot (mixed strategy) Up to ~70% on the Spot portion Low Must tolerate ~2-min reclaim Section 5 (graceful stop)
Graviton (ARM64) ~20% per vCPU-hr Low–medium Image must be arm64/multi-arch Graviton migration
Right-sizing Varies (often 30–60%) Medium Measure first; redeploy task def Container Insights / Compute Optimizer

Graviton (ARM64) — already covered in Section 1, the cheapest change you can make for compatible images. Right-sizing — use Container Insights and Compute Optimizer’s ECS recommendations to find tasks provisioned at 4 vCPU that peak at 0.8. Fargate bills per vCPU-second and GB-second from pull to stop, so an oversized task definition costs you on every running replica, every hour. Resize the task definition, redeploy, re-measure. Spot interruption handling at scale is the subject of EC2 Spot & Mixed Instances with ASG Interruption Handling — the same draining discipline applies.

Architecture at a glance

Follow a single request left to right and the whole system falls into place. A client hits the ALB on 443; the ALB terminates TLS and forwards to a healthy target on the container port (8080). Because the tasks run awsvpc, the target group is target_type = ip and the ALB routes straight to a task ENI’s private IP inside two private subnets across two AZs — each task its own ENI, its own security group that only accepts the ALB’s SG, no public IP. The task itself runs the app container plus any sidecars on a valid CPU/memory envelope (here 512/1024, ARM64), with two IAM roles: the execution role pulled the digest-pinned image from ECR (via a VPC endpoint, layers over the free S3 gateway endpoint) and read secrets before start, while the task role is what the app’s SDK uses at runtime. Two control loops sit beside the data path: Application Auto Scaling watches ALBRequestCountPerTarget and moves desiredCount between 4 and 40, and the rollout runs at 100/200 with the deployment circuit breaker armed to roll back a bad revision. Downstream, the task reaches Secrets Manager + KMS and ships logs and traces to CloudWatch / X-Ray.

The five numbered badges mark exactly where a deploy or scale event breaks if a knob is wrong: a deploy-time 502 when the target type isn’t ip or the deregistration delay is still 300s (badge 1); IP/ENI exhaustion when the subnets can’t absorb the deploy surge (badge 2); swallowed SIGTERM dropping in-flight requests when a shell is PID 1 (badge 3); late or flapping scaling from the wrong metric (badge 4); and a bad deploy that never rolls back when the circuit breaker is off (badge 5). Read the diagram once with the legend, and the troubleshooting playbook below maps one-to-one onto these hops.

Production ECS on Fargate architecture: client to ALB on 443, ALB with target-type ip and 30s deregistration delay routing to per-task ENIs in two private-subnet AZs with security groups allowing only the ALB, Fargate tasks running an ARM64 app plus sidecar with execution and task roles pulling a digest-pinned image from ECR via VPC endpoint, Application Auto Scaling on ALBRequestCountPerTarget scaling 4 to 40 tasks, a rolling deployment at 100/200 with a deployment circuit breaker set to rollback, and downstream Secrets Manager plus KMS and CloudWatch logs plus X-Ray traces, with five numbered failure badges for deploy-time 502s, ENI/IP exhaustion, swallowed SIGTERM, late scaling, and a bad deploy that never rolls back

Real-world scenario

Lumio Pay, a fintech platform team, ran a payment-authorization service on Fargate behind an ALB, scaled on CPU target tracking, 6 tasks steady. It worked until a Friday evening release: under a traffic spike, p99 latency tripled and the team saw a steady trickle of 502s on every deploy and every scale-in event — even though CPU never crossed the 70% target. The on-call engineer’s first instinct was to scale up the task size, which did nothing, then to roll back, which also threw 502s on the way down.

Three root causes, none of them the application logic. First, the service used CPU target tracking, but the workload was I/O-bound on a downstream HSM — CPU stayed low while request queues grew, so scaling reacted late, after latency had already spiked. Second, and worse, the app was launched via sh -c "java -jar app.jar": the shell was PID 1, swallowed SIGTERM, and the JVM was SIGKILLed on every task stop, severing in-flight authorizations the instant a task drained. Third, the ALB target group still had the default 300-second deregistration delay, so during deploys the ALB kept routing new connections to tasks ECS had already begun stopping — a second source of cut connections layered on top of the first.

They confirmed each in minutes. The scaling lag showed up as CloudWatch CPU flat at ~40% while the ALB target-response-time and request-count climbed. The PID 1 problem was visible in the task definition ("command": ["sh","-c","java -jar app.jar"]) and in the stopped-task pattern — every deploy logged tasks SIGKILLed, not gracefully exited. The deregistration delay was a one-line describe-target-group.

The fix was three coordinated changes, no new infrastructure. They switched the primary scaling signal to ALBRequestCountPerTarget (keeping a CPU policy as a backstop), changed the container entrypoint to exec the JVM as PID 1 with a real SIGTERM handler that drained the in-flight queue, and aligned the timers: deregistration delay to 30s, stopTimeout to 60s, drain grace to ~45s.

"deploymentConfiguration": {
  "minimumHealthyPercent": 100,
  "maximumPercent": 200,
  "deploymentCircuitBreaker": { "enable": true, "rollback": true }
}
resource "aws_lb_target_group" "auth" {
  name                 = "auth-tg"
  port                 = 8080
  protocol             = "HTTP"
  target_type          = "ip"          # required for awsvpc/Fargate tasks
  vpc_id               = var.vpc_id
  deregistration_delay = 30

  health_check {
    path                = "/healthz"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    interval            = 15
    timeout             = 5
    matcher             = "200"
  }
}

Note target_type = "ip" — Fargate tasks register by IP, not instance, because each task is its own ENI. After the change, deploy-time 502s went to zero, and the service scaled out ahead of the latency curve instead of behind it. While they were in there, they also enabled the circuit breaker with rollback (they’d never had one) and tested it by deploying a deliberately-broken revision in staging — it flipped to FAILED and restored the prior revision in under two minutes. The lesson the team took away: on Fargate, “graceful shutdown” is not one setting — it’s PID 1, stopTimeout, and the target-group deregistration delay all agreeing with each other, and “scaling” only works if the metric you chose actually leads your load.

Advantages and disadvantages

The serverless-container model both removes real toil and introduces failure modes that live in the wiring rather than your code. Weigh it honestly:

Advantages (why Fargate helps you) Disadvantages (why it bites)
No EC2 to size, patch, drain, or reboot — AWS owns the host fleet You can’t ssh to “the box”; debugging is via ECS Exec, logs, and Insights
Per-task ENI gives clean isolation, per-task SGs and Flow Logs Each task consumes a subnet IP + ENI; deploy surge can exhaust small subnets
Pay per vCPU-second/GB-second, scale to zero idle cost Per-second billing means an oversized task def bleeds on every replica, every hour
Circuit breaker auto-rolls-back a bad deploy with no tooling Off by default — a bad image loops forever until you enable it
Fargate Spot cuts the elastic portion ~70% Only safe if the app drains on SIGTERM; reclaim is a ~2-min warning
Application Auto Scaling is a managed control loop Wrong metric scales late; aggressive scale-in policies flap
Graviton/ARM64 is a ~20% saving for one flag Image must be arm64/multi-arch first
Two-role model enforces least privilege by design Conflating exec vs task role is the most common ECS IAM bug

Fargate is the right default when you want to ship containers, not operate servers, and your services are stateless and ALB-fronted. It bites hardest on chatty/I/O-bound services scaled on the wrong metric, services lifted from EC2 without revisiting PID 1 and signal handling, large services packed into small subnets, and anyone who deploys with the defaults (no circuit breaker, 300s deregistration delay) and never tunes them. The disadvantages are all manageable — but only if you know they exist, which is the entire point of this article. When the constraints argue for self-managed nodes (GPU, daemons, very high task density, specialized kernels), Choose Your Container Path: ECS vs EKS vs Fargate is the decision to revisit.

Hands-on lab

Stand up a minimal Fargate service, deploy a deliberately-broken revision, and watch the circuit breaker roll it back — then prove graceful drain. Free-tier-friendly-ish (Fargate has no free tier, but a 512/1024 task for under an hour is a few rupees; tear down at the end). Run in any shell with the AWS CLI configured.

Step 1 — Variables and a cluster.

REGION=us-east-1
CLUSTER=lab-cluster
aws ecs create-cluster --cluster-name $CLUSTER --region $REGION \
  --settings name=containerInsights,value=enhanced

Step 2 — A log group and a minimal task definition (good image). Use a public sample that listens on 80:

aws logs create-log-group --log-group-name /ecs/lab-web --region $REGION
cat > lab-web.task.json <<'JSON'
{
  "family": "lab-web",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256", "memory": "512",
  "executionRoleArn": "arn:aws:iam::ACCOUNT:role/ecsTaskExecutionRole",
  "containerDefinitions": [{
    "name": "web",
    "image": "public.ecr.aws/nginx/nginx:stable",
    "essential": true,
    "portMappings": [{ "containerPort": 80, "protocol": "tcp" }],
    "stopTimeout": 30,
    "logConfiguration": { "logDriver": "awslogs", "options": {
      "awslogs-group": "/ecs/lab-web", "awslogs-region": "us-east-1", "awslogs-stream-prefix": "web" } }
  }]
}
JSON
aws ecs register-task-definition --cli-input-json file://lab-web.task.json --region $REGION

Expected: a taskDefinition JSON with "revision": 1 and "status": "ACTIVE".

Step 3 — Create the service with the circuit breaker armed. Use two private subnets and a task SG you already have (or a default-VPC subnet + SG for the lab):

aws ecs create-service --cluster $CLUSTER --service-name lab-web \
  --task-definition lab-web:1 --desired-count 2 --launch-type FARGATE \
  --deployment-configuration '{"deploymentCircuitBreaker":{"enable":true,"rollback":true},"minimumHealthyPercent":100,"maximumPercent":200}' \
  --network-configuration 'awsvpcConfiguration={subnets=[subnet-AAA,subnet-BBB],securityGroups=[sg-XXX],assignPublicIp=ENABLED}' \
  --region $REGION

Expected: a service with "rolloutState": "IN_PROGRESS" that reaches COMPLETED once 2 tasks are running.

Step 4 — Prove each task has its own ENI + private IP (awsvpc).

aws ecs list-tasks --cluster $CLUSTER --service-name lab-web --query 'taskArns' --output text --region $REGION \
  | xargs aws ecs describe-tasks --cluster $CLUSTER --region $REGION --tasks \
  --query 'tasks[].attachments[].details[?name==`privateIPv4Address`].value' --output text

Expected: two distinct private IPs — one per task.

Step 5 — Register a deliberately-broken revision and deploy it. A task def pointing at an image that will never become healthy (a non-existent tag):

sed 's#nginx/nginx:stable#nginx/nginx:THIS-TAG-DOES-NOT-EXIST#' lab-web.task.json > lab-web.broken.json
aws ecs register-task-definition --cli-input-json file://lab-web.broken.json --region $REGION
aws ecs update-service --cluster $CLUSTER --service lab-web --task-definition lab-web:2 --region $REGION

Step 6 — Watch the circuit breaker fire and roll back.

aws ecs describe-services --cluster $CLUSTER --services lab-web --region $REGION \
  --query 'services[0].deployments[].{status:status,rollout:rolloutState,reason:rolloutStateReason,desired:desiredCount,running:runningCount,failed:failedTasks}'

Expected: the new deployment moves to rolloutState: FAILED with a rolloutStateReason mentioning the circuit breaker, and the service converges back onto revision 1 (the last known-good) — running stays at 2 throughout.

Validation checklist. You created a service with the breaker armed, proved per-task ENIs, deployed a broken revision, and watched ECS automatically restore the good one without you touching it. The steps mapped to what each proves:

Step What you did What it proves Real-world analogue
3 Service with deploymentCircuitBreaker The safety net is on, not assumed Every prod service should have this
4 Two distinct private IPs awsvpc = one ENI/IP per task IP-planning the deploy surge
5 Deploy an unhealthy revision A bad image would loop forever without the breaker A failed release at 3am
6 rolloutState: FAILED → rollback The breaker fires and reverts to last good The incident that doesn’t page you

Cleanup (avoid lingering Fargate charges).

aws ecs update-service --cluster $CLUSTER --service lab-web --desired-count 0 --region $REGION
aws ecs delete-service --cluster $CLUSTER --service lab-web --force --region $REGION
aws ecs delete-cluster --cluster $CLUSTER --region $REGION
aws logs delete-log-group --log-group-name /ecs/lab-web --region $REGION

Cost note. Two 256/512 tasks for under an hour is well under ₹40; deleting the service stops the per-second billing immediately. Container Insights enhanced adds a small ingestion cost — fine for a lab, watch it at scale.

Common mistakes & troubleshooting

This is the playbook — the part you bookmark. First as a scannable table you read mid-incident, then the entries that bite hardest with full confirm-command detail.

# Symptom Root cause Confirm (exact cmd / console path) Fix
1 502s on every deploy and scale-in; fine at steady state Target group not draining: target_type wrong or dereg delay 300s aws elbv2 describe-target-groups --query 'TargetGroups[].{type:TargetType,dereg:...}' target_type=ip; deregistration_delay=30; align under stopTimeout
2 In-flight requests error exactly when a task stops PID 1 is a shell, swallows SIGTERM → app SIGKILLed Inspect task def command/entryPoint; stopped tasks show no graceful exit exec form CMD or initProcessEnabled:true; drain in handler
3 p99 climbs before scale-out; thrash on scale-in Wrong scaling metric (CPU on I/O work) or bad ResourceLabel CloudWatch CPU flat while ALB request count/latency climb Switch to ALBRequestCountPerTarget; raise ScaleInCooldown
4 Scaling policy does nothing at all ResourceLabel malformed (wrong ALB/TG name portion) aws application-autoscaling describe-scaling-policies → inspect label Use <ALB full name>/<TG full name> exactly
5 Tasks stuck PROVISIONING; stopped reason RESOURCE:ENI Subnet out of free IPs for the deploy surge Subnet free-IP count vs desired × maximumPercent /24+ subnets across 2 AZ; lower maximumPercent temporarily
6 Bad image: failing tasks replaced forever, IP pool drains Circuit breaker off (or rollback off) describe-servicesrolloutState stuck IN_PROGRESS, rising failedTasks Enable deploymentCircuitBreaker with rollback:true
7 Task fails to start: CannotPullContainerError Bad digest/tag, or no route to ECR (no endpoint/NAT) describe-tasksstoppedReason; check subnet route + endpoints Fix digest; add ECR api/dkr + S3 endpoints or NAT
8 ResourceInitializationError: unable to pull secrets Execution role missing GetSecretValue/kms:Decrypt, or no route stoppedReason; exec-role policy; Secrets Manager endpoint Grant exec role the secret ARN + KMS key; add endpoint
9 App can read secrets/buckets it never references Secrets/runtime perms on the task role (or both *) Diff taskRoleArn policy vs what code uses Move secret-read to exec role; scope task role to used ARNs
10 New task killed seconds after start, never goes healthy No healthCheckGracePeriodSeconds; ALB fails it during cold start describe-services events show health-check failures right after start Set grace ≥ cold-start; speed up boot
11 Task OOM-killed; container exits 137 memory (hard cap) too low or a leak stoppedReason “OutOfMemory”; Container Insights memory ~100% Raise task memory to a valid combo; fix leak
12 Fargate Spot tasks vanish under load Spot reclamation (~2-min SIGTERM), app didn’t drain Service events: tasks stopped, capacity-provider FARGATE_SPOT Handle SIGTERM (Section 5); raise on-demand base
13 Two tasks in one deploy run different code Image is a moving tag (:latest), resolved per launch Task def image is a tag, not @sha256: Pin an immutable digest in CI
14 Deploy hangs at IN_PROGRESS, never completes Tasks never pass ALB health check (wrong port/path/matcher) describe-target-healthunhealthy; reason Align health-check port/path/matcher to the container
15 assign_public_ip task can’t reach internet/ECR Private subnet, assignPublicIp=DISABLED, no NAT/endpoint Subnet route table has no NAT/IGW; no endpoints Add NAT gateway or the VPC endpoints; keep IP disabled

The expanded form for the entries that bite hardest:

1. 502s on every deploy and every scale-in; fine at steady state. Root cause: The ALB target group isn’t draining gracefully — either target_type isn’t ip (so registration is wrong for awsvpc) or the deregistration delay is the default 300s, longer than ECS’s stop sequence, so the ALB keeps sending new connections to a task ECS is stopping. Confirm: aws elbv2 describe-target-groups --target-group-arns <arn> --query 'TargetGroups[].{type:TargetType,dereg:Attributes}' (or read the deregistration-delay attribute). Inspect stopped tasks for SIGKILL vs graceful exit. Fix: target_type=ip, deregistration_delay=30, and make sure that delay sits under stopTimeout (e.g. 60) so both the ALB and ECS finish draining together.

2. In-flight requests error exactly when a task stops. Root cause: PID 1 is a shell (sh -c "...") that swallows SIGTERM, so the app never gets the signal and is SIGKILLed after stopTimeout with requests still in flight. Confirm: Inspect the task definition’s command/entryPoint for a shell wrapper; stopped tasks show no graceful-exit log line, just an abrupt stop. Fix: Run the app as PID 1 via the exec form (CMD ["node","server.js"]) or set "initProcessEnabled": true in linuxParameters; implement a SIGTERM handler that stops accepting and finishes in-flight inside stopTimeout.

3. p99 climbs before scale-out; service thrashes on scale-in. Root cause: Wrong scaling metric — CPU target tracking on an I/O-bound service, so CPU stays low while queues grow and scaling reacts late; and/or a too-short ScaleInCooldown causing flapping. Confirm: CloudWatch shows CPU flat (e.g. 40%) while the ALB’s RequestCountPerTarget and TargetResponseTime climb. Fix: Make ALBRequestCountPerTarget the primary signal (keep CPU as a backstop), and raise ScaleInCooldown to stop thrash.

5. Tasks stuck PROVISIONING; stopped reason RESOURCE:ENI or IP-not-available. Root cause: The subnet(s) ran out of free IPs/ENIs during the deploy surgedesired × maximumPercent exceeded usable addresses (often a /26 or smaller hosting a 30+ task service at maximumPercent: 200). Confirm: Compare each subnet’s free-IP count to desired_count × (maximumPercent/100); describe-tasks shows stoppedReason with RESOURCE:ENI. Fix: Move tasks to /24-or-larger subnets across ≥2 AZs; as an immediate unblock, lower maximumPercent (e.g. to 150) so the surge is smaller.

6. A bad image leaves failing tasks replaced forever, IP pool draining. Root cause: The deployment circuit breaker is off (or on without rollback), so ECS keeps launching tasks from a revision that never becomes healthy. Confirm: aws ecs describe-services --query 'services[0].deployments[].{rollout:rolloutState,failed:failedTasks}' shows IN_PROGRESS with failedTasks climbing. Fix: update-service --deployment-configuration '{"deploymentCircuitBreaker":{"enable":true,"rollback":true}}'; redeploy and test it once in non-prod so you’ve actually seen it fire.

8. ResourceInitializationError: unable to pull secrets or registry auth. Root cause: The execution role lacks secretsmanager:GetSecretValue (or kms:Decrypt for a CMK), or the task has no network route to the Secrets Manager / ECR endpoints. Confirm: describe-tasksstoppedReason; check the exec-role policy and whether a secretsmanager interface endpoint (or NAT) exists. Fix: Grant the exec role the exact secret ARN and KMS key; add the secretsmanager (and ECR) VPC endpoints or a NAT route.

9. The app can read secrets or buckets it never references. Root cause: Secret-read or broad runtime permissions were attached to the task role (the one your code assumes), or both roles use *. Your application now holds privileges it should never have. Confirm: Diff the taskRoleArn policy against what the code actually calls; look for secretsmanager:* or s3:* on the task role. Fix: Move secrets-reading to the execution role; scope the task role to only the specific actions and ARNs the code uses.

Best practices

Security notes

The security controls that also prevent these incidents — secure and resilient pull the same direction:

Control Mechanism Secures against Also prevents
Two-role split executionRoleArn vs taskRoleArn App holding excess privilege Secret-pull failures (right role scoped)
secrets block + CMK Secrets Manager / SSM + KMS Plaintext creds in task def Rotation breaking the app (picked up on launch)
Private subnets + SG-by-ID awsvpc + SG references Direct internet exposure Connection-refused from CIDR drift
VPC endpoints Interface/gateway endpoints Egress over public internet NAT per-GB cost on every pull
Digest pinning + ECR scan @sha256: + image scanning Tampered/unknown images Two-tasks-differ at deploy
Non-root + read-only root FS user, readonlyRootFilesystem Container escape blast radius Accidental writes corrupting state

Cost & sizing

The bill drivers and how they interact with the fixes:

A rough monthly picture for a small production API (steady ~6 tasks, bursting to ~12), us-east-1, indicative — confirm against the live pricing page:

Cost driver What you pay for Rough INR / month What it buys Watch-out
6× 0.5 vCPU / 1 GB on-demand Steady Fargate compute ~₹9,000–12,000 Always-on floor Per-second; right-size first
Burst 6× more on Spot Elastic peak portion ~₹1,500–3,000 ~70% off the burst Must drain on SIGTERM
ARM64 vs X86_64 Same size, cheaper arch −~20% of compute The free saving Image must be arm64
NAT gateway Hourly + per-GB egress ~₹3,000–5,000 Internet/AWS egress Per-GB on every pull
VPC endpoints (3 interface + S3) Hourly per endpoint + per-GB ~₹2,000–3,500 Private pulls/secrets/logs Cheaper than NAT at volume
Container Insights + logs Per-metric + per-GB ingest ~₹1,500–4,000 Diagnosis itself Sample high-traffic
ALB Hourly + LCU ~₹2,000–3,000 Ingress + health checks LCUs scale with traffic

What exactly Fargate meters, so you know which knob each line item responds to:

Billed dimension Metered as From → to Lever that reduces it
vCPU per vCPU-second image pull start → task stop Right-size cpu; ARM64; Spot; scale-in faster
Memory per GB-second image pull start → task stop Right-size memory; fewer over-provisioned tasks
Ephemeral storage per GB-month above 20 GiB provisioned duration Keep within the 20 GiB free tier
Architecture ~20% lower rate on ARM64 n/a Build arm64/multi-arch images
Capacity provider Spot rate on FARGATE_SPOT n/a Mix on-demand base + Spot weight
Data egress per-GB (NAT/internet) per byte VPC endpoints; same-region pulls

Right-sizing workflow: read Container Insights / Compute Optimizer’s ECS recommendations, find tasks over-provisioned versus their peak, resize the task definition to the next valid combo down, redeploy, and re-measure after a full traffic cycle. Lumio’s post-incident bill dropped once they right-sized back down after fixing the scaling metric — the fix is usually configuration, not a bigger task.

Interview & exam questions

1. Why must a Fargate ALB target group use target_type = ip? Because every Fargate task runs awsvpc networking and has its own ENI and private IP — there’s no shared EC2 instance to register. instance target type registers EC2 instance IDs, which don’t exist on Fargate; ip registers each task’s private IP directly. Mapping to the SAA/DVA container objectives.

2. A service throws 502s on every deploy and scale-in but is fine at steady state. What’s the cause? The ALB target group isn’t draining gracefully — typically the default 300-second deregistration delay keeps routing new connections to tasks ECS is stopping, and/or PID 1 swallows SIGTERM so the app is SIGKILLed mid-request. Fix: deregistration_delay≈30 aligned under stopTimeout, and a real SIGTERM handler with the app as PID 1.

3. What does the deployment circuit breaker do, and what happens without it? It watches for a run of failed task launches during a deploy and, with rollback: true, automatically reverts to the last known-good task definition. Without it, a bad image that never passes health checks leaves ECS replacing failing tasks indefinitely, draining the subnet IP pool and paging on-call. It scales its failure threshold with service size.

4. Difference between the execution role and the task role? The execution role is assumed by the ECS agent before the container starts — to pull the image from ECR, write to the log group, and resolve secrets. The task role is assumed by your application code at runtime to call AWS APIs (S3, DynamoDB). Secrets-reading belongs to the execution role; the task role carries only runtime permissions. Conflating them is the classic ECS IAM mistake.

5. Which scaling metric should a web API behind an ALB use, and why not CPU? ALBRequestCountPerTarget — it scales on actual per-task load and reacts before CPU saturates. CPU target tracking lags for I/O-bound services because CPU stays low while request queues grow, so scaling reacts after latency has already spiked. Keep a CPU policy as a backstop, not the primary.

6. How do you plan subnet sizing for a Fargate deploy? Each task consumes one subnet IP via its ENI, and during a rolling deploy you run up to desired × (maximumPercent/100) tasks. For a 40-task service at maximumPercent: 200, plan for ~80 IPs during the deploy, plus AWS’s 5 reserved addresses per subnet and anything else in those subnets — so a /24 or larger across ≥2 AZs.

7. A task is stuck in PROVISIONING with stopped reason RESOURCE:ENI. What happened? The subnet ran out of free IPs/ENIs during the deploy surge — the task can’t get an ENI. Confirm by comparing free IPs to desired × maximumPercent. Fix by using larger subnets (/24+) across more AZs, or temporarily lowering maximumPercent to shrink the surge.

8. Why pin an image digest instead of a tag? ECS resolves the image reference at each task launch. A moving tag like :latest means two tasks in the same deployment can pull different code, producing nondeterministic behavior that’s brutal to debug. A @sha256: digest is immutable, so every task in a deployment runs identical bits.

9. How does graceful shutdown work on Fargate, and what are the three timers? On stop, ECS sends SIGTERM to PID 1, waits stopTimeout (default 30s, max 120s), then SIGKILL — while the ALB deregisters the task and waits its deregistration delay for in-flight connections. The three timers must nest: deregistration delay ≥ app drain grace ≤ stopTimeout. PID 1 must actually receive SIGTERM (exec form or initProcessEnabled).

10. When would you choose blue/green over rolling deployments? Rolling with a circuit breaker is the right default for most services. Reach for blue/green (native ECS or CodeDeploy) when you need a full parallel environment with instant cutover and rollback, or canary/linear traffic shifting — high-stakes changes where you want to validate the green environment before sending it real traffic. The cost is running a full second environment during the shift.

11. How does Fargate Spot save money, and what’s the precondition? It runs the same tasks at up to ~70% off but can reclaim them with a ~2-minute SIGTERM warning. The precondition is that the service is stateless and drains cleanly on SIGTERM — Spot reclamation uses the same graceful-stop path. Use a capacity-provider strategy with an on-demand base for the guaranteed floor and Spot for the elastic remainder.

12. Your scaling policy seems to do nothing. What’s the most likely silent cause? A malformed ResourceLabel on an ALBRequestCountPerTarget policy. It must be <ALB full name>/<target group full name> — the portions after loadbalancer/ and targetgroup/ in the ARNs. Get it wrong and Application Auto Scaling can’t read the metric, so the policy silently never acts.

These map to AWS Certified Solutions Architect – Associate (SAA-C03) and Developer – Associate (DVA-C02) for ECS/Fargate, task definitions, IAM roles, and deployments; the networking depth (awsvpc, endpoints, SGs) touches Advanced Networking – Specialty (ANS-C01). A compact cert-mapping for revision:

Question theme Primary cert Objective area
Task def, roles, deploys, circuit breaker DVA-C02 / SAA-C03 Deploy & operate containerized apps
awsvpc, ENIs, endpoints, SGs SAA-C03 / ANS-C01 Design resilient/secure networking
Auto Scaling metric choice SAA-C03 Design scalable architectures
Two-role IAM least privilege SAA-C03 / SCS-C02 Secure access; least privilege
Spot/Graviton/right-sizing cost SAA-C03 Cost-optimized architectures

Quick check

  1. A Fargate service throws 502s on every deploy and scale-in but is healthy at steady state. Name the two most likely causes and the one target-group setting you check first.
  2. Your container starts via sh -c "node server.js". Why might in-flight requests be dropped on every task stop, and what are two fixes?
  3. True or false: scaling out to more tasks fixes a service whose tasks are getting OOM-killed.
  4. A web API behind an ALB is scaling late under load even though CPU stays at 40%. What metric should it scale on instead, and why?
  5. A bad image is deployed and ECS keeps replacing failing tasks until the subnet runs out of IPs. What one feature would have prevented this, and how do you turn it on?

Answers

  1. Cause A: the ALB target group’s deregistration delay is the default 300s, longer than ECS’s stop sequence, so the ALB keeps sending new connections to a stopping task. Cause B: PID 1 swallows SIGTERM so the app is SIGKILLed mid-request. First setting to check: deregistration_delay (set it to ≈30s and align it under stopTimeout). Also confirm target_type = ip.
  2. The shell is PID 1 and may not forward SIGTERM, so the app never gets the signal and is SIGKILLed after stopTimeout with requests in flight. Fixes: run the app as PID 1 via the exec form (CMD ["node","server.js"]), or set "initProcessEnabled": true in linuxParameters to get a signal-forwarding init — plus implement a SIGTERM handler that drains.
  3. False. OOM is against the per-task memory cap; every scaled-out task hits the same ceiling and OOMs. Fix by raising the task memory to a valid CPU/memory combination (scale up) and/or fixing the leak — scaling out doesn’t change the per-task limit.
  4. ALBRequestCountPerTarget. It scales on actual per-task request load and reacts before CPU saturates; CPU target tracking lags for I/O-bound work because CPU stays low while request queues grow. Keep CPU as a backstop policy.
  5. The deployment circuit breaker with rollback: true. Enable it via update-service --deployment-configuration '{"deploymentCircuitBreaker":{"enable":true,"rollback":true}}' (and set a healthCheckGracePeriodSeconds); it auto-detects the failing deploy and reverts to the last known-good revision.

Glossary

Next steps

You can now wire a production Fargate service: correctly sized, isolated per-task, scaled on the right signal, deployed with a tested safety net, and shut down without dropping a request. Build outward:

awsecsfargatecontainersautoscalingdeploymentsawsvpcgraviton
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments