AWS Containers

AWS ECS vs EKS vs Fargate: Choose Your Container Path

Quick take: ECS is the easy AWS-native path. EKS is Kubernetes when you genuinely need it. Fargate removes nodes from both. The hard decision is not ECS vs EKS — it is whether you actually need Kubernetes, and separately, whether you want to own the servers.

A SaaS company adopted Amazon EKS because it was “the industry standard.” Six months later, three platform engineers spent their weeks managing node groups, the VPC CNI, an ingress controller, the cluster autoscaler and a sprawl of Helm charts — all to run a handful of stateless HTTP services that did nothing Kubernetes-specific. They migrated the web tier to ECS on Fargate and cut platform toil in half. The data platform, which leaned on the Spark Operator and custom controllers, stayed on EKS because Kubernetes was genuinely earning its keep there. That is the whole article in one anecdote: AWS gives you two orchestrators (ECS, EKS) crossed with two launch types (Fargate, EC2), and the cost of choosing wrong is measured in engineer-years, not dollars.

This is the decision guide I wish that team had read first. We treat the choice as two orthogonal axes, not one menu. Axis one — orchestrator — is ECS (AWS’s own scheduler, no control-plane fee, deep IAM/CloudWatch integration) versus EKS (conformant Kubernetes, portable, ecosystem-rich, but you operate add-ons and upgrades and pay $0.10/hr per cluster). Axis two — launch type — is Fargate (serverless: no nodes to patch, scale or right-size, billed per vCPU-second) versus EC2 (you own the instances: cheaper at steady state, Spot/Graviton/GPU available, daemonsets and privileged mode possible). Four corners: ECS+Fargate, ECS+EC2, EKS+Fargate, EKS+EC2 (and EKS+Karpenter, the modern node-provisioning answer). Each corner has a different operating model, a different bill, and a different set of 2 a.m. failure modes.

By the end you will stop choosing by brand recognition. You will know that an awsvpc task needs an ALB target group of target-type ip or it will never pass health checks; that a task stuck in PROVISIONING in a private subnet almost always means missing ECR/S3/logs VPC endpoints; that CannotPullContainerError is an execution-role problem, not a task-role one; that EKS Fargate quietly forbids DaemonSets and hostNetwork; and that the cheapest steady-state path is usually EC2 Spot on Graviton with Karpenter, while the cheapest operationally is Fargate. Because this is a reference you will return to mid-decision and mid-incident, the trade-offs, the limits, the task-definition fields and the failure modes are all laid out as scannable tables — read the prose once, then keep the tables open when the architecture review (or the pager) starts.

What problem this solves

Containers need an orchestrator: something to place them on hosts, restart them when they die, roll out new versions, wire them to load balancers, and scale them with demand. AWS does not give you one answer — it gives you a 2×2, and the marketing pages make all four corners sound equally good. They are not. The wrong corner is expensive in the way that hurts most: not a surprise invoice (though that too), but a permanent tax on every deploy, every patch cycle, every on-call rotation.

What breaks without a deliberate choice: a five-person startup stands up EKS “to be cloud-native,” then discovers that keeping the cluster alive — Kubernetes minor-version upgrades every ~14 months before support ends, VPC CNI / CoreDNS / kube-proxy add-on bumps, IP-exhaustion from the CNI’s per-pod ENI model, ingress-controller CVEs, Helm-chart drift — is now a full-time job that produces zero customer value. Conversely, a platform team standardizes on ECS for simplicity, then spends a year reinventing Helm-style templating, operators and CRDs in CloudFormation because they actually did need Kubernetes’ extensibility. Both teams chose on the wrong axis. The orchestrator axis is about extensibility and portability; the launch-type axis is about who owns the servers. Conflating them is the root mistake.

Who hits this: essentially every team that has outgrown a single EC2 box or a Lambda and wants to run long-lived containers. It bites hardest on teams that (a) adopt Kubernetes for resume-driven reasons, (b) run Fargate at high steady-state utilization and overpay versus EC2, © deploy into private subnets without the VPC endpoints awsvpc networking requires, or (d) confuse the execution role with the task role and then can’t pull an image or read a secret. The fix is almost never “switch orchestrators in a panic” — it’s “decide the two axes on their actual merits, then implement the networking and IAM correctly.”

To frame the whole field before the deep dive, here is the 2×2 with the one question each corner forces and the single fact that most often makes the decision:

Corner One-line identity Question it forces Deciding fact Best when
ECS + Fargate AWS-native, no nodes “Do I really need k8s? No.” Lowest total ops; per-vCPU premium Stateless services, batch, side-projects, small teams
ECS + EC2 AWS-native, own nodes “Need GPU/Spot/custom AMI on ECS?” Cheaper steady-state; you patch AMIs Cost-sensitive steady load, GPU, daemons on ECS
EKS + Fargate k8s API, no nodes “Want k8s but hate node ops?” k8s API minus DaemonSets/GPU/hostNet Portable manifests, low-ops k8s, per-pod isolation
EKS + EC2 (Karpenter) Full k8s, own nodes “Operators/CRDs + Spot/GPU?” Max power & cost control; max toil Spark/ML, service mesh, multi-cloud, big platforms

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already be comfortable with the AWS container fundamentals: a container image lives in a registry (Amazon ECR or another OCI registry); a task (ECS) or Pod (Kubernetes) is one or more containers scheduled together; a service keeps N copies running and registers them with a load balancer. You should know how to run the AWS CLI and read JSON output, what a VPC, subnet, security group and route table are, and that IAM roles grant AWS permissions. Basic Kubernetes literacy (Deployment, Service, namespace) helps for the EKS sections but is not required to follow the decision logic.

This sits in the Compute → Containers track and is the decision upstream of all the hands-on container work. It assumes the compute landscape from AWS Compute: EC2, Lambda, ECS and EKS — Which One to Choose? (that article picks the category; this one picks within containers). It depends on the networking from AWS VPC, Subnets and Security Groups Explainedawsvpc task networking, VPC endpoints and SGs are where most container outages actually live — and on the load-balancer choice from AWS ALB vs NLB vs API Gateway Compared, because the ALB target-type detail below is the single most common ECS wiring bug. Identity grounding comes from AWS Organizations & IAM Foundations.

A quick map of who owns what during a container incident, so you page the right person fast:

Layer What lives here Who usually owns it Failure classes it can cause
Client / DNS / TLS Name resolution, certs, retries Frontend / SRE 5xx only if misrouted; mostly red herrings
ALB / target group Listener, health check, target-type Network / platform 503 (no healthy targets), 504 (slow app)
Orchestrator (ECS/EKS) Scheduling, desired count, rollout Platform team Tasks not placed, stuck rollout, throttling
Launch type (Fargate/EC2) Capacity, ENI attach, node health Platform / AWS PROVISIONING hang, node pressure, IP exhaustion
Image / ECR Pull auth, image tag, size App + platform CannotPullContainerError, slow cold start
Task / Pod (your code) Process, port bind, memory App / dev team Crash loop, OOM (137), wrong port
IAM (exec + task role) Pull/log/secrets vs app APIs App + security AccessDenied, secret resolve fail

Core concepts

Six mental models make every later decision and diagnosis obvious.

The choice is two axes, not one. Orchestrator (ECS vs EKS) decides the API and ecosystem you program against and operate. Launch type (Fargate vs EC2) decides who owns the servers. They are independent: you can run ECS on Fargate or EC2, and EKS on Fargate or EC2 (or both at once). Decide them separately. The orchestrator question is “do I need Kubernetes’ extensibility and portability?” The launch-type question is “do I want to patch, scale and right-size servers, in exchange for lower cost and more control?”

ECS is the AWS-native, no-cluster-fee orchestrator. Amazon ECS (Elastic Container Service) schedules tasks (defined by a task definition — a versioned JSON describing containers, CPU/memory, networking, roles, logging). A service maintains a desired count and integrates natively with ALB/NLB, CloudWatch, IAM, App Mesh and Service Connect. There is no charge for the ECS control plane — you pay only for the compute (Fargate or EC2). ECS concepts map cleanly onto AWS primitives, so there is little to learn beyond AWS itself. The trade: it is AWS-only and less extensible than Kubernetes.

EKS is conformant Kubernetes, with a control-plane fee and add-on operations. Amazon EKS (Elastic Kubernetes Service) runs an upstream-conformant Kubernetes control plane that AWS manages (highly available across AZs) for $0.10 per cluster-hour (~$73/month). You get the entire Kubernetes API: Deployments, CRDs, operators, Helm, the Horizontal/Vertical Pod Autoscaler, network policies, and portability across clouds. The cost is operational: you own the add-on lifecycle (VPC CNI, CoreDNS, kube-proxy), cluster version upgrades (a new minor roughly every ~4 months; ~14 months of standard support each), the ingress/load-balancer controller, IP planning for the CNI, and the broader Kubernetes blast radius. Power and portability in exchange for toil.

Fargate is serverless containers — no nodes, billed per vCPU-second. AWS Fargate runs your task/pod on AWS-managed capacity. You specify CPU and memory; AWS finds the host, attaches an ENI (awsvpc), pulls the image and runs the container. No EC2 to patch, scale, secure or right-size. You pay per vCPU-second and GB-second while the task runs (per-second, 1-minute minimum). The trade-offs: a per-vCPU premium over EC2 at steady state (~20–50% depending on Region/commitment), no DaemonSets/privileged/GPU, fixed CPU↔memory ratios, slower cold starts than a warm EC2 node, and ephemeral storage capped (20 GiB default, up to 200 GiB configurable).

EC2 launch type means you own the nodes — cheaper and more flexible, but yours to operate. With the EC2 launch type, tasks/pods run on EC2 instances you provision (an ECS capacity provider / Auto Scaling Group, or on EKS a managed node group or Karpenter). You choose instance families (Graviton/arm64 for ~20–40% better price-performance, GPU for ML, memory-optimized for caches), use Spot for up to ~90% savings on interruptible work, bring custom AMIs, run DaemonSets/privileged containers, and bin-pack many tasks per instance. The cost: you patch AMIs, manage scaling and capacity, and carry the security of the host OS.

awsvpc networking gives each task its own ENI — and its own failure modes. On Fargate (always) and increasingly on EC2, ECS/EKS use the awsvpc network mode: each task/pod gets its own Elastic Network Interface with a VPC IP, its own security group, and first-class VPC routing. This is clean (per-task SGs, no port conflicts) but introduces three classics: IP exhaustion (each task burns a subnet IP; the EKS CNI burns several), the ALB target-type ip requirement (the LB targets the task’s ENI IP, not a host), and VPC-endpoint dependence in private subnets (pulling from ECR and writing to CloudWatch need a route to AWS — a NAT Gateway or interface/gateway endpoints).

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary repeats these for lookup; this table is the mental model side by side:

Concept One-line definition Where it lives Why it matters to the choice
Orchestrator ECS or EKS — the scheduler/API Account / Region Axis 1: extensibility & portability
Launch type Fargate or EC2 — who owns hosts Per service/profile Axis 2: cost & control vs ops
Task definition Versioned JSON: containers, CPU/mem, roles ECS The unit you deploy on ECS
Service Keeps N tasks/pods running + LB-wired ECS / EKS Steady-state app; rollout target
Pod / Deployment k8s scheduling unit / replica controller EKS The unit you deploy on EKS
Execution role Pull image, write logs, read secrets ECS task def Wrong → CannotPullContainerError
Task role / IRSA The app’s own AWS permissions Task / Pod Wrong → app AccessDenied
Capacity provider Maps a service to Fargate/EC2 capacity ECS How EC2/Spot/Fargate mix is set
Managed node group AWS-managed EC2 ASG for EKS EKS Node lifecycle without raw ASGs
Karpenter Just-in-time node provisioner for EKS EKS Modern EC2 scaling; bin-packs Spot
VPC CNI EKS plugin giving pods VPC IPs EKS IP exhaustion; prefix delegation
awsvpc ENI Per-task/pod network interface + SG Subnet IP burn; ALB target-type ip
VPC endpoint Private route to ECR/S3/logs/STS VPC Missing → PROVISIONING/pull fails

Axis 1 — ECS or EKS? Deciding whether you need Kubernetes

This is the consequential decision, and it is not about which is “better” — it is about whether your workload needs Kubernetes’ extensibility and portability enough to pay for operating it. Default to ECS. Reach for EKS only when you can name a concrete Kubernetes capability you depend on.

What ECS gives you (and what it doesn’t)

ECS is the path of least resistance on AWS. Everything is an AWS primitive you already understand; there is no second API to learn, no add-on fleet to keep current, and no control-plane bill.

Capability ECS Notes
Control-plane cost $0 (free) You pay only Fargate/EC2 compute
Learning curve Low (AWS concepts only) Task def ≈ “JSON of containers”
Native ALB/NLB integration Yes (target group + service) First-class, no extra controller
IAM per task Yes (task role) Clean least-privilege per workload
Service discovery Cloud Map / Service Connect DNS + L7 mesh-lite, no sidecar to run
Autoscaling Service Auto Scaling (target tracking) On CPU/mem/ALB request count
Secrets Secrets Manager / SSM injection Declared in task def
Observability CloudWatch Logs/Container Insights Native; OTel via ADOT sidecar
Custom controllers / operators No The big gap vs k8s
CRDs / extensible API No Can’t extend the API
Portability off AWS No AWS-only
Ecosystem (Helm/charts) No Use CloudFormation/CDK/Terraform

What EKS gives you (and what it costs)

EKS is Kubernetes — the full API, the ecosystem, the portability. The price is a control-plane fee plus a permanent operational surface.

Capability EKS Notes
Control-plane cost $0.10/hr (~$73/mo) per cluster Plus compute; consolidate clusters
Learning curve High (Kubernetes + AWS) YAML, controllers, RBAC, CNI
API extensibility (CRDs) Yes Operators, custom resources
Operators ecosystem Yes Spark, Flink, Strimzi, cert-manager…
Helm / chart ecosystem Yes Huge reuse for off-the-shelf software
Portability / multi-cloud Yes (conformant) Same manifests on GKE/AKS/on-prem
Advanced scheduling Yes Affinity, taints/tolerations, topology
Network policies Yes (CNI/Calico) Pod-level micro-segmentation
HPA + VPA + KEDA Yes Event-driven & vertical autoscaling
Add-on lifecycle (you own) CNI, CoreDNS, kube-proxy Version-bump on every cluster upgrade
Cluster upgrades (you own) ~every 14 months before EOL In-place; test add-on compat
LB controller (you install) AWS Load Balancer Controller Provisions ALB/NLB from Ingress/Service
IP planning (you own) VPC CNI per-pod ENI Prefix delegation / custom networking

The decision table — does this workload need Kubernetes?

Run each “yes” signal against the list. One genuine yes can justify EKS; all no means ECS, full stop.

Signal If YES → lean Why
You already run Helm charts / operators / CRDs EKS Reusing the k8s ecosystem is the point
You need multi-cloud / on-prem portability EKS Conformant API runs the same elsewhere
You run Spark/Flink/ML on Kubernetes operators EKS Operators are the value (e.g. Spark Operator)
You need advanced scheduling (affinity, topology, gang) EKS ECS scheduling is comparatively basic
Your org has deep Kubernetes skills already EKS The toil is cheaper when you know k8s
You need a service mesh (Istio/Linkerd) EKS Mesh ecosystems are k8s-native
You just need to run stateless containers + ALB ECS k8s buys you nothing here
Team is small / no k8s expertise ECS Don’t pay the cluster tax for nothing
You want lowest operational surface ECS No add-ons, no upgrades, no CNI
Cost of the control plane matters at small scale ECS $0 vs $73/mo per cluster
You want resume-driven Kubernetes ECS Not a technical reason; resist

Operating-toil comparison (the part the bill doesn’t show)

The control-plane fee is the visible cost. The invisible one is recurring engineering time. This is where most “we should have used ECS” regret originates.

Recurring task ECS EKS Notes
Patch the orchestrator AWS (none for you) AWS does control plane; you do add-ons Add-on bumps every upgrade
Minor-version upgrades None Yes, ~yearly before EOL Test CNI/CoreDNS/app compat
Networking plugin (CNI) None (native) You tune (prefix deleg., custom net) IP exhaustion is an EKS-only class
Load-balancer wiring Native service↔TG Install/operate LB Controller A Deployment you keep current
Ingress ALB via service Ingress + controller More moving parts
RBAC / access IAM only IAM + Kubernetes RBAC + aws-auth/Access Entries Two systems to keep in sync
Secrets Native injection CSI driver / External Secrets Extra components
Disaster of a bad upgrade Rare Real risk (add-on/app breakage) Blue/green clusters mitigate

Axis 2 — Fargate or EC2? Deciding who owns the servers

Independent of the orchestrator, decide whether you want to operate nodes. Fargate trades money for the elimination of node operations; EC2 trades operations for lower cost and more capability. Both work under ECS and EKS.

Fargate — the no-nodes model

Property Fargate behaviour Implication
Host management None (AWS-managed) No AMI patching, no node scaling
Billing Per vCPU-second + GB-second Pay only while the task runs
Sizing Fixed CPU↔memory combinations Can’t pick arbitrary ratios
Networking Always awsvpc (own ENI) Per-task SG; burns a subnet IP
GPU Not supported ML/GPU must use EC2
Privileged / hostNetwork / DaemonSet Not supported No node-level agents on Fargate
Ephemeral storage 20 GiB default (up to 200 GiB) No persistent local disk
Spot equivalent Fargate Spot (~70% off, interruptible) Great for batch/dev
Cold start Seconds (image pull + ENI attach) Slower than a warm EC2 node
Per-vCPU cost vs EC2 ~20–50% premium at steady state The core trade-off

EC2 launch type — own the nodes

Property EC2 behaviour Implication
Host management Yours (patch, scale, secure) Operational cost
Billing Per instance-hour (or Spot/RI/SP) Cheaper at steady, high utilization
Sizing Any instance family/size Graviton, GPU, memory/compute-optimized
Bin-packing Many tasks per instance Higher density = lower unit cost
Spot Up to ~90% off (interruptible) Big savings on tolerant workloads
Graviton (arm64) ~20–40% better price/perf Rebuild image multi-arch
GPU Supported (g/p families) Required for ML inference/training
DaemonSets / privileged Supported Node agents, log shippers, security tools
Custom AMI / kernel Supported Compliance, special drivers
Scaling mechanism ASG / capacity provider / Karpenter Karpenter = fast, bin-packing JIT nodes

Fargate-vs-EC2 decision table

If your workload… Choose Why
Is spiky / low-and-variable utilization Fargate Pay per second; no idle nodes to fund
Has a small team / wants min ops Fargate No node patching or scaling
Runs steady & high utilization 24×7 EC2 Bin-pack + RI/SP beats per-task pricing
Needs GPU (ML inference/training) EC2 Fargate has no GPU
Needs DaemonSets / node agents / privileged EC2 Fargate forbids them
Can tolerate interruptions (batch, CI, dev) Fargate Spot / EC2 Spot Up to 70–90% savings
Wants Graviton price-performance EC2 (arm64) or Fargate arm64 Both support arm64; EC2 cheaper
Has bursty batch with no infra team Fargate Scales to zero between runs
Needs custom AMI / kernel modules EC2 Fargate is a sealed runtime
Wants the cheapest possible steady compute EC2 Spot + Graviton + Karpenter Lowest unit cost, highest toil

The four corners, side by side

Fargate EC2
ECS Lowest ops; AWS-native; no nodes; per-vCPU premium. Default for most services. AWS-native + Spot/Graviton/GPU/daemons; you patch AMIs. Cost-optimized ECS.
EKS k8s API, no nodes; no DaemonSet/GPU/hostNet; per-pod isolation; pod-level fee mechanics. Low-ops k8s. Full k8s power: operators, GPU, Spot via Karpenter, daemonsets. Max toil. Spark/ML/mesh.

ECS deep dive — the task definition, every field that bites

On ECS you deploy task definitions. A task definition is immutable and versioned (family:revision); you register a new revision and update the service to it. The fields below are where real incidents originate.

Task-level settings

Field What it sets Choices / values Default Gotcha
requiresCompatibilities Launch type compatibility FARGATE / EC2 Fargate forces awsvpc + valid CPU/mem pair
networkMode Task networking awsvpc / bridge / host / none bridge (EC2) Fargate = awsvpc only; ALB needs target-type ip
cpu (task) vCPU units (1024 = 1 vCPU) 256–16384 (Fargate set) Fargate: only specific CPU↔mem pairs
memory (task) MiB Tied to CPU on Fargate EC2: optional but recommended as a cap
executionRoleArn Pull image, logs, secrets An IAM role Missing → CannotPullContainerError
taskRoleArn App’s AWS permissions An IAM role The app’s calls (S3/DDB) use THIS, not exec role
ephemeralStorage.sizeInGiB Scratch disk (Fargate) 21–200 20 Not persistent; gone on stop
runtimePlatform OS/arch LINUX/X86_64, LINUX/ARM64, Windows x86_64 arm64 = Graviton savings; rebuild image
pidMode / ipcMode Shared namespaces task/host none host not allowed on Fargate

Fargate CPU↔memory valid combinations

Fargate does not let you pick arbitrary CPU/memory. Pick a row.

vCPU (cpu) Memory options (memory)
0.25 (256) 0.5, 1, 2 GB
0.5 (512) 1, 2, 3, 4 GB
1 (1024) 2–8 GB (1 GB steps)
2 (2048) 4–16 GB (1 GB steps)
4 (4096) 8–30 GB (1 GB steps)
8 (8192) 16–60 GB (4 GB steps)
16 (16384) 32–120 GB (8 GB steps)

Container-level settings (inside containerDefinitions)

Field What it sets Notes / gotcha
image ECR/OCI image URI Pin a digest/tag, never :latest in prod
portMappings.containerPort Port the app listens on Must match the ALB health check + target group port
essential If false, container dying doesn’t kill task Sidecars often essential:false
logConfiguration Log driver awslogs (CloudWatch) or awsfirelens (FireLens→anywhere)
healthCheck Container-level health (Docker) Separate from the ALB health check
secrets Inject from Secrets Manager/SSM Needs execution role permission
environment Plain env vars Never put secrets here
ulimits / linuxParameters nofile, capabilities add/drop Linux capabilities here
dependsOn Container start ordering E.g. app waits for a proxy to be HEALTHY
cpu / memoryReservation Per-container limits Sum must fit the task-level sizing

Create an ECS Fargate service (CLI)

# 1) Register the task definition (JSON in file)
aws ecs register-task-definition --cli-input-json file://taskdef.json

# 2) Create a target group of TYPE IP (awsvpc requires this!)
aws elbv2 create-target-group --name web-tg \
  --protocol HTTP --port 8080 --vpc-id vpc-0abc \
  --target-type ip --health-check-path /healthz

# 3) Create the service on Fargate, wired to the ALB
aws ecs create-service --cluster prod --service-name web \
  --task-definition web:7 --desired-count 3 --launch-type FARGATE \
  --network-configuration 'awsvpcConfiguration={subnets=[subnet-1,subnet-2],securityGroups=[sg-web],assignPublicIp=DISABLED}' \
  --load-balancers 'targetGroupArn=arn:...:targetgroup/web-tg/...,containerName=web,containerPort=8080'

The same in Terraform

resource "aws_ecs_service" "web" {
  name            = "web"
  cluster         = aws_ecs_cluster.prod.id
  task_definition = aws_ecs_task_definition.web.arn
  desired_count   = 3
  launch_type     = "FARGATE"

  network_configuration {
    subnets          = var.private_subnets
    security_groups  = [aws_security_group.web.id]
    assign_public_ip = false
  }
  load_balancer {
    target_group_arn = aws_lb_target_group.web.arn # target_type = "ip"
    container_name   = "web"
    container_port   = 8080
  }
}

ECS capacity providers — how Fargate/EC2/Spot mix is set

Capacity provider Backs Use it for
FARGATE On-demand Fargate Baseline reliable capacity
FARGATE_SPOT Interruptible Fargate (~70% off) Batch, dev, fault-tolerant tiers
ASG capacity provider Your EC2 Auto Scaling Group EC2 launch type; managed scaling
Capacity-provider strategy Weighted mix (e.g. 1 on-demand : 3 Spot) Cost/reliability blend with a base count

EKS deep dive — clusters, node options, and the add-ons you own

On EKS you deploy standard Kubernetes objects (Deployment, Service, Ingress). The differences from a generic cluster are where the nodes come from, how pods get IPs, and which add-ons you keep current.

EKS compute options

Compute option What it is Pros Cons
Managed node groups AWS-managed EC2 ASG of workers Simple lifecycle, AWS-patched AMIs, drain on update Less flexible than Karpenter; coarse scaling
Self-managed nodes Your own ASG/AMI Full control (custom AMI/kernel) You own everything, including upgrades
Karpenter JIT node provisioner (controller) Fast, bin-packs, picks cheapest fit, Spot-native A controller you operate; newer mental model
EKS on Fargate Serverless pods via Fargate profiles No nodes; per-pod isolation No DaemonSet/GPU/hostNetwork; profile selectors
EKS Auto Mode AWS-managed compute+addons Lowest ops EKS; AWS runs nodes/CNI/LB Newer; less control; premium

The add-ons you must keep current (the toil, enumerated)

Add-on Job If you neglect it
VPC CNI (aws-node) Gives pods VPC IPs IP exhaustion; pods stuck ContainerCreating
CoreDNS In-cluster DNS Service discovery breaks
kube-proxy Service VIP routing Service traffic fails
AWS Load Balancer Controller ALB/NLB from Ingress/Service No external load balancing
Cluster Autoscaler / Karpenter Node scaling Pods Pending, no capacity
EBS/EFS CSI driver Persistent volumes PVCs won’t bind
Metrics Server HPA input HPA can’t scale
cert-manager / ExternalDNS (optional) TLS / DNS automation Manual cert/DNS toil

Fargate profiles (EKS) — and their hard limits

A Fargate profile declares which pods (by namespace + labels) run on Fargate instead of nodes. The limits below decide whether your workload even fits.

Limitation on EKS Fargate Detail Consequence
No DaemonSets Can’t schedule one pod per node Node-level agents (logging, security) won’t run; use sidecars
No GPU No accelerator support ML/GPU pods must use EC2 nodes
No hostNetwork / hostPort Pod can’t share host net Some CNIs/agents incompatible
No privileged containers Sealed runtime Security/observability tooling that needs it fails
One pod per “node” Each pod = its own micro-VM Higher isolation; different cost profile
Sidecar logging No node agent → use FireLens/sidecar Wire logs per pod
Profile selectors required Pods must match a profile to land Mismatched pods stay Pending

IRSA vs Pod Identity — granting AWS permissions to pods

Mechanism How it works When to use
IRSA (IAM Roles for Service Accounts) OIDC trust → annotate a ServiceAccount with a role ARN Mature, widely supported, fine-grained per-SA
EKS Pod Identity Pod Identity Agent + association; no per-cluster OIDC trust setup Newer, simpler at scale; fewer trust-policy edits

Minimal EKS on Fargate, then a Deployment

# Cluster + a Fargate profile for the "apps" namespace (eksctl)
eksctl create cluster --name prod --region ap-south-1 --fargate

# Install the AWS Load Balancer Controller (Helm) so Ingress provisions an ALB
helm repo add eks https://aws.github.io/eks-charts
helm install aws-lb-controller eks/aws-load-balancer-controller \
  -n kube-system --set clusterName=prod
apiVersion: apps/v1
kind: Deployment
metadata: { name: web, namespace: apps }
spec:
  replicas: 3
  selector: { matchLabels: { app: web } }
  template:
    metadata: { labels: { app: web } }
    spec:
      serviceAccountName: web-sa   # IRSA-annotated for the app's AWS perms
      containers:
        - name: web
          image: 1234.dkr.ecr.ap-south-1.amazonaws.com/web:1.4.2
          ports: [{ containerPort: 8080 }]
          resources:
            requests: { cpu: "250m", memory: "512Mi" }
            limits:   { cpu: "500m", memory: "1Gi" }

Networking — awsvpc, ALB target types, and the VPC endpoints you forget

This section is where the most outages live. awsvpc networking is clean but unforgiving, and the ALB/endpoint requirements are non-negotiable.

Network modes (ECS)

Mode Each task gets ALB target type Use when
awsvpc Own ENI + IP + SG ip Fargate (forced); EC2 when you want per-task SGs
bridge Shared host net (Docker bridge) instance Legacy EC2; dynamic host ports
host Host’s network namespace instance Max perf, no isolation; EC2 only
none No external networking n/a Batch with no inbound

ALB target-type — the #1 ECS wiring bug

Target type Registers Required for Symptom if wrong
ip Task/pod ENI IP awsvpc / Fargate Targets never register / ALB 503 with bridge-style TG
instance EC2 instance + host port bridge/host EC2 Health checks fail for awsvpc tasks

If your service is Fargate or awsvpc and you created an instance target group, registration fails or the ALB has no healthy targets → clients get 503. Recreate the target group with --target-type ip and point its health check at the container port.

VPC endpoints private tasks need (or a NAT Gateway)

A task in a private subnet with assignPublicIp=DISABLED must reach AWS APIs to pull the image and ship logs. Either route via a NAT Gateway or add these endpoints (cheaper at scale, and required if you have no NAT):

Endpoint Type Why the task needs it
com.amazonaws.<region>.ecr.api Interface ECR auth / metadata
com.amazonaws.<region>.ecr.dkr Interface Pull image layers
com.amazonaws.<region>.s3 Gateway ECR layers live in S3 (must add!)
com.amazonaws.<region>.logs Interface CloudWatch Logs (awslogs driver)
com.amazonaws.<region>.secretsmanager Interface If injecting secrets
com.amazonaws.<region>.ssm / ssmmessages Interface SSM params / ECS Exec
com.amazonaws.<region>.sts Interface IRSA / role assumption (EKS)
com.amazonaws.<region>.ecs / ecs-agent / ecs-telemetry Interface ECS agent comms (EC2 launch)

Forgetting the S3 gateway endpoint is the classic: ecr.api/ecr.dkr resolve, auth succeeds, but the layer download (which goes to S3) hangs → task stuck in PROVISIONING or CannotPullContainerError.

EKS VPC CNI — IP exhaustion math

The EKS VPC CNI gives each pod a real VPC IP, pre-allocating a warm pool per node. Without prefix delegation, a node’s pod density is capped by its ENI/IP limits, and large clusters exhaust /24s fast.

Lever Effect Trade-off
Prefix delegation Assign /28 prefixes → ~16× more pods/node Slight IP fragmentation; enable early
Custom networking Pods in a secondary CIDR More config; preserves primary subnet IPs
Bigger subnets (/19+) More headroom Plan CIDRs up front; hard to change later
Fewer, larger nodes Fewer warm-pool IPs wasted Larger blast radius per node

Deployments & rollouts — keeping the service alive during change

ECS deployment controllers

Controller Behaviour Use when
ECS rolling (default) Replaces tasks per min/max healthy % Default; simple rolling update
CodeDeploy blue/green Shifts ALB traffic to a new task set Safe canary/linear/all-at-once with rollback
EXTERNAL You drive task sets via API Custom deployment tooling

Deployment-tuning knobs (ECS)

Setting Controls Default Gotcha
minimumHealthyPercent How many tasks stay up during deploy 100 Too high + no spare capacity = stuck deploy
maximumPercent Burst capacity during deploy 200 Fargate has no nodes to “fill”; fine. EC2 needs headroom
deploymentCircuitBreaker.rollback Auto-roll-back on failed deploy off Turn ON — saves a bad rollout
Health-check grace period Ignore ALB health for N s after start 0 Set it for slow-booting apps or you’ll thrash

Kubernetes rollout knobs (EKS)

Setting Controls Notes
strategy.rollingUpdate.maxUnavailable Pods down during rollout Lower = safer, slower
strategy.rollingUpdate.maxSurge Extra pods during rollout Needs node headroom (or Karpenter scales)
readinessProbe When a pod joins the LB Wrong path → pod never Ready → no endpoints
livenessProbe When kubelet restarts a pod Too aggressive = crash-loop you caused
PodDisruptionBudget Min available during drains Protects availability during node upgrades

Architecture at a glance

Trace one HTTPS request and you can see the whole 2×2 in a single path. A client hits an Application Load Balancer on :443 (TLS terminates here). Because the containers run with awsvpc networking, the ALB’s target group must be target-type ip — it sends traffic to the task’s own ENI IP, not to a host. From there the request enters the control plane you chose: ECS (AWS-native, no cluster fee) or EKS (the Kubernetes API at $0.10/hr). That orchestrator schedules the work onto the data plane you chose: Fargate (serverless, 0.25–16 vCPU, nothing to patch) or EC2 nodes (you own the AMI; Graviton, Spot and GPU live here). Whichever combination, the actual workload is the same container — an image pulled from ECR over :443, running as a task or pod under a task role (ECS) or IRSA (EKS). Finally every task leans on shared platform dependencies: CloudWatch for logs and metrics, and IAM split into an execution role (pull the image, write logs, read secrets) and a task role (the app’s own AWS calls).

The numbered badges mark the five places this architecture most often goes wrong or forces a decision. (1) is the ALB target-type trap — awsvpc demands ip, and an instance target group gives you a 503 with no healthy targets. (2) is the orchestrator fork itself: pay the EKS control-plane fee and add-on toil only for genuine Kubernetes needs. (3) is the Fargate-vs-EC2 trade — serverless simplicity versus steady-state cost, GPU and daemons. (4) is the dreaded PROVISIONING hang: an ENI that can’t attach because the subnet is out of IPs or the private subnet lacks ECR/S3/logs endpoints. (5) is the IAM split — mixing the execution and task roles is why images won’t pull or secrets won’t resolve. Read the diagram left to right and the badges become your pre-flight checklist.

ECS vs EKS over Fargate vs EC2: one HTTPS request through an ALB (target-type ip) into the chosen orchestrator and launch type, the ECR image and task/pod, and the CloudWatch + IAM platform dependencies, with five numbered failure/decision points

Real-world scenario

Northwind Stream (fictional) is a 40-engineer media-analytics SaaS on AWS. Two years ago a newly hired platform lead stood up a single large EKS cluster “to be cloud-native,” and everything went on it: the customer-facing web/API tier (a dozen stateless Go and Node services), a set of scheduled batch jobs, and the data platform (Spark on the Spark Operator, plus a couple of bespoke controllers). It worked — until operating it became the team’s main job.

The symptoms were classic misallocation. The platform group had grown to three full-time engineers whose week was Kubernetes upkeep: a minor-version upgrade every cycle (with the obligatory VPC CNI / CoreDNS compatibility testing), recurring IP-exhaustion alerts as the CNI burned through a /23 (prefix delegation hadn’t been enabled), AWS Load Balancer Controller CVEs to patch, Helm-chart drift, and a painful incident where a bad CoreDNS bump broke service discovery for twenty minutes. None of this produced customer value. Meanwhile the web tier — pure stateless HTTP behind an ALB — used nothing Kubernetes-specific. It was paying the full cluster tax for zero benefit.

The architecture review split the workloads along the two axes honestly:

Workload Needs k8s? Utilization Decision Why
Web / API tier (12 services) No Spiky daytime ECS + Fargate Stateless + ALB; no k8s features used; min ops
Scheduled batch (reports) No Bursty, scale-to-zero ECS + Fargate (Spot) EventBridge-triggered; cheap; no idle nodes
Data platform (Spark, controllers) Yes Steady, heavy, GPU-ish EKS + EC2 (Karpenter, Spot, Graviton) Operators/CRDs are the value; cost-tuned nodes
ML inference (GPU) Maybe Steady EKS + EC2 GPU Operators + GPU; Fargate can’t do GPU

They migrated the web tier to ECS Fargate behind the existing ALBs (recreating the target groups as target-type ip), moved the batch jobs to ECS Fargate Spot triggered by EventBridge Scheduler, and kept the data platform on EKS — but moved its nodes to Karpenter on Graviton Spot, and finally enabled prefix delegation to kill the IP-exhaustion alerts. The numbers afterward: the platform team shrank from three engineers to one; the EKS cluster’s blast radius dropped (only the data platform now depends on it); the batch tier’s compute bill fell sharply because it scaled to zero between runs; and the Spark/ML workloads got cheaper on Karpenter+Graviton+Spot while gaining the JIT bin-packing they’d lacked. The lesson the lead wrote in the post-mortem: “Kubernetes was the right tool for the 20% of our workload that needed it, and an expensive mistake for the 80% that didn’t. We chose on the wrong axis — we picked an orchestrator before asking whether each workload needed one.”

Advantages and disadvantages

Dimension Advantage Disadvantage
ECS $0 control plane; lowest learning curve; native AWS integration; per-task IAM AWS-only; no CRDs/operators; less extensible
EKS Full Kubernetes API; portable; huge ecosystem; advanced scheduling $0.10/hr/cluster; add-on + upgrade toil; bigger blast radius
Fargate No node patching/scaling/right-sizing; per-second billing; per-task isolation ~20–50% per-vCPU premium; no GPU/DaemonSet/privileged; fixed CPU/mem pairs
EC2 Cheaper at steady state; Spot/Graviton/GPU; daemons; custom AMI; bin-packing You patch/scale/secure nodes; capacity planning; host security surface

In prose: ECS wins when the workload is “just containers + a load balancer” and you value shipping over operating — which is most workloads, most of the time. EKS wins precisely when you can name the Kubernetes capability you depend on (an operator, CRDs, a mesh, portability) — and when that value clears the bar of the control-plane fee plus the permanent add-on/upgrade surface. On the other axis, Fargate wins on spiky utilization, small teams, and anything you’d rather not babysit; its premium is real but often dwarfed by the salary cost of node operations at small scale. EC2 wins on steady, high-utilization fleets where bin-packing plus Savings Plans/Spot/Graviton make it dramatically cheaper, and on the hard requirements Fargate simply can’t meet (GPU, DaemonSets, custom kernels). The corners are not ranked; they are matched to a workload’s shape.

Hands-on lab

This lab deploys a tiny HTTP container on ECS Fargate behind an ALB, hits it, then tears everything down. It is free-tier-adjacent (Fargate and ALB are not free, but a few minutes costs cents). Run in a Region like ap-south-1 (Mumbai). Replace IDs with yours.

Step 0 — prerequisites

aws sts get-caller-identity            # confirm you're authenticated
aws ec2 describe-vpcs --query 'Vpcs[0].VpcId' --output text   # note a VPC id

Step 1 — create a cluster

aws ecs create-cluster --cluster-name lab --capacity-providers FARGATE FARGATE_SPOT

Expected: JSON with "status": "ACTIVE".

Step 2 — an execution role (pull image + write logs)

aws iam create-role --role-name lab-exec \
  --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"ecs-tasks.amazonaws.com"},"Action":"sts:AssumeRole"}]}'
aws iam attach-role-policy --role-name lab-exec \
  --policy-arn arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy

Step 3 — register a task definition (taskdef.json)

{
  "family": "lab-web",
  "requiresCompatibilities": ["FARGATE"],
  "networkMode": "awsvpc",
  "cpu": "256", "memory": "512",
  "executionRoleArn": "arn:aws:iam::<acct>:role/lab-exec",
  "containerDefinitions": [{
    "name": "web",
    "image": "public.ecr.aws/nginx/nginx:latest",
    "portMappings": [{ "containerPort": 80 }],
    "essential": true,
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {
        "awslogs-group": "/ecs/lab-web",
        "awslogs-region": "ap-south-1",
        "awslogs-stream-prefix": "web",
        "awslogs-create-group": "true"
      }
    }
  }]
}
aws ecs register-task-definition --cli-input-json file://taskdef.json

Step 4 — an ALB + a TARGET-TYPE IP target group (the lab’s whole point)

aws elbv2 create-load-balancer --name lab-alb --type application \
  --subnets subnet-1 subnet-2 --security-groups sg-alb
aws elbv2 create-target-group --name lab-tg --protocol HTTP --port 80 \
  --vpc-id vpc-0abc --target-type ip --health-check-path /
# create a listener on :80 forwarding to lab-tg (ARNs from the two commands above)
aws elbv2 create-listener --load-balancer-arn <alb-arn> --protocol HTTP --port 80 \
  --default-actions Type=forward,TargetGroupArn=<tg-arn>

Step 5 — run the service

aws ecs create-service --cluster lab --service-name web \
  --task-definition lab-web --desired-count 2 --launch-type FARGATE \
  --network-configuration 'awsvpcConfiguration={subnets=[subnet-1,subnet-2],securityGroups=[sg-web],assignPublicIp=ENABLED}' \
  --load-balancers 'targetGroupArn=<tg-arn>,containerName=web,containerPort=80'

Note: assignPublicIp=ENABLED lets the lab pull the public ECR image without VPC endpoints. In production you’d use private subnets + the endpoints in the table above.

Step 6 — verify

aws ecs describe-services --cluster lab --services web \
  --query 'services[0].deployments[0].runningCount'      # → 2 when ready
aws elbv2 describe-target-health --target-group-arn <tg-arn> \
  --query 'TargetHealthDescriptions[].TargetHealth.State' # → ["healthy","healthy"]
curl http://<alb-dns-name>/                               # → nginx welcome HTML

Step 7 — teardown (avoid charges)

aws ecs update-service --cluster lab --service web --desired-count 0
aws ecs delete-service --cluster lab --service web --force
aws elbv2 delete-listener --listener-arn <listener-arn>
aws elbv2 delete-load-balancer --load-balancer-arn <alb-arn>
aws elbv2 delete-target-group --target-group-arn <tg-arn>
aws ecs delete-cluster --cluster lab
aws logs delete-log-group --log-group-name /ecs/lab-web

Common mistakes & troubleshooting

This is the differentiator. Containers fail in a small set of recurring ways with a specific root cause and an exact confirm step. Use this as a playbook: match the symptom, confirm the cause, apply the fix.

# Symptom Root cause Confirm (exact command / path) Fix
1 Task stuck in PROVISIONING ENI can’t attach: no free subnet IPs, or private subnet missing ECR/S3/logs endpoints aws ecs describe-tasks ... --query 'tasks[0].stoppedReason'; check subnet free IPs Free IPs / bigger subnet; add ECR(api,dkr)+S3 gateway+logs endpoints
2 CannotPullContainerError Execution role lacks ECR perms, or no route to ECR/S3 Task stoppedReason; CloudTrail ecr:GetAuthorizationToken deny Attach AmazonECSTaskExecutionRolePolicy; add ECR+S3 endpoints or NAT
3 ResourceInitializationError: unable to pull secrets Execution role can’t read Secrets Manager/SSM, or no endpoint stoppedReason; secret ARN in task def Grant exec role secretsmanager:GetSecretValue; add SM endpoint
4 App throws AccessDenied calling S3/DDB Permission put on execution role, not task role App logs; the call uses the task role Move the app’s policy to the task role (taskRoleArn)
5 ALB returns 503, no healthy targets Target group is target-type instance for an awsvpc/Fargate service aws elbv2 describe-target-health shows no/unhealthy targets Recreate TG --target-type ip; health-check the container port
6 Targets unhealthy, app is fine Health-check path/port wrong; SG blocks ALB→task describe-target-health reason Target.ResponseCodeMismatch/Timeout Fix --health-check-path/port; allow ALB SG → task SG on the port
7 Container exits with code 137 OOM — exceeded task/container memory stoppedReason: OutOfMemoryError; Container Insights memory Raise memory; fix leak; set memoryReservation sensibly
8 Crash loop (task restarts forever) App throws at startup (bad env/secret/migration) aws logs tail /ecs/<svc> repeating trace; ECS events Fix config; enable circuit-breaker rollback; add health grace period
9 EKS pods Pending No capacity (no nodes) or no matching Fargate profile kubectl describe podFailedScheduling/Insufficient cpu Scale nodes/Karpenter; add a Fargate profile matching the labels
10 EKS pods ContainerCreating forever IP exhaustion (VPC CNI) or CNI not ready kubectl describe podfailed to assign an IP; aws-node logs Enable prefix delegation; bigger subnets; restart CNI
11 Spot interruption kills tasks/nodes Fargate Spot/EC2 Spot reclaimed ECS events / node Terminating; Spot interruption notice Run an on-demand base via capacity-provider strategy; PDBs (EKS)
12 Deploy stuck, never completes minimumHealthyPercent 100 + no spare EC2 capacity aws ecs describe-services deployment IN_PROGRESS forever Lower min-healthy or add capacity; Fargate avoids this
13 504 Gateway Timeout from ALB App slower than ALB idle/target timeout App Insights/logs latency; ALB idle timeout Speed up app; raise target/idle timeout; fix downstream
14 exec format error on start arm64 image on x86 task (or vice-versa) Container logs first line Build multi-arch image; match runtimePlatform
15 ECS Exec / kubectl exec fails Missing SSM endpoints or enableExecuteCommand off aws ecs execute-command error; SSM agent Enable exec; add ssm/ssmmessages endpoints; task-role SSM perms

A few reading notes that save the most time:

Best practices

Security notes

Cost & sizing

What drives the bill differs sharply by corner. Roughly (us-east-1-class on-demand list; INR at ~₹84/USD; verify current pricing):

Cost driver Applies to Rough figure Notes
Fargate vCPU ECS/EKS Fargate ~$0.04048 / vCPU-hr Per-second, 1-min minimum
Fargate memory ECS/EKS Fargate ~$0.004445 / GB-hr Billed alongside vCPU
Fargate Spot ECS Fargate ~70% off Interruptible; batch/dev
EKS control plane Every EKS cluster $0.10/hr (~$73/mo / ~₹6,100) Per cluster — consolidate!
EC2 on-demand EC2 launch type Instance-hour Cheaper than Fargate at high utilization
EC2 Spot EC2 launch type up to ~90% off Interruptible; Karpenter handles it
Graviton (arm64) Fargate & EC2 ~20–40% better price/perf Multi-arch image required
NAT Gateway Private tasks w/o endpoints ~$0.045/hr + $0.045/GB Endpoints often cheaper at scale
Interface VPC endpoint Private tasks ~$0.01/hr each + data Fixed per-AZ; adds up with many
ALB Fronting services ~$0.0225/hr + LCU Shared across many target groups
CloudWatch Logs All ~$0.50/GB ingest + storage Sample/route via FireLens to cut cost
Container Insights Optional Per metric/log Useful but priced; scope it

Right-sizing guidance

Decision Heuristic
Fargate task size Start at the smallest valid CPU/mem pair that fits; scale out, not up, first
Fargate vs EC2 crossover Above ~60–70% steady utilization 24×7, EC2 (with RI/SP) usually wins
Spot mix On-demand base for availability + Spot burst for cost (e.g. 1:3)
EKS cluster count Consolidate — each cluster is $73/mo; use namespaces/RBAC, not extra clusters
arm64 adoption Default new services to Graviton if the stack supports it
Logs spend Route with FireLens, sample debug logs, set retention

Free-tier-ish notes

Interview & exam questions

Q1. ECS vs EKS in one sentence — when each? ECS is AWS’s native orchestrator with no control-plane fee and the lowest operational surface — use it for straightforward containerized services. EKS is conformant Kubernetes ($0.10/hr/cluster) — use it when you need the Kubernetes ecosystem (operators, CRDs, Helm), portability, or advanced scheduling. (SAP-C02)

Q2. Fargate vs EC2 launch type — the trade-off? Fargate is serverless (no nodes to patch/scale/right-size, per-second billing) at a per-vCPU premium and without GPU/DaemonSet/privileged support. EC2 is cheaper at steady high utilization and supports Spot, Graviton, GPU, daemons and custom AMIs, but you own host operations. (DVA-C02 / SAP-C02)

Q3. Your Fargate service behind an ALB returns 503 with no healthy targets. Why? Almost certainly the target group is target-type instance while Fargate uses awsvpc, so targets register the wrong way. Recreate the target group with --target-type ip and point the health check at the container port. (DVA-C02)

Q4. A Fargate task is stuck in PROVISIONING in a private subnet. First two suspects? No free IPs in the subnet for the task ENI, or missing VPC endpoints (ECR api+dkr, the S3 gateway endpoint for layers, and logs). Confirm via stoppedReason and subnet free-IP count. (SAP-C02)

Q5. Difference between the ECS execution role and task role? The execution role lets the ECS agent pull the image, write logs and read secrets (platform-side). The task role is the application’s own AWS identity for its API calls (S3, DynamoDB, etc.). CannotPullContainerError → execution role; app AccessDenied → task role. (DVA-C02)

Q6. When is EKS genuinely the right call over ECS? When you depend on Kubernetes-specific capabilities: existing Helm charts/operators/CRDs, a service mesh, advanced scheduling, multi-cloud portability, or workloads like Spark/Flink/ML that run on Kubernetes operators. Brand recognition is not a reason. (SAP-C02)

Q7. What can’t you run on EKS Fargate? DaemonSets, GPU workloads, privileged containers and hostNetwork/hostPort pods. Node-level agents (logging/security) must be sidecars; GPU/daemon needs require EC2 node groups. (CKA mindset / SAP-C02)

Q8. How do you give an EKS pod AWS permissions securely? Use IRSA (IAM Roles for Service Accounts via OIDC) or EKS Pod Identity to bind a least-privilege role to a ServiceAccount — not the node instance profile, which would over-grant every pod on the node. (DOP-C02)

Q9. Container exits with code 137 — what happened and how do you confirm? It was OOM-killed for exceeding its memory limit. Confirm via stoppedReason: OutOfMemoryError (ECS) or the pod’s OOMKilled reason (EKS) and Container Insights memory metrics; fix by raising memory or fixing the leak. (DVA-C02)

Q10. How do you cut container compute cost without sacrificing availability? Run an on-demand base plus Spot burst (capacity-provider strategy on ECS; Karpenter Spot with on-demand fallback on EKS), adopt Graviton/arm64, right-size from metrics, and commit Savings Plans once usage is steady. (SAP-C02)

Q11. Why does the EKS control plane cost matter for cluster strategy? Each cluster is $0.10/hr (~$73/mo). Spinning up a cluster per team/app multiplies that fee and the add-on toil. Consolidate with namespaces and RBAC instead, reserving separate clusters for genuine isolation needs. (SAP-C02)

Q12. What is Karpenter and why prefer it over the Cluster Autoscaler? Karpenter is a just-in-time node provisioner for EKS that watches pending pods and launches right-sized, cheapest-fit nodes (Spot-native, bin-packing) in seconds, then consolidates. It’s faster and more cost-efficient than ASG-based Cluster Autoscaler’s fixed node groups. (DOP-C02)

Quick check

  1. You have a dozen stateless HTTP services, a five-person team, and no Kubernetes experience. Which corner of the 2×2 do you pick, and why?
  2. Your Fargate service behind an ALB shows “no healthy targets” and clients get 503. What is the single most likely misconfiguration?
  3. Name two VPC endpoints (besides logs) a private-subnet Fargate task needs to pull an image, and the easy one to forget.
  4. The app logs AccessDenied calling DynamoDB. Execution role or task role — which do you fix?
  5. Give one workload that genuinely justifies EKS over ECS and one that genuinely justifies EC2 over Fargate.

Answers

  1. ECS + Fargate. Stateless containers + ALB use nothing Kubernetes-specific, and Fargate removes node ops — the lowest total operational surface for a small team. EKS would add a control-plane fee and an add-on/upgrade burden for zero benefit.
  2. The target group is target-type instance instead of ip. awsvpc/Fargate tasks must be targeted by ENI IP; recreate the target group with --target-type ip and health-check the container port.
  3. ecr.api and ecr.dkr (interface endpoints) — plus the easy-to-forget S3 gateway endpoint, because ECR image layers are stored in S3 and the download stalls without it.
  4. The task role. The app’s own AWS calls use the task role; the execution role only covers image pull, log writes and secret reads.
  5. EKS-justifying: Spark/ML on the Spark Operator (or anything needing CRDs/operators, a mesh, or multi-cloud portability). EC2-justifying: a GPU inference service (Fargate has no GPU) or a steady 24×7 fleet where Spot+Graviton+bin-packing is far cheaper.

Glossary

Next steps

AWSECSEKSFargateKubernetesContainersECRawsvpc
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading