AWS Lesson 66 of 123

ECS Service Connect Deep Dive: Service Discovery, Traffic Resilience, and Migrating Off ALBs

Most ECS estates accumulate internal ALBs the way attics accumulate boxes. Service A needs to call service B, so someone stands up an internal Application Load Balancer, a target group, a listener, a Route 53 alias, and a security-group rule — for a call path that never leaves the VPC and never sees a browser. Multiply by fifty services and you are paying for fifty load balancers, fifty health-check configurations, and an extra network hop on every east-west request, all to do something an ALB was never designed for: client-side service discovery with retries.

ECS Service Connect collapses that. It runs a managed Envoy sidecar in every task, registers a logical name in an AWS Cloud Map namespace, and lets http://payments resolve and load-balance directly to healthy payments tasks — with connection pooling, retries, timeouts, and outlier detection handled by the proxy. No internal ALB, no per-call DNS lookup, no extra hop. This is how it actually works, where it beats and loses to the alternatives, every setting and limit that bites in production, and how to migrate without a flag day.

By the end you will stop reflexively standing up an internal ALB for every service-to-service edge. You will know when Service Connect is the right tool (intra-namespace east-west with resilience), when an ALB still earns its keep (L7 path/host routing, public ingress, WAF), and when Cloud Map DNS is the only option (a non-ECS consumer that cannot carry a sidecar). And you will be able to read the proxy’s CloudWatch metrics well enough to debug a migration step at 02:00 instead of rolling it back blind.

What problem this solves

The internal-ALB-per-service pattern has three costs that compound silently. Money: each internal ALB has an hourly charge plus LCU (Load Balancer Capacity Unit) consumption; at fifty internal ALBs that is a material line on the bill for traffic that never touches the internet. Latency: every east-west call takes an extra hop through the ALB — the client connects to the ALB, the ALB connects to a target — adding a round-trip and a second TLS handshake to a call that could have gone task-to-task. Resilience gaps: an ALB ejects a target only when its health check fails on a fixed interval. A replica that passes a shallow /healthz while returning 503s to real traffic keeps taking requests — the ALB never notices, and callers eat a steady error rate that pages on-call weekly.

The DNS-based alternative — Cloud Map service discovery with serviceRegistries — removes the ALB hop but introduces its own pain: the client resolves a name against a Route 53 private hosted zone, caches the A records for the TTL, and load-balances with whatever its HTTP library happens to do. A task that died ten seconds ago can still be in the resolver cache, so you get the “connection refused to a dead IP” tail. There are no retries, no outlier detection, and no per-call observability — the client picks an IP blind and logs nothing about the upstream’s health.

Who hits this: any team running more than a handful of ECS services that call each other. It bites hardest on microservice estates on Fargate (where you cannot install a node-level mesh agent), teams that adopted internal ALBs early and now drown in them, and anyone debugging an intermittent east-west 5xx that a health check sails past. Service Connect is the AWS-native answer that gives you mesh-grade resilience without running a service mesh — but only inside one namespace, in one account, and only for workloads that can carry the sidecar.

To frame the whole field before the deep dive, here is every east-west discovery option, the cost it removes, the cost it adds, and the one situation it is right for:

Option Removes Adds Right when
Internal ALB nothing (the baseline) hourly + LCU cost, extra hop, no per-request ejection you genuinely need L7 path/host routing or WAF on the edge
Cloud Map DNS ALB cost + hop DNS TTL staleness, no retries, no per-call telemetry a non-ECS consumer (Lambda/EC2) must resolve ECS tasks
Service Connect ALB cost + hop, TTL staleness, blind client LB one sidecar per task, namespace/account scoping intra-namespace east-west needing discovery + resilience
VPC Lattice cross-account/VPC plumbing a different abstraction, IAM auth policies, cost model service-to-service across accounts/VPCs with IAM auth

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already be comfortable with ECS fundamentals: a task definition declares containers, ports, and IAM roles; a service keeps a desired count of tasks running and (optionally) registers them with a load balancer; Fargate vs EC2 launch types; and the awsvpc network mode where every task gets its own ENI and private IP. You should know how to read JSON output from the AWS CLI, run aws ecs execute-command (ECS Exec) into a task, and the basics of an internal ALB — listener, target group, target-type ip. If those are shaky, start with AWS ECS & ECR Fundamentals: Task Definitions, Services, Fargate and Production ECS on Fargate: Task Networking, Autoscaling, Deployments.

This sits in the container networking and resilience track. It assumes the load-balancer mechanics from Elastic Load Balancing: ALB, NLB, GWLB Deep Dive (because Service Connect replaces internal ALBs, not your ingress one), the VPC/subnet/SG model from VPC Deep Dive: Subnets, Routing, IGW, NAT, Endpoints, and Route 53 fundamentals from Route 53: DNS Records, Routing Policies, Health Checks. For cross-account east-west it pairs with PrivateLink: Service Provider & Consumer, Cross-Account and VPC Lattice: Service Networks, IAM Auth, Cross-Account.

A quick map of who owns what when an east-west call fails, so you page the right person fast:

Layer What lives here Who usually owns it Failure classes it can cause
Caller app The dependency URL/config App / dev team Wrong URL (ALB vs SC alias), no retry on its own client
Service Connect agent (client) Endpoint discovery, LB, retries Platform (managed by ECS) 503 (no endpoints — role/namespace misconfig), retry storms
Namespace (Cloud Map) Logical name boundary Platform Cross-namespace call fails (scoping)
Callee task + agent (server) Serves requests, advertises endpoint App + platform 5xx from the app, outlier ejection
portMappings / appProtocol Port name + L7 protocol App / dev team L4 pass-through (no retries), port-name mismatch
Account / VPC boundary PrivateLink / Lattice seam Network team Cross-account call has no namespace path

Core concepts

Four mental models make every later decision obvious.

Service Connect is a client-side proxy mesh, not a load balancer. There is no central appliance traffic flows through. Instead, every participating task carries a managed Envoy sidecar, and the caller’s proxy holds a live view of every healthy endpoint for the names it consumes. When your app calls http://payments:8080, the call goes to the local proxy on localhost, and the proxy picks a healthy payments task and connects to it directly. The “load balancing” happens at the client, per request, with no hop through a shared device.

The role split decides whether outbound calls resolve. A service is client, server, or client-and-server. A server advertises endpoints into the namespace (other services can find it) but its proxy is not in client mode, so its own outbound calls do not resolve through Service Connect. A client consumes endpoints (it can call others) but advertises nothing. A payments service that both serves peers and calls ledger must be client-and-server. The single most common “why does my outbound call 503?” is a service set to server only.

Discovery is DNS-free and push-based. Unlike Cloud Map DNS discovery, a Service Connect namespace does not require a private hosted zone or a runtime DNS query. The clientAliases.dnsName (e.g. payments) is a logical name the proxy intercepts locally; it is not a record resolved over the wire. The ECS control plane pushes endpoint changes to every client proxy in the namespace within seconds, so there is no DNS-TTL staleness window and no “connection refused to a dead IP” tail.

Resilience is in the proxy, not your code or a health check. The Envoy sidecar does connection pooling, per-request load balancing, configurable timeouts, retries on idempotent failures, and outlier detection — ejecting a task that returns real 5xx to real traffic, not one that merely fails a probe. This is the headline reason to adopt Service Connect even if you were happy with discovery: you get mesh-grade resilience on every call path without writing or operating a mesh.

What Service Connect replaces, keeps, and does not touch — the mental model for “what changes when I turn this on”:

Thing Before Service Connect After Service Connect Verdict
Internal east-west ALB One per service edge Deleted (after migration) Replaced
Public ingress ALB Internet-facing Unchanged Kept
L7 path/host routing At the ALB Still at the ALB Kept
Client-side load balancing Library / DNS, blind Proxy, per-request Replaced
Service discovery DNS or static ALB DNS Push-based, namespace Replaced
Retries / outlier detection App code or none In the proxy Added
Task IAM roles / secrets Your design Unchanged Untouched
Security groups Your design Unchanged (task-to-task) Untouched

The vocabulary in one table

Pin down every moving part before the deep sections. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Concept One-line definition Where it lives Why it matters
Namespace The logical boundary services discover each other in Cloud Map (HTTP type) Discovery is scoped to it; one per environment
Service Connect agent Managed Envoy sidecar ECS injects per task Inside every participating task Does discovery, LB, retries, outlier detection
portName Name linking SC config to a port mapping Task def portMappings[].name Must match the SC portName exactly
discoveryName The short name a callee advertises SC config (server side) What the proxy registers in the namespace
clientAlias The DNS name + port callers use SC config http://<dnsName>:<port> the app calls
appProtocol L7 protocol declaration portMappings[].appProtocol http/http2/grpc unlock L7 retries
Client role Consumes endpoints, advertises nothing SC config services empty/no advertise Needed for outbound calls to resolve
Server role Advertises endpoints into the namespace SC config with services Lets peers find this service
Outlier detection Ejecting a task on real 5xx Envoy in the agent Catches failures a health check misses
Per-request timeout Cap on a single request SC timeout.perRequestTimeoutSeconds Stops a slow upstream pinning a connection
Idle timeout How long idle connections live SC timeout.idleTimeoutSeconds Too low → reconnect churn under chatty traffic
LCU The metered cost unit of an ALB Per internal ALB Deleting ALBs removes this charge
Ingress ALB The public internet-facing LB In front of the edge service Service Connect does not replace it
PrivateLink / Lattice Cross-account/VPC connectivity seam At the account boundary Where Service Connect’s single-account scope ends

The Service Connect architecture: agent, namespace, client and server modes

Service Connect has three moving parts, and understanding the split is the whole game.

The namespace is an AWS Cloud Map HTTP namespace. It is the logical boundary inside which services discover each other by short name. You create it once per environment (think one namespace per prod, staging) and point every service in that environment at it. Unlike Cloud Map’s older DNS-based service discovery, a Service Connect namespace does not require a private hosted zone or DNS queries at runtime — discovery happens in the proxy’s control plane.

The Service Connect agent is a managed Envoy proxy that ECS injects as a sidecar container into every participating task. You do not write an Envoy config and you do not manage the container image — ECS owns its lifecycle, pushes endpoint updates to it, and ships its metrics. Your application talks to localhost-style endpoints the proxy exposes; the proxy handles the actual connection to a healthy backend task.

Client vs server roles are set per service in its serviceConnectConfiguration:

A frontend that calls APIs but exposes nothing internally is a pure client. A payments service that both serves peers and calls ledger is client-and-server. The distinction matters: a service must be client or client-and-server for its outbound calls to resolve through Service Connect. I have watched teams set a service to server only, then wonder why its outbound call to a dependency 503s — the proxy is not in client mode, so it has no endpoints to route to.

The three roles, what each does, and the failure if you pick wrong:

Role Advertises endpoints? Resolves outbound? Use for Failure if mis-set
client No Yes A frontend/edge service that only calls others None for outbound; peers can’t find it (intended)
server Yes No (Rare) a pure sink that never calls peers Its own outbound calls 503 — no client mode
client-and-server Yes Yes Any service that both serves and calls peers None — the safe default for most services

Here is a minimal client-and-server block in a task/service definition (JSON for the CreateService call):

{
  "serviceConnectConfiguration": {
    "enabled": true,
    "namespace": "prod",
    "services": [
      {
        "portName": "http",
        "discoveryName": "payments",
        "clientAliases": [
          { "dnsName": "payments", "port": 8080 }
        ]
      }
    ]
  }
}

The portName (http) must match a name on a portMappings entry in the task definition. That linkage is mandatory and is the single most common misconfiguration.

{
  "name": "app",
  "portMappings": [
    { "name": "http", "containerPort": 8080, "protocol": "tcp", "appProtocol": "http" }
  ]
}

Set appProtocol deliberately. http (or http2/grpc) is what unlocks L7 features — retries on status codes, per-request stats. Leave it as raw TCP and the proxy degrades to L4 pass-through: you keep discovery and connection pooling but lose HTTP-aware retries and outlier detection.

Every field in the serviceConnectConfiguration, what it does, its default, and the gotcha:

Field What it does Default Valid values Gotcha
enabled Turns Service Connect on for the service false true / false Must be true even when inheriting a cluster default namespace
namespace The Cloud Map namespace to join cluster default if set namespace name or ARN Must be HTTP type; cross-namespace is invisible
services Endpoints this service advertises none array of advertise blocks Omit/empty → pure client (advertises nothing)
services[].portName Links to a named portMappings entry must match a portMappings[].name Mismatch → service won’t register; #1 misconfig
services[].discoveryName Name registered in the namespace the portName short string Defaults to portName if omitted — be explicit
services[].clientAliases DNS name + port callers use array of {dnsName, port} The port here is what callers connect to, not the container port
services[].timeout Per-service idle/per-request caps proxy defaults seconds Set per discoveryName; 0 disables a cap
services[].tls TLS for the advertised endpoint off ACM PCA config Optional; pairs with a PCA-issued cert
logConfiguration Where the agent ships its logs none awslogs / other drivers Without it you fly blind on proxy access logs

The port linkage tripping people up, stated as a mapping table:

In the task definition In the Service Connect config Must match?
portMappings[].name = "http" services[].portName = "http" Yes, exactly
portMappings[].containerPort = 8080 (not referenced directly) No
portMappings[].appProtocol = "http" (enables L7 features) Drives retry/outlier capability
(none) clientAliases[].dnsName = "payments" Logical name, not a port
(none) clientAliases[].port = 8080 The port callers dial, can differ from containerPort

Namespaces and Cloud Map: logical names, DNS-free discovery

The namespace is a Cloud Map HTTP namespace. Create it before any service references it:

aws servicediscovery create-http-namespace \
  --name prod \
  --description "Service Connect namespace for prod ECS services"

You can also let ECS create one implicitly when you set a default namespace on the cluster:

aws ecs put-cluster-capacity-providers \
  --cluster prod \
  --capacity-providers FARGATE FARGATE_SPOT \
  --default-capacity-provider-strategy capacityProvider=FARGATE,weight=1

aws ecs update-cluster \
  --cluster prod \
  --service-connect-defaults namespace=prod

With a cluster default set, new services inherit the namespace and you only specify enabled: true plus the per-service services block. The same in Terraform, so the namespace and default are reviewed as code:

resource "aws_service_discovery_http_namespace" "prod" {
  name        = "prod"
  description = "Service Connect namespace for prod ECS services"
}

resource "aws_ecs_cluster" "prod" {
  name = "prod"
  service_connect_defaults {
    namespace = aws_service_discovery_http_namespace.prod.arn
  }
}

The DNS-free part is the important nuance. With Cloud Map DNS-based discovery (the older serviceRegistries model), a client resolves payments.prod.local against a Route 53 private hosted zone, gets back a set of A records, and picks one. The client does its own load balancing with whatever its HTTP library happens to do, DNS TTLs cache stale records, and a task that died ten seconds ago can still be in the resolver cache.

Service Connect inverts this. The agent maintains a live view of healthy endpoints pushed from the ECS control plane — no periodic DNS query, no TTL staleness window. When a payments task is stopped, ECS withdraws its endpoint from every client proxy in the namespace within seconds. clientAliases.dnsName is a logical name the proxy intercepts locally; it is not a record you have to resolve over the wire to a hosted zone. That is why Service Connect reacts to topology change far faster than DNS-based discovery, and why you stop seeing the “connection refused to a dead IP” tail that plagues DNS-TTL discovery.

The two namespace types Cloud Map offers and what each supports — pick HTTP for Service Connect:

Namespace type Created by Supports Service Connect? Supports DNS discovery? When to use
HTTP create-http-namespace Yes No (no hosted zone) Service Connect; API-based discovery
DNS private create-private-dns-namespace Yes (also usable) Yes (private hosted zone) When you also need DNS resolution for non-SC consumers
DNS public create-public-dns-namespace No Yes (public) Public service discovery — not east-west

The discovery-mechanism contrast in one place, because it is the crux of why people migrate:

Property Cloud Map DNS discovery Service Connect
Resolution path Client → Route 53 PHZ → A records App → local proxy (no wire DNS)
Who load-balances The client’s HTTP library The client proxy, per request
Staleness on task death DNS TTL window (seconds–minutes) Push-based withdrawal, ~seconds
Dead-IP “connection refused” tail Common Eliminated
Requires a private hosted zone Yes No
Health awareness Cloud Map health checks (coarse) Real per-request outlier detection
Telemetry on the upstream chosen None Proxy access logs + metrics

Built-in resilience: pooling, retries, timeouts, outlier detection

This is the reason to adopt Service Connect even if you were happy with discovery. The Envoy sidecar gives every call path mesh-grade resilience without a service mesh.

Connection pooling is automatic. The proxy keeps warm upstream connections to backend tasks and multiplexes requests, so you are not paying TCP and TLS handshake cost per request. For HTTP/2 and gRPC (appProtocol: http2 / grpc) it multiplexes streams over a single connection.

Per-request load balancing. Because the client proxy holds the full healthy-endpoint set, it load-balances per request across tasks, not per DNS resolution. A new task that scales in starts taking traffic immediately; a task scaling out is drained.

Timeouts are configurable per service via timeout in the Service Connect config. idleTimeoutSeconds bounds idle connections; perRequestTimeoutSeconds caps a single request — critical for HTTP/1.1 where a slow upstream otherwise pins a connection:

{
  "portName": "http",
  "discoveryName": "ledger",
  "clientAliases": [{ "dnsName": "ledger", "port": 8080 }],
  "timeout": {
    "idleTimeoutSeconds": 60,
    "perRequestTimeoutSeconds": 15
  }
}

For long-poll or streaming endpoints, set perRequestTimeoutSeconds: 0 to disable the per-request cap on that service — otherwise the proxy will sever your stream at the timeout. Do this surgically, per discoveryName, never globally.

Retries and outlier detection are the headline. The proxy retries idempotent failures and ejects consistently-failing tasks from the load-balancing pool (Envoy outlier detection) so a single bad replica stops poisoning the call path. These are tuned through the Service Connect agent’s behavior rather than hand-written Envoy YAML; you express intent at the service level and ECS renders the proxy config. The practical effect: a task that starts returning 5xx — bad deploy, wedged thread pool, exhausted connections — is detected and pulled out of rotation for a cool-down window, then probed back in. With an internal ALB you would get this only if your health check happened to catch the failure mode, and never at per-request granularity.

The behavioral difference from an ALB is worth stating plainly: an ALB ejects a target when its health check fails on a fixed interval. Service Connect’s outlier detection ejects a target based on the actual request stream — real 5xx responses to real traffic — which catches partial and intermittent failure that a /healthz probe sails right past.

The resilience knobs, what each does, the default, and when to change it:

Knob What it controls Default behaviour When to change Trade-off / gotcha
Connection pooling Reuse of warm upstream connections On, automatic (not tuned directly) HTTP/2+gRPC multiplex on one connection
Per-request LB Endpoint chosen per request On (not tuned directly) New tasks take traffic immediately
idleTimeoutSeconds How long an idle conn is kept proxy default Tune for chatty vs bursty callers Too low → reconnect churn
perRequestTimeoutSeconds Max time for one request proxy default Cap slow upstreams; 0 for streams 0 globally = no protection; do it per service
Retries Re-attempt idempotent failures On for idempotent (managed by the agent) Non-idempotent calls are not retried
Outlier detection Eject tasks failing real requests On (managed by the agent) Catches partial failure a probe misses
TLS (tls block) Encrypt the advertised endpoint Off Compliance / zero-trust east-west Needs an ACM PCA-issued cert

How outlier detection beats a health check, mapped failure-mode by failure-mode:

Backend failure mode ALB health check (/healthz) Service Connect outlier detection
Process down / port closed Caught (probe fails) Caught (connection fails)
Returns 200 on /healthz but 503 on real traffic Missed — keeps routing Caught — ejects on real 5xx
Wedged thread pool, slow but alive Often missed Caught via timeouts + ejection
One bad replica out of ten Caught only if probe hits it Caught per request, ejected fast
Intermittent 5xx (1 in 20) Usually missed Caught when the failure rate crosses the threshold
Connection-pool exhaustion under load Missed Caught (timeouts → ejection)

Retry behaviour by HTTP method, because “the proxy retries” is not unconditional — only safe (idempotent) operations are re-attempted:

Method Idempotent? Retried by the proxy? Why
GET Yes Yes Safe to repeat; no side effect
HEAD Yes Yes Safe metadata fetch
PUT Yes Yes Same result if repeated
DELETE Yes Yes Deleting twice is the same end state
OPTIONS Yes Yes No side effect
POST No No May create/charge twice; never auto-retried
PATCH No (generally) No Partial update may not be idempotent

The proxy-surfaced failure conditions you will see in access logs, what each means, and where to look next:

Condition (in proxy logs) Meaning Likely cause Next look
no_healthy_upstream Proxy has no healthy endpoint Role/namespace misconfig, all replicas down describe-services role; callee health
upstream_reset_before_response Backend reset the connection App crash/restart mid-request Callee task logs; recent deploy
upstream_response_timeout Per-request timeout hit Slow backend or timeout too tight (or a stream) perRequestTimeoutSeconds; backend p99
upstream_connection_failure Could not connect to the task SG blocks the port; task not on the ENI port Task SG ingress; portMappings
5xx (passed through) Backend returned a real 5xx Application error Callee app exceptions
outlier_eject event Task removed from rotation Replica failing real requests ECS/ServiceConnect ejection metric

A symptom→cause→confirm→fix view of the resilience layer, because these are the things that page you mid-migration:

Symptom Likely cause Confirm Fix
Outbound call 503s immediately Caller is server-only, no client mode describe-services → role has no client Set client-and-server and redeploy
Stream cut at ~15s Per-request timeout applied to a stream Check timeout.perRequestTimeoutSeconds Set 0 on that discoveryName only
No retries despite 5xx appProtocol is raw TCP (L4) portMappings[].appProtocol missing Set http/http2/grpc, redeploy
Tasks constantly ejected A replica failing real requests ECS/ServiceConnect ejection metric Fix/replace the bad replica; check its logs
p99 climbs after cutover Idle timeout too low → reconnect churn Compare latency pre/post on the alias Raise idleTimeoutSeconds for chatty paths

Service Connect vs internal ALB vs Cloud Map discovery

Pick by the problem, not the habit.

Capability Internal ALB Cloud Map DNS discovery Service Connect
Discovery mechanism Static DNS to the ALB Route 53 A records, client-resolved Proxy control plane, no runtime DNS
Load balancing At the ALB (extra hop) Client-side, library-dependent Client-side proxy, per request
Staleness on task death ALB dereg delay DNS TTL window Seconds, push-based
Retries No (client must) No Yes, in proxy
Outlier detection Health-check based None Per-request, real traffic
L7 routing (paths/hosts) Yes No No (name-to-service only)
Extra network hop Yes No No
Per-hour cost Per ALB + LCU Namespace + queries No LB charge; pay task/proxy
TLS termination Yes, at ALB N/A Optional pass/terminate (ACM PCA)
Cross-account Via PrivateLink Via shared PHZ tricks No (single account)
Per-call telemetry ALB access logs (coarse) None Proxy access logs + per-call metrics

The decision rule I give teams:

A point that bites people: Service Connect does not replace your ingress ALB. Public traffic still lands on an internet-facing ALB in front of the edge service. Service Connect replaces the internal ALBs between services. Keep the front door; demolish the interior hallways.

The decision as a lookup table — match the requirement to the tool:

If you need… Then use… Because…
Path/host L7 routing (/v2/* → green) Internal ALB SC is name-to-service, not a router
WAF on east-west traffic Internal ALB (+ WAF) SC has no WAF integration
Public internet ingress Internet-facing ALB SC is intra-namespace only
East-west discovery + retries + ejection Service Connect Mesh-grade resilience, no ALB hop
To delete a pile of internal ALBs Service Connect Removes the cost and the hop
A Lambda/EC2 to find ECS tasks Cloud Map DNS The consumer can’t carry a sidecar
Cross-account service calls with IAM auth VPC Lattice / PrivateLink SC does not cross the account line
Weighted canary at the LB layer Internal ALB (weighted TGs) SC LB is even per-request, not weighted

The cost dimensions side by side, so the migration’s financial case is explicit:

Cost dimension Internal ALB (per service) Service Connect
Hourly LB charge Yes, per ALB None
LCU consumption Yes, scales with traffic/conns None
Compute tax None (managed) One sidecar’s CPU/memory per task
Cross-AZ data Standard inter-AZ rates Standard inter-AZ rates (same)
Health-check overhead Per-target probes None (control-plane push)
Operational overhead TG + listener + Route 53 alias per edge One namespace + per-service config

Incremental migration: dual-running endpoints, then cut over

You do not flip a 60-service estate at once. The migration is safe because Service Connect and your existing internal ALB can coexist on the same service.

Step 1 — turn on Service Connect as server, keep the ALB. Add serviceConnectConfiguration to the callee (payments) and redeploy. It now advertises payments into the namespace and stays behind its internal ALB. Nothing calls the new endpoint yet. Cost is one extra sidecar per task and zero risk to existing callers.

Step 2 — make callers clients. Add Service Connect (client or client-and-server) to one caller and redeploy. Its proxy now learns the payments endpoint. The application still points at the old ALB URL.

Step 3 — flip the URL for one caller. Change that caller’s dependency URL from the ALB hostname to the Service Connect alias, e.g. http://payments:8080. Roll it out, ideally behind a config flag so rollback is a flag, not a deploy. Watch the proxy metrics (next section). If error rate or p99 moves the wrong way, flip the flag back to the ALB URL — both paths are live.

# Caller task definition env — flip per service, behind a flag
environment = [
  { name = "PAYMENTS_URL", value = var.use_service_connect ? "http://payments:8080" : "http://payments.internal.example.com" }
]

Step 4 — drain and delete the ALB. Once every caller of payments resolves through Service Connect and has soaked, remove the ALB target group registration, delete the listener, the ALB, and the Route 53 alias. That is the moment the cost and the hop actually disappear — not when you enabled Service Connect, but when the last caller stops using the ALB.

The property that makes this safe: enabling Service Connect on a service does not change how its existing ALB traffic flows. The two discovery paths are independent. You migrate caller by caller, and each step is independently reversible.

The migration as a phased table — what changes, the risk, and how to roll back at each step:

Step What you change Who is affected Cost delta Rollback
1. Callee as server Add SC server to payments, redeploy Nobody (no caller uses it yet) +1 sidecar/task on payments Remove SC config, redeploy
2. Caller as client Add SC client+ to one caller, redeploy That caller’s proxy learns endpoints +1 sidecar/task on caller Remove SC config, redeploy
3. Flip the URL Caller dependency URL → SC alias (behind flag) That one call path None Toggle the flag back to ALB URL
4. Delete the ALB Remove TG reg, listener, ALB, R53 alias Removes the internal ALB −1 ALB hourly + LCU Recreate ALB (slow) — only after full soak

A readiness checklist before each payments ALB deletion — every box must be ticked:

Check How to confirm Must be true
Every caller flipped Audit each caller’s dependency URL/flag All on http://payments:8080
Soak time elapsed ≥ 48h on the flipped path No error/p99 regression
SC metrics show all traffic ECS/ServiceConnect request count per caller Matches expected call volume
ALB target group traffic dropped ALB RequestCount on that TG Near zero
No non-ECS consumer of the ALB grep configs / DNS for the ALB host None remaining
Rollback path documented Flag flips caller back to ALB URL ALB still live until deletion

Cross-namespace and cross-account considerations

Service Connect discovery is scoped to a single namespace. A service in namespace prod cannot resolve a discoveryName advertised in namespace payments-prod. This is a deliberate isolation boundary, and it has consequences.

The architecture I land on for multi-account: each account runs its own namespace for internal traffic; anything that must cross an account boundary goes through a deliberate, observable PrivateLink or Lattice seam. Do not try to stretch a namespace across accounts — it is not a supported topology and you will fight it.

The boundaries Service Connect can and cannot cross, and what takes over at each seam:

Boundary Service Connect crosses it? What you use instead Notes
Service → service, same namespace Yes The intended use
Across namespaces, same account No Internal ALB / VPC endpoint at the seam Split namespaces only for hard domain boundaries
Across accounts, same region No PrivateLink, VPC Lattice, shared ingress Lattice adds IAM auth policies
Across regions No Cross-region ALB / Global Accelerator / Lattice Latency + data-transfer cost apply
Shared VPC (RAM) subnets No (namespace is single-owner) PrivateLink / Lattice Shared subnets ≠ shared namespace
To a non-ECS consumer (Lambda/EC2) No (no sidecar) Cloud Map DNS discovery The sidecar is the gate

When to split a namespace versus keep one, as a decision table:

Situation One namespace Split namespaces
All services trust each other in prod
Hard domain/security boundary between teams ✓ (intentional)
You want zero cross-team east-west discovery
Frequent cross-team calls ✓ (keep discoverable)
Separate prod/staging/dev ✓ one per environment
Accidental split “for tidiness” ✓ (avoid the split)

Telemetry: proxy metrics, per-call stats, debugging failures

The Service Connect agent emits metrics you do not get from DNS discovery, and they are how you debug a bad migration step.

Metrics. Enable proxy metrics by setting a logConfiguration on the Service Connect config so the agent ships logs, and the proxy emits CloudWatch metrics under the ECS/ServiceConnect namespace, including request counts, HTTP response codes, and request latency per DiscoveryName and TargetDiscoveryName. Watch these per dimension:

aws cloudwatch get-metric-statistics \
  --namespace ECS/ServiceConnect \
  --metric-name HTTPCode_Target_5XX_Count \
  --dimensions Name=DiscoveryName,Value=payments Name=ServiceName,Value=checkout \
  --start-time "$(date -u -v-1H '+%Y-%m-%dT%H:%M:%SZ')" \
  --end-time   "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" \
  --period 60 --statistics Sum

Proxy logs. Route the agent’s logs to CloudWatch by adding a log config to the Service Connect block:

{
  "serviceConnectConfiguration": {
    "enabled": true,
    "namespace": "prod",
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {
        "awslogs-group": "/ecs/serviceconnect/checkout",
        "awslogs-region": "us-east-1",
        "awslogs-stream-prefix": "sc"
      }
    },
    "services": [ /* ... */ ]
  }
}

When a call fails after you flip the URL, the proxy access logs are the source of truth. They show the upstream the proxy chose, the response code it got, and whether the request was retried — distinguishing “the proxy could not find a healthy payments” (namespace/role misconfig) from “payments returned 503” (backend problem). That distinction is exactly what DNS discovery cannot tell you, because with DNS the client picks blind and logs nothing about the upstream’s health.

The key metrics in the ECS/ServiceConnect namespace, the dimensions that matter, and what each tells you:

Metric Dimensions What it tells you Alarm on
Request count (per target) DiscoveryName, ServiceName Is the new path actually carrying traffic? Unexpected zero after a flip
HTTP 5xx count DiscoveryName, TargetDiscoveryName Backend error rate on the call path Sustained non-zero post-cutover
HTTP 4xx count DiscoveryName Client-side errors (auth, bad request) Spike correlated with a deploy
Request latency (p50/p99) DiscoveryName Did removing the ALB hop change tail latency? p99 regression vs ALB baseline
Outlier ejections DiscoveryName, target A replica failing real requests Any sustained ejections
Connection count DiscoveryName Pool size / churn Churn spikes (idle timeout too low)

The logConfiguration options for the agent, what each does, and the gotcha:

Option What it sets Example Gotcha
logDriver Where logs go awslogs Must be a driver the platform supports for the agent
awslogs-group The CloudWatch Logs group /ecs/serviceconnect/checkout Create it (or let ECS) and set retention
awslogs-region Region for the log group us-east-1 Must match the task’s region
awslogs-stream-prefix Stream name prefix sc Helps separate agent logs from app logs
(retention) How long logs are kept 14–30 days Set it or logs accumulate cost forever

The alarms worth wiring before you cut over a path, with a starting threshold:

Alarm Metric + dimension Starting threshold Why
Backend 5xx on the new path 5xx count by DiscoveryName > 0 sustained 5 min Catches a bad backend right after the flip
p99 regression Latency p99 by DiscoveryName > ALB baseline + 20% Confirms the removed hop didn’t add tail latency
Outlier ejections Ejection count by target > 0 sustained A replica is failing real traffic
Traffic dropped to zero Request count by DiscoveryName == 0 when expected A flip silently broke the path
Connection churn Connection count spikes step-change Idle timeout too low → reconnect storm

What the proxy access log distinguishes that DNS discovery cannot:

You observe DNS discovery says Service Connect access log says
A failed call (nothing — client picked blind) The exact upstream IP it chose
503 to the caller Could be anything Whether it was “no healthy endpoint” or “backend 503”
Slow request (no upstream record) Upstream latency + whether it was retried
Intermittent errors Invisible Per-request response codes per target
A task being ejected (no concept of ejection) Ejection events + the failing target

Architecture at a glance

The diagram traces an east-west call the way it actually flows under Service Connect, then marks the hops where a misconfiguration bites. Read it left to right. A public request lands on the ingress ALB (which Service Connect does not replace) and is routed to the checkout task. Inside that task, the application does not call a load balancer — it calls http://payments:8080, which its local Service Connect agent (client mode) intercepts. The agent holds a live, push-updated view of every healthy payments endpoint in the prod namespace (a Cloud Map HTTP namespace, no private hosted zone, no runtime DNS), picks one per request, and connects directly to a payments task’s agent (server mode) over the task ENI on port 8080 — no extra ALB hop. The same agent does the retries, the per-request timeout, and the outlier detection that ejects a payments replica returning real 5xx. Telemetry — proxy access logs and the ECS/ServiceConnect metrics — flows out to CloudWatch, which is how you prove the call is carried by the proxy and not the old internal ALB.

The numbered badges mark the five places this breaks during a migration. Badge 1 sits on the appProtocol/portName linkage — get it wrong and you either fail to register or silently drop to L4 pass-through with no retries. Badge 2 is the client role: a server-only caller has no endpoints to route to and 503s on its own outbound calls. Badge 3 is the namespace boundary — a name in another namespace or account is simply invisible. Badge 4 is the per-request timeout severing a stream. Badge 5 is outlier detection ejecting a bad replica, which is the system working but worth watching. Follow the path, find the badge on the hop you are debugging, and read the legend for the symptom, the confirm command, and the fix.

ECS Service Connect east-west architecture: a public request enters through an ingress ALB to a checkout Fargate task; the task's local Service Connect Envoy agent in client mode intercepts the call to http://payments:8080, resolves it against the prod Cloud Map HTTP namespace with no runtime DNS, and connects per-request directly to a healthy payments task's agent in server mode over the task ENI on port 8080 with retries, per-request timeouts and outlier detection, while proxy access logs and ECS/ServiceConnect metrics flow to CloudWatch; numbered badges mark the appProtocol/portName linkage, the client-role requirement, the single-namespace boundary, the per-request stream timeout, and outlier ejection of a failing replica

Real-world scenario

A fintech platform team — call it Larkspur Pay — ran ~70 ECS-on-Fargate services in a single prod account, each fronted by its own internal ALB for east-west calls. Two problems compounded. The internal-ALB bill was material: 70 ALBs plus LCU consumption, for traffic that never left the VPC. And they had a recurring incident class where one wedged replica of a downstream service kept passing its shallow /healthz check while returning 503s to real traffic — the ALB never ejected it, and callers saw a steady 0.5% error rate that paged on-call weekly. Monthly east-west ALB spend was roughly ₹140,000, and the wedged-replica page had fired nine times in the prior quarter.

The constraint: they could not take a maintenance window across 70 services, and a hard org rule required every change to be reversible by config flag, not redeploy. The platform team was six engineers, and any plan that needed a synchronized cutover was dead on arrival.

They adopted Service Connect incrementally. One prod namespace, every service enabled as client-and-server over two sprints. Each caller’s downstream URL moved behind a flag (USE_SERVICE_CONNECT), defaulting to the ALB. They flipped one high-traffic path — checkoutpayments — first, soaked 48 hours, and the wedged-replica incident class disappeared on that path: outlier detection ejected the bad task on real 5xx responses within the cool-down window instead of waiting for a health check that never failed. The proxy access logs were decisive during the soak — when one call failed, the log showed whether the proxy could not find a healthy payments (which would have meant a role/namespace misconfig) or payments itself returned 503 (a backend problem). With the old DNS-free-but-blind setup they had been guessing.

The one genuinely tricky service was a market-data feed with a long-lived gRPC stream. The first cutover attempt severed the stream at the default per-request timeout. The fix was to disable the per-request cap on exactly that discoveryName, never globally:

{
  "portName": "grpc-stream",
  "discoveryName": "marketdata",
  "clientAliases": [{ "dnsName": "marketdata", "port": 9000 }],
  "timeout": { "perRequestTimeoutSeconds": 0 }
}

After every caller of a given service was flipped and soaked, they deleted that service’s internal ALB — target group registration, listener, ALB, Route 53 alias. That was the moment the cost actually dropped, not when Service Connect was enabled.

Outcome after full cutover: 60-plus internal ALBs deleted, one extra sidecar per task in their place, the weekly wedged-replica page gone, and east-west p99 down slightly because they removed the ALB hop. East-west ALB spend fell from ~₹140,000/month to ~₹18,000 (the remaining ALBs were the public ingress and two services doing genuine L7 path routing), against a small rise in Fargate cost for the sidecars — a net monthly saving in the six figures of rupees. The lesson on the wall: “An internal ALB for an east-west call is a load balancer doing a service-mesh’s job badly. Delete it after the last caller moves — not before.”

The migration as a timeline, because the order of moves is the lesson:

Phase Action Effect What it should have been (if different)
Sprint 1 Enable SC client-and-server on all services Endpoints advertised; nothing routes yet Correct — server-first is zero-risk
Sprint 1 Flip checkoutpayments behind a flag Wedged-replica incident class gone on that path The decisive proof point
Sprint 1 First marketdata cutover gRPC stream severed at ~15s Set perRequestTimeoutSeconds: 0 first
Sprint 2 Flip remaining callers, path by path Each independently reversible by flag Correct — never a synchronized cutover
Sprint 2 Soak 48h per path, watch proxy logs/metrics Confirmed traffic on the new path The soak is non-negotiable
Sprint 2 Delete each ALB after its last caller moved Cost actually drops here Not at enable-time — a common mistake

Advantages and disadvantages

Service Connect trades a pile of managed-but-costly internal ALBs for a managed sidecar in every task. Weigh it honestly:

Advantages (why this helps you) Disadvantages (why it bites)
Deletes internal ALBs — removes hourly + LCU cost and an extra hop per east-west call Adds one Envoy sidecar per task — a CPU/memory tax that scales with task count
Mesh-grade resilience (retries, per-request timeouts, outlier detection) with no mesh to operate Outlier detection and retries are managed, not deeply tunable — you express intent, not raw Envoy config
DNS-free, push-based discovery — no TTL staleness, no “connection refused to a dead IP” tail Discovery is scoped to one namespace in one account — no cross-account, no cross-namespace
Per-call telemetry (proxy access logs + ECS/ServiceConnect metrics) DNS discovery can’t give you Requires logConfiguration to be set or you fly blind on the access logs
Name-to-service simplicity — http://payments:8080 just works Not an L7 router — no path/host rules, no WAF, no weighted canaries at the LB
Migration is incremental and reversible — coexists with the ALB per service The sidecar is the gate — any non-ECS consumer can’t use it; you keep a fallback
Reacts to topology change in seconds, per request TLS east-west needs ACM PCA setup; it’s not on by default

Service Connect is right for east-west service-to-service traffic inside a single namespace and account where you want discovery plus resilience and want to delete internal ALBs. It is the wrong tool when you need real L7 routing (keep the ALB), when callers cross an account or namespace boundary (use PrivateLink or Lattice), or when a non-ECS consumer must resolve the service (use Cloud Map DNS). The sidecar tax is real but usually small next to the deleted ALB bill — quantify it for your task count before assuming.

Hands-on lab

Stand up two Fargate services in one namespace, prove checkout reaches payments through the Service Connect proxy (not an ALB), then tear it all down. Free-tier-friendly within Fargate’s pricing; delete at the end. Run in a shell with the AWS CLI configured and an existing VPC with two private subnets and a security group allowing intra-SG traffic on 8080.

Step 1 — Variables and the HTTP namespace.

export AWS_REGION=us-east-1
CLUSTER=sc-lab
NS=sc-lab-ns
SUBNETS=subnet-aaa,subnet-bbb        # two private subnets
SG=sg-ccc                            # allows TCP 8080 within the SG

aws servicediscovery create-http-namespace --name $NS

Step 2 — Create the cluster with the namespace as default.

aws ecs create-cluster --cluster-name $CLUSTER \
  --capacity-providers FARGATE \
  --service-connect-defaults namespace=$NS

Expected: a cluster JSON with status: ACTIVE and the serviceConnectDefaults.namespace set.

Step 3 — Register a payments task definition that binds 8080 with a named port. The key fields are the portMappings[].name and appProtocol.

{
  "family": "payments",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256", "memory": "512",
  "executionRoleArn": "arn:aws:iam::<acct>:role/ecsTaskExecutionRole",
  "containerDefinitions": [
    {
      "name": "app",
      "image": "public.ecr.aws/docker/library/httpd:2.4",
      "portMappings": [
        { "name": "http", "containerPort": 8080, "protocol": "tcp", "appProtocol": "http" }
      ]
    }
  ]
}

Register it: aws ecs register-task-definition --cli-input-json file://payments-td.json.

Step 4 — Create the payments service as a Service Connect server.

aws ecs create-service --cluster $CLUSTER --service-name payments \
  --task-definition payments --desired-count 2 --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[$SUBNETS],securityGroups=[$SG],assignPublicIp=DISABLED}" \
  --service-connect-configuration '{
    "enabled": true,
    "namespace": "'"$NS"'",
    "services": [
      { "portName": "http", "discoveryName": "payments",
        "clientAliases": [ { "dnsName": "payments", "port": 8080 } ] }
    ]
  }'

Step 5 — Create a checkout service as a client (same task def family for the lab; it just needs the agent and ECS Exec enabled).

aws ecs create-service --cluster $CLUSTER --service-name checkout \
  --task-definition payments --desired-count 1 --launch-type FARGATE \
  --enable-execute-command \
  --network-configuration "awsvpcConfiguration={subnets=[$SUBNETS],securityGroups=[$SG],assignPublicIp=DISABLED}" \
  --service-connect-configuration '{ "enabled": true, "namespace": "'"$NS"'" }'

Step 6 — Verify the namespace, the registered endpoint, and the agent sidecar.

# Namespace is HTTP type
aws servicediscovery list-namespaces \
  --query "Namespaces[?Name=='$NS'].[Name,Type,Id]" --output table

# payments registered a Service Connect endpoint
aws ecs describe-services --cluster $CLUSTER --services payments \
  --query "services[0].serviceConnectConfiguration" --output json

# Tasks run the managed SC agent sidecar
TASK=$(aws ecs list-tasks --cluster $CLUSTER --service-name payments --query 'taskArns[0]' --output text)
aws ecs describe-tasks --cluster $CLUSTER --tasks $TASK \
  --query "tasks[0].containers[].name" --output json

Step 7 — Prove the call is carried by the proxy, not an ALB. Exec into the checkout task and curl the logical name; a 200 from a private task IP (not an ALB IP) is the proof.

CHECKOUT=$(aws ecs list-tasks --cluster $CLUSTER --service-name checkout --query 'taskArns[0]' --output text)
aws ecs execute-command --cluster $CLUSTER --task $CHECKOUT --container app --interactive \
  --command "curl -s -o /dev/null -w '%{http_code} %{remote_ip}\n' http://payments:8080/"

Step 8 — Teardown. Delete services, cluster, and namespace.

aws ecs update-service --cluster $CLUSTER --service checkout --desired-count 0
aws ecs update-service --cluster $CLUSTER --service payments --desired-count 0
aws ecs delete-service --cluster $CLUSTER --service checkout --force
aws ecs delete-service --cluster $CLUSTER --service payments --force
aws ecs delete-cluster --cluster $CLUSTER
NSID=$(aws servicediscovery list-namespaces --query "Namespaces[?Name=='$NS'].Id" --output text)
aws servicediscovery delete-namespace --id $NSID

The lab’s expected results at a glance:

Step Command Expected result
2 create-cluster status: ACTIVE, default namespace set
4 create-service payments Service ACTIVE, 2 tasks RUNNING
6 describe-services SC query serviceConnectConfiguration.enabled = true, discoveryName: payments
6 describe-tasks containers Two containers: your app plus the managed SC agent
7 exec curl 200 <private-task-IP> (a 10.x/100.64.x IP, never an ALB IP)
8 teardown Services drained, cluster + namespace deleted

Common mistakes & troubleshooting

These are the real failure modes, with the symptom, the root cause, the exact confirm command, and the fix.

# Symptom Root cause Confirm Fix
1 Service won’t register an endpoint portName doesn’t match a portMappings[].name describe-task-definition → compare names Make services[].portName equal the port mapping name
2 Outbound call to a dependency 503s Caller is server-only (no client mode) describe-services → role advertises but no client Set the caller client-and-server, redeploy
3 No retries despite 5xx; raw bytes only appProtocol unset → L4 pass-through describe-task-definitionappProtocol missing Set http/http2/grpc on the port mapping
4 Long-lived stream cut at ~15s Per-request timeout applied to a stream Inspect timeout.perRequestTimeoutSeconds Set 0 on that discoveryName only
5 http://payments doesn’t resolve from a Lambda No sidecar — SC needs the agent The caller is not an ECS task Use Cloud Map DNS discovery for non-ECS consumers
6 Cross-team call fails to resolve Callee is in a different namespace list-namespaces; compare both services Same namespace, or use ALB/PrivateLink at the seam
7 Cross-account call has no path SC is single-account Accounts differ PrivateLink or VPC Lattice at the boundary
8 No proxy access logs to debug with logConfiguration not set Check SC config for logConfiguration Add an awslogs log config, redeploy
9 Tasks constantly ejected, errors persist A replica failing real requests ECS/ServiceConnect ejection metric + app logs Fix/replace the bad replica; check its 5xx source
10 Deleted the ALB, callers broke Deleted before the last caller flipped Audit caller URLs/flags Recreate ALB; only delete after full soak
11 Connection refused to a dead IP (still) Caller still on Cloud Map DNS, not SC Caller URL is *.local, not the SC alias Flip the URL to the SC alias; SC is push-based
12 App can’t reach localhost proxy endpoint App binds/ calls wrong interface or port Exec + curl http://<dnsName>:<port> Call the clientAlias dnsName:port, not the container’s own IP
13 Security group blocks the task-to-task call SG doesn’t allow intra-SG 8080 describe-security-groups; check ingress Allow the port within the SG (or from the caller SG)
14 Enabled SC but the bill didn’t drop ALBs still present (not deleted) List ALBs / target groups Delete the now-unused internal ALBs
15 Endpoint registered under the wrong name discoveryName defaulted to portName describe-services SC config Set discoveryName explicitly to the intended name
16 Caller dials the container’s own IP App uses task metadata IP, not the alias Exec + inspect the URL the app builds Call http://<dnsName>:<port> (the clientAlias)
17 New tasks not taking traffic Stale view (very brief) or task unhealthy ECS/ServiceConnect per-target request count Usually resolves in seconds; check task health
18 gRPC works but no retries appProtocol is http not grpc portMappings[].appProtocol Set grpc for proper HTTP/2 framing + retries
19 TLS east-west not encrypting tls block not configured SC config has no tls Add ACM PCA cert config to the advertised service
20 Two services collide on a name Same discoveryName in one namespace List advertised names in the namespace Give each service a unique discoveryName

The three highest-frequency mistakes, in detail

Port-name mismatch (row 1). The serviceConnectConfiguration.services[].portName must exactly equal a portMappings[].name in the task definition. If they differ — even a casing slip — the service silently fails to register a Service Connect endpoint and peers can’t find it.

# Confirm the names line up
aws ecs describe-task-definition --task-definition payments \
  --query "taskDefinition.containerDefinitions[].portMappings[].name" --output json

server-only caller (row 2). A service set to advertise endpoints but not consume has no client proxy, so its own outbound calls have no endpoint set to route to and 503 immediately. The fix is client-and-server.

aws ecs describe-services --cluster prod --services checkout \
  --query "services[0].serviceConnectConfiguration" --output json

L4 pass-through (row 3). Omit appProtocol and the proxy treats the traffic as raw TCP — you keep discovery and pooling but lose HTTP-aware retries and outlier detection. Declare the protocol on the port mapping and the L7 features switch on.

Best practices

Security notes

The security posture of Service Connect is mostly inherited from the task and network model, with a few topic-specific points.

The security-relevant settings and how to reason about each:

Control Default Hardened setting Why
Task SG ingress Your design Allow port only from caller SG (not 0.0.0.0/0) Least-network; SC needs only task-to-task
East-west TLS (tls) Off ACM PCA-issued cert per advertised endpoint Encrypt + authenticate east-west
assignPublicIp varies DISABLED on private subnets No public IP on tasks
Task vs execution role Separate Scope each to least privilege SC doesn’t change this; keep it tight
Proxy access logs Off (no logConfiguration) awslogs group with retention Auditability + debugging
Cross-account exposure None (SC is single-account) PrivateLink/Lattice with IAM auth Controlled, authenticated boundary
ACM PCA root trust n/a Private CA scoped to the estate Issues the east-west TLS certs
Namespace ownership Single account Keep it in the workload account No cross-account discovery to abuse
Internal ALB surface Many listeners/TGs Deleted Fewer misconfigurable edges
ECS Exec (debugging) Off On only when needed, audited execute-command is powerful; scope it

Cost & sizing

The economics of Service Connect are a trade: you delete internal ALBs and add one sidecar per task. Whether you come out ahead depends on how many ALBs you delete versus how many tasks carry the sidecar.

What drives the Service Connect cost: the managed Envoy agent consumes a slice of each task’s CPU and memory. There is no separate per-hour Service Connect charge — you pay for the extra compute the sidecar uses, multiplied by your task count. Inter-AZ data transfer is the same as before (it is task-to-task either way).

What you delete: each internal ALB you retire removes its hourly charge plus LCU consumption (new connections, active connections, processed bytes, rule evaluations). At fifty internal ALBs this is the dominant term and almost always swamps the sidecar tax.

# Roughly size the sidecar tax: tasks × per-task overhead.
# Count running tasks across the cluster you're migrating:
aws ecs list-tasks --cluster prod --desired-status RUNNING \
  --query "length(taskArns)" --output text

The cost trade as a table — the levers and their direction:

Lever Direction Magnitude Notes
Internal ALBs deleted Saves Largest term Hourly + LCU per ALB removed
Sidecar CPU/memory per task Costs Small per task Scales with task count, not traffic
Inter-AZ data transfer Neutral Same task-to-task either way
Extra network hop removed Saves (latency) p99 improvement Not a billed line, but real
Health-check overhead removed Saves (minor) Negligible No per-target probes
ACM PCA (if you enable TLS) Costs Per CA + per cert Only if you turn on east-west TLS
CloudWatch Logs (proxy access logs) Costs Per GB ingested + stored Set retention; cheap vs ALB savings
CloudWatch metrics Costs Per custom metric / dimension ECS/ServiceConnect dimensions add up at scale
Route 53 alias records removed Saves (tiny) Negligible One fewer record per deleted ALB

Rough sizing intuition — when Service Connect wins on cost:

Estate shape Internal ALBs Tasks Service Connect verdict
Many services, modest task counts 50+ a few hundred Strong win — ALB savings dominate
Few services, huge task counts 3–5 thousands Marginal — sidecar tax grows; measure
Mostly L7-routed public traffic (ingress only) any Little to delete — keep the ALBs
Classic microservices, 1 ALB per edge one per edge moderate The canonical win

A worked example: retiring 60 internal ALBs removes their hourly + LCU charges (in the Larkspur Pay case, ~₹140,000/month down to ~₹18,000 for the surviving ingress and L7 ALBs). The added Fargate cost for ~600 task sidecars was a fraction of that, netting a six-figure-rupee monthly saving — and a small p99 improvement from the deleted hop. Always compute your own task count against your own ALB count before assuming the direction.

Interview & exam questions

Q1. What are the three components of ECS Service Connect, and what does each do? A Cloud Map HTTP namespace (the logical discovery boundary), a managed Envoy agent sidecar injected per task (it does discovery, load balancing, retries, and outlier detection), and client/server roles per service that decide whether a service advertises endpoints, consumes them, or both. (Maps to the AWS DevOps Pro and SA Pro container topics.)

Q2. Why might a service’s outbound call 503 even though the target is healthy? Because the calling service is configured as server only — it advertises endpoints but has no client proxy, so its outbound calls have no endpoint set to route to. The fix is client-and-server.

Q3. How does Service Connect differ from Cloud Map DNS discovery on staleness? DNS discovery resolves A records against a private hosted zone and caches them for the TTL, so a dead task can linger in the resolver cache. Service Connect is push-based — the control plane withdraws an endpoint from every client proxy within seconds — so there is no TTL staleness and no dead-IP tail.

Q4. What does appProtocol control, and what breaks if you omit it? It declares the L7 protocol (http/http2/grpc) on a port mapping, which unlocks HTTP-aware retries, per-request stats, and outlier detection. Omit it and the proxy degrades to L4 pass-through: you keep discovery and pooling but lose the L7 resilience features.

Q5. How does outlier detection differ from an ALB health check? An ALB ejects a target when a periodic health check fails. Outlier detection ejects a task based on the actual request stream — real 5xx to real traffic — so it catches a replica that passes /healthz but returns 503s, which an ALB never notices.

Q6. Can Service Connect span accounts or namespaces? No. Discovery is scoped to one namespace in one account. Cross-namespace or cross-account calls need an internal ALB, PrivateLink, or VPC Lattice at the seam; Service Connect handles intra-namespace, intra-account east-west.

Q7. What is the correct, reversible way to migrate one call path off an internal ALB? Enable Service Connect as server on the callee, make the caller a client, then flip the caller’s dependency URL to the Service Connect alias behind a config flag. Soak, and only delete the ALB after every caller of that service has flipped — each step is independently reversible.

Q8. When should you set perRequestTimeoutSeconds: 0? Only on streaming or long-poll endpoints, set per discoveryName, never globally — otherwise the proxy severs long-lived connections at the per-request cap.

Q9. How do you prove a call is carried by the proxy and not an ALB? Exec into the caller and curl the logical name; a 200 from a private task IP (not the ALB’s IP) confirms it, cross-checked against ECS/ServiceConnect request-count metrics on the target’s DiscoveryName.

Q10. What does enabling Service Connect cost, and what does it save? It adds one Envoy sidecar’s CPU/memory per task (cost scales with task count). It saves the hourly + LCU charges of every internal ALB you delete, plus an extra network hop per call. The net is a win when you delete many ALBs relative to your task count.

Q11. Does Service Connect replace your public ingress ALB? No. It replaces internal east-west ALBs between services. Public traffic still lands on an internet-facing ALB, and any service doing real L7 path/host routing keeps its ALB.

Q12. What’s the single most common Service Connect misconfiguration? A mismatch between services[].portName in the Service Connect config and a portMappings[].name in the task definition — the service silently fails to register an endpoint.

Quick check

  1. A checkout service is set to server only and its calls to payments return 503. What’s the fix, in one change?
  2. You flip a caller to http://payments:8080 and its long-lived gRPC stream gets cut after a few seconds. What setting, on which scope, fixes it?
  3. True or false: enabling Service Connect on a service immediately stops its existing internal-ALB traffic.
  4. A Lambda needs to resolve ECS payments tasks. Can it use Service Connect? If not, what does it use?
  5. After full migration the internal-ALB bill hasn’t dropped. What did you forget to do?

Answers

  1. Change checkout to client-and-server and redeploy — a server-only service has no client proxy, so its outbound calls have no endpoints to route to.
  2. Set perRequestTimeoutSeconds: 0 on the marketdata (gRPC) discoveryName only — never globally — so the proxy stops severing the stream at the per-request cap.
  3. False. Enabling Service Connect is additive; the existing ALB path is independent and keeps flowing until you flip callers and delete the ALB.
  4. No — Service Connect requires the Envoy sidecar, which a Lambda can’t carry. The Lambda uses Cloud Map DNS discovery against a private hosted zone instead.
  5. Delete the now-unused internal ALBs (target group registration, listener, ALB, Route 53 alias). The cost drops when the ALB is deleted, not when Service Connect is enabled.

Glossary

Next steps

awsecsservice-connectservice-discoverycontainers
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments