ECS Service Connect Deep Dive: Service Discovery, Traffic Resilience, and Migrating Off ALBs

Most ECS estates accumulate internal ALBs the way attics accumulate boxes. Service A needs to call service B, so someone stands up an internal ALB, a target group, a listener, a Route 53 alias, and a security group rule — for a call path that never leaves the VPC and never sees a browser. Multiply by fifty services and you are paying for fifty load balancers, fifty health-check configurations, and an extra network hop on every east-west request, all to do something an ALB was never designed for: client-side service discovery with retries.

ECS Service Connect collapses that. It runs a managed Envoy sidecar in every task, registers a logical name in a Cloud Map namespace, and lets http://payments resolve and load-balance directly to healthy payments tasks — with connection pooling, retries, timeouts, and outlier detection handled by the proxy. No internal ALB, no per-call DNS lookup, no extra hop. This is how it actually works, where it beats and loses to the alternatives, and how to migrate without a flag day.

1. The architecture: agent proxy, namespaces, client and server modes

Service Connect has three moving parts, and understanding the split is the whole game.

The namespace is an AWS Cloud Map HTTP namespace. It is the logical boundary inside which services discover each other by short name. You create it once per environment (think one namespace per prod, staging) and point every service in that environment at it. Unlike Cloud Map’s older DNS-based service discovery, a Service Connect namespace does not require a private hosted zone or DNS queries at runtime — discovery happens in the proxy’s control plane.

The Service Connect agent is a managed Envoy proxy that ECS injects as a sidecar container into every participating task. You do not write an Envoy config and you do not manage the container image — ECS owns its lifecycle, pushes endpoint updates to it, and ships its metrics. Your application talks to localhost-style endpoints the proxy exposes; the proxy handles the actual connection to a healthy backend task.

Client vs server roles are set per service in its serviceConnectConfiguration:

A server (or client-and-server) advertises one or more endpoints into the namespace. It declares a portName from its task definition and a discoveryName (the short name peers will call) plus a clientAliases entry giving the DNS name and port other services use.
A client only consumes. It joins the namespace so its proxy learns every advertised endpoint, but it publishes nothing.

A frontend that calls APIs but exposes nothing internally is a pure client. A payments service that both serves peers and calls ledger is client-and-server. The distinction matters: a service must be client or client-and-server for its outbound calls to resolve through Service Connect. I have watched teams set a service to server only, then wonder why its outbound call to a dependency 503s — the proxy is not in client mode, so it has no endpoints to route to.

Here is a minimal client-and-server block in a task/service definition (CDK-style JSON for the CreateService call):

{
  "serviceConnectConfiguration": {
    "enabled": true,
    "namespace": "prod",
    "services": [
      {
        "portName": "http",
        "discoveryName": "payments",
        "clientAliases": [
          { "dnsName": "payments", "port": 8080 }
        ]
      }
    ]
  }
}

The portName (http) must match a name on a portMappings entry in the task definition. That linkage is mandatory and is the single most common misconfiguration.

{
  "name": "app",
  "portMappings": [
    { "name": "http", "containerPort": 8080, "protocol": "tcp", "appProtocol": "http" }
  ]
}

Set appProtocol deliberately. http (or http2/grpc) is what unlocks L7 features — retries on status codes, per-request stats. Leave it as raw TCP and the proxy degrades to L4 pass-through: you keep discovery and connection pooling but lose HTTP-aware retries and outlier detection.

2. Namespaces and Cloud Map: logical names, DNS-free discovery

The namespace is a Cloud Map HTTP namespace. Create it before any service references it:

aws servicediscovery create-http-namespace \
  --name prod \
  --description "Service Connect namespace for prod ECS services"

You can also let ECS create one implicitly when you set a default namespace on the cluster:

aws ecs put-cluster-capacity-providers \
  --cluster prod \
  --capacity-providers FARGATE FARGATE_SPOT \
  --default-capacity-provider-strategy capacityProvider=FARGATE,weight=1

aws ecs update-cluster \
  --cluster prod \
  --service-connect-defaults namespace=prod

With a cluster default set, new services inherit the namespace and you only specify enabled: true plus the per-service services block.

The DNS-free part is the important nuance. With Cloud Map DNS-based discovery (the older serviceRegistries model), a client resolves payments.prod.local against a Route 53 private hosted zone, gets back a set of A records, and picks one. The client does its own load balancing with whatever its HTTP library happens to do, DNS TTLs cache stale records, and a task that died ten seconds ago can still be in the resolver cache.

Service Connect inverts this. The agent maintains a live view of healthy endpoints pushed from the ECS control plane — no periodic DNS query, no TTL staleness window. When a payments task is stopped, ECS withdraws its endpoint from every client proxy in the namespace within seconds. clientAliases.dnsName is a logical name the proxy intercepts locally; it is not a record you have to resolve over the wire to a hosted zone. That is why Service Connect reacts to topology change far faster than DNS-based discovery, and why you stop seeing the “connection refused to a dead IP” tail that plagues DNS-TTL discovery.

3. Built-in resilience: pooling, retries, timeouts, outlier detection

This is the reason to adopt Service Connect even if you were happy with discovery. The Envoy sidecar gives every call path mesh-grade resilience without a service mesh.

Connection pooling is automatic. The proxy keeps warm upstream connections to backend tasks and multiplexes requests, so you are not paying TCP and TLS handshake cost per request. For HTTP/2 and gRPC (appProtocol: http2 / grpc) it multiplexes streams over a single connection.

Per-request load balancing. Because the client proxy holds the full healthy-endpoint set, it load-balances per request across tasks, not per DNS resolution. A new task that scales in starts taking traffic immediately; a task scaling out is drained.

Timeouts are configurable per service via timeout in the Service Connect config. idleTimeoutSeconds bounds idle connections; perRequestTimeoutSeconds caps a single request — critical for HTTP/1.1 where a slow upstream otherwise pins a connection:

{
  "portName": "http",
  "discoveryName": "ledger",
  "clientAliases": [{ "dnsName": "ledger", "port": 8080 }],
  "timeout": {
    "idleTimeoutSeconds": 60,
    "perRequestTimeoutSeconds": 15
  }
}

For long-poll or streaming endpoints, set perRequestTimeoutSeconds: 0 to disable the per-request cap on that service — otherwise the proxy will sever your stream at the timeout. Do this surgically, per discoveryName, never globally.

Retries and outlier detection are the headline. The proxy retries idempotent failures and ejects consistently-failing tasks from the load-balancing pool (Envoy outlier detection) so a single bad replica stops poisoning the call path. These are tuned through the Service Connect agent’s behavior rather than hand-written Envoy YAML; you express intent at the service level and ECS renders the proxy config. The practical effect: a task that starts returning 5xx — bad deploy, wedged thread pool, exhausted connections — is detected and pulled out of rotation for a cool-down window, then probed back in. With an internal ALB you would get this only if your health check happened to catch the failure mode, and never at per-request granularity.

The behavioral difference from an ALB is worth stating plainly: an ALB ejects a target when its health check fails on a fixed interval. Service Connect’s outlier detection ejects a target based on the actual request stream — real 5xx responses to real traffic — which catches partial and intermittent failure that a /healthz probe sails right past.

4. Service Connect vs internal ALB vs Cloud Map discovery

Pick by the problem, not the habit.

Capability	Internal ALB	Cloud Map DNS discovery	Service Connect
Discovery mechanism	Static DNS to the ALB	Route 53 A records, client-resolved	Proxy control plane, no runtime DNS
Load balancing	At the ALB (extra hop)	Client-side, library-dependent	Client-side proxy, per request
Staleness on task death	ALB dereg delay	DNS TTL window	Seconds, push-based
Retries	No (client must)	No	Yes, in proxy
Outlier detection	Health-check based	None	Per-request, real traffic
L7 routing (paths/hosts)	Yes	No	No (name-to-service only)
Extra network hop	Yes	No	No
Per-hour cost	Per ALB	Namespace + queries	No LB charge; pay task/proxy
TLS termination	Yes, at ALB	N/A	Pass-through (app or future config)

The decision rule I give teams:

Keep an ALB when you genuinely need L7 routing — path/host rules, weighted target groups for canaries at the LB, WAF, or you are terminating public traffic. Service Connect is name-to-service, not a router. It does not do /v2/* → green, /* → blue.
Use Service Connect for east-west service-to-service traffic inside a namespace where you want discovery plus resilience and want to delete the internal ALB hop and bill.
Use Cloud Map DNS discovery only when a non-ECS consumer (a Lambda, an EC2 process, something that cannot get a Service Connect sidecar) needs to resolve ECS tasks. The sidecar is the gate: no sidecar, no Service Connect.

A point that bites people: Service Connect does not replace your ingress ALB. Public traffic still lands on an internet-facing ALB in front of the edge service. Service Connect replaces the internal ALBs between services. Keep the front door; demolish the interior hallways.

5. Incremental migration: dual-running endpoints, then cut over

You do not flip a 60-service estate at once. The migration is safe because Service Connect and your existing internal ALB can coexist on the same service.

Step 1 — turn on Service Connect as server, keep the ALB. Add serviceConnectConfiguration to the callee (payments) and redeploy. It now advertises payments into the namespace and stays behind its internal ALB. Nothing calls the new endpoint yet. Cost is one extra sidecar per task and zero risk to existing callers.

Step 2 — make callers clients. Add Service Connect (client or client-and-server) to one caller and redeploy. Its proxy now learns the payments endpoint. The application still points at the old ALB URL.

Step 3 — flip the URL for one caller. Change that caller’s dependency URL from the ALB hostname to the Service Connect alias, e.g. http://payments:8080. Roll it out, ideally behind a config flag so rollback is a flag, not a deploy. Watch the proxy metrics (next section). If error rate or p99 moves the wrong way, flip the flag back to the ALB URL — both paths are live.

# Caller task definition env — flip per service, behind a flag
environment = [
  { name = "PAYMENTS_URL", value = var.use_service_connect ? "http://payments:8080" : "http://payments.internal.example.com" }
]

Step 4 — drain and delete the ALB. Once every caller of payments resolves through Service Connect and has soaked, remove the ALB target group registration, delete the listener, the ALB, and the Route 53 alias. That is the moment the cost and the hop actually disappear — not when you enabled Service Connect, but when the last caller stops using the ALB.

The property that makes this safe: enabling Service Connect on a service does not change how its existing ALB traffic flows. The two discovery paths are independent. You migrate caller by caller, and each step is independently reversible.

6. Cross-namespace and cross-account considerations

Service Connect discovery is scoped to a single namespace. A service in namespace prod cannot resolve a discoveryName advertised in namespace payments-prod. This is a deliberate isolation boundary, and it has consequences.

One namespace per environment is usually right. Putting every prod service in one namespace lets them all discover each other. Splitting prod into team-a and team-b namespaces means cross-team calls cannot use Service Connect directly — they fall back to an internal ALB or a VPC endpoint at the boundary. Use that split intentionally when you want a hard boundary between domains, not by accident.
Cross-account is not a Service Connect feature. The namespace and its services live in one account. To call a service in another account, you publish it the account-boundary way — an internal ALB exposed via PrivateLink (endpoint service + interface endpoint), or a shared ingress — and the consumer treats it as an external dependency, not a namespace member. Service Connect handles intra-account east-west; PrivateLink handles the trust boundary.
Shared VPC via RAM does not change this. Even if two accounts share subnets, the Cloud Map namespace is owned by one account and Service Connect endpoints are not discoverable across the account line. Plan the seam: Service Connect inside the account, PrivateLink or an internal ALB at the edge.

The architecture I land on for multi-account: each account runs its own namespace for internal traffic; anything that must cross an account boundary goes through a deliberate, observable PrivateLink seam. Do not try to stretch a namespace across accounts — it is not a supported topology and you will fight it.

7. Telemetry: proxy metrics, per-call stats, debugging failures

The Service Connect agent emits metrics you do not get from DNS discovery, and they are how you debug a bad migration step.

Metrics. Enable proxy metrics by setting a logConfiguration on the Service Connect config so the agent ships logs, and the proxy emits CloudWatch metrics under the ECS/ServiceConnect namespace, including request counts, HTTP response codes, and request latency per DiscoveryName and TargetDiscoveryName. Watch these per dimension:

RequestCountPerTarget / response-code splits — your 5xx rate on the new path.
Latency percentiles — confirm the proxy hop did not add tail latency (it should not; you removed the ALB hop).
Outlier ejections — if tasks are being ejected, a backend replica is failing real requests.

aws cloudwatch get-metric-statistics \
  --namespace ECS/ServiceConnect \
  --metric-name HTTPCode_Target_5XX_Count \
  --dimensions Name=DiscoveryName,Value=payments Name=ServiceName,Value=checkout \
  --start-time "$(date -u -v-1H '+%Y-%m-%dT%H:%M:%SZ')" \
  --end-time   "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" \
  --period 60 --statistics Sum

Proxy logs. Route the agent’s logs to CloudWatch by adding a log config to the Service Connect block:

{
  "serviceConnectConfiguration": {
    "enabled": true,
    "namespace": "prod",
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {
        "awslogs-group": "/ecs/serviceconnect/checkout",
        "awslogs-region": "us-east-1",
        "awslogs-stream-prefix": "sc"
      }
    },
    "services": [ /* ... */ ]
  }
}

When a call fails after you flip the URL, the proxy access logs are the source of truth. They show the upstream the proxy chose, the response code it got, and whether the request was retried — distinguishing “the proxy could not find a healthy payments” (namespace/role misconfig) from “payments returned 503” (backend problem). That distinction is exactly what DNS discovery cannot tell you, because with DNS the client picks blind and logs nothing about the upstream’s health.

Verify

Confirm the namespace, the registered endpoints, and live traffic before you delete anything.

# 1. Namespace exists and is HTTP type
aws servicediscovery list-namespaces \
  --query "Namespaces[?Name=='prod'].[Name,Type,Id]" --output table

# 2. The callee registered a Service Connect endpoint
aws ecs describe-services --cluster prod --services payments \
  --query "services[0].serviceConnectConfiguration" --output json

# 3. Tasks are running the SC agent sidecar (look for the managed proxy container)
aws ecs describe-tasks --cluster prod \
  --tasks "$(aws ecs list-tasks --cluster prod --service-name payments --query 'taskArns[0]' --output text)" \
  --query "tasks[0].containers[].name" --output json

Then exec into a caller task and prove the logical name resolves through the proxy, not through an ALB:

aws ecs execute-command --cluster prod \
  --task <checkout-task-id> --container app --interactive \
  --command "curl -s -o /dev/null -w '%{http_code} %{remote_ip}\n' http://payments:8080/healthz"

A 200 with a private task IP (not your ALB’s IP) confirms Service Connect is carrying the call. Cross-check the ECS/ServiceConnect metrics show request count on the payments DiscoveryName for the calling service.

Enterprise scenario

A fintech platform team ran ~70 ECS-on-Fargate services in a single prod account, each fronted by its own internal ALB for east-west calls. Two problems compounded: the internal ALB bill was material (70 ALBs plus LCUs), and they had a recurring incident class where one wedged replica of a downstream service kept passing its shallow /healthz check while returning 503s to real traffic — the ALB never ejected it, and callers saw a steady 0.5% error rate that paged on-call weekly.

The constraint: they could not take a maintenance window across 70 services, and a hard org rule required every change to be reversible by config flag, not redeploy.

They adopted Service Connect incrementally. One prod namespace, every service enabled as client-and-server over two sprints. Each caller’s downstream URL moved behind a flag (USE_SERVICE_CONNECT), defaulting to the ALB. They flipped one high-traffic path — checkout → payments — first, soaked 48 hours, and the wedged-replica incident class disappeared: outlier detection ejected the bad task on real 5xx responses within the cool-down window instead of waiting for a health check that never failed. After every caller of a given service was flipped and soaked, they deleted that service’s internal ALB.

The decisive piece was the per-service timeout plus a disabled per-request cap on their one streaming endpoint, so the migration did not sever long-lived connections:

{
  "portName": "grpc-stream",
  "discoveryName": "marketdata",
  "clientAliases": [{ "dnsName": "marketdata", "port": 9000 }],
  "timeout": { "perRequestTimeoutSeconds": 0 }
}

Outcome after full cutover: 60-plus internal ALBs deleted, one extra sidecar per task in their place, the weekly wedged-replica page gone, and east-west p99 down slightly because they removed the ALB hop. The remaining ALBs were exactly the ones that earned their keep — the public ingress and the two services doing genuine L7 path routing.

ECS Service Connect Deep Dive: Service Discovery, Traffic Resilience, and Migrating Off ALBs

1. The architecture: agent proxy, namespaces, client and server modes

2. Namespaces and Cloud Map: logical names, DNS-free discovery

3. Built-in resilience: pooling, retries, timeouts, outlier detection

4. Service Connect vs internal ALB vs Cloud Map discovery

5. Incremental migration: dual-running endpoints, then cut over

6. Cross-namespace and cross-account considerations

7. Telemetry: proxy metrics, per-call stats, debugging failures

Verify

Enterprise scenario

Checklist

Written by Vinod

Comments

Keep Reading

Centralized AWS Backup with Organizations: Vault Lock, Cross-Account Copy, and Recovery Runbooks

Centralized Egress Inspection with AWS Network Firewall: Routing, Domain Filtering, and Suricata Rules

Validating VPC Connectivity with Reachability Analyzer and Network Access Analyzer