Building Cross-Account Services with AWS PrivateLink: Endpoint Services, NLBs, and DNS

You have an internal API that another team — or another company — needs to reach. The reflex is to peer the VPCs or hang both off a Transit Gateway. Both grant network-layer reachability, and both fall apart the moment the consumer’s 10.0.0.0/16 collides with yours, which across a large estate it always eventually does. AWS PrivateLink solves a narrower problem and solves it cleanly: it exposes one service across a trust boundary as a single elastic network interface (an ENI) in the consumer’s subnet. No routes, no transitive reachability, no CIDR negotiation, no public internet. This is how the most painful “expose this to 200 accounts” problem on AWS becomes a self-service onboarding problem instead of a networking project.

The mechanism has two halves and a wire between them. On the provider side you front your workload with an internal Network Load Balancer (NLB) or Gateway Load Balancer, then publish a VPC endpoint service on top of it; that service hands out a coordination string — com.amazonaws.vpce.<region>.vpce-svc-xxxx — and gates who connects through an allowed-principals allow-list plus an optional per-connection acceptance step. On the consumer side you create an interface endpoint naming that service, AWS provisions one ENI per subnet, and your workloads talk only to those ENIs. Private DNS then lets a friendly name like payments.internal.example.com resolve — inside the consumer VPC only — to the local ENIs, so client code never learns the ugly regional hostname. The catch that trips every team once is that AWS makes the provider prove they own the domain before any consumer can switch private DNS on.

By the end of this article you will build both sides correctly, wire up private DNS with domain-ownership verification, reason about availability zones across accounts (where us-east-1a in your account is not the same physical zone as in theirs), size for fan-in against the real quotas, keep the whole thing observable with VPC Flow Logs and NLB metrics, and — when it inevitably misbehaves — localise the failure to exactly one hop with a runbook of symptom, root cause, the exact command to confirm, and the fix. Every configuration here carries both a Terraform block and an aws CLI command, and because this is a reference you will return to mid-incident, the option matrices, error catalogue, quotas, and the troubleshooting playbook are all laid out as scannable tables.

What problem this solves

The pain is reachability that is too broad and address space that collides. When you peer two VPCs or attach them to a Transit Gateway, you give the consumer a route into your network. A misconfigured security group, an over-broad route table, or a curious operator on the other side can then reach anything routable in your VPC — not just the one API you meant to share. Security teams rightly flag this for external partners: “call one endpoint” should not grant “a path into the platform network.” PrivateLink gives the consumer a route to one load balancer, full stop, and only ever in the consumer→provider direction.

The second pain is CIDR overlap. Peering and TGW both demand non-overlapping address space, because they route on IP. In a large organisation — especially after acquisitions — two production VPCs sitting on the same 10.20.0.0/16 is not a hypothetical; it is Tuesday. Renumbering a live, compliance-scoped VPC is a multi-quarter project nobody signs up for. PrivateLink never exposes the provider’s address space at all: the consumer talks to ENIs carved from the consumer’s own subnets, so the two VPCs can use byte-identical CIDRs and never know.

Who hits this: any platform team publishing an internal service to many accounts (a fraud-scoring API, a config service, a logging sink), any SaaS vendor offering private connectivity to customer VPCs, and anyone who looked at a VPC-peering mesh of 80 accounts and realised it does not scale. The cost of PrivateLink is that it is one-directional and one-service-per-endpoint, and it requires a load balancer in front of the workload. If you genuinely need bidirectional, everything-talks-to-everything connectivity, this is the wrong tool — reach for TGW. To frame the whole decision before the deep dive, here is when each connectivity primitive is the right call:

Pattern	What it connects	CIDR overlap	Direction	Bills per GB	Pick it when
VPC peering	Two VPCs, full IP reachability	Must not overlap	Bidirectional	No (intra-region)	Two VPCs, mutual trust, small N
Transit Gateway	Many VPCs/accounts, policy-routed	Must not overlap	Bidirectional	Yes	Hub-and-spoke routing, segmentation
PrivateLink	One service behind an NLB/GWLB	Irrelevant	Unidirectional (consumer→provider)	Yes	Publish a service to many accounts
VPC Lattice	App-layer services across accounts	Irrelevant	Policy-governed	Yes	HTTP service mesh, per-request auth
PublicEndpoint + IAM	A regional AWS API/SaaS over public IP	n/a	Caller→service	Egress only	You accept public-path + IAM-only

The deciding question is always reachability versus publishing. Peering and TGW are routing: they give the consumer a route to your network. PrivateLink is publishing: it gives the consumer a route to one load balancer. If you would put the thing behind a load balancer and a DNS name anyway, publish it — do not route to it.

Mental model: peering and TGW are routing. PrivateLink is publishing. The consumer never touches your network — only an ENI in their own subnet that happens to forward to your NLB.

Learning objectives

By the end of this article you can:

Choose PrivateLink over peering, Transit Gateway, or VPC Lattice by reasoning about reachability, CIDR overlap, direction, and fan-in — and explain why overlapping CIDRs are irrelevant to it.
Build the provider side: an internal NLB across the right AZs with cross-zone load balancing, a VPC endpoint service on top, and the service_name coordination string handed to consumers.
Gate access with two stacked controls — allowed principals (who may create an endpoint) and per-connection acceptance (acceptance_required) — and automate approval with connection notifications to SNS and a Lambda.
Build the consumer side: an interface endpoint with one ENI per AZ, the security group on the endpoint ENI, and AZ alignment reasoned about by AZ ID, not name.
Publish a friendly private DNS name with domain-ownership verification (the TXT-record dance) and explain the split-horizon resolution that makes it work only inside the consumer VPC.
Size for high fan-in against the real quotas — allowed principals per service, endpoints per VPC, the NLB 350-second idle timeout, source-port and target capacity — and model the per-AZ-hourly plus per-GB cost.
Observe and troubleshoot the path with VPC Flow Logs on the ENIs and NLB CloudWatch metrics, and walk a symptom→cause→confirm→fix runbook to localise any failure to one hop.

Prerequisites & where this fits

You should be comfortable with core VPC concepts — subnets, route tables, security groups, ENIs, and Availability Zones — and know that an NLB is a Layer-4 (TCP/UDP/TLS) load balancer with no security group of its own. You should be able to run the aws CLI with a profile per account (this is inherently a two-account exercise), read JSON output, and apply Terraform. Familiarity with Route 53 hosted zones (public and private) and basic DNS resolution helps a great deal for the private-DNS section.

This sits in the Networking track and assumes the fundamentals from the AWS VPC Deep Dive: Subnets, Routing, IGW, NAT & Endpoints — interface endpoints are the same primitive you use for AWS service endpoints, pointed at a private service instead. It is the cross-account complement to the AWS Transit Gateway Multi-Account VPC Architecture: TGW for routing meshes, PrivateLink for service publishing. The NLB underneath is covered in AWS Elastic Load Balancing: ALB, NLB & GWLB Deep Dive, the access controls lean on IAM Cross-Account Roles, External ID & the Confused Deputy, and for an app-layer alternative compare VPC Lattice Service Networks with IAM Auth.

A quick map of who owns and confirms what during an incident, so you call the right person fast:

Layer	What lives here	Which account owns it	Failure classes it can cause
Consumer client + its SG	App code, egress rules	Consumer	Client SG blocks egress to the ENI
Endpoint ENI + its SG	The interface endpoint, per-AZ ENIs	Consumer	Inbound SG blocks the client → hang/refuse
Private DNS (managed PHZ)	Split-horizon name → ENI	Consumer (created by AWS)	Long regional name returned; resolution fails
AWS backbone	The PrivateLink data path	AWS (managed)	Practically never; AZ mismatch shows here
Endpoint service + gates	Allowed principals, acceptance, DNS verify	Provider	pendingAcceptance; private DNS not effective
Internal NLB + targets	L4 LB, target groups, health checks	Provider	Unhealthy targets, AZ gaps, idle resets

Core concepts

Five mental models make every later step and every failure obvious.

An endpoint service publishes one service, not a network. On the provider side, a VPC endpoint service is a thin publishing layer that sits on top of an internal NLB (or GWLB) and advertises a single coordination string, the service name (com.amazonaws.vpce.<region>.vpce-svc-xxxx). It exposes exactly what the NLB fronts — one TCP service — and nothing else of your VPC. There is no route, no IP range, no transitive path. The consumer cannot “see” your network; they can only reach the load balancer you chose to publish.

The interface endpoint is an ENI in the consumer’s own subnet. On the consumer side, an interface endpoint of type Interface names the provider’s service, and AWS provisions one ENI per subnet you specify, each with a private IP from that consumer subnet’s range. Those ENIs are the only thing consumer workloads ever talk to. This is why CIDR overlap is irrelevant: the client connects to 10.0.1.50 in its own VPC, and AWS quietly carries the flow across the backbone to the provider’s NLB. The provider’s address space never enters the picture.

Two stacked controls decide who connects. Access is governed by two independent gates that both must pass. Allowed principals is an allow-list of IAM principal ARNs (a role, an account root, or *) that decides who may even create an endpoint to your service — without an entry, the consumer cannot see or target it at all. Acceptance (acceptance_required) is a per-connection gate: when true, every new endpoint lands in pendingAcceptance and waits for the provider to approve it. They stack: with acceptance_required = false an unlisted principal still cannot connect; with an open allow-list, acceptance still holds each connection until approved.

Private DNS is split-horizon and requires proof of ownership. By default the endpoint hands the consumer an ugly regional name. A private DNS name lets the provider associate a friendly hostname so consumers keep calling it. AWS implements this as split-horizon: it creates a managed private hosted zone inside the consumer VPC that resolves the name to the local ENIs, while the public name resolves to nothing useful outside. To stop anyone from publishing a service that hijacks *.example.com, AWS forces the provider to prove domain ownership via a TXT record in the public zone before any consumer may enable private DNS.

The NLB underneath sets the rules of physics. Everything downstream — which AZs the service is reachable in, whether traffic stays zone-local, how long-lived connections behave — is decided by the NLB. The subnets you attach define the published AZs; cross-zone load balancing decides whether an ENI in us-east-1a can reach targets in us-east-1b; the 350-second TCP idle timeout silently resets long-lived flows without keepalives. Get the NLB right and PrivateLink is boring; get it wrong and you get “works in one AZ, dead in another.”

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Concept	One-line definition	Whose account	Why it matters
Endpoint service	Publishing layer over an NLB/GWLB	Provider	The thing consumers connect to
Service name	`com.amazonaws.vpce.<region>.vpce-svc-…`	Provider (issued by AWS)	Coordination key; share out of band
Interface endpoint	`Interface`-type endpoint → ENIs	Consumer	What client workloads actually talk to
Endpoint ENI	One private IP per AZ in the consumer subnet	Consumer	The only L4 hop; its SG is the filter
Allowed principals	Allow-list of who may create an endpoint	Provider	Gate #1; scope to role ARNs
Acceptance	Per-connection approval (`acceptance_required`)	Provider	Gate #2; `pendingAcceptance` until approved
Private DNS name	Friendly name resolving to the ENIs	Provider sets, consumer enables	Avoids the ugly regional hostname
Domain verification	TXT record proving domain ownership	Provider	Must be `verified` before private DNS works
Cross-zone LB	NLB sends across AZs, not just zone-local	Provider	Off → “works in one AZ only”
Connection notification	SNS event on Connect/Accept/Reject/Delete	Provider	Auto-approval / human alerting
AZ ID	Stable physical-zone identifier (`use1-az1`)	Both	AZ names differ per account
Idle timeout	NLB’s fixed 350 s TCP inactivity window	Provider (NLB)	Long-lived flows reset without keepalive
PROXY protocol v2	Prepends real client IP to the TCP stream	Provider (target group)	Targets otherwise see the ENI, not the client
GWLB endpoint	Endpoint type for inline-appliance services	Both	Inspection/firewall services use GWLB, not NLB
Regional endpoint name	The long `vpce-…amazonaws.com` hostname	Consumer	The fallback when private DNS isn’t effective

When an endpoint service is the right call

PrivateLink, peering, Transit Gateway, and VPC Lattice are not interchangeable. Pick by what you are actually sharing and in which direction. The single deciding factor is whether you want to give the other side a route to your network or a path to one service; the rest follows. Reach for an endpoint service when the answer to “would I put this behind a load balancer and a DNS name anyway?” is yes.

The full comparison, on the dimensions that decide a real design:

Dimension	PrivateLink	VPC peering	Transit Gateway	VPC Lattice
Granularity	One service per endpoint	Whole VPC	Whole VPCs (route-domains)	Per app-layer service
CIDR overlap allowed	Yes (irrelevant)	No	No	Yes
Direction	Consumer→provider only	Bidirectional	Bidirectional	Policy-governed
Layer	L4 (TCP/UDP/TLS via NLB)	L3 (IP)	L3 (IP)	L7 (HTTP) + L4
Transitive reach	None (no route)	Non-transitive	Transitive (by design)	Service-scoped
Scales to N consumers	Excellent (fan-in)	Poor (mesh)	Good (hub-spoke)	Excellent
Public internet	Never	Never	Never	Never
Auth model	Allowed principals + acceptance	SG/NACL only	SG/NACL + routing	IAM auth policies
Bills per GB	Yes (data processing)	No (intra-region)	Yes	Yes
DNS integration	Private DNS (verified)	Manual	Manual / R53 Resolver	Built-in service DNS
Onboarding model	Self-service (gated)	Per-peer request	Per-attachment	Service association
Typical use	SaaS / shared internal API	Two trusted VPCs	Routing mesh	HTTP service mesh

Three reading notes that save the most design time:

If your requirement is…	Don’t use…	Use…	Because…
“Expose one API to 200 accounts”	Peering mesh / TGW	PrivateLink	Fan-in, no routes, overlap-proof
“Many VPCs must route to each other”	PrivateLink	Transit Gateway	PrivateLink is one-service, one-way
“Partner gets a path into our net”	Peering / TGW	PrivateLink	Grants one service, not network reach
“HTTP routing + per-request authZ”	Raw PrivateLink	VPC Lattice	Lattice does L7 + IAM auth policies
“Two of my own VPCs, full trust”	PrivateLink	VPC peering	Cheaper, bidirectional, simpler

Provider step 1 — Front the service with an internal NLB

An endpoint service sits on top of an NLB (or GWLB). We will use an NLB. It must be internal — PrivateLink targets the load balancer’s private addresses, not an internet-facing scheme — and you register your service’s targets (instances, IPs, or even an ALB as a target if you need L7 routing behind it) in a target group.

resource "aws_lb" "svc" {
  name                             = "payments-svc-nlb"
  internal                         = true
  load_balancer_type               = "network"
  subnets                          = var.provider_subnet_ids   # one per AZ you publish
  enable_cross_zone_load_balancing = true
}

resource "aws_lb_target_group" "svc" {
  name        = "payments-svc-tg"
  port        = 8443
  protocol    = "TCP"
  vpc_id      = var.provider_vpc_id
  target_type = "ip"

  health_check {
    protocol            = "TCP"
    port                = "8443"
    healthy_threshold   = 3
    unhealthy_threshold = 3
    interval            = 10
  }
}

resource "aws_lb_listener" "svc" {
  load_balancer_arn = aws_lb.svc.arn
  port              = 8443
  protocol          = "TCP"
  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.svc.arn
  }
}

# Equivalent, CLI: an internal NLB with cross-zone enabled
aws elbv2 create-load-balancer --name payments-svc-nlb --type network \
  --scheme internal --subnets subnet-aaa subnet-bbb subnet-ccc
aws elbv2 modify-load-balancer-attributes --load-balancer-arn <nlb-arn> \
  --attributes Key=load_balancing.cross_zone.enabled,Value=true

Two decisions here matter for everything downstream. First, the subnets you attach to the NLB define which AZs the service is available in. A consumer can only create an endpoint in an AZ where you have a presence. Publish in at least two, and prefer to publish in every AZ your largest consumers use. Second, turn on cross-zone load balancing. An NLB is zonal by default: an endpoint ENI in us-east-1a will only send to targets in us-east-1a unless cross-zone is enabled, which with uneven target distribution produces hot zones and surprising health-check behaviour. Cross-zone traffic on an NLB incurs inter-AZ data charges, which the load-balancer owner pays — budget for it.

The NLB decisions that ripple into PrivateLink behaviour, with the default, the trade-off, and the gotcha:

Setting	Values	Default	When to change	Trade-off / gotcha
`scheme`	`internal` / `internet-facing`	n/a	Must be internal for PrivateLink	Internet-facing NLB cannot back an endpoint service
Subnets (AZs)	One subnet per AZ	none	Publish every AZ big consumers use	Consumer can’t endpoint into an AZ you skip
`cross_zone.enabled`	`true` / `false`	false	Almost always `true`	`false` → ENI only reaches same-AZ targets; inter-AZ data charges when on
`target_type`	`instance` / `ip` / `alb`	`instance`	`ip` for ECS/containers; `alb` for L7 behind	`alb` target enables HTTP routing but adds a hop
Listener protocol	`TCP` / `UDP` / `TLS` / `TCP_UDP`	none	`TLS` to terminate at the NLB	`TLS` needs an ACM cert; otherwise pass TCP through
Health-check protocol	`TCP` / `HTTP` / `HTTPS`	`TCP`	`HTTP(S)` for app-aware checks	`TCP` only proves the port is open, not the app
`deletion_protection`	`true` / `false`	`false`	`true` in prod	Prevents an accidental `terraform destroy` outage
`preserve_client_ip`	`true` / `false`	varies by target type	Needed if the app reads source IP	Through PrivateLink the source is the ENI, not the real client

A subtle but important point on client IP: through PrivateLink the provider’s targets see the endpoint ENI’s network identity, not the consumer’s real client IP (unless you carry it in PROXY protocol v2, which the NLB can prepend). Do not build authorization on observed source IP across a PrivateLink boundary — use the allowed-principals gate and application-layer auth instead.

Provider step 2 — Create the VPC endpoint service

With the NLB live, create the endpoint service that points at it. The key knob is acceptance_required.

resource "aws_vpc_endpoint_service" "payments" {
  acceptance_required        = true
  network_load_balancer_arns = [aws_lb.svc.arn]

  tags = { Name = "payments-endpoint-service" }
}

output "service_name" {
  # e.g. com.amazonaws.vpce.us-east-1.vpce-svc-0123456789abcdef0
  value = aws_vpc_endpoint_service.payments.service_name
}

# CLI equivalent — note the NLB ARN(s), not the VPC
aws ec2 create-vpc-endpoint-service-configuration \
  --network-load-balancer-arns <nlb-arn> \
  --acceptance-required

AWS assigns a service name of the form com.amazonaws.vpce.<region>.vpce-svc-xxxxxxxxxxxxxxxxx. This string is what consumers use to find you; it is not secret, but it is the coordination key — hand it to consumers out of band (a wiki, a Terraform output, a platform catalogue). acceptance_required = true means every new connection lands in pendingAcceptance and waits for you to approve it. For a controlled internal platform with a known consumer list, that manual gate is worth keeping; for self-service at scale you flip it to false and rely on the allow-list instead. Do not run with false and an open allow-list unless you genuinely intend anyone in any account to connect.

The endpoint-service configuration options, end to end:

Setting	Values	Default	When to change	Trade-off / gotcha
`acceptance_required`	`true` / `false`	`true` (console)	`false` for self-service at scale	`false` + open allow-list = anyone can connect
`network_load_balancer_arns`	One or more NLB ARNs	none	Multiple for blue/green or sharding	All must be in the same region/VPC
`gateway_load_balancer_arns`	One or more GWLB ARNs	none	For appliance/inspection services	NLB and GWLB are mutually exclusive per service
`supported_ip_address_types`	`ipv4` / `ipv6`	`ipv4`	Add `ipv6` for dual-stack consumers	Consumer and NLB must both support the family
`private_dns_name`	FQDN string	unset	Set to publish a friendly name	Triggers the TXT verification requirement
`allowed_principals`	List of ARNs	empty (nobody)	Always add at least one	Empty = no consumer can even see the service
`supported_regions`	Region list (cross-region)	current	For cross-region access (where supported)	Adds cross-region data charges
Tags	Key/value	none	Always tag owner + cost-centre	Untagged platform services are unauditable

The service has a lifecycle state you will check constantly; know what each value means:

`ServiceState`	Meaning	What to do
`Pending`	Being created	Wait; verify the NLB exists
`Available`	Ready for connections	Add allowed principals; share the name
`Deleting`	Tear-down in progress	Consumers’ endpoints will go to `rejected`
`Failed`	Creation failed	Check the NLB ARN and account limits

Provider step 3 — Allowed principals, acceptance, and notifications

Two independent controls govern who reaches your service, and they stack. Allowed principals decide who is even permitted to create an endpoint to your service; without an entry, the consumer cannot see or target it at all. Scope these as tightly as the consumer’s identity allows — a specific role ARN is better than a whole account root, which is better than *.

resource "aws_vpc_endpoint_service_allowed_principal" "consumer" {
  vpc_endpoint_service_id = aws_vpc_endpoint_service.payments.id
  principal_arn           = "arn:aws:iam::222222222222:role/payments-client-prod" # tighten to a role
}

aws ec2 modify-vpc-endpoint-service-permissions \
  --service-id vpce-svc-0123456789abcdef0 \
  --add-allowed-principals arn:aws:iam::222222222222:role/payments-client-prod

Acceptance is the per-connection gate that applies when acceptance_required = true. List pending connections and approve them:

aws ec2 describe-vpc-endpoint-connections \
  --filters Name=service-id,Values=vpce-svc-0123456789abcdef0 \
  --query 'VpcEndpointConnections[?VpcEndpointState==`pendingAcceptance`].[VpcEndpointId,VpcEndpointOwner]' \
  --output table

aws ec2 accept-vpc-endpoint-connections \
  --service-id vpce-svc-0123456789abcdef0 \
  --vpc-endpoint-ids vpce-0a1b2c3d4e5f6a7b8

To avoid polling for pending connections, wire a connection notification to SNS. You get an event on Connect, Accept, Reject, and Delete, which you can route to a Lambda for auto-approval against an authoritative consumer registry, or just to a channel so a human acts within minutes instead of hours.

resource "aws_vpc_endpoint_connection_notification" "payments" {
  vpc_endpoint_service_id     = aws_vpc_endpoint_service.payments.id
  connection_notification_arn = aws_sns_topic.privatelink_events.arn
  connection_events           = ["Connect", "Accept", "Reject", "Delete"]
}

How the two gates interact is the single most common source of “why can’t they connect” — read this matrix carefully:

Allowed principal present?	`acceptance_required`	Outcome	Provider action needed
No	`true`	Consumer can’t even create the endpoint	Add the principal first
No	`false`	Consumer can’t create the endpoint	Add the principal first
Yes	`true`	Endpoint sits in `pendingAcceptance`	Accept the connection (or auto-approve)
Yes	`false`	Endpoint goes `available` immediately	None — self-service
`*` (wildcard)	`false`	Anyone in any account connects	Intentional only; usually a mistake
`*` (wildcard)	`true`	Anyone may request, you gate each	Acceptable for vetted public-ish services

Scope the principal as tightly as the consumer’s identity allows — the security blast radius shrinks as you narrow it:

Principal form	Example	Who can connect	Use when
Role ARN	`arn:aws:iam::222…:role/app-prod`	Only that workload role	Default — tightest practical scope
Account root	`arn:aws:iam::222…:root`	Any principal in that account	You trust the whole account
Org-wide via condition	`aws:PrincipalOrgID` (policy)	Any account in your org	Internal platform, many accounts
Wildcard	`*`	Anyone, any account	Almost never; vetted + acceptance only

The connection-notification events, and what you typically do with each:

Event	Fires when	Typical handler action
`Connect`	A consumer creates an endpoint	Look up the principal in the registry
`Accept`	A connection is accepted	Record onboarding; emit a metric
`Reject`	You (or auto-logic) reject it	Alert the consumer with the reason
`Delete`	The consumer deletes the endpoint	Clean up registry/DNS state

Consumer step 4 — Create the interface endpoint

Now switch to the consumer account. The consumer creates an interface endpoint of type Interface, naming the provider’s service. AWS provisions one ENI per subnet you specify, each with a private IP from that subnet’s range. Those ENIs are the only thing the consumer’s workloads ever talk to.

resource "aws_vpc_endpoint" "payments" {
  vpc_id              = var.consumer_vpc_id
  service_name        = "com.amazonaws.vpce.us-east-1.vpce-svc-0123456789abcdef0"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = var.consumer_subnet_ids       # one per AZ — must overlap provider AZs
  security_group_ids  = [aws_security_group.endpoint.id]
  private_dns_enabled = false                          # see step 5 before enabling
}

aws ec2 create-vpc-endpoint --vpc-id vpc-cons123 --vpc-endpoint-type Interface \
  --service-name com.amazonaws.vpce.us-east-1.vpce-svc-0123456789abcdef0 \
  --subnet-ids subnet-c1 subnet-c2 --security-group-ids sg-endpoint \
  --no-private-dns-enabled

Three things define correctness on this side. AZ alignment comes first: put the endpoint in the same AZs the provider published. An endpoint ENI in an AZ the provider does not serve is dead weight — there is no target on the other side. Use AZ IDs (use1-az1), not names, when reasoning across accounts, because us-east-1a in your account may map to a different physical zone than in the provider’s. The security group is on the endpoint ENI — this is the most common stumble. The SG attached to the endpoint controls traffic from consumer workloads into the ENI; it must allow inbound on the service port from the client CIDRs/SGs. The provider’s NLB has no security group at all, so the endpoint SG is the only L4 filter on the path. And consumer-side SGs on the clients still need egress to the endpoint.

resource "aws_security_group" "endpoint" {
  name   = "payments-endpoint-sg"
  vpc_id = var.consumer_vpc_id

  ingress {
    description = "Clients to PrivateLink endpoint"
    from_port   = 8443
    to_port     = 8443
    protocol    = "tcp"
    cidr_blocks = [var.app_subnet_cidr]   # or security_groups = [aws_security_group.client.id]
  }
}

Finally, one endpoint, many AZs, one connection: each consumer VPC needs exactly one interface endpoint to the service, and the ENIs across AZs share a single connection from the provider’s point of view. The interface-endpoint options that define behaviour:

Setting	Values	Default	When to change	Trade-off / gotcha
`vpc_endpoint_type`	`Interface` / `Gateway` / `GatewayLoadBalancer`	n/a	Interface for PrivateLink services	Gateway type is only for S3/DynamoDB
`subnet_ids`	One subnet per AZ	none	Match the provider’s published AZs	An ENI in an unpublished AZ is dead weight
`security_group_ids`	SG list	default VPC SG	Always set explicitly	Default SG often denies the service port
`private_dns_enabled`	`true` / `false`	`true` (for AWS svcs) / `false` (custom)	`true` once provider DNS is verified	Premature `true` errors if not verified
`ip_address_type`	`ipv4` / `ipv6` / `dualstack`	`ipv4`	Match the service’s families	Mismatch → endpoint can’t be created
`policy`	Endpoint policy JSON	full access	Restrict actions (for AWS-service EPs)	Custom services rely on app auth, not this
`dns_options.dns_record_ip_type`	`ipv4` / `ipv6` / `service-defined`	`ipv4`	Dual-stack resolution	Must align with how the service publishes

The endpoint state values you will watch while it provisions:

`State`	Meaning	What to do
`pendingAcceptance`	Waiting for the provider to accept	Ask the provider to accept (or check allow-list)
`pending`	Provisioning the ENIs	Wait a minute or two
`available`	ENIs up; ready to use	Test connectivity; check the SG
`rejected`	Provider rejected the connection	You’re not on the allow-list / were denied
`failed`	Provisioning failed	Check AZ overlap and subnet capacity
`deleting` / `deleted`	Tear-down	Expected on `terraform destroy`

Consumer step 5 — Private DNS and domain-ownership verification

By default the endpoint hands the consumer an ugly regional DNS name like vpce-0a1b2c3d4e5f6a7b8-abcd1234.vpce-svc-0123456789abcdef0.us-east-1.vpce.amazonaws.com, plus zonal variants. Functional, but nobody wants that baked into client config. Private DNS names let the provider associate a friendly name — say payments.internal.example.com — so consumers keep calling that hostname and resolution silently points at the endpoint.

The catch, and the reason this trips teams up, is that AWS makes the provider prove they own the domain before any consumer is allowed to enable private DNS. This prevents someone from publishing a service that hijacks *.example.com. Enable a private DNS name on the endpoint service, then publish the TXT record AWS gives you:

resource "aws_vpc_endpoint_service" "payments" {
  acceptance_required        = true
  network_load_balancer_arns = [aws_lb.svc.arn]
  private_dns_name           = "payments.internal.example.com"
}

AWS returns a verification token. Fetch it and create the TXT record in the public hosted zone for the domain (verification is done against public DNS, even though the service itself is private):

aws ec2 describe-vpc-endpoint-service-configurations \
  --service-ids vpce-svc-0123456789abcdef0 \
  --query 'ServiceConfigurations[0].PrivateDnsNameConfiguration'
# -> { "State": "pendingVerification", "Type": "TXT",
#      "Value": "vpce:abc123...", "Name": "_a1b2c3d4" }

resource "aws_route53_record" "privatelink_verify" {
  zone_id = var.public_zone_id
  name    = "_a1b2c3d4.payments.internal.example.com"
  type    = "TXT"
  ttl     = 1800
  records = ["vpce:abc123..."]
}

aws ec2 start-vpc-endpoint-service-private-dns-verification \
  --service-id vpce-svc-0123456789abcdef0

Once the state flips to verified, consumers can set private_dns_enabled = true on their interface endpoints. Behind the scenes AWS creates a managed private hosted zone in the consumer VPC that resolves payments.internal.example.com to the endpoint ENIs — classic split-horizon: the public name resolves to nothing useful publicly, but inside the consumer VPC it points at the local ENIs. For private_dns_enabled to actually take effect, the consumer VPC must have enableDnsHostnames and enableDnsSupport both turned on, or resolution silently falls back to the long regional name.

The private-DNS verification states and what each means for the consumer:

`PrivateDnsNameConfiguration.State`	Meaning	Consumer can enable private DNS?	Provider action
`pendingVerification`	Token issued, TXT not yet seen	No	Publish the TXT record, then start verification
`verified`	Ownership proven	Yes	Tell consumers to set `private_dns_enabled=true`
`failed`	TXT wrong/missing after retries	No	Fix the TXT name/value; re-run verification

What must be true on each side for the friendly name to actually resolve — every one of these, or you get the long regional name:

Requirement	Side	How to confirm	If wrong
`private_dns_name` set on the service	Provider	`describe-vpc-endpoint-service-configurations`	Friendly name never offered
TXT verification `verified`	Provider	`PrivateDnsNameConfiguration.State`	Consumer can’t enable private DNS
`private_dns_enabled = true`	Consumer	`describe-vpc-endpoints`	Long regional name returned
`enableDnsSupport = true`	Consumer VPC	`describe-vpc-attribute`	Resolution silently falls back
`enableDnsHostnames = true`	Consumer VPC	`describe-vpc-attribute`	Resolution silently falls back
No conflicting Route 53 record	Consumer	Check private hosted zones	A manual record can shadow the managed one

The two DNS names you’ll see, and when each is correct:

Name form	Example	Resolves to	When you should see it
Friendly private DNS	`payments.internal.example.com`	Endpoint ENI IPs (in-VPC only)	After verification + `private_dns_enabled=true`
Regional endpoint name	`vpce-…vpce-svc-…region.vpce.amazonaws.com`	The endpoint (all AZs)	When private DNS is off — and as a fallback
Zonal endpoint name	`vpce-…az1.…vpce.amazonaws.com`	One AZ’s ENI	When you deliberately pin to an AZ

Verify the path end to end

Prove the path end to end before declaring victory. Work provider→consumer→client, because each layer depends on the one before it.

# 1. Provider: service exists, NLB attached, DNS verified
aws ec2 describe-vpc-endpoint-service-configurations \
  --service-ids vpce-svc-0123456789abcdef0 \
  --query 'ServiceConfigurations[0].[ServiceState,PrivateDnsNameConfiguration.State,AvailabilityZones]'

# 2. Provider: the consumer's connection is accepted (not pendingAcceptance/rejected)
aws ec2 describe-vpc-endpoint-connections \
  --filters Name=service-id,Values=vpce-svc-0123456789abcdef0 \
  --query 'VpcEndpointConnections[].[VpcEndpointOwner,VpcEndpointState]' --output table

# 3. Consumer: endpoint is "available" with ENIs in each AZ
aws ec2 describe-vpc-endpoints \
  --vpc-endpoint-ids vpce-0a1b2c3d4e5f6a7b8 \
  --query 'VpcEndpoints[0].[State,NetworkInterfaceIds,DnsEntries[0].DnsName]'

# 4. From a consumer instance: DNS resolves to a private (ENI) address...
dig +short payments.internal.example.com
# 10.x.x.x  (a consumer-subnet IP, not a public one)

# 5. ...and the service answers
curl -sS -o /dev/null -w '%{http_code}\n' https://payments.internal.example.com:8443/healthz

If dig returns the long vpce-...amazonaws.com name, private DNS is not effective — re-check private_dns_enabled, the DNS verification state, and the VPC’s DNS-hostnames flag. The verification checklist as a table — what “good” looks like at each step:

#	Check	Command	Expected good result
1	Service available + DNS verified	`describe-vpc-endpoint-service-configurations`	`Available`, `verified`, AZs listed
2	Connection accepted	`describe-vpc-endpoint-connections`	`available` for the consumer owner
3	Endpoint up with ENIs	`describe-vpc-endpoints`	`available`, N ENIs, a DNS entry
4	DNS resolves private	`dig +short <name>`	A `10.x`/consumer-subnet IP
5	Service responds	`curl … :8443/healthz`	`200` (or your healthy code)
6	No REJECTs at the ENI	Flow Logs query (below)	Zero `REJECT` rows from the client

Scaling, resiliency, and the quotas that bite

A few limits and behaviours decide whether this holds up under load. Cross-zone load balancing (covered above, repeated because it bites in production): without it, an endpoint ENI only reaches targets in its own AZ. Endpoint connection capacity: a single interface endpoint scales horizontally across AZs, giving roughly tens of thousands of concurrent connections per AZ per endpoint; for very high fan-in the bottleneck is usually NLB target capacity and source-port exhaustion on long-lived connections, not the endpoint itself, so watch ActiveFlowCount and NewFlowCount on the NLB. The idle timeout is the silent killer: NLB TCP flows have a 350-second idle timeout, so long-lived gRPC or database-style connections through PrivateLink need TCP keepalives below that or they reset silently.

The quotas you will actually hit, with the default and how to handle each:

Quota	Default	Adjustable?	What hitting it looks like	Mitigation
Allowed principals per endpoint service	50	Yes (Service Quotas)	51st consumer can’t be added	Raise before the 50th account; prefer org-condition
Interface endpoints per VPC	50	Yes	New endpoint creation fails	Consolidate, or request an increase
Endpoint services per account/region	20–50 (varies)	Yes	New service creation fails	Raise; or share one service across consumers
Connections per endpoint (per AZ)	~tens of thousands	Effectively scale-bound	New flows fail at extreme fan-in	Add AZs; scale NLB targets
NLB targets per target group	500 (instance/IP)	Yes	Can’t register more targets	Increase, or shard behind multiple NLBs
NLB TCP idle timeout	350 s	No	Long-lived flows reset at ~6 min	Client TCP keepalive < 350 s
Source ports per flow tuple	~64 K per dst	No	Port exhaustion on one hot 5-tuple	Spread targets/ports; reuse connections
NLBs per endpoint service	1 active set (per region/VPC)	n/a	Can’t span VPCs/regions with one service	Publish multiple services or use cross-region
Subnets (AZs) per NLB	One per AZ, up to region AZ count	n/a	Can’t publish in an AZ with no subnet	Add a subnet per AZ you want to serve
Connection notifications per service	Small fixed cap	No	Extra SNS wiring rejected	Fan out from one SNS topic instead
Target group health-check interval	10–30 s	Yes	Slow detection of dead targets	Tune interval/thresholds for your SLA

The behaviours and limits that decide resiliency, side by side:

Behaviour	Default	Why it matters	What to do
Zonal NLB routing	On (cross-zone off)	ENI reaches only same-AZ targets	Enable cross-zone unless you want zonal isolation
Single connection per consumer VPC	Always	One endpoint = one provider-side connection	Don’t create duplicate endpoints per AZ
AZ-ID vs AZ-name mismatch	Inherent across accounts	`us-east-1a` differs per account	Reason with AZ IDs (`use1-az1`)
350 s idle timeout	Fixed	Long-poll/gRPC/DB connections reset	Keepalive below the threshold
Cross-zone data charges	Billed to LB owner	Cost can surprise on chatty services	Model inter-AZ GB; keep targets balanced
ENI per AZ, not per client	Always	Thousands of clients share one ENI/AZ	Scale targets, not endpoints, for fan-in
No source-IP visibility	Default	Targets see the ENI, not the consumer	Use PROXY protocol v2 if the app needs it

Observability and cost

You are billed on two axes for an interface endpoint: an hourly charge per endpoint per AZ, and a per-GB data-processing charge on traffic through it. The per-AZ hourly line is why you do not blindly enable every AZ — each ENI is its own meter. The data-processing charge is on top of any inter-AZ transfer the NLB incurs, and at high fan-in across hundreds of consumer endpoints, the per-endpoint hourly cost dominates and is easy to overlook. The good news: PrivateLink bypasses SNAT entirely for the consumer — traffic to the ENI rides the backbone, so you do not burn the consumer’s NAT-gateway SNAT ports the way a public-internet call would.

For traffic visibility, VPC Flow Logs on the endpoint ENIs show source IPs, ports, and accept/reject actions — invaluable when a consumer swears they cannot connect. The endpoint ENIs have stable interface IDs; filter on them.

-- CloudWatch Logs Insights over VPC Flow Logs — rejected traffic to the endpoint ENI
fields @timestamp, srcAddr, dstAddr, dstPort, action
| filter interfaceId = "eni-0abc123def456789"
| filter action = "REJECT"
| stats count() as rejects by srcAddr, dstPort
| sort rejects desc

On the provider side, the meaningful signals are NLB CloudWatch metrics: HealthyHostCount/UnHealthyHostCount per target group, ActiveFlowCount, and TCP_Target_Reset_Count. A rise in target resets with healthy hosts usually points at idle-timeout or application-side connection churn rather than the network. Observability sources and what each is the source of truth for:

Signal source	Lives where	Source of truth for
VPC Flow Logs (endpoint ENI)	Consumer account	Client→ENI accept/REJECT, source IP/port
`ActiveFlowCount` / `NewFlowCount`	Provider NLB metrics	Concurrent flows; fan-in pressure
`HealthyHostCount` / `UnHealthyHostCount`	Provider target group	Whether targets are in rotation
`TCP_Target_Reset_Count`	Provider NLB metrics	Idle-timeout / app churn resets
`ProcessedBytes`	Provider NLB metrics	Data volume (cost driver)
Connection-notification events	Provider SNS topic	Who connected/was rejected and when
`describe-vpc-endpoint-connections`	Provider control plane	Per-consumer connection state

The cost model — what drives the bill and how to control it (figures approximate, us-east-1, USD; ₹ at ~₹86/USD):

Cost driver	Rough rate	Who pays	How to control
Interface endpoint, per-AZ hourly	~$0.01/AZ/hr (~₹0.86)	Consumer	Only enable AZs you actually use
Endpoint data processing	~$0.01/GB (~₹0.86)	Consumer	Tier on volume; consolidate chatty calls
NLB hourly + LCU	~$0.0225/hr + LCU	Provider	Right-size; one NLB per service, not per consumer
NLB cross-zone data	inter-AZ $0.01/GB each way	Provider (LB owner)	Balance targets; weigh zonal isolation
Flow Logs storage/ingest	CloudWatch/S3 rates	Whoever logs	Sample, or log REJECT-only for cost
SNS + Lambda (auto-approve)	Negligible	Provider	Effectively free at onboarding volumes
Saved consumer NAT/SNAT	(a credit, not a charge)	Consumer	PrivateLink bypasses the public path entirely
Cross-region data (if used)	inter-region transfer rate	Both	Only when `supported_regions` spans regions

A worked sizing note: at 120 consumer accounts each enabling two AZs, the per-endpoint hourly line alone is roughly 120 × 2 × $0.01 × 730 hr ≈ $1,750/month (~₹150,000) before a single byte flows — which is exactly why the per-AZ count is a deliberate decision, not a default.

Architecture at a glance

Read the diagram left to right; it traces a single request through the system and pins the five failure classes onto the exact hop where each bites. A client in the consumer VPC resolves the friendly name payments.internal.example.com, which — thanks to the managed split-horizon private hosted zone — returns a local interface-endpoint ENI address rather than anything public. The client opens a TCP connection to that ENI on :8443; its egress SG must allow it out, and the endpoint ENI’s own security group must allow it in (the NLB has no SG, so this is the only L4 filter on the path — badge 1). From the ENI the flow crosses the AWS backbone in one direction only, never touching the internet, and arrives at the provider’s endpoint service. There, two gates decide whether the connection ever existed: the allowed-principals allow-list and the per-connection acceptance step (badge 2), while domain-ownership verification via a public TXT record is what let private DNS resolve in the first place (badge 3).

Past the gates, the service forwards to an internal NLB that load-balances across registered targets (IPs, instances, or an ALB) with TCP health checks. The NLB’s published AZs and its cross-zone setting decide whether an ENI in one AZ can reach targets in another — get this wrong and the service “works in one AZ, dies in another” (badge 4) — and its fixed 350-second idle timeout silently resets long-lived flows without keepalives (badge 5). The right-hand observe zone closes the loop: VPC Flow Logs on the ENI catch REJECTs, NLB metrics catch flow pressure and target resets, and the SNS connection-notification feed can auto-approve new consumers against a registry. The whole method during an incident is to find the badge that matches your symptom, read its legend line, run the named confirm command, and apply the fix.

Real-world scenario

Vantage Pay, a fictional payments platform team of five engineers, ran a fraud-scoring API that ~120 internal accounts needed to call, plus three external partner accounts under contract. Their first instinct was a Transit Gateway attachment per consumer. It died on contact with reality: two recently-acquired business units had VPCs on 10.20.0.0/16 — the same block the fraud service’s VPC used — and renumbering a live, PCI-scoped production VPC was a non-starter. Worse, security flagged that a TGW attachment would grant those partner accounts a route into the platform network, far more reach than “call one API” warranted. The architecture review bounced the TGW plan in a single meeting.

They rebuilt it as a PrivateLink endpoint service. The fraud API went behind an internal NLB across three AZs with cross-zone enabled; the endpoint service ran with acceptance_required = true and an allow-list keyed to specific consumer role ARNs, not account roots, so a partner could only connect from their designated workload role. A connection-notification SNS topic fed a small Lambda that auto-approved any principal present in the platform’s account-registry DynamoDB table and left everything else pending for a human — onboarding dropped from a ticket-and-a-meeting to minutes. The two overlapping 10.20.0.0/16 business units connected with zero renumbering, because PrivateLink never exposes the provider’s address space, and the team published fraud.internal.payments.example.com as a verified private DNS name so every consumer used one stable hostname regardless of account.

# The control that made the partner case acceptable to security:
# allow only the partner's specific workload role, never the account root.
resource "aws_vpc_endpoint_service_allowed_principal" "partner_a" {
  vpc_endpoint_service_id = aws_vpc_endpoint_service.fraud.id
  principal_arn           = "arn:aws:iam::333333333333:role/fraud-client-prod"
}

Two incidents in the first month taught the team the failure map. The first: a newly-onboarded BU reported “endpoint is available but every call hangs.” Flow Logs on the endpoint ENI showed REJECT from the client subnet on :8443 — the BU had created the endpoint with the default security group, which did not allow the service port inbound. Five-minute fix. The second: a partner’s gRPC client saw connections silently dropping “every few minutes.” That was the 350-second NLB idle timeout against a long-lived streaming connection with no keepalive; the partner set a 200-second TCP keepalive and it vanished. Both were textbook, both were one hop, and both went straight into the runbook.

The one real cost they had to plan for was the per-endpoint, per-AZ hourly charge multiplied across 120+ consumers — a line item that genuinely showed up on the bill at roughly ₹150,000/month before data. They accepted it as the price of dropping TGW attachment management and CIDR coordination entirely, and trimmed it by getting each consumer to enable only the two AZs they actually ran in rather than all three. Net: the consumer count became a self-service onboarding problem instead of a networking project, which is exactly the trade PrivateLink is built to make.

Advantages and disadvantages

The publish-one-service model both enables clean cross-account sharing and imposes real constraints. Weigh it honestly:

Advantages (why this model helps you)	Disadvantages (why it constrains you)
Overlapping CIDRs are irrelevant — the consumer talks to ENIs in its own subnets	One service per endpoint; not a general routing solution
Grants reach to one load balancer, not your network — minimal blast radius	One-directional (consumer→provider) only
Scales to hundreds of consumers as self-service onboarding, not a mesh	Requires an internal NLB/GWLB in front of the workload
Never traverses the public internet; bypasses consumer SNAT entirely	Per-endpoint, per-AZ hourly cost compounds at high fan-in
Two stacked gates (principals + acceptance) give fine-grained access control	Provider must prove domain ownership before friendly DNS works
Private DNS gives a stable hostname regardless of consumer account	AZ names differ across accounts — must reason by AZ ID
Connection notifications enable automated, audited onboarding	NLB 350 s idle timeout resets long-lived flows without keepalives
Works for SaaS vendors offering private connectivity to customers	Source client IP is hidden unless you carry PROXY protocol v2

The model is right whenever you are publishing a service to many accounts — a shared internal API, a logging/telemetry sink, or a SaaS product offered privately to customer VPCs — and it shines exactly where peering and TGW buckle: large fan-in and overlapping address space. It is the wrong tool when you need bidirectional, everything-talks-to-everything routing (use TGW), when both VPCs are yours and mutually trusted at small N (peering is cheaper and simpler), or when you need L7 routing and per-request authorization across services (VPC Lattice). The disadvantages are all manageable — the NLB requirement, the cost, the DNS dance, the idle timeout — but only if you know they exist going in, which is the point of this article.

Hands-on lab

Stand up a complete PrivateLink path across two accounts (or two VPCs you control), prove it end to end, then tear it down. Free-tier-friendly where possible; the NLB and endpoint hours cost a few cents — delete at the end. Run in CloudShell or with two named CLI profiles (--profile provider, --profile consumer).

Step 1 — Provider: an internal NLB with a simple target. Front a tiny target (an EC2 instance running a health endpoint on :8443, or an IP target) with an internal NLB.

PROFILE_P="--profile provider"
aws elbv2 create-load-balancer $PROFILE_P --name pl-lab-nlb --type network \
  --scheme internal --subnets subnet-pa subnet-pb
# capture the ARN
NLB=$(aws elbv2 describe-load-balancers $PROFILE_P --names pl-lab-nlb \
  --query 'LoadBalancers[0].LoadBalancerArn' -o text)
aws elbv2 modify-load-balancer-attributes $PROFILE_P --load-balancer-arn $NLB \
  --attributes Key=load_balancing.cross_zone.enabled,Value=true

Expected: an ARN returned; cross-zone attribute set to true.

Step 2 — Provider: target group, target, listener.

TG=$(aws elbv2 create-target-group $PROFILE_P --name pl-lab-tg --protocol TCP \
  --port 8443 --vpc-id vpc-prov --target-type ip \
  --query 'TargetGroups[0].TargetGroupArn' -o text)
aws elbv2 register-targets $PROFILE_P --target-group-arn $TG --targets Id=10.10.1.20
aws elbv2 create-listener $PROFILE_P --load-balancer-arn $NLB --protocol TCP --port 8443 \
  --default-actions Type=forward,TargetGroupArn=$TG

Expected: target registers; after a minute describe-target-health shows healthy.

Step 3 — Provider: publish the endpoint service.

SVC=$(aws ec2 create-vpc-endpoint-service-configuration $PROFILE_P \
  --network-load-balancer-arns $NLB --acceptance-required \
  --query 'ServiceConfiguration.ServiceId' -o text)
SVC_NAME=$(aws ec2 describe-vpc-endpoint-service-configurations $PROFILE_P \
  --service-ids $SVC --query 'ServiceConfigurations[0].ServiceName' -o text)
echo "Share this: $SVC_NAME"

Expected: a vpce-svc-… id and a com.amazonaws.vpce.<region>.vpce-svc-… name.

Step 4 — Provider: allow the consumer principal.

aws ec2 modify-vpc-endpoint-service-permissions $PROFILE_P --service-id $SVC \
  --add-allowed-principals arn:aws:iam::222222222222:role/lab-consumer-role

Expected: command succeeds; the consumer can now see the service.

Step 5 — Consumer: create the interface endpoint.

PROFILE_C="--profile consumer"
EP=$(aws ec2 create-vpc-endpoint $PROFILE_C --vpc-id vpc-cons \
  --vpc-endpoint-type Interface --service-name "$SVC_NAME" \
  --subnet-ids subnet-ca subnet-cb --security-group-ids sg-lab-endpoint \
  --no-private-dns-enabled --query 'VpcEndpoint.VpcEndpointId' -o text)

Expected: an endpoint id; state begins as pendingAcceptance.

Step 6 — Provider: accept the connection.

aws ec2 accept-vpc-endpoint-connections $PROFILE_P --service-id $SVC \
  --vpc-endpoint-ids $EP

Expected: within a minute the consumer’s endpoint flips to available.

Step 7 — Consumer: confirm and test.

aws ec2 describe-vpc-endpoints $PROFILE_C --vpc-endpoint-ids $EP \
  --query 'VpcEndpoints[0].[State,DnsEntries[0].DnsName]'
# From a consumer instance whose SG can reach sg-lab-endpoint on :8443:
curl -sS -o /dev/null -w '%{http_code}\n' \
  https://<regional-endpoint-dns>:8443/healthz

Expected: available, a regional DNS name, and a 200 from the health endpoint. The lab steps and their expected output, at a glance:

Step	Command family	Expected output
1	`create-load-balancer` + cross-zone	NLB ARN; cross-zone `true`
2	`create-target-group` / `register-targets`	Target `healthy` after ~1 min
3	`create-vpc-endpoint-service-configuration`	`vpce-svc-…` id + service name
4	`modify-vpc-endpoint-service-permissions`	Consumer principal allowed
5	`create-vpc-endpoint` (Interface)	Endpoint id; `pendingAcceptance`
6	`accept-vpc-endpoint-connections`	Endpoint → `available`
7	`describe-vpc-endpoints` + `curl`	`available`; `200` from `/healthz`

Step 8 — Teardown (run both, in order).

aws ec2 delete-vpc-endpoints $PROFILE_C --vpc-endpoint-ids $EP
aws ec2 delete-vpc-endpoint-service-configurations $PROFILE_P --service-ids $SVC
aws elbv2 delete-listener $PROFILE_P --listener-arn <listener-arn>
aws elbv2 delete-target-group $PROFILE_P --target-group-arn $TG
aws elbv2 delete-load-balancer $PROFILE_P --load-balancer-arn $NLB

Delete the consumer endpoint before the provider service, or the service delete is blocked by an active connection.

Common mistakes & troubleshooting

When it does not work, walk these in order — most failures are one of the first three, and almost all localise to a single hop. The playbook: match the symptom, read the root cause, run the confirm command, apply the fix.

#	Symptom	Root cause	Confirm (exact command / path)	Fix
1	Endpoint stuck in `pendingAcceptance`	Provider hasn’t accepted; or principal not allow-listed	`describe-vpc-endpoint-connections … VpcEndpointState`	Add the role ARN to allowed principals and `accept-vpc-endpoint-connections`
2	`available` but connections hang/refuse	Endpoint ENI’s SG blocks the service port	VPC Flow Logs on ENI show `REJECT`; check `describe-security-groups`	Allow inbound `:8443` from client CIDR/SG on the endpoint SG
3	DNS returns the long `vpce-…amazonaws.com` name	Private DNS not effective (verify state / flag / VPC DNS)	`describe-vpc-endpoints …private_dns_enabled`; `…PrivateDnsNameConfiguration.State`	Verify domain, set `private_dns_enabled=true`, enable VPC DNS flags
4	Works in one AZ, fails in another	Provider didn’t publish that AZ, or cross-zone off	Compare provider `AvailabilityZones` vs consumer subnet AZ IDs	Publish all AZs + enable cross-zone load balancing
5	Healthy endpoint, but no targets answer	Target group unhealthy (wrong port/host/probe)	`describe-target-health` shows `unhealthy`	Fix health-check port/protocol; ensure app listens on the target port
6	Intermittent resets on long-lived flows	NLB 350 s TCP idle timeout	`TCP_Target_Reset_Count` rising with healthy hosts	Add TCP keepalives below 350 s on the client
7	Consumer can’t even create the endpoint	Principal missing from allowed list	Provider `describe-vpc-endpoint-service-permissions`	Add the consumer principal (role ARN preferred)
8	Endpoint state `rejected`	Provider rejected the connection (or auto-logic did)	`describe-vpc-endpoint-connections … rejected`	Re-request after the provider allow-lists/accepts
9	Connection works, but app sees wrong source IP	Source is the ENI, not the real client	Compare app logs vs client IP	Enable PROXY protocol v2 on the target group if you need real client IP
10	DNS verification stuck at `pendingVerification`	TXT record wrong name/value or not propagated	`dig TXT _<token>.<name>`; check the public zone	Fix the TXT name/value; `start-…-private-dns-verification` again
11	“Asymmetric routing”-style weirdness	Almost always an ALB-as-target with mismatched health checks	Inspect the NLB→ALB target chain	Simplify the target chain; align health checks; re-test
12	Hit a hard ceiling adding the 51st consumer	`allowed_principals` per service quota (default 50)	Service Quotas console	Request an increase; or use an `aws:PrincipalOrgID` condition
13	High fan-in: new connections fail under load	NLB target capacity / source-port exhaustion	`ActiveFlowCount`, `NewFlowCount` near ceiling	Add AZs/targets; reuse connections; shard behind multiple NLBs
14	`terraform destroy` fails on the service	Active consumer connection still attached	`describe-vpc-endpoint-connections` non-empty	Delete consumer endpoints first, then the service
15	Endpoint created but no DNS entry at all	VPC DNS support disabled on the consumer VPC	`describe-vpc-attribute --attribute enableDnsSupport`	Enable `enableDnsSupport` (and `enableDnsHostnames`)
16	Connection drops when provider redeploys NLB	NLB replaced, not updated in place	Provider change record; new NLB ARN	Update the service’s NLB ARN; avoid replacing the NLB
17	Partner connects from an unexpected role	Allow-list scoped to account root, not a role	`describe-vpc-endpoint-service-permissions`	Tighten to the specific workload role ARN
18	`curl` works by IP but not by name	Client using stale/cached regional name	`dig +short <name>` vs the ENI IPs	Flush resolver cache; confirm private DNS effective

The fastest first-cut decision table — symptom to most-likely cause to first move:

If you see…	It’s probably…	Do this first
`pendingAcceptance` forever	Acceptance/allow-list gate	Check allow-list, then accept
Hang/refuse on an `available` endpoint	Endpoint ENI SG	Flow Logs → fix inbound SG rule
Long regional DNS name	Private DNS not effective	Check verify state + VPC DNS flags
Zone-specific failures	AZ mismatch / cross-zone off	Compare AZ IDs; enable cross-zone
Resets every few minutes	350 s idle timeout	Add TCP keepalives < 350 s
Can’t add another consumer	Allowed-principals quota	Raise quota or use org condition

The error/limit reference — the strings and states you’ll meet, what they mean, and the fix:

Error / state	Where it appears	Meaning	Fix
`pendingAcceptance`	Endpoint / connection state	Awaiting provider acceptance	Accept the connection / check allow-list
`rejected`	Endpoint state	Provider rejected the connection	Get allow-listed; re-request
`failed`	Endpoint state	Provisioning failed (AZ/subnet)	Fix AZ overlap and subnet capacity
`pendingVerification`	DNS config state	TXT not yet verified	Publish TXT; start verification
`InvalidServiceName`	CLI error	Service name typo / wrong region	Copy the exact `com.amazonaws.vpce.…` string
`You are not authorized…`	CLI error (consumer)	Principal not on allow-list	Provider adds the principal
`REJECT` (Flow Logs)	ENI flow log	SG/NACL blocked the packet	Allow the service port on the endpoint SG
`VpcEndpointLimitExceeded`	CLI error	Endpoints-per-VPC quota hit	Raise quota / consolidate endpoints
`cross_zone` data charge	Cost Explorer line	Inter-AZ NLB traffic	Balance targets; weigh zonal isolation
`TCP_Target_Reset_Count` ↑	NLB metric	Idle-timeout/app resets	TCP keepalives; fix app connection churn
`DnsNamesInUse` / verify failed	DNS config state	TXT mismatch or stale token	Re-fetch token; fix TXT; re-verify
`ExceededAllowedPrincipalsLimit`	CLI error	50-principal cap reached	Raise quota; use `aws:PrincipalOrgID` condition
Endpoint `available`, target `unhealthy`	Mixed signals	App not on the target port/health path	Fix target port + health check; re-register

Best practices

Front the service with an internal NLB across every AZ your big consumers use, with cross-zone load balancing on. This single setting prevents the most common “works in one AZ” class of failure.
Scope allowed principals to role ARNs, not account roots, and never * unless intentional. For org-internal platforms, prefer an aws:PrincipalOrgID condition over enumerating 120 accounts.
Set acceptance_required deliberately. Keep it true with a connection-notification Lambda for audited self-service; only false when the allow-list is your sole, sufficient gate.
Wire connection notifications to SNS from day one — auto-approve against an authoritative registry, alert humans for everything else. Polling for pendingAcceptance does not scale.
Put the security group on the endpoint ENI and allow the service port from client CIDRs/SGs. The NLB has no SG; the endpoint SG is the only L4 filter on the path.
Reason about AZs by ID (use1-az1), not name. us-east-1a is a different physical zone in different accounts; aligning on names silently strands an ENI.
Verify the private DNS domain before telling consumers to flip private_dns_enabled, and confirm the consumer VPC has enableDnsHostnames and enableDnsSupport on.
Set TCP keepalives below the 350-second NLB idle timeout for any gRPC, database, or long-poll connection through PrivateLink.
Enable VPC Flow Logs on the endpoint ENIs and dashboard NLB metrics (ActiveFlowCount, HealthyHostCount, TCP_Target_Reset_Count) before you onboard real traffic.
Review quotas ahead of growth — allowed principals per service (50), endpoints per VPC, endpoint services per account — and request increases before you hit them.
Model the per-endpoint, per-AZ hourly cost against expected fan-in and get consumers to enable only the AZs they run in. The hourly meter, not data, usually dominates at scale.
Treat the service_name as a coordination artifact, not a secret — publish it in a platform catalogue with the supported AZs, port, and onboarding contact.

Security notes

PrivateLink’s security story is strong precisely because it grants the minimum reach: one load balancer, one direction, no route into your network. Lean into that.

Control	What it protects	How to apply
Allowed principals (role ARN)	Who may create an endpoint	Scope to the consumer’s workload role; avoid root/`*`
`aws:PrincipalOrgID` condition	Org-internal access at scale	Allow any account in your org without enumerating
Per-connection acceptance	Onboarding gate	Keep `true`; automate via SNS→Lambda registry check
Endpoint ENI security group	L4 ingress to the ENI	Least-privilege: service port, client SG/CIDR only
No route into the provider VPC	Lateral movement	Inherent — PrivateLink grants no transitive reach
Domain-ownership verification	DNS hijack prevention	Provider proves the domain before private DNS
NLB TLS termination + ACM	Data-in-transit	Use a `TLS` listener with an ACM cert if terminating
App-layer authN/Z	Real identity	Don’t trust source IP across PrivateLink — auth in the app
VPC Flow Logs (REJECT)	Detection/audit	Alert on REJECT spikes; record who connected
PROXY protocol v2	Real client IP (when needed)	Enable on the target group; parse in the app
SCP / RCP guardrails	Org-wide policy on endpoints	Restrict who can create endpoint services/principals
CloudTrail on EC2 endpoint APIs	Change audit	Alert on `ModifyVpcEndpointServicePermissions` changes
Connection-notification audit	Onboarding trail	Persist Connect/Accept/Reject events for review

Three security-specific gotchas worth internalising. First, the source IP your targets see is the endpoint ENI, not the consumer’s client — never build authorization on observed source IP across a PrivateLink boundary; use the principal gate and application auth. Second, traffic is private but not automatically encrypted — PrivateLink keeps the flow off the internet, but if you need confidentiality on the wire, terminate TLS at the NLB (ACM cert) or run TLS end-to-end through it. Third, the allow-list is your perimeter — a stray * with acceptance_required = false is the one mistake that turns a private service into an open one; review it in code and alert on changes.

Cost & sizing

What drives the bill: the per-endpoint, per-AZ hourly charge (consumer side) and per-GB data processing (consumer side), plus the NLB hourly + LCU and any inter-AZ cross-zone data (provider side). At high fan-in, the per-AZ hourly line dominates — it accrues whether or not data flows — which is why the number of enabled AZs is the most important cost lever you have. PrivateLink also saves money on the consumer side by bypassing NAT-gateway SNAT and data-processing for the call (the traffic never goes to the internet). Rough figures, us-east-1, USD, with ₹ at ~₹86/USD:

Item	Rough rate	Side	Sizing lever
Interface endpoint, per AZ	~$0.01/hr (~₹0.86) → ~$7.30/AZ/mo	Consumer	Enable only AZs you use
Endpoint data processing	~$0.01/GB (~₹0.86)	Consumer	Consolidate chatty calls; tier on volume
NLB	~$0.0225/hr + LCU	Provider	One NLB per service, not per consumer
NLB cross-zone inter-AZ	~$0.01/GB each direction	Provider	Balance targets; consider zonal isolation
Flow Logs	CloudWatch/S3 rates	Either	REJECT-only or sampled to cut cost
SNS + Lambda	Effectively free at onboarding scale	Provider	n/a

The sizing rules of thumb, as a decision table:

If you have…	Then size for…	Because…
Many consumers, low per-consumer traffic	Minimising AZ count per endpoint	Hourly meter dominates over data
Few consumers, high traffic	Minimising data processing + NLB LCUs	Data + LCU dominate over hourly
Long-lived connections	Keepalives + NLB headroom	Avoid idle resets and flow churn
Bursty fan-in	NLB targets + AZ spread	Source-port and target capacity are the ceiling
Cross-region consumers	Cross-region data + supported regions	Adds inter-region transfer cost

There is no “free tier” for PrivateLink endpoints, but the lab above runs in cents because the endpoint and NLB exist for minutes. For a 120-consumer internal platform at two AZs each, budget on the order of ₹150,000/month for endpoint hours before data — the line item that surprises teams who modelled only data transfer.

Interview & exam questions

Q1. When would you choose PrivateLink over VPC peering or Transit Gateway? When you are publishing a single service to one or many consumers, especially across accounts with overlapping CIDRs, and you want to grant reach to one load balancer rather than a route into your network. Peering/TGW are bidirectional L3 routing that require non-overlapping CIDRs; PrivateLink is unidirectional, one-service, and CIDR-agnostic. (SAP-C02, ANS-C01.)

Q2. Why are overlapping CIDRs irrelevant to PrivateLink? Because the consumer’s workloads talk only to an interface-endpoint ENI carved from the consumer’s own subnet; AWS carries the flow across the backbone to the provider’s NLB without ever exposing the provider’s address space. There is no IP-level route between the VPCs to collide. (ANS-C01.)

Q3. What two controls gate who can connect to an endpoint service, and how do they interact? Allowed principals (who may create an endpoint) and per-connection acceptance (acceptance_required). They stack: without an allow-list entry the consumer can’t connect even with acceptance_required = false; with the entry and true, the connection waits in pendingAcceptance until accepted. (SAP-C02.)

Q4. A consumer’s endpoint is available but every call hangs. Where do you look first? The endpoint ENI’s security group — it’s the only L4 filter on the path because the NLB has no SG. Confirm with VPC Flow Logs on the ENI (look for REJECT) and fix the inbound rule to allow the service port from the client. (ANS-C01.)

Q5. Why does private DNS require domain-ownership verification, and where is the TXT record published? To stop a malicious provider from publishing a service that hijacks a domain like *.example.com. The provider publishes the verification TXT record in the public hosted zone for the domain, even though the service itself is private. (SAP-C02.)

Q6. What is split-horizon DNS in this context? AWS creates a managed private hosted zone inside the consumer VPC that resolves the friendly name to the local endpoint ENIs, while the public name resolves to nothing useful externally — so the same hostname behaves differently inside the consumer VPC than on the public internet. (ANS-C01.)

Q7. Why must the NLB be internal, and what do its subnets determine? PrivateLink targets the NLB’s private addresses, so an internet-facing scheme can’t back an endpoint service. The subnets you attach define which AZs the service is available in — a consumer can only create an endpoint in an AZ where the provider has a presence. (ANS-C01.)

Q8. A long-lived gRPC connection through PrivateLink resets every few minutes. Cause and fix? The NLB’s 350-second TCP idle timeout. Fix it by enabling TCP keepalives below that threshold on the client; nothing about PrivateLink itself changes the timeout. (SOA-C02.)

Q9. Why reason about AZs by ID rather than name across accounts? Because us-east-1a may map to a different physical Availability Zone in the consumer account than in the provider account. AZ IDs (use1-az1) are stable across accounts, so aligning endpoint and NLB AZs by ID is correct; by name can silently strand an ENI in an unserved zone. (ANS-C01.)

Q10. What are the two billing axes for an interface endpoint, and which dominates at high fan-in? A per-endpoint, per-AZ hourly charge and a per-GB data-processing charge (both consumer-side). At high fan-in across many consumers, the per-AZ hourly charge dominates because it accrues regardless of traffic. (SAP-C02.)

Q11. How do you automate consumer onboarding at scale? Set acceptance_required = true, wire a connection notification to SNS, and have a Lambda auto-approve principals present in an authoritative registry (e.g. a DynamoDB table) while leaving unknowns pending for a human. This turns onboarding from a ticket into minutes while keeping an audited gate. (SAP-C02.)

Q12. Your targets behind PrivateLink need the real client IP for logging. How? Enable PROXY protocol v2 on the NLB target group so the client’s address is prepended to the TCP stream; otherwise the targets see the endpoint ENI’s network identity, not the consumer’s client. (ANS-C01.)

Quick check

You need to expose one internal API to 150 accounts, several with overlapping 10.0.0.0/16 CIDRs. Which connectivity primitive, and why?
A consumer reports their endpoint is stuck in pendingAcceptance. What two things could be wrong, and what command confirms each?
dig payments.internal.example.com from a consumer instance returns a long vpce-…amazonaws.com name. List three things to check.
Why does the endpoint ENI’s security group matter when the NLB has none?
A streaming connection through PrivateLink drops every few minutes with healthy targets. What’s the cause and the one-line fix?

Answers

PrivateLink endpoint service. It publishes one service (not a network), is CIDR-agnostic because the consumer talks to ENIs in its own subnets, and scales to many accounts as self-service. Peering/TGW would demand non-overlapping CIDRs and grant a route into your network.
Either the consumer’s principal is not on the allowed-principals list, or the provider hasn’t accepted the connection. Confirm the connection state with aws ec2 describe-vpc-endpoint-connections and the permissions with aws ec2 describe-vpc-endpoint-service-permissions; fix by adding the principal and/or accept-vpc-endpoint-connections.
(a) private_dns_enabled = true on the consumer endpoint; (b) the service’s PrivateDnsNameConfiguration.State is verified; © the consumer VPC has enableDnsHostnames and enableDnsSupport both on. Any one missing falls back to the long name.
Because the NLB has no security group, the endpoint ENI’s SG is the only L4 filter on the path. It must allow inbound on the service port from the client CIDRs/SGs, or connections to an otherwise-available endpoint hang or are refused.
The NLB 350-second TCP idle timeout. Fix: set TCP keepalives below 350 seconds on the client.

Glossary

PrivateLink — AWS service that exposes one service across accounts/VPCs as an interface endpoint, with no routing or CIDR overlap concerns.
Endpoint service — The provider-side publishing layer that sits on top of an internal NLB/GWLB and advertises a service name.
Service name — The coordination string com.amazonaws.vpce.<region>.vpce-svc-xxxx consumers use to find and target the service.
Interface endpoint — A consumer-side Interface-type VPC endpoint that provisions one ENI per subnet pointing at the provider’s service.
Endpoint ENI — The elastic network interface (one per AZ) in the consumer’s subnet; the only hop consumer workloads talk to, and where the SG filter lives.
Allowed principals — The allow-list of IAM principal ARNs permitted to create an endpoint to the service (access gate #1).
Acceptance / acceptance_required — The per-connection approval gate; when true, new endpoints wait in pendingAcceptance (access gate #2).
Connection notification — An SNS feed of Connect/Accept/Reject/Delete events for automating or alerting on onboarding.
Private DNS name — A friendly hostname the provider associates so consumers avoid the regional endpoint name; requires domain verification.
Domain-ownership verification — The TXT-record proof (in the public zone) that the provider owns the domain before private DNS is allowed.
Split-horizon DNS — Resolution that returns the endpoint ENIs inside the consumer VPC while the public name resolves to nothing useful externally.
Cross-zone load balancing — NLB setting that lets an ENI reach targets in any AZ, not just its own; off by default.
AZ ID — A stable physical-zone identifier (use1-az1) that is consistent across accounts, unlike the AZ name.
Idle timeout — The NLB’s fixed 350-second TCP idle window after which inactive flows reset.
PROXY protocol v2 — A target-group option that prepends the real client address to the TCP stream so targets can see it.

Next steps

Master the VPC fundamentals these endpoints build on in AWS VPC Deep Dive: Subnets, Routing, IGW, NAT & Endpoints.
Compare the routing alternative for many-to-many connectivity in AWS Transit Gateway Multi-Account VPC Architecture.
Go deep on the load balancer underneath in AWS Elastic Load Balancing: ALB, NLB & GWLB Deep Dive.
Tighten the cross-account access model with IAM Cross-Account Roles, External ID & the Confused Deputy.
Evaluate the app-layer alternative for HTTP service meshes in VPC Lattice Service Networks with IAM Auth.