You have an internal API that another team — or another company — needs to reach. The reflex is to peer the VPCs or hang both off a Transit Gateway. Both grant network-layer reachability, and both fall apart the moment the consumer’s 10.0.0.0/16 collides with yours, which across a large estate it always eventually does. AWS PrivateLink solves a narrower problem and solves it cleanly: it exposes one service across a trust boundary as a single elastic network interface (an ENI) in the consumer’s subnet. No routes, no transitive reachability, no CIDR negotiation, no public internet. This is how the most painful “expose this to 200 accounts” problem on AWS becomes a self-service onboarding problem instead of a networking project.
The mechanism has two halves and a wire between them. On the provider side you front your workload with an internal Network Load Balancer (NLB) or Gateway Load Balancer, then publish a VPC endpoint service on top of it; that service hands out a coordination string — com.amazonaws.vpce.<region>.vpce-svc-xxxx — and gates who connects through an allowed-principals allow-list plus an optional per-connection acceptance step. On the consumer side you create an interface endpoint naming that service, AWS provisions one ENI per subnet, and your workloads talk only to those ENIs. Private DNS then lets a friendly name like payments.internal.example.com resolve — inside the consumer VPC only — to the local ENIs, so client code never learns the ugly regional hostname. The catch that trips every team once is that AWS makes the provider prove they own the domain before any consumer can switch private DNS on.
By the end of this article you will build both sides correctly, wire up private DNS with domain-ownership verification, reason about availability zones across accounts (where us-east-1a in your account is not the same physical zone as in theirs), size for fan-in against the real quotas, keep the whole thing observable with VPC Flow Logs and NLB metrics, and — when it inevitably misbehaves — localise the failure to exactly one hop with a runbook of symptom, root cause, the exact command to confirm, and the fix. Every configuration here carries both a Terraform block and an aws CLI command, and because this is a reference you will return to mid-incident, the option matrices, error catalogue, quotas, and the troubleshooting playbook are all laid out as scannable tables.
What problem this solves
The pain is reachability that is too broad and address space that collides. When you peer two VPCs or attach them to a Transit Gateway, you give the consumer a route into your network. A misconfigured security group, an over-broad route table, or a curious operator on the other side can then reach anything routable in your VPC — not just the one API you meant to share. Security teams rightly flag this for external partners: “call one endpoint” should not grant “a path into the platform network.” PrivateLink gives the consumer a route to one load balancer, full stop, and only ever in the consumer→provider direction.
The second pain is CIDR overlap. Peering and TGW both demand non-overlapping address space, because they route on IP. In a large organisation — especially after acquisitions — two production VPCs sitting on the same 10.20.0.0/16 is not a hypothetical; it is Tuesday. Renumbering a live, compliance-scoped VPC is a multi-quarter project nobody signs up for. PrivateLink never exposes the provider’s address space at all: the consumer talks to ENIs carved from the consumer’s own subnets, so the two VPCs can use byte-identical CIDRs and never know.
Who hits this: any platform team publishing an internal service to many accounts (a fraud-scoring API, a config service, a logging sink), any SaaS vendor offering private connectivity to customer VPCs, and anyone who looked at a VPC-peering mesh of 80 accounts and realised it does not scale. The cost of PrivateLink is that it is one-directional and one-service-per-endpoint, and it requires a load balancer in front of the workload. If you genuinely need bidirectional, everything-talks-to-everything connectivity, this is the wrong tool — reach for TGW. To frame the whole decision before the deep dive, here is when each connectivity primitive is the right call:
| Pattern | What it connects | CIDR overlap | Direction | Bills per GB | Pick it when |
|---|---|---|---|---|---|
| VPC peering | Two VPCs, full IP reachability | Must not overlap | Bidirectional | No (intra-region) | Two VPCs, mutual trust, small N |
| Transit Gateway | Many VPCs/accounts, policy-routed | Must not overlap | Bidirectional | Yes | Hub-and-spoke routing, segmentation |
| PrivateLink | One service behind an NLB/GWLB | Irrelevant | Unidirectional (consumer→provider) | Yes | Publish a service to many accounts |
| VPC Lattice | App-layer services across accounts | Irrelevant | Policy-governed | Yes | HTTP service mesh, per-request auth |
| PublicEndpoint + IAM | A regional AWS API/SaaS over public IP | n/a | Caller→service | Egress only | You accept public-path + IAM-only |
The deciding question is always reachability versus publishing. Peering and TGW are routing: they give the consumer a route to your network. PrivateLink is publishing: it gives the consumer a route to one load balancer. If you would put the thing behind a load balancer and a DNS name anyway, publish it — do not route to it.
Mental model: peering and TGW are routing. PrivateLink is publishing. The consumer never touches your network — only an ENI in their own subnet that happens to forward to your NLB.
Learning objectives
By the end of this article you can:
- Choose PrivateLink over peering, Transit Gateway, or VPC Lattice by reasoning about reachability, CIDR overlap, direction, and fan-in — and explain why overlapping CIDRs are irrelevant to it.
- Build the provider side: an internal NLB across the right AZs with cross-zone load balancing, a VPC endpoint service on top, and the
service_namecoordination string handed to consumers. - Gate access with two stacked controls — allowed principals (who may create an endpoint) and per-connection acceptance (
acceptance_required) — and automate approval with connection notifications to SNS and a Lambda. - Build the consumer side: an interface endpoint with one ENI per AZ, the security group on the endpoint ENI, and AZ alignment reasoned about by AZ ID, not name.
- Publish a friendly private DNS name with domain-ownership verification (the TXT-record dance) and explain the split-horizon resolution that makes it work only inside the consumer VPC.
- Size for high fan-in against the real quotas — allowed principals per service, endpoints per VPC, the NLB 350-second idle timeout, source-port and target capacity — and model the per-AZ-hourly plus per-GB cost.
- Observe and troubleshoot the path with VPC Flow Logs on the ENIs and NLB CloudWatch metrics, and walk a symptom→cause→confirm→fix runbook to localise any failure to one hop.
Prerequisites & where this fits
You should be comfortable with core VPC concepts — subnets, route tables, security groups, ENIs, and Availability Zones — and know that an NLB is a Layer-4 (TCP/UDP/TLS) load balancer with no security group of its own. You should be able to run the aws CLI with a profile per account (this is inherently a two-account exercise), read JSON output, and apply Terraform. Familiarity with Route 53 hosted zones (public and private) and basic DNS resolution helps a great deal for the private-DNS section.
This sits in the Networking track and assumes the fundamentals from the AWS VPC Deep Dive: Subnets, Routing, IGW, NAT & Endpoints — interface endpoints are the same primitive you use for AWS service endpoints, pointed at a private service instead. It is the cross-account complement to the AWS Transit Gateway Multi-Account VPC Architecture: TGW for routing meshes, PrivateLink for service publishing. The NLB underneath is covered in AWS Elastic Load Balancing: ALB, NLB & GWLB Deep Dive, the access controls lean on IAM Cross-Account Roles, External ID & the Confused Deputy, and for an app-layer alternative compare VPC Lattice Service Networks with IAM Auth.
A quick map of who owns and confirms what during an incident, so you call the right person fast:
| Layer | What lives here | Which account owns it | Failure classes it can cause |
|---|---|---|---|
| Consumer client + its SG | App code, egress rules | Consumer | Client SG blocks egress to the ENI |
| Endpoint ENI + its SG | The interface endpoint, per-AZ ENIs | Consumer | Inbound SG blocks the client → hang/refuse |
| Private DNS (managed PHZ) | Split-horizon name → ENI | Consumer (created by AWS) | Long regional name returned; resolution fails |
| AWS backbone | The PrivateLink data path | AWS (managed) | Practically never; AZ mismatch shows here |
| Endpoint service + gates | Allowed principals, acceptance, DNS verify | Provider | pendingAcceptance; private DNS not effective |
| Internal NLB + targets | L4 LB, target groups, health checks | Provider | Unhealthy targets, AZ gaps, idle resets |
Core concepts
Five mental models make every later step and every failure obvious.
An endpoint service publishes one service, not a network. On the provider side, a VPC endpoint service is a thin publishing layer that sits on top of an internal NLB (or GWLB) and advertises a single coordination string, the service name (com.amazonaws.vpce.<region>.vpce-svc-xxxx). It exposes exactly what the NLB fronts — one TCP service — and nothing else of your VPC. There is no route, no IP range, no transitive path. The consumer cannot “see” your network; they can only reach the load balancer you chose to publish.
The interface endpoint is an ENI in the consumer’s own subnet. On the consumer side, an interface endpoint of type Interface names the provider’s service, and AWS provisions one ENI per subnet you specify, each with a private IP from that consumer subnet’s range. Those ENIs are the only thing consumer workloads ever talk to. This is why CIDR overlap is irrelevant: the client connects to 10.0.1.50 in its own VPC, and AWS quietly carries the flow across the backbone to the provider’s NLB. The provider’s address space never enters the picture.
Two stacked controls decide who connects. Access is governed by two independent gates that both must pass. Allowed principals is an allow-list of IAM principal ARNs (a role, an account root, or *) that decides who may even create an endpoint to your service — without an entry, the consumer cannot see or target it at all. Acceptance (acceptance_required) is a per-connection gate: when true, every new endpoint lands in pendingAcceptance and waits for the provider to approve it. They stack: with acceptance_required = false an unlisted principal still cannot connect; with an open allow-list, acceptance still holds each connection until approved.
Private DNS is split-horizon and requires proof of ownership. By default the endpoint hands the consumer an ugly regional name. A private DNS name lets the provider associate a friendly hostname so consumers keep calling it. AWS implements this as split-horizon: it creates a managed private hosted zone inside the consumer VPC that resolves the name to the local ENIs, while the public name resolves to nothing useful outside. To stop anyone from publishing a service that hijacks *.example.com, AWS forces the provider to prove domain ownership via a TXT record in the public zone before any consumer may enable private DNS.
The NLB underneath sets the rules of physics. Everything downstream — which AZs the service is reachable in, whether traffic stays zone-local, how long-lived connections behave — is decided by the NLB. The subnets you attach define the published AZs; cross-zone load balancing decides whether an ENI in us-east-1a can reach targets in us-east-1b; the 350-second TCP idle timeout silently resets long-lived flows without keepalives. Get the NLB right and PrivateLink is boring; get it wrong and you get “works in one AZ, dead in another.”
The vocabulary in one table
Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:
| Concept | One-line definition | Whose account | Why it matters |
|---|---|---|---|
| Endpoint service | Publishing layer over an NLB/GWLB | Provider | The thing consumers connect to |
| Service name | com.amazonaws.vpce.<region>.vpce-svc-… |
Provider (issued by AWS) | Coordination key; share out of band |
| Interface endpoint | Interface-type endpoint → ENIs |
Consumer | What client workloads actually talk to |
| Endpoint ENI | One private IP per AZ in the consumer subnet | Consumer | The only L4 hop; its SG is the filter |
| Allowed principals | Allow-list of who may create an endpoint | Provider | Gate #1; scope to role ARNs |
| Acceptance | Per-connection approval (acceptance_required) |
Provider | Gate #2; pendingAcceptance until approved |
| Private DNS name | Friendly name resolving to the ENIs | Provider sets, consumer enables | Avoids the ugly regional hostname |
| Domain verification | TXT record proving domain ownership | Provider | Must be verified before private DNS works |
| Cross-zone LB | NLB sends across AZs, not just zone-local | Provider | Off → “works in one AZ only” |
| Connection notification | SNS event on Connect/Accept/Reject/Delete | Provider | Auto-approval / human alerting |
| AZ ID | Stable physical-zone identifier (use1-az1) |
Both | AZ names differ per account |
| Idle timeout | NLB’s fixed 350 s TCP inactivity window | Provider (NLB) | Long-lived flows reset without keepalive |
| PROXY protocol v2 | Prepends real client IP to the TCP stream | Provider (target group) | Targets otherwise see the ENI, not the client |
| GWLB endpoint | Endpoint type for inline-appliance services | Both | Inspection/firewall services use GWLB, not NLB |
| Regional endpoint name | The long vpce-…amazonaws.com hostname |
Consumer | The fallback when private DNS isn’t effective |
When an endpoint service is the right call
PrivateLink, peering, Transit Gateway, and VPC Lattice are not interchangeable. Pick by what you are actually sharing and in which direction. The single deciding factor is whether you want to give the other side a route to your network or a path to one service; the rest follows. Reach for an endpoint service when the answer to “would I put this behind a load balancer and a DNS name anyway?” is yes.
The full comparison, on the dimensions that decide a real design:
| Dimension | PrivateLink | VPC peering | Transit Gateway | VPC Lattice |
|---|---|---|---|---|
| Granularity | One service per endpoint | Whole VPC | Whole VPCs (route-domains) | Per app-layer service |
| CIDR overlap allowed | Yes (irrelevant) | No | No | Yes |
| Direction | Consumer→provider only | Bidirectional | Bidirectional | Policy-governed |
| Layer | L4 (TCP/UDP/TLS via NLB) | L3 (IP) | L3 (IP) | L7 (HTTP) + L4 |
| Transitive reach | None (no route) | Non-transitive | Transitive (by design) | Service-scoped |
| Scales to N consumers | Excellent (fan-in) | Poor (mesh) | Good (hub-spoke) | Excellent |
| Public internet | Never | Never | Never | Never |
| Auth model | Allowed principals + acceptance | SG/NACL only | SG/NACL + routing | IAM auth policies |
| Bills per GB | Yes (data processing) | No (intra-region) | Yes | Yes |
| DNS integration | Private DNS (verified) | Manual | Manual / R53 Resolver | Built-in service DNS |
| Onboarding model | Self-service (gated) | Per-peer request | Per-attachment | Service association |
| Typical use | SaaS / shared internal API | Two trusted VPCs | Routing mesh | HTTP service mesh |
Three reading notes that save the most design time:
| If your requirement is… | Don’t use… | Use… | Because… |
|---|---|---|---|
| “Expose one API to 200 accounts” | Peering mesh / TGW | PrivateLink | Fan-in, no routes, overlap-proof |
| “Many VPCs must route to each other” | PrivateLink | Transit Gateway | PrivateLink is one-service, one-way |
| “Partner gets a path into our net” | Peering / TGW | PrivateLink | Grants one service, not network reach |
| “HTTP routing + per-request authZ” | Raw PrivateLink | VPC Lattice | Lattice does L7 + IAM auth policies |
| “Two of my own VPCs, full trust” | PrivateLink | VPC peering | Cheaper, bidirectional, simpler |
Provider step 1 — Front the service with an internal NLB
An endpoint service sits on top of an NLB (or GWLB). We will use an NLB. It must be internal — PrivateLink targets the load balancer’s private addresses, not an internet-facing scheme — and you register your service’s targets (instances, IPs, or even an ALB as a target if you need L7 routing behind it) in a target group.
resource "aws_lb" "svc" {
name = "payments-svc-nlb"
internal = true
load_balancer_type = "network"
subnets = var.provider_subnet_ids # one per AZ you publish
enable_cross_zone_load_balancing = true
}
resource "aws_lb_target_group" "svc" {
name = "payments-svc-tg"
port = 8443
protocol = "TCP"
vpc_id = var.provider_vpc_id
target_type = "ip"
health_check {
protocol = "TCP"
port = "8443"
healthy_threshold = 3
unhealthy_threshold = 3
interval = 10
}
}
resource "aws_lb_listener" "svc" {
load_balancer_arn = aws_lb.svc.arn
port = 8443
protocol = "TCP"
default_action {
type = "forward"
target_group_arn = aws_lb_target_group.svc.arn
}
}
# Equivalent, CLI: an internal NLB with cross-zone enabled
aws elbv2 create-load-balancer --name payments-svc-nlb --type network \
--scheme internal --subnets subnet-aaa subnet-bbb subnet-ccc
aws elbv2 modify-load-balancer-attributes --load-balancer-arn <nlb-arn> \
--attributes Key=load_balancing.cross_zone.enabled,Value=true
Two decisions here matter for everything downstream. First, the subnets you attach to the NLB define which AZs the service is available in. A consumer can only create an endpoint in an AZ where you have a presence. Publish in at least two, and prefer to publish in every AZ your largest consumers use. Second, turn on cross-zone load balancing. An NLB is zonal by default: an endpoint ENI in us-east-1a will only send to targets in us-east-1a unless cross-zone is enabled, which with uneven target distribution produces hot zones and surprising health-check behaviour. Cross-zone traffic on an NLB incurs inter-AZ data charges, which the load-balancer owner pays — budget for it.
The NLB decisions that ripple into PrivateLink behaviour, with the default, the trade-off, and the gotcha:
| Setting | Values | Default | When to change | Trade-off / gotcha |
|---|---|---|---|---|
scheme |
internal / internet-facing |
n/a | Must be internal for PrivateLink | Internet-facing NLB cannot back an endpoint service |
| Subnets (AZs) | One subnet per AZ | none | Publish every AZ big consumers use | Consumer can’t endpoint into an AZ you skip |
cross_zone.enabled |
true / false |
false | Almost always true |
false → ENI only reaches same-AZ targets; inter-AZ data charges when on |
target_type |
instance / ip / alb |
instance |
ip for ECS/containers; alb for L7 behind |
alb target enables HTTP routing but adds a hop |
| Listener protocol | TCP / UDP / TLS / TCP_UDP |
none | TLS to terminate at the NLB |
TLS needs an ACM cert; otherwise pass TCP through |
| Health-check protocol | TCP / HTTP / HTTPS |
TCP |
HTTP(S) for app-aware checks |
TCP only proves the port is open, not the app |
deletion_protection |
true / false |
false |
true in prod |
Prevents an accidental terraform destroy outage |
preserve_client_ip |
true / false |
varies by target type | Needed if the app reads source IP | Through PrivateLink the source is the ENI, not the real client |
A subtle but important point on client IP: through PrivateLink the provider’s targets see the endpoint ENI’s network identity, not the consumer’s real client IP (unless you carry it in PROXY protocol v2, which the NLB can prepend). Do not build authorization on observed source IP across a PrivateLink boundary — use the allowed-principals gate and application-layer auth instead.
Provider step 2 — Create the VPC endpoint service
With the NLB live, create the endpoint service that points at it. The key knob is acceptance_required.
resource "aws_vpc_endpoint_service" "payments" {
acceptance_required = true
network_load_balancer_arns = [aws_lb.svc.arn]
tags = { Name = "payments-endpoint-service" }
}
output "service_name" {
# e.g. com.amazonaws.vpce.us-east-1.vpce-svc-0123456789abcdef0
value = aws_vpc_endpoint_service.payments.service_name
}
# CLI equivalent — note the NLB ARN(s), not the VPC
aws ec2 create-vpc-endpoint-service-configuration \
--network-load-balancer-arns <nlb-arn> \
--acceptance-required
AWS assigns a service name of the form com.amazonaws.vpce.<region>.vpce-svc-xxxxxxxxxxxxxxxxx. This string is what consumers use to find you; it is not secret, but it is the coordination key — hand it to consumers out of band (a wiki, a Terraform output, a platform catalogue). acceptance_required = true means every new connection lands in pendingAcceptance and waits for you to approve it. For a controlled internal platform with a known consumer list, that manual gate is worth keeping; for self-service at scale you flip it to false and rely on the allow-list instead. Do not run with false and an open allow-list unless you genuinely intend anyone in any account to connect.
The endpoint-service configuration options, end to end:
| Setting | Values | Default | When to change | Trade-off / gotcha |
|---|---|---|---|---|
acceptance_required |
true / false |
true (console) |
false for self-service at scale |
false + open allow-list = anyone can connect |
network_load_balancer_arns |
One or more NLB ARNs | none | Multiple for blue/green or sharding | All must be in the same region/VPC |
gateway_load_balancer_arns |
One or more GWLB ARNs | none | For appliance/inspection services | NLB and GWLB are mutually exclusive per service |
supported_ip_address_types |
ipv4 / ipv6 |
ipv4 |
Add ipv6 for dual-stack consumers |
Consumer and NLB must both support the family |
private_dns_name |
FQDN string | unset | Set to publish a friendly name | Triggers the TXT verification requirement |
allowed_principals |
List of ARNs | empty (nobody) | Always add at least one | Empty = no consumer can even see the service |
supported_regions |
Region list (cross-region) | current | For cross-region access (where supported) | Adds cross-region data charges |
| Tags | Key/value | none | Always tag owner + cost-centre | Untagged platform services are unauditable |
The service has a lifecycle state you will check constantly; know what each value means:
ServiceState |
Meaning | What to do |
|---|---|---|
Pending |
Being created | Wait; verify the NLB exists |
Available |
Ready for connections | Add allowed principals; share the name |
Deleting |
Tear-down in progress | Consumers’ endpoints will go to rejected |
Failed |
Creation failed | Check the NLB ARN and account limits |
Provider step 3 — Allowed principals, acceptance, and notifications
Two independent controls govern who reaches your service, and they stack. Allowed principals decide who is even permitted to create an endpoint to your service; without an entry, the consumer cannot see or target it at all. Scope these as tightly as the consumer’s identity allows — a specific role ARN is better than a whole account root, which is better than *.
resource "aws_vpc_endpoint_service_allowed_principal" "consumer" {
vpc_endpoint_service_id = aws_vpc_endpoint_service.payments.id
principal_arn = "arn:aws:iam::222222222222:role/payments-client-prod" # tighten to a role
}
aws ec2 modify-vpc-endpoint-service-permissions \
--service-id vpce-svc-0123456789abcdef0 \
--add-allowed-principals arn:aws:iam::222222222222:role/payments-client-prod
Acceptance is the per-connection gate that applies when acceptance_required = true. List pending connections and approve them:
aws ec2 describe-vpc-endpoint-connections \
--filters Name=service-id,Values=vpce-svc-0123456789abcdef0 \
--query 'VpcEndpointConnections[?VpcEndpointState==`pendingAcceptance`].[VpcEndpointId,VpcEndpointOwner]' \
--output table
aws ec2 accept-vpc-endpoint-connections \
--service-id vpce-svc-0123456789abcdef0 \
--vpc-endpoint-ids vpce-0a1b2c3d4e5f6a7b8
To avoid polling for pending connections, wire a connection notification to SNS. You get an event on Connect, Accept, Reject, and Delete, which you can route to a Lambda for auto-approval against an authoritative consumer registry, or just to a channel so a human acts within minutes instead of hours.
resource "aws_vpc_endpoint_connection_notification" "payments" {
vpc_endpoint_service_id = aws_vpc_endpoint_service.payments.id
connection_notification_arn = aws_sns_topic.privatelink_events.arn
connection_events = ["Connect", "Accept", "Reject", "Delete"]
}
How the two gates interact is the single most common source of “why can’t they connect” — read this matrix carefully:
| Allowed principal present? | acceptance_required |
Outcome | Provider action needed |
|---|---|---|---|
| No | true |
Consumer can’t even create the endpoint | Add the principal first |
| No | false |
Consumer can’t create the endpoint | Add the principal first |
| Yes | true |
Endpoint sits in pendingAcceptance |
Accept the connection (or auto-approve) |
| Yes | false |
Endpoint goes available immediately |
None — self-service |
* (wildcard) |
false |
Anyone in any account connects | Intentional only; usually a mistake |
* (wildcard) |
true |
Anyone may request, you gate each | Acceptable for vetted public-ish services |
Scope the principal as tightly as the consumer’s identity allows — the security blast radius shrinks as you narrow it:
| Principal form | Example | Who can connect | Use when |
|---|---|---|---|
| Role ARN | arn:aws:iam::222…:role/app-prod |
Only that workload role | Default — tightest practical scope |
| Account root | arn:aws:iam::222…:root |
Any principal in that account | You trust the whole account |
| Org-wide via condition | aws:PrincipalOrgID (policy) |
Any account in your org | Internal platform, many accounts |
| Wildcard | * |
Anyone, any account | Almost never; vetted + acceptance only |
The connection-notification events, and what you typically do with each:
| Event | Fires when | Typical handler action |
|---|---|---|
Connect |
A consumer creates an endpoint | Look up the principal in the registry |
Accept |
A connection is accepted | Record onboarding; emit a metric |
Reject |
You (or auto-logic) reject it | Alert the consumer with the reason |
Delete |
The consumer deletes the endpoint | Clean up registry/DNS state |
Consumer step 4 — Create the interface endpoint
Now switch to the consumer account. The consumer creates an interface endpoint of type Interface, naming the provider’s service. AWS provisions one ENI per subnet you specify, each with a private IP from that subnet’s range. Those ENIs are the only thing the consumer’s workloads ever talk to.
resource "aws_vpc_endpoint" "payments" {
vpc_id = var.consumer_vpc_id
service_name = "com.amazonaws.vpce.us-east-1.vpce-svc-0123456789abcdef0"
vpc_endpoint_type = "Interface"
subnet_ids = var.consumer_subnet_ids # one per AZ — must overlap provider AZs
security_group_ids = [aws_security_group.endpoint.id]
private_dns_enabled = false # see step 5 before enabling
}
aws ec2 create-vpc-endpoint --vpc-id vpc-cons123 --vpc-endpoint-type Interface \
--service-name com.amazonaws.vpce.us-east-1.vpce-svc-0123456789abcdef0 \
--subnet-ids subnet-c1 subnet-c2 --security-group-ids sg-endpoint \
--no-private-dns-enabled
Three things define correctness on this side. AZ alignment comes first: put the endpoint in the same AZs the provider published. An endpoint ENI in an AZ the provider does not serve is dead weight — there is no target on the other side. Use AZ IDs (use1-az1), not names, when reasoning across accounts, because us-east-1a in your account may map to a different physical zone than in the provider’s. The security group is on the endpoint ENI — this is the most common stumble. The SG attached to the endpoint controls traffic from consumer workloads into the ENI; it must allow inbound on the service port from the client CIDRs/SGs. The provider’s NLB has no security group at all, so the endpoint SG is the only L4 filter on the path. And consumer-side SGs on the clients still need egress to the endpoint.
resource "aws_security_group" "endpoint" {
name = "payments-endpoint-sg"
vpc_id = var.consumer_vpc_id
ingress {
description = "Clients to PrivateLink endpoint"
from_port = 8443
to_port = 8443
protocol = "tcp"
cidr_blocks = [var.app_subnet_cidr] # or security_groups = [aws_security_group.client.id]
}
}
Finally, one endpoint, many AZs, one connection: each consumer VPC needs exactly one interface endpoint to the service, and the ENIs across AZs share a single connection from the provider’s point of view. The interface-endpoint options that define behaviour:
| Setting | Values | Default | When to change | Trade-off / gotcha |
|---|---|---|---|---|
vpc_endpoint_type |
Interface / Gateway / GatewayLoadBalancer |
n/a | Interface for PrivateLink services | Gateway type is only for S3/DynamoDB |
subnet_ids |
One subnet per AZ | none | Match the provider’s published AZs | An ENI in an unpublished AZ is dead weight |
security_group_ids |
SG list | default VPC SG | Always set explicitly | Default SG often denies the service port |
private_dns_enabled |
true / false |
true (for AWS svcs) / false (custom) |
true once provider DNS is verified |
Premature true errors if not verified |
ip_address_type |
ipv4 / ipv6 / dualstack |
ipv4 |
Match the service’s families | Mismatch → endpoint can’t be created |
policy |
Endpoint policy JSON | full access | Restrict actions (for AWS-service EPs) | Custom services rely on app auth, not this |
dns_options.dns_record_ip_type |
ipv4 / ipv6 / service-defined |
ipv4 |
Dual-stack resolution | Must align with how the service publishes |
The endpoint state values you will watch while it provisions:
State |
Meaning | What to do |
|---|---|---|
pendingAcceptance |
Waiting for the provider to accept | Ask the provider to accept (or check allow-list) |
pending |
Provisioning the ENIs | Wait a minute or two |
available |
ENIs up; ready to use | Test connectivity; check the SG |
rejected |
Provider rejected the connection | You’re not on the allow-list / were denied |
failed |
Provisioning failed | Check AZ overlap and subnet capacity |
deleting / deleted |
Tear-down | Expected on terraform destroy |
Consumer step 5 — Private DNS and domain-ownership verification
By default the endpoint hands the consumer an ugly regional DNS name like vpce-0a1b2c3d4e5f6a7b8-abcd1234.vpce-svc-0123456789abcdef0.us-east-1.vpce.amazonaws.com, plus zonal variants. Functional, but nobody wants that baked into client config. Private DNS names let the provider associate a friendly name — say payments.internal.example.com — so consumers keep calling that hostname and resolution silently points at the endpoint.
The catch, and the reason this trips teams up, is that AWS makes the provider prove they own the domain before any consumer is allowed to enable private DNS. This prevents someone from publishing a service that hijacks *.example.com. Enable a private DNS name on the endpoint service, then publish the TXT record AWS gives you:
resource "aws_vpc_endpoint_service" "payments" {
acceptance_required = true
network_load_balancer_arns = [aws_lb.svc.arn]
private_dns_name = "payments.internal.example.com"
}
AWS returns a verification token. Fetch it and create the TXT record in the public hosted zone for the domain (verification is done against public DNS, even though the service itself is private):
aws ec2 describe-vpc-endpoint-service-configurations \
--service-ids vpce-svc-0123456789abcdef0 \
--query 'ServiceConfigurations[0].PrivateDnsNameConfiguration'
# -> { "State": "pendingVerification", "Type": "TXT",
# "Value": "vpce:abc123...", "Name": "_a1b2c3d4" }
resource "aws_route53_record" "privatelink_verify" {
zone_id = var.public_zone_id
name = "_a1b2c3d4.payments.internal.example.com"
type = "TXT"
ttl = 1800
records = ["vpce:abc123..."]
}
aws ec2 start-vpc-endpoint-service-private-dns-verification \
--service-id vpce-svc-0123456789abcdef0
Once the state flips to verified, consumers can set private_dns_enabled = true on their interface endpoints. Behind the scenes AWS creates a managed private hosted zone in the consumer VPC that resolves payments.internal.example.com to the endpoint ENIs — classic split-horizon: the public name resolves to nothing useful publicly, but inside the consumer VPC it points at the local ENIs. For private_dns_enabled to actually take effect, the consumer VPC must have enableDnsHostnames and enableDnsSupport both turned on, or resolution silently falls back to the long regional name.
The private-DNS verification states and what each means for the consumer:
PrivateDnsNameConfiguration.State |
Meaning | Consumer can enable private DNS? | Provider action |
|---|---|---|---|
pendingVerification |
Token issued, TXT not yet seen | No | Publish the TXT record, then start verification |
verified |
Ownership proven | Yes | Tell consumers to set private_dns_enabled=true |
failed |
TXT wrong/missing after retries | No | Fix the TXT name/value; re-run verification |
What must be true on each side for the friendly name to actually resolve — every one of these, or you get the long regional name:
| Requirement | Side | How to confirm | If wrong |
|---|---|---|---|
private_dns_name set on the service |
Provider | describe-vpc-endpoint-service-configurations |
Friendly name never offered |
TXT verification verified |
Provider | PrivateDnsNameConfiguration.State |
Consumer can’t enable private DNS |
private_dns_enabled = true |
Consumer | describe-vpc-endpoints |
Long regional name returned |
enableDnsSupport = true |
Consumer VPC | describe-vpc-attribute |
Resolution silently falls back |
enableDnsHostnames = true |
Consumer VPC | describe-vpc-attribute |
Resolution silently falls back |
| No conflicting Route 53 record | Consumer | Check private hosted zones | A manual record can shadow the managed one |
The two DNS names you’ll see, and when each is correct:
| Name form | Example | Resolves to | When you should see it |
|---|---|---|---|
| Friendly private DNS | payments.internal.example.com |
Endpoint ENI IPs (in-VPC only) | After verification + private_dns_enabled=true |
| Regional endpoint name | vpce-…vpce-svc-…region.vpce.amazonaws.com |
The endpoint (all AZs) | When private DNS is off — and as a fallback |
| Zonal endpoint name | vpce-…az1.…vpce.amazonaws.com |
One AZ’s ENI | When you deliberately pin to an AZ |
Verify the path end to end
Prove the path end to end before declaring victory. Work provider→consumer→client, because each layer depends on the one before it.
# 1. Provider: service exists, NLB attached, DNS verified
aws ec2 describe-vpc-endpoint-service-configurations \
--service-ids vpce-svc-0123456789abcdef0 \
--query 'ServiceConfigurations[0].[ServiceState,PrivateDnsNameConfiguration.State,AvailabilityZones]'
# 2. Provider: the consumer's connection is accepted (not pendingAcceptance/rejected)
aws ec2 describe-vpc-endpoint-connections \
--filters Name=service-id,Values=vpce-svc-0123456789abcdef0 \
--query 'VpcEndpointConnections[].[VpcEndpointOwner,VpcEndpointState]' --output table
# 3. Consumer: endpoint is "available" with ENIs in each AZ
aws ec2 describe-vpc-endpoints \
--vpc-endpoint-ids vpce-0a1b2c3d4e5f6a7b8 \
--query 'VpcEndpoints[0].[State,NetworkInterfaceIds,DnsEntries[0].DnsName]'
# 4. From a consumer instance: DNS resolves to a private (ENI) address...
dig +short payments.internal.example.com
# 10.x.x.x (a consumer-subnet IP, not a public one)
# 5. ...and the service answers
curl -sS -o /dev/null -w '%{http_code}\n' https://payments.internal.example.com:8443/healthz
If dig returns the long vpce-...amazonaws.com name, private DNS is not effective — re-check private_dns_enabled, the DNS verification state, and the VPC’s DNS-hostnames flag. The verification checklist as a table — what “good” looks like at each step:
| # | Check | Command | Expected good result |
|---|---|---|---|
| 1 | Service available + DNS verified | describe-vpc-endpoint-service-configurations |
Available, verified, AZs listed |
| 2 | Connection accepted | describe-vpc-endpoint-connections |
available for the consumer owner |
| 3 | Endpoint up with ENIs | describe-vpc-endpoints |
available, N ENIs, a DNS entry |
| 4 | DNS resolves private | dig +short <name> |
A 10.x/consumer-subnet IP |
| 5 | Service responds | curl … :8443/healthz |
200 (or your healthy code) |
| 6 | No REJECTs at the ENI | Flow Logs query (below) | Zero REJECT rows from the client |
Scaling, resiliency, and the quotas that bite
A few limits and behaviours decide whether this holds up under load. Cross-zone load balancing (covered above, repeated because it bites in production): without it, an endpoint ENI only reaches targets in its own AZ. Endpoint connection capacity: a single interface endpoint scales horizontally across AZs, giving roughly tens of thousands of concurrent connections per AZ per endpoint; for very high fan-in the bottleneck is usually NLB target capacity and source-port exhaustion on long-lived connections, not the endpoint itself, so watch ActiveFlowCount and NewFlowCount on the NLB. The idle timeout is the silent killer: NLB TCP flows have a 350-second idle timeout, so long-lived gRPC or database-style connections through PrivateLink need TCP keepalives below that or they reset silently.
The quotas you will actually hit, with the default and how to handle each:
| Quota | Default | Adjustable? | What hitting it looks like | Mitigation |
|---|---|---|---|---|
| Allowed principals per endpoint service | 50 | Yes (Service Quotas) | 51st consumer can’t be added | Raise before the 50th account; prefer org-condition |
| Interface endpoints per VPC | 50 | Yes | New endpoint creation fails | Consolidate, or request an increase |
| Endpoint services per account/region | 20–50 (varies) | Yes | New service creation fails | Raise; or share one service across consumers |
| Connections per endpoint (per AZ) | ~tens of thousands | Effectively scale-bound | New flows fail at extreme fan-in | Add AZs; scale NLB targets |
| NLB targets per target group | 500 (instance/IP) | Yes | Can’t register more targets | Increase, or shard behind multiple NLBs |
| NLB TCP idle timeout | 350 s | No | Long-lived flows reset at ~6 min | Client TCP keepalive < 350 s |
| Source ports per flow tuple | ~64 K per dst | No | Port exhaustion on one hot 5-tuple | Spread targets/ports; reuse connections |
| NLBs per endpoint service | 1 active set (per region/VPC) | n/a | Can’t span VPCs/regions with one service | Publish multiple services or use cross-region |
| Subnets (AZs) per NLB | One per AZ, up to region AZ count | n/a | Can’t publish in an AZ with no subnet | Add a subnet per AZ you want to serve |
| Connection notifications per service | Small fixed cap | No | Extra SNS wiring rejected | Fan out from one SNS topic instead |
| Target group health-check interval | 10–30 s | Yes | Slow detection of dead targets | Tune interval/thresholds for your SLA |
The behaviours and limits that decide resiliency, side by side:
| Behaviour | Default | Why it matters | What to do |
|---|---|---|---|
| Zonal NLB routing | On (cross-zone off) | ENI reaches only same-AZ targets | Enable cross-zone unless you want zonal isolation |
| Single connection per consumer VPC | Always | One endpoint = one provider-side connection | Don’t create duplicate endpoints per AZ |
| AZ-ID vs AZ-name mismatch | Inherent across accounts | us-east-1a differs per account |
Reason with AZ IDs (use1-az1) |
| 350 s idle timeout | Fixed | Long-poll/gRPC/DB connections reset | Keepalive below the threshold |
| Cross-zone data charges | Billed to LB owner | Cost can surprise on chatty services | Model inter-AZ GB; keep targets balanced |
| ENI per AZ, not per client | Always | Thousands of clients share one ENI/AZ | Scale targets, not endpoints, for fan-in |
| No source-IP visibility | Default | Targets see the ENI, not the consumer | Use PROXY protocol v2 if the app needs it |
Observability and cost
You are billed on two axes for an interface endpoint: an hourly charge per endpoint per AZ, and a per-GB data-processing charge on traffic through it. The per-AZ hourly line is why you do not blindly enable every AZ — each ENI is its own meter. The data-processing charge is on top of any inter-AZ transfer the NLB incurs, and at high fan-in across hundreds of consumer endpoints, the per-endpoint hourly cost dominates and is easy to overlook. The good news: PrivateLink bypasses SNAT entirely for the consumer — traffic to the ENI rides the backbone, so you do not burn the consumer’s NAT-gateway SNAT ports the way a public-internet call would.
For traffic visibility, VPC Flow Logs on the endpoint ENIs show source IPs, ports, and accept/reject actions — invaluable when a consumer swears they cannot connect. The endpoint ENIs have stable interface IDs; filter on them.
-- CloudWatch Logs Insights over VPC Flow Logs — rejected traffic to the endpoint ENI
fields @timestamp, srcAddr, dstAddr, dstPort, action
| filter interfaceId = "eni-0abc123def456789"
| filter action = "REJECT"
| stats count() as rejects by srcAddr, dstPort
| sort rejects desc
On the provider side, the meaningful signals are NLB CloudWatch metrics: HealthyHostCount/UnHealthyHostCount per target group, ActiveFlowCount, and TCP_Target_Reset_Count. A rise in target resets with healthy hosts usually points at idle-timeout or application-side connection churn rather than the network. Observability sources and what each is the source of truth for:
| Signal source | Lives where | Source of truth for |
|---|---|---|
| VPC Flow Logs (endpoint ENI) | Consumer account | Client→ENI accept/REJECT, source IP/port |
ActiveFlowCount / NewFlowCount |
Provider NLB metrics | Concurrent flows; fan-in pressure |
HealthyHostCount / UnHealthyHostCount |
Provider target group | Whether targets are in rotation |
TCP_Target_Reset_Count |
Provider NLB metrics | Idle-timeout / app churn resets |
ProcessedBytes |
Provider NLB metrics | Data volume (cost driver) |
| Connection-notification events | Provider SNS topic | Who connected/was rejected and when |
describe-vpc-endpoint-connections |
Provider control plane | Per-consumer connection state |
The cost model — what drives the bill and how to control it (figures approximate, us-east-1, USD; ₹ at ~₹86/USD):
| Cost driver | Rough rate | Who pays | How to control |
|---|---|---|---|
| Interface endpoint, per-AZ hourly | ~$0.01/AZ/hr (~₹0.86) | Consumer | Only enable AZs you actually use |
| Endpoint data processing | ~$0.01/GB (~₹0.86) | Consumer | Tier on volume; consolidate chatty calls |
| NLB hourly + LCU | ~$0.0225/hr + LCU | Provider | Right-size; one NLB per service, not per consumer |
| NLB cross-zone data | inter-AZ $0.01/GB each way | Provider (LB owner) | Balance targets; weigh zonal isolation |
| Flow Logs storage/ingest | CloudWatch/S3 rates | Whoever logs | Sample, or log REJECT-only for cost |
| SNS + Lambda (auto-approve) | Negligible | Provider | Effectively free at onboarding volumes |
| Saved consumer NAT/SNAT | (a credit, not a charge) | Consumer | PrivateLink bypasses the public path entirely |
| Cross-region data (if used) | inter-region transfer rate | Both | Only when supported_regions spans regions |
A worked sizing note: at 120 consumer accounts each enabling two AZs, the per-endpoint hourly line alone is roughly 120 × 2 × $0.01 × 730 hr ≈ $1,750/month (~₹150,000) before a single byte flows — which is exactly why the per-AZ count is a deliberate decision, not a default.
Architecture at a glance
Read the diagram left to right; it traces a single request through the system and pins the five failure classes onto the exact hop where each bites. A client in the consumer VPC resolves the friendly name payments.internal.example.com, which — thanks to the managed split-horizon private hosted zone — returns a local interface-endpoint ENI address rather than anything public. The client opens a TCP connection to that ENI on :8443; its egress SG must allow it out, and the endpoint ENI’s own security group must allow it in (the NLB has no SG, so this is the only L4 filter on the path — badge 1). From the ENI the flow crosses the AWS backbone in one direction only, never touching the internet, and arrives at the provider’s endpoint service. There, two gates decide whether the connection ever existed: the allowed-principals allow-list and the per-connection acceptance step (badge 2), while domain-ownership verification via a public TXT record is what let private DNS resolve in the first place (badge 3).
Past the gates, the service forwards to an internal NLB that load-balances across registered targets (IPs, instances, or an ALB) with TCP health checks. The NLB’s published AZs and its cross-zone setting decide whether an ENI in one AZ can reach targets in another — get this wrong and the service “works in one AZ, dies in another” (badge 4) — and its fixed 350-second idle timeout silently resets long-lived flows without keepalives (badge 5). The right-hand observe zone closes the loop: VPC Flow Logs on the ENI catch REJECTs, NLB metrics catch flow pressure and target resets, and the SNS connection-notification feed can auto-approve new consumers against a registry. The whole method during an incident is to find the badge that matches your symptom, read its legend line, run the named confirm command, and apply the fix.
Real-world scenario
Vantage Pay, a fictional payments platform team of five engineers, ran a fraud-scoring API that ~120 internal accounts needed to call, plus three external partner accounts under contract. Their first instinct was a Transit Gateway attachment per consumer. It died on contact with reality: two recently-acquired business units had VPCs on 10.20.0.0/16 — the same block the fraud service’s VPC used — and renumbering a live, PCI-scoped production VPC was a non-starter. Worse, security flagged that a TGW attachment would grant those partner accounts a route into the platform network, far more reach than “call one API” warranted. The architecture review bounced the TGW plan in a single meeting.
They rebuilt it as a PrivateLink endpoint service. The fraud API went behind an internal NLB across three AZs with cross-zone enabled; the endpoint service ran with acceptance_required = true and an allow-list keyed to specific consumer role ARNs, not account roots, so a partner could only connect from their designated workload role. A connection-notification SNS topic fed a small Lambda that auto-approved any principal present in the platform’s account-registry DynamoDB table and left everything else pending for a human — onboarding dropped from a ticket-and-a-meeting to minutes. The two overlapping 10.20.0.0/16 business units connected with zero renumbering, because PrivateLink never exposes the provider’s address space, and the team published fraud.internal.payments.example.com as a verified private DNS name so every consumer used one stable hostname regardless of account.
# The control that made the partner case acceptable to security:
# allow only the partner's specific workload role, never the account root.
resource "aws_vpc_endpoint_service_allowed_principal" "partner_a" {
vpc_endpoint_service_id = aws_vpc_endpoint_service.fraud.id
principal_arn = "arn:aws:iam::333333333333:role/fraud-client-prod"
}
Two incidents in the first month taught the team the failure map. The first: a newly-onboarded BU reported “endpoint is available but every call hangs.” Flow Logs on the endpoint ENI showed REJECT from the client subnet on :8443 — the BU had created the endpoint with the default security group, which did not allow the service port inbound. Five-minute fix. The second: a partner’s gRPC client saw connections silently dropping “every few minutes.” That was the 350-second NLB idle timeout against a long-lived streaming connection with no keepalive; the partner set a 200-second TCP keepalive and it vanished. Both were textbook, both were one hop, and both went straight into the runbook.
The one real cost they had to plan for was the per-endpoint, per-AZ hourly charge multiplied across 120+ consumers — a line item that genuinely showed up on the bill at roughly ₹150,000/month before data. They accepted it as the price of dropping TGW attachment management and CIDR coordination entirely, and trimmed it by getting each consumer to enable only the two AZs they actually ran in rather than all three. Net: the consumer count became a self-service onboarding problem instead of a networking project, which is exactly the trade PrivateLink is built to make.
Advantages and disadvantages
The publish-one-service model both enables clean cross-account sharing and imposes real constraints. Weigh it honestly:
| Advantages (why this model helps you) | Disadvantages (why it constrains you) |
|---|---|
| Overlapping CIDRs are irrelevant — the consumer talks to ENIs in its own subnets | One service per endpoint; not a general routing solution |
| Grants reach to one load balancer, not your network — minimal blast radius | One-directional (consumer→provider) only |
| Scales to hundreds of consumers as self-service onboarding, not a mesh | Requires an internal NLB/GWLB in front of the workload |
| Never traverses the public internet; bypasses consumer SNAT entirely | Per-endpoint, per-AZ hourly cost compounds at high fan-in |
| Two stacked gates (principals + acceptance) give fine-grained access control | Provider must prove domain ownership before friendly DNS works |
| Private DNS gives a stable hostname regardless of consumer account | AZ names differ across accounts — must reason by AZ ID |
| Connection notifications enable automated, audited onboarding | NLB 350 s idle timeout resets long-lived flows without keepalives |
| Works for SaaS vendors offering private connectivity to customers | Source client IP is hidden unless you carry PROXY protocol v2 |
The model is right whenever you are publishing a service to many accounts — a shared internal API, a logging/telemetry sink, or a SaaS product offered privately to customer VPCs — and it shines exactly where peering and TGW buckle: large fan-in and overlapping address space. It is the wrong tool when you need bidirectional, everything-talks-to-everything routing (use TGW), when both VPCs are yours and mutually trusted at small N (peering is cheaper and simpler), or when you need L7 routing and per-request authorization across services (VPC Lattice). The disadvantages are all manageable — the NLB requirement, the cost, the DNS dance, the idle timeout — but only if you know they exist going in, which is the point of this article.
Hands-on lab
Stand up a complete PrivateLink path across two accounts (or two VPCs you control), prove it end to end, then tear it down. Free-tier-friendly where possible; the NLB and endpoint hours cost a few cents — delete at the end. Run in CloudShell or with two named CLI profiles (--profile provider, --profile consumer).
Step 1 — Provider: an internal NLB with a simple target. Front a tiny target (an EC2 instance running a health endpoint on :8443, or an IP target) with an internal NLB.
PROFILE_P="--profile provider"
aws elbv2 create-load-balancer $PROFILE_P --name pl-lab-nlb --type network \
--scheme internal --subnets subnet-pa subnet-pb
# capture the ARN
NLB=$(aws elbv2 describe-load-balancers $PROFILE_P --names pl-lab-nlb \
--query 'LoadBalancers[0].LoadBalancerArn' -o text)
aws elbv2 modify-load-balancer-attributes $PROFILE_P --load-balancer-arn $NLB \
--attributes Key=load_balancing.cross_zone.enabled,Value=true
Expected: an ARN returned; cross-zone attribute set to true.
Step 2 — Provider: target group, target, listener.
TG=$(aws elbv2 create-target-group $PROFILE_P --name pl-lab-tg --protocol TCP \
--port 8443 --vpc-id vpc-prov --target-type ip \
--query 'TargetGroups[0].TargetGroupArn' -o text)
aws elbv2 register-targets $PROFILE_P --target-group-arn $TG --targets Id=10.10.1.20
aws elbv2 create-listener $PROFILE_P --load-balancer-arn $NLB --protocol TCP --port 8443 \
--default-actions Type=forward,TargetGroupArn=$TG
Expected: target registers; after a minute describe-target-health shows healthy.
Step 3 — Provider: publish the endpoint service.
SVC=$(aws ec2 create-vpc-endpoint-service-configuration $PROFILE_P \
--network-load-balancer-arns $NLB --acceptance-required \
--query 'ServiceConfiguration.ServiceId' -o text)
SVC_NAME=$(aws ec2 describe-vpc-endpoint-service-configurations $PROFILE_P \
--service-ids $SVC --query 'ServiceConfigurations[0].ServiceName' -o text)
echo "Share this: $SVC_NAME"
Expected: a vpce-svc-… id and a com.amazonaws.vpce.<region>.vpce-svc-… name.
Step 4 — Provider: allow the consumer principal.
aws ec2 modify-vpc-endpoint-service-permissions $PROFILE_P --service-id $SVC \
--add-allowed-principals arn:aws:iam::222222222222:role/lab-consumer-role
Expected: command succeeds; the consumer can now see the service.
Step 5 — Consumer: create the interface endpoint.
PROFILE_C="--profile consumer"
EP=$(aws ec2 create-vpc-endpoint $PROFILE_C --vpc-id vpc-cons \
--vpc-endpoint-type Interface --service-name "$SVC_NAME" \
--subnet-ids subnet-ca subnet-cb --security-group-ids sg-lab-endpoint \
--no-private-dns-enabled --query 'VpcEndpoint.VpcEndpointId' -o text)
Expected: an endpoint id; state begins as pendingAcceptance.
Step 6 — Provider: accept the connection.
aws ec2 accept-vpc-endpoint-connections $PROFILE_P --service-id $SVC \
--vpc-endpoint-ids $EP
Expected: within a minute the consumer’s endpoint flips to available.
Step 7 — Consumer: confirm and test.
aws ec2 describe-vpc-endpoints $PROFILE_C --vpc-endpoint-ids $EP \
--query 'VpcEndpoints[0].[State,DnsEntries[0].DnsName]'
# From a consumer instance whose SG can reach sg-lab-endpoint on :8443:
curl -sS -o /dev/null -w '%{http_code}\n' \
https://<regional-endpoint-dns>:8443/healthz
Expected: available, a regional DNS name, and a 200 from the health endpoint. The lab steps and their expected output, at a glance:
| Step | Command family | Expected output |
|---|---|---|
| 1 | create-load-balancer + cross-zone |
NLB ARN; cross-zone true |
| 2 | create-target-group / register-targets |
Target healthy after ~1 min |
| 3 | create-vpc-endpoint-service-configuration |
vpce-svc-… id + service name |
| 4 | modify-vpc-endpoint-service-permissions |
Consumer principal allowed |
| 5 | create-vpc-endpoint (Interface) |
Endpoint id; pendingAcceptance |
| 6 | accept-vpc-endpoint-connections |
Endpoint → available |
| 7 | describe-vpc-endpoints + curl |
available; 200 from /healthz |
Step 8 — Teardown (run both, in order).
aws ec2 delete-vpc-endpoints $PROFILE_C --vpc-endpoint-ids $EP
aws ec2 delete-vpc-endpoint-service-configurations $PROFILE_P --service-ids $SVC
aws elbv2 delete-listener $PROFILE_P --listener-arn <listener-arn>
aws elbv2 delete-target-group $PROFILE_P --target-group-arn $TG
aws elbv2 delete-load-balancer $PROFILE_P --load-balancer-arn $NLB
Delete the consumer endpoint before the provider service, or the service delete is blocked by an active connection.
Common mistakes & troubleshooting
When it does not work, walk these in order — most failures are one of the first three, and almost all localise to a single hop. The playbook: match the symptom, read the root cause, run the confirm command, apply the fix.
| # | Symptom | Root cause | Confirm (exact command / path) | Fix |
|---|---|---|---|---|
| 1 | Endpoint stuck in pendingAcceptance |
Provider hasn’t accepted; or principal not allow-listed | describe-vpc-endpoint-connections … VpcEndpointState |
Add the role ARN to allowed principals and accept-vpc-endpoint-connections |
| 2 | available but connections hang/refuse |
Endpoint ENI’s SG blocks the service port | VPC Flow Logs on ENI show REJECT; check describe-security-groups |
Allow inbound :8443 from client CIDR/SG on the endpoint SG |
| 3 | DNS returns the long vpce-…amazonaws.com name |
Private DNS not effective (verify state / flag / VPC DNS) | describe-vpc-endpoints …private_dns_enabled; …PrivateDnsNameConfiguration.State |
Verify domain, set private_dns_enabled=true, enable VPC DNS flags |
| 4 | Works in one AZ, fails in another | Provider didn’t publish that AZ, or cross-zone off | Compare provider AvailabilityZones vs consumer subnet AZ IDs |
Publish all AZs + enable cross-zone load balancing |
| 5 | Healthy endpoint, but no targets answer | Target group unhealthy (wrong port/host/probe) | describe-target-health shows unhealthy |
Fix health-check port/protocol; ensure app listens on the target port |
| 6 | Intermittent resets on long-lived flows | NLB 350 s TCP idle timeout | TCP_Target_Reset_Count rising with healthy hosts |
Add TCP keepalives below 350 s on the client |
| 7 | Consumer can’t even create the endpoint | Principal missing from allowed list | Provider describe-vpc-endpoint-service-permissions |
Add the consumer principal (role ARN preferred) |
| 8 | Endpoint state rejected |
Provider rejected the connection (or auto-logic did) | describe-vpc-endpoint-connections … rejected |
Re-request after the provider allow-lists/accepts |
| 9 | Connection works, but app sees wrong source IP | Source is the ENI, not the real client | Compare app logs vs client IP | Enable PROXY protocol v2 on the target group if you need real client IP |
| 10 | DNS verification stuck at pendingVerification |
TXT record wrong name/value or not propagated | dig TXT _<token>.<name>; check the public zone |
Fix the TXT name/value; start-…-private-dns-verification again |
| 11 | “Asymmetric routing”-style weirdness | Almost always an ALB-as-target with mismatched health checks | Inspect the NLB→ALB target chain | Simplify the target chain; align health checks; re-test |
| 12 | Hit a hard ceiling adding the 51st consumer | allowed_principals per service quota (default 50) |
Service Quotas console | Request an increase; or use an aws:PrincipalOrgID condition |
| 13 | High fan-in: new connections fail under load | NLB target capacity / source-port exhaustion | ActiveFlowCount, NewFlowCount near ceiling |
Add AZs/targets; reuse connections; shard behind multiple NLBs |
| 14 | terraform destroy fails on the service |
Active consumer connection still attached | describe-vpc-endpoint-connections non-empty |
Delete consumer endpoints first, then the service |
| 15 | Endpoint created but no DNS entry at all | VPC DNS support disabled on the consumer VPC | describe-vpc-attribute --attribute enableDnsSupport |
Enable enableDnsSupport (and enableDnsHostnames) |
| 16 | Connection drops when provider redeploys NLB | NLB replaced, not updated in place | Provider change record; new NLB ARN | Update the service’s NLB ARN; avoid replacing the NLB |
| 17 | Partner connects from an unexpected role | Allow-list scoped to account root, not a role | describe-vpc-endpoint-service-permissions |
Tighten to the specific workload role ARN |
| 18 | curl works by IP but not by name |
Client using stale/cached regional name | dig +short <name> vs the ENI IPs |
Flush resolver cache; confirm private DNS effective |
The fastest first-cut decision table — symptom to most-likely cause to first move:
| If you see… | It’s probably… | Do this first |
|---|---|---|
pendingAcceptance forever |
Acceptance/allow-list gate | Check allow-list, then accept |
Hang/refuse on an available endpoint |
Endpoint ENI SG | Flow Logs → fix inbound SG rule |
| Long regional DNS name | Private DNS not effective | Check verify state + VPC DNS flags |
| Zone-specific failures | AZ mismatch / cross-zone off | Compare AZ IDs; enable cross-zone |
| Resets every few minutes | 350 s idle timeout | Add TCP keepalives < 350 s |
| Can’t add another consumer | Allowed-principals quota | Raise quota or use org condition |
The error/limit reference — the strings and states you’ll meet, what they mean, and the fix:
| Error / state | Where it appears | Meaning | Fix |
|---|---|---|---|
pendingAcceptance |
Endpoint / connection state | Awaiting provider acceptance | Accept the connection / check allow-list |
rejected |
Endpoint state | Provider rejected the connection | Get allow-listed; re-request |
failed |
Endpoint state | Provisioning failed (AZ/subnet) | Fix AZ overlap and subnet capacity |
pendingVerification |
DNS config state | TXT not yet verified | Publish TXT; start verification |
InvalidServiceName |
CLI error | Service name typo / wrong region | Copy the exact com.amazonaws.vpce.… string |
You are not authorized… |
CLI error (consumer) | Principal not on allow-list | Provider adds the principal |
REJECT (Flow Logs) |
ENI flow log | SG/NACL blocked the packet | Allow the service port on the endpoint SG |
VpcEndpointLimitExceeded |
CLI error | Endpoints-per-VPC quota hit | Raise quota / consolidate endpoints |
cross_zone data charge |
Cost Explorer line | Inter-AZ NLB traffic | Balance targets; weigh zonal isolation |
TCP_Target_Reset_Count ↑ |
NLB metric | Idle-timeout/app resets | TCP keepalives; fix app connection churn |
DnsNamesInUse / verify failed |
DNS config state | TXT mismatch or stale token | Re-fetch token; fix TXT; re-verify |
ExceededAllowedPrincipalsLimit |
CLI error | 50-principal cap reached | Raise quota; use aws:PrincipalOrgID condition |
Endpoint available, target unhealthy |
Mixed signals | App not on the target port/health path | Fix target port + health check; re-register |
Best practices
- Front the service with an internal NLB across every AZ your big consumers use, with cross-zone load balancing on. This single setting prevents the most common “works in one AZ” class of failure.
- Scope allowed principals to role ARNs, not account roots, and never
*unless intentional. For org-internal platforms, prefer anaws:PrincipalOrgIDcondition over enumerating 120 accounts. - Set
acceptance_requireddeliberately. Keep ittruewith a connection-notification Lambda for audited self-service; onlyfalsewhen the allow-list is your sole, sufficient gate. - Wire connection notifications to SNS from day one — auto-approve against an authoritative registry, alert humans for everything else. Polling for
pendingAcceptancedoes not scale. - Put the security group on the endpoint ENI and allow the service port from client CIDRs/SGs. The NLB has no SG; the endpoint SG is the only L4 filter on the path.
- Reason about AZs by ID (
use1-az1), not name.us-east-1ais a different physical zone in different accounts; aligning on names silently strands an ENI. - Verify the private DNS domain before telling consumers to flip
private_dns_enabled, and confirm the consumer VPC hasenableDnsHostnamesandenableDnsSupporton. - Set TCP keepalives below the 350-second NLB idle timeout for any gRPC, database, or long-poll connection through PrivateLink.
- Enable VPC Flow Logs on the endpoint ENIs and dashboard NLB metrics (
ActiveFlowCount,HealthyHostCount,TCP_Target_Reset_Count) before you onboard real traffic. - Review quotas ahead of growth — allowed principals per service (50), endpoints per VPC, endpoint services per account — and request increases before you hit them.
- Model the per-endpoint, per-AZ hourly cost against expected fan-in and get consumers to enable only the AZs they run in. The hourly meter, not data, usually dominates at scale.
- Treat the
service_nameas a coordination artifact, not a secret — publish it in a platform catalogue with the supported AZs, port, and onboarding contact.
Security notes
PrivateLink’s security story is strong precisely because it grants the minimum reach: one load balancer, one direction, no route into your network. Lean into that.
| Control | What it protects | How to apply |
|---|---|---|
| Allowed principals (role ARN) | Who may create an endpoint | Scope to the consumer’s workload role; avoid root/* |
aws:PrincipalOrgID condition |
Org-internal access at scale | Allow any account in your org without enumerating |
| Per-connection acceptance | Onboarding gate | Keep true; automate via SNS→Lambda registry check |
| Endpoint ENI security group | L4 ingress to the ENI | Least-privilege: service port, client SG/CIDR only |
| No route into the provider VPC | Lateral movement | Inherent — PrivateLink grants no transitive reach |
| Domain-ownership verification | DNS hijack prevention | Provider proves the domain before private DNS |
| NLB TLS termination + ACM | Data-in-transit | Use a TLS listener with an ACM cert if terminating |
| App-layer authN/Z | Real identity | Don’t trust source IP across PrivateLink — auth in the app |
| VPC Flow Logs (REJECT) | Detection/audit | Alert on REJECT spikes; record who connected |
| PROXY protocol v2 | Real client IP (when needed) | Enable on the target group; parse in the app |
| SCP / RCP guardrails | Org-wide policy on endpoints | Restrict who can create endpoint services/principals |
| CloudTrail on EC2 endpoint APIs | Change audit | Alert on ModifyVpcEndpointServicePermissions changes |
| Connection-notification audit | Onboarding trail | Persist Connect/Accept/Reject events for review |
Three security-specific gotchas worth internalising. First, the source IP your targets see is the endpoint ENI, not the consumer’s client — never build authorization on observed source IP across a PrivateLink boundary; use the principal gate and application auth. Second, traffic is private but not automatically encrypted — PrivateLink keeps the flow off the internet, but if you need confidentiality on the wire, terminate TLS at the NLB (ACM cert) or run TLS end-to-end through it. Third, the allow-list is your perimeter — a stray * with acceptance_required = false is the one mistake that turns a private service into an open one; review it in code and alert on changes.
Cost & sizing
What drives the bill: the per-endpoint, per-AZ hourly charge (consumer side) and per-GB data processing (consumer side), plus the NLB hourly + LCU and any inter-AZ cross-zone data (provider side). At high fan-in, the per-AZ hourly line dominates — it accrues whether or not data flows — which is why the number of enabled AZs is the most important cost lever you have. PrivateLink also saves money on the consumer side by bypassing NAT-gateway SNAT and data-processing for the call (the traffic never goes to the internet). Rough figures, us-east-1, USD, with ₹ at ~₹86/USD:
| Item | Rough rate | Side | Sizing lever |
|---|---|---|---|
| Interface endpoint, per AZ | ~$0.01/hr (~₹0.86) → ~$7.30/AZ/mo | Consumer | Enable only AZs you use |
| Endpoint data processing | ~$0.01/GB (~₹0.86) | Consumer | Consolidate chatty calls; tier on volume |
| NLB | ~$0.0225/hr + LCU | Provider | One NLB per service, not per consumer |
| NLB cross-zone inter-AZ | ~$0.01/GB each direction | Provider | Balance targets; consider zonal isolation |
| Flow Logs | CloudWatch/S3 rates | Either | REJECT-only or sampled to cut cost |
| SNS + Lambda | Effectively free at onboarding scale | Provider | n/a |
The sizing rules of thumb, as a decision table:
| If you have… | Then size for… | Because… |
|---|---|---|
| Many consumers, low per-consumer traffic | Minimising AZ count per endpoint | Hourly meter dominates over data |
| Few consumers, high traffic | Minimising data processing + NLB LCUs | Data + LCU dominate over hourly |
| Long-lived connections | Keepalives + NLB headroom | Avoid idle resets and flow churn |
| Bursty fan-in | NLB targets + AZ spread | Source-port and target capacity are the ceiling |
| Cross-region consumers | Cross-region data + supported regions | Adds inter-region transfer cost |
There is no “free tier” for PrivateLink endpoints, but the lab above runs in cents because the endpoint and NLB exist for minutes. For a 120-consumer internal platform at two AZs each, budget on the order of ₹150,000/month for endpoint hours before data — the line item that surprises teams who modelled only data transfer.
Interview & exam questions
Q1. When would you choose PrivateLink over VPC peering or Transit Gateway? When you are publishing a single service to one or many consumers, especially across accounts with overlapping CIDRs, and you want to grant reach to one load balancer rather than a route into your network. Peering/TGW are bidirectional L3 routing that require non-overlapping CIDRs; PrivateLink is unidirectional, one-service, and CIDR-agnostic. (SAP-C02, ANS-C01.)
Q2. Why are overlapping CIDRs irrelevant to PrivateLink? Because the consumer’s workloads talk only to an interface-endpoint ENI carved from the consumer’s own subnet; AWS carries the flow across the backbone to the provider’s NLB without ever exposing the provider’s address space. There is no IP-level route between the VPCs to collide. (ANS-C01.)
Q3. What two controls gate who can connect to an endpoint service, and how do they interact? Allowed principals (who may create an endpoint) and per-connection acceptance (acceptance_required). They stack: without an allow-list entry the consumer can’t connect even with acceptance_required = false; with the entry and true, the connection waits in pendingAcceptance until accepted. (SAP-C02.)
Q4. A consumer’s endpoint is available but every call hangs. Where do you look first? The endpoint ENI’s security group — it’s the only L4 filter on the path because the NLB has no SG. Confirm with VPC Flow Logs on the ENI (look for REJECT) and fix the inbound rule to allow the service port from the client. (ANS-C01.)
Q5. Why does private DNS require domain-ownership verification, and where is the TXT record published? To stop a malicious provider from publishing a service that hijacks a domain like *.example.com. The provider publishes the verification TXT record in the public hosted zone for the domain, even though the service itself is private. (SAP-C02.)
Q6. What is split-horizon DNS in this context? AWS creates a managed private hosted zone inside the consumer VPC that resolves the friendly name to the local endpoint ENIs, while the public name resolves to nothing useful externally — so the same hostname behaves differently inside the consumer VPC than on the public internet. (ANS-C01.)
Q7. Why must the NLB be internal, and what do its subnets determine? PrivateLink targets the NLB’s private addresses, so an internet-facing scheme can’t back an endpoint service. The subnets you attach define which AZs the service is available in — a consumer can only create an endpoint in an AZ where the provider has a presence. (ANS-C01.)
Q8. A long-lived gRPC connection through PrivateLink resets every few minutes. Cause and fix? The NLB’s 350-second TCP idle timeout. Fix it by enabling TCP keepalives below that threshold on the client; nothing about PrivateLink itself changes the timeout. (SOA-C02.)
Q9. Why reason about AZs by ID rather than name across accounts? Because us-east-1a may map to a different physical Availability Zone in the consumer account than in the provider account. AZ IDs (use1-az1) are stable across accounts, so aligning endpoint and NLB AZs by ID is correct; by name can silently strand an ENI in an unserved zone. (ANS-C01.)
Q10. What are the two billing axes for an interface endpoint, and which dominates at high fan-in? A per-endpoint, per-AZ hourly charge and a per-GB data-processing charge (both consumer-side). At high fan-in across many consumers, the per-AZ hourly charge dominates because it accrues regardless of traffic. (SAP-C02.)
Q11. How do you automate consumer onboarding at scale? Set acceptance_required = true, wire a connection notification to SNS, and have a Lambda auto-approve principals present in an authoritative registry (e.g. a DynamoDB table) while leaving unknowns pending for a human. This turns onboarding from a ticket into minutes while keeping an audited gate. (SAP-C02.)
Q12. Your targets behind PrivateLink need the real client IP for logging. How? Enable PROXY protocol v2 on the NLB target group so the client’s address is prepended to the TCP stream; otherwise the targets see the endpoint ENI’s network identity, not the consumer’s client. (ANS-C01.)
Quick check
- You need to expose one internal API to 150 accounts, several with overlapping
10.0.0.0/16CIDRs. Which connectivity primitive, and why? - A consumer reports their endpoint is stuck in
pendingAcceptance. What two things could be wrong, and what command confirms each? dig payments.internal.example.comfrom a consumer instance returns a longvpce-…amazonaws.comname. List three things to check.- Why does the endpoint ENI’s security group matter when the NLB has none?
- A streaming connection through PrivateLink drops every few minutes with healthy targets. What’s the cause and the one-line fix?
Answers
- PrivateLink endpoint service. It publishes one service (not a network), is CIDR-agnostic because the consumer talks to ENIs in its own subnets, and scales to many accounts as self-service. Peering/TGW would demand non-overlapping CIDRs and grant a route into your network.
- Either the consumer’s principal is not on the allowed-principals list, or the provider hasn’t accepted the connection. Confirm the connection state with
aws ec2 describe-vpc-endpoint-connectionsand the permissions withaws ec2 describe-vpc-endpoint-service-permissions; fix by adding the principal and/oraccept-vpc-endpoint-connections. - (a)
private_dns_enabled = trueon the consumer endpoint; (b) the service’sPrivateDnsNameConfiguration.Stateisverified; © the consumer VPC hasenableDnsHostnamesandenableDnsSupportboth on. Any one missing falls back to the long name. - Because the NLB has no security group, the endpoint ENI’s SG is the only L4 filter on the path. It must allow inbound on the service port from the client CIDRs/SGs, or connections to an otherwise-
availableendpoint hang or are refused. - The NLB 350-second TCP idle timeout. Fix: set TCP keepalives below 350 seconds on the client.
Glossary
- PrivateLink — AWS service that exposes one service across accounts/VPCs as an interface endpoint, with no routing or CIDR overlap concerns.
- Endpoint service — The provider-side publishing layer that sits on top of an internal NLB/GWLB and advertises a service name.
- Service name — The coordination string
com.amazonaws.vpce.<region>.vpce-svc-xxxxconsumers use to find and target the service. - Interface endpoint — A consumer-side
Interface-type VPC endpoint that provisions one ENI per subnet pointing at the provider’s service. - Endpoint ENI — The elastic network interface (one per AZ) in the consumer’s subnet; the only hop consumer workloads talk to, and where the SG filter lives.
- Allowed principals — The allow-list of IAM principal ARNs permitted to create an endpoint to the service (access gate #1).
- Acceptance /
acceptance_required— The per-connection approval gate; when true, new endpoints wait inpendingAcceptance(access gate #2). - Connection notification — An SNS feed of
Connect/Accept/Reject/Deleteevents for automating or alerting on onboarding. - Private DNS name — A friendly hostname the provider associates so consumers avoid the regional endpoint name; requires domain verification.
- Domain-ownership verification — The TXT-record proof (in the public zone) that the provider owns the domain before private DNS is allowed.
- Split-horizon DNS — Resolution that returns the endpoint ENIs inside the consumer VPC while the public name resolves to nothing useful externally.
- Cross-zone load balancing — NLB setting that lets an ENI reach targets in any AZ, not just its own; off by default.
- AZ ID — A stable physical-zone identifier (
use1-az1) that is consistent across accounts, unlike the AZ name. - Idle timeout — The NLB’s fixed 350-second TCP idle window after which inactive flows reset.
- PROXY protocol v2 — A target-group option that prepends the real client address to the TCP stream so targets can see it.
Next steps
- Master the VPC fundamentals these endpoints build on in AWS VPC Deep Dive: Subnets, Routing, IGW, NAT & Endpoints.
- Compare the routing alternative for many-to-many connectivity in AWS Transit Gateway Multi-Account VPC Architecture.
- Go deep on the load balancer underneath in AWS Elastic Load Balancing: ALB, NLB & GWLB Deep Dive.
- Tighten the cross-account access model with IAM Cross-Account Roles, External ID & the Confused Deputy.
- Evaluate the app-layer alternative for HTTP service meshes in VPC Lattice Service Networks with IAM Auth.