AWS Lesson 48 of 123

Building Cross-Account Services with AWS PrivateLink: Endpoint Services, NLBs, and DNS

You have an internal API that another team — or another company — needs to reach. The reflex is to peer the VPCs or hang both off a Transit Gateway. Both grant network-layer reachability, and both fall apart the moment the consumer’s 10.0.0.0/16 collides with yours, which across a large estate it always eventually does. AWS PrivateLink solves a narrower problem and solves it cleanly: it exposes one service across a trust boundary as a single elastic network interface (an ENI) in the consumer’s subnet. No routes, no transitive reachability, no CIDR negotiation, no public internet. This is how the most painful “expose this to 200 accounts” problem on AWS becomes a self-service onboarding problem instead of a networking project.

The mechanism has two halves and a wire between them. On the provider side you front your workload with an internal Network Load Balancer (NLB) or Gateway Load Balancer, then publish a VPC endpoint service on top of it; that service hands out a coordination string — com.amazonaws.vpce.<region>.vpce-svc-xxxx — and gates who connects through an allowed-principals allow-list plus an optional per-connection acceptance step. On the consumer side you create an interface endpoint naming that service, AWS provisions one ENI per subnet, and your workloads talk only to those ENIs. Private DNS then lets a friendly name like payments.internal.example.com resolve — inside the consumer VPC only — to the local ENIs, so client code never learns the ugly regional hostname. The catch that trips every team once is that AWS makes the provider prove they own the domain before any consumer can switch private DNS on.

By the end of this article you will build both sides correctly, wire up private DNS with domain-ownership verification, reason about availability zones across accounts (where us-east-1a in your account is not the same physical zone as in theirs), size for fan-in against the real quotas, keep the whole thing observable with VPC Flow Logs and NLB metrics, and — when it inevitably misbehaves — localise the failure to exactly one hop with a runbook of symptom, root cause, the exact command to confirm, and the fix. Every configuration here carries both a Terraform block and an aws CLI command, and because this is a reference you will return to mid-incident, the option matrices, error catalogue, quotas, and the troubleshooting playbook are all laid out as scannable tables.

What problem this solves

The pain is reachability that is too broad and address space that collides. When you peer two VPCs or attach them to a Transit Gateway, you give the consumer a route into your network. A misconfigured security group, an over-broad route table, or a curious operator on the other side can then reach anything routable in your VPC — not just the one API you meant to share. Security teams rightly flag this for external partners: “call one endpoint” should not grant “a path into the platform network.” PrivateLink gives the consumer a route to one load balancer, full stop, and only ever in the consumer→provider direction.

The second pain is CIDR overlap. Peering and TGW both demand non-overlapping address space, because they route on IP. In a large organisation — especially after acquisitions — two production VPCs sitting on the same 10.20.0.0/16 is not a hypothetical; it is Tuesday. Renumbering a live, compliance-scoped VPC is a multi-quarter project nobody signs up for. PrivateLink never exposes the provider’s address space at all: the consumer talks to ENIs carved from the consumer’s own subnets, so the two VPCs can use byte-identical CIDRs and never know.

Who hits this: any platform team publishing an internal service to many accounts (a fraud-scoring API, a config service, a logging sink), any SaaS vendor offering private connectivity to customer VPCs, and anyone who looked at a VPC-peering mesh of 80 accounts and realised it does not scale. The cost of PrivateLink is that it is one-directional and one-service-per-endpoint, and it requires a load balancer in front of the workload. If you genuinely need bidirectional, everything-talks-to-everything connectivity, this is the wrong tool — reach for TGW. To frame the whole decision before the deep dive, here is when each connectivity primitive is the right call:

Pattern What it connects CIDR overlap Direction Bills per GB Pick it when
VPC peering Two VPCs, full IP reachability Must not overlap Bidirectional No (intra-region) Two VPCs, mutual trust, small N
Transit Gateway Many VPCs/accounts, policy-routed Must not overlap Bidirectional Yes Hub-and-spoke routing, segmentation
PrivateLink One service behind an NLB/GWLB Irrelevant Unidirectional (consumer→provider) Yes Publish a service to many accounts
VPC Lattice App-layer services across accounts Irrelevant Policy-governed Yes HTTP service mesh, per-request auth
PublicEndpoint + IAM A regional AWS API/SaaS over public IP n/a Caller→service Egress only You accept public-path + IAM-only

The deciding question is always reachability versus publishing. Peering and TGW are routing: they give the consumer a route to your network. PrivateLink is publishing: it gives the consumer a route to one load balancer. If you would put the thing behind a load balancer and a DNS name anyway, publish it — do not route to it.

Mental model: peering and TGW are routing. PrivateLink is publishing. The consumer never touches your network — only an ENI in their own subnet that happens to forward to your NLB.

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should be comfortable with core VPC concepts — subnets, route tables, security groups, ENIs, and Availability Zones — and know that an NLB is a Layer-4 (TCP/UDP/TLS) load balancer with no security group of its own. You should be able to run the aws CLI with a profile per account (this is inherently a two-account exercise), read JSON output, and apply Terraform. Familiarity with Route 53 hosted zones (public and private) and basic DNS resolution helps a great deal for the private-DNS section.

This sits in the Networking track and assumes the fundamentals from the AWS VPC Deep Dive: Subnets, Routing, IGW, NAT & Endpoints — interface endpoints are the same primitive you use for AWS service endpoints, pointed at a private service instead. It is the cross-account complement to the AWS Transit Gateway Multi-Account VPC Architecture: TGW for routing meshes, PrivateLink for service publishing. The NLB underneath is covered in AWS Elastic Load Balancing: ALB, NLB & GWLB Deep Dive, the access controls lean on IAM Cross-Account Roles, External ID & the Confused Deputy, and for an app-layer alternative compare VPC Lattice Service Networks with IAM Auth.

A quick map of who owns and confirms what during an incident, so you call the right person fast:

Layer What lives here Which account owns it Failure classes it can cause
Consumer client + its SG App code, egress rules Consumer Client SG blocks egress to the ENI
Endpoint ENI + its SG The interface endpoint, per-AZ ENIs Consumer Inbound SG blocks the client → hang/refuse
Private DNS (managed PHZ) Split-horizon name → ENI Consumer (created by AWS) Long regional name returned; resolution fails
AWS backbone The PrivateLink data path AWS (managed) Practically never; AZ mismatch shows here
Endpoint service + gates Allowed principals, acceptance, DNS verify Provider pendingAcceptance; private DNS not effective
Internal NLB + targets L4 LB, target groups, health checks Provider Unhealthy targets, AZ gaps, idle resets

Core concepts

Five mental models make every later step and every failure obvious.

An endpoint service publishes one service, not a network. On the provider side, a VPC endpoint service is a thin publishing layer that sits on top of an internal NLB (or GWLB) and advertises a single coordination string, the service name (com.amazonaws.vpce.<region>.vpce-svc-xxxx). It exposes exactly what the NLB fronts — one TCP service — and nothing else of your VPC. There is no route, no IP range, no transitive path. The consumer cannot “see” your network; they can only reach the load balancer you chose to publish.

The interface endpoint is an ENI in the consumer’s own subnet. On the consumer side, an interface endpoint of type Interface names the provider’s service, and AWS provisions one ENI per subnet you specify, each with a private IP from that consumer subnet’s range. Those ENIs are the only thing consumer workloads ever talk to. This is why CIDR overlap is irrelevant: the client connects to 10.0.1.50 in its own VPC, and AWS quietly carries the flow across the backbone to the provider’s NLB. The provider’s address space never enters the picture.

Two stacked controls decide who connects. Access is governed by two independent gates that both must pass. Allowed principals is an allow-list of IAM principal ARNs (a role, an account root, or *) that decides who may even create an endpoint to your service — without an entry, the consumer cannot see or target it at all. Acceptance (acceptance_required) is a per-connection gate: when true, every new endpoint lands in pendingAcceptance and waits for the provider to approve it. They stack: with acceptance_required = false an unlisted principal still cannot connect; with an open allow-list, acceptance still holds each connection until approved.

Private DNS is split-horizon and requires proof of ownership. By default the endpoint hands the consumer an ugly regional name. A private DNS name lets the provider associate a friendly hostname so consumers keep calling it. AWS implements this as split-horizon: it creates a managed private hosted zone inside the consumer VPC that resolves the name to the local ENIs, while the public name resolves to nothing useful outside. To stop anyone from publishing a service that hijacks *.example.com, AWS forces the provider to prove domain ownership via a TXT record in the public zone before any consumer may enable private DNS.

The NLB underneath sets the rules of physics. Everything downstream — which AZs the service is reachable in, whether traffic stays zone-local, how long-lived connections behave — is decided by the NLB. The subnets you attach define the published AZs; cross-zone load balancing decides whether an ENI in us-east-1a can reach targets in us-east-1b; the 350-second TCP idle timeout silently resets long-lived flows without keepalives. Get the NLB right and PrivateLink is boring; get it wrong and you get “works in one AZ, dead in another.”

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Concept One-line definition Whose account Why it matters
Endpoint service Publishing layer over an NLB/GWLB Provider The thing consumers connect to
Service name com.amazonaws.vpce.<region>.vpce-svc-… Provider (issued by AWS) Coordination key; share out of band
Interface endpoint Interface-type endpoint → ENIs Consumer What client workloads actually talk to
Endpoint ENI One private IP per AZ in the consumer subnet Consumer The only L4 hop; its SG is the filter
Allowed principals Allow-list of who may create an endpoint Provider Gate #1; scope to role ARNs
Acceptance Per-connection approval (acceptance_required) Provider Gate #2; pendingAcceptance until approved
Private DNS name Friendly name resolving to the ENIs Provider sets, consumer enables Avoids the ugly regional hostname
Domain verification TXT record proving domain ownership Provider Must be verified before private DNS works
Cross-zone LB NLB sends across AZs, not just zone-local Provider Off → “works in one AZ only”
Connection notification SNS event on Connect/Accept/Reject/Delete Provider Auto-approval / human alerting
AZ ID Stable physical-zone identifier (use1-az1) Both AZ names differ per account
Idle timeout NLB’s fixed 350 s TCP inactivity window Provider (NLB) Long-lived flows reset without keepalive
PROXY protocol v2 Prepends real client IP to the TCP stream Provider (target group) Targets otherwise see the ENI, not the client
GWLB endpoint Endpoint type for inline-appliance services Both Inspection/firewall services use GWLB, not NLB
Regional endpoint name The long vpce-…amazonaws.com hostname Consumer The fallback when private DNS isn’t effective

When an endpoint service is the right call

PrivateLink, peering, Transit Gateway, and VPC Lattice are not interchangeable. Pick by what you are actually sharing and in which direction. The single deciding factor is whether you want to give the other side a route to your network or a path to one service; the rest follows. Reach for an endpoint service when the answer to “would I put this behind a load balancer and a DNS name anyway?” is yes.

The full comparison, on the dimensions that decide a real design:

Dimension PrivateLink VPC peering Transit Gateway VPC Lattice
Granularity One service per endpoint Whole VPC Whole VPCs (route-domains) Per app-layer service
CIDR overlap allowed Yes (irrelevant) No No Yes
Direction Consumer→provider only Bidirectional Bidirectional Policy-governed
Layer L4 (TCP/UDP/TLS via NLB) L3 (IP) L3 (IP) L7 (HTTP) + L4
Transitive reach None (no route) Non-transitive Transitive (by design) Service-scoped
Scales to N consumers Excellent (fan-in) Poor (mesh) Good (hub-spoke) Excellent
Public internet Never Never Never Never
Auth model Allowed principals + acceptance SG/NACL only SG/NACL + routing IAM auth policies
Bills per GB Yes (data processing) No (intra-region) Yes Yes
DNS integration Private DNS (verified) Manual Manual / R53 Resolver Built-in service DNS
Onboarding model Self-service (gated) Per-peer request Per-attachment Service association
Typical use SaaS / shared internal API Two trusted VPCs Routing mesh HTTP service mesh

Three reading notes that save the most design time:

If your requirement is… Don’t use… Use… Because…
“Expose one API to 200 accounts” Peering mesh / TGW PrivateLink Fan-in, no routes, overlap-proof
“Many VPCs must route to each other” PrivateLink Transit Gateway PrivateLink is one-service, one-way
“Partner gets a path into our net” Peering / TGW PrivateLink Grants one service, not network reach
“HTTP routing + per-request authZ” Raw PrivateLink VPC Lattice Lattice does L7 + IAM auth policies
“Two of my own VPCs, full trust” PrivateLink VPC peering Cheaper, bidirectional, simpler

Provider step 1 — Front the service with an internal NLB

An endpoint service sits on top of an NLB (or GWLB). We will use an NLB. It must be internal — PrivateLink targets the load balancer’s private addresses, not an internet-facing scheme — and you register your service’s targets (instances, IPs, or even an ALB as a target if you need L7 routing behind it) in a target group.

resource "aws_lb" "svc" {
  name                             = "payments-svc-nlb"
  internal                         = true
  load_balancer_type               = "network"
  subnets                          = var.provider_subnet_ids   # one per AZ you publish
  enable_cross_zone_load_balancing = true
}

resource "aws_lb_target_group" "svc" {
  name        = "payments-svc-tg"
  port        = 8443
  protocol    = "TCP"
  vpc_id      = var.provider_vpc_id
  target_type = "ip"

  health_check {
    protocol            = "TCP"
    port                = "8443"
    healthy_threshold   = 3
    unhealthy_threshold = 3
    interval            = 10
  }
}

resource "aws_lb_listener" "svc" {
  load_balancer_arn = aws_lb.svc.arn
  port              = 8443
  protocol          = "TCP"
  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.svc.arn
  }
}
# Equivalent, CLI: an internal NLB with cross-zone enabled
aws elbv2 create-load-balancer --name payments-svc-nlb --type network \
  --scheme internal --subnets subnet-aaa subnet-bbb subnet-ccc
aws elbv2 modify-load-balancer-attributes --load-balancer-arn <nlb-arn> \
  --attributes Key=load_balancing.cross_zone.enabled,Value=true

Two decisions here matter for everything downstream. First, the subnets you attach to the NLB define which AZs the service is available in. A consumer can only create an endpoint in an AZ where you have a presence. Publish in at least two, and prefer to publish in every AZ your largest consumers use. Second, turn on cross-zone load balancing. An NLB is zonal by default: an endpoint ENI in us-east-1a will only send to targets in us-east-1a unless cross-zone is enabled, which with uneven target distribution produces hot zones and surprising health-check behaviour. Cross-zone traffic on an NLB incurs inter-AZ data charges, which the load-balancer owner pays — budget for it.

The NLB decisions that ripple into PrivateLink behaviour, with the default, the trade-off, and the gotcha:

Setting Values Default When to change Trade-off / gotcha
scheme internal / internet-facing n/a Must be internal for PrivateLink Internet-facing NLB cannot back an endpoint service
Subnets (AZs) One subnet per AZ none Publish every AZ big consumers use Consumer can’t endpoint into an AZ you skip
cross_zone.enabled true / false false Almost always true false → ENI only reaches same-AZ targets; inter-AZ data charges when on
target_type instance / ip / alb instance ip for ECS/containers; alb for L7 behind alb target enables HTTP routing but adds a hop
Listener protocol TCP / UDP / TLS / TCP_UDP none TLS to terminate at the NLB TLS needs an ACM cert; otherwise pass TCP through
Health-check protocol TCP / HTTP / HTTPS TCP HTTP(S) for app-aware checks TCP only proves the port is open, not the app
deletion_protection true / false false true in prod Prevents an accidental terraform destroy outage
preserve_client_ip true / false varies by target type Needed if the app reads source IP Through PrivateLink the source is the ENI, not the real client

A subtle but important point on client IP: through PrivateLink the provider’s targets see the endpoint ENI’s network identity, not the consumer’s real client IP (unless you carry it in PROXY protocol v2, which the NLB can prepend). Do not build authorization on observed source IP across a PrivateLink boundary — use the allowed-principals gate and application-layer auth instead.

Provider step 2 — Create the VPC endpoint service

With the NLB live, create the endpoint service that points at it. The key knob is acceptance_required.

resource "aws_vpc_endpoint_service" "payments" {
  acceptance_required        = true
  network_load_balancer_arns = [aws_lb.svc.arn]

  tags = { Name = "payments-endpoint-service" }
}

output "service_name" {
  # e.g. com.amazonaws.vpce.us-east-1.vpce-svc-0123456789abcdef0
  value = aws_vpc_endpoint_service.payments.service_name
}
# CLI equivalent — note the NLB ARN(s), not the VPC
aws ec2 create-vpc-endpoint-service-configuration \
  --network-load-balancer-arns <nlb-arn> \
  --acceptance-required

AWS assigns a service name of the form com.amazonaws.vpce.<region>.vpce-svc-xxxxxxxxxxxxxxxxx. This string is what consumers use to find you; it is not secret, but it is the coordination key — hand it to consumers out of band (a wiki, a Terraform output, a platform catalogue). acceptance_required = true means every new connection lands in pendingAcceptance and waits for you to approve it. For a controlled internal platform with a known consumer list, that manual gate is worth keeping; for self-service at scale you flip it to false and rely on the allow-list instead. Do not run with false and an open allow-list unless you genuinely intend anyone in any account to connect.

The endpoint-service configuration options, end to end:

Setting Values Default When to change Trade-off / gotcha
acceptance_required true / false true (console) false for self-service at scale false + open allow-list = anyone can connect
network_load_balancer_arns One or more NLB ARNs none Multiple for blue/green or sharding All must be in the same region/VPC
gateway_load_balancer_arns One or more GWLB ARNs none For appliance/inspection services NLB and GWLB are mutually exclusive per service
supported_ip_address_types ipv4 / ipv6 ipv4 Add ipv6 for dual-stack consumers Consumer and NLB must both support the family
private_dns_name FQDN string unset Set to publish a friendly name Triggers the TXT verification requirement
allowed_principals List of ARNs empty (nobody) Always add at least one Empty = no consumer can even see the service
supported_regions Region list (cross-region) current For cross-region access (where supported) Adds cross-region data charges
Tags Key/value none Always tag owner + cost-centre Untagged platform services are unauditable

The service has a lifecycle state you will check constantly; know what each value means:

ServiceState Meaning What to do
Pending Being created Wait; verify the NLB exists
Available Ready for connections Add allowed principals; share the name
Deleting Tear-down in progress Consumers’ endpoints will go to rejected
Failed Creation failed Check the NLB ARN and account limits

Provider step 3 — Allowed principals, acceptance, and notifications

Two independent controls govern who reaches your service, and they stack. Allowed principals decide who is even permitted to create an endpoint to your service; without an entry, the consumer cannot see or target it at all. Scope these as tightly as the consumer’s identity allows — a specific role ARN is better than a whole account root, which is better than *.

resource "aws_vpc_endpoint_service_allowed_principal" "consumer" {
  vpc_endpoint_service_id = aws_vpc_endpoint_service.payments.id
  principal_arn           = "arn:aws:iam::222222222222:role/payments-client-prod" # tighten to a role
}
aws ec2 modify-vpc-endpoint-service-permissions \
  --service-id vpce-svc-0123456789abcdef0 \
  --add-allowed-principals arn:aws:iam::222222222222:role/payments-client-prod

Acceptance is the per-connection gate that applies when acceptance_required = true. List pending connections and approve them:

aws ec2 describe-vpc-endpoint-connections \
  --filters Name=service-id,Values=vpce-svc-0123456789abcdef0 \
  --query 'VpcEndpointConnections[?VpcEndpointState==`pendingAcceptance`].[VpcEndpointId,VpcEndpointOwner]' \
  --output table

aws ec2 accept-vpc-endpoint-connections \
  --service-id vpce-svc-0123456789abcdef0 \
  --vpc-endpoint-ids vpce-0a1b2c3d4e5f6a7b8

To avoid polling for pending connections, wire a connection notification to SNS. You get an event on Connect, Accept, Reject, and Delete, which you can route to a Lambda for auto-approval against an authoritative consumer registry, or just to a channel so a human acts within minutes instead of hours.

resource "aws_vpc_endpoint_connection_notification" "payments" {
  vpc_endpoint_service_id     = aws_vpc_endpoint_service.payments.id
  connection_notification_arn = aws_sns_topic.privatelink_events.arn
  connection_events           = ["Connect", "Accept", "Reject", "Delete"]
}

How the two gates interact is the single most common source of “why can’t they connect” — read this matrix carefully:

Allowed principal present? acceptance_required Outcome Provider action needed
No true Consumer can’t even create the endpoint Add the principal first
No false Consumer can’t create the endpoint Add the principal first
Yes true Endpoint sits in pendingAcceptance Accept the connection (or auto-approve)
Yes false Endpoint goes available immediately None — self-service
* (wildcard) false Anyone in any account connects Intentional only; usually a mistake
* (wildcard) true Anyone may request, you gate each Acceptable for vetted public-ish services

Scope the principal as tightly as the consumer’s identity allows — the security blast radius shrinks as you narrow it:

Principal form Example Who can connect Use when
Role ARN arn:aws:iam::222…:role/app-prod Only that workload role Default — tightest practical scope
Account root arn:aws:iam::222…:root Any principal in that account You trust the whole account
Org-wide via condition aws:PrincipalOrgID (policy) Any account in your org Internal platform, many accounts
Wildcard * Anyone, any account Almost never; vetted + acceptance only

The connection-notification events, and what you typically do with each:

Event Fires when Typical handler action
Connect A consumer creates an endpoint Look up the principal in the registry
Accept A connection is accepted Record onboarding; emit a metric
Reject You (or auto-logic) reject it Alert the consumer with the reason
Delete The consumer deletes the endpoint Clean up registry/DNS state

Consumer step 4 — Create the interface endpoint

Now switch to the consumer account. The consumer creates an interface endpoint of type Interface, naming the provider’s service. AWS provisions one ENI per subnet you specify, each with a private IP from that subnet’s range. Those ENIs are the only thing the consumer’s workloads ever talk to.

resource "aws_vpc_endpoint" "payments" {
  vpc_id              = var.consumer_vpc_id
  service_name        = "com.amazonaws.vpce.us-east-1.vpce-svc-0123456789abcdef0"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = var.consumer_subnet_ids       # one per AZ — must overlap provider AZs
  security_group_ids  = [aws_security_group.endpoint.id]
  private_dns_enabled = false                          # see step 5 before enabling
}
aws ec2 create-vpc-endpoint --vpc-id vpc-cons123 --vpc-endpoint-type Interface \
  --service-name com.amazonaws.vpce.us-east-1.vpce-svc-0123456789abcdef0 \
  --subnet-ids subnet-c1 subnet-c2 --security-group-ids sg-endpoint \
  --no-private-dns-enabled

Three things define correctness on this side. AZ alignment comes first: put the endpoint in the same AZs the provider published. An endpoint ENI in an AZ the provider does not serve is dead weight — there is no target on the other side. Use AZ IDs (use1-az1), not names, when reasoning across accounts, because us-east-1a in your account may map to a different physical zone than in the provider’s. The security group is on the endpoint ENI — this is the most common stumble. The SG attached to the endpoint controls traffic from consumer workloads into the ENI; it must allow inbound on the service port from the client CIDRs/SGs. The provider’s NLB has no security group at all, so the endpoint SG is the only L4 filter on the path. And consumer-side SGs on the clients still need egress to the endpoint.

resource "aws_security_group" "endpoint" {
  name   = "payments-endpoint-sg"
  vpc_id = var.consumer_vpc_id

  ingress {
    description = "Clients to PrivateLink endpoint"
    from_port   = 8443
    to_port     = 8443
    protocol    = "tcp"
    cidr_blocks = [var.app_subnet_cidr]   # or security_groups = [aws_security_group.client.id]
  }
}

Finally, one endpoint, many AZs, one connection: each consumer VPC needs exactly one interface endpoint to the service, and the ENIs across AZs share a single connection from the provider’s point of view. The interface-endpoint options that define behaviour:

Setting Values Default When to change Trade-off / gotcha
vpc_endpoint_type Interface / Gateway / GatewayLoadBalancer n/a Interface for PrivateLink services Gateway type is only for S3/DynamoDB
subnet_ids One subnet per AZ none Match the provider’s published AZs An ENI in an unpublished AZ is dead weight
security_group_ids SG list default VPC SG Always set explicitly Default SG often denies the service port
private_dns_enabled true / false true (for AWS svcs) / false (custom) true once provider DNS is verified Premature true errors if not verified
ip_address_type ipv4 / ipv6 / dualstack ipv4 Match the service’s families Mismatch → endpoint can’t be created
policy Endpoint policy JSON full access Restrict actions (for AWS-service EPs) Custom services rely on app auth, not this
dns_options.dns_record_ip_type ipv4 / ipv6 / service-defined ipv4 Dual-stack resolution Must align with how the service publishes

The endpoint state values you will watch while it provisions:

State Meaning What to do
pendingAcceptance Waiting for the provider to accept Ask the provider to accept (or check allow-list)
pending Provisioning the ENIs Wait a minute or two
available ENIs up; ready to use Test connectivity; check the SG
rejected Provider rejected the connection You’re not on the allow-list / were denied
failed Provisioning failed Check AZ overlap and subnet capacity
deleting / deleted Tear-down Expected on terraform destroy

Consumer step 5 — Private DNS and domain-ownership verification

By default the endpoint hands the consumer an ugly regional DNS name like vpce-0a1b2c3d4e5f6a7b8-abcd1234.vpce-svc-0123456789abcdef0.us-east-1.vpce.amazonaws.com, plus zonal variants. Functional, but nobody wants that baked into client config. Private DNS names let the provider associate a friendly name — say payments.internal.example.com — so consumers keep calling that hostname and resolution silently points at the endpoint.

The catch, and the reason this trips teams up, is that AWS makes the provider prove they own the domain before any consumer is allowed to enable private DNS. This prevents someone from publishing a service that hijacks *.example.com. Enable a private DNS name on the endpoint service, then publish the TXT record AWS gives you:

resource "aws_vpc_endpoint_service" "payments" {
  acceptance_required        = true
  network_load_balancer_arns = [aws_lb.svc.arn]
  private_dns_name           = "payments.internal.example.com"
}

AWS returns a verification token. Fetch it and create the TXT record in the public hosted zone for the domain (verification is done against public DNS, even though the service itself is private):

aws ec2 describe-vpc-endpoint-service-configurations \
  --service-ids vpce-svc-0123456789abcdef0 \
  --query 'ServiceConfigurations[0].PrivateDnsNameConfiguration'
# -> { "State": "pendingVerification", "Type": "TXT",
#      "Value": "vpce:abc123...", "Name": "_a1b2c3d4" }
resource "aws_route53_record" "privatelink_verify" {
  zone_id = var.public_zone_id
  name    = "_a1b2c3d4.payments.internal.example.com"
  type    = "TXT"
  ttl     = 1800
  records = ["vpce:abc123..."]
}
aws ec2 start-vpc-endpoint-service-private-dns-verification \
  --service-id vpce-svc-0123456789abcdef0

Once the state flips to verified, consumers can set private_dns_enabled = true on their interface endpoints. Behind the scenes AWS creates a managed private hosted zone in the consumer VPC that resolves payments.internal.example.com to the endpoint ENIs — classic split-horizon: the public name resolves to nothing useful publicly, but inside the consumer VPC it points at the local ENIs. For private_dns_enabled to actually take effect, the consumer VPC must have enableDnsHostnames and enableDnsSupport both turned on, or resolution silently falls back to the long regional name.

The private-DNS verification states and what each means for the consumer:

PrivateDnsNameConfiguration.State Meaning Consumer can enable private DNS? Provider action
pendingVerification Token issued, TXT not yet seen No Publish the TXT record, then start verification
verified Ownership proven Yes Tell consumers to set private_dns_enabled=true
failed TXT wrong/missing after retries No Fix the TXT name/value; re-run verification

What must be true on each side for the friendly name to actually resolve — every one of these, or you get the long regional name:

Requirement Side How to confirm If wrong
private_dns_name set on the service Provider describe-vpc-endpoint-service-configurations Friendly name never offered
TXT verification verified Provider PrivateDnsNameConfiguration.State Consumer can’t enable private DNS
private_dns_enabled = true Consumer describe-vpc-endpoints Long regional name returned
enableDnsSupport = true Consumer VPC describe-vpc-attribute Resolution silently falls back
enableDnsHostnames = true Consumer VPC describe-vpc-attribute Resolution silently falls back
No conflicting Route 53 record Consumer Check private hosted zones A manual record can shadow the managed one

The two DNS names you’ll see, and when each is correct:

Name form Example Resolves to When you should see it
Friendly private DNS payments.internal.example.com Endpoint ENI IPs (in-VPC only) After verification + private_dns_enabled=true
Regional endpoint name vpce-…vpce-svc-…region.vpce.amazonaws.com The endpoint (all AZs) When private DNS is off — and as a fallback
Zonal endpoint name vpce-…az1.…vpce.amazonaws.com One AZ’s ENI When you deliberately pin to an AZ

Verify the path end to end

Prove the path end to end before declaring victory. Work provider→consumer→client, because each layer depends on the one before it.

# 1. Provider: service exists, NLB attached, DNS verified
aws ec2 describe-vpc-endpoint-service-configurations \
  --service-ids vpce-svc-0123456789abcdef0 \
  --query 'ServiceConfigurations[0].[ServiceState,PrivateDnsNameConfiguration.State,AvailabilityZones]'

# 2. Provider: the consumer's connection is accepted (not pendingAcceptance/rejected)
aws ec2 describe-vpc-endpoint-connections \
  --filters Name=service-id,Values=vpce-svc-0123456789abcdef0 \
  --query 'VpcEndpointConnections[].[VpcEndpointOwner,VpcEndpointState]' --output table

# 3. Consumer: endpoint is "available" with ENIs in each AZ
aws ec2 describe-vpc-endpoints \
  --vpc-endpoint-ids vpce-0a1b2c3d4e5f6a7b8 \
  --query 'VpcEndpoints[0].[State,NetworkInterfaceIds,DnsEntries[0].DnsName]'
# 4. From a consumer instance: DNS resolves to a private (ENI) address...
dig +short payments.internal.example.com
# 10.x.x.x  (a consumer-subnet IP, not a public one)

# 5. ...and the service answers
curl -sS -o /dev/null -w '%{http_code}\n' https://payments.internal.example.com:8443/healthz

If dig returns the long vpce-...amazonaws.com name, private DNS is not effective — re-check private_dns_enabled, the DNS verification state, and the VPC’s DNS-hostnames flag. The verification checklist as a table — what “good” looks like at each step:

# Check Command Expected good result
1 Service available + DNS verified describe-vpc-endpoint-service-configurations Available, verified, AZs listed
2 Connection accepted describe-vpc-endpoint-connections available for the consumer owner
3 Endpoint up with ENIs describe-vpc-endpoints available, N ENIs, a DNS entry
4 DNS resolves private dig +short <name> A 10.x/consumer-subnet IP
5 Service responds curl … :8443/healthz 200 (or your healthy code)
6 No REJECTs at the ENI Flow Logs query (below) Zero REJECT rows from the client

Scaling, resiliency, and the quotas that bite

A few limits and behaviours decide whether this holds up under load. Cross-zone load balancing (covered above, repeated because it bites in production): without it, an endpoint ENI only reaches targets in its own AZ. Endpoint connection capacity: a single interface endpoint scales horizontally across AZs, giving roughly tens of thousands of concurrent connections per AZ per endpoint; for very high fan-in the bottleneck is usually NLB target capacity and source-port exhaustion on long-lived connections, not the endpoint itself, so watch ActiveFlowCount and NewFlowCount on the NLB. The idle timeout is the silent killer: NLB TCP flows have a 350-second idle timeout, so long-lived gRPC or database-style connections through PrivateLink need TCP keepalives below that or they reset silently.

The quotas you will actually hit, with the default and how to handle each:

Quota Default Adjustable? What hitting it looks like Mitigation
Allowed principals per endpoint service 50 Yes (Service Quotas) 51st consumer can’t be added Raise before the 50th account; prefer org-condition
Interface endpoints per VPC 50 Yes New endpoint creation fails Consolidate, or request an increase
Endpoint services per account/region 20–50 (varies) Yes New service creation fails Raise; or share one service across consumers
Connections per endpoint (per AZ) ~tens of thousands Effectively scale-bound New flows fail at extreme fan-in Add AZs; scale NLB targets
NLB targets per target group 500 (instance/IP) Yes Can’t register more targets Increase, or shard behind multiple NLBs
NLB TCP idle timeout 350 s No Long-lived flows reset at ~6 min Client TCP keepalive < 350 s
Source ports per flow tuple ~64 K per dst No Port exhaustion on one hot 5-tuple Spread targets/ports; reuse connections
NLBs per endpoint service 1 active set (per region/VPC) n/a Can’t span VPCs/regions with one service Publish multiple services or use cross-region
Subnets (AZs) per NLB One per AZ, up to region AZ count n/a Can’t publish in an AZ with no subnet Add a subnet per AZ you want to serve
Connection notifications per service Small fixed cap No Extra SNS wiring rejected Fan out from one SNS topic instead
Target group health-check interval 10–30 s Yes Slow detection of dead targets Tune interval/thresholds for your SLA

The behaviours and limits that decide resiliency, side by side:

Behaviour Default Why it matters What to do
Zonal NLB routing On (cross-zone off) ENI reaches only same-AZ targets Enable cross-zone unless you want zonal isolation
Single connection per consumer VPC Always One endpoint = one provider-side connection Don’t create duplicate endpoints per AZ
AZ-ID vs AZ-name mismatch Inherent across accounts us-east-1a differs per account Reason with AZ IDs (use1-az1)
350 s idle timeout Fixed Long-poll/gRPC/DB connections reset Keepalive below the threshold
Cross-zone data charges Billed to LB owner Cost can surprise on chatty services Model inter-AZ GB; keep targets balanced
ENI per AZ, not per client Always Thousands of clients share one ENI/AZ Scale targets, not endpoints, for fan-in
No source-IP visibility Default Targets see the ENI, not the consumer Use PROXY protocol v2 if the app needs it

Observability and cost

You are billed on two axes for an interface endpoint: an hourly charge per endpoint per AZ, and a per-GB data-processing charge on traffic through it. The per-AZ hourly line is why you do not blindly enable every AZ — each ENI is its own meter. The data-processing charge is on top of any inter-AZ transfer the NLB incurs, and at high fan-in across hundreds of consumer endpoints, the per-endpoint hourly cost dominates and is easy to overlook. The good news: PrivateLink bypasses SNAT entirely for the consumer — traffic to the ENI rides the backbone, so you do not burn the consumer’s NAT-gateway SNAT ports the way a public-internet call would.

For traffic visibility, VPC Flow Logs on the endpoint ENIs show source IPs, ports, and accept/reject actions — invaluable when a consumer swears they cannot connect. The endpoint ENIs have stable interface IDs; filter on them.

-- CloudWatch Logs Insights over VPC Flow Logs — rejected traffic to the endpoint ENI
fields @timestamp, srcAddr, dstAddr, dstPort, action
| filter interfaceId = "eni-0abc123def456789"
| filter action = "REJECT"
| stats count() as rejects by srcAddr, dstPort
| sort rejects desc

On the provider side, the meaningful signals are NLB CloudWatch metrics: HealthyHostCount/UnHealthyHostCount per target group, ActiveFlowCount, and TCP_Target_Reset_Count. A rise in target resets with healthy hosts usually points at idle-timeout or application-side connection churn rather than the network. Observability sources and what each is the source of truth for:

Signal source Lives where Source of truth for
VPC Flow Logs (endpoint ENI) Consumer account Client→ENI accept/REJECT, source IP/port
ActiveFlowCount / NewFlowCount Provider NLB metrics Concurrent flows; fan-in pressure
HealthyHostCount / UnHealthyHostCount Provider target group Whether targets are in rotation
TCP_Target_Reset_Count Provider NLB metrics Idle-timeout / app churn resets
ProcessedBytes Provider NLB metrics Data volume (cost driver)
Connection-notification events Provider SNS topic Who connected/was rejected and when
describe-vpc-endpoint-connections Provider control plane Per-consumer connection state

The cost model — what drives the bill and how to control it (figures approximate, us-east-1, USD; ₹ at ~₹86/USD):

Cost driver Rough rate Who pays How to control
Interface endpoint, per-AZ hourly ~$0.01/AZ/hr (~₹0.86) Consumer Only enable AZs you actually use
Endpoint data processing ~$0.01/GB (~₹0.86) Consumer Tier on volume; consolidate chatty calls
NLB hourly + LCU ~$0.0225/hr + LCU Provider Right-size; one NLB per service, not per consumer
NLB cross-zone data inter-AZ $0.01/GB each way Provider (LB owner) Balance targets; weigh zonal isolation
Flow Logs storage/ingest CloudWatch/S3 rates Whoever logs Sample, or log REJECT-only for cost
SNS + Lambda (auto-approve) Negligible Provider Effectively free at onboarding volumes
Saved consumer NAT/SNAT (a credit, not a charge) Consumer PrivateLink bypasses the public path entirely
Cross-region data (if used) inter-region transfer rate Both Only when supported_regions spans regions

A worked sizing note: at 120 consumer accounts each enabling two AZs, the per-endpoint hourly line alone is roughly 120 × 2 × $0.01 × 730 hr ≈ $1,750/month (~₹150,000) before a single byte flows — which is exactly why the per-AZ count is a deliberate decision, not a default.

Architecture at a glance

Read the diagram left to right; it traces a single request through the system and pins the five failure classes onto the exact hop where each bites. A client in the consumer VPC resolves the friendly name payments.internal.example.com, which — thanks to the managed split-horizon private hosted zone — returns a local interface-endpoint ENI address rather than anything public. The client opens a TCP connection to that ENI on :8443; its egress SG must allow it out, and the endpoint ENI’s own security group must allow it in (the NLB has no SG, so this is the only L4 filter on the path — badge 1). From the ENI the flow crosses the AWS backbone in one direction only, never touching the internet, and arrives at the provider’s endpoint service. There, two gates decide whether the connection ever existed: the allowed-principals allow-list and the per-connection acceptance step (badge 2), while domain-ownership verification via a public TXT record is what let private DNS resolve in the first place (badge 3).

Past the gates, the service forwards to an internal NLB that load-balances across registered targets (IPs, instances, or an ALB) with TCP health checks. The NLB’s published AZs and its cross-zone setting decide whether an ENI in one AZ can reach targets in another — get this wrong and the service “works in one AZ, dies in another” (badge 4) — and its fixed 350-second idle timeout silently resets long-lived flows without keepalives (badge 5). The right-hand observe zone closes the loop: VPC Flow Logs on the ENI catch REJECTs, NLB metrics catch flow pressure and target resets, and the SNS connection-notification feed can auto-approve new consumers against a registry. The whole method during an incident is to find the badge that matches your symptom, read its legend line, run the named confirm command, and apply the fix.

AWS PrivateLink cross-account architecture traced left to right: a client in the consumer VPC resolves a friendly private DNS name via a split-horizon managed private hosted zone to a local interface-endpoint ENI (one per AZ, with the security group on the ENI as the only Layer-4 filter since the NLB has none), the flow crosses the AWS backbone one-directionally without touching the internet to the provider's endpoint service where allowed-principals and per-connection acceptance gate access and a public TXT-record domain-ownership verification enables private DNS, then forwards to an internal Network Load Balancer on TCP 8443 with cross-zone load balancing across registered IP/instance/ALB targets with health checks, and finally an observe zone of VPC Flow Logs catching ENI REJECTs, NLB metrics like ActiveFlowCount and TCP_Target_Reset_Count, and an SNS connection-notification feed driving Lambda auto-approval — with five numbered failure badges mapping the endpoint-ENI security group, pendingAcceptance gating, private-DNS verification, single-AZ cross-zone gaps, and the 350-second idle-timeout resets onto their exact hops

Real-world scenario

Vantage Pay, a fictional payments platform team of five engineers, ran a fraud-scoring API that ~120 internal accounts needed to call, plus three external partner accounts under contract. Their first instinct was a Transit Gateway attachment per consumer. It died on contact with reality: two recently-acquired business units had VPCs on 10.20.0.0/16 — the same block the fraud service’s VPC used — and renumbering a live, PCI-scoped production VPC was a non-starter. Worse, security flagged that a TGW attachment would grant those partner accounts a route into the platform network, far more reach than “call one API” warranted. The architecture review bounced the TGW plan in a single meeting.

They rebuilt it as a PrivateLink endpoint service. The fraud API went behind an internal NLB across three AZs with cross-zone enabled; the endpoint service ran with acceptance_required = true and an allow-list keyed to specific consumer role ARNs, not account roots, so a partner could only connect from their designated workload role. A connection-notification SNS topic fed a small Lambda that auto-approved any principal present in the platform’s account-registry DynamoDB table and left everything else pending for a human — onboarding dropped from a ticket-and-a-meeting to minutes. The two overlapping 10.20.0.0/16 business units connected with zero renumbering, because PrivateLink never exposes the provider’s address space, and the team published fraud.internal.payments.example.com as a verified private DNS name so every consumer used one stable hostname regardless of account.

# The control that made the partner case acceptable to security:
# allow only the partner's specific workload role, never the account root.
resource "aws_vpc_endpoint_service_allowed_principal" "partner_a" {
  vpc_endpoint_service_id = aws_vpc_endpoint_service.fraud.id
  principal_arn           = "arn:aws:iam::333333333333:role/fraud-client-prod"
}

Two incidents in the first month taught the team the failure map. The first: a newly-onboarded BU reported “endpoint is available but every call hangs.” Flow Logs on the endpoint ENI showed REJECT from the client subnet on :8443 — the BU had created the endpoint with the default security group, which did not allow the service port inbound. Five-minute fix. The second: a partner’s gRPC client saw connections silently dropping “every few minutes.” That was the 350-second NLB idle timeout against a long-lived streaming connection with no keepalive; the partner set a 200-second TCP keepalive and it vanished. Both were textbook, both were one hop, and both went straight into the runbook.

The one real cost they had to plan for was the per-endpoint, per-AZ hourly charge multiplied across 120+ consumers — a line item that genuinely showed up on the bill at roughly ₹150,000/month before data. They accepted it as the price of dropping TGW attachment management and CIDR coordination entirely, and trimmed it by getting each consumer to enable only the two AZs they actually ran in rather than all three. Net: the consumer count became a self-service onboarding problem instead of a networking project, which is exactly the trade PrivateLink is built to make.

Advantages and disadvantages

The publish-one-service model both enables clean cross-account sharing and imposes real constraints. Weigh it honestly:

Advantages (why this model helps you) Disadvantages (why it constrains you)
Overlapping CIDRs are irrelevant — the consumer talks to ENIs in its own subnets One service per endpoint; not a general routing solution
Grants reach to one load balancer, not your network — minimal blast radius One-directional (consumer→provider) only
Scales to hundreds of consumers as self-service onboarding, not a mesh Requires an internal NLB/GWLB in front of the workload
Never traverses the public internet; bypasses consumer SNAT entirely Per-endpoint, per-AZ hourly cost compounds at high fan-in
Two stacked gates (principals + acceptance) give fine-grained access control Provider must prove domain ownership before friendly DNS works
Private DNS gives a stable hostname regardless of consumer account AZ names differ across accounts — must reason by AZ ID
Connection notifications enable automated, audited onboarding NLB 350 s idle timeout resets long-lived flows without keepalives
Works for SaaS vendors offering private connectivity to customers Source client IP is hidden unless you carry PROXY protocol v2

The model is right whenever you are publishing a service to many accounts — a shared internal API, a logging/telemetry sink, or a SaaS product offered privately to customer VPCs — and it shines exactly where peering and TGW buckle: large fan-in and overlapping address space. It is the wrong tool when you need bidirectional, everything-talks-to-everything routing (use TGW), when both VPCs are yours and mutually trusted at small N (peering is cheaper and simpler), or when you need L7 routing and per-request authorization across services (VPC Lattice). The disadvantages are all manageable — the NLB requirement, the cost, the DNS dance, the idle timeout — but only if you know they exist going in, which is the point of this article.

Hands-on lab

Stand up a complete PrivateLink path across two accounts (or two VPCs you control), prove it end to end, then tear it down. Free-tier-friendly where possible; the NLB and endpoint hours cost a few cents — delete at the end. Run in CloudShell or with two named CLI profiles (--profile provider, --profile consumer).

Step 1 — Provider: an internal NLB with a simple target. Front a tiny target (an EC2 instance running a health endpoint on :8443, or an IP target) with an internal NLB.

PROFILE_P="--profile provider"
aws elbv2 create-load-balancer $PROFILE_P --name pl-lab-nlb --type network \
  --scheme internal --subnets subnet-pa subnet-pb
# capture the ARN
NLB=$(aws elbv2 describe-load-balancers $PROFILE_P --names pl-lab-nlb \
  --query 'LoadBalancers[0].LoadBalancerArn' -o text)
aws elbv2 modify-load-balancer-attributes $PROFILE_P --load-balancer-arn $NLB \
  --attributes Key=load_balancing.cross_zone.enabled,Value=true

Expected: an ARN returned; cross-zone attribute set to true.

Step 2 — Provider: target group, target, listener.

TG=$(aws elbv2 create-target-group $PROFILE_P --name pl-lab-tg --protocol TCP \
  --port 8443 --vpc-id vpc-prov --target-type ip \
  --query 'TargetGroups[0].TargetGroupArn' -o text)
aws elbv2 register-targets $PROFILE_P --target-group-arn $TG --targets Id=10.10.1.20
aws elbv2 create-listener $PROFILE_P --load-balancer-arn $NLB --protocol TCP --port 8443 \
  --default-actions Type=forward,TargetGroupArn=$TG

Expected: target registers; after a minute describe-target-health shows healthy.

Step 3 — Provider: publish the endpoint service.

SVC=$(aws ec2 create-vpc-endpoint-service-configuration $PROFILE_P \
  --network-load-balancer-arns $NLB --acceptance-required \
  --query 'ServiceConfiguration.ServiceId' -o text)
SVC_NAME=$(aws ec2 describe-vpc-endpoint-service-configurations $PROFILE_P \
  --service-ids $SVC --query 'ServiceConfigurations[0].ServiceName' -o text)
echo "Share this: $SVC_NAME"

Expected: a vpce-svc-… id and a com.amazonaws.vpce.<region>.vpce-svc-… name.

Step 4 — Provider: allow the consumer principal.

aws ec2 modify-vpc-endpoint-service-permissions $PROFILE_P --service-id $SVC \
  --add-allowed-principals arn:aws:iam::222222222222:role/lab-consumer-role

Expected: command succeeds; the consumer can now see the service.

Step 5 — Consumer: create the interface endpoint.

PROFILE_C="--profile consumer"
EP=$(aws ec2 create-vpc-endpoint $PROFILE_C --vpc-id vpc-cons \
  --vpc-endpoint-type Interface --service-name "$SVC_NAME" \
  --subnet-ids subnet-ca subnet-cb --security-group-ids sg-lab-endpoint \
  --no-private-dns-enabled --query 'VpcEndpoint.VpcEndpointId' -o text)

Expected: an endpoint id; state begins as pendingAcceptance.

Step 6 — Provider: accept the connection.

aws ec2 accept-vpc-endpoint-connections $PROFILE_P --service-id $SVC \
  --vpc-endpoint-ids $EP

Expected: within a minute the consumer’s endpoint flips to available.

Step 7 — Consumer: confirm and test.

aws ec2 describe-vpc-endpoints $PROFILE_C --vpc-endpoint-ids $EP \
  --query 'VpcEndpoints[0].[State,DnsEntries[0].DnsName]'
# From a consumer instance whose SG can reach sg-lab-endpoint on :8443:
curl -sS -o /dev/null -w '%{http_code}\n' \
  https://<regional-endpoint-dns>:8443/healthz

Expected: available, a regional DNS name, and a 200 from the health endpoint. The lab steps and their expected output, at a glance:

Step Command family Expected output
1 create-load-balancer + cross-zone NLB ARN; cross-zone true
2 create-target-group / register-targets Target healthy after ~1 min
3 create-vpc-endpoint-service-configuration vpce-svc-… id + service name
4 modify-vpc-endpoint-service-permissions Consumer principal allowed
5 create-vpc-endpoint (Interface) Endpoint id; pendingAcceptance
6 accept-vpc-endpoint-connections Endpoint → available
7 describe-vpc-endpoints + curl available; 200 from /healthz

Step 8 — Teardown (run both, in order).

aws ec2 delete-vpc-endpoints $PROFILE_C --vpc-endpoint-ids $EP
aws ec2 delete-vpc-endpoint-service-configurations $PROFILE_P --service-ids $SVC
aws elbv2 delete-listener $PROFILE_P --listener-arn <listener-arn>
aws elbv2 delete-target-group $PROFILE_P --target-group-arn $TG
aws elbv2 delete-load-balancer $PROFILE_P --load-balancer-arn $NLB

Delete the consumer endpoint before the provider service, or the service delete is blocked by an active connection.

Common mistakes & troubleshooting

When it does not work, walk these in order — most failures are one of the first three, and almost all localise to a single hop. The playbook: match the symptom, read the root cause, run the confirm command, apply the fix.

# Symptom Root cause Confirm (exact command / path) Fix
1 Endpoint stuck in pendingAcceptance Provider hasn’t accepted; or principal not allow-listed describe-vpc-endpoint-connections … VpcEndpointState Add the role ARN to allowed principals and accept-vpc-endpoint-connections
2 available but connections hang/refuse Endpoint ENI’s SG blocks the service port VPC Flow Logs on ENI show REJECT; check describe-security-groups Allow inbound :8443 from client CIDR/SG on the endpoint SG
3 DNS returns the long vpce-…amazonaws.com name Private DNS not effective (verify state / flag / VPC DNS) describe-vpc-endpoints …private_dns_enabled; …PrivateDnsNameConfiguration.State Verify domain, set private_dns_enabled=true, enable VPC DNS flags
4 Works in one AZ, fails in another Provider didn’t publish that AZ, or cross-zone off Compare provider AvailabilityZones vs consumer subnet AZ IDs Publish all AZs + enable cross-zone load balancing
5 Healthy endpoint, but no targets answer Target group unhealthy (wrong port/host/probe) describe-target-health shows unhealthy Fix health-check port/protocol; ensure app listens on the target port
6 Intermittent resets on long-lived flows NLB 350 s TCP idle timeout TCP_Target_Reset_Count rising with healthy hosts Add TCP keepalives below 350 s on the client
7 Consumer can’t even create the endpoint Principal missing from allowed list Provider describe-vpc-endpoint-service-permissions Add the consumer principal (role ARN preferred)
8 Endpoint state rejected Provider rejected the connection (or auto-logic did) describe-vpc-endpoint-connections … rejected Re-request after the provider allow-lists/accepts
9 Connection works, but app sees wrong source IP Source is the ENI, not the real client Compare app logs vs client IP Enable PROXY protocol v2 on the target group if you need real client IP
10 DNS verification stuck at pendingVerification TXT record wrong name/value or not propagated dig TXT _<token>.<name>; check the public zone Fix the TXT name/value; start-…-private-dns-verification again
11 “Asymmetric routing”-style weirdness Almost always an ALB-as-target with mismatched health checks Inspect the NLB→ALB target chain Simplify the target chain; align health checks; re-test
12 Hit a hard ceiling adding the 51st consumer allowed_principals per service quota (default 50) Service Quotas console Request an increase; or use an aws:PrincipalOrgID condition
13 High fan-in: new connections fail under load NLB target capacity / source-port exhaustion ActiveFlowCount, NewFlowCount near ceiling Add AZs/targets; reuse connections; shard behind multiple NLBs
14 terraform destroy fails on the service Active consumer connection still attached describe-vpc-endpoint-connections non-empty Delete consumer endpoints first, then the service
15 Endpoint created but no DNS entry at all VPC DNS support disabled on the consumer VPC describe-vpc-attribute --attribute enableDnsSupport Enable enableDnsSupport (and enableDnsHostnames)
16 Connection drops when provider redeploys NLB NLB replaced, not updated in place Provider change record; new NLB ARN Update the service’s NLB ARN; avoid replacing the NLB
17 Partner connects from an unexpected role Allow-list scoped to account root, not a role describe-vpc-endpoint-service-permissions Tighten to the specific workload role ARN
18 curl works by IP but not by name Client using stale/cached regional name dig +short <name> vs the ENI IPs Flush resolver cache; confirm private DNS effective

The fastest first-cut decision table — symptom to most-likely cause to first move:

If you see… It’s probably… Do this first
pendingAcceptance forever Acceptance/allow-list gate Check allow-list, then accept
Hang/refuse on an available endpoint Endpoint ENI SG Flow Logs → fix inbound SG rule
Long regional DNS name Private DNS not effective Check verify state + VPC DNS flags
Zone-specific failures AZ mismatch / cross-zone off Compare AZ IDs; enable cross-zone
Resets every few minutes 350 s idle timeout Add TCP keepalives < 350 s
Can’t add another consumer Allowed-principals quota Raise quota or use org condition

The error/limit reference — the strings and states you’ll meet, what they mean, and the fix:

Error / state Where it appears Meaning Fix
pendingAcceptance Endpoint / connection state Awaiting provider acceptance Accept the connection / check allow-list
rejected Endpoint state Provider rejected the connection Get allow-listed; re-request
failed Endpoint state Provisioning failed (AZ/subnet) Fix AZ overlap and subnet capacity
pendingVerification DNS config state TXT not yet verified Publish TXT; start verification
InvalidServiceName CLI error Service name typo / wrong region Copy the exact com.amazonaws.vpce.… string
You are not authorized… CLI error (consumer) Principal not on allow-list Provider adds the principal
REJECT (Flow Logs) ENI flow log SG/NACL blocked the packet Allow the service port on the endpoint SG
VpcEndpointLimitExceeded CLI error Endpoints-per-VPC quota hit Raise quota / consolidate endpoints
cross_zone data charge Cost Explorer line Inter-AZ NLB traffic Balance targets; weigh zonal isolation
TCP_Target_Reset_Count NLB metric Idle-timeout/app resets TCP keepalives; fix app connection churn
DnsNamesInUse / verify failed DNS config state TXT mismatch or stale token Re-fetch token; fix TXT; re-verify
ExceededAllowedPrincipalsLimit CLI error 50-principal cap reached Raise quota; use aws:PrincipalOrgID condition
Endpoint available, target unhealthy Mixed signals App not on the target port/health path Fix target port + health check; re-register

Best practices

Security notes

PrivateLink’s security story is strong precisely because it grants the minimum reach: one load balancer, one direction, no route into your network. Lean into that.

Control What it protects How to apply
Allowed principals (role ARN) Who may create an endpoint Scope to the consumer’s workload role; avoid root/*
aws:PrincipalOrgID condition Org-internal access at scale Allow any account in your org without enumerating
Per-connection acceptance Onboarding gate Keep true; automate via SNS→Lambda registry check
Endpoint ENI security group L4 ingress to the ENI Least-privilege: service port, client SG/CIDR only
No route into the provider VPC Lateral movement Inherent — PrivateLink grants no transitive reach
Domain-ownership verification DNS hijack prevention Provider proves the domain before private DNS
NLB TLS termination + ACM Data-in-transit Use a TLS listener with an ACM cert if terminating
App-layer authN/Z Real identity Don’t trust source IP across PrivateLink — auth in the app
VPC Flow Logs (REJECT) Detection/audit Alert on REJECT spikes; record who connected
PROXY protocol v2 Real client IP (when needed) Enable on the target group; parse in the app
SCP / RCP guardrails Org-wide policy on endpoints Restrict who can create endpoint services/principals
CloudTrail on EC2 endpoint APIs Change audit Alert on ModifyVpcEndpointServicePermissions changes
Connection-notification audit Onboarding trail Persist Connect/Accept/Reject events for review

Three security-specific gotchas worth internalising. First, the source IP your targets see is the endpoint ENI, not the consumer’s client — never build authorization on observed source IP across a PrivateLink boundary; use the principal gate and application auth. Second, traffic is private but not automatically encrypted — PrivateLink keeps the flow off the internet, but if you need confidentiality on the wire, terminate TLS at the NLB (ACM cert) or run TLS end-to-end through it. Third, the allow-list is your perimeter — a stray * with acceptance_required = false is the one mistake that turns a private service into an open one; review it in code and alert on changes.

Cost & sizing

What drives the bill: the per-endpoint, per-AZ hourly charge (consumer side) and per-GB data processing (consumer side), plus the NLB hourly + LCU and any inter-AZ cross-zone data (provider side). At high fan-in, the per-AZ hourly line dominates — it accrues whether or not data flows — which is why the number of enabled AZs is the most important cost lever you have. PrivateLink also saves money on the consumer side by bypassing NAT-gateway SNAT and data-processing for the call (the traffic never goes to the internet). Rough figures, us-east-1, USD, with ₹ at ~₹86/USD:

Item Rough rate Side Sizing lever
Interface endpoint, per AZ ~$0.01/hr (~₹0.86) → ~$7.30/AZ/mo Consumer Enable only AZs you use
Endpoint data processing ~$0.01/GB (~₹0.86) Consumer Consolidate chatty calls; tier on volume
NLB ~$0.0225/hr + LCU Provider One NLB per service, not per consumer
NLB cross-zone inter-AZ ~$0.01/GB each direction Provider Balance targets; consider zonal isolation
Flow Logs CloudWatch/S3 rates Either REJECT-only or sampled to cut cost
SNS + Lambda Effectively free at onboarding scale Provider n/a

The sizing rules of thumb, as a decision table:

If you have… Then size for… Because…
Many consumers, low per-consumer traffic Minimising AZ count per endpoint Hourly meter dominates over data
Few consumers, high traffic Minimising data processing + NLB LCUs Data + LCU dominate over hourly
Long-lived connections Keepalives + NLB headroom Avoid idle resets and flow churn
Bursty fan-in NLB targets + AZ spread Source-port and target capacity are the ceiling
Cross-region consumers Cross-region data + supported regions Adds inter-region transfer cost

There is no “free tier” for PrivateLink endpoints, but the lab above runs in cents because the endpoint and NLB exist for minutes. For a 120-consumer internal platform at two AZs each, budget on the order of ₹150,000/month for endpoint hours before data — the line item that surprises teams who modelled only data transfer.

Interview & exam questions

Q1. When would you choose PrivateLink over VPC peering or Transit Gateway? When you are publishing a single service to one or many consumers, especially across accounts with overlapping CIDRs, and you want to grant reach to one load balancer rather than a route into your network. Peering/TGW are bidirectional L3 routing that require non-overlapping CIDRs; PrivateLink is unidirectional, one-service, and CIDR-agnostic. (SAP-C02, ANS-C01.)

Q2. Why are overlapping CIDRs irrelevant to PrivateLink? Because the consumer’s workloads talk only to an interface-endpoint ENI carved from the consumer’s own subnet; AWS carries the flow across the backbone to the provider’s NLB without ever exposing the provider’s address space. There is no IP-level route between the VPCs to collide. (ANS-C01.)

Q3. What two controls gate who can connect to an endpoint service, and how do they interact? Allowed principals (who may create an endpoint) and per-connection acceptance (acceptance_required). They stack: without an allow-list entry the consumer can’t connect even with acceptance_required = false; with the entry and true, the connection waits in pendingAcceptance until accepted. (SAP-C02.)

Q4. A consumer’s endpoint is available but every call hangs. Where do you look first? The endpoint ENI’s security group — it’s the only L4 filter on the path because the NLB has no SG. Confirm with VPC Flow Logs on the ENI (look for REJECT) and fix the inbound rule to allow the service port from the client. (ANS-C01.)

Q5. Why does private DNS require domain-ownership verification, and where is the TXT record published? To stop a malicious provider from publishing a service that hijacks a domain like *.example.com. The provider publishes the verification TXT record in the public hosted zone for the domain, even though the service itself is private. (SAP-C02.)

Q6. What is split-horizon DNS in this context? AWS creates a managed private hosted zone inside the consumer VPC that resolves the friendly name to the local endpoint ENIs, while the public name resolves to nothing useful externally — so the same hostname behaves differently inside the consumer VPC than on the public internet. (ANS-C01.)

Q7. Why must the NLB be internal, and what do its subnets determine? PrivateLink targets the NLB’s private addresses, so an internet-facing scheme can’t back an endpoint service. The subnets you attach define which AZs the service is available in — a consumer can only create an endpoint in an AZ where the provider has a presence. (ANS-C01.)

Q8. A long-lived gRPC connection through PrivateLink resets every few minutes. Cause and fix? The NLB’s 350-second TCP idle timeout. Fix it by enabling TCP keepalives below that threshold on the client; nothing about PrivateLink itself changes the timeout. (SOA-C02.)

Q9. Why reason about AZs by ID rather than name across accounts? Because us-east-1a may map to a different physical Availability Zone in the consumer account than in the provider account. AZ IDs (use1-az1) are stable across accounts, so aligning endpoint and NLB AZs by ID is correct; by name can silently strand an ENI in an unserved zone. (ANS-C01.)

Q10. What are the two billing axes for an interface endpoint, and which dominates at high fan-in? A per-endpoint, per-AZ hourly charge and a per-GB data-processing charge (both consumer-side). At high fan-in across many consumers, the per-AZ hourly charge dominates because it accrues regardless of traffic. (SAP-C02.)

Q11. How do you automate consumer onboarding at scale? Set acceptance_required = true, wire a connection notification to SNS, and have a Lambda auto-approve principals present in an authoritative registry (e.g. a DynamoDB table) while leaving unknowns pending for a human. This turns onboarding from a ticket into minutes while keeping an audited gate. (SAP-C02.)

Q12. Your targets behind PrivateLink need the real client IP for logging. How? Enable PROXY protocol v2 on the NLB target group so the client’s address is prepended to the TCP stream; otherwise the targets see the endpoint ENI’s network identity, not the consumer’s client. (ANS-C01.)

Quick check

  1. You need to expose one internal API to 150 accounts, several with overlapping 10.0.0.0/16 CIDRs. Which connectivity primitive, and why?
  2. A consumer reports their endpoint is stuck in pendingAcceptance. What two things could be wrong, and what command confirms each?
  3. dig payments.internal.example.com from a consumer instance returns a long vpce-…amazonaws.com name. List three things to check.
  4. Why does the endpoint ENI’s security group matter when the NLB has none?
  5. A streaming connection through PrivateLink drops every few minutes with healthy targets. What’s the cause and the one-line fix?

Answers

  1. PrivateLink endpoint service. It publishes one service (not a network), is CIDR-agnostic because the consumer talks to ENIs in its own subnets, and scales to many accounts as self-service. Peering/TGW would demand non-overlapping CIDRs and grant a route into your network.
  2. Either the consumer’s principal is not on the allowed-principals list, or the provider hasn’t accepted the connection. Confirm the connection state with aws ec2 describe-vpc-endpoint-connections and the permissions with aws ec2 describe-vpc-endpoint-service-permissions; fix by adding the principal and/or accept-vpc-endpoint-connections.
  3. (a) private_dns_enabled = true on the consumer endpoint; (b) the service’s PrivateDnsNameConfiguration.State is verified; © the consumer VPC has enableDnsHostnames and enableDnsSupport both on. Any one missing falls back to the long name.
  4. Because the NLB has no security group, the endpoint ENI’s SG is the only L4 filter on the path. It must allow inbound on the service port from the client CIDRs/SGs, or connections to an otherwise-available endpoint hang or are refused.
  5. The NLB 350-second TCP idle timeout. Fix: set TCP keepalives below 350 seconds on the client.

Glossary

Next steps

awsprivatelinknetworkingvpcnlbendpoint-servicecross-accountprivate-dns
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments