Azure Networking

Deterministic Outbound with Azure NAT Gateway: Fixing SNAT Port Exhaustion

The first sign is always the same: a batch job, a webhook fan-out, or a connection-pool-happy microservice throws intermittent connection timeouts to an external API, and nobody can reproduce it locally. The metric that explains it is SNATConnectionCount flatlining at a ceiling — you have run out of SNAT ports. Azure NAT Gateway is the correct fix, and it gives you a stable, allow-listable set of egress IPs as a bonus. This is how to deploy it properly and migrate off the default outbound path without an outage.

1. Three outbound paths, and why only one scales

Every VM or pod that talks to the public internet needs source NAT (SNAT): its private IP and ephemeral port get rewritten to a public IP and port. Azure gives you three ways to provide that public IP, and they are not equivalent.

Outbound method SNAT ports Egress IP Recommended?
Default outbound access Implicit, small, shared per host Unpredictable, Microsoft-owned, can change No — being retired Sept 2025 for new VNets
Load Balancer outbound rules Manual, 64K per frontend IP, divided across backends The LB’s public IP(s) Only if you already need the LB
NAT Gateway Up to ~64,512 per public IP, allocated on demand Your public IP / prefix, stable Yes — the default choice

The mechanics that matter:

For any subnet making meaningful outbound calls, use NAT Gateway. The rest of this article assumes that decision.

2. Understanding SNAT port allocation

A SNAT “port” is one entry in the translation table, and the table is keyed on the full 5-tuple: source IP, source port, destination IP, destination port, protocol. This is the detail that trips people up.

You are not limited to 64K total connections. You are limited to 64K connections to the same destination IP and port. A connection to 20.1.2.3:443 and another to 20.1.2.4:443 are different destinations and cheaper on the table than two flows to the same one.

Exhaustion almost always means many connections to one destination — a single payment gateway, one storage public endpoint, one upstream API behind a single VIP. Spreading the same load across many destinations rarely exhausts ports.

Each public IP attached to a NAT Gateway contributes 64,512 SNAT ports (some docs round to 64,000). The maths is linear:

SNAT ports available = 64,512  x  (number of attached public IPs)

1 public IP   ->  ~64,512 simultaneous flows to a single dest IP:port
2 public IPs  ->  ~129,024
16 public IPs ->  ~1,032,192   (16 = the per-NAT-GW IP/prefix cap)

A port is freed when its flow goes idle past the TCP idle timeout — which is why timeout tuning (Step 7) is part of capacity planning, not an afterthought.

3. Provision the NAT Gateway, public IP, and prefix

You need three resources: the NAT Gateway, at least one Standard SKU public IP (or a public IP prefix), and a subnet association. Prefer a public IP prefix to individual IPs: the contiguous CIDR is what you publish to partners for allow-listing, and you can scale within it without changing what they whitelist.

Azure CLI:

LOC=eastus
RG=rg-egress-prod

# A /28 prefix = 16 contiguous IPs. Pick the size from Step 5.
az network public-ip prefix create \
  --resource-group $RG \
  --name pip-prefix-natgw \
  --length 28 \
  --location $LOC \
  --version IPv4

# NAT Gateway. Idle timeout in minutes (default 4, max 120) — see Step 7.
az network nat gateway create \
  --resource-group $RG \
  --name natgw-prod \
  --location $LOC \
  --public-ip-prefixes pip-prefix-natgw \
  --idle-timeout 10

NAT Gateway is a zonal resource. Created with no zone it is non-zonal (regional); to pin it, pass --zone 1. For zone-redundant egress you deploy one NAT Gateway per zone, each on its own subnet, because a single NAT Gateway cannot span zones. Keep that in mind for the AKS section.

The same thing in Bicep, which is what you actually want in the repo:

resource pipPrefix 'Microsoft.Network/publicIPPrefixes@2023-11-01' = {
  name: 'pip-prefix-natgw'
  location: location
  sku: { name: 'Standard' }
  properties: {
    prefixLength: 28
    publicIPAddressVersion: 'IPv4'
  }
}

resource natgw 'Microsoft.Network/natGateways@2023-11-01' = {
  name: 'natgw-prod'
  location: location
  sku: { name: 'Standard' }
  properties: {
    idleTimeoutInMinutes: 10
    publicIpPrefixes: [
      { id: pipPrefix.id }
    ]
  }
}

NAT Gateway has exactly one SKU (Standard), so there is no SKU decision to make — only IP count and idle timeout.

4. Associate to subnets and the precedence rules

A NAT Gateway is attached to a subnet, and once attached it becomes the outbound path for every resource in that subnet. This is where the ordering rules matter, because they determine what wins when multiple egress configs exist.

az network vnet subnet update \
  --resource-group $RG \
  --vnet-name vnet-prod \
  --name snet-workload \
  --nat-gateway natgw-prod

The precedence Azure applies for outbound, highest to lowest:

  1. NAT Gateway — if present on the subnet, it wins. It overrides Load Balancer outbound rules and the instance-level public IP for outbound traffic.
  2. Instance-level public IP (a public IP directly on the NIC) — used for outbound only if there is no NAT Gateway.
  3. Load Balancer outbound rules — used only if neither of the above applies.
  4. Default outbound access — the last resort, used when nothing else is configured.

Two rules that save you a debugging session:

5. Size the prefix to your connection count

This is the capacity-planning step people skip and then page about. Work backwards from the worst-case concurrent flows to a single destination.

required public IPs = ceil( peak_concurrent_flows_to_one_dest / 64,512 )
prefix length       = smallest /N whose host count >= required IPs
Prefix IPs Approx SNAT ports (to one dest:port)
/31 2 ~129,024
/30 4 ~258,048
/29 8 ~516,096
/28 16 ~1,032,192

The hard ceiling: a single NAT Gateway supports a maximum of 16 public IP addresses (individual IPs and prefixes combined). A /28 is the largest single prefix that fits, giving ~1.03M ports. Need more than 16 IPs to one destination? Split the workload across multiple subnets, each with its own NAT Gateway and prefix — the documented scaling pattern, not a workaround.

Worked example: a notification service holds ~140,000 simultaneous TLS connections to one provider VIP at peak. ceil(140000 / 64512) = 3 IPs. A /30 (4 IPs) covers it with headroom; provision it and hand the partner that 4-address CIDR. Do not provision a /28 “to be safe” — you pay per IP, and you can attach a second prefix later without downtime.

6. Integrating with AKS for stable, allow-listable egress

AKS defaults its outboundType to loadBalancer, which puts cluster egress behind the Standard LB and inherits its SNAT-allocation headaches at high pod density. Setting outboundType to managedNATGateway (Azure manages it) or userAssignedNATGateway (you bring your own on the node subnet) fixes both the exhaustion and the IP-stability problems in one move.

This must be chosen at cluster creationoutboundType is largely immutable, with only specific migration paths supported, so design it in from day one for any cluster that calls allow-listed external endpoints.

Managed NAT Gateway (Azure provisions and owns it):

az aks create \
  --resource-group rg-aks-prod \
  --name aks-prod \
  --network-plugin azure \
  --outbound-type managedNATGateway \
  --nat-gateway-managed-outbound-ip-count 2 \
  --nat-gateway-idle-timeout 10 \
  --generate-ssh-keys

User-assigned, when you must control the exact egress prefix (the common enterprise case — partners allow-list your CIDR):

# NAT Gateway + prefix already created and attached to the AKS node subnet,
# then point the cluster at that subnet with a userAssignedNATGateway type.
az aks create \
  --resource-group rg-aks-prod \
  --name aks-prod \
  --network-plugin azure \
  --vnet-subnet-id "$NODE_SUBNET_ID" \
  --outbound-type userAssignedNATGateway \
  --generate-ssh-keys

For a zone-redundant cluster the node pools span zones, but a NAT Gateway is single-zone. The correct topology is one node-pool subnet per availability zone, each with its own zonal NAT Gateway and prefix. Pods in zone 1 egress through the zone-1 NAT Gateway, and so on; partners allow-list the union of the zonal prefixes. Do not share one NAT Gateway across a multi-zone node subnet — it pins your egress to a single zone’s fate.

7. Tune TCP idle timeout and watch the metrics

The TCP idle timeout governs how long an idle flow holds its SNAT port before reclaim. Lowering it (minimum 4 minutes) frees ports faster and raises effective capacity. Raising it (up to 120 minutes) keeps quiet but long-lived connections alive — useful for databases or brokers that idle between bursts but should not be torn down.

az network nat gateway update \
  --resource-group $RG \
  --name natgw-prod \
  --idle-timeout 4

Do not solve idle-timeout pain by cranking it to 120. The durable fix for connections dying mid-idle is application-level TCP keepalives (or HTTP keep-alive / connection pooling) that send traffic before the timeout. Keepalives reset the idle timer and are far more robust than betting that no flow ever idles longer than your window.

NAT Gateway emits metrics you should alert on, in Microsoft.Network/natGateways:

A KQL query against the platform metrics (or an Azure Monitor metric alert) to catch exhaustion before users do:

AzureMetrics
| where ResourceProvider == "MICROSOFT.NETWORK"
| where Resource == "NATGW-PROD"
| where MetricName in ("DroppedPackets", "SNATConnectionCount")
| summarize Total = sum(Total) by bin(TimeGenerated, 5m), MetricName
| order by TimeGenerated desc

Set a metric alert on DroppedPackets > 0 over 5 minutes. Any sustained drops mean you are at the ceiling — add a public IP to the prefix (no downtime) or lower the idle timeout, then re-check.

Verify

Confirm the egress IP is what you provisioned, not a default-outbound address, and that ports are healthy.

From a VM or a debug pod inside the subnet:

# Should return an IP from your NAT Gateway's prefix, every time.
curl -s https://ifconfig.me ; echo
curl -s https://api.ipify.org ; echo
# Confirm the association and the attached IPs from the control plane.
az network nat gateway show \
  --resource-group $RG --name natgw-prod \
  --query "{idle:idleTimeoutInMinutes, prefixes:publicIpPrefixes[].id, subnets:subnets[].id}" -o jsonc

For AKS, exec into a pod and hit the same echo endpoints — the returned IP must fall inside the cluster’s NAT Gateway prefix:

kubectl run egress-check --image=curlimages/curl --rm -it --restart=Never -- \
  sh -c "curl -s https://api.ipify.org; echo"

Then run a real load test (e.g. a burst of concurrent connections to your actual upstream) while watching SNATConnectionCount climb and DroppedPackets stay at zero in the metrics blade. Zero drops under peak is the pass condition.

Enterprise scenario

A fintech platform team ran a payment-reconciliation service on AKS that, at end-of-day batch, opened tens of thousands of short-lived HTTPS connections to a single acquiring-bank API behind one VIP. The cluster used the default outboundType: loadBalancer. Around 18:00 daily, reconciliation jobs failed with connection timeouts and recovered on their own by 18:30. Nobody could reproduce it off-peak.

The constraint was twofold. The bank required the platform to allow-list a fixed set of source IPs, so any fix had to produce a stable, declared egress CIDR. And the LB outbound rules pre-allocated SNAT ports across the node pool — because every connection targeted the same destination IP and port, the 5-tuple table for that one destination hit its ceiling exactly when batch concurrency peaked. The dropped-flow window lined up perfectly with the failures.

They rebuilt the node pools on subnets fronted by a user-assigned NAT Gateway with a /30 prefix (4 IPs, ~258K ports to that destination, roughly double the observed peak), handed the bank that 4-address CIDR, and dropped the idle timeout to 4 minutes to recycle the short-lived flows aggressively. As the cluster was zone-redundant, they provisioned one zonal NAT Gateway and prefix per zone and gave the bank the union of the three CIDRs.

# Per zone: dedicated subnet, zonal NAT GW, /30 prefix on the node subnet.
az network public-ip prefix create -g rg-pay -n pip-prefix-z1 --length 30 --location eastus --zone 1
az network nat gateway create -g rg-pay -n natgw-z1 --location eastus \
  --public-ip-prefixes pip-prefix-z1 --idle-timeout 4 --zone 1
az network vnet subnet update -g rg-pay --vnet-name vnet-pay \
  --name snet-aks-z1 --nat-gateway natgw-z1

The 18:00 failures stopped on the first batch after cutover. DroppedPackets has been flat at zero since, and the bank’s allow-list never has to change because all future scaling happens within the published prefixes.

Checklist

nat-gatewaynetworkingsnategressoutbound

Comments

Keep Reading