The first sign is always the same: a batch job, a webhook fan-out, or a connection-pool-happy microservice throws intermittent connection timeouts to an external API, and nobody can reproduce it locally. The metric that explains it is SNATConnectionCount flatlining at a ceiling — you have run out of SNAT ports. Azure NAT Gateway is the correct fix, and it gives you a stable, allow-listable set of egress IPs as a bonus. This is how to deploy it properly and migrate off the default outbound path without an outage.
1. Three outbound paths, and why only one scales
Every VM or pod that talks to the public internet needs source NAT (SNAT): its private IP and ephemeral port get rewritten to a public IP and port. Azure gives you three ways to provide that public IP, and they are not equivalent.
| Outbound method | SNAT ports | Egress IP | Recommended? |
|---|---|---|---|
| Default outbound access | Implicit, small, shared per host | Unpredictable, Microsoft-owned, can change | No — being retired Sept 2025 for new VNets |
| Load Balancer outbound rules | Manual, 64K per frontend IP, divided across backends | The LB’s public IP(s) | Only if you already need the LB |
| NAT Gateway | Up to ~64,512 per public IP, allocated on demand | Your public IP / prefix, stable | Yes — the default choice |
The mechanics that matter:
- Default outbound access is the implicit egress every VM gets with no explicit outbound configured. It uses a small, shared SNAT pool and an IP you do not own and cannot predict. Microsoft is retiring it for newly created VNets (effective 30 September 2025). Treat any reliance on it as tech debt.
- Load Balancer outbound rules let you allocate ports explicitly, but a Standard LB frontend gives you a fixed budget of 64,000 ports that you must pre-divide across the backend pool. Allocate 1,024 ports per VM and you cap the pool at ~62 instances per frontend IP. Get the division wrong and you starve VMs or cap scale.
- NAT Gateway allocates SNAT ports on demand from a shared pool across all instances in the subnet — no per-VM pre-carving. An idle VM consumes nothing; a busy one bursts. That dynamic allocation is the single biggest reason it survives load that exhausts the other two.
For any subnet making meaningful outbound calls, use NAT Gateway. The rest of this article assumes that decision.
2. Understanding SNAT port allocation
A SNAT “port” is one entry in the translation table, and the table is keyed on the full 5-tuple: source IP, source port, destination IP, destination port, protocol. This is the detail that trips people up.
You are not limited to 64K total connections. You are limited to 64K connections to the same destination IP and port. A connection to 20.1.2.3:443 and another to 20.1.2.4:443 are different destinations and cheaper on the table than two flows to the same one.
Exhaustion almost always means many connections to one destination — a single payment gateway, one storage public endpoint, one upstream API behind a single VIP. Spreading the same load across many destinations rarely exhausts ports.
Each public IP attached to a NAT Gateway contributes 64,512 SNAT ports (some docs round to 64,000). The maths is linear:
SNAT ports available = 64,512 x (number of attached public IPs)
1 public IP -> ~64,512 simultaneous flows to a single dest IP:port
2 public IPs -> ~129,024
16 public IPs -> ~1,032,192 (16 = the per-NAT-GW IP/prefix cap)
A port is freed when its flow goes idle past the TCP idle timeout — which is why timeout tuning (Step 7) is part of capacity planning, not an afterthought.
3. Provision the NAT Gateway, public IP, and prefix
You need three resources: the NAT Gateway, at least one Standard SKU public IP (or a public IP prefix), and a subnet association. Prefer a public IP prefix to individual IPs: the contiguous CIDR is what you publish to partners for allow-listing, and you can scale within it without changing what they whitelist.
Azure CLI:
LOC=eastus
RG=rg-egress-prod
# A /28 prefix = 16 contiguous IPs. Pick the size from Step 5.
az network public-ip prefix create \
--resource-group $RG \
--name pip-prefix-natgw \
--length 28 \
--location $LOC \
--version IPv4
# NAT Gateway. Idle timeout in minutes (default 4, max 120) — see Step 7.
az network nat gateway create \
--resource-group $RG \
--name natgw-prod \
--location $LOC \
--public-ip-prefixes pip-prefix-natgw \
--idle-timeout 10
NAT Gateway is a zonal resource. Created with no zone it is non-zonal (regional); to pin it, pass --zone 1. For zone-redundant egress you deploy one NAT Gateway per zone, each on its own subnet, because a single NAT Gateway cannot span zones. Keep that in mind for the AKS section.
The same thing in Bicep, which is what you actually want in the repo:
resource pipPrefix 'Microsoft.Network/publicIPPrefixes@2023-11-01' = {
name: 'pip-prefix-natgw'
location: location
sku: { name: 'Standard' }
properties: {
prefixLength: 28
publicIPAddressVersion: 'IPv4'
}
}
resource natgw 'Microsoft.Network/natGateways@2023-11-01' = {
name: 'natgw-prod'
location: location
sku: { name: 'Standard' }
properties: {
idleTimeoutInMinutes: 10
publicIpPrefixes: [
{ id: pipPrefix.id }
]
}
}
NAT Gateway has exactly one SKU (Standard), so there is no SKU decision to make — only IP count and idle timeout.
4. Associate to subnets and the precedence rules
A NAT Gateway is attached to a subnet, and once attached it becomes the outbound path for every resource in that subnet. This is where the ordering rules matter, because they determine what wins when multiple egress configs exist.
az network vnet subnet update \
--resource-group $RG \
--vnet-name vnet-prod \
--name snet-workload \
--nat-gateway natgw-prod
The precedence Azure applies for outbound, highest to lowest:
- NAT Gateway — if present on the subnet, it wins. It overrides Load Balancer outbound rules and the instance-level public IP for outbound traffic.
- Instance-level public IP (a public IP directly on the NIC) — used for outbound only if there is no NAT Gateway.
- Load Balancer outbound rules — used only if neither of the above applies.
- Default outbound access — the last resort, used when nothing else is configured.
Two rules that save you a debugging session:
- A resource with its own instance-level public IP still receives inbound on that IP, but once a NAT Gateway is on the subnet its outbound goes through the NAT Gateway. Inbound and outbound are decoupled — you want consistent egress regardless of per-NIC IPs.
- One NAT Gateway can serve multiple subnets in the same VNet, but a subnet can have only one NAT Gateway. It cannot attach to subnets in different VNets, nor to a subnet containing a Basic SKU resource (Basic LB or public IP). Everything must be Standard SKU.
5. Size the prefix to your connection count
This is the capacity-planning step people skip and then page about. Work backwards from the worst-case concurrent flows to a single destination.
required public IPs = ceil( peak_concurrent_flows_to_one_dest / 64,512 )
prefix length = smallest /N whose host count >= required IPs
| Prefix | IPs | Approx SNAT ports (to one dest:port) |
|---|---|---|
| /31 | 2 | ~129,024 |
| /30 | 4 | ~258,048 |
| /29 | 8 | ~516,096 |
| /28 | 16 | ~1,032,192 |
The hard ceiling: a single NAT Gateway supports a maximum of 16 public IP addresses (individual IPs and prefixes combined). A /28 is the largest single prefix that fits, giving ~1.03M ports. Need more than 16 IPs to one destination? Split the workload across multiple subnets, each with its own NAT Gateway and prefix — the documented scaling pattern, not a workaround.
Worked example: a notification service holds ~140,000 simultaneous TLS connections to one provider VIP at peak. ceil(140000 / 64512) = 3 IPs. A /30 (4 IPs) covers it with headroom; provision it and hand the partner that 4-address CIDR. Do not provision a /28 “to be safe” — you pay per IP, and you can attach a second prefix later without downtime.
6. Integrating with AKS for stable, allow-listable egress
AKS defaults its outboundType to loadBalancer, which puts cluster egress behind the Standard LB and inherits its SNAT-allocation headaches at high pod density. Setting outboundType to managedNATGateway (Azure manages it) or userAssignedNATGateway (you bring your own on the node subnet) fixes both the exhaustion and the IP-stability problems in one move.
This must be chosen at cluster creation — outboundType is largely immutable, with only specific migration paths supported, so design it in from day one for any cluster that calls allow-listed external endpoints.
Managed NAT Gateway (Azure provisions and owns it):
az aks create \
--resource-group rg-aks-prod \
--name aks-prod \
--network-plugin azure \
--outbound-type managedNATGateway \
--nat-gateway-managed-outbound-ip-count 2 \
--nat-gateway-idle-timeout 10 \
--generate-ssh-keys
User-assigned, when you must control the exact egress prefix (the common enterprise case — partners allow-list your CIDR):
# NAT Gateway + prefix already created and attached to the AKS node subnet,
# then point the cluster at that subnet with a userAssignedNATGateway type.
az aks create \
--resource-group rg-aks-prod \
--name aks-prod \
--network-plugin azure \
--vnet-subnet-id "$NODE_SUBNET_ID" \
--outbound-type userAssignedNATGateway \
--generate-ssh-keys
For a zone-redundant cluster the node pools span zones, but a NAT Gateway is single-zone. The correct topology is one node-pool subnet per availability zone, each with its own zonal NAT Gateway and prefix. Pods in zone 1 egress through the zone-1 NAT Gateway, and so on; partners allow-list the union of the zonal prefixes. Do not share one NAT Gateway across a multi-zone node subnet — it pins your egress to a single zone’s fate.
7. Tune TCP idle timeout and watch the metrics
The TCP idle timeout governs how long an idle flow holds its SNAT port before reclaim. Lowering it (minimum 4 minutes) frees ports faster and raises effective capacity. Raising it (up to 120 minutes) keeps quiet but long-lived connections alive — useful for databases or brokers that idle between bursts but should not be torn down.
az network nat gateway update \
--resource-group $RG \
--name natgw-prod \
--idle-timeout 4
Do not solve idle-timeout pain by cranking it to 120. The durable fix for connections dying mid-idle is application-level TCP keepalives (or HTTP keep-alive / connection pooling) that send traffic before the timeout. Keepalives reset the idle timer and are far more robust than betting that no flow ever idles longer than your window.
NAT Gateway emits metrics you should alert on, in Microsoft.Network/natGateways:
SNATConnectionCount— established outbound connections. The headline number.TotalConnectionCount— active flows through the gateway.DroppedPackets— the smoking gun. A non-zero, rising value strongly indicates exhaustion or capacity pressure.PacketCount/ByteCount— throughput.
A KQL query against the platform metrics (or an Azure Monitor metric alert) to catch exhaustion before users do:
AzureMetrics
| where ResourceProvider == "MICROSOFT.NETWORK"
| where Resource == "NATGW-PROD"
| where MetricName in ("DroppedPackets", "SNATConnectionCount")
| summarize Total = sum(Total) by bin(TimeGenerated, 5m), MetricName
| order by TimeGenerated desc
Set a metric alert on DroppedPackets > 0 over 5 minutes. Any sustained drops mean you are at the ceiling — add a public IP to the prefix (no downtime) or lower the idle timeout, then re-check.
Verify
Confirm the egress IP is what you provisioned, not a default-outbound address, and that ports are healthy.
From a VM or a debug pod inside the subnet:
# Should return an IP from your NAT Gateway's prefix, every time.
curl -s https://ifconfig.me ; echo
curl -s https://api.ipify.org ; echo
# Confirm the association and the attached IPs from the control plane.
az network nat gateway show \
--resource-group $RG --name natgw-prod \
--query "{idle:idleTimeoutInMinutes, prefixes:publicIpPrefixes[].id, subnets:subnets[].id}" -o jsonc
For AKS, exec into a pod and hit the same echo endpoints — the returned IP must fall inside the cluster’s NAT Gateway prefix:
kubectl run egress-check --image=curlimages/curl --rm -it --restart=Never -- \
sh -c "curl -s https://api.ipify.org; echo"
Then run a real load test (e.g. a burst of concurrent connections to your actual upstream) while watching SNATConnectionCount climb and DroppedPackets stay at zero in the metrics blade. Zero drops under peak is the pass condition.
Enterprise scenario
A fintech platform team ran a payment-reconciliation service on AKS that, at end-of-day batch, opened tens of thousands of short-lived HTTPS connections to a single acquiring-bank API behind one VIP. The cluster used the default outboundType: loadBalancer. Around 18:00 daily, reconciliation jobs failed with connection timeouts and recovered on their own by 18:30. Nobody could reproduce it off-peak.
The constraint was twofold. The bank required the platform to allow-list a fixed set of source IPs, so any fix had to produce a stable, declared egress CIDR. And the LB outbound rules pre-allocated SNAT ports across the node pool — because every connection targeted the same destination IP and port, the 5-tuple table for that one destination hit its ceiling exactly when batch concurrency peaked. The dropped-flow window lined up perfectly with the failures.
They rebuilt the node pools on subnets fronted by a user-assigned NAT Gateway with a /30 prefix (4 IPs, ~258K ports to that destination, roughly double the observed peak), handed the bank that 4-address CIDR, and dropped the idle timeout to 4 minutes to recycle the short-lived flows aggressively. As the cluster was zone-redundant, they provisioned one zonal NAT Gateway and prefix per zone and gave the bank the union of the three CIDRs.
# Per zone: dedicated subnet, zonal NAT GW, /30 prefix on the node subnet.
az network public-ip prefix create -g rg-pay -n pip-prefix-z1 --length 30 --location eastus --zone 1
az network nat gateway create -g rg-pay -n natgw-z1 --location eastus \
--public-ip-prefixes pip-prefix-z1 --idle-timeout 4 --zone 1
az network vnet subnet update -g rg-pay --vnet-name vnet-pay \
--name snet-aks-z1 --nat-gateway natgw-z1
The 18:00 failures stopped on the first batch after cutover. DroppedPackets has been flat at zero since, and the bank’s allow-list never has to change because all future scaling happens within the published prefixes.