The first sign is always the same: a batch job, a webhook fan-out, or a connection-pool-happy microservice throws intermittent connection timeouts to one external API, and nobody can reproduce it locally. The metric that explains it is SNATConnectionCount flatlining at a ceiling and DroppedPackets ticking up — you have run out of SNAT ports. Every VM and pod that talks to the public internet needs source NAT: its private IP and ephemeral port get rewritten to a public IP and port so the return traffic can find its way home. Azure gives you three ways to supply that public IP, and they behave so differently under load that the choice is the difference between a service that scales and one that pages you at 18:00 every day.
Azure NAT Gateway is the correct fix. It is a fully managed, highly available outbound-only resource that you attach to a subnet; from that moment, every flow leaving the subnet is translated through your public IP (or a contiguous public IP prefix you can hand to partners for allow-listing), and SNAT ports are allocated on demand from a large shared pool rather than pre-carved per instance. That single dynamic-allocation property is why it survives the load that exhausts default outbound and Load Balancer outbound rules. This article is the deep, exam-grade, production-real treatment: every outbound path compared, the 5-tuple SNAT model that explains why exhaustion happens, prefix sizing maths, AKS outboundType integration, idle-timeout tuning, and a full symptom→cause→confirm→fix playbook — with az, Bicep, KQL, and a dense reference of tables you can scan mid-incident.
By the end you will stop guessing about egress. When the pager goes off with “intermittent timeouts to the payment provider,” you will know within ninety seconds whether you are looking at SNAT exhaustion against a single destination VIP, an idle-timeout reclaiming long-lived flows, a Basic-SKU resource silently blocking the association, or a zone-redundant cluster pinned to one zone’s NAT Gateway — and you will know the exact az command and metric that confirms it. Deterministic, allow-listable, exhaustion-proof egress is a solved problem once you understand the mechanism; this is how to solve it properly and migrate off the default outbound path without an outage.
What problem this solves
Two distinct production pains converge on this one resource, and most teams meet them in this order.
The first is SNAT port exhaustion. Your workload opens many concurrent outbound connections — a notification service fanning out to a webhook host, a reconciliation job hammering one bank API, a microservice with a misconfigured connection pool — and the platform runs out of translation-table entries for that destination. New connections fail. It surfaces as intermittent 5xx, dependency timeouts, and connection resets, under load and not at rest, which is exactly why it passes every test and dies in production during the busy window. On default outbound the per-host SNAT pool is small and shared; on Load Balancer outbound rules it is a fixed budget you must pre-divide across the backend pool and inevitably get wrong.
The second pain is unpredictable, un-allow-listable egress IPs. A partner — a bank, a SaaS API, a regulator’s endpoint — says “give us the source IPs your traffic comes from and we will allow-list them.” With default outbound access you cannot: the egress IP is Microsoft-owned, shared, and can change. With Load Balancer outbound it is the LB’s frontend IP, which works but couples your egress identity to an inbound resource you may not want. NAT Gateway gives you a stable public IP or a contiguous CIDR prefix that is yours, that you publish once, and that never changes as you scale within the prefix.
Who hits this: any subnet making meaningful outbound calls. It bites hardest on high-fan-out batch and event-driven workloads (webhooks, reconciliation, scrapers), on AKS clusters at high pod density (the default outboundType: loadBalancer inherits LB SNAT limits), and on any enterprise integration where a partner demands a fixed source-IP allow-list. There is a third, quieter motivation: Microsoft is retiring default outbound access for newly created VNets (effective 30 September 2025), so new subnets must have an explicit outbound method anyway — and NAT Gateway is the recommended default. The three problems below frame the whole field:
| Problem | What you observe in production | Root mechanism | NAT Gateway’s answer |
|---|---|---|---|
| SNAT exhaustion | Intermittent timeouts/5xx to one upstream under load, fine at rest | Translation-table entries for one 5-tuple destination run out | On-demand ports from a large shared pool (64,512 per public IP) |
| Unpredictable egress IP | Partner cannot allow-list you; egress IP changes | Default outbound uses a shared, Microsoft-owned IP | Stable, owned public IP / contiguous prefix you publish |
| Default-outbound retirement | New VNets cannot rely on implicit egress (Sept 2025) | Implicit egress being removed for new VNets | Explicit, recommended outbound method per subnet |
Learning objectives
By the end of this article you can:
- Compare the three outbound paths (default outbound, Load Balancer outbound rules, NAT Gateway) on SNAT allocation, egress-IP stability, and scale — and justify NAT Gateway as the default.
- Explain SNAT port allocation through the 5-tuple translation model, and why exhaustion almost always means many flows to a single destination IP and port.
- Size a public IP prefix from your peak concurrent-flow count using
ceil(flows / 64,512), and respect the 16-IP-per-NAT-Gateway ceiling. - Provision a NAT Gateway, public IP prefix, and subnet association with both
azCLI and Bicep, and reason about the outbound precedence rules (what wins when multiple egress configs coexist). - Integrate NAT Gateway with AKS via
managedNATGatewayanduserAssignedNATGatewayoutboundType, including the zone-redundant one-NAT-Gateway-per-zone topology. - Tune the TCP idle timeout (4–120 minutes) as a capacity-planning lever and pair it with application keepalives instead of cranking it blindly.
- Drive the diagnostic signals —
SNATConnectionCount,DroppedPackets,TotalConnectionCount— and confirm or rule out exhaustion with exactazand KQL.
Prerequisites & where this fits
You should already understand the basics of an Azure virtual network: a VNet is an address space, subnets carve it up, and resources land in subnets. Knowing what a public IP and a Standard Load Balancer are will help, as will comfort running az in Cloud Shell, reading JSON output, and recognising the difference between inbound and outbound traffic. If you have ever seen “SNAT” in a metric blade and not known what it meant, you are the target reader.
This sits in the Networking track and is the egress-side companion to the inbound and routing material. The VNet and subnet fundamentals come from Azure VNet Deep Dive: Every Setting. The alternative outbound mechanism — and when you would still reach for it — is covered in Standard Load Balancer Outbound Rules, Cross-Region & HA Ports. For PaaS targets you often want to bypass SNAT entirely, which is Private Endpoint vs Service Endpoint. When egress must be inspected and filtered rather than just translated, that is a different resource — see Azure Firewall: Forced Tunneling & Hub-Spoke Routing. And because SNAT exhaustion shows up on the compute side too, the App Service angle is in Troubleshooting Azure App Service: 502/503, Cold Starts & Restart Loops.
A quick map of where each outbound concern is owned and what it can break, so you call the right person fast:
| Layer | What lives here | Who usually owns it | What it can cause |
|---|---|---|---|
| Workload (VM / pod) | Connection pooling, keepalives, retries | App / dev team | Exhaustion via per-request connections |
| Subnet config | NAT Gateway association, UDRs, NSGs | Platform / network | Wrong egress path, blocked association |
| NAT Gateway | SNAT pool, idle timeout, attached IPs | Network team | Port ceiling, idle reclaim, zone pinning |
| Public IP / prefix | The egress identity you publish | Network team | Allow-list churn if not a prefix |
| Destination (partner) | The single VIP everyone hits | External | The 5-tuple bottleneck that exhausts ports |
| Monitoring | SNATConnectionCount, DroppedPackets |
Platform / SRE | Blind to exhaustion until users feel it |
Core concepts
Five mental models make every later decision obvious.
Source NAT is mandatory for outbound, and somebody must supply the public IP. A private IP cannot appear on the public internet; the return packet would have nowhere to go. So every outbound flow has its source private-IP-and-port rewritten to a public IP and a SNAT port. The only question is which resource provides that public IP and how it allocates the ports — and that is the entire subject of this article.
A SNAT “port” is a translation-table entry keyed on the full 5-tuple. The table is keyed on (source IP, source port, destination IP, destination port, protocol). This is the detail that trips everyone up. You are not limited to 64K total connections — you are limited to 64K connections to the same destination IP and port. A flow to 20.1.2.3:443 and a flow to 20.1.2.4:443 are different destinations and inexpensive relative to two flows to the same 20.1.2.3:443. Exhaustion therefore almost always means many connections to one destination: a single payment gateway, one storage public endpoint, one upstream behind a single VIP.
NAT Gateway allocates ports on demand from a shared pool; the alternatives pre-carve them. With NAT Gateway, an idle VM in the subnet consumes nothing and a busy one bursts — ports are handed out across all instances in the subnet as needed. With Load Balancer outbound rules you must pre-divide a fixed 64K budget across the backend pool before any traffic flows, and a wrong division either starves VMs or caps scale. Dynamic allocation is the single biggest reason NAT Gateway survives load that exhausts the other two.
Each attached public IP contributes 64,512 SNAT ports, and the maths is linear. One IP gives ~64,512 simultaneous flows to a single destination IP:port; two give ~129,024; the cap of 16 IPs per NAT Gateway gives ~1,032,192. Capacity planning is therefore arithmetic, not guesswork: count your peak concurrent flows to the busiest single destination and divide.
A port is freed only when its flow goes idle past the TCP idle timeout. Until then the entry is held. That is why idle-timeout tuning is part of capacity planning, not an afterthought: a 4-minute timeout recycles short-lived flows aggressively (more effective capacity), while 120 minutes keeps quiet long-lived connections alive (fewer surprise resets) but holds ports longer. The durable fix for connections dying mid-idle is application keepalives, not a longer timeout.
The vocabulary in one table
Pin down every moving part before the deep sections. The glossary repeats these for lookup; this is the mental model side by side:
| Concept | One-line definition | Where it lives | Why it matters |
|---|---|---|---|
| Source NAT (SNAT) | Rewriting private src IP:port → public IP:port for egress | Performed by the outbound resource | Mandatory for any internet egress |
| SNAT port | One translation-table entry, keyed on the 5-tuple | The NAT resource’s table | The finite thing you exhaust |
| 5-tuple | (src IP, src port, dst IP, dst port, protocol) | Per flow | Same destination = shared budget |
| NAT Gateway | Managed outbound-only resource attached to a subnet | Subnet association | The recommended egress path |
| Public IP prefix | A contiguous CIDR of public IPs | Attached to the NAT Gateway | The allow-listable egress identity |
| Default outbound access | Implicit egress every VM gets if nothing is configured | Platform behaviour | Being retired Sept 2025; unpredictable IP |
| Idle timeout | How long an idle flow holds its SNAT port | NAT Gateway property (4–120 min) | Capacity + connection-reset lever |
outboundType |
AKS cluster egress mode | AKS networking profile | Chooses LB vs NAT Gateway for the cluster |
DroppedPackets |
NAT Gateway metric for dropped flows | Platform metrics | The smoking gun for exhaustion |
SNATConnectionCount |
Established outbound connections through the gateway | Platform metrics | The headline capacity number |
Three outbound paths, and why only one scales
Azure provides three ways to give an outbound flow a public IP, and they are not equivalent. Default outbound is the implicit egress with no configuration; Load Balancer outbound rules let you allocate ports explicitly off an LB frontend; NAT Gateway is the purpose-built, subnet-attached resource. The headline comparison:
| Outbound method | SNAT ports | Egress IP | Allocation model | Recommended? |
|---|---|---|---|---|
| Default outbound access | Implicit, small, shared per host | Unpredictable, Microsoft-owned, can change | Platform-managed, opaque | No — retired Sept 2025 for new VNets |
| Load Balancer outbound rules | 64,000 per frontend IP, pre-divided across backends | The LB’s public IP(s) | Manual, static per-instance | Only if you already run the LB |
| NAT Gateway | ~64,512 per attached public IP, on demand | Your public IP / prefix, stable | Dynamic, shared across the subnet | Yes — the default choice |
The mechanics that actually decide it:
| Dimension | Default outbound | LB outbound rules | NAT Gateway |
|---|---|---|---|
| Port allocation | Implicit, small | Static, pre-carved | Dynamic, on demand |
| Idle VM cost | Holds a small share | Holds its pre-allocated ports | Consumes nothing |
| Burst behaviour | Exhausts fast | Capped by pre-division | Bursts from shared pool |
| Egress-IP stability | None (can change) | Stable (LB frontend) | Stable (your prefix) |
| Allow-listing | Impossible | Possible (LB IP) | Clean (contiguous CIDR) |
| Max public IPs | n/a | Multiple frontends | 16 per NAT GW |
| Ports per public IP | Tiny shared slice | 64,000 to pre-divide | ~64,512 on demand |
| Couples to inbound? | No | Yes (needs an LB) | No (outbound-only) |
| Provisioning effort | None | Rule + budget maths | Resource + association |
| SKU requirement | Any | Standard LB | All Standard in subnet |
| Zone model | Platform | LB zone behaviour | Zonal (1 GW/zone for HA) |
| Future-proof | No (retiring) | Yes | Yes (recommended) |
Three reading notes that save a design review:
| If you are… | The trap | Do this |
|---|---|---|
| Relying on default outbound today | It works until it doesn’t, and it is being retired | Treat it as tech debt; add an explicit NAT Gateway |
| Already running a Standard LB | Tempting to reuse its outbound rules | Fine for small scale; add NAT Gateway if you pre-divide ports |
| Building a new subnet | New VNets lose implicit egress Sept 2025 | Provision NAT Gateway from day one |
For any subnet making meaningful outbound calls, the answer is NAT Gateway. The rest of this article assumes that decision.
Understanding SNAT port allocation
A SNAT “port” is one entry in the translation table, and the table is keyed on the full 5-tuple: source IP, source port, destination IP, destination port, protocol. You are not limited to 64K total connections; you are limited to 64K connections to the same destination IP and port. Spreading the same load across many destinations rarely exhausts ports — exhaustion is a single-destination phenomenon.
Each public IP attached to a NAT Gateway contributes 64,512 SNAT ports (some docs round to 64,000). The arithmetic is linear:
SNAT ports available = 64,512 × (number of attached public IPs)
1 public IP -> ~64,512 simultaneous flows to a single dest IP:port
2 public IPs -> ~129,024
16 public IPs -> ~1,032,192 (16 = the per-NAT-GW IP/prefix cap)
How the same workload consumes the budget depending on how it is written:
| Connection pattern | Flows to one dest | SNAT pressure | Typical culprit |
|---|---|---|---|
New HttpClient / socket per request |
One per request | Severe — scales with RPS | The classic exhaustion bug |
| Pooled client, no keepalive | One per pool slot, churns on idle | Moderate | Idle reclaim mid-burst |
| Pooled client + keepalive | Reused, long-lived | Low — flat under load | The intended pattern |
| Many destinations (sharded) | Spread across 5-tuples | Low | Naturally exhaustion-resistant |
| Single VIP, high concurrency | All on one 5-tuple | Severe | Payment/bank/storage endpoint |
| Retry storm on failure | Multiplies new flows | Severe — self-worsening | Aggressive retry-on-timeout |
| Short-lived TLS handshakes | New flow per call | High under burst | Webhook/notification fan-out |
| DNS-resolved to rotating IPs | Spread naturally | Low | CDN-fronted upstreams |
The port lifecycle is what ties capacity to the idle timeout:
| Flow state | Holds a SNAT port? | Freed when… | Lever |
|---|---|---|---|
| Active (sending/receiving) | Yes | Connection closes | App connection reuse |
| Idle but open | Yes | Idle timeout elapses | Idle-timeout setting |
TCP TIME_WAIT |
Briefly | OS reclaim window | Keepalive / fewer new flows |
| Closed | No | Immediately | — |
A worked sizing example, end to end: at 1,800 requests/second with a new connection per request and a ~4-minute TIME_WAIT, you can have hundreds of thousands of sockets in flight against one destination. That is why it fails instantly under flash-sale load and never in a unit test — and why the first fix is connection reuse, with NAT Gateway sizing as the safety margin behind it.
Provision the NAT Gateway, public IP, and prefix
You need three resources: the NAT Gateway, at least one Standard-SKU public IP (or a public IP prefix), and a subnet association. Prefer a prefix to individual IPs — the contiguous CIDR is exactly what you publish to partners for allow-listing, and you can scale within it without changing what they whitelist.
Azure CLI:
LOC=eastus
RG=rg-egress-prod
# A /28 prefix = 16 contiguous IPs. Pick the size from the sizing section.
az network public-ip prefix create \
--resource-group $RG \
--name pip-prefix-natgw \
--length 28 \
--location $LOC \
--version IPv4
# NAT Gateway. Idle timeout in minutes (default 4, max 120).
az network nat gateway create \
--resource-group $RG \
--name natgw-prod \
--location $LOC \
--public-ip-prefixes pip-prefix-natgw \
--idle-timeout 10
The same thing in Bicep, which is what you actually want in the repo:
resource pipPrefix 'Microsoft.Network/publicIPPrefixes@2023-11-01' = {
name: 'pip-prefix-natgw'
location: location
sku: { name: 'Standard' }
properties: {
prefixLength: 28
publicIPAddressVersion: 'IPv4'
}
}
resource natgw 'Microsoft.Network/natGateways@2023-11-01' = {
name: 'natgw-prod'
location: location
sku: { name: 'Standard' }
properties: {
idleTimeoutInMinutes: 10
publicIpPrefixes: [
{ id: pipPrefix.id }
]
}
}
NAT Gateway has exactly one SKU (Standard), so there is no SKU decision to make — only IP count and idle timeout. The complete property surface:
| Property | Values | Default | When to change | Trade-off / limit |
|---|---|---|---|---|
sku.name |
Standard only |
Standard |
Never (no choice) | Cannot attach to Basic-SKU resources |
idleTimeoutInMinutes |
4–120 | 4 | Long-lived idle flows being reset | Higher holds ports longer |
publicIpAddresses |
0–16 individual IPs | none | Need fixed single IPs | Counts toward the 16-IP cap |
publicIpPrefixes |
prefix(es) totalling ≤16 IPs | none | Allow-listable CIDR (preferred) | Largest single prefix is /28 |
zones |
none / 1 / 2 / 3 |
none (regional) | Pin egress to a zone for AZ design | One GW cannot span zones |
subnets |
association(s) | none | Attach to workload subnet(s) | A subnet has only one NAT GW |
Public IP vs public IP prefix — pick the prefix for anything partners allow-list:
| Aspect | Individual public IP(s) | Public IP prefix |
|---|---|---|
| Allow-listing | Each IP listed separately | One contiguous CIDR |
| Scaling egress | Add/remove single IPs | Grow within the prefix |
| Partner churn | Allow-list changes per IP | Allow-list never changes |
| Granularity | Exact count | Powers of two (/31…/28) |
| Best for | One or two fixed IPs | Production egress identity |
Zone behaviour you must design around — NAT Gateway is a zonal resource:
| Deployment | --zone |
Resilience | Use when |
|---|---|---|---|
| Non-zonal (regional) | omitted | Single regional resource | Dev / non-HA egress |
| Zonal | 1, 2, or 3 |
Pinned to one zone’s fate | Part of a per-zone HA design |
| Zone-redundant egress | one GW per zone | Survives a zone loss | Production multi-zone workloads |
A single NAT Gateway cannot span zones. For zone-redundant egress you deploy one NAT Gateway per zone, each on its own subnet — covered in the AKS section, where it matters most.
Associate to subnets and the precedence rules
A NAT Gateway is attached to a subnet, and once attached it becomes the outbound path for every resource in that subnet. This is where the ordering rules matter, because they decide what wins when multiple egress configs coexist.
az network vnet subnet update \
--resource-group $RG \
--vnet-name vnet-prod \
--name snet-workload \
--nat-gateway natgw-prod
resource subnet 'Microsoft.Network/virtualNetworks/subnets@2023-11-01' = {
parent: vnet
name: 'snet-workload'
properties: {
addressPrefix: '10.20.1.0/24'
natGateway: { id: natgw.id }
}
}
The precedence Azure applies for outbound, highest to lowest:
| Priority | Outbound method | Wins over | Notes |
|---|---|---|---|
| 1 | NAT Gateway on the subnet | Everything below | Overrides LB outbound rules and instance-level IP for egress |
| 2 | Instance-level public IP (on the NIC) | LB rules, default | Used for outbound only if no NAT Gateway |
| 3 | Load Balancer outbound rules | Default | Used only if neither above applies |
| 4 | Default outbound access | — | Last resort; retiring for new VNets |
Two rules that save you a debugging session:
| Rule | What it means | Consequence if ignored |
|---|---|---|
| NAT Gateway wins outbound even if a NIC has its own public IP | Inbound stays on the NIC IP; outbound goes via NAT GW | Surprised that egress IP “changed” after attaching NAT GW |
| Everything in the subnet must be Standard SKU | No Basic LB or Basic public IP allowed | Association silently blocked / unsupported |
Association scope and limits, enumerated:
| Capability | Allowed? | Detail |
|---|---|---|
| One NAT GW → many subnets (same VNet) | Yes | Shares the SNAT pool across them |
| One subnet → many NAT GWs | No | A subnet has exactly one NAT GW |
| NAT GW → subnets in different VNets | No | Same-VNet only |
| Subnet contains a Basic-SKU resource | No | Must be all Standard SKU |
| Attach to a gateway subnet (VPN/ER) | No | Not supported on gateway subnets |
| Coexist with NSG / UDR on the subnet | Yes | NSG/UDR still apply to the flow |
| Coexist with a Standard LB (inbound) | Yes | NAT GW handles outbound, LB inbound |
| Inbound on a NIC’s own public IP | Yes | Inbound stays on the NIC IP |
| Span availability zones with one GW | No | Zonal; one GW per zone for HA |
| Reuse one prefix across many GWs | No | A prefix attaches to one GW |
Size the prefix to your connection count
This is the capacity-planning step people skip and then page about. Work backwards from the worst-case concurrent flows to a single destination:
required public IPs = ceil( peak_concurrent_flows_to_one_dest / 64,512 )
prefix length = smallest /N whose host count >= required IPs
The prefix-to-capacity table:
| Prefix | IPs | Approx SNAT ports (to one dest:port) | Covers peak flows up to |
|---|---|---|---|
| /31 | 2 | ~129,024 | ~129K |
| /30 | 4 | ~258,048 | ~258K |
| /29 | 8 | ~516,096 | ~516K |
| /28 | 16 | ~1,032,192 | ~1.03M |
The hard ceiling: a single NAT Gateway supports a maximum of 16 public IP addresses (individual IPs and prefixes combined). A /28 is the largest single prefix that fits, giving ~1.03M ports. Need more than 16 IPs to one destination? Split the workload across multiple subnets, each with its own NAT Gateway and prefix — the documented scaling pattern, not a workaround.
The limits and fixed numbers worth keeping on one screen:
| Limit / quantity | Value | Why it matters |
|---|---|---|
| SNAT ports per public IP | ~64,512 | The divisor in all sizing maths |
| Max public IPs per NAT Gateway | 16 | Hard ceiling; ~1.03M ports max |
| Largest single prefix on one GW | /28 (16 IPs) | The biggest contiguous block per GW |
| Idle timeout range | 4–120 minutes | Capacity vs reset trade-off |
| Default idle timeout | 4 minutes | Aggressive reclaim out of the box |
| SKU choices | Standard only |
No SKU decision to make |
| NAT GWs per subnet | 1 | A subnet has exactly one |
| Subnets per NAT GW | Many (same VNet) | Shares the pool across them |
| Zones a single GW spans | 1 | Zonal; needs one per zone for HA |
| Default-outbound retirement (new VNets) | 30 Sep 2025 | Why new subnets need explicit egress |
Worked sizing for three realistic workloads:
| Workload | Peak flows to one dest | ceil(/64,512) |
Prefix to provision | Headroom |
|---|---|---|---|---|
| Webhook fan-out | ~50,000 | 1 IP | /31 (2 IPs) | ~2.5× |
| Notification service | ~140,000 | 3 IPs | /30 (4 IPs) | ~1.8× |
| Bulk reconciliation | ~400,000 | 7 IPs | /29 (8 IPs) | ~1.3× |
| Multi-tenant scraper | ~900,000 | 14 IPs | /28 (16 IPs) | ~1.1× |
| Multi-zone notification | ~120K per zone | 2 IPs/zone | /30 per zone | per-zone GW |
| Beyond /28 | > 1.03M | > 16 IPs | Split across subnets | per-subnet GW |
Worked example in prose: a notification service holds ~140,000 simultaneous TLS connections to one provider VIP at peak. ceil(140000 / 64512) = 3 IPs. A /30 (4 IPs) covers it with headroom; provision it and hand the partner that 4-address CIDR. Do not provision a /28 “to be safe” — you pay per IP, and you can attach a second prefix later without downtime. Sizing anti-patterns to avoid:
| Anti-pattern | Why it hurts | Better |
|---|---|---|
| Provision /28 “to be safe” | Pay for 16 IPs you do not use | Size to peak + modest headroom |
| Size by total connections, not per-destination | Over- or under-provisions wildly | Count flows to the busiest single dest |
| One giant prefix beyond /28 | Exceeds the 16-IP cap | Split across subnets/NAT GWs |
| Loose individual IPs for a partner | Allow-list churns on scale | Use a contiguous prefix |
| Ignore idle timeout in the maths | Held ports inflate live count | Tune idle timeout + keepalives |
Integrating with AKS for stable, allow-listable egress
AKS defaults its outboundType to loadBalancer, which puts cluster egress behind the Standard LB and inherits its SNAT-allocation headaches at high pod density. Switching to managedNATGateway (Azure provisions and owns the gateway) or userAssignedNATGateway (you bring your own on the node subnet) fixes both the exhaustion and the IP-stability problems in one move.
This must be chosen at cluster creation — outboundType is largely immutable, with only specific migration paths supported — so design it in from day one for any cluster that calls allow-listed external endpoints. The four modes:
outboundType |
Who owns the NAT GW | Egress IP control | SNAT scaling | Choose when |
|---|---|---|---|---|
loadBalancer (default) |
AKS-managed LB | LB outbound IPs | LB pre-division limits | Low egress concurrency only |
managedNATGateway |
Azure provisions it | Azure-allocated IP(s) | On demand, large pool | You want NAT GW without owning IPs |
userAssignedNATGateway |
You (on node subnet) | Your exact prefix | On demand, large pool | Partner allow-lists your CIDR |
userDefinedRouting |
You (via UDR/firewall) | Firewall/NVA IP | N/A (egress through NVA) | Egress must be inspected/filtered |
Managed NAT Gateway (Azure provisions and owns it):
az aks create \
--resource-group rg-aks-prod \
--name aks-prod \
--network-plugin azure \
--outbound-type managedNATGateway \
--nat-gateway-managed-outbound-ip-count 2 \
--nat-gateway-idle-timeout 10 \
--generate-ssh-keys
User-assigned, when you must control the exact egress prefix (the common enterprise case — partners allow-list your CIDR):
# NAT Gateway + prefix already created and attached to the AKS node subnet,
# then point the cluster at that subnet with a userAssignedNATGateway type.
az aks create \
--resource-group rg-aks-prod \
--name aks-prod \
--network-plugin azure \
--vnet-subnet-id "$NODE_SUBNET_ID" \
--outbound-type userAssignedNATGateway \
--generate-ssh-keys
The AKS-specific knobs and their constraints:
| Flag / setting | What it controls | Default | Constraint |
|---|---|---|---|
--outbound-type |
Cluster egress mode | loadBalancer |
Set at create; limited migration paths |
--nat-gateway-managed-outbound-ip-count |
Managed-mode IP count | 1 | Up to 16; each is 64,512 ports |
--nat-gateway-idle-timeout |
Managed-mode idle timeout | 4 | 4–120 minutes |
--vnet-subnet-id (user-assigned) |
Node subnet with your NAT GW | — | NAT GW must be pre-attached |
--network-plugin |
CNI choice | azure/kubenet |
Egress design independent of plugin |
For a zone-redundant cluster the node pools span zones, but a NAT Gateway is single-zone. The correct topology is one node-pool subnet per availability zone, each with its own zonal NAT Gateway and prefix. Pods in zone 1 egress through the zone-1 NAT Gateway, and so on; partners allow-list the union of the zonal prefixes. Do not share one NAT Gateway across a multi-zone node subnet — it pins your egress to a single zone’s fate. The zone-redundant layout:
| Zone | Node-pool subnet | Zonal NAT GW | Prefix | Egress IPs published |
|---|---|---|---|---|
| 1 | snet-aks-z1 |
natgw-z1 (--zone 1) |
/30 |
CIDR-1 |
| 2 | snet-aks-z2 |
natgw-z2 (--zone 2) |
/30 |
CIDR-2 |
| 3 | snet-aks-z3 |
natgw-z3 (--zone 3) |
/30 |
CIDR-3 |
| Partner allow-list | — | — | — | Union of CIDR-1…3 |
More on the day-two operational side of clusters like this is in AKS Day-Two: Upgrades & Fleet Operations.
Tune TCP idle timeout and watch the metrics
The TCP idle timeout governs how long an idle flow holds its SNAT port before reclaim. Lowering it (minimum 4 minutes) frees ports faster and raises effective capacity. Raising it (up to 120 minutes) keeps quiet but long-lived connections alive — useful for databases or brokers that idle between bursts but should not be torn down.
az network nat gateway update \
--resource-group $RG \
--name natgw-prod \
--idle-timeout 4
The idle-timeout decision, both directions:
| Setting | Effect on capacity | Effect on long-lived flows | Pick when |
|---|---|---|---|
| 4 min (minimum) | Frees ports fastest | May reset quiet connections | High-churn, short-lived flows |
| 10 min (common default-bump) | Balanced | Tolerates brief idleness | General production |
| 30–60 min | Holds ports longer | Keeps brokers/DBs alive | Bursty long-lived sessions |
| 120 min (maximum) | Holds ports longest | Rarely resets anything | Last resort; prefer keepalives |
Do not solve idle-timeout pain by cranking it to 120. The durable fix for connections dying mid-idle is application-level TCP keepalives (or HTTP keep-alive / connection pooling) that send traffic before the timeout. Keepalives reset the idle timer and are far more robust than betting that no flow ever idles longer than your window. Idle timeout vs keepalive, head to head:
| Approach | Where it lives | Robustness | Side effect |
|---|---|---|---|
| Raise NAT idle timeout | NAT Gateway property | Bet on no flow idling longer | Holds ports; lowers capacity |
| App TCP keepalive | Socket options / client | Resets the timer reliably | Tiny keepalive traffic |
| HTTP keep-alive / pooling | HTTP client config | Reuses connections, fewer flows | Needs correct pool sizing |
| Both (recommended) | NAT + app | Most robust | Minimal |
NAT Gateway emits metrics you should alert on, in Microsoft.Network/natGateways:
| Metric | What it measures | Alert when | Why it matters |
|---|---|---|---|
SNATConnectionCount |
Established outbound connections | Approaching ceiling | The headline capacity number |
TotalConnectionCount |
Active flows through the gateway | Sustained near limit | Corroborates pressure |
DroppedPackets |
Packets/flows dropped | > 0 sustained | The smoking gun for exhaustion |
PacketCount |
Packets processed | Throughput baseline | Capacity/throughput trend |
ByteCount |
Bytes processed | Egress-cost tracking | Bill driver (per-GB) |
SNATConnectionCount (by direction) |
Inbound vs outbound flows | Skew vs expectation | Confirms it is egress, not return |
| Datapath availability | Gateway health | Below 100% | Rules out a platform-side fault |
A KQL query against the platform metrics (or an Azure Monitor metric alert) to catch exhaustion before users do:
AzureMetrics
| where ResourceProvider == "MICROSOFT.NETWORK"
| where Resource == "NATGW-PROD"
| where MetricName in ("DroppedPackets", "SNATConnectionCount")
| summarize Total = sum(Total) by bin(TimeGenerated, 5m), MetricName
| order by TimeGenerated desc
Which changes are online (no egress interruption) and which force a heavier operation is worth knowing before an incident, because the in-the-moment capacity bumps are all online:
| Change | Disruptive? | How | Use during an incident? |
|---|---|---|---|
| Add a public IP to the prefix | No | az network public-ip prefix larger / attach IP |
Yes — fastest capacity bump |
| Lower the idle timeout | No | az network nat gateway update --idle-timeout |
Yes — frees ports faster |
| Raise the idle timeout | No | same flag | Yes — but prefer keepalives |
| Attach NAT GW to another subnet | No | subnet update --nat-gateway |
Yes — extend coverage |
| Detach NAT GW from a subnet | Egress reverts | subnet update --remove natGateway |
Cautiously — egress path changes |
| Shrink/replace the prefix | Yes | recreate prefix | No — plan it |
Change AKS outboundType |
Yes (recreate) | cluster recreate / migration path | No — design up front |
Set a metric alert on DroppedPackets > 0 over 5 minutes. Any sustained drops mean you are at the ceiling — add a public IP to the prefix (no downtime) or lower the idle timeout, then re-check. Recommended starting thresholds:
| Alert | Metric | Threshold (starting point) | Action on fire |
|---|---|---|---|
| Exhaustion (hard) | DroppedPackets |
> 0 for 5 min | Add IP to prefix; lower idle timeout |
| Capacity creep | SNATConnectionCount |
> 80% of 64,512 × IPs |
Plan a prefix bump |
| Flow surge | TotalConnectionCount |
> your modelled peak | Investigate connection reuse |
| Egress cost | ByteCount |
> budget for the month | Review chatty callers / data path |
The deeper observability story — workbooks, alert routing, action groups — is in Azure Monitor Deep Dive: Every Option.
Architecture at a glance
Follow a single outbound request left to right and the whole design clicks into place. A workload — a VM or, more often, a fleet of AKS pods — sits in a workload subnet and opens a TCP connection to an external API. Because the subnet has a NAT Gateway associated, the platform intercepts that flow at egress and performs source NAT: the pod’s private 10.20.1.x address and ephemeral port are rewritten to one of the public IPs in the attached prefix (say 203.0.113.0/30) and an allocated SNAT port from the 64,512-per-IP pool. The translated packet leaves through the NAT Gateway’s stable public identity and arrives at the destination — which, critically, is usually a single VIP (one payment gateway, one bank API). The partner sees a source IP inside the prefix it already allow-listed, and the return traffic flows back through the same translation.
The diagram makes the failure map explicit. Badge 1 marks the workload tier, where a per-request connection pattern manufactures the exhaustion in the first place — the fix lives in code (pooling and keepalives), not in the network. Badge 2 sits on the NAT Gateway, the resource whose SNATConnectionCount and DroppedPackets you watch and whose idle timeout and attached-IP count you tune. Badge 3 is on the public IP prefix — the allow-listable CIDR that must stay stable as you scale within it. Badge 4 is on the single destination VIP, the 5-tuple bottleneck that turns “lots of connections” into “exhaustion,” because every flow shares one (dst IP, dst port). Badge 5 maps the zone-redundant concern: a NAT Gateway is single-zone, so a multi-zone cluster needs one gateway per zone or its egress pins to a single zone’s fate. Trace those five numbered points and you can localise any egress incident to exactly one of them.
Real-world scenario
A fintech platform team — call them PaySettle — ran a payment-reconciliation service on AKS that, at end-of-day batch, opened tens of thousands of short-lived HTTPS connections to a single acquiring-bank API behind one VIP. The cluster used the default outboundType: loadBalancer. Around 18:00 daily, reconciliation jobs failed with connection timeouts and recovered on their own by 18:30. Nobody could reproduce it off-peak, and three sprints of “add retries” and “increase the timeout” had changed nothing.
The constraint was twofold. The bank required PaySettle to allow-list a fixed set of source IPs, so any fix had to produce a stable, declared egress CIDR — ruling out default outbound entirely. And the LB outbound rules pre-allocated SNAT ports across the node pool; because every connection targeted the same destination IP and port, the 5-tuple table for that one destination hit its ceiling exactly when batch concurrency peaked. The dropped-flow window lined up perfectly with the failures. The incident timeline made the mechanism unmistakable:
| Time | Observation | Signal | Interpretation |
|---|---|---|---|
| 17:55 | Batch ramps up | TotalConnectionCount climbing |
Concurrency rising toward peak |
| 18:02 | First timeouts | DroppedPackets > 0 |
SNAT ceiling hit for the bank VIP |
| 18:10 | Job retries storm | More new connections | Retries worsen exhaustion |
| 18:28 | Batch tapers | DroppedPackets → 0 |
Ports freed; “self-heals” |
| Next day | Same window | Identical shape | Deterministic, load-driven |
They rebuilt the node pools on subnets fronted by a user-assigned NAT Gateway with a /30 prefix (4 IPs, ~258K ports to that destination, roughly double the observed peak), handed the bank that 4-address CIDR, and dropped the idle timeout to 4 minutes to recycle the short-lived flows aggressively. As the cluster was zone-redundant, they provisioned one zonal NAT Gateway and prefix per zone and gave the bank the union of the three CIDRs. They also fixed the application to reuse a pooled HTTPS client with keepalives, so the live flow count fell even before the extra ports mattered.
# Per zone: dedicated subnet, zonal NAT GW, /30 prefix on the node subnet.
az network public-ip prefix create -g rg-pay -n pip-prefix-z1 --length 30 --location eastus --zone 1
az network nat gateway create -g rg-pay -n natgw-z1 --location eastus \
--public-ip-prefixes pip-prefix-z1 --idle-timeout 4 --zone 1
az network vnet subnet update -g rg-pay --vnet-name vnet-pay \
--name snet-aks-z1 --nat-gateway natgw-z1
The 18:00 failures stopped on the first batch after cutover. DroppedPackets has been flat at zero since, and the bank’s allow-list never has to change because all future scaling happens within the published prefixes. Before-and-after, the change was stark:
| Metric | Before (LB outbound) | After (NAT GW /30 per zone) |
|---|---|---|
| Egress IP | LB frontend, shared | 3 stable /30 prefixes |
| Ports to bank VIP | Pre-divided, capped | ~258K per zone, on demand |
| 18:00 failures | Daily, ~30 min | Zero |
DroppedPackets |
Non-zero at peak | Flat at zero |
| Allow-list churn | Risk on every scale | None (scale within prefix) |
| Idle timeout | LB default | 4 min + app keepalives |
Advantages and disadvantages
NAT Gateway is the right default for egress, but it is not free of trade-offs. The explicit two-column view:
| Advantages | Disadvantages |
|---|---|
| On-demand SNAT from a large shared pool (no pre-carving) | Outbound-only — does not handle inbound at all |
| Stable, allow-listable public IP / prefix | Zonal resource — multi-zone HA needs one GW per zone |
| Survives load that exhausts default / LB outbound | Adds an hourly + per-GB cost vs free default outbound |
| Fully managed, highly available, single SKU | 16-IP-per-GW cap; beyond that you split subnets |
| Decouples egress identity from inbound resources | All subnet resources must be Standard SKU |
| Simple association model (attach to subnet) | No L7 features (no filtering/inspection — that is Azure Firewall) |
AKS-native via outboundType |
outboundType largely immutable post-create |
When each side matters:
| Decision factor | Favours NAT Gateway | Favours an alternative |
|---|---|---|
| Need stable egress IP to allow-list | Strongly | — |
| High concurrency to one destination | Strongly | — |
| Need to filter/inspect egress | — | Azure Firewall (forced tunneling) |
| Egress is to Azure PaaS only | Maybe | Private Endpoint (bypasses SNAT) |
| Already run a Standard LB at small scale | Optional | Reuse LB outbound rules |
| New VNet (post Sep 2025) | Yes | — (default outbound unavailable) |
| AKS at high pod density | Strongly | — |
| Need zone-redundant egress | Yes (one GW/zone) | — |
| Lowest possible cost, non-prod | — | Default outbound (while it lasts) |
For egress that must be inspected rather than merely translated, NAT Gateway is the wrong tool — that is Azure Firewall: Forced Tunneling & Hub-Spoke Routing. For traffic to Azure PaaS that should never touch the public internet at all, prefer Private Endpoint vs Service Endpoint.
Hands-on lab
Provision a NAT Gateway with a prefix, attach it to a subnet, and prove your egress IP is the one you provisioned — free-tier-friendly and fully torn down at the end. Run in Cloud Shell (Bash).
Step 1 — Variables and resource group.
RG=rg-natgw-lab
LOC=centralindia
VNET=vnet-lab
SUBNET=snet-workload
NATGW=natgw-lab
PREFIX=pip-prefix-lab
az group create -n $RG -l $LOC -o table
Step 2 — Create a VNet and a workload subnet.
az network vnet create -g $RG -n $VNET --address-prefix 10.20.0.0/16 \
--subnet-name $SUBNET --subnet-prefix 10.20.1.0/24 -o table
Expected: a VNet with one subnet on 10.20.1.0/24.
Step 3 — Create a /31 public IP prefix (2 IPs — plenty for a lab).
az network public-ip prefix create -g $RG -n $PREFIX --length 31 --location $LOC --version IPv4 -o table
Expected: a prefix resource showing an ipPrefix like x.x.x.x/31.
Step 4 — Create the NAT Gateway with the prefix and a 4-minute idle timeout.
az network nat gateway create -g $RG -n $NATGW --location $LOC \
--public-ip-prefixes $PREFIX --idle-timeout 4 -o table
Expected: a NAT Gateway, sku.name = Standard, idleTimeoutInMinutes = 4.
Step 5 — Associate the NAT Gateway to the subnet.
az network vnet subnet update -g $RG --vnet-name $VNET --name $SUBNET --nat-gateway $NATGW -o table
Step 6 — Confirm the association and attached IPs from the control plane.
az network nat gateway show -g $RG -n $NATGW \
--query "{idle:idleTimeoutInMinutes, prefixes:publicIpPrefixes[].id, subnets:subnets[].id}" -o jsonc
Expected: idle: 4, your prefix listed, and the workload subnet listed under subnets.
Step 7 — Prove egress (optional, needs a VM in the subnet). Deploy a tiny VM into snet-workload, then from inside it the echo services must return an IP from your prefix:
# From a VM/debug pod INSIDE the subnet — should return an IP from your prefix, every time.
curl -s https://ifconfig.me ; echo
curl -s https://api.ipify.org ; echo
Validation checklist. You provisioned a NAT Gateway with an allow-listable prefix, attached it to a subnet, confirmed the association from the control plane, and (optionally) verified the egress IP falls inside your prefix. The steps mapped to what each proves:
| Step | What you did | What it proves |
|---|---|---|
| 3 | Create a public IP prefix | The contiguous CIDR you would publish |
| 4 | Create NAT GW + idle timeout | Single SKU; idle timeout is the lever |
| 5 | Associate to subnet | Egress for the whole subnet now flows through it |
| 6 | Show association | The control-plane confirmation path |
| 7 | Egress echo from inside | The egress IP is your prefix, not default outbound |
Cleanup (avoid lingering charges).
az group delete -n $RG --yes --no-wait
Cost note. A NAT Gateway plus a tiny prefix is a few rupees per hour; an hour of this lab is well under ₹50, and deleting the resource group stops everything. Skip Step 7’s VM and the lab is nearly free.
Common mistakes & troubleshooting
This is the playbook — the part you bookmark. First the error strings exhaustion actually produces (it rarely says “SNAT” — it shows up as generic connection failures), then the scannable symptom table, then the expanded reasoning for the entries that bite hardest.
SNAT exhaustion is an outbound failure, so the error surfaces in your application’s client, not in an HTTP status from the platform. The strings to recognise:
| Observed error (client side) | What it usually means | How to confirm it is SNAT | First fix |
|---|---|---|---|
connection timed out to one host under load |
New flow could not get a port | DroppedPackets > 0 at that time |
Connection reuse; add IP to prefix |
connection reset by peer mid-session |
Idle flow’s port reclaimed | Resets cluster at the idle-timeout interval | Keepalives; raise idle timeout modestly |
EADDRNOTAVAIL / “address not available” |
OS/port allocation pressure | Many flows to one dst; SNATConnectionCount high |
Pooled client; fewer concurrent flows |
| Sporadic TLS handshake failures at peak | New handshakes starved of ports | Failure window == load peak | Size prefix to peak; shard destinations |
| 5xx from your service calling an upstream | Upstream call failed, not the upstream itself | App Insights dependency failures to one target | Fix the dependency client, not the upstream |
| Intermittent DNS-then-connect failures | Connect phase starved, DNS fine | Connect errors spike, DNS resolves | Reuse connections; add capacity |
The symptom-to-fix table:
| # | Symptom | Root cause | Confirm (exact cmd / portal path) | Fix |
|---|---|---|---|---|
| 1 | Intermittent timeouts/5xx to one upstream under load, fine at rest | SNAT exhaustion on one 5-tuple destination | Metrics: DroppedPackets > 0; SNATConnectionCount near 64,512×IPs |
Connection reuse + add IP to prefix |
| 2 | Egress IP is not the prefix you provisioned | Subnet not associated, or NAT GW precedence not in effect | az network nat gateway show --query subnets; curl ifconfig.me from inside |
Associate the subnet; remove conflicting NIC IP assumptions |
| 3 | Association fails / unsupported | A Basic-SKU resource in the subnet | az network public-ip list / LB SKU = Basic |
Upgrade everything to Standard SKU |
| 4 | Long-lived connections reset mid-idle | Idle timeout shorter than the idle gap | NAT GW idleTimeoutInMinutes; app logs show resets at the interval |
App keepalives; modestly raise idle timeout |
| 5 | Exhaustion despite “plenty of ports” | All flows hit one dest IP:port (single 5-tuple) | App Insights dependencies by target; one host dominates |
Shard destinations or reuse connections |
| 6 | AKS egress IP still LB, not NAT GW | outboundType left at default loadBalancer |
az aks show --query networkProfile.outboundType |
Recreate cluster with managed/user-assigned NAT GW |
| 7 | Cannot change AKS outboundType after create |
Property is largely immutable | az aks update rejects the change |
Plan it at creation; use a supported migration path |
| 8 | Multi-zone cluster egress pinned to one zone | One NAT GW shared across a multi-zone subnet | NAT GW zones; pods in other zones lose egress on zone loss |
One zonal NAT GW + subnet per zone |
| 9 | Need more than ~1.03M ports to one dest | Hit the 16-IP-per-NAT-GW cap | Prefix already /28; SNATConnectionCount ceiling |
Split workload across subnets, each its own NAT GW |
| 10 | Egress works but partner blocks you | They allow-listed loose IPs; you scaled and IP changed | Compare current attached IPs vs partner’s list | Use a contiguous prefix; publish the CIDR once |
| 11 | UDR sends traffic to a firewall, bypassing NAT GW | Route table overrides the egress path | az network route-table route list; effective routes |
Decide: NAT GW or forced-tunnel via firewall |
| 12 | “Self-healing” failures every peak window | Exhaustion that clears when load drops | DroppedPackets rises and falls with load |
Size the prefix to peak; fix connection reuse |
The expanded form, for the entries that bite hardest:
1. Intermittent timeouts/5xx to one upstream under load, fine at rest.
Root cause: SNAT port exhaustion — too many concurrent flows to one destination 5-tuple, usually from per-request connections.
Confirm: NAT Gateway metrics show DroppedPackets > 0 during the window and SNATConnectionCount flattening near 64,512 × (attached IPs); correlate with the load window.
Fix: Reuse connections (pooled HTTP client + keepalives) first; then add a public IP to the prefix (no downtime) and/or lower the idle timeout. Scaling out instances does not help — NAT Gateway pools ports across the subnet already.
2. Egress IP is not the prefix you provisioned.
Root cause: The subnet was never associated, or you are reading a NIC’s instance-level public IP and assuming it is the egress identity.
Confirm: az network nat gateway show --query "subnets[].id" should list the workload subnet; curl -s https://api.ipify.org from inside the subnet must return a prefix IP.
Fix: Associate the subnet. Remember NAT Gateway wins outbound even when a NIC has its own public IP (which still serves inbound).
3. Association fails or is unsupported.
Root cause: Something in the subnet is Basic SKU (a Basic public IP or Basic Load Balancer), which NAT Gateway cannot coexist with.
Confirm: az network public-ip list -g $RG --query "[].{name:name, sku:sku.name}"; check any LB’s SKU.
Fix: Upgrade every public IP and LB in the subnet to Standard SKU; re-attempt the association.
4. Long-lived connections reset mid-idle.
Root cause: The idle timeout is shorter than the connection’s idle gap, so the port is reclaimed and the next packet finds a dead translation.
Confirm: Check idleTimeoutInMinutes; application logs show resets clustering at exactly that interval.
Fix: Add application keepalives (the robust fix) and, if appropriate, raise the idle timeout modestly — never jump straight to 120.
5. Exhaustion despite “plenty of ports.”
Root cause: Every flow targets one destination IP:port, so they all share a single 5-tuple budget — total ports are irrelevant.
Confirm: App Insights dependencies | summarize by target shows one host dominating; the port ceiling is per-destination.
Fix: Reuse connections to that host (fewer flows), or shard across multiple destination endpoints if the upstream offers them.
6. AKS egress IP is still the Load Balancer, not the NAT Gateway.
Root cause: outboundType was left at the default loadBalancer.
Confirm: az aks show -g rg-aks-prod -n aks-prod --query "networkProfile.outboundType" -o tsv.
Fix: outboundType is set at creation — recreate the cluster (or follow a supported migration path) with managedNATGateway or userAssignedNATGateway.
8. Multi-zone cluster egress pinned to one zone.
Root cause: One zonal NAT Gateway is shared across a node subnet that spans zones, so losing that zone takes egress with it.
Confirm: The NAT Gateway shows a single zones value while node pools span multiple zones.
Fix: Deploy one zonal NAT Gateway and subnet per availability zone; publish the union of the prefixes. (Routing-side egress mysteries — UDRs overriding the path — are diagnosed in Troubleshooting VNet Connectivity: NSG, UDR, Effective Routes & Network Watcher.)
Best practices
- Make NAT Gateway the default egress for any subnet with meaningful outbound — default outbound is being retired for new VNets and gives you no stable IP.
- Use a public IP prefix, not loose IPs, so partners allow-list one contiguous CIDR and you scale within it without churn.
- Size the prefix from
ceil(peak_flows_to_one_dest / 64,512), with modest headroom — never provision a /28 “to be safe.” - Fix connection reuse in the application first. Pooled clients with keepalives prevent most exhaustion before any port sizing matters.
- Set the idle timeout deliberately — 4 minutes to recycle short-lived flows, higher only when paired with application keepalives.
- For AKS, choose
outboundTypeat cluster creation — it is largely immutable, so design it in for any cluster that calls allow-listed endpoints. - For zone redundancy, deploy one zonal NAT Gateway + prefix per availability zone and publish the union of CIDRs; never share one across a multi-zone subnet.
- Keep everything in the subnet Standard SKU — a single Basic public IP or LB blocks the association.
- Alert on
DroppedPackets > 0and dashboardSNATConnectionCountso you catch the ceiling before users feel it. - Validate egress from inside the subnet/pod with an IP-echo service; the returned IP must always fall inside your prefix.
- Decide NAT Gateway or forced-tunnel via firewall, not both by accident — a UDR to a firewall overrides the NAT Gateway egress path.
- Document the final egress CIDR(s) as an immutable allow-list contract with each partner, versioned in your IaC repo.
Security notes
- Egress identity is a security boundary. A stable, allow-listable prefix lets partners restrict who can reach them to your CIDR — make that contract explicit and version it, because a silent IP change becomes an outage and a security review.
- NAT Gateway translates but does not inspect. It is not a firewall — it provides no FQDN filtering, no L7 rules, no threat intel. If egress must be controlled (allow only certain destinations), pair or replace it with Azure Firewall: Forced Tunneling & Hub-Spoke Routing.
- Prefer Private Endpoints for Azure PaaS targets. Traffic to Storage, SQL, Key Vault and the like should stay on the Microsoft backbone and never consume SNAT or traverse the public internet — see Private Endpoint vs Service Endpoint.
- Keep NSGs on the subnet. NAT Gateway does not replace network security groups; egress flows still pass through your NSG rules, so least-privilege outbound rules still apply.
- Reduce blast radius with separate egress prefixes per environment. Production, staging and dev should egress through distinct prefixes so a compromised non-prod workload cannot impersonate prod’s allow-listed identity.
- Audit the attached IPs as configuration. Changes to the prefix or attached IPs are security-relevant; manage them in IaC and review them in PRs, not by hand in the portal.
The egress-security knobs side by side:
| Control | Mechanism | Secures against | Note |
|---|---|---|---|
| Stable egress identity | Public IP prefix | Partner over-permissive allow-lists | Publish once; never churn |
| Egress filtering | Azure Firewall (not NAT GW) | Exfiltration to arbitrary hosts | NAT GW does not filter |
| PaaS off the public path | Private Endpoint | Public-internet exposure of PaaS | Bypasses SNAT entirely |
| Least-privilege egress | NSG outbound rules | Unwanted destinations/ports | Still applies under NAT GW |
| Per-env isolation | Separate prefixes | Cross-env identity reuse | Distinct allow-lists |
| Config auditing | IaC + PR review | Silent IP/prefix drift | Treat egress IPs as code |
| Egress logging | Flow logs / firewall logs | Undetected anomalous egress | NAT GW itself does not log flows |
Cost & sizing
The bill drivers are simple and bounded:
- NAT Gateway has an hourly resource charge plus a per-GB data-processing charge for traffic flowing through it. The hourly cost is per gateway, so a zone-redundant design (three gateways) is roughly 3× the hourly base — budget for it deliberately.
- Public IPs are charged per IP-hour. A
/30prefix is four IPs of cost whether or not you use all the ports; this is exactly why you size to peak rather than over-provisioning a/28. - Per-GB data processing dominates for high-egress workloads. A chatty service moving terabytes pays far more in processing than in the hourly base; reducing needless egress (caching, fewer round-trips) is the real cost lever.
- The free alternative — default outbound — is a false economy at any production scale: it gives no stable IP and is being retired, and the cost of a SNAT-exhaustion outage during a sale dwarfs the NAT Gateway bill.
A rough monthly picture (figures are indicative; confirm current regional pricing):
| Configuration | What you pay for | Rough INR / month | When it fits |
|---|---|---|---|
| 1× NAT GW + /31 prefix, low traffic | 1 GW hourly + 2 IPs + light per-GB | ~₹2,500–4,000 | Single-zone, modest egress |
| 1× NAT GW + /30 prefix, medium traffic | 1 GW + 4 IPs + moderate per-GB | ~₹4,000–7,000 | Production single-zone |
| 1× NAT GW + /28 prefix | 1 GW + 16 IPs + per-GB | ~₹7,000–11,000 | Single-zone, very high concurrency |
| 3× NAT GW (per-zone) + 3× /30 | 3 GW hourly + 12 IPs + per-GB | ~₹12,000–20,000 | Zone-redundant production |
| High-egress (TB/month) any layout | Per-GB processing dominates | per-GB drives it | Data-heavy egress |
| Add one IP to an existing prefix | +1 IP-hour, no GW change | small delta | Quick capacity bump, no downtime |
| Default outbound (for contrast) | Nothing (until retired) | ~₹0 | Non-prod only; no stable IP |
Sizing rule of thumb, distilled:
| You have… | Provision | Idle timeout | Why |
|---|---|---|---|
| < 65K peak flows to one dest | 1 IP (/32 or single) | 4–10 min | One IP covers it |
| ~65K–130K | /31 (2 IPs) | 4–10 min | Two IPs, headroom |
| ~130K–260K | /30 (4 IPs) | 4 min | Aggressive recycle |
| ~260K–520K | /29 (8 IPs) | 4 min | Larger pool |
| ~520K–1.03M | /28 (16 IPs) | 4 min | Max on one GW |
| > 1.03M to one dest | Split across subnets/GWs | 4 min | Past the 16-IP cap |
| Long-lived bursty (brokers/DBs) | size to peak | 30–60 min + keepalive | Avoid mid-idle resets |
| Multi-zone HA | per-zone /30 each | 4 min | One GW per zone |
Interview & exam questions
1. What are the three outbound paths in Azure, and why does only NAT Gateway scale? Default outbound access (implicit, small shared pool, unpredictable IP, retiring Sept 2025), Load Balancer outbound rules (a fixed 64K budget you must pre-divide across the backend pool), and NAT Gateway (on-demand allocation from a large shared pool, ~64,512 ports per attached public IP, stable egress identity). NAT Gateway scales because ports are handed out dynamically across the subnet rather than pre-carved per instance.
2. What exactly is a SNAT port, and why is exhaustion a single-destination problem? A SNAT port is one entry in the translation table, keyed on the full 5-tuple — source IP, source port, destination IP, destination port, protocol. You are limited to ~64K connections to the same destination IP and port, not 64K total. Exhaustion therefore almost always means many concurrent flows to one VIP (a payment gateway, one storage endpoint); spreading load across destinations rarely exhausts ports.
3. How do you size a public IP prefix? Count the peak concurrent flows to the busiest single destination and compute required_IPs = ceil(peak_flows / 64,512), then pick the smallest prefix whose host count covers it. For ~140,000 flows that is 3 IPs → a /30 (4 IPs) with headroom. Never over-provision a /28 “to be safe” — you pay per IP and can add a prefix later with no downtime.
4. What is the outbound precedence when multiple egress configs exist? Highest to lowest: NAT Gateway (wins if present on the subnet), then an instance-level public IP on the NIC, then Load Balancer outbound rules, then default outbound access. Notably, NAT Gateway overrides a NIC’s own public IP for outbound — that IP still serves inbound, but egress goes through the NAT Gateway.
5. What is the per-NAT-Gateway IP cap and what do you do beyond it? A single NAT Gateway supports a maximum of 16 public IPs (individual + prefixes combined), giving ~1.03M ports to one destination via a /28. Beyond that, split the workload across multiple subnets, each with its own NAT Gateway and prefix — the documented scaling pattern.
6. How do you give AKS deterministic, allow-listable egress? Set outboundType at cluster creation to userAssignedNATGateway (you attach your own NAT Gateway + prefix to the node subnet, so partners allow-list your exact CIDR) or managedNATGateway (Azure provisions it). The default loadBalancer inherits LB SNAT limits; outboundType is largely immutable, so it must be chosen up front.
7. How do you make egress zone-redundant given that NAT Gateway is zonal? A NAT Gateway is single-zone, so you deploy one zonal NAT Gateway and node-pool subnet per availability zone. Pods in each zone egress through that zone’s gateway, and partners allow-list the union of the zonal prefixes. Sharing one gateway across a multi-zone subnet pins all egress to a single zone’s fate.
8. What does the TCP idle timeout do, and how should you set it? It controls how long an idle flow holds its SNAT port before reclaim (4–120 minutes, default 4). Lowering it frees ports faster (more effective capacity); raising it keeps quiet long-lived connections alive but holds ports longer. The durable fix for mid-idle resets is application keepalives, not cranking the timeout to 120.
9. Which NAT Gateway metric is the smoking gun for exhaustion, and what else do you watch? DroppedPackets — any sustained non-zero value strongly indicates exhaustion or capacity pressure. Watch it alongside SNATConnectionCount (the headline established-connection count, compared against 64,512 × attached IPs) and TotalConnectionCount. Alert on DroppedPackets > 0 over 5 minutes.
10. Your app exhausts SNAT despite “plenty of ports” — why? Because every flow targets the same destination IP and port, so they all consume one 5-tuple budget; total ports across other destinations are irrelevant. Confirm with App Insights dependencies grouped by target (one host dominates). Fix with connection reuse to that host, or shard across multiple endpoints if available.
11. Can NAT Gateway filter or inspect egress? No — it performs source NAT only; it has no FQDN filtering, L7 rules, or threat intelligence. For controlled egress (allow only specific destinations) you use Azure Firewall (often with forced tunneling), and for Azure PaaS you prefer Private Endpoints so traffic never traverses the public internet or consumes SNAT at all.
12. Why does NAT Gateway require Standard SKU everywhere in the subnet? NAT Gateway is a Standard-SKU resource and cannot coexist with Basic-SKU public IPs or Basic Load Balancers in the same subnet; the association is unsupported. Upgrade every public IP and LB in the subnet to Standard before attaching it.
These map to AZ-700 (Network Engineer Associate) — design and implement network connectivity and routing, hybrid and outbound connectivity — and touch AZ-104 (Administrator) for virtual networking and AZ-305 for designing resilient, allow-listable egress. A compact cert mapping:
| Question theme | Primary cert | Objective area |
|---|---|---|
| Outbound paths, SNAT model, prefix sizing | AZ-700 | Design & implement outbound connectivity |
AKS outboundType, zone-redundant egress |
AZ-700 / CKA-adjacent | Cluster networking design |
| Precedence, SKU constraints, association | AZ-104 | Configure virtual networking |
| Allow-listable, resilient egress design | AZ-305 | Design network architecture |
| Egress filtering vs translation | AZ-700 / AZ-500 | Secure network connectivity |
Quick check
- Why does SNAT exhaustion almost always involve a single destination, and what part of the 5-tuple makes that true?
- You have ~140,000 peak concurrent flows to one bank VIP. What prefix do you provision, and what is the per-IP port figure you divided by?
- True or false: scaling out to more VM instances behind a NAT Gateway adds SNAT ports.
- Your zone-redundant AKS cluster’s egress survives a single instance failure but not a zone outage. What did you get wrong, and what is the fix?
- Long-lived broker connections keep resetting after exactly four minutes of quiet. Name the property involved and the robust fix.
Answers
- The translation table is keyed on the full 5-tuple including destination IP and destination port; you can have ~64,512 flows to one
dst IP:portper public IP. Many flows to one VIP share that single budget and exhaust it, while the same number spread across destinations would not. ceil(140000 / 64512) = 3IPs, so provision a/30(4 IPs, ~258K ports, with headroom). The divisor is 64,512 SNAT ports per attached public IP.- False. NAT Gateway already pools ports across the whole subnet on demand; adding instances does not add ports. To add capacity you attach another public IP to the prefix (or lower the idle timeout / fix connection reuse).
- You shared one zonal NAT Gateway across a multi-zone node subnet, pinning egress to that zone. The fix is one zonal NAT Gateway and node-pool subnet per availability zone, publishing the union of the zonal prefixes to the partner.
- The TCP idle timeout (default 4 minutes) is reclaiming the idle flow’s SNAT port. The robust fix is application-level keepalives (or HTTP keep-alive / pooling) that reset the idle timer — preferable to merely raising the timeout toward 120.
Glossary
- Source NAT (SNAT) — rewriting a private source IP and port to a public IP and port so outbound traffic can traverse the internet and return.
- SNAT port — one entry in the NAT translation table, keyed on the full 5-tuple; the finite resource that exhausts.
- 5-tuple — the tuple (source IP, source port, destination IP, destination port, protocol) that uniquely identifies a flow; the reason exhaustion is per-destination.
- NAT Gateway — a managed, highly available, outbound-only Azure resource attached to a subnet that performs SNAT through your public IP/prefix with on-demand port allocation.
- Public IP prefix — a contiguous block of public IPs (e.g.
/30,/28) you attach to a NAT Gateway and publish for partner allow-listing. - Default outbound access — the implicit egress a VM gets with no explicit outbound configured; small shared pool, unpredictable Microsoft-owned IP, retiring for new VNets on 30 September 2025.
- Load Balancer outbound rules — explicit outbound port allocation off a Standard LB frontend; a fixed 64K budget you must pre-divide across the backend pool.
- Idle timeout — how long an idle flow holds its SNAT port before reclaim (4–120 minutes, default 4); a capacity and connection-reset lever.
outboundType— the AKS cluster networking setting choosing egress mode:loadBalancer,managedNATGateway,userAssignedNATGateway, oruserDefinedRouting; largely immutable after creation.SNATConnectionCount— NAT Gateway metric for established outbound connections; the headline capacity number.DroppedPackets— NAT Gateway metric whose sustained non-zero value is the smoking gun for SNAT exhaustion.TotalConnectionCount— NAT Gateway metric for active flows through the gateway.- TCP keepalive — periodic packets that reset the idle timer so long-lived but quiet connections are not reclaimed; the robust alternative to raising the idle timeout.
- Zonal resource — a resource pinned to a single availability zone; a NAT Gateway is zonal, so multi-zone egress needs one per zone.
Next steps
You can now give any subnet deterministic, allow-listable, exhaustion-proof egress and prove it from inside the network. Build outward:
- Next: Standard Load Balancer Outbound Rules, Cross-Region & HA Ports — the alternative outbound mechanism and exactly when you would still reach for it.
- Related: Private Endpoint vs Service Endpoint — take Azure PaaS traffic off the public path entirely so it never consumes SNAT.
- Related: Azure Firewall: Forced Tunneling & Hub-Spoke Routing — when egress must be inspected and filtered, not merely translated.
- Related: Azure VNet Deep Dive: Every Setting — the subnet, NSG and address-space fundamentals underneath every egress decision.
- Related: Troubleshooting VNet Connectivity: NSG, UDR, Effective Routes & Network Watcher — when a route table, not SNAT, is hijacking your egress path.