Azure Lesson 19 of 137

Deterministic Outbound with Azure NAT Gateway: Fixing SNAT Port Exhaustion

The first sign is always the same: a batch job, a webhook fan-out, or a connection-pool-happy microservice throws intermittent connection timeouts to one external API, and nobody can reproduce it locally. The metric that explains it is SNATConnectionCount flatlining at a ceiling and DroppedPackets ticking up — you have run out of SNAT ports. Every VM and pod that talks to the public internet needs source NAT: its private IP and ephemeral port get rewritten to a public IP and port so the return traffic can find its way home. Azure gives you three ways to supply that public IP, and they behave so differently under load that the choice is the difference between a service that scales and one that pages you at 18:00 every day.

Azure NAT Gateway is the correct fix. It is a fully managed, highly available outbound-only resource that you attach to a subnet; from that moment, every flow leaving the subnet is translated through your public IP (or a contiguous public IP prefix you can hand to partners for allow-listing), and SNAT ports are allocated on demand from a large shared pool rather than pre-carved per instance. That single dynamic-allocation property is why it survives the load that exhausts default outbound and Load Balancer outbound rules. This article is the deep, exam-grade, production-real treatment: every outbound path compared, the 5-tuple SNAT model that explains why exhaustion happens, prefix sizing maths, AKS outboundType integration, idle-timeout tuning, and a full symptom→cause→confirm→fix playbook — with az, Bicep, KQL, and a dense reference of tables you can scan mid-incident.

By the end you will stop guessing about egress. When the pager goes off with “intermittent timeouts to the payment provider,” you will know within ninety seconds whether you are looking at SNAT exhaustion against a single destination VIP, an idle-timeout reclaiming long-lived flows, a Basic-SKU resource silently blocking the association, or a zone-redundant cluster pinned to one zone’s NAT Gateway — and you will know the exact az command and metric that confirms it. Deterministic, allow-listable, exhaustion-proof egress is a solved problem once you understand the mechanism; this is how to solve it properly and migrate off the default outbound path without an outage.

What problem this solves

Two distinct production pains converge on this one resource, and most teams meet them in this order.

The first is SNAT port exhaustion. Your workload opens many concurrent outbound connections — a notification service fanning out to a webhook host, a reconciliation job hammering one bank API, a microservice with a misconfigured connection pool — and the platform runs out of translation-table entries for that destination. New connections fail. It surfaces as intermittent 5xx, dependency timeouts, and connection resets, under load and not at rest, which is exactly why it passes every test and dies in production during the busy window. On default outbound the per-host SNAT pool is small and shared; on Load Balancer outbound rules it is a fixed budget you must pre-divide across the backend pool and inevitably get wrong.

The second pain is unpredictable, un-allow-listable egress IPs. A partner — a bank, a SaaS API, a regulator’s endpoint — says “give us the source IPs your traffic comes from and we will allow-list them.” With default outbound access you cannot: the egress IP is Microsoft-owned, shared, and can change. With Load Balancer outbound it is the LB’s frontend IP, which works but couples your egress identity to an inbound resource you may not want. NAT Gateway gives you a stable public IP or a contiguous CIDR prefix that is yours, that you publish once, and that never changes as you scale within the prefix.

Who hits this: any subnet making meaningful outbound calls. It bites hardest on high-fan-out batch and event-driven workloads (webhooks, reconciliation, scrapers), on AKS clusters at high pod density (the default outboundType: loadBalancer inherits LB SNAT limits), and on any enterprise integration where a partner demands a fixed source-IP allow-list. There is a third, quieter motivation: Microsoft is retiring default outbound access for newly created VNets (effective 30 September 2025), so new subnets must have an explicit outbound method anyway — and NAT Gateway is the recommended default. The three problems below frame the whole field:

Problem What you observe in production Root mechanism NAT Gateway’s answer
SNAT exhaustion Intermittent timeouts/5xx to one upstream under load, fine at rest Translation-table entries for one 5-tuple destination run out On-demand ports from a large shared pool (64,512 per public IP)
Unpredictable egress IP Partner cannot allow-list you; egress IP changes Default outbound uses a shared, Microsoft-owned IP Stable, owned public IP / contiguous prefix you publish
Default-outbound retirement New VNets cannot rely on implicit egress (Sept 2025) Implicit egress being removed for new VNets Explicit, recommended outbound method per subnet

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already understand the basics of an Azure virtual network: a VNet is an address space, subnets carve it up, and resources land in subnets. Knowing what a public IP and a Standard Load Balancer are will help, as will comfort running az in Cloud Shell, reading JSON output, and recognising the difference between inbound and outbound traffic. If you have ever seen “SNAT” in a metric blade and not known what it meant, you are the target reader.

This sits in the Networking track and is the egress-side companion to the inbound and routing material. The VNet and subnet fundamentals come from Azure VNet Deep Dive: Every Setting. The alternative outbound mechanism — and when you would still reach for it — is covered in Standard Load Balancer Outbound Rules, Cross-Region & HA Ports. For PaaS targets you often want to bypass SNAT entirely, which is Private Endpoint vs Service Endpoint. When egress must be inspected and filtered rather than just translated, that is a different resource — see Azure Firewall: Forced Tunneling & Hub-Spoke Routing. And because SNAT exhaustion shows up on the compute side too, the App Service angle is in Troubleshooting Azure App Service: 502/503, Cold Starts & Restart Loops.

A quick map of where each outbound concern is owned and what it can break, so you call the right person fast:

Layer What lives here Who usually owns it What it can cause
Workload (VM / pod) Connection pooling, keepalives, retries App / dev team Exhaustion via per-request connections
Subnet config NAT Gateway association, UDRs, NSGs Platform / network Wrong egress path, blocked association
NAT Gateway SNAT pool, idle timeout, attached IPs Network team Port ceiling, idle reclaim, zone pinning
Public IP / prefix The egress identity you publish Network team Allow-list churn if not a prefix
Destination (partner) The single VIP everyone hits External The 5-tuple bottleneck that exhausts ports
Monitoring SNATConnectionCount, DroppedPackets Platform / SRE Blind to exhaustion until users feel it

Core concepts

Five mental models make every later decision obvious.

Source NAT is mandatory for outbound, and somebody must supply the public IP. A private IP cannot appear on the public internet; the return packet would have nowhere to go. So every outbound flow has its source private-IP-and-port rewritten to a public IP and a SNAT port. The only question is which resource provides that public IP and how it allocates the ports — and that is the entire subject of this article.

A SNAT “port” is a translation-table entry keyed on the full 5-tuple. The table is keyed on (source IP, source port, destination IP, destination port, protocol). This is the detail that trips everyone up. You are not limited to 64K total connections — you are limited to 64K connections to the same destination IP and port. A flow to 20.1.2.3:443 and a flow to 20.1.2.4:443 are different destinations and inexpensive relative to two flows to the same 20.1.2.3:443. Exhaustion therefore almost always means many connections to one destination: a single payment gateway, one storage public endpoint, one upstream behind a single VIP.

NAT Gateway allocates ports on demand from a shared pool; the alternatives pre-carve them. With NAT Gateway, an idle VM in the subnet consumes nothing and a busy one bursts — ports are handed out across all instances in the subnet as needed. With Load Balancer outbound rules you must pre-divide a fixed 64K budget across the backend pool before any traffic flows, and a wrong division either starves VMs or caps scale. Dynamic allocation is the single biggest reason NAT Gateway survives load that exhausts the other two.

Each attached public IP contributes 64,512 SNAT ports, and the maths is linear. One IP gives ~64,512 simultaneous flows to a single destination IP:port; two give ~129,024; the cap of 16 IPs per NAT Gateway gives ~1,032,192. Capacity planning is therefore arithmetic, not guesswork: count your peak concurrent flows to the busiest single destination and divide.

A port is freed only when its flow goes idle past the TCP idle timeout. Until then the entry is held. That is why idle-timeout tuning is part of capacity planning, not an afterthought: a 4-minute timeout recycles short-lived flows aggressively (more effective capacity), while 120 minutes keeps quiet long-lived connections alive (fewer surprise resets) but holds ports longer. The durable fix for connections dying mid-idle is application keepalives, not a longer timeout.

The vocabulary in one table

Pin down every moving part before the deep sections. The glossary repeats these for lookup; this is the mental model side by side:

Concept One-line definition Where it lives Why it matters
Source NAT (SNAT) Rewriting private src IP:port → public IP:port for egress Performed by the outbound resource Mandatory for any internet egress
SNAT port One translation-table entry, keyed on the 5-tuple The NAT resource’s table The finite thing you exhaust
5-tuple (src IP, src port, dst IP, dst port, protocol) Per flow Same destination = shared budget
NAT Gateway Managed outbound-only resource attached to a subnet Subnet association The recommended egress path
Public IP prefix A contiguous CIDR of public IPs Attached to the NAT Gateway The allow-listable egress identity
Default outbound access Implicit egress every VM gets if nothing is configured Platform behaviour Being retired Sept 2025; unpredictable IP
Idle timeout How long an idle flow holds its SNAT port NAT Gateway property (4–120 min) Capacity + connection-reset lever
outboundType AKS cluster egress mode AKS networking profile Chooses LB vs NAT Gateway for the cluster
DroppedPackets NAT Gateway metric for dropped flows Platform metrics The smoking gun for exhaustion
SNATConnectionCount Established outbound connections through the gateway Platform metrics The headline capacity number

Three outbound paths, and why only one scales

Azure provides three ways to give an outbound flow a public IP, and they are not equivalent. Default outbound is the implicit egress with no configuration; Load Balancer outbound rules let you allocate ports explicitly off an LB frontend; NAT Gateway is the purpose-built, subnet-attached resource. The headline comparison:

Outbound method SNAT ports Egress IP Allocation model Recommended?
Default outbound access Implicit, small, shared per host Unpredictable, Microsoft-owned, can change Platform-managed, opaque No — retired Sept 2025 for new VNets
Load Balancer outbound rules 64,000 per frontend IP, pre-divided across backends The LB’s public IP(s) Manual, static per-instance Only if you already run the LB
NAT Gateway ~64,512 per attached public IP, on demand Your public IP / prefix, stable Dynamic, shared across the subnet Yes — the default choice

The mechanics that actually decide it:

Dimension Default outbound LB outbound rules NAT Gateway
Port allocation Implicit, small Static, pre-carved Dynamic, on demand
Idle VM cost Holds a small share Holds its pre-allocated ports Consumes nothing
Burst behaviour Exhausts fast Capped by pre-division Bursts from shared pool
Egress-IP stability None (can change) Stable (LB frontend) Stable (your prefix)
Allow-listing Impossible Possible (LB IP) Clean (contiguous CIDR)
Max public IPs n/a Multiple frontends 16 per NAT GW
Ports per public IP Tiny shared slice 64,000 to pre-divide ~64,512 on demand
Couples to inbound? No Yes (needs an LB) No (outbound-only)
Provisioning effort None Rule + budget maths Resource + association
SKU requirement Any Standard LB All Standard in subnet
Zone model Platform LB zone behaviour Zonal (1 GW/zone for HA)
Future-proof No (retiring) Yes Yes (recommended)

Three reading notes that save a design review:

If you are… The trap Do this
Relying on default outbound today It works until it doesn’t, and it is being retired Treat it as tech debt; add an explicit NAT Gateway
Already running a Standard LB Tempting to reuse its outbound rules Fine for small scale; add NAT Gateway if you pre-divide ports
Building a new subnet New VNets lose implicit egress Sept 2025 Provision NAT Gateway from day one

For any subnet making meaningful outbound calls, the answer is NAT Gateway. The rest of this article assumes that decision.

Understanding SNAT port allocation

A SNAT “port” is one entry in the translation table, and the table is keyed on the full 5-tuple: source IP, source port, destination IP, destination port, protocol. You are not limited to 64K total connections; you are limited to 64K connections to the same destination IP and port. Spreading the same load across many destinations rarely exhausts ports — exhaustion is a single-destination phenomenon.

Each public IP attached to a NAT Gateway contributes 64,512 SNAT ports (some docs round to 64,000). The arithmetic is linear:

SNAT ports available = 64,512  ×  (number of attached public IPs)

1 public IP   ->  ~64,512 simultaneous flows to a single dest IP:port
2 public IPs  ->  ~129,024
16 public IPs ->  ~1,032,192   (16 = the per-NAT-GW IP/prefix cap)

How the same workload consumes the budget depending on how it is written:

Connection pattern Flows to one dest SNAT pressure Typical culprit
New HttpClient / socket per request One per request Severe — scales with RPS The classic exhaustion bug
Pooled client, no keepalive One per pool slot, churns on idle Moderate Idle reclaim mid-burst
Pooled client + keepalive Reused, long-lived Low — flat under load The intended pattern
Many destinations (sharded) Spread across 5-tuples Low Naturally exhaustion-resistant
Single VIP, high concurrency All on one 5-tuple Severe Payment/bank/storage endpoint
Retry storm on failure Multiplies new flows Severe — self-worsening Aggressive retry-on-timeout
Short-lived TLS handshakes New flow per call High under burst Webhook/notification fan-out
DNS-resolved to rotating IPs Spread naturally Low CDN-fronted upstreams

The port lifecycle is what ties capacity to the idle timeout:

Flow state Holds a SNAT port? Freed when… Lever
Active (sending/receiving) Yes Connection closes App connection reuse
Idle but open Yes Idle timeout elapses Idle-timeout setting
TCP TIME_WAIT Briefly OS reclaim window Keepalive / fewer new flows
Closed No Immediately

A worked sizing example, end to end: at 1,800 requests/second with a new connection per request and a ~4-minute TIME_WAIT, you can have hundreds of thousands of sockets in flight against one destination. That is why it fails instantly under flash-sale load and never in a unit test — and why the first fix is connection reuse, with NAT Gateway sizing as the safety margin behind it.

Provision the NAT Gateway, public IP, and prefix

You need three resources: the NAT Gateway, at least one Standard-SKU public IP (or a public IP prefix), and a subnet association. Prefer a prefix to individual IPs — the contiguous CIDR is exactly what you publish to partners for allow-listing, and you can scale within it without changing what they whitelist.

Azure CLI:

LOC=eastus
RG=rg-egress-prod

# A /28 prefix = 16 contiguous IPs. Pick the size from the sizing section.
az network public-ip prefix create \
  --resource-group $RG \
  --name pip-prefix-natgw \
  --length 28 \
  --location $LOC \
  --version IPv4

# NAT Gateway. Idle timeout in minutes (default 4, max 120).
az network nat gateway create \
  --resource-group $RG \
  --name natgw-prod \
  --location $LOC \
  --public-ip-prefixes pip-prefix-natgw \
  --idle-timeout 10

The same thing in Bicep, which is what you actually want in the repo:

resource pipPrefix 'Microsoft.Network/publicIPPrefixes@2023-11-01' = {
  name: 'pip-prefix-natgw'
  location: location
  sku: { name: 'Standard' }
  properties: {
    prefixLength: 28
    publicIPAddressVersion: 'IPv4'
  }
}

resource natgw 'Microsoft.Network/natGateways@2023-11-01' = {
  name: 'natgw-prod'
  location: location
  sku: { name: 'Standard' }
  properties: {
    idleTimeoutInMinutes: 10
    publicIpPrefixes: [
      { id: pipPrefix.id }
    ]
  }
}

NAT Gateway has exactly one SKU (Standard), so there is no SKU decision to make — only IP count and idle timeout. The complete property surface:

Property Values Default When to change Trade-off / limit
sku.name Standard only Standard Never (no choice) Cannot attach to Basic-SKU resources
idleTimeoutInMinutes 4–120 4 Long-lived idle flows being reset Higher holds ports longer
publicIpAddresses 0–16 individual IPs none Need fixed single IPs Counts toward the 16-IP cap
publicIpPrefixes prefix(es) totalling ≤16 IPs none Allow-listable CIDR (preferred) Largest single prefix is /28
zones none / 1 / 2 / 3 none (regional) Pin egress to a zone for AZ design One GW cannot span zones
subnets association(s) none Attach to workload subnet(s) A subnet has only one NAT GW

Public IP vs public IP prefix — pick the prefix for anything partners allow-list:

Aspect Individual public IP(s) Public IP prefix
Allow-listing Each IP listed separately One contiguous CIDR
Scaling egress Add/remove single IPs Grow within the prefix
Partner churn Allow-list changes per IP Allow-list never changes
Granularity Exact count Powers of two (/31…/28)
Best for One or two fixed IPs Production egress identity

Zone behaviour you must design around — NAT Gateway is a zonal resource:

Deployment --zone Resilience Use when
Non-zonal (regional) omitted Single regional resource Dev / non-HA egress
Zonal 1, 2, or 3 Pinned to one zone’s fate Part of a per-zone HA design
Zone-redundant egress one GW per zone Survives a zone loss Production multi-zone workloads

A single NAT Gateway cannot span zones. For zone-redundant egress you deploy one NAT Gateway per zone, each on its own subnet — covered in the AKS section, where it matters most.

Associate to subnets and the precedence rules

A NAT Gateway is attached to a subnet, and once attached it becomes the outbound path for every resource in that subnet. This is where the ordering rules matter, because they decide what wins when multiple egress configs coexist.

az network vnet subnet update \
  --resource-group $RG \
  --vnet-name vnet-prod \
  --name snet-workload \
  --nat-gateway natgw-prod
resource subnet 'Microsoft.Network/virtualNetworks/subnets@2023-11-01' = {
  parent: vnet
  name: 'snet-workload'
  properties: {
    addressPrefix: '10.20.1.0/24'
    natGateway: { id: natgw.id }
  }
}

The precedence Azure applies for outbound, highest to lowest:

Priority Outbound method Wins over Notes
1 NAT Gateway on the subnet Everything below Overrides LB outbound rules and instance-level IP for egress
2 Instance-level public IP (on the NIC) LB rules, default Used for outbound only if no NAT Gateway
3 Load Balancer outbound rules Default Used only if neither above applies
4 Default outbound access Last resort; retiring for new VNets

Two rules that save you a debugging session:

Rule What it means Consequence if ignored
NAT Gateway wins outbound even if a NIC has its own public IP Inbound stays on the NIC IP; outbound goes via NAT GW Surprised that egress IP “changed” after attaching NAT GW
Everything in the subnet must be Standard SKU No Basic LB or Basic public IP allowed Association silently blocked / unsupported

Association scope and limits, enumerated:

Capability Allowed? Detail
One NAT GW → many subnets (same VNet) Yes Shares the SNAT pool across them
One subnet → many NAT GWs No A subnet has exactly one NAT GW
NAT GW → subnets in different VNets No Same-VNet only
Subnet contains a Basic-SKU resource No Must be all Standard SKU
Attach to a gateway subnet (VPN/ER) No Not supported on gateway subnets
Coexist with NSG / UDR on the subnet Yes NSG/UDR still apply to the flow
Coexist with a Standard LB (inbound) Yes NAT GW handles outbound, LB inbound
Inbound on a NIC’s own public IP Yes Inbound stays on the NIC IP
Span availability zones with one GW No Zonal; one GW per zone for HA
Reuse one prefix across many GWs No A prefix attaches to one GW

Size the prefix to your connection count

This is the capacity-planning step people skip and then page about. Work backwards from the worst-case concurrent flows to a single destination:

required public IPs = ceil( peak_concurrent_flows_to_one_dest / 64,512 )
prefix length       = smallest /N whose host count >= required IPs

The prefix-to-capacity table:

Prefix IPs Approx SNAT ports (to one dest:port) Covers peak flows up to
/31 2 ~129,024 ~129K
/30 4 ~258,048 ~258K
/29 8 ~516,096 ~516K
/28 16 ~1,032,192 ~1.03M

The hard ceiling: a single NAT Gateway supports a maximum of 16 public IP addresses (individual IPs and prefixes combined). A /28 is the largest single prefix that fits, giving ~1.03M ports. Need more than 16 IPs to one destination? Split the workload across multiple subnets, each with its own NAT Gateway and prefix — the documented scaling pattern, not a workaround.

The limits and fixed numbers worth keeping on one screen:

Limit / quantity Value Why it matters
SNAT ports per public IP ~64,512 The divisor in all sizing maths
Max public IPs per NAT Gateway 16 Hard ceiling; ~1.03M ports max
Largest single prefix on one GW /28 (16 IPs) The biggest contiguous block per GW
Idle timeout range 4–120 minutes Capacity vs reset trade-off
Default idle timeout 4 minutes Aggressive reclaim out of the box
SKU choices Standard only No SKU decision to make
NAT GWs per subnet 1 A subnet has exactly one
Subnets per NAT GW Many (same VNet) Shares the pool across them
Zones a single GW spans 1 Zonal; needs one per zone for HA
Default-outbound retirement (new VNets) 30 Sep 2025 Why new subnets need explicit egress

Worked sizing for three realistic workloads:

Workload Peak flows to one dest ceil(/64,512) Prefix to provision Headroom
Webhook fan-out ~50,000 1 IP /31 (2 IPs) ~2.5×
Notification service ~140,000 3 IPs /30 (4 IPs) ~1.8×
Bulk reconciliation ~400,000 7 IPs /29 (8 IPs) ~1.3×
Multi-tenant scraper ~900,000 14 IPs /28 (16 IPs) ~1.1×
Multi-zone notification ~120K per zone 2 IPs/zone /30 per zone per-zone GW
Beyond /28 > 1.03M > 16 IPs Split across subnets per-subnet GW

Worked example in prose: a notification service holds ~140,000 simultaneous TLS connections to one provider VIP at peak. ceil(140000 / 64512) = 3 IPs. A /30 (4 IPs) covers it with headroom; provision it and hand the partner that 4-address CIDR. Do not provision a /28 “to be safe” — you pay per IP, and you can attach a second prefix later without downtime. Sizing anti-patterns to avoid:

Anti-pattern Why it hurts Better
Provision /28 “to be safe” Pay for 16 IPs you do not use Size to peak + modest headroom
Size by total connections, not per-destination Over- or under-provisions wildly Count flows to the busiest single dest
One giant prefix beyond /28 Exceeds the 16-IP cap Split across subnets/NAT GWs
Loose individual IPs for a partner Allow-list churns on scale Use a contiguous prefix
Ignore idle timeout in the maths Held ports inflate live count Tune idle timeout + keepalives

Integrating with AKS for stable, allow-listable egress

AKS defaults its outboundType to loadBalancer, which puts cluster egress behind the Standard LB and inherits its SNAT-allocation headaches at high pod density. Switching to managedNATGateway (Azure provisions and owns the gateway) or userAssignedNATGateway (you bring your own on the node subnet) fixes both the exhaustion and the IP-stability problems in one move.

This must be chosen at cluster creationoutboundType is largely immutable, with only specific migration paths supported — so design it in from day one for any cluster that calls allow-listed external endpoints. The four modes:

outboundType Who owns the NAT GW Egress IP control SNAT scaling Choose when
loadBalancer (default) AKS-managed LB LB outbound IPs LB pre-division limits Low egress concurrency only
managedNATGateway Azure provisions it Azure-allocated IP(s) On demand, large pool You want NAT GW without owning IPs
userAssignedNATGateway You (on node subnet) Your exact prefix On demand, large pool Partner allow-lists your CIDR
userDefinedRouting You (via UDR/firewall) Firewall/NVA IP N/A (egress through NVA) Egress must be inspected/filtered

Managed NAT Gateway (Azure provisions and owns it):

az aks create \
  --resource-group rg-aks-prod \
  --name aks-prod \
  --network-plugin azure \
  --outbound-type managedNATGateway \
  --nat-gateway-managed-outbound-ip-count 2 \
  --nat-gateway-idle-timeout 10 \
  --generate-ssh-keys

User-assigned, when you must control the exact egress prefix (the common enterprise case — partners allow-list your CIDR):

# NAT Gateway + prefix already created and attached to the AKS node subnet,
# then point the cluster at that subnet with a userAssignedNATGateway type.
az aks create \
  --resource-group rg-aks-prod \
  --name aks-prod \
  --network-plugin azure \
  --vnet-subnet-id "$NODE_SUBNET_ID" \
  --outbound-type userAssignedNATGateway \
  --generate-ssh-keys

The AKS-specific knobs and their constraints:

Flag / setting What it controls Default Constraint
--outbound-type Cluster egress mode loadBalancer Set at create; limited migration paths
--nat-gateway-managed-outbound-ip-count Managed-mode IP count 1 Up to 16; each is 64,512 ports
--nat-gateway-idle-timeout Managed-mode idle timeout 4 4–120 minutes
--vnet-subnet-id (user-assigned) Node subnet with your NAT GW NAT GW must be pre-attached
--network-plugin CNI choice azure/kubenet Egress design independent of plugin

For a zone-redundant cluster the node pools span zones, but a NAT Gateway is single-zone. The correct topology is one node-pool subnet per availability zone, each with its own zonal NAT Gateway and prefix. Pods in zone 1 egress through the zone-1 NAT Gateway, and so on; partners allow-list the union of the zonal prefixes. Do not share one NAT Gateway across a multi-zone node subnet — it pins your egress to a single zone’s fate. The zone-redundant layout:

Zone Node-pool subnet Zonal NAT GW Prefix Egress IPs published
1 snet-aks-z1 natgw-z1 (--zone 1) /30 CIDR-1
2 snet-aks-z2 natgw-z2 (--zone 2) /30 CIDR-2
3 snet-aks-z3 natgw-z3 (--zone 3) /30 CIDR-3
Partner allow-list Union of CIDR-1…3

More on the day-two operational side of clusters like this is in AKS Day-Two: Upgrades & Fleet Operations.

Tune TCP idle timeout and watch the metrics

The TCP idle timeout governs how long an idle flow holds its SNAT port before reclaim. Lowering it (minimum 4 minutes) frees ports faster and raises effective capacity. Raising it (up to 120 minutes) keeps quiet but long-lived connections alive — useful for databases or brokers that idle between bursts but should not be torn down.

az network nat gateway update \
  --resource-group $RG \
  --name natgw-prod \
  --idle-timeout 4

The idle-timeout decision, both directions:

Setting Effect on capacity Effect on long-lived flows Pick when
4 min (minimum) Frees ports fastest May reset quiet connections High-churn, short-lived flows
10 min (common default-bump) Balanced Tolerates brief idleness General production
30–60 min Holds ports longer Keeps brokers/DBs alive Bursty long-lived sessions
120 min (maximum) Holds ports longest Rarely resets anything Last resort; prefer keepalives

Do not solve idle-timeout pain by cranking it to 120. The durable fix for connections dying mid-idle is application-level TCP keepalives (or HTTP keep-alive / connection pooling) that send traffic before the timeout. Keepalives reset the idle timer and are far more robust than betting that no flow ever idles longer than your window. Idle timeout vs keepalive, head to head:

Approach Where it lives Robustness Side effect
Raise NAT idle timeout NAT Gateway property Bet on no flow idling longer Holds ports; lowers capacity
App TCP keepalive Socket options / client Resets the timer reliably Tiny keepalive traffic
HTTP keep-alive / pooling HTTP client config Reuses connections, fewer flows Needs correct pool sizing
Both (recommended) NAT + app Most robust Minimal

NAT Gateway emits metrics you should alert on, in Microsoft.Network/natGateways:

Metric What it measures Alert when Why it matters
SNATConnectionCount Established outbound connections Approaching ceiling The headline capacity number
TotalConnectionCount Active flows through the gateway Sustained near limit Corroborates pressure
DroppedPackets Packets/flows dropped > 0 sustained The smoking gun for exhaustion
PacketCount Packets processed Throughput baseline Capacity/throughput trend
ByteCount Bytes processed Egress-cost tracking Bill driver (per-GB)
SNATConnectionCount (by direction) Inbound vs outbound flows Skew vs expectation Confirms it is egress, not return
Datapath availability Gateway health Below 100% Rules out a platform-side fault

A KQL query against the platform metrics (or an Azure Monitor metric alert) to catch exhaustion before users do:

AzureMetrics
| where ResourceProvider == "MICROSOFT.NETWORK"
| where Resource == "NATGW-PROD"
| where MetricName in ("DroppedPackets", "SNATConnectionCount")
| summarize Total = sum(Total) by bin(TimeGenerated, 5m), MetricName
| order by TimeGenerated desc

Which changes are online (no egress interruption) and which force a heavier operation is worth knowing before an incident, because the in-the-moment capacity bumps are all online:

Change Disruptive? How Use during an incident?
Add a public IP to the prefix No az network public-ip prefix larger / attach IP Yes — fastest capacity bump
Lower the idle timeout No az network nat gateway update --idle-timeout Yes — frees ports faster
Raise the idle timeout No same flag Yes — but prefer keepalives
Attach NAT GW to another subnet No subnet update --nat-gateway Yes — extend coverage
Detach NAT GW from a subnet Egress reverts subnet update --remove natGateway Cautiously — egress path changes
Shrink/replace the prefix Yes recreate prefix No — plan it
Change AKS outboundType Yes (recreate) cluster recreate / migration path No — design up front

Set a metric alert on DroppedPackets > 0 over 5 minutes. Any sustained drops mean you are at the ceiling — add a public IP to the prefix (no downtime) or lower the idle timeout, then re-check. Recommended starting thresholds:

Alert Metric Threshold (starting point) Action on fire
Exhaustion (hard) DroppedPackets > 0 for 5 min Add IP to prefix; lower idle timeout
Capacity creep SNATConnectionCount > 80% of 64,512 × IPs Plan a prefix bump
Flow surge TotalConnectionCount > your modelled peak Investigate connection reuse
Egress cost ByteCount > budget for the month Review chatty callers / data path

The deeper observability story — workbooks, alert routing, action groups — is in Azure Monitor Deep Dive: Every Option.

Architecture at a glance

Follow a single outbound request left to right and the whole design clicks into place. A workload — a VM or, more often, a fleet of AKS pods — sits in a workload subnet and opens a TCP connection to an external API. Because the subnet has a NAT Gateway associated, the platform intercepts that flow at egress and performs source NAT: the pod’s private 10.20.1.x address and ephemeral port are rewritten to one of the public IPs in the attached prefix (say 203.0.113.0/30) and an allocated SNAT port from the 64,512-per-IP pool. The translated packet leaves through the NAT Gateway’s stable public identity and arrives at the destination — which, critically, is usually a single VIP (one payment gateway, one bank API). The partner sees a source IP inside the prefix it already allow-listed, and the return traffic flows back through the same translation.

The diagram makes the failure map explicit. Badge 1 marks the workload tier, where a per-request connection pattern manufactures the exhaustion in the first place — the fix lives in code (pooling and keepalives), not in the network. Badge 2 sits on the NAT Gateway, the resource whose SNATConnectionCount and DroppedPackets you watch and whose idle timeout and attached-IP count you tune. Badge 3 is on the public IP prefix — the allow-listable CIDR that must stay stable as you scale within it. Badge 4 is on the single destination VIP, the 5-tuple bottleneck that turns “lots of connections” into “exhaustion,” because every flow shares one (dst IP, dst port). Badge 5 maps the zone-redundant concern: a NAT Gateway is single-zone, so a multi-zone cluster needs one gateway per zone or its egress pins to a single zone’s fate. Trace those five numbered points and you can localise any egress incident to exactly one of them.

Architecture of deterministic Azure NAT Gateway egress: an AKS/VM workload subnet on 10.20.1.0/24 egressing through a Standard NAT Gateway that performs source NAT onto an attached public IP prefix (203.0.113.0/30, ~258K SNAT ports), out to a single external destination VIP on port 443, with Azure Monitor watching SNATConnectionCount and DroppedPackets; numbered badges mark the per-request connection bug at the workload, the NAT Gateway idle-timeout and IP-count tuning point, the allow-listable prefix that must stay stable, the single-destination 5-tuple bottleneck that causes exhaustion, and the zone-redundancy pin where one gateway cannot span zones

Real-world scenario

A fintech platform team — call them PaySettle — ran a payment-reconciliation service on AKS that, at end-of-day batch, opened tens of thousands of short-lived HTTPS connections to a single acquiring-bank API behind one VIP. The cluster used the default outboundType: loadBalancer. Around 18:00 daily, reconciliation jobs failed with connection timeouts and recovered on their own by 18:30. Nobody could reproduce it off-peak, and three sprints of “add retries” and “increase the timeout” had changed nothing.

The constraint was twofold. The bank required PaySettle to allow-list a fixed set of source IPs, so any fix had to produce a stable, declared egress CIDR — ruling out default outbound entirely. And the LB outbound rules pre-allocated SNAT ports across the node pool; because every connection targeted the same destination IP and port, the 5-tuple table for that one destination hit its ceiling exactly when batch concurrency peaked. The dropped-flow window lined up perfectly with the failures. The incident timeline made the mechanism unmistakable:

Time Observation Signal Interpretation
17:55 Batch ramps up TotalConnectionCount climbing Concurrency rising toward peak
18:02 First timeouts DroppedPackets > 0 SNAT ceiling hit for the bank VIP
18:10 Job retries storm More new connections Retries worsen exhaustion
18:28 Batch tapers DroppedPackets → 0 Ports freed; “self-heals”
Next day Same window Identical shape Deterministic, load-driven

They rebuilt the node pools on subnets fronted by a user-assigned NAT Gateway with a /30 prefix (4 IPs, ~258K ports to that destination, roughly double the observed peak), handed the bank that 4-address CIDR, and dropped the idle timeout to 4 minutes to recycle the short-lived flows aggressively. As the cluster was zone-redundant, they provisioned one zonal NAT Gateway and prefix per zone and gave the bank the union of the three CIDRs. They also fixed the application to reuse a pooled HTTPS client with keepalives, so the live flow count fell even before the extra ports mattered.

# Per zone: dedicated subnet, zonal NAT GW, /30 prefix on the node subnet.
az network public-ip prefix create -g rg-pay -n pip-prefix-z1 --length 30 --location eastus --zone 1
az network nat gateway create -g rg-pay -n natgw-z1 --location eastus \
  --public-ip-prefixes pip-prefix-z1 --idle-timeout 4 --zone 1
az network vnet subnet update -g rg-pay --vnet-name vnet-pay \
  --name snet-aks-z1 --nat-gateway natgw-z1

The 18:00 failures stopped on the first batch after cutover. DroppedPackets has been flat at zero since, and the bank’s allow-list never has to change because all future scaling happens within the published prefixes. Before-and-after, the change was stark:

Metric Before (LB outbound) After (NAT GW /30 per zone)
Egress IP LB frontend, shared 3 stable /30 prefixes
Ports to bank VIP Pre-divided, capped ~258K per zone, on demand
18:00 failures Daily, ~30 min Zero
DroppedPackets Non-zero at peak Flat at zero
Allow-list churn Risk on every scale None (scale within prefix)
Idle timeout LB default 4 min + app keepalives

Advantages and disadvantages

NAT Gateway is the right default for egress, but it is not free of trade-offs. The explicit two-column view:

Advantages Disadvantages
On-demand SNAT from a large shared pool (no pre-carving) Outbound-only — does not handle inbound at all
Stable, allow-listable public IP / prefix Zonal resource — multi-zone HA needs one GW per zone
Survives load that exhausts default / LB outbound Adds an hourly + per-GB cost vs free default outbound
Fully managed, highly available, single SKU 16-IP-per-GW cap; beyond that you split subnets
Decouples egress identity from inbound resources All subnet resources must be Standard SKU
Simple association model (attach to subnet) No L7 features (no filtering/inspection — that is Azure Firewall)
AKS-native via outboundType outboundType largely immutable post-create

When each side matters:

Decision factor Favours NAT Gateway Favours an alternative
Need stable egress IP to allow-list Strongly
High concurrency to one destination Strongly
Need to filter/inspect egress Azure Firewall (forced tunneling)
Egress is to Azure PaaS only Maybe Private Endpoint (bypasses SNAT)
Already run a Standard LB at small scale Optional Reuse LB outbound rules
New VNet (post Sep 2025) Yes — (default outbound unavailable)
AKS at high pod density Strongly
Need zone-redundant egress Yes (one GW/zone)
Lowest possible cost, non-prod Default outbound (while it lasts)

For egress that must be inspected rather than merely translated, NAT Gateway is the wrong tool — that is Azure Firewall: Forced Tunneling & Hub-Spoke Routing. For traffic to Azure PaaS that should never touch the public internet at all, prefer Private Endpoint vs Service Endpoint.

Hands-on lab

Provision a NAT Gateway with a prefix, attach it to a subnet, and prove your egress IP is the one you provisioned — free-tier-friendly and fully torn down at the end. Run in Cloud Shell (Bash).

Step 1 — Variables and resource group.

RG=rg-natgw-lab
LOC=centralindia
VNET=vnet-lab
SUBNET=snet-workload
NATGW=natgw-lab
PREFIX=pip-prefix-lab
az group create -n $RG -l $LOC -o table

Step 2 — Create a VNet and a workload subnet.

az network vnet create -g $RG -n $VNET --address-prefix 10.20.0.0/16 \
  --subnet-name $SUBNET --subnet-prefix 10.20.1.0/24 -o table

Expected: a VNet with one subnet on 10.20.1.0/24.

Step 3 — Create a /31 public IP prefix (2 IPs — plenty for a lab).

az network public-ip prefix create -g $RG -n $PREFIX --length 31 --location $LOC --version IPv4 -o table

Expected: a prefix resource showing an ipPrefix like x.x.x.x/31.

Step 4 — Create the NAT Gateway with the prefix and a 4-minute idle timeout.

az network nat gateway create -g $RG -n $NATGW --location $LOC \
  --public-ip-prefixes $PREFIX --idle-timeout 4 -o table

Expected: a NAT Gateway, sku.name = Standard, idleTimeoutInMinutes = 4.

Step 5 — Associate the NAT Gateway to the subnet.

az network vnet subnet update -g $RG --vnet-name $VNET --name $SUBNET --nat-gateway $NATGW -o table

Step 6 — Confirm the association and attached IPs from the control plane.

az network nat gateway show -g $RG -n $NATGW \
  --query "{idle:idleTimeoutInMinutes, prefixes:publicIpPrefixes[].id, subnets:subnets[].id}" -o jsonc

Expected: idle: 4, your prefix listed, and the workload subnet listed under subnets.

Step 7 — Prove egress (optional, needs a VM in the subnet). Deploy a tiny VM into snet-workload, then from inside it the echo services must return an IP from your prefix:

# From a VM/debug pod INSIDE the subnet — should return an IP from your prefix, every time.
curl -s https://ifconfig.me ; echo
curl -s https://api.ipify.org ; echo

Validation checklist. You provisioned a NAT Gateway with an allow-listable prefix, attached it to a subnet, confirmed the association from the control plane, and (optionally) verified the egress IP falls inside your prefix. The steps mapped to what each proves:

Step What you did What it proves
3 Create a public IP prefix The contiguous CIDR you would publish
4 Create NAT GW + idle timeout Single SKU; idle timeout is the lever
5 Associate to subnet Egress for the whole subnet now flows through it
6 Show association The control-plane confirmation path
7 Egress echo from inside The egress IP is your prefix, not default outbound

Cleanup (avoid lingering charges).

az group delete -n $RG --yes --no-wait

Cost note. A NAT Gateway plus a tiny prefix is a few rupees per hour; an hour of this lab is well under ₹50, and deleting the resource group stops everything. Skip Step 7’s VM and the lab is nearly free.

Common mistakes & troubleshooting

This is the playbook — the part you bookmark. First the error strings exhaustion actually produces (it rarely says “SNAT” — it shows up as generic connection failures), then the scannable symptom table, then the expanded reasoning for the entries that bite hardest.

SNAT exhaustion is an outbound failure, so the error surfaces in your application’s client, not in an HTTP status from the platform. The strings to recognise:

Observed error (client side) What it usually means How to confirm it is SNAT First fix
connection timed out to one host under load New flow could not get a port DroppedPackets > 0 at that time Connection reuse; add IP to prefix
connection reset by peer mid-session Idle flow’s port reclaimed Resets cluster at the idle-timeout interval Keepalives; raise idle timeout modestly
EADDRNOTAVAIL / “address not available” OS/port allocation pressure Many flows to one dst; SNATConnectionCount high Pooled client; fewer concurrent flows
Sporadic TLS handshake failures at peak New handshakes starved of ports Failure window == load peak Size prefix to peak; shard destinations
5xx from your service calling an upstream Upstream call failed, not the upstream itself App Insights dependency failures to one target Fix the dependency client, not the upstream
Intermittent DNS-then-connect failures Connect phase starved, DNS fine Connect errors spike, DNS resolves Reuse connections; add capacity

The symptom-to-fix table:

# Symptom Root cause Confirm (exact cmd / portal path) Fix
1 Intermittent timeouts/5xx to one upstream under load, fine at rest SNAT exhaustion on one 5-tuple destination Metrics: DroppedPackets > 0; SNATConnectionCount near 64,512×IPs Connection reuse + add IP to prefix
2 Egress IP is not the prefix you provisioned Subnet not associated, or NAT GW precedence not in effect az network nat gateway show --query subnets; curl ifconfig.me from inside Associate the subnet; remove conflicting NIC IP assumptions
3 Association fails / unsupported A Basic-SKU resource in the subnet az network public-ip list / LB SKU = Basic Upgrade everything to Standard SKU
4 Long-lived connections reset mid-idle Idle timeout shorter than the idle gap NAT GW idleTimeoutInMinutes; app logs show resets at the interval App keepalives; modestly raise idle timeout
5 Exhaustion despite “plenty of ports” All flows hit one dest IP:port (single 5-tuple) App Insights dependencies by target; one host dominates Shard destinations or reuse connections
6 AKS egress IP still LB, not NAT GW outboundType left at default loadBalancer az aks show --query networkProfile.outboundType Recreate cluster with managed/user-assigned NAT GW
7 Cannot change AKS outboundType after create Property is largely immutable az aks update rejects the change Plan it at creation; use a supported migration path
8 Multi-zone cluster egress pinned to one zone One NAT GW shared across a multi-zone subnet NAT GW zones; pods in other zones lose egress on zone loss One zonal NAT GW + subnet per zone
9 Need more than ~1.03M ports to one dest Hit the 16-IP-per-NAT-GW cap Prefix already /28; SNATConnectionCount ceiling Split workload across subnets, each its own NAT GW
10 Egress works but partner blocks you They allow-listed loose IPs; you scaled and IP changed Compare current attached IPs vs partner’s list Use a contiguous prefix; publish the CIDR once
11 UDR sends traffic to a firewall, bypassing NAT GW Route table overrides the egress path az network route-table route list; effective routes Decide: NAT GW or forced-tunnel via firewall
12 “Self-healing” failures every peak window Exhaustion that clears when load drops DroppedPackets rises and falls with load Size the prefix to peak; fix connection reuse

The expanded form, for the entries that bite hardest:

1. Intermittent timeouts/5xx to one upstream under load, fine at rest. Root cause: SNAT port exhaustion — too many concurrent flows to one destination 5-tuple, usually from per-request connections. Confirm: NAT Gateway metrics show DroppedPackets > 0 during the window and SNATConnectionCount flattening near 64,512 × (attached IPs); correlate with the load window. Fix: Reuse connections (pooled HTTP client + keepalives) first; then add a public IP to the prefix (no downtime) and/or lower the idle timeout. Scaling out instances does not help — NAT Gateway pools ports across the subnet already.

2. Egress IP is not the prefix you provisioned. Root cause: The subnet was never associated, or you are reading a NIC’s instance-level public IP and assuming it is the egress identity. Confirm: az network nat gateway show --query "subnets[].id" should list the workload subnet; curl -s https://api.ipify.org from inside the subnet must return a prefix IP. Fix: Associate the subnet. Remember NAT Gateway wins outbound even when a NIC has its own public IP (which still serves inbound).

3. Association fails or is unsupported. Root cause: Something in the subnet is Basic SKU (a Basic public IP or Basic Load Balancer), which NAT Gateway cannot coexist with. Confirm: az network public-ip list -g $RG --query "[].{name:name, sku:sku.name}"; check any LB’s SKU. Fix: Upgrade every public IP and LB in the subnet to Standard SKU; re-attempt the association.

4. Long-lived connections reset mid-idle. Root cause: The idle timeout is shorter than the connection’s idle gap, so the port is reclaimed and the next packet finds a dead translation. Confirm: Check idleTimeoutInMinutes; application logs show resets clustering at exactly that interval. Fix: Add application keepalives (the robust fix) and, if appropriate, raise the idle timeout modestly — never jump straight to 120.

5. Exhaustion despite “plenty of ports.” Root cause: Every flow targets one destination IP:port, so they all share a single 5-tuple budget — total ports are irrelevant. Confirm: App Insights dependencies | summarize by target shows one host dominating; the port ceiling is per-destination. Fix: Reuse connections to that host (fewer flows), or shard across multiple destination endpoints if the upstream offers them.

6. AKS egress IP is still the Load Balancer, not the NAT Gateway. Root cause: outboundType was left at the default loadBalancer. Confirm: az aks show -g rg-aks-prod -n aks-prod --query "networkProfile.outboundType" -o tsv. Fix: outboundType is set at creation — recreate the cluster (or follow a supported migration path) with managedNATGateway or userAssignedNATGateway.

8. Multi-zone cluster egress pinned to one zone. Root cause: One zonal NAT Gateway is shared across a node subnet that spans zones, so losing that zone takes egress with it. Confirm: The NAT Gateway shows a single zones value while node pools span multiple zones. Fix: Deploy one zonal NAT Gateway and subnet per availability zone; publish the union of the prefixes. (Routing-side egress mysteries — UDRs overriding the path — are diagnosed in Troubleshooting VNet Connectivity: NSG, UDR, Effective Routes & Network Watcher.)

Best practices

Security notes

The egress-security knobs side by side:

Control Mechanism Secures against Note
Stable egress identity Public IP prefix Partner over-permissive allow-lists Publish once; never churn
Egress filtering Azure Firewall (not NAT GW) Exfiltration to arbitrary hosts NAT GW does not filter
PaaS off the public path Private Endpoint Public-internet exposure of PaaS Bypasses SNAT entirely
Least-privilege egress NSG outbound rules Unwanted destinations/ports Still applies under NAT GW
Per-env isolation Separate prefixes Cross-env identity reuse Distinct allow-lists
Config auditing IaC + PR review Silent IP/prefix drift Treat egress IPs as code
Egress logging Flow logs / firewall logs Undetected anomalous egress NAT GW itself does not log flows

Cost & sizing

The bill drivers are simple and bounded:

A rough monthly picture (figures are indicative; confirm current regional pricing):

Configuration What you pay for Rough INR / month When it fits
1× NAT GW + /31 prefix, low traffic 1 GW hourly + 2 IPs + light per-GB ~₹2,500–4,000 Single-zone, modest egress
1× NAT GW + /30 prefix, medium traffic 1 GW + 4 IPs + moderate per-GB ~₹4,000–7,000 Production single-zone
1× NAT GW + /28 prefix 1 GW + 16 IPs + per-GB ~₹7,000–11,000 Single-zone, very high concurrency
3× NAT GW (per-zone) + 3× /30 3 GW hourly + 12 IPs + per-GB ~₹12,000–20,000 Zone-redundant production
High-egress (TB/month) any layout Per-GB processing dominates per-GB drives it Data-heavy egress
Add one IP to an existing prefix +1 IP-hour, no GW change small delta Quick capacity bump, no downtime
Default outbound (for contrast) Nothing (until retired) ~₹0 Non-prod only; no stable IP

Sizing rule of thumb, distilled:

You have… Provision Idle timeout Why
< 65K peak flows to one dest 1 IP (/32 or single) 4–10 min One IP covers it
~65K–130K /31 (2 IPs) 4–10 min Two IPs, headroom
~130K–260K /30 (4 IPs) 4 min Aggressive recycle
~260K–520K /29 (8 IPs) 4 min Larger pool
~520K–1.03M /28 (16 IPs) 4 min Max on one GW
> 1.03M to one dest Split across subnets/GWs 4 min Past the 16-IP cap
Long-lived bursty (brokers/DBs) size to peak 30–60 min + keepalive Avoid mid-idle resets
Multi-zone HA per-zone /30 each 4 min One GW per zone

Interview & exam questions

1. What are the three outbound paths in Azure, and why does only NAT Gateway scale? Default outbound access (implicit, small shared pool, unpredictable IP, retiring Sept 2025), Load Balancer outbound rules (a fixed 64K budget you must pre-divide across the backend pool), and NAT Gateway (on-demand allocation from a large shared pool, ~64,512 ports per attached public IP, stable egress identity). NAT Gateway scales because ports are handed out dynamically across the subnet rather than pre-carved per instance.

2. What exactly is a SNAT port, and why is exhaustion a single-destination problem? A SNAT port is one entry in the translation table, keyed on the full 5-tuple — source IP, source port, destination IP, destination port, protocol. You are limited to ~64K connections to the same destination IP and port, not 64K total. Exhaustion therefore almost always means many concurrent flows to one VIP (a payment gateway, one storage endpoint); spreading load across destinations rarely exhausts ports.

3. How do you size a public IP prefix? Count the peak concurrent flows to the busiest single destination and compute required_IPs = ceil(peak_flows / 64,512), then pick the smallest prefix whose host count covers it. For ~140,000 flows that is 3 IPs → a /30 (4 IPs) with headroom. Never over-provision a /28 “to be safe” — you pay per IP and can add a prefix later with no downtime.

4. What is the outbound precedence when multiple egress configs exist? Highest to lowest: NAT Gateway (wins if present on the subnet), then an instance-level public IP on the NIC, then Load Balancer outbound rules, then default outbound access. Notably, NAT Gateway overrides a NIC’s own public IP for outbound — that IP still serves inbound, but egress goes through the NAT Gateway.

5. What is the per-NAT-Gateway IP cap and what do you do beyond it? A single NAT Gateway supports a maximum of 16 public IPs (individual + prefixes combined), giving ~1.03M ports to one destination via a /28. Beyond that, split the workload across multiple subnets, each with its own NAT Gateway and prefix — the documented scaling pattern.

6. How do you give AKS deterministic, allow-listable egress? Set outboundType at cluster creation to userAssignedNATGateway (you attach your own NAT Gateway + prefix to the node subnet, so partners allow-list your exact CIDR) or managedNATGateway (Azure provisions it). The default loadBalancer inherits LB SNAT limits; outboundType is largely immutable, so it must be chosen up front.

7. How do you make egress zone-redundant given that NAT Gateway is zonal? A NAT Gateway is single-zone, so you deploy one zonal NAT Gateway and node-pool subnet per availability zone. Pods in each zone egress through that zone’s gateway, and partners allow-list the union of the zonal prefixes. Sharing one gateway across a multi-zone subnet pins all egress to a single zone’s fate.

8. What does the TCP idle timeout do, and how should you set it? It controls how long an idle flow holds its SNAT port before reclaim (4–120 minutes, default 4). Lowering it frees ports faster (more effective capacity); raising it keeps quiet long-lived connections alive but holds ports longer. The durable fix for mid-idle resets is application keepalives, not cranking the timeout to 120.

9. Which NAT Gateway metric is the smoking gun for exhaustion, and what else do you watch? DroppedPackets — any sustained non-zero value strongly indicates exhaustion or capacity pressure. Watch it alongside SNATConnectionCount (the headline established-connection count, compared against 64,512 × attached IPs) and TotalConnectionCount. Alert on DroppedPackets > 0 over 5 minutes.

10. Your app exhausts SNAT despite “plenty of ports” — why? Because every flow targets the same destination IP and port, so they all consume one 5-tuple budget; total ports across other destinations are irrelevant. Confirm with App Insights dependencies grouped by target (one host dominates). Fix with connection reuse to that host, or shard across multiple endpoints if available.

11. Can NAT Gateway filter or inspect egress? No — it performs source NAT only; it has no FQDN filtering, L7 rules, or threat intelligence. For controlled egress (allow only specific destinations) you use Azure Firewall (often with forced tunneling), and for Azure PaaS you prefer Private Endpoints so traffic never traverses the public internet or consumes SNAT at all.

12. Why does NAT Gateway require Standard SKU everywhere in the subnet? NAT Gateway is a Standard-SKU resource and cannot coexist with Basic-SKU public IPs or Basic Load Balancers in the same subnet; the association is unsupported. Upgrade every public IP and LB in the subnet to Standard before attaching it.

These map to AZ-700 (Network Engineer Associate)design and implement network connectivity and routing, hybrid and outbound connectivity — and touch AZ-104 (Administrator) for virtual networking and AZ-305 for designing resilient, allow-listable egress. A compact cert mapping:

Question theme Primary cert Objective area
Outbound paths, SNAT model, prefix sizing AZ-700 Design & implement outbound connectivity
AKS outboundType, zone-redundant egress AZ-700 / CKA-adjacent Cluster networking design
Precedence, SKU constraints, association AZ-104 Configure virtual networking
Allow-listable, resilient egress design AZ-305 Design network architecture
Egress filtering vs translation AZ-700 / AZ-500 Secure network connectivity

Quick check

  1. Why does SNAT exhaustion almost always involve a single destination, and what part of the 5-tuple makes that true?
  2. You have ~140,000 peak concurrent flows to one bank VIP. What prefix do you provision, and what is the per-IP port figure you divided by?
  3. True or false: scaling out to more VM instances behind a NAT Gateway adds SNAT ports.
  4. Your zone-redundant AKS cluster’s egress survives a single instance failure but not a zone outage. What did you get wrong, and what is the fix?
  5. Long-lived broker connections keep resetting after exactly four minutes of quiet. Name the property involved and the robust fix.

Answers

  1. The translation table is keyed on the full 5-tuple including destination IP and destination port; you can have ~64,512 flows to one dst IP:port per public IP. Many flows to one VIP share that single budget and exhaust it, while the same number spread across destinations would not.
  2. ceil(140000 / 64512) = 3 IPs, so provision a /30 (4 IPs, ~258K ports, with headroom). The divisor is 64,512 SNAT ports per attached public IP.
  3. False. NAT Gateway already pools ports across the whole subnet on demand; adding instances does not add ports. To add capacity you attach another public IP to the prefix (or lower the idle timeout / fix connection reuse).
  4. You shared one zonal NAT Gateway across a multi-zone node subnet, pinning egress to that zone. The fix is one zonal NAT Gateway and node-pool subnet per availability zone, publishing the union of the zonal prefixes to the partner.
  5. The TCP idle timeout (default 4 minutes) is reclaiming the idle flow’s SNAT port. The robust fix is application-level keepalives (or HTTP keep-alive / pooling) that reset the idle timer — preferable to merely raising the timeout toward 120.

Glossary

Next steps

You can now give any subnet deterministic, allow-listable, exhaustion-proof egress and prove it from inside the network. Build outward:

nat-gatewaynetworkingsnategressoutbound
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments