Azure Standard Load Balancer Deep Dive: Outbound Rules, HA Ports, and Cross-Region Load Balancing

Standard Load Balancer is the Layer-4 plumbing almost every Azure network design sits on, and it is the layer people understand least. They reach for it as “a TCP load balancer,” wire up one rule, and never touch the parts that decide whether the system survives load: outbound rules that give you deterministic SNAT instead of a 2 a.m. port-exhaustion incident, HA Ports that make a firewall sandwich genuinely highly available, health-probe thresholds that decide whether a deploy drains gracefully or black-holes connections, and a global cross-region front end that fails a whole region over without a DNS change. The Azure Standard Load Balancer is a software-defined, zero-latency-added, pass-through L4 device: it does not terminate connections, it rewrites the destination (and optionally the source) of a 5-tuple and forwards the packet. That “pass-through” nature is exactly why its failure modes are subtle — there is no access log, no TLS to inspect, no request to trace, just flows that either complete or quietly die.

This is the engineering-grade walkthrough of every moving part, ending with a global anycast front end whose backend pool is other load balancers. Everything here is the Standard SKU. Basic Load Balancer retires 30 September 2025 — no SLA, no availability zones, no outbound rules, no HA Ports, no cross-region — so if you are still on Basic, migration is the first task, not an optimization. Because this is a reference you will return to mid-incident, the rule types, the SNAT maths, the probe knobs, the metrics, and the failure playbook are all laid out as scannable tables: read the prose once, then keep the tables open when SnatConnectionCount (Failed) starts climbing.

By the end you will stop guessing. When egress fails under a flash sale you will know whether you starved SNAT ports on one destination, whether implicit SNAT is shadowing your explicit rule, whether a “healthy” backend is lying because its probe is TCP, or whether your stateful firewall is resetting long flows because HA Ports does not guarantee path symmetry. Knowing which in ninety seconds is what separates a five-minute incident from a two-hour one.

What problem this solves

Standard LB exists to spread Layer-4 traffic across a pool of backends inside a region, to provide controlled, deterministic outbound internet access for those backends, and — with the cross-region SKU — to give a TCP/UDP service a single global IP with automatic regional failover. The pain it removes is the pain of doing any of that by hand: round-robin DNS that ignores health, a single NAT box that becomes a bottleneck and a single point of failure, or a firewall pair with no safe way to balance across both nodes.

What breaks without engineering it properly is specific and recurring. An app that opens a new outbound connection per request exhausts the shared SNAT pool and throws intermittent dependency timeouts that pass in test and fail in production — the single most common Standard LB incident. A “TCP load balancer” with a TCP probe keeps routing traffic to a worker that returns 500 to every user, because the socket is still open. A firewall sandwich that passed its lab test resets every long-lived database and gRPC connection in production because return packets traverse a different stateful appliance than the forward packets. A team migrates inbound but forgets that Standard is secure by default — no implicit outbound — and every backend silently loses internet access the moment they remove the public IP from the NIC.

Who hits this: anyone running VMs or VM Scale Sets behind L4 in Azure, anyone inserting a network virtual appliance (NGFW, IDS/IPS, proxy) into the data path, anyone with chatty outbound calls to a small number of upstreams (payment APIs, partner endpoints, a shared database), and anyone designing multi-region active-active for a non-HTTP protocol that Azure Front Door (HTTP-only) cannot serve. The fix is almost never “make the LB bigger” — Standard LB has no instance size. It is “allocate ports explicitly, probe at Layer 7, engineer flow symmetry, and watch the SNAT metrics.”

To frame the whole field before the deep dive, here is every failure class this article covers, the question it forces, and where to look first:

Failure class	What you actually see	First question to ask	First place to look	Most common single cause
SNAT port exhaustion	Intermittent outbound timeouts under load, fine at rest	Does it fail to one destination or all?	`SnatConnectionCount` (Failed) metric	New connection per request to one upstream
Implicit SNAT shadowing	Unpredictable port use; exhaustion despite a rule	Is `disableOutboundSnat` set on inbound rules?	LB rule config / ARM `disableOutboundSnat`	Inbound rule silently providing default SNAT
Backend “healthy” but failing	Users get 500/502 from a node the LB calls healthy	Is the probe TCP or HTTP?	`DipAvailability` vs real 5xx rate	TCP probe on an app that 500s
Asymmetric NVA reset	Long flows reset, short flows fine	Forward and return path same appliance?	Firewall session log (“no state”)	HA Ports without Floating IP / state sync
No regional failover	Region down, global IP keeps sending traffic	Is the regional probe honest?	Cross-region LB backend health	Regional LB still reports healthy
No outbound at all	New backends cannot reach the internet	Is there an explicit egress path?	Effective routes; outbound rule presence	Standard is secure-by-default (egress opt-in)

Learning objectives

By the end of this article you can:

Choose correctly between Standard (regional), Gateway, and cross-region (Global) load balancers, and explain why they are not interchangeable.
Design a backend pool the right way (NIC-based vs IP-based) and align it to availability zones so the frontend’s zone-redundancy is actually backed by surviving backends.
Allocate SNAT ports explicitly with an outbound rule, compute allocated_outbound_ports against your maximum pool size, and stop implicit SNAT from shadowing it with disableOutboundSnat.
Build an HA Ports rule for an active-active NVA pool and engineer the flow symmetry (Floating IP / DSR + vendor state sync) that HA Ports alone does not give you.
Tune TCP / HTTP / HTTPS probes and probe-threshold for a deliberate detection-vs-flapping trade-off, and sequence a graceful drain so deploys never black-hole connections.
Front regional LBs with a cross-region LB for one static anycast IP and automatic, DNS-free regional failover on any TCP/UDP protocol.
Read the Standard LB metrics (SnatConnectionCount, UsedSnatPorts, AllocatedSnatPorts, DipAvailability, VipAvailability) and wire alerts that fire before users feel it.
Map any Standard LB symptom to a root cause, a confirming command, and a fix — and price the design in INR.

Prerequisites & where this fits

You should already understand the basics: a frontend IP configuration (the VIP — public or internal/private), a backend pool (the targets), a load-balancing rule (which frontend port maps to which backend port), a health probe (what “healthy” means), and an outbound rule (how the pool reaches the internet). You should know how SNAT (Source Network Address Translation) works in principle — a private IP:port rewritten to a public IP:port so return traffic finds its way home — and be comfortable running az in Cloud Shell, reading JSON output, and applying a Bicep or Terraform file. Familiarity with availability zones, NSGs, and UDRs helps; the Azure Virtual Network, Subnets and NSGs fundamentals are assumed.

This sits in the Networking track and is the L4 layer beneath almost everything else. It is the floor under Azure Multi-Region Active-Active Architecture, the SNAT-aware sibling of Diagnosing and Killing SNAT Port Exhaustion on Cloud NAT Gateways (NAT Gateway is the other egress path and often the better one), and the HA mechanism behind Deploying HA Third-Party NVAs in Azure: The Load Balancer Sandwich Pattern. Where you need Layer-7 features — path routing, WAF, edge TLS — you want Application Gateway instead; this article is strictly L4.

A quick map of who owns which layer during an incident, so you call the right person fast:

Layer	What lives here	Who usually owns it	Failure classes it can cause
Client / DNS	Name resolution, retries	Frontend / SRE	Rarely the LB; usually a red herring
Cross-region (Global) LB	Anycast VIP, regional health	Network / platform	No failover (regional probe lies), wrong region steer
Regional Standard LB	Inbound rules, probes, outbound rules	Network team	SNAT exhaustion, probe false-healthy, no egress
Backend pool (VM/VMSS)	The app, the NIC, the zone	App / compute team	False-healthy app, zone imbalance
NVA subnet (firewall sandwich)	UDRs, NSGs, stateful FW	Security / network	Asymmetric reset, total-blast-radius NSG mistake
Outbound (SNAT / NAT GW)	Egress to APIs/DB	Platform + network	Port exhaustion under load

Core concepts

Six mental models make every later diagnosis obvious.

Standard LB is pass-through, not a proxy. It rewrites the 5-tuple (destination — and for outbound, source) and forwards the packet; it never terminates the TCP connection. That is why it adds no measurable latency, sees no application data, and produces no access log. Visibility comes from metrics and VNet flow logs, not from the LB itself. Every troubleshooting instinct you have from an L7 proxy (read the access log, inspect the request) does not apply.

A SNAT port is keyed on the full destination 5-tuple. You are not limited to ~64,000 total outbound connections — you are limited to ~64,000 simultaneous flows to the same destination IP and port per frontend public IP. Exhaustion almost always means many flows to one upstream behind a single VIP. This is the single most misunderstood fact about the device, and it is why “we only have 5,000 connections” still exhausts when they all go to one payment API.

Standard is secure by default — outbound is opt-in. Being in a backend pool does not grant a backend internet access. You must provide egress explicitly: an outbound rule on the LB, a NAT Gateway on the subnet, or an instance-level public IP. Remove a public IP from a NIC without adding one of these and the backend goes dark to the internet — a classic migration surprise.

Zone-redundant frontend, zone-spread backends — both or neither. A Standard LB frontend is zone-redundant by default (its VIP is served from all zones). But HA is only real if the backends span zones too. A zone-redundant frontend in front of a single-zone VMSS dies with that zone. Spread instances across zones 1/2/3 and the frontend keeps serving from the survivors.

The probe defines truth, and the wrong probe lies. The LB sends traffic only to instances the probe says are healthy. A TCP probe proves a socket is open; an HTTP/HTTPS probe proves the app answered 200. A wedged-but-listening process passes a TCP probe and fails every real request. Detection time is roughly interval x probe_threshold, and graceful drain (stop new flows, let established flows finish) is a property you sequence into deploys, not a setting.

HA Ports balances everything; it does not guarantee symmetry. An HA Ports rule (protocol All, ports 0/0, internal LB only) load-balances all ports and protocols at once — the only sane way to front a firewall whose port set you cannot enumerate. But a stateful NVA needs the return packet on the same appliance as the forward packet, and HA Ports hashes flows independently per direction. Symmetry is your job (Floating IP / DSR, vendor state sync, symmetric UDRs) — and getting it wrong is the most common HA-Ports incident.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary repeats these for lookup; this is the model side by side:

Concept	One-line definition	Where it lives	Why it matters
Frontend IP config	The VIP (public or internal) traffic enters on	On the LB	Each public IP = +64k SNAT ports
Backend pool	The set of targets (NIC-based or IP-based)	On the LB	IP-based pools cannot do outbound rules
Load-balancing rule	Frontend port → backend port mapping	On the LB	Can silently provide implicit SNAT
Inbound NAT rule	One frontend port → one backend instance	On the LB	Per-instance reach (SSH/RDP)
Health probe	What “healthy” means (TCP/HTTP/HTTPS)	On the LB	Wrong type → false-healthy backend
Outbound rule	Explicit SNAT egress + port allocation	On the LB	The deterministic-egress control
HA Ports rule	Protocol All, ports 0/0 (internal only)	On internal LB	Balances every port for NVAs
SNAT port	One outbound 5-tuple translation entry	Per frontend public IP (~64k)	Exhaustion → outbound failures
Floating IP (DSR)	Backend sees original VIP; return symmetric	On the rule	Required for stateful NVA symmetry
Cross-region LB	Global anycast VIP over regional LBs	Global tier	One static IP, DNS-free failover
DipAvailability	% of probes succeeding per backend	Metric	Your health/drain signal
VipAvailability	Whether the frontend datapath is up	Metric	The “is the VIP alive” signal

Standard vs Gateway vs cross-region, and when each fits

Azure ships three load balancer “shapes.” They are not interchangeable, and picking the wrong one shows up as a missing feature or a redesign weeks later.

SKU / type	Scope	Primary job	Outbound SNAT	HA Ports	Frontend
Standard (regional)	One region, zone-aware	General L4 load balancing for VMs/VMSS	Yes, via outbound rules	Yes (internal)	Public or internal
Gateway	One region	Transparent insertion of NVAs via service chaining	No (bump-in-the-wire)	N/A	Internal (chained)
Cross-region (Global)	Multi-region	Anycast global front end over regional LBs	No	No	Public (global)

The mental model:

Standard regional LB is the default — balances inside one region, zone-redundant or zonal, and home to outbound rules and HA Ports. Everything in Steps below uses this unless stated.
Gateway LB is for service insertion only. You chain it to a Standard LB frontend or a VM NIC so traffic transparently flows through an NVA pool and back, source IP preserved. It is not a general-purpose front end and has no outbound rules.
Cross-region LB is a thin global anycast layer whose backend pool is other Standard load balancers. One static global IP, steers to the closest healthy region, does no SNAT, and sits in front of regional LBs, not in place of them.

The decision distilled — match the requirement to the shape:

If you need…	Use	Why not the others
Balance VMs/VMSS in one region	Standard regional	Gateway has no general rules; cross-region needs regional LBs underneath
Insert a firewall/IDS transparently in the path	Gateway LB	Standard needs HA Ports + UDRs; cross-region is global only
One static IP for a TCP/UDP service across regions	Cross-region LB	Front Door is HTTP-only; Traffic Manager is DNS/TTL-bound
L7 path routing / WAF / edge TLS	Application Gateway / Front Door	Standard LB is L4 only — no HTTP awareness
Deterministic, allow-listable egress at scale	NAT Gateway (with or without LB)	LB outbound pre-carves ports; NAT GW allocates on demand

The rest of this article uses the regional Standard LB for the core sections, layers HA Ports for the NVA case, then puts the cross-region LB on top.

Basic is retiring — what changes on the way to Standard

If you are still on Basic LB (retires 30 September 2025), migration is the first task. The features do not map 1:1, and several Standard behaviors are secure-by-default where Basic was permissive — so a lift-and-shift that ignores these breaks egress or HA:

Aspect	Basic LB	Standard LB	Migration action
SLA	None	99.99% (with ≥2 healthy backends)	Gain SLA; ensure 2+ instances
Availability zones	Not supported	Zone-redundant / zonal	Re-pin frontend + spread backends
Outbound rules	Not supported	Supported (explicit SNAT)	Add an explicit outbound rule
Default outbound access	Implicit, on	Secure by default (opt-in)	Add egress or backends go dark
HA Ports	Not supported	Internal LB only	Build the HA Ports rule on an internal LB
Public IP SKU	Basic	Standard (required)	Upgrade the PIP to Standard
Backend pool size	~300	~1,000	Re-architect large fleets if needed
Cross-region	Not supported	Supported (Global tier)	Layer a cross-region LB if multi-region
NSG requirement	Optional	Recommended/expected	Add NSGs (Standard assumes them)

The one that silently bites: default outbound access. On Basic, a backend reached the internet implicitly; on Standard it does not until you add an outbound rule or NAT Gateway. Migrate the egress path in the same change as the LB, or every backend loses internet the moment the public IP leaves the NIC.

Frontend IPs and the rule types

A Standard LB is a collection of frontend IP configurations and the rules that bind them to a backend pool. Get the rule taxonomy straight — each does one job, and mixing them up is where implicit SNAT bites.

Rule type	What it does	Frontend → backend	SNAT behaviour	Typical use
Load-balancing rule	Distributes a port across the whole pool	e.g. 443 → 8443 (all instances)	Provides implicit outbound SNAT unless disabled	Web/API traffic to a VMSS
HA Ports rule	Balances all ports/protocols at once	0 → 0, protocol All (internal LB)	Implicit SNAT off by design (internal)	Active-active NVA sandwich
Inbound NAT rule	Maps one port to one specific instance	e.g. 50001 → VM3:22	None	Per-instance SSH/RDP/jump
Inbound NAT pool (VMSS)	Range of ports → instances	50000-50100 → VMSS:22	None	SSH into VMSS instances
Outbound rule	Explicit egress + manual port allocation	pool → frontend public IP	Explicit SNAT (what you want)	Deterministic internet egress

Each rule type has hard requirements — what it needs and where it’s allowed. Mismatch any of these and the create is rejected or the rule silently does nothing:

Rule type	Needs a probe?	Pool type	Public or internal LB	Floating IP option	Key constraint
Load-balancing rule	Yes	NIC or IP-based	Either	Yes (DSR)	Implicit SNAT on unless disabled
HA Ports rule	Yes	NIC or IP-based	Internal only	Yes (needed for NVA)	One per LB; ports 0/0, protocol All
Inbound NAT rule	Optional	NIC-based	Either	Yes	One frontend port → one instance
Inbound NAT pool (VMSS)	Optional	VMSS NICs	Either	n/a	Port range mapped to instances
Outbound rule	No	NIC-based only	Public (egress IP)	n/a	IP-based pools unsupported

The two constraints that trip people most: HA Ports is internal-LB-only, and outbound rules require a NIC-based pool. If you find yourself trying to put HA Ports on a public LB or an outbound rule on an IP-based pool, the design is wrong, not the syntax.

The frontend itself is public or internal, and zonal or zone-redundant. The defaults and the decision:

Frontend property	Options	Default	When to change	Gotcha
Address type	Public / Internal (private)	— (you choose)	Internal for east-west, NVA, internal services	Internal LBs can do HA Ports; public cannot
Zone behaviour	Zone-redundant / Zonal / No-zone	Zone-redundant (Standard)	Zonal only for latency/co-location pins	A zonal frontend dies with its zone
IP allocation	Static / Dynamic	Static (Standard PIP)	Always static for a stable VIP	Dynamic VIPs change on dealloc
Public IP SKU	Standard / Basic	Standard	Must be Standard with a Standard LB	Basic PIP + Standard LB is rejected
Inbound/outbound IP sharing	Same IP / separate IP	Often shared	Separate outbound IP keeps SNAT budget clean	Sharing mixes inbound + SNAT on one budget

LOC=eastus
RG=rg-lb-prod

# Zone-redundant public frontend IP (Standard SKU, served from all zones).
az network public-ip create \
  --resource-group $RG --name pip-lb-fe \
  --sku Standard --tier Regional \
  --allocation-method Static --zone 1 2 3

az network lb create \
  --resource-group $RG --name lb-app-prod \
  --sku Standard \
  --public-ip-address pip-lb-fe \
  --frontend-ip-name fe-public \
  --backend-pool-name bep-app

Zonal vs zone-redundant is a real decision. A zone-redundant frontend survives a single zone loss transparently. A zonal frontend (pinned with a single --zone) is occasionally required for latency-sensitive or co-location designs, but it dies with its zone. Default to zone-redundant unless you have a specific, written reason not to.

Backend pool design: NIC-based vs IP-based, and zone alignment

A Standard LB backend pool can be defined two ways, and the choice constrains the entire design — especially outbound.

NIC-based pool — membership is the NIC (ipConfiguration) of a VM or VMSS. The right model for VM/VMSS workloads: lifecycle is tied to the compute resource, and outbound rules work cleanly.
IP-based pool — membership is raw private IPs in the VNet, for backends whose lifecycle you do not own or want to pre-declare. The hard constraint: IP-based pools do not support outbound rules. Need LB-provided SNAT? Use a NIC-based pool, or front egress with NAT Gateway.

The trade-off in full:

Aspect	NIC-based pool	IP-based pool
Membership unit	VM/VMSS NIC `ipConfiguration`	Raw private IP in the VNet
Outbound rules (SNAT)	Supported	Not supported
Lifecycle coupling	Tied to the compute resource	Decoupled (you manage IPs)
Best for	Standard VM/VMSS workloads	Pre-provisioned IPs, mixed/unmanaged backends
Auto-membership (VMSS)	Yes, via the scale set	Manual IP management
Cross-resource-group targets	Constrained	More flexible

Zone alignment is the part that gets skipped. A Standard LB frontend is zone-redundant by default, but HA is only real if the backends span zones too. Spread VMSS instances across zones 1/2/3 and the frontend keeps serving from surviving zones when one fails. The zone model side by side:

Backend zoning	Survives single-zone loss?	When to use	Watch-out
Zone-spread (1/2/3)	Yes — survivors keep serving	Default for HA	Cross-zone bandwidth has a (tiny) cost
Zonal (pinned to one zone)	No — dies with the zone	Latency/co-location pin only	Pair with a zonal frontend deliberately
No-zone (regional)	Best-effort (no zone guarantee)	Legacy / regions without zones	No explicit zone resilience
Mixed zonal + zone-redundant FE	Partial	Migration states	Easy to think you’re HA when you’re not

See Azure Regions and Availability Zones for the zone model in depth; the rule here is simply both ends or neither.

Outbound rules and explicit SNAT port allocation

This is the part that prevents incidents. By default a Standard LB does not give backends outbound internet access just for being in a pool — Standard is secure by default, and egress is opt-in. The clean ways to provide it:

Egress method	How it allocates ports	Best for	Cost	Limit / gotcha
Outbound rule (LB)	Pre-carved, manual per instance	LB already present; egress must be the LB VIP	PIP only	Pre-divides 64k; caps pool size if over-allocated
NAT Gateway	On-demand from a shared pool	Pure egress at scale; many destinations	Hourly + per-GB	Zonal (one per zone); separate article
Instance-level public IP	Per-instance dedicated	A handful of VMs needing own IP	PIP per VM	Doesn’t scale; management overhead
Default outbound (legacy)	Implicit, Microsoft-managed	Nothing — being retired	None	Non-deterministic; do not rely on it

A SNAT port is one entry in a translation table keyed on the full 5-tuple, including the destination IP and port. You are not limited to 64K total connections — you are limited to ~64K simultaneous flows to the same destination IP:port per frontend public IP. Exhaustion almost always means many flows to one upstream behind a single VIP.

With an outbound rule you allocate ports explicitly, pre-dividing the 64,000-port budget per frontend IP across the pool. The maths is unforgiving:

ports_per_instance = floor( (64,000 x frontend_IP_count) / backend_instance_count )

64,000 ports, 1 frontend IP, 50 instances  -> 1,280 ports each
64,000 ports, 1 frontend IP, 100 instances ->   640 ports each
64,000 ports, 2 frontend IPs, 100 instances -> 1,280 ports each

The allocation table you actually plan against — note how adding frontend IPs (or a public IP prefix) is the lever that grows the budget:

Frontend public IPs	Total SNAT ports	50 instances	100 instances	200 instances
1 IP	64,000	1,280 / inst	640 / inst	320 / inst
2 IPs	128,000	2,560 / inst	1,280 / inst	640 / inst
4 IPs	256,000	5,120 / inst	2,560 / inst	1,280 / inst
/28 prefix (16 IPs)	1,024,000	20,480 / inst	10,240 / inst	5,120 / inst

Set it too high and you cap pool size (you can run out of ports to hand new instances); too low (the default auto-allocation is famously stingy) and busy instances exhaust ports while the pool looks half-idle. Always allocate manually, against your maximum intended pool size.

# Dedicated outbound frontend IP — do NOT share the inbound VIP for outbound
# if you can avoid it; a separate IP keeps the SNAT budget clean.
az network public-ip create \
  --resource-group $RG --name pip-lb-outbound \
  --sku Standard --allocation-method Static --zone 1 2 3

az network lb frontend-ip create \
  --resource-group $RG --lb-name lb-app-prod \
  --name fe-outbound --public-ip-address pip-lb-outbound

# Explicit outbound rule: manual port allocation, generous idle timeout,
# and TCP reset on idle so clients learn the flow is gone.
az network lb outbound-rule create \
  --resource-group $RG --lb-name lb-app-prod \
  --name obr-app \
  --frontend-ip-configs fe-outbound \
  --address-pool bep-app \
  --protocol All \
  --idle-timeout 15 \
  --enable-tcp-reset true \
  --outbound-ports 1280

The flags that matter, each with its default and the failure if you get it wrong:

Flag (CLI)	ARM / Bicep	Default	Set it to	Failure if wrong
`--outbound-ports`	`allocatedOutboundPorts`	Auto (stingy)	floor(64k×IPs / max-instances)	Too low → exhaustion; too high → can’t add instances
`--enable-tcp-reset`	`enableTcpReset`	false	`true`	Idle flows dropped silently; clients hang
`--idle-timeout`	`idleTimeoutInMinutes`	4	15-30 (or app keepalives)	Mid-idle drops on long-lived flows
`--protocol`	`protocol`	—	`All` (TCP+UDP)	UDP egress missing if set to Tcp only
`--frontend-ip-configs`	`frontendIPConfigurations`	—	A dedicated outbound IP	Sharing inbound VIP muddies the budget

The durable fix for mid-idle drops is application keepalives, not a giant idle timeout. Each extra frontend IP (or a public IP prefix) adds another 64,000 ports. If you are fighting this maths at scale, that is the signal to move egress to NAT Gateway, which allocates ports on demand instead of pre-carving them.

Worked sizing against real workloads — pick the row closest to yours and read the verdict. The key variable is concurrent flows to the busiest single destination, not total throughput:

Workload	Instances	Frontend IPs	Ports/instance	Peak flows to busiest dest	Verdict
Internal API, few egress calls	10	1	6,400	~500	Huge headroom; fine
Web tier → one DB VIP, pooled	20	1	3,200	~1,500	Comfortable with reuse
Payment fan-out, per-request conns	50	1	1,280	~30,000	Exhausts — reuse or add IPs
Payment fan-out, pooled clients	50	1	1,280	~1,200	Fine once connections are reused
Batch webhook fan-out	100	1	640	~50,000	Exhausts — needs NAT Gateway
Batch webhook fan-out	100	4 (/30 prefix)	2,560	~50,000	Borderline; NAT Gateway better
Large fleet, many destinations	200	2	640	~3,000 (spread)	Fine — load spread over dests
Large fleet, one hot destination	200	2	640	~80,000	Exhausts — shard or NAT GW

The pattern is unmissable: the rows that exhaust all have many concurrent flows to one destination with no connection reuse. Fix reuse first (it collapses the flow count), then add IPs or move to NAT Gateway for genuine high-fan-out to a single upstream.

The implicit-SNAT trap

A load-balancing rule silently provides implicit, unmanaged SNAT alongside any explicit outbound rule unless you turn it off. Two overlapping SNAT behaviors give you unpredictable port use and exhaustion you cannot reason about. The fix is one flag — disableOutboundSnat = true (ARM/Bicep disableOutboundSnat) on the load-balancing rule — so egress is governed only by your explicit outbound rule.

Configuration	Outbound SNAT source	Determinism	Verdict
LB rule only, `disableOutboundSnat=false`	Implicit (auto, ~stingy)	Low	Default; exhausts early
LB rule + outbound rule, `disableOutboundSnat=false`	Both (overlapping)	Very low	The trap — unpredictable
LB rule (`disableOutboundSnat=true`) + outbound rule	Explicit rule only	High	Correct
NAT Gateway on subnet	NAT GW (on demand)	Highest	Best for pure egress

HA Ports for active-active NVAs and firewall sandwiches

HA Ports makes an internal Standard LB load-balance all ports and all protocols with one rule. It exists for the network virtual appliance case: you cannot enumerate every port a firewall must pass, so you balance the whole flow space at once. An HA Ports rule is just a load-balancing rule with protocol All and both frontendPort and backendPort set to 0. It is available on internal Standard LBs only (not public).

# Internal LB in front of the active-active NVA pool.
az network lb create \
  --resource-group $RG --name lb-nva-internal \
  --sku Standard \
  --vnet-name vnet-hub --subnet snet-nva-frontend \
  --frontend-ip-name fe-nva --private-ip-address 10.0.10.4 \
  --backend-pool-name bep-nva

# HA Ports: protocol All, ports 0/0 — every port, every protocol.
az network lb rule create \
  --resource-group $RG --lb-name lb-nva-internal \
  --name rule-haports \
  --protocol All --frontend-port 0 --backend-port 0 \
  --frontend-ip-name fe-nva \
  --backend-pool-name bep-nva \
  --probe-name probe-nva \
  --enable-tcp-reset true \
  --idle-timeout 15

The classic topology is the firewall sandwich: an external/internal LB pair around an active-active NVA pool, HA Ports on the internal side. The design rules that decide whether it actually works:

Design rule	Why it matters	If you skip it
Symmetric routing	Stateful NVA needs return packet on the same appliance	Mid-stream resets (“no matching state”) on long flows
Floating IP (DSR)	Backend sees original VIP; keeps routing symmetric	Return path diverges; asymmetric drops
Vendor session-state sync	Any appliance can handle any packet of a flow	Rebalance/probe event drops in-flight sessions
Per-NVA liveness probe	Pulls a wedged appliance out of rotation	Black-holes traffic to a hung-but-listening FW
Treat the NVA subnet as prod-critical	HA Ports = no per-port blast radius	One bad NSG/UDR breaks all protocols at once

The two non-negotiables, spelled out:

Symmetric routing. A stateful NVA requires the return packet to traverse the same appliance as the forward packet. With plain HA Ports, asymmetric paths break connections. Fix it with Floating IP (Direct Server Return) and/or UDRs that keep flows symmetric, or the vendor’s state-synchronizing cluster. This is the single most common HA-Ports failure mode — validate against the vendor reference architecture. See the Load Balancer Sandwich pattern for the full topology.
Health probe per NVA. The probe must hit a real liveness endpoint so a hung NVA is pulled from rotation. A probe against a port that answers while the data plane is wedged gives false “healthy” and black-holes traffic.

HA Ports balances everything, so a misconfigured NSG or UDR on the NVA subnet now affects all protocols at once. There is no per-port blast radius anymore — treat that subnet as production-critical and test failover explicitly.

Health probe protocols, thresholds, and graceful drain

Probes decide what “healthy” means, and the defaults are rarely what you want for a zero-downtime deploy. Standard LB supports TCP, HTTP, and HTTPS probes.

Probe type	Healthy when	Proves	Use it for	Limit
TCP	3-way handshake completes on the port	The port is open	Non-HTTP backends; cheapest	A wedged app that still listens passes
HTTP	GET on the path returns HTTP 200	The app answered	Web backends	Slightly more overhead than TCP
HTTPS	GET over TLS returns HTTP 200	The app answered over TLS	Encrypted-probe requirements	Cert/TLS handling on the backend

Prefer an HTTP/HTTPS probe against a real /healthz over TCP wherever the backend speaks HTTP. A TCP probe stays “healthy” while the app returns 500s to every user, because the socket is still open. Only an L7 probe catches a wedged-but-listening process.

az network lb probe create \
  --resource-group $RG --lb-name lb-app-prod \
  --name probe-app \
  --protocol Http --port 8080 --path /healthz \
  --interval 5 --probe-threshold 2

The probe knobs, their ranges, and the trade-off each controls:

Setting (CLI)	ARM / Bicep	Default	Range	Trade-off
`--protocol`	`protocol`	—	Tcp / Http / Https	TCP = cheap but blind; HTTP = true health
`--port`	`port`	—	1-65535	Must match the listening/health port
`--path`	`requestPath`	— (HTTP/S)	any path returning 200	Keep it shallow and honest
`--interval`	`intervalInSeconds`	15 (min 5)	5-2147483646	Tighter = faster detect, more flap risk
`--probe-threshold`	`numberOfProbes` / `probeThreshold`	1-2	≥1	Higher rides blips; lower evicts fast

Detection time is roughly interval x probe_threshold (~10s at 5s/2). Tighter flaps on a merely-slow backend; looser keeps sending traffic to a dead node. A sizing guide:

interval × threshold	Detect time	Good for	Risk
5s × 2	~10s	Fast eviction of dead nodes	Flaps a momentarily-slow node
5s × 3	~15s	Balanced default	Slightly slower eviction
15s × 2	~30s	Stable, flap-averse	Dead node serves up to ~30s
30s × 3	~90s	Very stable backends	Slow to pull a failed node

Graceful drain is the other half. When a probe starts failing (or you pull an instance from the pool), Standard LB stops new flows to it but does not kill established TCP connections — existing flows continue until they close or hit the idle timeout. So the clean deploy sequence is:

Step	Action	What the LB does	Why
1	Flip the instance’s `/healthz` to non-200 (or stop the app gracefully)	Probe begins failing	Signal intent to drain
2	Wait `interval × threshold`	Marks instance unhealthy; stops new flows	No new traffic lands on it
3	Wait out the drain window	Established flows finish naturally	In-flight requests complete
4	Recycle, bring `/healthz` back	Probe succeeds; rejoins rotation	Instance returns warm

This is the orchestration that VMSS rolling upgrades and App Service slot swaps lean on under the hood. Deploys that black-hole requests almost always skipped the drain wait between steps 2 and 3.

A reference deployment in Bicep and Terraform

Here is the regional public LB, NIC-style pool, explicit outbound rule, HTTP probe, and load-balancing rule as one coherent reference — the shape you want in the repo, not a pile of CLI commands. First Bicep:

param location string = resourceGroup().location

resource pip 'Microsoft.Network/publicIPAddresses@2023-11-01' = {
  name: 'pip-lb-fe'
  location: location
  sku: { name: 'Standard' }
  zones: [ '1', '2', '3' ]
  properties: { publicIPAllocationMethod: 'Static' }
}

resource lb 'Microsoft.Network/loadBalancers@2023-11-01' = {
  name: 'lb-app-prod'
  location: location
  sku: { name: 'Standard' }
  properties: {
    frontendIPConfigurations: [ {
      name: 'fe-public'
      properties: { publicIPAddress: { id: pip.id } }
    } ]
    backendAddressPools: [ { name: 'bep-app' } ]
    probes: [ {
      name: 'probe-app'
      properties: { protocol: 'Http', port: 8080, requestPath: '/healthz', intervalInSeconds: 5, numberOfProbes: 2 }
    } ]
    loadBalancingRules: [ {
      name: 'rule-https'
      properties: {
        protocol: 'Tcp'
        frontendPort: 443
        backendPort: 8443
        idleTimeoutInMinutes: 15
        enableTcpReset: true
        disableOutboundSnat: true   // outbound handled by the explicit rule below
        frontendIPConfiguration: { id: resourceId('Microsoft.Network/loadBalancers/frontendIPConfigurations', 'lb-app-prod', 'fe-public') }
        backendAddressPool: { id: resourceId('Microsoft.Network/loadBalancers/backendAddressPools', 'lb-app-prod', 'bep-app') }
        probe: { id: resourceId('Microsoft.Network/loadBalancers/probes', 'lb-app-prod', 'probe-app') }
      }
    } ]
    outboundRules: [ {
      name: 'obr-app'
      properties: {
        protocol: 'All'
        allocatedOutboundPorts: 1280
        idleTimeoutInMinutes: 15
        enableTcpReset: true
        frontendIPConfigurations: [ { id: resourceId('Microsoft.Network/loadBalancers/frontendIPConfigurations', 'lb-app-prod', 'fe-public') } ]
        backendAddressPool: { id: resourceId('Microsoft.Network/loadBalancers/backendAddressPools', 'lb-app-prod', 'bep-app') }
      }
    } ]
  }
}

The same shape in Terraform, which is what many teams keep in the repo:

resource "azurerm_public_ip" "lb_fe" {
  name                = "pip-lb-fe"
  resource_group_name = var.rg
  location            = var.location
  allocation_method   = "Static"
  sku                 = "Standard"
  zones               = ["1", "2", "3"]
}

resource "azurerm_lb" "app" {
  name                = "lb-app-prod"
  resource_group_name = var.rg
  location            = var.location
  sku                 = "Standard"
  frontend_ip_configuration {
    name                 = "fe-public"
    public_ip_address_id = azurerm_public_ip.lb_fe.id
  }
}

resource "azurerm_lb_backend_address_pool" "app" {
  name            = "bep-app"
  loadbalancer_id = azurerm_lb.app.id
}

resource "azurerm_lb_probe" "app" {
  name                = "probe-app"
  loadbalancer_id     = azurerm_lb.app.id
  protocol            = "Http"
  port                = 8080
  request_path        = "/healthz"
  interval_in_seconds = 5
  number_of_probes    = 2
}

resource "azurerm_lb_rule" "app" {
  name                           = "rule-https"
  loadbalancer_id                = azurerm_lb.app.id
  protocol                       = "Tcp"
  frontend_port                  = 443
  backend_port                   = 8443
  frontend_ip_configuration_name = "fe-public"
  backend_address_pool_ids       = [azurerm_lb_backend_address_pool.app.id]
  probe_id                       = azurerm_lb_probe.app.id
  idle_timeout_in_minutes        = 15
  enable_tcp_reset               = true
  disable_outbound_snat          = true # outbound handled by the explicit rule below
}

resource "azurerm_lb_outbound_rule" "app" {
  name                     = "obr-app"
  loadbalancer_id          = azurerm_lb.app.id
  protocol                 = "All"
  backend_address_pool_id  = azurerm_lb_backend_address_pool.app.id
  allocated_outbound_ports = 1280
  idle_timeout_in_minutes  = 15
  enable_tcp_reset         = true
  frontend_ip_configuration {
    name = "fe-public"
  }
}

The detail that bites people: set disable_outbound_snat = true on the load-balancing rule (disableOutboundSnat in ARM/Bicep) so the inbound rule does not silently provide implicit, unmanaged SNAT alongside your explicit outbound rule. Without it you get two overlapping SNAT behaviors and unpredictable port use. The Bicep-vs-Terraform property names you will reach for:

Concept	CLI flag	Bicep / ARM property	Terraform argument
Disable implicit SNAT	`--disable-outbound-snat`	`disableOutboundSnat`	`disable_outbound_snat`
Allocated SNAT ports	`--outbound-ports`	`allocatedOutboundPorts`	`allocated_outbound_ports`
Idle timeout	`--idle-timeout`	`idleTimeoutInMinutes`	`idle_timeout_in_minutes`
TCP reset on idle	`--enable-tcp-reset`	`enableTcpReset`	`enable_tcp_reset`
Floating IP (DSR)	`--floating-ip`	`enableFloatingIP`	`enable_floating_ip`
Probe threshold	`--probe-threshold`	`numberOfProbes`	`number_of_probes`

Cross-region load balancer: global front end, regional pools, failover

The cross-region (Global) LB gives you a single static anycast IP from Microsoft’s edge, with a backend pool of regional Standard load balancers. Traffic enters at the closest edge and steers to the closest healthy region; if a region’s LB goes unhealthy, flows shift to the next automatically — no DNS TTL to wait out, because the IP never changes.

# Global LB lives in a supported "home region" but serves globally.
az network public-ip create \
  --resource-group rg-global --name pip-global \
  --sku Standard --tier Global --allocation-method Static

az network cross-region-lb create \
  --resource-group rg-global --name lb-global \
  --frontend-ip-name fe-global \
  --public-ip-address pip-global \
  --backend-pool-name bep-regions

# Backend members are the *frontend IP configs of regional Standard LBs*.
az network cross-region-lb address-pool address add \
  --resource-group rg-global --lb-name lb-global \
  --pool-name bep-regions --name eastus-lb \
  --frontend-ip-address "$EASTUS_LB_FE_ID"

az network cross-region-lb address-pool address add \
  --resource-group rg-global --lb-name lb-global \
  --pool-name bep-regions --name westeurope-lb \
  --frontend-ip-address "$WESTEUROPE_LB_FE_ID"

What to internalize about the global LB:

It health-checks the regional LBs, not your VMs. Each regional LB’s own probes decide regional health; the global LB consumes that signal, so your regional probe design drives global failover quality.
Default distribution is geo-proximity by network latency, with automatic failover to the next-closest region when one drops.
Client source IP is preserved to the regional LB, which still sees real client addresses for its own routing and logging.
It is L4 only. Global L7 (path routing, WAF, edge TLS) is Front Door.

The global-routing options compared, so you pick the right global layer:

Global option	Layer	Routing basis	Failover speed	Static IP	Protocols
Cross-region LB	L4	Geo-proximity (latency)	Seconds, no DNS	Yes (anycast)	Any TCP/UDP
Front Door	L7	Latency / priority / weighted	Seconds (edge)	No (anycast hostname)	HTTP/S only
Traffic Manager	DNS	Performance / priority / geo / weighted	DNS TTL-bound (minutes)	No (DNS)	Any (DNS-level)
Anycast accelerator	L4	Edge anycast	Seconds	Yes	TCP/UDP

This is the cleanest way to give a TCP/UDP service (not just HTTP) one global IP with regional failover — something Traffic Manager (DNS/TTL-bound) and Front Door (HTTP-only) cannot each do alone. For the edge-anycast variant and latency engineering, see Anycast at the Edge, and for the broader pattern Azure Multi-Region Active-Active Architecture.

How the global LB behaves in each failure and routing case — what actually happens to a flow, and what holds constant:

Event	What the cross-region LB does	Client impact	What stays constant
Normal steady state	Routes to closest healthy region by latency	Lowest-latency region	Global static IP
One region’s LB goes unhealthy	Stops sending to it; shifts to next-closest	Brief reconnect, no DNS wait	Global static IP
Failed region recovers	Re-includes it once probes pass	Gradual return of nearby traffic	Global static IP
New region added to pool	Starts steering nearby clients to it	More local routing	Global static IP
All regions unhealthy	No healthy backend; connections fail	Outage (by definition)	IP still answers, no target
Client moves geographically	Re-steered to new closest region	Lower latency from new location	Global static IP
Regional probe lies “healthy”	Keeps sending to a degraded region	Errors with no failover	(the bug — fix the probe)

The single design dependency to burn in: the global LB only fails over as well as your regional probes report. A dishonest regional probe (TCP, or / that always 200s) is the difference between automatic failover and a global outage that points at a region that “looks” up.

Diagnostics: metrics, SNAT counts, and the queries that matter

Before the metrics, the hard numbers — the limits and quotas you design against. Most “why did it fall over” moments are one of these ceilings, and knowing the real figure (not a guess) is half the diagnosis:

Limit / quota	Standard LB value	What hits it	Symptom at the ceiling	Lever to raise it
SNAT ports per frontend public IP	~64,000	Flows to one destination IP:port	`SnatConnectionCount` Failed > 0	Add public IPs / a prefix; NAT Gateway
Backend pool size (NIC-based)	up to ~1,000 instances	Very large VMSS fleets	Can’t add members	Split pools / multiple LBs
Frontend IP configurations	up to ~600 per LB	Many VIPs on one LB	Create fails at the cap	Use additional LBs
Load-balancing + outbound + NAT rules	up to ~1,500 per LB	Rule-heavy designs	Rule create fails	Consolidate; multiple LBs
Probe interval (minimum)	5 seconds	Fast detection needs	Can’t go tighter	Tune threshold instead
Idle timeout range	4-100 minutes	Long-lived idle flows	Mid-idle drop below your value	App keepalives + raise timeout
Public IP prefix size	/28 to /31 (16 down to 2 IPs)	Allow-listable egress block	Prefix too small for budget	Allocate a larger prefix
Cross-region LB backend members	regional LB frontends	Multi-region fan-out	—	Add regional LBs to the pool
HA Ports rules per internal LB	1 (it’s “all ports”)	NVA sandwich	N/A (one rule covers all)	—
TCP reset on idle	off by default	Silent idle drops	Clients hang, don’t retry	`enableTcpReset=true`

These are the figures that matter in practice; Azure publishes the authoritative current limits per subscription/region, and a few are soft (raisable via support). The mechanism — per-destination SNAT, per-IP 64k — never changes even as the published caps shift, so design to the mechanism.

Standard LB emits multi-dimensional metrics under Microsoft.Network/loadBalancers. Because the LB has no access log, these metrics plus VNet flow logs are your only visibility. The ones worth alerting on:

Metric	What it measures	Split by	Watch for	What it confirms
SnatConnectionCount	Established SNAT flows	`ConnectionState` (Pending/Failed)	Rising Failed	Port exhaustion (the canary)
AllocatedSnatPorts	Ports budgeted per backend	backend	Baseline	Your configured ceiling
UsedSnatPorts	Ports actually consumed	backend	Used → Allocated	How close to the ceiling you are
DipAvailability	% probes succeeding per backend (Health Probe Status)	backend	Drops below 100%	Backend health / drain signal
VipAvailability	Datapath availability of the frontend	frontend	Drops below 100%	Whether the VIP itself is up
ByteCount / PacketCount / SYNCount	Throughput and new-connection rate	direction	Sudden spikes	Load / SYN-flood patterns

A KQL query to catch SNAT pressure before users do:

AzureMetrics
| where ResourceProvider == "MICROSOFT.NETWORK"
| where ResourceId has "/LOADBALANCERS/LB-APP-PROD"
| where MetricName in ("UsedSnatPorts", "AllocatedSnatPorts", "SnatConnectionCount")
| summarize Used = sumif(Total, MetricName == "UsedSnatPorts"),
            Allocated = sumif(Total, MetricName == "AllocatedSnatPorts")
            by bin(TimeGenerated, 5m)
| extend UtilizationPct = round(100.0 * Used / Allocated, 1)
| order by TimeGenerated desc

Alert on SnatConnectionCount with ConnectionState == Failed greater than 0 over 5 minutes — sustained failed SNAT means you are at the ceiling, and the fix is more frontend IPs, higher per-instance ports, or NAT Gateway. The alerts worth wiring before the next incident — leading indicators, not the lagging “VIP down”:

Alert on	Metric / dimension	Threshold (starting point)	Why it’s leading
SNAT failures	`SnatConnectionCount` (Failed)	> 0 sustained 5 min	First sign of exhaustion before timeouts spike
SNAT utilization	`UsedSnatPorts` / `AllocatedSnatPorts`	> 80% for 10 min	Predicts exhaustion with headroom to act
Backend health	`DipAvailability`	< 100% for 5 min	Catches probe failures / drain issues
Datapath	`VipAvailability`	< 100% for 5 min	The VIP itself is degraded
Connection rate	`SYNCount`	unusual spike	Load surge or SYN-flood pattern

An L4 LB has no access logs like an L7 proxy; flow-level visibility comes from VNet flow logs on the backend subnet, fed into Traffic Analytics for top-talker and drop analysis — see Network Flow Logs to Insight. Wire the LB metrics into a workspace and dashboards via Azure Monitor.

Architecture at a glance

The diagram traces an L4 flow as it actually moves and maps each failure class onto the exact hop where it bites. Read it left to right. Clients hit a single static anycast IP on the cross-region (Global) LB, which steers them to the closest healthy region — badge 1 marks the failover decision, which works only because the global LB consumes each regional LB’s honest health signal (not your VMs directly). Inside the region, the regional Standard LB applies an inbound rule (443 to 8443), runs a health probe (badge 2 — a TCP probe here would lie “healthy” while the app 500s), and governs egress through an explicit outbound rule (badge 3 — where implicit SNAT shadowing and stingy auto-allocation cause exhaustion). The rule hashes each flow by 5-tuple onto the NIC-based backend pool, a VMSS spread across zones 1/2/3; outbound flows leave via the egress VIP (badge 4 — the ~64,000-ports-per-IP ceiling counts simultaneous flows to one destination, not total connections).

Branching off the backend path is the NVA sandwich: spoke traffic is forced by UDR through an internal LB with an HA Ports rule (protocol All, ports 0/0) in front of an active-active firewall pool. Badge 5 sits on the stateful appliance — HA Ports balances every port but not flow symmetry, so without Floating IP (DSR) and vendor session-state sync, return packets land on a different firewall and long flows reset mid-stream. Finally, every hop reports into Azure Monitor and VNet flow logs — the only visibility an access-log-less L4 device gives you. The five numbered legend entries narrate each badge as symptom · confirm · fix; that is the whole diagnostic method: localise the symptom to a hop, read the cause, run the named metric/command, apply the fix.

Real-world scenario

Meridian Pay, a fictional but representative payments platform, ran an active-active NGFW firewall sandwich in their hub VNet: an internal Standard LB in front of three firewall VMs, an HA Ports rule, and all spoke traffic forced through it via UDRs. The fleet was a 50-instance VMSS of payment workers behind a separate public Standard LB, fronted by a single outbound public IP, in Central India. It passed every lab and functional test. Monthly LB-and-egress spend was about ₹14,000. Two separate incidents, two weeks apart, taught the team the two hardest lessons of this device.

Incident one — the firewall sandwich resets. In production, long-lived database and gRPC connections reset randomly after a few minutes while short HTTP calls were fine, and the firewall logs showed sessions with “no matching state.” The constraint was classic stateful-inspection asymmetry. HA Ports hashes flows across the three firewalls by 5-tuple, but the return-path UDRs sent reply packets back through a different firewall than the forward path. The second appliance saw a mid-stream packet for a session it never created and dropped it. Short flows finished inside one hash window; long flows lived long enough to hit a state mismatch on a reconvergence or probe-driven rebalance. The fix had two parts: enable the vendor’s session-state synchronization across the cluster so any appliance can handle any packet of a flow, and enable Floating IP (Direct Server Return) on the HA Ports rule so appliances see the original VIP and routing stays symmetric per the vendor design. They also pointed the probe at a real data-plane liveness URL, not just a listening port.

# HA Ports rule with Floating IP enabled for the stateful NVA sandwich.
az network lb rule create \
  --resource-group rg-hub --lb-name lb-nva-internal \
  --name rule-haports \
  --protocol All --frontend-port 0 --backend-port 0 \
  --frontend-ip-name fe-nva --backend-pool-name bep-nva \
  --probe-name probe-nva-dataplane \
  --floating-ip true \
  --enable-tcp-reset true --idle-timeout 30

The mid-stream resets stopped on the first cutover.

Incident two — SNAT exhaustion during a sale. Three weeks later, a flash sale drove the 50-instance fleet to peak, and the payment-provider callout (a single upstream VIP) started timing out intermittently — ~9% of charges failing. The on-call reflex was to scale the VMSS out, which helped marginally and cost money. The real read came from the metric: SnatConnectionCount with a non-zero Failed dimension, and UsedSnatPorts pinned at AllocatedSnatPorts on the busiest instances. With one frontend IP across 50 instances the outbound rule had auto-allocated a stingy port count, and every flow targeted the same payment VIP, so the ~64,000-ports-per-IP-per-destination ceiling was the wall. Two coupled bugs again: a per-request connection pattern in the worker, and a single outbound IP with no headroom. The night-of fix: set the outbound rule to an explicit --outbound-ports 1280, set disableOutboundSnat=true on the inbound rule to stop implicit shadowing, and add a second outbound public IP to double the budget. The following week they fixed the worker to reuse connections and moved egress to a NAT Gateway for on-demand ports independent of instance count.

The next sale ran at full load with zero failed SNAT, charge success returned to 100%, and they moved the VMSS back down to its baseline size at ₹13,500 — lower than before. The two lessons on the wall: “HA Ports gives you all-port load balancing, not flow symmetry — that is your routing’s job,” and “SNAT is per-destination; one busy upstream exhausts you no matter how few total connections you think you have.” The incidents as a timeline, because the order of moves is the lesson:

Time	Symptom	Action taken	Effect	What it should have been
Wk1	Long flows reset, short ones fine	Restart firewalls	Brief relief, recurs	Ask: is the path symmetric?
Wk1	“no matching state” in FW log	Read FW session log	Asymmetry identified	The breakthrough
Wk1	Root cause found	Floating IP + vendor state sync + dataplane probe	Resets stop	Correct fix
Wk3	9% charge timeouts at peak	Scale VMSS out	Marginal, costs money	Don’t scale to mask
Wk3	Still failing	Read `SnatConnectionCount` (Failed)	Exhaustion confirmed	This was the read
Wk3	Mitigated	Explicit ports + `disableOutboundSnat` + 2nd IP	Failures clear	Correct night-of fix
+1wk	Fixed	Connection reuse + NAT Gateway; scale back down	0 SNAT fails, ₹13,500	The actual fix is code + egress design

Advantages and disadvantages

The pass-through L4 model both enables these designs and creates their failure modes. Weigh it honestly:

Advantages (why this model helps you)	Disadvantages (why it bites)
Zero added latency — pure 5-tuple rewrite, no termination	No access log; you diagnose from metrics + flow logs only
Protocol-agnostic — balances any TCP/UDP, not just HTTP	No L7 features — no path routing, WAF, or TLS termination
Outbound rules give deterministic, allow-listable SNAT	Pre-carved ports cap pool size; the maths is unforgiving
HA Ports balances every port for NVAs with one rule	HA Ports gives no flow symmetry — stateful NVAs need extra engineering
Zone-redundant frontend survives single-zone loss	Only real if backends are zone-spread too — easy to fake HA
Cross-region LB = one static IP, DNS-free regional failover	L4 only; HTTP global routing still needs Front Door
Secure by default — no implicit internet exposure	Egress is opt-in; forget it and backends go dark
First-class SNAT/health metrics you can alert on	Finite SNAT (~64k/IP/destination) is invisible until you hit it under load

The model is right when you need a fast, protocol-agnostic L4 front end, controlled egress, or NVA HA. It bites hardest on chatty outbound workloads to few destinations (SNAT), stateful NVA sandwiches (symmetry), and anyone who assumes a zone-redundant frontend alone means HA. The disadvantages are all manageable — but only if you know they exist, which is the point of this article. Where you need L7, reach for Application Gateway instead.

Hands-on lab

Stand up a regional Standard LB with a zone-redundant frontend, a NIC-based pool, an explicit outbound rule, and an HTTP probe — then prove the egress IP is deterministic and the drain works. Free-tier-friendly except the two B1s VMs and the public IPs (a few rupees an hour; delete at the end). Run in Cloud Shell (Bash).

Step 1 — Variables and resource group.

RG=rg-lb-lab
LOC=centralindia
az group create -n $RG -l $LOC -o table

Step 2 — VNet, subnet, and two zone-spread backend VMs.

az network vnet create -g $RG -n vnet-lab --address-prefix 10.0.0.0/16 \
  --subnet-name snet-app --subnet-prefix 10.0.1.0/24 -o table

for i in 1 2; do
  az vm create -g $RG -n vm-app-$i --image Ubuntu2204 --size Standard_B1s \
    --vnet-name vnet-lab --subnet snet-app --zone $i \
    --public-ip-address "" --admin-username azureuser --generate-ssh-keys -o table
done

Expected: two VMs, vm-app-1 in zone 1 and vm-app-2 in zone 2, neither with a public IP (egress will come from the LB).

Step 3 — Zone-redundant public frontend and the Standard LB.

az network public-ip create -g $RG -n pip-lb-fe \
  --sku Standard --allocation-method Static --zone 1 2 3 -o table

az network lb create -g $RG -n lb-lab --sku Standard \
  --public-ip-address pip-lb-fe --frontend-ip-name fe-public \
  --backend-pool-name bep-app -o table

Expected: a Standard LB with frontend fe-public and an empty pool bep-app.

Step 4 — HTTP probe, load-balancing rule (implicit SNAT disabled), and explicit outbound rule.

az network lb probe create -g $RG --lb-name lb-lab -n probe-app \
  --protocol Http --port 80 --path / --interval 5 --probe-threshold 2

az network lb rule create -g $RG --lb-name lb-lab -n rule-http \
  --protocol Tcp --frontend-port 80 --backend-port 80 \
  --frontend-ip-name fe-public --backend-pool-name bep-app \
  --probe-name probe-app --idle-timeout 15 --enable-tcp-reset true \
  --disable-outbound-snat true

az network lb outbound-rule create -g $RG --lb-name lb-lab -n obr-app \
  --frontend-ip-configs fe-public --address-pool bep-app \
  --protocol All --idle-timeout 15 --enable-tcp-reset true --outbound-ports 1280

Expected: disableOutboundSnat: true on the LB rule and an outbound rule allocating 1280 ports.

Step 5 — Add the NICs to the pool and install a tiny web server on each VM.

for i in 1 2; do
  NIC=$(az vm show -g $RG -n vm-app-$i --query "networkProfile.networkInterfaces[0].id" -o tsv)
  IPCFG=$(az network nic show --ids $NIC --query "ipConfigurations[0].name" -o tsv)
  az network nic ip-config address-pool add --nic-name $(basename $NIC) -g $RG \
    --ip-config-name $IPCFG --lb-name lb-lab --address-pool bep-app
  az vm run-command invoke -g $RG -n vm-app-$i --command-id RunShellScript \
    --scripts "sudo apt-get update -y && sudo apt-get install -y nginx && echo vm-app-$i | sudo tee /var/www/html/index.html"
done

Step 6 — Verify inbound balancing and the deterministic egress IP.

LBIP=$(az network public-ip show -g $RG -n pip-lb-fe --query ipAddress -o tsv)
for i in $(seq 1 10); do curl -s http://$LBIP/; done   # alternates vm-app-1 / vm-app-2

# Egress determinism: from inside a backend, the source IP must be pip-lb-fe.
az vm run-command invoke -g $RG -n vm-app-1 --command-id RunShellScript \
  --scripts "curl -s https://api.ipify.org"
echo "Compare the returned IP to:"; echo $LBIP

Expected: the curl loop alternates between vm-app-1 and vm-app-2; the egress check returns the LB’s frontend IP — proof the outbound rule is the egress path.

Step 7 — Prove graceful drain. Stop nginx on one VM, watch it leave rotation after interval × threshold (~10s), confirm the other keeps serving:

az vm run-command invoke -g $RG -n vm-app-1 --command-id RunShellScript \
  --scripts "sudo systemctl stop nginx"
sleep 15
for i in $(seq 1 10); do curl -s http://$LBIP/; done   # now only vm-app-2
az network lb show -g $RG -n lb-lab --query "probes[0].{proto:protocol,interval:intervalInSeconds,threshold:numberOfProbes}" -o jsonc

Expected: after ~10-15s, every response is vm-app-2 — the probe pulled the stopped instance without killing the survivor.

Validation checklist — what each step proved:

Step	What you did	What it proves	Real-world analogue
2	Zone-spread VMs, no public IP	Backends span zones; egress is LB-provided	The HA + secure-by-default model
4	`disable-outbound-snat true` + outbound rule	Explicit SNAT, no implicit shadowing	The incident-proof egress config
6	curl loop + ipify from inside	Inbound balances; egress is deterministic	“Which IP do partners allow-list?”
7	Stop nginx, watch drain	Probe pulls dead nodes, keeps survivors	Zero-downtime deploy drain

Cleanup (avoid lingering charges):

az group delete -n $RG --yes --no-wait

Cost note. Two B1s VMs plus two Standard public IPs run a few rupees per hour; an hour of this lab is well under ₹60, and deleting the resource group stops everything. Standard public IPs and the LB carry a small hourly charge even idle, so do not leave the lab running.

Common mistakes & troubleshooting

This is the playbook — the part you bookmark. An L4 LB emits no HTTP status codes, so the “error reference” is the set of connection-level outcomes and metric/health states you read instead. Learn to map each to what the LB is actually doing:

Observed outcome	What it means at L4	Likely cause	How to confirm	First move
Connection refused (RST on connect)	No healthy backend on that port	All instances unhealthy / wrong rule port	`DipAvailability` 0%; rule port vs listener	Fix probe/listener; check rule mapping
Connection times out (no SYN-ACK)	VIP/datapath issue or NSG block	`VipAvailability` drop, or NSG denies	`VipAvailability` metric; NSG effective rules	Allow AzureLoadBalancer tag; check region health
Outbound connect fails under load	SNAT port exhaustion	Per-destination 5-tuple ceiling	`SnatConnectionCount` Failed > 0	Add IP / NAT Gateway; reuse connections
Mid-stream RST after minutes (idle)	Idle timeout reclaimed the flow	Idle timeout < flow idle gap	Flow dies at the timeout boundary	App keepalives; raise idle timeout
Mid-stream RST after minutes (NVA)	Stateful asymmetry dropped it	Return path on a different firewall	FW log “no matching state”	Floating IP + state sync
Backend “Up” but app errors	Probe proves socket, not app	TCP probe on a 500-ing app	`DipAvailability` 100% vs 5xx	HTTP/HTTPS `/healthz` probe
New flows stop, old ones continue	Graceful drain in progress	Probe failed / instance removed	`DipAvailability` dropped on that instance	Expected; finish the drain sequence
Global VIP serves a dead region	Regional probe reports healthy	Dishonest regional probe	Cross-region backend health “Up”	Make the regional probe real

Now the symptom → cause → confirm → fix table you read mid-incident, then the entries that bite hardest in detail.

#	Symptom	Root cause	Confirm (exact cmd / portal path)	Fix
1	Intermittent outbound timeouts under load, fine at rest	SNAT port exhaustion to one destination	`SnatConnectionCount` (Failed) > 0; `UsedSnatPorts` ≈ `AllocatedSnatPorts`	Explicit outbound rule; more frontend IPs; NAT Gateway; reuse connections
2	Exhaustion despite an outbound rule; ports unpredictable	Implicit SNAT from the inbound rule shadowing it	LB rule shows `disableOutboundSnat: false`	Set `disableOutboundSnat=true` on every LB rule
3	Backends “healthy” but users get 500/502	TCP probe on an app that 500s (wedged-but-listening)	Probe `protocol: Tcp`; `DipAvailability` 100% while 5xx high	Switch to HTTP/HTTPS probe on `/healthz`
4	Long-lived flows reset after minutes; short ones fine	Asymmetric routing through stateful NVAs	Firewall log “no matching state”; pattern is long-only	Floating IP (DSR) + vendor state sync; symmetric UDRs
5	New backends can’t reach the internet	Standard is secure-by-default; no egress configured	No outbound rule / NAT GW; effective routes lack default	Add an outbound rule or NAT Gateway
6	Region goes down but the global IP keeps sending traffic	Regional LB still reports healthy (probe lies)	Cross-region LB backend health “Up”; regional `DipAvailability` not 0	Make the regional probe honest (HTTP `/healthz`)
7	Mid-idle drops on long-lived connections	Idle timeout too short, no keepalives	Flows die at the idle-timeout boundary	App keepalives; raise `idleTimeoutInMinutes`; `enableTcpReset`
8	Scaling out the pool silently starves ports	`outbound-ports` computed for today, not max	New instances get fewer ports than needed	Compute ports from maximum pool size, not current
9	A NSG/UDR change breaks all protocols at once	HA Ports = no per-port blast radius	One subnet change; everything fails together	Treat NVA subnet as prod-critical; test failover
10	IP-based pool: outbound rule won’t apply	IP-based pools don’t support outbound rules	Pool is IP-based; rule rejected/ineffective	Use a NIC-based pool, or NAT Gateway for egress
11	HA Ports rule rejected on a public LB	HA Ports is internal-LB only	LB frontend is public	Use an internal LB for the HA Ports rule
12	Basic→Standard migration breaks egress/zones	Basic features don’t map 1:1	Still on Basic; retires 30 Sep 2025	Plan a Standard migration (PIP SKU, outbound, zones)

The expanded form for the entries that cost the most time:

1. Intermittent outbound timeouts under load, fine at rest. Root cause: SNAT port exhaustion, almost always many flows to one destination IP:port (the per-destination 5-tuple ceiling, not total connections). Confirm: SnatConnectionCount with a non-zero Failed dimension under load; UsedSnatPorts pinned at AllocatedSnatPorts on the busy instances.

az monitor metrics list \
  --resource $(az network lb show -g $RG -n lb-app-prod --query id -o tsv) \
  --metric SnatConnectionCount --filter "ConnectionState eq 'Failed'" \
  --interval PT1M --aggregation Total -o table

Fix: Reuse outbound connections (shared client, keepalives); allocate ports explicitly against max pool size; add frontend IPs or a public IP prefix (+64k each); or move egress to a NAT Gateway. Scaling out is a band-aid.

2. Exhaustion despite an outbound rule; port use is unpredictable. Root cause: The load-balancing rule is providing implicit SNAT alongside your explicit outbound rule — two overlapping behaviors. Confirm: az network lb rule show ... --query disableOutboundSnat returns false. Fix: Set disableOutboundSnat=true on every load-balancing rule so egress is governed only by the outbound rule.

3. Backends report healthy but users get 500/502. Root cause: A TCP probe keeps a wedged-but-listening process in rotation; the socket is open, the app is broken. Confirm: Probe protocol is Tcp; DipAvailability shows 100% while your app’s 5xx rate is high. Fix: Switch to an HTTP/HTTPS probe against a real /healthz that exercises the app, not just the socket.

4. Long-lived flows reset after a few minutes; short flows are fine. Root cause: Asymmetric routing through a stateful NVA sandwich — the return packet traverses a different firewall than the forward packet. Confirm: Firewall session log shows “no matching state”; the failure is exclusively long flows (DB, gRPC), never short HTTP. Fix: Enable Floating IP (DSR) on the HA Ports rule, enable the vendor’s session-state sync, and keep UDRs symmetric. Point the probe at a data-plane liveness URL.

5. New backends can’t reach the internet. Root cause: Standard is secure by default — being in a pool grants no egress; nobody added an explicit path. Confirm: No outbound rule and no NAT Gateway on the subnet; effective routes lack an internet default via a managed egress. Fix: Add an outbound rule (NIC-based pool) or a NAT Gateway on the subnet.

6. A region is down but the global IP keeps sending traffic there. Root cause: The regional probe is dishonest — a TCP probe (or / that always 200s) keeps the regional LB “healthy,” so the cross-region LB never fails it over. Confirm: Cross-region LB backend health shows the region “Up” while it’s clearly degraded; regional DipAvailability isn’t dropping. Fix: Make the regional probe a true health check; global failover quality is exactly your regional probe quality.

And the fast triage table — match the signal you have to the likely cause and the immediate move, before you even open the playbook:

If you see…	It’s probably…	Do this
`SnatConnectionCount` Failed climbing under load	Per-destination SNAT exhaustion	Add a frontend IP now; plan NAT Gateway + connection reuse
`UsedSnatPorts` ≈ `AllocatedSnatPorts`, Failed still 0	About to exhaust	Raise `outbound-ports` / add an IP before it fails
Exhaustion with an outbound rule present	Implicit SNAT shadowing	Set `disableOutboundSnat=true` on the LB rule
`DipAvailability` 100% but users get 5xx	TCP probe lying healthy	Switch probe to HTTP/HTTPS `/healthz`
`DipAvailability` flapping on a slow node	Probe too tight	Raise `interval × threshold`
Only long flows reset, short ones fine	Asymmetric stateful NVA	Floating IP (DSR) + vendor state sync
New VMs have no internet	Secure-by-default, no egress	Add outbound rule or NAT Gateway
Global IP won’t fail a dead region over	Regional probe dishonest	Make regional `/healthz` real
`VipAvailability` < 100%	Datapath/frontend degraded	Check region health; open a support case
Outbound rule “won’t apply”	IP-based pool	Convert to NIC-based pool
HA Ports rule rejected	Public LB (internal-only feature)	Move HA Ports to an internal LB
Mid-idle drops at a fixed interval	Idle timeout / no keepalives	App keepalives; raise timeout; `enableTcpReset`

Best practices

Confirm everything is Standard SKU end to end. Basic LB retires 30 September 2025 — no zones, no outbound rules, no HA Ports, no cross-region. Migration is the first task, not an optimization.
Spread backends across availability zones and keep the frontend zone-redundant unless a zonal pin is specifically justified. A zone-redundant frontend over a single-zone pool is not HA.
Use a NIC-based backend pool if you need LB outbound rules; IP-based pools cannot do outbound SNAT.
Provide outbound explicitly — a dedicated outbound rule with manual outbound-ports, or a NAT Gateway. Never rely on implicit/default SNAT.
Compute allocated_outbound_ports from your maximum pool size, not today’s count, and add frontend IPs (or a public IP prefix) to grow the 64k-per-IP budget.
Set disableOutboundSnat = true on inbound rules so they don’t shadow the explicit outbound rule.
For NVA HA, use an internal LB with an HA Ports rule (protocol All, ports 0/0) and engineer symmetric routing (Floating IP / DSR and/or vendor state sync).
Prefer HTTP/HTTPS /healthz probes over TCP; size interval × probe_threshold for your detection-vs-flapping trade-off.
Build a drain step into deploys: fail the probe, wait interval × threshold, let in-flight flows close, then recycle.
For a global static IP with regional failover on L4, front regional LBs with a cross-region LB; use Front Door for global L7.
Alert on SnatConnectionCount (Failed) > 0 and dashboard UsedSnatPorts/AllocatedSnatPorts, DipAvailability, and VipAvailability.
Add VNet flow logs + Traffic Analytics on backend subnets for the flow visibility an L4 LB does not log itself.
Load-test to peak with zero failed SNAT, and rehearse a regional failover with the global IP held constant.

Security notes

Secure by default is a feature — keep it. Standard LB grants no inbound or outbound exposure implicitly. Provide outbound through a controlled path (outbound rule or NAT Gateway) so egress IPs are known and allow-listable, not random.
NSGs still gate the data plane. The LB forwards; the NSG on the backend NIC/subnet decides what’s allowed. Restrict inbound to the LB’s expected ports and the AzureLoadBalancer service tag for probes; never open the whole subnet. See Azure Virtual Network, Subnets and NSGs.
Allow-list egress at the destination. A deterministic outbound IP (or public IP prefix) is what a partner whitelists. Use a prefix so you can scale within a stable CIDR rather than adding loose IPs the partner must re-approve.
HA Ports has no per-port blast radius. Because one rule governs every port, an NSG or UDR mistake on the NVA subnet exposes or breaks all protocols at once. Treat that subnet as the most sensitive in the hub; review changes like production code.
Internal LBs for east-west. Keep service-to-service and NVA traffic on internal (private) frontends so it never touches a public IP; reserve public frontends for genuine internet ingress.
Probe endpoints reveal nothing. A /healthz returns a status, not internal topology, versions, or dependency hostnames — it’s reachable from the platform and should not leak a system map.
Pair with a WAF where the protocol is HTTP. Standard LB does no inspection; if you need request filtering, front it with Application Gateway WAF — L4 balancing and L7 inspection are different jobs.

Cost & sizing

Standard LB has no instance size to choose — the cost model is rule-count and processed data, plus the public IPs and any NAT Gateway you attach for egress. The drivers and how they interact with the design:

Rules and data. Standard LB bills a small hourly charge for the first set of rules and a per-rule charge beyond it, plus a per-GB data processed charge. A handful of rules is rupees per day; the data charge scales with throughput.
Public IPs. Each Standard public IP carries a small hourly charge. Adding IPs to grow the SNAT budget is cheap insurance against exhaustion — far cheaper than failed transactions during a sale — but they’re not free; size to need.
NAT Gateway (the better egress path at scale) adds an hourly + per-GB charge. It usually replaces multiple outbound IPs and the port-carving headache, and is the right call once you’re fighting the 64k maths.
Cross-region LB adds the global tier’s data-processing charge on top of the regional LBs it fronts; you still pay for each regional LB underneath.
Zone-redundancy is free; cross-zone data has a tiny per-GB cost that is irrelevant next to the resilience it buys.

A rough monthly picture for a mid-size regional deployment in INR:

Cost driver	What you pay for	Rough INR / month	What it buys	Watch-out
Standard LB (rules + base)	Hourly + first rules	~₹1,500-2,500	The LB itself	Per-rule charge beyond the base set
Data processed	Per-GB through the LB	~₹0.4-0.5 / GB	Throughput	Scales with traffic; can dominate at high GB
Standard public IP (each)	Hourly per IP	~₹300-400 / IP	+64k SNAT ports each	Don’t over-provision idle IPs
NAT Gateway	Hourly + per-GB	~₹1,500-3,000	On-demand SNAT, deterministic egress	Zonal; one per zone for AZ coverage
Cross-region LB	Global data processing	~₹1,000-2,500	One static IP + DNS-free failover	On top of the regional LBs
VNet flow logs + Traffic Analytics	Storage + ingestion	~₹1,000-3,000	The visibility L4 lacks	Sample/retain sensibly
Public IP prefix (/28)	Hourly per IP in the block	~₹4,500-6,000 (16 IPs)	Allow-listable, stable egress CIDR	Pay for the whole block even if idle
Additional LB rules (beyond base)	Per-rule hourly	~₹100-200 / rule	Extra VIPs/ports	Adds up on rule-heavy LBs
Cross-zone data transfer	Per-GB inter-zone	~₹0.1 / GB	Zone resilience	Negligible vs the resilience

The sizing rule in one line: pick the minimum outbound IPs (or a NAT Gateway) that keeps UsedSnatPorts comfortably below AllocatedSnatPorts at peak, run zone-spread backends behind a zone-redundant frontend, and only add the cross-region tier when you genuinely need a single global IP. Meridian Pay landed at ₹13,500/month after fixing connection reuse and moving to NAT Gateway — lower than the ₹14,000 they paid while broken, proof the fix is usually design, not a bigger bill.

Interview & exam questions

1. What does an outbound rule do that implicit SNAT does not, and why does it matter? An outbound rule lets you explicitly allocate SNAT ports per instance, pick the outbound frontend IP(s), set the idle timeout, and enable TCP reset — deterministic, plannable egress. Implicit SNAT (from a load-balancing rule) auto-allocates a stingy port count and is non-deterministic. You also set disableOutboundSnat=true on the LB rule so the two don’t overlap. It matters because deterministic egress is the difference between a planned 64k budget and a 2 a.m. exhaustion incident.

2. Why can an app with only 5,000 outbound connections still exhaust SNAT? Because SNAT ports are keyed on the full destination 5-tuple — the limit is ~64,000 simultaneous flows to the same destination IP:port per frontend public IP, not 64,000 total. If all 5,000 connections target one upstream and the app opens a fresh connection per request without reuse, the per-destination pressure builds far past what the raw count suggests.

3. What is an HA Ports rule, and what does it deliberately not solve? An HA Ports rule (protocol All, frontend/backend port 0, internal LB only) load-balances every port and protocol at once — built for NVAs whose port set you can’t enumerate. It does not guarantee flow symmetry: a stateful firewall needs the return packet on the same appliance, and HA Ports hashes directions independently. You add Floating IP (DSR) and vendor session-state sync to get symmetry.

4. Your backends show 100% healthy but users get 502s. Most likely cause? A TCP health probe on an app that is wedged-but-listening — the socket is open so the probe passes, but the app returns 500/502 to real requests. Confirm via DipAvailability at 100% while the 5xx rate is high; fix by switching to an HTTP/HTTPS probe against a real /healthz.

5. Difference between NIC-based and IP-based backend pools, and the constraint that decides it? NIC-based pools attach VM/VMSS NIC ipConfigurations; IP-based pools list raw private IPs. The decisive constraint: IP-based pools cannot use outbound rules. If you need LB-provided SNAT, you must use a NIC-based pool (or provide egress via NAT Gateway).

6. How does the cross-region (Global) LB decide health and where does its failover quality come from? It health-checks the regional load balancers, not your VMs directly — consuming each regional LB’s own probe signal. So global failover quality is exactly your regional probe quality: an honest regional /healthz probe means clean failover; a TCP-or-/ probe that always passes means the region never fails over even when it’s down.

7. Why is a zone-redundant frontend not enough for HA on its own? Because a zone-redundant frontend only guarantees the VIP survives a zone loss. If the backends are pinned to a single zone, losing that zone takes the app down regardless. HA requires both a zone-redundant (or appropriately zonal) frontend and zone-spread backends.

8. What is the implicit-SNAT trap and how do you avoid it? A load-balancing rule silently provides implicit outbound SNAT alongside any explicit outbound rule unless disabled, producing two overlapping behaviors and unpredictable port use. Avoid it by setting disableOutboundSnat=true on every load-balancing rule, so egress is governed solely by the explicit outbound rule.

9. A firewall sandwich resets long-lived connections but not short ones. Diagnose and fix. Classic asymmetric routing through stateful NVAs — return packets traverse a different appliance than the forward path, which has no session state, so it drops mid-stream packets. Short flows finish inside one hash window; long flows hit a rebalance. Fix with Floating IP (DSR) + vendor session-state sync and symmetric UDRs; point the probe at a data-plane liveness URL.

10. Which metric is the canary for SNAT exhaustion, and what do you alert on? SnatConnectionCount split by ConnectionState — alert on the Failed dimension > 0 sustained over ~5 minutes. Dashboard UsedSnatPorts against AllocatedSnatPorts per backend (alert at ~80% utilization) so you act with headroom before failures begin.

11. When do you pick cross-region LB over Front Door or Traffic Manager? Cross-region LB when you need a single static anycast IP for any TCP/UDP protocol with DNS-free, seconds-fast regional failover. Front Door is HTTP-only (no static IP, but L7 + WAF + edge TLS); Traffic Manager is DNS/TTL-bound (minutes to fail over, any protocol at the DNS level but no single IP).

12. How do you grow the SNAT budget without code changes, and what’s the better long-term fix? Add frontend public IPs (or a public IP prefix) — each adds ~64,000 ports — and re-compute outbound-ports against max pool size. The better long-term fix is connection reuse in the app (cuts outbound connections drastically) and, at scale, a NAT Gateway that allocates ports on demand independent of instance count.

These map to AZ-700 (Network Engineer) — design and implement load balancing and network connectivity — most directly, with the egress/SNAT and NVA topics squarely in scope; AZ-104 (Administrator) — configure load balancing, probes, and rules; and the resilience/active-active design angle touches AZ-305 (Solutions Architect). A compact cert-mapping for revision:

Question theme	Primary cert	Exam objective area
Outbound rules, SNAT maths, NAT Gateway	AZ-700	Design & implement network connectivity / load balancing
HA Ports, firewall sandwich, symmetry	AZ-700	Implement load balancing; secure connectivity
Probes, rules, NIC vs IP pools	AZ-104	Configure load balancing
Zone-redundant vs zonal frontend/backends	AZ-104 / AZ-305	Resilience & availability
Cross-region LB vs Front Door vs Traffic Manager	AZ-700 / AZ-305	Global routing & multi-region design
SNAT/DipAvailability metrics & alerting	AZ-104 / AZ-700	Monitor & troubleshoot networking

Quick check

An app opens ~3,000 outbound connections, all to a single payment API, with a new connection per request, and you start seeing timeouts under load. Which limit are you hitting and what’s the metric that proves it?
You configured an explicit outbound rule but still see unpredictable port use and exhaustion. What single property did you forget, and on which rule?
True or false: a zone-redundant frontend in front of a single-zone VMSS gives you high availability.
Your firewall sandwich resets long-lived gRPC and DB connections but short HTTP calls are fine. What’s the root cause and the two fixes?
You need one static IP for a UDP service with automatic failover across two regions. Which Azure load balancer, and why not Front Door?

Answers

SNAT port exhaustion against the per-destination 5-tuple ceiling (~64,000 simultaneous flows to the same destination IP:port per frontend public IP). The proof is SnatConnectionCount with a non-zero Failed dimension (and UsedSnatPorts ≈ AllocatedSnatPorts). The total connection count being modest is irrelevant — they all target one destination.
disableOutboundSnat=true on the load-balancing rule. Without it, the inbound rule silently provides implicit SNAT alongside your explicit outbound rule, giving two overlapping behaviors and unpredictable port use.
False. Zone-redundancy on the frontend only protects the VIP. If the backends are in one zone, losing that zone takes the app down. HA needs zone-spread backends too.
Asymmetric routing through the stateful NVAs — return packets land on a different appliance with no session state and get dropped mid-stream. Fixes: Floating IP (Direct Server Return) on the HA Ports rule and the vendor’s session-state synchronization (plus symmetric UDRs).
The cross-region (Global) Load Balancer — it gives a single static anycast IP for any TCP/UDP protocol with DNS-free regional failover. Front Door is HTTP-only and provides no static IP, so it cannot serve a UDP service.

Glossary

Standard Load Balancer — a pass-through, zero-added-latency Layer-4 device that rewrites the 5-tuple and forwards packets; zone-aware, with outbound rules and HA Ports. The default Azure LB SKU (Basic retires 30 Sep 2025).
Frontend IP configuration — the VIP traffic enters on; public or internal, zonal or zone-redundant. Each public IP adds ~64,000 SNAT ports.
Backend pool — the set of targets; NIC-based (VM/VMSS NIC, supports outbound rules) or IP-based (raw private IPs, no outbound rules).
Load-balancing rule — maps a frontend port to a backend port across the whole pool; provides implicit SNAT unless disableOutboundSnat is set.
Inbound NAT rule — maps a single frontend port to a single backend instance (e.g. SSH/RDP to one VM).
Health probe — defines “healthy” via TCP (port open), HTTP, or HTTPS (returns 200); detection time ≈ interval × probe_threshold.
Outbound rule — explicit egress with manual SNAT port allocation, a chosen outbound IP, idle timeout, and TCP reset — the deterministic-egress control.
SNAT (Source NAT) port — one outbound 5-tuple translation entry; the limit is ~64,000 simultaneous flows to one destination IP:port per frontend public IP.
disableOutboundSnat — a load-balancing-rule flag that turns off implicit SNAT so egress is governed only by the explicit outbound rule.
HA Ports rule — a load-balancing rule with protocol All and ports 0/0 (internal LB only) that balances every port and protocol at once, for NVA pools.
Floating IP (Direct Server Return / DSR) — makes the backend see the original VIP and keeps routing symmetric; required for stateful NVA correctness.
Firewall sandwich — an external/internal LB pair around an active-active NVA pool, HA Ports on the internal side; needs flow symmetry to work.
Zone-redundant frontend — a VIP served from all availability zones, surviving a single-zone loss (only “HA” if backends are zone-spread too).
Cross-region (Global) Load Balancer — a global anycast VIP whose backend pool is regional Standard LBs; one static IP, DNS-free regional failover, L4 only.
DipAvailability — the Health Probe Status metric: percent of probes succeeding per backend; the health/drain signal.
VipAvailability — the Data Path Availability metric: whether the frontend datapath itself is up.
SnatConnectionCount — established SNAT flows split by ConnectionState; a rising Failed count is the canary for port exhaustion.
Graceful drain — the LB stops new flows to an unhealthy/removed instance but lets established flows finish; sequenced into deploys, not a setting.

Next steps

You can now engineer Standard LB end to end — deterministic SNAT, HA Ports symmetry, honest probes, and a global front end — and diagnose any of its failure modes. Build outward:

Next: Diagnosing and Killing SNAT Port Exhaustion on Cloud NAT Gateways — the other egress path, and usually the better one at scale.
Related: Deploying HA Third-Party NVAs in Azure: The Load Balancer Sandwich Pattern — the full firewall-sandwich topology behind the HA Ports section.
Related: Azure Load Balancer vs Application Gateway: Picking the Right Traffic Manager — when you need L7 instead of L4.
Related: Anycast at the Edge: Global Accelerator-Style TCP/UDP Routing for Latency and Failover — the edge-anycast variant of the cross-region front end.
Related: Azure Multi-Region Active-Active Architecture: Designing for Zero-Downtime — where the cross-region LB fits in a full active-active design.
Related: Diagnosing Azure VNet Connectivity: NSGs, UDRs, Effective Routes & Network Watcher — the routing tools that confirm symmetry and egress paths.