Networking Azure

Deploying HA Third-Party NVAs in Azure: The Load Balancer Sandwich Pattern

Running a third-party firewall, IDS, or proxy as a Network Virtual Appliance (NVA) in Azure is straightforward until you need it to survive a single-instance failure without dropping live sessions. This article walks through the load balancer sandwich pattern end to end: why naive NVA HA breaks, how to wrap an NVA pair between an external and an internal Standard Load Balancer, and how to keep stateful inspection happy with symmetric flows.

Why NVA HA Is Hard

A stateless router can be made HA trivially. A firewall cannot, because it tracks connection state. Three independent problems collide:

  1. Stateful flow pinning. The forward and return packets of a TCP session must traverse the same NVA instance. If the SYN goes through NVA-A and the SYN/ACK comes back through NVA-B, NVA-B sees a packet with no matching session table entry and silently drops it. This is the asymmetric routing problem, and it is the single most common reason “working” NVA HA setups mysteriously lose connections.

  2. Route convergence on failure. The classic pattern points a User Defined Route (UDR) 0.0.0.0/0 next hop at NVA-A’s NIC. When NVA-A dies, something has to rewrite that UDR to NVA-B. Doing that with a custom script (an Azure Function watching health and calling the ARM API) is slow, racy, and brittle — convergence takes tens of seconds and the script itself becomes a failure domain.

  3. Single-NIC SNAT confusion. If both NVAs SNAT to their own NIC IPs, return traffic naturally comes back to the originating instance. But the moment you want active-active throughput, or you have non-SNAT (routed/transparent) flows, symmetry is no longer free.

The load balancer sandwich solves all three at the data plane. Failover is handled by health probes in milliseconds-to-seconds, not by an ARM control-plane script. Symmetry is enforced by the load balancer’s 5-tuple (or floating-IP) hashing.

The Load Balancer Sandwich

The pattern places the NVA pair between two Standard Load Balancers:

                 Internet / on-prem
                        |
            [ External Standard LB ]   <- public or peered ingress
                  |          |
              [ NVA-A ]   [ NVA-B ]     <- multi-NIC, IP forwarding ON
                  |          |
            [ Internal Standard LB ]    <- HA Ports frontend
                        |
              UDR 0.0.0.0/0 -> ILB frontend IP
                        |
                   Spoke subnets

Why two LBs? Because traffic flows in both directions through the firewall. East-west and egress traffic hits the ILB. Ingress to published services and the return path of egress flows hits the external LB. Each direction needs a load-balanced, health-probed entry point so that the failure of one NVA removes it from both rotations.

The critical insight: with HA Ports plus the same load distribution mode on both LBs, Azure hashes the 5-tuple consistently, so a given flow lands on the same NVA in both directions. That is what gives you flow symmetry without per-instance SNAT.

Step 1 — Deploy the NVA Instances with Multiple NICs and IP Forwarding

Each NVA needs at least two NICs: an untrusted/external NIC and a trusted/internal NIC. (Many vendor images add a third NIC for management.) Two things are non-negotiable:

Create the NICs with forwarding on:

RG=rg-hub-nva
LOC=eastus
VNET=vnet-hub

# Untrusted (external) NICs
az network nic create -g $RG -n nic-nva-a-untrust \
  --vnet-name $VNET --subnet snet-untrust \
  --ip-forwarding true --accelerated-networking true

az network nic create -g $RG -n nic-nva-b-untrust \
  --vnet-name $VNET --subnet snet-untrust \
  --ip-forwarding true --accelerated-networking true

# Trusted (internal) NICs
az network nic create -g $RG -n nic-nva-a-trust \
  --vnet-name $VNET --subnet snet-trust \
  --ip-forwarding true --accelerated-networking true

az network nic create -g $RG -n nic-nva-b-trust \
  --vnet-name $VNET --subnet snet-trust \
  --ip-forwarding true --accelerated-networking true

Deploy the two appliance VMs. Use a vendor BYOL/PAYG image from the Marketplace (you must accept its terms once with az vm image terms accept). Place both instances in an availability set or, preferably, across availability zones so a single zone or rack failure cannot take out both:

# Example: zonal placement, one VM per zone.
# --image is a placeholder; use your vendor's URN from the Marketplace.
az vm create -g $RG -n nva-a \
  --image <publisher>:<offer>:<sku>:<version> \
  --zone 1 \
  --nics nic-nva-a-untrust nic-nva-a-trust \
  --size Standard_D4s_v5 --admin-username azureuser \
  --generate-ssh-keys

az vm create -g $RG -n nva-b \
  --image <publisher>:<offer>:<sku>:<version> \
  --zone 2 \
  --nics nic-nva-b-untrust nic-nva-b-trust \
  --size Standard_D4s_v5 --admin-username azureuser \
  --generate-ssh-keys

Note on accelerated networking: most modern firewall images support it and you want it for throughput, but confirm against your vendor’s compatibility matrix. Some appliance versions require it off.

Step 2 — Configure the Internal Standard Load Balancer with HA Ports

The ILB is where the magic happens. Create a Standard internal LB, a backend pool containing both trusted NICs, a health probe, and a single HA Ports rule.

ILB=ilb-nva-trust

# 1. Standard internal LB with a frontend in the trusted subnet
az network lb create -g $RG -n $ILB --sku Standard \
  --vnet-name $VNET --subnet snet-trust \
  --frontend-ip-name fe-trust \
  --backend-pool-name bep-nva-trust \
  --private-ip-address 10.0.1.10

# 2. Health probe — probe the firewall's data-plane health endpoint.
#    Use TCP to a port the firewall actually serves, or HTTP to a
#    vendor health URL. Do NOT probe a port that is up while the
#    dataplane is down.
az network lb probe create -g $RG --lb-name $ILB \
  -n probe-nva -p tcp --port 22 \
  --interval 5 --threshold 2

# 3. HA Ports load-balancing rule (protocol All, port 0).
#    floating-ip (Direct Server Return) is REQUIRED for HA Ports.
az network lb rule create -g $RG --lb-name $ILB \
  -n haports \
  --protocol All --frontend-port 0 --backend-port 0 \
  --frontend-ip-name fe-trust \
  --backend-pool-name bep-nva-trust \
  --probe-name probe-nva \
  --floating-ip true \
  --idle-timeout 30

Then add both trusted NIC IP configurations to the backend pool:

az network nic ip-config address-pool add -g $RG \
  --nic-name nic-nva-a-trust --ip-config-name ipconfig1 \
  --lb-name $ILB --address-pool bep-nva-trust

az network nic ip-config address-pool add -g $RG \
  --nic-name nic-nva-b-trust --ip-config-name ipconfig1 \
  --lb-name $ILB --address-pool bep-nva-trust

A few details that bite people:

Step 3 — Point UDRs at the ILB Frontend, Not a Single NVA

This is the change that eliminates the failover script. In every spoke (and in the hub subnets that need to egress through the firewall), the default route’s next hop is the ILB frontend IP, with next-hop type VirtualAppliance.

RT=rt-spoke1

az network route-table create -g $RG -n $RT

# Default route -> ILB frontend IP (10.0.1.10), NOT a NVA NIC.
az network route-table route create -g $RG --route-table-name $RT \
  -n default-to-firewall \
  --address-prefix 0.0.0.0/0 \
  --next-hop-type VirtualAppliance \
  --next-hop-ip-address 10.0.1.10

# Associate with the spoke subnet
az network vnet subnet update -g rg-spoke1 \
  --vnet-name vnet-spoke1 -n snet-app \
  --route-table $RT

Because the next hop is a load balancer frontend, Azure forwards each new flow to a healthy backend NVA. If an NVA fails its probe, the ILB stops sending it flows — the UDR never changes. Convergence is governed by your probe interval and threshold (e.g. 5s interval, 2 failures => ~10s), not by a control-plane rewrite.

Equivalent Terraform for the route, if you manage routing as code:

resource "azurerm_route" "default_to_fw" {
  name                   = "default-to-firewall"
  resource_group_name    = azurerm_resource_group.hub.name
  route_table_name       = azurerm_route_table.spoke1.name
  address_prefix         = "0.0.0.0/0"
  next_hop_type          = "VirtualAppliance"
  next_hop_in_ip_address = azurerm_lb.ilb_trust.frontend_ip_configuration[0].private_ip_address
}

Do not forget the hub’s own subnets. The GatewaySubnet, for instance, may need UDRs steering on-prem-bound or spoke-bound return traffic back through the firewall so that the return path is also load-balanced and symmetric. Mismatched forward/return steering is the classic source of one-way connectivity.

Step 4 — Ensure Flow Symmetry So Stateful Inspection Survives

Symmetry is the whole game. With the sandwich, you get it as follows:

az network lb rule update -g $RG --lb-name $ILB -n haports \
  --load-distribution SourceIP   # or SourceIPProtocol

There are two architectural ways to guarantee return symmetry, and you should pick one deliberately:

Approach How return traffic stays symmetric Trade-off
SNAT on the NVA NVA rewrites source IP to its trusted-NIC IP; return naturally routes back to that NIC Loses original client IP for inspection/logging unless X-Forwarded handled; simplest
No-SNAT (routed/transparent) + matching LB hash Both LBs hash the 5-tuple identically, return flow re-selects the same NVA Preserves client IP; depends entirely on consistent LB config and correct return UDRs

If you run no-SNAT (to preserve client IPs through the firewall), the return-path UDRs and the matching LB distribution mode are load-bearing. If you run SNAT, symmetry is largely self-healing but you give up source-IP visibility at the firewall and on the destination workload.

Active-Active vs Active-Passive Trade-offs

Dimension Active-Active (this pattern) Active-Passive
Throughput Sum of both instances One instance; standby idle
Failover behavior Probe drops dead node; surviving node already carries ~half the load Standby promotes; may need vendor cluster sync
Session survival on failover Flows on the dead node are reset unless vendor does session sync Vendor HA sync can preserve sessions
Complexity LB config + symmetry discipline Vendor clustering + (often) UDR/IP move
Cost efficiency Both nodes earning Half your firewall capacity sits idle

Active-active maximizes utilization but, by default, sessions pinned to a failed NVA are dropped and must reconnect — the load balancer reroutes new flows to the survivor, it cannot migrate an in-flight session table. Some vendors offer session-state synchronization between cluster members; if session survival (not just connectivity survival) matters, enable it per the vendor’s clustering guide and size for the sync overhead.

SNAT and Port Exhaustion

When the NVA SNATs many east-west or egress flows behind a small set of IPs, you can hit SNAT port exhaustion under high connection rates. Mitigations:

Enterprise scenario

A payments platform team ran a Palo Alto VM-Series pair in the hub as the no-SNAT egress and inspection chokepoint for ~40 spokes. They needed original client IPs preserved for PCI logging, so SNAT was off. The sandwich was textbook: Standard ILB, HA Ports, floating IP, UDRs at the ILB frontend. It passed every functional test and held for months.

Then they enabled forced tunneling so spoke-to-on-prem traffic would also be inspected. Connectivity to on-prem subnets started failing intermittently — roughly half of all new flows. Classic asymmetry. The forward path (spoke -> ILB -> NVA-A -> ExpressRoute) was load-balanced; the return path was not. On-prem-bound return traffic landed on the GatewaySubnet, which had no route table, so it took the system default straight back to the spoke, bypassing the firewall entirely. With no-SNAT, the firewall never saw the return, half the sessions hashed to NVA-B on the way back, and stateful inspection dropped them.

The fix was a GatewaySubnet UDR steering RFC1918 return traffic back through the same ILB frontend, restoring symmetry:

az network route-table route create -g rg-hub-nva \
  --route-table-name rt-gateway -n return-via-fw \
  --address-prefix 10.0.0.0/8 \
  --next-hop-type VirtualAppliance \
  --next-hop-ip-address 10.0.1.10

The lesson: the load balancer sandwich only guarantees symmetry for traffic it actually sees in both directions. Any subnet originating return traffic — GatewaySubnet above all — needs an explicit UDR, or no-SNAT flows die the moment the topology grows a new path.

Verify

Confirm the data plane behaves before you trust it.

1. NICs have IP forwarding on:

az network nic show -g $RG -n nic-nva-a-trust \
  --query "enableIpForwarding"   # expect: true

2. ILB rule is HA Ports with floating IP:

az network lb rule show -g $RG --lb-name ilb-nva-trust -n haports \
  --query "{proto:protocol, fePort:frontendPort, bePort:backendPort, floatingIp:enableFloatingIp}"
# expect: proto=All, fePort=0, bePort=0, floatingIp=true

3. Both NVAs are healthy in the backend pool:

az network lb show -g $RG -n ilb-nva-trust \
  --query "backendAddressPools[].loadBalancerBackendAddresses[].name"

Then check probe health in Azure Monitor metrics (Health Probe Status / DipAvailability) for the ILB — both backends should report healthy.

4. UDR effective routes from a spoke NIC point at the ILB frontend:

az network nic show-effective-route-table \
  -g rg-spoke1 -n <spoke-vm-nic> -o table
# 0.0.0.0/0 should show next hop = VirtualAppliance, IP = 10.0.1.10

5. Symmetry / failover smoke test:

Pre-Production Checklist

Pitfalls

Next Steps

Once the sandwich is stable, codify all of it (NICs, LBs, probes, route tables, UDR associations) in Terraform or Bicep so the symmetry-critical settings cannot drift by hand. Add alerting on the ILB DipAvailability metric and on per-NVA throughput so you see a degraded node before customers do, and run the failover smoke test on a schedule rather than once at go-live.

AzureNVALoad BalancerHARoutingNetworking

Comments

Keep Reading