Most enterprises do not retire their SD-WAN when they go all-in on a cloud backbone. They want the branch overlay and the cloud transit fabric to be one routing domain: a store in Ohio reaching a workload in East US 2 should traverse the SD-WAN overlay to the nearest cloud hub, then ride the provider backbone the rest of the way, with routes learned dynamically and failover handled in the data plane. This article walks the full integration on Azure Virtual WAN with a partner SD-WAN appliance in the hub, covers the equivalent AWS Transit Gateway / Cloud WAN Connect pattern, and is opinionated about the parts that bite: BGP loop avoidance, breakout vs backhaul, and sizing the underlay you actually need.
I will use Arista VeloCloud as the concrete partner. As of mid-2026 this matters: new deployments of VMware SD-WAN in Azure Virtual WAN are blocked at the end of June 2026 — existing deployments keep running, but for anything green-field you deploy the arista-velocloud-sdwan managed application instead. The architecture below is identical across the supported connectivity partners (Barracuda, Cisco Catalyst SD-WAN, Versa, HPE Aruba EdgeConnect).
1. SD-WAN vs traditional WAN: what actually changes
Traditional WAN routes packets per-prefix over a fixed circuit. SD-WAN builds an encrypted overlay of tunnels between edge devices and a controller-orchestrated fabric, then makes forwarding decisions per-application on top of that overlay. Three properties drive the cloud integration design:
- Transport independence. An edge device bonds whatever underlays it has — MPLS, broadband, LTE/5G — into one logical overlay. The cloud only ever sees the overlay tunnel(s) terminating on the hub appliance; it does not know or care which physical transport a packet rode.
- App-aware steering. The overlay classifies traffic (DSCP, DPI, FQDN) and steers per business intent: Microsoft 365 breaks out locally, ERP backhauls to a regional hub, everything else takes the lowest-loss path. This is the lever you wire into cloud egress policy later.
- Centralized orchestration. A controller (VeloCloud Orchestrator, Cisco vManage/Catalyst Manager, Versa Director) pushes config and certificates. This is what makes zero-touch provisioning of a branch possible, and what you integrate with — not the individual boxes.
The cloud backbone (Azure Virtual WAN, AWS Cloud WAN) is the transit core the overlay plugs into. The win is that branch-to-cloud and branch-to-branch-via-cloud become managed, BGP-driven, and any-to-any, instead of a mesh of point-to-point IPsec tunnels you hand-maintain.
2. Insertion models: NVA in the hub vs native integration
There are two ways to land an SD-WAN overlay into the cloud backbone, and choosing wrong is expensive to undo.
| Model | What it is | When to use |
|---|---|---|
| Partner NVA in the hub | The SD-WAN vendor’s gateway runs inside the managed virtual hub as a first-class appliance, peering with the hub router. | Branches terminate SD-WAN tunnels directly on a cloud-resident gateway; you want vendor parity with on-prem edges and unified orchestration. |
| Native / indirect (NVA in a spoke VNet) | The appliance runs in a normal spoke VNet; it peers with the hub router over BGP or connects via VPN/Connect. | You want full control of the VM, a vendor not in the managed program, or a transitional design. |
On Azure, the managed-NVA-in-hub model is the clean one: the appliance is Availability Zone aware and automatically deployed highly available, it peers with the Virtual WAN hub router and participates in routing decisions like a Microsoft gateway, and you do not create Site, Site-to-Site connection, or P2S resources for branches — the appliance and its orchestrator own that. You still create hub-to-VNet connections for your workloads.
Two hard constraints to internalize before you build:
A managed NVA requires a Standard virtual hub (not Basic). Licensing is BYOL only — you buy the SD-WAN license from the vendor; Microsoft bills the NVA Infrastructure Units and underlying resources separately.
3. Deploy the partner appliance HA pair and wire it to the hub router
3a. Size the hub and the NVA scale unit
The unit of capacity is the NVA Infrastructure Unit: 1 unit = 500 Mbps of aggregate throughput, and a deployment can range from 2 to 80 units (the vendor publishes which scale points they actually support). That 500 Mbps is raw infrastructure throughput — encryption, encapsulation, and any DPI eat into it, so the effective SD-WAN throughput per unit is vendor-specific and lower. Size to peak, not to best-case test numbers, because the n+1 instance model only protects you when you are under the rated ceiling.
Instance count scales with the chosen unit, which matters for IP planning:
| Scale unit | Instances deployed |
|---|---|
| 2 - 20 | 2 |
| 30 - 40 | 3 |
| 60 | 4 |
| 80 | 5 |
The hub address space drives how many NVA interface IPs you get, and the NVA subnets cannot be resized after creation. Use /23 minimum (required for NVAs with more than two NICs); a /23 gives 11 IPs each for the internal and external NVA subnets, /22 gives 27, /21 gives 59. Pick generously — adding instances or IP configs later consumes from the same fixed pool.
3b. Stand up the hub, then deploy the managed NVA
The hub and Virtual WAN are plain infrastructure-as-code. The NVA itself is a Marketplace managed application, so you deploy it from the Marketplace offer (it creates a customer resource group plus a locked, publisher-controlled managed resource group holding the NetworkVirtualAppliances resource) and then configure it from the vendor’s orchestrator.
resource "azurerm_virtual_wan" "core" {
name = "vwan-core"
resource_group_name = azurerm_resource_group.net.name
location = "eastus2"
}
resource "azurerm_virtual_hub" "eus2" {
name = "hub-eus2"
resource_group_name = azurerm_resource_group.net.name
location = "eastus2"
virtual_wan_id = azurerm_virtual_wan.core.id
address_prefix = "10.100.0.0/23" # /23 minimum for NVA + room to grow
sku = "Standard" # NVA in hub requires Standard
}
The NVA resource is provisioned by the managed app; the supported, vendor-agnostic way to create it directly is the virtual-appliance API. After deployment you confirm placement and scale:
# Confirm the NVA landed in the right hub and note its instance/scale-unit profile
az network virtual-appliance show \
--name velocloud-eus2 --resource-group <managed-rg> \
--query "{hub:virtualHub.id, units:virtualApplianceScaleUnit, vendor:nvaSku.vendor}" -o jsonc
Direct SSH/console access to the appliances is intentionally not available; all SD-WAN config (interfaces, business policy, branch profiles) happens in the VeloCloud Orchestrator, which the managed app bootstraps.
4. Branch onboarding: zero-touch provisioning and tunnel templates
This is where SD-WAN earns its keep. You do not hand-configure tunnels per branch; you template once and let ZTP do the rest.
The flow:
- Pre-stage in the Orchestrator. Create the branch as a managed Edge, assign it a Profile (the tunnel template: which overlay links, QoS classes, business policy, and which cloud hub gateways to register with). Generate an Activation Key per Edge.
- Bootstrap the credential. The activation key (a one-time token) is the trust anchor. Ship the appliance to the branch; an on-site tech powers it on and either enters the key or scans it. The Edge calls home to the Orchestrator over the internet, authenticates with the key, and is issued its long-lived certificate.
- Auto-build tunnels. Once activated, the Edge pulls its profile and automatically establishes overlay tunnels to the in-hub gateway(s) for every cloud hub in scope. No per-branch tunnel config touches the cloud side — the in-hub NVA already advertises itself to the overlay.
- Route exchange begins. The branch LAN prefixes flow up the overlay to the in-hub NVA, which redistributes them into BGP toward the hub router (Section 5).
Treat ZTP enrollment as software config, not clicks. A minimal branch profile expressed as data, fed to the Orchestrator API or your config pipeline:
# branch profile: store-ohio-014 (consumed by the SD-WAN orchestrator API)
edge:
name: store-ohio-014
profile: retail-branch-standard
activation_key_ttl_days: 30 # key expires if not used; rotate, never reuse
links:
- interface: GE3 # broadband primary
type: public
bandwidth_mbps_up: 200
- interface: LTE1 # cellular backup
type: public
backup: true
cloud_gateways:
- hub: hub-eus2 # register overlay to the in-hub NVA
- hub: hub-scus # second hub for geo-redundancy
lan_subnets:
- 10.214.14.0/24 # advertised into the overlay -> hub router via BGP
Operational rule: activation keys are bearer credentials. Scope a short TTL, deliver them out-of-band from the hardware, and revoke on any failed/aborted onboarding. A leaked key plus a spare appliance is an unauthorized tunnel into your backbone.
5. BGP between the SD-WAN overlay and the cloud backbone
The in-hub NVA learns branch prefixes from the overlay and must hand them to the cloud backbone — and learn cloud/VNet prefixes back — over BGP. With the managed-in-hub model this peering is built in: the NVA peers with the hub router automatically. If you instead run the appliance in a spoke VNet, you configure the peering explicitly, and the rules below are where designs go wrong.
Hub-router BGP facts you must respect (Azure):
- The hub router supports only 16-bit (2-byte) ASNs, and the NVA ASN must differ from the hub router’s ASN.
- You cannot use ASNs reserved by Azure or IANA. Reserved by Azure: public
8074, 8075, 12076; private65515, 65517, 65518, 65519, 65520. Reserved by IANA:23456,64496-64511,65535-65551. Pick something clean like65010for the overlay. - When you peer with the hub router you are given two peer IP addresses. You must peer with both, and advertise the same routes to both — asymmetric advertisement causes routing failures. This is the hub’s redundancy, analogous to dual BGP sessions on AWS.
- BGP peering is supported only to an IP on an NVA interface — loopback peering is not supported.
- Peering is only to NVAs in directly connected VNets. You cannot BGP-peer an on-prem device or an Azure Route Server with the hub router.
- Limits: the hub accepts at most 10,000 routes total across all connected resources, an NVA can advertise at most 4,000 routes, and a hub supports a maximum of 8 BGP peers.
- For the NVA’s learned branch routes to reach VPN/ExpressRoute sites (and vice versa), branch-to-branch routing must be enabled on the Virtual WAN.
A spoke-NVA peering in IaC, with the ASN and dual-IP rules applied:
# NVA running in a spoke VNet, peering the spoke router (the hub) over BGP.
resource "azurerm_virtual_hub_bgp_connection" "velo" {
name = "velo-overlay-peer"
virtual_hub_id = azurerm_virtual_hub.eus2.id
peer_asn = 65010 # != hub ASN, not in any reserved range
peer_ip = "10.110.0.5" # NVA interface IP (no loopbacks)
virtual_network_connection_id =
azurerm_virtual_hub_connection.velo_spoke.id
}
Avoiding route loops
Loop avoidance here is mostly about understanding what the hub will and will not re-advertise:
- System routes win over BGP for directly attached VNet prefixes. You cannot force traffic destined to a spoke VNet’s own address space through the NVA via hub BGP — the hub auto-learns those system routes and prefers them over anything BGP gives it. Design around it; do not fight it.
- More-specifics do not leak to on-prem. Routes the NVA advertises that are more specific than the VNet address space are not propagated onward to on-prem by the hub. Good for blast-radius, but know it so you do not expect a /32 to appear at a branch.
- eBGP AS-PATH is your loop guard. Run the overlay in its own ASN. Because eBGP rejects any advertisement whose AS-PATH already contains the receiver’s ASN, a prefix that originated on-prem, entered the overlay, and tries to loop back from the cloud is dropped on AS-PATH. Do not blanket-strip AS-PATH or enable
allowas-inon the overlay edge — that is how you build a black hole. - Prefer aggregates. Advertise a branch supernet (e.g.
10.214.0.0/16) from the NVA rather than thousands of /24s. It keeps you under the 4,000-route NVA limit and the 10,000-route hub ceiling, and it makes failover converge faster.
6. Traffic steering: breakout vs backhaul
App-aware steering is the reason SD-WAN exists; the integration job is to make the cloud egress policy agree with the overlay’s business policy.
- Local internet breakout for trusted SaaS (Microsoft 365, well-known CDNs) — the branch Edge sends it straight out its local underlay, never touching the cloud hub. Lowest latency, lowest backbone cost. Drive it with the vendor’s FQDN/app categories.
- Backhaul to inspected egress for everything that must be logged or filtered. The Edge steers it over the overlay to the in-hub gateway; the hub forces it through a security stack before the internet.
On a secured Virtual hub, you express that backhaul intent with Routing Intent, not hand-built UDRs. Note the dependency: BGP peering between a spoke NVA and a secured hub is supported only when Routing Intent is configured — on a secured hub without it, the peering is not supported.
# Send all internet-bound and private traffic to the hub's security stack.
# Branch SaaS that breaks out locally never enters the hub, so it is unaffected.
az network vhub routing-intent create \
--name eus2-intent \
--vhub-name hub-eus2 --resource-group net-rg \
--routing-policies \
"[{name:Internet,destinations:[Internet],nextHop:<azfw-or-nva-id>}, \
{name:PrivateTraffic,destinations:[PrivateTraffic],nextHop:<azfw-or-nva-id>}]"
The principle: keep the two policy planes consistent. If the overlay breaks a SaaS app out locally but the cloud hub’s egress firewall expects to inspect it, you have a coverage gap. Decide per app category in one place and reflect it in both.
7. Capacity and the underlay bandwidth you actually need
Two numbers, do not conflate them:
- Cloud-side aggregate throughput = sum of all branch overlay traffic landing on the hub at peak. Pick the NVA scale unit for this, sized to peak with headroom (remember the 500 Mbps/unit is pre-crypto/pre-DPI). A 100-branch estate averaging 50 Mbps each but peaking together at 8 Gbps needs ~16+ effective units, not the average.
- Per-branch underlay = per-site overlay demand plus IPsec overhead. IPsec/ESP overhead runs roughly 5-10%; size the branch circuit to peak app demand plus that, and ensure the backup underlay (LTE/5G) can carry the must-keep traffic classes alone, because in a failover it will.
MTU is the silent killer. The overlay encapsulates (IPsec, often over the provider’s own encap), so a 1500-byte inner packet plus headers exceeds path MTU and fragments or black-holes large flows. On AWS Connect the math is explicit — set the GRE tunnel MTU to external MTU minus 24 bytes (4-byte GRE + 20-byte outer IP): 1500 - 24 = 1476. On any overlay, leave room and enable TCP MSS clamping at the edge so handshakes negotiate a safe segment size instead of relying on PMTUD through tunnels that often drop ICMP.
The AWS equivalent: Transit Gateway / Cloud WAN Connect
If your backbone is AWS, the analogous construct is a Connect attachment on a transit/transport attachment, with GRE tunnels (Connect peers) to the SD-WAN appliance and BGP for routing. The specifics worth knowing:
- Each Connect peer establishes two BGP sessions over the tunnel for routing-plane redundancy — the same “peer with both” discipline as Azure’s dual hub IPs.
- BGP inside addresses come from a /29 in
169.254.0.0/16(link-local); several /29s in that range are reserved. Configure the first address in the /29 on the appliance as its BGP IP. - If you do not set a peer ASN, AWS uses the transit gateway ASN (default
64512) and you end up in iBGP; for eBGP to a different ASN you must setebgp-multihopwith TTL 2. Note BFD is not supported here, so your reconvergence floor is BGP timers (default keepalive 10s / hold 30s) unless you tune them.
# AWS: GRE Connect peer to an SD-WAN appliance, dual BGP sessions implied.
aws ec2 create-transit-gateway-connect-peer \
--transit-gateway-attachment-id tgw-attach-0abc123 \
--peer-address 10.20.0.10 \
--inside-cidr-blocks 169.254.6.0/29 \
--bgp-options PeerAsn=65010
# Appliance side: GRE tunnel MTU = 1500 - 24 = 1476, MSS clamp on, BGP from 169.254.6.1
Enterprise scenario
A retail platform team migrated 240 stores from MPLS to VeloCloud SD-WAN with two Azure Virtual WAN hubs (East US 2 primary, South Central US secondary) and the appliance in the hub. POS, inventory, and back-office traffic backhauled through the hubs to an inspected egress; Microsoft 365 broke out locally at each store. They advertised every store LAN as an individual /24 from the in-hub NVA into the hub router — 240 prefixes, plus their VNet and ExpressRoute routes. It worked in the pilot of 30 stores and passed UAT.
The constraint hit during national rollout. As store count climbed and a few sites advertised secondary subnets, the prefix count from the overlay marched toward the 4,000-route NVA limit while the hub’s total crept toward the 10,000-route ceiling as VNets and ExpressRoute prefixes piled on. More damaging operationally: every store add or subnet change re-advertised a fresh /24, and hub-side reconvergence across hundreds of discrete prefixes during the nightly maintenance window produced 30-60 second windows where some stores saw POS timeouts. The team had also left routing_weight defaults such that during a hub failover, return traffic for a subset of prefixes came back over the secondary hub while the overlay still preferred primary — classic asymmetry, and the inspecting firewall dropped those flows.
The fix was structural. First, summarize: VeloCloud was configured to advertise a single regional supernet per geography (10.214.0.0/16, 10.220.0.0/16, …) from the in-hub NVA instead of per-store /24s, collapsing 240+ prefixes to a handful and dropping hub reconvergence to sub-second on store changes. Branch-specific reachability stayed intact because the overlay itself still carried the specific routes between Edges; only the cloud-facing advertisement was aggregated. Second, they made hub preference deterministic and symmetric so failover egress and the overlay’s preferred hub always agreed. The corrected, aggregated peering:
resource "azurerm_virtual_hub_bgp_connection" "velo_eus2" {
name = "velo-overlay-peer"
virtual_hub_id = azurerm_virtual_hub.eus2.id
peer_asn = 65010 # overlay ASN, distinct from hub, not reserved
peer_ip = "10.110.0.5"
virtual_network_connection_id =
azurerm_virtual_hub_connection.velo_spoke.id
}
# NVA business policy advertises regional supernets only:
# 10.214.0.0/16, 10.220.0.0/16 ... -> well under the 4,000-route NVA cap,
# stable through store adds, fast hub reconvergence on failure.
They added a synthetic check that counts prefixes received by the hub from the overlay and alarms at 70% of the 4,000 limit, plus a per-region return-path probe that fails if traffic egresses the non-preferred hub. The route-scale class of incident became a dashboard threshold instead of a 2 a.m. page.
Verify
Confirm the integration end to end before you onboard real branches.
- Overlay tunnels up. In the SD-WAN orchestrator, the in-hub gateway shows the expected branches as connected/stable with both primary and backup underlays registered.
- BGP adjacency and dual peering (Azure). Confirm the NVA is peered with the hub router on both advertised IPs and learned routes are present:
az network vhub bgpconnection list \
--vhub-name hub-eus2 --resource-group net-rg -o table
# Effective routes the hub programmed: branch supernets should show next hop = the BGP peer
az network vhub get-effective-routes \
--name hub-eus2 --resource-group net-rg \
--resource-type RouteTable \
--resource-id <defaultRouteTable-id> -o table
- Route hygiene. The hub shows your aggregated branch supernets (not a flood of /24s), prefix count is well under 4,000 from the NVA, and no on-prem prefix has looped back into the overlay (check AS-PATH on the edge).
- Steering correctness. From a branch, a SaaS flow (e.g. an M365 endpoint) egresses locally (traceroute does not enter the hub), while a backhauled app does traverse the hub’s inspected egress.
- AWS side.
aws ec2 describe-transit-gateway-connect-peersshows the peeravailablewith both BGP sessionsUP, and the TGW route table carries the appliance-advertised prefixes. - MTU sanity. Large transfers over the overlay complete without stalls; confirm MSS clamping by checking negotiated segment size, and that GRE/IPsec MTU leaves header room.