Hardening Azure App Service: VNet Integration, Private Endpoints, and Zero-Downtime Slots

A default Azure App Service is reachable from the entire internet, talks to your database over a shared public path, and stores connection strings as plaintext in app settings. Then, when you finally ship a fix, an in-place restart drops live requests on the floor and a bad release means a frantic redeploy under pressure. This guide closes all four gaps — inbound, outbound, secrets, and release safety — turning a az webapp create default into a network-isolated production service you ship to on demand, with a warmed swap and a one-command rollback.

The four problems are independent and have independent fixes. Regional VNet integration governs outbound. Private endpoints govern inbound. Key Vault references plus managed identity kill the secrets. Deployment slots with warm-up and swap-with-preview give you zero-downtime release and instant rollback. You can adopt them in any order, but the order matters in production: get egress right before you force it, get private DNS right before you disable public access, and make your readiness probe honest before you gate a swap on it. Every one of those "before"s is a 2 a.m. incident waiting to happen, and this guide is structured so you never hit them.

By the end you will treat a release the way a payments team does: deploy to staging from inside the VNet, warm the slot on a probe that actually proves the dependency chain, preview production config without moving traffic, swap, and — if a regression surfaces — swap back, because the previous bits are sitting right there in the staging slot. The tables in this guide are the reference you keep open during a change window; the prose explains the mechanism, the code is copy-pasteable az/Bicep, and the tables enumerate every setting, value, default, and gotcha end to end.

What problem this solves

The pain is concrete and it shows up in three flavours. First, exposure: the public *.azurewebsites.net hostname is internet-reachable, so your only access control is application-layer, and your outbound calls leave from a rotating pool of shared Azure IPs that no partner firewall can sensibly allowlist. Second, secret sprawl: connection strings and keys sit as plaintext app settings, visible to anyone with Reader on the resource and trivially leaked into screenshots, exports, and ARM templates. Third, release fragility: an in-place deploy restarts the worker, so live traffic eats a cold start or an outright 503, and a bad release has no fast undo — you redeploy the previous artifact and pray.

What breaks without this discipline: a team scales up the plan to “fix” intermittent 5xx (masking SNAT exhaustion for a day before it returns worse); an engineer disables a firewall rule to “make the deploy work” after locking down public access (because the CI runner now can’t reach Kudu); a midnight release goes out, 500s in production seconds after the swap completes, and nobody can explain why staging was healthy and production was not. Every one of these is avoidable with the exact configuration walked through below.

Who hits this: every team running App Service in production behind a compliance boundary — payments, healthcare, anything with a partner allowlist or a private-only data tier. It bites hardest on apps that resolve names at startup (Key Vault references, private SQL), apps deployed from public-hosted pipelines after a lockdown, and cost-sensitive deployments that never set up slots and therefore release in-place. The fix is never “scale up” — it’s “isolate the three planes and gate the release on real health.”

To frame the whole field before the deep dive, here is each plane this article hardens, the default you inherit, the fix, and the failure you avoid:

Plane	Default behaviour	The fix	Plan tier needed	Failure avoided
Inbound	Public `*.azurewebsites.net`, open to internet	Private endpoint + `publicNetworkAccess=Disabled`	Basic+ (PE), Standard+ (prod)	Anonymous reachability; app-layer-only access control
Outbound	Egress from shared rotating Azure IPs	Regional VNet integration + NAT GW / firewall	Basic+ (integration)	Unfirewallable egress; no stable source IP
Secrets	Connection strings as plaintext app settings	Key Vault references + managed identity (or passwordless)	Any tier	Secret leak via export/Reader/screenshot
Release	In-place deploy restarts the live worker	Staging slot + warm-up + swap-with-preview	Standard+ (5 slots) / Basic (limited)	Cold start / 503 on every deploy; no fast rollback
Observability	Logs not retained; alerts page on staging noise	App Insights + diag settings on both slots, prod-scoped alerts	Any tier	Blind incidents; on-call paged by a staging deploy

Learning objectives

By the end of this article you can:

Stand up regional VNet integration on a correctly delegated Microsoft.Web/serverFarms subnet and force all egress through it with WEBSITE_VNET_ROUTE_ALL without breaking internet-bound calls.
Pin a stable outbound IP with a NAT gateway, or route egress through a hub Azure Firewall for FQDN filtering and logging — and explain when to pick which.
Eliminate plaintext secrets with Key Vault references resolved by a managed identity, and go one better with passwordless (Entra) auth to SQL, Storage, and Service Bus.
Lock down inbound with a private endpoint, disable public access, and make the hostname resolve privately with the right privatelink.azurewebsites.net zone — including the SCM/Kudu deploy-path consequence.
Build a staging slot with sticky slot settings, gate the swap on a real readiness probe, use swap-with-preview for two-phase validation, and roll back with a reverse swap.
Wire Health Check, autoscale, Application Insights, and slot-aware diagnostic settings so the platform self-heals and alerts page only on production.
Read the option, limit, and decision tables for every setting above and run the release playbook that maps each post-swap symptom to its root cause, the exact confirm command, and the fix.

Prerequisites & where this fits

You should already understand App Service basics: an App Service plan is the set of VM workers (an SKU like B1, S1, P1v3) you rent, and one or more web apps run on that plan sharing its CPU, memory, and instance count. You should be comfortable running az in Cloud Shell, reading JSON output, and you should know that App Service has deployment slots (staging/production swap targets). Familiarity with VNets, subnets, NSGs, and private DNS is assumed — if any of that is shaky, the Azure VNet deep dive: every setting and Azure Private Endpoints & Private DNS at scale are the upstream reading.

This sits in the Secure-by-default Compute track. It assumes the platform mechanics from the Azure App Service Deep Dive: Plans, Scaling, Slots, TLS and the compute-choice context from App Service vs Container Apps vs AKS. It pairs tightly with Azure Key Vault: secrets, keys, certificates for the secrets plane and Azure Monitor & Application Insights for observability for the alerting plane. When a release does go wrong despite this, the companion Troubleshooting App Service: 502/503, cold starts & restart loops is the incident-time reference.

The examples assume an existing hub-and-spoke topology with the app deployed into a spoke VNet vnet-spoke-app. A quick map of who owns each plane during a change so you escalate to the right person fast:

Plane	What lives here	Who usually owns it	What they break if it’s wrong
Subnet / delegation	Integration subnet, PE subnet, route tables	Network team	Egress breaks; PE has no IP
Private DNS	`privatelink.azurewebsites.net`, vault zone	Network / platform	Hostname resolves public → connections fail
App config	App settings, slot settings, KV references	App / dev team	Sticky leak; KV ref fails at boot
Identity / RBAC	Managed identity, role assignments	Platform + security	MI can’t read vault → crash loop
Release pipeline	Slot deploy, warm-up, swap, rollback	App / DevOps	Cold start, broken swap, no rollback
Plan & scale	SKU, instance count, autoscale, Always On	Platform	Restart = 503; slot starves prod capacity

Plan tier matters. VNet integration and private endpoints require Basic or higher; the production scenarios below assume Standard (S1) or Premium v3 (P1v3). Slots, Always On, and per-slot autoscale all need Standard+ for real use. Free/Shared tiers support none of this. The table in Cost & sizing enumerates exactly what each tier unlocks.

Core concepts

Five mental models make every later step obvious, and pinning them down now prevents the most common mistakes.

Inbound and outbound are different subnets with different jobs. Regional VNet integration injects your app’s outbound calls into a delegated subnet (delegated to Microsoft.Web/serverFarms). A private endpoint projects the app inbound as a private NIC in a separate, non-delegated subnet. They are never the same subnet. Conflating them is the single most common networking error here, so the guide keeps them in distinct steps and distinct CIDRs.

The status of a secret reference is an identity question. A Key Vault reference (@Microsoft.KeyVault(...)) resolves only if the app’s managed identity has Key Vault Secrets User and the vault’s network rules let the app reach it. A red “Key Vault Reference” badge in the portal is almost always one of those two — a missing role assignment or a vault firewall/private-endpoint DNS gap — not a syntax problem.

A swap moves traffic between warmed instances; the warm-up is the whole point. A staging slot is a full, addressable copy of the app on the same plan. You deploy to staging, warm it on a readiness path, then swap — App Service redirects production traffic to the already-warm staging instances. If the slot isn’t warm, or the warm-up probe is dishonest, the swap ships a cold or broken worker and you’ve engineered downtime into a feature designed to prevent it.

Slot settings decide what travels. By default, app settings and connection strings follow the slot — they swap with the code. That is wrong for anything environment-specific. Marking a setting as a slot setting (sticky / “deployment slot setting”) pins it to the slot so it never crosses during a swap. The classic disaster is a staging DB connection string promoting itself to production because nobody made it sticky.

Private DNS changes the blast radius for everything that resolves a name. The moment you add a private endpoint — for the app or for Key Vault or SQL — you change how a name resolves inside the VNet. Any service that resolves a hostname at startup (Key Vault references, a SQL connection, a downstream API) is now subject to your private DNS being correct. A private endpoint you add for one resource can break a second resource that merely shares the VNet’s resolver.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary repeats these for lookup; this table is the model side by side:

Concept	One-line definition	Where it lives	Why it matters here
App Service plan	Rented VM workers (SKU + count)	Subscription / RG	Both slots share it; restart = 503
Web app (site)	One app running on a plan	On the plan	The thing you isolate and release
Deployment slot	A swappable copy of the app	On the plan	Staging target; rollback lives here
Slot setting (sticky)	App setting pinned to a slot	App config	Stops env values crossing a swap
Swap-with-preview	Two-phase swap (preview → complete)	Slot operation	Validate prod config before traffic moves
VNet integration	App’s outbound into a delegated subnet	Site config + subnet	Governs egress; needs delegation
`WEBSITE_VNET_ROUTE_ALL`	Force all egress (incl. internet) into VNet	App setting	Needs egress path first, or breaks
Private endpoint	App projected as a private NIC	PE subnet + NIC	Governs inbound; needs private DNS
`publicNetworkAccess`	Public reachability toggle	Site property	`Disabled` closes the public door
Managed identity	Azure-issued identity for the app	The app	Resolves KV refs; passwordless auth
Key Vault reference	`@Microsoft.KeyVault(...)` app setting	App setting + identity	Plaintext-free secrets
Warm-up ping	Pre-swap probe on a readiness path	App setting	Makes the swap zero-downtime
Health Check	Per-instance liveness probe	Site config	Evicts bad instances from rotation

Step 1 — Regional VNet integration for outbound

Regional VNet integration injects the app’s outbound calls into a dedicated, delegated subnet in your VNet. The subnet must be delegated to Microsoft.Web/serverFarms and should be sized generously — the platform consumes addresses as the plan scales out (a /27 is a sane minimum; /26 for larger plans).

# Delegated subnet for outbound integration
az network vnet subnet create \
  --resource-group rg-app-prod \
  --vnet-name vnet-spoke-app \
  --name snet-appsvc-integration \
  --address-prefixes 10.20.1.0/27 \
  --delegations Microsoft.Web/serverFarms

# Wire the app to it
az webapp vnet-integration add \
  --resource-group rg-app-prod \
  --name app-orders-prod \
  --vnet vnet-spoke-app \
  --subnet snet-appsvc-integration

resource integSubnet 'Microsoft.Network/virtualNetworks/subnets@2023-11-01' = {
  parent: vnet
  name: 'snet-appsvc-integration'
  properties: {
    addressPrefix: '10.20.1.0/27'
    delegations: [ {
      name: 'serverFarms'
      properties: { serviceName: 'Microsoft.Web/serverFarms' }
    } ]
  }
}

resource vnetConn 'Microsoft.Web/sites/networkConfig@2023-12-01' = {
  parent: site
  name: 'virtualNetwork'
  properties: { subnetResourceId: integSubnet.id, swiftSupported: true }
}

At this point outbound calls to RFC 1918 destinations route through the VNet, but internet-bound traffic still leaves via Azure’s shared egress. To force all egress through the VNet — so it can be inspected by a firewall and leave from a predictable IP — set WEBSITE_VNET_ROUTE_ALL:

az webapp config appsettings set \
  --resource-group rg-app-prod \
  --name app-orders-prod \
  --settings WEBSITE_VNET_ROUTE_ALL=1

Once WEBSITE_VNET_ROUTE_ALL=1 is set, the subnet’s effective routes apply to internet traffic too. If the subnet has a route table sending 0.0.0.0/0 at an Azure Firewall or NAT gateway, every outbound packet now follows it. Without a deliberate egress path, internet calls can break — so configure the next step before flipping this in production. This is badge ④ in the diagram.

The integration subnet has hard requirements; get any of these wrong and integration silently misbehaves or fails to attach. Enumerate them before you build:

Requirement	Value / rule	Why	What breaks if wrong
Delegation	`Microsoft.Web/serverFarms`	Reserves the subnet for App Service	Integration won’t attach
Minimum size	`/27` (32 IPs) recommended; `/28` absolute floor	Platform consumes IPs on scale-out	Scale-out fails when IPs run out
Dedication	One plan per integration subnet	Avoids IP contention	Address conflicts across plans
Other resources	None — subnet must be empty of other services	Delegation is exclusive	Create fails
NSG	Allowed; must not block needed egress	You can still filter	Over-tight NSG breaks app egress
Route table (UDR)	Allowed; honoured when `ROUTE_ALL=1`	This is how you force egress	Missing default route breaks internet
Region	Same region as the app/plan	Regional integration is in-region only	Cross-region not supported
Service endpoints	Optional; coexist with integration	Lock PaaS to the subnet	—

The behaviour of WEBSITE_VNET_ROUTE_ALL is the bit people misjudge. Here is exactly what each value does to each traffic class:

`WEBSITE_VNET_ROUTE_ALL`	RFC 1918 (private) traffic	Internet-bound traffic	Use when
`0` (default)	Routed through the VNet	Leaves via shared Azure egress	You only need to reach private resources
`1`	Routed through the VNet	Also forced into the VNet (follows UDR/NAT/firewall)	You need egress inspection or a stable source IP

A few related outbound app settings round out the control surface:

Setting	Effect	Default	When to set
`WEBSITE_VNET_ROUTE_ALL`	Force all egress (incl. internet) into the VNet	`0`	After an egress path (NAT/firewall) exists
`WEBSITE_DNS_SERVER`	Override DNS the app uses	Azure-provided 168.63.129.16	Custom resolver / on-prem DNS for private zones
`WEBSITE_DNS_ALT_SERVER`	Secondary DNS	none	Resolver redundancy
`WEBSITE_CONTENTOVERVNET`	Route content share over the VNet	`0`	Fully private storage for the app’s content

When egress misbehaves, the question is always “which route is actually in effect for this destination?” This decision table maps what you observe to the cause and the next move:

If you see…	It’s probably…	Do this
Private (RFC 1918) reachable, internet broken	`ROUTE_ALL=1` with no `0.0.0.0/0` egress route	Add NAT gateway / UDR to the integration subnet
Internet fine, private targets unreachable	Integration not attached, or wrong subnet	Re-check `az webapp vnet-integration list`; delegate the subnet
Partner allowlist rejects your calls	Egress leaving on the shared rotating pool	Attach NAT gateway; allowlist its static IP/prefix
Egress works but isn’t logged/filtered	No firewall in the path	UDR `0.0.0.0/0` → hub Azure Firewall
Intermittent outbound failures under load	SNAT exhaustion on shared egress	NAT gateway (bigger pool) + connection reuse
Private DNS names resolve to public IPs	Custom `WEBSITE_DNS_SERVER` not forwarding the zone	Point DNS at a resolver that knows the private zones

Pinning egress with a NAT gateway or firewall

For a stable outbound IP (what most SaaS allowlists and partner firewalls demand), attach a NAT gateway to the integration subnet:

az network public-ip create -g rg-app-prod -n pip-nat-app --sku Standard --allocation-method Static
az network nat gateway create -g rg-app-prod -n nat-app --public-ip-addresses pip-nat-app --idle-timeout 10
az network vnet subnet update \
  -g rg-app-prod --vnet-name vnet-spoke-app \
  --name snet-appsvc-integration --nat-gateway nat-app

For inspection and FQDN filtering, point the subnet’s default route at your hub Azure Firewall instead (UDR with 0.0.0.0/0 → VirtualAppliance → firewall private IP). Use the firewall when you need egress logging and rules; use the NAT gateway when you only need a fixed source IP. They can be combined, but the firewall must then own the route. The trade-off, side by side:

Egress option	Stable source IP	FQDN filtering / logging	Cost shape	SNAT headroom	Pick when
Shared Azure egress (default)	No (rotating pool)	No	Free	~128 ports/instance	Dev / no allowlist needed
NAT Gateway	Yes (1 static IP, or a /28 prefix)	No	Hourly + per-GB	Up to ~64,512 ports/IP	Partner allowlist; no inspection
Azure Firewall (hub)	Yes (firewall public IP)	Yes (rules + logs)	Higher hourly + per-GB	Firewall SNAT pool	Compliance, egress rules, logging
NAT Gateway + Firewall	Yes	Yes	Highest	Largest	Both fixed IP and inspection

SNAT port exhaustion is the failure that hides behind “intermittent 5xx under load.” The NAT gateway’s far larger port pool is the architectural fix; reusing connections in code is the real fix. The mechanics are covered end to end in Azure NAT Gateway: deterministic egress & SNAT exhaustion.

Step 2 — Kill plaintext secrets with Key Vault references

Never store a connection string or key as a literal app setting. Instead, store the secret in Key Vault and reference it. App Service resolves the reference at startup (and on a refresh interval) using the app’s managed identity — the secret value never appears in configuration.

First, enable a system-assigned identity and grant it read access to the vault. Use RBAC vaults (the modern default) with the Key Vault Secrets User role:

az webapp identity assign -g rg-app-prod -n app-orders-prod

PRINCIPAL_ID=$(az webapp identity show -g rg-app-prod -n app-orders-prod --query principalId -o tsv)
VAULT_ID=$(az keyvault show -g rg-app-prod -n kv-orders-prod --query id -o tsv)

az role assignment create \
  --assignee-object-id "$PRINCIPAL_ID" \
  --assignee-principal-type ServicePrincipal \
  --role "Key Vault Secrets User" \
  --scope "$VAULT_ID"

Then reference the secret. The reference uses a versionless URI so rotation flows through without a redeploy:

az webapp config appsettings set -g rg-app-prod -n app-orders-prod --settings \
  "ServiceBusKey=@Microsoft.KeyVault(SecretUri=https://kv-orders-prod.vault.azure.net/secrets/sb-key/)"

Confirm resolution in the portal: Configuration → Application settings shows a green “Key Vault Reference” badge. A red badge means the identity can’t read the secret (usually a missing role assignment or a vault firewall blocking the app). The two valid reference syntaxes and what each pins:

Reference form	Example	Rotation behaviour	When to use
Versionless `SecretUri`	`.../secrets/sb-key/`	Picks up the latest version on refresh (~24h or restart)	Default — rotation flows through
Versioned `SecretUri`	`.../secrets/sb-key/<version>`	Pinned to that exact version	You must freeze a value for compliance
`VaultName` + `SecretName`	`VaultName=kv-...;SecretName=sb-key`	Same as versionless	Alternate syntax; identical effect

When a reference shows red, this table localises it fast:

Symptom	Likely cause	Confirm	Fix
Red badge, “access denied”	MI lacks `Key Vault Secrets User`	`az role assignment list --assignee $PRINCIPAL_ID --scope $VAULT_ID`	Grant the role at vault scope
Red badge, resolves intermittently	Vault firewall blocks the app’s egress	Vault → Networking; app egress IP	Allow the app’s subnet / NAT IP, or use a vault PE
Red badge after adding vault PE	Private DNS for `vaultcore` not linked	`nslookup kv-...vault.azure.net` from Kudu	Link `privatelink.vaultcore.azure.net` to the VNet
Value is stale after rotation	Versioned URI pins the old version	Inspect the `SecretUri`	Switch to versionless; restart to force refresh
`MSI_ENDPOINT`/identity errors	Identity not assigned	`az webapp identity show`	`az webapp identity assign`

Prefer passwordless over even references

Key Vault references remove plaintext, but the best secret is no secret. For Azure backends that support Entra auth — Azure SQL, Storage, Service Bus, Event Hubs — use the managed identity directly and drop the credential entirely.

Azure SQL: assign the app’s identity as a contained DB user and connect with Authentication=Active Directory Default (via Azure.Identity / Microsoft.Data.SqlClient); no password in the connection string.
Storage / Service Bus: grant a data-plane RBAC role (e.g. Storage Blob Data Contributor, Azure Service Bus Data Sender) to the identity and authenticate with DefaultAzureCredential.

-- Run against the target database as an Entra admin
CREATE USER [app-orders-prod] FROM EXTERNAL PROVIDER;
ALTER ROLE db_datareader ADD MEMBER [app-orders-prod];
ALTER ROLE db_datawriter ADD MEMBER [app-orders-prod];

The three secret strategies ranked, so you can pick deliberately per dependency:

Strategy	Plaintext in config?	Credential to rotate?	Setup effort	Use for
Plaintext app setting	Yes (visible to Reader)	Yes	None	Never in production
Key Vault reference	No	Yes (in the vault)	Low (MI + role + ref)	Third-party / non-Entra secrets
Passwordless (Entra MI)	No	No credential exists	Medium (RBAC + code)	Azure SQL / Storage / Service Bus / Event Hubs

System-assigned vs user-assigned identity is a real fork once you have more than one app:

Identity type	Lifecycle	Shareable across apps	Best for
System-assigned	Tied to the app; deleted with it	No (1:1)	A single app’s own access
User-assigned	Independent resource	Yes (many apps reuse it)	A fleet sharing the same role grants

The common data-plane roles you’ll actually assign for passwordless auth:

Backend	Role to grant the MI	Scope
Azure SQL	Contained DB user + `db_datareader`/`db_datawriter`	The database
Blob Storage	`Storage Blob Data Contributor`	Storage account / container
Queue Storage	`Storage Queue Data Contributor`	Storage account
Service Bus	`Azure Service Bus Data Sender` / `...Receiver`	Namespace / queue
Event Hubs	`Azure Event Hubs Data Sender` / `...Receiver`	Namespace / hub
Key Vault (secrets)	`Key Vault Secrets User`	Vault

For automatic rotation patterns and event-driven re-reads, see Azure Key Vault: secret rotation with managed identity, and for federated, secretless CI/CD the Key Vault workload identity & secrets guide.

Step 3 — Lock down inbound with a private endpoint

A private endpoint projects the app into your VNet as a private IP and (when enabled) disables public access entirely. Inbound now requires a route into the VNet — from on-prem over ExpressRoute/VPN, from a peered spoke, or via Application Gateway/Front Door as the public front door.

A private endpoint is a separate subnet from the VNet-integration subnet in Step 1. One handles inbound (private endpoint), the other handles outbound (delegated integration). Do not reuse one subnet for both. This is badge ③ territory.

# Dedicated subnet for the private endpoint (no delegation)
az network vnet subnet create \
  -g rg-app-prod --vnet-name vnet-spoke-app \
  --name snet-privateendpoints --address-prefixes 10.20.2.0/27

WEBAPP_ID=$(az webapp show -g rg-app-prod -n app-orders-prod --query id -o tsv)

az network private-endpoint create \
  -g rg-app-prod -n pe-app-orders \
  --vnet-name vnet-spoke-app --subnet snet-privateendpoints \
  --private-connection-resource-id "$WEBAPP_ID" \
  --group-id sites \
  --connection-name pe-app-orders-conn

# Turn off public network access so the app is reachable only via the PE
az webapp update -g rg-app-prod -n app-orders-prod --set publicNetworkAccess=Disabled

resource pe 'Microsoft.Network/privateEndpoints@2023-11-01' = {
  name: 'pe-app-orders'
  location: location
  properties: {
    subnet: { id: peSubnet.id }
    privateLinkServiceConnections: [ {
      name: 'pe-app-orders-conn'
      properties: {
        privateLinkServiceId: site.id
        groupIds: [ 'sites' ]   // 'sites' = the app; 'sites-<slot>' for a slot
      }
    } ]
  }
}

Private endpoints are useless without Private DNS. The app’s hostname must resolve to the private IP from inside the VNet. Create the privatelink.azurewebsites.net zone, link it to the VNet, and register the record (a private-endpoint DNS zone group automates the A record):

az network private-dns zone create -g rg-app-prod -n privatelink.azurewebsites.net

az network private-dns link vnet create \
  -g rg-app-prod -n link-spoke-app \
  --zone-name privatelink.azurewebsites.net \
  --virtual-network vnet-spoke-app --registration-enabled false

az network private-endpoint dns-zone-group create \
  -g rg-app-prod --endpoint-name pe-app-orders \
  -n default --private-dns-zone privatelink.azurewebsites.net --zone-name privatelink_azurewebsites_net

The --group-id (sub-resource) you target depends on what you’re making private. The ones relevant to App Service and its dependencies:

Service	`--group-id` (sub-resource)	Private DNS zone
App Service (main site)	`sites`	`privatelink.azurewebsites.net`
App Service slot	`sites-<slotname>`	`privatelink.azurewebsites.net`
Key Vault	`vault`	`privatelink.vaultcore.azure.net`
Azure SQL	`sqlServer`	`privatelink.database.windows.net`
Blob Storage	`blob`	`privatelink.blob.core.windows.net`
Service Bus	`namespace`	`privatelink.servicebus.windows.net`

The privatelink.azurewebsites.net zone covers all of the app’s hostnames — and that’s the trap with SCM/Kudu:

Hostname	Resolves via	Consequence when public access is off
`app.azurewebsites.net`	`privatelink.azurewebsites.net` A record	Browser/API reachable only inside the VNet
`app.scm.azurewebsites.net` (Kudu/SCM)	Same private zone	CI/CD must reach Kudu over the private network
`app-staging.azurewebsites.net` (slot)	Same private zone	Slot also private; deploy the slot from the VNet too

The SCM/Kudu site shares the hostname. Once public access is disabled, your CI/CD agent must reach the app over the private network (a self-hosted runner in the VNet, or a build that pushes an artifact a VNet-attached deploy step consumes). A public-hosted pipeline doing zip deploy will start failing — plan the deploy path before you flip publicNetworkAccess.

The inbound access-control surface is broader than just the PE; here is the full set and how they stack:

Control	What it does	Default	Set via	Note
Private endpoint	Projects a private NIC; can disable public	none	`az network private-endpoint create`	The strongest inbound control
`publicNetworkAccess`	Master public on/off	`Enabled`	`az webapp update --set publicNetworkAccess=Disabled`	`Disabled` requires a PE or VNet route to reach the app
Access restrictions (IP ACL)	Allow/deny by IP/CIDR or service tag	Allow all	`az webapp config access-restriction add`	Use when you keep public but limit callers
SCM site restrictions	Separate ACL for Kudu	Inherits main or open	`--scm-site` flag	Lock Kudu independently
`httpsOnly`	Redirect HTTP→HTTPS	`false` (set it `true`)	`az webapp update --set httpsOnly=true`	Always on in prod
Client certificates (mTLS)	Require a client cert	off	`clientCertEnabled`	For mutual TLS scenarios

Putting the app behind a public front door while keeping the origin private is the common production shape — see Application Gateway with WAF & end-to-end TLS and Azure Front Door & Traffic Manager global failover. The deeper private-DNS-at-scale patterns are in Private Endpoints & Private DNS at scale; the PE-vs-service-endpoint choice is in Private Endpoint vs Service Endpoint.

Step 4 — Deployment slots done right

A staging slot is a full, addressable copy of the app on the same plan. You deploy to staging, warm it, validate it, then swap — App Service redirects production traffic to the warmed instances with no cold start.

az webapp deployment slot create -g rg-app-prod -n app-orders-prod --slot staging

The subtlety that bites everyone is which settings travel during a swap. By default, app settings and connection strings follow the slot — they swap along with the code. That is wrong for anything environment-specific (a staging DB connection string must NOT become production’s). Mark those as slot settings (“deployment slot setting” / sticky) so they stay pinned to the slot:

az webapp config appsettings set -g rg-app-prod -n app-orders-prod --slot staging \
  --slot-settings ASPNETCORE_ENVIRONMENT=Staging "SqlConnection=@Microsoft.KeyVault(SecretUri=https://kv-orders-prod.vault.azure.net/secrets/sql-staging/)"

What travels and what stays is the whole game. This is the definitive matrix — what swaps, what’s sticky, and what you can’t change:

Setting / element	Swaps with code?	Make it sticky?	Notes
Regular app setting	Yes	—	Feature flags, shared tuning that should promote
Slot setting (sticky)	No	Mark it sticky	Env name, env-specific connection strings, slot-scoped keys
Connection strings (regular)	Yes	Often should be sticky	Same rule as app settings
Key Vault references	Yes (the reference)	Sticky if env-specific URI	The URI travels; resolution happens per-slot identity
General settings (stack, Always On)	Yes	n/a	Match staging to prod or behaviour differs post-swap
Publishing endpoints / hostnames	No (stay with slot)	n/a	Each slot keeps its own hostname
Managed identity	No (per-slot)	n/a	Each slot has its own identity; grant both
Private endpoint / inbound config	No (per-resource)	n/a	The slot’s networking is its own
TLS/SSL bindings	No	n/a	Bindings stay with the slot
Scale settings / autoscale	No (plan-level)	n/a	Plan is shared by both slots
Diagnostic settings	No (per-resource)	n/a	Apply to both slots explicitly

The non-obvious one is managed identity: a slot has its own identity. If you grant only production’s identity Key Vault Secrets User, staging’s references go red and the app crash-loops the moment you swap. Grant both slot identities, or use a shared user-assigned identity. This is the failure behind badge ⑤.

Warm-up so the swap is actually zero-downtime

A swap is only seamless if the staging instances are already warm. Tell App Service to ping a path on every instance and wait for healthy responses before completing the swap:

az webapp config appsettings set -g rg-app-prod -n app-orders-prod --slot staging --slot-settings \
  WEBSITE_SWAP_WARMUP_PING_PATH=/health/ready \
  WEBSITE_SWAP_WARMUP_PING_STATUSES=200,202 \
  WEBSITE_WARMUP_PATH=/health/ready

WEBSITE_SWAP_WARMUP_PING_PATH and WEBSITE_SWAP_WARMUP_PING_STATUSES gate the swap on your readiness endpoint returning an acceptable status on each instance. /health/ready should check real dependencies (DB reachable, Key Vault references resolved, cache primed) — not just return 200 unconditionally. Pair this with Always On so the slot never idles out before a swap:

az webapp config set -g rg-app-prod -n app-orders-prod --slot staging --always-on true

Every setting that governs the warm-up handshake, with defaults and the gotcha for each:

Setting	What it does	Default	Valid values	Gotcha
`WEBSITE_SWAP_WARMUP_PING_PATH`	Path pinged before swap completes	`/`	any path	`/` may be slow/unauth; use a real readiness path
`WEBSITE_SWAP_WARMUP_PING_STATUSES`	Statuses counted as “warm”	`200`	comma list, e.g. `200,202`	Listing `2x`/`3xx` too loosely passes a broken app
`WEBSITE_WARMUP_PATH`	Path hit on instance start (non-swap warm)	none	any path	Warms scale-out/restart, not just swaps
`WEBSITE_SWAP_WARMUP_MAXATTEMPTS`*	Max warm-up attempts	platform-managed	integer	Tune only if startup is legitimately long
`alwaysOn`	Keep a warm worker resident	`false`	`true`/`false`	Off = slot idles out before a swap

*Behaviour and exposure of the maxattempts knob vary by stack; treat warm-up tuning conservatively and fix slow startup instead of widening the gate.

A readiness path that lies defeats the entire mechanism. Design rule — what /health/ready must and must not check:

`/health/ready` returns 200 when…	Include in the check	Never include
Config is loaded and the process can serve	In-process self-checks	An unconditional `return 200`
Required deps are reachable	Fast DB ping; resolve a sentinel KV secret	A slow aggregate / report query
The instance can serve a real request	Cheap synthetic path	A call to an external payment API
—	—	Optional downstreams (cache, search) that may blip

Swap with auto-rollback semantics

The robust pattern is a swap with preview (two-phase swap). Phase 1 applies production’s slot settings to staging and restarts it under production config — without moving traffic. You validate against the previewed slot, then complete:

# Phase 1: apply target (production) config to staging, no traffic moved yet
az webapp deployment slot swap -g rg-app-prod -n app-orders-prod \
  --slot staging --target-slot production --action preview

# ... run smoke tests against the staging slot now running prod config ...

# Phase 2: complete the swap (traffic moves)
az webapp deployment slot swap -g rg-app-prod -n app-orders-prod \
  --slot staging --target-slot production --action swap

If smoke tests fail during preview, abort with --action reset and nothing reaches users. If a regression surfaces after completion, swap back — the previous production bits are sitting in the staging slot, so rollback is another swap, not a redeploy:

az webapp deployment slot swap -g rg-app-prod -n app-orders-prod --slot staging --target-slot production

The --action values and exactly what each does:

`--action`	What happens	Traffic moves?	Use for
`swap`	Single-phase swap	Yes, immediately	Simple promote when you trust the slot
`preview`	Phase 1: apply target config, restart staging	No	Validate prod config before committing
`swap` (after preview)	Phase 2: complete the previewed swap	Yes	Promote a validated preview
`reset`	Cancel a preview, revert config	No	Abort when smoke tests fail

The release strategies you can build on slots, compared:

Strategy	Mechanism	Rollback	Granularity	Best for
In-place deploy	Deploy to production directly	Redeploy old artifact	All-or-nothing, with downtime	Never in prod
Slot swap	Warm staging → swap	Reverse swap (instant)	All-or-nothing, zero-downtime	Standard releases
Swap-with-preview	Two-phase swap	`reset` or reverse swap	All-or-nothing, validated	Risk-sensitive releases
Canary via traffic %	Route a % to the slot	Set % back to 0	Percentage of traffic	Gradual exposure

A canary that routes a slice of live traffic to the staging slot before a full swap is the next maturity step. The traffic-split and blue-green patterns on slots (and with Traffic Manager) are covered in Blue-green deployments with App Service slots & Traffic Manager.

You can route a percentage of production traffic to the slot for a canary without swapping:

# Send 10% of production traffic to the staging slot (canary)
az webapp traffic-routing set -g rg-app-prod -n app-orders-prod \
  --distribution staging=10
# Revert (all traffic back to production)
az webapp traffic-routing clear -g rg-app-prod -n app-orders-prod

Before you ever click swap, run this pre-flight checklist — every row is a real post-swap incident waiting if it’s wrong:

Pre-swap check	Why it matters	Confirm with
Staging slot identity has vault access	Else KV refs go red → crash loop post-swap	`az webapp identity show --slot staging` + role list
Env-specific settings are sticky	Else staging values promote to production	`... --slot staging --query "[?slotSetting].name"`
Warm-up path is a real readiness check	Else the gate passes on a broken app	Read the handler; hit it with a dep down
`alwaysOn` on the slot	Else the slot idles out before swap	`az webapp config show --slot staging --query alwaysOn`
Private DNS for vault/SQL reachable from slot	Else boot-time resolution fails	`nslookup` from Kudu in the slot
Plan has headroom (`min-count >= 2`)	Else a recycle during deploy zeroes prod	Plan instance count / autoscale min
Stack/runtime matches production	Else behaviour diverges after swap	Compare general settings both slots
Diagnostic settings on the slot	Else you’re blind during the change	`az monitor diagnostic-settings list` on the slot

Step 5 — Health checks, scaling, and resilience

Enable Health Check so the platform pulls unhealthy instances out of rotation and recycles them. App Service polls the path across instances and stops routing to any that fail consistently:

az webapp config set -g rg-app-prod -n app-orders-prod --generic-configurations '{"healthCheckPath": "/health/live"}'

Use a liveness path (/health/live — is the process up?) for Health Check and a readiness path (/health/ready — are dependencies good?) for warm-up. Conflating them recycles healthy instances during a transient dependency blip. The two probes, kept straight:

Probe	Question it answers	Used by	Fail it when	Never fail it on
Liveness (`/health/live`)	Is this process up?	Health Check (eviction)	The process is wedged	A downstream blip (evicts the fleet)
Readiness (`/health/ready`)	Can it serve (deps OK)?	Swap warm-up	A required dep is unreachable	Optional/best-effort deps

The Health Check knob set, enumerated:

Setting / control	What it does	Default	Valid range	When to change
`healthCheckPath`	Path probed per instance	unset (disabled)	any path returning 200 healthy	Always set in prod; keep it shallow
`WEBSITE_HEALTHCHECK_MAXPINGFAILURES`	Consecutive fails before instance replaced	10	2–10	Lower for fast eviction; higher to ride blips
`WEBSITE_HEALTHCHECK_MAXUNHEALTHYWORKERPERCENT`	Cap % of instances removed at once	50	1–100	Prevent evicting the whole fleet on a shared dep
Probe interval	How often the platform pings	~1 min	platform-managed	Not directly tunable

Add autoscale on the plan. Scale on a signal that reflects load (CPU here; queue depth or HTTP queue length are often better):

az monitor autoscale create -g rg-app-prod \
  --resource $(az appservice plan show -g rg-app-prod -n plan-orders-prod --query id -o tsv) \
  --name autoscale-orders --min-count 2 --max-count 10 --count 2

az monitor autoscale rule create -g rg-app-prod --autoscale-name autoscale-orders \
  --condition "CpuPercentage > 70 avg 10m" --scale out 2

az monitor autoscale rule create -g rg-app-prod --autoscale-name autoscale-orders \
  --condition "CpuPercentage < 30 avg 10m" --scale in 1

Autoscale operates on the plan, which both slots share. Staging instances consume the same plan capacity, so size max-count with headroom for a slot running warm during a deploy. Keep min-count at 2+ so production survives an instance recycle.

The scaling signals you can pick and when each is right:

Autoscale metric	Reflects	Good for	Watch out for
`CpuPercentage`	Compute load	CPU-bound apps	I/O-bound apps stay low while slow
`MemoryPercentage`	Memory pressure	Memory-heavy workloads	Leaks look like load
`HttpQueueLength`	Requests waiting for a thread	Latency-sensitive web apps	Often the best web signal
Service Bus / queue depth	Backlog of work	Worker / queue-driven apps	Needs the queue metric wired in
Schedule (time-based)	Known traffic shape	Predictable daily peaks	Doesn’t react to surprises

The cold-start triggers slots/scaling introduce, and what fixes each:

Cold-start trigger	When	Fix	Tier
Idle unload (~20 min)	Low-traffic apps	`alwaysOn=true`	B1+
Just deployed	After a deploy/restart	Slot-swap with warm-up	S1+
Scaled out (new instance)	Autoscale adds capacity	Pre-warmed instances	P1v3+
Swapped in without warm-up	After a swap	`WEBSITE_SWAP_WARMUP_PING_PATH`	B1+

Step 6 — Observability and slot-aware alerting

Wire Application Insights for distributed tracing and live metrics, and ship platform logs to Log Analytics via diagnostic settings:

APPI_CONN=$(az monitor app-insights component show -g rg-app-prod --app appi-orders --query connectionString -o tsv)
az webapp config appsettings set -g rg-app-prod -n app-orders-prod \
  --settings APPLICATIONINSIGHTS_CONNECTION_STRING="$APPI_CONN"

az monitor diagnostic-settings create \
  --name diag-to-law \
  --resource $(az webapp show -g rg-app-prod -n app-orders-prod --query id -o tsv) \
  --workspace $(az monitor log-analytics workspace show -g rg-app-prod -n law-orders --query id -o tsv) \
  --logs    '[{"category":"AppServiceHTTPLogs","enabled":true},{"category":"AppServiceConsoleLogs","enabled":true},{"category":"AppServiceAppLogs","enabled":true}]' \
  --metrics '[{"category":"AllMetrics","enabled":true}]'

Apply diagnostic settings to the staging slot too — a slot is a distinct resource and won’t inherit them. Scope production alerts (5xx rate, response-time P95, health-check failures) to the production slot resource ID so a noisy staging deploy doesn’t page on-call. The log categories worth shipping and what each is for:

Diagnostic category	What it captures	Use it for
`AppServiceHTTPLogs`	Per-request access logs (status, latency, bytes)	5xx rate, slow paths, traffic shape
`AppServiceConsoleLogs`	stdout/stderr from the app/container	Boot errors, port-probe failures
`AppServiceAppLogs`	Application logging (your logger output)	App-level diagnostics
`AppServicePlatformLogs`	Platform/container lifecycle events	Restart/recycle causes
`AppServiceAuditLogs`	SCM/FTP access	Who deployed / accessed Kudu
`AllMetrics`	Platform metrics to the workspace	Long-retention dashboards

The release-relevant alerts to define, and where to scope each:

Alert	Signal	Threshold (starting point)	Scope to
Http5xx spike	`Http5xx` count	> 1% of requests over 5m	Production slot
Response-time P95	`HttpResponseTime` P95	> your SLO (e.g. 1.5s)	Production slot
Health-check failures	`HealthCheckStatus`	Any instance unhealthy 5m+	Production slot
Restart rate	Platform restart events	> N in 10m	App (both slots)
Post-swap error burst	`Http5xx` after a swap event	any non-zero immediately post-swap	Production slot

During and after a change window, a handful of Log Analytics / App Insights queries answer the questions you’ll actually ask. Keep these ready:

Question during a release	Table / source	Query gist
Did 5xx spike right after the swap?	`AppServiceHTTPLogs`	filter `ScStatus >= 500`, bin by minute around the swap time
Which exceptions are new post-swap?	`exceptions` (App Insights)	`summarize count() by problemId, operation_Name` since the swap
Did a request run and fail, or never arrive?	`requests` (App Insights)	a matching `requests` row that returned 5xx = your code; none = platform/network
Is the health path failing per-instance?	`requests`	filter `url endswith "/health/live"`, group by `cloud_RoleInstance`
Which dependency started failing under load?	`dependencies`	`where success == false
Did the app restart, and why?	`AppServicePlatformLogs`	look for recycle/restart events with the cause field
Is egress hitting SNAT limits?	metric `SnatConnectionCount`	any non-zero `Failed` dimension is the smoking gun

Application Insights is the single most useful tool when a swap goes wrong — it tells you whether a request ran and failed (your code) or never reached a worker (platform/networking). The deep observability and KQL patterns are in Azure Monitor & Application Insights for observability.

Architecture at a glance

Read the diagram left to right as a release in motion. On the far left, CI/Deploy is a self-hosted runner inside the spoke VNet pushing a run-from-package zip — it has to live in the VNet because, once you disable public access, the SCM/Kudu endpoint is private (badge ③). The artifact lands on the staging slot of the shared App Service Plan (P1v3, Always On). Staging warms on a real /health/ready probe (badge ①) before any traffic moves; the production slot is the swap target carrying live traffic, and both slots draw from the same plan with min 2 instances, so a recycle never zeroes capacity. The two failure points baked into this zone are a dishonest warm-up gate (①) and a non-sticky environment setting leaking across the swap (②).

The remaining three zones are the isolation and data planes the running app depends on at every request and, critically, at boot. Inbound isolation is the private endpoint (10.20.2.x, sites group) plus the privatelink.azurewebsites.net private DNS zone — without that zone link the hostname resolves to the public IP and even the CI runner can’t reach Kudu (③). The outbound plane is the delegated /27 integration subnet with WEBSITE_VNET_ROUTE_ALL=1, fronted by a NAT gateway for a stable egress IP — flip ROUTE_ALL before that egress path exists and every internet call breaks (④). Finally, secrets & data: the managed identity pulls a token to Key Vault (@Microsoft.KeyVault references, get-secret) and to Azure SQL via Entra with no password — and because a freshly-swapped production instance re-resolves every reference at boot, a missing vault private-DNS link or role grant turns a clean swap into a crash loop (⑤). The numbered legend on the diagram narrates each of the five as symptom · confirm · fix — the same five you’ll meet in the troubleshooting playbook below.

Real-world scenario

A payments team — call them NorthPay — runs app-orders-prod on a P1v3 plan behind a private endpoint, with publicNetworkAccess=Disabled, VNet integration with WEBSITE_VNET_ROUTE_ALL=1 egressing through a NAT gateway, and Key Vault references for every secret. They deploy from a self-hosted runner in the spoke and release with swap-with-preview. Their compliance auditor signed off on the isolation; their SRE lead signed off on the zero-downtime story. Then the first private-network release went sideways.

The deploy to staging succeeded. Preview looked clean — smoke tests passed against staging running production config. They completed the swap at 21:40, and within seconds production threw HTTP 500 on every request. Staging under preview had been healthy; production was not. The on-call engineer’s instinct was to swap back, which they did — and production recovered instantly, because the previous (working) bits were sitting in the staging slot. That reverse swap, completed in under a minute, is exactly the rollback this architecture is supposed to give you, and it bought them the time to find the real cause without an outage.

The cause was Key Vault references and the private endpoint colliding at swap time. The vault had its own private endpoint, but the freshly-restarted production instances re-resolved every @Microsoft.KeyVault(...) reference on startup, and WEBSITE_VNET_ROUTE_ALL=1 forced that DNS lookup through the VNet — where the privatelink.vaultcore.azure.net zone was linked to the spoke but the conditional-forwarder rule on the hub DNS server hadn’t been updated for the new vault. Staging had cached resolved secrets from before the DNS change; production started cold and couldn’t reach the vault. The warm-up ping on /health/ready should have caught it, but the readiness probe only checked SQL, not secret resolution — so the gate passed on an app that couldn’t actually serve. This is badges ① and ⑤ firing together.

Two fixes. First, make readiness actually prove the dependency chain — resolve a sentinel secret, not just open a DB connection:

app.MapHealthChecks("/health/ready", new HealthCheckOptions {
    Predicate = c => c.Tags.Contains("ready")
});
builder.Services.AddHealthChecks()
    .AddAzureKeyVault(new Uri(vaultUri), new DefaultAzureCredential(),
        o => o.AddSecret("health-canary"), tags: new[] { "ready" })
    .AddSqlServer(sqlConn, tags: new[] { "ready" });

Second, gate the swap on it explicitly and confirm the vault is reachable from the integration subnet before releasing:

az webapp config appsettings set -g rg-app-prod -n app-orders-prod --slot staging \
  --slot-settings WEBSITE_SWAP_WARMUP_PING_PATH=/health/ready WEBSITE_SWAP_WARMUP_PING_STATUSES=200
nslookup kv-orders-prod.vault.azure.net   # from Kudu: must return 10.20.2.x, not a public IP

The lesson NorthPay took away: a private endpoint you add for one resource changes the DNS blast radius for every service that resolves a name at startup. Readiness checks have to exercise the secrets path, or warm-up gating is theatre — and the only reason this was a 90-second incident instead of a two-hour one is that the rollback was a reverse swap, not a redeploy.

Advantages and disadvantages

The hardened architecture is not free — it adds moving parts, subnets, and DNS to reason about. The honest trade-off:

Advantages	Disadvantages
No anonymous internet reachability (private endpoint)	More subnets, DNS zones, and route tables to operate
Stable, allowlistable egress IP (NAT gateway)	NAT gateway / firewall add hourly + per-GB cost
Zero plaintext secrets; passwordless where possible	Managed identity per slot is easy to forget → swap crash loop
Zero-downtime releases with instant rollback (reverse swap)	Slots consume plan capacity; size `max-count` with headroom
Validated promotion (swap-with-preview / canary)	Standard+ tier required; Free/Shared can’t play
Self-healing via Health Check + autoscale	A dishonest health path silently defeats warm-up gating
Private DNS keeps names resolving inside the VNet	DNS misconfig breaks boot-time resolution app-wide
CI/CD path is auditable and VNet-scoped	Public-hosted pipelines must move into the VNet

When each axis matters: inbound isolation is non-negotiable under a compliance boundary or when the app fronts a private-only data tier; for a purely public marketing site it may be overkill. Stable egress matters the instant a partner firewall enters the picture. Slots with warm-up pay for themselves on the first release you’d otherwise have done in-place during business hours. Passwordless matters most where credential rotation is a recurring operational cost — it deletes the credential entirely. The one axis with no downside worth skipping is killing plaintext secrets: do it on day one regardless of tier.

Hands-on lab

A free-tier-friendly walk-through. Slots and Always On need Standard+, so this lab uses S1 for the slot steps and tears down at the end; the secrets and identity steps work on any paid tier. Everything is copy-pasteable; replace names as needed.

# 0. Variables
RG=rg-zdt-lab; LOC=eastus; PLAN=plan-zdt-lab; APP=app-zdt-$RANDOM; KV=kv-zdt-$RANDOM
az group create -n $RG -l $LOC

# 1. Standard plan + app (S1 unlocks slots + Always On)
az appservice plan create -g $RG -n $PLAN --sku S1 --is-linux
az webapp create -g $RG -p $PLAN -n $APP --runtime "DOTNETCORE:8.0"
az webapp config set -g $RG -n $APP --always-on true

# 2. Managed identity + Key Vault + a secret + a reference
az webapp identity assign -g $RG -n $APP
PID=$(az webapp identity show -g $RG -n $APP --query principalId -o tsv)
az keyvault create -g $RG -n $KV --enable-rbac-authorization true
ME=$(az ad signed-in-user show --query id -o tsv)
az role assignment create --assignee $ME --role "Key Vault Secrets Officer" \
  --scope $(az keyvault show -g $RG -n $KV --query id -o tsv)
az keyvault secret set --vault-name $KV --name health-canary --value ok
az role assignment create --assignee-object-id $PID --assignee-principal-type ServicePrincipal \
  --role "Key Vault Secrets User" --scope $(az keyvault show -g $RG -n $KV --query id -o tsv)
az webapp config appsettings set -g $RG -n $APP --settings \
  "Canary=@Microsoft.KeyVault(SecretUri=https://$KV.vault.azure.net/secrets/health-canary/)"

# 3. Staging slot with sticky settings + warm-up
az webapp deployment slot create -g $RG -n $APP --slot staging
az webapp config appsettings set -g $RG -n $APP --slot staging \
  --slot-settings ASPNETCORE_ENVIRONMENT=Staging \
  WEBSITE_SWAP_WARMUP_PING_PATH=/ WEBSITE_SWAP_WARMUP_PING_STATUSES=200
# Grant the STAGING slot's identity too (the gotcha):
az webapp identity assign -g $RG -n $APP --slot staging
SPID=$(az webapp identity show -g $RG -n $APP --slot staging --query principalId -o tsv)
az role assignment create --assignee-object-id $SPID --assignee-principal-type ServicePrincipal \
  --role "Key Vault Secrets User" --scope $(az keyvault show -g $RG -n $KV --query id -o tsv)

# 4. Swap-with-preview, then complete
az webapp deployment slot swap -g $RG -n $APP --slot staging --target-slot production --action preview
az webapp deployment slot swap -g $RG -n $APP --slot staging --target-slot production --action swap

Expected results and how to verify each:

Step	What you should see	Verify with
1	App reachable on `https://$APP.azurewebsites.net`; `alwaysOn=true`	`az webapp config show -g $RG -n $APP --query alwaysOn`
2	Green “Key Vault Reference” badge on `Canary`	`az webapp config appsettings list -g $RG -n $APP --query "[?name=='Canary']"`
3	`ASPNETCORE_ENVIRONMENT` shows `slotSetting:true` on staging	`... --slot staging --query "[?slotSetting].name"`
4	Swap completes with no error; app stays up	`az webapp deployment slot list -g $RG -n $APP -o table`

# 5. Teardown — delete everything
az group delete -n $RG --yes --no-wait

This lab keeps inbound/outbound public for simplicity (no VNet). Add Steps 1 and 3 from the main guide — integration subnet, private endpoint, private DNS — only in a VNet-enabled subscription; doing so on a throwaway lab adds cost and DNS plumbing without teaching the slot mechanics any better.

Common mistakes & troubleshooting

This is the section you keep open during a change window. It is a release playbook: each row is a real failure mode with its symptom, root cause, the exact command or portal path to confirm it, and the fix. Read the prose once; scan the table at 21:40 when a swap just went wrong.

#	Symptom	Root cause	Confirm (exact command / path)	Fix
1	500s in production seconds after swap; staging was healthy	KV references re-resolve at boot; vault unreachable (DNS/role)	Portal: Configuration → green/red KV badge; `nslookup kv-...vault.azure.net` from Kudu	Link `privatelink.vaultcore.azure.net`; grant slot MI `Key Vault Secrets User`; swap back to recover
2	Staging DB connection string now live in production	Env setting wasn’t sticky; swapped across	`az webapp config appsettings list --slot staging --query "[?slotSetting].name"`	Mark env keys as `--slot-settings`; swap back; re-release
3	All internet calls fail after enabling integration	`WEBSITE_VNET_ROUTE_ALL=1` with no egress route	From Kudu: `curl -s ifconfig.me` (hangs / wrong IP)	Attach NAT gateway / UDR to integration subnet, then keep `ROUTE_ALL=1`
4	App unreachable after `publicNetworkAccess=Disabled`	Private DNS zone not linked; resolves public IP	From a VNet host: `nslookup app.azurewebsites.net` (returns public IP)	Link `privatelink.azurewebsites.net`; register A via PE zone group
5	CI/CD `zip deploy` fails after lockdown	Kudu/SCM now private; public runner can’t reach it	Pipeline log: SCM connect timeout / 403	Move deploy to a VNet-attached self-hosted runner
6	KV reference badge red	MI lacks role or vault firewall blocks app	`az role assignment list --assignee $PID --scope $VAULT_ID`	Grant `Key Vault Secrets User`; allow app subnet / use vault PE
7	Swap completes but users hit a cold start	Warm-up path wrong or slot idled	`az webapp config appsettings list --slot staging --query "[?name=='WEBSITE_SWAP_WARMUP_PING_PATH']"`	Set a real readiness path; enable `alwaysOn` on the slot
8	Warm-up “passes” but app is broken post-swap	`/health/ready` returns 200 unconditionally	Read the health handler; hit it while a dep is down	Make readiness check DB + a sentinel KV secret
9	Whole app 503s when a downstream blips	Health (liveness) path fails on optional dep → fleet evicted	Health Check blade per-instance status; KQL on the path	Make liveness shallow; raise `WEBSITE_HEALTHCHECK_MAXPINGFAILURES`
10	503 on every deploy/restart	Single instance; in-place restart zeroes capacity	`az webapp show --query "siteConfig.numberOfWorkers"`; plan instance count	`min-count >= 2`; deploy via slot-swap, not in-place
11	Staging slot crash-loops though production is fine	Slot’s own identity never granted vault access	`az webapp identity show --slot staging`; role list for that PID	Grant the staging identity, or use a shared user-assigned identity
12	Intermittent 5xx / dependency timeouts under load	SNAT port exhaustion on shared egress	Diagnose and solve → SNAT Port Exhaustion; `SnatConnectionCount` Failed > 0	Reuse connections; attach NAT gateway; PE for PaaS targets
13	502 only when behind App Gateway/Front Door	Upstream backend timeout shorter than app response	App Insights request duration vs gateway timeout	Speed up the path; raise upstream timeout to match
14	Secret value stale after rotation	Versioned KV reference URI pinned an old version	Inspect the `SecretUri` in app settings	Switch to versionless URI; restart to refresh

The boot-time DNS distinction (rows 1, 4, 6) is the one that eats the most hours, so call it out explicitly:

Distinction	The trap	How to tell them apart
App PE vs vault PE DNS	Fixing the app’s zone but not the vault’s	App reachable, yet KV badge red → the vault’s `vaultcore` zone is the gap
Staging healthy vs production healthy at swap	Cached resolution in staging hides the DNS gap	Production cold-resolves at boot; staging had cached secrets from before a DNS change
Cold start vs broken swap	Both look like “slow/erroring after deploy”	A cold start recovers on retry; a broken swap keeps failing → swap back

Best practices

Isolate the three planes in distinct subnets: delegated /27 for outbound integration, a separate non-delegated /27 for the private endpoint. Never reuse one subnet for both.
Configure egress before forcing it: attach a NAT gateway or firewall route to the integration subnet before setting WEBSITE_VNET_ROUTE_ALL=1.
Link private DNS before disabling public access: privatelink.azurewebsites.net (app) and privatelink.vaultcore.azure.net (vault) must resolve privately first, or boot-time resolution breaks.
Kill plaintext secrets on day one: Key Vault references with a managed identity granted Key Vault Secrets User; go passwordless for SQL/Storage/Service Bus where Entra auth is supported.
Grant both slot identities: a slot has its own managed identity — grant it vault access too, or use a shared user-assigned identity, or the swap crash-loops.
Make environment settings sticky: audit slotSetting on every env-specific key before every release; a non-sticky staging connection string promotes itself to production.
Gate swaps on an honest readiness probe: /health/ready must check the real dependency chain (DB + a sentinel KV secret), not return 200 unconditionally.
Release via swap-with-preview, roll back via reverse swap: validate prod config without moving traffic; if a regression surfaces, swap back — the old bits are in staging.
Run min-count >= 2: so an instance recycle never zeroes production capacity, and size max-count with headroom for a warm staging slot.
Separate liveness from readiness: liveness (Health Check) stays shallow so a downstream blip doesn’t evict the fleet; readiness (warm-up) proves dependencies.
Apply diagnostic settings to both slots and scope alerts to production: a slot is a distinct resource; alerts on the production slot ID keep a staging deploy from paging on-call.
Move CI/CD into the VNet before lockdown: a self-hosted runner in the spoke (or a VNet-attached deploy step) so Kudu/SCM stays reachable.

Security notes

Least privilege on the identity: grant Key Vault Secrets User (read-only secrets), not Key Vault Administrator. For data planes, grant the narrowest data role (Storage Blob Data Reader over Contributor where reads suffice).
No secret should be plaintext anywhere: not in app settings, not in ARM/Bicep parameters, not in pipeline variables. References resolve at runtime; the value never lands in config or source.
Private by default, public by exception: publicNetworkAccess=Disabled with a private endpoint is the baseline; expose only through a WAF-fronted front door when public reach is genuinely required.
httpsOnly=true and a modern minimum TLS version: redirect HTTP→HTTPS and set the minimum TLS to 1.2+ on both slots.
Lock the SCM/Kudu surface independently: restrict the SCM site to the deploy runner’s source range; it is a separate access-control plane from the main site.
Encrypt the egress story: force egress through the VNet so traffic can be inspected/logged at a hub firewall, and prefer private endpoints for PaaS targets so data-plane traffic stays on the Azure backbone (no SNAT, no public hop).
Rotate without redeploy: versionless Key Vault references pick up rotated secrets on refresh — pair with Key Vault secret rotation with managed identity so rotation is event-driven, not a deploy.
Audit who deploys: enable AppServiceAuditLogs; treat Kudu access as privileged and review it.

Cost & sizing

What drives the bill here is the plan SKU and instance count (both slots share the plan), plus the NAT gateway / firewall for egress and the private endpoint hourly charge. Slots themselves are free — they consume plan capacity, not a separate line item. Rough figures (USD list, plus an INR feel for budget planning; verify current pricing in the Azure calculator):

Component	Unit	~USD/month	~INR/month	Notes
App Service Plan S1 (1 inst)	per instance	~$70	~₹5,800	Unlocks slots + Always On
App Service Plan P1v3 (1 inst)	per instance	~$120	~₹10,000	Pre-warmed instances; better isolation
Each extra instance	per instance	linear	linear	`min-count 2` doubles the base
Private endpoint	per PE	~$7 + per-GB	~₹600	One per resource made private
NAT gateway	hourly + per-GB	~$32 + data	~₹2,700	Stable egress IP
Azure Firewall (if used)	hourly + per-GB	~$900+	~₹75,000+	Shared at the hub, not per-app
Key Vault	per 10k ops	cents	cents	References add minimal ops
Log Analytics	per GB ingested	~$2.76/GB	~₹230/GB	Tune categories to control cost

What each tier actually unlocks for this topic — pick the lowest tier that has what you need:

Tier	Slots	Always On	VNet integration	Private endpoint	Pre-warmed	Use for
F1 / D1	No	No	No	No	No	Throwaway experiments only
B1–B3	Limited	Yes	Yes	Yes	No	Dev/test isolation; light prod
S1–S3	5	Yes	Yes	Yes	No	Standard production with slots
P1v3–P3v3	20	Yes	Yes	Yes	Yes	Scale-out prod; cold-start-sensitive

Right-sizing rules: keep min-count at 2 for HA (a single instance means every restart is a 503), set max-count from your peak load plus one warm staging slot’s worth of headroom, and prefer P1v3 over scaling many S1 instances when scale-out cold starts hurt — pre-warmed instances exist only on Premium v3. To control Log Analytics spend, ship only the categories you query (HTTP + Console + App logs cover most incidents; drop the rest until you need them).

Interview & exam questions

1. Why are the VNet-integration subnet and the private-endpoint subnet always separate? Integration governs outbound and the subnet must be delegated to Microsoft.Web/serverFarms; a private endpoint governs inbound and lives in a non-delegated subnet as a private NIC. The delegation is exclusive, the jobs are opposite, and the platform won’t let one subnet serve both. (AZ-700, AZ-204)

2. What does WEBSITE_VNET_ROUTE_ALL=1 change, and what must exist first? It forces all egress — including internet-bound — into the VNet, where it follows the subnet’s route table. A deliberate egress path (NAT gateway or a UDR to a firewall) must exist first, or internet calls break. (AZ-700)

3. A Key Vault reference shows a red badge. Name the two most likely causes. The app’s managed identity lacks Key Vault Secrets User on the vault, or the vault’s network rules (firewall / private endpoint with missing private DNS) block the app from reaching it. Syntax is rarely the issue. (AZ-204, AZ-500)

4. Why does a slot have its own managed identity, and why does it matter at swap time? Each slot is a distinct resource with its own system-assigned identity. If only production’s identity has vault access, staging’s references go red and the app crash-loops the instant you swap. Grant both, or use a shared user-assigned identity. (AZ-204)

5. What is a slot setting and what disaster does it prevent? A slot (sticky) setting is pinned to its slot and does not travel during a swap. It prevents environment-specific values — like a staging DB connection string — from promoting themselves into production when you swap. (AZ-204)

6. Explain swap-with-preview and how it enables safe rollback. Phase 1 (preview) applies the target slot’s config to the source and restarts it without moving traffic, so you validate prod config first. Phase 2 (swap) completes it. If a regression appears after completion, a reverse swap rolls back instantly because the previous bits sit in the staging slot. (AZ-204)

7. Why must the warm-up readiness path be “honest”? The swap completes only when the warm-up path returns an accepted status. If /health/ready returns 200 unconditionally, the gate passes on a broken/cold app and you ship downtime. It must exercise the real dependency chain — DB plus a sentinel secret. (AZ-204)

8. After disabling public access, deployments start failing. Why? The SCM/Kudu endpoint shares the app’s hostname, so it’s now private too. A public-hosted pipeline can no longer reach Kudu. The deploy must run from a VNet-attached self-hosted runner. (AZ-400, AZ-204)

9. How does a private endpoint break a different resource at boot? It changes how a name resolves inside the VNet. Anything resolving a hostname at startup (a Key Vault reference, a SQL connection) now depends on the relevant private DNS zone being linked. Add a vault PE without linking privatelink.vaultcore.azure.net and references fail at boot. (AZ-700, AZ-500)

10. When do you choose a NAT gateway versus an Azure Firewall for egress? NAT gateway when you only need a stable, allowlistable source IP and a large SNAT pool. Azure Firewall when you also need FQDN filtering, egress rules, and logging. They combine, but the firewall then owns the route. (AZ-700)

11. Why keep min-count >= 2, and how does that interact with slots? Two instances mean a recycle or platform patch never zeroes production capacity, so an in-place restart stops being a 503. Because both slots share the plan, also size max-count with headroom for a warm staging slot during a deploy. (AZ-204)

12. Liveness vs readiness — which feeds Health Check, and what must each never do? Liveness (“is the process up?”) feeds Health Check and eviction; readiness (“can it serve?”) feeds swap warm-up. Liveness must never hard-fail on an optional downstream (it would evict the whole fleet); readiness must never return 200 unconditionally. (AZ-204)

Quick check

Which app setting forces internet-bound traffic through the VNet, and what must already exist before you set it?
You disabled public access and now your pipeline’s zip deploy fails. What changed and what’s the fix?
A swap completed and production immediately 500s while staging was healthy. Name the most likely cause and the fastest recovery.
What is the one thing a /health/ready warm-up probe must do to make a swap genuinely safe?
Why is granting only the production slot’s managed identity Key Vault Secrets User a problem?

Answers

WEBSITE_VNET_ROUTE_ALL=1. A deliberate egress path — a NAT gateway on the integration subnet, or a UDR 0.0.0.0/0 to an Azure Firewall — must exist first, or all internet calls break.
Disabling public access also makes the SCM/Kudu endpoint private (it shares the hostname). A public-hosted runner can no longer reach it; move the deploy to a VNet-attached self-hosted runner.
Key Vault references re-resolve at boot, and the freshly-restarted production instances couldn’t reach the vault (missing privatelink.vaultcore.azure.net link or an ungranted slot identity). Fastest recovery: swap back — the working bits are in the staging slot.
Exercise the real dependency chain — at minimum a DB ping and resolving a sentinel Key Vault secret — instead of returning 200 unconditionally, so the gate can’t pass on a broken app.
Each slot has its own identity. Staging’s Key Vault references go red and the app crash-loops the moment you swap, because staging’s identity was never granted vault access. Grant both, or use a shared user-assigned identity.

Glossary

Regional VNet integration — injects the app’s outbound traffic into a delegated subnet (Microsoft.Web/serverFarms) in your VNet.
WEBSITE_VNET_ROUTE_ALL — app setting that forces all egress (including internet-bound) through the VNet to follow its route table.
Private endpoint — a private NIC in your VNet that projects the app inbound; paired with publicNetworkAccess=Disabled it removes public reachability.
publicNetworkAccess — site property toggling public reachability; Disabled requires a private route to reach the app.
Private DNS zone — e.g. privatelink.azurewebsites.net; makes a hostname resolve to the private endpoint’s IP inside the VNet.
Managed identity — an Azure-issued identity for the app (system- or user-assigned) used to resolve Key Vault references and authenticate passwordless.
Key Vault reference — @Microsoft.KeyVault(SecretUri=...) app setting resolved at runtime by the managed identity, so no plaintext secret lives in config.
Passwordless auth — using the managed identity directly against Entra-aware backends (SQL, Storage, Service Bus) so no credential exists to store or rotate.
Deployment slot — a full, addressable copy of the app on the same plan, used as a staging/swap target.
Slot setting (sticky) — an app setting/connection string pinned to its slot that does not travel during a swap.
Swap — redirects production traffic to the warmed staging instances with no cold start; the reverse swap is the rollback.
Swap-with-preview — a two-phase swap: preview applies target config without moving traffic; swap completes it; reset aborts.
Warm-up ping — WEBSITE_SWAP_WARMUP_PING_PATH/...STATUSES; gates a swap on a readiness path returning an accepted status on every instance.
Health Check — a per-instance liveness probe (healthCheckPath) that evicts and recycles instances that fail it.
NAT gateway — provides a stable, allowlistable outbound IP and a large SNAT port pool for the integration subnet.
SCM/Kudu — the deployment/management site sharing the app’s hostname; goes private when public access is disabled.

Next steps

Put the private origin behind a WAF front door with Application Gateway with WAF & end-to-end TLS.
Mature the release into traffic-split canaries and blue-green with Blue-green deployments with App Service slots & Traffic Manager.
Make secret rotation event-driven, not a redeploy, via Azure Key Vault: secret rotation with managed identity.
Diagnose any release that still goes wrong with the Troubleshooting App Service: 502/503, cold starts & restart loops playbook.
Solve the SNAT-exhaustion class of egress failures with Azure NAT Gateway: deterministic egress & SNAT exhaustion.