The Linux Consumption plan gave you scale-to-zero and execution billing, but you paid for it with no VNet integration, opaque scaling, and cold starts you could only pray about. Flex Consumption is Microsoft’s answer: the same serverless billing model, but now with true virtual network integration, selectable instance memory, deterministic per-function concurrency, and always-ready instances to kill cold starts on the functions that matter. It is the plan you reach for when a function has to live on a private network, take a burst without a cold-start cliff, and never leak a storage key — all while still scaling to zero when idle.
This is how to provision it correctly, tune it, and prove the scale controller behaves under load. We treat Flex not as a checkbox but as a system with five tunable surfaces — plan choice, instance memory, concurrency, always-ready capacity, and the private network path — each of which has a default that is wrong for production and a failure mode that bites under load. You will learn every setting end to end: what it is, the values it accepts, the default, when to change it, the trade-off, and the limit or gotcha that turns it into a 2am incident. Because this is a reference you will return to mid-incident, every deep section anchors to a table you can scan, and the operational failure modes are laid out as a symptom→cause→confirm→fix playbook.
By the end you will stop guessing about serverless scale. When a burst lands you will know whether you were capped (concurrency × max-instances ceiling) or cold (burst outran the warm pool), whether your private dependency is actually private (or silently resolving a public IP because a DNS zone link is missing), and whether your bill is execution-only or quietly paying an always-ready baseline you forgot you reserved. Knowing which within ninety seconds is what separates a tuned serverless platform from one that pages you every flash sale.
What problem this solves
Serverless on Azure used to force a brutal trade. Linux Consumption billed only for active execution and scaled to zero — perfect economics — but it had no VNet integration, so a function could not reach a private database, an on-prem service over ExpressRoute, or a Key Vault behind a private endpoint. Its scaling was a black box you could not tune, and its cold starts were unbounded and unmitigated. The moment your function needed a private network or a latency SLA, you were pushed onto the Premium (Elastic Premium / EP) plan, which fixed networking and cold starts by keeping instances always on — and billing you for every reserved instance whether it ran code or not, with no scale-to-zero.
Flex Consumption dissolves that trade. It keeps scale-to-zero and execution billing like Consumption, but adds VNet integration, always-ready instances (warm capacity you reserve only where you need it), selectable instance memory, and explicit per-instance concurrency — the Premium capabilities, available à la carte on a consumption-billed plan. You pay the Premium-style baseline only on the slice of capacity you explicitly reserve, and nothing for the rest when it is idle.
What breaks without it: teams either over-pay for idle Premium compute to get a private network they barely use, or they ship on Consumption and discover too late that a synchronous API cold-starts past an upstream timeout, the upstream retries, and the retries stampede a backend with a fixed connection pool. Who hits this: anyone running serverless that must (a) reach private resources, (b) hold a tail-latency SLA on a hot path, © cap fan-out against a fragile downstream, or (d) eliminate storage connection strings for compliance. Flex is the plan that lets you do all four without abandoning serverless economics. To frame the whole surface before the deep dive, here is every tunable, the production-wrong default, and the failure it prevents:
| Tunable surface | Default (often wrong for prod) | What you set it to | Failure it prevents |
|---|---|---|---|
| Plan choice | Consumption (no VNet, opaque scale) | Flex Consumption | Cannot reach private deps; un-tunable cold starts |
| Instance memory | 2048 MB | 512 / 2048 / 4096 by workload | Over-paying cores, or OOM on heavy payloads |
| HTTP concurrency | Memory-derived (implicit) | Explicit perInstanceConcurrency |
Silent scale-math drift; runaway fan-out |
| Max instance count | High (scales toward 1,000) | Capped to your downstream’s limit | DDoS-ing your own database under burst |
| Always-ready | 0 (everything cold) | Sized to the burst leading edge | Cold-start latency on the hot path |
| VNet + Private DNS | Public outbound, no zone links | Delegated subnet + linked zones | Traffic bypassing private endpoints |
| Storage auth | Connection string in AzureWebJobsStorage |
Identity-based connection (UAMI) | A storage key sitting in app settings |
Learning objectives
By the end of this article you can:
- Choose Flex over Consumption or Premium with a concrete decision table — by VNet need, cold-start SLA, scale ceiling, and billing shape — and explain exactly what each plan bills.
- Provision a Flex app with a subnet delegated to
Microsoft.App/environments, the rightMicrosoft.AppRP registration, and VNet integration, in bothazCLI and Bicep. - Size instance memory (512 / 2048 / 4096 MB) and cap maximum instance count against the regional 250-core quota, computing cores as
instances × cores-per-instance. - Tune per-instance concurrency for HTTP (the
perInstanceConcurrencyflag) and non-HTTP triggers (target-based scaling inhost.json), and use it as a backpressure mechanism against a fragile downstream. - Eliminate cold starts on latency-critical groups with always-ready instances, and reason about their billing baseline and the zone-redundant minimum of 2.
- Deploy with one-deploy and replace the
AzureWebJobsStorageconnection string with an identity-based connection, plus lock down dependencies behind private endpoints and linked Private DNS zones. - Diagnose HTTP 429s as either an instance/quota cap or a cold-start cascade, using
InstanceCount, execution-unit metrics, and a Kusto query that correlates throttle rate against live instance count.
Prerequisites & where this fits
You should already understand Azure Functions basics: a function app is a deployment and scale unit, triggers (HTTP, Service Bus, Event Hubs, Timer) start executions, and bindings wire inputs/outputs. You should be comfortable with az in Cloud Shell, reading JSON output, and the idea of a managed identity (system- or user-assigned) granting an Azure resource access without secrets. Familiarity with VNet, subnets, private endpoints, and Private DNS zones helps, because half of Flex’s value is the private network path.
This sits in the Serverless track and assumes the trigger/binding fundamentals from Azure Functions: Serverless Patterns, Triggers & Bindings. It is the scaling-and-networking layer beneath orchestration — pair it with Durable Functions: Orchestration Patterns & Fan-Out/Fan-In when your workload needs stateful coordination, since the Durable trigger is one of Flex’s scale groups. The plan-choice question is upstream: Containers vs Serverless vs VMs: Choosing a Compute Model frames when serverless wins at all. For the private path, the dependency-side mechanics live in Azure Private Endpoints & Private DNS at Scale, and the egress-exhaustion story it shares with App Service is in Azure NAT Gateway: Deterministic Egress & SNAT Exhaustion.
A quick map of who owns what during a Flex incident, so you call the right person fast:
| Layer | What lives here | Who usually owns it | Failure classes it can cause |
|---|---|---|---|
| Trigger source | HTTP edge, Service Bus, Event Hubs | App / integration team | Burst shape that triggers cold starts; poison messages |
| Flex plan / scale controller | Concurrency, max-instances, always-ready | App + platform | 429 throttling (capped) or cold-start cascades |
| Regional core quota | 250-core (512,000 MB) budget | Subscription owner | Scale stalls below configured max |
| VNet integration subnet | Delegated /26, outbound route |
Network team | No outbound; subnet too small to scale |
| Private DNS zones | privatelink.* resolution |
Network / platform | App resolves public IP, bypasses PE |
| Backing storage / Key Vault | Package, host metadata, secrets | Platform + security | Boot failure on missing identity role |
| Managed identity (UAMI) | Data-plane roles, deploy auth | Security / platform | Host can’t read package → app won’t start |
Core concepts
Five mental models make every later tuning decision obvious.
Flex bills two pools, not one. Unlike Consumption (active execution only) or Premium (every reserved instance always), Flex splits capacity into on-demand instances that bill only while actively executing (a 1,000 ms minimum per execution, then rounded up to 100 ms) and always-ready instances that bill a baseline for provisioned memory continuously plus execution memory while running. You pay the Premium-style baseline only on the slice you explicitly reserve. The whole cost model collapses to: reserve the minimum warm capacity your latency SLA needs, let everything else scale to zero.
The scale controller is deterministic — you give it the math. On Consumption the scale heuristics were opaque. On Flex, instances are added based on the concurrency you configure: for HTTP, the scale controller adds an instance when existing instances are saturated at their perInstanceConcurrency; for non-HTTP, target-based scaling computes a desired instance count from queue depth and the batch settings. Scaling is no longer a mystery — it is traffic ÷ concurrency, bounded by your max-instance-count and the regional quota.
Concurrency is a backpressure valve, not just a perf knob. perInstanceConcurrency × maximum-instance-count is a hard ceiling on total in-flight executions. That product is the most important number on the plan: set it equal to (or below) your weakest downstream’s capacity — a database connection pool, a third-party rate cap — and overload becomes structurally impossible. The app throttles at the edge with 429s long before the downstream falls over. Size concurrency against your fragile dependency, not against incoming traffic.
The private path is two halves: route and resolve. VNet integration handles outbound routing — it puts the worker’s egress on a delegated subnet. But reaching a private endpoint also requires DNS resolution to the private IP, which only happens if the integration VNet is linked to the relevant Private DNS zones (privatelink.blob.core.windows.net, etc.). Get the route without the resolve and the app silently resolves the public IP, traffic skips the private endpoint, and your “private” architecture is a fiction. Both halves must be present.
Cold start is latency on the leading edge, not a constant. With always-ready capacity, steady-state traffic never cold-starts — the warm pool absorbs it. Cold start only appears when a burst outruns the warm pool, spilling onto on-demand instances that pay runtime boot, JIT, DI build, and connection-pool prime on their first request. The fix is never “warm everything” (that is just Premium); it is to size always-ready to the burst’s leading edge so the cold instances come up behind already-served traffic.
The vocabulary in one table
Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:
| Concept | One-line definition | Where it lives | Why it matters |
|---|---|---|---|
| Flex Consumption | Serverless plan with VNet, warm pool, selectable memory | The plan SKU | The whole subject; scale-to-zero + private + tunable |
| On-demand instance | Bills only while executing (1s min, 100ms round) | The plan | The scale-to-zero economics |
| Always-ready instance | Warm capacity, baseline-billed | Per scale group | Kills cold start on the hot path |
| Scale group | Functions that scale together (http/blob/durable) |
Runtime | Concurrency/always-ready apply per group |
perInstanceConcurrency |
HTTP executions per instance before scale-out | Scale config | The HTTP scale denominator + backpressure |
| Target-based scaling | Non-HTTP desired-instance from queue depth | host.json |
How Service Bus/queues/hubs scale |
| Instance memory | 512 / 2048 / 4096 MB per worker | Scale config | Drives vCPU, bandwidth, and core cost |
| Maximum instance count | Horizontal ceiling (40–1,000) | Scale config | The other half of the backpressure cap |
| Regional core quota | 250 cores (512,000 MB) per sub+region default | Subscription | Real scale ceiling under the configured max |
| Delegated subnet | /26+ delegated to Microsoft.App/environments |
VNet | Required for VNet integration |
| Private DNS link | VNet linked to privatelink.* zones |
Private DNS | Makes the private endpoint actually private |
| Identity-based connection | AzureWebJobsStorage__accountName + MI |
App settings | Removes the storage key entirely |
Flex vs Consumption vs Premium: the scaling and billing model
Pick the wrong plan and you either overpay for idle compute (Premium) or hit a wall you cannot tune around (Consumption). Here is the decision matrix that actually matters:
| Concern | Consumption | Premium (EP) | Flex Consumption |
|---|---|---|---|
| Scale to zero | Yes | No (min 1) | Yes |
| Max scale-out instances | 200 | 100 | 1,000 |
| VNet integration | No | Yes | Yes (subnet delegation) |
| Cold-start mitigation | None | Pre-warmed instances | Always-ready instances |
| Instance memory | Fixed | Fixed per SKU | Selectable: 512 / 2048 / 4096 MB |
| Concurrency control | Implicit | Implicit | Explicit per-instance |
| Billing | Execution only | Per-instance (always on) | Execution + always-ready baseline |
| OS | Linux/Windows | Linux/Windows | Linux only |
| In-place migration | — | — | No (create new app, redeploy) |
The billing distinction is the crux. Consumption bills only GB-seconds of active execution. Premium bills the full lifetime of every reserved instance whether it runs code or not. Flex Consumption splits the difference: on-demand instances bill only while actively executing (1,000 ms minimum, then rounded up to 100 ms), while any always-ready instances you configure bill a baseline for provisioned memory whether or not they execute. You only pay the Premium-style baseline on the slice of capacity you explicitly reserve. Read the three billing shapes side by side:
| Billing dimension | Consumption | Premium (EP) | Flex on-demand | Flex always-ready |
|---|---|---|---|---|
| Charged when idle | No | Yes (full instance) | No | Yes (memory baseline) |
| Execution rounding | GB-s of active time | n/a (always on) | 1,000 ms min, then 100 ms | Baseline + execution memory |
| Free grant | Monthly GB-s + executions | None | None on Flex | None on Flex |
| Scales to zero | Yes | No | Yes | No (it’s the warm floor) |
| Drives the bill | Active GB-seconds | Reserved instances | Active GB-seconds | Reserved memory × time |
And the decision rule as a table — match your hard constraint to the plan:
| If your hard constraint is… | Then choose… | Because |
|---|---|---|
| “Must reach a private DB / on-prem / private endpoint” | Flex (or Premium) | Consumption has no VNet integration |
| “Tail-latency SLA on a synchronous hot path” | Flex with always-ready | Warm pool defeats cold start, scoped to the hot group |
| “Spiky, scale-to-zero, no private network, cost-first” | Consumption | Cheapest when idle dominates; no warm baseline |
| “Always-on, predictable load, want max throughput” | Premium | Reserved instances + pre-warmed for steady high load |
| “Cap fan-out against a fragile downstream” | Flex | Explicit concurrency × max-instances is a hard valve |
| “Need a hard, low instance ceiling (e.g. max 5)” | Consumption / Premium | Flex floor for max-instance-count is 40 |
| “Windows runtime required” | Consumption / Premium | Flex is Linux only |
| “C# in-process model, can’t re-target” | Consumption / Premium | Flex requires the isolated worker |
| “Remove every storage key from config” | Flex | Identity-based AzureWebJobsStorage connection |
The C# in-process model is not supported on Flex Consumption — you must be on the isolated worker model (.NET 8 / 9 / 10). There is also no in-place migration in or out: moving to Flex means creating a new app and redeploying. The supported isolated stacks are .NET isolated, Node.js, Python, Java, and PowerShell; check
az functionapp list-flexconsumption-locationsand the runtime support matrix before you commit.
The supported runtime stacks on Flex, with the --runtime token and the model constraint:
| Runtime | --runtime token |
Model | Supported on Flex | Note |
|---|---|---|---|---|
| .NET (isolated) | dotnet-isolated |
Isolated worker | Yes (8/9/10) | In-process dotnet is not supported |
| Node.js | node |
n/a | Yes | LTS versions offered per region |
| Python | python |
n/a | Yes | Use --build-remote true for native wheels |
| Java | java |
n/a | Yes | Check version availability per region |
| PowerShell | powershell |
n/a | Yes | For automation/ops workloads |
| .NET (in-process) | dotnet |
In-process | No | Re-target the isolated worker |
| Windows-only stacks | — | — | No | Flex is Linux only |
Provision a Flex app with subnet delegation
VNet integration on Flex requires a subnet delegated to Microsoft.App/environments, at least /27 in size (use /26 to leave scaling headroom), and the Microsoft.App resource provider registered on the subscription. The portal and CLI enforce the RP registration at create time. Here is the full set of provisioning prerequisites — miss any one and create fails or the app can’t scale:
| Prerequisite | Exact value / command | Why it’s required | Gotcha if wrong |
|---|---|---|---|
| Resource provider | az provider register --namespace Microsoft.App |
Backs subnet delegation | “RP not registered” at create |
| Region supports Flex | az functionapp list-flexconsumption-locations |
Flex isn’t in every region | Silent fallback / create error |
| Delegated subnet | --delegations Microsoft.App/environments |
Flex joins via App environment | Microsoft.Web/... delegation fails |
| Subnet size | /27 minimum, /26 recommended |
Each instance consumes an IP | Subnet exhaustion caps scale-out |
| Subnet is empty/dedicated | One subnet per Flex app | Delegation is exclusive | Sharing it breaks integration |
| Backing storage | Standard_LRS/LZRS, TLS1_2, no public blob |
Host metadata + package | Public blob access fails policy |
| Runtime is isolated | dotnet-isolated, node, python, java, powershell |
No in-process C# | In-process never starts |
RG=rg-fnflex-prod
LOC=eastus
VNET=vnet-app
SUBNET=snet-func-flex
STORAGE=stfnflexprod$RANDOM
# 1. Register the provider that backs subnet delegation
az provider register --namespace Microsoft.App --wait
# 2. Network + a dedicated, delegated subnet (/26 leaves headroom)
az network vnet create -g $RG -n $VNET --address-prefixes 10.40.0.0/16 \
--subnet-name $SUBNET --subnet-prefixes 10.40.1.0/26
az network vnet subnet update -g $RG --vnet-name $VNET -n $SUBNET \
--delegations Microsoft.App/environments
# 3. Backing storage account (host metadata + deployment container)
az storage account create -g $RG -n $STORAGE -l $LOC --sku Standard_LRS \
--allow-blob-public-access false --min-tls-version TLS1_2
The --delegations value is exact — Microsoft.App/environments, not Microsoft.Web/.... This trips up everyone coming from App Service VNet integration. With the subnet ready, create the app and join it to the VNet in one shot:
SUBNET_ID=$(az network vnet subnet show -g $RG --vnet-name $VNET -n $SUBNET --query id -o tsv)
az functionapp create \
--resource-group $RG \
--name fn-orders-prod \
--storage-account $STORAGE \
--flexconsumption-location $LOC \
--runtime dotnet-isolated --runtime-version 8.0 \
--vnet "$VNET" --subnet "$SUBNET"
--flexconsumption-location (not --consumption-plan-location) is what selects the Flex plan. Confirm the region supports it first with az functionapp list-flexconsumption-locations -o table — Flex is not in every region. To attach a VNet to an existing Flex app instead, use az functionapp vnet-integration add -g $RG -n fn-orders-prod --vnet "$VNET" --subnet "$SUBNET". The equivalent in Bicep, which is how you should actually ship this:
resource plan 'Microsoft.Web/serverfarms@2023-12-01' = {
name: 'plan-fn-orders'
location: location
sku: { tier: 'FlexConsumption', name: 'FC1' }
kind: 'functionapp,linux'
properties: { reserved: true }
}
resource site 'Microsoft.Web/sites@2023-12-01' = {
name: 'fn-orders-prod'
location: location
kind: 'functionapp,linux'
properties: {
serverFarmId: plan.id
virtualNetworkSubnetId: subnet.id // the delegated /26
functionAppConfig: {
runtime: { name: 'dotnet-isolated', version: '8.0' }
scaleAndConcurrency: {
instanceMemoryMB: 2048
maximumInstanceCount: 120
}
deployment: {
storage: {
type: 'blobContainer'
value: '${storage.properties.primaryEndpoints.blob}app-package'
authentication: { type: 'SystemAssignedIdentity' }
}
}
}
}
}
The key reference table for create-time arguments — the flags people most often get wrong:
Flag (az functionapp create) |
Accepts | Selects / sets | Common mistake |
|---|---|---|---|
--flexconsumption-location |
a Flex region | The Flex plan | Using --consumption-plan-location (picks Consumption) |
--runtime |
dotnet-isolated/node/python/java/powershell |
Worker stack | dotnet (in-process) — unsupported |
--runtime-version |
e.g. 8.0, 20, 3.11 |
Stack version | Version not offered in the region |
--instance-memory |
512 / 2048 / 4096 |
Per-instance memory (MB) | Arbitrary value rejected |
--maximum-instance-count |
40–1000 |
Horizontal ceiling | 5 (below the 40 floor) |
--vnet / --subnet |
name or ID | VNet integration target | Subnet not delegated to Microsoft.App |
--deployment-storage-auth-type |
…ConnectionString/UserAssignedIdentity/SystemAssignedIdentity |
Package auth | MI lacks Blob Data role |
Configure instance memory and maximum instance count
Two knobs govern how big each worker is and how far the app can spread. Memory comes in three sizes; CPU and network bandwidth scale proportionally with it:
| Instance memory (MB) | vCPU cores | Network bandwidth | Use for | Cost note |
|---|---|---|---|---|
| 512 | 0.25 | Lowest | High fan-out, light per-request work | Cheapest cores; fits more in the quota |
| 2048 | 1 | Medium | Default for most workloads | The balanced default |
| 4096 | 2 | Highest | CPU/memory-heavy work, large payloads, ML inference | 2 cores each → halves quota headroom |
Every instance also gets an extra ~272 MB platform buffer that you are not billed for. Set memory at create time with --instance-memory, or change it later:
# Larger instances for a CPU-bound transform app
az functionapp scale config set -g $RG -n fn-orders-prod --instance-memory 4096
# Cap horizontal scale (40 is the lowest allowed max; 1000 the ceiling)
az functionapp scale config set -g $RG -n fn-orders-prod --maximum-instance-count 120
--maximum-instance-count accepts 40 to 1,000. The floor of 40 surprises people — you cannot pin a Flex app to “max 5 instances.” If you need a hard, low ceiling, Flex is the wrong plan. The two scale-config knobs and their boundaries:
| Setting | Values | Default | When to change | Trade-off | Limit / gotcha |
|---|---|---|---|---|---|
instanceMemoryMB |
512 / 2048 / 4096 | 2048 | CPU-bound → up; high fan-out → down | More memory = more cores = more quota burn | Only three discrete values |
maximumInstanceCount |
40–1000 | high (~100s) | Cap against downstream limits | Lower cap = earlier 429s under burst | Floor is 40 — no low ceiling |
alwaysReady[group] |
0–N per group | 0 | Latency-critical groups | Warm baseline billing | Min 2 if zone-redundant |
perInstanceConcurrency |
1–N (HTTP) | memory-derived | Pin explicitly in prod | Higher = fewer instances but more thrash risk | HTTP-only flag |
Mind the regional subscription quota: every Flex app in a subscription+region shares a default budget of 250 cores (512,000 MB). Cores are instances × cores-per-instance, so a single 4096-MB app maxes out the default quota at 125 instances (125 × 2). Always-ready instances count against it; scaled-to-zero apps do not. Request an increase via support before you plan for thousands of large instances. The quota math worked through, so you can see the ceiling your --maximum-instance-count actually hits:
| Instance memory | Cores / instance | Max instances at 250-core quota | Effective ceiling vs your --maximum-instance-count |
|---|---|---|---|
| 512 MB | 0.25 | 1,000 | Quota never binds before the 1,000 hard cap |
| 2048 MB | 1 | 250 | Quota binds if you set max-count > 250 |
| 4096 MB | 2 | 125 | Quota binds if you set max-count > 125 |
The limits and quotas you will actually hit on Flex, with the real numbers:
| Limit / quota | Value | Scope | What hitting it looks like | How to raise |
|---|---|---|---|---|
| Regional core quota | 250 cores / 512,000 MB (default) | Subscription + region | Scale stalls below --maximum-instance-count |
Support request |
| Max instance count | 1,000 | Per app | Hard horizontal ceiling | Cannot exceed |
| Min instance count (max-count floor) | 40 | Per app | Can’t pin a low ceiling | Use a different plan |
| Instance memory choices | 512 / 2048 / 4096 MB | Per app | Other values rejected | Fixed set |
| Subnet size | /27 min (/26 recommended) |
Integration subnet | IP exhaustion caps scale-out | Larger subnet at create |
| Always-ready min (zone-redundant) | 2 per group | Per scale group | Single warm instance rejected | n/a — by design |
| Always-ready min (non-zonal) | 1 per group | Per scale group | — | Raise to 2 when enabling AZ |
| Platform memory buffer | ~272 MB / instance | Per instance | Extra unbilled headroom | Not part of your memory size |
| Execution billing minimum | 1,000 ms, then 100 ms rounding | Per execution | Short calls cost a 1s floor | n/a |
| Deployment package source | one blob container | Per app | — | One-deploy pulls from it on start |
Per-instance concurrency: HTTP and non-HTTP triggers
This is the single most impactful tuning lever on Flex. Concurrency is how many parallel executions each instance handles. Set it too high and instances thrash under memory pressure; set it too low and you scale out (and bill) more instances than you need.
Flex groups functions into scale groups that scale together: all HTTP/SignalR triggers (http), Event Grid blob triggers (blob), and Durable orchestration/activity/entity triggers (durable). Everything else scales individually as function:<NAME>. Know which group a trigger lands in, because concurrency and always-ready apply per group:
| Trigger | Scale group | Concurrency mechanism | Notes |
|---|---|---|---|
| HTTP / SignalR | http |
perInstanceConcurrency flag |
The only type valid for that flag |
| Event Grid blob | blob |
Target-based (host.json) |
Event-Grid-sourced blob events |
| Durable orchestrator / activity / entity | durable |
Target-based + Durable settings | One group for all Durable functions |
| Service Bus queue / topic | function:<NAME> |
Target-based (serviceBus in host.json) |
Scales individually |
| Event Hubs | function:<NAME> |
Target-based (partition-bound) | Bounded by partition count |
| Storage Queue | function:<NAME> |
Target-based (queues in host.json) |
batchSize + newBatchThreshold |
| Timer | function:<NAME> |
n/a (single execution) | One instance per fire |
HTTP concurrency is set explicitly and, once set, is honored regardless of instance memory size:
# Each instance handles up to 10 concurrent HTTP executions before
# the scale controller adds another instance.
az functionapp scale config set -g $RG -n fn-orders-prod \
--trigger-type http --trigger-settings perInstanceConcurrency=10
http is the only trigger type valid for perInstanceConcurrency. The default HTTP concurrency is derived from instance memory when you do not set it — bigger instances default higher. Pin it explicitly in production so a later memory change doesn’t silently shift your scale math. How the choice plays out:
perInstanceConcurrency |
Effect on scale-out | Effect per instance | Pick when |
|---|---|---|---|
| Low (e.g. 1–4) | Scales out aggressively (more instances) | Light load per worker, low thrash | Heavy per-request CPU/memory; isolation matters |
| Medium (e.g. 8–16) | Balanced | Good utilization | Typical I/O-bound APIs |
| High (e.g. 24–64) | Scales out reluctantly (fewer instances) | Dense, risk of memory pressure | Light, async, high-fan-out handlers with memory headroom |
| Unset (memory-derived) | Drifts when you change memory | Implicit | Never, in production — pin it |
For non-HTTP triggers (Service Bus, Event Hubs, Storage Queue), concurrency is governed by target-based scaling through host.json, not the CLI flag above. You tune the batch/concurrency knobs of the binding and the runtime computes a target instance count from queue depth:
{
"version": "2.0",
"extensions": {
"serviceBus": {
"maxConcurrentCalls": 16,
"maxConcurrentSessions": 8,
"prefetchCount": 32
},
"queues": {
"batchSize": 16,
"newBatchThreshold": 8
}
}
}
For a queue trigger, target-based scaling computes desired instances as roughly messages ÷ (batchSize + newBatchThreshold). Lowering batchSize makes the app scale out more aggressively per message backlog; raising it packs more work onto each instance. Tune this against downstream throughput limits (database connection pools, third-party API rate caps) — uncontrolled fan-out is how you DDoS your own backend. The host.json concurrency knobs that matter for target-based scaling:
host.json setting |
Binding | Default | Raise it to… | Lower it to… | Gotcha |
|---|---|---|---|---|---|
serviceBus.maxConcurrentCalls |
Service Bus (no sessions) | 16 | Pack more per instance | Throttle downstream | Per-instance, not global |
serviceBus.maxConcurrentSessions |
Service Bus (sessions) | 8 | Handle more sessions/instance | Preserve ordering pressure | Session-bound |
serviceBus.prefetchCount |
Service Bus | 0 | Cut receive latency | Reduce lock churn | Prefetch holds locks |
queues.batchSize |
Storage Queue | 16 | Fewer, denser instances | Scale out per backlog | Max 32; with threshold drives target |
queues.newBatchThreshold |
Storage Queue | batchSize/2 | Fetch next batch sooner | — | Adds to scale denominator |
eventHubs.maxEventBatchSize |
Event Hubs | varies | Bigger batches | Lower memory/latency | Scale also bounded by partitions |
The backpressure ceiling is the product of the two halves — size it against the weakest downstream:
| Downstream constraint | Concurrency | Max instances | In-flight ceiling | Safe vs the constraint? |
|---|---|---|---|---|
| DB pool = 200 connections | 24 | 8 | 24 × 8 = 192 | Yes — 192 < 200 |
| DB pool = 200 connections | 50 | 8 | 50 × 8 = 400 | No — pool exhausts |
| Partner API = 100 req/s cap | 10 | 10 | 10 × 10 = 100 | At the edge — add margin |
| No hard downstream limit | 16 | 120 | 16 × 120 = 1,920 | Bounded only by quota/latency |
Always-ready instances to kill cold starts
On-demand instances cold-start. For latency-critical paths — a synchronous checkout API, a webhook with a tight SLA — reserve always-ready instances that stay warm and take traffic first. The platform only spins up on-demand instances after the always-ready pool is saturated.
# Keep 3 warm instances for the HTTP group
az functionapp scale config always-ready set -g $RG -n fn-orders-prod \
--settings http=3
# Mix: warm Durable group + warm a single hot function
az functionapp scale config always-ready set -g $RG -n fn-orders-prod \
--settings durable=2 function:ProcessPayment=2
At create time the equivalent is --always-ready-instances http=3. Remove reservations with az functionapp scale config always-ready delete -g $RG -n fn-orders-prod --setting-names http function:ProcessPayment. What you can reserve, and the syntax for each:
| Always-ready target | Syntax | Covers | When to use |
|---|---|---|---|
| HTTP group | http=N |
All HTTP/SignalR triggers | Synchronous APIs with a latency SLA |
| Durable group | durable=N |
All Durable orchestrators/activities/entities | Orchestrations that must start instantly |
| Blob group | blob=N |
Event Grid blob triggers | Latency-sensitive blob processing |
| Single function | function:<NAME>=N |
One named function | One hot function inside a larger app |
| Remove a reservation | --setting-names <group> (delete verb) |
Frees the warm pool | Scaling reservation down to zero |
Two things to internalize. First, billing: always-ready instances bill a baseline for provisioned memory continuously, plus execution memory while running, with no free grant — this is the Premium-style cost, scoped to only the instances you reserve. Reserve the minimum that holds your steady-state concurrency. Second, zone redundancy: if you enable availability zones, the minimum always-ready count per group is 2, not 1, so the warm pool survives a zone outage. How to size the warm pool against the burst you actually get:
| Scenario | Steady-state concurrent reqs | Concurrency | Always-ready to reserve | Cold start exposure |
|---|---|---|---|---|
| Flat low traffic, tight SLA | ~20 | 10 | http=2 |
None at steady state |
| Predictable diurnal peak | ~120 at peak | 24 | http=5 (peak ÷ 24, rounded) |
Only above peak |
| Spiky flash-sale burst | base 50, spike 1,800 | 24 | http=6 (covers the leading edge) |
On-demand absorbs the tail |
| Zone-redundant, any load | any | any | min 2 per group | Survives one zone down |
| No latency SLA (async) | any | any | 0 (let it scale from zero) |
Accepted; cheapest |
A worked sizing rule: always-ready instances ≈ ceil(steady-state concurrent requests ÷ perInstanceConcurrency). Reserve that, let on-demand take everything above it, and the warm pool pays the cold-start cost once at deploy — never on a user request.
Not every trigger needs a warm pool. Match cold-start sensitivity to trigger shape before you spend on always-ready:
| Trigger type | Cold-start sensitivity | Reserve always-ready? | Why |
|---|---|---|---|
| Synchronous HTTP with latency SLA | High | Yes (http=N) |
A user/acquirer is blocked on the response |
| Durable orchestration (must start fast) | High | Yes (durable=N) |
Start latency is visible to the caller |
| Webhook with a tight timeout | High | Yes (http=N) |
Caller retries on slow start → amplification |
| Service Bus / queue (async backlog) | Low | Usually no | A few seconds of warm-up is invisible to a backlog |
| Event Hubs stream processing | Low–Medium | Rarely | Throughput matters more than first-call latency |
| Timer / scheduled batch | None | No | Nobody is waiting on the first execution |
Deploy with one-deploy and managed-identity storage
Flex has exactly one deployment path: build, zip, push the package to a blob container. The app pulls and runs from that package on startup. No WEBSITE_RUN_FROM_PACKAGE gymnastics — that behavior is built in.
# Build + zip your project, then one-deploy it
func azure functionapp publish fn-orders-prod
# or push a prebuilt package and run the build remotely on the platform:
az functionapp deployment source config-zip \
-g $RG -n fn-orders-prod --src ./app.zip --build-remote true
--build-remote true runs Oryx build (restore/compile) on the platform — use it for Python/Node where native wheels must match the Linux host. For precompiled .NET isolated output, ship the built artifact and skip remote build. The deployment options compared:
| Path | Command | Build runs | Use for | Watch-out |
|---|---|---|---|---|
| Core Tools publish | func azure functionapp publish |
Local | Quick local→cloud loop | Local toolchain must match runtime |
| Zip + remote build | config-zip --build-remote true |
Platform (Oryx) | Python/Node native deps | Slower first deploy |
| Zip + prebuilt | config-zip (no remote build) |
Pre-done | Precompiled .NET isolated | Artifact must be Linux-correct |
| CI/CD (Bicep + zip) | pipeline pushes to container | Pipeline | Reproducible prod deploys | Identity needs Blob Data role |
The security upgrade is removing storage secrets entirely. By default the host talks to storage via a connection string in AzureWebJobsStorage. Replace it with an identity-based connection so no key ever lands in app settings:
# Assign a user-assigned identity and grant it data-plane access to storage
UAMI_ID=$(az identity show -g $RG -n id-fn-orders --query id -o tsv)
UAMI_CLIENT=$(az identity show -g $RG -n id-fn-orders --query clientId -o tsv)
STORAGE_ID=$(az storage account show -g $RG -n $STORAGE --query id -o tsv)
az functionapp identity assign -g $RG -n fn-orders-prod --identities "$UAMI_ID"
# Host needs Blob + Queue + Table data roles on the backing account
for ROLE in "Storage Blob Data Owner" "Storage Queue Data Contributor" "Storage Account Contributor"; do
az role assignment create --assignee "$UAMI_CLIENT" --role "$ROLE" --scope "$STORAGE_ID"
done
# Swap the connection string for an identity-based connection
az functionapp config appsettings set -g $RG -n fn-orders-prod --settings \
"AzureWebJobsStorage__accountName=$STORAGE" \
"AzureWebJobsStorage__credential=managedidentity" \
"AzureWebJobsStorage__clientId=$UAMI_CLIENT" && \
az functionapp config appsettings delete -g $RG -n fn-orders-prod \
--setting-names AzureWebJobsStorage
The __accountName syntax is specific to AzureWebJobsStorage. Omit __clientId and Flex falls back to the system-assigned identity (use az functionapp identity assign -g $RG -n fn-orders-prod with no --identities). The exact roles the host needs on the backing storage account, and what each one is for:
| Role | Scope | What the host uses it for | Omit it and… |
|---|---|---|---|
| Storage Blob Data Owner | Backing account | Host metadata, lease blobs, package read | Host can’t start / scale |
| Storage Queue Data Contributor | Backing account | Internal control queues | Queue-driven scale breaks |
| Storage Account Contributor | Backing account | Management-plane ops the host performs | Some host operations fail |
| Storage Blob Data Contributor | Deployment account/container | Read/write the deployment package | Package pull fails → app won’t run |
For the deployment container specifically, you can authenticate the same way at create time:
az functionapp create -g $RG -n fn-orders-prod --storage-account $STORAGE \
--runtime dotnet-isolated --runtime-version 8.0 --flexconsumption-location $LOC \
--deployment-storage-name $STORAGE \
--deployment-storage-container-name app-package \
--deployment-storage-auth-type UserAssignedIdentity \
--deployment-storage-auth-value "$UAMI_ID"
--deployment-storage-auth-type accepts StorageAccountConnectionString, UserAssignedIdentity, or SystemAssignedIdentity. The identity needs Storage Blob Data Contributor on the deployment account. The three identity-connection app-setting keys, decoded:
| App setting | Value | Meaning | Default if omitted |
|---|---|---|---|
AzureWebJobsStorage__accountName |
the storage account name | Target account (no key) | Falls back to connection string |
AzureWebJobsStorage__credential |
managedidentity |
Use a managed identity | Connection-string mode |
AzureWebJobsStorage__clientId |
the UAMI client ID | Which user-assigned identity | System-assigned identity |
Private endpoints, Key Vault references, and outbound lockdown
VNet integration handles outbound traffic. To lock down inbound access to your dependencies, pair it with private endpoints and disable public network access on each backing resource.
# Private endpoint for the storage blob service
az network private-endpoint create -g $RG -n pe-st-blob \
--vnet-name $VNET --subnet snet-pe \
--private-connection-resource-id "$STORAGE_ID" \
--group-id blob --connection-name conn-st-blob
# Force all storage traffic through the private path
az storage account update -g $RG -n $STORAGE --public-network-access Disabled
For the function app to resolve *.privatelink.blob.core.windows.net to the private IP through its VNet, ensure the integration subnet’s VNet is linked to the relevant Private DNS zones. Without that DNS link the app resolves the public IP and the endpoint is bypassed. The zones you must link, per dependency:
| Dependency | Private DNS zone to link | Group ID (--group-id) |
Symptom if zone is unlinked |
|---|---|---|---|
| Blob storage | privatelink.blob.core.windows.net |
blob |
App reads/writes over public IP |
| Queue storage | privatelink.queue.core.windows.net |
queue |
Control queues bypass PE |
| Table storage | privatelink.table.core.windows.net |
table |
Table ops bypass PE |
| File storage | privatelink.file.core.windows.net |
file |
File share bypasses PE |
| Key Vault | privatelink.vaultcore.azure.net |
vault |
Secret pull over public IP |
| Service Bus | privatelink.servicebus.windows.net |
namespace |
Messaging bypasses PE |
| SQL Database | privatelink.database.windows.net |
sqlServer |
DB traffic over public IP |
Pull secrets from Key Vault behind its own private endpoint via Key Vault references — the secret value is never stored in app settings:
az functionapp config appsettings set -g $RG -n fn-orders-prod --settings \
"DbConnection=@Microsoft.KeyVault(SecretUri=https://kv-orders.vault.azure.net/secrets/db-conn/)"
Grant the app’s managed identity Key Vault Secrets User on the vault. To force all outbound through the VNet (so it can traverse a firewall or NAT gateway and the resolver sees private records), set vnetRouteAllEnabled:
az resource update -g $RG --namespace Microsoft.Web --resource-type sites \
--name fn-orders-prod --set properties.vnetRouteAllEnabled=true
The outbound-networking settings and what each one controls:
| Setting / control | What it does | Default | Set it when |
|---|---|---|---|
virtualNetworkSubnetId |
Binds outbound to the delegated subnet | unset | Always, for VNet integration |
vnetRouteAllEnabled |
Routes all outbound through the VNet | false | Must traverse firewall/NAT or see private DNS |
| Private DNS zone link | Resolves privatelink.* to private IP |
unlinked | Any private endpoint dependency |
--public-network-access Disabled (on dep) |
Blocks public inbound to the dependency | Enabled | Lock the dependency to the private path |
| Key Vault reference | @Microsoft.KeyVault(SecretUri=...) |
none | Keep secrets out of app settings |
| NAT gateway on the subnet | Deterministic, large SNAT pool | none | Chatty egress to a single destination |
Architecture at a glance
The diagram traces a request through Flex the way it actually flows, then maps each scaling and private-path failure onto the exact hop where it bites. Read it left to right. A trigger — an HTTP client on 443 or a Service Bus / Event Hubs message — arrives and signals the scale controller. The controller routes to the always-ready pool first (warm, with a minimum of 2 per group when zone-redundant); only when that pool saturates does it spin up on-demand instances (512 / 2048 / 4096 MB), which cold-start on their first request. The perInstanceConcurrency knob is the denominator that decides how many instances the controller adds, and together with maximum-instance-count forms the backpressure ceiling. Outbound from the workers goes through the VNet integration zone — a delegated /26 subnet plus the Private DNS zones that resolve privatelink.* to private IPs — into the private dependencies: the AzureWebJobsStorage account and Key Vault behind private endpoints, authenticated by a user-assigned identity so no key is ever in app settings.
Five numbered badges mark where this breaks. (1) A burst outruns the warm pool and on-demand cold-starts on the hot path. (2) The concurrency × max-instances ceiling throttles with 429s before scaling further. (3) A missing Private DNS link makes the app resolve the public IP and silently bypass the private endpoint. (4) The identity loses its Storage data role and the host can’t even read the package to start. (5) The regional 250-core quota is exhausted and scale stalls below your configured max. The observability zone on the right — Application Insights for the 429/p95 KQL and the core-quota tool — is how you tell badge (1) apart from badge (2): throttle rate climbing while p95 stays flat means you were capped; throttle spiking alongside a p95 spike at the burst edge means you were cold. That single distinction is the whole diagnostic method for serverless scale.
Real-world scenario
Solvix Payments runs a synchronous card-authorization API. It had lived on the Linux Consumption plan until a Black Friday incident exposed two structural problems at once. First, cold starts pushed p99 past their acquirer’s 800 ms timeout; the acquirer retried, and the retries stampeded a backend whose PostgreSQL pool capped at 200 connections. Second — and the reason they could not just move to Premium and forget it — the auth function had to reach an on-prem fraud-scoring service over a private ExpressRoute path, which Consumption could not do at all because it has no VNet integration. The team is six engineers; the workload averages 50 concurrent authorizations with a flash-sale spike to ~1,800, and the hard rule from the DBA was simple: total in-flight executions must never exceed ~150 backend connections regardless of incoming spike.
The first instinct on the bridge was to scale the Consumption plan “bigger” — but Consumption gives you no such knob, and even on Premium the cold-start-on-scale-out problem and the lack of a fan-out cap would have remained. They moved to Flex Consumption and solved all three constraints with three coordinated settings, deployed as Bicep and reviewed in a PR.
VNet integration over a delegated Microsoft.App/environments subnet gave them the private route to on-prem — the thing Consumption could never do. They reserved always-ready instances to absorb the burst’s leading edge so the acquirer never saw a cold start on the hot path. And crucially, they capped fan-out by pinning per-instance concurrency and max instances so total in-flight executions could never exceed the database pool:
# 6 warm instances x 24 concurrency = 144 steady-state in-flight,
# hard-capped at 8 instances so peak <= 192 < the 200-conn pool.
az functionapp scale config always-ready set -g rg-payments -n fn-auth \
--settings http=6
az functionapp scale config set -g rg-payments -n fn-auth \
--trigger-type http --trigger-settings perInstanceConcurrency=24
az functionapp scale config set -g rg-payments -n fn-auth \
--maximum-instance-count 8 --instance-memory 2048
The result: p99 dropped under 300 ms because the warm pool never cold-started on the hot path, and the explicit concurrency × max-instances ceiling made backend overload structurally impossible — the function throttled with 429s at the edge (which the acquirer handled gracefully) long before the database pool exhausted. The next flash sale ran at 1,900 rps with zero backend-pool incidents. The always-ready baseline added a predictable, small monthly cost (6 warm 2048-MB instances), which the team accepted as the price of the SLA — far cheaper than a fully always-on Premium plan sized for peak.
The migration as a before/after, because the shape of the fix is the lesson:
| Dimension | Before (Linux Consumption) | After (Flex Consumption) |
|---|---|---|
| Private path to on-prem | Impossible (no VNet) | VNet integration over ExpressRoute |
| Cold start on hot path | Unbounded, tripped 800 ms timeout | http=6 warm → p99 < 300 ms |
| Fan-out cap | None → stampeded 200-conn pool | 24 × 8 = 192 hard ceiling |
| Behavior at overload | Backend pool exhaustion | 429 at the edge, acquirer retries gracefully |
| Storage secret | Connection string in settings | Identity-based connection, no key |
| Billing shape | Execution only, but un-tunable | Execution + small warm baseline |
The lesson that generalizes: on Flex, concurrency and max-instance-count are not just performance knobs, they are a backpressure mechanism. Size them against your weakest downstream dependency, not against incoming traffic — and reserve always-ready only on the hot path, not everywhere.
Advantages and disadvantages
The “serverless-but-tunable” model both unlocks production serverless and introduces knobs that are wrong by default. Weigh it honestly:
| Advantages (why Flex helps you) | Disadvantages (why it bites) |
|---|---|
| Scale-to-zero economics plus VNet integration — no Premium tax for a private network | Linux only; no in-place migration in or out (create new app + redeploy) |
| Always-ready instances kill cold starts on exactly the groups that need it | Always-ready bills a continuous baseline with no free grant — forget one and it costs |
| Deterministic scale controller: instances = traffic ÷ concurrency, not a black box | The maximum-instance-count floor is 40 — you can’t pin a low ceiling |
| Concurrency × max-instances is a hard backpressure valve against fragile downstreams | Mis-set high, that same product becomes a self-inflicted DDoS on your backend |
| Identity-based storage connection removes the last storage key from config | The host needs several data-plane roles; miss one and the app won’t even start |
| Selectable memory (512/2048/4096) right-sizes cores to the workload | Only three discrete sizes; the 250-core regional quota binds large fleets |
| Private endpoints + Key Vault references make a genuinely private serverless app | DNS zone links are easy to forget → traffic silently goes public |
Flex is the right model for serverless that must reach a private network, hold a latency SLA, or cap fan-out — and still scale to zero when idle. It is the wrong model when you need a Windows runtime, a hard low instance ceiling, or you have no private/latency requirement at all (plain Consumption is cheaper and simpler). The disadvantages are all manageable — but only if you know they exist, which is the point of this article.
Hands-on lab
Provision a Flex app with VNet integration, pin concurrency and a low-ish max-instance cap, reserve a warm instance, and prove the scale and private-path settings are actually in effect. Run in Cloud Shell (Bash). This uses real (billed) resources — delete the resource group at the end; an hour is a few rupees of always-ready baseline plus storage.
Step 1 — Variables, RP registration, and resource group.
RG=rg-fnflex-lab
LOC=eastus
VNET=vnet-lab
SUBNET=snet-flex
STORAGE=stflexlab$RANDOM
APP=fn-lab-$RANDOM
az group create -n $RG -l $LOC -o table
az provider register --namespace Microsoft.App --wait
Step 2 — Confirm the region offers Flex, then build the delegated subnet.
az functionapp list-flexconsumption-locations -o table | grep -i $LOC # must appear
az network vnet create -g $RG -n $VNET --address-prefixes 10.50.0.0/16 \
--subnet-name $SUBNET --subnet-prefixes 10.50.1.0/26
az network vnet subnet update -g $RG --vnet-name $VNET -n $SUBNET \
--delegations Microsoft.App/environments
Expected: the subnet update returns JSON with delegations[0].serviceName = Microsoft.App/environments.
Step 3 — Backing storage and the Flex app, joined to the VNet.
az storage account create -g $RG -n $STORAGE -l $LOC --sku Standard_LRS \
--allow-blob-public-access false --min-tls-version TLS1_2
az functionapp create -g $RG -n $APP --storage-account $STORAGE \
--flexconsumption-location $LOC --runtime dotnet-isolated --runtime-version 8.0 \
--vnet "$VNET" --subnet "$SUBNET" -o table
Expected: a function app row; kind contains functionapp,linux.
Step 4 — Pin the scale math: memory, max instances, concurrency, one warm instance.
az functionapp scale config set -g $RG -n $APP --instance-memory 2048 --maximum-instance-count 40
az functionapp scale config set -g $RG -n $APP \
--trigger-type http --trigger-settings perInstanceConcurrency=10
az functionapp scale config always-ready set -g $RG -n $APP --settings http=1
(Note: http=1 is fine for a non-zone-redundant lab; production with AZ enabled requires a minimum of 2.)
Step 5 — Verify every layer is actually in effect, not just configured.
az functionapp scale config show -g $RG -n $APP -o jsonc # memory, max-count, concurrency
az functionapp scale config always-ready list -g $RG -n $APP -o table # http=1 present
az functionapp vnet-integration list -g $RG -n $APP -o table # bound to snet-flex
Expected: scale config shows instanceMemoryMB: 2048, maximumInstanceCount: 40, HTTP concurrency 10; always-ready lists http=1; VNet integration lists the delegated subnet. The lab steps mapped to what each proves:
| Step | What you did | What it proves | Real-world analogue |
|---|---|---|---|
| 2 | Delegate the subnet to Microsoft.App |
The delegation is exact and required | First Flex VNet setup of any team |
| 3 | --flexconsumption-location + --vnet |
Flex is selected and VNet-joined in one shot | Production provisioning |
| 4 | Pin memory / max-count / concurrency / warm | The scale math is explicit, not implicit | Hardening before a launch |
| 5 | Read it back with scale config show |
“Configured” ≠ “in effect” — verify both | The 90-second pre-incident check |
Step 6 — Cleanup (stop the always-ready baseline and storage charges).
az group delete -n $RG --yes --no-wait
Cost note. The only non-trivial charge in this lab is the single always-ready instance’s memory baseline (pennies per hour at 2048 MB) plus storage. Deleting the resource group stops everything.
Common mistakes & troubleshooting
This is the playbook — the part you bookmark. First as a scannable table you can read mid-incident, then the detail on the entries that bite hardest. The unifying diagnostic is the 429 fork: capped (instance/quota ceiling) versus cold (burst outran the warm pool).
| # | Symptom | Root cause | Confirm (exact cmd / portal path) | Fix |
|---|---|---|---|---|
| 1 | 429s under burst; throttle rate climbs while p95 stays flat | App-level cap: concurrency × max-instances ceiling, or core quota |
App Insights KQL (throttle vs instance count); Diagnose & solve → Flex Quota | Raise --maximum-instance-count / concurrency, or request quota |
| 2 | 429s + p95 spike at the start of a burst, settling after | Cold-start cascade: burst outran the warm pool | InstanceCount climbing from zero at burst edge; p95 spike |
Add always-ready sized to the burst leading edge |
| 3 | App reads/writes storage over the public IP despite a PE | Private DNS zone not linked to the VNet | nslookup <acct>.blob.core.windows.net from a peered VM returns public A record |
Link privatelink.blob/queue/... zones to the VNet |
| 4 | App won’t start; host errors on storage | UAMI missing a Storage data-plane role | az role assignment list --assignee <clientId> --scope <storageId> empty |
Grant Blob Data Owner + Queue Data Contributor |
| 5 | Scale stalls below your --maximum-instance-count |
Regional 250-core quota exhausted | Diagnose & solve → Flex Consumption Quota tool | Request a core-quota increase via support |
| 6 | create fails with a delegation/RP error |
Subnet not delegated to Microsoft.App, or RP unregistered |
az network vnet subnet show --query delegations; az provider show -n Microsoft.App |
Delegate to Microsoft.App/environments; register the RP |
| 7 | Scale-out plateaus early under load | Integration subnet too small (IP exhaustion) | Subnet /27 with many instances; available IPs near zero |
Recreate with a larger subnet (/26 or bigger) |
| 8 | C# app never starts on Flex | In-process model deployed (unsupported) | Worker logs; project targets in-process | Re-target the isolated worker (.NET 8/9/10) |
| 9 | Key Vault reference resolves empty → app misbehaves | MI lacks Key Vault Secrets User, or vault PE/DNS missing | Portal → Environment variables (red error); az role assignment list |
Grant Secrets User; link privatelink.vaultcore; open vault firewall |
| 10 | Deploy “succeeds” but app runs old/empty | Deployment-container auth/role wrong; package not pulled | Diagnose & solve → Flex Deployment tool; package container empty/denied | Set --deployment-storage-auth-type; grant Blob Data Contributor |
| 11 | Backend (DB/API) overwhelmed under spike | Fan-out uncapped: concurrency × max-instances > downstream limit |
Compute the product; compare to pool/rate cap | Lower the product below the weakest downstream |
| 12 | Outbound doesn’t traverse the firewall / sees public DNS | vnetRouteAllEnabled is false |
az functionapp show --query properties.vnetRouteAllEnabled |
Set vnetRouteAllEnabled=true |
The expanded form, with the full reasoning for the entries that bite hardest:
1 & 2 — Capped vs cold: the central 429 fork. Both look like “we got 429s under load,” and treating one as the other wastes the incident. Capped means instances are saturated at their concurrency limit and the app cannot add more — either --maximum-instance-count is too low or the regional core quota is exhausted; the tell is a throttle rate that climbs while p95 stays flat (the served requests are fine, you just can’t serve more). Cold means a burst arrived faster than on-demand instances warm up, an upstream timed out and retried, and the retries amplified load; the tell is a throttle spike alongside a p95 spike at the burst’s leading edge. Use this Kusto query to separate them — it correlates 429 rate against the cause signals:
let window = 5m;
requests
| where timestamp > ago(1h)
| summarize
total = count(),
throttled = countif(resultCode == 429),
p95_ms = percentile(duration, 95)
by bin(timestamp, window)
| extend throttle_rate = round(100.0 * throttled / total, 2)
| order by timestamp asc
A throttle rate that climbs while p95 stays flat points to a hard instance cap (capped → raise max-count/concurrency or request quota). A throttle rate that spikes alongside a p95 latency spike at the start of a burst points to cold starts (cold → add always-ready sized to the burst’s leading edge). Read against the live InstanceCount metric, the two are unmistakable:
| Signal pattern | Diagnosis | Why | First fix |
|---|---|---|---|
Throttle ↑, p95 flat, InstanceCount pinned at max |
Capped (instances) | Saturated and can’t add more | Raise --maximum-instance-count |
Throttle ↑, p95 flat, InstanceCount below max |
Capped (quota) | Quota stalls scale below your cap | Request core-quota increase |
| Throttle spike + p95 spike at burst edge, then settles | Cold | On-demand cold-starting behind the burst | Add always-ready for the leading edge |
| Throttle 0, p95 spike on first request after idle | Cold (no SLA breach yet) | Warm pool empty at idle | Reserve a small warm floor |
The metrics that explain scaling decisions — read these first:
APP_ID=$(az functionapp show -g $RG -n fn-orders-prod --query id -o tsv)
az monitor metrics list --resource "$APP_ID" --metric "InstanceCount" --interval PT1M -o table
az monitor metrics list --resource "$APP_ID" --metric "OnDemandFunctionExecutionUnits" --interval PT1H -o table
az monitor metrics list --resource "$APP_ID" --metric "AlwaysReadyFunctionExecutionUnits" --interval PT1H -o table
| Metric | What it tells you | Use it to… |
|---|---|---|
InstanceCount |
Live instances over time | See capped (pinned at max) vs cold (climbing from zero) |
OnDemandFunctionExecutionUnits |
GB-s of on-demand execution | Attribute the variable part of the bill |
AlwaysReadyFunctionExecutionUnits |
GB-s on the warm pool | Confirm the warm pool is sized/used right |
FunctionExecutionCount |
Total executions | Correlate throttle rate to volume |
MemoryWorkingSet |
Per-instance memory in use | Spot pressure that argues for a larger memory size |
AverageMemoryWorkingSet |
Fleet-average memory | Right-size 512 vs 2048 vs 4096 |
Http5xx / Http429 |
Edge error rates | The symptom; confirm against the cause above |
3 — The “private but actually public” trap. VNet integration routed the egress, but the integration VNet was never linked to the privatelink.blob.core.windows.net zone, so the app resolved the public IP and the private endpoint was bypassed entirely. Confirm: nslookup <account>.blob.core.windows.net from a peered VM (not your laptop) returns a public A record instead of a 10.x private IP. Fix: link every relevant privatelink.* zone to the VNet, then re-test resolution from inside the VNet.
4 — App won’t start because the host can’t read storage. The UAMI was assigned but never granted the Storage data-plane roles, so the host can’t read its own metadata/package and never starts — which looks like a generic “app down,” not an auth problem. Confirm: az role assignment list --assignee <UAMI clientId> --scope <storage id> is empty. Fix: grant Storage Blob Data Owner + Storage Queue Data Contributor (+ Account Contributor) on the backing account, and Blob Data Contributor on the deployment container.
5 & 11 — Quota vs self-DDoS. Two opposite failure modes around the same product. (5) Scale stalls below your --maximum-instance-count because the regional 250-core quota is exhausted (remember 4096-MB instances burn 2 cores each, so 125 of them is the whole default budget). Confirm: the Flex Consumption Quota tool in Diagnose & solve. Fix: request an increase. (11) The opposite — scale runs too far and concurrency × max-instances exceeds your downstream’s capacity, so the backend pool exhausts under spike. Confirm: compute the product and compare to the pool/rate cap. Fix: lower the product below the weakest downstream; this is the backpressure discipline.
Best practices
- Pin
perInstanceConcurrencyexplicitly in production. The memory-derived default silently shifts your scale math the day someone changes instance memory. Set it; review it as code. - Size
concurrency × max-instancesto your weakest downstream, not to traffic. That product is a hard backpressure ceiling — make backend overload structurally impossible, then let the edge 429. - Reserve always-ready only on latency-critical groups. It bills a continuous baseline; warm the hot HTTP/Durable group, leave everything else to scale from zero. Use
min 2per group when zone-redundant. - Size the warm pool to the burst’s leading edge, roughly
ceil(steady-state concurrency ÷ perInstanceConcurrency). You’re not warming everything (that’s Premium) — you’re covering the front of the spike. - Use an identity-based
AzureWebJobsStorageconnection. Remove the connection string entirely; grant the host the exact data-plane roles it needs (Blob/Queue) and nothing broader. - Delegate a dedicated
/26integration subnet toMicrosoft.App/environments— never share it, and leave IP headroom so scale-out isn’t capped by subnet exhaustion. - Link every
privatelink.*zone the app talks to. VNet integration without DNS linking is “private” in name only; verify resolution from inside the VNet, not your laptop. - Set
vnetRouteAllEnabled=truewhen outbound must traverse a firewall/NAT or see private DNS — otherwise some egress leaks to the public path. - Cap
--maximum-instance-countagainst the 250-core quota, computinginstances × cores/instance; request an increase before a launch, not during the incident. - Wire Application Insights from day one and keep the capped-vs-cold KQL handy — it turns a serverless scaling mystery into a 90-second read of throttle rate vs
InstanceCount. - Ship the whole plan as Bicep, reviewed in PRs. Memory, concurrency, max-count, always-ready, VNet, and identity roles are all a wrong-value-away from an incident; treat them as code.
- Stay on a supported isolated runtime. No in-process C#; confirm the stack/version is offered in your Flex region before you commit.
The alerts worth wiring before the next burst — leading indicators, not “app down”:
| Alert on | Signal | Threshold (starting point) | Why it’s leading |
|---|---|---|---|
| Throttling | Http429 rate |
> 1% sustained 5 min | First sign of capped-or-cold before users feel it |
| Scale ceiling | InstanceCount at max |
= maximum-instance-count for 10 min |
You’re capped — raise the cap or quota |
| Cold-start latency | request p95 | > your SLO at burst edges | Warm pool too small for the spike |
| Core quota | Flex Quota tool / scale stall | scale flat below max-count | Quota, not your config, is the cap |
| Always-ready cost | AlwaysReadyFunctionExecutionUnits |
trending up unexpectedly | A forgotten reservation billing a baseline |
| Private-path drift | dependency failures over public IP | any, post-deploy | A DNS zone link went missing |
Security notes
- Managed identity over secrets, everywhere. Use a user-assigned (or system-assigned) managed identity for the storage connection (
AzureWebJobsStorage__credential=managedidentity) and for Key Vault references, so no connection string or key sits in app settings. Grant least privilege — the specific Storage data roles and Key Vault Secrets User, not broad management roles. - Lock dependencies to the private path. Put backing storage, Key Vault, and data stores behind private endpoints, set
--public-network-access Disabledon each, and confirm the app resolves them privately via linked DNS zones. A private endpoint with public access still enabled is a false sense of security. - Force outbound through the VNet with
vnetRouteAllEnabled=trueso egress can traverse a firewall / NAT gateway and the resolver sees private records — and so you can apply egress controls (deterministic SNAT, allow-listed destinations) the way Azure NAT Gateway: Deterministic Egress & SNAT Exhaustion describes. - Scope storage roles to the exact account. The host needs Blob/Queue data roles on the backing account and Blob Data Contributor on the deployment container — assign them at that scope, not at subscription or resource-group level.
- Keep secrets in Key Vault behind its own private endpoint, referenced via
@Microsoft.KeyVault(...); the secret value never lands in app settings and rotates centrally. See Azure Key Vault: Secrets, Keys & Certificates. - Disable public blob access and enforce TLS 1.2 on the backing account (
--allow-blob-public-access false --min-tls-version TLS1_2) — both are required-shaped for a compliant Flex deployment. - Treat the deployment container as sensitive. It holds your runnable package; restrict its access to the deploy identity and the host identity, and prefer identity auth over a connection string for it too.
The security controls that also prevent these incidents — secure and resilient pull the same way here:
| Control | Setting / mechanism | Secures against | Also prevents |
|---|---|---|---|
| Identity-based storage connection | AzureWebJobsStorage__credential=managedidentity |
A storage key in app settings | Key rotation breaking the host |
| Least-privilege storage roles | Blob/Queue data roles at account scope | Over-broad access | Surprise blast radius if the identity leaks |
| Private endpoints + public access off | PE + --public-network-access Disabled |
Data exfiltration over public IP | Accidental public exposure of deps |
| Linked Private DNS zones | privatelink.* zone → VNet link |
Traffic resolving the public path | “Private but actually public” drift |
| Key Vault references | @Microsoft.KeyVault(SecretUri=...) |
Secrets in plaintext config | Hand-rolled secret rotation breaking the app |
| Force VNet routing | vnetRouteAllEnabled=true |
Egress bypassing the firewall | Private DNS not being consulted |
Cost & sizing
The bill drivers on Flex and how they interact with the tuning:
- On-demand execution dominates a scale-to-zero workload — you pay GB-seconds of active execution (1,000 ms minimum per call, then 100 ms rounding) and nothing while idle. Short, frequent calls pay the 1-second floor, so very chatty tiny functions can cost more than their wall-clock suggests.
- Always-ready baseline is the Premium-style cost, scoped to what you reserve. Every warm instance bills its provisioned memory continuously —
instances × memory × time— plus execution while running. Reserve the minimum that covers your steady-state concurrency; a forgottenhttp=10on a 4096-MB app is a real monthly line item. - Instance memory sets both performance and cost: a 4096-MB instance bills (and quota-counts) double a 2048-MB one. Right-size down for high-fan-out light work; only go to 4096 for genuine CPU/memory pressure.
- The 250-core quota is free but bounds the fleet — large always-ready pools of 4096-MB instances eat it fast (125 instances = the whole default budget).
- VNet / NAT / private endpoints add small hourly + per-GB charges, but they are the price of a private serverless app and far cheaper than the alternative (a fully always-on Premium plan sized for peak).
A rough monthly picture, and what each lever buys you:
| Cost driver | What you pay for | Rough INR / month | What it fixes | Watch-out |
|---|---|---|---|---|
| On-demand execution | Active GB-seconds (1s min) | Scales with traffic; ₹0 idle | The scale-to-zero economics | Chatty tiny calls pay the 1s floor |
| 2× always-ready (2048 MB) | Warm memory baseline | ~₹6,000–10,000 | Cold start on the hot path | Forgetting it bills 24×7 |
| 6× always-ready (2048 MB) | Larger warm floor | ~₹18,000–30,000 | Bigger burst leading edge | Size to peak ÷ concurrency, not “lots” |
| 4096-MB sizing | 2× cores per instance | ~2× the above | CPU/memory-heavy work | Doubles quota burn |
| Private endpoints | Per endpoint hourly | ~₹1,000–2,000 each | Genuinely private deps | One per dependency |
| NAT gateway | Hourly + per-GB egress | ~₹1,500–3,000 | Deterministic egress at scale | Needs VNet integration |
| App Insights ingestion | Per-GB telemetry | ~₹1,000–3,000 | The capped-vs-cold diagnosis | Sample high-traffic apps |
The sizing rule in one line: let on-demand carry the variable load to zero, reserve always-ready only for the hot path’s burst edge (ceil(steady concurrency ÷ perInstanceConcurrency)), and cap concurrency × max-instances below your weakest downstream. That combination is cheaper than Premium-for-peak and safer than uncapped Consumption.
Interview & exam questions
1. What does Flex Consumption add over Linux Consumption, and what does it keep? It keeps scale-to-zero and execution billing, and adds VNet integration, always-ready instances, selectable instance memory (512/2048/4096), and explicit per-instance concurrency — the Premium capabilities available à la carte on a consumption-billed plan, so you pay a warm baseline only on the slice you reserve.
2. How does Flex billing differ from Consumption and Premium? Consumption bills only active GB-seconds; Premium bills every reserved instance always-on. Flex bills on-demand instances only while executing (1,000 ms minimum, then 100 ms rounding) and always-ready instances a continuous memory baseline plus execution. You pay the Premium-style cost only on reserved warm capacity.
3. A burst causes 429s. How do you tell whether you were capped or cold? Correlate the 429 rate against p95 and InstanceCount. Capped: throttle climbs while p95 stays flat and InstanceCount is pinned at max (or stalled by quota) — raise --maximum-instance-count/concurrency or request quota. Cold: throttle and p95 spike at the burst’s leading edge while InstanceCount climbs from zero — add always-ready sized to the leading edge.
4. What subnet requirement does Flex VNet integration impose, and what’s the common mistake? A subnet delegated to Microsoft.App/environments, at least /27 (use /26), with the Microsoft.App RP registered. The common mistake is delegating to Microsoft.Web/... (App Service’s delegation) — Flex joins via the App environment, so that delegation fails.
5. Why might an app reach a dependency over the public IP even though it has VNet integration and a private endpoint? Because VNet integration only routes outbound; reaching the private endpoint also needs DNS resolution to the private IP, which requires the integration VNet to be linked to the privatelink.* zone. Without that link the app resolves the public IP and bypasses the endpoint.
6. How do you turn concurrency into a backpressure mechanism? perInstanceConcurrency × maximum-instance-count is a hard ceiling on total in-flight executions. Set that product at or below your weakest downstream’s capacity (DB pool, API rate cap) and overload becomes structurally impossible — the app 429s at the edge before the downstream falls over.
7. What is the regional core quota and how do you compute against it? A default 250 cores (512,000 MB) per subscription+region, shared by all Flex apps. Cores are instances × cores-per-instance (512 MB = 0.25, 2048 = 1, 4096 = 2), so 125 instances of 4096 MB exhaust the default. Always-ready counts; scaled-to-zero doesn’t. Request increases via support.
8. How do you remove the storage key from a Flex app? Replace the AzureWebJobsStorage connection string with an identity-based connection: set AzureWebJobsStorage__accountName, __credential=managedidentity, and __clientId (UAMI), then delete AzureWebJobsStorage. Grant the identity the Storage Blob/Queue data roles on the backing account.
9. What’s the minimum always-ready count when availability zones are enabled, and why? 2 per group, not 1 — so the warm pool survives a single zone outage. A single warm instance would be a single point of failure that defeats the purpose of zone redundancy.
10. Which runtimes/models are unsupported on Flex, and what’s the migration path? The C# in-process model is unsupported — you must use the isolated worker (.NET 8/9/10). Flex is Linux only. There is no in-place migration in or out: you create a new app and redeploy.
11. How is non-HTTP trigger concurrency tuned, since perInstanceConcurrency is HTTP-only? Via target-based scaling in host.json — the binding’s batch/concurrency knobs (serviceBus.maxConcurrentCalls, queues.batchSize/newBatchThreshold, Event Hubs batch size) — from which the runtime computes a desired instance count from queue depth.
12. The function app deploys “successfully” but runs old or empty code. What do you check? The deployment container auth/role: the deploy identity needs Storage Blob Data Contributor on the deployment account, and --deployment-storage-auth-type must match how it’s authenticated. Use the Flex Consumption Deployment tool in Diagnose & solve to see package status.
These map to AZ-204 (Developer Associate) — develop, configure, monitor and troubleshoot Azure Functions, scaling and networking — and AZ-305 (Solutions Architect) for the plan-choice and private-network design. The networking-cost angle (VNet integration, NAT, SNAT) touches AZ-700, and the identity/least-privilege angle touches AZ-500. A compact cert-mapping for revision:
| Question theme | Primary cert | Exam objective area |
|---|---|---|
| Flex vs Consumption vs Premium, billing | AZ-204 / AZ-305 | Choose & configure compute; cost |
| Concurrency, max-instances, always-ready | AZ-204 | Configure & scale Functions |
| VNet integration, delegated subnet, Private DNS | AZ-700 | Design & implement network connectivity |
| Identity-based storage connection, KV references | AZ-500 / AZ-204 | Secure app config; manage identities |
| 429 capped-vs-cold, metrics, KQL | AZ-204 | Monitor & troubleshoot solutions |
| Core quota, scale ceilings | AZ-305 | Design for scale & limits |
Quick check
- You see 429s under load; the throttle rate climbs but p95 latency stays flat. Were you capped or cold, and what’s the first fix?
- Your Flex app has VNet integration and the storage account has a private endpoint, yet
nslookupfrom a peered VM returns a public IP. What’s missing? - You want a hard ceiling of “no more than 5 instances ever.” Can Flex do it? Why or why not?
- A 4096-MB Flex app needs to scale to 200 instances. Does the default regional quota allow it? Show the math.
- How do you make total in-flight executions never exceed a 150-connection database pool?
Answers
- Capped. Throttle climbing while p95 stays flat means served requests are fine but the app can’t add capacity — you’re at the
concurrency × max-instancesceiling or the core quota. First fix: raise--maximum-instance-count(orperInstanceConcurrencyif instances have memory headroom), or request a core-quota increase. (Cold would show a p95 spike at the burst’s leading edge.) - The integration VNet isn’t linked to the
privatelink.blob.core.windows.netPrivate DNS zone. VNet integration routes outbound but doesn’t resolve names; without the zone link the app gets the public IP and bypasses the endpoint. Link the relevantprivatelink.*zones to the VNet. - No. The floor for
--maximum-instance-countis 40 — you cannot pin a Flex app to a low ceiling like 5. If you need a hard low ceiling, use Consumption or Premium instead. - No. 4096 MB = 2 cores per instance, so 200 instances = 400 cores, which exceeds the default 250-core quota (which caps 4096-MB apps at 125 instances). Request a quota increase before planning for 200.
- Set
perInstanceConcurrency × maximum-instance-count ≤ 150— e.g. concurrency 18 × max 8 = 144, or 24 × 6 = 144. The product is a hard backpressure ceiling; the app 429s at the edge before the pool exhausts.
Glossary
- Flex Consumption — a serverless Functions plan combining scale-to-zero and execution billing with VNet integration, always-ready instances, selectable memory, and explicit concurrency.
- On-demand instance — a Flex instance that bills only while actively executing (1,000 ms minimum per execution, then 100 ms rounding) and scales to zero when idle.
- Always-ready instance — reserved warm capacity that takes traffic first and bills a continuous memory baseline plus execution; minimum 2 per group when zone-redundant.
- Scale group — the unit Flex scales together:
http(HTTP/SignalR),blob(Event Grid blob),durable(all Durable functions), orfunction:<NAME>for everything else. perInstanceConcurrency— the explicit number of concurrent HTTP executions an instance handles before the scale controller adds another; the only trigger type valid for the flag ishttp.- Target-based scaling — the mechanism (configured in
host.json) by which non-HTTP triggers compute a desired instance count from queue depth and batch settings. - Instance memory — the per-worker memory size (512 / 2048 / 4096 MB); vCPU and bandwidth scale with it, and it sets the core-quota cost (0.25 / 1 / 2 cores).
- Maximum instance count — the horizontal ceiling (40–1,000); together with concurrency it forms the backpressure cap on in-flight executions.
- Regional core quota — the default budget of 250 cores (512,000 MB) per subscription+region; cores =
instances × cores-per-instance; always-ready counts, scaled-to-zero doesn’t. - Delegated subnet — the integration subnet, at least
/27(use/26), delegated toMicrosoft.App/environments; required for Flex VNet integration. - Private DNS link — linking the integration VNet to a
privatelink.*zone so the app resolves a dependency’s private endpoint IP instead of its public IP. - Identity-based connection —
AzureWebJobsStorage__accountName+__credential=managedidentity(+__clientId), replacing the storage connection string so no key is stored. - One-deploy — Flex’s single deployment path: build, zip, push the package to a blob container; the app pulls and runs from it on startup (no
WEBSITE_RUN_FROM_PACKAGEneeded). vnetRouteAllEnabled— the setting that forces all outbound through the VNet so it can traverse a firewall/NAT and the resolver sees private DNS records.- Capped vs cold — the 429 diagnostic fork: capped = instance/quota ceiling (throttle up, p95 flat); cold = burst outran the warm pool (throttle + p95 spike at the burst edge).
- Execution unit — the GB-seconds metric (
OnDemandFunctionExecutionUnits/AlwaysReadyFunctionExecutionUnits) that attributes the bill to on-demand vs warm capacity.
Next steps
You can now provision, tune, and diagnose a Flex Consumption app on a private network. Build outward:
- Next: Azure Functions: Serverless Patterns, Triggers & Bindings — the trigger/binding fundamentals beneath every Flex scale decision.
- Related: Durable Functions: Orchestration Patterns & Fan-Out/Fan-In — stateful coordination on the
durablescale group you warm with always-ready. - Related: Azure Private Endpoints & Private DNS at Scale — the dependency-side mechanics that make the Flex private path actually private.
- Related: Azure NAT Gateway: Deterministic Egress & SNAT Exhaustion — controlling and scaling the outbound path your VNet-integrated functions egress through.
- Related: Troubleshooting Azure App Service: 502/503, Cold Starts & Restart Loops — the PaaS cousin’s failure playbook; many patterns (cold start, SNAT, identity) carry over.
- Related: Azure Cost: Reservations, Savings Plans & Hybrid Benefit Strategy — putting the always-ready baseline and execution bill in a broader cost-engineering frame.