Azure Lesson 38 of 137

Azure Functions Flex Consumption: VNet Integration, Concurrency, and Cold-Start Tuning

The Linux Consumption plan gave you scale-to-zero and execution billing, but you paid for it with no VNet integration, opaque scaling, and cold starts you could only pray about. Flex Consumption is Microsoft’s answer: the same serverless billing model, but now with true virtual network integration, selectable instance memory, deterministic per-function concurrency, and always-ready instances to kill cold starts on the functions that matter. It is the plan you reach for when a function has to live on a private network, take a burst without a cold-start cliff, and never leak a storage key — all while still scaling to zero when idle.

This is how to provision it correctly, tune it, and prove the scale controller behaves under load. We treat Flex not as a checkbox but as a system with five tunable surfaces — plan choice, instance memory, concurrency, always-ready capacity, and the private network path — each of which has a default that is wrong for production and a failure mode that bites under load. You will learn every setting end to end: what it is, the values it accepts, the default, when to change it, the trade-off, and the limit or gotcha that turns it into a 2am incident. Because this is a reference you will return to mid-incident, every deep section anchors to a table you can scan, and the operational failure modes are laid out as a symptom→cause→confirm→fix playbook.

By the end you will stop guessing about serverless scale. When a burst lands you will know whether you were capped (concurrency × max-instances ceiling) or cold (burst outran the warm pool), whether your private dependency is actually private (or silently resolving a public IP because a DNS zone link is missing), and whether your bill is execution-only or quietly paying an always-ready baseline you forgot you reserved. Knowing which within ninety seconds is what separates a tuned serverless platform from one that pages you every flash sale.

What problem this solves

Serverless on Azure used to force a brutal trade. Linux Consumption billed only for active execution and scaled to zero — perfect economics — but it had no VNet integration, so a function could not reach a private database, an on-prem service over ExpressRoute, or a Key Vault behind a private endpoint. Its scaling was a black box you could not tune, and its cold starts were unbounded and unmitigated. The moment your function needed a private network or a latency SLA, you were pushed onto the Premium (Elastic Premium / EP) plan, which fixed networking and cold starts by keeping instances always on — and billing you for every reserved instance whether it ran code or not, with no scale-to-zero.

Flex Consumption dissolves that trade. It keeps scale-to-zero and execution billing like Consumption, but adds VNet integration, always-ready instances (warm capacity you reserve only where you need it), selectable instance memory, and explicit per-instance concurrency — the Premium capabilities, available à la carte on a consumption-billed plan. You pay the Premium-style baseline only on the slice of capacity you explicitly reserve, and nothing for the rest when it is idle.

What breaks without it: teams either over-pay for idle Premium compute to get a private network they barely use, or they ship on Consumption and discover too late that a synchronous API cold-starts past an upstream timeout, the upstream retries, and the retries stampede a backend with a fixed connection pool. Who hits this: anyone running serverless that must (a) reach private resources, (b) hold a tail-latency SLA on a hot path, © cap fan-out against a fragile downstream, or (d) eliminate storage connection strings for compliance. Flex is the plan that lets you do all four without abandoning serverless economics. To frame the whole surface before the deep dive, here is every tunable, the production-wrong default, and the failure it prevents:

Tunable surface Default (often wrong for prod) What you set it to Failure it prevents
Plan choice Consumption (no VNet, opaque scale) Flex Consumption Cannot reach private deps; un-tunable cold starts
Instance memory 2048 MB 512 / 2048 / 4096 by workload Over-paying cores, or OOM on heavy payloads
HTTP concurrency Memory-derived (implicit) Explicit perInstanceConcurrency Silent scale-math drift; runaway fan-out
Max instance count High (scales toward 1,000) Capped to your downstream’s limit DDoS-ing your own database under burst
Always-ready 0 (everything cold) Sized to the burst leading edge Cold-start latency on the hot path
VNet + Private DNS Public outbound, no zone links Delegated subnet + linked zones Traffic bypassing private endpoints
Storage auth Connection string in AzureWebJobsStorage Identity-based connection (UAMI) A storage key sitting in app settings

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already understand Azure Functions basics: a function app is a deployment and scale unit, triggers (HTTP, Service Bus, Event Hubs, Timer) start executions, and bindings wire inputs/outputs. You should be comfortable with az in Cloud Shell, reading JSON output, and the idea of a managed identity (system- or user-assigned) granting an Azure resource access without secrets. Familiarity with VNet, subnets, private endpoints, and Private DNS zones helps, because half of Flex’s value is the private network path.

This sits in the Serverless track and assumes the trigger/binding fundamentals from Azure Functions: Serverless Patterns, Triggers & Bindings. It is the scaling-and-networking layer beneath orchestration — pair it with Durable Functions: Orchestration Patterns & Fan-Out/Fan-In when your workload needs stateful coordination, since the Durable trigger is one of Flex’s scale groups. The plan-choice question is upstream: Containers vs Serverless vs VMs: Choosing a Compute Model frames when serverless wins at all. For the private path, the dependency-side mechanics live in Azure Private Endpoints & Private DNS at Scale, and the egress-exhaustion story it shares with App Service is in Azure NAT Gateway: Deterministic Egress & SNAT Exhaustion.

A quick map of who owns what during a Flex incident, so you call the right person fast:

Layer What lives here Who usually owns it Failure classes it can cause
Trigger source HTTP edge, Service Bus, Event Hubs App / integration team Burst shape that triggers cold starts; poison messages
Flex plan / scale controller Concurrency, max-instances, always-ready App + platform 429 throttling (capped) or cold-start cascades
Regional core quota 250-core (512,000 MB) budget Subscription owner Scale stalls below configured max
VNet integration subnet Delegated /26, outbound route Network team No outbound; subnet too small to scale
Private DNS zones privatelink.* resolution Network / platform App resolves public IP, bypasses PE
Backing storage / Key Vault Package, host metadata, secrets Platform + security Boot failure on missing identity role
Managed identity (UAMI) Data-plane roles, deploy auth Security / platform Host can’t read package → app won’t start

Core concepts

Five mental models make every later tuning decision obvious.

Flex bills two pools, not one. Unlike Consumption (active execution only) or Premium (every reserved instance always), Flex splits capacity into on-demand instances that bill only while actively executing (a 1,000 ms minimum per execution, then rounded up to 100 ms) and always-ready instances that bill a baseline for provisioned memory continuously plus execution memory while running. You pay the Premium-style baseline only on the slice you explicitly reserve. The whole cost model collapses to: reserve the minimum warm capacity your latency SLA needs, let everything else scale to zero.

The scale controller is deterministic — you give it the math. On Consumption the scale heuristics were opaque. On Flex, instances are added based on the concurrency you configure: for HTTP, the scale controller adds an instance when existing instances are saturated at their perInstanceConcurrency; for non-HTTP, target-based scaling computes a desired instance count from queue depth and the batch settings. Scaling is no longer a mystery — it is traffic ÷ concurrency, bounded by your max-instance-count and the regional quota.

Concurrency is a backpressure valve, not just a perf knob. perInstanceConcurrency × maximum-instance-count is a hard ceiling on total in-flight executions. That product is the most important number on the plan: set it equal to (or below) your weakest downstream’s capacity — a database connection pool, a third-party rate cap — and overload becomes structurally impossible. The app throttles at the edge with 429s long before the downstream falls over. Size concurrency against your fragile dependency, not against incoming traffic.

The private path is two halves: route and resolve. VNet integration handles outbound routing — it puts the worker’s egress on a delegated subnet. But reaching a private endpoint also requires DNS resolution to the private IP, which only happens if the integration VNet is linked to the relevant Private DNS zones (privatelink.blob.core.windows.net, etc.). Get the route without the resolve and the app silently resolves the public IP, traffic skips the private endpoint, and your “private” architecture is a fiction. Both halves must be present.

Cold start is latency on the leading edge, not a constant. With always-ready capacity, steady-state traffic never cold-starts — the warm pool absorbs it. Cold start only appears when a burst outruns the warm pool, spilling onto on-demand instances that pay runtime boot, JIT, DI build, and connection-pool prime on their first request. The fix is never “warm everything” (that is just Premium); it is to size always-ready to the burst’s leading edge so the cold instances come up behind already-served traffic.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Concept One-line definition Where it lives Why it matters
Flex Consumption Serverless plan with VNet, warm pool, selectable memory The plan SKU The whole subject; scale-to-zero + private + tunable
On-demand instance Bills only while executing (1s min, 100ms round) The plan The scale-to-zero economics
Always-ready instance Warm capacity, baseline-billed Per scale group Kills cold start on the hot path
Scale group Functions that scale together (http/blob/durable) Runtime Concurrency/always-ready apply per group
perInstanceConcurrency HTTP executions per instance before scale-out Scale config The HTTP scale denominator + backpressure
Target-based scaling Non-HTTP desired-instance from queue depth host.json How Service Bus/queues/hubs scale
Instance memory 512 / 2048 / 4096 MB per worker Scale config Drives vCPU, bandwidth, and core cost
Maximum instance count Horizontal ceiling (40–1,000) Scale config The other half of the backpressure cap
Regional core quota 250 cores (512,000 MB) per sub+region default Subscription Real scale ceiling under the configured max
Delegated subnet /26+ delegated to Microsoft.App/environments VNet Required for VNet integration
Private DNS link VNet linked to privatelink.* zones Private DNS Makes the private endpoint actually private
Identity-based connection AzureWebJobsStorage__accountName + MI App settings Removes the storage key entirely

Flex vs Consumption vs Premium: the scaling and billing model

Pick the wrong plan and you either overpay for idle compute (Premium) or hit a wall you cannot tune around (Consumption). Here is the decision matrix that actually matters:

Concern Consumption Premium (EP) Flex Consumption
Scale to zero Yes No (min 1) Yes
Max scale-out instances 200 100 1,000
VNet integration No Yes Yes (subnet delegation)
Cold-start mitigation None Pre-warmed instances Always-ready instances
Instance memory Fixed Fixed per SKU Selectable: 512 / 2048 / 4096 MB
Concurrency control Implicit Implicit Explicit per-instance
Billing Execution only Per-instance (always on) Execution + always-ready baseline
OS Linux/Windows Linux/Windows Linux only
In-place migration No (create new app, redeploy)

The billing distinction is the crux. Consumption bills only GB-seconds of active execution. Premium bills the full lifetime of every reserved instance whether it runs code or not. Flex Consumption splits the difference: on-demand instances bill only while actively executing (1,000 ms minimum, then rounded up to 100 ms), while any always-ready instances you configure bill a baseline for provisioned memory whether or not they execute. You only pay the Premium-style baseline on the slice of capacity you explicitly reserve. Read the three billing shapes side by side:

Billing dimension Consumption Premium (EP) Flex on-demand Flex always-ready
Charged when idle No Yes (full instance) No Yes (memory baseline)
Execution rounding GB-s of active time n/a (always on) 1,000 ms min, then 100 ms Baseline + execution memory
Free grant Monthly GB-s + executions None None on Flex None on Flex
Scales to zero Yes No Yes No (it’s the warm floor)
Drives the bill Active GB-seconds Reserved instances Active GB-seconds Reserved memory × time

And the decision rule as a table — match your hard constraint to the plan:

If your hard constraint is… Then choose… Because
“Must reach a private DB / on-prem / private endpoint” Flex (or Premium) Consumption has no VNet integration
“Tail-latency SLA on a synchronous hot path” Flex with always-ready Warm pool defeats cold start, scoped to the hot group
“Spiky, scale-to-zero, no private network, cost-first” Consumption Cheapest when idle dominates; no warm baseline
“Always-on, predictable load, want max throughput” Premium Reserved instances + pre-warmed for steady high load
“Cap fan-out against a fragile downstream” Flex Explicit concurrency × max-instances is a hard valve
“Need a hard, low instance ceiling (e.g. max 5)” Consumption / Premium Flex floor for max-instance-count is 40
“Windows runtime required” Consumption / Premium Flex is Linux only
“C# in-process model, can’t re-target” Consumption / Premium Flex requires the isolated worker
“Remove every storage key from config” Flex Identity-based AzureWebJobsStorage connection

The C# in-process model is not supported on Flex Consumption — you must be on the isolated worker model (.NET 8 / 9 / 10). There is also no in-place migration in or out: moving to Flex means creating a new app and redeploying. The supported isolated stacks are .NET isolated, Node.js, Python, Java, and PowerShell; check az functionapp list-flexconsumption-locations and the runtime support matrix before you commit.

The supported runtime stacks on Flex, with the --runtime token and the model constraint:

Runtime --runtime token Model Supported on Flex Note
.NET (isolated) dotnet-isolated Isolated worker Yes (8/9/10) In-process dotnet is not supported
Node.js node n/a Yes LTS versions offered per region
Python python n/a Yes Use --build-remote true for native wheels
Java java n/a Yes Check version availability per region
PowerShell powershell n/a Yes For automation/ops workloads
.NET (in-process) dotnet In-process No Re-target the isolated worker
Windows-only stacks No Flex is Linux only

Provision a Flex app with subnet delegation

VNet integration on Flex requires a subnet delegated to Microsoft.App/environments, at least /27 in size (use /26 to leave scaling headroom), and the Microsoft.App resource provider registered on the subscription. The portal and CLI enforce the RP registration at create time. Here is the full set of provisioning prerequisites — miss any one and create fails or the app can’t scale:

Prerequisite Exact value / command Why it’s required Gotcha if wrong
Resource provider az provider register --namespace Microsoft.App Backs subnet delegation “RP not registered” at create
Region supports Flex az functionapp list-flexconsumption-locations Flex isn’t in every region Silent fallback / create error
Delegated subnet --delegations Microsoft.App/environments Flex joins via App environment Microsoft.Web/... delegation fails
Subnet size /27 minimum, /26 recommended Each instance consumes an IP Subnet exhaustion caps scale-out
Subnet is empty/dedicated One subnet per Flex app Delegation is exclusive Sharing it breaks integration
Backing storage Standard_LRS/LZRS, TLS1_2, no public blob Host metadata + package Public blob access fails policy
Runtime is isolated dotnet-isolated, node, python, java, powershell No in-process C# In-process never starts
RG=rg-fnflex-prod
LOC=eastus
VNET=vnet-app
SUBNET=snet-func-flex
STORAGE=stfnflexprod$RANDOM

# 1. Register the provider that backs subnet delegation
az provider register --namespace Microsoft.App --wait

# 2. Network + a dedicated, delegated subnet (/26 leaves headroom)
az network vnet create -g $RG -n $VNET --address-prefixes 10.40.0.0/16 \
  --subnet-name $SUBNET --subnet-prefixes 10.40.1.0/26
az network vnet subnet update -g $RG --vnet-name $VNET -n $SUBNET \
  --delegations Microsoft.App/environments

# 3. Backing storage account (host metadata + deployment container)
az storage account create -g $RG -n $STORAGE -l $LOC --sku Standard_LRS \
  --allow-blob-public-access false --min-tls-version TLS1_2

The --delegations value is exact — Microsoft.App/environments, not Microsoft.Web/.... This trips up everyone coming from App Service VNet integration. With the subnet ready, create the app and join it to the VNet in one shot:

SUBNET_ID=$(az network vnet subnet show -g $RG --vnet-name $VNET -n $SUBNET --query id -o tsv)

az functionapp create \
  --resource-group $RG \
  --name fn-orders-prod \
  --storage-account $STORAGE \
  --flexconsumption-location $LOC \
  --runtime dotnet-isolated --runtime-version 8.0 \
  --vnet "$VNET" --subnet "$SUBNET"

--flexconsumption-location (not --consumption-plan-location) is what selects the Flex plan. Confirm the region supports it first with az functionapp list-flexconsumption-locations -o table — Flex is not in every region. To attach a VNet to an existing Flex app instead, use az functionapp vnet-integration add -g $RG -n fn-orders-prod --vnet "$VNET" --subnet "$SUBNET". The equivalent in Bicep, which is how you should actually ship this:

resource plan 'Microsoft.Web/serverfarms@2023-12-01' = {
  name: 'plan-fn-orders'
  location: location
  sku: { tier: 'FlexConsumption', name: 'FC1' }
  kind: 'functionapp,linux'
  properties: { reserved: true }
}

resource site 'Microsoft.Web/sites@2023-12-01' = {
  name: 'fn-orders-prod'
  location: location
  kind: 'functionapp,linux'
  properties: {
    serverFarmId: plan.id
    virtualNetworkSubnetId: subnet.id   // the delegated /26
    functionAppConfig: {
      runtime: { name: 'dotnet-isolated', version: '8.0' }
      scaleAndConcurrency: {
        instanceMemoryMB: 2048
        maximumInstanceCount: 120
      }
      deployment: {
        storage: {
          type: 'blobContainer'
          value: '${storage.properties.primaryEndpoints.blob}app-package'
          authentication: { type: 'SystemAssignedIdentity' }
        }
      }
    }
  }
}

The key reference table for create-time arguments — the flags people most often get wrong:

Flag (az functionapp create) Accepts Selects / sets Common mistake
--flexconsumption-location a Flex region The Flex plan Using --consumption-plan-location (picks Consumption)
--runtime dotnet-isolated/node/python/java/powershell Worker stack dotnet (in-process) — unsupported
--runtime-version e.g. 8.0, 20, 3.11 Stack version Version not offered in the region
--instance-memory 512 / 2048 / 4096 Per-instance memory (MB) Arbitrary value rejected
--maximum-instance-count 401000 Horizontal ceiling 5 (below the 40 floor)
--vnet / --subnet name or ID VNet integration target Subnet not delegated to Microsoft.App
--deployment-storage-auth-type …ConnectionString/UserAssignedIdentity/SystemAssignedIdentity Package auth MI lacks Blob Data role

Configure instance memory and maximum instance count

Two knobs govern how big each worker is and how far the app can spread. Memory comes in three sizes; CPU and network bandwidth scale proportionally with it:

Instance memory (MB) vCPU cores Network bandwidth Use for Cost note
512 0.25 Lowest High fan-out, light per-request work Cheapest cores; fits more in the quota
2048 1 Medium Default for most workloads The balanced default
4096 2 Highest CPU/memory-heavy work, large payloads, ML inference 2 cores each → halves quota headroom

Every instance also gets an extra ~272 MB platform buffer that you are not billed for. Set memory at create time with --instance-memory, or change it later:

# Larger instances for a CPU-bound transform app
az functionapp scale config set -g $RG -n fn-orders-prod --instance-memory 4096

# Cap horizontal scale (40 is the lowest allowed max; 1000 the ceiling)
az functionapp scale config set -g $RG -n fn-orders-prod --maximum-instance-count 120

--maximum-instance-count accepts 40 to 1,000. The floor of 40 surprises people — you cannot pin a Flex app to “max 5 instances.” If you need a hard, low ceiling, Flex is the wrong plan. The two scale-config knobs and their boundaries:

Setting Values Default When to change Trade-off Limit / gotcha
instanceMemoryMB 512 / 2048 / 4096 2048 CPU-bound → up; high fan-out → down More memory = more cores = more quota burn Only three discrete values
maximumInstanceCount 40–1000 high (~100s) Cap against downstream limits Lower cap = earlier 429s under burst Floor is 40 — no low ceiling
alwaysReady[group] 0–N per group 0 Latency-critical groups Warm baseline billing Min 2 if zone-redundant
perInstanceConcurrency 1–N (HTTP) memory-derived Pin explicitly in prod Higher = fewer instances but more thrash risk HTTP-only flag

Mind the regional subscription quota: every Flex app in a subscription+region shares a default budget of 250 cores (512,000 MB). Cores are instances × cores-per-instance, so a single 4096-MB app maxes out the default quota at 125 instances (125 × 2). Always-ready instances count against it; scaled-to-zero apps do not. Request an increase via support before you plan for thousands of large instances. The quota math worked through, so you can see the ceiling your --maximum-instance-count actually hits:

Instance memory Cores / instance Max instances at 250-core quota Effective ceiling vs your --maximum-instance-count
512 MB 0.25 1,000 Quota never binds before the 1,000 hard cap
2048 MB 1 250 Quota binds if you set max-count > 250
4096 MB 2 125 Quota binds if you set max-count > 125

The limits and quotas you will actually hit on Flex, with the real numbers:

Limit / quota Value Scope What hitting it looks like How to raise
Regional core quota 250 cores / 512,000 MB (default) Subscription + region Scale stalls below --maximum-instance-count Support request
Max instance count 1,000 Per app Hard horizontal ceiling Cannot exceed
Min instance count (max-count floor) 40 Per app Can’t pin a low ceiling Use a different plan
Instance memory choices 512 / 2048 / 4096 MB Per app Other values rejected Fixed set
Subnet size /27 min (/26 recommended) Integration subnet IP exhaustion caps scale-out Larger subnet at create
Always-ready min (zone-redundant) 2 per group Per scale group Single warm instance rejected n/a — by design
Always-ready min (non-zonal) 1 per group Per scale group Raise to 2 when enabling AZ
Platform memory buffer ~272 MB / instance Per instance Extra unbilled headroom Not part of your memory size
Execution billing minimum 1,000 ms, then 100 ms rounding Per execution Short calls cost a 1s floor n/a
Deployment package source one blob container Per app One-deploy pulls from it on start

Per-instance concurrency: HTTP and non-HTTP triggers

This is the single most impactful tuning lever on Flex. Concurrency is how many parallel executions each instance handles. Set it too high and instances thrash under memory pressure; set it too low and you scale out (and bill) more instances than you need.

Flex groups functions into scale groups that scale together: all HTTP/SignalR triggers (http), Event Grid blob triggers (blob), and Durable orchestration/activity/entity triggers (durable). Everything else scales individually as function:<NAME>. Know which group a trigger lands in, because concurrency and always-ready apply per group:

Trigger Scale group Concurrency mechanism Notes
HTTP / SignalR http perInstanceConcurrency flag The only type valid for that flag
Event Grid blob blob Target-based (host.json) Event-Grid-sourced blob events
Durable orchestrator / activity / entity durable Target-based + Durable settings One group for all Durable functions
Service Bus queue / topic function:<NAME> Target-based (serviceBus in host.json) Scales individually
Event Hubs function:<NAME> Target-based (partition-bound) Bounded by partition count
Storage Queue function:<NAME> Target-based (queues in host.json) batchSize + newBatchThreshold
Timer function:<NAME> n/a (single execution) One instance per fire

HTTP concurrency is set explicitly and, once set, is honored regardless of instance memory size:

# Each instance handles up to 10 concurrent HTTP executions before
# the scale controller adds another instance.
az functionapp scale config set -g $RG -n fn-orders-prod \
  --trigger-type http --trigger-settings perInstanceConcurrency=10

http is the only trigger type valid for perInstanceConcurrency. The default HTTP concurrency is derived from instance memory when you do not set it — bigger instances default higher. Pin it explicitly in production so a later memory change doesn’t silently shift your scale math. How the choice plays out:

perInstanceConcurrency Effect on scale-out Effect per instance Pick when
Low (e.g. 1–4) Scales out aggressively (more instances) Light load per worker, low thrash Heavy per-request CPU/memory; isolation matters
Medium (e.g. 8–16) Balanced Good utilization Typical I/O-bound APIs
High (e.g. 24–64) Scales out reluctantly (fewer instances) Dense, risk of memory pressure Light, async, high-fan-out handlers with memory headroom
Unset (memory-derived) Drifts when you change memory Implicit Never, in production — pin it

For non-HTTP triggers (Service Bus, Event Hubs, Storage Queue), concurrency is governed by target-based scaling through host.json, not the CLI flag above. You tune the batch/concurrency knobs of the binding and the runtime computes a target instance count from queue depth:

{
  "version": "2.0",
  "extensions": {
    "serviceBus": {
      "maxConcurrentCalls": 16,
      "maxConcurrentSessions": 8,
      "prefetchCount": 32
    },
    "queues": {
      "batchSize": 16,
      "newBatchThreshold": 8
    }
  }
}

For a queue trigger, target-based scaling computes desired instances as roughly messages ÷ (batchSize + newBatchThreshold). Lowering batchSize makes the app scale out more aggressively per message backlog; raising it packs more work onto each instance. Tune this against downstream throughput limits (database connection pools, third-party API rate caps) — uncontrolled fan-out is how you DDoS your own backend. The host.json concurrency knobs that matter for target-based scaling:

host.json setting Binding Default Raise it to… Lower it to… Gotcha
serviceBus.maxConcurrentCalls Service Bus (no sessions) 16 Pack more per instance Throttle downstream Per-instance, not global
serviceBus.maxConcurrentSessions Service Bus (sessions) 8 Handle more sessions/instance Preserve ordering pressure Session-bound
serviceBus.prefetchCount Service Bus 0 Cut receive latency Reduce lock churn Prefetch holds locks
queues.batchSize Storage Queue 16 Fewer, denser instances Scale out per backlog Max 32; with threshold drives target
queues.newBatchThreshold Storage Queue batchSize/2 Fetch next batch sooner Adds to scale denominator
eventHubs.maxEventBatchSize Event Hubs varies Bigger batches Lower memory/latency Scale also bounded by partitions

The backpressure ceiling is the product of the two halves — size it against the weakest downstream:

Downstream constraint Concurrency Max instances In-flight ceiling Safe vs the constraint?
DB pool = 200 connections 24 8 24 × 8 = 192 Yes — 192 < 200
DB pool = 200 connections 50 8 50 × 8 = 400 No — pool exhausts
Partner API = 100 req/s cap 10 10 10 × 10 = 100 At the edge — add margin
No hard downstream limit 16 120 16 × 120 = 1,920 Bounded only by quota/latency

Always-ready instances to kill cold starts

On-demand instances cold-start. For latency-critical paths — a synchronous checkout API, a webhook with a tight SLA — reserve always-ready instances that stay warm and take traffic first. The platform only spins up on-demand instances after the always-ready pool is saturated.

# Keep 3 warm instances for the HTTP group
az functionapp scale config always-ready set -g $RG -n fn-orders-prod \
  --settings http=3

# Mix: warm Durable group + warm a single hot function
az functionapp scale config always-ready set -g $RG -n fn-orders-prod \
  --settings durable=2 function:ProcessPayment=2

At create time the equivalent is --always-ready-instances http=3. Remove reservations with az functionapp scale config always-ready delete -g $RG -n fn-orders-prod --setting-names http function:ProcessPayment. What you can reserve, and the syntax for each:

Always-ready target Syntax Covers When to use
HTTP group http=N All HTTP/SignalR triggers Synchronous APIs with a latency SLA
Durable group durable=N All Durable orchestrators/activities/entities Orchestrations that must start instantly
Blob group blob=N Event Grid blob triggers Latency-sensitive blob processing
Single function function:<NAME>=N One named function One hot function inside a larger app
Remove a reservation --setting-names <group> (delete verb) Frees the warm pool Scaling reservation down to zero

Two things to internalize. First, billing: always-ready instances bill a baseline for provisioned memory continuously, plus execution memory while running, with no free grant — this is the Premium-style cost, scoped to only the instances you reserve. Reserve the minimum that holds your steady-state concurrency. Second, zone redundancy: if you enable availability zones, the minimum always-ready count per group is 2, not 1, so the warm pool survives a zone outage. How to size the warm pool against the burst you actually get:

Scenario Steady-state concurrent reqs Concurrency Always-ready to reserve Cold start exposure
Flat low traffic, tight SLA ~20 10 http=2 None at steady state
Predictable diurnal peak ~120 at peak 24 http=5 (peak ÷ 24, rounded) Only above peak
Spiky flash-sale burst base 50, spike 1,800 24 http=6 (covers the leading edge) On-demand absorbs the tail
Zone-redundant, any load any any min 2 per group Survives one zone down
No latency SLA (async) any any 0 (let it scale from zero) Accepted; cheapest

A worked sizing rule: always-ready instances ≈ ceil(steady-state concurrent requests ÷ perInstanceConcurrency). Reserve that, let on-demand take everything above it, and the warm pool pays the cold-start cost once at deploy — never on a user request.

Not every trigger needs a warm pool. Match cold-start sensitivity to trigger shape before you spend on always-ready:

Trigger type Cold-start sensitivity Reserve always-ready? Why
Synchronous HTTP with latency SLA High Yes (http=N) A user/acquirer is blocked on the response
Durable orchestration (must start fast) High Yes (durable=N) Start latency is visible to the caller
Webhook with a tight timeout High Yes (http=N) Caller retries on slow start → amplification
Service Bus / queue (async backlog) Low Usually no A few seconds of warm-up is invisible to a backlog
Event Hubs stream processing Low–Medium Rarely Throughput matters more than first-call latency
Timer / scheduled batch None No Nobody is waiting on the first execution

Deploy with one-deploy and managed-identity storage

Flex has exactly one deployment path: build, zip, push the package to a blob container. The app pulls and runs from that package on startup. No WEBSITE_RUN_FROM_PACKAGE gymnastics — that behavior is built in.

# Build + zip your project, then one-deploy it
func azure functionapp publish fn-orders-prod
# or push a prebuilt package and run the build remotely on the platform:
az functionapp deployment source config-zip \
  -g $RG -n fn-orders-prod --src ./app.zip --build-remote true

--build-remote true runs Oryx build (restore/compile) on the platform — use it for Python/Node where native wheels must match the Linux host. For precompiled .NET isolated output, ship the built artifact and skip remote build. The deployment options compared:

Path Command Build runs Use for Watch-out
Core Tools publish func azure functionapp publish Local Quick local→cloud loop Local toolchain must match runtime
Zip + remote build config-zip --build-remote true Platform (Oryx) Python/Node native deps Slower first deploy
Zip + prebuilt config-zip (no remote build) Pre-done Precompiled .NET isolated Artifact must be Linux-correct
CI/CD (Bicep + zip) pipeline pushes to container Pipeline Reproducible prod deploys Identity needs Blob Data role

The security upgrade is removing storage secrets entirely. By default the host talks to storage via a connection string in AzureWebJobsStorage. Replace it with an identity-based connection so no key ever lands in app settings:

# Assign a user-assigned identity and grant it data-plane access to storage
UAMI_ID=$(az identity show -g $RG -n id-fn-orders --query id -o tsv)
UAMI_CLIENT=$(az identity show -g $RG -n id-fn-orders --query clientId -o tsv)
STORAGE_ID=$(az storage account show -g $RG -n $STORAGE --query id -o tsv)

az functionapp identity assign -g $RG -n fn-orders-prod --identities "$UAMI_ID"

# Host needs Blob + Queue + Table data roles on the backing account
for ROLE in "Storage Blob Data Owner" "Storage Queue Data Contributor" "Storage Account Contributor"; do
  az role assignment create --assignee "$UAMI_CLIENT" --role "$ROLE" --scope "$STORAGE_ID"
done

# Swap the connection string for an identity-based connection
az functionapp config appsettings set -g $RG -n fn-orders-prod --settings \
  "AzureWebJobsStorage__accountName=$STORAGE" \
  "AzureWebJobsStorage__credential=managedidentity" \
  "AzureWebJobsStorage__clientId=$UAMI_CLIENT" && \
az functionapp config appsettings delete -g $RG -n fn-orders-prod \
  --setting-names AzureWebJobsStorage

The __accountName syntax is specific to AzureWebJobsStorage. Omit __clientId and Flex falls back to the system-assigned identity (use az functionapp identity assign -g $RG -n fn-orders-prod with no --identities). The exact roles the host needs on the backing storage account, and what each one is for:

Role Scope What the host uses it for Omit it and…
Storage Blob Data Owner Backing account Host metadata, lease blobs, package read Host can’t start / scale
Storage Queue Data Contributor Backing account Internal control queues Queue-driven scale breaks
Storage Account Contributor Backing account Management-plane ops the host performs Some host operations fail
Storage Blob Data Contributor Deployment account/container Read/write the deployment package Package pull fails → app won’t run

For the deployment container specifically, you can authenticate the same way at create time:

az functionapp create -g $RG -n fn-orders-prod --storage-account $STORAGE \
  --runtime dotnet-isolated --runtime-version 8.0 --flexconsumption-location $LOC \
  --deployment-storage-name $STORAGE \
  --deployment-storage-container-name app-package \
  --deployment-storage-auth-type UserAssignedIdentity \
  --deployment-storage-auth-value "$UAMI_ID"

--deployment-storage-auth-type accepts StorageAccountConnectionString, UserAssignedIdentity, or SystemAssignedIdentity. The identity needs Storage Blob Data Contributor on the deployment account. The three identity-connection app-setting keys, decoded:

App setting Value Meaning Default if omitted
AzureWebJobsStorage__accountName the storage account name Target account (no key) Falls back to connection string
AzureWebJobsStorage__credential managedidentity Use a managed identity Connection-string mode
AzureWebJobsStorage__clientId the UAMI client ID Which user-assigned identity System-assigned identity

Private endpoints, Key Vault references, and outbound lockdown

VNet integration handles outbound traffic. To lock down inbound access to your dependencies, pair it with private endpoints and disable public network access on each backing resource.

# Private endpoint for the storage blob service
az network private-endpoint create -g $RG -n pe-st-blob \
  --vnet-name $VNET --subnet snet-pe \
  --private-connection-resource-id "$STORAGE_ID" \
  --group-id blob --connection-name conn-st-blob

# Force all storage traffic through the private path
az storage account update -g $RG -n $STORAGE --public-network-access Disabled

For the function app to resolve *.privatelink.blob.core.windows.net to the private IP through its VNet, ensure the integration subnet’s VNet is linked to the relevant Private DNS zones. Without that DNS link the app resolves the public IP and the endpoint is bypassed. The zones you must link, per dependency:

Dependency Private DNS zone to link Group ID (--group-id) Symptom if zone is unlinked
Blob storage privatelink.blob.core.windows.net blob App reads/writes over public IP
Queue storage privatelink.queue.core.windows.net queue Control queues bypass PE
Table storage privatelink.table.core.windows.net table Table ops bypass PE
File storage privatelink.file.core.windows.net file File share bypasses PE
Key Vault privatelink.vaultcore.azure.net vault Secret pull over public IP
Service Bus privatelink.servicebus.windows.net namespace Messaging bypasses PE
SQL Database privatelink.database.windows.net sqlServer DB traffic over public IP

Pull secrets from Key Vault behind its own private endpoint via Key Vault references — the secret value is never stored in app settings:

az functionapp config appsettings set -g $RG -n fn-orders-prod --settings \
  "DbConnection=@Microsoft.KeyVault(SecretUri=https://kv-orders.vault.azure.net/secrets/db-conn/)"

Grant the app’s managed identity Key Vault Secrets User on the vault. To force all outbound through the VNet (so it can traverse a firewall or NAT gateway and the resolver sees private records), set vnetRouteAllEnabled:

az resource update -g $RG --namespace Microsoft.Web --resource-type sites \
  --name fn-orders-prod --set properties.vnetRouteAllEnabled=true

The outbound-networking settings and what each one controls:

Setting / control What it does Default Set it when
virtualNetworkSubnetId Binds outbound to the delegated subnet unset Always, for VNet integration
vnetRouteAllEnabled Routes all outbound through the VNet false Must traverse firewall/NAT or see private DNS
Private DNS zone link Resolves privatelink.* to private IP unlinked Any private endpoint dependency
--public-network-access Disabled (on dep) Blocks public inbound to the dependency Enabled Lock the dependency to the private path
Key Vault reference @Microsoft.KeyVault(SecretUri=...) none Keep secrets out of app settings
NAT gateway on the subnet Deterministic, large SNAT pool none Chatty egress to a single destination

Architecture at a glance

The diagram traces a request through Flex the way it actually flows, then maps each scaling and private-path failure onto the exact hop where it bites. Read it left to right. A trigger — an HTTP client on 443 or a Service Bus / Event Hubs message — arrives and signals the scale controller. The controller routes to the always-ready pool first (warm, with a minimum of 2 per group when zone-redundant); only when that pool saturates does it spin up on-demand instances (512 / 2048 / 4096 MB), which cold-start on their first request. The perInstanceConcurrency knob is the denominator that decides how many instances the controller adds, and together with maximum-instance-count forms the backpressure ceiling. Outbound from the workers goes through the VNet integration zone — a delegated /26 subnet plus the Private DNS zones that resolve privatelink.* to private IPs — into the private dependencies: the AzureWebJobsStorage account and Key Vault behind private endpoints, authenticated by a user-assigned identity so no key is ever in app settings.

Five numbered badges mark where this breaks. (1) A burst outruns the warm pool and on-demand cold-starts on the hot path. (2) The concurrency × max-instances ceiling throttles with 429s before scaling further. (3) A missing Private DNS link makes the app resolve the public IP and silently bypass the private endpoint. (4) The identity loses its Storage data role and the host can’t even read the package to start. (5) The regional 250-core quota is exhausted and scale stalls below your configured max. The observability zone on the right — Application Insights for the 429/p95 KQL and the core-quota tool — is how you tell badge (1) apart from badge (2): throttle rate climbing while p95 stays flat means you were capped; throttle spiking alongside a p95 spike at the burst edge means you were cold. That single distinction is the whole diagnostic method for serverless scale.

Azure Functions Flex Consumption private-path architecture: an HTTP client on 443 and a Service Bus/Event Hubs trigger signal the Flex scale controller, which routes first to a warm always-ready pool (min 2 per group when zone-redundant) and spills onto cold-starting on-demand instances of 512/2048/4096 MB governed by per-instance concurrency and a maximum-instance-count of up to 1000; outbound flows through a VNet integration zone with a delegated /26 subnet and privatelink Private DNS zones into private dependencies — the AzureWebJobsStorage account and Key Vault behind private endpoints authenticated by a user-assigned managed identity — with Application Insights and the regional 250-core quota tool on the right, and five numbered failure badges marking cold start on the hot path, concurrency-cap throttling, an unlinked Private DNS zone bypassing the endpoint, the identity losing its storage data role, and the core quota stalling scale

Real-world scenario

Solvix Payments runs a synchronous card-authorization API. It had lived on the Linux Consumption plan until a Black Friday incident exposed two structural problems at once. First, cold starts pushed p99 past their acquirer’s 800 ms timeout; the acquirer retried, and the retries stampeded a backend whose PostgreSQL pool capped at 200 connections. Second — and the reason they could not just move to Premium and forget it — the auth function had to reach an on-prem fraud-scoring service over a private ExpressRoute path, which Consumption could not do at all because it has no VNet integration. The team is six engineers; the workload averages 50 concurrent authorizations with a flash-sale spike to ~1,800, and the hard rule from the DBA was simple: total in-flight executions must never exceed ~150 backend connections regardless of incoming spike.

The first instinct on the bridge was to scale the Consumption plan “bigger” — but Consumption gives you no such knob, and even on Premium the cold-start-on-scale-out problem and the lack of a fan-out cap would have remained. They moved to Flex Consumption and solved all three constraints with three coordinated settings, deployed as Bicep and reviewed in a PR.

VNet integration over a delegated Microsoft.App/environments subnet gave them the private route to on-prem — the thing Consumption could never do. They reserved always-ready instances to absorb the burst’s leading edge so the acquirer never saw a cold start on the hot path. And crucially, they capped fan-out by pinning per-instance concurrency and max instances so total in-flight executions could never exceed the database pool:

# 6 warm instances x 24 concurrency = 144 steady-state in-flight,
# hard-capped at 8 instances so peak <= 192 < the 200-conn pool.
az functionapp scale config always-ready set -g rg-payments -n fn-auth \
  --settings http=6
az functionapp scale config set -g rg-payments -n fn-auth \
  --trigger-type http --trigger-settings perInstanceConcurrency=24
az functionapp scale config set -g rg-payments -n fn-auth \
  --maximum-instance-count 8 --instance-memory 2048

The result: p99 dropped under 300 ms because the warm pool never cold-started on the hot path, and the explicit concurrency × max-instances ceiling made backend overload structurally impossible — the function throttled with 429s at the edge (which the acquirer handled gracefully) long before the database pool exhausted. The next flash sale ran at 1,900 rps with zero backend-pool incidents. The always-ready baseline added a predictable, small monthly cost (6 warm 2048-MB instances), which the team accepted as the price of the SLA — far cheaper than a fully always-on Premium plan sized for peak.

The migration as a before/after, because the shape of the fix is the lesson:

Dimension Before (Linux Consumption) After (Flex Consumption)
Private path to on-prem Impossible (no VNet) VNet integration over ExpressRoute
Cold start on hot path Unbounded, tripped 800 ms timeout http=6 warm → p99 < 300 ms
Fan-out cap None → stampeded 200-conn pool 24 × 8 = 192 hard ceiling
Behavior at overload Backend pool exhaustion 429 at the edge, acquirer retries gracefully
Storage secret Connection string in settings Identity-based connection, no key
Billing shape Execution only, but un-tunable Execution + small warm baseline

The lesson that generalizes: on Flex, concurrency and max-instance-count are not just performance knobs, they are a backpressure mechanism. Size them against your weakest downstream dependency, not against incoming traffic — and reserve always-ready only on the hot path, not everywhere.

Advantages and disadvantages

The “serverless-but-tunable” model both unlocks production serverless and introduces knobs that are wrong by default. Weigh it honestly:

Advantages (why Flex helps you) Disadvantages (why it bites)
Scale-to-zero economics plus VNet integration — no Premium tax for a private network Linux only; no in-place migration in or out (create new app + redeploy)
Always-ready instances kill cold starts on exactly the groups that need it Always-ready bills a continuous baseline with no free grant — forget one and it costs
Deterministic scale controller: instances = traffic ÷ concurrency, not a black box The maximum-instance-count floor is 40 — you can’t pin a low ceiling
Concurrency × max-instances is a hard backpressure valve against fragile downstreams Mis-set high, that same product becomes a self-inflicted DDoS on your backend
Identity-based storage connection removes the last storage key from config The host needs several data-plane roles; miss one and the app won’t even start
Selectable memory (512/2048/4096) right-sizes cores to the workload Only three discrete sizes; the 250-core regional quota binds large fleets
Private endpoints + Key Vault references make a genuinely private serverless app DNS zone links are easy to forget → traffic silently goes public

Flex is the right model for serverless that must reach a private network, hold a latency SLA, or cap fan-out — and still scale to zero when idle. It is the wrong model when you need a Windows runtime, a hard low instance ceiling, or you have no private/latency requirement at all (plain Consumption is cheaper and simpler). The disadvantages are all manageable — but only if you know they exist, which is the point of this article.

Hands-on lab

Provision a Flex app with VNet integration, pin concurrency and a low-ish max-instance cap, reserve a warm instance, and prove the scale and private-path settings are actually in effect. Run in Cloud Shell (Bash). This uses real (billed) resources — delete the resource group at the end; an hour is a few rupees of always-ready baseline plus storage.

Step 1 — Variables, RP registration, and resource group.

RG=rg-fnflex-lab
LOC=eastus
VNET=vnet-lab
SUBNET=snet-flex
STORAGE=stflexlab$RANDOM
APP=fn-lab-$RANDOM
az group create -n $RG -l $LOC -o table
az provider register --namespace Microsoft.App --wait

Step 2 — Confirm the region offers Flex, then build the delegated subnet.

az functionapp list-flexconsumption-locations -o table | grep -i $LOC   # must appear
az network vnet create -g $RG -n $VNET --address-prefixes 10.50.0.0/16 \
  --subnet-name $SUBNET --subnet-prefixes 10.50.1.0/26
az network vnet subnet update -g $RG --vnet-name $VNET -n $SUBNET \
  --delegations Microsoft.App/environments

Expected: the subnet update returns JSON with delegations[0].serviceName = Microsoft.App/environments.

Step 3 — Backing storage and the Flex app, joined to the VNet.

az storage account create -g $RG -n $STORAGE -l $LOC --sku Standard_LRS \
  --allow-blob-public-access false --min-tls-version TLS1_2
az functionapp create -g $RG -n $APP --storage-account $STORAGE \
  --flexconsumption-location $LOC --runtime dotnet-isolated --runtime-version 8.0 \
  --vnet "$VNET" --subnet "$SUBNET" -o table

Expected: a function app row; kind contains functionapp,linux.

Step 4 — Pin the scale math: memory, max instances, concurrency, one warm instance.

az functionapp scale config set -g $RG -n $APP --instance-memory 2048 --maximum-instance-count 40
az functionapp scale config set -g $RG -n $APP \
  --trigger-type http --trigger-settings perInstanceConcurrency=10
az functionapp scale config always-ready set -g $RG -n $APP --settings http=1

(Note: http=1 is fine for a non-zone-redundant lab; production with AZ enabled requires a minimum of 2.)

Step 5 — Verify every layer is actually in effect, not just configured.

az functionapp scale config show -g $RG -n $APP -o jsonc          # memory, max-count, concurrency
az functionapp scale config always-ready list -g $RG -n $APP -o table   # http=1 present
az functionapp vnet-integration list -g $RG -n $APP -o table      # bound to snet-flex

Expected: scale config shows instanceMemoryMB: 2048, maximumInstanceCount: 40, HTTP concurrency 10; always-ready lists http=1; VNet integration lists the delegated subnet. The lab steps mapped to what each proves:

Step What you did What it proves Real-world analogue
2 Delegate the subnet to Microsoft.App The delegation is exact and required First Flex VNet setup of any team
3 --flexconsumption-location + --vnet Flex is selected and VNet-joined in one shot Production provisioning
4 Pin memory / max-count / concurrency / warm The scale math is explicit, not implicit Hardening before a launch
5 Read it back with scale config show “Configured” ≠ “in effect” — verify both The 90-second pre-incident check

Step 6 — Cleanup (stop the always-ready baseline and storage charges).

az group delete -n $RG --yes --no-wait

Cost note. The only non-trivial charge in this lab is the single always-ready instance’s memory baseline (pennies per hour at 2048 MB) plus storage. Deleting the resource group stops everything.

Common mistakes & troubleshooting

This is the playbook — the part you bookmark. First as a scannable table you can read mid-incident, then the detail on the entries that bite hardest. The unifying diagnostic is the 429 fork: capped (instance/quota ceiling) versus cold (burst outran the warm pool).

# Symptom Root cause Confirm (exact cmd / portal path) Fix
1 429s under burst; throttle rate climbs while p95 stays flat App-level cap: concurrency × max-instances ceiling, or core quota App Insights KQL (throttle vs instance count); Diagnose & solve → Flex Quota Raise --maximum-instance-count / concurrency, or request quota
2 429s + p95 spike at the start of a burst, settling after Cold-start cascade: burst outran the warm pool InstanceCount climbing from zero at burst edge; p95 spike Add always-ready sized to the burst leading edge
3 App reads/writes storage over the public IP despite a PE Private DNS zone not linked to the VNet nslookup <acct>.blob.core.windows.net from a peered VM returns public A record Link privatelink.blob/queue/... zones to the VNet
4 App won’t start; host errors on storage UAMI missing a Storage data-plane role az role assignment list --assignee <clientId> --scope <storageId> empty Grant Blob Data Owner + Queue Data Contributor
5 Scale stalls below your --maximum-instance-count Regional 250-core quota exhausted Diagnose & solve → Flex Consumption Quota tool Request a core-quota increase via support
6 create fails with a delegation/RP error Subnet not delegated to Microsoft.App, or RP unregistered az network vnet subnet show --query delegations; az provider show -n Microsoft.App Delegate to Microsoft.App/environments; register the RP
7 Scale-out plateaus early under load Integration subnet too small (IP exhaustion) Subnet /27 with many instances; available IPs near zero Recreate with a larger subnet (/26 or bigger)
8 C# app never starts on Flex In-process model deployed (unsupported) Worker logs; project targets in-process Re-target the isolated worker (.NET 8/9/10)
9 Key Vault reference resolves empty → app misbehaves MI lacks Key Vault Secrets User, or vault PE/DNS missing Portal → Environment variables (red error); az role assignment list Grant Secrets User; link privatelink.vaultcore; open vault firewall
10 Deploy “succeeds” but app runs old/empty Deployment-container auth/role wrong; package not pulled Diagnose & solve → Flex Deployment tool; package container empty/denied Set --deployment-storage-auth-type; grant Blob Data Contributor
11 Backend (DB/API) overwhelmed under spike Fan-out uncapped: concurrency × max-instances > downstream limit Compute the product; compare to pool/rate cap Lower the product below the weakest downstream
12 Outbound doesn’t traverse the firewall / sees public DNS vnetRouteAllEnabled is false az functionapp show --query properties.vnetRouteAllEnabled Set vnetRouteAllEnabled=true

The expanded form, with the full reasoning for the entries that bite hardest:

1 & 2 — Capped vs cold: the central 429 fork. Both look like “we got 429s under load,” and treating one as the other wastes the incident. Capped means instances are saturated at their concurrency limit and the app cannot add more — either --maximum-instance-count is too low or the regional core quota is exhausted; the tell is a throttle rate that climbs while p95 stays flat (the served requests are fine, you just can’t serve more). Cold means a burst arrived faster than on-demand instances warm up, an upstream timed out and retried, and the retries amplified load; the tell is a throttle spike alongside a p95 spike at the burst’s leading edge. Use this Kusto query to separate them — it correlates 429 rate against the cause signals:

let window = 5m;
requests
| where timestamp > ago(1h)
| summarize
    total = count(),
    throttled = countif(resultCode == 429),
    p95_ms = percentile(duration, 95)
  by bin(timestamp, window)
| extend throttle_rate = round(100.0 * throttled / total, 2)
| order by timestamp asc

A throttle rate that climbs while p95 stays flat points to a hard instance cap (capped → raise max-count/concurrency or request quota). A throttle rate that spikes alongside a p95 latency spike at the start of a burst points to cold starts (cold → add always-ready sized to the burst’s leading edge). Read against the live InstanceCount metric, the two are unmistakable:

Signal pattern Diagnosis Why First fix
Throttle ↑, p95 flat, InstanceCount pinned at max Capped (instances) Saturated and can’t add more Raise --maximum-instance-count
Throttle ↑, p95 flat, InstanceCount below max Capped (quota) Quota stalls scale below your cap Request core-quota increase
Throttle spike + p95 spike at burst edge, then settles Cold On-demand cold-starting behind the burst Add always-ready for the leading edge
Throttle 0, p95 spike on first request after idle Cold (no SLA breach yet) Warm pool empty at idle Reserve a small warm floor

The metrics that explain scaling decisions — read these first:

APP_ID=$(az functionapp show -g $RG -n fn-orders-prod --query id -o tsv)
az monitor metrics list --resource "$APP_ID" --metric "InstanceCount" --interval PT1M -o table
az monitor metrics list --resource "$APP_ID" --metric "OnDemandFunctionExecutionUnits" --interval PT1H -o table
az monitor metrics list --resource "$APP_ID" --metric "AlwaysReadyFunctionExecutionUnits" --interval PT1H -o table
Metric What it tells you Use it to…
InstanceCount Live instances over time See capped (pinned at max) vs cold (climbing from zero)
OnDemandFunctionExecutionUnits GB-s of on-demand execution Attribute the variable part of the bill
AlwaysReadyFunctionExecutionUnits GB-s on the warm pool Confirm the warm pool is sized/used right
FunctionExecutionCount Total executions Correlate throttle rate to volume
MemoryWorkingSet Per-instance memory in use Spot pressure that argues for a larger memory size
AverageMemoryWorkingSet Fleet-average memory Right-size 512 vs 2048 vs 4096
Http5xx / Http429 Edge error rates The symptom; confirm against the cause above

3 — The “private but actually public” trap. VNet integration routed the egress, but the integration VNet was never linked to the privatelink.blob.core.windows.net zone, so the app resolved the public IP and the private endpoint was bypassed entirely. Confirm: nslookup <account>.blob.core.windows.net from a peered VM (not your laptop) returns a public A record instead of a 10.x private IP. Fix: link every relevant privatelink.* zone to the VNet, then re-test resolution from inside the VNet.

4 — App won’t start because the host can’t read storage. The UAMI was assigned but never granted the Storage data-plane roles, so the host can’t read its own metadata/package and never starts — which looks like a generic “app down,” not an auth problem. Confirm: az role assignment list --assignee <UAMI clientId> --scope <storage id> is empty. Fix: grant Storage Blob Data Owner + Storage Queue Data Contributor (+ Account Contributor) on the backing account, and Blob Data Contributor on the deployment container.

5 & 11 — Quota vs self-DDoS. Two opposite failure modes around the same product. (5) Scale stalls below your --maximum-instance-count because the regional 250-core quota is exhausted (remember 4096-MB instances burn 2 cores each, so 125 of them is the whole default budget). Confirm: the Flex Consumption Quota tool in Diagnose & solve. Fix: request an increase. (11) The opposite — scale runs too far and concurrency × max-instances exceeds your downstream’s capacity, so the backend pool exhausts under spike. Confirm: compute the product and compare to the pool/rate cap. Fix: lower the product below the weakest downstream; this is the backpressure discipline.

Best practices

The alerts worth wiring before the next burst — leading indicators, not “app down”:

Alert on Signal Threshold (starting point) Why it’s leading
Throttling Http429 rate > 1% sustained 5 min First sign of capped-or-cold before users feel it
Scale ceiling InstanceCount at max = maximum-instance-count for 10 min You’re capped — raise the cap or quota
Cold-start latency request p95 > your SLO at burst edges Warm pool too small for the spike
Core quota Flex Quota tool / scale stall scale flat below max-count Quota, not your config, is the cap
Always-ready cost AlwaysReadyFunctionExecutionUnits trending up unexpectedly A forgotten reservation billing a baseline
Private-path drift dependency failures over public IP any, post-deploy A DNS zone link went missing

Security notes

The security controls that also prevent these incidents — secure and resilient pull the same way here:

Control Setting / mechanism Secures against Also prevents
Identity-based storage connection AzureWebJobsStorage__credential=managedidentity A storage key in app settings Key rotation breaking the host
Least-privilege storage roles Blob/Queue data roles at account scope Over-broad access Surprise blast radius if the identity leaks
Private endpoints + public access off PE + --public-network-access Disabled Data exfiltration over public IP Accidental public exposure of deps
Linked Private DNS zones privatelink.* zone → VNet link Traffic resolving the public path “Private but actually public” drift
Key Vault references @Microsoft.KeyVault(SecretUri=...) Secrets in plaintext config Hand-rolled secret rotation breaking the app
Force VNet routing vnetRouteAllEnabled=true Egress bypassing the firewall Private DNS not being consulted

Cost & sizing

The bill drivers on Flex and how they interact with the tuning:

A rough monthly picture, and what each lever buys you:

Cost driver What you pay for Rough INR / month What it fixes Watch-out
On-demand execution Active GB-seconds (1s min) Scales with traffic; ₹0 idle The scale-to-zero economics Chatty tiny calls pay the 1s floor
2× always-ready (2048 MB) Warm memory baseline ~₹6,000–10,000 Cold start on the hot path Forgetting it bills 24×7
6× always-ready (2048 MB) Larger warm floor ~₹18,000–30,000 Bigger burst leading edge Size to peak ÷ concurrency, not “lots”
4096-MB sizing 2× cores per instance ~2× the above CPU/memory-heavy work Doubles quota burn
Private endpoints Per endpoint hourly ~₹1,000–2,000 each Genuinely private deps One per dependency
NAT gateway Hourly + per-GB egress ~₹1,500–3,000 Deterministic egress at scale Needs VNet integration
App Insights ingestion Per-GB telemetry ~₹1,000–3,000 The capped-vs-cold diagnosis Sample high-traffic apps

The sizing rule in one line: let on-demand carry the variable load to zero, reserve always-ready only for the hot path’s burst edge (ceil(steady concurrency ÷ perInstanceConcurrency)), and cap concurrency × max-instances below your weakest downstream. That combination is cheaper than Premium-for-peak and safer than uncapped Consumption.

Interview & exam questions

1. What does Flex Consumption add over Linux Consumption, and what does it keep? It keeps scale-to-zero and execution billing, and adds VNet integration, always-ready instances, selectable instance memory (512/2048/4096), and explicit per-instance concurrency — the Premium capabilities available à la carte on a consumption-billed plan, so you pay a warm baseline only on the slice you reserve.

2. How does Flex billing differ from Consumption and Premium? Consumption bills only active GB-seconds; Premium bills every reserved instance always-on. Flex bills on-demand instances only while executing (1,000 ms minimum, then 100 ms rounding) and always-ready instances a continuous memory baseline plus execution. You pay the Premium-style cost only on reserved warm capacity.

3. A burst causes 429s. How do you tell whether you were capped or cold? Correlate the 429 rate against p95 and InstanceCount. Capped: throttle climbs while p95 stays flat and InstanceCount is pinned at max (or stalled by quota) — raise --maximum-instance-count/concurrency or request quota. Cold: throttle and p95 spike at the burst’s leading edge while InstanceCount climbs from zero — add always-ready sized to the leading edge.

4. What subnet requirement does Flex VNet integration impose, and what’s the common mistake? A subnet delegated to Microsoft.App/environments, at least /27 (use /26), with the Microsoft.App RP registered. The common mistake is delegating to Microsoft.Web/... (App Service’s delegation) — Flex joins via the App environment, so that delegation fails.

5. Why might an app reach a dependency over the public IP even though it has VNet integration and a private endpoint? Because VNet integration only routes outbound; reaching the private endpoint also needs DNS resolution to the private IP, which requires the integration VNet to be linked to the privatelink.* zone. Without that link the app resolves the public IP and bypasses the endpoint.

6. How do you turn concurrency into a backpressure mechanism? perInstanceConcurrency × maximum-instance-count is a hard ceiling on total in-flight executions. Set that product at or below your weakest downstream’s capacity (DB pool, API rate cap) and overload becomes structurally impossible — the app 429s at the edge before the downstream falls over.

7. What is the regional core quota and how do you compute against it? A default 250 cores (512,000 MB) per subscription+region, shared by all Flex apps. Cores are instances × cores-per-instance (512 MB = 0.25, 2048 = 1, 4096 = 2), so 125 instances of 4096 MB exhaust the default. Always-ready counts; scaled-to-zero doesn’t. Request increases via support.

8. How do you remove the storage key from a Flex app? Replace the AzureWebJobsStorage connection string with an identity-based connection: set AzureWebJobsStorage__accountName, __credential=managedidentity, and __clientId (UAMI), then delete AzureWebJobsStorage. Grant the identity the Storage Blob/Queue data roles on the backing account.

9. What’s the minimum always-ready count when availability zones are enabled, and why? 2 per group, not 1 — so the warm pool survives a single zone outage. A single warm instance would be a single point of failure that defeats the purpose of zone redundancy.

10. Which runtimes/models are unsupported on Flex, and what’s the migration path? The C# in-process model is unsupported — you must use the isolated worker (.NET 8/9/10). Flex is Linux only. There is no in-place migration in or out: you create a new app and redeploy.

11. How is non-HTTP trigger concurrency tuned, since perInstanceConcurrency is HTTP-only? Via target-based scaling in host.json — the binding’s batch/concurrency knobs (serviceBus.maxConcurrentCalls, queues.batchSize/newBatchThreshold, Event Hubs batch size) — from which the runtime computes a desired instance count from queue depth.

12. The function app deploys “successfully” but runs old or empty code. What do you check? The deployment container auth/role: the deploy identity needs Storage Blob Data Contributor on the deployment account, and --deployment-storage-auth-type must match how it’s authenticated. Use the Flex Consumption Deployment tool in Diagnose & solve to see package status.

These map to AZ-204 (Developer Associate)develop, configure, monitor and troubleshoot Azure Functions, scaling and networking — and AZ-305 (Solutions Architect) for the plan-choice and private-network design. The networking-cost angle (VNet integration, NAT, SNAT) touches AZ-700, and the identity/least-privilege angle touches AZ-500. A compact cert-mapping for revision:

Question theme Primary cert Exam objective area
Flex vs Consumption vs Premium, billing AZ-204 / AZ-305 Choose & configure compute; cost
Concurrency, max-instances, always-ready AZ-204 Configure & scale Functions
VNet integration, delegated subnet, Private DNS AZ-700 Design & implement network connectivity
Identity-based storage connection, KV references AZ-500 / AZ-204 Secure app config; manage identities
429 capped-vs-cold, metrics, KQL AZ-204 Monitor & troubleshoot solutions
Core quota, scale ceilings AZ-305 Design for scale & limits

Quick check

  1. You see 429s under load; the throttle rate climbs but p95 latency stays flat. Were you capped or cold, and what’s the first fix?
  2. Your Flex app has VNet integration and the storage account has a private endpoint, yet nslookup from a peered VM returns a public IP. What’s missing?
  3. You want a hard ceiling of “no more than 5 instances ever.” Can Flex do it? Why or why not?
  4. A 4096-MB Flex app needs to scale to 200 instances. Does the default regional quota allow it? Show the math.
  5. How do you make total in-flight executions never exceed a 150-connection database pool?

Answers

  1. Capped. Throttle climbing while p95 stays flat means served requests are fine but the app can’t add capacity — you’re at the concurrency × max-instances ceiling or the core quota. First fix: raise --maximum-instance-count (or perInstanceConcurrency if instances have memory headroom), or request a core-quota increase. (Cold would show a p95 spike at the burst’s leading edge.)
  2. The integration VNet isn’t linked to the privatelink.blob.core.windows.net Private DNS zone. VNet integration routes outbound but doesn’t resolve names; without the zone link the app gets the public IP and bypasses the endpoint. Link the relevant privatelink.* zones to the VNet.
  3. No. The floor for --maximum-instance-count is 40 — you cannot pin a Flex app to a low ceiling like 5. If you need a hard low ceiling, use Consumption or Premium instead.
  4. No. 4096 MB = 2 cores per instance, so 200 instances = 400 cores, which exceeds the default 250-core quota (which caps 4096-MB apps at 125 instances). Request a quota increase before planning for 200.
  5. Set perInstanceConcurrency × maximum-instance-count ≤ 150 — e.g. concurrency 18 × max 8 = 144, or 24 × 6 = 144. The product is a hard backpressure ceiling; the app 429s at the edge before the pool exhausts.

Glossary

Next steps

You can now provision, tune, and diagnose a Flex Consumption app on a private network. Build outward:

azure-functionsflex-consumptionserverlessvnetscaling
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments