Azure Container Apps Deep Dive: Dapr, KEDA Scaling, Revisions, and Split Traffic

Azure Container Apps (ACA) sits in the gap between “I just want to run a container” and “I’m operating a Kubernetes cluster.” Under the hood it is Kubernetes plus KEDA (event-driven autoscaling), Dapr (a portable microservices runtime), and Envoy (the ingress proxy) — but you never touch a node, a kubelet, or an ingress controller. You get scale-to-zero, event-driven autoscaling, a service mesh’s worth of Dapr building blocks, and built-in blue-green via immutable revisions — declared in Bicep or set with one az command. The catch is that every one of those gifts has a contract you can violate, and when you do the failure is opaque: a worker that scales but never wakes, a Dapr component silently loaded by every app in the environment, a canary that took 100% of traffic because the weights didn’t sum the way you thought.

This guide builds a small two-service system — an orders-api (HTTP, externally reachable) and an orders-worker (queue-driven, internal) — and wires up Dapr pub/sub and state, KEDA scaling, immutable revisions, and weighted traffic splitting. Everything here is az containerapp and Bicep; no kubectl. Because this is a reference you will return to mid-incident, every moving part — every ingress mode, every scaler, every revision trigger, every error string — is laid out as a scannable table next to the prose and code that explain it. Read the prose once; keep the tables open when a deploy goes sideways at 18:03 on a Friday.

By the end you will stop guessing. You will know whether an app failed because its container never bound 0.0.0.0, because a Dapr component was scoped wrong, because min-replicas 0 met a trigger that cannot wake from zero, or because a revision suffix collided with a deleted one. Knowing which within ninety seconds is what separates a five-minute rollback from a two-hour incident bridge.

Versions. Commands target the containerapp Azure CLI extension and the Microsoft.App resource provider (API 2024-03-01 / 2025-01-01). Install once with az extension add --name containerapp --upgrade and register Microsoft.App plus Microsoft.OperationalInsights.

What problem this solves

You have a handful of microservices. Plain App Service can run them but gives you no event-driven scaling, no scale-to-zero, no sidecar mesh, and no weighted canary. Full AKS gives you all of that and a cluster to patch, upgrade, secure, and staff. ACA is the middle: the Kubernetes capabilities you actually wanted for stateless and event-driven workloads, with the cluster operations deleted. It is the right tool when you want progressive delivery and pub/sub without an Argo/Flagger/Istio stack and without a platform team.

What breaks without it: teams reach for App Service and then bolt on Service Bus triggers, a homegrown blue-green via two slots, and a custom retry library — reinventing KEDA, revisions, and Dapr badly. Or they stand up AKS for three services and spend more engineer-hours on node pools and CNI than on the product. ACA collapses that. But the collapse hides machinery, and hidden machinery has sharp edges: the environment subnet can’t be resized after creation, scale-to-zero needs a wake-capable trigger, Dapr components default to environment-wide scope, and a single-revision-mode deploy tears down the old revision the instant the new one activates — cutting in-flight requests.

Who hits this: teams running stateless HTTP APIs and queue/event workers who want autoscaling and canary without operating Kubernetes; cost-sensitive shops that want scale-to-zero in non-prod; and anyone migrating off AKS for workloads that never needed a full cluster. To frame the field before the deep dive, here is what ACA owns versus what still bites:

Capability	What ACA gives you	The contract you must honour	What bites if you ignore it
Ingress	Envoy L7, free FQDN + TLS	One container port; bind `0.0.0.0`	502/connection-refused; app “healthy” but unreachable
Autoscaling	KEDA, scale-to-zero	A trigger that can wake from 0	Worker stuck at 0; messages pile up
Microservices runtime	Dapr sidecar per app	Scope components to `dapr-app-id`s	Every app loads every component; cross-talk
Progressive delivery	Immutable revisions + weights	Multiple-revision mode; unique suffixes	Deploy goes straight to 100%; no rollback
Network boundary	Environment = VNet + LA workspace	Subnet `/23`+, fixed at create	Can’t grow the subnet; rebuild the environment
Secrets/identity	Managed identity + Key Vault refs	UAMI with the right RBAC	Inline passwords in IaC; pull/secret failures

Learning objectives

By the end of this article you can:

Choose the right Container Apps environment topology (one per bounded context vs per team), size its subnet correctly, and decide between Consumption-only and workload profiles.
Configure ingress as external, internal, or disabled, front a private environment with Application Gateway or Front Door, and explain the container-port + 0.0.0.0 bind contract that decides reachability.
Enable Dapr per app, register pub/sub, state, and service-invocation components scoped to the right dapr-app-ids, and choose identity-based auth over connection strings where the broker supports it.
Author KEDA scale rules for HTTP concurrency, Service Bus / Storage Queue depth, CPU/memory, and custom scalers — and tune messageCount and concurrentRequests from real per-message processing time.
Operate revisions: distinguish template changes (new revision) from configuration changes (no revision), pick single vs multiple mode, and pin readable suffixes.
Run weighted traffic splitting for canary and blue-green, attach labels for sticky testing, and roll back with a one-line weight flip to a still-warm revision.
Diagnose the dozen real ACA failure modes — wrong port, scope leaks, can’t-wake-from-zero, suffix collisions, dropped messages on scale-in — using the exact az, KQL, and portal paths.

Prerequisites & where this fits

You should be comfortable with containers (an image, a registry, a port, an entrypoint), with az in Cloud Shell reading JSON output, and with the idea of a microservice that talks to a queue and a database. Familiarity with Kubernetes concepts (pods, probes, autoscaling) helps but is not required — that is the point of ACA. You should know what a managed identity is and that Azure Service Bus and Cosmos DB exist as managed brokers/stores.

This sits in the Compute → Containers track, one rung below full Kubernetes. The decision of whether to use ACA at all is upstream: see Azure App Service vs Container Apps vs AKS and Containers vs Serverless vs VMs. The Dapr building blocks here are the managed mirror of Configure Dapr on Kubernetes: service invocation, state, pub/sub; the KEDA scalers mirror KEDA event-driven autoscaling with Kafka and Service Bus. It pairs tightly with Azure Service Bus: sessions, dedup, dead-letter patterns for the broker, Azure Container Registry secure supply chain for the image source, and Azure Monitor & Application Insights for observability for the trace graph.

A quick map of which layer owns which failure, so you call the right person fast:

Layer	What lives here	Who usually owns it	Failure classes it causes
Client / DNS	TLS, name resolution, the FQDN	Frontend / SRE	404/timeout if FQDN wrong; mostly red herrings
Front Door / App Gateway	WAF, backend probe, timeout	Network team	502 (origin timeout), 403 (WAF)
Environment ingress (Envoy)	L7 routing, revision weights	Platform	Wrong split, 502 if no healthy revision
App / revision	Your container, port bind, probes	App / dev team	502 (wrong port), restart loop, crash
Dapr sidecar	mTLS, retries, component load	App + platform	Component not found, scope leak, 500 from sidecar
KEDA scaler	min/max replicas, trigger	Platform + app	Stuck at 0, over/under-scaled, drops on scale-in
Identity / secrets	UAMI, Key Vault refs, ACR pull	App + platform	ImagePull fail, secret unresolved, crash loop

Core concepts

Six mental models make every later decision obvious.

The environment is the boundary that matters. A Container Apps environment is the security and network boundary. Apps in the same environment share a virtual network and a Log Analytics workspace, and can call each other by name and over Dapr. Apps in different environments cannot. This is your first architecture decision: one environment per bounded context, or one per team — never one per app (you would pay the per-environment floor and lose intra-app networking for nothing).

Ingress is per-app and the port contract is explicit. Each app declares at most one ingress, and a single --target-port that the container must bind on 0.0.0.0. Envoy fronts it. External ingress gets a public FQDN; internal ingress is reachable only inside the environment; disabled means outbound-only. Bind 127.0.0.1 and the probe from outside the container fails — the app is “running” and unreachable, the ACA twin of the App Service WEBSITES_PORT trap.

Scaling is KEDA, and scale-to-zero needs a waker. Every app has min/max replicas and a scale rule. The default rule is HTTP concurrency. Setting --min-replicas 0 makes an idle app free, but scale-from-zero requires an event source that can wake it — HTTP traffic, or a KEDA scaler polling a queue/topic. A plain TCP app with no trigger cannot wake from zero and sits dead.

Revisions are immutable and triggered by template changes. Every change to an app’s template (image, env vars, scale, resources, probes) mints a new immutable revision. Changes to configuration (ingress, secrets, registries, Dapr on/off) do not. That single distinction is the whole revision model. In single mode the new revision replaces the old; in multiple mode they coexist and you split traffic by weight.

Dapr is a sidecar you opt into per app, with components scoped at the environment. Enable the sidecar on an app and it gets an identity (--dapr-app-id) and a localhost API on port 3500. Components (pub/sub, state, bindings) are registered against the environment and, unless scoped, are loaded by every Dapr-enabled app. Scope is the safety boundary; forget it and every app mounts every broker.

Identity replaces passwords. Registry pull, Key Vault-backed secrets, and identity-based broker auth all run through a user-assigned managed identity (UAMI) with the right RBAC. Inlining a registry password or connection string in the template is the most common ACA mistake and the one a secret-scanner will catch in your IaC.

The vocabulary in one table

Pin down every moving part before the deep sections. The glossary at the end repeats these for lookup; this is the mental model side by side:

Concept	One-line definition	Where it lives	Why it matters
Environment	Security + network + logging boundary	Resource group	Apps inside can talk; subnet is fixed at create
Workload profile	Consumption (serverless) or Dedicated compute	On the environment	Decides CPU/mem ratios, GPU, isolation
Container app	One app (1+ containers) on an environment	On the environment	The unit you scale and revision
Ingress	Envoy L7 entry: external/internal/disabled	Per app	Reachability + the port contract
Target port	The single port your container binds	App ingress config	Wrong/loopback → 502, unreachable
Revision	Immutable snapshot of the app template	Under the app	Blue-green/canary unit; template change mints one
Revision suffix	Human-readable revision name tail	Set on `update`	Needed for traffic/label commands; must be unique
Traffic weight	% of ingress to a revision (multi mode)	Ingress config	The canary/rollback lever; weights sum to 100
Label	Stable alias → a revision, own FQDN	On a revision	Sticky smoke-testing without user traffic
Scale rule	KEDA trigger deciding replica count	Per app	min/max + trigger; scale-to-zero needs a waker
Dapr sidecar	Per-app runtime on localhost:3500	Injected per app	mTLS, retries, pub/sub, state, invocation
Dapr component	A broker/store/binding definition	On the environment	Scope it or every app loads it
UAMI	User-assigned managed identity	Standalone resource	ACR pull, Key Vault refs, broker auth

The environment: the boundary that matters

A Container Apps environment is the security and network boundary. Apps in the same environment share a virtual network and a Log Analytics workspace, and can call each other by name. This is your first architecture decision: one environment per bounded context, or one per team — not one per app.

RG=rg-aca-orders
LOC=eastus
ENV=cae-orders

az group create -n $RG -l $LOC

# Log Analytics workspace for the environment
az monitor log-analytics workspace create \
  -g $RG -n law-aca-orders

LAW_ID=$(az monitor log-analytics workspace show \
  -g $RG -n law-aca-orders --query customerId -o tsv)
LAW_KEY=$(az monitor log-analytics workspace get-shared-keys \
  -g $RG -n law-aca-orders --query primarySharedKey -o tsv)

az containerapp env create \
  -g $RG -n $ENV -l $LOC \
  --logs-workspace-id "$LAW_ID" \
  --logs-workspace-key "$LAW_KEY"

Environment-level settings, end to end

The environment carries a surprising number of one-way doors. Every setting, its default, when to change it, and the gotcha:

Setting	Default	When to change	Trade-off / gotcha
Workload profiles	Off (Consumption-only)	You need Dedicated/GPU or VNet at scale	Enabling needs a `/23`+ subnet; profile mix is editable, subnet is not
`--infrastructure-subnet-resource-id`	None (managed network)	Hub-and-spoke / private workloads	Immutable after create — size once
`--internal-only`	false	No public surface allowed	Even external apps get a private VIP; front with App GW/Front Door
Logs destination	Log Analytics	Azure Monitor / none	Switching later is disruptive; pick at create
Zone redundancy	Off	Prod HA across AZs	Must be set at create; needs a subnet; small cost
`--dapr-instrumentation-key`	Unset	You want Dapr traces in App Insights	Set the App Insights connection string here, not per app
Custom domain + cert	Unset	Branded ingress on the environment	Managed cert or bring-your-own; DNS validation
Mutual TLS (env)	Off	Enforce mTLS between apps	Adds handshake cost; coordinate with Dapr mTLS
Platform-reserved CIDRs	Auto	Avoid overlap with on-prem/hub	Reserve `100.100.0.0/17`-class ranges; do not reuse

Internal vs external ingress, and VNet integration

Ingress is per-app and has three states:

Setting	Reachable from	Gets a public FQDN?	Use for
`--ingress external`	Public internet (and the environment)	Yes (unless `--internal-only`)	Public APIs, frontends
`--ingress internal`	Only apps in the same environment	No (internal FQDN only)	Backend services, workers exposing HTTP
ingress disabled	Nothing — outbound only	No	Pure workers (queue consumers, cron)

For real workloads you give the environment its own subnet so the whole thing sits inside your hub-and-spoke. Use a workload profiles environment (which supports both the serverless “Consumption” profile and dedicated profiles) and delegate a subnet sized /23 or larger:

# Subnet must be >= /23 for workload-profile environments
SUBNET_ID=$(az network vnet subnet show \
  -g rg-network --vnet-name vnet-spoke-app -n snet-aca \
  --query id -o tsv)

az containerapp env create \
  -g $RG -n $ENV -l $LOC \
  --enable-workload-profiles \
  --infrastructure-subnet-resource-id "$SUBNET_ID" \
  --internal-only true \
  --logs-workspace-id "$LAW_ID" --logs-workspace-key "$LAW_KEY"

--internal-only true means even external apps get a private VIP — the environment’s ingress is reachable only from the VNet, so you front it with Application Gateway or Front Door and keep nothing on the public internet. The subnet cannot be changed after creation, so size it once and correctly.

The subnet sizing rule is the one most teams get wrong, because revisions and scale eat IPs:

Environment type	Min subnet	Why that size	If you under-size
Consumption-only (managed network)	n/a (no delegated subnet)	Platform-managed	—
Consumption-only with custom VNet	`/23`	Platform reserves a large block	Create fails or scale caps early
Workload profiles	`/23` (larger for big fleets)	Each revision/replica consumes IPs from the range	Revisions fail to roll out; “no IP” errors
Many apps × many revisions	`/21`–`/20`	Multi-mode keeps old + new live	Silent scale ceiling; canaries can’t allocate

The networking knobs and their failure modes — the table to keep open when “it deployed but nothing can reach it”:

Networking control	What it does	Default	Symptom when wrong
`--target-port`	Port Envoy probes/forwards to	none (must set)	502; container up but unreachable
Bind address	App must listen on `0.0.0.0`	app’s choice	Probe fails from outside container → 502
`--transport`	`auto`/`http`/`http2`/`tcp`	auto	gRPC needs `http2`; wrong → broken streams
`--exposed-port` (tcp)	External port for TCP ingress	n/a	TCP apps need it; HTTP apps ignore it
IP restrictions	Allow/deny CIDRs on ingress	allow all	Lock down without it; or accidental block
`--internal-only` (env)	Private VIP only	false	Public exposure you didn’t intend
Client certificate mode	ignore/accept/require	ignore	mTLS clients rejected, or unauth accepted
Sticky sessions (affinity)	Pin client to a replica	none	Uneven warmth; breaks even scaling

Deploy the first app

Pull-from-registry and identity come later; start with a public image to prove the path.

az containerapp create \
  -g $RG -n orders-api \
  --environment $ENV \
  --image mcr.microsoft.com/k8se/quickstart:latest \
  --target-port 8080 \
  --ingress external \
  --workload-profile-name Consumption \
  --min-replicas 1 --max-replicas 5 \
  --cpu 0.5 --memory 1.0Gi

az containerapp show -g $RG -n orders-api \
  --query properties.configuration.ingress.fqdn -o tsv

--cpu/--memory must follow allowed ratios on the Consumption profile (1 vCPU : 2 GiB), e.g. 0.25/0.5Gi, 0.5/1.0Gi, 1.0/2.0Gi. Dedicated workload profiles relax this. The valid Consumption combinations — copy a row, don’t guess:

vCPU	Memory	Typical use	Notes
0.25	0.5 Gi	Tiny sidecars, cron	Smallest billable size
0.5	1.0 Gi	Light HTTP API	Common default for `orders-api`
0.75	1.5 Gi	Medium API	—
1.0	2.0 Gi	Standard service	The 1:2 ceiling per replica on Consumption
1.25–2.0	2.5–4.0 Gi	Heavier workers	Still 1:2; total per app ≤ 4 vCPU / 8 Gi on Consumption

Workload profiles change the math entirely — pick the profile to the workload, not the other way round:

Profile	vCPU range	Memory	Scale-to-zero	When to use	Cost model
Consumption	0.25–4	0.5–8 Gi	Yes	Bursty, event-driven, dev	Per vCPU-s + GiB-s; free idle
Dedicated D-series	4–32	16–128 Gi	No (min ≥ 1 per profile)	Steady, memory-heavy, isolation	Per-node-hour, you size the pool
Dedicated E-series	4–32	32–256 Gi	No	Memory-bound (caches, JVM)	Per-node-hour
Consumption GPU	per SKU	per SKU	Yes (where available)	Inference bursts	Per GPU-s; region-limited
Dedicated GPU	per SKU	per SKU	No	Steady inference/training	Per-node-hour

Enable Dapr and wire pub/sub, state, and service invocation

Dapr is enabled per app but its components are scoped at the environment and shared. The critical detail teams miss: an app’s Dapr identity is its --dapr-app-id, and that ID is what other apps use for service invocation and what the sidecar uses for component scoping.

Enable the sidecar on both apps:

az containerapp update -g $RG -n orders-api \
  --enable-dapr true \
  --dapr-app-id orders-api \
  --dapr-app-port 8080 \
  --dapr-app-protocol http

az containerapp update -g $RG -n orders-worker \
  --enable-dapr true \
  --dapr-app-id orders-worker \
  --dapr-app-port 8080

The full Dapr app-level configuration surface — what each flag does and the cost of getting it wrong:

Flag / setting	What it does	Default	When to change	Gotcha
`--enable-dapr`	Inject the sidecar	false	Any service needing pub/sub, state, invocation	Adds ~a sidecar’s CPU+memory per replica
`--dapr-app-id`	This app’s Dapr identity	none	Always when Dapr on	Must be unique; used for invocation + scoping
`--dapr-app-port`	Port the sidecar calls your app on	target-port	App listens elsewhere	Wrong → sidecar can’t deliver subscriptions
`--dapr-app-protocol`	http or grpc to your app	http	gRPC apps	Mismatch → 500s from sidecar
`--dapr-http-max-request-size`	Max body MB to sidecar	4 MB	Large messages	Too low → 413 on big publishes
`--dapr-http-read-buffer-size`	Header/buffer KB	4 KB (×)	Big headers	Streaming/large headers fail
`--dapr-log-level`	Sidecar log verbosity	info	Debugging	`debug` is noisy + costs LA ingestion
`--dapr-enable-api-logging`	Log every Dapr API call	false	Triage only	Verbose; turn off after

Dapr building blocks you actually use

Dapr exposes more building blocks than most teams touch. The ones that matter on ACA, with the Azure backing service:

Building block	What it does	Localhost API path	Azure backing on ACA
Service invocation	Call another app by `dapr-app-id`, mTLS + retries	`/v1.0/invoke/<app>/method/<m>`	Built-in (no component)
Pub/sub	Publish/subscribe to a topic	`/v1.0/publish/<comp>/<topic>`	Service Bus topics, Storage Queues, others
State	Key/value store, optional ETag/transactions	`/v1.0/state/<store>`	Cosmos DB, Redis, Table Storage
Bindings	Trigger on / send to external systems	`/v1.0/bindings/<name>`	Event Grid, Blob, Cron, SQL
Secrets	Read secrets via a store	`/v1.0/secrets/<store>/<key>`	Key Vault (or ACA secrets)
Configuration	Read/subscribe to config	`/v1.0/configuration/<store>`	Redis, Postgres
Actors	Virtual actors with turn-based concurrency	`/v1.0/actors/...`	Backed by a state store

A pub/sub component (Azure Service Bus)

Components are declared in YAML and registered against the environment. Scope them to only the apps that need them — an unscoped component is loaded by every Dapr-enabled app in the environment.

# pubsub-servicebus.yaml
componentType: pubsub.azure.servicebus.topics
version: v1
metadata:
  - name: namespaceName
    value: "sb-orders.servicebus.windows.net"
  - name: consumerID
    value: "orders-worker"
# Identity-based auth: the app's managed identity must have
# the Azure Service Bus Data Owner/Sender/Receiver role.
scopes:
  - orders-api
  - orders-worker

az containerapp env dapr-component set \
  -g $RG -n $ENV \
  --dapr-component-name orderpubsub \
  --yaml pubsub-servicebus.yaml

Note there is no apiVersion/kind/metadata.name block here — the ACA YAML schema for dapr-component set is the component spec body only; the component name comes from --dapr-component-name. This trips up everyone copying a raw Dapr component manifest. The difference, spelled out because it costs an hour:

Field	Raw Dapr (Kubernetes) manifest	ACA `dapr-component set` YAML
`apiVersion`	`dapr.io/v1alpha1`	Omitted
`kind`	`Component`	Omitted
`metadata.name`	the component name	Omitted — use `--dapr-component-name`
`spec.type`	`pubsub.azure.servicebus.topics`	`componentType:` at root
`spec.version`	`v1`	`version:` at root
`spec.metadata`	list of name/value	`metadata:` at root
`scopes`	under root	`scopes:` at root (same)

The publisher calls its own sidecar; Dapr handles the broker:

# From inside orders-api, the sidecar listens on $DAPR_HTTP_PORT (3500)
curl -X POST "http://localhost:3500/v1.0/publish/orderpubsub/orders.created" \
  -H "Content-Type: application/json" \
  -d '{"orderId":"A-1001","total":42.50}'

The subscriber declares its subscription (programmatically via /dapr/subscribe or a declarative subscription resource) and Dapr POSTs each message to the app’s route. State and service invocation follow the same pattern: a state.azure.cosmosdb component plus GET/POST http://localhost:3500/v1.0/state/<store>, and service-to-service calls via http://localhost:3500/v1.0/invoke/orders-worker/method/health — no DNS, no client-side load balancing, mTLS between sidecars for free.

The component metadata keys you set per backing service — the ones that actually matter:

Component type	Key metadata	Auth options	Common mistake
`pubsub.azure.servicebus.topics`	`namespaceName`, `consumerID`	MI or connection string	Sharing one `consumerID` across apps → competing consumers
`pubsub.azure.servicebus.queues`	`namespaceName`	MI or connstring	Queue vs topic mismatch with publisher
`state.azure.cosmosdb`	`url`, `database`, `collection`	MI or key	Partition key mismatch → 400 on save
`state.azure.blobstorage`	`accountName`, `containerName`	MI or key	No ETag support unless configured
`bindings.azure.storagequeues`	`accountName`, `queue`	MI or key	Direction (input/output) not set
`bindings.azure.eventgrid`	topic endpoint, scopes	MI or key	Webhook validation handshake missed
`secretstores.azure.keyvault`	`vaultName`	MI	UAMI lacks Secrets User role

Why this over plain HTTP between apps? Dapr service invocation gives you mTLS, retries, and consistent telemetry without an SDK. But it adds a sidecar (latency + memory) to every replica. If two services only ever do simple internal HTTP, internal ingress alone may be enough.

The honest trade-off, so you opt in deliberately rather than by reflex:

Concern	Dapr service invocation	Plain internal HTTP
mTLS between services	Automatic	You wire it (or skip it)
Retries / resiliency policies	Built-in, declarative	Your client library
Telemetry / distributed trace	Sidecar emits spans	You instrument
Per-replica cost	Sidecar CPU + memory	None
Latency	Extra localhost hop	Direct
Portability off-Azure	High (Dapr API)	Tied to your code
Learning curve	Dapr concepts/components	None

KEDA scale rules: HTTP, queue depth, and custom

ACA scaling is KEDA. Every app has a scale rule; the default is HTTP concurrency. The numbers that matter are --min-replicas and --max-replicas, plus the rule that decides where between them you sit.

Scale to zero

Setting --min-replicas 0 lets an idle app cost nothing. The catch: scale-to-zero requires an event source that can wake the app. HTTP and the Dapr/queue scalers can; a plain TCP app with no trigger cannot wake from zero. The worker is the perfect candidate — no traffic, no replicas.

Which triggers can wake an app from zero, and which cannot — the single table that prevents the most common “stuck at 0” incident:

Scale rule type	Wakes from 0?	What it watches	Notes
`http`	Yes	Concurrent requests	The default; HTTP request itself wakes it
`azure-servicebus`	Yes	Queue/topic message count	KEDA polls the broker even at 0
`azure-queue`	Yes	Storage Queue length	Polls at 0
`kafka`	Yes	Consumer lag	Polls at 0
`redis` / `redis-streams`	Yes	List/stream length	Polls at 0
`cron`	Yes	Time window	Wakes on schedule
`cpu`	No	CPU %	Metric only meaningful with ≥1 replica
`memory`	No	Memory %	Same — cannot wake from 0
`tcp` (custom, no trigger)	No	n/a	Nothing polls; app stays at 0

HTTP scaling

az containerapp update -g $RG -n orders-api \
  --min-replicas 1 --max-replicas 20 \
  --scale-rule-name http-rule \
  --scale-rule-type http \
  --scale-rule-http-concurrency 50

Each replica handles ~50 concurrent requests before KEDA adds another. Keep min-replicas at 1+ for latency-sensitive public APIs to dodge cold starts. The replica-count knobs and their effects:

Knob	What it controls	Default	Raise it when	Lower it when
`--min-replicas`	Floor (warm capacity)	0	Latency-sensitive; avoid cold start	Pure cost in non-prod
`--max-replicas`	Ceiling (cost cap + protection)	10	Known burst peaks	Protect a fragile downstream
`--scale-rule-http-concurrency`	Requests per replica before adding	10 (×)	Cheap, fast handlers	Heavy per-request work
Cooldown (managed)	Wait before scaling in	platform	—	Not directly tunable on ACA
Polling interval (managed)	How often KEDA checks	platform	—	Not directly tunable on ACA

Queue-depth scaling (the worker)

Scale orders-worker on Service Bus queue length, from zero. Authentication metadata for custom scalers references a secret on the app:

az containerapp update -g $RG -n orders-worker \
  --min-replicas 0 --max-replicas 30 \
  --secrets "sb-conn=<service-bus-connection-string>" \
  --scale-rule-name sb-queue \
  --scale-rule-type azure-servicebus \
  --scale-rule-metadata "queueName=orders" "messageCount=20" \
  --scale-rule-auth "connection=sb-conn"

messageCount=20 is the target backlog per replica: 200 pending messages drives ~10 replicas. This is throughput tuning, not just a threshold — set it from how long one message takes to process.

The same shape covers azure-queue (Storage Queues), kafka, redis, and dozens of other KEDA scalers; --scale-rule-type plus --scale-rule-metadata is the universal lever. Note ACA fixes the KEDA polling/cooldown internally — you tune target metrics, not the controller. The scalers you will actually use on Azure, with their key metadata:

`--scale-rule-type`	Metadata that matters	Auth	Target metric meaning
`http`	`concurrentRequests`	none	Requests per replica
`azure-servicebus`	`queueName`/`topicName`+`subscriptionName`, `messageCount`	connection or MI	Messages per replica
`azure-queue`	`queueName`, `queueLength`	connection or MI	Queue items per replica
`azure-eventhub`	`consumerGroup`, `unprocessedEventThreshold`	connection or MI	Lag per replica
`kafka`	`topic`, `consumerGroup`, `lagThreshold`	SASL/MI	Consumer lag per replica
`redis` / `redis-streams`	`listName`/`stream`, `listLength`	password	List/stream length per replica
`cron`	`start`, `end`, `desiredReplicas`, `timezone`	none	Replicas during the window
`cpu`	`type=Utilization`, `value`	none	CPU % (needs ≥1 replica)
`memory`	`type=Utilization`, `value`	none	Memory % (needs ≥1 replica)

Tuning messageCount from real numbers, not vibes — a worked table:

Per-message processing time	Backlog	`messageCount` choice	Resulting replicas	Drain time
50 ms	1,000	100	~10	~0.5 s of work each
500 ms	1,000	20	~50 (capped at max)	spread across max-replicas
2 s	200	10	~20	~20 s if max allows
30 s (heavy)	60	5	~12	long; cap max to protect downstream
Variable / spiky	any	start at p50 throughput	autoscale settles	watch and adjust

Revisions: single vs multiple mode

Every meaningful change to an app’s template (image, env vars, scale, resources) creates a new immutable revision. Changes to configuration (ingress, secrets, registries) do not — that distinction is the whole revision model.

The exhaustive trigger table — memorise the left column or you will be surprised by a revision you didn’t expect (or its absence):

Change	Lives in	Mints a new revision?	Why
Container image / tag	template	Yes	New code = new immutable snapshot
Environment variables	template	Yes	Config baked into the revision
CPU / memory	template	Yes	Resource shape is template
Scale min/max + rules	template	Yes	Scale is part of the template
Probes (startup/live/ready)	template	Yes	Health config is template
Command / args	template	Yes	Entry behaviour is template
Ingress (external/internal/port)	configuration	No	Shared across revisions
Traffic weights	configuration	No	Routing, not a snapshot
Secrets (add/update value)	configuration	No*	*but env vars referencing them are template
Registry credentials	configuration	No	Pull config is shared
Dapr enable/disable + IDs	configuration	No	Dapr config is app-level
Labels	configuration	No	Alias to an existing revision

Two modes:

Single revision mode (default): activating a new revision deactivates the old one. Clean, but no overlap.
Multiple revision mode: old and new revisions run side by side, and you control how traffic splits. This is what unlocks blue-green and canary.

Aspect	Single revision mode	Multiple revision mode
Old revision on new deploy	Deactivated immediately	Stays active
Traffic control	100% to latest, automatic	You set weights
Blue-green / canary	Not possible	The whole point
In-flight requests on deploy	Cut unless you handle SIGTERM well	Drained gracefully; old stays warm
Rollback	Redeploy old image	One-line weight flip (instant)
Cost	One revision’s replicas	Two revisions’ replicas during overlap
Default?	Yes	Opt in

Switch the API to multiple mode and pin a readable revision suffix:

az containerapp revision set-mode -g $RG -n orders-api --mode multiple

az containerapp update -g $RG -n orders-api \
  --image acrorders.azurecr.io/orders-api:1.4.0 \
  --revision-suffix v1-4-0

The suffix makes the revision name orders-api--v1-4-0 instead of a random hash — non-negotiable for traffic-splitting commands and runbooks. Suffixes must be unique per app; you cannot reuse v1-4-0 even after deleting it, so encode the build/semver.

Revision lifecycle states and what each means operationally:

State	Meaning	Takes traffic?	How you get here
Provisioning	Replicas starting	No	Just created
Running / Active	Healthy, in service	If weight > 0	Normal
Activating / Deactivating	Transitioning	Briefly	Mode change, manual toggle
Inactive	Kept but scaled to 0	No	Deactivated; can reactivate (multi mode)
Failed	Could not become healthy	No	Bad image/port/probe
Scaled-to-zero	Active but `min-replicas 0`, idle	On next trigger	Event-driven worker at rest

Weighted traffic splitting: canary and blue-green

In multiple revision mode, ingress traffic is distributed by weight across revisions. Ship 1.5.0 alongside 1.4.0 but send it nothing yet:

az containerapp update -g $RG -n orders-api \
  --image acrorders.azurecr.io/orders-api:1.5.0 \
  --revision-suffix v1-5-0

# Both revisions exist; keep 100% on the stable one
az containerapp ingress traffic set -g $RG -n orders-api \
  --revision-weight orders-api--v1-4-0=100 orders-api--v1-5-0=0

Canary in steps — weights must sum to 100:

# 10% canary
az containerapp ingress traffic set -g $RG -n orders-api \
  --revision-weight orders-api--v1-4-0=90 orders-api--v1-5-0=10

# Watch metrics, then 50/50, then cut over
az containerapp ingress traffic set -g $RG -n orders-api \
  --revision-weight orders-api--v1-4-0=0 orders-api--v1-5-0=100

Rollback is the same command with the weights reversed — instant, because the old revision is still running. For sticky testing without affecting users, give the new revision a label and hit its stable per-label FQDN directly:

az containerapp revision label add -g $RG -n orders-api \
  --revision orders-api--v1-5-0 --label canary
# -> https://orders-api---canary.<env-hash>.<region>.azurecontainerapps.io

You can also pin by weight and use --revision-weight latest=N so new revisions inherit a canary slice automatically — useful in CI/CD where the suffix is generated per build.

A canary ramp as a runbook table — the gate at each step is the discipline:

Step	Stable weight	Canary weight	Gate before proceeding	Rollback move
0. Dark deploy	100	0	Smoke test on `--label canary` FQDN	Delete revision
1. Toe in	95	5	Error rate flat 5 min in App Insights	Set canary=0
2. Canary	90	10	p95 latency within budget	Set canary=0
3. Half	50	50	No new exception signatures	Flip to stable=100
4. Majority	10	90	Dependency failures flat	Flip to stable=100
5. Cut over	0	100	Hold; keep old warm 24 h	Flip to old=100 (instant)

The traffic/label routing methods compared — pick the one that fits the test:

Method	Who hits the new revision	Use for	Limit
`--revision-weight <rev>=N`	N% of all ingress users	Progressive rollout	Random users; no targeting
`--revision-weight latest=N`	N% to whatever is newest	CI/CD auto-canary	“latest” moves as you deploy
`--label <name>` + per-label FQDN	Only callers of that FQDN	Smoke tests, internal QA	You must route testers to it
Single mode (no split)	Everyone, instantly	Simple non-prod	No overlap, cuts in-flight

Secrets, managed identity, and a private registry

Hardcoding a registry password or connection string in the template is the most common ACA mistake. Use a user-assigned managed identity for both registry pull and Key Vault-backed secrets.

# Identity + AcrPull on the registry
UAMI_ID=$(az identity create -g $RG -n id-orders --query id -o tsv)
UAMI_CID=$(az identity show -g $RG -n id-orders --query clientId -o tsv)
ACR_ID=$(az acr show -n acrorders --query id -o tsv)

az role assignment create \
  --assignee "$UAMI_CID" --role AcrPull --scope "$ACR_ID"

# Attach identity and configure registry to use it (no password)
az containerapp identity assign -g $RG -n orders-api --user-assigned "$UAMI_ID"

az containerapp registry set -g $RG -n orders-api \
  --server acrorders.azurecr.io \
  --identity "$UAMI_ID"

Reference a Key Vault secret instead of inlining it. The identity needs Key Vault Secrets User on the vault:

az containerapp secret set -g $RG -n orders-api \
  --secrets "sb-conn=keyvaultref:https://kv-orders.vault.azure.net/secrets/sb-conn,identityref:$UAMI_ID"

# Surface the secret to the app as an env var
az containerapp update -g $RG -n orders-api \
  --set-env-vars "SB_CONNECTION=secretref:sb-conn"

keyvaultref:...,identityref:... makes ACA resolve the secret at runtime through the managed identity — the value never lives in your IaC or pipeline. secretref: then projects it to an env var without exposing it in the template.

The RBAC roles each integration needs — grant the minimum, not Contributor:

Integration	Identity	Role	Scope	If missing
ACR image pull	UAMI / system	AcrPull	The registry	`ImagePullBackOff` / revision Failed
Key Vault secret ref	UAMI / system	Key Vault Secrets User	The vault	Secret resolves empty → crash loop
Service Bus (Dapr, MI auth)	UAMI / system	Azure Service Bus Data Receiver/Sender	Namespace/entity	Sidecar can’t connect; pub/sub dead
Cosmos DB state (MI auth)	UAMI / system	Cosmos data-plane role	Account	State ops 403
Storage Queue scaler (MI)	UAMI / system	Storage Queue Data Reader	Storage account	Scaler can’t read length; no scale
Pull logs / manage	operator	Container Apps Contributor	RG/app	Can’t deploy/operate

Secret sources and how they surface — the three ways a value reaches your container:

Secret source	Declared as	Reaches the app via	Rotates by
Inline ACA secret	`--secrets "k=v"`	`secretref:k` env var or scaler auth	`secret set` (mints nothing)
Key Vault reference	`--secrets "k=keyvaultref:<uri>,identityref:<id>"`	`secretref:k`; resolved at runtime	Rotate in KV; ACA re-reads
Dapr secret store	`secretstores.azure.keyvault` component	`/v1.0/secrets/<store>/<key>`	Rotate in KV

Health probes, startup ordering, and graceful shutdown

ACA supports the three Kubernetes probe types, declared in the container template. Bicep is the clean way to express them:

// fragment of the container template
probes: [
  {
    type: 'Startup'
    httpGet: { path: '/healthz/startup', port: 8080 }
    periodSeconds: 5
    failureThreshold: 30   // up to 150s to become ready
  }
  {
    type: 'Liveness'
    httpGet: { path: '/healthz/live', port: 8080 }
    periodSeconds: 10
    failureThreshold: 3
  }
  {
    type: 'Readiness'
    httpGet: { path: '/healthz/ready', port: 8080 }
    periodSeconds: 5
    failureThreshold: 3
  }
]

The three probes, what each governs, and the failure each prevents (or causes when misconfigured):

Probe	Question it answers	On failure	Common misconfig	Result of the misconfig
Startup	Has the app finished booting?	Keep waiting (up to threshold)	`failureThreshold` too low	Slow boots killed → restart loop
Liveness	Is the process wedged?	Restart the container	Checks a dependency	Dependency blip → needless restarts
Readiness	Can it serve traffic now?	Pull from rotation (no restart)	Always returns 200	Cold/half-ready replica takes traffic → 502s

Probe tuning fields and sane starting values:

Field	Meaning	Startup default	Liveness	Readiness
`initialDelaySeconds`	Wait before first probe	0	0–5	0
`periodSeconds`	Interval between probes	5	10	5
`timeoutSeconds`	Per-probe timeout	1–2	1–2	1–2
`failureThreshold`	Fails before action	30 (≈150 s budget)	3	3
`successThreshold`	Successes to recover	1	1	1

Startup ordering across services: ACA has no dependsOn between apps at runtime. Don’t assume orders-worker is up when orders-api starts — make readiness probes reflect real dependencies (e.g. /healthz/ready returns 503 until the Service Bus connection is live) and let retries do the rest. Dapr helps here: the sidecar buffers and retries service invocation, so transient unavailability of a callee doesn’t hard-fail the caller.

Graceful shutdown: on scale-in or a new revision, ACA sends SIGTERM, stops routing new requests, and waits out the termination grace period before SIGKILL. Your app must catch SIGTERM, drain in-flight work, and exit. For the queue worker this means: stop pulling new messages, finish the current one, then exit — otherwise scale-in events drop messages mid-process.

The shutdown sequence as a timeline, so you know exactly what you have to handle:

Phase	What ACA does	Your app must	If you ignore it
1. Decide to stop	Scale-in or new revision	—	—
2. De-register	Stop routing new requests/messages to this replica	—	—
3. `SIGTERM`	Sends the signal	Catch it; begin drain	Process keeps pulling work
4. Grace period	Waits (terminationGracePeriod)	Finish in-flight; stop consumers	In-flight cut at SIGKILL
5. `SIGKILL`	Force-kills if still alive	(should have exited)	Dropped HTTP responses / lost messages

Architecture at a glance

The diagram traces a real request and a real message through the system, left to right, and pins the failure classes onto the exact hop where each bites. A client (or Application Gateway / Front Door when the environment is --internal-only) hits the environment ingress — an Envoy front end that owns the FQDN, terminates TLS, and splits traffic by revision weight. From there the request lands on the orders-api app, which runs as one or more immutable revisions (stable + canary), each replica paired with a Dapr sidecar on localhost:3500. When orders-api publishes orders.created, the Dapr pub/sub component routes it to Azure Service Bus; KEDA watches that queue depth and wakes orders-worker from zero, scaling replicas to the backlog. State and secrets resolve through Cosmos DB, Key Vault, and a user-assigned managed identity — no connection strings in the template.

Read the numbered badges as the failure map. Badge 1 sits on the ingress/port hop: a container bound to 127.0.0.1 or the wrong --target-port returns 502 even while “running”. Badge 2 sits on revision routing: weights that don’t sum to 100 or a latest pin that moved send the canary 100% of traffic. Badge 3 sits on the Dapr sidecar: an unscoped or misnamed component means the sidecar 500s or every app loads every broker. Badge 4 sits on the KEDA edge: min-replicas 0 with a CPU/memory trigger cannot wake, so the worker stays dead and the queue grows. Badge 5 sits on identity: a UAMI missing AcrPull or Secrets User fails the image pull or resolves a secret to empty, crash-looping the revision. The legend narrates each as symptom, the one command that confirms it, and the fix.

Real-world scenario

Lumio Payments runs an orders-api and three downstream workers on ACA, all scale-to-zero to control cost in non-prod. The environment is workload-profiles, --internal-only, fronted by Application Gateway, in Central India. Traffic averages 300 requests/second with a Friday-evening spike to ~1,400 rps at payout time. The platform team is three engineers; the monthly ACA + Service Bus spend is about ₹22,000. The mandate from the platform org: production rollouts must be progressive and instantly reversible without a redeploy — and there is no Kubernetes team and no service-mesh budget.

Two related incidents forced the redesign. First, every Friday-evening deploy caused a brief spike of 502s. The apps ran in single revision mode, so activating a new revision tore down the old one the instant the new one became active, and in-flight payment requests on draining replicas were cut. Second, a bad build once shipped straight to 100% of traffic with no safety net, because single mode has no concept of a weighted canary — the new revision simply took everything.

The breakthrough was realising ACA already shipped the entire progressive-delivery toolkit; they were just not using it. They put orders-api in multiple revision mode with semver revision suffixes, and changed the pipeline to deploy at 0% weight, attach a canary label, and run smoke tests against the per-label FQDN before any user saw the build. Promotion became a weighted ramp (10 → 50 → 100) gated on Application Insights failure-rate, with rollback as a one-line weight flip to the previous revision — which was still warm because multiple mode keeps it active. They also fixed graceful shutdown so SIGTERM drained in-flight orders, killing the deploy-time 502s at the source rather than masking them.

# CI step: ship dark, smoke-test the canary label, then ramp
az containerapp update -g $RG -n orders-api \
  --image acrorders.azurecr.io/orders-api:$SEMVER --revision-suffix ${SEMVER//./-}
az containerapp ingress traffic set -g $RG -n orders-api \
  --revision-weight latest=0
az containerapp revision label add -g $RG -n orders-api \
  --revision "orders-api--${SEMVER//./-}" --label canary
# ... run smoke tests against https://orders-api---canary.<env-hash>... ...
az containerapp ingress traffic set -g $RG -n orders-api \
  --revision-weight orders-api--${SEMVER//./-}=10 \
  --revision-weight "$(az containerapp ingress show -g $RG -n orders-api \
      --query 'traffic[?weight>`0`].revisionName | [0]' -o tsv)=90"

A second, subtler problem surfaced once canary was live: the workers occasionally dropped messages on scale-in. Under bursty load KEDA would scale orders-worker out to 18 replicas, then scale back in as the queue drained — and a replica receiving SIGTERM mid-message exited before completing it, leaving the payment half-processed (the message had been received but not settled, so Service Bus re-delivered it, occasionally double-charging). The fix was a proper shutdown handler: on SIGTERM, stop the Service Bus receiver, finish the in-flight message, settle it, then exit. Combined with idempotency keyed on orderId, double-delivery became harmless.

The outcome: the next Friday payout ran at 1,500 rps with zero deploy-time 502s and zero dropped messages; a bad build during the following week was caught at the canary-label smoke test and never took a single percent of user traffic; and rollback during a separate scare was a one-line weight flip that took effect in under two seconds because the prior revision was still warm. Spend held at ₹22,000 because the only added cost was the brief overlap of two revisions during each ramp. The lesson on the wall: “ACA’s revision + label + weight + SIGTERM primitives are a complete progressive-delivery system — no Argo Rollouts, no Flagger, no mesh, used deliberately.”

The incident as a before/after table, because the order of moves is the lesson:

Symptom	Root cause	Old behaviour	Fix applied	Result
Friday 502 spike on deploy	Single mode tore down old revision	In-flight requests cut	Multiple mode + SIGTERM drain	Zero deploy 502s
Bad build to 100% users	No weighted canary	New revision took everything	Deploy at 0% + canary label	Caught at smoke test
Double-charged payments	Replica killed mid-message	SIGTERM dropped in-flight	Drain + settle + idempotency	Zero dropped messages
Slow rollback	Old revision gone	Redeploy to revert	Weight flip to warm revision	< 2 s rollback

Advantages and disadvantages

The managed-Kubernetes-with-the-cluster-deleted model both gives you the progressive-delivery and event-driven toolkit and hides the machinery that makes it fail in non-obvious ways. Weigh it honestly:

Advantages (why ACA helps you)	Disadvantages (why it bites)
Scale-to-zero and event-driven autoscaling are built in (KEDA) — no controller to run	Scale-to-zero needs a wake-capable trigger; CPU/memory rules silently never wake from 0
Immutable revisions + weighted traffic = canary/blue-green with no Argo/Flagger	Single mode (the default) tears down the old revision and cuts in-flight requests
Dapr gives mTLS, retries, pub/sub, state without an SDK or a mesh	Every Dapr-enabled app loads every unscoped component — easy cross-talk and over-grant
No nodes, kubelets, CNI, or upgrades to operate	You lose `kubectl`-level control; debugging is through `az`/logs, not the cluster
Free FQDN + managed TLS via Envoy ingress	One port, must bind `0.0.0.0`; loopback or wrong port = 502 while “running”
Managed identity for pull/secrets/broker auth keeps secrets out of IaC	A missing RBAC role fails the pull or resolves a secret to empty → crash loop, no clear error
Rollback is an instant weight flip to a still-warm revision	The environment subnet is fixed at create; under-size it and you rebuild the environment
Per-second Consumption billing, free idle	Dedicated profiles bill per-node-hour with a floor; mixing models needs care

ACA is right for stateless HTTP APIs and event-driven workers that want autoscaling and progressive delivery without a cluster. It is wrong when you need DaemonSets, custom controllers/operators, GPU scheduling beyond what profiles offer, sub-millisecond pod-to-pod control, or the full Kubernetes API — there, AKS is the tool. The disadvantages are all manageable — but only if you know they exist, which is the point of the playbook below.

Hands-on lab

Build the two-service system end to end, watch KEDA scale the worker from zero, run a canary, and tear it all down. Free-tier-friendly (Consumption profile, scale-to-zero). Run in Cloud Shell (Bash).

Step 1 — Variables, providers, extension.

RG=rg-aca-lab
LOC=eastus
ENV=cae-lab
az group create -n $RG -l $LOC -o table
az extension add --name containerapp --upgrade
az provider register -n Microsoft.App --wait
az provider register -n Microsoft.OperationalInsights --wait

Step 2 — Create the environment (managed network is fine for the lab).

az containerapp env create -g $RG -n $ENV -l $LOC -o table

Expected: a cae-lab environment, provisioningState: Succeeded.

Step 3 — Deploy orders-api (public quickstart image, external ingress).

az containerapp create -g $RG -n orders-api --environment $ENV \
  --image mcr.microsoft.com/k8se/quickstart:latest \
  --target-port 8080 --ingress external \
  --min-replicas 1 --max-replicas 5 --cpu 0.5 --memory 1.0Gi -o table

FQDN=$(az containerapp show -g $RG -n orders-api \
  --query properties.configuration.ingress.fqdn -o tsv)
curl -s "https://$FQDN" -o /dev/null -w "HTTP %{http_code}\n"   # expect HTTP 200

Step 4 — Deploy orders-worker scaled-to-zero on a Storage Queue. Create a storage account + queue, then scale the worker on its length.

SA=stacalab$RANDOM
az storage account create -g $RG -n $SA -l $LOC --sku Standard_LRS -o none
CONN=$(az storage account show-connection-string -g $RG -n $SA -o tsv)
az storage queue create -n orders --connection-string "$CONN" -o none

az containerapp create -g $RG -n orders-worker --environment $ENV \
  --image mcr.microsoft.com/k8se/quickstart:latest \
  --min-replicas 0 --max-replicas 10 --cpu 0.25 --memory 0.5Gi \
  --secrets "queue-conn=$CONN" \
  --scale-rule-name q --scale-rule-type azure-queue \
  --scale-rule-metadata "queueName=orders" "queueLength=5" \
  --scale-rule-auth "connection=queue-conn" -o table

az containerapp replica list -g $RG -n orders-worker -o table   # expect EMPTY (0 replicas)

Step 5 — Wake it from zero. Push 50 messages; watch replicas appear.

for i in $(seq 1 50); do \
  az storage message put -q orders --content "msg-$i" --connection-string "$CONN" -o none; done
sleep 30
az containerapp replica list -g $RG -n orders-worker -o table   # expect 1+ replicas now

Step 6 — Multiple revision mode + a canary.

az containerapp revision set-mode -g $RG -n orders-api --mode multiple
az containerapp update -g $RG -n orders-api --revision-suffix v2 \
  --set-env-vars "VERSION=2" -o none
az containerapp ingress traffic set -g $RG -n orders-api \
  --revision-weight latest=10 -o table   # 10% to the new revision
az containerapp revision list -g $RG -n orders-api \
  --query "[].{name:name, active:properties.active, weight:properties.trafficWeight}" -o table

Step 7 — Teardown. One command removes everything.

az group delete -n $RG --yes --no-wait

Expected-output checkpoints in one table, so you know each step worked:

Step	Command	Expected signal
3	`curl https://$FQDN`	`HTTP 200`
4	`replica list` (worker)	Empty — 0 replicas at rest
5	`replica list` after messages	1+ replicas (woke from zero)
6	`revision list`	Two active revisions, weights 90/10
7	`group delete`	Returns immediately (`--no-wait`)

Common mistakes & troubleshooting

This is the differentiator. ACA failures are opaque because the platform hides the machinery — the symptom (502, stuck worker, dropped message) rarely names its cause. Scan the playbook, find your symptom, run the exact confirm command, apply the fix. Most of these have nothing to do with your application code.

#	Symptom	Root cause	Confirm (exact command / path)	Fix
1	502 / connection refused, app “running”	Container binds `127.0.0.1` or wrong `--target-port`	`az containerapp logs show -g $RG -n <app> --type system`; look for probe fail	Bind `0.0.0.0:<port>`; set `--target-port` to it
2	Worker never wakes; queue grows	`min-replicas 0` with `cpu`/`memory` rule (can’t wake)	`az containerapp show ... --query properties.template.scale`	Use a queue/HTTP scaler that polls at 0
3	Every app sees every broker	Dapr component left unscoped	`az containerapp env dapr-component show ... --query scopes`	Add `scopes:` with the right `dapr-app-id`s
4	`dapr-component set` rejected / no component	Pasted raw Dapr manifest (apiVersion/kind/metadata.name)	Diff your YAML vs the body-only schema	Strip to `componentType/version/metadata/scopes`
5	Canary took 100% of traffic	`latest` pin moved, or weights didn’t sum to 100	`az containerapp ingress show ... --query traffic`	Pin explicit revision names; ensure Σ=100
6	Rollback “didn’t work”	Old revision deactivated (single mode)	`az containerapp revision list --query "[].properties.active"`	Set `--mode multiple`; flip weight to old
7	`--revision-suffix` rejected	Suffix reused (even after delete)	`az containerapp revision list --query "[].name"`	Encode build/semver; suffixes are unique-forever
8	`ImagePullBackOff` / revision Failed	UAMI lacks AcrPull, or registry not set to identity	`az role assignment list --assignee <uami-cid>`	Grant AcrPull; `registry set --identity`
9	App crash-loops, secret looks empty	Key Vault ref unresolved (no Secrets User / wrong URI)	`az containerapp secret show`; check UAMI RBAC	Grant Key Vault Secrets User; fix URI
10	Messages double-processed	Replica `SIGKILL`ed mid-message on scale-in	Console logs show no settle before exit	Handle `SIGTERM`: drain + settle; add idempotency
11	Dapr pub/sub silent (no delivery)	`--dapr-app-port` wrong, or no `/dapr/subscribe` route	`--dapr-enable-api-logging true`, read sidecar logs	Set app port; expose subscription route
12	Competing-consumer message loss	Multiple apps share one `consumerID`	Compare `consumerID` across components	Unique `consumerID` per subscriber
13	Can’t grow the environment subnet	Subnet fixed at create, too small	`az network vnet subnet show ... --query addressPrefix`	Rebuild env on a `/23`+ subnet; migrate apps
14	gRPC streams break	`--transport auto` chose http/1	`az containerapp ingress show --query transport`	Set `--transport http2`
15	High Log Analytics bill	`--dapr-log-level debug` / api logging left on	`ContainerAppConsoleLogs_CL` volume by app	Reset to `info`; disable api logging

The deeper detail on the top five

1 — Wrong port / loopback bind (the ACA WEBSITES_PORT). Envoy probes --target-port; your container must answer on 0.0.0.0:<that port>. A container bound to 127.0.0.1 rejects the probe from outside the container even when the port number is right.

# System logs carry the platform's probe/health story
az containerapp logs show -g $RG -n orders-api --type system --tail 50
# Fix: redeploy the image to bind 0.0.0.0, or correct the port
az containerapp ingress update -g $RG -n orders-api --target-port 8080

2 — Can’t wake from zero. cpu and memory are KEDA resource scalers — meaningful only with ≥1 replica running. With min-replicas 0 and only a CPU rule, nothing ever wakes the app. Use an HTTP, Service Bus, Storage Queue, Kafka, or cron rule (all poll at 0), or set min-replicas 1.

3 — Unscoped Dapr component. A component with no scopes: block is mounted by every Dapr-enabled app in the environment — every app connects to that broker, multiplying connections and blast radius.

az containerapp env dapr-component show -g $RG -n $ENV \
  --dapr-component-name orderpubsub --query scopes -o json   # null/empty = unscoped

5 — Canary took everything. Two traps: weights that don’t sum to 100 (ACA normalises, often not how you expect), and --revision-weight latest=N where “latest” moved to the new revision on the next deploy. Pin explicit revision names in production runbooks.

10 — Dropped/duplicated messages on scale-in. KEDA scales workers in as the backlog drains. A replica that gets SIGTERM mid-message and exits without settling leaves the message for re-delivery (Service Bus) — at-least-once becomes visibly duplicate. Always handle SIGTERM: stop the receiver, finish + settle the current message, then exit; make handlers idempotent.

Error / status reference

The codes and strings you actually see, what they mean on ACA, and the first fix:

Code / string	Where it shows	Likely cause	First fix
502 Bad Gateway	Client / App GW	Wrong port, `127.0.0.1` bind, no healthy revision	Fix port/bind; check revision health
404 Not Found	Client	Wrong FQDN, label endpoint, or ingress disabled	Use the right FQDN; enable ingress
403 Forbidden	Client	IP restriction or client-cert `require`	Allow the CIDR; present the cert
`ImagePullBackOff`	Revision status	UAMI lacks AcrPull / registry not on identity	Grant AcrPull; `registry set --identity`
`CreateContainerError`	Revision status	Bad command/args/env or invalid CPU:mem ratio	Fix template; use a valid size row
Revision `Failed`	`revision list`	Probe never passes, or crash on boot	Read system + console logs; fix probe/boot
`ERR_PUBSUB_NOT_FOUND`	Dapr sidecar logs	Component name/scoping wrong	Match `--dapr-component-name`; scope it
`ERR_STATE_STORE_NOT_FOUND`	Dapr sidecar logs	State component missing/misnamed/unscoped	Register + scope the state component
413 Request Entity Too Large	Dapr sidecar	Body > `dapr-http-max-request-size`	Raise the max request size
`OOMKilled`	Console / system logs	Replica exceeded its memory	Raise `--memory` (valid ratio) or fix leak

Decision table — start here

If you see…	It’s probably…	Do this first
502 but logs say app started	Port/bind contract	Confirm `--target-port` + `0.0.0.0` bind
Worker at 0, queue rising	Non-waking scaler	Switch to queue/HTTP rule; or `min-replicas 1`
Deploy caused a 502 blip	Single revision mode	Switch to multiple mode; handle SIGTERM
Canary slice went to 100%	`latest` pin or bad weights	Pin explicit revisions; Σ weights = 100
Secret-backed value empty	Key Vault RBAC/URI	Grant Secrets User; verify the SecretUri
Image won’t pull	Identity/AcrPull	Grant AcrPull; set registry to the UAMI
Other apps hit your broker	Unscoped component	Add `scopes:` to the component
Duplicate side-effects	At-least-once + no idempotency	Add idempotency; settle before exit

Verify

Confirm each layer independently rather than trusting that “it deployed.”

# Revisions and their traffic weights
az containerapp revision list -g $RG -n orders-api \
  --query "[].{name:name, active:properties.active, weight:properties.trafficWeight, replicas:properties.replicas}" -o table

# Live replica count (watch it scale)
az containerapp replica list -g $RG -n orders-worker -o table

# Dapr components visible to the environment
az containerapp env dapr-component list -g $RG -n $ENV -o table

# Hit the canary label endpoint directly
curl -s https://orders-api---canary.<env-hash>.<region>.azurecontainerapps.io/healthz/ready -o /dev/null -w "%{http_code}\n"

Then prove KEDA actually scaled from zero. Push messages onto the queue and confirm the worker wakes, processes, and scales back to zero. In Log Analytics, the ContainerAppSystemLogs_CL table records scaling decisions and the ContainerAppConsoleLogs_CL table holds stdout/stderr:

ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "orders-worker"
| where TimeGenerated > ago(15m)
| project TimeGenerated, RevisionName_s, ReplicaName_s, Log_s
| order by TimeGenerated desc

The verification matrix — one check per layer, and what proves it healthy:

Layer	Command / query	Healthy signal
Ingress / port	`curl https://$FQDN`	200, not 502
Revisions	`revision list`	Expected revisions active, weights as intended
Scaling (worker)	`replica list` before/after load	0 at rest, N under load, back to 0
Dapr components	`env dapr-component list`	Only intended components, scoped
Pub/sub path	App Insights app map	End-to-end transaction api→SB→worker
Secrets/identity	`secret show` + `role assignment list`	Resolved; UAMI has the roles

Observability: Dapr dashboard, logs, and App Insights

For Dapr-level visibility — components loaded, sidecar config, service invocation — inspect what’s registered and wire traces to Application Insights:

az containerapp env dapr-component list -g $RG -n $ENV -o yaml   # what's registered

Wire distributed tracing by attaching an Application Insights connection string to the environment’s Dapr configuration so sidecar-to-sidecar calls produce a real trace graph:

AI_CONN=$(az monitor app-insights component show \
  -g $RG -a appi-orders --query connectionString -o tsv)

az containerapp env update -g $RG -n $ENV \
  --dapr-instrumentation-key "$AI_CONN"

Now a publish from orders-api through Service Bus to orders-worker shows as a connected end-to-end transaction in the Application Insights application map — the single most useful artifact when a message “disappears” between services. Where each kind of signal lives:

Signal	Source	Where to read it	Best for
App stdout/stderr	Container console	`ContainerAppConsoleLogs_CL`	App errors, your logs
Platform/scale events	System	`ContainerAppSystemLogs_CL`	Probe fails, scaling decisions, restarts
Metrics (replicas, CPU, reqs)	Azure Monitor	Metrics Explorer / `az monitor metrics`	Trends, alerts
Distributed traces	Dapr → App Insights	Application map / transactions	Message flow across services
Live tail	`az containerapp logs show --follow`	Terminal	Active incident

Best practices

One environment per bounded context, never per app; size the subnet >= /23 (larger for big fleets) and accept it is fixed at create.
Front public-facing apps with App Gateway/Front Door and set the environment --internal-only where policy requires; keep nothing on the public internet by default.
Scope every Dapr component to specific dapr-app-ids — an unscoped component is loaded by every Dapr-enabled app.
Use identity-based auth for Dapr pub/sub and state where the broker supports it, not connection strings.
min-replicas 0 only on apps with a wake-capable trigger (HTTP or a polling KEDA scaler); never with CPU/memory alone.
Tune messageCount/concurrentRequests from real per-message processing time, not a guess; cap max-replicas to protect fragile downstreams.
Run production apps in multiple revision mode with semver --revision-suffix; suffixes are unique-forever, so encode the build.
Deploy at 0% weight with a canary label; promote via a gated weight ramp; pin explicit revision names (not latest) in runbooks.
Verify rollback as a single weight flip to a still-running prior revision before you need it in anger.
Registry pull and Key Vault secrets via UAMI with least-privilege roles (AcrPull, Key Vault Secrets User); no inline passwords.
Define startup, liveness, and readiness probes; readiness reflects real dependencies; never fail liveness on an optional downstream.
Handle SIGTERM: drain in-flight work and settle messages before exit; make handlers idempotent so re-delivery is harmless.
Wire Application Insights to the environment’s Dapr config for end-to-end traces — the first artifact you’ll want when a message vanishes.

Security notes

ACA’s security posture is mostly about identity, network isolation, and secret handling — the same disciplines as any Azure workload, with a few container-specific edges.

Control	What to do	Why
Managed identity over secrets	UAMI for ACR pull, Key Vault refs, broker auth	No long-lived credentials in IaC or pipeline
Least-privilege RBAC	AcrPull, Key Vault Secrets User, Service Bus Data Receiver/Sender — scoped tight	Limit blast radius of a compromised app
Network isolation	`--internal-only` env + private VIP; front with App GW/WAF	No public ingress; inspect at L7
Private endpoints	For ACR, Key Vault, Service Bus, Cosmos	Keep broker/store traffic off the internet
Dapr mTLS	On by default between sidecars; enable env mTLS	Encrypt + authenticate service-to-service
Component scoping	Scope every component to its apps	Prevent an app reading another’s broker
Secret resolution	Key Vault refs (runtime) over inline secrets	Value never persists in the template
IP restrictions	Allow-list CIDRs on ingress	Reduce exposed surface even when external
Image provenance	Pull from private ACR; scan + sign images	Supply-chain integrity
Client certificates	`require` mode for mTLS clients	Authenticate callers at ingress

A few non-obvious ones: a failed Key Vault reference resolves to an empty value, not an error the app can catch — so a missing RBAC role looks like a malformed connection string and crash-loops; confirm with az containerapp secret show and the UAMI’s role assignments. And Dapr component scoping is a security boundary, not just hygiene — an unscoped Service Bus component means every app in the environment can publish and consume on that namespace.

Cost & sizing

ACA Consumption bills per vCPU-second and GiB-second of active replicas plus a small per-request charge, with a monthly free grant — idle (scaled-to-zero) replicas cost nothing. Dedicated workload profiles bill per node-hour for the pool you size, regardless of utilisation. The bill drivers and how to cut each:

Cost driver	What it is	How to reduce	Watch out
Active vCPU/GiB-seconds	Running replica time × size	Scale-to-zero; right-size CPU:mem; cap max-replicas	Over-large replicas waste the ratio
Request count	Per-million requests (Consumption)	Usually negligible	Chatty internal calls add up
Dedicated node-hours	The profile pool you provision	Only for steady/memory-heavy; size tight	Pays even when idle (no scale-to-zero)
Log Analytics ingestion	Console + system logs volume	`info` not `debug`; sample; cap retention	`dapr-enable-api-logging` is a silent bill
Service Bus / Cosmos	The brokers/stores behind Dapr	Right-tier the broker; batch	Not ACA, but part of the system bill
App Gateway / Front Door	The fronting edge	Share across apps; right-SKU	Fixed hourly + per-GB
Egress / private endpoints	Outbound + PE hourly	Keep traffic on the backbone	PE per-hour per endpoint

Rough figures (East US / Central India list, mid-2026, INR≈USD×84): a single 0.5 vCPU / 1 GiB app running 24×7 on Consumption is on the order of ₹1,800–2,400/month; the same app scaled-to-zero and active ~4 hours/day is ₹300–500/month plus requests. A worker that wakes only for bursts can be near-zero at rest. The free grant covers the first slice of vCPU-/GiB-seconds and requests each month, so small dev environments often land inside the free tier. The sizing decision in one table:

Workload shape	Profile	min/max replicas	Why
Bursty event worker	Consumption	0 / N	Free at rest; wakes on queue
Latency-sensitive public API	Consumption	1 / N	Avoid cold start; still bursts
Steady high-throughput service	Dedicated D-series	≥1 / N	Predictable cost, no per-second premium
Memory-heavy (cache/JVM)	Dedicated E-series	≥1 / N	RAM ratio Consumption can’t give
GPU inference bursts	Consumption GPU	0 / N	Pay per GPU-second; region-limited

Interview & exam questions

Mapped to AZ-204 (Developing Solutions for Azure) and AZ-305 (Designing), plus general microservices design rounds.

What is a Container Apps environment and why does it matter? It is the security and network boundary: apps in the same environment share a VNet and Log Analytics workspace and can call each other by name and over Dapr; apps in different environments cannot. You choose one per bounded context, and its subnet is fixed at creation.
What distinguishes a revision from a configuration change? A change to the app template (image, env vars, scale, resources, probes) mints a new immutable revision; a change to configuration (ingress, secrets, registries, Dapr on/off, traffic weights, labels) does not. That distinction is the whole revision model.
How do you do a canary on ACA with no extra tooling? Put the app in multiple revision mode, deploy the new revision at 0% weight with a canary label, smoke-test the per-label FQDN, then ramp weights (10→50→100) gated on metrics; rollback is a one-line weight flip to the still-warm prior revision.
Why might an app with min-replicas 0 never wake? Because its scale rule is cpu or memory, which are only meaningful with ≥1 replica and cannot wake from zero. Use a trigger that polls at 0 — HTTP, Service Bus, Storage Queue, Kafka, Redis, or cron.
What does messageCount mean on a Service Bus scaler? It is the target backlog per replica: KEDA divides the queue depth by it to choose the replica count, so 200 messages with messageCount=20 drives ~10 replicas. Set it from real per-message processing time, not a guess.
How are Dapr components scoped, and why does it matter? Components are registered against the environment; a scopes: list of dapr-app-ids restricts which apps load them. Without it, every Dapr-enabled app loads the component — a connection-fanout and security problem.
How do you avoid inline secrets and registry passwords? Use a user-assigned managed identity with AcrPull on the registry (registry set --identity) and Key Vault Secrets User on the vault, referencing secrets with keyvaultref:<uri>,identityref:<id> so the value resolves at runtime and never lives in the template.
Why do messages get duplicated on scale-in, and how do you fix it? KEDA scales workers in as the backlog drains; a replica SIGTERMed mid-message that exits without settling leaves the message for at-least-once re-delivery. Handle SIGTERM (drain + settle) and make handlers idempotent.
External vs internal vs disabled ingress? External gets a public FQDN (unless the environment is --internal-only); internal is reachable only inside the environment; disabled is outbound-only for pure workers. All HTTP ingress requires one --target-port bound on 0.0.0.0.
When ACA over AKS, and when AKS over ACA? ACA when you want autoscaling, scale-to-zero, Dapr, and canary for stateless/event-driven workloads without operating a cluster. AKS when you need DaemonSets, custom operators/controllers, fine-grained scheduling, GPU control beyond profiles, or the full Kubernetes API.
What does single revision mode do on deploy, and why can it cause 502s? It deactivates the old revision the instant the new one activates; in-flight requests on draining replicas are cut if the app doesn’t handle SIGTERM. Multiple mode keeps the old revision warm and lets you drain.
Why is the environment subnet a one-way door? It is immutable after creation; revisions and replicas consume IPs from it, so an under-sized subnet caps scale and you must rebuild the environment on a larger one (/23+) and migrate apps.

Quick check

Your orders-worker has min-replicas 0 and a cpu scale rule. Messages are piling up and no replica appears. Why?
A deploy to a single-revision-mode app caused a brief 502 spike. What changes prevent it?
You set --revision-weight latest=10 in CI; after the next deploy the new build is taking 100% of traffic. What happened?
An app’s Key Vault-backed connection string is empty and the app crash-loops, but there’s no “access denied” error. What’s the most likely cause and the confirm command?
Two subscriber apps share the same consumerID on a Service Bus pub/sub component. What goes wrong?

Answers

cpu and memory are resource scalers that only mean anything with ≥1 replica — they cannot wake from zero. Switch to the azure-queue/azure-servicebus scaler (which polls at 0) or set min-replicas 1.
Put the app in multiple revision mode (so the old revision stays warm and drains) and handle SIGTERM so in-flight requests finish before the replica exits.
latest re-pointed to the new revision on the next deploy, so “10% to latest” became 10% to the build that is now also the stable one — effectively everything. Pin explicit revision names in production traffic commands.
The UAMI is missing Key Vault Secrets User (or the SecretUri is wrong); the reference resolves to empty rather than erroring. Confirm with az containerapp secret show and az role assignment list --assignee <uami-clientId> against the vault scope.
They become competing consumers on the same logical subscription — messages are split across them instead of each app getting its own copy. Give each subscriber a unique consumerID.

Glossary

Container Apps environment — the security/network/logging boundary; apps inside share a VNet and Log Analytics workspace and can talk; the subnet is fixed at create.
Workload profile — Consumption (serverless, scale-to-zero) or Dedicated (per-node-hour, isolation/GPU) compute that an app runs on.
Ingress — Envoy-fronted L7 entry to an app: external (public FQDN), internal (env-only), or disabled (outbound-only).
Target port — the single port a container must bind on 0.0.0.0 for ingress to reach it.
Revision — an immutable snapshot of the app template; minted by any template change.
Revision suffix — the human-readable tail of a revision name; unique-forever per app.
Traffic weight — the percentage of ingress routed to a revision in multiple mode; weights sum to 100.
Label — a stable alias to a revision with its own FQDN, for sticky smoke-testing.
KEDA — the event-driven autoscaler ACA uses; a scale rule’s trigger decides replica count.
Scale-to-zero — running zero replicas when idle; requires a wake-capable trigger.
Dapr — a portable microservices runtime injected as a per-app sidecar on localhost:3500.
Dapr component — a pub/sub, state, binding, or secret-store definition registered on the environment; scope it to specific apps.
dapr-app-id — an app’s Dapr identity, used for service invocation and component scoping.
UAMI — user-assigned managed identity; the credential-less way to pull images and resolve secrets.
Graceful shutdown — catching SIGTERM to drain in-flight work and settle messages before exit.

Next steps

Configure Dapr on Kubernetes: service invocation, state, pub/sub — the same building blocks on raw Kubernetes, for when you outgrow ACA.
KEDA event-driven autoscaling with Kafka and Service Bus — go deeper on the scaler that powers ACA scaling.
Azure Service Bus: sessions, dedup, dead-letter patterns — make the broker behind your pub/sub reliable and exactly-once-ish.
Azure Container Registry secure supply chain — secure, sign, and zone-redundant the images ACA pulls.
Azure App Service vs Container Apps vs AKS — re-confirm ACA is still the right tier as the system grows.