Durable Functions in Production: Orchestrations, Fan-out/Fan-in, and Entity State

Durable Functions is the part of Azure Functions that lets you write stateful, long-running workflows as plain code instead of stitching together queues, tables, and state machines by hand. The catch is that the programming model is not what it looks like. An orchestrator function reads top to bottom like normal C# or TypeScript, but underneath it is a replay engine that re-executes your code from the start every time it makes progress. If you do not internalize that, you will ship orchestrations that work in the demo and corrupt their own state under load. This guide builds the core patterns the right way — chaining, fan-out/fan-in, human interaction, eternal orchestrations, and durable entities — and ends with how to debug them when they get stuck at 2 a.m.

The whole field reduces to one sentence: the orchestrator is the brain and must be pure; activities are the hands and may touch the outside world; the Durable Task backend (Azure Storage, Netherite, or MSSQL) is the memory that survives crashes. Everything that bites you in production — NonDeterministicOrchestrationException, a settlement run that wedges at 95,000 merchants, double-applied payments, a history table that grows to tens of GB — is a violation of one of those three roles. Because this is a reference you will keep open mid-incident, every pattern, setting, error and limit here is laid out as a scannable table alongside the prose and the code: read the prose once, then keep the tables open.

All examples use the .NET isolated worker model, which is the supported path going forward; the concepts map directly to the JavaScript, Python, and PowerShell SDKs. By the end you will stop guessing — when an orchestration hangs you will know within ninety seconds whether you face a non-deterministic body, an unbounded fan-out starving the control queue, a non-idempotent activity double-applying a side effect, a WaitForExternalEvent with no timeout, or simply history bloat from a missing ContinueAsNew.

What problem this solves

Long-running, stateful workflows are the swamp of cloud engineering. You need to call five services in order, fan out ten thousand parallel jobs and wait for all of them, pause for a human approval that might take three days, or run a per-device aggregator forever — and you need all of it to survive a worker crash, a deployment, a scale-in event, and a transient API failure halfway through. The naive answer is to hand-roll it: a queue per step, a table to hold state, a poller to advance the state machine, a dead-letter queue for failures, and a pile of correlation IDs to tie it together. That code is mostly plumbing, it is where the bugs live, and every team rewrites it.

Durable Functions collapses that plumbing into code you can read. The state is the event-sourced history; you do not manage it. But the abstraction has a sharp edge: because the orchestrator body replays, anything non-deterministic in it silently diverges history and corrupts the workflow — or, if the SDK catches it, throws NonDeterministicOrchestrationException and wedges the instance. What breaks without this knowledge is specific and expensive: a settlement job that scales fine to 40,000 items and falls over at 95,000; a reconcile activity that double-posts to a partner ledger when its retry fires; a “stuck Running” instance nobody can explain; a history table that grows until queries time out.

Who hits this: any team using Durable Functions for orchestration (order processing, ETL, batch media work, approvals, sagas), anyone who fanned out without bounding the width, anyone whose activities have side effects but aren’t idempotent, and anyone running an eternal orchestration without ContinueAsNew. To frame the whole field before the deep dive, here is every failure class this guide covers, what it looks like, and the one place to look first.

Failure class	What you observe	First question	First place to look	Most common single cause
Non-determinism	`NonDeterministicOrchestrationException` on replay	Did the orchestrator schedule different work than history?	The exception + `showHistory=true`	`DateTime.UtcNow`/`Guid.NewGuid`/I/O in the orchestrator
Stuck “Running” forever	Instance never reaches a terminal state	Is it waiting on an event, or retrying a poison item?	Status API; KQL for non-terminal instances	`WaitForExternalEvent` with no timeout
Double-applied side effect	Duplicate charges/adjustments	Did an activity retry after the original succeeded?	`dependencies` failures + duplicate rows	Non-idempotent activity + retry policy
Slow / wedged fan-out	Used to finish in 40 min, now 6 h	Did fan-out width outgrow the backend?	Control-queue latency; instance duration	Unbounded `Task.WhenAll` over 10k+ activities
History bloat	Queries time out; storage in tens of GB	Large payloads or missing `ContinueAsNew`?	History table size; payload sizes	Returning big blobs by value; eternal loop without reset
Wrong app ran it	“My orchestration ran on the other app”	Do two apps share a storage account + hub name?	`host.json` `hubName`	Two apps sharing a task hub

Learning objectives

By the end of this article you can:

Explain the replay execution model and list exactly what is forbidden in an orchestrator body and the deterministic replacement for each.
Build the five canonical patterns correctly: function chaining, fan-out/fan-in, human interaction (external event + durable timer), eternal orchestrations (ContinueAsNew), and durable entities.
Bound a fan-out with sub-orchestrations so a hundred-thousand-item batch doesn’t starve the control queue, and choose a fan-in failure policy (WhenAll vs collect-and-partition) deliberately.
Make side-effecting activities idempotent so retries and redeliveries can’t double-apply, using a deterministic idempotency key.
Choose a storage backend (Azure Storage / Netherite / MSSQL) for the workload and reason about throughput, cost, and the task-hub namespacing rule.
Diagnose a stuck, poison, or bloated orchestration: query an instance with the status API, terminate a wedged one, run KQL to find non-terminal instances, and groom history with the purge API.
Read the option/limit/error reference tables and pick the right retry policy, timer pattern, and entity-vs-orchestration decision for each case.

Prerequisites & where this fits

You should already be comfortable with Azure Functions fundamentals — triggers and bindings, the consumption/premium/dedicated hosting model, app settings, and deploying with func/az. You should be able to run az in Cloud Shell, read JSON output, and write enough C# to follow async/await and Task.WhenAll/Task.WhenAny. Familiarity with event sourcing helps but isn’t required — this article teaches the model from first principles. If you’re new to plain (non-durable) serverless patterns, read Azure Functions: Serverless Patterns & Best Practices and Build a Simple Serverless API on Azure first.

This sits in the Serverless / application-architecture track, one layer above plain function triggers. It assumes the hosting and scaling mechanics covered in Azure Functions Flex Consumption: VNet, Scaling & Cold Start, and it pairs tightly with the messaging primitives — when you outgrow Durable’s built-in queues you reach for Azure Service Bus: Sessions, Dedup & Dead-Letter Patterns and Azure Event Grid: MQTT, Event-Driven Routing & Dead-Letter. The diagnostic half leans on Azure Monitor & Application Insights for Observability and KQL for Azure Monitor & Log Analytics, because Application Insights is the single most useful tool for triaging a stuck instance.

A quick map of who owns what during an incident, so you escalate to the right place fast:

Layer	What lives here	Who usually owns it	Failure classes it can cause
Trigger / client	HTTP/event start, raiseEvent, status	App / dev team	Wrong instance ID; lost external event
Orchestrator body	Determinism, control flow, `WhenAll`	App / dev team	Non-determinism; unbounded fan-out; no timeout
Activities / entities	All I/O, side effects, shared state	App / dev team	Double-apply; poison item; large payloads
Durable backend	History, queues, partitions	App + platform	Throughput ceiling; control-queue latency
Storage account	Tables/blobs/queues, or Event Hubs/SQL	Platform team	Hub-name collisions; storage throttling (429)
Observability	Traces, status API, purge	App / SRE	“Stuck Running” invisible without queries

Core concepts

Six mental models make every later diagnosis obvious.

The orchestrator replays; it does not run once. An orchestrator runs, awaits an activity, and unloads from memory. When that activity completes, the Durable Task Framework replays the orchestrator from line one, feeding already-completed results from a history table instead of calling the activities again. Replay stops at the first await whose result is not yet in history, and real execution resumes there. This is how an orchestration survives a worker crash, a deployment, or a scale-in: its state is the event-sourced history, not the process memory.

Determinism is non-negotiable. Because the body replays repeatedly, it must make the same decisions and schedule the same activities in the same order given the same history. That forbids ambient clocks, randomness, direct I/O, and non-deterministic collection ordering inside the orchestrator. The replacements live on the context (context.CurrentUtcDateTime, context.NewGuid()). The SDK detects divergence and throws NonDeterministicOrchestrationException rather than silently corrupting state — treat that as a code defect, never a transient error to retry.

Activities are the hands. All I/O — HTTP, database, blob, reading config — happens in activity functions, which run once per logical call (with retries) and whose inputs/outputs are serialized to JSON and recorded in history. Anything non-deterministic belongs here or comes from the context.

Entities hold state; orchestrations coordinate. A durable entity is an addressable, persistent object (a tiny actor) identified by entityName@key, with single-threaded access per entity so updates serialize without locks. Use an orchestration for a workflow with a start and end; use an entity for long-lived mutable state many callers update.

The task hub is the namespace. hubName in host.json namespaces all the queues and tables. Two function apps sharing one storage account must use different hub names or they fight over each other’s work items — the classic “my orchestration ran on the wrong app” incident.

The backend is finite and shared. Whatever provider you choose, its queues and partitions have throughput limits. On Azure Storage the control queues (default ~128 partitions across a small number of queues) and work-item queue can become the binding constraint under heavy fan-out; saturating them spikes latency and slows every replay.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary repeats these for lookup; this table is the model side by side.

Concept	One-line definition	Where it lives	Why it matters
Orchestrator function	The deterministic “brain” that schedules work	Your code (`[OrchestrationTrigger]`)	Replays; must be pure
Activity function	A unit of real work / I/O	Your code (`[ActivityTrigger]`)	Runs once per call; can do I/O
Durable entity	Addressable single-threaded state object	Your code (`[EntityTrigger]`)	Race-free shared state, no locks
Client (binding)	Starts/queries/signals orchestrations	`[DurableClient]`	The only way in from outside
History table	Event-sourced record of an instance	Backend (Table/SQL/Event Hubs)	Source of replay and of bloat
Task hub	Namespace for all queues/tables	`host.json` `hubName`	Collisions = cross-app interference
Instance ID	Unique key for one orchestration run	Generated or supplied	Address for status/event/terminate
Replay	Re-executing the body from the start	Framework behaviour	Why determinism is required
`ContinueAsNew`	Restart with fresh state + clean history	Orchestrator API	Bounds eternal-orchestration history
External event	A named signal delivered to an instance	`raiseEvent` API	Human/async-in pattern
Durable timer	A persisted, replay-safe deadline	`context.CreateTimer`	Survives host restart; never `Task.Delay`
Storage provider	Backend that persists all state	Azure Storage / Netherite / MSSQL	Throughput + cost + ops profile

The five built-in application patterns, side by side — this is the map of the deep sections that follow.

Pattern	Shape	Use it for	Key API	Main pitfall
Function chaining	A → B → C, output feeds next	Ordered pipelines (ingest → parse → store)	`CallActivityAsync` in sequence	Passing large payloads by value
Fan-out / fan-in	Parallel N, then aggregate	Batch jobs, per-item processing	`Task.WhenAll` over many activities	Unbounded width starves the queue
Async HTTP / human-in	Pause, wait for a signal/timeout	Approvals, callbacks, 2FA	`WaitForExternalEvent` + `CreateTimer`	No timeout → stuck forever
Eternal orchestration	Loop forever, bounded	Monitors, recurring cleanup, aggregators	`ContinueAsNew`	`while(true)` → history grows unbounded
Durable entities	Addressable stateful actor	Counters, carts, per-tenant budgets	`SignalEntityAsync` / `CallEntityAsync`	Treating an entity like an orchestration

The replay execution model and why determinism is non-negotiable

An orchestration survives a worker crash, a deployment, or a scale-in because its state is the event-sourced history, not the process memory. That same mechanism is the source of every Durable Functions bug. Because the orchestrator body is replayed repeatedly, it must be deterministic — given the same history, it must make the same decisions and schedule the same activities in the same order.

The replacements for non-deterministic constructs live on the orchestration context:

[Function(nameof(ProcessOrder))]
public async Task<OrderResult> ProcessOrder(
    [OrchestrationTrigger] TaskOrchestrationContext context,
    OrderInput input)
{
    // Deterministic, replay-safe equivalents:
    DateTime now = context.CurrentUtcDateTime;        // NOT DateTime.UtcNow
    Guid id = context.NewGuid();                       // NOT Guid.NewGuid()
    ILogger logger = context.CreateReplaySafeLogger<OrderProcessor>();

    // Skip log statements during replay so you don't see every line twice:
    if (!context.IsReplaying)
        logger.LogInformation("Starting order {OrderId}", input.OrderId);

    // All real work happens in activities, which CAN do I/O:
    var validated = await context.CallActivityAsync<bool>(nameof(ValidateOrder), input);
    return new OrderResult(input.OrderId, validated);
}

The mental model that sticks: the orchestrator is the brain and must be pure; activities are the hands and may touch the outside world. Anything non-deterministic belongs in an activity or comes from the context.

The Durable Task SDK detects non-deterministic orchestration when the replayed code schedules different work than the history records, and throws rather than silently corrupting state. Treat any NonDeterministicOrchestrationException as a code defect.

What is forbidden in an orchestrator — and the fix

Every forbidden construct, why it breaks replay, and the deterministic substitute. Memorize this table; it is the single highest-leverage thing in the article.

Forbidden in orchestrator	Why it breaks replay	Replay-safe replacement	Where the real work goes
`DateTime.UtcNow` / `DateTime.Now`	Different value each replay → divergent decisions	`context.CurrentUtcDateTime`	—
`Guid.NewGuid()`	New ID each replay → divergent history	`context.NewGuid()`	—
`Random` / crypto RNG	Non-reproducible	Seed from `context.NewGuid()` or compute in an activity	Activity
`HttpClient` / DB / file I/O	Side effects re-fire on every replay	—	Activity
Reading env vars / config	Value may change between replays	Pass as input, or read in an activity	Activity
`Task.Delay` / `Thread.Sleep`	Wall-clock; lost on restart	`context.CreateTimer(deadline, ct)`	—
`Task.Run` / arbitrary threads	Non-deterministic scheduling	Schedule durable tasks only	Activity
`lock` / `Monitor` / mutex	Threading assumptions don’t hold	Use a durable entity for serialization	Entity
`await` on non-Durable tasks	Completes outside the replay model	Only `await` Durable APIs	—
Iterating an unordered `Dictionary`	Ordering differs per replay	Sort to a stable order first	—
`Environment.MachineName`, static mutable state	Host-specific / shared mutable	Pass via input or entity	Entity / input
`static` counters incremented in body	Replays increment repeatedly	Move to an entity	Entity
`Console.WriteLine` / unguarded logging	Logs duplicate on every replay	`IsReplaying`-guarded replay-safe logger	—
`ConfigureAwait` / custom `SynchronizationContext`	Breaks the framework’s scheduler	Just `await` durable tasks plainly	—
Throwing to “retry” the orchestrator	Faults the orchestration, not a retry	Put retry policy on the activity	Activity

The context APIs you reach for instead, and exactly what each returns:

Context member	Replaces	Returns / does	Note
`context.CurrentUtcDateTime`	`DateTime.UtcNow`	Deterministic “now” frozen per replay	Advances only as history advances
`context.NewGuid()`	`Guid.NewGuid()`	Deterministic GUID seeded from instance + counter	Use as idempotency-key seed
`context.IsReplaying`	—	`true` while re-executing history	Guard logging / one-shot effects
`context.CreateReplaySafeLogger<T>()`	`ILogger`	Logger that suppresses replayed lines	Avoids double logs
`context.GetInput<T>()`	constructor args	The serialized input payload	Must be a serializable POCO
`context.InstanceId`	—	This orchestration’s ID	For correlation / child IDs
`context.CallActivityAsync<T>(...)`	direct method call	Schedules an activity, awaits result	Recorded in history
`context.CreateTimer(deadline, ct)`	`Task.Delay`	A persisted durable timer	Survives restart
`context.WaitForExternalEvent<T>(name)`	a callback	Awaits a named external event	Pair with a timeout
`context.ContinueAsNew(state)`	a `while(true)` loop	Restarts with clean history	Last statement on the branch
`context.CallSubOrchestratorAsync<T>(...)`	a giant `WhenAll`	Schedules a child orchestration	Bounds fan-out width
`context.Entities.CallEntityAsync<T>(...)`	a `lock` / shared field	Read-modify-write an entity	Single-threaded per key
`context.WaitForExternalEvent<T>(name, timeout)`	a callback + manual timer	Awaits an event with a built-in timeout	Throws `TimeoutException` on expiry

Function chaining and passing state safely

The simplest pattern is a sequence: A then B then C, where each step’s output feeds the next. Because state flows through return values held in history, you do not need external storage to pass data between steps.

[Function(nameof(IngestPipeline))]
public async Task<string> IngestPipeline(
    [OrchestrationTrigger] TaskOrchestrationContext context)
{
    var input = context.GetInput<IngestRequest>()!;

    string downloaded = await context.CallActivityAsync<string>(nameof(Download), input.Url);
    string parsed     = await context.CallActivityAsync<string>(nameof(Parse), downloaded);
    string stored     = await context.CallActivityAsync<string>(nameof(Persist), parsed);
    return stored;
}

Two rules keep this safe. First, everything crossing an activity boundary is serialized to JSON — inputs and outputs must be serializable POCOs, not live handles, streams, or HttpClient instances. Keep payloads small: if a step produces a 200 MB blob, return the blob URI, not the bytes, because large payloads bloat the history table and slow every replay. Second, add retries where failure is expected, not a blanket retry on everything.

var retry = TaskOptions.FromRetryPolicy(new RetryPolicy(
    maxNumberOfAttempts: 5,
    firstRetryInterval: TimeSpan.FromSeconds(5),
    backoffCoefficient: 2.0,
    maxRetryInterval: TimeSpan.FromMinutes(2)));

string downloaded = await context.CallActivityAsync<string>(
    nameof(Download), input.Url, retry);

The retry timing is itself recorded as durable timers, so a 5-attempt exponential backoff survives a worker restart mid-backoff.

Retry policy options, end to end

Every field of RetryPolicy, its default behaviour, and how to reason about it. Tuning these badly is a top cause of “stuck retrying forever.”

Setting	Type / values	Typical value	When to change	Trade-off / gotcha
`maxNumberOfAttempts`	int ≥ 1	3–5	Raise for flaky upstreams; keep low for fast-fail	Too high + non-idempotent activity = repeated side effects
`firstRetryInterval`	`TimeSpan`	5 s	Lower for chatty internal calls	Too low hammers a struggling dependency
`backoffCoefficient`	double ≥ 1	2.0	1.0 for fixed delay; >1 for exponential	Exponential can stretch total time to hours
`maxRetryInterval`	`TimeSpan`	1–5 min	Cap the exponential growth	Without a cap, late attempts are days apart
`retryTimeout`	`TimeSpan`	(unset)	Bound total retry wall-clock	Unset = retries until attempts exhausted
`handle` predicate	`Func<exc,bool>`	retry all	Retry only transient exceptions	Retrying a `ValidationException` is pointless

Where to put a retry — not every failure deserves one:

Failure kind	Retry?	Why
Transient network / 5xx / throttling (429)	Yes, with backoff	Likely to succeed on retry
Timeout to a healthy-but-busy dependency	Yes, bounded	Backoff lets it recover
`ValidationException` / 400 / bad input	No	Deterministic failure; retry wastes time
`NonDeterministicOrchestrationException`	No	Code defect — fix it, never retry
Poison message (always throws)	No (cap attempts)	Dead-letter / partition the result instead
Idempotent write that may have partially succeeded	Yes, if idempotent	Safe only when the activity is idempotent

Fan-out/fan-in for parallel processing

Chaining is sequential. When steps are independent, fan them out, run them in parallel across the entire scaled-out function app, then fan in to aggregate. This is the pattern that makes Durable Functions worth using over a logic-light queue trigger.

[Function(nameof(BatchResize))]
public async Task<int> BatchResize(
    [OrchestrationTrigger] TaskOrchestrationContext context)
{
    var batch = context.GetInput<ImageBatch>()!;

    // List the work in an activity (I/O), not in the orchestrator:
    string[] files = await context.CallActivityAsync<string[]>(
        nameof(ListSourceFiles), batch.Prefix);

    // FAN OUT: schedule all activities without awaiting individually.
    var tasks = new List<Task<long>>(files.Length);
    foreach (string file in files)
        tasks.Add(context.CallActivityAsync<long>(nameof(ResizeImage), file));

    // FAN IN: await them all; this is replay-safe and durable.
    long[] sizes = await Task.WhenAll(tasks);

    int totalBytes = sizes.Aggregate(0, (sum, s) => sum + (int)s);
    await context.CallActivityAsync(nameof(WriteManifest),
        new Manifest(batch.Prefix, files.Length, totalBytes));
    return files.Length;
}

Task.WhenAll over Durable tasks is the canonical fan-in. The orchestrator suspends until every activity reports back, and the framework records each completion in history independently, so a crash after 900 of 1,000 completions resumes with only the outstanding 100 left to run.

Two production guardrails matter. Bound the fan-out width: fanning out 100,000 activities at once floods the work-item queue and can starve other orchestrations — chunk the list and process N at a time, or use sub-orchestrations. And decide your failure policy explicitly: Task.WhenAll throws an aggregate if any task faults after its retries are exhausted, so if you want “best effort, collect successes and failures,” await each task in a try/catch and partition the results yourself rather than letting one poison item fail the whole batch.

Bounding the fan-out with sub-orchestrations

A sub-orchestration per chunk caps concurrent work items and isolates failures. This is the single most important scaling fix in the article.

[Function(nameof(BatchParent))]
public async Task<int[]> BatchParent(
    [OrchestrationTrigger] TaskOrchestrationContext context)
{
    string[] all = context.GetInput<string[]>()!;
    const int chunkSize = 500;

    var chunkTasks = new List<Task<int>>();
    for (int i = 0; i < all.Length; i += chunkSize)
    {
        string[] chunk = all.Skip(i).Take(chunkSize).ToArray();
        // CallSubOrchestratorAsync bounds the in-flight width to one chunk at a time per call:
        chunkTasks.Add(context.CallSubOrchestratorAsync<int>(nameof(ProcessChunk), chunk));
    }
    return await Task.WhenAll(chunkTasks);   // still parallel, but width-controlled
}

The fan-in failure policies, side by side — pick before you ship, not during the incident:

Policy	How you write it	On a single failure	Use when
All-or-nothing	`await Task.WhenAll(tasks)`	Throws aggregate; orchestration faults	Every item must succeed (financial postings)
Best-effort partition	`try/await` each, collect ok/err lists	One bad item doesn’t sink the batch	Independent items; you report failures
First-success	`await Task.WhenAny(...)` then cancel	Returns on first winner	Racing redundant sources
Bounded width	sub-orchestration per N items	Failure isolated to a chunk	Very large batches (10k+)
Throttled	semaphore of pending tasks	Caps concurrent in-flight work	Protecting a rate-limited downstream

Fan-out sizing — what each width does to the Azure Storage backend:

Fan-out width	Behaviour on Azure Storage backend	Recommendation
1–100	Comfortable; negligible queue pressure	Just `Task.WhenAll`
100–1,000	Fine; watch control-queue latency under bursts	`Task.WhenAll`; monitor
1,000–10,000	Work-item queue pressure begins	Chunk into sub-orchestrations
10,000–100,000	Control-queue latency spikes; replays slow	Mandatory chunking (~500/chunk)
> 100,000	Starves other orchestrations; risk of wedge	Chunk and consider Netherite

Human interaction with external events and durable timers

Some workflows must pause and wait for a human — an approval, a signature, a second factor — possibly for hours or days. You do this with an external event and a durable timer racing each other so you get a timeout instead of a workflow that hangs forever.

[Function(nameof(ApprovalWorkflow))]
public async Task<string> ApprovalWorkflow(
    [OrchestrationTrigger] TaskOrchestrationContext context)
{
    var request = context.GetInput<PurchaseRequest>()!;
    await context.CallActivityAsync(nameof(RequestApproval), request);

    // Durable timer: a replay-safe deadline. Always pair with a CTS so the
    // timer is cleaned up when the event arrives first.
    using var cts = new CancellationTokenSource();
    DateTime deadline = context.CurrentUtcDateTime.AddHours(72);
    Task timeout = context.CreateTimer(deadline, cts.Token);

    // External event: resumes when someone POSTs to the raise-event API.
    Task<bool> approved = context.WaitForExternalEvent<bool>("ApprovalResponse");

    Task winner = await Task.WhenAny(approved, timeout);
    if (winner == approved)
    {
        cts.Cancel();   // tear down the pending timer
        return approved.Result ? "Approved" : "Rejected";
    }
    return "TimedOut";   // escalate
}

Two things people get wrong. Use context.CreateTimer, never Task.Delay — a durable timer is persisted, so if the host restarts during the 72-hour wait the timer is restored and still fires, whereas Task.Delay is wall-clock and evaporates on restart. (Durable timers were historically capped at ~6 days on the Azure Storage backend; for longer waits, loop shorter timers.) And always cancel the loser — if you don’t cancel the timer when the event wins, the orchestration is held open until the timer fires, inflating instance counts and history.

The external event is delivered from outside by instance ID:

# Raise the "ApprovalResponse" event with payload `true` to a running instance
curl -X POST \
  "https://myapp.azurewebsites.net/runtime/webhooks/durabletask/instances/${INSTANCE_ID}/raiseEvent/ApprovalResponse?taskHub=MyTaskHub&code=${SYSTEM_KEY}" \
  -H "Content-Type: application/json" \
  -d 'true'

External-event vs durable-timer mechanics

The two primitives that make human-in-the-loop safe, contrasted:

Aspect	External event (`WaitForExternalEvent`)	Durable timer (`CreateTimer`)
What it waits for	A named signal from outside	A wall-clock deadline
Delivered by	`raiseEvent` REST API / client	The framework
Survives host restart	Yes (buffered if it arrives early)	Yes (persisted)
If it never happens	Hangs forever — needs a timer	Always fires
Cancellation	n/a	Cancel via `CancellationToken` when event wins
Max duration	Unbounded	~6 days (Azure Storage); loop for longer
Common bug	No timeout → stuck “Running”	Not cancelling the loser
Replay-safety	Yes (recorded as an event)	Yes (recorded as a timer-fired event)

Task.WhenAny race outcomes — read this to reason about the branches:

Winner	What it means	What you must do
`approved` (event)	Human responded in time	Cancel the timer (`cts.Cancel()`), return result
`timeout` (timer)	Deadline passed, no response	Escalate / mark `TimedOut` (event may still arrive — handle or ignore)
Both effectively simultaneous	Rare boundary	First-completed wins deterministically on replay
Neither (still pending)	Orchestration suspends	Nothing — it resumes when one completes

Eternal orchestrations and ContinueAsNew

Some processes never really end: a per-device aggregator, a recurring cleanup, a monitor that polls forever. You cannot just wrap the body in while (true) — the history table would grow without bound and eventually every replay would crawl. The answer is ContinueAsNew, which restarts the orchestration with fresh state and a clean history, carrying forward only the input you choose.

[Function(nameof(PeriodicMonitor))]
public async Task PeriodicMonitor(
    [OrchestrationTrigger] TaskOrchestrationContext context)
{
    var state = context.GetInput<MonitorState>()!;

    bool stillOpen = await context.CallActivityAsync<bool>(nameof(CheckHealth), state.Target);
    if (!stillOpen)
        return;   // condition met -> orchestration completes for good

    // Wait one polling interval with a durable timer:
    DateTime next = context.CurrentUtcDateTime.AddMinutes(5);
    await context.CreateTimer(next, CancellationToken.None);

    // Reset history and loop with updated state. Do NOT recurse or while(true).
    context.ContinueAsNew(state with { Iterations = state.Iterations + 1 });
}

Key constraints: drain pending work before ContinueAsNew (any external events that arrived but weren’t awaited are lost across the boundary, so await everything you care about first); ContinueAsNew does not “return” — it schedules a restart, so structure the method so the call is the last statement on that branch; and remember this is what bounds history growth — an eternal orchestration without ContinueAsNew is a slow-motion outage.

Eternal-orchestration rules

The boundary semantics that trip people up:

Rule	Why	What happens if you ignore it
Call `ContinueAsNew` as the last statement on the branch	It schedules a restart, doesn’t return	Code after it runs unexpectedly during replay
Drain (await) pending external events first	Unawaited events are dropped at the boundary	Lost signals; missed approvals
Never use `while(true)` to loop	History grows unbounded	Replays crawl; queries time out
Don’t recurse via `CallSubOrchestrator` to loop	Builds a deep instance chain	Resource and history sprawl
Carry forward only the state you need	Large carried state bloats the new instance	Slow restarts
Terminate the loop on a real exit condition	Otherwise it truly is eternal	Orphan instances accumulate

Looping mechanism comparison:

Mechanism	History growth	Correct for	Notes
`ContinueAsNew`	Reset each iteration (flat)	Monitors, recurring jobs, aggregators	The right tool
`while(true)` in body	Unbounded growth	Nothing	Slow-motion outage
Timer-triggered function restarting an orchestration	Flat (new instance each time)	Cron-like schedules	Singleton-ID to avoid overlap
Recursion via sub-orchestration	Grows a chain	Bounded depth only	Not for “forever”

Durable entities for stateful, single-threaded actor logic

Orchestrations coordinate; entities hold state. A durable entity is an addressable, persistent object (think a tiny actor) identified by entityName@key. The framework guarantees single-threaded access per entity, so you get serialized, race-free updates without locks — ideal for counters, shopping carts, per-tenant aggregates, or rate-limit budgets.

public class Counter : TaskEntity<int>
{
    public void Add(int amount) => State += amount;
    public void Reset() => State = 0;
    public int Get() => State;

    [Function(nameof(Counter))]
    public static Task Run([EntityTrigger] TaskEntityDispatcher dispatcher)
        => dispatcher.DispatchAsync<Counter>();
}

Call entities two ways. From a client you fire signals (one-way, fire-and-forget):

[Function("AddToCounter")]
public async Task<HttpResponseData> AddToCounter(
    [HttpTrigger(AuthorizationLevel.Function, "post", Route = "counter/{key}/add")]
        HttpRequestData req,
    [DurableClient] DurableTaskClient client,
    string key)
{
    var entityId = new EntityInstanceId(nameof(Counter), key);
    await client.Entities.SignalEntityAsync(entityId, "Add", 1);
    return req.CreateResponse(HttpStatusCode.Accepted);
}

From an orchestrator you can signal or call and await a return value, and the single-threaded guarantee lets an orchestration safely read-modify-write shared state:

var entityId = new EntityInstanceId(nameof(Counter), key);
int current = await context.Entities.CallEntityAsync<int>(entityId, "Get");
if (current < limit)
    await context.Entities.CallEntityAsync(entityId, "Add", 1);

When to reach for entities over an orchestration: use an orchestration for a workflow with a defined start and end; use an entity for long-lived, mutable state that many callers update concurrently. They compose — an orchestration that needs a global counter or lock should delegate to an entity rather than trying to serialize access itself.

Signal vs call, and entity vs orchestration

The two ways to invoke an entity differ in a way that matters for correctness:

Aspect	`SignalEntityAsync` (signal)	`CallEntityAsync` (call)
Direction	One-way, fire-and-forget	Two-way, awaits a return
Return value	None	Typed result
Callable from client	Yes	No (orchestrator/entity only)
Callable from orchestrator	Yes	Yes
Ordering guarantee	Delivered, eventually	Completes before next line
Use for	Increment, append, notify	Read-modify-write, read state
Blocking the caller	No	Yes (until entity responds)

Choosing the right primitive for a job:

Need	Orchestration	Entity	Plain activity
Multi-step workflow with start/end	✅
Long-lived mutable state, many writers		✅
Race-free counter / budget / cart		✅
One-off I/O with no shared state			✅
Distributed lock		✅ (`LockAsync`)
Fan-out of independent work	✅ (orchestrator)		✅ (the work)
Per-tenant aggregate updated by events		✅

Choosing a storage backend

Durable Functions persists all state through a storage provider. The default is fine until it isn’t, and the choice has real throughput and cost consequences.

Provider	Backing store	Best for	Watch out for
Azure Storage (default)	Blobs, queues, tables	Default; low ops; most apps	Throughput ceiling under heavy fan-out; per-transaction cost adds up; history in Table Storage
Netherite	Azure Event Hubs + Page Blobs	High-throughput, high fan-out workloads needing low latency	Operationally heavier; partitions fixed at provisioning; Event Hubs cost
MSSQL	Azure SQL / SQL Server	Portability, on-prem/hybrid, single store you already operate and back up	You own SQL throughput and DTU/vCore sizing

The provider is selected in host.json:

{
  "version": "2.0",
  "extensions": {
    "durableTask": {
      "hubName": "MyTaskHub",
      "storageProvider": {
        "type": "Netherite",
        "partitionCount": 12
      }
    }
  }
}

Practical guidance: stay on Azure Storage until you have measured a throughput problem — most orchestrations never hit its limits, and it is the cheapest to operate. Move to Netherite when you are processing tens of thousands of work items per second and feeling queue latency. Choose MSSQL when portability, a single backed-up store, or running outside Azure dominates the decision. Switching providers is a state migration, so decide before you have millions of live instances, not after.

A note on task hubs: the hubName namespaces all the queues and tables. Two function apps sharing a storage account must use different hub names, or they will fight over each other’s work items — a classic “my orchestration ran on the wrong app” incident.

Backend comparison in depth

The three providers across the dimensions that actually drive the decision:

Dimension	Azure Storage	Netherite	MSSQL
Throughput ceiling	Moderate (queue/table bound)	Very high (Event Hubs partitions)	Bound by SQL tier (DTU/vCore)
Latency under fan-out	Rises with width	Low and stable	Depends on SQL sizing
Operational effort	Lowest	Higher (Event Hubs, partitions)	Medium (you run SQL)
Partition model	~Auto, control-queue partitions	Fixed at provisioning (e.g. 12)	SQL-managed
Cost model	Per-transaction (cheap at low scale)	Event Hubs TU + Page Blobs	SQL compute + storage
Portability / hybrid	Azure-only	Azure-only	On-prem/hybrid friendly
Backup / single store	3 stores (blob/queue/table)	Event Hubs + blobs	One database to back up
Best fit	Most apps; default	10k+ work-items/sec, low latency	Portability, existing SQL estate
Migration cost from here	—	State migration required	State migration required

Task-hub configuration rules — collisions here cause “wrong app ran my orchestration”:

Rule	Value / setting	Why
Unique `hubName` per app on shared storage	`host.json` → `durableTask.hubName`	Apps share queues/tables otherwise
Default hub name	derived from app name	Fine if each app has its own storage
Allowed characters	alphanumeric, start with a letter	Invalid names fail silently/confusingly
Change hub name = new task hub	new queues/tables created	In-flight instances on the old hub are orphaned
Don’t share a hub across environments	dev/test/prod separate hubs	Cross-environment interference

Approximate Azure Storage backend limits worth knowing (use as mechanism, validate exact numbers against current docs):

Resource	Approximate limit	Effect when hit
Control-queue partitions	~128 (across a few queues)	Caps orchestration parallelism per hub
Durable timer max duration	~6 days	Longer waits must loop shorter timers
Activity payload (input/output)	Large payloads spill to blob	Bloats history; slows replay
External-event buffering	Held until awaited	Early events are not lost
Storage throttling	HTTP 429 from the account	Backend latency spikes; retries
Instance ID length / characters	Reasonable string; avoid `/`, `\`, `#`, `?`	Bad IDs break status/raiseEvent URLs
Concurrent activities per instance (host)	Tunable via `host.json` concurrency	Caps per-instance parallelism
Status webhook lifetime	Bounded; expires/purged	410 Gone when querying old URLs

Architecture at a glance

The diagram below is the request-and-state path of a fan-out/fan-in orchestration, left to right. A client/trigger (an HTTP call or an event) starts an orchestration through the [DurableClient] binding and can later raiseEvent to it. The orchestrator — the deterministic brain — schedules activities with Task.WhenAll and uses ContinueAsNew to keep eternal loops’ history flat. The work lands in the activities/entities zone: an activity fanned out (chunked to ~500 per sub-orchestration), a single-threaded entity@key holding shared state, and a partner API that must be hit with an idempotent key. All of that state — history, control queues, work-item queue — lives in the Durable backend (Azure Storage by default, with ~128 control-queue partitions, or Netherite for high throughput). Finally the observe/groom zone is where you live during an incident: App Insights for KQL traces, and the status/purge APIs to inspect history and reclaim space.

Follow the numbered badges to read the failure map onto the path. The brain is where non-determinism (1) bites; the activity zone is where unbounded fan-out (2) saturates the queue and a non-idempotent side effect (3) double-applies; the backend is where history bloat or a poison item (4) stalls a partition; and the whole instance can sit “Running” forever (5) when a WaitForExternalEvent has no timeout. The legend narrates each as symptom → confirm → fix.

Real-world scenario

A payments platform team at a fictional fintech, LedgerLink, ran a nightly settlement orchestration that fanned out one activity per merchant — roughly 40,000 of them — to reconcile transactions against a partner ledger. It worked for months. Then onboarding pushed merchant count past ~95,000 and settlement, which used to finish in 40 minutes, started running for six-plus hours and occasionally wedged in “Running” until someone terminated it manually. Worse, a few runs produced double-applied adjustments, and the partner started raising disputes.

Two root causes surfaced under investigation. First, the fan-out was unbounded: scheduling 95,000 activities in one Task.WhenAll saturated the Azure Storage work-item queue, and control-queue latency spiked so badly that replays slowed to a crawl. Second, the reconcile activity called the partner’s ledger API non-idempotently — when an activity timed out and the retry policy fired, the original call had sometimes already posted, so the adjustment landed twice. The history table had also grown to tens of GB because each activity returned the full reconciliation record instead of a reference, so every replay dragged that payload through Table Storage.

The fix had three parts. They chunked the fan-out into sub-batches of 500 with a durable sub-orchestration per chunk, capping concurrent work items. They made the activity idempotent by deriving a deterministic idempotency key (context.NewGuid() seeded per merchant, persisted before the call) and having the partner API treat a repeated key as a no-op. And because throughput was now the binding constraint, they migrated the task hub to the Netherite backend.

// Sub-orchestration per chunk bounds the fan-out width and isolates failures.
[Function(nameof(SettleChunk))]
public async Task<ChunkResult> SettleChunk(
    [OrchestrationTrigger] TaskOrchestrationContext context)
{
    var merchants = context.GetInput<string[]>()!;   // <= 500 per chunk
    var retry = TaskOptions.FromRetryPolicy(new RetryPolicy(
        maxNumberOfAttempts: 4,
        firstRetryInterval: TimeSpan.FromSeconds(10),
        backoffCoefficient: 2.0));

    var tasks = merchants
        .Select(m => context.CallActivityAsync<bool>(nameof(Reconcile), m, retry))
        .ToList();

    bool[] results = await Task.WhenAll(tasks);
    return new ChunkResult(merchants.Length, results.Count(ok => ok));
}

Settlement dropped back to ~35 minutes and stopped wedging; duplicate adjustments went to zero. The incident timeline and what each step actually changed:

Time	Status	Action	Result	Verdict
Month 0	Healthy	40k merchants, single `WhenAll`	~40 min nightly	Fine at this scale
Month 6, T+0	Degraded	Merchants hit 95k; same code	6 h+, occasional wedge	Unbounded fan-out
T+1 h	Investigating	Checked control-queue latency	Latency spiked, replays crawling	Queue saturation confirmed
T+2 h	Investigating	KQL for non-terminal instances	Found stuck “Running” runs	Wedge confirmed
T+1 day	Mitigated	Chunk to 500 via sub-orchestrations	Duration ~70 min	Width fixed; dupes remain
T+3 days	Mitigated	Idempotency key persisted pre-call	Dupes → 0	Side effect fixed
T+1 week	Fixed	Migrate task hub to Netherite	~35 min, stable	Throughput headroom

The lesson the team wrote into their runbook: fan-out width and activity idempotency are not optional at scale. Durable Functions will happily let you schedule a hundred thousand activities and retry a non-idempotent side effect — and both will bite you in production, not in the demo.

Advantages and disadvantages

The event-sourced, replay-based model both enables code-as-workflow and imposes the determinism constraint. Weigh it honestly:

Advantages (why this model helps you)	Disadvantages (why it bites)
Workflows are plain code — no hand-rolled queues, tables, or state machines	The orchestrator body replays, so non-deterministic code silently corrupts or throws
State is durable for free — survives crashes, deploys, scale-in via history	History is a real store you must groom (bloat, purge) and size
Fan-out/fan-in across the whole scaled-out app with one `Task.WhenAll`	Unbounded fan-out starves the control queue — you must chunk
Built-in durable timers + external events make human-in-the-loop trivial	No timeout on `WaitForExternalEvent` → stuck “Running” forever
Retries, backoff, and sub-orchestration isolation are first-class	Retries re-fire non-idempotent side effects → double-apply
Entities give race-free shared state without locks	Misusing an entity like an orchestration (or vice versa) hurts
Pluggable backends (Storage / Netherite / MSSQL) for different scale points	Switching backends is a state migration, not a config flip
Strong observability via App Insights traces + status/purge APIs	“Stuck” instances are invisible unless you actively query for them

The model is right when you have genuine multi-step or long-running workflows that must survive failure and you want to ship code, not operate infrastructure. It bites hardest on very wide fan-outs (unbounded width), side-effecting activities that aren’t idempotent, eternal loops without ContinueAsNew, and teams that don’t internalize replay. Every disadvantage is manageable — but only if you know it exists, which is the point of this article.

Hands-on lab

Deploy a tiny fan-out/fan-in orchestration, watch it run, exercise the status API, then groom it with purge — free-tier-friendly on the Consumption plan; delete at the end. Run in Cloud Shell (Bash). (This lab uses the .NET isolated worker; substitute the JS/Python templates if you prefer.)

Step 1 — Variables and resource group.

RG=rg-durable-lab
LOC=centralindia
STG=stdurable$RANDOM        # 3–24 lowercase alphanumerics, globally unique
APP=func-durable-$RANDOM    # globally-unique function app name
az group create -n $RG -l $LOC -o table

Step 2 — Storage account (the default Durable backend) and the function app.

az storage account create -n $STG -g $RG -l $LOC --sku Standard_LRS -o table
az functionapp create -n $APP -g $RG --storage-account $STG \
  --consumption-plan-location $LOC --runtime dotnet-isolated \
  --functions-version 4 -o table

Expected: a function app on the Consumption plan, runtime dotnet-isolated.

Step 3 — Scaffold a Durable project locally and add a fan-out orchestration.

func init DurableLab --worker-runtime dotnet-isolated
cd DurableLab
func new --name FanOut --template "Durable Functions Orchestrator"
# Edit FanOut.cs to fan out a CallActivityAsync over a small array and Task.WhenAll the results.

Step 4 — Publish and capture the system key for the Durable HTTP APIs.

func azure functionapp publish $APP
SYS_KEY=$(az functionapp keys list -n $APP -g $RG \
  --query "systemKeys.durabletask_extension" -o tsv)

Step 5 — Start an orchestration and capture the instance ID. The HTTP-start trigger returns a status-query payload:

BASE="https://$APP.azurewebsites.net"
RESP=$(curl -s -X POST "$BASE/api/FanOut_HttpStart?code=$SYS_KEY")
echo "$RESP"
INSTANCE_ID=$(echo "$RESP" | python3 -c "import sys,json;print(json.load(sys.stdin)['id'])")

Step 6 — Query status and history.

curl -s "$BASE/runtime/webhooks/durabletask/instances/${INSTANCE_ID}?showHistory=true&code=$SYS_KEY" | head -40
# Expected: runtimeStatus transitions Pending → Running → Completed, with activity events in history.

Step 7 — Groom: purge the completed instance.

curl -s -X DELETE \
  "$BASE/runtime/webhooks/durabletask/instances/${INSTANCE_ID}?code=$SYS_KEY"
# Expected: an instancesDeleted count of 1; the history for that instance is gone.

Validation checklist. You created the Storage-backed task hub, ran a fan-out/fan-in orchestration, watched it reach Completed, inspected its event-sourced history, and purged it. The lab steps mapped to what each proves:

Step	What you did	What it proves	Real-world analogue
2	Storage + function app	The default backend is just a storage account	Every first Durable deploy
3	Fan-out orchestrator	`Task.WhenAll` is the canonical fan-in	Batch/parallel processing
5	HTTP-start → instance ID	The instance ID is the address for everything	Starting work from an API
6	`showHistory=true`	History is real, inspectable, event-sourced	02:14 triage of a stuck run
7	Purge API	History must be groomed or it bloats	Scheduled cleanup

Cleanup (avoid lingering storage charges).

az group delete -n $RG --yes --no-wait

Cost note. Consumption plan + a small LRS storage account for an hour of this lab is well under ₹20; deleting the resource group stops everything. Durable’s cost on Consumption is dominated by storage transactions (every history write is a transaction), which is why grooming and small payloads matter.

Common mistakes & troubleshooting

This is the playbook — the part you bookmark. First as a scannable table you read mid-incident, then the entries that bite hardest expanded with the exact confirm commands.

#	Symptom	Root cause	Confirm (exact cmd / path)	Fix
1	`NonDeterministicOrchestrationException` on replay	`DateTime.UtcNow`/`Guid.NewGuid`/I/O in orchestrator	Exception message; diff `?showHistory=true` across replays	Use `context.CurrentUtcDateTime`/`NewGuid`; move I/O to an activity
2	Instance stuck “Running” forever	`WaitForExternalEvent` with no timeout	Status API `runtimeStatus=Running` for hours; KQL non-terminal query	Add the timer-race timeout; terminate the wedged instance
3	Duplicate charges/adjustments	Non-idempotent activity + retry fired	`dependencies` failures + duplicate rows downstream	Deterministic idempotency key persisted before the call
4	Settlement went from 40 min to 6 h, occasionally wedges	Unbounded fan-out saturating control queue	Control-queue latency; instance duration trend	Chunk to ~500 via sub-orchestrations; consider Netherite
5	Queries time out; history in tens of GB	Large payloads returned by value; missing purge	History table size; payload sizes in history	Return blob URIs; scheduled purge of terminal instances
6	Eternal monitor’s history grows every cycle	`while(true)` loop instead of `ContinueAsNew`	History length grows per iteration	Replace loop with `ContinueAsNew`
7	A poison work item stalls a partition	Activity throws deterministically; redelivered forever	Repeating failure in logs; control/work-item queue backlog	Fix the activity; cap attempts; partition the result
8	“My orchestration ran on the wrong app”	Two apps share a storage account + hub name	Compare `host.json` `hubName` across apps	Give each app a unique `hubName`
9	External event “lost” — instance never resumed	Event raised to wrong instance ID / hub, or before await with `ContinueAsNew`	`raiseEvent` 202 but no state change; check ID/hub	Use exact instance ID + `taskHub`; await events before `ContinueAsNew`
10	Terminating an instance didn’t stop the work	`terminate` doesn’t cancel in-flight activities	Activity still logging after terminate	Make activities cancellation-aware; design for at-least-once
11	Backend latency spikes; HTTP 429 in logs	Storage account throttling under load	Storage metrics 429; backend trace latency	Scale the account / move to Netherite; reduce transactions
12	Fan-in throws aggregate, whole batch fails on one bad item	`Task.WhenAll` with no per-item handling	Aggregate exception naming one activity	Switch to collect-and-partition `try/catch` per task

The expanded form for the entries that bite hardest:

1. NonDeterministicOrchestrationException on replay. Root cause: the orchestrator body did something non-deterministic — read DateTime.UtcNow, called Guid.NewGuid(), did direct I/O, or iterated an unordered collection — so the replay scheduled different work than history records. Confirm: the exception message names the divergence; pull the instance with ?showHistory=true and compare the scheduled events against the body. Grep the orchestrator for the forbidden constructs in the table above. Fix: replace with context.CurrentUtcDateTime / context.NewGuid(), move all I/O into activities, and sort collections to a stable order. Never retry this — it’s a code defect.

2. Instance stuck “Running” forever. Root cause: almost always an unresolved WaitForExternalEvent with no timeout, or a fan-in where one activity throws on every retry and the host keeps redelivering it. Confirm: the status API shows runtimeStatus: Running for far longer than expected; the fleet-wide KQL below surfaces every non-terminal instance. Fix: add the timer-race from the human-interaction section; put bounded retry policies on activities; terminate the genuinely wedged instance.

# Inspect a single instance: status, input, output, and execution history
curl "https://myapp.azurewebsites.net/runtime/webhooks/durabletask/instances/${INSTANCE_ID}?showHistory=true&code=${SYSTEM_KEY}"

# Terminate a wedged instance (does NOT cancel in-flight activities)
curl -X POST \
  "https://myapp.azurewebsites.net/runtime/webhooks/durabletask/instances/${INSTANCE_ID}/terminate?reason=stuck&code=${SYSTEM_KEY}"

// Orchestrations that started but never reached a terminal state in 24h
traces
| where timestamp > ago(24h)
| where customDimensions.prop__functionType == "Orchestrator"
| extend instanceId = tostring(customDimensions.prop__instanceId),
         state      = tostring(customDimensions.prop__state)
| summarize states = make_set(state), last = max(timestamp) by instanceId
| where not (states has "Completed" or states has "Failed" or states has "Terminated")
| order by last asc

3. Duplicate charges/adjustments. Root cause: a side-effecting activity isn’t idempotent, so when an attempt times out and the retry policy fires, the original call may have already succeeded — the effect lands twice. Confirm: App Insights dependencies shows the call failing/timing out under load, and you see duplicate rows downstream. Correlate the retry timestamps with the duplicates. Fix: derive a deterministic idempotency key (seed context.NewGuid() per logical unit), persist it before the call, and have the downstream treat a repeated key as a no-op. See Transactional Outbox/Inbox & Exactly-Once Event Publishing for the broader pattern.

5. History bloat — queries time out, history in tens of GB. Root cause: large activity payloads returned by value, and/or no purge of terminal instances. Confirm: the history Table Storage is huge; individual history rows carry large payloads. Fix: return references (blob URIs, row keys) instead of big blobs, and schedule a purge so history is groomed continuously instead of growing until queries time out.

# Purge completed/failed/terminated instances older than a cutoff
curl -X DELETE \
  "https://myapp.azurewebsites.net/runtime/webhooks/durabletask/instances?createdTimeTo=2026-03-01T00:00:00Z&runtimeStatus=Completed,Failed,Terminated&code=${SYSTEM_KEY}"

Schedule that purge (a timer-triggered function calling client.PurgeInstancesAsync) so history is groomed continuously.

The error/exception reference you scan first — every error you realistically see, what it means, and the fix:

Error / status	Meaning	Likely cause	How to confirm	First fix
`NonDeterministicOrchestrationException`	Replay scheduled different work than history	Clock/GUID/I/O in orchestrator	Exception text; `showHistory` diff	Use context APIs; move I/O to activities
`OrchestrationFailureException`	Orchestrator threw and faulted	Unhandled exception in body or activity aggregate	Instance `output` / failure details	Fix the throwing path; handle aggregates
`TaskFailedException`	An activity exhausted its retries	Persistent activity failure	Activity logs; `dependencies`	Fix the activity; tune retry/idempotency
`runtimeStatus: Running` (stuck)	Never reached terminal state	Unbounded wait / poison retry	Status API; KQL non-terminal	Timer-race; terminate; fix poison item
`runtimeStatus: Failed`	Terminal failure	Faulted orchestrator/activity	Instance output	Read output; fix root cause
`runtimeStatus: Terminated`	Manually stopped	`terminate` was called	Status API reason	Was the in-flight work cancelled?
HTTP 404 on raiseEvent/status	Instance not found	Wrong instance ID / wrong hub	Verify ID + `taskHub` query	Use exact ID and hub name
HTTP 429 (backend)	Storage throttling	Heavy transaction volume	Storage account metrics	Scale account / Netherite; cut transactions
HTTP 410 Gone (status URL)	Status webhook expired/purged	Instance purged	—	Re-query by ID if still present

Decision table for the on-call engineer — if you see…:

If you see…	It’s probably…	Do this
`NonDeterministicOrchestrationException`	Clock/GUID/I/O in the orchestrator	Fix the body; never retry
One instance “Running” for hours	A wait with no timeout, or poison retry	KQL to confirm; add timeout; terminate
Many instances slow at once	Backend saturation / unbounded fan-out	Check control-queue latency; chunk fan-out
Duplicate downstream effects	Non-idempotent activity + retry	Add idempotency key
Queries timing out, huge history	Bloat	Smaller payloads; scheduled purge
Work running on the “wrong app”	Shared task hub	Unique `hubName` per app
Event raised but nothing resumed	Wrong ID/hub, or dropped at `ContinueAsNew`	Verify ID/hub; await before `ContinueAsNew`
Terminate didn’t stop the work	`terminate` ignores in-flight activities	Make activities cancellation-aware
Backend 429s under load	Storage account throttling	Scale account / Netherite; cut transactions
Same exception every replay, no retry helps	Code defect in the body	Fix it — never retry a non-determinism error

Best practices

Keep orchestrators pure. No DateTime.UtcNow, Guid.NewGuid(), randomness, or direct I/O — use context.CurrentUtcDateTime, context.NewGuid(), and IsReplaying-guarded logging. This is the rule that prevents the most production incidents.
All I/O and side effects live in activities or entities, never in orchestrator bodies.
Keep payloads small and serializable. Pass large data by reference (a blob URI), not by value — it bloats the history table and slows every replay.
Carry explicit, bounded retry policies on activities, and choose the fan-in failure policy (WhenAll vs collect-and-partition) deliberately rather than by default.
Bound fan-out width with chunking or sub-orchestrations (~500/chunk) instead of scheduling unlimited activities at once.
Make side-effecting activities idempotent so retries and redeliveries can’t double-apply — derive and persist a deterministic idempotency key.
Pair every WaitForExternalEvent with a durable-timer timeout and cancel the loser, so no instance can hang forever.
Use ContinueAsNew for every long-lived orchestration to bound history; never while(true).
Use durable entities for shared mutable state (single-threaded) instead of ad-hoc locking.
Choose the backend for the workload — Azure Storage default, Netherite for high throughput, MSSQL for portability — and give every app a unique hubName on shared storage.
Schedule a purge of terminal instances so the history table doesn’t bloat, and keep App Insights/KQL queries on hand to find non-terminal instances.
Test determinism deliberately — a temporary DateTime.UtcNow in an orchestrator should throw NonDeterministicOrchestrationException; confirm the guardrail is active.

The signals worth alerting on before the next incident — leading indicators, not “the orchestration failed”:

Alert on	Signal / source	Threshold (starting point)	Why it’s leading
Non-terminal instance age	KQL non-terminal query	Any instance Running > expected SLA	Catches stuck “Running” before users notice
Control-queue latency	Backend traces / metrics	Rising trend under load	Predicts fan-out saturation
Storage throttling	Storage account 429 count	> 0 sustained	Backend is the bottleneck
History table size	Storage metrics	Growth without purge	Predicts query timeouts
Activity failure rate	`dependencies` success=false	> 1% sustained	Poison items / retries firing
Orchestration duration	App Insights custom metric	p95 > baseline	Width or backend regression

Security notes

Managed identity over secrets. The function app’s connection to its Durable storage account, and any secrets your activities need, should use the app’s managed identity with Key Vault references rather than plaintext connection strings. Grant least privilege — Storage Blob/Queue/Table Data Contributor scoped to the task-hub account, and Key Vault Secrets User for secrets. See Azure Key Vault: Secret Rotation with Managed Identity.
Protect the Durable HTTP management APIs. The raiseEvent, terminate, purge, and status endpoints are gated by the durabletask_extension system key — treat it like a credential, never log it, and prefer calling the management APIs from trusted backends or through APIM rather than exposing them.
Validate external-event payloads. An external event is an untrusted input from outside the orchestration; validate and authorize the caller of raiseEvent (who can approve a purchase?) at the HTTP layer before the signal reaches the instance.
Isolate the network. For sensitive workloads, VNet-integrate the function app and reach storage/SQL/partner APIs over Private Endpoints so task-hub traffic and activity I/O stay off the public internet.
Don’t leak state in errors. Instance output and history can contain business data; keep detailed failure output out of anonymous-facing responses and lock down who can query the status API.
Secure the backend store. Restrict the Durable storage account / Event Hubs / SQL with firewall rules and private access; it holds the full event-sourced history of every workflow.

The security controls that also prevent these incidents — secure and resilient pull the same way:

Control	Mechanism	Secures against	Also prevents
Managed identity to storage	`identity` + RBAC on the account	Connection strings in config	Secret-rotation breaking the backend connection
System-key protection on mgmt APIs	`durabletask_extension` key + APIM	Anonymous terminate/purge/raiseEvent	Malicious instance manipulation
Authorize `raiseEvent` callers	HTTP auth before the signal	Unauthorized approvals	Spoofed external events corrupting flow
Private Endpoints for storage/SQL/API	VNet + private DNS	Data exfiltration over public net	SNAT/egress surprises in activities
Vault firewall + trusted services	Key Vault networking	Secret exfiltration	KV-reference boot failures (when allow-listed)
Least-privilege RBAC on the account	Scoped data-plane roles	Over-broad access to history	Accidental cross-hub interference

Cost & sizing

The bill drivers and how they interact with the patterns:

Storage transactions dominate on Azure Storage. Every history write, every queue operation, every status poll is a billable transaction. A chatty orchestration with large payloads and frequent polling can run a surprising storage bill — which is exactly why small payloads, fewer activities, and scheduled purge are cost levers, not just performance ones.
Compute follows your hosting plan. On Consumption you pay per-execution + GB-seconds; Premium (EP) and Dedicated trade a floor cost for no cold start and VNet. Fan-out multiplies executions — 95,000 activities is 95,000 executions plus their history writes.
Netherite adds Event Hubs cost (throughput units) and Page Blob storage — justified only when you’re processing tens of thousands of work items per second and Azure Storage latency is the binding constraint.
MSSQL adds SQL compute (DTU/vCore) you size and pay for — chosen for portability/single-store rather than cost.
Polling is a hidden cost. Clients that poll the status URL in a tight loop generate transactions; back off the poll interval.

A rough monthly picture for a moderate workload (a few hundred thousand activity executions/day, small payloads, groomed history) on Consumption: storage transactions plus execution charges typically land in the low thousands of INR; the same workload on Premium EP1 adds a floor of roughly ₹12,000–18,000/month for the always-warm instance. The cost drivers and what each buys you:

Cost driver	What you pay for	Rough INR / month	What it fixes	Watch-out
Storage transactions (history/queues)	Per-transaction on the account	~₹500–3,000 (workload-dependent)	(it’s the backend itself)	Large payloads + polling inflate it
Consumption executions	Per-execution + GB-seconds	Pennies per 10k executions	Cheapest entry; scales to zero	Cold start; fan-out multiplies count
Premium plan (EP1+)	Always-warm instance floor	~₹12,000–18,000+	Cold start, VNet, predictable latency	Pay even when idle
Netherite (Event Hubs TU + blobs)	Throughput units + Page Blobs	~₹8,000+	Throughput ceiling under heavy fan-out	Over-provisioned at low scale
MSSQL backend	SQL DTU/vCore + storage	depends on SQL tier	Portability, single backed-up store	You operate the SQL
App Insights ingestion	Per-GB telemetry	~₹1,000–3,000	Triage (KQL, traces)	Sample high-volume apps

Free-tier note: the Consumption plan includes a monthly grant of free executions and GB-seconds, so small Durable workloads cost mostly the (cheap) storage transactions — keep payloads small and purge terminal instances and the bill stays tiny.

Interview & exam questions

1. Why must an orchestrator function be deterministic, and name three things you can’t do in one? Because the orchestrator replays from history every time it makes progress, it must schedule the same work in the same order given the same history — non-determinism diverges history and corrupts state (the SDK throws NonDeterministicOrchestrationException). You can’t use DateTime.UtcNow, Guid.NewGuid(), or direct I/O (HttpClient, DB) in the body — use context.CurrentUtcDateTime, context.NewGuid(), and activities instead.

2. What is the fan-out/fan-in pattern and what’s the canonical fan-in? Fan-out schedules many independent activities in parallel (build a list of CallActivityAsync tasks without awaiting each); fan-in waits for them all. The canonical fan-in is await Task.WhenAll(tasks) over the Durable tasks — replay-safe and durable, so a crash after 900 of 1,000 completions resumes with only the outstanding 100.

3. How do you bound a very large fan-out and why must you? Scheduling, say, 100,000 activities in one Task.WhenAll saturates the work-item/control queues and starves other orchestrations, spiking latency. Bound it by chunking — a sub-orchestration per ~500 items via CallSubOrchestratorAsync — which caps in-flight work and isolates failures to a chunk.

4. How do you implement a human-approval step that won’t hang forever? Race a WaitForExternalEvent against a durable timer with Task.WhenAny: if the event wins, cancel the timer and return; if the timer wins, escalate/time out. Use context.CreateTimer (persisted, survives restart), never Task.Delay, and always cancel the loser so the instance doesn’t stay open.

5. What does ContinueAsNew do and when do you need it? It restarts the orchestration with a clean history and fresh input, which is how you run an eternal orchestration (monitor, recurring job) without the history table growing unbounded. Drain pending events first, and make the ContinueAsNew call the last statement on the branch — it schedules a restart, it doesn’t return.

6. When do you use a durable entity instead of an orchestration? Use an orchestration for a workflow with a defined start and end; use an entity for long-lived, mutable state that many callers update concurrently (counters, carts, per-tenant budgets, rate limits). Entities guarantee single-threaded access per entityName@key, giving race-free updates without locks.

7. An activity that posts to a partner API double-applied during retries. Why, and how do you fix it? The activity isn’t idempotent: an attempt timed out and the retry fired after the original call had already posted. Fix it by deriving a deterministic idempotency key (seed context.NewGuid() per unit, persist before the call) and having the downstream treat a repeated key as a no-op — so retries and redeliveries can’t double-apply.

8. Compare the three storage backends. Azure Storage (default) is lowest-ops and cheapest at low scale but has a throughput ceiling under heavy fan-out; Netherite (Event Hubs + Page Blobs) gives very high throughput and low latency at the cost of operational complexity; MSSQL gives portability and a single backed-up store for hybrid/on-prem at the cost of running SQL. Switching is a state migration, so choose before millions of instances exist.

9. Two function apps’ orchestrations are interfering. Most likely cause? They share a storage account and the same hubName, so they’re reading each other’s queues and tables (the “ran on the wrong app” incident). Give each app a unique hubName in host.json (or separate storage accounts).

10. How do you find and recover a stuck “Running” instance? Query the status API (?showHistory=true) for the instance, or run a fleet-wide KQL over traces for instances with no terminal state. The usual cause is a WaitForExternalEvent with no timeout or a poison-item retry loop — fix the code (timer-race, bounded retries) and terminate the wedged instance (knowing terminate doesn’t cancel in-flight activities).

11. What causes history-table bloat and how do you control it? Large activity payloads returned by value, and missing purge of terminal instances. Return references (blob URIs/row keys) instead of big blobs, and schedule a purge (PurgeInstancesAsync) of completed/failed/terminated instances so history is groomed continuously instead of growing until queries time out.

12. Does terminating an instance stop its in-flight activities? No — terminate marks the orchestration terminated but does not cancel activities already running. Design activities to be cancellation-aware and assume at-least-once execution so a terminated-but-still-running activity can’t corrupt downstream state.

These map primarily to AZ-204 (Developer Associate) — implement Azure Functions; develop event-based and message-based solutions — and the durable-orchestration patterns appear in solution-architecture scenarios on AZ-305. A compact cert-mapping for revision:

Question theme	Primary cert	Objective area
Replay model & determinism	AZ-204	Implement Azure Functions
Fan-out/fan-in, sub-orchestration	AZ-204	Develop message/event solutions
Human-in-the-loop (events/timers)	AZ-204	Durable Functions patterns
Entities vs orchestrations	AZ-204 / AZ-305	Stateful serverless design
Backend choice & scaling	AZ-305	Design for throughput/cost
Idempotency & exactly-once	AZ-204 / AZ-305	Reliable messaging design

Quick check

You add DateTime.UtcNow to an orchestrator and it throws on replay. What exception, and what’s the deterministic replacement?
An approval orchestration sits in “Running” for three days and never finishes. What’s the most likely cause and the fix?
A nightly job that fanned out 95,000 activities in one Task.WhenAll went from 40 minutes to 6 hours. Name the root cause and the fix.
Your activity posts to a payment API and you see duplicate charges after a deploy. Why, and what makes it safe?
Two function apps share a storage account and one app’s orchestration “runs on the other app.” What single setting fixes it?

Answers

NonDeterministicOrchestrationException. The orchestrator replays, so an ambient clock produces a different value each replay and diverges history. Replace it with context.CurrentUtcDateTime (and use context.NewGuid() for IDs); move any I/O into an activity.
A WaitForExternalEvent with no timeout — nothing ever raised the event, so the instance waits forever. Fix by racing the wait against a durable timer with Task.WhenAny, cancelling the loser; terminate the already-wedged instance.
Unbounded fan-out saturated the work-item/control queues and starved replays. Fix by chunking into sub-orchestrations (~500/chunk) to bound in-flight width, and consider migrating the task hub to Netherite for throughput headroom.
The activity isn’t idempotent: a timed-out attempt’s retry posted again after the original had already succeeded. Make it safe with a deterministic idempotency key (seeded context.NewGuid(), persisted before the call) that the downstream treats as a no-op on repeat.
Give each app a unique hubName in host.json — they were sharing one task hub (the same queues and tables) on the shared storage account.

Glossary

Orchestrator function — the deterministic “brain” ([OrchestrationTrigger]) that schedules activities and sub-orchestrations; it replays from history and must be pure.
Activity function — a unit of real work / I/O ([ActivityTrigger]); runs once per logical call (with retries), inputs/outputs serialized to JSON in history.
Durable entity — an addressable, persistent, single-threaded state object (entityName@key) for race-free shared state without locks.
Client binding — [DurableClient]; the way external code starts, queries, signals (raiseEvent), terminates, and purges orchestrations.
Replay — re-executing the orchestrator body from the start, feeding completed results from history; the reason determinism is required.
NonDeterministicOrchestrationException — thrown when a replay schedules different work than history records; a code defect, never a transient error.
History table — the event-sourced record of an instance’s execution; the source of replay and of bloat if payloads are large or purge is missing.
Task hub — the namespace (hubName in host.json) for all of an app’s Durable queues and tables; must be unique per app on shared storage.
Instance ID — the unique key addressing one orchestration run; used for status, raiseEvent, terminate, and purge.
Fan-out/fan-in — scheduling many activities in parallel (fan-out) and aggregating their results with Task.WhenAll (fan-in).
Sub-orchestration — an orchestration called from another (CallSubOrchestratorAsync); used to bound fan-out width and isolate failures.
External event — a named signal delivered to a running instance via the raiseEvent API; paired with a durable timer for human-in-the-loop.
Durable timer — a persisted, replay-safe deadline (context.CreateTimer) that survives host restarts; never Task.Delay.
ContinueAsNew — restarts an orchestration with fresh state and clean history; bounds the history of eternal orchestrations.
Idempotency key — a deterministic key (seed context.NewGuid(), persisted before the side effect) that makes an activity safe to retry without double-applying.
Storage provider — the backend that persists state: Azure Storage (default), Netherite (Event Hubs + Page Blobs), or MSSQL.
Purge — deleting terminal-instance history (PurgeInstancesAsync / DELETE API) to reclaim space and keep queries fast.

Next steps

You can now build the five Durable patterns correctly and triage a stuck orchestration. Build outward:

Next: Azure Functions: Serverless Patterns & Best Practices — the plain-trigger foundations under Durable, and when not to reach for orchestration.
Related: Azure Service Bus: Sessions, Dedup & Dead-Letter Patterns — when you outgrow Durable’s built-in queues and need first-class messaging.
Related: Transactional Outbox/Inbox & Exactly-Once Event Publishing — the idempotency and dedup patterns that keep side-effecting activities safe.
Related: Azure Monitor & Application Insights for Observability and KQL for Azure Monitor & Log Analytics — the telemetry that turns a stuck-instance mystery into a two-minute query.
Related: AWS Step Functions: Distributed Orchestration & Error-Handling Patterns — the same orchestration problems solved the AWS way, for contrast.