Architecture Multi-Cloud

Resiliency Patterns That Actually Work: Retry, Circuit Breaker, and Bulkhead

Most resiliency outages I have debugged were not caused by the original fault. They were caused by the resiliency code – a retry loop hammering a service that was already on its knees, turning a brief blip into a self-sustaining outage. This article walks through the four patterns that actually work in production – timeout, retry with backoff and jitter, circuit breaker, and bulkhead – and, more importantly, how to compose them so they reinforce each other instead of amplifying the failure you are trying to survive.

Why naive retries cause cascading failure

Picture service A calling service B, with B briefly slow because a dependency is GC-pausing. A’s developer adds “just retry 3 times” with no delay. Now every request to B that would have failed once fails up to four times. A’s offered load to B has quadrupled at the exact moment B is least able to handle it. B’s queues fill, its latency climbs past A’s timeout, and more requests start failing – so more retries fire. This is a retry storm, and it is a positive feedback loop: the louder B screams, the harder A hits it.

The failure then propagates. A’s threads are blocked waiting on slow B calls, so A exhausts its own thread pool and starts failing requests from service Z upstream. Z retries A. The blast radius grows one hop at a time until the whole mesh is saturated. This is cascading failure, and the uncomfortable truth is that the retries, timeouts, and “resilience” code are what spread it.

The mental model that fixes this: every resiliency control is a load-shaping decision, not just an error-handling one. A retry adds load. A timeout frees a resource. A circuit breaker sheds load. A bulkhead caps load. If you cannot say what each control does to offered load on a struggling dependency, you have not designed for resilience – you have decorated your code with hope.

The rest of this article builds the controls in the order they must be reasoned about: timeouts first (they bound everything), then backoff, then breakers, then isolation, then composition, then degradation, then proof.

Step 1 – Timeouts and deadlines as the foundation

A call with no timeout is not a call; it is an open-ended promise to wait forever, and forever is exactly how long a hung TCP connection or a deadlocked dependency will take. Before any retry or breaker, every outbound call needs a bound.

There are two distinct timeouts, and conflating them is a classic bug:

In .NET with HttpClient, the trap is that HttpClient.Timeout is an overall timeout for the whole request including any handler-based retries, which silently truncates your retry policy. Set a generous client timeout and control real timeouts with a resilience policy:

// Polly v8 (Microsoft.Extensions.Http.Resilience / Polly.Core)
services.AddHttpClient("inventory", c =>
    {
        c.BaseAddress = new Uri("https://inventory.internal");
        // Outer ceiling; the pipeline below owns the real timeouts.
        c.Timeout = TimeSpan.FromSeconds(30);
    })
    .AddResilienceHandler("inventory-pipeline", builder =>
    {
        // Per-attempt timeout: one try may take at most 2s.
        builder.AddTimeout(TimeSpan.FromSeconds(2));
    });

Just as important: propagate the deadline downstream. If A has 800 ms left, it should tell B “you have 800 ms” so B does not start expensive work it can never deliver. In gRPC this is built in – a client deadline is sent on the wire and visible to the server via the context.

// gRPC: an absolute deadline travels with the call. The server can observe it
// via context.Deadline / CancellationToken and stop work that will be discarded.
var reply = await client.GetStockAsync(
    request,
    deadline: DateTime.UtcNow.AddMilliseconds(800));

Rule of thumb: a server’s own processing timeout should be shorter than its callers’ timeouts, and timeouts should shrink as you go deeper into the call tree. If a leaf service times out at 5s but the edge times out at 2s, the leaf is doing 3 seconds of work that nobody is waiting for – pure wasted capacity during an incident.

Step 2 – Exponential backoff with jitter, done correctly

Once attempts are bounded, retries become safe to add – but only with two properties: exponential backoff (each retry waits longer, easing load on a recovering dependency) and jitter (randomization, so a thousand clients that failed at the same instant do not all retry at the same instant).

Without jitter you get a thundering herd: synchronized retry waves that re-saturate the dependency on a fixed cadence. The fix that AWS popularized is full jitter – pick a random delay anywhere in [0, computed_backoff]:

backoff = min(cap, base * 2 ** attempt)
sleep   = random_between(0, backoff)      # full jitter

Polly v8 implements this directly with DelayBackoffType.Exponential plus UseJitter = true:

builder.AddRetry(new HttpRetryStrategyOptions
{
    MaxRetryAttempts = 3,
    BackoffType = DelayBackoffType.Exponential,
    UseJitter = true,                         // decorrelated jitter
    Delay = TimeSpan.FromMilliseconds(200),   // base
    // Only retry things that are actually transient + idempotent-safe.
    ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
        .Handle<HttpRequestException>()
        .HandleResult(r => r.StatusCode == HttpStatusCode.ServiceUnavailable
                        || r.StatusCode == HttpStatusCode.GatewayTimeout
                        || (int)r.StatusCode == 429),
});

Three things teams get wrong here:

  1. Retrying non-idempotent operations. A POST that charges a card must not be blindly retried; a timeout does not tell you whether the write committed. Use idempotency keys so a retry is a no-op server-side, and only then retry writes.
  2. Retrying non-transient errors. A 400, 401, or 404 will fail identically on retry – you are just adding load. Retry only 429, 503, 504, connection failures, and attempt timeouts.
  3. Honoring Retry-After. When a server sends Retry-After (commonly with 429/503), it is telling you exactly when to come back. Respect it instead of your own backoff curve.

Budget your retries. Pick a small MaxRetryAttempts and ensure the worst-case total time (sum of attempt timeouts plus backoffs) fits inside the overall deadline from Step 1. Three attempts with a 2s attempt timeout and exponential backoff can blow a 5s budget instantly – do the arithmetic, do not eyeball it.

Step 3 – Circuit breakers: states, thresholds, and half-open probing

Backoff slows an individual client down. A circuit breaker does something a single retry policy cannot: it lets a client stop calling a dependency entirely once that dependency is clearly broken, so the dependency gets breathing room to recover and the client fails fast instead of tying up resources.

A breaker is a three-state machine:

State Behavior Transition out
Closed Calls flow normally; failures are counted Failure rate exceeds threshold over the sampling window -> Open
Open Calls fail immediately (fail fast), no traffic to dependency After BreakDuration elapses -> Half-Open
Half-Open A limited number of trial calls are allowed through Trials succeed -> Closed; any trial fails -> back to Open

The half-open state is the subtle, essential part. After the break duration, the breaker does not fling the floodgates open – it lets a small number of probe requests through. If they succeed, the dependency has recovered and the circuit closes. If they fail, it re-opens for another break duration. This prevents a half-recovered service from being instantly re-saturated the moment the timer expires.

Polly v8 uses a rate-based breaker over a rolling window, which is far better than counting raw failures (a count-based breaker trips on volume, not health):

builder.AddCircuitBreaker(new HttpCircuitBreakerStrategyOptions
{
    FailureRatio = 0.5,                          // open at 50% failures...
    MinimumThroughput = 20,                      // ...but only after >=20 calls
    SamplingDuration = TimeSpan.FromSeconds(30), // ...in this rolling window
    BreakDuration = TimeSpan.FromSeconds(15),    // stay open this long, then probe
    ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
        .Handle<HttpRequestException>()
        .HandleResult(r => (int)r.StatusCode >= 500),
});

MinimumThroughput is what stops a breaker from tripping on a single failure during a quiet period. Without it, one bad request out of two opens the circuit at a 50% ratio – statistical noise becomes an outage.

When the circuit is open, Polly throws BrokenCircuitException immediately. Your code must catch that and do something useful – serve a cached value, a default, or a clear degraded response – which is exactly the graceful degradation we wire up in Step 6.

Step 4 – Bulkhead isolation to contain blast radius

The name comes from ships: a hull is divided into watertight compartments so a breach in one does not sink the vessel. In software, a bulkhead caps the resources any one dependency can consume, so a slow dependency cannot starve the resources every other dependency needs.

The canonical failure this prevents: service A calls B (healthy) and C (hung). With a shared thread/connection pool, calls to C pile up and consume every thread, and now calls to B fail too – even though B is perfectly fine. Bulkheading gives C its own bounded compartment; when it fills, C calls are rejected fast while B keeps flowing.

In Polly v8 the classic Bulkhead policy was replaced by a concurrency limiter built on System.Threading.RateLimiting. A ConcurrencyLimiter caps simultaneous executions and bounds the waiting queue:

builder.AddConcurrencyLimiter(
    permitLimit: 50,    // at most 50 concurrent calls to this dependency
    queueLimit: 10);    // plus 10 may wait; beyond that, reject immediately

When the limiter is full, executions are rejected rather than queued without bound – that immediate rejection is the containment. The blast radius of a hung dependency is now permitLimit + queueLimit stuck calls, not your entire process.

At the infrastructure layer you get the same isolation declaratively. In a service mesh, an Istio DestinationRule bounds the connection pool per destination and ejects unhealthy hosts – a bulkhead and a breaker enforced by the sidecar, no application code involved:

apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: inventory
spec:
  host: inventory.shop.svc.cluster.local
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100           # bulkhead: cap TCP connections
      http:
        http2MaxRequests: 200         # bulkhead: cap concurrent requests
        maxRequestsPerConnection: 10
    outlierDetection:                 # passive circuit breaking per host
      consecutive5xxErrors: 5
      interval: 10s
      baseEjectionTime: 30s
      maxEjectionPercent: 50

Application-level (Polly) and infrastructure-level (mesh) controls are complementary, not redundant. The mesh protects you from misbehaving peers and gives uniform policy across languages; in-process policies protect a single service’s resources and can reason about request semantics (idempotency, fallbacks) that a sidecar cannot see.

Step 5 – Combining patterns into a resilience pipeline

Order matters enormously when you stack these. The pipeline executes outer to inner, and each layer wraps the next. The correct ordering for a single logical operation is:

Overall timeout            <- caps the whole operation (the deadline)
  └─ Retry                 <- repeats the inner attempt
       └─ Circuit breaker  <- sees each attempt; trips on aggregate failure
            └─ Bulkhead    <- bounds concurrency of the actual call
                 └─ Attempt timeout
                      └─ the HTTP/gRPC call

The reasoning:

Polly v8 composes this fluently; strategies added first are outermost:

.AddResilienceHandler("inventory-pipeline", builder =>
{
    builder
        .AddTimeout(TimeSpan.FromSeconds(10))     // 1. overall deadline (outer)
        .AddRetry(retryOptions)                   // 2. retry the attempt below
        .AddCircuitBreaker(breakerOptions)        // 3. trip on aggregate failure
        .AddConcurrencyLimiter(50, 10)            // 4. bulkhead the call
        .AddTimeout(TimeSpan.FromSeconds(2));     // 5. per-attempt (inner)
});

Define this once per dependency profile (a “chatty cache” needs different numbers than a “slow report generator”) and reuse it. Resist the urge to put one global policy on everything – a breaker tuned for a high-volume service will never trip for a low-volume one, and a bulkhead sized for one will starve the other.

Step 6 – Load shedding and graceful degradation under saturation

Retries, breakers, and bulkheads handle a dependency failing. Load shedding handles you failing – being asked to do more than you can. When offered load exceeds capacity, accepting every request makes all of them slow and timed out; rejecting the excess keeps the rest fast. Shedding 20% cleanly beats failing 100% mushily.

Shed at the edge, cheaply, before you spend resources on work you will drop. The mesh can do this with a fault filter or rate limit, but the higher-leverage move is an admission/concurrency limit at the server: cap in-flight requests and return 503 with a Retry-After past that point, ideally prioritizing by request class so health checks and paying-customer traffic win over batch jobs. A 503 with Retry-After cooperates with the client backoff from Step 2 – the system self-regulates.

Graceful degradation is the partner pattern: when a dependency is open-circuited or shed, return reduced-but-useful service instead of an error. The fallback lives in the BrokenCircuitException/rejection handler:

try
{
    return await pipeline.ExecuteAsync(
        ct => _client.GetRecommendationsAsync(userId, ct), cancellationToken);
}
catch (Exception ex) when (
    ex is BrokenCircuitException or TimeoutRejectedException
       or RateLimiterRejectedException)
{
    // Degrade, do not fail: stale cache, generic results, or an empty-but-valid
    // response. The page renders; the recommendations are simply less personal.
    _logger.LogWarning(ex, "recommendations degraded for {UserId}", userId);
    return await _cache.GetLastKnownGoodAsync(userId)
           ?? Recommendations.Generic;
}

Decide per call site whether a dependency is critical or optional. The cart service failing should fail checkout; the recommendations service failing should never fail checkout. Encoding that judgment in fallbacks is what separates a resilient system from one that is merely instrumented.

Step 7 – Testing with fault injection and steady-state hypotheses

Untested resilience code is a liability – it is extra code in the critical path that has never run, and the first time it does will be during a real incident. Chaos engineering flips that: you inject the failure on purpose, in a controlled way, and verify the system holds.

The discipline is to state a steady-state hypothesis first – a measurable definition of “healthy” – then inject a fault and assert the hypothesis still holds. For example: “p99 latency stays under 300 ms and the success rate stays above 99% even when the inventory service injects 2s of latency into 50% of responses.”

On Kubernetes, Chaos Mesh injects exactly these faults declaratively. Start small (a short duration, a fraction of pods) and watch your SLO dashboards:

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: inventory-latency
  namespace: shop
spec:
  action: delay
  mode: fixed-percent
  value: "50"                 # affect 50% of matching pods
  selector:
    namespaces: [shop]
    labelSelectors:
      app: inventory
  delay:
    latency: "2s"
    jitter: "200ms"
  duration: "120s"

Inject faults that map to each pattern you built:

Run the experiment in production-like staging first, with a tight duration and a kill switch. The point is not to break things; it is to discover whether you are already broken before a customer does. A hypothesis that fails in a controlled experiment is a gift – it is the bug you did not ship.

Enterprise scenario

A payments platform I worked with ran a Black Friday postmortem after checkout latency spiked to 9s for twenty minutes. The trigger was mundane: their fraud-scoring vendor’s API degraded to ~3s responses. The amplifier was their own resilience code. Each checkout pod used a shared HttpClient with a Polly retry of 3 and a fixed 500ms delay, and the same pool also fed the inventory and tax services. Slow fraud calls consumed every connection; healthy inventory calls then starved and timed out, so the breaker for inventory tripped on a dependency that was perfectly fine. Classic shared-pool bulkhead failure plus a retry storm against an already-slow vendor.

The constraint they could not change: the fraud vendor had no idempotency contract and a hard 100-RPS contractual cap, so blind retries risked both double-charges and a 429 ban during peak revenue. The fix was a per-dependency pipeline with its own concurrency compartment, exponential backoff with jitter, and a fail-fast breaker – and critically, fraud was reclassified as optional with a documented fallback to a synchronous rules engine.

services.AddHttpClient("fraud", c => c.BaseAddress = new Uri("https://fraud.vendor"))
    .AddResilienceHandler("fraud", b => b
        .AddTimeout(TimeSpan.FromSeconds(4))                 // overall deadline
        .AddRetry(new HttpRetryStrategyOptions {
            MaxRetryAttempts = 2, UseJitter = true,
            BackoffType = DelayBackoffType.Exponential })
        .AddCircuitBreaker(new HttpCircuitBreakerStrategyOptions {
            FailureRatio = 0.5, MinimumThroughput = 20,
            BreakDuration = TimeSpan.FromSeconds(20) })
        .AddConcurrencyLimiter(permitLimit: 80, queueLimit: 0) // isolate + cap at vendor RPS
        .AddTimeout(TimeSpan.FromMilliseconds(1500)));         // per-attempt

When the breaker opened, checkout fell back to local rules and approved low-risk carts. p99 checkout latency held under 600ms through the next peak.

Verify

After wiring the pipeline, confirm each control actually engages – do not assume the config is honored.

# Polly emits telemetry per strategy. Confirm the meter is registered and
# that retry/breaker/timeout events appear when you inject faults.
# (Polly.Core publishes a "Polly" Meter + ActivitySource via OpenTelemetry.)
dotnet-counters monitor --process-id <pid> --counters Polly

# Drive load and fault injection, then watch the breaker open and recover.
# A correctly tuned breaker: Closed -> Open (on fault) -> Half-Open -> Closed.

# In a mesh, confirm Envoy is enforcing the DestinationRule and ejecting hosts.
kubectl exec deploy/inventory -c istio-proxy -n shop -- \
  pilot-agent request GET stats | grep -E "circuit_breakers|outlier|rq_pending"

# Confirm outlier detection actually ejected an unhealthy endpoint.
kubectl exec deploy/inventory -c istio-proxy -n shop -- \
  pilot-agent request GET stats | grep outlier_detection.ejections_active

A healthy result: under injected latency, attempt timeouts fire and the breaker transitions to Open (Envoy’s ejections_active is non-zero, or Polly’s circuit-breaker state metric flips); load shedding returns 503 with Retry-After rather than slow 200s; and your steady-state SLO dashboard stays green throughout the experiment. If the breaker never opens under a 100%-fault injection, your thresholds are unreachable – re-check MinimumThroughput against your test traffic volume.

Checklist

Pitfalls and next steps

The failure I see most is retrying at every layer. If the HTTP client retries, the service-mesh sidecar retries, and the caller retries, three attempts become twenty-seven, and your “resilience” is now a load multiplier. Pick exactly one layer to own retries for a given call and make the others fail fast. The second most common: a breaker that can never trip because MinimumThroughput is higher than the service ever sees, or a bulkhead sized at infinity because nobody measured real concurrency. Both render the control decorative.

Beyond that, the highest-leverage next step is to make resilience observable and rehearsed. Emit the circuit-breaker state, retry counts, bulkhead rejections, and shed-request counts as first-class metrics; alert when a breaker stays open or rejections spike. Then run fault-injection drills on a schedule, not once – systems drift, dependencies change, and a steady-state hypothesis that held last quarter is just an assumption today. Resilience is not a library you install; it is a property you continuously verify.

ResiliencyPatternsMicroservicesPollyRetryCircuit Breaker

Comments

Keep Reading