Architecture Azure

The Reliability Pillar in Practice: From SLOs to Self-Healing

Most “reliability” work I review is redundancy bought on instinct: dual regions for a workload that tolerates an hour of downtime, a single-zone database under a system promising four nines. The Well-Architected reliability pillar is only useful when you reverse that flow and let numbers drive the design. This is the end-to-end process I use to take a business SLA down to an error budget, decide redundancy per tier, and build self-healing that closes the loop without paging a human.

From SLA to SLO to error budget: the numbers that drive design

Three terms get used interchangeably and they are not the same thing.

The error budget is the inverse of the SLO and it is the single most useful number you will produce. If your availability SLO is 99.9% over 30 days, the allowed downtime is the budget:

SLO Downtime / 30 days Downtime / year
99% ~7h 18m ~3d 15h
99.9% ~43m 49s ~8h 45m
99.95% ~21m 54s ~4h 22m
99.99% ~4m 23s ~52m 35s

The budget is a permission slip, not just a threshold. If you have spent only 10% of it this month, you ship faster and take more risk. If you have burned 90% by day ten, you freeze risky changes and the on-call gets to push back on the next feature flag. That cultural contract is the actual point of the reliability pillar; everything below is mechanism.

Pick the SLO per critical journey, not per service. A 99.99% target on a background report exporter is wasted money. Be honest about what the business will actually pay to protect.

Step 1 — Map the critical user journey and its dependencies

You cannot set an SLO for “the platform.” You set it for a journey: user logs in, adds item to cart, checks out. Write the journey as an ordered list of hops and the hard dependency at each hop.

[Client] -> Front Door (global) -> App Gateway (regional WAF)
         -> AKS ingress -> checkout-api (pod)
         -> Redis (session)        [hard]
         -> SQL (orders)           [hard]
         -> payment-provider (3rd) [hard, external]
         -> Service Bus (events)   [soft, async]

Classify each dependency as hard (journey fails without it) or soft (journey degrades gracefully). This drives both your redundancy spend and your composite availability math. Series dependencies multiply: if checkout depends on five hard components each at 99.9%, the ceiling is 0.999^5 = ~99.5%, already below a 99.9% target. That single calculation usually kills the “we’ll just promise four nines” conversation. Your options: reduce hard dependencies in the path, make some redundant, or convert hard to soft with caching and queues.

Step 2 — FMEA for each component

Failure Mode and Effects Analysis is a structured table. For every component in the journey, enumerate how it fails, the effect, how you detect it, and a Risk Priority Number (RPN = Severity x Likelihood x Detectability, each scored 1-10, higher is worse). Sort by RPN and you have a prioritized work list instead of a vibe.

Component Failure mode Effect Detection S L D RPN
SQL primary Zone outage Writes fail Failover group health 9 3 2 54
Redis Node eviction Session loss, re-login Cache miss spike 5 4 4 80
checkout-api Memory leak / OOM Pod restarts, 5xx Liveness probe + RPS drop 7 5 3 105
payment-provider API timeout Checkout stalls Synthetic + error rate 9 4 5 180

The high-RPN rows here are the external payment provider (high detectability score because a slow timeout is hard to catch fast) and the API OOM. That tells you where to invest: a circuit breaker and synthetic probe on payments, tighter memory limits and a fast liveness probe on the API. High detectability scores are the cheapest RPN to lower: a synthetic check that fires in 30 seconds instead of waiting for customer reports drops D from 5 to 2 with almost no infra cost.

Step 3 — Choose redundancy per tier

Now spend money where the math and the FMEA agree, and only there. Azure gives you three escalating levels, each with a real cost and complexity step.

Match the tier to its data gravity. Stateless front ends go zone-redundant trivially. Stateful tiers are where the engineering lives.

# Zone-redundant AKS system node pool across 3 zones
resource "azurerm_kubernetes_cluster" "this" {
  name                = "aks-checkout-prod"
  location            = "eastus2"
  resource_group_name = azurerm_resource_group.this.name
  dns_prefix          = "checkout"

  default_node_pool {
    name                 = "system"
    vm_size              = "Standard_D4s_v5"
    zones                = [1, 2, 3]
    auto_scaling_enabled = true
    min_count            = 3
    max_count            = 9
  }

  identity {
    type = "SystemAssigned"
  }
}

For the database tier, zone redundancy and cross-region failover are explicit choices. Azure SQL zone-redundant general purpose with a geo failover group:

# Zone-redundant Azure SQL database (Hyperscale or Business Critical / GP v2)
az sql db create \
  --resource-group rg-checkout-prod \
  --server sql-checkout-prod \
  --name orders \
  --edition GeneralPurpose \
  --compute-model Provisioned \
  --family Gen5 --capacity 4 \
  --zone-redundant true

# Auto-failover group for cross-region DR (async geo-replication)
az sql failover-group create \
  --name fg-checkout \
  --resource-group rg-checkout-prod \
  --server sql-checkout-prod \
  --partner-server sql-checkout-dr \
  --failover-policy Automatic \
  --grace-period 1 \
  --add-db orders

Failover groups replicate asynchronously, so an automatic failover can lose the last few seconds of committed transactions. That is a non-zero RPO. If your journey cannot tolerate it, you need synchronous replication (Business Critical with zone redundancy stays in-region and synchronous) and you accept that you cannot have synchronous cross-region. Pick one; you cannot have zero RPO and survive a region loss simultaneously.

Step 4 — Instrument health models and the four golden signals

Redundancy is useless if you cannot tell when a replica is sick. Instrument the four golden signals for every tier:

  1. Latency — split successful vs failed request latency; a fast 500 looks great on a naive average.
  2. Traffic — requests/sec, the denominator for everything else.
  3. Errors — rate of 5xx and explicit failures.
  4. Saturation — how full the constrained resource is (CPU, memory, connection pool, queue depth).

Then build a health model: a layered rollup that maps raw signals to “is this journey healthy.” Resource health rolls into component health, which rolls into journey health. The win is that an alert fires on journey degradation, not on a single CPU spike that self-corrected.

The most important alert you will write is a burn-rate alert on the error budget, not a static threshold. A multi-window burn-rate alert pages when you are consuming budget fast enough to exhaust it well before the window closes, and stays quiet on a brief blip. Here is the SLO and a fast-burn alert as a Prometheus recording/alerting rule:

groups:
- name: checkout-slo
  rules:
  # SLI: fraction of good requests over 5m and 1h windows
  - record: job:http_req_error_ratio:rate5m
    expr: |
      sum(rate(http_requests_total{job="checkout-api",code=~"5.."}[5m]))
        / sum(rate(http_requests_total{job="checkout-api"}[5m]))
  - record: job:http_req_error_ratio:rate1h
    expr: |
      sum(rate(http_requests_total{job="checkout-api",code=~"5.."}[1h]))
        / sum(rate(http_requests_total{job="checkout-api"}[1h]))
  # Fast burn: >14.4x budget consumption confirmed on two windows.
  # For a 99.9% SLO the budget is 0.001; 14.4 * 0.001 = 0.0144.
  - alert: CheckoutErrorBudgetFastBurn
    expr: |
      job:http_req_error_ratio:rate5m > (14.4 * 0.001)
      and
      job:http_req_error_ratio:rate1h > (14.4 * 0.001)
    for: 2m
    labels: { severity: page }
    annotations:
      summary: "Checkout burning error budget at >14.4x; page on-call."

The two-window AND is what kills false pages: the 1h window confirms the 5m spike is real before anyone gets woken up. The 14.4x factor is the standard fast-burn multiplier; it consumes 2% of a 30-day budget in one hour.

Step 5 — Build self-healing with probes, autoscale, and auto-restart

Self-healing means the platform recovers from common, well-understood failures (the low-RPN-after-detection rows from your FMEA) without a human. Three mechanisms cover most of it.

Health probes that actually gate traffic. On Kubernetes, separate readiness (should I get traffic?) from liveness (should I be killed and restarted?) and add a startup probe for slow boots so liveness does not kill a pod mid-startup. A dangerous mistake is pointing liveness at a deep check that touches the database; a DB blip then restart-storms every pod. Keep liveness shallow and local.

spec:
  containers:
  - name: checkout-api
    startupProbe:        # don't let liveness fire until the app has booted
      httpGet: { path: /healthz, port: 8080 }
      failureThreshold: 30
      periodSeconds: 5
    livenessProbe:       # shallow: is the process wedged? restart if so
      httpGet: { path: /healthz, port: 8080 }
      periodSeconds: 10
      failureThreshold: 3
    readinessProbe:      # deep-ish: can I serve? pull from LB if not
      httpGet: { path: /readyz, port: 8080 }
      periodSeconds: 5
      failureThreshold: 2

Autoscale on the saturation signal that actually constrains you. CPU is the lazy default and is often wrong. If the bottleneck is queue depth or in-flight requests, scale on that via KEDA. A Service Bus consumer scaled on queue length:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: checkout-worker
spec:
  scaleTargetRef:
    name: checkout-worker
  minReplicaCount: 2          # never scale to zero on a hot path
  maxReplicaCount: 30
  triggers:
  - type: azure-servicebus
    metadata:
      queueName: orders
      messageCount: "20"      # target ~20 messages in flight per replica
    authenticationRef:
      name: keda-sb-auth

Auto-restart and platform-level recovery. For PaaS, lean on the platform’s own health gate so it stops routing to unhealthy instances and recycles them. Azure App Service health check:

az webapp config set \
  --resource-group rg-checkout-prod \
  --name app-checkout-prod \
  --generic-configurations '{"healthCheckPath": "/healthz"}'

Always pair auto-restart with a backoff and a circuit breaker on outbound calls. Self-healing without a breaker just turns a dependency outage into a retry storm that takes down the dependency harder. The healing logic and the retry/backoff policy are two halves of the same control loop.

Step 6 — Validate with fault injection and a reliability scorecard

A redundancy design you have never failed over is a hypothesis, not a control. Inject the failures from your FMEA and watch the health model and self-healing respond. Azure Chaos Studio lets you run faults as a managed experiment; you can also kill pods directly to test probe behavior.

# Kill a checkout pod and confirm readiness pulls it from the LB
# and the Deployment reschedules without a journey-level SLO breach.
kubectl delete pod -l app=checkout-api -n checkout --grace-period=0

# Start a defined Chaos Studio experiment (e.g. NSG block to SQL, AKS node shutdown)
az rest --method post \
  --url "https://management.azure.com/subscriptions/<sub>/resourceGroups/rg-checkout-prod/providers/Microsoft.Chaos/experiments/exp-zone-failover/start?api-version=2024-01-01"

Run each experiment in non-prod first, then a controlled prod game day with the error budget watched live. The pass criterion is not “it recovered” but “it recovered within the RTO and inside the error budget.”

Capture the outcome in a reliability scorecard per journey so this is auditable and trends over time:

Dimension Target Last result Status
Availability SLO (30d) 99.95% 99.97% Pass
Error budget remaining > 25% 61% Pass
Zone-failover RTO < 5 min 3m 40s Pass
Cross-region RPO < 60 s 22 s Pass
Liveness restart recovery < 60 s 18 s Pass
Untested FMEA rows 0 2 Fail

Enterprise scenario

A fintech I worked with ran checkout on zone-redundant Azure SQL Business Critical with an auto-failover group to a paired region, and a multi-window burn-rate alert wired exactly as above. On paper, four nines. Then a regional networking event degraded the primary’s storage latency without taking it fully offline. The failover group’s automatic policy never tripped, because Microsoft’s health probe only fails over on a hard outage, not a brownout. Writes were timing out at the app, the fast-burn alert was screaming, and the database “looked healthy” to Azure.

The gotcha: automatic failover groups give you DR for a region loss, not for a sick-but-alive primary. The grace period (--grace-period 1) only governs how long Azure tolerates an outage before flipping, not slowness. We were one console click from a manual failover but hesitated because the group’s default policy would lose committed transactions.

The fix was a forced failover that respected RPO. Because Business Critical replicates synchronously in-region, we failed over to the local secondary replica first (zero data loss), not cross-region:

# Force failover to the in-region synchronous secondary (no data loss)
az sql failover-group set-primary \
  --name fg-checkout \
  --resource-group rg-checkout-prod \
  --server sql-checkout-dr

Recovery was 90 seconds. The lasting change: we added a synthetic write-latency probe feeding the same burn-rate logic, and a runbook that makes the manual-failover decision explicit when the budget burns but Azure reports healthy. Automation handles the binary failures; the brownout is where a human still earns their pager.

Verify

Confirm the pieces are real and wired, not just declared.

# 1. Probes are configured and pods are actually Ready
kubectl get pods -n checkout -o wide
kubectl get deploy checkout-api -n checkout \
  -o jsonpath='{.spec.template.spec.containers[0].livenessProbe}'; echo

# 2. KEDA is scaling on the queue, not idling at min
kubectl get scaledobject,hpa -n checkout

# 3. SQL really is zone-redundant and the failover group exists
az sql db show -g rg-checkout-prod -s sql-checkout-prod -n orders \
  --query "{name:name, zoneRedundant:zoneRedundant}" -o table
az sql failover-group show -g rg-checkout-prod -s sql-checkout-prod \
  -n fg-checkout --query "{state:replicationState, role:replicationRole}" -o table

# 4. The burn-rate alert rule loaded in Prometheus
curl -s http://prometheus:9090/api/v1/rules \
  | jq '.data.groups[].rules[] | select(.name=="CheckoutErrorBudgetFastBurn")'

Reliability readiness checklist

Pitfalls

Start with one journey, get its SLO, error budget, and burn-rate alert real, then walk Steps 2 through 6 for it before you template the pattern across the estate. Reliability is a per-journey discipline; the platform-wide version is just the same loop, run many times.

Well-ArchitectedReliabilitySLOResiliencyAzureObservability

Comments

Keep Reading