Observability Multi-Cloud

SLOs and Error Budgets in Practice: Defining SLIs and Building Multi-Window Burn-Rate Alerts

Most alerting catalogs are a museum of every incident the team has ever had: one rule per scar, each with a threshold someone picked at 3am and nobody has touched since. The result is pages that fire when nothing is wrong and stay silent when something is. SLOs replace that with a single question - are we spending reliability faster than we agreed users would tolerate? - and burn-rate alerts answer it with arithmetic instead of vibes. This guide takes you from picking an SLI through shipping multi-window multi-burn-rate alerts that page on real risk.

1. From symptom to SLI: measure what users feel

A Service Level Indicator is a ratio of good events to valid events. The discipline is in the nouns. An SLI is not “CPU is below 80%” or “the pod is Ready” - those are causes, and users do not experience causes. They experience requests that succeed or fail, and requests that are fast or slow.

Two indicators cover the vast majority of request-driven services:

Three decisions make or break an SLI:

  1. Where you measure. Measure as close to the user as you can while still owning the signal. The load balancer or ingress sees what the client experiences; the application’s own histogram sees only requests that arrived and were routed. Prefer the edge for the user-facing SLO, and keep deeper signals for debugging.
  2. What “valid” excludes. A 400 is the client’s fault, not yours - exclude 4xx from the good numerator only if you also keep it out of the denominator’s failure accounting, or you will burn budget for malformed client requests. Health-check and synthetic traffic should be excluded entirely; they inflate the denominator and hide real user pain.
  3. Threshold, not average. Never build a latency SLI on a mean. Averages hide the tail where users actually suffer. Define latency as “the proportion of requests faster than T”, which is a count-based ratio you can compute exactly, not a quantile you estimate.

A good SLI moves when users are unhappy and stays flat when they are fine. If you can imagine a scenario where the SLI looks healthy while users are suffering - or the reverse - you have the wrong indicator. Fix the indicator before you argue about the target.

2. SLO targets and the budget that follows

The SLO is the target you commit to over a rolling window (28 or 30 days is standard). The error budget is its complement: budget = 1 - SLO. That tiny subtraction is the entire point, because it converts a percentage into a quantity of allowed failure that you can spend, track, and run out of.

The arithmetic of nines, over a 30-day (43,200-minute) window:

SLO Error budget Allowed downtime / 30d Allowed bad fraction
99% 1% ~7h 12m 1 in 100
99.5% 0.5% ~3h 36m 1 in 200
99.9% 0.1% ~43m 1 in 1000
99.95% 0.05% ~21m 36s 1 in 2000
99.99% 0.01% ~4m 19s 1 in 10000

Two things experienced engineers get wrong here:

3. Encoding SLIs in PromQL with recording rules

Compute SLIs as good / valid from raw counters, then pre-aggregate with recording rules so alert queries stay cheap and fast. Start from a standard request-count counter (here http_requests_total with a code label and a job label per service).

Define the rate of total and bad events at a short evaluation interval. Using rate() over a 5-minute window gives a per-second rate; the ratio of two rates over the same window is dimensionless, so the per-second normalization cancels cleanly.

# sli-recording-rules.yaml
groups:
  - name: sli_http_5m
    interval: 30s
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      - record: job:http_errors:rate5m
        expr: sum by (job) (rate(http_requests_total{code=~"5.."}[5m]))

      # The availability error ratio: bad / valid over 5m
      - record: job:slo_errors:ratio_rate5m
        expr: |
          job:http_errors:rate5m
            /
          job:http_requests:rate5m

You need the same ratio over multiple windows for burn-rate alerting, so repeat the recording rules for each window the alerts will reference. A clean pattern is one group per window:

  - name: sli_http_1h
    interval: 30s
    rules:
      - record: job:slo_errors:ratio_rate1h
        expr: |
          sum by (job) (rate(http_requests_total{code=~"5.."}[1h]))
            /
          sum by (job) (rate(http_requests_total[1h]))

  - name: sli_http_6h
    interval: 30s
    rules:
      - record: job:slo_errors:ratio_rate6h
        expr: |
          sum by (job) (rate(http_requests_total{code=~"5.."}[6h]))
            /
          sum by (job) (rate(http_requests_total[6h]))

For a latency SLI from a native or classic histogram, the good-events ratio is the count at or below your threshold bucket divided by the total count. With a classic histogram whose le boundaries include 0.3 seconds:

# Fraction of requests SLOWER than 300ms over 1h = latency error ratio
1 - (
  sum by (job) (rate(http_request_duration_seconds_bucket{le="0.3"}[1h]))
    /
  sum by (job) (rate(http_request_duration_seconds_count[1h]))
)

The threshold T must line up with an existing le bucket boundary. histogram_quantile() interpolates within a bucket and is the wrong tool for an SLI - you want an exact count of good vs valid events, which means picking a bucket edge at SLI design time and instrumenting for it. Native histograms relax this, but the bucket-edge discipline is still the safe default.

4. Why single-threshold alerts fail

The naive SLO alert is “page when the error ratio over the last 5 minutes exceeds 1 - SLO”. For a 99.9% target that means paging whenever the 5m error ratio crosses 0.001. This is broken in both directions:

The fix is to alert on burn rate - how fast you are spending the budget - and to require agreement across a long and a short window so a transient spike cannot page on its own.

5. Burn rate, explained

Burn rate is the multiple of budget consumption relative to the steady “spend it exactly over the window” pace.

The formula is simply the observed error ratio divided by the budget:

burn_rate = observed_error_ratio / (1 - SLO)

This reframes alerting around a single question: given the speed I am burning, when do I run out? That maps directly to severity. A burn rate of 14.4 means budget exhaustion in ~2 days - page now. A burn rate of 1 means you are exactly on pace - that is a ticket to investigate, not a 3am wake-up. The canonical thresholds from the Google SRE workbook tie each burn rate to the fraction of budget it consumes over a chosen window:

Severity Long window Short window Burn rate Budget consumed in long window Time to exhaust 30d budget
Page (fast) 1h 5m 14.4 2% ~50 hours
Page (slow) 6h 30m 6 5% ~5 days
Ticket 24h 2h 3 10% ~10 days
Ticket 72h 6h 1 10% ~30 days

6. Multi-window multi-burn-rate alerts

The “multi-window” half is what kills false pages. Each alert requires the burn rate to exceed the threshold over both a long window (which establishes the burn is real and sustained) and a short window (which ensures the burn is still happening right now, so you do not page on an issue that already resolved). The short window is conventionally one-twelfth of the long window.

Here is a complete, correct rule pair for a 99.9% availability SLO, using the recording rules from Section 3. The literal 0.001 is 1 - SLO; the multipliers 14.4 and 6 are the burn-rate thresholds.

# slo-burn-alerts.yaml
groups:
  - name: slo_http_burn
    rules:
      # FAST burn -> PAGE. 14.4x over 1h AND 5m. Budget gone in ~2 days.
      - alert: HighErrorBudgetBurnFast
        expr: |
          (
            job:slo_errors:ratio_rate1h > (14.4 * 0.001)
            and
            job:slo_errors:ratio_rate5m > (14.4 * 0.001)
          )
        for: 2m
        labels:
          severity: page
          slo: http_availability
          window: fast
        annotations:
          summary: "Fast error-budget burn on {{ $labels.job }}"
          description: "Burn rate >14.4x over 1h and 5m. ~2 days to budget exhaustion."

      # SLOW burn -> PAGE. 6x over 6h AND 30m. Budget gone in ~5 days.
      - alert: HighErrorBudgetBurnSlow
        expr: |
          (
            job:slo_errors:ratio_rate6h > (6 * 0.001)
            and
            job:slo_errors:ratio_rate30m > (6 * 0.001)
          )
        for: 5m
        labels:
          severity: page
          slo: http_availability
          window: slow
        annotations:
          summary: "Sustained error-budget burn on {{ $labels.job }}"
          description: "Burn rate >6x over 6h and 30m. ~5 days to budget exhaustion."

      # TICKET. 3x over 24h AND 2h. Slow leak, no page.
      - alert: ErrorBudgetBurnTicket
        expr: |
          (
            job:slo_errors:ratio_rate1d > (3 * 0.001)
            and
            job:slo_errors:ratio_rate2h > (3 * 0.001)
          )
        for: 15m
        labels:
          severity: ticket
          slo: http_availability
          window: ticket
        annotations:
          summary: "Slow error-budget burn on {{ $labels.job }}"
          description: "Burn rate >3x over 24h and 2h. Investigate during business hours."

This requires recording rules for the 30m, 2h, and 1d windows as well; add them with the same pattern as Section 3. Pre-computing every window keeps these alert expressions O(1) to evaluate even with many services, which matters because Alertmanager and Prometheus evaluate these on every scrape interval.

7. Routing and inhibition: one incident, one page

Three burn-rate alerts can fire for the same outage at once. Without coordination, the on-call gets three pages for one problem. Alertmanager solves this with grouping (collapse related alerts into one notification) and inhibition (let a higher-severity alert silence a lower one).

# alertmanager.yaml
route:
  receiver: default
  group_by: ['job', 'slo']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - matchers:
        - severity = page
      receiver: pagerduty
      group_by: ['job', 'slo']
      continue: false
    - matchers:
        - severity = ticket
      receiver: jira

inhibit_rules:
  # A firing PAGE for an SLO suppresses the TICKET for the same SLO+job.
  - source_matchers:
      - severity = page
    target_matchers:
      - severity = ticket
    equal: ['job', 'slo']

  # The fast page suppresses the slow page for the same SLO+job.
  - source_matchers:
      - window = fast
    target_matchers:
      - window = slow
    equal: ['job', 'slo']

The equal list is the critical, easy-to-miss part: inhibition only applies when the listed labels match between source and target, so ['job', 'slo'] ensures a page for checkout never silences a ticket for search. Grouping by ['job', 'slo'] means a single SLO breach produces one notification thread no matter how many windows trip.

Enterprise scenario

A payments platform team rolled the canonical 14.4x / 6x burn-rate alerts across ~40 services off a shared http_requests_total counter. Weeks in, the fast page fired during a partial outage but Alertmanager paged the wrong on-call rotation, and a noisy auth service drowned the checkout page. Root cause: the SLI summed errors with sum by (job), but the gateway emitted one job for all routes - the per-team team label existed on the raw metric and was being aggregated away. The recording rules collapsed every service into a single time series, so equal: ['job', 'slo'] inhibition and group_by had nothing to discriminate on.

The fix was to preserve the routing dimension end to end - carry team through every recording rule, alert label, and Alertmanager equal list:

- record: job_team:slo_errors:ratio_rate1h
  expr: |
    sum by (job, team) (rate(http_requests_total{code=~"5.."}[1h]))
      /
    sum by (job, team) (rate(http_requests_total[1h]))
inhibit_rules:
  - source_matchers: [severity = page]
    target_matchers: [severity = ticket]
    equal: ['job', 'team', 'slo']

The gotcha worth internalizing: a burn-rate alert is only as well-targeted as the labels surviving aggregation. Any label you want to route, group, or inhibit on must appear in the by (...) clause of the recording rule, not just the raw series - Prometheus drops everything else before the alert ever sees it. They added a promtool test asserting team propagates onto the fired alert, turning a silent label-drop into a CI failure.

Verify

Validate the rules and prove the math before you trust the pages.

# 1. Lint rule syntax (CI gate, no running Prometheus needed)
promtool check rules sli-recording-rules.yaml slo-burn-alerts.yaml

# 2. Validate Alertmanager config
amtool check-config alertmanager.yaml

# 3. Confirm recording rules are producing series at runtime
#    Query the Prometheus HTTP API for the 1h burn ratio:
curl -sG 'http://localhost:9090/api/v1/query' \
  --data-urlencode 'query=job:slo_errors:ratio_rate1h' | jq '.data.result'

Then unit-test the alerts with promtool, which runs them against synthetic series so you can assert that a known input fires (or does not fire) the right alert:

# slo-alert-tests.yaml
rule_files:
  - sli-recording-rules.yaml
  - slo-burn-alerts.yaml
evaluation_interval: 1m
tests:
  - interval: 1m
    input_series:
      # 20% error rate -> burn rate 200x for a 99.9% SLO -> fast page
      - series: 'http_requests_total{job="checkout", code="200"}'
        values: '0+80x180'
      - series: 'http_requests_total{job="checkout", code="500"}'
        values: '0+20x180'
    alert_rule_test:
      - eval_time: 65m
        alertname: HighErrorBudgetBurnFast
        exp_alerts:
          - exp_labels:
              severity: page
              slo: http_availability
              window: fast
              job: checkout

Run it with promtool test rules slo-alert-tests.yaml. A passing test is your proof that the threshold arithmetic, the window math, and the label propagation are all correct - far more trustworthy than reading the YAML and hoping.

Finally, dry-run the routing to confirm a page goes where you expect:

amtool config routes test \
  --config.file=alertmanager.yaml \
  severity=page job=checkout slo=http_availability

Rollout checklist

Operating the budget

The alerts are the easy part. The hard part is the error-budget policy - the pre-agreed, written answer to “what do we do when the budget runs out?” Without it, a depleted budget is just a sad number. With it, the budget becomes a brake the whole team respects:

Report budget burn over the rolling SLO window, not the alert windows. A simple panel: 1 - (sum_over_time(error_ratio) / target_error_ratio) framed over 30 days shows remaining budget as a percentage trending toward zero. Review it in weekly ops and in sprint planning. The point of the whole exercise is not the page at 3am - it is the conversation in planning where “we are at 12% budget remaining with two weeks left in the window” turns an abstract reliability debate into a concrete, shared engineering decision.

Pitfalls

SLOSLIErrorBudgetAlertingPrometheus

Comments

Keep Reading