Observability Multi-Cloud

Building an On-Call Practice: PagerDuty Escalation, Alert Routing, and Actionable Runbooks

Most on-call rotations are a tax the team pays for not designing one. Every monitoring system points at the same pager, every alert is “critical,” and the runbook is a Slack search for “has anyone seen this before.” The pager becomes noise, people learn to ignore it, and the one page that mattered gets acknowledged at 3am by someone who then has no idea what to do next.

A real on-call practice treats the pipeline from signal to human action as a product. The alert is the input; a rested engineer who knows exactly what to do is the output. PagerDuty sits in the middle as the routing and escalation engine, but the engine is only as good as what you feed it and how you shape it. This guide builds that pipeline end to end: normalizing alerts from Alertmanager, Azure Monitor, and CloudWatch into PagerDuty; modeling services, escalation policies, and follow-the-sun schedules; using Event Orchestration for dedup, suppression, and dynamic routing; mapping severity to SLO impact instead of metric noise; and embedding runbooks and diagnostics directly into the incident so the responder is never starting from zero.

1. From alert to page: normalize every source into the Events API

PagerDuty’s ingestion contract is the Events API v2: a single JSON shape with an event_action (trigger, acknowledge, resolve), a dedup_key, and a payload carrying summary, severity, source, and arbitrary custom_details. Every monitoring system you own should land here in the same shape. That uniformity is what lets one set of orchestration rules govern alerts regardless of origin.

The raw event looks like this:

{
  "routing_key": "R0ABCDE1234567890ABCDEF0123456789",
  "event_action": "trigger",
  "dedup_key": "prod/payments-api/HighErrorRate/eu-west-1",
  "payload": {
    "summary": "payments-api 5xx ratio 8.2% over 5m (SLO burn 14x)",
    "severity": "critical",
    "source": "payments-api.eu-west-1",
    "component": "payments-api",
    "group": "payments",
    "class": "slo_burn",
    "custom_details": {
      "runbook": "https://runbooks.kloudvin.io/payments/high-error-rate",
      "dashboard": "https://grafana.kloudvin.io/d/pay-red",
      "slo": "payments-availability",
      "burn_rate": "14x"
    }
  }
}

The severity field must be one of critical, error, warning, or info - those are the only values the Events API accepts, and getting this wrong silently coerces the event. The dedup_key is the most important field most teams ignore: identical keys collapse into one incident, and a resolve event with the same key auto-closes it. Make the key deterministic from the alert identity (environment, service, alert name, region), never from a timestamp.

Alertmanager routes here natively through its pagerduty_configs receiver. Point a route at it and map labels into the payload:

receivers:
  - name: pagerduty-payments
    pagerduty_configs:
      - routing_key_file: /etc/alertmanager/secrets/pd-payments-key
        severity: '{{ if eq .CommonLabels.severity "page" }}critical{{ else }}warning{{ end }}'
        # dedup_key is derived from Alertmanager's group key automatically,
        # but pin it explicitly so resolves match across restarts:
        details:
          runbook: '{{ .CommonAnnotations.runbook_url }}'
          slo: '{{ .CommonLabels.slo }}'
          firing: '{{ .Alerts.Firing | len }}'
        links:
          - href: '{{ .CommonAnnotations.dashboard_url }}'
            text: Dashboard

Alertmanager sends resolve automatically when the alert clears, so incidents self-heal without a human touching them - critical for keeping MTTA honest.

Azure Monitor has no native PagerDuty webhook with the v2 schema, so route Action Groups to PagerDuty’s built-in Azure integration, which translates the common alert schema. Create the Action Group with a webhook receiver using the common alert schema (the flag matters - it stabilizes the payload):

az monitor action-group create \
  --name ag-payments-pd \
  --resource-group rg-observability \
  --short-name payPD \
  --action webhook pagerduty \
    "https://events.pagerduty.com/integration/<INTEGRATION_KEY>/enqueue" \
    useCommonAlertSchema=true

CloudWatch alarms publish to SNS; PagerDuty subscribes an HTTPS endpoint to that SNS topic and parses the alarm shape. The cleanest path is one SNS topic per severity tier so routing intent is encoded at the source:

aws sns create-topic --name cw-alarms-critical
aws sns subscribe \
  --topic-arn arn:aws:sns:eu-west-1:123456789012:cw-alarms-critical \
  --protocol https \
  --notification-endpoint \
    "https://events.pagerduty.com/integration/<INTEGRATION_KEY>/enqueue"

Rule of thumb: do transformation as close to the source as you can cheaply, but do routing centrally in PagerDuty. Source-side transforms keep payloads clean; central routing keeps policy in one place you can audit.

2. Model services, escalation policies, and follow-the-sun schedules

PagerDuty’s object model is small but people get it backwards. A Service represents a thing that can break (a deployable, a domain), not a team. An Escalation Policy says who to wake and in what order. Schedules define who is on call right now. You attach a policy to a service; the policy references schedules.

Model this as code from day one. The Terraform PagerDuty provider makes the whole topology reviewable:

resource "pagerduty_schedule" "payments_primary" {
  name      = "Payments - Primary"
  time_zone = "Europe/London"

  layer {
    name                         = "EMEA daytime"
    start                        = "2026-06-01T08:00:00+01:00"
    rotation_virtual_start       = "2026-06-01T08:00:00+01:00"
    rotation_turn_length_seconds = 604800 # weekly
    users                        = [data.pagerduty_user.alice.id,
                                     data.pagerduty_user.bob.id]
  }
}

resource "pagerduty_escalation_policy" "payments" {
  name      = "Payments Escalation"
  num_loops = 2

  rule {
    escalation_delay_in_minutes = 10
    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.payments_primary.id
    }
  }

  rule {
    escalation_delay_in_minutes = 15
    target {
      type = "user_reference"
      id   = data.pagerduty_user.team_lead.id
    }
  }
}

resource "pagerduty_service" "payments_api" {
  name                    = "payments-api"
  escalation_policy       = pagerduty_escalation_policy.payments.id
  alert_creation          = "create_alerts_and_incidents"
  auto_resolve_timeout    = null # let resolves come from the source, not a timer

  incident_urgency_rule {
    type    = "constant"
    urgency = "high"
  }
}

Three decisions in there earn their keep. escalation_delay_in_minutes = 10 is your acknowledgement SLA: if the primary does not ack in ten minutes, it escalates. Set this to match how fast a human can realistically respond, not an aspirational number that just trains people to ignore the first page. num_loops = 2 means the whole chain repeats twice before giving up - a backstop against everyone being asleep. And auto_resolve_timeout = null is deliberate: auto-resolving incidents on a timer hides unresolved problems and corrupts MTTR, so prefer source-driven resolves from step 1.

For follow-the-sun, layer multiple time zones in one schedule so coverage hands off automatically. Add a second layer for APAC and a third for the Americas, each with restrictions bounding its active hours:

  layer {
    name                   = "APAC daytime"
    start                  = "2026-06-01T09:00:00+09:00"
    rotation_virtual_start = "2026-06-01T09:00:00+09:00"
    rotation_turn_length_seconds = 604800
    users = [data.pagerduty_user.kenji.id, data.pagerduty_user.mei.id]

    restriction {
      type              = "daily_restriction"
      start_time_of_day = "09:00:00"
      duration_seconds  = 32400 # 9 hours
    }
  }

Follow-the-sun is the single biggest humane-on-call lever you have: nobody should be paged at 3am for a non-emergency if a colleague is awake and working. The cost is coordination overhead and handoff discipline, which is exactly what the next sections automate.

3. Event Orchestration: dedup, suppression, and dynamic routing

This is where the practice gets built. Event Orchestration is PagerDuty’s rules engine that runs on every event before it becomes an incident. You point all your integrations at a single Global Orchestration routing key, then use rules to dispatch to the right service, suppress noise, enrich payloads, and set severity. It replaces a sprawl of per-service integration keys with one auditable rule set.

A rule has conditions (PCL expressions over event fields) and actions. Here is a router that dispatches by the group field, drops known noise, and tags low-burn SLO alerts as low urgency:

{
  "sets": [
    {
      "id": "start",
      "rules": [
        {
          "label": "Suppress flapping health-check probes",
          "conditions": [
            { "expression": "event.summary matches part 'health-check probe'" }
          ],
          "actions": { "suppress": true }
        },
        {
          "label": "Route payments to payments-api service",
          "conditions": [
            { "expression": "event.custom_details.group matches 'payments'" }
          ],
          "actions": {
            "route_to": "PXYZ123",
            "severity": "critical",
            "annotate": "Auto-routed by group=payments"
          }
        },
        {
          "label": "Low burn-rate SLO alerts are not pages",
          "conditions": [
            { "expression": "event.custom_details.burn_rate matches part '2x'" }
          ],
          "actions": { "severity": "warning", "priority": "P3" }
        }
      ]
    }
  ],
  "catch_all": {
    "actions": { "route_to": "PCATCHALL" }
  }
}

The catch_all is non-negotiable: anything no rule matches goes to a triage service, never to /dev/null. An unrouted critical alert that silently vanishes is the worst failure mode an on-call system has.

Three orchestration capabilities do the heavy lifting against fatigue:

Suppression is not silence. A suppressed alert still lands in PagerDuty’s timeline and analytics. That distinction is what lets you turn noise off and prove later that you were right to.

4. Severity and priority: tie pages to SLO impact, not metric spikes

The deepest fix for alert fatigue is conceptual: stop paging on metrics and start paging on SLO burn. A CPU spike is not an incident; a payments service spending its error budget fourteen times faster than allowed is. PagerDuty has two orthogonal fields here, and conflating them is a common mistake.

Urgency (high / low) controls notification behavior - whether to push, call, and escalate now, or hold quietly. Priority (P1-P5, configurable) is business classification used for reporting and routing, and does not by itself wake anyone. The mapping you want:

SLO impact Burn rate Severity Urgency Priority Behavior
User-facing outage >10x critical high P1 Page immediately, escalate
Fast budget burn 2-10x error high P2 Page primary only
Slow burn / degradation ~1x warning low P3 Notify, no escalation, business hours
Capacity / hygiene n/a info low P4 Ticket, never page

Encode this in Orchestration so severity is derived, not asserted by whoever wrote the alert. The burn-rate value flows from your SLO alerts (see the multi-window burn-rate approach - the same burn_rate label that fired the alert maps directly to PagerDuty priority here):

{
  "label": "P1 for fast SLO burn",
  "conditions": [
    { "expression": "event.custom_details.class matches 'slo_burn' and event.custom_details.burn_rate matches part '14x'" }
  ],
  "actions": { "severity": "critical", "priority": "P1", "annotate": "SLO fast-burn -> immediate page" }
}

The win compounds: when every page is an SLO violation, “did this need to wake someone” stops being a judgment call and becomes arithmetic. That is also what makes the post-incident review (step 7) productive - you can ask whether the SLO was right, instead of whether the threshold was.

5. Embed runbooks and automated diagnostics in the incident

A page that says “high error rate” and stops is an insult to the responder. The incident should arrive carrying the runbook, the dashboard, and ideally the first round of diagnostics already run. Two mechanisms get you there.

First, links in the payload. The custom_details.runbook and links you set in step 1 surface directly on the incident. Standardize the keys so every incident has a runbook and dashboard field - then make their absence a CI failure on the alert definition, so no alert ships without a runbook.

Second, automated diagnostics on trigger. PagerDuty’s automation actions (or a webhook subscription firing a function) can run a read-only diagnostic the moment an incident opens and post the output back as a note. The pattern: subscribe to incident.triggered, run a scoped, read-only command, attach the result.

#!/usr/bin/env bash
# Invoked by a webhook subscription on incident.triggered.
# Runs read-only diagnostics and posts a note back to the incident.
set -euo pipefail
INCIDENT_ID="$1"
SERVICE="$2"

DIAG=$(kubectl --context=prod -n "$SERVICE" get pods \
  --field-selector=status.phase!=Running -o wide 2>&1 | head -20)
RECENT=$(kubectl --context=prod -n "$SERVICE" \
  get events --sort-by=.lastTimestamp 2>&1 | tail -10)

curl -sS -X POST "https://api.pagerduty.com/incidents/${INCIDENT_ID}/notes" \
  -H "Authorization: Token token=${PD_API_TOKEN}" \
  -H "Content-Type: application/json" \
  -H "From: automation@kloudvin.io" \
  -d "$(jq -n --arg c "Non-running pods:
${DIAG}

Recent events:
${RECENT}" '{note: {content: $c}}')"

Keep these diagnostics strictly read-only and least-privilege - auto-remediation that mutates production from an unattended hook is how a noisy night becomes an outage. The goal is to shave the first five minutes of “what is even happening,” not to act on the responder’s behalf.

6. Reduce alert fatigue: thresholds, grouping, and quiet hours

Fatigue is a systems problem, not a willpower problem. Attack it on three fronts.

Thresholds and hysteresis at the source. Page on sustained conditions (for: 10m in Prometheus), require a minimum sample count, and use burn-rate windows that demand both a short and a long window to fire. This kills the single-scrape blip that has no business waking anyone.

Grouping in PagerDuty. Enable Intelligent Alert Grouping or content-based grouping on each service so a fan-out failure (one dependency takes down twelve services) becomes one incident with twelve alerts, not twelve pages. Content-based grouping keys on payload fields you control:

resource "pagerduty_service" "payments_api" {
  name              = "payments-api"
  escalation_policy = pagerduty_escalation_policy.payments.id
  alert_creation    = "create_alerts_and_incidents"

  alert_grouping_parameters {
    type = "content_based"
    config {
      aggregate = "all"
      fields    = ["custom_details.group", "class"]
    }
  }
}

Quiet hours via urgency, not suppression. Low-urgency incidents do not page outside the notification rules a responder sets, so the warning-tier SLO alerts from step 4 naturally go quiet overnight and surface in the morning. Combine that with support-hours rules on the service to flip non-critical urgency to low after hours:

  incident_urgency_rule {
    type = "use_support_hours"

    during_support_hours    { type = "constant"; urgency = "high" }
    outside_support_hours   { type = "constant"; urgency = "low"  }
  }

  support_hours {
    type         = "fixed_time_per_day"
    time_zone    = "Europe/London"
    start_time   = "08:00:00"
    end_time     = "20:00:00"
    days_of_week = [1, 2, 3, 4, 5]
  }

The principle: critical/SLO-breaching pages cut through always; everything else respects human sleep. If a class of alert can wait until morning, it should prove it cannot before it is allowed to wake anyone.

7. Post-incident reviews, MTTA/MTTR, and feedback loops

An on-call practice that does not learn just relives the same incident forever. Two outputs close the loop.

The blameless review. For every P1/P2, produce a short writeup: timeline, contributing factors (plural - never a single root cause), what the runbook got right and wrong, and concrete action items with owners. The most valuable line item is almost always “the alert that should have fired but did not” or “the alert that fired but was useless.” Feed both straight back into steps 4 and 6.

The metrics that matter. Pull them from PagerDuty Analytics, but interpret them correctly:

Track these per service and per quarter, and treat a sustained regression as a reliability bug with the same weight as an outage.

Verify

Prove the pipeline works before you trust it at 3am.

# 1. Fire a synthetic trigger through Orchestration and confirm routing+severity.
curl -sS -X POST "https://events.pagerduty.com/v2/enqueue" \
  -H "Content-Type: application/json" \
  -d '{
    "routing_key": "'"$PD_GLOBAL_ORCH_KEY"'",
    "event_action": "trigger",
    "dedup_key": "synthetic/verify/payments/eu-west-1",
    "payload": {
      "summary": "SYNTHETIC verify payments slo_burn 14x",
      "severity": "critical",
      "source": "verify.eu-west-1",
      "custom_details": { "group": "payments", "class": "slo_burn", "burn_rate": "14x" }
    }
  }'
# Expect HTTP 202 and "status":"success" with a dedup_key echoed back.

# 2. Confirm dedup: send the SAME dedup_key again -> still one incident, not two.

# 3. Resolve it so you do not actually page anyone.
curl -sS -X POST "https://events.pagerduty.com/v2/enqueue" \
  -H "Content-Type: application/json" \
  -d '{ "routing_key": "'"$PD_GLOBAL_ORCH_KEY"'", "event_action": "resolve",
        "dedup_key": "synthetic/verify/payments/eu-west-1" }'

Then verify the topology and intent:

Enterprise scenario

A fintech platform team ran 140 microservices behind a single PagerDuty service with one shared integration key. Every Alertmanager, CloudWatch, and Azure Monitor alert hit that key, so every incident escalated to the same primary regardless of which service broke. During a regional networking blip in eu-west-1, a dependency timeout fanned out across 40 services and generated 312 pages in nine minutes. The primary on-call acknowledged the first one, then silenced their phone to survive the night - and missed the genuinely separate database failover that paged 20 minutes later. The post-incident review surfaced an MTTA of 47 minutes on the page that actually mattered, buried under noise.

The constraint was organizational, not technical: 140 services owned by 18 teams, and no appetite for each team to reconfigure dozens of alert sources. They could not ask every service to repoint its integration key.

The fix was to keep the single ingestion key but move all policy into one Global Event Orchestration. Sources stayed unchanged; Orchestration became the brain. They added content-based alert grouping keyed on group and class, a rule set that routed by group to the correct per-team service, a suppression rule for dependency-timeout fan-out (the symptom), and severity derived from SLO burn rate so only true budget violations carried high urgency.

{
  "label": "Collapse dependency-timeout fan-out into the upstream incident",
  "conditions": [
    { "expression": "event.summary matches part 'upstream timeout' or event.summary matches part 'context deadline exceeded'" }
  ],
  "actions": {
    "variables": [
      { "name": "upstream", "path": "event.custom_details.upstream_service",
        "type": "regex", "value": "(.*)" }
    ],
    "severity": "warning",
    "suppress": false,
    "annotate": "Fan-out symptom; grouped under upstream {{variables.upstream}}"
  }
}

A replay of the same incident shape against the new orchestration turned 312 pages into 6 incidents - one per genuinely independent failure domain - and the database failover paged cleanly on its own service with no competition. MTTA on critical pages dropped from 47 minutes to under 4. Crucially, none of the 140 services changed a line of config; the entire fix lived in one Terraform-managed orchestration the platform team owned and reviewed.

Checklist

on-callincident-responsepagerdutyalertingobservability

Comments

Keep Reading