Prometheus & Grafana, In Depth: Scraping, PromQL, Alertmanager & Dashboards (Hands-On)

The previous lesson taught observability as a discipline — the three pillars, the golden signals, SLIs and error budgets — deliberately without committing to any one tool. This lesson is the opposite: it is about the tools. Prometheus and Grafana are the de facto open-source monitoring stack of the cloud-native world. Prometheus is a Cloud Native Computing Foundation graduated project (the second after Kubernetes itself) and Grafana is the visualisation layer almost everyone pairs with it. If you operate anything on Kubernetes, or run an SRE function, you will meet this pair, and the Prometheus Certified Associate exam exists precisely because employers want to know you understand it deeply.

We will treat the stack the way you would have to in production: not as a black box you helm install and forget, but option by option. We walk Prometheus’s architecture and pull model, then prometheus.yml — every block of scrape_configs, service discovery, and the relabeling machinery that confuses everyone the first time. We cover the exporters that turn the world into metrics (node, blackbox, cAdvisor), the local TSDB with its retention and compaction, and remote-write for long-term storage. We then instrument a real application: the four metric types, the exposition format on the wire, and a client library. We do PromQL properly — instant vs range vectors, every selector, rate()/increase()/histogram_quantile(), the aggregation operators, and recording rules. We do Alertmanager end to end — the routing tree, grouping, inhibition, silences, and receivers for Slack, PagerDuty and email. We do Grafana — data sources, dashboards and panels, template variables, provisioning-as-code, and Grafana-managed alerting. And we tie it all together in a Docker Compose lab you can run on your laptop in five minutes.

Where the observability-fundamentals lesson owns the theory (why histograms aggregate and summaries do not, how burn-rate alerting works), this lesson owns the mechanics (the exact YAML, the exact PromQL, the exact curl). The two are designed to be read together.

Learning objectives

By the end of this lesson you will be able to:

Explain the Prometheus architecture — server, retrieval, TSDB, HTTP API — and why the pull model is a design choice, not an accident.
Write a prometheus.yml from scratch: global, scrape_configs with every per-job option, service discovery, and relabel_configs / metric_relabel_configs to shape targets and series.
Deploy and read the standard exporters (node, blackbox, cAdvisor) and the Pushgateway, and know when each is appropriate.
Reason about the local TSDB — blocks, the WAL, head, compaction, retention by time and size — and configure remote-write for long-term storage and global query.
Instrument an application correctly: choose counter / gauge / histogram / summary, emit the exposition format, and expose /metrics with a client library.
Write PromQL confidently — selectors and matchers, instant vs range vectors, rate()/irate()/increase(), histogram_quantile(), the aggregation operators, binary operators and _over_time functions — and factor expensive queries into recording rules.
Configure Alertmanager: alerting rules in Prometheus, the routing tree, grouping, inhibition, silences, and receivers (Slack/PagerDuty/email) with templated notifications.
Build and provision Grafana as code — data sources, dashboards, folders, template variables, and Grafana-managed alerts — and stand the whole stack up with Docker Compose.

Prerequisites & where this fits

You should be comfortable with Docker and docker compose, reading and writing YAML, basic HTTP (status codes, curl), and the shape of a containerised service. Crucially, you should already understand the concepts this lesson makes concrete: the difference between a counter and a gauge, what cardinality is, why you graph rate() of a counter and never the counter itself, and what an SLO and an error budget are. All of that is covered in Observability Fundamentals for DevOps, which is the prerequisite for this lesson — we will recap the bare minimum and otherwise build straight on it. This lesson sits in the Observability module of the DevOps Zero-to-Hero course, after the fundamentals and the SRE practice lesson. If your target is Kubernetes-native monitoring specifically — the Prometheus Operator, ServiceMonitor/PodMonitor, kube-state-metrics — that has its own dedicated lesson, Kubernetes Monitoring, In Depth; here we deliberately use Docker Compose so you learn the raw configuration the Operator generates for you, which is exactly what an interviewer or the PCA exam expects you to be able to read.

Core concepts: the Prometheus architecture

Prometheus is, at heart, three things bolted together: a scraper that pulls metrics over HTTP, a time-series database (TSDB) that stores them locally on disk, and a query engine (PromQL) that reads them back. Everything else — alerting, federation, remote storage — hangs off those three. It is a single statically-linked Go binary with no external dependencies (no database to run alongside it), which is a large part of why it is so widely adopted: you can run a useful Prometheus from one binary and one config file.

The components, and how data flows:

Component	Role	Notes
Retrieval (scraper)	Periodically pulls `/metrics` from each configured target	Driven by `scrape_configs`; discovers targets via service discovery
TSDB (storage)	Stores samples on local disk as time-ordered blocks + a write-ahead log	Local by default; `--storage.tsdb.path`, retention flags
PromQL engine	Evaluates queries against the TSDB	Powers the expression browser, Grafana, recording & alerting rules
Rule manager	Evaluates recording rules (precompute) and alerting rules on a schedule	`rule_files`; alerting rules emit alerts
HTTP server / API	Serves the web UI and the `/api/v1` query API	Grafana and tooling read through this
Service discovery	Finds what to scrape dynamically (Kubernetes, EC2, file, Consul…)	Targets are not static in real systems
Alertmanager (separate process)	Receives fired alerts; dedupes, groups, routes, silences, notifies	Not part of the Prometheus binary — runs alongside

Two boundaries trip people up. First, Alertmanager is a separate process — Prometheus evaluates alerting rules and, when one fires, pushes the alert to Alertmanager over HTTP; Alertmanager owns everything after that (grouping, routing, notifications). Prometheus does not send Slack messages. Second, long-term storage is opt-in — Prometheus stores locally and is not designed to be a durable, clustered datastore; for that you bolt on remote-write to Thanos, Mimir, Cortex, or a vendor backend (covered below).

Why pull, not push?

Prometheus scrapes (pulls): it reaches out to each target’s HTTP endpoint on a schedule and reads the current metric values. Most older systems (StatsD, Graphite) and OpenTelemetry use push: the app sends metrics out. Pull is a deliberate design choice with concrete benefits:

Liveness for free. If a scrape fails, the target is down — Prometheus records a synthetic up metric (1 = scrape succeeded, 0 = failed) for every target, so “is it up?” needs no extra instrumentation.
No credentials sprawl. Targets do not need to know where Prometheus is or hold push credentials; Prometheus holds the target list. You can run a second Prometheus (e.g. in staging) against the same targets with zero app change.
Targets are simple. An app just exposes a static /metrics page; it does no buffering, batching, or retry. You can curl it by hand to debug.
Centralised control of rate and discovery. Scrape interval, timeout, and what gets scraped are configured in one place.

The cost of pull is the awkward case of short-lived jobs that finish before any scrape — a cron job, a CI step. For those, Prometheus offers the Pushgateway: the job pushes its final metrics to the Pushgateway, which holds them so Prometheus can scrape it. The Pushgateway is for service-level batch results only — it is explicitly not a way to convert Prometheus into a push system, and over-using it reintroduces all the problems pull avoids.

The configuration file: `prometheus.yml`

Everything Prometheus scrapes and how is in one YAML file, reloadable without a restart (send SIGHUP, or call POST /-/reload if started with --web.enable-lifecycle). The top-level structure:

global:                 # defaults that apply to every scrape job
  scrape_interval: 15s          # how often to scrape (default 1m)
  scrape_timeout: 10s           # how long to wait for a scrape (must be ≤ interval)
  evaluation_interval: 15s      # how often to evaluate recording/alerting rules
  external_labels:              # labels added to all series when talking to remote systems
    cluster: lab
    region: in-south-1

rule_files:             # recording & alerting rule files (globs allowed)
  - "rules/*.yml"

alerting:               # where to send fired alerts
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

scrape_configs:         # the heart of the file — one block per job (see below)
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

remote_write:           # optional: ship samples to long-term storage
  - url: "https://mimir.example.com/api/v1/push"

`scrape_configs` — every per-job option

A job is a set of targets sharing a scrape configuration; each target becomes an instance. Prometheus automatically attaches a job label (the job_name) and an instance label (host:port) to every series from that job — these two are the backbone of all your queries.

Key	What it does	Default	Notes / gotcha
`job_name`	Logical name; becomes the `job` label	(required, unique)	Choose meaningfully — you will `sum by (job)` constantly
`scrape_interval`	Per-job override of global	inherits `global`	Don’t set below ~10s without reason; multiplies storage
`scrape_timeout`	Per-job override	inherits	Must be ≤ `scrape_interval`
`metrics_path`	Path to scrape	`/metrics`	Blackbox/Pushgateway differ
`scheme`	`http` or `https`	`http`	Set `https` for TLS targets
`static_configs`	Hard-coded `targets` + optional `labels`	—	Fine for fixed infra; use SD for dynamic
`<sd>_sd_configs`	Service discovery (kubernetes, ec2, consul, dns, file…)	—	The real-world way to find targets
`relabel_configs`	Rewrite/filter targets before scrape	—	Drop targets, build the address, set labels
`metric_relabel_configs`	Rewrite/filter samples after scrape	—	Drop noisy/high-cardinality metrics at ingest
`basic_auth` / `authorization`	Auth to the target	—	`authorization: { credentials_file: … }` for bearer tokens
`tls_config`	CA, client cert, `insecure_skip_verify`	—	For HTTPS scrape targets
`params`	URL query params sent on scrape	—	Used by blackbox (`module`) and federation (`match[]`)
`honor_labels`	Keep the target’s own `job`/`instance` if it sets them	`false`	Set `true` for Pushgateway/federation so labels aren’t overwritten
`sample_limit`	Drop the scrape if it exceeds N samples	`0` (off)	Cardinality safety valve
`body_size_limit`, `label_limit`	Further ingest guards	off	Defence against a misbehaving target

A realistic job using file-based service discovery plus a metric drop:

scrape_configs:
  - job_name: "node"
    file_sd_configs:
      - files: ["targets/node-*.json"]      # reloaded automatically when the file changes
        refresh_interval: 30s
    relabel_configs:
      # take the SD-provided "datacentre" meta-label and turn it into a real label
      - source_labels: [__meta_filepath]
        regex: ".*node-(.+)\\.json"
        target_label: dc
        replacement: "$1"
    metric_relabel_configs:
      # drop a famously high-cardinality metric we don't need
      - source_labels: [__name__]
        regex: "node_scrape_collector_duration_seconds"
        action: drop

Service discovery

Static target lists do not survive contact with autoscaling. Service discovery (SD) lets Prometheus learn its targets at runtime. Each SD mechanism populates meta-labels (prefixed __meta_) describing each discovered target, which you then turn into real labels (or use to filter) via relabel_configs.

SD mechanism	Discovers targets from	Typical use
`kubernetes_sd_configs`	The Kubernetes API (nodes, pods, services, endpoints, ingress)	The standard for K8s
`ec2_sd_configs` / `azure_sd_configs` / `gce_sd_configs`	Cloud provider APIs (instances + tags)	VM fleets
`consul_sd_configs`	Consul service catalog	Service-mesh / VM estates
`dns_sd_configs`	DNS A/AAAA/SRV records	Simple dynamic targets
`file_sd_configs`	JSON/YAML files on disk (written by anything)	Glue for any system; great for labs
`http_sd_configs`	An HTTP endpoint returning the target list	Custom inventory APIs
`static_configs`	A hard-coded list	Fixed infrastructure

Relabeling — the part everyone finds confusing

Relabeling is a small pipeline of rules that rewrite the label set, run in two places: relabel_configs runs over the target’s labels before scraping (to decide whether to scrape it and what to call it), and metric_relabel_configs runs over each sample’s labels after scraping (to drop or rewrite metrics). They share the same grammar.

Each rule reads from source_labels (joined by separator, default ;), matches them against regex, and applies an action:

`action`	Effect
`replace` (default)	If `regex` matches `source_labels`, set `target_label` to `replacement` (with `$1`, `$2` capture groups)
`keep`	Keep the target/sample only if `regex` matches; drop everything else
`drop`	Drop the target/sample if `regex` matches
`labelmap`	Copy labels whose name matches `regex` to new names
`labeldrop` / `labelkeep`	Drop/keep labels by name regex
`hashmod`	Set `target_label` to `modulus` hash of `source_labels` — used for scrape sharding

The canonical Kubernetes pattern: only scrape pods that opt in with an annotation, and use annotations to build the scrape address.

relabel_configs:
  # 1. Only keep pods annotated prometheus.io/scrape: "true"
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: "true"
  # 2. Use the pod's port annotation to override the scrape port in __address__
  - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    action: replace
    regex: "([^:]+)(?::\\d+)?;(\\d+)"
    replacement: "$1:$2"
    target_label: __address__
  # 3. Promote the namespace meta-label to a real label
  - source_labels: [__meta_kubernetes_namespace]
    target_label: namespace

The three magic targets of relabeling are __address__ (the host:port Prometheus will scrape — relabel it to change where it scrapes), __metrics_path__ (the path), and __scheme__. Labels beginning with __ are dropped after relabeling, so use them as scratch space. The single most important use of relabeling for cost control is the keep/drop on __name__ in metric_relabel_configs — it lets you discard high-cardinality metrics you will never query before they ever hit the TSDB.

Exporters: turning the world into metrics

Most things you want to monitor — a Linux host, a database, a router, a website — do not natively expose Prometheus metrics. An exporter is a small bridge: it sits next to (or in front of) the thing, reads its native stats, and exposes them on a /metrics endpoint in Prometheus format. There are hundreds; three are near-universal.

Exporter	Exposes	Runs as	Notable metrics
node_exporter	Linux/Unix host metrics — CPU, memory, disk, filesystem, network	A daemon on every host (port 9100)	`node_cpu_seconds_total`, `node_memory_MemAvailable_bytes`, `node_filesystem_avail_bytes`, `node_load1`
cAdvisor	Per-container resource usage (CPU/mem/net/fs)	One per host; reads the container runtime (port 8080)	`container_cpu_usage_seconds_total`, `container_memory_working_set_bytes`
blackbox_exporter	Black-box probes — HTTP, HTTPS, TCP, ICMP, DNS — from the outside	One central instance; probes many targets	`probe_success`, `probe_duration_seconds`, `probe_http_status_code`, `probe_ssl_earliest_cert_expiry`

Two patterns distinguish them. node_exporter and cAdvisor are “white-box”: they run on the thing and report its internals, and Prometheus scrapes them directly. blackbox_exporter is “black-box”: it tests a target the way a user would (does this URL return 200 within 2s? does the TLS cert expire soon?), and the scrape is indirect — Prometheus scrapes the blackbox exporter, passing the real target as a parameter:

- job_name: "blackbox-http"
  metrics_path: /probe
  params:
    module: [http_2xx]            # which probe definition in blackbox.yml to run
  static_configs:
    - targets:                    # these are the URLs to PROBE, not to scrape
        - https://kloudvin.com
        - https://api.kloudvin.com/health
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target     # pass the URL as ?target=
    - source_labels: [__param_target]
      target_label: instance           # label the series by the probed URL
    - target_label: __address__
      replacement: blackbox:9115       # actually scrape the blackbox exporter here

That relabeling dance — move the URL into __param_target, then point __address__ at the exporter — is the idiom for every black-box probe and a frequent interview question.

Beyond these, there are exporters for almost everything: mysqld_exporter, postgres_exporter, redis_exporter, kafka_exporter, snmp_exporter (network gear), windows_exporter, and the statsd_exporter that bridges legacy StatsD push to Prometheus pull. For application metrics you do not want an exporter at all — you instrument the app directly (next section).

The local TSDB: storage, retention and remote-write

Prometheus’s storage engine is a purpose-built time-series database optimised for the append-heavy, scrape-driven workload. Understanding its shape explains the retention flags and the memory behaviour interviewers ask about.

Data lands first in the head block — the most recent, in-memory data — and is simultaneously written to a write-ahead log (WAL) on disk so an unexpected restart loses nothing. Periodically (every two hours by default) the head is flushed to an immutable, on-disk block covering a time window. Each block is a self-contained directory holding the compressed samples (chunks/), an index, and metadata; over time, compaction merges adjacent small blocks into larger ones (covering up to ~10% of the retention window) to keep query and storage efficient. A sample is roughly 1–2 bytes on disk after compression, which is why Prometheus can hold millions of series cheaply — provided cardinality stays bounded.

Retention is controlled by two flags (whichever triggers first wins):

Flag	Meaning	Default
`--storage.tsdb.path`	Where blocks live	`data/`
`--storage.tsdb.retention.time`	Delete blocks older than this	`15d`
`--storage.tsdb.retention.size`	Cap total block size (e.g. `50GB`)	`0` (unlimited)
`--storage.tsdb.wal-compression`	Compress the WAL	on (recent versions)

The defining limitation: the local TSDB is single-node and not clustered. It is durable enough for short-to-medium retention on one server, but it is not a highly-available, infinitely-scalable datastore, and you should not try to make it one by cranking retention to a year. For long retention, global query across many Prometheis, and HA, you use remote-write.

remote_write streams every sample, as it is ingested, to an external endpoint over a compact protocol. The receiver is a horizontally-scalable backend built for exactly this:

Backend	What it is
Thanos	Adds global query, downsampling and object-storage long-term retention on top of Prometheus (sidecar model)
Grafana Mimir	Horizontally scalable, multi-tenant long-term Prometheus storage (Cortex lineage)
VictoriaMetrics	High-performance TSDB, drop-in remote-write target, lower resource use
Cloud services	AWS Managed Prometheus (AMP), Azure Monitor managed Prometheus, Google Cloud Managed Service for Prometheus

remote_write:
  - url: "https://mimir.example.com/api/v1/push"
    queue_config:                # tune the shipping queue under load
      max_shards: 50
      capacity: 10000
    write_relabel_configs:       # optionally drop series before they leave the building
      - source_labels: [__name__]
        regex: "go_.*"           # don't ship Go runtime internals to long-term storage
        action: drop

The pattern in large estates: each Prometheus keeps a short local retention (for fast, recent queries and as a buffer) and remote-writes everything to a central, durable, queryable backend — Prometheus does the scraping it is good at, and the remote backend does the long-term storage and global view it is good at. There is also remote_read (query a remote backend transparently), but remote-write is far more common.

Instrumenting an application

Exporters cover infrastructure; for your application’s golden signals — request rate, error rate, latency, business counters — you instrument the code with a client library (official ones for Go, Python, Java, Ruby, Rust, .NET, Node.js, and more). The library maintains the metric registry in memory and exposes it on /metrics.

The four metric types (the practitioner’s view)

The fundamentals lesson covered the theory; here is the implementation reality.

Type	Use it for	What appears on `/metrics`	Query it with
Counter	Things that only go up (and reset to 0 on restart): requests, errors, bytes	One `*_total` series	`rate(x_total[5m])` — never the raw value
Gauge	Snapshots that go up and down: in-flight requests, queue depth, temperature	One series	Graph directly; `avg`/`max`/`min`
Histogram	Distributions you need percentiles of: latency, sizes	`_bucket{le="…"}` (cumulative), `_sum`, `*_count`	`histogram_quantile()` over `*_bucket`
Summary	A single instance’s exact quantiles	`{quantile="…"}`, `_sum`, `_count`	Read the quantile series directly — cannot aggregate

Recap of the one rule that matters most: prefer histograms for latency, because their _bucket series can be summed across all instances and then turned into a fleet-wide percentile with histogram_quantile(); a summary’s quantiles are computed inside one process and cannot be averaged into a correct cluster percentile. (The newer native/exponential histograms give high resolution with far fewer series, but classic bucketed histograms are still the safe default and what most tooling expects.)

The exposition format

The wire format is plain text, one sample per line, with optional # HELP and # TYPE comments. This is literally what a curl localhost:8000/metrics returns:

# HELP http_requests_total Total HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="GET",route="/api/orders",status="200"} 80421
http_requests_total{method="GET",route="/api/orders",status="500"} 17

# HELP http_request_duration_seconds Request latency in seconds.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{route="/api/orders",le="0.1"} 79000
http_request_duration_seconds_bucket{route="/api/orders",le="0.5"} 80300
http_request_duration_seconds_bucket{route="/api/orders",le="1.0"} 80420
http_request_duration_seconds_bucket{route="/api/orders",le="+Inf"} 80438
http_request_duration_seconds_sum{route="/api/orders"} 6021.7
http_request_duration_seconds_count{route="/api/orders"} 80438

Note the histogram’s anatomy: each _bucket{le="X"} is the cumulative count of observations ≤ X, the final bucket is always le="+Inf" (= total count), and _sum/_count let you compute an average (_sum / _count). The le="+Inf" bucket equalling _count is what makes the buckets internally consistent.

A minimal instrumented app (Python)

The official prometheus_client library, used in the lab:

from prometheus_client import Counter, Histogram, start_http_server
import random, time

REQS = Counter("http_requests_total", "Total HTTP requests",
               ["method", "route", "status"])
LAT  = Histogram("http_request_duration_seconds", "Request latency",
                 ["route"],
                 buckets=[0.05, 0.1, 0.25, 0.5, 1, 2.5, 5])  # choose buckets to span your SLO

def handle_request():
    route = random.choice(["/api/orders", "/api/users"])
    with LAT.labels(route=route).time():            # times the block, observes into the histogram
        time.sleep(random.expovariate(8))           # simulate work
        status = "500" if random.random() < 0.05 else "200"   # ~5% errors
        REQS.labels(method="GET", route=route, status=status).inc()

if __name__ == "__main__":
    start_http_server(8000)                          # exposes /metrics on :8000
    while True:
        handle_request()

Three best practices are baked in: bounded label values (route, status — never a raw URL or user ID), histogram buckets chosen to straddle your SLO threshold (so you can query the fraction under it), and using the library’s .time() helper so you cannot forget to observe. Then point a scrape job at it:

- job_name: "demo-app"
  static_configs:
    - targets: ["app:8000"]

PromQL in depth

PromQL is where Prometheus earns its keep. It looks simple and has sharp edges; mastering a dozen patterns covers almost everything.

Selectors and the two vector types

The atomic unit is a selector: a metric name plus optional label matchers in braces.

http_requests_total{job="demo-app", status="500"}

Matchers: = (equals), != (not equals), =~ (regex matches), !~ (regex does not match). Regexes are fully anchored (status=~"5.." matches 500–599).

The single most important distinction in PromQL:

An instant vector is the current value of each matching series — one value per series, at the evaluation time. http_requests_total{status="500"} is an instant vector.
A range vector is a window of values over time for each series, selected with a duration in brackets: http_requests_total[5m] returns the last 5 minutes of raw samples per series. You almost never display a range vector directly — you feed it to a function like rate().

This is why rate(http_requests_total[5m]) works and rate(http_requests_total) does not: rate needs a range to compute change over.

Rate functions — the heart of counter queries

You never graph a raw counter (it only climbs). You graph its per-second rate:

rate(http_requests_total[5m])        # avg per-second increase over the window, per series
irate(http_requests_total[5m])       # "instant" rate from the LAST two samples — spiky, for fast-moving graphs
increase(http_requests_total[5m])    # total increase over the window (= rate × window seconds)

All three are counter-aware: they automatically detect and correct for a counter reset (when a process restarts and the counter drops to 0), which is exactly why counters are safe across restarts. Use rate() for almost everything (smooth, for alerting and dashboards), irate() only for high-resolution graphs of volatile counters, and increase() when you want a human-readable total (“12,000 requests in the last hour”). Rule of thumb: the range in rate(...[5m]) should be at least 4× your scrape interval so each window contains enough samples.

Aggregation operators

Aggregation collapses many series into fewer, and the by/without clause controls which labels survive:

sum(rate(http_requests_total[5m]))                       # one number: total req/s across everything
sum by (status) (rate(http_requests_total[5m]))          # one series per status code
sum without (instance) (rate(http_requests_total[5m]))   # sum across instances, keep all other labels

The full set: sum, avg, min, max, count, count_values, stddev, stdvar, group, topk/bottomk (the N largest/smallest series), and quantile. by keeps only the listed labels; without keeps all except the listed labels — without (instance) is the idiom for “aggregate across replicas”. The classic error rate as a ratio:

sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
sum(rate(http_requests_total[5m]))

Percentiles from histograms

histogram_quantile(
  0.99,
  sum by (le, route) (rate(http_request_duration_seconds_bucket[5m]))
)

The order is load-bearing: take rate() of the _bucket counters, sum ... by (le, ...) across instances first (keeping the le bucket label and any dimensions you want), and only then apply histogram_quantile. Summing buckets before computing the quantile is precisely why histograms aggregate across instances and summaries do not — the single most-asked PromQL/Prometheus interview question.

`_over_time`, binary operators and `offset`

The *_over_time(range) family aggregates a single series across a time window (as opposed to across series): max_over_time(node_cpu_seconds_total[1h]), avg_over_time, min_over_time, quantile_over_time, count_over_time, last_over_time. Arithmetic and comparison binary operators (+ - * /, > < == >= <=, and/or/unless) combine vectors by matching labels — used constantly to build ratios and thresholds. offset 1w shifts a query back in time for week-over-week comparisons, and @ pins evaluation to a fixed timestamp.

Pattern	What it answers
`rate(x_total[5m])`	Per-second rate of a counter
`sum by (l) (rate(x[5m]))`	Rate grouped by label `l`
`histogram_quantile(0.95, sum by (le)(rate(h_bucket[5m])))`	Fleet p95 latency
`a / b`	A ratio (error rate, utilisation)
`100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])))`	CPU utilisation %
`predict_linear(node_filesystem_avail_bytes[6h], 4*3600) < 0`	Disk full within 4h?
`topk(5, sum by (pod)(rate(container_cpu_usage_seconds_total[5m])))`	Top 5 CPU-hungry pods

Recording rules — precompute the expensive ones

Heavy queries — multi-instance histogram quantiles, ratios over many series — are slow to run repeatedly on dashboards and in alerts. A recording rule evaluates an expression on the evaluation_interval schedule and stores the result as a new time series, so dashboards and alerts read the cheap precomputed series instead. The naming convention is level:metric:operation.

groups:
  - name: http-aggregations
    interval: 30s                          # optional per-group override
    rules:
      - record: job:http_requests:rate5m            # new series name
        expr: sum by (job) (rate(http_requests_total[5m]))
      - record: job:http_errors:ratio5m
        expr: |
          sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
            /
          sum by (job) (rate(http_requests_total[5m]))

Use recording rules for any expression that is (a) expensive and (b) queried often — dashboard panels, SLO error-rate ratios, and the inputs to burn-rate alerts. They trade a little extra storage for big query-time savings and consistency (every dashboard uses the same definition).

Alerting: rules → Alertmanager → notifications

Alerting in this stack is a two-stage pipeline. Prometheus evaluates alerting rules and decides what is firing. Alertmanager decides what to do about it — dedupe, group, route, silence, and notify. Keeping them separate means many Prometheis can feed one Alertmanager, and you change notification policy without touching the rules.

Alerting rules (in Prometheus)

An alerting rule is an expr that, whenever it returns a non-empty result, produces one firing alert per result series:

groups:
  - name: slo-alerts
    rules:
      - alert: HighErrorRate
        expr: job:http_errors:ratio5m > 0.05        # uses our recording rule
        for: 10m                                    # must stay true for 10m → kills flapping
        labels:
          severity: page                            # labels are used by Alertmanager ROUTING
        annotations:                                # annotations are for HUMANS (the notification body)
          summary: "High error rate on {{ $labels.job }}"
          description: "{{ $labels.job }} is at {{ $value | humanizePercentage }} 5xx for 10m."
          runbook: "https://runbooks.kloudvin.com/high-error-rate"

The four mechanics to internalise:

for is the pending period: the expression must be continuously true for this long before the alert moves from Pending to Firing. This is your primary anti-flapping control — a 30-second blip never pages.
labels are for machines: Alertmanager routes and groups on them (severity, team, service). Add a severity to every alert.
annotations are for humans: they fill the notification (summary, description, a runbook link). They use Go templating with $labels, $value, and helpers like humanizePercentage.
Prometheus exposes the alert states (ALERTS metric, /alerts page) and sends firing/resolved alerts to Alertmanager on every evaluation cycle; it does not notify directly.

Alertmanager: routing, grouping, inhibition, silences

Alertmanager’s job is to take a stream of fired alerts and turn it into a humane set of notifications. Its config has four moving parts.

1. The routing tree. A single top-level route with nested routes; each incoming alert walks the tree and is handled by the first matching branch (depth-first). continue: true lets an alert match multiple branches. This is how “database alerts go to the DBAs, everything severity: page pages on-call, the rest goes to a Slack channel” is expressed:

route:
  receiver: "slack-default"          # fallback receiver
  group_by: ["alertname", "service"] # which alerts get batched into one notification
  group_wait: 30s                    # wait this long to collect the first batch of a new group
  group_interval: 5m                 # wait between notifications for an existing group with new alerts
  repeat_interval: 4h                # re-notify an unresolved group this often
  routes:
    - matchers: ['team="database"']
      receiver: "slack-dba"
    - matchers: ['severity="page"']
      receiver: "pagerduty"
      group_wait: 10s                # page faster than the default
      continue: true                 # also fall through to record it in Slack
    - matchers: ['severity="ticket"']
      receiver: "jira"

2. Grouping. Without grouping, a rack losing power that fails 200 services sends 200 messages. group_by batches alerts that share the listed label values into a single notification (“12 instances of TargetDown for service=payments”). group_wait buffers the first alert of a new group briefly to collect its siblings; group_interval paces follow-ups as the group changes; repeat_interval controls re-nagging while unresolved. Grouping is the single biggest lever against alert-storm fatigue. (group_by: ['...'] with a literal '...' means “do not group” — one notification per alert.)

3. Inhibition. An inhibition rule suppresses some alerts while a more important one is firing — the classic case being “if the whole cluster is down (a critical alert), don’t also page me about every individual service being unreachable (the warning alerts)”:

inhibit_rules:
  - source_matchers: ['severity="critical"']
    target_matchers: ['severity="warning"']
    equal: ["cluster", "service"]     # only inhibit warnings that share these labels with the critical

equal is essential: it scopes the suppression so a critical alert in one service does not silence warnings in unrelated services.

4. Silences and receivers. A silence is a temporary, manual mute created in the Alertmanager UI (or API) by matching labels — used during planned maintenance (“silence everything for service=payments for 2 hours while we migrate the DB”). Silences are time-boxed and audited (who, why, until when), unlike a permanent config change. A receiver is a named notification destination; Alertmanager ships integrations for the lot:

Receiver type	Notes
`slack_configs`	Webhook URL (a secret), channel, templated title/text
`pagerduty_configs`	`routing_key` (a secret); maps severity to PagerDuty urgency
`opsgenie_configs`	Opsgenie API key
`email_configs`	SMTP server, from/to, TLS
`webhook_configs`	POST the alert JSON to any HTTP endpoint (custom integrations, Microsoft Teams via a relay)
`telegram_configs`, `sns_configs`, `victorops_configs`, `discord_configs`	Various

receivers:
  - name: "slack-default"
    slack_configs:
      - api_url: "${SLACK_WEBHOOK_URL}"        # store as a secret, not in the file
        channel: "#alerts"
        title: '{{ .CommonAnnotations.summary }}'
        text: >-
          {{ range .Alerts }}*{{ .Labels.severity }}* {{ .Annotations.description }}
          <{{ .Annotations.runbook }}|runbook>
          {{ end }}
        send_resolved: true                    # also notify when the alert clears
  - name: "pagerduty"
    pagerduty_configs:
      - routing_key: "${PAGERDUTY_ROUTING_KEY}"
        severity: '{{ .CommonLabels.severity }}'

send_resolved: true is a small but important touch — it tells the channel when the problem is over, so nobody chases a resolved incident.

Grafana: dashboards, variables, provisioning and alerting

Prometheus has a usable expression browser, but you visualise and share through Grafana — a multi-datasource dashboarding tool that queries Prometheus (and Loki, Tempo, SQL databases, CloudWatch, and dozens more) and renders panels, with its own alerting engine on top.

Data sources

A data source is a connection to a backend. For Prometheus you give it the server URL (http://prometheus:9090), choose the scrape interval hint, and optionally enable exemplars (so a click on a latency spike jumps to the linked trace in Tempo). You can have many data sources and mix them on one dashboard. The key operational rule: data sources should be provisioned from config, not clicked in by hand, so they are reproducible (below).

Dashboards, panels and queries

A dashboard is a grid of panels; each panel runs one or more queries (PromQL, here) and renders them in a visualisation — Time series, Stat, Gauge, Bar gauge, Table, Heatmap (ideal for histograms), State timeline, Logs. Panels have thresholds (colour by value), units (seconds, bytes, percent — set these or your axes lie), legends, and transformations (join, rename, calculate fields). A good service overview is the RED panel set: a Stat of request rate, a Time series of error ratio, and a Time series of p50/p95/p99 latency from a histogram — identical for every service so they are instantly comparable.

Template variables — one dashboard, every service

A dashboard hard-coded to one service is wasteful. Template variables turn a dashboard into a reusable template by parameterising queries with a dropdown at the top. The most useful kinds:

Variable type	Populated from	Example
Query	A PromQL `label_values()` call	`label_values(http_requests_total, job)` → a `$job` dropdown of all jobs
Custom	A hand-typed list	`prod, staging, dev`
Interval	A list of durations	`$__rate_interval` choices (1m, 5m, 1h)
Constant / Textbox	Fixed or free text	A threshold value
Data source	All sources of a type	Switch the whole dashboard between Prometheus instances

You then use $job in every query (rate(http_requests_total{job="$job"}[5m])), often with the =~"$job" matcher and the multi-value/All option so one dashboard serves every service. Grafana also provides built-ins: $__rate_interval (auto-sizes the rate() window to the panel’s resolution and scrape interval — use it instead of a hard-coded [5m]), $__interval, and $__range.

Provisioning as code

Clicking dashboards together by hand is the Grafana equivalent of kubectl edit in prod — unreproducible and lost when the container is recreated. Provisioning declares data sources and dashboards in YAML/JSON files that Grafana loads at startup, so the whole stack is in Git and reproducible. Two provisioning files:

provisioning/datasources/prometheus.yml:

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy                 # Grafana's backend queries Prometheus (not the browser)
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      httpMethod: POST
      exemplarTraceIdDestinations: []   # wire up if you run Tempo

provisioning/dashboards/dashboards.yml (tells Grafana to load every dashboard JSON in a folder):

apiVersion: 1
providers:
  - name: "file-provisioned"
    folder: "DevOps"
    type: file
    options:
      path: /etc/grafana/provisioning/dashboards
      foldersFromFilesStructure: true

You then drop exported dashboard JSON files into that path. In CI you can lint and version them; the dashboard becomes a reviewed artefact, not tribal knowledge. (Community dashboards from grafana.com — e.g. Node Exporter Full, ID 1860 — can be imported by ID and then committed.)

Grafana-managed alerting vs Alertmanager

Grafana has its own unified alerting engine, which can evaluate alert rules against any data source (Prometheus, Loki, a SQL DB, CloudWatch) — not just Prometheus — and route them through contact points and notification policies that mirror Alertmanager’s routing tree. So you have two valid choices:

	Prometheus + Alertmanager	Grafana-managed alerting
Where rules live	`rules.yml` in Prometheus (in Git)	Grafana (UI or provisioned YAML)
Data sources	Prometheus only	Any Grafana data source
Routing/notify	Alertmanager (routing tree, inhibition, silences)	Grafana contact points + notification policies
Best for	Prometheus-centric, GitOps, multi-Prometheus	Mixed data sources, teams living in Grafana

A common production split: keep the critical, Prometheus-based paging alerts in Prometheus + Alertmanager (battle-tested, GitOps-friendly), and use Grafana-managed alerting for cross-data-source or dashboard-driven alerts. Grafana can also act purely as a front-end to an external Alertmanager, giving you a nicer UI for silences over your existing setup. There is no wrong answer; just pick one source of truth per alert so you are not debugging two systems.

The packaged stack: kube-prometheus-stack

In the real world on Kubernetes you rarely assemble these by hand. The kube-prometheus-stack Helm chart (from the prometheus-community repo) installs the whole lot — the Prometheus Operator, Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics, and a battery of pre-built dashboards and alerting rules — in one command, and lets you declare scrape targets with ServiceMonitor/PodMonitor custom resources instead of editing prometheus.yml. That Operator-driven workflow is its own lesson (Kubernetes Monitoring, In Depth); the point here is that everything the Operator generates is the raw configuration this lesson taught you to read — which is exactly why you learn it by hand first.

The Prometheus and Grafana stack: scrape targets and exporters feed the Prometheus server (retrieval, TSDB, rule engine and PromQL), which serves Grafana for dashboards and pushes fired alerts to Alertmanager for routing, grouping and notification, with optional remote-write to long-term storage

The diagram shows the full data flow: instrumented apps and exporters (node, cAdvisor, blackbox) exposing /metrics; Prometheus pulling them on a schedule (with service discovery and relabeling), storing samples in its local TSDB and optionally remote-writing to Mimir/Thanos; the rule engine producing recording-rule series and firing alerting rules into Alertmanager (routing → grouping → inhibition/silences → Slack/PagerDuty/email); and Grafana querying Prometheus to render provisioned dashboards.

Hands-on lab

We will stand up the complete stack with Docker Compose: an instrumented sample app, node-exporter, Prometheus scraping both, Alertmanager wired in, and Grafana with a provisioned data source. Everything is free and local. Allow about 15 minutes.

1. Project layout. Create a folder prom-lab/ with these files.

docker-compose.yml:

services:
  prometheus:
    image: prom/prometheus:v3.5.0          # current LTS line
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.retention.time=2d"
      - "--web.enable-lifecycle"            # enables POST /-/reload
    ports: ["9090:9090"]
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./rules.yml:/etc/prometheus/rules.yml:ro

  alertmanager:
    image: prom/alertmanager:v0.28.1
    command: ["--config.file=/etc/alertmanager/alertmanager.yml"]
    ports: ["9093:9093"]
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro

  node-exporter:
    image: prom/node-exporter:v1.9.1
    ports: ["9100:9100"]

  app:                                       # our instrumented Python app (built below)
    build: ./app
    ports: ["8000:8000"]

  grafana:
    image: grafana/grafana:12.0.0
    ports: ["3000:3000"]
    environment:
      GF_AUTH_ANONYMOUS_ENABLED: "true"
      GF_AUTH_ANONYMOUS_ORG_ROLE: "Admin"
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning:ro

app/Dockerfile and app/app.py (the instrumented app from earlier):

# app/Dockerfile
FROM python:3.13-slim
RUN pip install --no-cache-dir prometheus_client
COPY app.py /app.py
CMD ["python", "/app.py"]

# app/app.py
from prometheus_client import Counter, Histogram, start_http_server
import random, time

REQS = Counter("http_requests_total", "Total HTTP requests", ["method","route","status"])
LAT  = Histogram("http_request_duration_seconds", "Request latency", ["route"],
                 buckets=[0.05,0.1,0.25,0.5,1,2.5,5])

def handle():
    route = random.choice(["/api/orders","/api/users"])
    with LAT.labels(route=route).time():
        time.sleep(random.expovariate(8))
        status = "500" if random.random() < 0.05 else "200"
        REQS.labels(method="GET", route=route, status=status).inc()

if __name__ == "__main__":
    start_http_server(8000)
    while True:
        handle()

prometheus.yml:

global:
  scrape_interval: 5s
  evaluation_interval: 5s
rule_files:
  - "rules.yml"
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]
scrape_configs:
  - job_name: "prometheus"
    static_configs: [{ targets: ["localhost:9090"] }]
  - job_name: "node"
    static_configs: [{ targets: ["node-exporter:9100"] }]
  - job_name: "demo-app"
    static_configs: [{ targets: ["app:8000"] }]

rules.yml (a recording rule + a real alerting rule):

groups:
  - name: demo
    rules:
      - record: job:http_errors:ratio5m
        expr: |
          sum by (job) (rate(http_requests_total{status=~"5.."}[1m]))
            /
          sum by (job) (rate(http_requests_total[1m]))
      - alert: HighErrorRate
        expr: job:http_errors:ratio5m{job="demo-app"} > 0.02
        for: 1m
        labels: { severity: page }
        annotations:
          summary: "demo-app error ratio {{ $value | humanizePercentage }}"
          runbook: "https://runbooks.kloudvin.com/high-error-rate"
      - alert: TargetDown
        expr: up == 0
        for: 30s
        labels: { severity: page }
        annotations:
          summary: "Target {{ $labels.instance }} ({{ $labels.job }}) is down"

alertmanager.yml (uses a webhook receiver so you need no external account):

route:
  receiver: "log-webhook"
  group_by: ["alertname", "job"]
  group_wait: 10s
  group_interval: 1m
  repeat_interval: 1h
inhibit_rules:
  - source_matchers: ['alertname="TargetDown"']
    target_matchers: ['alertname="HighErrorRate"']
    equal: ["job"]            # if the app is DOWN, don't also alert on its error ratio
receivers:
  - name: "log-webhook"
    webhook_configs:
      - url: "http://app:8000/"   # any reachable URL; we just want to see routing work
        send_resolved: true

grafana/provisioning/datasources/prometheus.yml:

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true

2. Start the stack.

docker compose up -d --build
docker compose ps          # five services should be "running"

3. Confirm scraping. Open http://localhost:9090/targets — prometheus, node, and demo-app should all be UP. Inspect the app’s raw metrics by hand to see the exposition format:

curl -s localhost:8000/metrics | grep -E "http_requests_total|http_request_duration_seconds_bucket" | head

You should see the _total counter series and the _bucket{le="..."} histogram lines exactly as described earlier.

4. Run PromQL. In the Prometheus UI (http://localhost:9090/graph), run each and switch to the Graph tab:

up                                                          # 1 per healthy target
sum by (status) (rate(http_requests_total[1m]))             # req/s split by 200 vs 500
job:http_errors:ratio5m                                     # your recording rule
histogram_quantile(0.95,
  sum by (le) (rate(http_request_duration_seconds_bucket[1m])))   # p95 latency
100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[1m])))    # host CPU utilisation %

Expected: up returns three 1s; the rate query shows two lines (200 and 500); the ratio hovers near 0.05 (our 5% error rate); p95 latency is a fraction of a second.

5. Watch an alert fire and route. The app’s ~5% error rate is above the 2% threshold, so within ~1 minute http://localhost:9090/alerts shows HighErrorRate go Pending → Firing. Open http://localhost:9093 (Alertmanager) and confirm the alert appears, grouped by alertname+job. Now test inhibition and grouping by killing the app:

docker compose stop app

TargetDown fires for demo-app; because of the inhibition rule, HighErrorRate for the same job is suppressed (you cannot have an error ratio for a target that is scraping no data) — verify in the Alertmanager UI that only TargetDown is active. Restart and watch both resolve:

docker compose start app

6. Create a silence. In the Alertmanager UI (http://localhost:9093 → Silences → New Silence), add a matcher job="node", a duration of 1h, a creator and comment, and save — any alert from the node job is now muted (audited, time-boxed). This is the maintenance-window workflow.

7. Grafana. Open http://localhost:3000 (anonymous admin). The Prometheus data source is already provisioned (Connections → Data sources → Prometheus, Test it). Build a panel: Dashboards → New → Add visualization → Prometheus, and enter the p95 query from step 4; set the panel unit to seconds and the visualisation to Time series. Add a second panel for job:http_errors:ratio5m with unit percent (0.0–1.0). You now have a RED-style overview reading the same recording rule your alert uses.

Validation checklist: three targets UP; curl shows the exposition format; the five PromQL queries return data; HighErrorRate fires and appears in Alertmanager; stopping the app fires TargetDown and inhibits HighErrorRate; a silence mutes the node job; Grafana renders both panels from the provisioned data source.

Cleanup.

docker compose down -v       # stop and remove containers + volumes

Then delete the prom-lab/ folder if it was throwaway.

Cost note. Entirely free — every image is open-source and runs locally; nothing leaves your machine and no cloud account is needed. The only production “cost” levers are TSDB cardinality (bound your label values), scrape interval × number of series (storage and CPU), retention (retention.time/size), and, if you adopt it, remote-write egress and the long-term-storage backend — all of which you now know how to control.

Common mistakes & troubleshooting

Symptom	Likely cause	Fix
Target shows DOWN on `/targets`	Wrong `__address__`/port, network/DNS, scheme, or auth; check the `Error` column there	`curl` the target’s `/metrics` from inside the network; fix host:port; in Compose use the service name, not `localhost`
`rate()` returns empty or `NaN`	Range shorter than ~2 scrape intervals, or you wrapped a gauge in `rate()`	Use `rate(...[≥4×interval])`; only `rate()` counters; use `$__rate_interval` in Grafana
Graph of a counter only climbs	Plotting the raw counter	Always wrap counters in `rate()`/`increase()`
Fleet p99 looks wrong or won’t aggregate	Used a summary, or applied `histogram_quantile` before summing buckets	Use a histogram; `sum by (le) (rate(..._bucket[…]))` then `histogram_quantile`
Prometheus OOMs / disk fills / slow queries	Cardinality explosion from an unbounded label (user id, raw URL, request id)	Find it with `topk(10, count by (__name__)({__name__=~".+"}))`; drop in `metric_relabel_configs`; normalise routes
Alerts fire in Prometheus but no notification	Alertmanager unreachable, no matching `route`, or the receiver secret is wrong	Check `/alerts` shows Firing; confirm `alerting.alertmanagers` target; check Alertmanager logs and routing tree
One incident sends dozens of messages	No/insufficient grouping	Set `group_by` on the shared labels; tune `group_wait`/`group_interval`
Paged for every service when the whole cluster is down	No inhibition	Add an `inhibit_rules` suppressing `warning` while `critical` fires, scoped with `equal`
Grafana panel empty but query works in Prometheus	Wrong data source URL (use the service name), or a `$variable` is unset/empty	Test the data source; check the variable dropdown and the `=~"$var"` matcher
Dashboards/data sources vanish after a redeploy	Configured by hand instead of provisioned	Provision data sources and dashboards from files in Git

Best practices

Configure everything as code and in Git — prometheus.yml, rules, alertmanager.yml, Grafana provisioning. The whole stack should be reproducible from a repo (this is what kube-prometheus-stack does for you on K8s).
Use service discovery + relabeling, not hand-maintained target lists, for anything that scales; opt targets in by annotation/label and shape labels with relabel_configs.
Bound cardinality ruthlessly. Never put unbounded values in labels; use metric_relabel_configs to drop noisy metrics at ingest and sample_limit as a safety valve.
Prefer histograms for latency (aggregatable), choose buckets that straddle your SLO, and consider native histograms for resolution at lower series cost.
Precompute with recording rules for any expensive expression you query repeatedly (dashboard panels, SLO ratios, burn-rate inputs); name them level:metric:operation.
Set for: on every alert to kill flapping; put severity (and team/service) labels on every alert for routing; link a runbook in the annotations.
Group and inhibit in Alertmanager so a single incident is one humane notification, not a storm; use silences for planned maintenance.
Standardise dashboards on RED per service and template them with variables (and $__rate_interval) so one dashboard serves many services; always set panel units.
Plan storage: keep modest local retention and remote-write to a scalable backend (Mimir/Thanos/managed) for long retention, HA, and a global view.
Run Alertmanager in HA (a cluster of ≥2 with gossip) so a single node failing does not stop your pages.

Security notes

The monitoring stack is a high-value target and a frequent leak. Never expose Prometheus, Alertmanager, the Pushgateway or Grafana to the public internet — an open Prometheus /metrics or unauthenticated Grafana hands an attacker your entire internal topology, hostnames, versions and traffic patterns, and an open Pushgateway or Alertmanager lets them inject fake metrics or delete your alerts and create silences to mask an attack. Prometheus itself has only basic built-in auth and TLS (configured via --web.config.file); in practice you put it behind a reverse proxy / network policy / mesh mTLS and restrict it to your VPC. Treat receiver secrets as secrets — Slack webhook URLs and PagerDuty routing keys must come from a secret store or env vars (note the ${...} placeholders above), never be committed in alertmanager.yml. Scrape over TLS with auth for sensitive targets (scheme: https, tls_config, authorization). In Grafana, replace the lab’s anonymous-admin with real authentication (OAuth/SAML/LDAP), use least-privilege org roles and folder permissions, and prefer access: proxy data sources so backend credentials never reach the browser. Finally, remember monitoring is security detection: error spikes, traffic anomalies and saturation are often the first visible sign of an attack — route security-relevant alerts to the right team, and protect the monitoring plane so attackers cannot blind it.

Interview & exam questions

Why does Prometheus pull rather than push, and what is the one case where you push? Pull gives you target liveness for free (the up metric), avoids every app holding push credentials, keeps targets simple (just expose /metrics), and centralises discovery and rate control. The exception is short-lived batch jobs that finish before any scrape — those push their final result to the Pushgateway, which Prometheus then scrapes. The Pushgateway is only for batch results, not a general push channel.
Walk me through what happens from a scrape to a stored sample. Prometheus’s retrieval component resolves targets (via static config or service discovery), runs relabel_configs to decide whether/where to scrape, HTTP-GETs /metrics each scrape_interval, parses the exposition format, applies metric_relabel_configs, and appends the samples to the head block while writing them to the WAL for crash safety. Every ~2h the head flushes to an immutable on-disk block; compaction later merges blocks; retention deletes old ones.
Counter vs gauge — and why never graph a raw counter? A counter only increases (and resets to 0 on restart); a gauge goes up and down. You never graph a counter directly because a monotonically rising line is meaningless — you graph its rate (rate(x_total[5m]) = per-second change). rate/increase are counter-aware and correct for resets automatically.
Histogram vs summary — and why can you compute a fleet-wide p99 from one but not the other? A histogram exposes raw cumulative _bucket{le} counts; you sum by (le) those buckets across all instances and then apply histogram_quantile(), giving a correct aggregate percentile. A summary computes quantiles inside each process and ships the results — and you cannot average percentiles, so summaries cannot produce a correct cluster-wide p99. Prefer histograms for latency.
What is relabel_configs versus metric_relabel_configs? Both rewrite label sets with the same grammar, but relabel_configs runs over a target’s labels before scraping — to keep/drop targets and build __address__/__metrics_path__ from service-discovery meta-labels — while metric_relabel_configs runs over each sample after scraping, mainly to drop high-cardinality/noisy metrics before they hit the TSDB.
You discover Kubernetes pods via SD but only want the annotated ones, on a custom port. How? Use relabel_configs: a keep action on __meta_kubernetes_pod_annotation_prometheus_io_scrape matching "true", then a replace that rewrites __address__ to host:port using the prometheus.io/port annotation, and replace rules to promote __meta_kubernetes_namespace/pod to real labels.
What is cardinality, how does it blow up, and how do you fix it? Cardinality is the number of unique time series = the product of all label-value combinations. It explodes when you put an unbounded value (user id, request id, raw URL, timestamp) in a label, exhausting memory and slowing queries. Find offenders with topk(10, count by (__name__)({__name__=~".+"})), drop them with metric_relabel_configs/sample_limit, normalise routes (/orders/:id), and push per-request detail to logs/traces.
Explain for:, labels and annotations on an alerting rule. for: is the pending duration the expression must stay true before the alert fires — the anti-flapping control. labels are for machines (Alertmanager routes and groups on severity/team). annotations are for humans — the templated summary/description/runbook that fill the notification (using $labels, $value).
Alertmanager: what do grouping, inhibition and silences each do? Grouping (group_by) batches related alerts into one notification so a multi-failure incident is not a message storm; group_wait/group_interval/repeat_interval pace it. Inhibition suppresses lower-severity alerts while a related higher-severity one fires (scoped with equal) — e.g. don’t page on per-service warnings when the whole cluster is critical. A silence is a temporary, audited, label-matched mute created in the UI for planned maintenance.
What is a recording rule and when do you use one? A recording rule precomputes an expression on a schedule and stores the result as a new series (named level:metric:operation). Use it for expressions that are expensive and queried often — multi-instance histogram quantiles, SLO error-ratios, burn-rate inputs — so dashboards and alerts read a cheap precomputed series and every consumer uses the same definition.
How does Prometheus do long-term and highly-available storage? The local TSDB is single-node and meant for short/medium retention. For long retention, a global view across many Prometheis, and HA, you use remote_write to a scalable backend — Thanos, Grafana Mimir, VictoriaMetrics, or a managed service (AMP/Azure/GCP). Typical pattern: short local retention as a buffer, remote-write everything to the durable backend.
Prometheus + Alertmanager vs Grafana-managed alerting — when each? Prometheus + Alertmanager keeps rules in Git, evaluates against Prometheus only, and routes via Alertmanager (routing tree, inhibition, silences) — best for Prometheus-centric, GitOps, multi-Prometheus setups. Grafana-managed alerting evaluates against any Grafana data source and routes via contact points/notification policies — best for cross-data-source or dashboard-driven alerts. Keep one source of truth per alert.
What does the up metric tell you and where does it come from? up is a synthetic gauge Prometheus writes for every target on every scrape: 1 if the scrape succeeded, 0 if it failed. It is the free liveness signal the pull model gives you, and up == 0 is the canonical TargetDown alert.

Quick check

What two things must be true for rate(http_requests_total[5m]) to be meaningful (vs rate of a gauge, or too short a range)?
You need a cluster-wide p95 latency from 8 pods. What metric type, and what is the exact order of operations in PromQL?
What is the difference between relabel_configs and metric_relabel_configs?
In Alertmanager, which feature stops a whole-cluster-down critical alert from also paging you about every individual warning?
Name two ways the local Prometheus TSDB’s limitations are addressed in production.

Answers

The metric must be a counter (not a gauge), and the range must span at least ~2–4 scrape intervals so the window contains enough samples; rate is counter-aware and corrects for resets.
A histogram: take rate() of the _bucket series, sum by (le) across the pods first, then apply histogram_quantile(0.95, ...). Summing buckets before the quantile is why histograms aggregate across instances.
relabel_configs rewrites/filters targets before scraping (keep/drop targets, build __address__ from SD meta-labels); metric_relabel_configs rewrites/filters samples after scraping (mainly dropping high-cardinality metrics before storage).
Inhibition (an inhibit_rules entry suppressing severity="warning" while severity="critical" fires, scoped with equal).
Remote-write to a scalable/durable backend (Thanos, Mimir, VictoriaMetrics, or a managed service) for long retention and a global view, and running Alertmanager (and Prometheus) in HA; bounding retention and cardinality also keeps the single node healthy.

Exercise

Extend the lab into a small but realistic monitoring setup:

Add the blackbox exporter to the Compose stack and a blackbox-http scrape job that probes https://kloudvin.com and your local app’s /metrics, using the __param_target relabeling idiom. Graph probe_success and probe_duration_seconds.
Add a burn-rate alert for the demo app: a fast-burn rule (e.g. 14.4× the error budget over a 1h and a 5m window, against a 99% SLO) with severity: page, reusing a recording rule for the error ratio. (Lean on the burn-rate maths from the observability-fundamentals lesson.)
Route by severity in Alertmanager: send severity: page to one receiver and severity: ticket to another, with continue so pages are also recorded; add a second alerting rule at ticket severity to prove the routing.
Provision a dashboard as code: build a RED overview (rate, error ratio, p50/p95/p99) in Grafana, add a $job template variable (label_values(http_requests_total, job)), export the JSON, and drop it into grafana/provisioning/dashboards/ so it loads automatically on the next docker compose up.
Trigger and observe: raise the app’s error rate, confirm the burn-rate alert fires, the page routes to the right receiver, and the dashboard’s error panel reacts; then create a silence for a maintenance window and confirm it mutes.

Capture in your notes: the blackbox relabeling block, the burn-rate expr, the Alertmanager routing tree, and a screenshot of the provisioned dashboard with the $job dropdown.

Certification mapping

Exam / certification	Relevant objectives
Prometheus Certified Associate (PCA)	The whole exam: architecture & pull model, exposition format & metric types, instrumentation, `prometheus.yml`/scrape_configs/service discovery & relabeling, exporters, PromQL (selectors, `rate`, `histogram_quantile`, aggregation), recording & alerting rules, Alertmanager (routing/grouping/inhibition/silences), TSDB & remote-write, Grafana basics
AWS Certified DevOps Engineer – Professional (DOP-C02)	Monitoring & observability design; metrics/alerting; integrating Amazon Managed Service for Prometheus and Grafana; automated response to alerts
Microsoft Azure DevOps Engineer Expert (AZ-400)	Implement monitoring/observability; metrics, dashboards and alerting; Azure Monitor managed Prometheus + Azure Managed Grafana; defining and tracking KPIs/SLIs
Certified Kubernetes Administrator / Application Developer (CKA/CKAD)	Cluster monitoring fundamentals; understanding metrics pipelines (deeper Operator workflow is the dedicated K8s lesson)
Google Cloud Professional DevOps Engineer	SLI/SLO/alerting strategy; Cloud Monitoring (Managed Prometheus); building dashboards and reducing alert fatigue

Glossary

Pull / scrape model — Prometheus periodically HTTP-GETs each target’s /metrics, rather than targets pushing.
Exporter — a bridge that exposes a system’s stats in Prometheus format (node_exporter, blackbox_exporter, cAdvisor).
Pushgateway — a holding area for short-lived batch jobs to push final metrics so Prometheus can scrape them.
scrape_configs / job / instance — the scrape job definitions; each job has a job label, each target an instance label.
Service discovery (SD) — dynamically finding targets (Kubernetes, EC2, file, DNS…), exposing __meta_* labels.
Relabeling — rule pipeline that rewrites/filters labels: relabel_configs (targets, pre-scrape) vs metric_relabel_configs (samples, post-scrape).
TSDB / WAL / block / compaction — the local time-series store: write-ahead log for durability, immutable on-disk blocks, merged over time by compaction.
Remote-write — streaming samples to an external long-term/HA backend (Thanos, Mimir, VictoriaMetrics, managed services).
Counter / gauge / histogram / summary — the metric types: monotonic total / up-down snapshot / cumulative _buckets for percentiles / client-side quantiles.
Exposition format — the plain-text /metrics wire format (# HELP, # TYPE, one sample per line).
PromQL — Prometheus’s query language; instant vector (current value per series) vs range vector ([5m] window for rate()).
rate / irate / increase — counter-aware per-second rate (avg) / instant rate / total increase over a window.
histogram_quantile — computes a percentile from summed _bucket rates; the reason histograms aggregate across instances.
Recording rule — a precomputed, stored expression (level:metric:operation) for expensive, frequently-queried PromQL.
Alerting rule — a PromQL expression with for, labels, annotations that fires alerts to Alertmanager.
Alertmanager — the separate process that dedupes, groups, routes (routing tree), inhibits, silences and notifies (receivers).
Routing tree / grouping / inhibition / silence — match alerts to receivers / batch related alerts / suppress lower-severity during higher / temporary manual mute.
Receiver / contact point — a named notification destination (Slack, PagerDuty, email, webhook).
Grafana data source / panel / dashboard / variable — a backend connection / a single visualisation+query / a grid of panels / a templating dropdown.
Provisioning — declaring Grafana data sources and dashboards from config files so the stack is reproducible.
kube-prometheus-stack — the Helm chart that installs Prometheus Operator + Prometheus + Alertmanager + Grafana + exporters with ServiceMonitor/PodMonitor.

Next steps

You can now stand up, configure and operate the Prometheus and Grafana stack end to end — scraping with relabeling, instrumenting an app, querying with PromQL, alerting through Alertmanager, and visualising with provisioned Grafana dashboards. This closes the loop opened by Observability Fundamentals for DevOps (the theory these tools implement) and feeds directly into SRE & Incident Management (where these alerts become pages and these dashboards drive incident response). For the Kubernetes-native version of this stack — the Prometheus Operator, ServiceMonitor/PodMonitor and kube-state-metrics — see Kubernetes Monitoring, In Depth. Then continue the track with Deployment Strategies: Rolling, Blue/Green, Canary, Progressive Delivery & Rollback, where the SLO metrics you now collect become the automated gate that promotes or rolls back a release.