Most PromQL bugs are not syntax errors. They are semantic errors that return a plausible number, ship to a dashboard, and quietly lie for six months. This is a working engineer’s tour of the parts that bite: the data model, the rate() family, histograms, aggregation, vector matching, and how to validate a query before it pages someone at 3am.
1. The data model you must internalize
A Prometheus time series is uniquely identified by a metric name plus a set of labels. The metric name is itself just sugar for the __name__ label, so these are equivalent:
http_requests_total{job="api", code="200"}
{__name__="http_requests_total", job="api", code="200"}
Every series is an append-only stream of (timestamp, float64) samples. There are four metric types (counter, gauge, histogram, summary), but here is the part people miss: the type is metadata, not enforcement. PromQL does not stop you from calling rate() on a gauge or avg on a counter. The semantics live in your head and in the metric’s # TYPE line, not in the query engine. Get the type wrong and you get a number, just not a true one.
The mental model that prevents most mistakes:
| Type | What it stores | Correct operations |
|---|---|---|
| Counter | Monotonic cumulative total, resets to 0 on restart | rate, irate, increase; never read raw |
| Gauge | A value that goes up and down | read raw, avg/max/min, delta, deriv |
| Histogram | Cumulative _bucket counters + _sum + _count |
histogram_quantile, rate on buckets |
| Summary | Pre-computed _sum + _count + client-side quantiles |
rate on sum/count; quantiles are not aggregatable |
Counter metrics conventionally end in
_total. That suffix is a hint to humans; it does not change behavior. The actual behavior you rely on is monotonicity plus reset detection.
2. Counters and the rate() family
Counters only ever go up (until a process restart sets them back to 0). A raw counter value is meaningless on a dashboard. What you want is a per-second rate, and rate() is the workhorse:
rate(http_requests_total{job="api"}[5m])
This computes the per-second average increase over a 5-minute window. Critically, rate() and increase() both detect counter resets: if the value drops between two samples, the engine assumes a restart and treats the drop as a reset rather than a negative rate. That is why you must never compute rates by hand with subtraction.
The three functions and when to use each:
rate(v[w])— per-second average over the window. Smooth, resilient, the default for alerting and graphing. Use a window of at least 4x your scrape interval so a single missed scrape does not blow it up.irate(v[w])— instantaneous rate from only the last two samples in the window. Reacts fast to spikes but is jumpy; good for high-resolution graphs you are watching live, bad for alerts because it is noisy.increase(v[w])— total increase over the window (not per-second). It is essentiallyrate * window_seconds, with the same reset handling. Use it for “how many errors in the last hour” style questions.
# Per-second error rate, smoothed
rate(http_requests_total{code=~"5.."}[5m])
# Live, twitchy view of the same thing
irate(http_requests_total{code=~"5.."}[1m])
# Absolute count of 5xx in the last hour
increase(http_requests_total{code=~"5.."}[1h])
Both
rateandincreaseextrapolate to the window edges, soincreaseover[1h]can return a non-integer like1342.8even though you counted whole requests. That is expected, not a bug. Treat it as an estimate.
A subtle trap: rate() needs at least two samples inside the window to return anything. A window narrower than your scrape interval, or a series that just started, yields empty results. If a panel is mysteriously blank, widen the range first.
3. Range vs instant vectors
This is the distinction that unlocks PromQL. An instant vector is one sample per series at the evaluation timestamp. A range vector is a set of samples per series over a lookback window, written with [5m].
http_requests_total # instant vector: one value per series
http_requests_total[5m] # range vector: many values per series
Functions like rate, increase, avg_over_time, and max_over_time consume a range vector and return an instant vector. You cannot graph a range vector directly, and you cannot pass an instant vector to rate(). Type mismatch is the single most common beginner error, and the error message (“expected type range vector … got instant vector”) tells you exactly which way you got it backwards.
The window also interacts with how Prometheus evaluates a graph. When you render a range over time, Prometheus evaluates your instant-vector expression at each step across the dashboard’s time range; the [5m] is the lookback applied at each of those steps. Two rules keep this honest:
- Make the range window larger than the step, or adjacent points share no data and the line gets gappy.
- Make the window a small multiple of the scrape interval (commonly 4x). With a 30s scrape,
[2m]gives you ~4 samples per window, enough to survive one missed scrape.
The off-by-one scrape trap: with a window exactly equal to the scrape interval, you frequently catch only one sample (sometimes zero, sometimes two) depending on alignment, so
rate()flickers between a value and nothing. Always give the window slack.
4. Histograms done right
A Prometheus histogram is not one series. It is a family: a set of cumulative _bucket counters (each labeled with an upper bound le), plus _sum and _count. “Cumulative” means the le="0.5" bucket counts every observation <= 0.5, the le="1" bucket counts everything <= 1 (including the <= 0.5 ones), and so on up to le="+Inf".
To get a percentile you rate() the buckets first, then apply histogram_quantile:
histogram_quantile(
0.95,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
Read it inside-out: rate(..._bucket[5m]) turns each cumulative bucket into a per-second rate, sum by (le) aggregates across all the other labels while preserving le (this is non-negotiable — drop le and the function has nothing to work with), and histogram_quantile(0.95, ...) interpolates the 95th percentile.
Two truths about classic histograms:
- Accuracy is bounded by your buckets.
histogram_quantiledoes linear interpolation within the matched bucket. If your p99 lands in a bucket spanning[1s, 10s], the answer is a guess somewhere in that range. Lay out buckets around your SLO thresholds, not on a generic exponential scale, if you care about a specific percentile. - You cannot average percentiles. Averaging per-instance p95 values across pods is mathematically meaningless and consistently understates tail latency. Always aggregate the buckets and compute the quantile once, as above. This is “the lie of averaged percentiles,” and it is everywhere in dashboards built by people who did not read this section.
For an SLO-style “what fraction of requests were under 300ms” question, skip quantiles entirely and ratio the buckets:
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
Newer Prometheus supports native (exponential) histograms, which store buckets far more efficiently and sidestep manual bucket layout. The mental model is the same —
histogram_quantileworks on them too — but the wire format and storage differ. If you are on a recent version, evaluate them; thele-bucket pain largely goes away.
5. Aggregation: by vs without
Aggregation operators (sum, avg, max, min, count, topk, quantile, …) collapse many series into fewer. The clause you attach decides which labels survive, and that decision is your dashboard’s grouping.
by (labels)keeps only the listed labels; everything else is collapsed.without (labels)keeps everything except the listed labels.
# Total request rate per service (keep only `job`)
sum by (job) (rate(http_requests_total[5m]))
# Same, but keep every label except the noisy `instance`
sum without (instance) (rate(http_requests_total[5m]))
without is often the more robust choice: when someone adds a new label upstream, by silently drops it (your panels do not change) while without automatically carries it through (your panels gain detail). For dashboards meant to surface new dimensions, prefer without. For tightly controlled top-line numbers, by is clearer about intent.
A reset-correctness reminder that order matters: always rate() before you sum. sum(rate(x[5m])) is correct; rate(sum(x)[5m]) is not, because summing counters across instances destroys per-instance reset detection — one pod restarting looks like the aggregate counter going backwards.
topk and bottomk are your friends for “noisiest N” panels, but note they are evaluated per timestamp, so the membership of the top 5 can change across a graph:
topk(5, sum by (pod) (rate(container_cpu_usage_seconds_total[5m])))
6. Joins and arithmetic across metrics
Binary operators between two instant vectors match series by their full label set by default. When the label sets differ, you must tell PromQL how to match with on (match only these labels) or ignoring (match on everything except these).
A canonical ratio — error rate as a fraction of total — where both sides already share labels:
sum by (job) (rate(http_requests_total{code=~"5.."}[5m]))
/
sum by (job) (rate(http_requests_total[5m]))
That works because both sides carry identical job labels. The interesting case is many-to-one matching, where one side has extra labels. Say you want per-code error fractions against a per-job total:
sum by (job, code) (rate(http_requests_total[5m]))
/ on (job) group_left
sum by (job) (rate(http_requests_total[5m]))
on (job) matches by job; group_left says the left side is the “many” and may have extra labels (here, code). The result keeps the left side’s full labeling. Use group_right for the mirror image.
group_left also does enrichment — pulling labels from a metadata metric like kube_pod_info onto your real metric:
sum by (pod) (rate(container_cpu_usage_seconds_total[5m]))
* on (pod) group_left(node)
kube_pod_info
Here group_left(node) copies the node label off kube_pod_info onto each CPU series. The arithmetic (* 1, effectively, since kube_pod_info is 1) is just a vehicle for the label join. This pattern — multiply by an info metric to attach its labels — is how you slice app metrics by infrastructure dimensions you did not export directly.
If a binary operation returns “many-to-many matching not allowed,” you have duplicate series on the side you think is unique. Add the missing label to your
on/ignoringset, or aggregate that side down to one series per match group first.
7. Subqueries, offset, and trend
offset shifts the lookback in time — perfect for week-over-week comparisons:
# This week's rate vs the same window 7 days ago
sum(rate(http_requests_total[5m]))
/
sum(rate(http_requests_total[5m] offset 7d))
Subqueries let you run a range query over the result of an instant query, with syntax [outer_range:resolution]. This is how you ask “what was the max 5-minute rate over the last hour”:
max_over_time(
sum(rate(http_requests_total[5m]))[1h:1m]
)
The [1h:1m] evaluates the inner expression every 1 minute across a 1-hour window, then max_over_time reduces it. Subqueries are powerful but expensive — they materialize a lot of intermediate points — so prefer a recording rule for anything you run repeatedly.
predict_linear fits a least-squares line over a range vector and extrapolates, which is the standard disk-fill / capacity query:
# Will free bytes hit zero within 4 hours? (4h = 14400s)
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 4 * 3600) < 0
It learns the trend from the last 6 hours and projects 4 hours forward. Pair it with a for: clause in an alert so a brief blip does not page. Note predict_linear assumes a gauge and a roughly linear trend — it is a heuristic for capacity warnings, not a forecast you would bet SLA credits on.
Enterprise scenario
A payments platform I worked with paged on checkout latency using an averaged-percentile panel: avg(histogram_quantile(0.99, rate(checkout_duration_seconds_bucket[5m]))) across ~40 pods. During a regional failover, two pods got cold caches and their real p99 jumped to 8s, but the dashboard barely moved past 1.2s and no alert fired. Customers were timing out while the SLO board stayed green. The constraint: the team had inherited per-pod quantile panels and assumed avg was a harmless way to “roll them up.” Averaging percentiles is the lie from section 4 — it understates the tail every time, and it gets worse the more pods you spread the load across.
The fix was to aggregate buckets first and compute the quantile exactly once, preserving le:
histogram_quantile(
0.99,
sum by (le) (rate(checkout_duration_seconds_bucket[5m]))
)
We promoted that to a recording rule (job:checkout_duration:p99_5m) so the alert and the dashboard read the same materialized series, then unit-tested it with promtool test rules using a synthetic two-pod skew so a future “cleanup” couldn’t reintroduce the average. The cold-cache failover that had been invisible now showed a clean p99 spike to 8s and paged within the for: 5m window. Same data, same Prometheus, correct math — the difference was refusing to average a percentile.
Verify
Sanity-check every non-trivial query before it lands. The expression browser at /graph is the fastest loop, but promtool and rule unit tests are what keep correctness from regressing.
Confirm your config and rules even parse:
promtool check config prometheus.yml
promtool check rules rules/*.yml
Evaluate a query against a live server straight from the CLI:
promtool query instant http://localhost:9090 \
'sum by (job) (rate(http_requests_total[5m]))'
Inspect a metric’s actual type and labels before trusting it — do not assume from the name:
curl -s http://localhost:9090/api/v1/metadata | \
jq '.data.http_request_duration_seconds'
Unit-test recording and alerting rules so a refactor cannot silently break a pager. Given a rule file, write a tests.yml with synthetic series and expected outputs, then:
# tests.yml
rule_files:
- rules.yml
evaluation_interval: 1m
tests:
- interval: 1m
input_series:
- series: 'http_requests_total{job="api", code="500"}'
values: '0+10x10' # 0,10,20,... one new error every minute
promql_expr_test:
- expr: rate(http_requests_total{code="500"}[5m])
eval_time: 10m
exp_samples:
- labels: '{job="api", code="500"}'
value: 0.16666666666666666 # ~10 per 60s
promtool test rules tests.yml
If that value surprises you, good — that is the extrapolation and per-second math from sections 2 and 4 made concrete. Lock it in a test.
Checklist
Pitfalls and next steps
The recurring theme: PromQL will hand you a number for almost any expression, valid or not. Discipline comes from knowing the metric type, respecting range-vector semantics, and validating against real data. The highest-leverage next move is to convert your most expensive dashboard queries into recording rules — they evaluate once on the server, make panels load instantly, and become the stable, testable foundation your alerts should reference. From there, evaluate native histograms to retire manual bucket layout, and put every alerting rule under promtool test rules in CI so a query that lies can never reach production again.