Observability Multi-Cloud

Taming Metric Cardinality: Relabeling, Limits, and Cost Governance in Prometheus

Cardinality is the single number that decides whether your Prometheus stack is a quiet utility or a recurring incident. It is the count of unique time series — every distinct combination of a metric name and its label values — and it governs three things at once: the RAM the head block needs to hold its index, how many samples a query must touch to compute an answer, and, the moment you remote-write to a vendor, the line item on your bill. One badly chosen label — a user ID, a full request URL, a Kubernetes pod name churning under an autoscaler — can multiply your series count by orders of magnitude in an afternoon. This is the working playbook: how to find the offenders, how to drop and rewrite labels with relabeling, how to set hard guardrails so a bad exporter can’t take the cluster down, and how to govern cardinality per team so it stays controlled.

1. Why cardinality is the load, not sample rate

Prometheus keeps an in-memory index of every active series and the most recent block of samples in a head block before flushing to disk. The cost that dominates is not how fast samples arrive — it is how many distinct series exist. A metric scraped once a minute with two million label combinations is far more expensive than a metric scraped every second with fifty.

The reason is multiplication. Total series for one metric is the product of the cardinalities of its labels:

http_requests_total{method, status, handler, instance}
   = |method| x |status| x |handler| x |instance|
   =    5     x    6     x    40     x    200       = 240,000 series

Add one unbounded label — user_id with 100,000 values — and that metric alone explodes past the point where the node survives. The damage shows up in three places:

The mental model to keep: a label is a dimension you slice by, not a field you store data in. If a value is unbounded or high-cardinality, it does not belong in a label.

2. Diagnose the offenders before you touch a config

Never relabel blind. Find what is actually expensive first. The fastest source of truth is the built-in TSDB status page at http://<prometheus>:9090/tsdb-status, which surfaces the head’s cardinality stats directly. The same data is exposed via the API:

# Top label names by number of distinct series they participate in,
# plus the most expensive label-value pairs and metric names.
curl -s http://localhost:9090/api/v1/status/tsdb | jq '.data'

That returns seriesCountByMetricName, seriesCountByLabelValuePair, labelValueCountByLabelName, and memoryInBytesByLabelName — the four lists that tell you exactly where the series live.

For ad-hoc hunting, PromQL answers the same questions interactively. These are the queries I run first on any unfamiliar cluster:

# Top 10 metric names by series count - the headline offenders
topk(10, count by (__name__)({__name__=~".+"}))
# Total active series in the head - your headline number
prometheus_tsdb_head_series
# Which job is generating the most series? Find the bad exporter.
topk(10, count by (job)({__name__=~".+"}))

To find which label on a specific metric is doing the damage, count distinct values per label:

# How many distinct values does each label of this metric carry?
count(count by (handler)(http_request_duration_seconds_bucket))
count(count by (user_id)(http_request_duration_seconds_bucket))

If that user_id line returns 80,000 and handler returns 40, you have found your fuse. Offline, promtool tsdb analyze reads a block on disk and prints the same breakdown without touching the running server — useful in CI against a snapshot:

promtool tsdb analyze /prometheus/data --limit=20

It prints the highest-cardinality labels, the label pairs with the most series, and the label names with the most unique values. That last list is the one that catches unbounded labels.

3. Drop and rewrite labels with metric_relabel_configs

The most important distinction in Prometheus relabeling is where it runs:

Dropping a whole noisy metric you never query:

scrape_configs:
  - job_name: "node"
    static_configs:
      - targets: ["node-exporter:9100"]
    metric_relabel_configs:
      # Drop go runtime histograms nobody dashboards on
      - source_labels: [__name__]
        regex: "go_gc_duration_seconds.*"
        action: drop

Stripping a single high-cardinality label while keeping the metric — the labeldrop action removes a label by name, which collapses series that differ only in that dimension:

    metric_relabel_configs:
      # Remove the unbounded user_id label from everything in this job.
      # Series collapse: 80,000 -> ~40 once user_id is gone.
      - regex: "user_id"
        action: labeldrop

A warning on labeldrop: removing a label collapses series, and if two surviving series become identical, Prometheus reports a duplicate-sample error for that scrape. Make sure the label you drop is genuinely extra detail, not part of the series identity you need.

Keeping only an allow-list of metrics from a chatty exporter — invert the logic with keep so you ingest a known set and discard the rest:

    metric_relabel_configs:
      # Keep only the four metrics we actually use from kube-state-metrics
      - source_labels: [__name__]
        regex: "kube_pod_status_phase|kube_deployment_status_replicas|kube_node_status_condition|kube_pod_container_resource_requests"
        action: keep

Truncating a high-cardinality label value instead of dropping it — rewrite path so /api/v1/orders/8a3f... becomes /api/v1/orders/:id:

    metric_relabel_configs:
      # Collapse UUID path segments into a placeholder
      - source_labels: [path]
        regex: "(/api/v1/orders/)[0-9a-f-]+"
        target_label: path
        replacement: "${1}:id"

metric_relabel_configs is your cheapest, most surgical control because it acts on data already at the server but not yet stored. Whatever you drop here costs zero memory, zero query time, and zero remote-write bill.

4. Enforce hard limits so one bad target can’t win

Relabeling is a scalpel; limits are the circuit breaker. A new exporter pushed by a team that did not read this article should fail its own scrape, not take down your Prometheus. Two per-scrape limits do this.

sample_limit caps how many samples a single scrape may yield. Exceed it and the entire scrape is dropped and marked failed — a loud, visible signal rather than silent series creep:

scrape_configs:
  - job_name: "app"
    sample_limit: 5000          # whole scrape fails past 5k series per target
    label_limit: 30             # reject a sample with >30 labels
    label_name_length_limit: 200
    label_value_length_limit: 1000
    static_configs:
      - targets: ["app:8080"]

label_limit, label_name_length_limit, and label_value_length_limit reject individual samples that carry too many labels or labels that are too long — the signature of a runaway label-generation bug. Set a sane default in global so every job inherits a floor:

global:
  scrape_interval: 30s
  scrape_timeout: 10s
  # Applied to every job unless overridden
  sample_limit: 10000
  label_limit: 30

When a scrape is rejected for exceeding a limit, the target’s up metric stays 1 but scrape_samples_scraped is suppressed and the failure is recorded — alert on it so a team learns immediately rather than after the bill arrives:

# Targets whose scrape was rejected by sample_limit
prometheus_target_scrapes_exceeded_sample_limit_total > 0

Target churn protection

Autoscaling and CI runners create the other cardinality leak: churn. Pods come and go, each with a unique pod or instance label, so series that are no longer scraped still occupy the head until they age out, and the cumulative unique count over a day dwarfs the instantaneous count. Two defenses:

scrape_configs:
  - job_name: "kubernetes-pods"
    target_limit: 2000     # refuse to scrape if SD returns >2000 targets
    metric_relabel_configs:
      - regex: "pod_template_hash|controller_revision_hash"
        action: labeldrop

5. Kill the usual high-cardinality suspects

Most cardinality fires come from the same short list. Burn them out at the source:

Offending label Why it explodes Fix
user_id, customer_id, tenant_id Unbounded, grows with your business labeldrop, or aggregate it away (Section 6)
path, url, endpoint (raw) Unique per ID/query string Normalize to route template :id via relabel replacement
pod, instance under autoscaling Churns on every deploy/scale event labeldrop if not sliced by; use target_limit
trace_id, request_id, span_id One value per request — pure cardinality bomb Never a label. Belongs in traces/logs
email, session_id, IP addresses Effectively unbounded; also a PII risk Drop entirely
status_message, free-text errors Arbitrary strings Use a bounded status_code instead

The governing rule, written so a developer can self-check before adding a label:

If you cannot name a finite, reasonably small set of values a label will ever take, it is not a label. Put that value in a trace or a log line, and slice metrics by something bounded.

A particularly common trap is histogram buckets. A histogram with a high-cardinality label is not one extra series — it is (buckets + 2) extra series per label-value combination. Check whether you actually need per-handler latency at full bucket resolution before you ship it.

6. Aggregate at write time with recording rules

Sometimes you genuinely need a high-cardinality metric for short-term debugging but only ever query an aggregate. Recording rules pre-compute the aggregate on a schedule and store the small result. Combined with relabeling, this lets you keep raw data on a short retention while a downsampled series feeds dashboards and long-term storage.

groups:
  - name: cardinality-reduction
    interval: 30s
    rules:
      # Collapse per-pod, per-handler request rate into per-service rate.
      # The stored result drops the pod dimension entirely.
      - record: "service:http_requests:rate5m"
        expr: |
          sum by (service, method, status) (
            rate(http_requests_total[5m])
          )

      # Pre-aggregate a latency histogram down to per-service buckets,
      # so the quantile query later touches a fraction of the series.
      - record: "service:http_request_duration_seconds_bucket:rate5m"
        expr: |
          sum by (service, le) (
            rate(http_request_duration_seconds_bucket[5m])
          )

Dashboards then query service:http_requests:rate5m instead of the raw metric — fewer series scanned, faster refresh. The high-leverage pattern is to pair this with remote-write filtering: keep raw data locally for 24 hours, but only forward the aggregated recording-rule output to the expensive long-term backend.

remote_write:
  - url: "https://prometheus-prod.example.com/api/v1/write"
    write_relabel_configs:
      # Forward only pre-aggregated recording-rule series (service:...) and
      # a small allow-list; drop raw per-pod series from long-term storage.
      - source_labels: [__name__]
        regex: "service:.*|up|node_.*"
        action: keep

write_relabel_configs is the same relabeling engine applied at the remote-write boundary. It is where you turn “store everything locally” into “pay to keep only what matters,” which is frequently a 5-10x reduction in remote-write series.

7. Per-team cardinality budgets, dashboards, and alerts

Controlling cardinality once is a project; keeping it controlled is governance. The model that holds in a multi-team platform is a budget per team, attributed by a team or owner label that you attach via relabeling at scrape time, then measured and alerted on.

First, attribute every series to an owner. In Kubernetes, map a namespace label to a team via relabeling so attribution is automatic:

    relabel_configs:
      - source_labels: [__meta_kubernetes_namespace]
        target_label: team
        regex: "(payments|checkout|search)-.*"
        replacement: "${1}"

Then build the per-team series count as a recording rule so the budget dashboard is cheap to render:

groups:
  - name: cardinality-governance
    interval: 1m
    rules:
      - record: "team:series:count"
        expr: "count by (team) ({__name__=~'.+'})"

Alert when a team crosses its allocation, and — more importantly — alert on growth rate, because a slow leak is invisible on an instantaneous gauge until it is a crisis:

groups:
  - name: cardinality-budgets
    rules:
      # Hard budget breach
      - alert: TeamCardinalityBudgetExceeded
        expr: "team:series:count > 200000"
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Team {{ $labels.team }} over its 200k series budget"
          description: "{{ $labels.team }} is at {{ $value }} active series."

      # Growth detector: series up >25% week-over-week
      - alert: CardinalityGrowthSpike
        expr: |
          team:series:count
            / (team:series:count offset 1w) > 1.25
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Team {{ $labels.team }} cardinality up >25% WoW"

The growth alert is the one that earns its keep. A budget alert tells you something already broke; the week-over-week ratio catches the new exporter the day it ships, while the fix is still a one-line relabel rule.

Enterprise scenario

A payments platform team I worked with ran a single-tenant Prometheus per environment, remote-writing to Grafana Cloud. Over six weeks their active series climbed from 1.4M to 6.8M and the monthly bill roughly quintupled, with no corresponding traffic growth. The on-call narrative was “Prometheus is slow,” but the root cause was billing, not latency.

promtool tsdb analyze on a snapshot block made it obvious in under a minute: the top label name by unique values was card_bin (the first six digits of a card number) on a single payment_authorization_duration_seconds histogram. A well-meaning engineer had added it to slice latency by issuing bank. With ~12 native histogram buckets, 9,000 distinct BINs, and the existing method and status labels, that one histogram had ballooned to several million series — and it was streaming straight to the paid backend.

The constraint: they could not simply delete the metric, because the fraud team did use per-BIN latency during incident reviews. The fix was to split storage by audience. They kept the raw, per-BIN histogram locally on a 24-hour retention for fraud debugging, but stripped card_bin at the remote-write boundary and forwarded only a pre-aggregated recording rule to long-term storage:

# Local recording rule: aggregate away the BIN dimension
groups:
  - name: payments
    interval: 30s
    rules:
      - record: "service:payment_auth_duration:bucket:rate5m"
        expr: |
          sum by (service, method, status, le) (
            rate(payment_authorization_duration_seconds_bucket[5m])
          )

# Remote-write: drop the raw per-BIN series, keep the aggregate
remote_write:
  - url: "https://prometheus-prod.grafana.net/api/prom/push"
    write_relabel_configs:
      - source_labels: [__name__]
        regex: "payment_authorization_duration_seconds_bucket"
        action: drop

Remote-write series for that metric dropped from millions to a few thousand. The fraud team kept its high-resolution local view; dashboards moved to the aggregated series and rendered faster. They then added the week-over-week growth alert from Section 7, scoped per team, so the next person who reached for a high-cardinality label got paged before it hit the invoice. Monthly spend returned to its prior baseline within one billing cycle.

Verify

Confirm the controls actually took effect rather than assuming the config reloaded cleanly.

# 1) Config is valid before you reload
promtool check config /etc/prometheus/prometheus.yml

# 2) Reload without restart, then confirm the new generation is live
curl -s -X POST http://localhost:9090/-/reload
curl -s http://localhost:9090/api/v1/status/config | jq -r '.data.yaml' | grep -A2 metric_relabel_configs
# 3) Headline series count should have dropped after relabeling
prometheus_tsdb_head_series

# 4) The label you dropped should be gone - this must return nothing
count(count by (user_id)({__name__=~".+"}))

# 5) No target is silently failing a sample_limit
prometheus_target_scrapes_exceeded_sample_limit_total

# 6) Per-team budget rule is producing data
team:series:count

Cross-check the /tsdb-status page after the reload: the top metric names and top label-value pairs should reflect your changes, and memoryInBytesByLabelName for the offending label should be absent.

Checklist

prometheuscardinalitycostrelabelingobservability

Comments

Keep Reading