Observability Multi-Cloud

Scaling Prometheus: Recording Rules, Remote-Write, and Long-Term Storage with Thanos and Mimir

A single Prometheus is one of the best engineering deals in infrastructure: scrape, store, alert, and query from one binary with no external dependencies. That deal expires the moment your active series count, retention requirement, or dashboard concurrency outgrows one machine’s RAM and disk. This guide is the path from that overloaded single node to a horizontally scalable stack: where the load actually comes from, how to shed it with recording rules and relabeling, how to tune remote-write so it does not melt your WAL, and how to choose between Thanos and Mimir for the long-term backend.

1. When a single Prometheus stops scaling

Prometheus is vertically scaled by design. It holds an in-memory index of every active series and keeps the most recent ~2 hours of samples in a head block before flushing to disk. Three symptoms tell you the node is past its limit.

Cardinality. Memory consumption tracks the number of active series (unique label-set combinations), not the sample rate. A single new high-cardinality label - a user ID, a full request path, a pod name churning under autoscaling - multiplies series. When the process starts getting OOM-killed during head compaction, cardinality is almost always the cause.

Retention. Local TSDB retention is governed by --storage.tsdb.retention.time (and/or --storage.tsdb.retention.size). Keeping a year of data locally means a year of blocks on one volume, with no replication and no horizontal read path. Local disk is for the recent window; anything you want to keep for capacity planning or compliance belongs in object storage.

Query time. A dashboard that computes histogram_quantile() over a rate() of a high-cardinality histogram across a 30-day range will touch enormous numbers of samples on every refresh. Long, complex range queries are the load you can engineer away before you ever shard.

Inspect the real cost before changing anything:

# Top 10 metric names by series count (run against your Prometheus)
topk(10, count by (__name__)({__name__=~".+"}))

# Series count contributed per scrape job
sort_desc(count by (job)({__name__=~".+"}))

# Head series and chunks over time - watch the trend, not the instant
prometheus_tsdb_head_series
prometheus_tsdb_head_chunks

Rule of thumb: budget roughly a few KB of RAM per active series including index overhead. A node holding several million active series wants tens of GB of RAM with headroom for query bursts and compaction. Measure your own ratio with process_resident_memory_bytes / prometheus_tsdb_head_series rather than trusting a generic number.

The tsdb CLI gives you a block-level breakdown of where cardinality lives:

# Analyze the most recent block for the worst label/metric offenders
promtool tsdb analyze /prometheus

2. Recording rules as a load valve

A recording rule precomputes an expression on the scrape/evaluation interval and writes the result back as a new series. Instead of every dashboard panel re-deriving a per-service error rate over millions of raw samples, the rule computes it once and panels read a cheap, low-cardinality series. This is the single highest-leverage change for query-time pain.

Naming matters - the community convention is level:metric:operations, where level is the aggregation labels kept, metric is the source metric, and operations describes the transform applied (most recent first):

# recording-rules.yaml
groups:
  - name: http_aggregations
    interval: 30s
    rules:
      # Per-service, per-method 5m request rate. Drops instance/pod labels.
      - record: job:http_requests:rate5m
        expr: sum by (job, method) (rate(http_requests_total[5m]))

      # Per-service error ratio, precomputed for SLO dashboards.
      - record: job:http_request_errors:ratio_rate5m
        expr: |
          sum by (job) (rate(http_requests_total{code=~"5.."}[5m]))
            /
          sum by (job) (rate(http_requests_total[5m]))

      # Pre-aggregated histogram buckets so quantiles read cheap series.
      - record: job_le:http_request_duration_seconds:rate5m
        expr: sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))

With the bucket aggregation precomputed, a latency panel becomes histogram_quantile(0.99, job_le:http_request_duration_seconds:rate5m) - reading a handful of series per job instead of the full per-instance bucket set.

Two operational guardrails:

Validate before shipping:

promtool check rules recording-rules.yaml

3. Federation vs remote-write for global views

Once you run more than one Prometheus - per cluster, per region, per team - you need a way to query across them. There are two patterns, and they are not equivalent.

Federation has a central Prometheus scrape the /federate endpoint of leaf Prometheis, pulling a filtered subset of series on an interval. It works for small, pre-aggregated rollups, but it scales poorly: it is pull-based and bursty, the central node inherits the cardinality of everything it federates, and you only get whatever the federation match selectors captured at scrape time. Use it sparingly, for a thin layer of cross-cluster aggregates only.

Remote-write streams samples from each Prometheus to a central, horizontally scalable backend (Thanos Receive, Mimir, or any compatible endpoint) as they are ingested. It is push-based, continuous, carries full resolution, and the backend - not your scraping Prometheus - owns the global query path and long-term storage. For a true global view, remote-write is the right default.

# prometheus.yaml - minimal remote_write to a central backend
remote_write:
  - url: https://mimir.internal.example.com/api/v1/push
    headers:
      X-Scope-OrgID: team-platform   # Mimir/Thanos-Receive tenant header
    # Send only what the backend needs for global views; keep the rest local.
    write_relabel_configs:
      - source_labels: [__name__]
        regex: "(go_gc_duration_seconds_count|process_cpu_seconds_total)"
        action: keep

You can run remote-write and keep local TSDB for fast, recent, single-cluster queries. A common topology: short local retention for on-call drill-down, remote-write to a global backend for cross-cluster and long-term queries. Many teams add external_labels (e.g. cluster, region) under global: so every remote-written series is attributable to its source.

4. Tuning the remote-write queue

Remote-write reads from the WAL and ships samples through an in-memory queue per remote endpoint. If the queue cannot keep up with ingestion, the WAL grows, disk fills, and you get backpressure that eventually stalls the whole process. The relevant knobs live under queue_config.

remote_write:
  - url: https://mimir.internal.example.com/api/v1/push
    queue_config:
      # Concurrency. Prometheus auto-scales shards between min and max
      # based on backlog; raise the ceiling for high-throughput endpoints.
      min_shards: 1
      max_shards: 50
      # Samples buffered per shard before a flush is forced.
      capacity: 10000
      # Max samples per outbound request to the backend.
      max_samples_per_send: 2000
      # Force a flush at this interval even if a batch is not full.
      batch_send_deadline: 5s
      # Retry backoff bounds for 5xx/429 from the backend.
      min_backoff: 30ms
      max_backoff: 5s
      # Honor Retry-After on 429 instead of hammering a throttled backend.
      retry_on_http_429: true

How to reason about it:

    write_relabel_configs:
      # Drop noisy debug metrics from the remote stream.
      - source_labels: [__name__]
        regex: "go_memstats_.*|.*_bucket_le_.*"
        action: drop

5. Thanos architecture

Thanos turns a fleet of Prometheus servers into a global, long-term system using object storage as the durable layer. The components you actually run:

# Store Gateway with an object-storage config (S3 example)
# objstore.yaml
type: S3
config:
  bucket: thanos-metrics
  endpoint: s3.us-east-1.amazonaws.com
  region: us-east-1
# Store Gateway pointed at the bucket
thanos store \
  --objstore.config-file=objstore.yaml \
  --data-dir=/var/thanos/store \
  --index-cache-size=2GB

# Compactor: downsampling + retention per resolution. Run ONE per bucket.
thanos compact \
  --objstore.config-file=objstore.yaml \
  --data-dir=/var/thanos/compact \
  --retention.resolution-raw=30d \
  --retention.resolution-5m=180d \
  --retention.resolution-1h=2y \
  --wait

# Querier deduplicating across replicated Prometheus pairs
thanos query \
  --store=dnssrv+_grpc._tcp.thanos-store.monitoring.svc \
  --store=dnssrv+_grpc._tcp.thanos-sidecar.monitoring.svc \
  --query.replica-label=replica

The compactor’s downsampling is what makes Thanos viable for multi-year retention: a one-hour-resolution series over two years is orders of magnitude smaller to scan than raw. Set --query.replica-label to whichever external_label distinguishes your HA Prometheus replicas, or the querier cannot dedupe them.

6. Grafana Mimir architecture

Mimir is a horizontally scalable, multi-tenant TSDB built primarily around the remote-write ingestion model. Where Thanos composes loosely coupled components over your Prometheus servers, Mimir is a clustered system with its own write and read paths. It is one binary that runs in different modes; in production it is commonly deployed in three groups.

The defining feature is native multi-tenancy: every request carries an X-Scope-OrgID header, and Mimir isolates series, queries, and limits per tenant. Per-tenant limits are how you stop one team from exhausting the cluster:

# Mimir runtime overrides - per-tenant guardrails
overrides:
  team-platform:
    max_global_series_per_user: 3000000
    ingestion_rate: 250000          # samples/sec
    ingestion_burst_size: 500000
    max_fetched_series_per_query: 500000
    compactor_blocks_retention_period: 1y
  team-payments:
    max_global_series_per_user: 1000000
    ingestion_rate: 100000
    compactor_blocks_retention_period: 90d

Mimir deliberately reuses Thanos and Cortex lineage internally (block format, store-gateway concepts), so the storage primitives will look familiar - the difference is the operational packaging and the first-class tenancy and limits layer.

7. Choosing between Thanos and Mimir

Both solve global view + long-term storage on object storage. The decision is operational, not feature-checkbox.

Dimension Thanos Mimir
Mental model Loosely coupled components over your existing Prometheus Clustered, purpose-built TSDB you push into
Primary ingestion Sidecar block upload (or Receive for push) Remote-write first
Multi-tenancy Possible, less native First-class, per-tenant limits built in
Long-range query latency Good with downsampling + caches Strong; query-frontend splitting/caching and sharding
Operational surface Fewer moving parts to start; compactor is a singleton More components/config; horizontally scalable end to end
Best fit Already have many Prometheis; want global view with minimal re-architecture Building a central, multi-team metrics platform at scale

Decision guide:

Either way, remote-write tuning (section 4) and cardinality control (section 8) apply identically - the backend choice does not save you from a series explosion.

8. Controlling cardinality at the source

The most durable fix is to never create the cardinality. Prometheus offers two relabeling stages, and the distinction is exam-critical:

scrape_configs:
  - job_name: app
    kubernetes_sd_configs:
      - role: pod
    metric_relabel_configs:
      # Drop a high-cardinality metric entirely.
      - source_labels: [__name__]
        regex: "apiserver_request_duration_seconds_bucket"
        action: drop

      # Strip a label that explodes cardinality (e.g. a full URL path),
      # but keep the metric.
      - regex: "url_path"
        action: labeldrop

      # Allow-list: keep only metrics you have explicitly blessed.
      - source_labels: [__name__]
        regex: "(http_requests_total|http_request_duration_seconds_bucket|up|process_cpu_seconds_total)"
        action: keep

labeldrop/labelkeep act on label names; drop/keep act on samples matched by source_labels values. An allow-list (action: keep on __name__) is the strongest control: nothing enters the TSDB unless it is on the list. Apply allow-lists at the source for local TSDB and at write_relabel_configs for the remote stream.

Enterprise scenario

A payments platform team ran HA Prometheus pairs per Kubernetes cluster and remote-wrote everything to a central Mimir cluster on S3. After a Black Friday traffic ramp, the distributors started returning 429 and on-call dashboards went blank for the team-payments tenant. The root cause was not Mimir capacity - it was the per-tenant ingestion_rate limit colliding with a cardinality spike from a new deploy that added a customer_id label to http_requests_total. Each Prometheus dutifully sharded up its remote-write queue, hammered the throttled distributor, honored Retry-After, and fell hours behind while the WAL grew toward the disk limit.

The fix had two halves. First, stop the bleeding at the edge - drop the offending label before it ever leaves the Prometheus, so the tenant fit back under its rate budget:

write_relabel_configs:
  - regex: "customer_id"
    action: labeldrop
  - source_labels: [__name__]
    regex: "http_requests_total"
    action: keep

Second, make the limit a guardrail, not a silent ceiling: alert on cortex_discarded_samples_total by reason="rate_limited" per tenant, and raise the headroom deliberately rather than reactively.

overrides:
  team-payments:
    ingestion_rate: 200000        # was 100000
    ingestion_burst_size: 600000  # absorb deploy-time spikes

The durable lesson: a multi-tenant backend turns one team’s cardinality mistake into that team’s outage, not everyone’s - but only if the per-tenant limit is observable. Wire cortex_discarded_samples_total and prometheus_remote_storage_samples_pending into paging before the next ramp, not after.

Verify

Confirm each layer is actually doing its job before declaring victory.

# Recording rules loaded and evaluating without lag
promtool check rules recording-rules.yaml
# 1. Recording rules produce series and stay within their interval
job:http_requests:rate5m
max by (rule_group) (prometheus_rule_group_last_duration_seconds)
  / max by (rule_group) (prometheus_rule_group_interval_seconds)   # keep < 1

# 2. Remote-write is keeping up: failures flat, backlog small
rate(prometheus_remote_storage_samples_failed_total[5m])           # expect ~0
prometheus_remote_storage_samples_pending                          # bounded, not climbing
prometheus_remote_storage_shards_desired                           # <= shards_max

# 3. WAL is not growing unbounded (backpressure check)
prometheus_tsdb_wal_truncations_total
prometheus_remote_storage_highest_timestamp_in_seconds
  - prometheus_remote_storage_queue_highest_sent_timestamp_seconds # small gap

# 4. Cardinality moved in the right direction after relabeling
count by (__name__)({__name__=~".+"})
# 5. Thanos: blocks present in object storage and visible to the querier
thanos tools bucket inspect --objstore.config-file=objstore.yaml

# 6. Mimir: confirm a tenant's series count against its limit
#    (cortex_-prefixed metrics are exposed by Mimir components)
#    Query Mimir's own /metrics or your monitoring stack for:
#      cortex_ingester_memory_series

Checklist

Pitfalls and next steps

Next, push rule evaluation server-side (Thanos Ruler or Mimir’s ruler) so global rules run against the global dataset rather than per-Prometheus, add per-tenant or per-job cardinality alerts so explosions page before they OOM a node, and codify the downsampling/retention tiers to your actual query patterns - most teams over-retain raw resolution and under-use the 5m and 1h tiers.

PrometheusThanosMimirRemoteWriteScalability

Comments

Keep Reading