Thanos in Production: Global Query View, Deduplication, and Object-Storage Downsampling

Prometheus is excellent at being a single node and bad at being a fleet. The moment you run two replicas for high availability, keep more than a few weeks of data, or want one query surface across regions, you are past what a single binary gives you. Thanos closes that gap without asking you to rewrite anything: your Prometheus instances keep scraping and alerting exactly as they do today, and a thin set of components turns their local TSDB blocks into a globally queryable, deduplicated, downsampled, near-infinite-retention store backed by object storage. This guide assembles that stack the way it actually runs in production - the component topology, the bucket layout you must verify, deduplication across HA pairs, Compactor downsampling and retention tiers, Store Gateway caching, and the operational runbook for when a block goes bad.

1. Component topology

Thanos is a kit of single-purpose binaries, not a monolith. Each speaks the StoreAPI (a gRPC interface) and the Querier fans out to all of them. Know what each one does before you deploy any of them.

Component	Role	Stateful?
Sidecar	Runs next to each Prometheus; serves its recent TSDB over StoreAPI and ships completed blocks to the bucket	No (Prometheus holds the data)
Store Gateway	Serves historical blocks from the bucket over StoreAPI	Caches only - safe to lose
Querier	Stateless fan-out + PromQL engine; deduplicates and merges results	No
Compactor	Singleton per bucket; compacts blocks, downsamples, applies retention	Local scratch disk
Ruler	Evaluates recording/alerting rules against the global view; writes its own blocks	Yes (small TSDB)
Receive	Optional push target (remote_write) for agents that cannot run a Sidecar	Yes (holds the WAL)

The mental model: Sidecar and Store Gateway are read backends, the Querier is the read frontend, the Compactor is the bucket janitor, and Ruler/Receive are optional. A minimal HA deployment is two Prometheus replicas each with a Sidecar, one Store Gateway, one Querier, and exactly one Compactor.

The single most important rule in all of Thanos: run exactly one Compactor per bucket. Two Compactors against the same bucket will race, produce overlapping blocks, and corrupt your retention. More on this in step 4.

2. Shipping TSDB blocks to object storage

Every Sidecar, Store Gateway, Compactor, and Ruler reads the same objstore.config. Define it once. For AWS S3:

# objstore-s3.yaml
type: S3
config:
  bucket: "acme-thanos-metrics"
  endpoint: "s3.us-east-1.amazonaws.com"
  region: "us-east-1"
  # Prefer IRSA / instance profile over static keys.
  # access_key / secret_key only if you truly cannot use a role.
  sse_config:
    type: "SSE-S3"

Azure Blob and GCS use the same --objstore.config-file flag with a different type:

# objstore-azure.yaml
type: AZURE
config:
  storage_account: "acmethanos"
  container: "metrics"
  # Use a managed identity; leave storage_account_key empty.
  endpoint: "blob.core.windows.net"

# objstore-gcs.yaml
type: GCS
config:
  bucket: "acme-thanos-metrics"
  # service_account omitted -> uses Workload Identity / ADC

The Sidecar ships blocks only when you point Prometheus at it. Two flags on Prometheus are mandatory:

prometheus \
  --storage.tsdb.path=/prometheus \
  --storage.tsdb.min-block-duration=2h \
  --storage.tsdb.max-block-duration=2h \
  --web.enable-lifecycle

Setting min and max block duration both to 2h disables Prometheus’s own local compaction. This is non-negotiable: the Thanos Compactor must own compaction in the bucket, and it cannot reconcile blocks that Prometheus has already merged locally. Then run the Sidecar:

thanos sidecar \
  --tsdb.path=/prometheus \
  --prometheus.url=http://localhost:9090 \
  --grpc-address=0.0.0.0:10901 \
  --http-address=0.0.0.0:10902 \
  --objstore.config-file=/etc/thanos/objstore-s3.yaml

The Sidecar uploads a block roughly every two hours, after Prometheus seals its head block to disk. Recent (un-shipped) data is still queryable because the Sidecar also serves the local TSDB over StoreAPI - the Querier sees the recent window from the Sidecar and everything older from the Store Gateway, with no gap.

External labels are the deduplication key

Each Prometheus in an HA pair must carry identical external labels except for one replica label:

# prometheus-a.yml
global:
  external_labels:
    cluster: "prod-use1"
    region: "us-east-1"
    replica: "a"     # the ONLY difference vs the partner

The partner is identical with replica: "b". These labels are stamped into every block’s metadata and are how the Querier later knows two series are replicas of each other rather than genuinely distinct.

Verify the bucket layout

After a Sidecar has run for a few hours, inspect what landed. Do not trust that it worked - check:

thanos tools bucket ls \
  --objstore.config-file=/etc/thanos/objstore-s3.yaml \
  --output=table

Each row is a block ULID with its time range, sample count, and replica label. The on-disk layout under the bucket is one directory per block:

acme-thanos-metrics/
  01J8X.../              # block ULID
    meta.json            # time range, labels, downsample level, compaction level
    index                # the block's inverted index
    chunks/000001        # compressed sample chunks

The meta.json is the source of truth. thanos.downsample.resolution: 0 means raw; compaction.level: 1 means a fresh 2h block straight from a Sidecar. Watch those two fields evolve as the Compactor works.

3. Querier deduplication across HA pairs

The Querier connects to every StoreAPI endpoint and presents one PromQL surface. Point it at your Sidecars and Store Gateway:

thanos query \
  --http-address=0.0.0.0:9090 \
  --grpc-address=0.0.0.0:10903 \
  --query.replica-label=replica \
  --endpoint=dns+thanos-sidecar.monitoring.svc:10901 \
  --endpoint=dns+thanos-store.monitoring.svc:10901

Using dns+ with a headless Kubernetes Service makes the Querier discover all Sidecar pods behind that name - you do not list each replica by hand. The critical flag is --query.replica-label=replica. It tells the Querier that series differing only in the replica label are the same logical series.

With deduplication enabled (the default in the UI, or dedup=true on the API), the Querier merges the two replica time series, preferring whichever has continuous data and stitching across gaps when one replica was down. Disable it with dedup=false and you see both replicas side by side - useful when debugging why one Prometheus missed scrapes:

# Deduplicated (one clean series)
curl -s 'http://thanos-query:9090/api/v1/query?query=up&dedup=true'

# Raw, both replicas visible
curl -s 'http://thanos-query:9090/api/v1/query?query=up&dedup=false'

A common production mistake: forgetting --query.replica-label. Without it, every counter rate() doubles, every gauge shows two slightly offset lines, and alerts fire on phantom data. If your graphs suddenly show pairs of nearly-identical lines, this flag is missing.

You can supply multiple replica labels (for example replica plus an availability zone) by repeating the flag. For genuinely separate Prometheus instances - different clusters, different external labels - the Querier keeps them distinct and you query the union, which is exactly the “global view” you wanted.

4. Compactor: compaction, downsampling, retention

The Compactor is the only stateful-against-the-bucket component and the most dangerous to misconfigure. It does three jobs: vertical/horizontal compaction (merging 2h blocks into larger 8h, then daily, then larger blocks), downsampling (computing 5m and 1h aggregates so long-range queries stay cheap), and retention (deleting blocks past their tier limit).

thanos compact \
  --data-dir=/var/thanos/compact \
  --objstore.config-file=/etc/thanos/objstore-s3.yaml \
  --http-address=0.0.0.0:10912 \
  --retention.resolution-raw=30d \
  --retention.resolution-5m=120d \
  --retention.resolution-1h=730d \
  --compact.concurrency=2 \
  --downsample.concurrency=2 \
  --wait

The retention model is the heart of cost control. Thanos keeps three resolutions of the same data:

Resolution	What it is	Typical retention	Use
raw	every scraped sample	30 days	live debugging, alerting context
5m	5-minute downsampled aggregates	120 days	weekly/monthly dashboards
1h	1-hour downsampled aggregates	2 years (730d)	capacity planning, YoY trends

Downsampling does not throw away accuracy for aggregations - each downsampled point stores the count, sum, min, max, and counter values, so rate(), histogram_quantile(), and avg_over_time() remain correct at coarse resolution. The Querier auto-selects the right resolution for the query’s step. A 6-month dashboard never touches raw blocks; it reads cheap 1h aggregates.

Critical: downsampling only happens after blocks are compacted to the required level. A 5m downsample needs blocks covering at least 40h; 1h downsample needs ~10 days. Brand-new data is always served raw. If you set --retention.resolution-raw shorter than the downsampling window, you delete data before it can be downsampled and create permanent gaps.

Avoiding compactor halts

The Compactor halts and refuses to proceed when it finds overlapping blocks it cannot reconcile - almost always caused by running two Compactors, or by a Sidecar shipping blocks from a Prometheus that did its own local compaction. The fix is prevention: one Compactor, and min/max-block-duration=2h on every Prometheus.

If a halt happens, do not delete blocks blindly. Diagnose first:

# List overlapping blocks the compactor is choking on
thanos tools bucket verify \
  --objstore.config-file=/etc/thanos/objstore-s3.yaml \
  --issues=overlapped_blocks

Genuine accidental overlaps can be repaired offline with thanos tools bucket rewrite or, as a last resort, by marking the bad block for deletion (covered in the runbook). Run the Compactor with a generous --data-dir on fast disk - it downloads blocks to merge them, and a daily compaction of a busy bucket needs tens of GB of scratch space.

5. Store Gateway: index caching and query sharding

The Store Gateway answers all historical queries by reading blocks from object storage. Object storage is high-latency, so caching is what makes it fast. Configure both an index cache (postings and series lookups) and a bucket/chunk cache (the actual sample data):

thanos store \
  --data-dir=/var/thanos/store \
  --objstore.config-file=/etc/thanos/objstore-s3.yaml \
  --grpc-address=0.0.0.0:10901 \
  --http-address=0.0.0.0:10902 \
  --index-cache-size=2GB \
  --chunk-pool-size=4GB

For anything beyond a single instance, move the index cache to memcached so it survives restarts and is shared across Store Gateway replicas:

# index-cache-memcached.yaml
type: MEMCACHED
config:
  addresses: ["dns+memcached.monitoring.svc:11211"]
  max_item_size: "16MiB"
  max_async_concurrency: 20

Pass it with --index-cache.config-file=/etc/thanos/index-cache-memcached.yaml. A separate caching bucket config (--store.caching-bucket.config-file) caches chunk subranges and the existence of objects, cutting GET requests to S3 dramatically - which matters because S3 charges per request and per byte.

Sharding the Store Gateway

A single Store Gateway eventually cannot hold the index headers for a multi-terabyte bucket in memory. Shard by block using relabeling on the block’s external labels, so each replica owns a disjoint slice:

# store-shard-0.yaml  (one file per shard)
- action: hashmod
  source_labels: ["__block_id"]
  target_label: shard
  modulus: 3
- action: keep
  source_labels: ["shard"]
  regex: "0"

Run three Store Gateways, each with its own --selector.relabel-config-file keeping shard 0, 1, and 2. The Querier fans out to all three and merges. This is horizontal scale for the read path: no single Store Gateway has to know about every block.

6. Thanos Ruler vs Prometheus rules

You now have two places rules can live, and the distinction is about what data the rule needs to see.

Keep alerting and recording rules on Prometheus when they only need local, recent data: per-target up alerts, fast burn-rate SLO alerts, host-level recording rules. Local evaluation is faster, survives a Querier outage, and is the resilient default.

Use Thanos Ruler only for rules that genuinely need the global, deduplicated view - cross-cluster aggregations, fleet-wide SLOs spanning regions, or recording rules over data that no single Prometheus holds:

thanos rule \
  --data-dir=/var/thanos/rule \
  --objstore.config-file=/etc/thanos/objstore-s3.yaml \
  --query=dns+thanos-query.monitoring.svc:9090 \
  --rule-file=/etc/thanos/rules/*.yaml \
  --alertmanagers.url=dns+alertmanager.monitoring.svc:9093 \
  --label='ruler_cluster="global"' \
  --label='replica="r0"' \
  --grpc-address=0.0.0.0:10901

The trap with Ruler: it evaluates rules over the network against the Querier. If the Querier is slow or a Store Gateway is degraded, rule evaluation lags and alerts are delayed. Never move your critical, low-latency alerts to Ruler. Run Ruler itself as an HA pair (distinct replica labels) and let the Querier deduplicate its output blocks, exactly like Prometheus pairs.

7. Capacity planning and --max-samples

Thanos’s main scaling failure mode is a single expensive query reading millions of samples through the Store Gateway and exhausting memory. Cap it on the Querier:

thanos query \
  --query.max-concurrent=20 \
  --query.timeout=2m \
  --store.response-timeout=30s \
  --query.max-concurrent-select=4

On the Store Gateway, bound how much any single Series request can pull:

thanos store \
  --store.grpc.series-sample-limit=50000000 \
  --store.grpc.series-max-concurrency=20

--store.grpc.series-sample-limit is the safety valve: a runaway query that would touch more than 50M samples is rejected with an error instead of OOM-killing the Store Gateway and taking down historical queries for everyone. Size it against your real dashboards - too low and legitimate long-range panels fail; too high and one bad query is a denial of service. Provision Store Gateway memory for index headers (proportional to total series across all blocks) plus your chunk pool plus burst query memory, and size memcached to hold the working set of postings.

Verify

Confirm each layer independently. Do not declare victory because the UI loaded.

# 1. Querier sees every store backend, all healthy
curl -s http://thanos-query:9090/api/v1/stores | \
  jq '.data[] | {name, lastError, minTime, maxTime}'

# 2. Blocks are landing AND being compacted/downsampled
thanos tools bucket ls --objstore.config-file=/etc/thanos/objstore-s3.yaml --output=table | head

# 3. Deduplication is actually collapsing replicas (count should NOT double)
curl -s 'http://thanos-query:9090/api/v1/query?query=count(up)&dedup=true'  | jq '.data.result[].value[1]'
curl -s 'http://thanos-query:9090/api/v1/query?query=count(up)&dedup=false' | jq '.data.result[].value[1]'

# 4. Compactor is healthy and not halted
curl -s http://thanos-compact:10912/metrics | grep -E 'thanos_compact_halted|thanos_compact_group_compactions_total'

Key signals: every store reports lastError: null with a sane minTime/maxTime; thanos_compact_halted is 0; the dedup vs no-dedup count(up) differ by exactly your replica factor (2x without dedup, 1x with); and meta.json for blocks older than ~2 days shows thanos.downsample.resolution: 300000 (5m) and eventually 3600000 (1h).

Enterprise scenario

A payments platform ran Prometheus HA pairs in three regions (us-east-1, eu-west-1, ap-southeast-2) and wanted a single global SLO dashboard plus 13-month compliance retention. They stood up Thanos, pointed all six Sidecars and one Querier at it - and immediately got paged for an exploding S3 bill and Store Gateways OOM-killing during month-end reviews. Finance had pulled a 13-month, 1-second-step dashboard, and the Querier was happily serving it from raw blocks because the Compactor had silently halted three weeks earlier on overlapping blocks. With downsampling stalled, every wide query scanned billions of raw samples straight out of S3.

The root cause was two-fold: a leftover second Compactor from a migration was racing the primary, and --retention.resolution-raw was set to 90d - long enough that nobody noticed downsampling had stopped, because raw data was still there to (expensively) serve. The fix was disciplined:

# 1. Scale ALL compactors to zero, then run exactly one.
# 2. Repair the overlap offline, then resume.
thanos tools bucket verify --objstore.config-file=/etc/thanos/objstore-s3.yaml --issues=overlapped_blocks
# 3. Cap query blast radius so one dashboard can't DoS the bucket.

thanos store \
  --store.grpc.series-sample-limit=50000000 \
  --index-cache.config-file=/etc/thanos/index-cache-memcached.yaml \
  --store.caching-bucket.config-file=/etc/thanos/caching-bucket.yaml

With one Compactor, downsampling caught up over 48 hours, the 13-month dashboard dropped from raw to 1h resolution, S3 GET volume fell by over 90%, and the sample limit ensured a single broad query failed fast instead of cascading. The durable lesson: a halted Compactor is invisible until it bankrupts you - alert on thanos_compact_halted > 0 and on time() - thanos_objstore_bucket_last_successful_upload_time before you ever ship to production.

Thanos in Production: Global Query View, Deduplication, and Object-Storage Downsampling

1. Component topology

2. Shipping TSDB blocks to object storage

External labels are the deduplication key

Verify the bucket layout

3. Querier deduplication across HA pairs

4. Compactor: compaction, downsampling, retention

Avoiding compactor halts

5. Store Gateway: index caching and query sharding

Sharding the Store Gateway

6. Thanos Ruler vs Prometheus rules

7. Capacity planning and --max-samples

Verify

Enterprise scenario

Production checklist

Written by Vinod

Comments

Keep Reading

Application Insights with OpenTelemetry: Distributed Tracing and Adaptive Sampling for .NET

Distributed Tracing on AWS with X-Ray: Service Maps, Segments, and ADOT on EKS

Azure Monitor Managed Prometheus and Managed Grafana for AKS, End to End