Running Grafana Mimir: Multi-Tenant, Horizontally Scalable Prometheus Storage

Grafana Mimir is what you reach for when a single Prometheus, or even a sharded Prometheus plus Thanos, stops being the right tool: you need one logical metrics backend that holds tens or hundreds of millions of active series, serves many teams under hard per-team quotas, and survives a zone failure without losing data. It is the open-source descendant of Cortex, rebuilt around a blocks-based TSDB on object storage. This guide deploys it in microservices mode, wires per-tenant limits, traces a query through the split-and-cache path, and right-sizes the stateful tiers. Every config block uses real Mimir keys; treat the version-pinned docs as the source of truth for defaults, which drift between releases.

1. Deployment modes: monolithic vs microservices

A single Mimir binary contains every component. The -target flag selects which components a given process runs.

Monolithic mode (-target=all): one process runs distributor, ingester, querier, query-frontend, store-gateway, compactor, and ruler. You scale by running more identical replicas behind a load balancer. This is correct for getting started, for development, and for clusters up to roughly a few million active series. The limitation is that you cannot scale the read path independently of the write path, and a query-heavy moment competes for CPU with ingestion.
Microservices mode: each component runs as its own deployment with its own replica count, requests, and limits. This is the production target at scale, because the failure domains and the scaling axes are genuinely different. Ingesters are memory-bound and stateful; queriers are CPU-bound and stateless; the compactor is a small, mostly-singleton-per-tenant batch worker. You do not want one knob for all three.

There is also read-write mode, which collapses the microservices into three roles (read, write, backend) as a middle ground. This guide uses full microservices mode because it makes the moving parts explicit, which is the point.

The same binary, the same config file, different -target:

# Microservices: each Deployment sets its own target
mimir -config.file=/etc/mimir/mimir.yaml -target=distributor
mimir -config.file=/etc/mimir/mimir.yaml -target=ingester
mimir -config.file=/etc/mimir/mimir.yaml -target=querier
mimir -config.file=/etc/mimir/mimir.yaml -target=query-frontend
mimir -config.file=/etc/mimir/mimir.yaml -target=store-gateway
mimir -config.file=/etc/mimir/mimir.yaml -target=compactor

Run one config file across the whole cluster. Components ignore the config sections that do not apply to their target, so a single source of truth is both possible and strongly recommended. In production, deploy with the mimir-distributed Helm chart or Jsonnet rather than hand-rolled manifests; the per-component flags above are what those abstractions ultimately emit.

2. The hash ring: distributor, ingester replication, zone-aware spreading

Mimir’s write path is a consistent hash ring shared via a key-value store. The distributor receives a remote-write request, splits it into individual series, and for each series hashes the tenant ID plus the series labels to a position on the ring. It then walks the ring to find the next replication_factor distinct ingesters and writes the samples to all of them in parallel. The default replication factor is 3, and a write succeeds once a quorum (2 of 3) acknowledges. That quorum is what lets you lose an ingester without losing the write.

Use memberlist as the ring KV store. It is gossip-based and removes the external Consul/etcd dependency that older Cortex deployments carried:

common:
  storage:
    backend: s3
    s3:
      endpoint: s3.us-east-1.amazonaws.com
      bucket_name: mimir-blocks-prod
      region: us-east-1

memberlist:
  join_members:
    - mimir-gossip-ring.mimir.svc.cluster.local:7946

ingester:
  ring:
    kvstore:
      store: memberlist
    replication_factor: 3
    zone_awareness_enabled: true
    instance_availability_zone: us-east-1a   # templated per replica/zone

The common.storage block is the single place to declare the object-store backend; distributor, ingester, store-gateway, compactor, and ruler all inherit it unless overridden. Point every Mimir cluster at object storage from day one. The ingester WAL on local disk is only the buffer until a block is flushed and shipped.

Zone-aware replication is the rule that makes replication factor 3 actually mean three failure domains, not three pods that happen to land in one AZ. Set zone_awareness_enabled: true and stamp each ingester with its real instance_availability_zone. With zone-awareness on, Mimir guarantees the 3 replicas of any series land in 3 distinct zones, so a full AZ outage costs you exactly one replica per series and the quorum holds. Deploy across a number of zones equal to or greater than the replication factor; for RF=3 that is three zones, and Mimir needs floor(RF/2)+1 (so 2 of 3) healthy zones to keep serving writes. The practical payoff is rollouts: with zone-awareness you can restart an entire zone of ingesters at once during an upgrade and never break quorum.

3. Per-tenant limits: ingestion rate, max series, out-of-order

Multi-tenancy is the reason most teams choose Mimir, and limits are how multi-tenancy stays safe. Every tenant is identified by the X-Scope-OrgID HTTP header on both writes and reads. Tenant IDs are at most 150 bytes and restricted to alphanumerics plus a small set of punctuation; __mimir_cluster is reserved. Without limits, one team shipping a cardinality bomb (a user_id label, an unbounded path) can exhaust ingester memory for everyone. Limits are the blast-radius control.

Set conservative defaults in the main config, then override per tenant in a runtime overrides file that Mimir reloads without a restart:

# mimir.yaml -- defaults applied to every tenant
limits:
  ingestion_rate: 50000              # samples/sec, sustained, per tenant
  ingestion_burst_size: 500000       # samples, token-bucket burst
  max_global_series_per_user: 1500000  # active series across all ingesters
  max_label_names_per_series: 30
  out_of_order_time_window: 0s       # off by default
  compactor_blocks_retention_period: 90d  # per-tenant retention in object storage

runtime_config:
  file: /etc/mimir/overrides.yaml

# overrides.yaml -- hot-reloaded, no restart
overrides:
  payments-team:
    max_global_series_per_user: 8000000
    ingestion_rate: 250000
    ingestion_burst_size: 2500000
  noisy-batch-jobs:
    out_of_order_time_window: 10m   # accept late samples from batch exporters

Three limits do the heavy lifting:

Limit	What it bounds	Failure when exceeded
`ingestion_rate` / `ingestion_burst_size`	Samples/sec per tenant (token bucket)	Distributor returns HTTP 429; remote-write retries with backoff
`max_global_series_per_user`	Active series summed across all ingesters	New series rejected with HTTP 400 until churn frees room
`out_of_order_time_window`	How far back a late sample may be	Older samples dropped as out-of-order unless the window covers them

max_global_series_per_user is global because each series is replicated across 3 ingesters; the limit is divided across the ingester set so you reason about the global active-series count, not per-pod numbers. Out-of-order ingestion is off by default and deserves a deliberate per-tenant decision: turning on the OOO window lets late samples (batch jobs, cross-region forwarders, edge agents buffering through a partition) be accepted instead of dropped, at the cost of some ingester memory. Enable it where you actually have late data, not globally.

4. Query path internals: query-frontend split, results cache, querier

The read path is where Mimir earns its keep on large queries, and the query-frontend is the component that makes it fast. A PromQL request lands on the query-frontend first. It does three things before any querier touches a TSDB block:

Splits by time. A 30-day range query is chopped into per-day sub-queries (split_queries_by_interval), which parallelize across queriers and bound the data any single sub-query scans.
Checks the results cache. Each sub-query is looked up in a memcached-backed results cache. Cache keys incorporate the query and time range, so yesterday’s already-computed slices are served from cache and only the live edge of the range is recomputed on a dashboard refresh.
Enqueues and load-balances. Remaining sub-queries go on an internal queue. Queriers pull work from that queue (directly, or via a separate query-scheduler at large scale), which keeps a slow query from starving fast ones and spreads load evenly.

frontend:
  # Split long-range queries into per-day chunks
  split_queries_by_interval: 24h
  # Align step boundaries so cached chunks are reusable across refreshes
  align_queries_with_step: true
  cache_results: true
  results_cache:
    backend: memcached
    memcached:
      addresses: dns+memcached-frontend.mimir.svc.cluster.local:11211

frontend_worker:
  # Queriers connect back to the frontend (or scheduler) to pull work
  frontend_address: mimir-query-frontend-headless.mimir.svc.cluster.local:9095

The querier is stateless and fans out: for any sub-query it asks ingesters for the recent in-memory window and store-gateways for older data in object storage, then stitches and deduplicates the replicated samples. Holding no state, queriers scale purely on query concurrency and CPU. This is why a dashboard refresh that re-runs a 30-day query is cheap: only the newest day is uncached, the rest is a memcached hit.

5. Compactor sharding and store-gateway block assignment

Ingesters ship 2-hour blocks to object storage continuously. Left alone, that is thousands of tiny blocks per tenant per day, which is murder on query performance and listing cost. The compactor is the background batch job that merges and deduplicates those blocks into larger, longer time ranges and writes a consolidated bucket index. It also enforces per-tenant retention by deleting blocks past compactor_blocks_retention_period.

At scale, a single compactor cannot keep up across all tenants, so the compactor runs as a sharded ring. Compaction work for a tenant is assigned to a subset of compactors via compactor_tenant_shard_size, which spreads the CPU and the object-store throughput:

compactor:
  compaction_interval: 30m
  compactor_tenant_shard_size: 0   # 0 = shuffle-shard across all compactors
  data_dir: /data/compactor
  sharding_ring:
    kvstore:
      store: memberlist

The store-gateway owns the read side of object storage. It discovers all blocks in the bucket, downloads and caches each block’s index-header, and answers queriers’ requests for historical data without re-listing the bucket per query. Like the compactor, it is a sharded ring: blocks are assigned across store-gateway replicas so no single instance has to hold the index for the entire bucket. Run store-gateways with replication so that losing one does not strand a slice of blocks, and back them with index and chunk caches:

store_gateway:
  sharding_ring:
    kvstore:
      store: memberlist
    replication_factor: 3
    zone_awareness_enabled: true
    instance_availability_zone: us-east-1a

blocks_storage:
  bucket_store:
    index_cache:
      backend: memcached
      memcached:
        addresses: dns+memcached-index.mimir.svc.cluster.local:11211
    chunks_cache:
      backend: memcached
      memcached:
        addresses: dns+memcached-chunks.mimir.svc.cluster.local:11211

The mental model: compactor writes the bucket into a query-friendly shape; store-gateway reads it back efficiently; both shard across a ring so the bucket can grow without any single instance becoming the bottleneck.

6. Wiring remote-write from Prometheus and Grafana Alloy

Producers send data to the distributor over the Prometheus remote-write protocol, with the tenant in the X-Scope-OrgID header. From a vanilla Prometheus:

# prometheus.yml
remote_write:
  - url: https://mimir-distributor.example.com/api/v1/push
    headers:
      X-Scope-OrgID: payments-team
    queue_config:
      max_shards: 50
      capacity: 10000
      max_samples_per_send: 2000
    metadata_config:
      send: true

Grafana Alloy (the successor to the Grafana Agent) uses the same endpoint and header:

prometheus.remote_write "mimir" {
  endpoint {
    url = "https://mimir-distributor.example.com/api/v1/push"
    headers = {
      "X-Scope-OrgID" = "payments-team",
    }
  }
}

Two operational notes that bite teams in week one. First, when -auth.multitenancy-enabled is at its default (true), the header is mandatory; a remote-write with no X-Scope-OrgID is rejected. If you disable multi-tenancy, all data lands in the single tenant anonymous, which is a defensible choice only for a true single-team cluster. Second, Prometheus owns the path tag with the tenant ID, but you almost never want producers picking arbitrary tenant strings. Terminate remote-write at an authenticating proxy (the same gateway that fronts the distributor) and have it set or validate X-Scope-OrgID from the caller’s identity, so a misconfigured Prometheus cannot write into another team’s tenant.

7. Right-sizing ingesters, WAL replay, and safe rollouts

Ingesters will cost you an outage if you treat them like stateless pods. Each holds the recent (~2h head plus in-flight) samples for its share of every tenant in memory, backed by a write-ahead log on a persistent volume. Sizing is driven by active series, not query load.

Memory. Budget against active series per ingester. The blunt lever is the global series limit times replication factor, divided across the ingester count. At 30M global series and RF=3 that is 90M series-replicas; add ingesters to bring per-pod series down rather than vertically scaling a few huge ones.
Disk. The WAL and not-yet-shipped blocks live on a PVC. Size it for several hours of headroom so a stuck object-store upload does not fill the volume.
WAL replay. On restart, an ingester replays its WAL to rebuild the head before it is Ready. With large heads this takes minutes of NotReady, which is exactly why rollouts must be careful.

The rollout rule: never restart more than you can lose from quorum, and never start a new ingester before the previous one finished WAL replay and rejoined the ring. With zone-aware replication you can safely roll one entire zone at a time, because the other two zones still satisfy quorum. The mimir-distributed Helm chart and Mimir’s Jsonnet ship a rollout-operator that enforces zone-by-zone ordering and waits on readiness; do not replace that with a naive kubectl rollout restart across all zones. When scaling down, an ingester must flush its head to object storage before terminating or recent samples are lost. Mimir flushes on SIGTERM, so give it a multi-minute terminationGracePeriodSeconds to finish rather than getting killed mid-write.

8. Cost and capacity modeling for billions of active series

At very large scale the dominant costs are ingester memory and object-store operations, not the bytes stored. Model them explicitly before you size the cluster.

Ingesters (memory). This is the floor on your compute bill. Active series is the input; replication multiplies it. Plan from a measured per-series memory figure for your label cardinality and churn, then:

ingester replicas  ~=  (active_series * replication_factor) / series_per_ingester

A billion active series at RF=3 means three billion series-replicas across the fleet. The realistic move at that scale is shuffle sharding: per-tenant shard sizes (the tenant_shard_size family on ingesters, store-gateways, and the compactor) so each tenant uses only a subset of the fleet. It bounds one tenant’s blast radius and improves cache locality, and it is how large Mimir clusters stay multi-tenant without every tenant touching every ingester.

Object storage (operations, not size). Compacted blocks are cheap to store; the bill is GET/LIST/PUT operations. Merging small blocks into large ones is itself a cost control, because fewer objects mean fewer listings and fetches. Size index and chunk caches generously so store-gateways serve repeats from memcached.

Caches and queriers (the read tax). A large dashboard footprint drives querier CPU and memcached capacity far more than ingestion does. The results cache is the highest-leverage spend: it turns repeated long-range queries into cache hits, paying for itself in querier replicas you do not have to run.

Rule of thumb for capacity planning: size ingesters from active series and replication factor, size store-gateways and caches from query patterns and bucket size, and size the compactor from tenant count and block churn. These three tiers scale on three different inputs, which is the entire argument for microservices mode.

Verify

Confirm the cluster is healthy end to end before pointing real producers at it.

Check that all ring members are ACTIVE and zone-spread. Each component exposes its ring at an admin endpoint:

# Ingester ring: every entry should be ACTIVE with distinct zones
curl -s http://mimir-ingester.mimir.svc:8080/ingester/ring | grep -E 'ACTIVE|Zone'

# Store-gateway ring
curl -s http://mimir-store-gateway.mimir.svc:8080/store-gateway/ring | head

Push a sample for a test tenant and read it back, proving the tenant header flows through write and read:

# Write one sample for tenant 'demo' via mimirtool (or remote-write directly)
mimirtool remote-write \
  --address=http://mimir-distributor.mimir.svc:8080 \
  --id=demo \
  --metric='kv_probe' --value=1

# Read it back -- same X-Scope-OrgID is required on queries
curl -s -H "X-Scope-OrgID: demo" \
  'http://mimir-query-frontend.mimir.svc:8080/prometheus/api/v1/query?query=kv_probe'

Confirm the query-frontend cache is actually being hit and that splitting is happening. Re-run a wide-range query twice and watch the cache metrics climb:

# Results-cache hit ratio on the query-frontend (should rise on repeat queries)
sum(rate(cortex_cache_hits_total{name=~".*frontend.*"}[5m]))
  /
sum(rate(cortex_cache_fetched_keys_total{name=~".*frontend.*"}[5m]))

# Per-tenant rejected (limited) samples -- should be zero for healthy tenants
sum by (reason) (rate(cortex_discarded_samples_total{user="payments-team"}[5m]))

Confirm the compactor is making progress and not falling behind:

# Successful compactions should be advancing; failures should be ~0
sum(rate(cortex_compactor_runs_completed_total[1h]))
sum(rate(cortex_compactor_runs_failed_total[1h]))

Enterprise scenario

A platform team consolidated 40 team-owned Prometheus servers onto one Mimir cluster. Months in, a single analytics team shipped a build that added a query_fingerprint label to a hot metric. Their active series went from 4M to 60M in twenty minutes. The cluster had per-tenant max_global_series_per_user set, so Mimir correctly rejected the analytics tenant’s new series with HTTP 400 and the other 39 tenants were untouched. But the analytics team’s own dashboards and alerts went dark, and their on-call paged the platform team as a sev-1 because “Mimir is dropping our metrics.”

The constraint: the limit did its job (protect the cluster), but the signal was buried. Nobody saw the cardinality explosion until series were already being rejected, and the rejection looked like a Mimir outage to the affected team.

The fix had two parts. First, they alerted on the discard rate before the hard limit by tracking rejected samples per tenant, so the analytics team got paged on their own cardinality the moment it spiked, with a runbook that named the offending metric. Second, they moved the noisiest exploratory tenants onto shuffle sharding with a bounded per-tenant ingester shard, so even a future cardinality bomb could only saturate that tenant’s slice of the fleet, never the shared ingester memory:

# overrides.yaml
overrides:
  analytics-team:
    ingestion_tenant_shard_size: 12   # only 12 ingesters, not the whole fleet
    max_global_series_per_user: 20000000

# Alert: a tenant is being limited (fires before users notice dropped data)
sum by (user) (rate(cortex_discarded_samples_total[5m])) > 0

The next cardinality incident, four weeks later, paged the owning team in two minutes with the exact metric name, stayed contained to a 12-ingester shard, and never registered as a platform-wide event. The lesson: in a multi-tenant system the limit is only half the design; the other half is making its signal land on the team that caused it, before it looks like your outage.

Running Grafana Mimir: Multi-Tenant, Horizontally Scalable Prometheus Storage

1. Deployment modes: monolithic vs microservices

2. The hash ring: distributor, ingester replication, zone-aware spreading

3. Per-tenant limits: ingestion rate, max series, out-of-order

4. Query path internals: query-frontend split, results cache, querier

5. Compactor sharding and store-gateway block assignment

6. Wiring remote-write from Prometheus and Grafana Alloy

7. Right-sizing ingesters, WAL replay, and safe rollouts

8. Cost and capacity modeling for billions of active series

Verify

Enterprise scenario

Checklist

Written by Vinod

Comments

Keep Reading

Application Insights with OpenTelemetry: Distributed Tracing and Adaptive Sampling for .NET

Distributed Tracing on AWS with X-Ray: Service Maps, Segments, and ADOT on EKS

Azure Monitor Managed Prometheus and Managed Grafana for AKS, End to End