Tail-Based Sampling at Scale with the OpenTelemetry Collector and Load-Balancing Exporter

Probabilistic head sampling decides on the root span, before the trace exists. Keep 1%, and you keep 1% of the errors, 1% of the slow requests, and 1% of everything boring — blind, uniform, and cheap. Tail sampling inverts that: it buffers every span of a trace until the trace is complete, then decides with full knowledge — keep every error and every p99-latency outlier, drop the fast and healthy bulk. The catch is the word complete. Tail sampling only works if every span of a given trace lands on the same Collector instance, and that single constraint forces the architecture in this article. Get the routing wrong and you do not get bad sampling — you get inconsistent sampling, where half a trace is kept and half is dropped, which is worse than no sampling at all.

1. Head vs tail: why tail needs full-trace assembly

Head sampling decides at trace start. The decision propagates in the W3C traceparent sampled flag, so every service honours it consistently and the cost is near zero — you never generate the spans you will not keep. The limitation is that the root has no idea whether the request will fail or stall three hops downstream.

Tail sampling decides at trace end. The tailsamplingprocessor holds spans in memory, grouped by trace ID, waits a configured window for the trace to go quiet, then runs your policies against the assembled trace and emits a single keep/drop verdict for all spans at once.

Dimension	Head (probabilistic)	Tail (policy-based)
Decision point	Root span, at trace start	After trace assembly
Information available	Almost none	Full trace: latency, errors, attributes
Keeps all errors?	No (probabilistic)	Yes, with a `status_code` policy
CPU/memory cost	Negligible	High: buffers every span
Requires whole trace co-located	No	Yes
Stateful	No	Yes

That last row is the whole problem. A single trace fans across many services, and their spans arrive at your Collector fleet through different agents. If three replicas each see a third of one trace, each runs its policies on a fragment and you get split decisions. Tail sampling demands that all spans sharing a trace ID converge on exactly one decision-making instance. Scaling the tail tier behind a normal round-robin Service is precisely the failure mode. The fix is consistent, trace-ID-based routing — which is what the load-balancing exporter provides.

2. The two-tier topology

The architecture is two distinct Collector tiers with different jobs. The data path is: app pods --OTLP--> gateway tier (N stateless replicas, load-balancing exporter, routes by trace ID) --OTLP--> sampler tier (each replica owns a slice of trace IDs, runs tail sampling) --> backend.

Tier 1 — gateways. Receive OTLP from agents/SDKs, enrich, batch, and route. The gateway holds no sampling state; any replica can receive any span. Its one special job is the load-balancing exporter, which hashes each span’s trace ID to a fixed sampling-tier backend. Because the hash is deterministic, every span of a given trace — whichever gateway received it — lands on the same sampler.

Tier 2 — samplers. Each replica runs tailsamplingprocessor, owns a slice of the trace-ID space, and sees complete traces for that slice, so its decisions are sound. This is the only tier that exports to your tracing backend (Tempo, Jaeger, a SaaS).

Why not put tailsamplingprocessor on the gateways directly? Because you cannot scale them. Add a second gateway replica behind a normal Service and traces split across replicas, breaking tail sampling. The load-balancing exporter exists to decouple “how many instances receive telemetry” from “which instance decides a given trace.”

3. Configuring the load-balancing exporter to route by trace ID

The load-balancing exporter is part of the contrib distribution (otelcol-contrib), not the core binary. Its job is to pick a backend deterministically from a routing key. For tail sampling the key must be traceID — this guarantees span co-location.

# tier 1: gateway collector
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 25
  k8sattributes: {}
  batch:
    timeout: 1s
    send_batch_size: 8192

exporters:
  loadbalancing:
    routing_key: traceID            # CRITICAL: route by trace ID, not service
    protocol:
      otlp:
        timeout: 5s
        tls:
          insecure: true            # in-cluster; use real TLS across boundaries
    resolver:
      # k8s resolver watches the Endpoints API directly: ring updates on scale
      # events are near-instant. Use the `dns` resolver elsewhere (set hostname
      # to the headless Service FQDN), but it lags by the DNS TTL.
      k8s:
        service: otel-sampler-headless.observability
        ports:
          - 4317

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, k8sattributes, batch]
      exporters: [loadbalancing]

Two details decide whether this works in production:

routing_key: traceID. The default is traceID for trace pipelines, but set it explicitly — the other valid value, service, routes by service name and scatters one trace across many samplers, defeating the design. (Use service only for per-service span-metrics aggregation, never for tail sampling.)
The resolver and a headless Service. The exporter maintains a consistent-hash ring over the resolved backends, so the sampler tier must be a headless Service (clusterIP: None) whose pods expose 4317 — that way each pod IP is a distinct ring member. The headless part is non-negotiable: a normal ClusterIP Service gives the resolver one VIP, the ring has a single member, and load-balancing collapses to “send everything to one place.”

4. Tail-sampling processor policies

On the sampler tier, tailsamplingprocessor evaluates an ordered list of policies. The logic is OR-with-precedence: if any policy says Sampled, the trace is kept; an invert_match policy can force a drop. Order matters because rate_limiting and probabilistic policies consume budget in evaluation order.

# tier 2: sampler collector
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  tail_sampling:
    decision_wait: 10s        # how long to wait for a trace to complete
    num_traces: 200000        # max traces held in memory at once
    expected_new_traces_per_sec: 5000
    policies:
      # 1. Always keep anything that errored.
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]

      # 2. Always keep slow traces (>= 750ms end to end).
      - name: slow
        type: latency
        latency:
          threshold_ms: 750

      # 3. Keep traces touching the payments service, but cap the volume.
      - name: payments-sampled
        type: and
        and:
          and_sub_policy:
            - name: is-payments
              type: string_attribute
              string_attribute:
                key: service.name
                values: [payments-api]
            - name: cap
              type: rate_limiting
              rate_limiting:
                spans_per_second: 500

      # 4. Everything else: keep a representative 2%.
      - name: baseline
        type: probabilistic
        probabilistic:
          sampling_percentage: 2

exporters:
  otlp/tempo:
    endpoint: tempo-distributor.observability.svc.cluster.local:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling]
      exporters: [otlp/tempo]

The policy types worth knowing:

Policy `type`	Keeps a trace when…	Typical use
`status_code`	any span has status ERROR	never lose a failure
`latency`	total duration >= threshold_ms	catch slow outliers
`rate_limiting`	under spans/sec budget	cap a noisy source
`probabilistic`	hash falls under percentage	representative baseline
`string_attribute` / `numeric_attribute`	attribute matches	tenant/route targeting
`and` (composite)	all sub-policies match	“payments AND rate-limited”
`composite`	weighted allocation across sub-policies	budget split by category

Use and for a boolean conjunction (“is-payments AND within rate limit”, above). Use composite to hand out a total span budget across several categories with per-category rate limits and ordering — the right tool when a single global cap must be divided fairly, not for a simple conjunction.

5. Sizing decision_wait, num_traces, and memory

These three coupled knobs are where tail sampling lives or dies, and the dangerous failure is silent: undersize them and traces get evicted before their decision, which looks identical to “the policy dropped them.”

decision_wait is how long the processor holds a trace before deciding. It must exceed your real end-to-end trace duration at a high percentile, plus transit. If your p99 trace is 4s, a 2s window decides on incomplete traces and your latency policy never fires on the slow ones — the exact traces you care about. Start at 10s, then tune against the late-span metric in section 7. Do not set it to 60s “to be safe”: decision_wait directly multiplies memory, because a trace occupies the buffer for that entire window.

num_traces is the hard cap on traces resident in memory. It must comfortably exceed decision_wait x arrival_rate. With 5,000 new traces/sec and a 10s window you have ~50,000 in flight; num_traces: 200000 gives 4x burst headroom. Exceed it and the processor evicts the oldest traces early, deciding them on whatever spans arrived — usually incomplete. expected_new_traces_per_sec pre-allocates the internal map; set it to the steady-state arrival rate per replica (total rate / replica count, since the load balancer spreads traces evenly).

Memory. Budget per-trace memory as (avg spans/trace x avg span size). At ~20 spans of ~1.5KB, a trace is ~30KB; 200,000 of them is ~6GB of buffered spans before Go overhead. Always pair the sampler with memory_limiter ahead of the export, give the pod a generous limit, and set GOMEMLIMIT so the Go runtime garbage-collects under pressure instead of getting OOM-killed:

# sampler pod env
env:
  - name: GOMEMLIMIT
    value: "7GiB"     # ~90% of the container memory limit

Rule of thumb: scale the sampler tier on trace arrival rate, not CPU. Each replica’s memory budget is num_traces x avg_trace_bytes. To handle more traffic, add replicas — the load balancer re-hashes the ring and each replica owns a smaller slice — rather than enlarging num_traces on a few fat pods, which only raises your blast radius on OOM.

6. Handling late-arriving spans and incomplete traces

Distributed traces do not arrive atomically — a slow async worker may emit its span seconds after the root finished. Three behaviours protect you.

decision_wait absorbs normal lateness. Any span arriving within the window joins its trace before the decision, which is why the window must track real trace duration, not a guess.

Spans arriving after the decision take the late-span path. Once a trace is decided and flushed, later spans for the same trace ID are evaluated against the cached decision where possible, so a kept trace’s stragglers are still exported and a dropped trace’s are dropped — keeping the trace consistent. A rising late-span counter (section 7) is the signal to increase decision_wait.

Routing stability prevents the worst case — the ring changing mid-trace. If a sampler pod restarts while a trace is in flight, some spans hash to the old owner and some to the new, splitting the decision. Mitigate by (a) using the k8s resolver for fast endpoint updates, (b) a decision_wait comfortably longer than a normal pod-startup blip, and © a terminationGracePeriodSeconds long enough to drain in-flight traces on rollout:

# sampler workload
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 30   # >= decision_wait, let in-flight traces decide

No tail-sampling setup eliminates split traces entirely during a scaling event — accept a small, measured inconsistency during rollouts and minimise its window rather than chasing perfection.

7. Validating sampling fairness and exporting metrics

Scrape the Collector’s own internal metrics and the tail processor stops being a black box. The decisive series is otelcol_processor_tail_sampling_count_traces_sampled, broken down by policy and a sampled (true/false) dimension.

# in either tier's config: expose Collector self-telemetry
service:
  telemetry:
    metrics:
      level: detailed
      address: 0.0.0.0:8888

Scrape :8888/metrics with Prometheus. The series to watch:

Metric	Tells you
`otelcol_processor_tail_sampling_count_traces_sampled`	kept vs dropped, per policy
`otelcol_processor_tail_sampling_sampling_trace_dropped_too_early`	traces evicted before decision — undersized
`otelcol_processor_tail_sampling_sampling_late_span_age`	how late stragglers arrive
`otelcol_processor_tail_sampling_new_trace_id_received`	trace arrival rate per replica

Kept ratio per policy, and the late-span signal:

# fraction of traces each policy keeps
sum by (policy) (
  rate(otelcol_processor_tail_sampling_count_traces_sampled{sampled="true"}[5m])
)
/ ignoring(sampled) group_left
sum by (policy) (
  rate(otelcol_processor_tail_sampling_count_traces_sampled[5m])
)

# alert: traces evicted before a decision was made (raise num_traces / decision_wait)
rate(otelcol_processor_tail_sampling_sampling_trace_dropped_too_early[5m]) > 0

The first query proves your policies behave: errors and slow should keep ~100%, baseline ~2%. Fairness across the ring is the second check — new_trace_id_received should be roughly equal across sampler replicas. A lopsided distribution means the ring is unbalanced, often because the resolver only sees one backend (re-check that the Service is headless).

8. Cost impact versus probabilistic head sampling

The reason teams adopt this complexity is economics, and the comparison is counter-intuitive. Take 50,000 traces/sec, ~20 spans each = 1,000,000 spans/sec, a true error rate of 0.5%, and a backend billed per ingested span.

Head sampling at 1%: ingest 10,000 spans/sec, keep 1% of errors — you lose 99% of every failure. Cheap, but during an incident the trace you need is almost certainly gone.
Tail sampling (all errors + all >750ms + 2% baseline): errors ~5,000 spans/sec, slow outliers a few percent more, baseline ~20,000 spans/sec. Net ingestion lands near 3-5% — call it ~40,000 spans/sec.

Strategy	Spans ingested/sec	Errors kept	Backend cost (relative)
Head 1%	10,000	1%	1.0x
Head 5%	50,000	5%	5.0x
Tail (policy)	~40,000	~100%	~4.0x at the backend

Tail sampling ingests more than aggressive head sampling, so it is not free at the backend. The honest framing: it buys decision quality, not raw storage savings. You pay more in Collector compute (the sampler tier is memory-heavy and CPU-real) and possibly more at the backend than 1% head sampling. What you get is a ~100% error catch rate and full p99 visibility for roughly the cost of blind 4-5% head sampling — but with the spend concentrated on traces that carry signal. Quantify it first: the sampler tier’s memory footprint (section 5) is a standing cost, and there is no point in two tiers if 1% head sampling already meets your debugging SLOs.

Enterprise scenario

A payments platform team ran a single-tier pool of eight otelcol-contrib replicas behind a standard ClusterIP Service, each with tailsamplingprocessor set to keep 100% of errors. During a partial outage, on-call pulled up the failing checkout flow in Tempo and found traces that ended abruptly — root span and a couple of children present, the downstream payment-authorization spans missing. The status_code: ERROR policy was correct; the errored spans had simply landed on a different replica than the root, and that replica, seeing only a healthy fragment, dropped its share via the 2% baseline. Every error trace was being silently torn in half.

The fix was the two-tier split. They left the eight gateways stateless (enrichment plus the load-balancing exporter, routing_key: traceID) and moved tailsamplingprocessor onto a separate four-replica sampler StatefulSet behind a headless Service. The one-line root cause and the one-line fix were the same field:

# gateways: stop sampling here, just route every span of a trace to one sampler
exporters:
  loadbalancing:
    routing_key: traceID
    resolver:
      k8s:
        service: otel-sampler-headless.observability
        ports: [4317]

After the change, otelcol_processor_tail_sampling_count_traces_sampled{policy="errors",sampled="true"} rose to match the true error count, and trace_dropped_too_early — quietly nonzero because each undersized gateway had buffered the whole firehose — went to zero once num_traces was sized against per-replica arrival rate (total rate / 4) instead of the full stream. The runbook lesson: tail sampling behind a non-headless Service is not “slightly worse” sampling, it is silently broken sampling, and the only proof it works is the per-policy kept-ratio metric, not the config looking correct.

Verify

# 1. Confirm the sampler Service is headless (clusterIP must be None).
kubectl -n observability get svc otel-sampler-headless -o jsonpath='{.spec.clusterIP}'
# expected: None

# 2. Confirm the load balancer resolved every sampler pod (ring members).
#    Each sampler IP should appear in the gateway's loadbalancing debug logs / metrics.
kubectl -n observability get endpoints otel-sampler-headless -o jsonpath='{.subsets[*].addresses[*].ip}'

# 3. Scrape the tail processor's own metrics from a sampler pod.
kubectl -n observability port-forward deploy/otel-sampler 8888:8888 &
curl -s localhost:8888/metrics | grep tail_sampling_count_traces_sampled

# 4. Generate load and confirm errors are kept ~100% while baseline stays ~2%.
#    (telemetrygen is the OTel load tool from the contrib repo.)
telemetrygen traces --otlp-endpoint otel-gateway.observability.svc.cluster.local:4317 \
  --otlp-insecure --traces 10000 --rate 500

In Prometheus, confirm otelcol_processor_tail_sampling_sampling_trace_dropped_too_early is flat at zero (correct sizing) and that new_trace_id_received is within ~10% across sampler replicas (balanced ring). Both green means full-trace assembly and fair routing are actually working.

Tail-Based Sampling at Scale with the OpenTelemetry Collector and Load-Balancing Exporter

1. Head vs tail: why tail needs full-trace assembly

2. The two-tier topology

3. Configuring the load-balancing exporter to route by trace ID

4. Tail-sampling processor policies

5. Sizing decision_wait, num_traces, and memory

6. Handling late-arriving spans and incomplete traces

7. Validating sampling fairness and exporting metrics

8. Cost impact versus probabilistic head sampling

Enterprise scenario

Verify

Checklist

Written by Vinod

Comments

Keep Reading

Application Insights with OpenTelemetry: Distributed Tracing and Adaptive Sampling for .NET

Distributed Tracing on AWS with X-Ray: Service Maps, Segments, and ADOT on EKS

Azure Monitor Managed Prometheus and Managed Grafana for AKS, End to End