Metrics tell you a service is burning 80% of a core. Traces tell you which request is slow. Neither tells you which line of code is on CPU when that happens. That gap used to be closed by attaching a profiler to one process, for one minute, after the incident — by which point the pathological allocation is gone. Continuous profiling closes it permanently: an eBPF agent on every node samples the on-CPU stack of every process, hundreds of times per second, for roughly 1% overhead and zero application changes. The output is a flame graph you can rewind to 3 a.m. last Tuesday. This is how to deploy it, symbolize it correctly, and read it without fooling yourself.
1. Why profiling is the fourth pillar
The three classic pillars answer “what”, “when”, and “across what”. Profiling answers “where in the code”. A 50 ms p99 regression shows up in a trace as a fat process-cart span — but the span won’t tell you 40 ms of it is a JSON serializer reallocating a buffer in a loop. The profile pins it to the function and line.
The discipline-defining property is always-on, low-overhead, fleet-wide. Older profilers (perf, async-profiler attached by hand) are accurate but episodic — you only have data if you happened to be profiling when the problem occurred. Continuous profiling samples constantly and stores the result keyed by time and labels (namespace, pod, container, node, function), so the question becomes a query — “show me the flame graph for payments between 03:00 and 03:05 UTC” — not “let me try to reproduce it”.
What it costs you is honesty about sampling. A profiler does not watch every instruction; it interrupts the CPU at a fixed frequency and records the current stack. The relative width of a frame is an unbiased estimate of the fraction of CPU time spent there. Absolute attribution of a single rare event is not what this tool is for.
2. How eBPF stack sampling works
The agent loads a small eBPF program attached to a perf event — specifically a PERF_COUNT_SW_CPU_CLOCK (or hardware cpu-cycles) event configured to fire at, say, 19 times per second per CPU. On each fire, the kernel runs the eBPF program in the context of whatever was executing. The program calls bpf_get_stackid() to walk the kernel and user stacks, gets a numeric stack ID, and increments a counter in a BPF hash map keyed by (pid, user_stack_id, kernel_stack_id). User space reads that map every few seconds and turns the raw addresses into a profile.
The win over perf record is that aggregation happens in the kernel. You never ship one event per sample to user space; you ship pre-counted stacks. That is what keeps overhead near 1% even at fleet scale.
The hard part is the user-space stack walk. Frame-pointer walking is cheap — follow the rbp chain — but only works if the binary was compiled with frame pointers (-fno-omit-frame-pointer), and most distro binaries and release builds omit them to free the register. So modern agents ship a second mechanism: they read the binary’s .eh_frame DWARF call-frame information, precompute a compact unwind table, load it into BPF maps, and unwind in-kernel without frame pointers. Parca Agent and the Grafana eBPF profiler both do this. No recompilation, no LD_PRELOAD, no sidecar.
The kernel must support
bpf_get_stackid/CO-RE and you needCAP_BPF/CAP_PERFMON(or privileged). In practice: kernel 5.4+ works, 5.8+ is comfortable, and 5.10+ is what you want for DWARF-based unwinding stability.
3. Deploy the agent as a DaemonSet
One agent per node, profiling every container on it. Below is a Grafana Alloy DaemonSet running the eBPF profiler component and remote-writing to Pyroscope. Alloy is the supported way to ship the Grafana eBPF profiler.
# alloy-config.alloy
discovery.kubernetes "pods" {
role = "pod"
}
discovery.relabel "pods" {
targets = discovery.kubernetes.pods.targets
rule {
source_labels = ["__meta_kubernetes_namespace"]
target_label = "namespace"
}
rule {
source_labels = ["__meta_kubernetes_pod_name"]
target_label = "pod"
}
rule {
source_labels = ["__meta_kubernetes_pod_container_name"]
target_label = "container"
}
}
pyroscope.ebpf "instance" {
forward_to = [pyroscope.write.endpoint.receiver]
targets = discovery.relabel.pods.output
default_target = "false"
collect_interval = "15s"
sample_rate = 19
}
pyroscope.write "endpoint" {
endpoint {
url = "http://pyroscope.monitoring.svc.cluster.local:4040"
}
external_labels = {
"cluster" = "prod-eu-west-1",
}
}
The agent needs host access and privileges. The pod spec is the load-bearing part:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: alloy-profiler
namespace: monitoring
spec:
selector:
matchLabels: { app: alloy-profiler }
template:
metadata:
labels: { app: alloy-profiler }
spec:
hostPID: true # see other containers' PIDs
serviceAccountName: alloy-profiler
containers:
- name: alloy
image: grafana/alloy:v1.5.0
args:
- run
- /etc/alloy/config.alloy
- --server.http.listen-addr=0.0.0.0:12345
securityContext:
privileged: true # or CAP_BPF + CAP_PERFMON + CAP_SYS_PTRACE
volumeMounts:
- { name: config, mountPath: /etc/alloy }
volumes:
- name: config
configMap: { name: alloy-profiler-config }
hostPID: true is what lets the agent resolve a sampled PID back to the right container and binary on the host. Without it you get stacks you cannot attribute.
If you prefer Parca, the model is identical — a DaemonSet of parca-agent pushing to a parca server:
# parca-agent container args
args:
- /bin/parca-agent
- --node=$(NODE_NAME)
- --remote-store-address=parca.monitoring.svc:7070
- --remote-store-insecure
- --profiling-cpu-sampling-frequency=19
Both projects speak the same wire format under the hood: profiles are encoded as pprof (profile.proto), the format go tool pprof has read for a decade. That portability matters — it is why a Pyroscope profile and a Parca profile and a Go runtime profile are all the same shape.
4. Symbolization: turning addresses into function names
A raw sample is a list of virtual addresses. Symbolization maps each address to function + file:line. Where that mapping lives depends on the language.
Native, stripped (Go, Rust, C/C++): the binary on the node has a .symtab/.debug_* or it does not. Release builds are usually stripped. Three options, in order of preference:
-
debuginfod. The agent extracts the binary’s build-ID (a hash in the ELF
NT_GNU_BUILD_IDnote) and fetches the matching debug info from a debuginfod server over HTTP. Point the agent at one:export DEBUGINFOD_URLS="https://debuginfod.elfutils.org/ https://debuginfod.internal.corp/"Parca runs its own symbolizer that uploads build-IDs and resolves them server-side, so symbolization happens once, centrally, not on every node.
-
Ship debug info to the server. With Parca,
parca-debuginfo uploadpushes the.debugELF for a build-ID into Parca’s store ahead of time. CI is the right place to do this, right after the build:parca-debuginfo upload --store-address=parca.monitoring.svc:7070 ./bin/payments-service -
Don’t strip. For your own services, keep frame pointers and a separated
.debugfile. The runtime overhead of-fno-omit-frame-pointeris ~1% and it makes unwinding bulletproof.
JIT/managed runtimes need per-language unwinders:
| Runtime | How the agent gets symbols |
|---|---|
| Go | Symbols in .gopclntab, in-binary even when “stripped” with default flags; works out of the box |
| Python | Agent walks the CPython interpreter frame objects (PyFrameObject) in-kernel to recover Python-level stacks |
| Java/JVM | JVM emits a perf-<pid>.map for JIT code; or async-profiler-style; native frames need DWARF |
| .NET / Node | Runtime-specific maps; Node needs --perf-basic-prof for JIT frames |
The failure mode you will see first is a flame graph full of bare hex addresses or [unknown] frames stacked under a recognizable native function. That always means missing debug info for that build-ID — not a broken agent. Resolve the build-ID, not the agent config.
5. Reading flame graphs without lying to yourself
A flame graph is an aggregation of every sampled stack over a time window, drawn as a stack of rectangles.
- The x-axis is NOT time. It is the percentage of samples. Width = share of CPU. A frame twice as wide was on-CPU twice as often. Left-to-right order is alphabetical within a parent, meaningless otherwise.
- The y-axis is stack depth. The bottom frame is the entry point (
main, a thread root); the top frame is what was actually executing when the sample fired. - The top edge is where the CPU was. Trace the topmost frames and you are looking at the leaf functions consuming cycles. Wide and high up = a hot leaf you can optimize directly.
- A wide frame low down is not itself expensive — it is a frame whose children are expensive. Don’t optimize the caller; follow the width upward to the leaf.
The single most valuable view is the diff flame graph (icicle diff). Pick two windows — before a deploy and after — and the renderer colors each frame by the change in its sample share. Red/wider = this function got more expensive; blue/narrower = cheaper. This is how you find a regression in seconds:
# Pull two pprof profiles from Parca's API and diff them locally
curl -s "http://parca:7070/api/v1/query?query=...&time=BEFORE" -o before.pb.gz
curl -s "http://parca:7070/api/v1/query?query=...&time=AFTER" -o after.pb.gz
go tool pprof -http=:8080 -diff_base=before.pb.gz after.pb.gz
pprof -diff_base produces exactly this: a graph where positive (red) nodes are the new cost. If a single function lit up red across a deploy boundary, that is your regression, with the commit pinned by the deploy timestamp.
Two more semantics people get wrong:
- CPU profile vs wall-clock. An on-CPU profile shows where cycles burn. It is blind to time spent off-CPU — blocked on locks, I/O, or
epoll. If a service is slow but its CPU flame graph is nearly empty, the cost is off-CPU; you need off-CPU profiling (eBPF tracingsched_switch) or a trace, not a CPU profile. - Memory profiles are allocation counts, not live heap by default. Pyroscope/Parca memory profiles typically attribute
alloc_space/alloc_objects— total bytes allocated, the thing that drives GC pressure.inuse_space(retained heap) is a different question. Know which one the panel is showing before you conclude “leak”.
6. Correlate profiles with traces
The next leap is span-scoped profiling: instead of asking “what was hot in payments from 03:00 to 03:05”, ask “what was hot during this specific slow span”. The mechanism is a label. OpenTelemetry’s pprofextension / the Pyroscope SDK pushes the active span’s span_id and trace_id into the runtime’s pprof labels (via Go’s pprof.Labels/pprof.Do, or the equivalent), so each sample carries the span it was taken under.
import "github.com/grafana/pyroscope-go"
// Attach span context as a profiling label for the duration of the work.
pyroscope.TagWrapper(ctx, pyroscope.Labels("span_id", span.SpanContext().SpanID().String()),
func(c context.Context) {
processCart(c) // every CPU sample taken here is tagged with this span_id
})
Now in the trace UI you click a 50 ms span and get the flame graph of just that span’s CPU time. The trace tells you which call was slow; the embedded profile tells you why, down to the function — one click, no reproduction. This is the payoff that justifies the whole stack.
7. Storage, retention, and the sampling tradeoff
Profiles are bulkier than metrics but compress hard because stacks repeat. Parca stores in a columnar engine (FrostDB); Pyroscope uses an object-store-backed engine modeled on the Grafana stack, so retention is an S3/GCS bill, not a node-local-disk problem.
# pyroscope values.yaml (microservices mode)
pyroscope:
structuredConfig:
storage:
backend: s3
s3:
bucket_name: prod-pyroscope-profiles
endpoint: s3.eu-west-1.amazonaws.com
limits:
max_query_lookback: 720h # 30 days queryable
ingestion_rate_mb: 32
The dial that governs both overhead and accuracy is sample rate:
| Sample rate | Overhead | Accuracy | Use for |
|---|---|---|---|
| 19 Hz | ~0.5% | Good for hot-path/aggregate analysis | Fleet-wide default |
| 49 Hz | ~1% | Better resolution of mid-cost functions | Latency-sensitive tiers |
| 100 Hz | ~2-3% | Sharp, but diminishing returns | Targeted, short windows |
Higher frequency narrows the confidence interval on each frame’s width but linearly increases per-sample work and storage. 19 Hz fleet-wide, with the option to crank a single namespace to 100 Hz during an investigation, is the sane posture. Resist running 100 Hz everywhere; you pay overhead on every node to sharpen estimates you mostly aren’t reading.
8. Drive cost optimization from CPU attribution
The most defensible ROI is turning CPU-share into money. Continuous profiling gives you fleet-wide attribution: which function, across all services, burns the most aggregate CPU-seconds. Query the aggregate flame graph across the whole cluster for a representative week, sort leaf functions by total CPU-seconds, and the top of that list is your optimization backlog ranked by cost — data you cannot get any other way. A regex compiled inside a request handler, a logging call serializing a struct it then drops, a crypto routine on a software path instead of AES-NI: each shows up as a fat frame spanning many services and translates directly to cores you can deprovision.
Top CPU consumers, cluster-wide, last 7d (by self CPU-seconds)
1. compress/flate.(*compressor).deflate 18.4% ~46 cores
2. encoding/json.(*encodeState).string 9.1% ~23 cores
3. regexp.(*Regexp).doExecute 6.7% ~17 cores <- compiled per-request
Item 3 — a regex recompiled on every request instead of once at init — is a five-line fix that returns 17 cores across the fleet. You only see it because the profiler aggregated the same hot leaf across every service that imported the offending library.
Enterprise scenario
A payments platform team ran ~600 Go and Python pods across three EKS clusters. After migrating to Graviton (arm64), p99 on the settlement service crept up ~12% with no code change and no obvious metric to blame — CPU was up a bit, but uniformly, so dashboards showed nothing actionable. The constraint: they could not reproduce it in staging (which still ran amd64), and they refused to add manual profiling hooks to a PCI-scoped service under change freeze.
They already ran the Grafana eBPF profiler as a DaemonSet, so the investigation was a query, not a deployment. They pulled the cluster-wide diff flame graph with pprof -diff_base, amd64 baseline vs arm64:
go tool pprof -http=:8080 -diff_base=settlement-amd64.pb.gz settlement-arm64.pb.gz
One frame lit up red across both Go and Python pods: a checksum routine in a shared vendored crypto library that fell back to a generic byte-at-a-time implementation on arm64 because the build hadn’t been compiled with the ARM CRC32 intrinsics enabled. The diff made it unmistakable — the same leaf, fatter on arm64, in services that otherwise shared no code.
The fix was a build flag, not a rewrite: rebuild the dependency with the architecture’s CRC extension enabled.
# arm64 build picks up the hardware CRC path
ENV GOARCH=arm64 GOARM64=v8.1 # enables LSE atomics + CRC on supported cores
RUN go build -trimpath -o /out/settlement ./cmd/settlement
p99 returned to baseline. The point is not the specific flag — it is that an always-on profiler turned a vague, unreproducible, cross-language regression into a single red frame, with no code change to a frozen service and no staging repro. That is the capability you are buying.
Verify
After rollout, confirm the pipeline end to end:
# 1. Agent is running on every node, one pod per node.
kubectl -n monitoring get ds alloy-profiler
# DESIRED == CURRENT == READY == node count
# 2. The eBPF program actually loaded (no silent permission failure).
kubectl -n monitoring logs ds/alloy-profiler | grep -iE "ebpf|perf_event|loaded"
# 3. Profiles are landing: query the store for a known service.
curl -s "http://pyroscope.monitoring.svc:4040/pyroscope/render?query=process_cpu:cpu:nanoseconds:cpu:nanoseconds%7Bnamespace%3D%22payments%22%7D&from=now-15m" | head
# 4. Symbolization works: the flame graph shows function names, not hex.
# Open the UI, load the payments service, confirm zero [unknown] frames on the hot path.
# 5. Span correlation: click a slow span in the trace UI and confirm a
# span-scoped flame graph renders for it.
If step 2 shows a permission error, the container lacks CAP_BPF/CAP_PERFMON or the kernel is too old. If step 4 shows hex addresses, the build-ID has no debug info in debuginfod/Parca — fix symbolization, not the agent.