Most Grafana dashboards die the same way: someone drags forty panels onto a screen, every metric the exporter emits gets its own graph, and six months later the dashboard is the one nobody opens during an incident because it answers no question. A dashboard is a tool for reasoning under pressure, not a museum of telemetry. This article walks through building dashboards that earn their place — structured with RED and USE, parameterized with template variables, enriched with exemplars and data links, and versioned as code so they survive the people who built them.
Why most dashboards fail
Three failure modes account for the vast majority of dead dashboards:
- Vanity panels. Every series the exporter exposes becomes a graph. Heap, GC pauses, file descriptors, and goroutine counts share a wall with the request rate, and the one number that matters is buried.
- No question being answered. A good panel answers “is the thing healthy right now, and if not, where do I look next?” A bad panel just plots a number with no threshold, no baseline, and no next step.
- No context. A latency spike with no annotation for the deploy that caused it, no link to the trace, and no overlay for the firing alert leaves the on-call engineer doing archaeology at 3 a.m.
The fix is to start from a method. Two cover almost everything you run.
The RED method for services
RED, from Tom Wilkie, applies to anything that serves requests — HTTP services, gRPC endpoints, queue consumers:
- Rate — requests per second
- Errors — failed requests per second (or error ratio)
- Duration — latency distribution, viewed at percentiles
Assuming a Prometheus histogram named http_request_duration_seconds with a code label, the three core panels are:
# Rate (per second, summed across instances)
sum(rate(http_requests_total{service="$service"}[$__rate_interval]))
# Error ratio (5xx as a fraction of all requests)
sum(rate(http_requests_total{service="$service", code=~"5.."}[$__rate_interval]))
/
sum(rate(http_requests_total{service="$service"}[$__rate_interval]))
# Duration: p99 from a native or classic histogram
histogram_quantile(
0.99,
sum by (le) (rate(http_request_duration_seconds_bucket{service="$service"}[$__rate_interval]))
)
Use
$__rate_intervalrather than a hardcoded[5m]. Grafana computes it from the panel’s scrape interval and time range, which keepsrate()correct when you zoom out to a week or in to ten minutes. A hardcoded window either undersamples or returns no data at the edges.
Plot p50, p90, and p99 as separate series on the Duration panel. The gap between p50 and p99 tells you whether you have a uniform slowdown or a long-tail problem hitting a subset of requests — two very different investigations.
The USE method for resources
USE, from Brendan Gregg, applies to resources that have finite capacity — CPU, memory, disk, network, connection pools:
- Utilization — the fraction of time the resource was busy
- Saturation — the degree of queued, unserviced work
- Errors — error events on the resource
For a Kubernetes node using node_exporter and cAdvisor:
# CPU utilization across the node
1 - avg(rate(node_cpu_seconds_total{mode="idle", instance="$instance"}[$__rate_interval]))
# Memory utilization
1 - (
node_memory_MemAvailable_bytes{instance="$instance"}
/
node_memory_MemTotal_bytes{instance="$instance"}
)
# Saturation: run-queue pressure via load relative to CPU count
node_load1{instance="$instance"}
/
count(count by (cpu) (node_cpu_seconds_total{instance="$instance"}))
The discipline is what matters more than the exact queries: for every service, draw the three RED panels; for every resource backing it, draw the three USE panels. When that is your template, you stop debating which metrics to add. The method decides for you, and a stranger can read any of your dashboards because they all share a grammar.
Template variables and chained queries
Hardcoding service="checkout" into every panel means one dashboard per service and dozens of near-identical JSON files to maintain. Template variables collapse that into one reusable dashboard.
Define a query variable that pulls live label values from Prometheus. In Dashboard settings -> Variables, create a variable service of type Query with this definition:
label_values(http_requests_total, service)
Then chain a second variable so the choices narrow as you drill down. An instance variable that depends on the selected service:
label_values(http_requests_total{service="$service"}, instance)
Because instance references $service, Grafana re-runs its query whenever service changes — chained variables. Add a region variable the same way and you have one dashboard that covers every service, region, and instance combination.
A few settings that make variables pleasant to live with:
| Setting | Recommended | Why |
|---|---|---|
| Include All option | On for instance |
Lets you aggregate across the fleet or pick one box |
| Multi-value | On for instance, off for service |
A panel is usually scoped to one service but many instances |
| Custom all value | .* |
Use with =~ regex matchers so “All” means a real pattern |
| Refresh | On time range change | Picks up newly deployed instances without a reload |
When you enable multi-value, switch your matchers to regex. instance=~"$instance" works for both single and multi selection; instance="$instance" breaks the moment a user picks two.
Thresholds, transformations, and value mappings
Raw series are noise until you give them meaning. Three features do the heavy lifting.
Thresholds turn a number into a verdict. On the error-ratio panel, set the field unit to Percent (0.0-1.0), then add thresholds at the values your SLO implies — for example green below 1%, amber at 1%, red at 5% — so a glance tells you the state without reading the axis.
Transformations reshape data after the query runs, in the panel itself. The ones I reach for most:
- Add field from calculation to derive a ratio or a difference without writing it into PromQL.
- Organize fields to rename and reorder columns in a table.
- Filter by name to drop the half-dozen series you do not want on a busy graph.
- Group by to roll up a table by a label.
Value mappings convert codes to words. A panel showing up returns 1 or 0; map 1 to “UP” with a green background and 0 to “DOWN” with red. Numeric pod-phase or status enums become legible the same way.
Data links and exemplars
This is where dashboards stop being passive. Exemplars are sampled trace IDs that Prometheus attaches to histogram observations when your instrumentation records them. With exemplar storage enabled in Prometheus and OpenTelemetry or Prometheus client exemplars emitted by the app, Grafana renders them as diamonds on the latency graph. Click one and jump straight to the trace in Tempo or Jaeger.
Wire it up in the Prometheus data source configuration. In provisioned YAML:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
uid: prometheus
url: http://prometheus:9090
jsonData:
httpMethod: POST
exemplarTraceIdDestinations:
- name: trace_id
datasourceUid: tempo
The datasourceUid must match the uid of your tracing data source, and the name must match the label your exemplars carry (commonly trace_id). Now a p99 spike is one click from the exact request that caused it.
Data links generalize the idea to any panel. A data link is a templated URL that can interpolate the series labels and the clicked time range. On the Duration panel, a link to the logs for that exact service and window turns “latency is high” into “here are the logs from the slow window”:
/explore?left={"datasource":"loki","queries":[{"expr":"{service=\"${__field.labels.service}\"}"}],"range":{"from":"${__from}","to":"${__to}"}}
Grafana substitutes ${__field.labels.service} from the hovered series and ${__from} / ${__to} from the time range. The exact Explore URL schema shifts between major versions, so build one link in the UI, then copy what Grafana generates rather than hand-writing the JSON from memory.
Annotations and alert overlays
Correlation is half of incident response. Two overlays make it instant.
Annotations mark events on the time axis. The most valuable one is deploys. If your CI emits a Prometheus metric or writes to a table, an annotation query overlays a vertical line at each release, so “latency jumped at 14:32” immediately reads as “latency jumped right after the 14:31 deploy.” Add an annotation query in dashboard settings backed by your deploy metric, and every panel on the dashboard gets the markers.
Alert state can be overlaid too. Grafana-managed alert rules can surface their firing periods as annotation regions, shading the window an alert was active directly on the relevant graph. Seeing the firing window line up with the error-rate climb removes any doubt about which symptom the alert is actually catching.
Dashboards as code
A dashboard built by clicking is a dashboard one accidental “Save” away from gone. Treat the JSON model as the source of truth and put it in Git. There are two delivery mechanisms; pick based on who owns the dashboard.
File-based provisioning
For dashboards the platform team owns, drop the JSON on disk and point Grafana at it. A provisioning config tells Grafana where to look:
# /etc/grafana/provisioning/dashboards/platform.yaml
apiVersion: 1
providers:
- name: platform
orgId: 1
folder: Platform
type: file
disableDeletion: true
editable: false
allowUiUpdates: false
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: true
Setting editable: false and allowUiUpdates: false makes the dashboard read-only in the UI, which forces every change through the repo. foldersFromFilesStructure: true mirrors your directory layout into Grafana folders, so the filesystem and the UI stay in sync. Grafana watches the path and reloads on change — no restart needed.
Strip the top-level
"id"field from exported JSON before committing it. A stale numericidfrom the source instance can collide on the target. Keep the"uid"— it is the stable, portable identifier Grafana uses to match dashboards across instances. Setting a deterministicuidis also what makes provisioning idempotent.
The Grafana Terraform provider
For dashboards that live alongside infrastructure, or when you want folders, permissions, alerts, and data sources managed in one place, use the Grafana Terraform provider:
terraform {
required_providers {
grafana = {
source = "grafana/grafana"
version = "~> 3.0"
}
}
}
provider "grafana" {
url = "https://grafana.example.com"
auth = var.grafana_service_account_token
}
resource "grafana_folder" "platform" {
title = "Platform"
}
resource "grafana_dashboard" "service_red" {
folder = grafana_folder.platform.id
config_json = file("${path.module}/dashboards/service-red.json")
}
Authenticate with a service account token, not a legacy API key — Grafana deprecated API keys in favor of service accounts. Keep the dashboard JSON in its own file and reference it with file() so designers can still edit visually in a scratch instance, export, and commit the result. The provider diffs the JSON on every terraform plan, so drift from someone editing in the UI shows up as a plan change you can revert.
Library panels and folders for governance
Across many teams, copy-paste is the enemy. Library panels let you define a panel once — the canonical RED error-ratio panel with the right thresholds and units — and reuse it across dashboards. Edit the library panel and every dashboard that embeds it updates. That is how you roll out a threshold change to fifty dashboards without touching fifty files.
Folders are the unit of permission. Grant a team edit rights on its folder and viewer rights elsewhere. Combined with provisioning, you get a clean model: the platform team owns provisioned, read-only golden dashboards in shared folders, while product teams build freely in their own folders. Folder permissions are themselves manageable through the Terraform provider, so access control lives in the same review process as everything else.
Enterprise scenario
A payments platform team I worked with ran one Grafana per cluster across 30+ EKS clusters, all driven by the Terraform provider. They shipped a fix to the golden RED dashboard — switching latency from a classic histogram to histogram_quantile over native histograms — and terraform apply succeeded everywhere. Within an hour, half the clusters showed empty Duration panels. The culprit was a mismatch the provider could not catch: native histograms require --enable-feature=native-histograms on the Prometheus side, and only the clusters on the newer Prometheus release had it. The plan diffed clean because the dashboard JSON was identical; the data backend was not.
The fix was to stop treating dashboard-as-code as sufficient and gate the rollout on a backend capability check before applying. They added a Prometheus query against the build-info metric to a CI step, keyed off the actual feature flags exposed at runtime:
# Fail the apply for any cluster missing the native-histograms feature
for ctx in $(kubectl config get-contexts -o name); do
count=$(curl -s "https://prometheus.$ctx.internal/api/v1/query" \
--data-urlencode 'query=prometheus_tsdb_head_series' | jq '.data.result | length')
[ "$count" -gt 0 ] || { echo "FAIL: $ctx prometheus unhealthy"; exit 1; }
done
They also pinned the dashboard variant per cluster via a Terraform for_each over a capability map, so older clusters kept the classic-histogram JSON until their Prometheus was upgraded. The lesson: a dashboard’s PromQL has an implicit contract with the data source’s version and feature flags, and version-skew across a fleet will bite you even when every plan is green. Provisioning makes dashboards reproducible, not correct.
Verify
Confirm the dashboard actually works before you call it done.
# Validate the dashboard JSON is well-formed before committing
jq empty dashboards/service-red.json && echo "valid JSON"
# Confirm provisioning loaded it (look for the provider name and any errors)
docker logs grafana 2>&1 | grep -i provisioning
# List dashboards via the HTTP API using a service account token
curl -s -H "Authorization: Bearer $GRAFANA_TOKEN" \
"https://grafana.example.com/api/search?type=dash-db" | jq '.[].title'
# Plan the Terraform change and confirm no unexpected drift
terraform plan
Then open the dashboard and check the human-facing behavior:
- Switch the
servicevariable and confirm every panel repoints, including chainedinstancechoices narrowing. - Pick “All” on a multi-value variable and confirm regex matchers still return data.
- Hover a latency spike and confirm an exemplar diamond links to a trace.
- Confirm the deploy annotation line appears at a known release time.
Checklist
Pitfalls
- Hardcoded rate intervals.
rate(...[5m])looks fine at the default zoom and returns gaps or garbage at others. Always$__rate_interval. - Single-value matchers on multi-value variables.
instance="$instance"silently breaks the instant a user multi-selects. Use=~everywhere a variable can hold more than one value. - Committing the
idfield. A leftover numericidcollides across instances; theuidis the portable key. Stripid, keepuid. - UI edits on provisioned dashboards. If you allow UI updates on provisioned dashboards, edits get silently overwritten on the next reload and people lose work. Lock them down and route changes through the repo.
- Exemplars without storage enabled. Diamonds will not appear unless Prometheus exemplar storage is on and the app actually emits exemplars — instrument first, then expect the link.
Start with one service. Build its RED and USE panels by hand, get the variables and exemplars right, then export the JSON and bring it under provisioning or Terraform. Once that template exists, the second service is a copy with a different default variable, and the dashboard sprawl that buried your last setup never gets a chance to start.