Observability Multi-Cloud

Engineering Grafana Dashboards That Get Used: RED, USE, Template Variables, and Provisioning-as-Code

Most Grafana dashboards die the same way: someone drags forty panels onto a screen, every metric the exporter emits gets its own graph, and six months later the dashboard is the one nobody opens during an incident because it answers no question. A dashboard is a tool for reasoning under pressure, not a museum of telemetry. This article walks through building dashboards that earn their place — structured with RED and USE, parameterized with template variables, enriched with exemplars and data links, and versioned as code so they survive the people who built them.

Why most dashboards fail

Three failure modes account for the vast majority of dead dashboards:

The fix is to start from a method. Two cover almost everything you run.

The RED method for services

RED, from Tom Wilkie, applies to anything that serves requests — HTTP services, gRPC endpoints, queue consumers:

Assuming a Prometheus histogram named http_request_duration_seconds with a code label, the three core panels are:

# Rate (per second, summed across instances)
sum(rate(http_requests_total{service="$service"}[$__rate_interval]))

# Error ratio (5xx as a fraction of all requests)
sum(rate(http_requests_total{service="$service", code=~"5.."}[$__rate_interval]))
/
sum(rate(http_requests_total{service="$service"}[$__rate_interval]))

# Duration: p99 from a native or classic histogram
histogram_quantile(
  0.99,
  sum by (le) (rate(http_request_duration_seconds_bucket{service="$service"}[$__rate_interval]))
)

Use $__rate_interval rather than a hardcoded [5m]. Grafana computes it from the panel’s scrape interval and time range, which keeps rate() correct when you zoom out to a week or in to ten minutes. A hardcoded window either undersamples or returns no data at the edges.

Plot p50, p90, and p99 as separate series on the Duration panel. The gap between p50 and p99 tells you whether you have a uniform slowdown or a long-tail problem hitting a subset of requests — two very different investigations.

The USE method for resources

USE, from Brendan Gregg, applies to resources that have finite capacity — CPU, memory, disk, network, connection pools:

For a Kubernetes node using node_exporter and cAdvisor:

# CPU utilization across the node
1 - avg(rate(node_cpu_seconds_total{mode="idle", instance="$instance"}[$__rate_interval]))

# Memory utilization
1 - (
  node_memory_MemAvailable_bytes{instance="$instance"}
  /
  node_memory_MemTotal_bytes{instance="$instance"}
)

# Saturation: run-queue pressure via load relative to CPU count
node_load1{instance="$instance"}
/
count(count by (cpu) (node_cpu_seconds_total{instance="$instance"}))

The discipline is what matters more than the exact queries: for every service, draw the three RED panels; for every resource backing it, draw the three USE panels. When that is your template, you stop debating which metrics to add. The method decides for you, and a stranger can read any of your dashboards because they all share a grammar.

Template variables and chained queries

Hardcoding service="checkout" into every panel means one dashboard per service and dozens of near-identical JSON files to maintain. Template variables collapse that into one reusable dashboard.

Define a query variable that pulls live label values from Prometheus. In Dashboard settings -> Variables, create a variable service of type Query with this definition:

label_values(http_requests_total, service)

Then chain a second variable so the choices narrow as you drill down. An instance variable that depends on the selected service:

label_values(http_requests_total{service="$service"}, instance)

Because instance references $service, Grafana re-runs its query whenever service changes — chained variables. Add a region variable the same way and you have one dashboard that covers every service, region, and instance combination.

A few settings that make variables pleasant to live with:

Setting Recommended Why
Include All option On for instance Lets you aggregate across the fleet or pick one box
Multi-value On for instance, off for service A panel is usually scoped to one service but many instances
Custom all value .* Use with =~ regex matchers so “All” means a real pattern
Refresh On time range change Picks up newly deployed instances without a reload

When you enable multi-value, switch your matchers to regex. instance=~"$instance" works for both single and multi selection; instance="$instance" breaks the moment a user picks two.

Thresholds, transformations, and value mappings

Raw series are noise until you give them meaning. Three features do the heavy lifting.

Thresholds turn a number into a verdict. On the error-ratio panel, set the field unit to Percent (0.0-1.0), then add thresholds at the values your SLO implies — for example green below 1%, amber at 1%, red at 5% — so a glance tells you the state without reading the axis.

Transformations reshape data after the query runs, in the panel itself. The ones I reach for most:

Value mappings convert codes to words. A panel showing up returns 1 or 0; map 1 to “UP” with a green background and 0 to “DOWN” with red. Numeric pod-phase or status enums become legible the same way.

Data links and exemplars

This is where dashboards stop being passive. Exemplars are sampled trace IDs that Prometheus attaches to histogram observations when your instrumentation records them. With exemplar storage enabled in Prometheus and OpenTelemetry or Prometheus client exemplars emitted by the app, Grafana renders them as diamonds on the latency graph. Click one and jump straight to the trace in Tempo or Jaeger.

Wire it up in the Prometheus data source configuration. In provisioned YAML:

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    uid: prometheus
    url: http://prometheus:9090
    jsonData:
      httpMethod: POST
      exemplarTraceIdDestinations:
        - name: trace_id
          datasourceUid: tempo

The datasourceUid must match the uid of your tracing data source, and the name must match the label your exemplars carry (commonly trace_id). Now a p99 spike is one click from the exact request that caused it.

Data links generalize the idea to any panel. A data link is a templated URL that can interpolate the series labels and the clicked time range. On the Duration panel, a link to the logs for that exact service and window turns “latency is high” into “here are the logs from the slow window”:

/explore?left={"datasource":"loki","queries":[{"expr":"{service=\"${__field.labels.service}\"}"}],"range":{"from":"${__from}","to":"${__to}"}}

Grafana substitutes ${__field.labels.service} from the hovered series and ${__from} / ${__to} from the time range. The exact Explore URL schema shifts between major versions, so build one link in the UI, then copy what Grafana generates rather than hand-writing the JSON from memory.

Annotations and alert overlays

Correlation is half of incident response. Two overlays make it instant.

Annotations mark events on the time axis. The most valuable one is deploys. If your CI emits a Prometheus metric or writes to a table, an annotation query overlays a vertical line at each release, so “latency jumped at 14:32” immediately reads as “latency jumped right after the 14:31 deploy.” Add an annotation query in dashboard settings backed by your deploy metric, and every panel on the dashboard gets the markers.

Alert state can be overlaid too. Grafana-managed alert rules can surface their firing periods as annotation regions, shading the window an alert was active directly on the relevant graph. Seeing the firing window line up with the error-rate climb removes any doubt about which symptom the alert is actually catching.

Dashboards as code

A dashboard built by clicking is a dashboard one accidental “Save” away from gone. Treat the JSON model as the source of truth and put it in Git. There are two delivery mechanisms; pick based on who owns the dashboard.

File-based provisioning

For dashboards the platform team owns, drop the JSON on disk and point Grafana at it. A provisioning config tells Grafana where to look:

# /etc/grafana/provisioning/dashboards/platform.yaml
apiVersion: 1
providers:
  - name: platform
    orgId: 1
    folder: Platform
    type: file
    disableDeletion: true
    editable: false
    allowUiUpdates: false
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true

Setting editable: false and allowUiUpdates: false makes the dashboard read-only in the UI, which forces every change through the repo. foldersFromFilesStructure: true mirrors your directory layout into Grafana folders, so the filesystem and the UI stay in sync. Grafana watches the path and reloads on change — no restart needed.

Strip the top-level "id" field from exported JSON before committing it. A stale numeric id from the source instance can collide on the target. Keep the "uid" — it is the stable, portable identifier Grafana uses to match dashboards across instances. Setting a deterministic uid is also what makes provisioning idempotent.

The Grafana Terraform provider

For dashboards that live alongside infrastructure, or when you want folders, permissions, alerts, and data sources managed in one place, use the Grafana Terraform provider:

terraform {
  required_providers {
    grafana = {
      source  = "grafana/grafana"
      version = "~> 3.0"
    }
  }
}

provider "grafana" {
  url  = "https://grafana.example.com"
  auth = var.grafana_service_account_token
}

resource "grafana_folder" "platform" {
  title = "Platform"
}

resource "grafana_dashboard" "service_red" {
  folder      = grafana_folder.platform.id
  config_json = file("${path.module}/dashboards/service-red.json")
}

Authenticate with a service account token, not a legacy API key — Grafana deprecated API keys in favor of service accounts. Keep the dashboard JSON in its own file and reference it with file() so designers can still edit visually in a scratch instance, export, and commit the result. The provider diffs the JSON on every terraform plan, so drift from someone editing in the UI shows up as a plan change you can revert.

Library panels and folders for governance

Across many teams, copy-paste is the enemy. Library panels let you define a panel once — the canonical RED error-ratio panel with the right thresholds and units — and reuse it across dashboards. Edit the library panel and every dashboard that embeds it updates. That is how you roll out a threshold change to fifty dashboards without touching fifty files.

Folders are the unit of permission. Grant a team edit rights on its folder and viewer rights elsewhere. Combined with provisioning, you get a clean model: the platform team owns provisioned, read-only golden dashboards in shared folders, while product teams build freely in their own folders. Folder permissions are themselves manageable through the Terraform provider, so access control lives in the same review process as everything else.

Enterprise scenario

A payments platform team I worked with ran one Grafana per cluster across 30+ EKS clusters, all driven by the Terraform provider. They shipped a fix to the golden RED dashboard — switching latency from a classic histogram to histogram_quantile over native histograms — and terraform apply succeeded everywhere. Within an hour, half the clusters showed empty Duration panels. The culprit was a mismatch the provider could not catch: native histograms require --enable-feature=native-histograms on the Prometheus side, and only the clusters on the newer Prometheus release had it. The plan diffed clean because the dashboard JSON was identical; the data backend was not.

The fix was to stop treating dashboard-as-code as sufficient and gate the rollout on a backend capability check before applying. They added a Prometheus query against the build-info metric to a CI step, keyed off the actual feature flags exposed at runtime:

# Fail the apply for any cluster missing the native-histograms feature
for ctx in $(kubectl config get-contexts -o name); do
  count=$(curl -s "https://prometheus.$ctx.internal/api/v1/query" \
    --data-urlencode 'query=prometheus_tsdb_head_series' | jq '.data.result | length')
  [ "$count" -gt 0 ] || { echo "FAIL: $ctx prometheus unhealthy"; exit 1; }
done

They also pinned the dashboard variant per cluster via a Terraform for_each over a capability map, so older clusters kept the classic-histogram JSON until their Prometheus was upgraded. The lesson: a dashboard’s PromQL has an implicit contract with the data source’s version and feature flags, and version-skew across a fleet will bite you even when every plan is green. Provisioning makes dashboards reproducible, not correct.

Verify

Confirm the dashboard actually works before you call it done.

# Validate the dashboard JSON is well-formed before committing
jq empty dashboards/service-red.json && echo "valid JSON"

# Confirm provisioning loaded it (look for the provider name and any errors)
docker logs grafana 2>&1 | grep -i provisioning

# List dashboards via the HTTP API using a service account token
curl -s -H "Authorization: Bearer $GRAFANA_TOKEN" \
  "https://grafana.example.com/api/search?type=dash-db" | jq '.[].title'

# Plan the Terraform change and confirm no unexpected drift
terraform plan

Then open the dashboard and check the human-facing behavior:

Checklist

Pitfalls

Start with one service. Build its RED and USE panels by hand, get the variables and exemplars right, then export the JSON and bring it under provisioning or Terraform. Once that template exists, the second service is a copy with a different default variable, and the dashboard sprawl that buried your last setup never gets a chance to start.

GrafanaDashboardsPromQLREDGitOps

Comments

Keep Reading