Clicking a dashboard into existence is fine until the third person edits it, the alert someone hand-tuned at 2 a.m. survives no review, and nobody can answer “is staging configured the same as prod?” Grafana has two complementary answers to this, and most teams pick the wrong one for the wrong layer. File-based provisioning, loaded from YAML and JSON on disk, is the right tool for bootstrap config that ships with the instance. The Grafana Terraform provider is the right tool for everything that has a lifecycle - folders, permissions, library panels, alert rules, notification policies - because Terraform tracks state, computes drift, and plans changes before they land. This article builds the full pipeline: where each model belongs, how to parameterize dashboards across environments, how to express unified alerting as code, and how to wire plan, lint, and drift detection into CI.
The whole thing assumes Grafana 11 or later (unified alerting is the only alerting system now; legacy alerting was removed) and the grafana/grafana Terraform provider 3.x.
1. Pick the right provisioning model per layer
The two mechanisms are not competitors. They own different layers of the stack.
| Concern | File-based provisioning | Terraform provider |
|---|---|---|
| Data sources (bootstrap) | Yes - ships with the image | Possible, but state churns on secrets |
| Folders and folder permissions | No native file support | Yes - first-class resource |
| Dashboards | Yes (JSON on disk) | Yes (grafana_dashboard) |
| Library panels | No | Yes (grafana_library_panel) |
| Contact points / notification policies | Yes (alerting YAML) | Yes |
| Alert rules | Yes (alerting YAML) | Yes (grafana_rule_group) |
| Cross-environment templating | Limited (env vars) | Full - variables, workspaces, modules |
| Drift detection | None | terraform plan |
The rule I apply on every platform team: anything that must exist before Grafana can serve a single request goes in file-based provisioning; anything with a review-and-promote lifecycle goes in Terraform. Data sources are the classic boundary case. The Prometheus data source the instance cannot start usefully without belongs in a provisioning file baked into the container. A team’s dashboards and their alert rules belong in Terraform so a pull request gates every change.
File-based provisioning is declarative and idempotent: Grafana reconciles the on-disk files into its database on startup and on a configurable interval. Critically, a resource provisioned from a file is read-only in the UI - the edit button is disabled - which is exactly the guarantee you want for the bootstrap layer.
Here is the minimal data source file. It lives at /etc/grafana/provisioning/datasources/:
# datasources/prometheus.yaml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus.monitoring.svc:9090
uid: prometheus-main # pin the uid - dashboards reference it
isDefault: true
jsonData:
httpMethod: POST
timeInterval: 30s
version: 1
editable: false
The single most important field is uid. Pin it. Dashboards and alert rules reference data sources by UID, not name, and if you let Grafana auto-generate one it will differ between dev and prod and every panel will say “datasource not found.” A pinned, environment-stable UID is what makes a dashboard portable.
2. Bootstrap the provider and manage folders as immutable resources
Folders are the unit of organization and the unit of permission in Grafana, so they are the first thing Terraform should own. Configure the provider with a service account token (API keys are deprecated; use service accounts):
# providers.tf
terraform {
required_version = ">= 1.6"
required_providers {
grafana = {
source = "grafana/grafana"
version = "~> 3.0"
}
}
}
provider "grafana" {
url = var.grafana_url # e.g. https://grafana.stage.internal
auth = var.grafana_service_account_token
}
Never put the token in a .tf file or terraform.tfvars that gets committed. Source it from the environment (TF_VAR_grafana_service_account_token) backed by your secrets manager - Vault, AWS Secrets Manager, or the CI provider’s secret store.
Now the folder and its permissions:
# folders.tf
resource "grafana_folder" "platform" {
title = "Platform"
uid = "platform" # stable across environments
}
resource "grafana_folder_permission" "platform" {
folder_uid = grafana_folder.platform.uid
permissions {
team_id = grafana_team.sre.id
permission = "Edit"
}
permissions {
role = "Viewer"
permission = "View"
}
}
Treat folder UIDs the same way you treat data source UIDs: pin them, keep them identical across environments, and never rename a folder in place - a rename forces every dashboard inside to be re-homed. Permissions are deliberately coarse here. Grafana’s folder model gives you View/Edit/Admin, and pushing all fine-grained access through team membership (rather than per-folder Terraform sprawl) keeps the blast radius of any permission change small and auditable.
3. Import existing dashboards and parameterize them
Almost nobody starts from a blank dashboard. You have dozens already built in the UI, and the migration path is: export the JSON, strip the instance-specific bits, and feed it to Terraform.
Export from the API (not the UI “Share” dialog, which adds export-only metadata):
curl -s -H "Authorization: Bearer $GRAFANA_TOKEN" \
"$GRAFANA_URL/api/dashboards/uid/abc123" \
| jq '.dashboard' > dashboards/service-overview.json
Two fields must be removed from the exported JSON before Terraform manages it:
jq 'del(.id) | del(.version)' \
dashboards/service-overview.json > dashboards/service-overview.clean.json
The id is the database primary key - it is instance-local and meaningless in another Grafana. The version is server-managed; leaving it in causes a perpetual diff. Drop both. Keep the uid - that is the stable, portable identifier you want consistent across environments.
The hardcoded data source UID inside the panels is the next problem. Replace it with a template token so the same JSON works everywhere:
# dashboards.tf
resource "grafana_dashboard" "service_overview" {
folder = grafana_folder.platform.uid
config_json = templatefile("${path.module}/dashboards/service-overview.json", {
datasource_uid = var.prometheus_datasource_uid
})
overwrite = true
}
Inside the JSON, panels reference "datasource": { "uid": "${datasource_uid}" }, and templatefile substitutes the per-environment value at plan time. overwrite = true lets Terraform reconcile a dashboard that already exists with the same UID rather than failing - essential when you import into an instance that already has the dashboard.
For dashboards that change often, hand-editing 1,500 lines of JSON does not scale. This is where Grafonnet earns its keep. Grafonnet is a Jsonnet library that renders dashboard JSON from composable functions, so a panel becomes a call you parameterize and reuse:
// service.jsonnet
local g = import 'g.libsonnet';
local prometheus = g.query.prometheus;
g.dashboard.new('Service Overview')
+ g.dashboard.withUid('service-overview')
+ g.dashboard.withPanels([
g.panel.timeSeries.new('Request rate')
+ g.panel.timeSeries.queryOptions.withTargets([
prometheus.new(
'$datasource',
'sum(rate(http_requests_total{service="$service"}[$__rate_interval]))'
),
]),
])
Render it to JSON in CI, then hand the output to the same grafana_dashboard resource:
jsonnet -J vendor service.jsonnet > dashboards/service-overview.json
The payoff is that a fleet of services share one Jsonnet template and differ only by their $service variable, instead of forty copy-pasted JSON files that drift apart panel by panel.
4. Express unified alerting as code
Unified alerting has three moving parts, and each maps to a Terraform resource: contact points (where a notification goes), notification policies (how alerts route to contact points), and rule groups (what fires). Manage all three in Terraform and the entire alerting surface becomes reviewable.
Contact point first - the destination:
# alerting_contactpoints.tf
resource "grafana_contact_point" "platform_oncall" {
name = "platform-oncall"
slack {
url = var.slack_webhook_url
title = "{{ .CommonLabels.alertname }}"
text = "{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}"
}
}
The notification policy tree decides routing. There is exactly one root policy per Grafana instance, so it is a singleton resource:
# alerting_policies.tf
resource "grafana_notification_policy" "root" {
contact_point = grafana_contact_point.platform_oncall.name
group_by = ["alertname", "service"]
group_wait = "30s"
group_interval = "5m"
repeat_interval = "4h"
policy {
matcher {
label = "severity"
match = "="
value = "critical"
}
contact_point = grafana_contact_point.platform_oncall.name
group_wait = "10s"
repeat_interval = "1h"
}
}
group_wait is how long Grafana buffers the first alert in a new group before sending, so a burst of related failures arrives as one notification rather than forty. repeat_interval is the re-notify cadence for an alert that stays firing. Tightening both for severity=critical in the nested policy is the standard pattern: page faster and re-page sooner for the things that matter.
Now the rule group. A rule group is the atomic unit of alert evaluation - every rule in it shares one evaluation interval and evaluates sequentially:
# alerting_rules.tf
resource "grafana_rule_group" "service_health" {
name = "service-health"
folder_uid = grafana_folder.platform.uid
interval_seconds = 60
rule {
name = "HighErrorRatio"
condition = "C"
for = "5m"
data {
ref_id = "A"
datasource_uid = var.prometheus_datasource_uid
relative_time_range {
from = 600
to = 0
}
model = jsonencode({
expr = "sum(rate(http_requests_total{code=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))"
refId = "A"
})
}
data {
ref_id = "C"
datasource_uid = "__expr__"
relative_time_range {
from = 0
to = 0
}
model = jsonencode({
type = "threshold"
expression = "A"
refId = "C"
conditions = [{
evaluator = { type = "gt", params = [0.05] }
}]
})
}
labels = {
severity = "critical"
}
annotations = {
summary = "Error ratio above 5% for {{ $labels.service }}"
}
}
}
Two things trip people up here. First, the special data source UID __expr__ is Grafana’s server-side expression engine: query A pulls the raw ratio from Prometheus, and the threshold expression C is what the condition points at. The query does not fire the alert; the expression does. Second, for = "5m" is the pending period - the condition must hold continuously for five minutes before the alert transitions from Pending to Firing, which is your defense against single-scrape blips. The interval_seconds on the group (how often it evaluates) and for on the rule (how long it must stay true) are independent knobs; get both wrong and you either page on noise or react too slowly.
5. Build reusable library panels and dashboard modules
A library panel is a single panel definition stored once and referenced by many dashboards. Edit it in one place and every dashboard that embeds it updates. Terraform owns the definition:
# library_panels.tf
resource "grafana_library_panel" "error_ratio" {
name = "Error Ratio"
folder_uid = grafana_folder.platform.uid
model_json = jsonencode({
title = "Error Ratio"
type = "timeseries"
targets = [{
expr = "sum(rate(http_requests_total{code=~\"5..\"}[$__rate_interval])) / sum(rate(http_requests_total[$__rate_interval]))"
}]
})
}
The higher-leverage abstraction is a Terraform module that packages a folder, its standard dashboards, its library panels, and its alert rules into one callable unit. A platform team exposes a grafana-service module and every product team instantiates it with a few variables:
# teams/checkout/main.tf
module "checkout" {
source = "../../modules/grafana-service"
service_name = "checkout"
folder_title = "Checkout"
prometheus_datasource = var.prometheus_datasource_uid
oncall_contact_point = "checkout-oncall"
error_ratio_threshold = 0.02
}
This is the difference between forty teams each reinventing a dashboard and forty teams inheriting a vetted standard. The module enforces that every service gets the same RED panels, the same naming, the same alert structure - and an improvement to the module propagates everywhere on the next apply.
6. Promote across environments with workspace isolation
The same code must produce dev, stage, and prod without copy-paste. Two patterns work; pick one and do not mix them.
The first is one Terraform workspace per environment with a tfvars file each:
terraform workspace new prod
terraform workspace select prod
terraform plan -var-file=env/prod.tfvars
terraform apply -var-file=env/prod.tfvars
# env/prod.tfvars
grafana_url = "https://grafana.prod.internal"
prometheus_datasource_uid = "prometheus-prod"
error_ratio_threshold = 0.02 # tighter in prod
The second pattern - which I prefer at scale - drops Terraform CLI workspaces in favor of one directory per environment, each with its own backend state and a shared module. The directory-per-environment layout makes the state boundary explicit on disk and removes the foot-gun of running apply against the wrong selected workspace:
environments/
dev/ -> backend "dev", calls module "platform"
stage/ -> backend "stage", calls module "platform"
prod/ -> backend "prod", calls module "platform"
modules/
platform/
Either way, the non-negotiable rule is separate state per environment. Shared state means a terraform apply aimed at dev can corrupt prod’s resource tracking. Separate backends - separate S3 keys or separate Terraform Cloud workspaces - give each environment an isolated blast radius. Promotion is then a git merge: a dashboard change lands in the dev directory, gets validated, and the same commit is promoted to stage and prod through your normal PR flow, with each environment’s apply running against its own state.
7. Wire the CI pipeline: plan, lint, drift detection
The pipeline has three jobs, and all three must pass before a merge.
Format and validate catch the cheap mistakes before anything touches Grafana:
# .github/workflows/grafana.yml
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- run: terraform fmt -check -recursive
- run: terraform init -backend=false
- run: terraform validate
Lint the dashboards against the schema. This is the step most teams skip and most regret. A dashboard JSON that is structurally valid can still reference a panel type that does not exist or a deprecated field. dashboard-linter (the Grafana Labs tool) checks dashboards against a set of best-practice rules - template variable usage, panel titles, target configuration:
# lint every dashboard JSON in the repo
go install github.com/grafana/dashboard-linter@latest
for f in dashboards/*.json; do
dashboard-linter lint "$f"
done
Plan on PR, apply on merge, and run drift detection on a schedule. The plan output posted to the pull request is what reviewers actually read - it shows exactly which folders, dashboards, and alert rules change. Drift detection is a scheduled plan that should always be empty; a non-empty plan on the nightly run means someone edited Grafana through the UI behind Terraform’s back:
drift:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- run: terraform init
- run: terraform plan -detailed-exitcode
# exit 0 = no drift, 2 = drift detected, 1 = error
The -detailed-exitcode flag is the trick: plan returns exit code 2 when there are changes to apply, so the scheduled job fails loudly the moment configuration drifts from code. Wire that failure to the same Slack channel your alerts go to.
Verify
Confirm the system end to end, not just that apply exited zero.
Check Terraform’s view of the world matches reality:
terraform plan -detailed-exitcode # must report "No changes" (exit 0)
Confirm the dashboard exists with the UID you pinned:
curl -s -H "Authorization: Bearer $GRAFANA_TOKEN" \
"$GRAFANA_URL/api/dashboards/uid/service-overview" \
| jq '.dashboard.title, .meta.provisioned'
Confirm the alert rule is loaded and evaluating - the rules API returns every rule’s current state:
curl -s -H "Authorization: Bearer $GRAFANA_TOKEN" \
"$GRAFANA_URL/api/prometheus/grafana/api/v1/rules" \
| jq '.data.groups[].rules[] | {name: .name, state: .state}'
A healthy rule reports state: "inactive" (condition not met), "pending" (met, inside the for window), or "firing". A state of "error" means the query or expression is broken - usually a wrong data source UID.
Test the contact point without waiting for a real alert. The UI’s “Test” button on the contact point sends a synthetic notification; use it to confirm the Slack webhook actually delivers before you depend on it at 3 a.m.
Finally, prove drift detection works: edit a dashboard panel in the UI, then run terraform plan. It must show the panel reverting on the next apply. If it shows nothing, your state is not tracking what you think it is.
Enterprise scenario
A payments platform ran a single Grafana Enterprise instance behind their SRE team and let product teams self-serve dashboards through the UI. It worked until an auditor asked a simple question during a SOC 2 review: “show me the change history and approver for every production alert rule in the last quarter.” There was none. Alert thresholds had been edited live, dozens of times, with no record of who or why - and two of those edits had silenced a critical database-saturation alert that later contributed to an outage.
The constraint they could not move was organizational: 30+ product teams, none of whom would tolerate filing a ticket with SRE for every dashboard tweak. A central team owning all of Grafana through one Terraform state would have become the bottleneck the teams were trying to escape, and a single state file touched by 30 teams is a merge-conflict and blast-radius nightmare.
The solution was a directory-per-team layout over a shared module, with separate backend state per team, and Grafana’s UI editing disabled for everything Terraform managed. Each team got a directory, owned its own state, and instantiated the same vetted grafana-service module. The CI pipeline ran plan on every PR (posted for the team to self-review), apply on merge, and a nightly drift check across all teams. The audit answer became “every change is a git commit with an approver, here is the log” - and because Terraform-managed resources are read-only in the UI, the live-editing that caused the outage was structurally impossible.
The load-bearing piece was making UI edits fail rather than relying on policy. A team’s apply step verifies provenance before it runs:
# CI gate: refuse to apply if anything was edited outside Terraform
terraform plan -detailed-exitcode -refresh-only
# exit code 2 here means state drifted from real Grafana ->
# someone edited the UI; fail the pipeline and require a git change
The -refresh-only plan compares real Grafana against state without proposing config changes, so a non-zero exit specifically flags out-of-band edits. That one gate turned “trust people not to click” into a machine-enforced invariant, and the next audit took an afternoon instead of a week.