DevOps Kubernetes

Policy-as-Code Guardrails with OPA Gatekeeper: Constraint Templates, Mutation, and CI Gating

Every cluster eventually accumulates a folklore of rules nobody enforces: “always set resource limits,” “only pull from our registry,” “tag everything with a cost-center.” These live in wikis and code review comments until the day a Deployment with no limits OOM-kills a node, or an unscanned image from Docker Hub lands in production. Guardrails that depend on human vigilance are not guardrails — they are suggestions.

OPA Gatekeeper turns those suggestions into admission-time controls. It plugs Open Policy Agent into the Kubernetes API server as a validating (and mutating) webhook, so a non-compliant object is rejected before it is persisted to etcd. This article builds a real guardrail program: ConstraintTemplates in Rego, parameterized Constraints, mutation defaults, safe staged rollout with dryrun, referential checks against synced data, and — critically — the same policies running in CI so violations surface on a pull request instead of at kubectl apply.

1. Architecture: webhook, constraint framework, and audit

Gatekeeper has three moving parts, and understanding the split prevents most production surprises.

The admission webhook. Gatekeeper registers a ValidatingWebhookConfiguration (and a MutatingWebhookConfiguration). On every CREATE/UPDATE for matched resources, the API server calls Gatekeeper synchronously. Gatekeeper evaluates the request against all active Constraints and returns allow/deny. Because this is in the critical path of every write, two settings matter enormously: failurePolicy and timeoutSeconds. We tune those in section 4.

The constraint framework. You do not write raw Rego against the webhook. You write a ConstraintTemplate — Rego plus a CRD schema — which generates a new custom resource kind. You then create Constraints (instances of that kind) that say “apply this logic to these resources with these parameters.” This two-tier design is the whole point: platform engineers author templates once; application teams (or you) declare cheap, declarative Constraints without touching Rego.

Audit. A background controller periodically re-evaluates existing cluster objects against all Constraints and writes results to each Constraint’s status.violations. This catches resources that predate a policy, or that were admitted while a Constraint was in dryrun. Audit is how you measure blast radius before flipping to enforce.

Install with the released manifest (pin the version — never track latest for an admission controller):

kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper/v3.16.3/deploy/gatekeeper.yaml
kubectl -n gatekeeper-system rollout status deploy/gatekeeper-controller-manager
kubectl get crd | grep gatekeeper

2. Authoring a ConstraintTemplate in Rego

Start with the canonical guardrail: required labels. The template defines the Rego logic and the parameter schema that becomes the Constraint CRD.

# templates/k8srequiredlabels.yaml
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8srequiredlabels
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredLabels      # this becomes the Constraint kind
      validation:
        openAPIV3Schema:
          type: object
          properties:
            labels:
              type: array
              items:
                type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequiredlabels

        violation[{"msg": msg, "details": {"missing_labels": missing}}] {
          provided := {label | input.review.object.metadata.labels[label]}
          required := {label | label := input.parameters.labels[_]}
          missing := required - provided
          count(missing) > 0
          msg := sprintf("missing required labels: %v", [missing])
        }

The contract is fixed and worth memorizing: the rule must be named violation, it returns a set of objects with a msg (string) and optional details, and a non-empty set means “reject.” The admission payload is at input.review.object; parameters from the Constraint are at input.parameters. Set arithmetic (required - provided) is idiomatic Rego — far cleaner than iterating.

A common mistake is naming the rule deny. That is the Conftest convention (section 7), not Gatekeeper. Gatekeeper only collects violation. Mixing them silently disables enforcement.

3. Parameterized Constraints: required labels, registries, resource limits

With the template applied, a Constraint is pure declaration. No Rego, just scope and parameters:

# constraints/require-owner-label.yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
  name: require-owner-and-costcenter
spec:
  enforcementAction: deny
  match:
    kinds:
      - apiGroups: ["apps"]
        kinds: ["Deployment", "StatefulSet"]
    excludedNamespaces: ["kube-system", "gatekeeper-system"]
  parameters:
    labels: ["owner", "cost-center"]

The match block is your scoping surface: kinds, namespaces, excludedNamespaces, labelSelector, and namespaceSelector. Always exclude system namespaces — locking kube-system out of mutating its own pods is a self-inflicted outage.

Two more guardrails platform teams ship on day one. A registry allowlist keeps images on your trusted path:

# templates/k8sallowedrepos.yaml
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8sallowedrepos
spec:
  crd:
    spec:
      names:
        kind: K8sAllowedRepos
      validation:
        openAPIV3Schema:
          type: object
          properties:
            repos:
              type: array
              items:
                type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8sallowedrepos

        violation[{"msg": msg}] {
          container := input.review.object.spec.containers[_]
          satisfied := [good | repo := input.parameters.repos[_]; good := startswith(container.image, repo)]
          not any(satisfied)
          msg := sprintf("container image %v is not from an allowed registry: %v", [container.image, input.parameters.repos])
        }

For resource limits, lean on Gatekeeper’s maintained library rather than hand-rolling. The community ships a battle-tested K8sContainerLimits template (and many others) at github.com/open-policy-agent/gatekeeper-library. Vendoring proven templates beats reinventing CPU/memory parsing in Rego, which is a notorious source of off-by-one bugs around binary vs. decimal suffixes.

4. Safe rollout: enforcementAction dryrun, warn, and audit

Never ship a new Constraint straight to deny in a busy cluster — you will discover the long tail of non-compliant workloads by paging the on-call. Gatekeeper gives three enforcementAction values:

Action Behavior on violation Use it for
dryrun Admits the object; records the violation in audit only Measuring blast radius
warn Admits, but returns a warning to the kubectl client Nudging teams before enforcement
deny Rejects the request Steady-state enforcement

The disciplined rollout is dryrun -> read audit -> warn -> deny. Start here:

spec:
  enforcementAction: dryrun

Apply, wait one audit cycle (default ~60s), then inspect what would have been blocked:

kubectl get k8srequiredlabels require-owner-and-costcenter \
  -o jsonpath='{.status.totalViolations}{"\n"}'

kubectl get k8srequiredlabels require-owner-and-costcenter \
  -o jsonpath='{range .status.violations[*]}{.namespace}{"/"}{.name}{": "}{.message}{"\n"}{end}'

Drive that count to zero (or to a known, accepted set) before promoting. Equally important is the webhook’s behavior when Gatekeeper itself is unavailable. The default failurePolicy: Ignore fails open — safer for cluster availability but it means an outage silently disables your guardrails. For genuinely security-critical policies, set failurePolicy: Fail on the webhook (fail closed), but only after you trust Gatekeeper’s HA and have budgeted for a tight timeoutSeconds (3 seconds is a sane ceiling; a slow webhook stalls every write).

5. Mutation: defaults with Assign and ModifySet

Validation rejects; mutation fixes. Instead of denying a Pod that omits seccompProfile, you can inject a default. Mutators are separate CRDs and run before validation, so a mutation can bring an object into compliance with a Constraint that would otherwise reject it.

Assign sets a scalar or object field. Defaulting the seccomp profile cluster-wide:

# mutations/default-seccomp.yaml
apiVersion: mutations.gatekeeper.sh/v1
kind: Assign
metadata:
  name: default-seccomp-profile
spec:
  applyTo:
    - groups: [""]
      kinds: ["Pod"]
      versions: ["v1"]
  match:
    scope: Namespaced
    excludedNamespaces: ["kube-system"]
  location: "spec.securityContext.seccompProfile.type"
  parameters:
    assign:
      value: "RuntimeDefault"
    pathTests:
      - subPath: "spec.securityContext.seccompProfile.type"
        condition: MustNotExist

pathTests with MustNotExist is what makes this a default rather than an override: the mutation only fires when the user has not already set the field. Without it you would stomp on teams that legitimately chose a localhost profile.

ModifySet manages list membership idempotently — adding to or pruning from arrays. To strip a debug flag teams keep copy-pasting:

# mutations/strip-debug-args.yaml
apiVersion: mutations.gatekeeper.sh/v1
kind: ModifySet
metadata:
  name: strip-debug-args
spec:
  applyTo:
    - groups: [""]
      kinds: ["Pod"]
      versions: ["v1"]
  location: "spec.containers[name: *].args"
  parameters:
    operation: prune
    values:
      fromList:
        - "--debug"

Use operation: merge to add an element. The [name: *] wildcard applies the change to every container. There is also AssignMetadata (labels/annotations only) and AssignImage (image fields), but Assign and ModifySet cover the overwhelming majority of defaulting needs.

6. Syncing data for referential constraints

Some rules cannot be decided from the incoming object alone — they need cluster context. “An Ingress host must be unique across all namespaces” requires knowing every other Ingress. Gatekeeper solves this by replicating selected objects into OPA’s in-memory cache via a Config (or SyncSet), then exposing them in Rego under data.inventory.

# config/sync.yaml
apiVersion: config.gatekeeper.sh/v1alpha1
kind: Config
metadata:
  name: config
  namespace: gatekeeper-system
spec:
  sync:
    syncOnly:
      - group: "networking.k8s.io"
        version: "v1"
        kind: "Ingress"

A referential template then reads the cache. Namespace-scoped objects live at data.inventory.namespace[<ns>][<groupVersion>][<kind>][<name>]; cluster-scoped at data.inventory.cluster[<groupVersion>][<kind>][<name>]:

package k8suniqueingresshost

identical(obj, review) {
  obj.metadata.namespace == review.object.metadata.namespace
  obj.metadata.name == review.object.metadata.name
}

violation[{"msg": msg}] {
  input.review.kind.kind == "Ingress"
  host := input.review.object.spec.rules[_].host
  other := data.inventory.namespace[_]["networking.k8s.io/v1"]["Ingress"][_]
  other.spec.rules[_].host == host
  not identical(other, input.review)
  msg := sprintf("ingress host %v is already claimed", [host])
}

The identical guard is essential — without it the object always collides with itself on UPDATE. Two operational caveats: only sync what you query (the cache costs memory and watch load), and remember the cache is eventually consistent. Under a burst of simultaneous Ingress creates, two could momentarily both pass. Treat referential uniqueness as defense-in-depth, not a hard transactional guarantee.

7. Shift-left: the same policies in CI with Conftest

Admission control is your last line of defense. It is a poor first one — by the time kubectl apply is rejected, the developer has already context-switched. The fix is running policy in CI against rendered manifests, so the feedback lands on the pull request.

Conftest runs Rego against structured config (YAML, JSON, HCL). Its convention differs from Gatekeeper: rules are named deny/violation/warn in package main, and the document under test is input directly (no review.object wrapper). You can share the core logic by factoring it into a library package both call, or maintain a thin Conftest mirror:

# policy/deny_registry.rego
package main

allowed_repos := ["registry.internal.example.com/", "ghcr.io/acme/"]

deny[msg] {
  input.kind == "Deployment"
  container := input.spec.template.spec.containers[_]
  not any([startswith(container.image, r) | r := allowed_repos[_]])
  msg := sprintf("%v: image %v not from an allowed registry", [input.metadata.name, container.image])
}

Wire it into the pipeline. Render Helm/Kustomize first so you test what actually deploys:

# render, then gate
helm template ./chart --values values-prod.yaml > /tmp/rendered.yaml
conftest test /tmp/rendered.yaml --policy policy/
# .github/workflows/policy.yml
name: policy-gate
on: [pull_request]
jobs:
  conftest:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install conftest
        run: |
          VER=0.56.0
          curl -sSfL "https://github.com/open-policy-agent/conftest/releases/download/v${VER}/conftest_${VER}_Linux_x86_64.tar.gz" \
            | tar -xz -C /usr/local/bin conftest
      - name: Render manifests
        run: kustomize build overlays/prod > rendered.yaml
      - name: Conftest gate
        run: conftest test rendered.yaml --policy policy/

Conftest exits non-zero on any deny, failing the job. Now a bad registry reference is a red check on the PR, not a production incident.

8. Testing Rego and gating the policies themselves

Policies are code, and untested policy code rots. Two complementary tools cover the two layers.

Unit-test the Rego with OPA’s built-in framework. Rules prefixed test_ are auto-discovered; with input as mocks the document:

# policy/deny_registry_test.rego
package main

test_denies_external_image {
  deny[_] with input as {
    "kind": "Deployment",
    "metadata": {"name": "web"},
    "spec": {"template": {"spec": {"containers": [{"image": "docker.io/nginx"}]}}},
  }
}

test_allows_internal_image {
  count(deny) == 0 with input as {
    "kind": "Deployment",
    "metadata": {"name": "web"},
    "spec": {"template": {"spec": {"containers": [{"image": "ghcr.io/acme/web:1.2.3"}]}}},
  }
}
opa test policy/ -v

Integration-test the Gatekeeper templates with gator, which evaluates real ConstraintTemplates + Constraints against fixtures without a cluster. A Suite declares cases and asserts the expected violation count:

# test/suite.yaml
kind: Suite
apiVersion: test.gatekeeper.sh/v1alpha1
tests:
  - name: required-labels
    template: ../templates/k8srequiredlabels.yaml
    constraint: ../constraints/require-owner-label.yaml
    cases:
      - name: missing-owner-is-rejected
        object: fixtures/deploy-no-owner.yaml
        assertions:
          - violations: yes
      - name: compliant-is-allowed
        object: fixtures/deploy-compliant.yaml
        assertions:
          - violations: no
gator verify test/suite.yaml

Both tools exit non-zero on failure, so they drop straight into the CI job from section 7. This closes the loop: a change to a ConstraintTemplate cannot merge unless its tests pass, exactly like application code.

Verify

Confirm the full guardrail program end to end against a live cluster:

# 1. Constraints are registered and active
kubectl get constrainttemplates
kubectl get constraints -A

# 2. A non-compliant object is actually rejected (expect an error)
cat <<'EOF' | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: rogue
  namespace: default
spec:
  selector: { matchLabels: { app: rogue } }
  template:
    metadata: { labels: { app: rogue } }
    spec:
      containers:
        - name: app
          image: docker.io/library/nginx:latest
EOF
# -> admission webhook "validation.gatekeeper.sh" denied the request:
#    missing required labels: {"cost-center", "owner"}; image ... not from an allowed registry

# 3. Mutation applied a default
kubectl run probe --image=ghcr.io/acme/probe:1.0 --restart=Never
kubectl get pod probe -o jsonpath='{.spec.securityContext.seccompProfile.type}{"\n"}'
# -> RuntimeDefault

# 4. Audit surfaces pre-existing violations
kubectl get k8srequiredlabels -o jsonpath='{range .items[*]}{.metadata.name}{": "}{.status.totalViolations}{"\n"}{end}'

# 5. CI gate runs locally
opa test policy/ -v && gator verify test/suite.yaml && conftest test rendered.yaml --policy policy/

If step 2 admits the Deployment instead of rejecting it, the most likely causes are: the Constraint is still in dryrun, the match block does not cover apps/Deployment, or the namespace is in excludedNamespaces.

Enterprise scenario

A platform team running multi-tenant clusters for ~40 product squads hit a recurring class of incident: teams shipped Deployments with no memory limits, and a single runaway pod would consume a node’s memory and trigger noisy-neighbor evictions across unrelated tenants. The wiki said “always set limits.” Nobody did.

Going straight to a hard deny was politically and operationally untenable — an audit showed roughly 60% of existing workloads lacked limits, so a same-day enforce would have blocked the next deploy for two-thirds of the org. They ran a staged program instead.

First, they applied the gatekeeper-library container-limits template with the Constraint in dryrun, exported status.totalViolations to a dashboard, and pushed the per-squad list into each team’s channel. Second — the move that made it land — they added an Assign mutation that injected conservative default limits only when absent, so new workloads became compliant automatically while teams tuned real values:

apiVersion: mutations.gatekeeper.sh/v1
kind: Assign
metadata:
  name: default-mem-limit
spec:
  applyTo:
    - groups: [""]
      kinds: ["Pod"]
      versions: ["v1"]
  match:
    scope: Namespaced
    excludedNamespaces: ["kube-system", "gatekeeper-system"]
  location: "spec.containers[name: *].resources.limits.memory"
  parameters:
    assign:
      value: "512Mi"
    pathTests:
      - subPath: "spec.containers[name: *].resources.limits.memory"
        condition: MustNotExist

Mutation drove the dry-run violation count down on its own as workloads rolled. Six weeks later, with the dashboard reading near-zero and the same checks already failing PRs in CI via Conftest, they promoted the Constraint to deny during a change window. There was no flag day and no spike of blocked deploys — the gate had effectively already closed.

Checklist

opagatekeeperpolicy-as-codekubernetesdevsecops

Comments

Keep Reading