Argo CD gets the conference talks, but Flux quietly runs a lot of the largest GitOps platforms because its controllers compose. Each one does a narrow job and reconciles a single CRD, which means you can wire image scanning to commit automation to OCI distribution without a monolith in the middle. The cost is that you have to understand the pieces. This is a platform-engineer’s tour: image automation that writes tags back to Git, manifests shipped as OCI artifacts instead of cloned repos, and the RBAC plumbing that makes hard multi-tenancy actually hold.
I’m assuming Flux v2 (the GitOps Toolkit, apiVersion group *.toolkit.fluxcd.io), the flux CLI v2.x, and a cluster you have admin on.
1. The controllers and what each reconciles
Flux is five controllers, each owning a set of CRDs:
| Controller | Reconciles | Job |
|---|---|---|
| source-controller | GitRepository, OCIRepository, HelmRepository, Bucket |
Fetch and expose artifacts |
| kustomize-controller | Kustomization |
Build kustomize overlays and apply |
| helm-controller | HelmRelease |
Render charts and manage releases |
| image-reflector-controller | ImageRepository, ImagePolicy |
Scan registries, select tags |
| image-automation-controller | ImageUpdateAutomation |
Write selected tags back to Git |
The mental model: source-controller produces artifacts, kustomize/helm controllers consume them and apply to the cluster, and the two image controllers form a separate loop that scans registries and pushes commits. They communicate through the Kubernetes API, not direct calls, so a controller can be down and the rest degrade gracefully rather than cascade.
2. Bootstrap declaratively and structure for tenants
flux bootstrap is imperative-feeling but its job is to make Flux manage its own installation from Git. Bootstrap against GitHub:
export GITHUB_TOKEN=ghp_...
flux bootstrap github \
--owner=acme-platform \
--repository=fleet-infra \
--branch=main \
--path=clusters/prod \
--components-extra=image-reflector-controller,image-automation-controller \
--personal=false
--components-extra is the part people miss: the two image controllers are not installed by default. Without them, your ImageUpdateAutomation objects sit there doing nothing with no obvious error.
For many tenants, separate the cluster’s own config from tenant config. A structure that scales:
fleet-infra/
clusters/prod/
flux-system/ # bootstrap-managed
tenants.yaml # one Kustomization per tenant, applied by Flux
tenants/
base/
team-a/
rbac.yaml # ServiceAccount + RoleBinding
sync.yaml # GitRepository + Kustomization (impersonated)
team-b/
production/
team-a/
kustomization.yaml # patches base for prod
The platform team owns clusters/ and tenants/base/*/rbac.yaml. Tenants own their own application repos, which the per-tenant GitRepository points at. flux create tenant scaffolds the namespace, service account, and a RoleBinding to a role you provide:
flux create tenant team-a \
--with-namespace=team-a \
--cluster-role=tenant-app-admin \
--export > tenants/base/team-a/rbac.yaml
3. Image automation: scan, select, commit
Three objects drive automated image updates. First, scan the registry:
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImageRepository
metadata:
name: podinfo
namespace: team-a
spec:
image: ghcr.io/acme-platform/podinfo
interval: 5m
secretRef:
name: ghcr-auth
Then declare which tag wins. The policy is where correctness lives – get the ordering wrong and you ship the wrong image:
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImagePolicy
metadata:
name: podinfo
namespace: team-a
spec:
imageRepositoryRef:
name: podinfo
filterTags:
pattern: '^main-[a-f0-9]+-(?P<ts>[0-9]+)$'
extract: '$ts'
policy:
numerical:
order: asc
This filters to main-<sha>-<timestamp> tags, extracts the timestamp, and picks the numerically highest. For real semver releases, use policy.semver with a range like >=1.0.0 instead – never sort semver lexically. Mark the deployment field Flux should rewrite with a setter marker:
spec:
containers:
- name: podinfo
image: ghcr.io/acme-platform/podinfo:main-abc123-1718000000 # {"$imagepolicy": "team-a:podinfo"}
Finally, the automation that commits the change back:
apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImageUpdateAutomation
metadata:
name: team-a-images
namespace: team-a
spec:
interval: 30m
sourceRef:
kind: GitRepository
name: team-a
git:
checkout:
ref:
branch: main
commit:
author:
name: fluxcdbot
email: fluxcdbot@acme.example
messageTemplate: |
Automated image update
{{ range .Changed.Changes }}{{ .OldValue }} -> {{ .NewValue }}
{{ end }}
push:
branch: flux-image-updates
update:
path: ./apps/team-a
strategy: Setters
Pushing to a dedicated flux-image-updates branch instead of main is the pattern I push teams toward: it forces image bumps through a PR with branch protection and CODEOWNERS, so a registry push can’t silently mutate production. The GitRepository your Kustomization reconciles still tracks main, so nothing deploys until the PR merges.
4. Manifests as OCI artifacts
Cloning Git on every reconcile across hundreds of tenants is load you don’t need, and it couples deploys to your Git host’s availability. Flux can treat any OCI registry as a source. Push your built manifests as an artifact in CI:
flux push artifact \
oci://ghcr.io/acme-platform/manifests/team-a:$(git rev-parse --short HEAD) \
--path=./deploy \
--source="$(git config --get remote.origin.url)" \
--revision="$(git rev-parse HEAD)"
flux tag artifact \
oci://ghcr.io/acme-platform/manifests/team-a:$(git rev-parse --short HEAD) \
--tag=latest
Consume it with OCIRepository instead of GitRepository:
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: OCIRepository
metadata:
name: team-a
namespace: team-a
spec:
interval: 10m
url: oci://ghcr.io/acme-platform/manifests/team-a
ref:
semver: ">=1.0.0"
secretRef:
name: ghcr-auth
verify:
provider: cosign
secretRef:
name: cosign-pub
The verify block is the reason to bother with OCI even if you keep Git: Flux refuses to reconcile an artifact whose cosign signature doesn’t validate. Combined with keyless signing in CI, you get a supply-chain gate where an unsigned or tampered artifact never reaches the cluster – something plain Git sources can’t give you without extra tooling.
5. Hard multi-tenancy with impersonation
Soft multi-tenancy (namespaces, NetworkPolicy) is not enough when tenants can author Kustomizations. By default kustomize-controller applies with its own powerful service account, so a tenant manifest could create a ClusterRoleBinding and escalate. Hard multi-tenancy closes this by making Flux impersonate a per-tenant service account that only has namespace-scoped rights:
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: team-a
namespace: team-a
spec:
serviceAccountName: team-a # impersonate this SA
sourceRef:
kind: OCIRepository
name: team-a
path: ./
prune: true
interval: 10m
targetNamespace: team-a
spec.serviceAccountName is the linchpin. kustomize-controller applies the manifests as team-a, so anything the tenant tries that exceeds that SA’s RBAC fails at apply time. Enforce that this field is never omitted by setting --default-service-account on the controller, so a missing serviceAccountName falls back to a powerless SA rather than the controller’s own identity:
flux bootstrap github ... \
--kustomization-controller-extra-args=--default-service-account=fluxcd-noop
The tenant’s RoleBinding must stay namespace-scoped:
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: team-a-reconciler
namespace: team-a
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: tenant-app-admin # a role WITHOUT rbac/clusterrole verbs
subjects:
- kind: ServiceAccount
name: team-a
namespace: team-a
6. Block cross-namespace references and lock sources
Impersonation stops privilege escalation but not data exfiltration. A tenant could point a Kustomization in their namespace at another tenant’s GitRepository via a cross-namespace sourceRef. Two controller flags shut both doors. Disable cross-namespace source references entirely:
--kustomization-controller-extra-args=--no-cross-namespace-refs=true
--helm-controller-extra-args=--no-cross-namespace-refs=true
--notification-controller-extra-args=--no-cross-namespace-refs=true
--image-automation-controller-extra-args=--no-cross-namespace-refs=true
With this set, a sourceRef may only target objects in the same namespace – a tenant physically cannot reference another tenant’s source. Then lock down which URLs sources may use with Kyverno, so a tenant can’t repoint their own OCIRepository at an arbitrary registry:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: restrict-flux-source-urls
spec:
validationFailureAction: Enforce
rules:
- name: oci-url-allowlist
match:
any:
- resources:
kinds: ["OCIRepository"]
validate:
message: "OCIRepository url must be under the platform registry"
pattern:
spec:
url: "oci://ghcr.io/acme-platform/manifests/*"
Together these three controls – impersonation, no cross-namespace refs, and a source-URL allowlist – are what I mean by hard multi-tenancy. Any one alone leaves a gap.
7. Progressive delivery with Flagger
Flux applies the desired state; it does not do canaries. Flagger fills that gap and reads the same Deployment Flux reconciles, so the GitOps loop stays the source of truth while Flagger owns the rollout:
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: podinfo
namespace: team-a
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: podinfo
service:
port: 9898
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
When Flux’s image automation merges a new tag and the Deployment spec changes, Flagger detects the change, shifts traffic in 10% steps, watches the success-rate metric, and rolls back automatically if it drops below 99%. The whole chain – registry push to Git commit to apply to canary – runs without a human in the path, but every step is observable and reversible.
8. Drift detection, health, and alerts
Flux corrects drift by default: Kustomization and HelmRelease re-apply on every interval, reverting manual kubectl edit. Gate “done” on real health, not just “applied,” with health checks:
spec:
wait: true
timeout: 5m
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: podinfo
namespace: team-a
wait: true blocks the Kustomization as Reconciling until every listed object reports healthy, so a bad rollout surfaces as a failed reconciliation instead of a green-but-broken deploy. Route those failures to Slack with Provider and Alert:
apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Provider
metadata:
name: slack
namespace: flux-system
spec:
type: slack
channel: platform-alerts
secretRef:
name: slack-url
---
apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Alert
metadata:
name: tenant-failures
namespace: flux-system
spec:
providerRef:
name: slack
eventSeverity: error
eventSources:
- kind: Kustomization
name: '*'
- kind: HelmRelease
name: '*'
Verify
Confirm the pipeline end to end:
# Controllers and CRDs healthy
flux check
# Sources are pulling artifacts
flux get sources oci --all-namespaces
flux get sources git --all-namespaces
# Image scan picked the expected tag
flux get image policy podinfo -n team-a
# LATEST IMAGE should show ghcr.io/.../podinfo:main-...
# Automation committed back to Git
flux get image update team-a-images -n team-a
# Impersonation is in effect (should be the tenant SA, not flux-system)
kubectl get kustomization team-a -n team-a -o jsonpath='{.spec.serviceAccountName}'
# Cross-namespace refs are blocked
kubectl get deploy kustomize-controller -n flux-system \
-o jsonpath='{.spec.template.spec.containers[0].args}' | tr ',' '\n' | grep cross-namespace
# Force a reconcile and watch health gating
flux reconcile kustomization team-a -n team-a --with-source
A correctly wired tenant shows Ready=True with a recent Applied revision, the ImagePolicy reports a LATEST IMAGE, and an out-of-policy sourceRef is rejected by the API server before reconciliation.
Enterprise scenario
A fintech platform team ran 80+ product squads on shared clusters under PCI scope. The audit finding that triggered the work: a squad’s Kustomization had created a ClusterRoleBinding granting cluster-admin, because kustomize-controller applied with its own identity and nothing stopped it. Worse, two squads were reconciling from each other’s Git repos via cross-namespace sourceRef, so one team’s broken manifest had taken down another’s service.
They couldn’t move squads to separate clusters – the per-cluster control-plane and node overhead was rejected on cost. So they hardened the shared model: impersonation on every Kustomization via --default-service-account=fluxcd-restricted, --no-cross-namespace-refs=true on the kustomize, helm, image-automation, and notification controllers, and a Kyverno policy pinning each OCIRepository URL to that squad’s own path. The single highest-leverage change was the default service account, because it closed the escalation path even for Kustomizations that omitted serviceAccountName:
flux bootstrap github \
--owner=acme-platform --repository=fleet-infra \
--path=clusters/prod \
--kustomization-controller-extra-args=--default-service-account=fluxcd-restricted,--no-cross-namespace-refs=true
fluxcd-restricted was a ServiceAccount with no RoleBindings at all, so any Kustomization that forgot to impersonate a real tenant SA could create exactly nothing. Re-running the pen test, the escalation path was closed and the cross-tenant blast radius was gone – and the cluster bill didn’t move.