Containerization DevOps

GitOps at Scale with Argo CD: App-of-Apps, ApplicationSets & Progressive Delivery

Argo CD turns a Git repository into the single source of truth for what runs in your clusters. That promise is easy to demo with one app and one cluster, and surprisingly hard to keep once you have dozens of teams, several environments, and a handful of regional clusters. This article walks through the patterns that survive that growth: a repository layout that scales, the app-of-apps bootstrap, ApplicationSets for fan-out, ordered syncs, secret handling, and progressive delivery with Argo Rollouts.

1. GitOps principles and a repository layout that scales

GitOps rests on a few non-negotiable rules. The desired state lives in Git. A controller continuously reconciles actual state toward that desired state. Changes happen through pull requests, not kubectl apply from a laptop. Drift is detected and either reported or corrected automatically.

The hardest design decision is repository structure. Two anti-patterns dominate: one giant repo where every team blocks on every other team’s reviews, and per-environment branches where promotion becomes a merge nightmare and main no longer reflects production. Use directories, not branches, for environments. Promotion is then a small, reviewable diff that copies a tested image tag from one path to another.

A layout that has held up well across many teams:

platform-gitops/                 # cluster-scoped, owned by platform team
  bootstrap/
    root-app.yaml                # the one app you apply by hand
  addons/                        # ingress, cert-manager, monitoring, ESO
    cert-manager/
    ingress-nginx/
  appsets/                       # ApplicationSets that fan apps out
    tenants.yaml
  clusters/
    prod-eastus/values.yaml      # per-cluster config (region, sizing)
    prod-westeu/values.yaml
    staging/values.yaml

apps-gitops/                     # namespace-scoped, owned by app teams
  checkout/
    base/                        # Kustomize base or Helm chart
    overlays/
      staging/
      prod-eastus/
      prod-westeu/

Keep platform concerns and application concerns in separate repositories with separate CODEOWNERS. The platform team should not gate every app deploy, and app teams should not be able to edit cluster-wide RBAC.

2. Bootstrap with the app-of-apps pattern

The app-of-apps pattern means you apply exactly one Application by hand. That root Application points at a directory of child Applications, and Argo CD reconciles them recursively. After bootstrap, everything (including Argo CD’s own configuration) is managed by Git.

Install Argo CD first, then apply the root app:

kubectl create namespace argocd
kubectl apply -n argocd \
  -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# Wait for the core controllers to be ready
kubectl rollout status deploy/argocd-server -n argocd
kubectl rollout status statefulset/argocd-application-controller -n argocd

The root Application is the only manifest you kubectl apply directly. It watches the bootstrap/ directory and creates everything else:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: root
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: default
  source:
    repoURL: https://github.com/acme/platform-gitops.git
    targetRevision: main
    path: bootstrap
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

The resources-finalizer.argocd.argoproj.io finalizer matters: without it, deleting the root Application orphans its children instead of cascading the cleanup. With it, argocd app delete root tears the whole tree down in dependency order.

3. Fan apps out with ApplicationSets

App-of-apps gets clumsy when you need the same app on ten clusters or one app per team. The ApplicationSet controller (bundled with Argo CD) generates Applications from a template plus a generator. The most useful generators are git (directories or files in a repo), cluster (registered Argo CD clusters by label), and matrix (the cross-product of two generators).

This ApplicationSet deploys every app overlay onto every production cluster, combining a git directory generator with a cluster generator:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: tenant-apps
  namespace: argocd
spec:
  goTemplate: true
  goTemplateOptions: ["missingkey=error"]
  generators:
    - matrix:
        generators:
          - git:
              repoURL: https://github.com/acme/apps-gitops.git
              revision: main
              directories:
                - path: "*/overlays/prod-*"
          - clusters:
              selector:
                matchLabels:
                  env: prod
  template:
    metadata:
      name: "{{.path.basename}}-{{.name}}"
    spec:
      project: tenants
      source:
        repoURL: https://github.com/acme/apps-gitops.git
        targetRevision: main
        path: "{{.path.path}}"
      destination:
        server: "{{.server}}"
        namespace: "{{.path[1]}}"
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
          - CreateNamespace=true

Register clusters so the clusters generator can find them, and label them so your selector works:

# Add a remote cluster from a kubeconfig context
argocd cluster add prod-eastus-context --name prod-eastus

# Label the generated cluster secret so ApplicationSets can target it
kubectl label secret -n argocd \
  -l argocd.argoproj.io/secret-type=cluster \
  env=prod region=eastus --overwrite

Set the ApplicationSet’s syncPolicy.applicationsSync deliberately. By default, deleting a generator’s source element deletes the generated Application. In production, use a preserve policy until you trust the generators, so a bad selector edit cannot wipe live workloads. Roll out generator changes behind a PR review like any other change.

4. Order dependent resources with sync waves and hooks

Argo CD applies resources in waves. A CRD must exist before the custom resource that uses it; a database migration must finish before the new app version starts. Annotate resources with argocd.argoproj.io/sync-wave (an integer, default 0); lower waves apply first, and Argo CD waits for each wave to become healthy before starting the next.

# CRDs and namespaces go early
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "-5"
---
# The app that depends on them goes later
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "0"

Resource hooks run logic at points in the sync. A PreSync Job is the right place for schema migrations:

apiVersion: batch/v1
kind: Job
metadata:
  name: db-migrate
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
  backoffLimit: 2
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: migrate
          image: ghcr.io/acme/checkout:1.8.0
          command: ["/app/migrate", "up"]

PreSync runs before the main sync, PostSync after all resources are healthy, and SyncFail only when a sync fails. The hook-delete-policy of HookSucceeded cleans up the Job once it passes so it does not accumulate. If a PreSync migration fails, the sync stops and the new version never rolls out, which is exactly what you want.

5. Manage secrets in GitOps

Plaintext secrets cannot live in Git. Two mature approaches solve this without breaking the “Git is the source of truth” model.

Sealed Secrets encrypts a Secret with a controller-held key. The encrypted SealedSecret is safe to commit; only the in-cluster controller can decrypt it.

# Encrypt locally, commit the output
kubectl create secret generic api-creds \
  --from-literal=token=s3cr3t --dry-run=client -o yaml \
  | kubeseal --controller-namespace kube-system --format yaml \
  > sealed-api-creds.yaml

External Secrets Operator (ESO) keeps secret values in a real secret manager (Azure Key Vault, AWS Secrets Manager, GCP Secret Manager, Vault) and syncs them into Kubernetes Secrets. Only a reference lives in Git:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: api-creds
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: azure-kv
    kind: SecretStore
  target:
    name: api-creds
  data:
    - secretKey: token
      remoteRef:
        key: checkout-api-token
Concern Sealed Secrets External Secrets Operator
Where values live Encrypted, in Git External secret manager
Rotation Re-seal and commit Change in the manager; auto-syncs
Audit trail Git history Manager’s audit log
Extra dependency One controller Operator plus a cloud secret store

For multi-cluster platforms, ESO usually wins: rotating a credential is a change in one secret store, not a commit fanned across every cluster overlay. Reserve Sealed Secrets for bootstrap-time secrets that must exist before ESO itself is running.

6. Progressive delivery with Argo Rollouts

A plain Kubernetes Deployment only does rolling updates. Argo Rollouts replaces the Deployment with a Rollout resource that understands canary and blue-green strategies, and gates promotion on metric AnalysisRuns. It integrates with Argo CD: the Rollout shows up as just another resource Argo CD reconciles.

A canary that shifts traffic in steps and runs analysis between them:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout
spec:
  replicas: 6
  selector:
    matchLabels:
      app: checkout
  template:
    metadata:
      labels:
        app: checkout
    spec:
      containers:
        - name: checkout
          image: ghcr.io/acme/checkout:1.8.0
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: success-rate
        - setWeight: 50
        - pause: { duration: 10m }
        - setWeight: 100

The AnalysisTemplate queries Prometheus and fails the rollout if the success rate drops below threshold, which triggers an automatic rollback to the stable ReplicaSet:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
    - name: success-rate
      interval: 1m
      count: 5
      successCondition: result[0] >= 0.99
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{app="checkout",code!~"5.."}[2m]))
            /
            sum(rate(http_requests_total{app="checkout"}[2m]))

Blue-green is the other strategy: a blueGreen block with activeService and previewService brings the new version up in full alongside the old one, lets you test it on the preview service, then flips the active service on promotion. Drive promotions with the plugin rather than editing live objects:

kubectl argo rollouts get rollout checkout --watch
kubectl argo rollouts promote checkout          # advance to the next step
kubectl argo rollouts abort checkout            # roll back to stable

7. Drift detection, self-healing, and safe pruning

With selfHeal: true, Argo CD reverts any manual change that diverges from Git within its reconcile interval. That is the behavior you want in production: a kubectl edit hotfix gets undone, forcing the fix through a PR. With prune: true, resources removed from Git are deleted from the cluster.

Pruning is the dangerous half. Guard it with two mechanisms. Mark resources you never want auto-deleted (a PersistentVolumeClaim, a namespace) with the prune protection annotation. And require confirmation for large deletions so a bad refactor cannot quietly remove a hundred objects:

# Never let Argo CD prune this resource
metadata:
  annotations:
    argocd.argoproj.io/sync-options: Prune=false
# Require manual confirmation for destructive prunes (per-app)
spec:
  syncPolicy:
    syncOptions:
      - PruneLast=true          # prune after other resources sync

Treat selfHeal and prune as production discipline, not just features. The combination guarantees that the cluster matches Git and that the only way to change the cluster is to change Git. That is the entire point of GitOps; turning them off quietly reintroduces snowflake drift.

8. Disaster recovery: rebuild a cluster from Git

If GitOps is real, a destroyed cluster is recoverable by pointing Argo CD at the same repo. The recovery runbook is short because the heavy lifting is declarative.

# 1. Provision a fresh cluster (Terraform/Bicep), get a kubeconfig.
# 2. Install Argo CD.
kubectl create namespace argocd
kubectl apply -n argocd \
  -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# 3. Re-add cluster credentials and labels (step 3 above).
# 4. Apply the single root Application; everything else follows.
kubectl apply -f bootstrap/root-app.yaml

The one thing Git cannot rebuild is decryption material. Back up the Sealed Secrets controller key and any ESO authentication out of band, because without them the committed SealedSecrets are inert and ESO cannot reach the secret store:

# Back up the Sealed Secrets private key (store in a vault, NOT Git)
kubectl get secret -n kube-system \
  -l sealedsecrets.bitnami.com/sealed-secrets-key \
  -o yaml > sealed-secrets-key-backup.yaml

Enterprise scenario

A payments platform ran ~40 services across four prod clusters via one ApplicationSet using a git directory generator on */overlays/prod-*. A team renamed an overlay directory in a routine PR. The git generator stopped emitting the old element, the ApplicationSet controller deleted the corresponding Application, and because every app had prune: true plus selfHeal: true, Argo CD cascaded deletes across all four clusters within one reconcile loop. A live payment service went down before anyone connected the directory rename to the outage.

Two root causes: the ApplicationSet defaulted to deleting Applications when a generator element disappears, and nothing distinguished “intentional removal” from “rename.” The fix was to stop treating generator output as authoritative for deletion. They set the preserve policy so a vanished element orphans rather than deletes the Application, requiring an explicit, reviewed delete:

spec:
  syncPolicy:
    preserveResourcesOnDeletion: true   # vanished generator element != delete
  # plus a guard on the apps the set generates
  template:
    metadata:
      annotations:
        argocd.argoproj.io/sync-options: Delete=confirm

They also added a CI check that diffs the rendered Application list (argocd appset generate against the PR branch) and fails the build if the count drops, so any deletion shows up in review as an explicit number. The deeper lesson: in a fan-out model, a one-line path edit has cluster-wide blast radius. Generators are convenient, but their output must never be the only thing standing between a typo and a production deletion.

Verify

After bootstrap or a recovery, confirm the platform actually converged.

# Every Application should report Synced + Healthy
argocd app list -o wide

# Inspect the root app's resource tree and drift
argocd app get root --refresh

# ApplicationSets generated the expected Applications
kubectl get applicationsets -n argocd
kubectl get applications -n argocd

# A canary is progressing as designed
kubectl argo rollouts status checkout

# Secrets actually materialized from references
kubectl get externalsecrets -A
kubectl get sealedsecrets -A

A converged platform shows every Application Synced/Healthy, no OutOfSync resources after a --refresh, Rollouts in a Healthy or Paused (mid-canary) phase, and target Secrets present where ExternalSecrets expect them.

Checklist

Pitfalls and next steps

The failures that bite teams are rarely Argo CD bugs. They are process gaps: an aggressive ApplicationSet selector that deletes live apps, prune enabled on a namespace holding a database, or a DR plan that assumes Git holds the secrets it cannot decrypt. Rehearse the destroy-and-rebuild path on a disposable cluster before you need it, and scope ApplicationSet preserve policies until the generators are proven.

From here, harden the platform with Argo CD Projects to restrict which repos, clusters, and namespaces each team can target; add notifications on sync failures and degraded health; and wire image updates through a controller or CI commit so promotions become reviewable diffs rather than manual tag edits. Pair that with Rollouts analysis backed by real SLO queries, and you have a multi-cluster delivery system where every change is auditable, reversible, and reconstructable from Git.

GitOpsArgo-CDArgo-RolloutsKubernetesCD

Comments

Keep Reading