Helm for Complex Releases: Umbrella Charts, Library Charts, Lifecycle Hooks, and Safe Rollbacks

A single-service Helm chart is a solved problem. The pain begins when one helm upgrade has to roll out an API, a worker, a cache, a database migration, and a couple of Bitnami subcharts as one atomic unit — and roll all of it back cleanly when the migration fails at 2 a.m. This article is about that situation: composing many charts into one release, scoping values so subcharts get exactly what they need and nothing they shouldn’t, sequencing side effects with hooks, and making upgrades that either fully succeed or leave no trace.

This is the advanced companion to chart authoring. It assumes you already write _helpers.tpl, ship a values.schema.json, and lint in CI. Here we deal with the release as a whole.

1. Umbrella chart anatomy: dependencies, aliases, conditions, tags

An umbrella (parent/wrapper) chart’s job is to pull other charts together. It ships almost no templates of its own — its value is in Chart.yaml. The dependency block is where composition happens, and four fields carry the weight: alias, condition, tags, and import-values.

# platform/Chart.yaml
apiVersion: v2
name: platform
version: 2.4.0
dependencies:
  - name: api
    version: "1.8.0"
    repository: "oci://ghcr.io/acme/charts"
  - name: worker
    version: "1.8.0"
    repository: "oci://ghcr.io/acme/charts"
  - name: redis
    version: "20.1.x"
    repository: "oci://registry-1.docker.io/bitnamicharts"
    condition: redis.enabled
    tags:
      - cache
  - name: postgresql
    version: "16.2.x"
    repository: "oci://registry-1.docker.io/bitnamicharts"
    alias: primarydb            # mount this dependency under a custom key
    condition: primarydb.enabled
    tags:
      - database

alias is the one people miss. Without it, a subchart’s values live under its chart name (postgresql:). With alias: primarydb, the same chart reads its overrides from .Values.primarydb, and you can declare the same chart twice under different aliases to run two PostgreSQL instances in one release. condition toggles a subchart on a boolean value and silently does nothing if the path is absent — which is why you always default it in values.yaml. tags toggle groups of subcharts at once (tags: { cache: true, database: true } in the parent values).

The precedence rule is worth memorizing: a per-subchart condition overrides any tags setting. If redis.enabled is explicitly set, it wins regardless of the cache tag. Tags are for coarse “turn off all the stateful stuff in preview environments” switches; conditions are for fine control of a single component.

Resolve and lock before you ever install:

helm dependency update ./platform   # resolves versions, writes Chart.lock, fills charts/
helm dependency build  ./platform   # rebuilds charts/ from an existing Chart.lock

Commit Chart.lock. CI and production must resolve byte-identical subcharts, and ~/x version ranges in Chart.yaml will otherwise drift between a Friday test and a Monday deploy.

2. Passing and scoping values into subcharts

Helm has exactly three ways for a parent to influence a subchart, and conflating them is the single largest source of “why did that value not take” tickets.

Override by subchart key. Anything nested under the subchart’s name (or alias) in the parent’s values is passed straight down, deep-merged over the subchart’s own values.yaml:

# platform/values.yaml
primarydb:                 # the alias from Chart.yaml
  auth:
    database: orders
  primary:
    persistence:
      size: 100Gi

The global namespace. Keys under .Values.global are visible to the parent and every subchart simultaneously. This is the only channel that crosses sibling boundaries, which makes it right for genuinely cross-cutting settings and wrong for almost everything else:

# platform/values.yaml
global:
  imageRegistry: registry.internal.acme.com
  imagePullSecrets:
    - name: acme-pull
  storageClass: gp3

A global is an implicit API shared by all subcharts. The day one subchart starts reading global.storageClass, removing it becomes a breaking change you cannot see from the umbrella. Treat the global block as a published contract: small, documented, and changed deliberately.

import-values for explicit propagation. When a subchart computes a value (a derived host, a generated name) that the parent needs, the subchart exports a block and the parent imports it without hard-coding the path:

# platform/Chart.yaml dependency entry; child block lives at the subchart's exports.connection
  - name: redis
    version: "20.1.x"
    repository: "oci://registry-1.docker.io/bitnamicharts"
    import-values:
      - child: exports.connection   # long form for nested keys
        parent: cache.connection

After import the parent reads .Values.cache.connection.host, instead of duplicating the hostname across two values files and watching them drift. The short string form (import-values: ["data"]) works only when the child block is literally named exports.data; for anything nested, use the explicit child/parent mapping.

A subchart can never reach up into its parent or sideways into a sibling — there is no such scope. If two subcharts must agree on a value, the parent sets it in both (or one exports and the other imports through the parent). Designing as if siblings can see each other is the most common Helm scoping mistake.

3. Library charts: shared helpers without rendered resources

When five service charts all need the same labels, the same security context, or the same probe defaults, copy-paste rots fast. A library chart is the fix: a chart that ships only named templates and renders nothing on its own.

# common/Chart.yaml
apiVersion: v2
name: common
type: library          # the critical line: Helm will not render this chart's templates
version: 3.1.0

The type: library declaration changes behavior: Helm skips the chart during rendering, so it never emits a Deployment or Service by itself — it only exposes define blocks for other charts to include. Put reusable logic in templates/_*.tpl:

{{/* common/templates/_pod.tpl */}}
{{- define "common.securityContext" -}}
runAsNonRoot: true
runAsUser: 10001
seccompProfile:
  type: RuntimeDefault
{{- end -}}

{{- define "common.image" -}}
{{- $reg := .Values.global.imageRegistry | default .Values.image.registry -}}
{{- printf "%s/%s:%s" $reg .Values.image.repository (.Values.image.tag | default .Chart.AppVersion) -}}
{{- end -}}

Declare common as a dependency in the app chart, then call its templates. The second argument to include is the context (.), which is how the helper sees the consuming chart’s values, not the library’s:

# api/templates/deployment.yaml
spec:
  template:
    spec:
      securityContext:
        {{- include "common.securityContext" . | nindent 8 }}
      containers:
        - name: api
          image: {{ include "common.image" . }}

A common advanced pattern has the library define an entire resource and lets each app chart pass overrides through a tpl-evaluated values block — Bitnami’s common chart works this way. That adds indirection; start by centralizing just labels, selector labels, image references, and security context, where fleet-wide drift actually hurts.

4. Pre-install, post-upgrade, and delete hooks with weights and policies

Hooks let you run resources at lifecycle points instead of as part of the steady-state release. The full set you will actually use: pre-install, post-install, pre-upgrade, post-upgrade, pre-delete, post-delete, pre-rollback, post-rollback. Within a single phase, helm.sh/hook-weight orders them — lower runs first, and weights are strings sorted as integers.

apiVersion: batch/v1
kind: Job
metadata:
  name: {{ include "common.fullname" . }}-warm-cache
  annotations:
    "helm.sh/hook": post-install,post-upgrade
    "helm.sh/hook-weight": "5"
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: warm
          image: {{ include "common.image" . }}
          command: ["/app/warm-cache"]

Three facts about hooks separate people who trust them from people who get paged:

Hook resources are not tracked as part of the release. Helm creates them out of band, they do not appear in the rendered release manifest, and helm uninstall will not necessarily clean them up. That is why you must set a hook-delete-policy.
The delete policies are before-hook-creation (delete a prior hook of the same name first), hook-succeeded (delete after success), and hook-failed (delete after failure). before-hook-creation,hook-succeeded is the sane default for Jobs: a clean slate each run, tidy-up on success, and a retained object on failure so you can read its logs.
A failed hook aborts the operation but does not roll back on its own — you need --atomic (Section 6) for that.

If a resource should keep existing and be reconciled (a ServiceAccount, a ConfigMap), it is not a hook — model it as a normal template. Reserve hooks for genuine one-shot, ordered side effects.

5. Running database migrations safely as a hook Job

The canonical hook is a schema migration that must run before new pods that expect the new schema. Get three things right and it is reliable; get any wrong and it is a recurring outage.

# platform/charts/api/templates/migrate-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: {{ include "common.fullname" . }}-migrate-{{ .Release.Revision }}
  annotations:
    "helm.sh/hook": pre-install,pre-upgrade
    "helm.sh/hook-weight": "-10"          # run before everything else in the phase
    "helm.sh/hook-delete-policy": before-hook-creation
spec:
  backoffLimit: 3                          # retry transient failures
  activeDeadlineSeconds: 600               # but give up after 10 minutes
  ttlSecondsAfterFinished: 3600            # GC the Job object an hour later
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: migrate
          image: {{ include "common.image" . }}
          command: ["/app/migrate", "up"]

The non-negotiables:

Idempotency. The hook can re-run (a retried helm upgrade, a backoffLimit retry), so the migration tool must track applied versions and no-op on already-applied changes. Every real framework (Flyway, golang-migrate, Alembic, Rails) does this; if yours does not, you are building an outage.
hook-weight: -10 guarantees the migration finishes before the Deployment rolls. Negative weights are valid and idiomatic for “run first.”
Naming with .Release.Revision plus before-hook-creation avoids the immutable-Job trap: a Job’s spec.template is immutable, so reusing a fixed name across upgrades fails with field is immutable. Embedding the revision yields a fresh name each time.
backoffLimit and activeDeadlineSeconds bound the blast radius: retry a flaky blip, but do not let a genuinely broken migration hold the release hostage forever.

Expand-then-contract is what makes migration hooks safe under rolling updates. Ship additive changes (a new nullable column) in release N, deploy code that writes both old and new, then drop the old column in N+2. A migration backward-compatible with the currently running pods can run as a pre-upgrade hook with zero coordination; one that is not needs a maintenance window no matter how you sequence it.

6. Atomic upgrades, --wait, and automatic rollback

By default helm upgrade returns as soon as the objects are submitted, not when they are healthy, and a partial failure leaves the release in a half-applied, failed state. For production releases, never run a bare helm upgrade.

helm upgrade platform oci://ghcr.io/acme/charts/platform \
  --version 2.4.0 \
  -f prod-values.yaml \
  --install \
  --atomic \
  --timeout 8m \
  --cleanup-on-fail

What each flag buys you:

--wait (implied by --atomic) blocks until Pods, PVCs, Deployments, and StatefulSets reach ready — or the --timeout expires. This turns “submitted” into “actually rolled out,” and lets a failed hook or an unready Pod count as a failed upgrade.
--atomic rolls the release back to the previous revision if the upgrade fails or times out. The release ends fully on the new version or fully on the old one — never wedged in between.
--timeout 8m bounds the wait. Size it above your slowest legitimate rollout (image pulls, migration hook, readiness ramp) so you do not trip rollback on a merely slow deploy.
--cleanup-on-fail deletes resources newly created during a failed upgrade, so a rolled-back release does not leak orphaned objects.

--atomic carries a cost: a failure takes the full timeout before giving up, and the rollback itself runs pre-rollback/post-rollback hooks — budget for both in your pipeline’s own timeout. Add --wait-for-jobs when hook Jobs must complete to gate readiness.

7. Release history, the storage backend, and pruning

Every helm upgrade writes a new revision. That history is what rollback reads, and left unbounded it becomes its own problem.

helm history platform                     # list every revision, status, and chart version
helm get values  platform --revision 6    # exactly what was applied at revision 6
helm get manifest platform --revision 6   # the rendered objects at that revision
helm rollback platform 6 --wait --timeout 5m

A rollback is itself a new revision (rolling back from 8 to 6 creates revision 9 with the contents of 6), so the audit trail stays append-only.

Two operational settings matter at scale. First, the storage backend. Since Helm 3 the default driver is secret — release state lives in a Secret in the release namespace, base64+gzip encoded, not the older configmap. Confirm it, and inspect the raw objects when debugging:

helm env | grep HELM_DRIVER          # expect HELM_DRIVER="secret" (the v3 default)
kubectl get secret -n prod -l owner=helm,name=platform
# sh.helm.release.v1.platform.v8  helm.sh/release.v1  1

Second, prune history with --history-max on every upgrade — large releases plus deep history can bump the per-object size limit and clutter the namespace. The default of 10 is reasonable, but explicit is better than implicit when an SRE is reasoning about what can be rolled back to:

helm upgrade platform ... --history-max 10   # keep only the last 10 revisions

8. Diffing releases with helm-diff and gating changes in CI

The most dangerous helm upgrade is the one where nobody saw the change — only the desired end state. The helm-diff plugin renders the delta between what is live and what you are about to apply, so a reviewer approves a diff, not a leap of faith.

helm plugin install https://github.com/databus23/helm-diff

helm diff upgrade platform oci://ghcr.io/acme/charts/platform \
  --version 2.4.0 \
  -f prod-values.yaml \
  --context 3

This surfaces exactly which objects mutate, which fields change, and — critically — whether you are about to touch an immutable field (a Deployment selector, a Service clusterIP, a StatefulSet volumeClaimTemplates) that Kubernetes will reject at apply time. Catching that in a diff is a one-line review comment; catching it mid-upgrade is an incident.

Gate it in CI so no production change merges without a rendered, reviewed diff:

# .github/workflows/helm-diff.yml
name: helm-diff
on: pull_request
jobs:
  diff:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: azure/setup-helm@v4
      - name: Install helm-diff
        run: helm plugin install https://github.com/databus23/helm-diff
      - name: Render diff against the live release
        run: |
          helm diff upgrade platform oci://ghcr.io/acme/charts/platform \
            --version "${CHART_VERSION}" \
            -f environments/prod/values.yaml \
            --detailed-exitcode | tee diff.txt
        env:
          CHART_VERSION: ${{ github.event.pull_request.head.sha }}
      - name: Comment diff on PR
        run: gh pr comment "${{ github.event.number }}" --body-file diff.txt
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

--detailed-exitcode returns 2 when there is a drift to apply, 0 when there is none — useful to fail or skip downstream steps deterministically. (This job needs cluster credentials to read live state; in a GitOps setup you would instead let Argo CD or Flux render the diff against the cluster it manages.)

Enterprise scenario

A platform team ran a 9-subchart umbrella across staging and three regional production clusters. Each release bundled an API, two workers, a pre-upgrade Flyway migration hook, and Bitnami PostgreSQL and Redis subcharts, deployed with --atomic --timeout 5m.

A release adding a non-trivial index migration failed in production only. The migration took ~6 minutes against the production data volume; staging’s tiny dataset finished in 20 seconds. At the 5-minute timeout, --atomic declared failure and rolled back. But the hook had already committed the index — Postgres does not unapply committed DDL because Helm rolled back the application. The rollback redeployed the previous app revision against a schema now ahead of it, and because the old Job name was fixed (no revision suffix), the retried upgrade also hit Job ... field is immutable. Three failure states stacked up.

The constraint was real: long migrations and a hard atomic timeout are in direct tension, and DDL is not transactional with Helm’s rollback. The fix had three parts.

First, they decoupled migration timing from the app timeout and made the Job name unique per revision:

metadata:
  name: api-migrate-{{ .Release.Revision }}        # unique name, no immutable-Job clash
spec:
  activeDeadlineSeconds: 1800                       # migrations may take up to 30m

Second, they adopted expand-then-contract so every migration was backward-compatible with the running pods — an additive column lands safely while old code runs, so rollback never hits an incompatible schema. Destructive changes were split into a separate, later release.

Third, for genuinely long online migrations they moved the operation out of the synchronous hook and ran it as a standalone, monitored Job before the upgrade, so a slow index build could never trip the app rollback timer:

kubectl apply -f migrate-job.yaml
kubectl wait --for=condition=complete job/api-migrate-2025q4 --timeout=45m
helm upgrade platform ... --atomic --timeout 8m     # app rollout only, schema already ahead

The lesson, written into their runbook: --atomic rolls back Kubernetes objects, not database state. Any irreversible hook side effect must be backward-compatible (so rollback is safe) or pulled out of the atomic window (so a timeout cannot leave it half-done).

Verify

Run these before you trust an umbrella release:

# 1. Dependencies resolve to the locked versions, no surprises
helm dependency build ./platform && helm dependency list ./platform

# 2. The whole umbrella renders with a real prod values file
helm template platform ./platform -f prod-values.yaml > /tmp/all.yaml
test -s /tmp/all.yaml && echo "rendered OK"

# 3. A disabled subchart actually disappears (expect no postgresql objects)
helm template platform ./platform --set primarydb.enabled=false | grep -c "kind: StatefulSet"

# 4. The change set is what you expect, against the live release
helm diff upgrade platform ./platform -f prod-values.yaml

# 5. Server-side validation, including admission, before a real apply
helm install platform ./platform --dry-run=server -f prod-values.yaml

# 6. After deploy: history is bounded and the latest revision is deployed
helm history platform | tail -5

--dry-run=server is meaningfully stronger than the default client dry run: it sends manifests to the API server for admission and schema validation, catching breaks a local render misses.

Checklist

Pitfalls

Assuming subcharts can see each other. Siblings share nothing but global. If two need a value, the parent sets it in both — design accordingly.
--atomic as a safety blanket for migrations. It rolls back objects, not committed DDL. A long or destructive migration inside the atomic window is a trap.
Fixed Job names for migration hooks. A Job’s spec.template is immutable; reuse a name across upgrades and you get field is immutable. Suffix with the revision.
Forgetting hooks are untracked. No hook-delete-policy means orphaned Jobs pile up; helm uninstall will not reliably clean them for you.
Unbounded release history. Deep history plus a large umbrella can bump the storage-object size limit and clutter the namespace. Pin --history-max.

Next step: take the umbrella you already run, move its shared labels and security context into a versioned library chart, and add a helm diff gate to the pipeline. The diff alone will pay for itself the first time it flags an immutable-field change before it reaches a cluster.

Helm for Complex Releases: Umbrella Charts, Library Charts, Lifecycle Hooks, and Safe Rollbacks

1. Umbrella chart anatomy: dependencies, aliases, conditions, tags

2. Passing and scoping values into subcharts

3. Library charts: shared helpers without rendered resources

4. Pre-install, post-upgrade, and delete hooks with weights and policies

5. Running database migrations safely as a hook Job

6. Atomic upgrades, --wait, and automatic rollback

7. Release history, the storage backend, and pruning

8. Diffing releases with helm-diff and gating changes in CI

Enterprise scenario

Verify

Checklist

Pitfalls

Written by Vinod

Comments

Keep Reading

Cilium Beyond CNI: Cluster Mesh, Egress Gateway, and the BGP Control Plane

GitOps with Flux: Image Update Automation, OCI Artifact Sources, and Hard Multi-Tenancy

Extending the Kubernetes API: Aggregated API Servers, CRD Conversion Webhooks, and Versioning Strategy