Containerization Containers

Running Stateful PostgreSQL on Kubernetes: StatefulSets, Operators, Automated Failover, and Point-in-Time Recovery

Running a stateless web service on Kubernetes is a solved problem. Running a primary/replica PostgreSQL cluster — with synchronous replication, sub-30-second failover, and a recovery story that survives a fat-fingered DELETE — is where most teams get burned. The naive approach (a StatefulSet plus a sidecar that runs pg_ctl) handles the easy 80% and then quietly loses data during the failover you didn’t test.

This guide builds a production Postgres cluster on Kubernetes the way it actually holds up: a StatefulSet for stable identity and storage, a mature operator for the consensus and lifecycle logic, object-storage WAL archiving for durability, and a rehearsed PITR drill so recovery is a runbook, not a prayer. Examples use CloudNativePG because it is CNCF-hosted, Postgres-native, and avoids bolting on external Patroni/etcd machinery — but the architecture transfers to Zalando and Crunchy.

1. StatefulSet fundamentals: stable identities, ordered rollout, headless services

A Deployment gives you interchangeable, anonymous pods. Postgres needs the opposite: each replica has a durable identity, its own volume, and a known position in a topology. That is exactly what a StatefulSet provides.

Three guarantees matter for databases:

apiVersion: v1
kind: Service
metadata:
  name: pg-headless
  namespace: db
spec:
  clusterIP: None          # headless: per-pod DNS, no virtual IP
  selector:
    app: pg
  ports:
    - name: postgres
      port: 5432

The DNS-per-pod property is the load-bearing primitive. Replication, leader election, and client routing all depend on a replica being reachable at a name that survives a node failure. Hold this mental model; everything below builds on it.

2. Why a raw StatefulSet is not enough: choosing an operator

A StatefulSet gives you stable pods. It gives you nothing about which pod is the primary, how a replica is promoted when the primary dies, or how to fence a node that is partitioned but still writing. That logic — distributed consensus, leader election, fencing, replication topology reconciliation — is the hard part, and it is operator territory.

Operator Consensus / HA mechanism Notes
CloudNativePG Operator-driven, uses the Kubernetes API as the source of truth (no external DCS) CNCF Sandbox, Postgres-native, instance manager as PID 1
Zalando postgres-operator Patroni + Kubernetes endpoints/configmaps as the DCS Battle-tested at scale, Spilo image, Patroni semantics
Crunchy PGO Patroni-based Commercial backing, broad enterprise feature set

For a greenfield platform I default to CloudNativePG: it treats the Kubernetes control plane as the consensus store (no etcd cluster to babysit beside your database), and the primary is tracked by a label the operator flips atomically. The rest of this guide uses it.

Install the operator (pin the version; never track latest for a stateful controller):

kubectl apply --server-side -f \
  https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.25/releases/cnpg-1.25.0.yaml

# Wait for the controller to be ready before creating clusters
kubectl -n cnpg-system rollout status deployment/cnpg-controller-manager

3. Provisioning storage with the right StorageClass, volumeClaimTemplates, and topology

Storage choice decides your recovery time and whether failover even works. Two hard rules:

  1. Use a StorageClass with volumeBindingMode: WaitForFirstConsumer. Without it, a PVC can bind to a zone the scheduler later can’t place the pod into, deadlocking the rollout. WaitForFirstConsumer defers binding until the pod is scheduled, so volume and pod land in the same zone.
  2. Use block storage that supports your failover model. ReadWriteOnce (RWO) EBS/PD-style volumes are correct here — each replica owns its own copy, and replication (not a shared disk) provides redundancy. Do not try to share one RWX volume between primary and replica; that is data corruption waiting to happen.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: pg-ssd
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "6000"
  throughput: "250"
volumeBindingMode: WaitForFirstConsumer   # critical for zonal correctness
allowVolumeExpansion: true                 # lets you grow PVCs without recreating pods
reclaimPolicy: Retain                      # don't auto-delete database volumes

Now declare the cluster. CloudNativePG synthesizes the StatefulSet, the headless service, the read/write and read-only services, and per-instance PVCs for you. Spreading instances across zones is done with affinity:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: pg
  namespace: db
spec:
  instances: 3
  imageName: ghcr.io/cloudnative-pg/postgresql:16.4

  storage:
    storageClass: pg-ssd
    size: 100Gi

  # Dedicated volume for WAL keeps write spikes off the data disk
  walStorage:
    storageClass: pg-ssd
    size: 20Gi

  affinity:
    enablePodAntiAffinity: true
    topologyKey: topology.kubernetes.io/zone   # one instance per zone
    podAntiAffinityType: required

  resources:
    requests:
      memory: "4Gi"
      cpu: "2"
    limits:
      memory: "4Gi"

Putting WAL on its own PVC (walStorage) is not cosmetic: WAL is sequential write-heavy, and isolating it stops checkpoint storms from starving query I/O on the data volume.

4. Configuring synchronous vs asynchronous replicas and quorum-based failover

This is the decision that defines your data-durability guarantee, and the one teams most often get wrong by accident.

The trap with naive synchronous setups (synchronous_standby_names = '1 (s1,s2)') is that if the listed standby is down, commits block forever. CloudNativePG solves this with a quorum-based config that ties the number of required confirmations to cluster size and won’t wedge when a single replica is unavailable:

spec:
  instances: 3
  postgresql:
    synchronous:
      method: any                 # quorum: ANY N synchronous standbys
      number: 1                   # require 1 of the available standbys to confirm
      dataDurability: required    # never silently fall back to async on commit
  # Bound how far a replica may lag before it's an eligible failover target
  failoverDelay: 0

With method: any and number: 1 across three instances, a commit needs one of the two standbys to confirm — you tolerate losing a single standby with zero data loss and no commit blocking. Setting dataDurability: required means the operator will refuse to degrade to asynchronous behavior to keep writes flowing; if you would rather preserve availability over strict RPO=0, use preferred and understand you are accepting potential loss during a double failure.

Rule of thumb: synchronous-with-quorum needs at least 3 instances. With 2 instances, synchronous replication makes the primary’s availability depend on the single standby — the opposite of what you wanted.

5. Automated failover mechanics: leader election, fencing, and split-brain avoidance

When the primary dies, four things must happen in order, and the order is what prevents split-brain (two pods both believing they are primary and both accepting writes):

  primary unhealthy
        |
   (1) operator detects via liveness + replication state
        |
   (2) FENCE the old primary  -> it is demoted / stopped, cannot accept writes
        |
   (3) ELECT most-advanced standby (least WAL lag) as new primary
        |
   (4) promote it, repoint the -rw Service, reconfigure remaining replicas

The non-negotiable step is (2) fencing. A primary that is network-partitioned but still running will happily accept writes that the rest of the cluster never sees. If you promote a standby without first guaranteeing the old primary cannot write, you now have two divergent timelines — split-brain — and reconciling them means losing one side’s data. CloudNativePG fences by demoting the old primary and only flips the cnpg.io/instanceRole=primary label (which the -rw Service selects on) once a single primary is guaranteed.

Because CloudNativePG uses the Kubernetes API as its source of truth, “who is primary” is a single atomic label update on one object — there is no separate etcd/Consul that can disagree with the cluster’s view. That eliminates an entire class of DCS-vs-database disagreement bugs.

You can also fence manually for maintenance — e.g., to take an instance out of rotation safely:

# Fence a specific instance (stops it accepting traffic; it stays demoted)
kubectl cnpg fencing on pg pg-2 -n db

# ... perform node maintenance, then ...
kubectl cnpg fencing off pg pg-2 -n db

To validate failover, kill the primary and watch promotion — don’t wait for an incident to find out your RTO:

PRIMARY=$(kubectl get cluster pg -n db -o jsonpath='{.status.currentPrimary}')
kubectl delete pod "$PRIMARY" -n db --grace-period=0 --force
kubectl get cluster pg -n db -w   # watch currentPrimary flip to a standby

6. Continuous WAL archiving to object storage and point-in-time recovery

Replication is not a backup. Replication faithfully copies a DROP TABLE to every standby in milliseconds. Durability against logical mistakes and total-cluster loss requires a base backup plus a continuous stream of WAL in object storage, which together enable point-in-time recovery — restoring to any moment, including “one second before the bad migration ran.”

Configure continuous archiving to S3 (CloudNativePG uses Barman Cloud under the hood). Store credentials in a Secret, never inline:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: pg
  namespace: db
spec:
  instances: 3
  backup:
    barmanObjectStore:
      destinationPath: s3://acme-pg-backups/pg
      s3Credentials:
        accessKeyId:
          name: pg-backup-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: pg-backup-creds
          key: SECRET_ACCESS_KEY
      wal:
        compression: gzip
        maxParallel: 8            # parallelize WAL upload to keep up with write load
      data:
        compression: gzip
    retentionPolicy: "30d"        # operator prunes backups + WAL older than 30 days

Take an on-demand base backup (and schedule recurring ones with a ScheduledBackup):

# One-off base backup
kubectl cnpg backup pg -n db

# Confirm WAL is actually shipping (the part people forget to check)
kubectl exec -n db pg-1 -c postgres -- \
  psql -tAc "SELECT archived_count, failed_count, last_failed_wal
             FROM pg_stat_archiver;"

failed_count climbing or last_failed_wal set means archiving is broken and your PITR window is silently frozen — alert on it.

To perform PITR, you create a new Cluster that bootstraps via recovery, pointing at the backup and a target time. CloudNativePG replays WAL from the base backup up to the target and stops:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: pg-restore
  namespace: db
spec:
  instances: 3
  storage:
    storageClass: pg-ssd
    size: 100Gi
  bootstrap:
    recovery:
      source: pg                       # references the externalCluster below
      recoveryTarget:
        targetTime: "2026-06-08 11:32:00+00"   # restore to just before the bad change
  externalClusters:
    - name: pg
      barmanObjectStore:
        destinationPath: s3://acme-pg-backups/pg
        s3Credentials:
          accessKeyId:
            name: pg-backup-creds
            key: ACCESS_KEY_ID
          secretAccessKey:
            name: pg-backup-creds
            key: SECRET_ACCESS_KEY
        wal:
          maxParallel: 8

Drill this quarterly against real backups. A PITR procedure you have never executed is a hypothesis, not a recovery plan. Measure wall-clock RTO and confirm row counts post-restore.

7. Rolling minor upgrades and major version upgrades

Minor upgrades (16.3 -> 16.4) are routine: bump imageName. The operator performs a rolling update — replicas first, then a controlled switchover so the primary restarts last, minimizing the write-path outage to a single fast failover:

kubectl patch cluster pg -n db --type merge \
  -p '{"spec":{"imageName":"ghcr.io/cloudnative-pg/postgresql:16.5"}}'
kubectl get cluster pg -n db -w   # watch the rolling switchover

Control whether the primary moves first or the operator drains replicas first via primaryUpdateStrategy (unsupervised lets the operator switch over automatically; supervised waits for you to trigger it during a maintenance window).

Major upgrades (16 -> 17) change the on-disk format and cannot be a simple image swap. Two safe paths:

8. Monitoring, connection pooling with PgBouncer, and capacity planning

Connection pooling is mandatory, not optional. Each Postgres connection is a backend process with real memory cost; a few hundred app pods opening connections directly will exhaust max_connections and OOM the database. CloudNativePG ships a first-class Pooler (PgBouncer) — point your apps at the pooler service, not the database:

apiVersion: postgresql.cnpg.io/v1
kind: Pooler
metadata:
  name: pg-pooler-rw
  namespace: db
spec:
  cluster:
    name: pg
  instances: 3
  type: rw                       # pool writes to the current primary
  pgbouncer:
    poolMode: transaction        # transaction pooling -> highest connection multiplexing
    parameters:
      max_client_conn: "1000"
      default_pool_size: "25"

poolMode: transaction gives the best multiplexing but forbids session-level features (prepared statements across transactions, SET that outlives a transaction, advisory session locks) — confirm your app and ORM tolerate it before shipping.

For monitoring, CloudNativePG exposes Prometheus metrics natively. Enable a PodMonitor and alert on the signals that actually predict incidents:

spec:
  monitoring:
    enablePodMonitor: true

The four metrics worth paging on:

Signal Why it matters
Replication lag (bytes/seconds) A lagging standby is not a valid failover target — silent RPO risk
pg_stat_archiver.failed_count WAL archiving broken == PITR window frozen
PVC disk usage on data and WAL volumes A full WAL disk stops the primary from accepting writes
Connection saturation vs max_connections Predicts the OOM that pooling is meant to prevent

For capacity planning, the disk that bites you is WAL: a long-running replication slot or a stalled archive can pin WAL forever and fill the volume, taking the primary read-only. Size the WAL PVC for your worst-case archive outage, cap max_slot_wal_keep_size, and alert on WAL volume usage well before 100%.

Verify

Run these after deploying — green across the board is your definition of “done”:

# 1. Cluster healthy, expected number of ready instances, a primary elected
kubectl get cluster pg -n db \
  -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,\
INSTANCES:.status.instances,READY:.status.readyInstances,PRIMARY:.status.currentPrimary

# 2. Replication is streaming and synchronous standbys are connected
kubectl exec -n db pg-1 -c postgres -- \
  psql -tAc "SELECT application_name, state, sync_state, replay_lag
             FROM pg_stat_replication;"

# 3. WAL archiving is succeeding (failed_count should be 0)
kubectl exec -n db pg-1 -c postgres -- \
  psql -tAc "SELECT archived_count, failed_count FROM pg_stat_archiver;"

# 4. Each instance has its own bound PVC (data + WAL)
kubectl get pvc -n db -l cnpg.io/cluster=pg

# 5. Failover works: delete the primary, confirm currentPrimary flips
kubectl delete pod "$(kubectl get cluster pg -n db -o jsonpath='{.status.currentPrimary}')" \
  -n db --grace-period=0 --force
kubectl get cluster pg -n db -w

If sync_state shows sync/quorum for the expected standbys, failed_count is 0, and the primary flips on pod deletion, the cluster is doing what it claims.

Enterprise scenario

A fintech platform team ran a 3-node CloudNativePG cluster across three AZs with synchronous replication (method: any, number: 1). During a routine node-pool upgrade, a cloud-provider zonal disruption took down the AZ holding the primary and one standby within the same minute. The surviving standby was healthy — but writes hung. Their on-call assumed a failover bug.

The actual constraint: with one of two standbys gone, a quorum requiring one confirmation was fine, but a second concurrent issue meant the lone surviving standby briefly couldn’t acknowledge, and dataDurability: required (correctly) refused to drop to asynchronous commits — so the primary blocked rather than risk RPO > 0. The cluster was choosing consistency over availability, exactly as configured. The team had picked strict durability without modeling a correlated double-AZ event.

Their fix was twofold. First, they spread instances across three failure domains with explicit anti-affinity so a single-AZ event can never take more than one instance — making the quorum robust to any one-zone loss:

spec:
  instances: 3
  affinity:
    enablePodAntiAffinity: true
    topologyKey: topology.kubernetes.io/zone
    podAntiAffinityType: required      # hard guarantee: no two instances share a zone
  postgresql:
    synchronous:
      method: any
      number: 1
      dataDurability: required         # consciously chosen: consistency > availability

Second — and this was the cultural shift — they wrote down the trade-off explicitly: for this workload, a brief write stall during a rare correlated double-failure is acceptable; silent data loss is not. They added an alert that distinguishes “commits blocking on synchronous quorum” from a generic outage, so on-call recognizes the condition as designed behavior rather than reflexively forcing the cluster into asynchronous mode and discarding the guarantee. The incident review’s headline: the database did exactly what they told it to — they just had not decided, in writing, what they were telling it.

Checklist

kubernetesstatefulsetpostgresqloperatorsstorage

Comments

Keep Reading