Running a stateless web service on Kubernetes is a solved problem. Running a primary/replica PostgreSQL cluster — with synchronous replication, sub-30-second failover, and a recovery story that survives a fat-fingered DELETE — is where most teams get burned. The naive approach (a StatefulSet plus a sidecar that runs pg_ctl) handles the easy 80% and then quietly loses data during the failover you didn’t test.
This guide builds a production Postgres cluster on Kubernetes the way it actually holds up: a StatefulSet for stable identity and storage, a mature operator for the consensus and lifecycle logic, object-storage WAL archiving for durability, and a rehearsed PITR drill so recovery is a runbook, not a prayer. Examples use CloudNativePG because it is CNCF-hosted, Postgres-native, and avoids bolting on external Patroni/etcd machinery — but the architecture transfers to Zalando and Crunchy.
1. StatefulSet fundamentals: stable identities, ordered rollout, headless services
A Deployment gives you interchangeable, anonymous pods. Postgres needs the opposite: each replica has a durable identity, its own volume, and a known position in a topology. That is exactly what a StatefulSet provides.
Three guarantees matter for databases:
- Stable network identity. Pods are named
<sts>-0,<sts>-1, … and keep that name across rescheduling. Paired with a headless Service (clusterIP: None), each pod gets a stable DNS record:pg-1.pg-headless.db.svc.cluster.local. Replicas can find the primary by name, not by a load balancer that might point anywhere. - Stable storage.
volumeClaimTemplatesmint one PVC per pod (data-pg-0,data-pg-1). Whenpg-0is rescheduled, it re-attaches its own PVC — not a fresh, empty one. Deleting the StatefulSet does not delete these PVCs; that is a feature for databases, a footgun if you forget. - Ordered, controlled lifecycle. Pods are created and scaled
0 -> 1 -> 2and terminated in reverse. Rollouts are one-at-a-time, so you never restart a quorum simultaneously.
apiVersion: v1
kind: Service
metadata:
name: pg-headless
namespace: db
spec:
clusterIP: None # headless: per-pod DNS, no virtual IP
selector:
app: pg
ports:
- name: postgres
port: 5432
The DNS-per-pod property is the load-bearing primitive. Replication, leader election, and client routing all depend on a replica being reachable at a name that survives a node failure. Hold this mental model; everything below builds on it.
2. Why a raw StatefulSet is not enough: choosing an operator
A StatefulSet gives you stable pods. It gives you nothing about which pod is the primary, how a replica is promoted when the primary dies, or how to fence a node that is partitioned but still writing. That logic — distributed consensus, leader election, fencing, replication topology reconciliation — is the hard part, and it is operator territory.
| Operator | Consensus / HA mechanism | Notes |
|---|---|---|
| CloudNativePG | Operator-driven, uses the Kubernetes API as the source of truth (no external DCS) | CNCF Sandbox, Postgres-native, instance manager as PID 1 |
| Zalando postgres-operator | Patroni + Kubernetes endpoints/configmaps as the DCS | Battle-tested at scale, Spilo image, Patroni semantics |
| Crunchy PGO | Patroni-based | Commercial backing, broad enterprise feature set |
For a greenfield platform I default to CloudNativePG: it treats the Kubernetes control plane as the consensus store (no etcd cluster to babysit beside your database), and the primary is tracked by a label the operator flips atomically. The rest of this guide uses it.
Install the operator (pin the version; never track latest for a stateful controller):
kubectl apply --server-side -f \
https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.25/releases/cnpg-1.25.0.yaml
# Wait for the controller to be ready before creating clusters
kubectl -n cnpg-system rollout status deployment/cnpg-controller-manager
3. Provisioning storage with the right StorageClass, volumeClaimTemplates, and topology
Storage choice decides your recovery time and whether failover even works. Two hard rules:
- Use a StorageClass with
volumeBindingMode: WaitForFirstConsumer. Without it, a PVC can bind to a zone the scheduler later can’t place the pod into, deadlocking the rollout.WaitForFirstConsumerdefers binding until the pod is scheduled, so volume and pod land in the same zone. - Use block storage that supports your failover model. ReadWriteOnce (RWO) EBS/PD-style volumes are correct here — each replica owns its own copy, and replication (not a shared disk) provides redundancy. Do not try to share one RWX volume between primary and replica; that is data corruption waiting to happen.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: pg-ssd
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "6000"
throughput: "250"
volumeBindingMode: WaitForFirstConsumer # critical for zonal correctness
allowVolumeExpansion: true # lets you grow PVCs without recreating pods
reclaimPolicy: Retain # don't auto-delete database volumes
Now declare the cluster. CloudNativePG synthesizes the StatefulSet, the headless service, the read/write and read-only services, and per-instance PVCs for you. Spreading instances across zones is done with affinity:
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: pg
namespace: db
spec:
instances: 3
imageName: ghcr.io/cloudnative-pg/postgresql:16.4
storage:
storageClass: pg-ssd
size: 100Gi
# Dedicated volume for WAL keeps write spikes off the data disk
walStorage:
storageClass: pg-ssd
size: 20Gi
affinity:
enablePodAntiAffinity: true
topologyKey: topology.kubernetes.io/zone # one instance per zone
podAntiAffinityType: required
resources:
requests:
memory: "4Gi"
cpu: "2"
limits:
memory: "4Gi"
Putting WAL on its own PVC (walStorage) is not cosmetic: WAL is sequential write-heavy, and isolating it stops checkpoint storms from starving query I/O on the data volume.
4. Configuring synchronous vs asynchronous replicas and quorum-based failover
This is the decision that defines your data-durability guarantee, and the one teams most often get wrong by accident.
- Asynchronous replication (default): the primary commits and acknowledges the client before replicas confirm receipt. Fast, but a primary failure can lose the last few transactions (
RPO > 0). - Synchronous replication: the primary waits for at least one standby to confirm the WAL record is persisted before acknowledging the commit.
RPO = 0, at the cost of commit latency and an availability coupling.
The trap with naive synchronous setups (synchronous_standby_names = '1 (s1,s2)') is that if the listed standby is down, commits block forever. CloudNativePG solves this with a quorum-based config that ties the number of required confirmations to cluster size and won’t wedge when a single replica is unavailable:
spec:
instances: 3
postgresql:
synchronous:
method: any # quorum: ANY N synchronous standbys
number: 1 # require 1 of the available standbys to confirm
dataDurability: required # never silently fall back to async on commit
# Bound how far a replica may lag before it's an eligible failover target
failoverDelay: 0
With method: any and number: 1 across three instances, a commit needs one of the two standbys to confirm — you tolerate losing a single standby with zero data loss and no commit blocking. Setting dataDurability: required means the operator will refuse to degrade to asynchronous behavior to keep writes flowing; if you would rather preserve availability over strict RPO=0, use preferred and understand you are accepting potential loss during a double failure.
Rule of thumb: synchronous-with-quorum needs at least 3 instances. With 2 instances, synchronous replication makes the primary’s availability depend on the single standby — the opposite of what you wanted.
5. Automated failover mechanics: leader election, fencing, and split-brain avoidance
When the primary dies, four things must happen in order, and the order is what prevents split-brain (two pods both believing they are primary and both accepting writes):
primary unhealthy
|
(1) operator detects via liveness + replication state
|
(2) FENCE the old primary -> it is demoted / stopped, cannot accept writes
|
(3) ELECT most-advanced standby (least WAL lag) as new primary
|
(4) promote it, repoint the -rw Service, reconfigure remaining replicas
The non-negotiable step is (2) fencing. A primary that is network-partitioned but still running will happily accept writes that the rest of the cluster never sees. If you promote a standby without first guaranteeing the old primary cannot write, you now have two divergent timelines — split-brain — and reconciling them means losing one side’s data. CloudNativePG fences by demoting the old primary and only flips the cnpg.io/instanceRole=primary label (which the -rw Service selects on) once a single primary is guaranteed.
Because CloudNativePG uses the Kubernetes API as its source of truth, “who is primary” is a single atomic label update on one object — there is no separate etcd/Consul that can disagree with the cluster’s view. That eliminates an entire class of DCS-vs-database disagreement bugs.
You can also fence manually for maintenance — e.g., to take an instance out of rotation safely:
# Fence a specific instance (stops it accepting traffic; it stays demoted)
kubectl cnpg fencing on pg pg-2 -n db
# ... perform node maintenance, then ...
kubectl cnpg fencing off pg pg-2 -n db
To validate failover, kill the primary and watch promotion — don’t wait for an incident to find out your RTO:
PRIMARY=$(kubectl get cluster pg -n db -o jsonpath='{.status.currentPrimary}')
kubectl delete pod "$PRIMARY" -n db --grace-period=0 --force
kubectl get cluster pg -n db -w # watch currentPrimary flip to a standby
6. Continuous WAL archiving to object storage and point-in-time recovery
Replication is not a backup. Replication faithfully copies a DROP TABLE to every standby in milliseconds. Durability against logical mistakes and total-cluster loss requires a base backup plus a continuous stream of WAL in object storage, which together enable point-in-time recovery — restoring to any moment, including “one second before the bad migration ran.”
Configure continuous archiving to S3 (CloudNativePG uses Barman Cloud under the hood). Store credentials in a Secret, never inline:
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: pg
namespace: db
spec:
instances: 3
backup:
barmanObjectStore:
destinationPath: s3://acme-pg-backups/pg
s3Credentials:
accessKeyId:
name: pg-backup-creds
key: ACCESS_KEY_ID
secretAccessKey:
name: pg-backup-creds
key: SECRET_ACCESS_KEY
wal:
compression: gzip
maxParallel: 8 # parallelize WAL upload to keep up with write load
data:
compression: gzip
retentionPolicy: "30d" # operator prunes backups + WAL older than 30 days
Take an on-demand base backup (and schedule recurring ones with a ScheduledBackup):
# One-off base backup
kubectl cnpg backup pg -n db
# Confirm WAL is actually shipping (the part people forget to check)
kubectl exec -n db pg-1 -c postgres -- \
psql -tAc "SELECT archived_count, failed_count, last_failed_wal
FROM pg_stat_archiver;"
failed_count climbing or last_failed_wal set means archiving is broken and your PITR window is silently frozen — alert on it.
To perform PITR, you create a new Cluster that bootstraps via recovery, pointing at the backup and a target time. CloudNativePG replays WAL from the base backup up to the target and stops:
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: pg-restore
namespace: db
spec:
instances: 3
storage:
storageClass: pg-ssd
size: 100Gi
bootstrap:
recovery:
source: pg # references the externalCluster below
recoveryTarget:
targetTime: "2026-06-08 11:32:00+00" # restore to just before the bad change
externalClusters:
- name: pg
barmanObjectStore:
destinationPath: s3://acme-pg-backups/pg
s3Credentials:
accessKeyId:
name: pg-backup-creds
key: ACCESS_KEY_ID
secretAccessKey:
name: pg-backup-creds
key: SECRET_ACCESS_KEY
wal:
maxParallel: 8
Drill this quarterly against real backups. A PITR procedure you have never executed is a hypothesis, not a recovery plan. Measure wall-clock RTO and confirm row counts post-restore.
7. Rolling minor upgrades and major version upgrades
Minor upgrades (16.3 -> 16.4) are routine: bump imageName. The operator performs a rolling update — replicas first, then a controlled switchover so the primary restarts last, minimizing the write-path outage to a single fast failover:
kubectl patch cluster pg -n db --type merge \
-p '{"spec":{"imageName":"ghcr.io/cloudnative-pg/postgresql:16.5"}}'
kubectl get cluster pg -n db -w # watch the rolling switchover
Control whether the primary moves first or the operator drains replicas first via primaryUpdateStrategy (unsupervised lets the operator switch over automatically; supervised waits for you to trigger it during a maintenance window).
Major upgrades (16 -> 17) change the on-disk format and cannot be a simple image swap. Two safe paths:
- Logical replication / blue-green: stand up a v17 cluster, replicate from v16 via a
Subscription, cut over when caught up. Near-zero downtime, more moving parts. pg_upgradeimport: CloudNativePG supports bootstrapping a new major-version cluster that imports from the old one. Plan a short maintenance window and test on a clone first — extension compatibility (PostGIS, pgvector) is the usual source of surprises.
8. Monitoring, connection pooling with PgBouncer, and capacity planning
Connection pooling is mandatory, not optional. Each Postgres connection is a backend process with real memory cost; a few hundred app pods opening connections directly will exhaust max_connections and OOM the database. CloudNativePG ships a first-class Pooler (PgBouncer) — point your apps at the pooler service, not the database:
apiVersion: postgresql.cnpg.io/v1
kind: Pooler
metadata:
name: pg-pooler-rw
namespace: db
spec:
cluster:
name: pg
instances: 3
type: rw # pool writes to the current primary
pgbouncer:
poolMode: transaction # transaction pooling -> highest connection multiplexing
parameters:
max_client_conn: "1000"
default_pool_size: "25"
poolMode: transaction gives the best multiplexing but forbids session-level features (prepared statements across transactions, SET that outlives a transaction, advisory session locks) — confirm your app and ORM tolerate it before shipping.
For monitoring, CloudNativePG exposes Prometheus metrics natively. Enable a PodMonitor and alert on the signals that actually predict incidents:
spec:
monitoring:
enablePodMonitor: true
The four metrics worth paging on:
| Signal | Why it matters |
|---|---|
| Replication lag (bytes/seconds) | A lagging standby is not a valid failover target — silent RPO risk |
pg_stat_archiver.failed_count |
WAL archiving broken == PITR window frozen |
| PVC disk usage on data and WAL volumes | A full WAL disk stops the primary from accepting writes |
Connection saturation vs max_connections |
Predicts the OOM that pooling is meant to prevent |
For capacity planning, the disk that bites you is WAL: a long-running replication slot or a stalled archive can pin WAL forever and fill the volume, taking the primary read-only. Size the WAL PVC for your worst-case archive outage, cap max_slot_wal_keep_size, and alert on WAL volume usage well before 100%.
Verify
Run these after deploying — green across the board is your definition of “done”:
# 1. Cluster healthy, expected number of ready instances, a primary elected
kubectl get cluster pg -n db \
-o custom-columns=NAME:.metadata.name,STATUS:.status.phase,\
INSTANCES:.status.instances,READY:.status.readyInstances,PRIMARY:.status.currentPrimary
# 2. Replication is streaming and synchronous standbys are connected
kubectl exec -n db pg-1 -c postgres -- \
psql -tAc "SELECT application_name, state, sync_state, replay_lag
FROM pg_stat_replication;"
# 3. WAL archiving is succeeding (failed_count should be 0)
kubectl exec -n db pg-1 -c postgres -- \
psql -tAc "SELECT archived_count, failed_count FROM pg_stat_archiver;"
# 4. Each instance has its own bound PVC (data + WAL)
kubectl get pvc -n db -l cnpg.io/cluster=pg
# 5. Failover works: delete the primary, confirm currentPrimary flips
kubectl delete pod "$(kubectl get cluster pg -n db -o jsonpath='{.status.currentPrimary}')" \
-n db --grace-period=0 --force
kubectl get cluster pg -n db -w
If sync_state shows sync/quorum for the expected standbys, failed_count is 0, and the primary flips on pod deletion, the cluster is doing what it claims.
Enterprise scenario
A fintech platform team ran a 3-node CloudNativePG cluster across three AZs with synchronous replication (method: any, number: 1). During a routine node-pool upgrade, a cloud-provider zonal disruption took down the AZ holding the primary and one standby within the same minute. The surviving standby was healthy — but writes hung. Their on-call assumed a failover bug.
The actual constraint: with one of two standbys gone, a quorum requiring one confirmation was fine, but a second concurrent issue meant the lone surviving standby briefly couldn’t acknowledge, and dataDurability: required (correctly) refused to drop to asynchronous commits — so the primary blocked rather than risk RPO > 0. The cluster was choosing consistency over availability, exactly as configured. The team had picked strict durability without modeling a correlated double-AZ event.
Their fix was twofold. First, they spread instances across three failure domains with explicit anti-affinity so a single-AZ event can never take more than one instance — making the quorum robust to any one-zone loss:
spec:
instances: 3
affinity:
enablePodAntiAffinity: true
topologyKey: topology.kubernetes.io/zone
podAntiAffinityType: required # hard guarantee: no two instances share a zone
postgresql:
synchronous:
method: any
number: 1
dataDurability: required # consciously chosen: consistency > availability
Second — and this was the cultural shift — they wrote down the trade-off explicitly: for this workload, a brief write stall during a rare correlated double-failure is acceptable; silent data loss is not. They added an alert that distinguishes “commits blocking on synchronous quorum” from a generic outage, so on-call recognizes the condition as designed behavior rather than reflexively forcing the cluster into asynchronous mode and discarding the guarantee. The incident review’s headline: the database did exactly what they told it to — they just had not decided, in writing, what they were telling it.