The Cardinal Rule: Never Tar a Live Datadir
If you take one thing from this lesson, take this: a tar of /var/lib/postgresql/data while postgres is running is not a backup. It restores cleanly enough to fool you and corrupts on the first checkpoint. The same applies to MySQL’s /var/lib/mysql, MongoDB’s /var/lib/mongodb, and every other transactional store.
The reason is fundamental. Databases hold in-memory state (dirty pages, transaction log buffers, lock tables) that the on-disk files only become consistent with at flush points. Filesystem-level copy crosses files at arbitrary times relative to those flushes; the resulting copy is a frankenbackup that mixes pre-checkpoint and post-checkpoint pages from the same transaction. Postgres calls this “torn pages”; MySQL calls it “inconsistent InnoDB state.”
The fix is to use the database’s own backup tool, which coordinates with the running engine to produce a transactionally consistent snapshot. This lesson is about wrapping those tools in shell scripts that handle the operational reality: failure modes, retention, encryption, off-host transport, and PITR drills.
The Four Patterns
| Pattern | Tool | Use case | Restore time |
|---|---|---|---|
| Logical dump | pg_dump, mysqldump --single-transaction |
Small DBs, version migrations, partial restores | O(data) — slow for large |
| Physical base backup | pg_basebackup, xtrabackup |
Large DBs, fast restore, PITR baseline | O(data on disk) — fast |
| WAL/binlog archiving | wal-g, barman, MySQL binlog |
Continuous backup, PITR to any second | Apply WAL since base |
| Online schema change | gh-ost, pt-online-schema-change |
DDL on huge tables without locks | N/A — migration tool |
A real production environment uses base backup + continuous WAL for PITR (point-in-time recovery), with logical dumps as a defense-in-depth secondary that’s portable across versions.
Pattern 1: Logical Dumps With Postgres pg_dump
pg_dump runs as a regular Postgres client. It opens a single transaction with REPEATABLE READ isolation, scans every table, and emits SQL or a custom binary format. Because it’s transactional, the dump is consistent regardless of concurrent writes.
#!/usr/bin/env bash
# pg-logical-backup.sh
set -euo pipefail
readonly DB=myapp_prod
readonly OUT_DIR=/var/backups/postgres
readonly STAMP=$(date +%Y-%m-%d-%H%M)
readonly OUT="$OUT_DIR/${DB}-${STAMP}.dump"
mkdir -p "$OUT_DIR"
# Custom format (-Fc) is compressed and supports parallel restore.
# --no-owner / --no-acl strip env-specific identifiers (useful for cross-env restore).
PGPASSFILE=/etc/postgres/pgpass \
pg_dump \
--host="$PGHOST" \
--username="$PGUSER" \
--dbname="$DB" \
--format=custom \
--compress=9 \
--jobs=4 \
--no-owner --no-acl \
--file="$OUT.tmp"
# Atomic rename (sidecar pattern from L28)
mv "$OUT.tmp" "$OUT"
# Sidecar checksum
sha256sum "$OUT" > "$OUT.sha256"
# Verify the dump can be listed (cheap structural sanity check)
pg_restore --list "$OUT" > /dev/null \
|| { echo "FAIL: pg_restore --list failed on $OUT"; exit 1; }
echo "OK: $OUT ($(stat -c %s "$OUT") bytes)"
Two non-obvious flags:
--jobs=4parallelizes the dump across 4 worker processes. Cuts dump time roughly proportionally on multi-core boxes. Only works with--format=directoryor--format=custom.--no-owner --no-acldropsOWNER TOandGRANTstatements. Critical for cross-environment restore (your prod owner doesn’t exist in staging) but dangerous for in-place restore because it loses ACLs. Use only for DR-to-different-env.
Why --format=custom Over Plain SQL
Plain SQL (--format=plain) is human-readable but cannot be parallel-restored, can’t selectively restore one table, and is ~3× larger uncompressed. Custom format (--format=custom) is a binary archive that:
- Compresses internally (no need to pipe through
gzip). - Supports
pg_restore --listfor inspection without restoring. - Supports
pg_restore --jobs=Nfor parallel restore. - Supports selective restore (
pg_restore --table=foo).
Always use custom format for production backups. Reserve plain SQL for dev exports where you want to grep or hand-edit the dump.
Restore Verification
# Fast structural verification (does NOT prove data correctness)
pg_restore --list "$OUT"
# Full restore to a sandbox DB
createdb "${DB}_drill"
pg_restore --dbname="${DB}_drill" --jobs=4 "$OUT"
# Smoke test — the most important step
psql --dbname="${DB}_drill" --command="SELECT count(*) FROM critical_orders_table"
The smoke test must be domain-specific. For an e-commerce DB it might be “orders count is within 5% of yesterday’s.” For a session store it might be “table exists and is queryable.” Always pair manifest verification with a domain smoke test.
Pattern 2: Logical Dumps With MySQL mysqldump
The MySQL equivalent is mysqldump. The critical flag is --single-transaction for InnoDB tables:
#!/usr/bin/env bash
# mysql-logical-backup.sh
set -euo pipefail
readonly DB=myapp_prod
readonly OUT_DIR=/var/backups/mysql
readonly STAMP=$(date +%Y-%m-%d-%H%M)
readonly OUT="$OUT_DIR/${DB}-${STAMP}.sql.zst"
mkdir -p "$OUT_DIR"
# --single-transaction → InnoDB consistent snapshot (no table locks)
# --routines --triggers → include stored procedures and triggers
# --master-data=2 → include binlog position as comment (for PITR baseline)
# --hex-blob → safe binary encoding for blob columns
# Pipe through zstd for compression
mysqldump \
--defaults-file=/etc/mysql/backup.cnf \
--single-transaction \
--routines --triggers \
--master-data=2 \
--hex-blob \
--databases "$DB" \
| zstd -9 -o "$OUT.tmp"
mv "$OUT.tmp" "$OUT"
sha256sum "$OUT" > "$OUT.sha256"
# Quick sanity: extract first 100 lines, verify SQL header
zstd -d -c "$OUT" | head -100 | grep -q "MySQL dump" \
|| { echo "FAIL: $OUT does not look like a MySQL dump"; exit 1; }
echo "OK: $OUT ($(stat -c %s "$OUT") bytes)"
The MyISAM Trap
--single-transaction only provides consistency for InnoDB tables. If your schema has any MyISAM tables, they are not covered by the snapshot — mysqldump falls back to LOCK TABLES for those, which (a) blocks writers and (b) doesn’t give you a multi-table consistent view.
Fix: Convert all production tables to InnoDB. MyISAM has been deprecated for over a decade; it has no place in modern production. Or, if you must keep MyISAM, accept that backup time = downtime for those tables.
--master-data=2 for PITR Baseline
--master-data=2 writes the current binlog filename and position as a comment at the top of the dump. The comment is critical for PITR: when you restore the dump, you know the exact binlog position to start replaying from to reach any later state.
-- CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000123', MASTER_LOG_POS=456789;
Without --master-data, the dump is a fixed point in time with no continuation. With it, the dump is a baseline that PITR can build on. Always use --master-data=2 for production backups (the =2 makes it a comment so it doesn’t auto-execute on restore).
Pattern 3: Physical Base Backups + WAL Archiving (PITR)
For databases over ~50 GB, logical dumps become too slow. The answer is physical backups + continuous WAL archiving, which gives:
- Fast base restore: just untar the data directory (vs. replaying SQL).
- PITR: restore to any second by replaying WAL since the base backup.
- Lower CPU overhead on the primary (no SQL serialization).
Postgres: pg_basebackup + WAL-G
The shell-friendly stack is:
pg_basebackupto take the base backup.wal-gfor continuous WAL push to S3 with compression and encryption.restore_commandinrecovery.conf(orrestore_commandGUC) to fetch WAL during recovery.
#!/usr/bin/env bash
# pg-base-backup.sh — weekly, takes a fresh PITR baseline
set -euo pipefail
readonly DEST=/var/backups/postgres-base
readonly STAMP=$(date +%Y-%m-%d)
readonly OUT="$DEST/$STAMP"
mkdir -p "$OUT"
pg_basebackup \
--pgdata="$OUT" \
--format=tar \
--gzip \
--progress \
--verbose \
--wal-method=stream \
--checkpoint=fast \
--label="weekly-base-$STAMP"
# wal-g handles continuous archive (separate systemd timer / cron)
# This script just creates the periodic baseline.
# Atomic flag file marks completion
touch "$OUT/.complete"
# sha256 the base files for integrity
( cd "$OUT" && find . -type f -print0 | xargs -0 sha256sum | sort -k 2 ) > "$OUT.manifest"
sha256sum "$OUT.manifest" > "$OUT.manifest.sha256"
echo "OK: base backup at $OUT"
The .complete flag file is the atomic completion marker. Restore scripts must check for .complete before considering a base backup usable; otherwise an in-progress (interrupted) backup could be picked up.
WAL Archiving Configuration
In postgresql.conf:
archive_mode = on
archive_command = 'wal-g wal-push %p'
archive_timeout = 60 # force WAL switch every 60s for tighter RPO
The archive_command runs synchronously on every WAL switch. If it fails, Postgres retains the WAL until archive_command succeeds — which means a broken archive_command can fill pg_wal/ and stop writes. Monitor it:
# Prometheus textfile exporter — runs every minute
unarchived_count=$(psql -tAc "SELECT count(*) FROM pg_ls_waldir() WHERE name NOT IN (SELECT wal_segment_name FROM pg_stat_archiver)")
last_archive_time=$(psql -tAc "SELECT extract(epoch from now() - last_archived_time) FROM pg_stat_archiver")
cat > /var/lib/node_exporter/textfile_collector/pg_archive.prom.tmp <<EOF
pg_wal_unarchived_count $unarchived_count
pg_wal_last_archive_age_seconds $last_archive_time
EOF
mv /var/lib/node_exporter/textfile_collector/pg_archive.prom{.tmp,}
Alert on pg_wal_last_archive_age_seconds > 300 — if WAL archiving stalls for >5 min, your PITR window is degrading.
MySQL: Percona XtraBackup
The MySQL equivalent of pg_basebackup is Percona XtraBackup (the official Oracle tool, MySQL Enterprise Backup, is closed source).
#!/usr/bin/env bash
# mysql-base-backup.sh
set -euo pipefail
readonly DEST=/var/backups/mysql-base
readonly STAMP=$(date +%Y-%m-%d)
readonly OUT="$DEST/$STAMP"
mkdir -p "$OUT"
xtrabackup \
--backup \
--target-dir="$OUT" \
--datadir=/var/lib/mysql \
--user=backup \
--password-file=/etc/mysql/backup-pass \
--parallel=4 \
--compress \
--compress-threads=4
# --prepare must run before restore to apply uncommitted log
# But run it in a separate sandbox during drill, NOT in-place,
# because --prepare modifies the backup files and breaks PITR chain.
touch "$OUT/.complete"
echo "OK: xtrabackup at $OUT"
For continuous binlog archiving, MySQL has no first-class equivalent of wal-g. The pattern is to copy mysql-bin.* files to S3 via cron:
# binlog-archive.sh — runs every 60s
flush_logs() {
mysql --execute='FLUSH BINARY LOGS' 2>/dev/null
}
archive_logs() {
for binlog in /var/log/mysql/mysql-bin.[0-9]*; do
[[ -f "$binlog" ]] || continue
# Skip the active log (mysqld is still writing to it)
[[ "$binlog" == "$(mysql --execute='SHOW MASTER STATUS\G' \
| awk '/File:/ {print $2}')" ]] && continue
aws s3 cp "$binlog" "s3://myapp-binlogs/$(basename "$binlog")" \
&& rm -f "$binlog"
done
}
flush_logs
archive_logs
The SHOW MASTER STATUS check ensures we don’t archive the currently-active binlog (which would race with mysqld appending to it).
Pattern 4: Online Schema Migrations Without Lock Outages
ALTER TABLE big_table ADD COLUMN ... on a 100GB MySQL/Postgres table can lock the table for hours. Online schema change tools rebuild the table in the background by:
- Creating a shadow table with the new schema.
- Copying rows in chunks, throttled to avoid replication lag.
- Capturing changes via triggers (Postgres
pg_repack/gh-ostfor MySQL) or logical replication. - Atomic swap (
RENAME TABLE) at the end.
gh-ost for MySQL
GitHub’s gh-ost uses MySQL’s binlog instead of triggers, so it adds zero overhead to writes:
#!/usr/bin/env bash
# Wrapping gh-ost in shell with safety guardrails.
set -euo pipefail
readonly DB=myapp_prod
readonly TABLE=orders
readonly ALTER='ADD COLUMN delivery_zone VARCHAR(64) DEFAULT NULL'
# Pre-flight: replication lag must be low
LAG=$(mysql --batch --skip-column-names \
--execute='SHOW SLAVE STATUS\G' \
| awk '/Seconds_Behind_Master/ {print $2}')
if [[ "$LAG" == "NULL" ]] || (( LAG > 30 )); then
echo "FAIL: replication lag is $LAG (must be < 30s)"
exit 1
fi
# Pre-flight: free disk space must exceed 2× table size
TABLE_SIZE=$(mysql --batch --skip-column-names --execute="
SELECT data_length + index_length FROM information_schema.tables
WHERE table_schema='$DB' AND table_name='$TABLE'")
FREE=$(df --output=avail -B1 /var/lib/mysql | tail -1)
if (( FREE < TABLE_SIZE * 2 )); then
echo "FAIL: free disk $FREE < 2× table size $TABLE_SIZE"
exit 1
fi
# Run gh-ost
gh-ost \
--user=ghost --password-file=/etc/mysql/ghost-pass \
--host=replica.internal \
--database="$DB" --table="$TABLE" \
--alter="$ALTER" \
--max-load='Threads_running=25' \
--critical-load='Threads_running=100' \
--chunk-size=1000 \
--throttle-control-replicas='replica2.internal,replica3.internal' \
--switch-to-rbr \
--execute
Critical safety flags:
--max-load='Threads_running=25'— pause copying when concurrent threads exceed 25.--critical-load='Threads_running=100'— abort if it spikes beyond 100.--throttle-control-replicas— pause copying if any replica falls behind.--switch-to-rbr— switch the migration session to row-based replication for safer cutover.
Without these, gh-ost can saturate your DB during peak traffic. The pre-flight checks (replication lag, free disk) prevent the most common foot-shoot.
pg_repack for Postgres
Postgres uses logical replication slots and triggers in pg_repack:
#!/usr/bin/env bash
set -euo pipefail
readonly DB=myapp_prod
readonly TABLE=orders
# pg_repack rewrites the table in place, removes bloat
pg_repack \
--dbname="$DB" \
--table="$TABLE" \
--jobs=4 \
--no-order # skip CLUSTER-style sort, just reclaim bloat
For schema changes (vs. just bloat reclaim), the Postgres-native approach is:
- Add nullable column (
ALTER TABLE ... ADD COLUMN x text— fast metadata-only op in modern Postgres). - Backfill in batches (
UPDATE ... WHERE x IS NULL LIMIT 1000). - Add NOT NULL constraint with
NOT VALIDfirst, thenVALIDATE CONSTRAINTlater.
Wrapped in shell:
backfill_in_batches() {
local total=0 batch=1000
while :; do
n=$(psql -tAc "
WITH cte AS (
SELECT id FROM orders WHERE delivery_zone IS NULL LIMIT $batch FOR UPDATE SKIP LOCKED
)
UPDATE orders SET delivery_zone = compute_zone(address)
FROM cte WHERE orders.id = cte.id
RETURNING 1
" | wc -l)
(( n == 0 )) && break
total=$((total + n))
echo "Backfilled $total rows so far"
sleep 0.1 # throttle
done
}
FOR UPDATE SKIP LOCKED is the magic: rows currently locked by an active transaction are skipped (rather than blocking), making the backfill safe to run during peak traffic.
The Drop-In lib/db.sh
# lib/db.sh — sourced helpers for database backup scripts.
#
# Depends on lib/backup.sh for upload_s3 / verify_remote / GFS prune.
#
# Usage:
# source /usr/local/lib/backup.sh
# source /usr/local/lib/db.sh
# pg_logical_dump myapp_prod /var/backups/postgres
# pg_basebackup_take /var/backups/postgres-base
# mysql_logical_dump myapp_prod /var/backups/mysql
set -o errexit -o nounset -o pipefail
db_log() { backup_log "[db] $*"; }
# Postgres logical dump in custom format. Args: db, dest_dir
pg_logical_dump() {
local db="$1" dest="$2"
local stamp out
stamp=$(date +%Y-%m-%d-%H%M)
mkdir -p "$dest"
out="$dest/${db}-${stamp}.dump"
db_log "pg_dump start: $db"
PGPASSFILE=/etc/postgres/pgpass \
pg_dump --dbname="$db" --format=custom --compress=9 --jobs=4 \
--no-owner --no-acl --file="$out.tmp"
mv "$out.tmp" "$out"
sha256sum "$out" > "$out.sha256"
pg_restore --list "$out" > /dev/null \
|| { db_log "FAIL: pg_restore --list failed"; return 1; }
db_log "pg_dump OK: $out ($(stat -c %s "$out") bytes)"
printf '%s\n' "$out"
}
# Postgres physical base backup. Args: dest_dir
pg_basebackup_take() {
local dest="$1"
local stamp out
stamp=$(date +%Y-%m-%d)
mkdir -p "$dest"
out="$dest/$stamp"
mkdir -p "$out"
db_log "pg_basebackup start"
pg_basebackup --pgdata="$out" --format=tar --gzip \
--wal-method=stream --checkpoint=fast \
--label="base-$stamp"
( cd "$out" && find . -type f -print0 | xargs -0 sha256sum | sort -k 2 ) > "$out.manifest"
sha256sum "$out.manifest" > "$out.manifest.sha256"
touch "$out/.complete"
db_log "pg_basebackup OK: $out"
printf '%s\n' "$out"
}
# MySQL logical dump with --single-transaction. Args: db, dest_dir
mysql_logical_dump() {
local db="$1" dest="$2"
local stamp out
stamp=$(date +%Y-%m-%d-%H%M)
mkdir -p "$dest"
out="$dest/${db}-${stamp}.sql.zst"
db_log "mysqldump start: $db"
mysqldump --defaults-file=/etc/mysql/backup.cnf \
--single-transaction --routines --triggers \
--master-data=2 --hex-blob \
--databases "$db" \
| zstd -9 -o "$out.tmp"
mv "$out.tmp" "$out"
sha256sum "$out" > "$out.sha256"
zstd -d -c "$out" | head -100 | grep -q "MySQL dump" \
|| { db_log "FAIL: dump format check failed"; return 1; }
db_log "mysqldump OK: $out ($(stat -c %s "$out") bytes)"
printf '%s\n' "$out"
}
# PITR drill — restore latest base + replay WAL to a target time. Args: target_iso
pg_pitr_drill() {
local target="$1"
local sandbox=/srv/pg-pitr-drill
rm -rf "$sandbox"
mkdir -p "$sandbox"
# Fetch latest base backup from S3
db_log "pg_pitr_drill: target=$target"
local latest
latest=$(aws s3 ls s3://myapp-pg-base/ | sort | tail -1 | awk '{print $4}')
aws s3 cp "s3://myapp-pg-base/$latest" "$sandbox/base.tar.gz"
tar -xzf "$sandbox/base.tar.gz" -C "$sandbox"
cat > "$sandbox/recovery.signal" <<EOF
restore_command = 'wal-g wal-fetch %f %p'
recovery_target_time = '$target'
recovery_target_action = 'pause'
EOF
pg_ctl -D "$sandbox" -l "$sandbox/log" start
# Wait for recovery to reach target
for _ in {1..60}; do
if psql -h /tmp -p 5433 -tAc "SELECT pg_is_in_recovery() AND NOT pg_is_wal_replay_paused()" 2>/dev/null | grep -q '^t$'; then
sleep 5
else
break
fi
done
# Smoke test
if psql -h /tmp -p 5433 -tAc "SELECT count(*) FROM critical_table" >/dev/null; then
db_log "pg_pitr_drill OK: target=$target"
pg_ctl -D "$sandbox" stop
return 0
else
db_log "FAIL: smoke test failed at target=$target"
pg_ctl -D "$sandbox" stop
return 1
fi
}
Using the Library
#!/usr/bin/env bash
# nightly-db-backup.sh
source /usr/local/lib/backup.sh
source /usr/local/lib/db.sh
# Postgres logical
out=$(pg_logical_dump myapp_prod /var/backups/postgres)
backup_upload_s3 "$out" s3://myapp-backups-prod/pg-logical/
backup_gfs_prune /var/backups/postgres myapp_prod
# Postgres physical base (weekly only)
if [[ "$(date +%u)" == "7" ]]; then
out=$(pg_basebackup_take /var/backups/postgres-base)
# pg_basebackup output is a directory — needs different upload pattern
fi
# Weekly PITR drill (Sunday)
if [[ "$(date +%u)" == "7" ]]; then
pg_pitr_drill "$(date -d '1 hour ago' -Iseconds)"
fi
The PITR Window: How “Continuous” Is Continuous?
Your PITR window is bounded by:
| Bound | Determined by | Typical |
|---|---|---|
| Newest restorable point | Last WAL successfully archived | Within 60s with archive_timeout=60 |
| Oldest restorable point | Oldest base backup + retained WAL | 7-30 days |
| Granularity | WAL segment size (16MB default) | Can replay to second precision |
The shell discipline is:
- Archive WAL on
archive_timeout(60s) for tight RPO. - Keep base backups for at least 2× your maximum required PITR depth (so you always have one in case the latest base is corrupt).
- Keep WAL for the full PITR window (not just from the latest base).
- Drill PITR weekly — restore to “1 hour ago” and verify smoke test passes.
Alert on:
time() - last_wal_archive_time > 300→ archive lag, RPO degrading.pg_basebackup_age_days > 14→ base backup stale, restore time grows.pitr_drill_status == 0→ drill failing, PITR is unproven.
The 8 Footguns
1. Backing Up Without --single-transaction (MySQL) or Inside an Active Long Txn (Postgres)
mysqldump without --single-transaction takes table locks. pg_dump is fine in this regard but a long-running concurrent transaction can pin old row versions causing bloat. Fix: Always --single-transaction for MySQL; monitor pg_stat_activity for long-running txns before starting pg_dump.
2. Restoring a pg_dump to a Different Major Version Than It Was Taken From
Custom-format dumps are mostly portable across major versions but extension-related DDL (e.g., CREATE EXTENSION postgis) can fail. Fix: Test cross-version restore in CI for every major upgrade.
3. Forgetting --master-data=2 Means No PITR Baseline
A MySQL dump without --master-data is useless as a PITR baseline because you don’t know where to start replaying binlogs from. Fix: Always use --master-data=2 (the =2 makes it a comment so it doesn’t auto-execute on restore to a non-replica).
4. Tar of Datadir Instead of Using pg_basebackup / xtrabackup
Already covered in the cardinal rule. Worth restating: even with pg_start_backup() / pg_stop_backup() the tar approach has subtle hazards (concurrent file deletion, stat races). Use the dedicated tool.
5. Running gh-ost / pg_repack Without Replication Lag Pre-Flight
Online schema tools generate write load that can push replicas behind by minutes. Fix: Pre-flight check Seconds_Behind_Master (MySQL) / pg_last_xact_replay_timestamp (Postgres) and abort if behind.
6. Backup Credentials With DELETE Privileges
A compromised backup credential should not be able to delete S3 objects, drop database tables, or destroy ZFS snapshots. Fix: IAM policies with PutObject + GetObject only; database role with pg_read_all_data only; ZFS roles with snapshot,send,hold only.
7. PITR Drill That Restores to “Now” Instead of “1 Hour Ago”
A drill that restores to now() doesn’t actually test WAL replay because the latest WAL is already in the base backup. Fix: Drill to a specific past time (e.g., 1 hour ago) so the drill actually exercises WAL fetch + replay.
8. Forgetting To Drop the Sandbox DB Between Drills
If your drill restores into myapp_drill and you don’t drop it first, the second drill might silently restore over the previous one and a manifest mismatch is masked by stale data. Fix: DROP DATABASE myapp_drill (or wipe the sandbox PGDATA) at the start of every drill.
Quick-Reference Card
LOGICAL DUMPS (small DB, partial restore, cross-version)
Postgres: pg_dump -Fc --jobs=4 --no-owner --no-acl
MySQL: mysqldump --single-transaction --master-data=2 --hex-blob
Verify: pg_restore --list / zstd -d -c | head | grep "MySQL dump"
PHYSICAL + WAL (large DB, fast restore, PITR)
Postgres: pg_basebackup -Ft -z + wal-g for archive
MySQL: xtrabackup --backup + binlog rsync to S3
Recovery: restore_command (PG) / mysqlbinlog --start-position (MySQL)
ONLINE SCHEMA CHANGE
MySQL: gh-ost --max-load --throttle-control-replicas
Postgres: ALTER TABLE ADD nullable, backfill, then NOT NULL
Pre-flight: replication lag, free disk, table size
PITR DRILLS
Drill weekly to "1 hour ago" target, not now()
Drop sandbox between drills (DROP DATABASE / rm -rf PGDATA)
Smoke test must be domain-specific, not just "table exists"
RPO / RPO BUDGETS
archive_timeout=60 (PG) → 60s RPO worst case
binlog flush every 60s → 60s RPO worst case
PITR depth = retain base backup + all WAL since
What’s Next
You now have a database-admin layer that handles logical and physical backups, WAL archiving, online schema migrations, and drill-tested PITR. But databases produce mountains of logs — slow query logs, error logs, audit logs — and analyzing them at multi-terabyte scale needs a different toolkit: streaming awk pipelines, GNU parallel for distributed map-reduce on shell, and the discipline of avoiding the cat huge.log | grep | awk | sort | uniq trap that loads everything into memory.
In the next lesson — Log Analysis at Scale: Streaming awk, GNU parallel & Distributed grep / sort / uniq Pipelines — we’ll build lib/loganalyze.sh covering streaming aggregation that fits in O(distinct keys) memory, parallel processing with parallel, distributed map-reduce across hosts via SSH fan-out, and the standard slow-query log reduction patterns for Postgres and MySQL.