Shell Database Admin: pg_dump Pipelines, MySQL Backup Orchestration, WAL Archiving & Online Schema Migration Wrappers

The Cardinal Rule: Never Tar a Live Datadir

If you take one thing from this lesson, take this: a tar of /var/lib/postgresql/data while postgres is running is not a backup. It restores cleanly enough to fool you and corrupts on the first checkpoint. The same applies to MySQL’s /var/lib/mysql, MongoDB’s /var/lib/mongodb, and every other transactional store.

The reason is fundamental. Databases hold in-memory state (dirty pages, transaction log buffers, lock tables) that the on-disk files only become consistent with at flush points. Filesystem-level copy crosses files at arbitrary times relative to those flushes; the resulting copy is a frankenbackup that mixes pre-checkpoint and post-checkpoint pages from the same transaction. Postgres calls this “torn pages”; MySQL calls it “inconsistent InnoDB state.”

The fix is to use the database’s own backup tool, which coordinates with the running engine to produce a transactionally consistent snapshot. This lesson is about wrapping those tools in shell scripts that handle the operational reality: failure modes, retention, encryption, off-host transport, and PITR drills.

The Four Patterns

Pattern	Tool	Use case	Restore time
Logical dump	`pg_dump`, `mysqldump --single-transaction`	Small DBs, version migrations, partial restores	O(data) — slow for large
Physical base backup	`pg_basebackup`, `xtrabackup`	Large DBs, fast restore, PITR baseline	O(data on disk) — fast
WAL/binlog archiving	`wal-g`, `barman`, MySQL binlog	Continuous backup, PITR to any second	Apply WAL since base
Online schema change	`gh-ost`, `pt-online-schema-change`	DDL on huge tables without locks	N/A — migration tool

A real production environment uses base backup + continuous WAL for PITR (point-in-time recovery), with logical dumps as a defense-in-depth secondary that’s portable across versions.

Pattern 1: Logical Dumps With Postgres pg_dump

pg_dump runs as a regular Postgres client. It opens a single transaction with REPEATABLE READ isolation, scans every table, and emits SQL or a custom binary format. Because it’s transactional, the dump is consistent regardless of concurrent writes.

#!/usr/bin/env bash
# pg-logical-backup.sh
set -euo pipefail

readonly DB=myapp_prod
readonly OUT_DIR=/var/backups/postgres
readonly STAMP=$(date +%Y-%m-%d-%H%M)
readonly OUT="$OUT_DIR/${DB}-${STAMP}.dump"

mkdir -p "$OUT_DIR"

# Custom format (-Fc) is compressed and supports parallel restore.
# --no-owner / --no-acl strip env-specific identifiers (useful for cross-env restore).
PGPASSFILE=/etc/postgres/pgpass \
pg_dump \
  --host="$PGHOST" \
  --username="$PGUSER" \
  --dbname="$DB" \
  --format=custom \
  --compress=9 \
  --jobs=4 \
  --no-owner --no-acl \
  --file="$OUT.tmp"

# Atomic rename (sidecar pattern from L28)
mv "$OUT.tmp" "$OUT"

# Sidecar checksum
sha256sum "$OUT" > "$OUT.sha256"

# Verify the dump can be listed (cheap structural sanity check)
pg_restore --list "$OUT" > /dev/null \
  || { echo "FAIL: pg_restore --list failed on $OUT"; exit 1; }

echo "OK: $OUT ($(stat -c %s "$OUT") bytes)"

Two non-obvious flags:

--jobs=4 parallelizes the dump across 4 worker processes. Cuts dump time roughly proportionally on multi-core boxes. Only works with --format=directory or --format=custom.
--no-owner --no-acl drops OWNER TO and GRANT statements. Critical for cross-environment restore (your prod owner doesn’t exist in staging) but dangerous for in-place restore because it loses ACLs. Use only for DR-to-different-env.

Why `--format=custom` Over Plain SQL

Plain SQL (--format=plain) is human-readable but cannot be parallel-restored, can’t selectively restore one table, and is ~3× larger uncompressed. Custom format (--format=custom) is a binary archive that:

Compresses internally (no need to pipe through gzip).
Supports pg_restore --list for inspection without restoring.
Supports pg_restore --jobs=N for parallel restore.
Supports selective restore (pg_restore --table=foo).

Always use custom format for production backups. Reserve plain SQL for dev exports where you want to grep or hand-edit the dump.

Restore Verification

# Fast structural verification (does NOT prove data correctness)
pg_restore --list "$OUT"

# Full restore to a sandbox DB
createdb "${DB}_drill"
pg_restore --dbname="${DB}_drill" --jobs=4 "$OUT"

# Smoke test — the most important step
psql --dbname="${DB}_drill" --command="SELECT count(*) FROM critical_orders_table"

The smoke test must be domain-specific. For an e-commerce DB it might be “orders count is within 5% of yesterday’s.” For a session store it might be “table exists and is queryable.” Always pair manifest verification with a domain smoke test.

Pattern 2: Logical Dumps With MySQL mysqldump

The MySQL equivalent is mysqldump. The critical flag is --single-transaction for InnoDB tables:

#!/usr/bin/env bash
# mysql-logical-backup.sh
set -euo pipefail

readonly DB=myapp_prod
readonly OUT_DIR=/var/backups/mysql
readonly STAMP=$(date +%Y-%m-%d-%H%M)
readonly OUT="$OUT_DIR/${DB}-${STAMP}.sql.zst"

mkdir -p "$OUT_DIR"

# --single-transaction → InnoDB consistent snapshot (no table locks)
# --routines --triggers → include stored procedures and triggers
# --master-data=2 → include binlog position as comment (for PITR baseline)
# --hex-blob → safe binary encoding for blob columns
# Pipe through zstd for compression
mysqldump \
  --defaults-file=/etc/mysql/backup.cnf \
  --single-transaction \
  --routines --triggers \
  --master-data=2 \
  --hex-blob \
  --databases "$DB" \
  | zstd -9 -o "$OUT.tmp"

mv "$OUT.tmp" "$OUT"
sha256sum "$OUT" > "$OUT.sha256"

# Quick sanity: extract first 100 lines, verify SQL header
zstd -d -c "$OUT" | head -100 | grep -q "MySQL dump" \
  || { echo "FAIL: $OUT does not look like a MySQL dump"; exit 1; }

echo "OK: $OUT ($(stat -c %s "$OUT") bytes)"

The MyISAM Trap

--single-transaction only provides consistency for InnoDB tables. If your schema has any MyISAM tables, they are not covered by the snapshot — mysqldump falls back to LOCK TABLES for those, which (a) blocks writers and (b) doesn’t give you a multi-table consistent view.

Fix: Convert all production tables to InnoDB. MyISAM has been deprecated for over a decade; it has no place in modern production. Or, if you must keep MyISAM, accept that backup time = downtime for those tables.

`--master-data=2` for PITR Baseline

--master-data=2 writes the current binlog filename and position as a comment at the top of the dump. The comment is critical for PITR: when you restore the dump, you know the exact binlog position to start replaying from to reach any later state.

-- CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000123', MASTER_LOG_POS=456789;

Without --master-data, the dump is a fixed point in time with no continuation. With it, the dump is a baseline that PITR can build on. Always use --master-data=2 for production backups (the =2 makes it a comment so it doesn’t auto-execute on restore).

Pattern 3: Physical Base Backups + WAL Archiving (PITR)

For databases over ~50 GB, logical dumps become too slow. The answer is physical backups + continuous WAL archiving, which gives:

Fast base restore: just untar the data directory (vs. replaying SQL).
PITR: restore to any second by replaying WAL since the base backup.
Lower CPU overhead on the primary (no SQL serialization).

Postgres: pg_basebackup + WAL-G

The shell-friendly stack is:

pg_basebackup to take the base backup.
wal-g for continuous WAL push to S3 with compression and encryption.
restore_command in recovery.conf (or restore_command GUC) to fetch WAL during recovery.

#!/usr/bin/env bash
# pg-base-backup.sh — weekly, takes a fresh PITR baseline
set -euo pipefail

readonly DEST=/var/backups/postgres-base
readonly STAMP=$(date +%Y-%m-%d)
readonly OUT="$DEST/$STAMP"

mkdir -p "$OUT"

pg_basebackup \
  --pgdata="$OUT" \
  --format=tar \
  --gzip \
  --progress \
  --verbose \
  --wal-method=stream \
  --checkpoint=fast \
  --label="weekly-base-$STAMP"

# wal-g handles continuous archive (separate systemd timer / cron)
# This script just creates the periodic baseline.

# Atomic flag file marks completion
touch "$OUT/.complete"

# sha256 the base files for integrity
( cd "$OUT" && find . -type f -print0 | xargs -0 sha256sum | sort -k 2 ) > "$OUT.manifest"
sha256sum "$OUT.manifest" > "$OUT.manifest.sha256"

echo "OK: base backup at $OUT"

The .complete flag file is the atomic completion marker. Restore scripts must check for .complete before considering a base backup usable; otherwise an in-progress (interrupted) backup could be picked up.

WAL Archiving Configuration

In postgresql.conf:

archive_mode = on
archive_command = 'wal-g wal-push %p'
archive_timeout = 60   # force WAL switch every 60s for tighter RPO

The archive_command runs synchronously on every WAL switch. If it fails, Postgres retains the WAL until archive_command succeeds — which means a broken archive_command can fill pg_wal/ and stop writes. Monitor it:

# Prometheus textfile exporter — runs every minute
unarchived_count=$(psql -tAc "SELECT count(*) FROM pg_ls_waldir() WHERE name NOT IN (SELECT wal_segment_name FROM pg_stat_archiver)")
last_archive_time=$(psql -tAc "SELECT extract(epoch from now() - last_archived_time) FROM pg_stat_archiver")

cat > /var/lib/node_exporter/textfile_collector/pg_archive.prom.tmp <<EOF
pg_wal_unarchived_count $unarchived_count
pg_wal_last_archive_age_seconds $last_archive_time
EOF
mv /var/lib/node_exporter/textfile_collector/pg_archive.prom{.tmp,}

Alert on pg_wal_last_archive_age_seconds > 300 — if WAL archiving stalls for >5 min, your PITR window is degrading.

MySQL: Percona XtraBackup

The MySQL equivalent of pg_basebackup is Percona XtraBackup (the official Oracle tool, MySQL Enterprise Backup, is closed source).

#!/usr/bin/env bash
# mysql-base-backup.sh
set -euo pipefail

readonly DEST=/var/backups/mysql-base
readonly STAMP=$(date +%Y-%m-%d)
readonly OUT="$DEST/$STAMP"

mkdir -p "$OUT"

xtrabackup \
  --backup \
  --target-dir="$OUT" \
  --datadir=/var/lib/mysql \
  --user=backup \
  --password-file=/etc/mysql/backup-pass \
  --parallel=4 \
  --compress \
  --compress-threads=4

# --prepare must run before restore to apply uncommitted log
# But run it in a separate sandbox during drill, NOT in-place,
# because --prepare modifies the backup files and breaks PITR chain.

touch "$OUT/.complete"
echo "OK: xtrabackup at $OUT"

For continuous binlog archiving, MySQL has no first-class equivalent of wal-g. The pattern is to copy mysql-bin.* files to S3 via cron:

# binlog-archive.sh — runs every 60s
flush_logs() {
  mysql --execute='FLUSH BINARY LOGS' 2>/dev/null
}

archive_logs() {
  for binlog in /var/log/mysql/mysql-bin.[0-9]*; do
    [[ -f "$binlog" ]] || continue
    # Skip the active log (mysqld is still writing to it)
    [[ "$binlog" == "$(mysql --execute='SHOW MASTER STATUS\G' \
                       | awk '/File:/ {print $2}')" ]] && continue
    aws s3 cp "$binlog" "s3://myapp-binlogs/$(basename "$binlog")" \
      && rm -f "$binlog"
  done
}

flush_logs
archive_logs

The SHOW MASTER STATUS check ensures we don’t archive the currently-active binlog (which would race with mysqld appending to it).

Pattern 4: Online Schema Migrations Without Lock Outages

ALTER TABLE big_table ADD COLUMN ... on a 100GB MySQL/Postgres table can lock the table for hours. Online schema change tools rebuild the table in the background by:

Creating a shadow table with the new schema.
Copying rows in chunks, throttled to avoid replication lag.
Capturing changes via triggers (Postgres pg_repack / gh-ost for MySQL) or logical replication.
Atomic swap (RENAME TABLE) at the end.

gh-ost for MySQL

GitHub’s gh-ost uses MySQL’s binlog instead of triggers, so it adds zero overhead to writes:

#!/usr/bin/env bash
# Wrapping gh-ost in shell with safety guardrails.
set -euo pipefail

readonly DB=myapp_prod
readonly TABLE=orders
readonly ALTER='ADD COLUMN delivery_zone VARCHAR(64) DEFAULT NULL'

# Pre-flight: replication lag must be low
LAG=$(mysql --batch --skip-column-names \
  --execute='SHOW SLAVE STATUS\G' \
  | awk '/Seconds_Behind_Master/ {print $2}')
if [[ "$LAG" == "NULL" ]] || (( LAG > 30 )); then
  echo "FAIL: replication lag is $LAG (must be < 30s)"
  exit 1
fi

# Pre-flight: free disk space must exceed 2× table size
TABLE_SIZE=$(mysql --batch --skip-column-names --execute="
  SELECT data_length + index_length FROM information_schema.tables
  WHERE table_schema='$DB' AND table_name='$TABLE'")
FREE=$(df --output=avail -B1 /var/lib/mysql | tail -1)
if (( FREE < TABLE_SIZE * 2 )); then
  echo "FAIL: free disk $FREE < 2× table size $TABLE_SIZE"
  exit 1
fi

# Run gh-ost
gh-ost \
  --user=ghost --password-file=/etc/mysql/ghost-pass \
  --host=replica.internal \
  --database="$DB" --table="$TABLE" \
  --alter="$ALTER" \
  --max-load='Threads_running=25' \
  --critical-load='Threads_running=100' \
  --chunk-size=1000 \
  --throttle-control-replicas='replica2.internal,replica3.internal' \
  --switch-to-rbr \
  --execute

Critical safety flags:

--max-load='Threads_running=25' — pause copying when concurrent threads exceed 25.
--critical-load='Threads_running=100' — abort if it spikes beyond 100.
--throttle-control-replicas — pause copying if any replica falls behind.
--switch-to-rbr — switch the migration session to row-based replication for safer cutover.

Without these, gh-ost can saturate your DB during peak traffic. The pre-flight checks (replication lag, free disk) prevent the most common foot-shoot.

pg_repack for Postgres

Postgres uses logical replication slots and triggers in pg_repack:

#!/usr/bin/env bash
set -euo pipefail

readonly DB=myapp_prod
readonly TABLE=orders

# pg_repack rewrites the table in place, removes bloat
pg_repack \
  --dbname="$DB" \
  --table="$TABLE" \
  --jobs=4 \
  --no-order  # skip CLUSTER-style sort, just reclaim bloat

For schema changes (vs. just bloat reclaim), the Postgres-native approach is:

Add nullable column (ALTER TABLE ... ADD COLUMN x text — fast metadata-only op in modern Postgres).
Backfill in batches (UPDATE ... WHERE x IS NULL LIMIT 1000).
Add NOT NULL constraint with NOT VALID first, then VALIDATE CONSTRAINT later.

Wrapped in shell:

backfill_in_batches() {
  local total=0 batch=1000
  while :; do
    n=$(psql -tAc "
      WITH cte AS (
        SELECT id FROM orders WHERE delivery_zone IS NULL LIMIT $batch FOR UPDATE SKIP LOCKED
      )
      UPDATE orders SET delivery_zone = compute_zone(address)
      FROM cte WHERE orders.id = cte.id
      RETURNING 1
    " | wc -l)
    (( n == 0 )) && break
    total=$((total + n))
    echo "Backfilled $total rows so far"
    sleep 0.1   # throttle
  done
}

FOR UPDATE SKIP LOCKED is the magic: rows currently locked by an active transaction are skipped (rather than blocking), making the backfill safe to run during peak traffic.

The Drop-In `lib/db.sh`

# lib/db.sh — sourced helpers for database backup scripts.
#
# Depends on lib/backup.sh for upload_s3 / verify_remote / GFS prune.
#
# Usage:
#   source /usr/local/lib/backup.sh
#   source /usr/local/lib/db.sh
#   pg_logical_dump myapp_prod /var/backups/postgres
#   pg_basebackup_take /var/backups/postgres-base
#   mysql_logical_dump myapp_prod /var/backups/mysql

set -o errexit -o nounset -o pipefail

db_log() { backup_log "[db] $*"; }

# Postgres logical dump in custom format. Args: db, dest_dir
pg_logical_dump() {
  local db="$1" dest="$2"
  local stamp out
  stamp=$(date +%Y-%m-%d-%H%M)
  mkdir -p "$dest"
  out="$dest/${db}-${stamp}.dump"

  db_log "pg_dump start: $db"
  PGPASSFILE=/etc/postgres/pgpass \
  pg_dump --dbname="$db" --format=custom --compress=9 --jobs=4 \
          --no-owner --no-acl --file="$out.tmp"
  mv "$out.tmp" "$out"
  sha256sum "$out" > "$out.sha256"

  pg_restore --list "$out" > /dev/null \
    || { db_log "FAIL: pg_restore --list failed"; return 1; }

  db_log "pg_dump OK: $out ($(stat -c %s "$out") bytes)"
  printf '%s\n' "$out"
}

# Postgres physical base backup. Args: dest_dir
pg_basebackup_take() {
  local dest="$1"
  local stamp out
  stamp=$(date +%Y-%m-%d)
  mkdir -p "$dest"
  out="$dest/$stamp"
  mkdir -p "$out"

  db_log "pg_basebackup start"
  pg_basebackup --pgdata="$out" --format=tar --gzip \
                --wal-method=stream --checkpoint=fast \
                --label="base-$stamp"

  ( cd "$out" && find . -type f -print0 | xargs -0 sha256sum | sort -k 2 ) > "$out.manifest"
  sha256sum "$out.manifest" > "$out.manifest.sha256"
  touch "$out/.complete"

  db_log "pg_basebackup OK: $out"
  printf '%s\n' "$out"
}

# MySQL logical dump with --single-transaction. Args: db, dest_dir
mysql_logical_dump() {
  local db="$1" dest="$2"
  local stamp out
  stamp=$(date +%Y-%m-%d-%H%M)
  mkdir -p "$dest"
  out="$dest/${db}-${stamp}.sql.zst"

  db_log "mysqldump start: $db"
  mysqldump --defaults-file=/etc/mysql/backup.cnf \
            --single-transaction --routines --triggers \
            --master-data=2 --hex-blob \
            --databases "$db" \
    | zstd -9 -o "$out.tmp"
  mv "$out.tmp" "$out"
  sha256sum "$out" > "$out.sha256"

  zstd -d -c "$out" | head -100 | grep -q "MySQL dump" \
    || { db_log "FAIL: dump format check failed"; return 1; }

  db_log "mysqldump OK: $out ($(stat -c %s "$out") bytes)"
  printf '%s\n' "$out"
}

# PITR drill — restore latest base + replay WAL to a target time. Args: target_iso
pg_pitr_drill() {
  local target="$1"
  local sandbox=/srv/pg-pitr-drill
  rm -rf "$sandbox"
  mkdir -p "$sandbox"

  # Fetch latest base backup from S3
  db_log "pg_pitr_drill: target=$target"
  local latest
  latest=$(aws s3 ls s3://myapp-pg-base/ | sort | tail -1 | awk '{print $4}')
  aws s3 cp "s3://myapp-pg-base/$latest" "$sandbox/base.tar.gz"
  tar -xzf "$sandbox/base.tar.gz" -C "$sandbox"

  cat > "$sandbox/recovery.signal" <<EOF
restore_command = 'wal-g wal-fetch %f %p'
recovery_target_time = '$target'
recovery_target_action = 'pause'
EOF

  pg_ctl -D "$sandbox" -l "$sandbox/log" start

  # Wait for recovery to reach target
  for _ in {1..60}; do
    if psql -h /tmp -p 5433 -tAc "SELECT pg_is_in_recovery() AND NOT pg_is_wal_replay_paused()" 2>/dev/null | grep -q '^t$'; then
      sleep 5
    else
      break
    fi
  done

  # Smoke test
  if psql -h /tmp -p 5433 -tAc "SELECT count(*) FROM critical_table" >/dev/null; then
    db_log "pg_pitr_drill OK: target=$target"
    pg_ctl -D "$sandbox" stop
    return 0
  else
    db_log "FAIL: smoke test failed at target=$target"
    pg_ctl -D "$sandbox" stop
    return 1
  fi
}

Using the Library

#!/usr/bin/env bash
# nightly-db-backup.sh
source /usr/local/lib/backup.sh
source /usr/local/lib/db.sh

# Postgres logical
out=$(pg_logical_dump myapp_prod /var/backups/postgres)
backup_upload_s3 "$out" s3://myapp-backups-prod/pg-logical/
backup_gfs_prune /var/backups/postgres myapp_prod

# Postgres physical base (weekly only)
if [[ "$(date +%u)" == "7" ]]; then
  out=$(pg_basebackup_take /var/backups/postgres-base)
  # pg_basebackup output is a directory — needs different upload pattern
fi

# Weekly PITR drill (Sunday)
if [[ "$(date +%u)" == "7" ]]; then
  pg_pitr_drill "$(date -d '1 hour ago' -Iseconds)"
fi

The PITR Window: How “Continuous” Is Continuous?

Your PITR window is bounded by:

Bound	Determined by	Typical
Newest restorable point	Last WAL successfully archived	Within 60s with `archive_timeout=60`
Oldest restorable point	Oldest base backup + retained WAL	7-30 days
Granularity	WAL segment size (16MB default)	Can replay to second precision

The shell discipline is:

Archive WAL on archive_timeout (60s) for tight RPO.
Keep base backups for at least 2× your maximum required PITR depth (so you always have one in case the latest base is corrupt).
Keep WAL for the full PITR window (not just from the latest base).
Drill PITR weekly — restore to “1 hour ago” and verify smoke test passes.

Alert on:

time() - last_wal_archive_time > 300 → archive lag, RPO degrading.
pg_basebackup_age_days > 14 → base backup stale, restore time grows.
pitr_drill_status == 0 → drill failing, PITR is unproven.

The 8 Footguns

1. Backing Up Without `--single-transaction` (MySQL) or Inside an Active Long Txn (Postgres)

mysqldump without --single-transaction takes table locks. pg_dump is fine in this regard but a long-running concurrent transaction can pin old row versions causing bloat. Fix: Always --single-transaction for MySQL; monitor pg_stat_activity for long-running txns before starting pg_dump.

2. Restoring a `pg_dump` to a Different Major Version Than It Was Taken From

Custom-format dumps are mostly portable across major versions but extension-related DDL (e.g., CREATE EXTENSION postgis) can fail. Fix: Test cross-version restore in CI for every major upgrade.

3. Forgetting `--master-data=2` Means No PITR Baseline

A MySQL dump without --master-data is useless as a PITR baseline because you don’t know where to start replaying binlogs from. Fix: Always use --master-data=2 (the =2 makes it a comment so it doesn’t auto-execute on restore to a non-replica).

4. Tar of Datadir Instead of Using pg_basebackup / xtrabackup

Already covered in the cardinal rule. Worth restating: even with pg_start_backup() / pg_stop_backup() the tar approach has subtle hazards (concurrent file deletion, stat races). Use the dedicated tool.

5. Running gh-ost / pg_repack Without Replication Lag Pre-Flight

Online schema tools generate write load that can push replicas behind by minutes. Fix: Pre-flight check Seconds_Behind_Master (MySQL) / pg_last_xact_replay_timestamp (Postgres) and abort if behind.

6. Backup Credentials With DELETE Privileges

A compromised backup credential should not be able to delete S3 objects, drop database tables, or destroy ZFS snapshots. Fix: IAM policies with PutObject + GetObject only; database role with pg_read_all_data only; ZFS roles with snapshot,send,hold only.

7. PITR Drill That Restores to “Now” Instead of “1 Hour Ago”

A drill that restores to now() doesn’t actually test WAL replay because the latest WAL is already in the base backup. Fix: Drill to a specific past time (e.g., 1 hour ago) so the drill actually exercises WAL fetch + replay.

8. Forgetting To Drop the Sandbox DB Between Drills

If your drill restores into myapp_drill and you don’t drop it first, the second drill might silently restore over the previous one and a manifest mismatch is masked by stale data. Fix: DROP DATABASE myapp_drill (or wipe the sandbox PGDATA) at the start of every drill.

Quick-Reference Card

LOGICAL DUMPS (small DB, partial restore, cross-version)
  Postgres: pg_dump -Fc --jobs=4 --no-owner --no-acl
  MySQL:    mysqldump --single-transaction --master-data=2 --hex-blob
  Verify:   pg_restore --list  /  zstd -d -c | head | grep "MySQL dump"

PHYSICAL + WAL (large DB, fast restore, PITR)
  Postgres: pg_basebackup -Ft -z + wal-g for archive
  MySQL:    xtrabackup --backup + binlog rsync to S3
  Recovery: restore_command (PG) / mysqlbinlog --start-position (MySQL)

ONLINE SCHEMA CHANGE
  MySQL:    gh-ost --max-load --throttle-control-replicas
  Postgres: ALTER TABLE ADD nullable, backfill, then NOT NULL
  Pre-flight: replication lag, free disk, table size

PITR DRILLS
  Drill weekly to "1 hour ago" target, not now()
  Drop sandbox between drills (DROP DATABASE / rm -rf PGDATA)
  Smoke test must be domain-specific, not just "table exists"

RPO / RPO BUDGETS
  archive_timeout=60 (PG) → 60s RPO worst case
  binlog flush every 60s → 60s RPO worst case
  PITR depth = retain base backup + all WAL since

What’s Next

You now have a database-admin layer that handles logical and physical backups, WAL archiving, online schema migrations, and drill-tested PITR. But databases produce mountains of logs — slow query logs, error logs, audit logs — and analyzing them at multi-terabyte scale needs a different toolkit: streaming awk pipelines, GNU parallel for distributed map-reduce on shell, and the discipline of avoiding the cat huge.log | grep | awk | sort | uniq trap that loads everything into memory.

In the next lesson — Log Analysis at Scale: Streaming awk, GNU parallel & Distributed grep / sort / uniq Pipelines — we’ll build lib/loganalyze.sh covering streaming aggregation that fits in O(distinct keys) memory, parallel processing with parallel, distributed map-reduce across hosts via SSH fan-out, and the standard slow-query log reduction patterns for Postgres and MySQL.

Shell Database Admin: pg_dump Pipelines, MySQL Backup Orchestration, WAL Archiving & Online Schema Migration Wrappers

The Cardinal Rule: Never Tar a Live Datadir

The Four Patterns

Pattern 1: Logical Dumps With Postgres pg_dump

Why `--format=custom` Over Plain SQL

Restore Verification

Pattern 2: Logical Dumps With MySQL mysqldump

The MyISAM Trap

`--master-data=2` for PITR Baseline

Pattern 3: Physical Base Backups + WAL Archiving (PITR)

Postgres: pg_basebackup + WAL-G

WAL Archiving Configuration

MySQL: Percona XtraBackup

Pattern 4: Online Schema Migrations Without Lock Outages

gh-ost for MySQL

pg_repack for Postgres

The Drop-In `lib/db.sh`

Using the Library

The PITR Window: How “Continuous” Is Continuous?

The 8 Footguns

1. Backing Up Without `--single-transaction` (MySQL) or Inside an Active Long Txn (Postgres)

2. Restoring a `pg_dump` to a Different Major Version Than It Was Taken From

3. Forgetting `--master-data=2` Means No PITR Baseline

4. Tar of Datadir Instead of Using pg_basebackup / xtrabackup

5. Running gh-ost / pg_repack Without Replication Lag Pre-Flight

6. Backup Credentials With DELETE Privileges

7. PITR Drill That Restores to “Now” Instead of “1 Hour Ago”

8. Forgetting To Drop the Sandbox DB Between Drills

Quick-Reference Card

What’s Next

Written by Vinod

Comments

Shell Database Admin: pg_dump Pipelines, MySQL Backup Orchestration, WAL Archiving & Online Schema Migration Wrappers

The Cardinal Rule: Never Tar a Live Datadir

The Four Patterns

Pattern 1: Logical Dumps With Postgres pg_dump

Why --format=custom Over Plain SQL

Restore Verification

Pattern 2: Logical Dumps With MySQL mysqldump

The MyISAM Trap

--master-data=2 for PITR Baseline

Pattern 3: Physical Base Backups + WAL Archiving (PITR)

Postgres: pg_basebackup + WAL-G

WAL Archiving Configuration

MySQL: Percona XtraBackup

Pattern 4: Online Schema Migrations Without Lock Outages

gh-ost for MySQL

pg_repack for Postgres

The Drop-In lib/db.sh

Using the Library

The PITR Window: How “Continuous” Is Continuous?

The 8 Footguns

1. Backing Up Without --single-transaction (MySQL) or Inside an Active Long Txn (Postgres)

2. Restoring a pg_dump to a Different Major Version Than It Was Taken From

3. Forgetting --master-data=2 Means No PITR Baseline

4. Tar of Datadir Instead of Using pg_basebackup / xtrabackup

5. Running gh-ost / pg_repack Without Replication Lag Pre-Flight

6. Backup Credentials With DELETE Privileges

7. PITR Drill That Restores to “Now” Instead of “1 Hour Ago”

8. Forgetting To Drop the Sandbox DB Between Drills

Quick-Reference Card

What’s Next

Written by Vinod

Comments

Why `--format=custom` Over Plain SQL

`--master-data=2` for PITR Baseline

The Drop-In `lib/db.sh`

1. Backing Up Without `--single-transaction` (MySQL) or Inside an Active Long Txn (Postgres)

2. Restoring a `pg_dump` to a Different Major Version Than It Was Taken From

3. Forgetting `--master-data=2` Means No PITR Baseline