Shell Lesson 18 of 42

File Operations at Scale: rsync, find -print0, Atomic Writes & Parallel-Safe Patterns — When `cp -r` Stops Being Enough

cp -r and for f in * work for tens of files. They break around the time you have:

This lesson covers the tools that handle all those cases:

By the end you’ll handle 1TB volumes confidently and never lose data to “the script died half-way through.”


1. rsync — the Unix copy tool you should be using

rsync is “smart cp”: it figures out what’s changed since last time and only transfers the differences. For local copies it’s competitive with cp. For remote copies it’s typically 10-100x faster on incremental transfers.

The canonical local copy

rsync -aP /source/ /dest/

The trailing / on source matters. rsync -a /src/ /dst/ copies the contents of /src into /dst. rsync -a /src /dst/ copies /src itself into /dst (creating /dst/src).

rsync -aP /src/ /dst/        # /src/foo → /dst/foo
rsync -aP /src  /dst/        # /src/foo → /dst/src/foo

This is the most common rsync mistake. Always think about whether you mean “copy contents” or “copy the dir itself.”

The canonical remote copy

rsync -azP /source/ user@host:/dest/

For very fast LAN transfers, omit -z (CPU is the bottleneck, not bandwidth).

--delete — make destination match source

rsync -aP --delete /source/ /dest/

Files in /dest/ that aren’t in /source/ are deleted. Use carefully — you can wipe the target if you fat-finger arguments. Always test with --dry-run first:

rsync -aPn --delete /source/ /dest/    # -n = --dry-run; just print what would happen

Always do this on first run with --delete.

--exclude and --include

rsync -aP --exclude='*.log' --exclude='node_modules/' /src/ /dst/

Patterns are checked against the path relative to the source root. --exclude='*.log' matches any .log file at any depth. --exclude='/cache/' matches only top-level cache/.

For complex rule sets, use --exclude-from=FILE:

# .rsync-excludes
node_modules/
.git/
*.log
*.tmp
__pycache__/
rsync -aP --exclude-from=.rsync-excludes /src/ /dst/

--link-dest — incremental snapshots

This is rsync’s killer feature. Hard-link files unchanged from a previous backup, only copying changed ones. Result: each “snapshot” appears full but actually shares disk with the previous.

# Yesterday's snapshot is at /backups/2026-06-21/
# Today, build /backups/2026-06-22/ that hard-links to yesterday's unchanged files
rsync -aP --link-dest=/backups/2026-06-21/ /source/ /backups/2026-06-22/

Now /backups/2026-06-22/ is a complete tree, but identical files share inodes with yesterday. Disk usage is roughly the size of new + modified files, not the full source.

This is how Time Machine, BackupPC, rsnapshot, and most “rolling N-day backup” systems work. Trivial to implement.

--partial and --inplace

If a transfer is interrupted, by default rsync deletes the partial file and starts over. With --partial (-P includes this), it keeps the partial. Re-running rsync resumes from the partial.

--inplace writes directly to the destination file rather than to a temp + rename. Faster, but readers may see partial data. Use only when readers won’t trip over half-files.

--bwlimit — rate limiting

rsync -aP --bwlimit=10000 /src/ user@host:/dst/    # 10,000 KB/s = 10 MB/s

Use during business hours so the rsync doesn’t saturate the link.

Verbosity and dry-run

rsync -aPv  /src/ /dst/         # verbose: list every file copied
rsync -aPvv /src/ /dst/         # extra verbose
rsync -aPn  /src/ /dst/         # dry run; show what WOULD be transferred

Always -n first when using --delete or aggressive exclude patterns.

--checksum vs default

By default, rsync skips files where size and mtime match. With --checksum it also reads and hashes both sides. Slow but bulletproof when you suspect mtimes are lying.

rsync over SSH with custom config

rsync -aP -e 'ssh -i ~/.ssh/backup_key -p 2222' /src/ user@host:/dst/

-e overrides the remote-shell command. Use to specify alternate keys, ports, or even different transports.

Combined real-world example: nightly backup

#!/usr/bin/env bash
set -Eeuo pipefail

source "$(dirname "${BASH_SOURCE[0]}")/lib/log.sh"

readonly SOURCE=/var/www
readonly DEST=/backups
readonly TODAY=$(date -u +%Y-%m-%d)
readonly YESTERDAY=$(date -u -d 'yesterday' +%Y-%m-%d 2>/dev/null || date -u -v-1d +%Y-%m-%d)

readonly TODAY_DIR="$DEST/$TODAY"
readonly YEST_DIR="$DEST/$YESTERDAY"

mkdir -p "$DEST"

OPTS=( -aP --delete --exclude-from=/etc/backup/excludes )
if [[ -d "$YEST_DIR" ]]; then
  OPTS+=( --link-dest="$YEST_DIR" )
fi

info "starting backup" date=$TODAY
rsync "${OPTS[@]}" "$SOURCE/" "$TODAY_DIR/"

# Prune backups older than 30 days
find "$DEST" -maxdepth 1 -type d -name '????-??-??' -mtime +30 -print0 \
  | xargs -0r rm -rf
info "backup complete"

This is the foundation of every “rolling daily backup” script. Adapt for your data.


2. Filename-safe iteration (recap + new patterns)

We covered this in L4 and L6. Recap:

# WRONG — breaks on spaces/newlines in filenames
for f in $(find . -name '*.log'); do …; done

# RIGHT — NUL-separated, mapfile collects safely
mapfile -d '' -t FILES < <(find . -name '*.log' -print0)
for f in "${FILES[@]}"; do …; done

# RIGHT — for direct piping
find . -name '*.log' -print0 | xargs -0 -n 1 process

# RIGHT — read loop with explicit IFS=
find . -name '*.log' -print0 | while IFS= read -r -d '' f; do
  process "$f"
done

The IFS= (empty) before read prevents word-splitting. -d '' makes the delimiter NUL.

find ... -exec ...

For simple ops, skip xargs entirely:

# Per-file: forks once per file (slow if many files)
find . -name '*.log' -exec gzip {} \;

# Batched: forks once per batch (fast)
find . -name '*.log' -exec gzip {} +

Always prefer + over \; when the command supports multiple args. We covered this in L11.

find -execdir

find . -name '*.tmp' -execdir rm -- {} \;

-execdir runs the command in the directory containing the file. Useful for git/svn operations that act on cwd.

When **/ (globstar) is enough

For in-shell iteration where you don’t need find’s full power:

shopt -s globstar nullglob
for f in **/*.log; do
  process "$f"
done

Quote "$f" even when the glob match is “safe” — it’s a habit you don’t want to break.

parallel over find

find . -name '*.log' -print0 | parallel -0 -j 8 gzip {}

parallel -0 reads NUL-separated input. Combine with -j 8 for 8-way parallelism. We covered this in L16.


3. Atomic writes — never leave a half-finished file

If a process is killed (Ctrl-C, OOM, kernel panic) while writing a file, readers may see truncated content. For configs, manifests, anything important, write atomically: write to a temp file, fsync, then rename.

The basic pattern

TMP=$(mktemp /var/lib/myapp/data.XXXXXX)
trap 'rm -f "$TMP"' EXIT
generate_data > "$TMP"
mv -- "$TMP" /var/lib/myapp/data.json
trap - EXIT

mv on the same filesystem is atomic at the kernel level — readers either see the old version or the new one, never partial.

Why same filesystem?

mv across filesystems is cp + unlink — non-atomic. If the destination is on /data (separate FS from /tmp), you must:

TMP=$(mktemp -p "$(dirname "$DEST")" data.XXXXXX)

-p DIR puts the temp file in DIR — same filesystem as DEST.

Adding fsync

mv is atomic from the kernel’s view, but the data may still be in page cache. To guarantee durability, sync first:

TMP=$(mktemp -p "$(dirname "$DEST")" .data.XXXXXX)
generate_data > "$TMP"
sync                      # flush ALL pending writes (heavyweight)
# Or, more targeted:
# python3 -c "import os; f=open('$TMP'); os.fsync(f.fileno())"
mv -- "$TMP" "$DEST"

For most use cases, the implicit kernel flushing is fine. Add explicit sync only when crashes are a real concern (e.g. database snapshots).

Reusable helper

atomic_write() {
  local target=$1
  local tmpdir; tmpdir=$(dirname "$target")
  local tmp; tmp=$(mktemp -p "$tmpdir" ".$(basename "$target").XXXXXX")
  trap 'rm -f "$tmp"' EXIT
  cat > "$tmp"                    # read stdin → temp file
  mv -- "$tmp" "$target"
  trap - EXIT
}

# Use:
generate_config | atomic_write /etc/myapp/config.yaml

cat > "$tmp" reads stdin to the temp. The function is generic.

Atomic directory replace

For atomically replacing a directory (e.g. blue/green static content):

NEW=/var/www/site.new
LIVE=/var/www/site
OLD=/var/www/site.old

# Build the new tree
rsync -aP /source/ "$NEW/"

# Atomic flip via rename — actually two-step on most filesystems
mv -- "$LIVE" "$OLD"          # not atomic in the strict sense
mv -- "$NEW" "$LIVE"          # but very fast — sub-millisecond gap

# Cleanup
rm -rf "$OLD"

For truly atomic dir-swap, use a symlink:

# Build the new dir at /var/www/sites/v2/
rsync -aP /source/ /var/www/sites/v2/

# Atomic symlink swap (mv replaces atomically)
ln -sfn /var/www/sites/v2 /var/www/site

The ln -sfn updates an existing symlink atomically (single rename syscall). This is how most blue/green static-content deployments work.


4. Parallel-safe directory operations

When fanning out file ops across cores, watch for races and contention.

Per-file work in parallel

# Compress every .log file with 8 cores
find /var/log -type f -name '*.log' -print0 \
  | xargs -0 -P 8 -n 1 gzip

This is safe — each gzip operates on its own file. No coordination needed.

Aggregating from parallel jobs

# Counting words across many files in parallel — DON'T just append
mkdir -p /tmp/results
find . -name '*.txt' -print0 | xargs -0 -P 8 -I {} \
  bash -c 'wc -w "$1" > "/tmp/results/$(basename "$1").wc"' _ {}

# Aggregate after
cat /tmp/results/*.wc | awk '{ s += $1 } END { print s }'

Each worker writes to its own file. After wait, aggregate. We saw this pattern in L16; here it’s specialised for file ops.

Walking very large trees efficiently

For directories with millions of files, naive find reads the whole tree at once. Use -maxdepth:

# Process top-level dirs in parallel; each find within is shallow
find /data -mindepth 1 -maxdepth 1 -type d -print0 \
  | xargs -0 -P 4 -I {} bash -c 'find "$1" -name "*.log" | wc -l' _ {}

This breaks the tree into independent subtrees, processes each in parallel.

rsync from many sources to one dest

rsync doesn’t natively parallelise. Workaround: rsync each top-level subdir in parallel.

ls -1 /source | parallel -j 4 'rsync -a /source/{}/ /dest/{}/'

Be careful with --delete — it deletes anything in the dest dir not present in the source dir, but each parallel rsync only sees its own subdir. So --delete is safe here as long as the dest layout mirrors the source.


5. Common patterns

Disk-usage one-liners

# Top 20 largest files in a tree
find /var -type f -printf '%s %p\n' | sort -rn | head -n 20

# Top 20 largest directories (by their direct content)
du -sh /var/* 2>/dev/null | sort -hr | head -n 20

# Total size of files matching a pattern
find . -name '*.log' -printf '%s\n' | awk '{s+=$1} END {print s}'

# Files older than 7 days, total size
find . -mtime +7 -type f -printf '%s\n' | awk '{s+=$1} END {print s/1024/1024 " MB"}'

Safely deleting many files

# DON'T — argv overflow risk
rm /var/cache/*.tmp

# DO — find + xargs handles arbitrary file counts
find /var/cache -name '*.tmp' -print0 | xargs -0 rm --

# Or with -delete (no fork, but no pre-list)
find /var/cache -name '*.tmp' -delete

Mirroring with hard links (zero-copy “branching”)

# Make a snapshot that shares storage with the original
cp -al /source /snapshot         # -a archive; -l hard-link instead of copy

# OR with rsync
rsync -a --link-dest=/source /source/ /snapshot/

Both create /snapshot/ where every file is a hard link to /source/. Both directories now point to the same blocks; modifying one (via overwrite, not in-place edit) breaks the link.

This is how container layer snapshots work conceptually.

Find files NOT matching a pattern

find . -type f -not -name '*.tmp'
find . -type f \! -name '*.tmp'      # ! escaped for shell

# Multiple patterns
find . -type f -not \( -name '*.tmp' -o -name '*.bak' \)

Atomic config reload without restart

# Pattern: write atomically, then signal the daemon
atomic_write /etc/myapp/config.yaml < new-config.yaml
systemctl reload myapp.service     # or kill -HUP $(cat myapp.pid)

The daemon re-reads on SIGHUP. Atomic-write means the daemon never reads a partial config.


6. Common pitfalls

cp losing perms / xattrs

cp file dest                         # may not preserve perms, ownership, ACLs
cp -a file dest                      # archive mode (preserves)
cp -p file dest                      # preserve mode/owner/timestamps only

Use -a (or cp --preserve=all) when fidelity matters.

rm -rf with empty variable

The classic disaster:

DIR=""
rm -rf "$DIR/cache"          # if DIR is empty: rm -rf "/cache" — DELETES /cache !

Always validate:

[[ -n "$DIR" ]] || die "DIR is empty"
rm -rf -- "$DIR/cache"

The -- ends option processing — protects against $DIR accidentally starting with a dash.

mv across filesystems

mv /var/data/big.tar /tmp/    # if /tmp is a separate FS, this is cp + unlink

Cross-FS mv can leave the source partially deleted if interrupted. For huge files, prefer cp -av && rm -- $src so you control timing.

find -exec with ; vs +

find . -name '*.log' -exec gzip {} \;     # forks gzip per file — slow
find . -name '*.log' -exec gzip {} +      # batched — fast

Always use + unless your command really only takes one file.

cp -r dir1/ dir2/ vs cp -r dir1 dir2/

cp -r dir1/ dir2/        # copies CONTENTS of dir1 into dir2
cp -r dir1  dir2/        # copies dir1 ITSELF into dir2 (creates dir2/dir1)

Same trailing-slash rule as rsync. Memorise.

Filesystem quirks

du vs ls -l size

ls -l file              # logical size (what you'd read)
du -h file              # disk usage (rounded to block; sparse files lower)

For sparse files (VM disk images, database files), du can be much smaller than ls -l. For atomic-write planning (where blocks matter), use du.


7. The lib/files.sh framework

# lib/files.sh — file-operation helpers

atomic_write() {
  local target=$1
  local tmp
  tmp=$(mktemp -p "$(dirname "$target")" ".$(basename "$target").XXXXXX")
  trap 'rm -f "$tmp"' EXIT
  cat > "$tmp"
  mv -- "$tmp" "$target"
  trap - EXIT
}

mirror_safely() {
  local src=$1 dst=$2
  [[ -d "$src" ]] || die "source not found: $src"
  rsync -aP --delete --exclude-from='.rsync-excludes' "$src/" "$dst/"
}

snapshot_with_link_dest() {
  local src=$1 dest_root=$2
  local today
  today=$(date -u +%Y-%m-%d)
  local target="$dest_root/$today"
  local prev
  prev=$(ls -1 "$dest_root" 2>/dev/null | sort -r | grep -E '^[0-9]{4}-[0-9]{2}-[0-9]{2}$' | head -1)
  local opts=( -aP --delete )
  [[ -n "$prev" && -d "$dest_root/$prev" ]] && opts+=( --link-dest="$dest_root/$prev" )
  rsync "${opts[@]}" "$src/" "$target/"
}

count_files() {
  find "$1" -type f -print0 | tr -cd '\0' | wc -c
}

total_size_mb() {
  find "$1" -type f -printf '%s\n' 2>/dev/null | awk '{s+=$1} END {print s/1024/1024}'
}

prune_older_than_days() {
  local dir=$1 days=$2
  find "$dir" -mindepth 1 -maxdepth 1 -type d -mtime "+$days" -print0 \
    | xargs -0r rm -rf
}

Use:

source "$(dirname "${BASH_SOURCE[0]}")/lib/files.sh"

generate_config | atomic_write /etc/myapp/config.yaml
snapshot_with_link_dest /var/www /backups
prune_older_than_days /backups 30

8. Twelve idioms for daily use

# 1. rsync local copy (idiomatic)
rsync -aP /src/ /dst/

# 2. rsync remote
rsync -azP /src/ user@host:/dst/

# 3. rsync with delete + dry-run first
rsync -aPn --delete /src/ /dst/        # confirm
rsync -aP --delete /src/ /dst/         # apply

# 4. rsync incremental snapshots
rsync -aP --link-dest=$LAST_BACKUP /src/ /backups/$TODAY/

# 5. NUL-safe iteration
mapfile -d '' -t FILES < <(find . -type f -print0)
for f in "${FILES[@]}"; do …; done

# 6. find + parallel
find . -name '*.log' -print0 | xargs -0 -P 8 -n 1 gzip

# 7. Atomic write
TMP=$(mktemp -p "$(dirname $TARGET)" ".$(basename $TARGET).XXXXXX")
generate > "$TMP" && mv "$TMP" "$TARGET"

# 8. Atomic dir-replace via symlink
rsync -aP /src/ /dest/v2/ && ln -sfn /dest/v2 /dest/live

# 9. find -delete
find /tmp -mindepth 1 -mmin +60 -delete

# 10. Top 20 largest files
find /var -type f -printf '%s %p\n' | sort -rn | head -n 20

# 11. Hard-link snapshot (zero-copy clone)
cp -al /source /snapshot

# 12. Check before destructive op
[[ -n "$DIR" ]] || die "DIR empty"; rm -rf -- "$DIR/cache"

9. What you must internalise before lesson 19


What’s next

Lesson 19: Date & Time Arithmetic — ISO 8601, Time Zones, Locale Hazards & Reliable Cron-Time Math. Working with dates in shell is full of traps: date -d is GNU only; macOS BSD date uses different flags; the %N format isn’t portable; cron uses local time; UTC is mandatory in production. We cover the GNU/BSD difference comprehensively, the canonical ISO 8601 patterns, time-zone handling in scripts, computing yesterday/last-week/last-month dates, and the standard “cron-safe” date math. After L19 you’ll never have a “this script broke at DST” bug again.

See you there.

shellbashrsyncfindfilesystematomic-writesparallelscalebackupsnapshots
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments