Why Naive Pipelines Die at Terabyte Scale
The classic Unix pipeline:
cat huge.log | grep ERROR | awk '{print $7}' | sort | uniq -c | sort -rn | head
Breaks at scale for three reasons:
catreads the entire file into memory pages it doesn’t need.sortis O(N log N) in time and O(N) in memory (or external-memory disk thrash).uniq -crequires sorted input — meaning sort already did the heavy work.
For a 500 GB log this pipeline either OOMs, fills /tmp (which sort uses for spill files), or runs for 6+ hours. The discipline of scale-friendly log analysis is streaming aggregation: process each line exactly once, aggregate in memory only the distinct keys (which is bounded by cardinality, not data volume), and emit the result.
This lesson teaches four patterns:
| Pattern | When | Memory | CPU |
|---|---|---|---|
| Streaming awk | Single host, fits in RAM by cardinality | O(distinct keys) | Single-core, I/O bound |
| GNU parallel | Single host, CPU-bound parsing | O(workers × cardinality) | All cores |
| SSH fan-out map-reduce | Multiple hosts, files on each | O(host cardinality) per node | Distributed |
| External-memory sort | Cardinality itself doesn’t fit | O(disk) | Disk-bound |
Pattern 1: Streaming awk Aggregation
awk has built-in associative arrays. This single property makes it the right tool for ~80% of log-analysis tasks. The pattern:
awk '{ count[$7]++ } END { for (k in count) print count[k], k }' access.log \
| sort -rn | head -20
What this does:
- For each line, increment
count[<7th field>](typically the URL path in nginx combined log format). - At end-of-input, iterate the associative array and emit
<count> <key>. - Sort numerically descending, take top 20.
Memory used is bounded by the number of distinct URLs, not the log size. A site with 10,000 distinct URLs on a 500 GB log uses ~1 MB of awk memory. The same query with sort | uniq -c would need 500 GB of /tmp space.
Why awk’s Associative Arrays Are So Fast
awk’s hash tables are written in C and tuned for line-oriented data. Combined with mawk (the fastest awk implementation), throughput often exceeds 500 MB/s on a single core — faster than grep -c for many patterns because grep has to advance regex state machine, while awk just hashes a field.
Quick benchmark on a 10 GB nginx log:
mawk aggregation: 18 sec
gawk aggregation: 42 sec
busybox aggregation: aborted at 5 min
Always install mawk on log-analysis hosts (apt install mawk). On most distros, awk is a symlink — point it to mawk if performance matters:
update-alternatives --set awk /usr/bin/mawk
The Three awk Implementations
| Implementation | Speed | Features | Default on |
|---|---|---|---|
| gawk | Baseline | Most features (gensub, time funcs) | Most distros |
| mawk | 3-10× faster | Fewer features, no gensub |
Some Debian variants |
| busybox awk | 10× slower than gawk | Minimal | Alpine, embedded |
If your script uses gensub, mktime, strftime, or --posix flags, you need gawk. For pure aggregation pipelines, mawk is the correct choice.
Real-World Streaming awk Patterns
Top 20 slowest endpoints (by mean response time):
awk '{
url = $7
rt = $NF # last field is request_time
sum[url] += rt
count[url]++
} END {
for (u in sum) printf "%.3f %d %s\n", sum[u]/count[u], count[u], u
}' access.log | sort -rn | head -20
The printf "%.3f %d %s" formats numerically so sort -rn works correctly — never use print with floats because sort may treat scientific notation inconsistently.
HTTP status code time-series (per minute):
awk '{
match($4, /\[([0-9]+\/[A-Za-z]+\/[0-9]+:[0-9]+:[0-9]+)/, m)
bucket = m[1] # YYYY-MM-DDTHH:MM truncated
code = $9
ts[bucket "|" code]++
} END {
for (k in ts) {
split(k, a, "|")
print a[1], a[2], ts[k]
}
}' access.log | sort
This produces a flat time-series suitable for piping into Grafana via the textfile collector or feeding to gnuplot.
5xx error spike detector (alerting from cron):
errors=$(awk '$9 ~ /^5/ {c++} END {print c+0}' /var/log/nginx/access.log)
if (( errors > 100 )); then
curl -X POST "$ALERTMANAGER_URL/api/v1/alerts" \
-d "[{\"labels\":{\"alertname\":\"5xxSpike\",\"value\":\"$errors\"}}]"
fi
The c+0 trick forces c to be numeric even if no errors were found (otherwise it’d print empty string).
Pattern 2: GNU parallel for CPU-Bound Map
When the work per line is heavy (regex compilation, JSON parsing, network lookup), single-core awk becomes the bottleneck. GNU parallel farms work across cores:
# Parse 1000 logs in parallel, 8 workers
ls /var/log/nginx/access-*.gz | parallel -j8 \
'zcat {} | awk "/ERROR/ {c++} END {print FILENAME, c+0}"'
The -j8 is workers; {} is the input filename; each invocation is independent. parallel batches work and prints results in order (or unordered with -k flag).
Map-Reduce in One Pipeline
For aggregation across many files, the pattern is:
- Map: each file → partial aggregation (key → count).
- Reduce: merge partial aggregations → global aggregation.
ls /var/log/nginx/access-*.gz | parallel -j8 \
'zcat {} | awk "{ count[\$7]++ } END { for (k in count) print count[k], k }"' \
| awk '{ count[$2] += $1 } END { for (k in count) print count[k], k }' \
| sort -rn | head -20
Stage 1 (parallel) outputs partial counts per file. Stage 2 (single awk) sums across files — note this stage is small because input is already partial-aggregated. The key insight: the reduce step’s input size is O(workers × distinct keys), not O(total log size).
parallel vs xargs
xargs -P N does parallel execution too, but lacks parallel’s:
- Per-job timeout (
--timeout). - Output buffering and ordering (
-k). - Job retry (
--retries). - ETA progress (
--eta). - Job control (
--halt).
For one-shot parallel maps, xargs -P is fine. For long-running production pipelines, parallel is worth the install.
parallel Throttling for Production
When running parallel against a database or external API, you must throttle:
# At most 4 jobs at a time, with 100ms gap between job starts
parallel -j4 --delay 0.1 './query-api.sh {}' :::: hosts.txt
# Limit by load average — pause if loadavg > 8
parallel -j4 --load 8 './heavy-job.sh {}' :::: inputs.txt
# Auto-tune to leave 2 cores free
parallel -j-2 './job.sh {}' :::: inputs.txt
The -j-2 (negative) is “all cores except 2” — useful for keeping the box responsive.
Pattern 3: Distributed Map-Reduce via SSH Fan-Out
When logs live on dozens of hosts (fleet of web servers), the pattern is:
- ssh-fan-out to run the map on each host (work happens locally, network only carries aggregated output).
- Reduce the per-host outputs centrally.
#!/usr/bin/env bash
# fleet-top-urls.sh — get top URLs across the entire web fleet
set -euo pipefail
readonly HOSTS="$(cat /etc/web-hosts.txt)"
# Map: each host runs awk locally, returns partial aggregation
for host in $HOSTS; do
ssh -o ConnectTimeout=5 -o BatchMode=yes "$host" \
"awk '{ c[\$7]++ } END { for (k in c) print c[k], k }' /var/log/nginx/access.log" \
> "/tmp/fleet-map.$host" &
done
wait
# Reduce: merge per-host outputs
cat /tmp/fleet-map.* \
| awk '{ c[$2] += $1 } END { for (k in c) print c[k], k }' \
| sort -rn | head -20
rm /tmp/fleet-map.*
The ssh -o BatchMode=yes is critical — it prevents SSH from prompting for passwords if key auth fails, which would hang the script. The ConnectTimeout=5 bounds the wait for unreachable hosts.
Production-Grade Fan-Out With pdsh or parallel
For >20 hosts the bash loop above becomes slow and lacks failure handling. Use parallel with the --sshlogin flag:
parallel --sshlogin "@/etc/web-hosts.txt" \
"awk '{ c[\$7]++ } END { for (k in c) print c[k], k }' /var/log/nginx/access.log" \
| awk '{ c[$2] += $1 } END { for (k in c) print c[k], k }' \
| sort -rn | head -20
--sshlogin @file reads hostnames from a file and runs each command remotely. parallel handles connection pooling, retries, and ordering.
Or pdsh (Parallel Distributed Shell)
pdsh is the heavyweight option, originally from LLNL clusters:
pdsh -w "$(cat /etc/web-hosts.txt | paste -sd,)" \
"awk '{ c[\$7]++ } END { for (k in c) print c[k], k }' /var/log/nginx/access.log" \
| awk -F: '{ c[$2] += $1 } END { for (k in c) print c[k], k }' \
| sort -rn | head -20
pdsh prefixes each output line with hostname: which the reducer must strip via -F:. parallel doesn’t prefix unless you ask for it.
Pattern 4: External-Memory Sort When Cardinality Itself Doesn’t Fit
Sometimes the distinct keys themselves don’t fit in RAM — e.g., a per-user-id aggregation across a billion users. The streaming-awk pattern fails.
The answer is sort with explicit external-memory tuning:
# Sort with 4GB memory budget, 4-way parallel merge, /var/tmp for spill
awk '{ print $4, $7 }' huge.log \
| sort --buffer-size=4G --parallel=4 -T /var/tmp \
| uniq -c \
| sort -rn \
| head -20
Critical flags:
--buffer-size=4G— how much RAMsortmay use before spilling to disk. Default is ~1 GB on most systems; bumping it reduces spill files.--parallel=4— number of parallel sort threads.-T /var/tmp— spill directory. Critical:/tmpis often tmpfs (RAM-backed!) and small./var/tmpis real disk and persists across reboot, which is what you want for big sorts.- For ASCII-only data, set
LC_ALL=C— locale-aware sort is 5-10× slower:
LC_ALL=C sort --buffer-size=4G ...
When to Use External Sort vs. awk
| Situation | Use awk | Use external sort |
|---|---|---|
| Distinct keys < 10M | ✓ | |
| Distinct keys 10M-1B | ? (depends on RAM) | ✓ |
| Need ordering, not just counts | ✓ | |
| Heavy parsing per line | awk + parallel | sort + parallel |
The rule of thumb: streaming awk for aggregation, external sort for ordering. They’re complementary, not competing.
Slow Query Log Reduction: A Real-World Case Study
Postgres slow query logs and MySQL slow query logs are the canonical “fits in awk” workload. The classic tool pgBadger is a 5000-line Perl script that does what you can do in 50 lines of awk.
#!/usr/bin/awk -f
# pg-slowlog-reduce.awk — reduce Postgres CSV log to top-N slow queries
BEGIN { FS = "," }
/duration:/ {
# Extract duration in ms
match($0, /duration: ([0-9.]+) ms/, m)
d = m[1]
# Normalize the query — strip literal numbers and quoted strings
q = $0
gsub(/'\''[^'\'']*'\''/, "'\''?'\''", q)
gsub(/[0-9]+/, "?", q)
# Hash by normalized query
total[q] += d
count[q]++
if (d > max[q]) max[q] = d
}
END {
for (q in total) {
printf "%.0f total_ms / %d calls / %.0f max_ms — %s\n", \
total[q], count[q], max[q], substr(q, 1, 80)
}
}
Run with:
awk -f pg-slowlog-reduce.awk /var/log/postgresql/postgres.log \
| sort -rn | head -20
Normalization is the magic: gsub(/[0-9]+/, "?") collapses WHERE id=123 and WHERE id=456 into the same shape WHERE id=?, so they aggregate. Without normalization, every query would be unique and the analysis is useless.
The same pattern works for nginx URL aggregation (collapse /users/123 → /users/?):
awk '{
url = $7
gsub(/\/[0-9]+/, "/?", url)
c[url]++
} END {
for (u in c) print c[u], u
}' access.log | sort -rn | head -20
The Drop-In lib/loganalyze.sh
# lib/loganalyze.sh — sourced helpers for log analysis pipelines.
#
# Depends on mawk (preferred) or gawk.
set -o errexit -o nounset -o pipefail
la_log() { printf '[%s] [loganalyze] %s\n' "$(date -Iseconds)" "$*"; }
# Detect best awk
la_awk_path() {
if command -v mawk >/dev/null; then echo mawk
elif command -v gawk >/dev/null; then echo gawk
else echo awk
fi
}
# Top-N URL aggregator (nginx combined format).
# Args: file, n
la_top_urls() {
local file="$1" n="${2:-20}"
local awk_bin
awk_bin=$(la_awk_path)
if [[ "$file" == *.gz ]]; then
zcat "$file"
elif [[ "$file" == *.zst ]]; then
zstd -d -c "$file"
else
cat "$file"
fi | "$awk_bin" '{
url = $7
gsub(/\/[0-9]+/, "/?", url)
c[url]++
} END {
for (u in c) print c[u], u
}' | sort -rn | head -"$n"
}
# Mean and p99 response time per URL. Requires nginx with $request_time as last field.
la_response_time_stats() {
local file="$1" n="${2:-20}"
local awk_bin
awk_bin=$(la_awk_path)
cat "$file" | "$awk_bin" '{
url = $7; rt = $NF
gsub(/\/[0-9]+/, "/?", url)
sum[url] += rt
count[url]++
n = count[url]
times[url, n] = rt
} END {
for (u in sum) {
mean = sum[u] / count[u]
# Approximate p99: sort the recorded times for this URL
n = count[u]
delete arr
for (i = 1; i <= n; i++) arr[i] = times[u, i]
asort(arr)
p99 = arr[int(n * 0.99)]
printf "%.3f %.3f %d %s\n", mean, p99, n, u
}
}' | sort -rn -k2 | head -"$n"
}
# 5xx counter for alerting. Args: file. Returns count to stdout.
la_5xx_count() {
local file="$1"
awk '$9 ~ /^5/ {c++} END {print c+0}' "$file"
}
# Top error patterns from generic log file. Looks for ERROR/WARN/FATAL prefixes.
# Args: file, n
la_top_errors() {
local file="$1" n="${2:-20}"
awk '/(ERROR|WARN|FATAL|CRITICAL)/ {
# Strip timestamp + thread/PID. Keep last 200 chars.
msg = substr($0, length($0) > 200 ? length($0) - 199 : 1)
# Normalize numbers
gsub(/[0-9]+/, "?", msg)
c[msg]++
} END {
for (m in c) print c[m], m
}' "$file" | sort -rn | head -"$n"
}
# Distributed top-URLs across a fleet. Args: hosts_file, log_path, n
la_fleet_top_urls() {
local hosts="$1" log="$2" n="${3:-20}"
local tmpdir
tmpdir=$(mktemp -d)
trap "rm -rf '$tmpdir'" RETURN
while read -r host; do
[[ -z "$host" || "$host" =~ ^# ]] && continue
ssh -o ConnectTimeout=5 -o BatchMode=yes "$host" \
"awk '{ gsub(/\/[0-9]+/,\"/?\",\$7); c[\$7]++ } END { for(k in c) print c[k],k }' $log" \
> "$tmpdir/$host" &
done < "$hosts"
wait
cat "$tmpdir"/* \
| awk '{ c[$2] += $1 } END { for (k in c) print c[k], k }' \
| sort -rn | head -"$n"
}
# Slow query reducer for Postgres CSV log. Args: file, n
la_pg_slow_queries() {
local file="$1" n="${2:-20}"
awk '/duration:/ {
match($0, /duration: ([0-9.]+) ms/, m)
d = m[1]
q = $0
gsub(/\047[^\047]*\047/, "?", q)
gsub(/[0-9]+/, "?", q)
total[q] += d; count[q]++
if (d > max[q]) max[q] = d
} END {
for (q in total) {
printf "%.0f %d %.0f %s\n", total[q], count[q], max[q], substr(q,1,120)
}
}' "$file" | sort -rn | head -"$n"
}
Note \047 — the octal escape for a single quote. Embedding ' inside a single-quoted bash heredoc with awk gets ugly fast; \047 sidesteps the quoting nightmare.
Using the Library
#!/usr/bin/env bash
source /usr/local/lib/loganalyze.sh
# Top 20 URLs from yesterday's compressed log
la_top_urls /var/log/nginx/access.log.1.gz 20
# Alert on 5xx spike
errors=$(la_5xx_count /var/log/nginx/access.log)
if (( errors > 1000 )); then
/usr/local/bin/alert "5xx spike: $errors"
fi
# Fleet-wide top URLs
la_fleet_top_urls /etc/web-hosts.txt /var/log/nginx/access.log 50
# Postgres slow queries from last hour
journalctl -u postgresql --since '1 hour ago' --no-pager > /tmp/pg-recent.log
la_pg_slow_queries /tmp/pg-recent.log 20
Streaming Real-Time vs. Batch
Everything above is batch (process a file). For real-time tail-and-aggregate, the pattern is:
# Keep a rolling 60-second 5xx window
tail -F /var/log/nginx/access.log \
| awk '
BEGIN { window = 60 }
{
now = systime()
if ($9 ~ /^5/) {
events[++idx] = now
}
# Emit count of events in last 60s
count = 0
for (i in events) if (events[i] >= now - window) count++
if (NR % 100 == 0) print "5xx in last 60s:", count
}
'
Real-time has its own footguns: tail -F (capital F) re-opens on rotation; tail -f (lowercase) silently dies after rotation. Always use -F for production tailers.
For higher-throughput streaming, the right tool is usually Vector, Fluent Bit, or Logstash — but for ad-hoc investigation, tail -F | awk is unbeatable.
The 8 Footguns
1. cat huge.log | grep Instead of grep huge.log
cat is the redundant first stage. grep already takes a filename. Fix: grep PATTERN huge.log. (Useless Use of Cat — UUOC — is a real category at scale.)
2. sort | uniq -c for Aggregation
Already covered — uses external sort for what awk does in O(distinct keys). Fix: awk '{c[$1]++} END {for(k in c) print c[k],k}'.
3. Locale-Aware Sort Slowness
LC_ALL=en_US.UTF-8 makes sort 5-10× slower than LC_ALL=C because it does Unicode collation. Fix: Set LC_ALL=C for log analysis (your hostnames and URLs are ASCII).
4. /tmp Tmpfs Filling Up
sort spills to /tmp by default. On systems where /tmp is tmpfs (RAM-backed), a 100GB sort fills RAM and OOMs the machine. Fix: sort -T /var/tmp ... to use real disk.
5. tail -f Instead of tail -F
-f (lowercase) follows by file descriptor. After log rotation, the original FD is to a deleted file; new logs go to a new file the tailer never sees. Fix: tail -F (uppercase) follows by name — re-opens on rotation.
6. awk Memory Blow-Up From Unbounded Cardinality
If the key you’re aggregating on is unbounded (every line is unique — e.g., bucketing by request-id), awk’s hash grows unbounded. Eventually OOMs. Fix: Truncate or hash the key. c[substr($1,1,32)]++ keeps only first 32 chars.
7. parallel With Stateful Workers
If your awk script depends on state across input lines (running totals, monotonic counters), running it in parallel chunks gives wrong answers — each chunk’s awk has separate state. Fix: Either restructure to be stateless (pure aggregation, then merge) or process serially.
8. SSH Fan-Out Without ConnectTimeout
A single dead host stalls the entire fan-out for ~2 minutes (TCP timeout). For 100 hosts, one dead box = 200-minute wait. Fix: Always ssh -o ConnectTimeout=5 -o BatchMode=yes. The BatchMode prevents password prompts from hanging on terminal-less environments.
Quick-Reference Card
STREAMING AGGREGATION (single host)
awk '{c[$KEY]++} END {for(k in c) print c[k],k}' | sort -rn | head
Memory: O(distinct keys), not O(file size)
awk PERFORMANCE
mawk: fastest, fewer features
gawk: most features (gensub, mktime), slower
busybox: 10× slower, embedded only
Set LC_ALL=C for ASCII data → 5-10× speedup
PARALLEL (multi-core, single host)
ls files | parallel -j8 'zcat {} | awk ... '
Then merge with another awk pass
-j-2 = "all cores minus 2", --load 8 = pause if loadavg > 8
SSH FAN-OUT (multi-host)
parallel --sshlogin @hosts.txt 'awk ... /var/log/...'
Always: ssh -o ConnectTimeout=5 -o BatchMode=yes
Map locally on each host, reduce centrally
EXTERNAL SORT (cardinality > RAM)
sort --buffer-size=4G --parallel=4 -T /var/tmp
LC_ALL=C for ASCII → 5-10× speedup
NORMALIZATION (the magic for log aggregation)
gsub(/\/[0-9]+/, "/?") → /users/123 → /users/?
gsub(/[0-9]+/, "?") → strip all numerics
gsub(/'[^']*'/, "?") → strip quoted strings (SQL params)
REAL-TIME
tail -F (capital F!) for rotation safety
Pipe to awk with running window aggregator
What’s Next
You can now extract signal from terabyte-scale logs. The next step is to act on that signal: build self-healing scripts that detect a problem (high 5xx rate, queue depth above threshold, stuck process), decide whether to remediate, and act with bounded blast-radius — without becoming the cause of the next incident through a runaway loop.
In the next lesson — Self-Healing Scripts: Detect-Decide-Act Loops, Blast-Radius Limits & Circuit Breakers — we’ll build lib/heal.sh covering detect/decide/act loops, circuit breakers that stop after N consecutive failures, blast-radius limits (“never restart more than 1 host per minute”), the dry-run discipline before any auto-remediation goes live, and the audit log every healer must write so post-incident review can answer “why did the healer do that?”