Shell Lesson 37 of 42

Shell Log Analysis at Scale: Streaming awk, GNU Parallel, Distributed grep/sort/uniq Pipelines for Terabyte-Sized Logs

Why Naive Pipelines Die at Terabyte Scale

The classic Unix pipeline:

cat huge.log | grep ERROR | awk '{print $7}' | sort | uniq -c | sort -rn | head

Breaks at scale for three reasons:

  1. cat reads the entire file into memory pages it doesn’t need.
  2. sort is O(N log N) in time and O(N) in memory (or external-memory disk thrash).
  3. uniq -c requires sorted input — meaning sort already did the heavy work.

For a 500 GB log this pipeline either OOMs, fills /tmp (which sort uses for spill files), or runs for 6+ hours. The discipline of scale-friendly log analysis is streaming aggregation: process each line exactly once, aggregate in memory only the distinct keys (which is bounded by cardinality, not data volume), and emit the result.

This lesson teaches four patterns:

Pattern When Memory CPU
Streaming awk Single host, fits in RAM by cardinality O(distinct keys) Single-core, I/O bound
GNU parallel Single host, CPU-bound parsing O(workers × cardinality) All cores
SSH fan-out map-reduce Multiple hosts, files on each O(host cardinality) per node Distributed
External-memory sort Cardinality itself doesn’t fit O(disk) Disk-bound

Pattern 1: Streaming awk Aggregation

awk has built-in associative arrays. This single property makes it the right tool for ~80% of log-analysis tasks. The pattern:

awk '{ count[$7]++ } END { for (k in count) print count[k], k }' access.log \
  | sort -rn | head -20

What this does:

  1. For each line, increment count[<7th field>] (typically the URL path in nginx combined log format).
  2. At end-of-input, iterate the associative array and emit <count> <key>.
  3. Sort numerically descending, take top 20.

Memory used is bounded by the number of distinct URLs, not the log size. A site with 10,000 distinct URLs on a 500 GB log uses ~1 MB of awk memory. The same query with sort | uniq -c would need 500 GB of /tmp space.

Why awk’s Associative Arrays Are So Fast

awk’s hash tables are written in C and tuned for line-oriented data. Combined with mawk (the fastest awk implementation), throughput often exceeds 500 MB/s on a single core — faster than grep -c for many patterns because grep has to advance regex state machine, while awk just hashes a field.

Quick benchmark on a 10 GB nginx log:

mawk    aggregation:   18 sec
gawk    aggregation:   42 sec
busybox aggregation: aborted at 5 min

Always install mawk on log-analysis hosts (apt install mawk). On most distros, awk is a symlink — point it to mawk if performance matters:

update-alternatives --set awk /usr/bin/mawk

The Three awk Implementations

Implementation Speed Features Default on
gawk Baseline Most features (gensub, time funcs) Most distros
mawk 3-10× faster Fewer features, no gensub Some Debian variants
busybox awk 10× slower than gawk Minimal Alpine, embedded

If your script uses gensub, mktime, strftime, or --posix flags, you need gawk. For pure aggregation pipelines, mawk is the correct choice.

Real-World Streaming awk Patterns

Top 20 slowest endpoints (by mean response time):

awk '{
  url = $7
  rt = $NF             # last field is request_time
  sum[url] += rt
  count[url]++
} END {
  for (u in sum) printf "%.3f %d %s\n", sum[u]/count[u], count[u], u
}' access.log | sort -rn | head -20

The printf "%.3f %d %s" formats numerically so sort -rn works correctly — never use print with floats because sort may treat scientific notation inconsistently.

HTTP status code time-series (per minute):

awk '{
  match($4, /\[([0-9]+\/[A-Za-z]+\/[0-9]+:[0-9]+:[0-9]+)/, m)
  bucket = m[1]   # YYYY-MM-DDTHH:MM truncated
  code = $9
  ts[bucket "|" code]++
} END {
  for (k in ts) {
    split(k, a, "|")
    print a[1], a[2], ts[k]
  }
}' access.log | sort

This produces a flat time-series suitable for piping into Grafana via the textfile collector or feeding to gnuplot.

5xx error spike detector (alerting from cron):

errors=$(awk '$9 ~ /^5/ {c++} END {print c+0}' /var/log/nginx/access.log)
if (( errors > 100 )); then
  curl -X POST "$ALERTMANAGER_URL/api/v1/alerts" \
    -d "[{\"labels\":{\"alertname\":\"5xxSpike\",\"value\":\"$errors\"}}]"
fi

The c+0 trick forces c to be numeric even if no errors were found (otherwise it’d print empty string).

Pattern 2: GNU parallel for CPU-Bound Map

When the work per line is heavy (regex compilation, JSON parsing, network lookup), single-core awk becomes the bottleneck. GNU parallel farms work across cores:

# Parse 1000 logs in parallel, 8 workers
ls /var/log/nginx/access-*.gz | parallel -j8 \
  'zcat {} | awk "/ERROR/ {c++} END {print FILENAME, c+0}"'

The -j8 is workers; {} is the input filename; each invocation is independent. parallel batches work and prints results in order (or unordered with -k flag).

Map-Reduce in One Pipeline

For aggregation across many files, the pattern is:

  1. Map: each file → partial aggregation (key → count).
  2. Reduce: merge partial aggregations → global aggregation.
ls /var/log/nginx/access-*.gz | parallel -j8 \
  'zcat {} | awk "{ count[\$7]++ } END { for (k in count) print count[k], k }"' \
  | awk '{ count[$2] += $1 } END { for (k in count) print count[k], k }' \
  | sort -rn | head -20

Stage 1 (parallel) outputs partial counts per file. Stage 2 (single awk) sums across files — note this stage is small because input is already partial-aggregated. The key insight: the reduce step’s input size is O(workers × distinct keys), not O(total log size).

parallel vs xargs

xargs -P N does parallel execution too, but lacks parallel’s:

For one-shot parallel maps, xargs -P is fine. For long-running production pipelines, parallel is worth the install.

parallel Throttling for Production

When running parallel against a database or external API, you must throttle:

# At most 4 jobs at a time, with 100ms gap between job starts
parallel -j4 --delay 0.1 './query-api.sh {}' :::: hosts.txt

# Limit by load average — pause if loadavg > 8
parallel -j4 --load 8 './heavy-job.sh {}' :::: inputs.txt

# Auto-tune to leave 2 cores free
parallel -j-2 './job.sh {}' :::: inputs.txt

The -j-2 (negative) is “all cores except 2” — useful for keeping the box responsive.

Pattern 3: Distributed Map-Reduce via SSH Fan-Out

When logs live on dozens of hosts (fleet of web servers), the pattern is:

  1. ssh-fan-out to run the map on each host (work happens locally, network only carries aggregated output).
  2. Reduce the per-host outputs centrally.
#!/usr/bin/env bash
# fleet-top-urls.sh — get top URLs across the entire web fleet
set -euo pipefail

readonly HOSTS="$(cat /etc/web-hosts.txt)"

# Map: each host runs awk locally, returns partial aggregation
for host in $HOSTS; do
  ssh -o ConnectTimeout=5 -o BatchMode=yes "$host" \
    "awk '{ c[\$7]++ } END { for (k in c) print c[k], k }' /var/log/nginx/access.log" \
    > "/tmp/fleet-map.$host" &
done
wait

# Reduce: merge per-host outputs
cat /tmp/fleet-map.* \
  | awk '{ c[$2] += $1 } END { for (k in c) print c[k], k }' \
  | sort -rn | head -20

rm /tmp/fleet-map.*

The ssh -o BatchMode=yes is critical — it prevents SSH from prompting for passwords if key auth fails, which would hang the script. The ConnectTimeout=5 bounds the wait for unreachable hosts.

Production-Grade Fan-Out With pdsh or parallel

For >20 hosts the bash loop above becomes slow and lacks failure handling. Use parallel with the --sshlogin flag:

parallel --sshlogin "@/etc/web-hosts.txt" \
  "awk '{ c[\$7]++ } END { for (k in c) print c[k], k }' /var/log/nginx/access.log" \
  | awk '{ c[$2] += $1 } END { for (k in c) print c[k], k }' \
  | sort -rn | head -20

--sshlogin @file reads hostnames from a file and runs each command remotely. parallel handles connection pooling, retries, and ordering.

Or pdsh (Parallel Distributed Shell)

pdsh is the heavyweight option, originally from LLNL clusters:

pdsh -w "$(cat /etc/web-hosts.txt | paste -sd,)" \
  "awk '{ c[\$7]++ } END { for (k in c) print c[k], k }' /var/log/nginx/access.log" \
  | awk -F: '{ c[$2] += $1 } END { for (k in c) print c[k], k }' \
  | sort -rn | head -20

pdsh prefixes each output line with hostname: which the reducer must strip via -F:. parallel doesn’t prefix unless you ask for it.

Pattern 4: External-Memory Sort When Cardinality Itself Doesn’t Fit

Sometimes the distinct keys themselves don’t fit in RAM — e.g., a per-user-id aggregation across a billion users. The streaming-awk pattern fails.

The answer is sort with explicit external-memory tuning:

# Sort with 4GB memory budget, 4-way parallel merge, /var/tmp for spill
awk '{ print $4, $7 }' huge.log \
  | sort --buffer-size=4G --parallel=4 -T /var/tmp \
  | uniq -c \
  | sort -rn \
  | head -20

Critical flags:

LC_ALL=C sort --buffer-size=4G ...

When to Use External Sort vs. awk

Situation Use awk Use external sort
Distinct keys < 10M
Distinct keys 10M-1B ? (depends on RAM)
Need ordering, not just counts
Heavy parsing per line awk + parallel sort + parallel

The rule of thumb: streaming awk for aggregation, external sort for ordering. They’re complementary, not competing.

Slow Query Log Reduction: A Real-World Case Study

Postgres slow query logs and MySQL slow query logs are the canonical “fits in awk” workload. The classic tool pgBadger is a 5000-line Perl script that does what you can do in 50 lines of awk.

#!/usr/bin/awk -f
# pg-slowlog-reduce.awk — reduce Postgres CSV log to top-N slow queries
BEGIN { FS = "," }
/duration:/ {
  # Extract duration in ms
  match($0, /duration: ([0-9.]+) ms/, m)
  d = m[1]

  # Normalize the query — strip literal numbers and quoted strings
  q = $0
  gsub(/'\''[^'\'']*'\''/, "'\''?'\''", q)
  gsub(/[0-9]+/, "?", q)

  # Hash by normalized query
  total[q] += d
  count[q]++
  if (d > max[q]) max[q] = d
}
END {
  for (q in total) {
    printf "%.0f total_ms / %d calls / %.0f max_ms — %s\n", \
           total[q], count[q], max[q], substr(q, 1, 80)
  }
}

Run with:

awk -f pg-slowlog-reduce.awk /var/log/postgresql/postgres.log \
  | sort -rn | head -20

Normalization is the magic: gsub(/[0-9]+/, "?") collapses WHERE id=123 and WHERE id=456 into the same shape WHERE id=?, so they aggregate. Without normalization, every query would be unique and the analysis is useless.

The same pattern works for nginx URL aggregation (collapse /users/123/users/?):

awk '{
  url = $7
  gsub(/\/[0-9]+/, "/?", url)
  c[url]++
} END {
  for (u in c) print c[u], u
}' access.log | sort -rn | head -20

The Drop-In lib/loganalyze.sh

# lib/loganalyze.sh — sourced helpers for log analysis pipelines.
#
# Depends on mawk (preferred) or gawk.

set -o errexit -o nounset -o pipefail

la_log() { printf '[%s] [loganalyze] %s\n' "$(date -Iseconds)" "$*"; }

# Detect best awk
la_awk_path() {
  if command -v mawk >/dev/null; then echo mawk
  elif command -v gawk >/dev/null; then echo gawk
  else echo awk
  fi
}

# Top-N URL aggregator (nginx combined format).
# Args: file, n
la_top_urls() {
  local file="$1" n="${2:-20}"
  local awk_bin
  awk_bin=$(la_awk_path)

  if [[ "$file" == *.gz ]]; then
    zcat "$file"
  elif [[ "$file" == *.zst ]]; then
    zstd -d -c "$file"
  else
    cat "$file"
  fi | "$awk_bin" '{
    url = $7
    gsub(/\/[0-9]+/, "/?", url)
    c[url]++
  } END {
    for (u in c) print c[u], u
  }' | sort -rn | head -"$n"
}

# Mean and p99 response time per URL. Requires nginx with $request_time as last field.
la_response_time_stats() {
  local file="$1" n="${2:-20}"
  local awk_bin
  awk_bin=$(la_awk_path)

  cat "$file" | "$awk_bin" '{
    url = $7; rt = $NF
    gsub(/\/[0-9]+/, "/?", url)
    sum[url] += rt
    count[url]++
    n = count[url]
    times[url, n] = rt
  } END {
    for (u in sum) {
      mean = sum[u] / count[u]
      # Approximate p99: sort the recorded times for this URL
      n = count[u]
      delete arr
      for (i = 1; i <= n; i++) arr[i] = times[u, i]
      asort(arr)
      p99 = arr[int(n * 0.99)]
      printf "%.3f %.3f %d %s\n", mean, p99, n, u
    }
  }' | sort -rn -k2 | head -"$n"
}

# 5xx counter for alerting. Args: file. Returns count to stdout.
la_5xx_count() {
  local file="$1"
  awk '$9 ~ /^5/ {c++} END {print c+0}' "$file"
}

# Top error patterns from generic log file. Looks for ERROR/WARN/FATAL prefixes.
# Args: file, n
la_top_errors() {
  local file="$1" n="${2:-20}"
  awk '/(ERROR|WARN|FATAL|CRITICAL)/ {
    # Strip timestamp + thread/PID. Keep last 200 chars.
    msg = substr($0, length($0) > 200 ? length($0) - 199 : 1)
    # Normalize numbers
    gsub(/[0-9]+/, "?", msg)
    c[msg]++
  } END {
    for (m in c) print c[m], m
  }' "$file" | sort -rn | head -"$n"
}

# Distributed top-URLs across a fleet. Args: hosts_file, log_path, n
la_fleet_top_urls() {
  local hosts="$1" log="$2" n="${3:-20}"
  local tmpdir
  tmpdir=$(mktemp -d)
  trap "rm -rf '$tmpdir'" RETURN

  while read -r host; do
    [[ -z "$host" || "$host" =~ ^# ]] && continue
    ssh -o ConnectTimeout=5 -o BatchMode=yes "$host" \
      "awk '{ gsub(/\/[0-9]+/,\"/?\",\$7); c[\$7]++ } END { for(k in c) print c[k],k }' $log" \
      > "$tmpdir/$host" &
  done < "$hosts"
  wait

  cat "$tmpdir"/* \
    | awk '{ c[$2] += $1 } END { for (k in c) print c[k], k }' \
    | sort -rn | head -"$n"
}

# Slow query reducer for Postgres CSV log. Args: file, n
la_pg_slow_queries() {
  local file="$1" n="${2:-20}"
  awk '/duration:/ {
    match($0, /duration: ([0-9.]+) ms/, m)
    d = m[1]
    q = $0
    gsub(/\047[^\047]*\047/, "?", q)
    gsub(/[0-9]+/, "?", q)
    total[q] += d; count[q]++
    if (d > max[q]) max[q] = d
  } END {
    for (q in total) {
      printf "%.0f %d %.0f %s\n", total[q], count[q], max[q], substr(q,1,120)
    }
  }' "$file" | sort -rn | head -"$n"
}

Note \047 — the octal escape for a single quote. Embedding ' inside a single-quoted bash heredoc with awk gets ugly fast; \047 sidesteps the quoting nightmare.

Using the Library

#!/usr/bin/env bash
source /usr/local/lib/loganalyze.sh

# Top 20 URLs from yesterday's compressed log
la_top_urls /var/log/nginx/access.log.1.gz 20

# Alert on 5xx spike
errors=$(la_5xx_count /var/log/nginx/access.log)
if (( errors > 1000 )); then
  /usr/local/bin/alert "5xx spike: $errors"
fi

# Fleet-wide top URLs
la_fleet_top_urls /etc/web-hosts.txt /var/log/nginx/access.log 50

# Postgres slow queries from last hour
journalctl -u postgresql --since '1 hour ago' --no-pager > /tmp/pg-recent.log
la_pg_slow_queries /tmp/pg-recent.log 20

Streaming Real-Time vs. Batch

Everything above is batch (process a file). For real-time tail-and-aggregate, the pattern is:

# Keep a rolling 60-second 5xx window
tail -F /var/log/nginx/access.log \
  | awk '
    BEGIN { window = 60 }
    {
      now = systime()
      if ($9 ~ /^5/) {
        events[++idx] = now
      }
      # Emit count of events in last 60s
      count = 0
      for (i in events) if (events[i] >= now - window) count++
      if (NR % 100 == 0) print "5xx in last 60s:", count
    }
  '

Real-time has its own footguns: tail -F (capital F) re-opens on rotation; tail -f (lowercase) silently dies after rotation. Always use -F for production tailers.

For higher-throughput streaming, the right tool is usually Vector, Fluent Bit, or Logstash — but for ad-hoc investigation, tail -F | awk is unbeatable.

The 8 Footguns

1. cat huge.log | grep Instead of grep huge.log

cat is the redundant first stage. grep already takes a filename. Fix: grep PATTERN huge.log. (Useless Use of Cat — UUOC — is a real category at scale.)

2. sort | uniq -c for Aggregation

Already covered — uses external sort for what awk does in O(distinct keys). Fix: awk '{c[$1]++} END {for(k in c) print c[k],k}'.

3. Locale-Aware Sort Slowness

LC_ALL=en_US.UTF-8 makes sort 5-10× slower than LC_ALL=C because it does Unicode collation. Fix: Set LC_ALL=C for log analysis (your hostnames and URLs are ASCII).

4. /tmp Tmpfs Filling Up

sort spills to /tmp by default. On systems where /tmp is tmpfs (RAM-backed), a 100GB sort fills RAM and OOMs the machine. Fix: sort -T /var/tmp ... to use real disk.

5. tail -f Instead of tail -F

-f (lowercase) follows by file descriptor. After log rotation, the original FD is to a deleted file; new logs go to a new file the tailer never sees. Fix: tail -F (uppercase) follows by name — re-opens on rotation.

6. awk Memory Blow-Up From Unbounded Cardinality

If the key you’re aggregating on is unbounded (every line is unique — e.g., bucketing by request-id), awk’s hash grows unbounded. Eventually OOMs. Fix: Truncate or hash the key. c[substr($1,1,32)]++ keeps only first 32 chars.

7. parallel With Stateful Workers

If your awk script depends on state across input lines (running totals, monotonic counters), running it in parallel chunks gives wrong answers — each chunk’s awk has separate state. Fix: Either restructure to be stateless (pure aggregation, then merge) or process serially.

8. SSH Fan-Out Without ConnectTimeout

A single dead host stalls the entire fan-out for ~2 minutes (TCP timeout). For 100 hosts, one dead box = 200-minute wait. Fix: Always ssh -o ConnectTimeout=5 -o BatchMode=yes. The BatchMode prevents password prompts from hanging on terminal-less environments.

Quick-Reference Card

STREAMING AGGREGATION (single host)
  awk '{c[$KEY]++} END {for(k in c) print c[k],k}' | sort -rn | head
  Memory: O(distinct keys), not O(file size)

awk PERFORMANCE
  mawk:    fastest, fewer features
  gawk:    most features (gensub, mktime), slower
  busybox: 10× slower, embedded only
  Set LC_ALL=C for ASCII data → 5-10× speedup

PARALLEL (multi-core, single host)
  ls files | parallel -j8 'zcat {} | awk ... '
  Then merge with another awk pass
  -j-2 = "all cores minus 2", --load 8 = pause if loadavg > 8

SSH FAN-OUT (multi-host)
  parallel --sshlogin @hosts.txt 'awk ... /var/log/...'
  Always: ssh -o ConnectTimeout=5 -o BatchMode=yes
  Map locally on each host, reduce centrally

EXTERNAL SORT (cardinality > RAM)
  sort --buffer-size=4G --parallel=4 -T /var/tmp
  LC_ALL=C for ASCII → 5-10× speedup

NORMALIZATION (the magic for log aggregation)
  gsub(/\/[0-9]+/, "/?")   → /users/123 → /users/?
  gsub(/[0-9]+/, "?")      → strip all numerics
  gsub(/'[^']*'/, "?")     → strip quoted strings (SQL params)

REAL-TIME
  tail -F (capital F!) for rotation safety
  Pipe to awk with running window aggregator

What’s Next

You can now extract signal from terabyte-scale logs. The next step is to act on that signal: build self-healing scripts that detect a problem (high 5xx rate, queue depth above threshold, stuck process), decide whether to remediate, and act with bounded blast-radius — without becoming the cause of the next incident through a runaway loop.

In the next lesson — Self-Healing Scripts: Detect-Decide-Act Loops, Blast-Radius Limits & Circuit Breakers — we’ll build lib/heal.sh covering detect/decide/act loops, circuit breakers that stop after N consecutive failures, blast-radius limits (“never restart more than 1 host per minute”), the dry-run discipline before any auto-remediation goes live, and the audit log every healer must write so post-incident review can answer “why did the healer do that?”

shelllog-analysisawkgnu-parallelstreamingmawkgawkexternal-sortmap-reducessh-fanoutperformance
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments