Shell Lesson 12 of 42

Text Processing: awk Deep Dive, jq, yq, csvkit & the Locale / UTF-8 Pitfalls — Where Shell Stops Being Primitive

We’ve climbed from echo to find … -print0 | xargs -0. So far the mental model has been “stream of bytes, lines, words.” That’s enough for 70% of shell tasks. The remaining 30% have structure: JSON from APIs, YAML from Kubernetes, CSV from finance teams, columnar output from ps/df/docker. For these, grep and sed are the wrong tool. You don’t want to regex JSON; you want a parser.

This final Tier 2 lesson covers four tools that fill that gap. By the end you will be able to:

This is the closer of Wave 1. After this, you have a complete shell-fundamentals foundation. Wave 2 builds advanced patterns — error frameworks, package managers, CLI design, secrets — on top of this.


1. awk — the data-processing language hidden inside awk

awk is named after its three creators — Aho, Weinberger, Kernighan — and is one of the original Unix-era programming languages. It looks like a one-liner tool, but it’s actually a small, complete programming language: variables, control flow, functions, associative arrays, regex.

The data model

Every awk program follows the same structure:

awk 'BEGIN { ... } /PATTERN/ { ACTION } END { ... }' FILE

awk reads input one record at a time (default = one line). For each record, it splits into fields (default separator = whitespace). Then it runs each pattern { action } block where the pattern matches.

“Print column N” — the boilerplate

ps -ef | awk '{ print $2 }'                # print 2nd column (PID)
ls -l | awk '{ print $5, $9 }'             # size and name
df -h | awk 'NR > 1 { print $5, $6 }'      # skip header (NR==1)

But that’s awk’s boring 5%. Where it shines:

BEGIN and END blocks

# Sum a column
ls -l *.log | awk 'BEGIN { total = 0 } { total += $5 } END { print "Total:", total }'

# Average response time from a log
awk 'BEGIN { sum = 0; n = 0 } /response_ms=/ { sum += $NF; n++ } END { print sum/n }' app.log

BEGIN runs before any input. END runs after all input. Use them for initialization and summarization.

Arithmetic and conditions

# Print files larger than 1MB (column 5 from ls -l is size in bytes)
ls -l | awk '$5 > 1024 * 1024 { print $9 }'

# Format payroll: fields are NAME HOURS RATE
awk '{ printf "%-10s $%.2f\n", $1, $2 * $3 }' payroll.txt

Note printf (lowercase, awk-builtin, separate from shell printf) — same C-style format string.

Field separators (-F and OFS)

# Print user, shell from /etc/passwd (colon-separated)
awk -F: '{ print $1, $7 }' /etc/passwd

# Convert CSV-like input to TSV output (DOES NOT handle quoted commas — see csvkit later)
awk -F, 'BEGIN{OFS="\t"} { $1=$1; print }' input.csv > output.tsv

The $1=$1 trick is a classic awk idiom — it forces awk to “rebuild” the record using the new OFS, even if no field actually changed.

Associative arrays — counting and grouping

This is awk’s killer feature. Arrays in awk are associative (string keys), not indexed.

# Count distinct user IDs in /etc/passwd
awk -F: '{ count[$3]++ } END { for (uid in count) print uid, count[uid] }' /etc/passwd

# Count log lines per HTTP status code from nginx access log
awk '{ status[$9]++ } END { for (s in status) print s, status[s] }' access.log

# Sum bytes by IP address (nginx log columns: 1=IP, 10=bytes)
awk '{ bytes[$1] += $10 } END { for (ip in bytes) print bytes[ip], ip }' access.log | sort -rn | head

This is enormously powerful. You’re aggregating in a single pass without sorting first.

Multi-file processing

# Compare line counts of two files
awk 'NR==FNR { count++; next } END { print FILENAME, count, NR-count }' file1 file2

# Merge two files by key (a kind of join)
awk 'NR==FNR { map[$1] = $2; next } { if ($1 in map) print $1, $2, map[$1] }' lookup.tsv data.tsv

The NR==FNR trick: while we’re on the first file, NR (overall record number) equals FNR (current file record number). Once we move to the second file, they diverge. So NR==FNR { … ; next } means “process only the first file, and skip to the next record.”

Regex and patterns

# Print only lines that match a regex
awk '/ERROR|WARN/ { print }' app.log

# Print lines where field 3 starts with "foo"
awk '$3 ~ /^foo/' data.tsv

# Negation
awk '!/DEBUG/' app.log              # everything except DEBUG lines

# Print lines from PATTERN to PATTERN (range, like sed)
awk '/^START/,/^END/' file.txt

printf for formatting

awk '{ printf "%-30s %10d\n", $1, $2 }' data.tsv

# Common formats:
# %s    string
# %d    integer
# %f    float
# %e    scientific
# %x    hex
# %o    octal
# %-10s left-align in width 10
# %5.2f float, width 5, 2 decimals

awk functions

# Built-ins: length, substr, index, split, gsub, sub, tolower, toupper, sprintf, ...

awk '{ print length($0), $0 }' file       # length of each line
awk '{ print toupper($1) }' file          # uppercase first field
awk '{ gsub(/foo/, "bar"); print }' file  # global substitute (like sed)

# Custom functions
awk 'function abs(x) { return x < 0 ? -x : x } { print abs($1) }' data

A complete real-world awk script

# Parse nginx access log, compute requests/sec and bytes/sec by 5-min bucket
awk '
  BEGIN { FS="[ \\[\\]]+" }
  {
    # field 4 looks like: 22/Jun/2026:14:35:12
    split($4, t, "[:/]")
    bucket = t[1] "/" t[2] "/" t[3] " " t[4] ":" sprintf("%02d", int(t[5]/5)*5)
    requests[bucket]++
    bytes[bucket] += $NF
  }
  END {
    for (b in requests)
      printf "%-22s %6d req  %12d bytes\n", b, requests[b], bytes[b]
  }
' access.log | sort

This kind of pipeline used to be a Python script. In awk, it’s 10 lines.


2. jq — JSON Swiss army knife

JSON is everywhere in modern shell work — Kubernetes API, AWS CLI, GitHub API, every web service. jq is to JSON what awk is to columnar text.

Basics: pretty-print and select

# Pretty-print
echo '{"name":"alice","age":30}' | jq .

# Get a field
echo '{"name":"alice","age":30}' | jq .name           # "alice"
echo '{"name":"alice","age":30}' | jq '.name'         # same; quote when shell would interpret

# Get a nested field
echo '{"user":{"name":"alice"}}' | jq .user.name

# Array access
echo '[1,2,3]' | jq '.[0]'                            # 1
echo '[1,2,3]' | jq '.[-1]'                           # 3 (negative = from end)
echo '[1,2,3]' | jq '.[]'                             # iterate: 1\n2\n3
echo '[1,2,3]' | jq '.[1:3]'                          # slice: [2,3]

Always wrap jq filters in single quotes — they contain $, [, . that the shell would otherwise interpret.

Pipes (inside jq)

jq has its own internal pipe |, which feeds output of one filter into another:

# From a list of users, get names
echo '[{"name":"alice"},{"name":"bob"}]' | jq '.[] | .name'
# "alice"
# "bob"

# Even more compact:
echo '[{"name":"alice"},{"name":"bob"}]' | jq '.[].name'

Selectors and filters

# select: keep only items matching a predicate
jq '.[] | select(.age > 30)' users.json

# Multiple predicates
jq '.[] | select(.age > 30 and .role == "admin")' users.json

# Pattern match
jq '.[] | select(.name | test("^A"))' users.json    # name starts with A

Construct new objects

# Pick specific fields
jq '.[] | {name, age}' users.json

# Rename / compute
jq '.[] | {full_name: .name, is_adult: (.age >= 18)}' users.json

# As an array
jq '[.[] | .name]' users.json

map — transform an array

# Double every age
jq 'map(.age *= 2)' users.json

# Map to just names
jq 'map(.name)' users.json   # equivalent to [.[] | .name]

length, keys, to_entries, from_entries

jq 'length' users.json                    # number of array items
jq 'keys' my_obj.json                     # all keys of an object
jq '.users | keys'                        # nested

# Convert object to array of {key, value} pairs
jq 'to_entries' obj.json
# [ {"key":"name","value":"alice"}, {"key":"age","value":30} ]

# And back
jq 'to_entries | from_entries' obj.json   # round-trip

Aggregations: add, min, max, unique, group_by

# Sum all ages
jq '[.[].age] | add' users.json

# Max age
jq '[.[].age] | max' users.json

# Unique roles
jq '[.[].role] | unique' users.json

# Group by role, count each
jq 'group_by(.role) | map({role: .[0].role, count: length})' users.json

Output modes

jq -r '.name' file              # raw — no JSON quotes; useful for shell strings
jq -c .                         # compact — one item per line; for streaming
jq -s '.'                       # slurp — read all input as single array

-r is essential when feeding jq output into shell variables:

NAME=$(curl -s api.example.com/user | jq -r '.name')   # without -r, NAME would have quotes

Real-world examples

# Get all running pod names from kubectl
kubectl get pods -o json | jq -r '.items[] | select(.status.phase == "Running") | .metadata.name'

# Format AWS instances as tab-separated
aws ec2 describe-instances --output json \
  | jq -r '.Reservations[].Instances[] | [.InstanceId, .InstanceType, .State.Name] | @tsv'

# From a GitHub commits API response, get hash and message
curl -s api.github.com/repos/torvalds/linux/commits \
  | jq -r '.[] | "\(.sha[0:7]) \(.commit.message | split("\n")[0])"'

Editing JSON

# Update a field
echo '{"name":"alice","age":30}' | jq '.age = 31'

# Add a field
echo '{"name":"alice"}' | jq '. + {age: 30}'

# Delete a field
echo '{"name":"alice","age":30}' | jq 'del(.age)'

# In-place edit a JSON file (atomic with mv-temp)
TMP=$(mktemp)
jq '.version = "2.0"' package.json > "$TMP" && mv "$TMP" package.json

There’s no jq -i (yet); the mktemp + mv pattern is canonical.


3. yq — jq for YAML

There are two yq tools confusingly named the same:

We’ll cover Mike Farah’s. Install with brew install yq or download the binary.

Basic usage

yq '.metadata.name' deployment.yaml         # get a field — same as jq syntax
yq '.spec.replicas = 5' deployment.yaml     # mutate (prints to stdout)
yq -i '.spec.replicas = 5' deployment.yaml  # in-place (yq has -i, jq doesn't)

Multiple documents in one YAML file

Kubernetes manifests often have multiple documents separated by ---. yq handles them:

yq '.kind' multi.yaml          # prints all "kind" values, one per document
yq 'select(.kind == "Service")' multi.yaml   # extract only Service docs

Convert between formats

yq -o json '.' file.yaml > file.json     # YAML to JSON
yq -p json -o yaml '.' file.json         # JSON to YAML
yq -p xml '.' file.xml                   # parse XML

Real-world examples

# Get all images used in a Helm template'd deployment
helm template mychart | yq '..|.image? | select(.)'

# Bulk-update image tag in a Kustomize patch
yq -i '.spec.template.spec.containers[0].image = "myimage:v2"' deploy.yaml

# Get all containers across all pods in a namespace
kubectl get pods -o yaml | yq '.items[].spec.containers[].name'

Caveats


4. CSV: when awk -F, isn’t enough

The naive approach to CSV is awk -F,. It works until the first quoted field with an embedded comma, then it explodes.

# This file:
"Smith, John",30,Engineer
"Doe, Jane",25,Manager

# awk -F, treats the comma INSIDE the quotes as a separator. Wrong.

For real-world CSV (especially anything from spreadsheets or business systems), use a real CSV parser.

csvkit — Python-based CSV toolkit

pip install csvkit

Provides:

Examples

# Pretty-print the first 10 rows
head -n 10 data.csv | csvlook

# Get a specific column by name
csvcut -c first_name,last_name people.csv

# Filter rows where state is CA
csvgrep -c state -m CA people.csv

# Statistics on each column
csvstat sales.csv

# Sort by sale amount, descending
csvsort -c amount -r sales.csv | head

# Join orders with customers on customer_id
csvjoin -c customer_id orders.csv customers.csv > joined.csv

# Run SQL against a CSV
csvsql --query "SELECT state, COUNT(*) FROM people GROUP BY state ORDER BY 2 DESC" people.csv

# Convert Excel to CSV
in2csv sales.xlsx > sales.csv

csvsql is genuinely magical: it loads the CSV into an in-memory SQLite, runs the query, prints the result. For small to medium CSVs (up to a few hundred MB), it beats writing pandas or a real database.

xsv — fast CSV (Rust)

For very large CSVs, csvkit (Python) is slow. xsv is a Rust-based alternative:

brew install xsv

xsv stats data.csv | xsv table        # statistics, table-formatted
xsv select first_name,last_name data.csv
xsv search -s state CA data.csv       # filter by column
xsv join customer_id orders.csv customer_id customers.csv

Same operations as csvkit, much faster on big files.

miller (mlr)

Yet another option, designed for “TSV/CSV/JSON/etc as named-field records”:

brew install miller

mlr --csv stats1 -a mean,stddev -f age people.csv
mlr --c2t cat people.csv > people.tsv      # CSV to TSV
mlr --c2j cat people.csv > people.json     # CSV to JSON

mlr is genuinely clever and very capable, but has its own syntax to learn. Pick one (csvkit, xsv, or mlr) and stick with it.


5. Locale and UTF-8 — the silent saboteurs

This is the section that most “shell scripting” tutorials skip, and it’s where almost every senior engineer gets bitten at least once.

Locale categories

A locale tells programs how to interpret text:

On a typical Mac/Linux system:

locale          # show current settings

# Common values:
LANG=en_US.UTF-8
LC_ALL=
LC_CTYPE="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
...

The LC_ALL=C trick

Setting LC_ALL=C (or the equivalent LC_ALL=POSIX) tells programs to use the most basic, byte-comparison-only locale. It’s:

So the rule:

# For pipelines that don't need linguistic correctness:
LC_ALL=C sort -u file.txt

# For pipelines that DO need it:
LC_ALL=en_US.UTF-8 sort -u file.txt

Always set LC_ALL explicitly inside scripts to avoid being at the mercy of the user’s environment:

#!/usr/bin/env bash
set -Eeuo pipefail
export LC_ALL=C        # if you want fast, byte-deterministic processing
# ... your pipeline ...

UTF-8 in grep, sed, awk

Modern GNU grep/sed/awk handle UTF-8 correctly when the locale is set to a UTF-8 locale:

echo 'café' | grep -oE '\w+'           # in en_US.UTF-8: "café"
echo 'café' | LC_ALL=C grep -oE '\w+'  # in C: "caf" (é is not a word char in C locale)

If you want to count grapheme clusters (what users perceive as “characters”), neither shell nor awk is the right tool — that’s python3 with unicodedata or specialised libraries.

wc -c vs wc -m

echo 'café' | wc -c        # 6 (bytes; é is 2 bytes in UTF-8) — plus newline
echo 'café' | wc -m        # 5 (characters) — plus newline; in UTF-8 locale

wc -c counts bytes; wc -m counts characters but only respects the locale.

File names with weird characters

Always quote variables that might contain filenames:

for f in *.txt; do
  cp "$f" "/backup/$f"        # quotes essential — filename might have spaces/UTF-8
done

find . -type f -print0 | xargs -0 cp -t /backup/   # NUL-safe

If a filename contains an invalid UTF-8 sequence (rare but happens), some tools will refuse it. find and cp are byte-faithful — they don’t care about UTF-8 validity. bash is also byte-faithful. So shell handles them; just don’t try to convert filenames to a string in another encoding.

BOM (Byte Order Mark) gotcha

Files saved by Windows tools sometimes have a UTF-8 BOM (EF BB BF) at the start. This is invisible but breaks scripts:

file weird.csv         # "UTF-8 Unicode (with BOM) text"

# This BOM appears as a "character" in the first cell:
head -c 3 weird.csv | xxd       # 00000000: efbb bf

# Strip BOM:
sed -i '1s/^\xef\xbb\xbf//' weird.csv
# or use dos2unix (which also handles CRLF)
dos2unix weird.csv

If your CSV “first column header” mysteriously doesn’t match what you expect, suspect a BOM.

CRLF vs LF

Windows line endings (\r\n) trip up shell scripts. The carriage return is invisible but breaks read, awk, etc.:

file script.sh                     # "ASCII text, with CRLF line terminators"

dos2unix script.sh                 # convert in place
# or:
sed -i 's/\r$//' script.sh
# or:
tr -d '\r' < script.sh > tmp && mv tmp script.sh

Numeric locale

LC_NUMERIC=de_DE.UTF-8 makes printf '%.2f' 3.14 output 3,14 (comma as decimal separator). This breaks tools that re-read the output:

# Inside scripts, force C numeric locale for safety
export LC_NUMERIC=C

6. Combining the toolkit

The real power is composing these. A few full workflows:

Workflow 1: Top 10 noisiest containers (Kubernetes)

kubectl top pods --all-namespaces --no-headers \
  | awk '{ print $3, $1 "/" $2 }' \
  | sort -k1 -h -r \
  | head -n 10 \
  | awk '{ printf "%-10s %s\n", $1, $2 }'

awk selects and reorders columns; sort -h does human-readable sort (M, G); the second awk formats. No regex, no jq.

Workflow 2: From CSV to per-region summary

csvgrep -c country -m USA sales.csv \
  | csvcut -c region,amount \
  | csvsql --query "SELECT region, SUM(CAST(amount AS REAL)) AS total
                    FROM stdin GROUP BY region ORDER BY total DESC"

Workflow 3: Container image audit across a Kubernetes cluster

# All container images in the cluster, deduped, with usage count
kubectl get pods --all-namespaces -o json \
  | jq -r '.items[].spec.containers[].image' \
  | sort | uniq -c | sort -rn

Workflow 4: Dynamic Helm values from a YAML file

# Pull pre-defined image map from values.yaml, inject into a kubectl set image
yq -r '.images | to_entries | .[] | "\(.key)=\(.value)"' values.yaml \
  | while IFS=$'\n' read -r line; do
      kubectl set image deployment/$DEPLOY "$line"
    done

Workflow 5: Streaming JSON logs filter

# Tail a structured-log-as-JSON file, filter ERROR-level, format human-readable
tail -F /var/log/app.log \
  | jq --unbuffered -r 'select(.level == "ERROR") | "\(.ts) \(.msg)"'

--unbuffered is essential for tail -F | jq pipelines so jq flushes after each input line.

Workflow 6: Detect drift in a Kubernetes manifest

# Compare in-cluster vs source-of-truth YAML, ignoring runtime fields
diff \
  <(kubectl get deploy myapp -o yaml | yq 'del(.metadata.resourceVersion, .metadata.generation, .status)') \
  <(yq 'del(.metadata.resourceVersion, .metadata.generation, .status)' deployment.yaml)

This kind of one-liner replaces a Python script. It’s the daily life of a platform engineer.


7. Pitfalls and conventions

Don’t pipe sort | uniq when you need order-preservation

uniq only deduplicates adjacent duplicates; that’s why it’s almost always used after sort. But sort reorders. If you need first-occurrence-preserving uniq:

awk '!seen[$0]++'         # canonical "uniq, preserving order"

Don’t sort -u when you need stable ordering

sort -u is a fast alternative to sort | uniq, but it doesn’t guarantee a particular dedup-winner among equal lines.

Don’t write CSV by hand

If your output is for downstream consumption as CSV, escape correctly. The naive printf '%s,%s\n' "$a" "$b" breaks the moment $a contains a comma or newline. Use a real tool (csvkit, miller, Python).

Don’t jq -r on JSON arrays of complex objects

If you do jq -r '.[]' on [{...},{...}], you’ll get malformed shell tokens. Either iterate one field at a time, or use jq -c '.[]' and parse each line again with jq.

# Wrong — produces ambiguous/multi-line output
jq -r '.[]' file.json | while read -r item; do … done

# Right — compact JSON per line
jq -c '.[]' file.json | while IFS= read -r item; do
  NAME=$(jq -r '.name' <<< "$item")
  AGE=$(jq -r '.age' <<< "$item")
  …
done

Don’t trust awk -F, for real CSV

We covered this. Use csvkit/xsv/miller.

Locale in CI

CI runners often have LANG=C or LANG=POSIX by default. If your script does locale-sensitive sort or printf, it will behave differently than on your laptop. Either explicitly set the locale in the script, or test with LC_ALL=C once.

Streaming vs slurp

jq defaults to streaming (one input record at a time). jq -s slurps everything into one array. If your script does tail -F | jq …, never use -s — it would buffer forever waiting for EOF.


8. Twelve idioms for daily use

# 1. Sum a numeric column
awk '{ s += $1 } END { print s }' data.tsv

# 2. Count distinct values in a column
awk '{ count[$1]++ } END { for (k in count) print count[k], k }' data | sort -rn

# 3. Print column 2 from a colon-separated file
awk -F: '{ print $2 }' file

# 4. Skip the header row
awk 'NR > 1' file.csv

# 5. Get all running pod names from kubectl
kubectl get pods -o json | jq -r '.items[] | select(.status.phase=="Running") | .metadata.name'

# 6. JSON to TSV
jq -r '.[] | [.id, .name, .email] | @tsv' users.json

# 7. Update a JSON field in place (atomic)
TMP=$(mktemp); jq '.version = "2.0"' package.json > "$TMP" && mv "$TMP" package.json

# 8. Update YAML in place
yq -i '.spec.replicas = 5' deploy.yaml

# 9. Convert YAML to JSON
yq -o json '.' file.yaml

# 10. Extract a column from real CSV (handling quotes)
csvcut -c name people.csv

# 11. Run SQL on a CSV
csvsql --query "SELECT state, COUNT(*) FROM stdin GROUP BY state" people.csv

# 12. Strip BOM and CRLF from a file
dos2unix file.csv && sed -i '1s/^\xef\xbb\xbf//' file.csv

9. What you must internalise before Wave 2

If anything felt fuzzy, re-read the section. These tools repay study many times over.


What’s next: Wave 1 complete!

You’ve now completed the foundation of the course:

Tier 1 (foundation) — anatomy, variables, conditionals, loops, functions, arrays. Tier 2 (intermediate) — I/O, pipes, processes, signals, glob/regex/find/grep/sed, structured-data toolkit.

You can now write a robust shell script: it sets set -Eeuo pipefail and IFS=$'\n\t', defines main "$@", has trap cleanup for safe interrupts, uses arrays for structured data, processes JSON with jq and YAML with yq, handles UTF-8 and locales correctly, and uses find -print0 | xargs -0 for filename-safe pipelines.

Wave 2 — Tier 3 Advanced — covers the next layer: error handling frameworks, debug/trace techniques, secrets management (1Password, vault, sops, age), package management cross-platform, idempotent installers, CLI design for your own scripts (option parsing with getopts and argparse-like patterns), bash testing (bats), and the canonical patterns for writing scripts that ship in production. We bring everything from Wave 1 and start building real systems with it.

See you in Tier 3.

shellbashawkjqyqcsvkitjsonyamlcsvutf-8localetext-processingfundamentals
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments