Text Processing: awk Deep Dive, jq, yq, csvkit & the Locale / UTF-8 Pitfalls — Where Shell Stops Being Primitive

We’ve climbed from echo to find … -print0 | xargs -0. So far the mental model has been “stream of bytes, lines, words.” That’s enough for 70% of shell tasks. The remaining 30% have structure: JSON from APIs, YAML from Kubernetes, CSV from finance teams, columnar output from ps/df/docker. For these, grep and sed are the wrong tool. You don’t want to regex JSON; you want a parser.

This final Tier 2 lesson covers four tools that fill that gap. By the end you will be able to:

Use awk not just for “print column 2” but as a real programming language with associative arrays and multi-file processing.
Use jq to slice, transform, and rewrite JSON in pipelines and scripts.
Use yq (the Go version, by Mike Farah) to do the same for YAML — essential for Kubernetes manifests and Helm charts.
Use csvkit to handle CSV files the way they actually exist in the wild (quoted fields, embedded commas, type inference, SQL queries).
Avoid the locale and UTF-8 pitfalls that cause sort to “fail” on German names and grep to be 10x slower than it should be.

This is the closer of Wave 1. After this, you have a complete shell-fundamentals foundation. Wave 2 builds advanced patterns — error frameworks, package managers, CLI design, secrets — on top of this.

1. `awk` — the data-processing language hidden inside `awk`

awk is named after its three creators — Aho, Weinberger, Kernighan — and is one of the original Unix-era programming languages. It looks like a one-liner tool, but it’s actually a small, complete programming language: variables, control flow, functions, associative arrays, regex.

The data model

Every awk program follows the same structure:

awk 'BEGIN { ... } /PATTERN/ { ACTION } END { ... }' FILE

awk reads input one record at a time (default = one line). For each record, it splits into fields (default separator = whitespace). Then it runs each pattern { action } block where the pattern matches.

$0 — the entire current record
$1, $2, … — fields 1, 2, etc.
NF — number of fields in current record
NR — record number (1-indexed, across all input)
FNR — record number in current file (resets per file)
FS — input field separator (default: whitespace)
OFS — output field separator (default: space)
RS — input record separator (default: newline)
ORS — output record separator (default: newline)
FILENAME — name of current input file

“Print column N” — the boilerplate

ps -ef | awk '{ print $2 }'                # print 2nd column (PID)
ls -l | awk '{ print $5, $9 }'             # size and name
df -h | awk 'NR > 1 { print $5, $6 }'      # skip header (NR==1)

But that’s awk’s boring 5%. Where it shines:

BEGIN and END blocks

# Sum a column
ls -l *.log | awk 'BEGIN { total = 0 } { total += $5 } END { print "Total:", total }'

# Average response time from a log
awk 'BEGIN { sum = 0; n = 0 } /response_ms=/ { sum += $NF; n++ } END { print sum/n }' app.log

BEGIN runs before any input. END runs after all input. Use them for initialization and summarization.

Arithmetic and conditions

# Print files larger than 1MB (column 5 from ls -l is size in bytes)
ls -l | awk '$5 > 1024 * 1024 { print $9 }'

# Format payroll: fields are NAME HOURS RATE
awk '{ printf "%-10s $%.2f\n", $1, $2 * $3 }' payroll.txt

Note printf (lowercase, awk-builtin, separate from shell printf) — same C-style format string.

Field separators (`-F` and `OFS`)

# Print user, shell from /etc/passwd (colon-separated)
awk -F: '{ print $1, $7 }' /etc/passwd

# Convert CSV-like input to TSV output (DOES NOT handle quoted commas — see csvkit later)
awk -F, 'BEGIN{OFS="\t"} { $1=$1; print }' input.csv > output.tsv

The $1=$1 trick is a classic awk idiom — it forces awk to “rebuild” the record using the new OFS, even if no field actually changed.

Associative arrays — counting and grouping

This is awk’s killer feature. Arrays in awk are associative (string keys), not indexed.

# Count distinct user IDs in /etc/passwd
awk -F: '{ count[$3]++ } END { for (uid in count) print uid, count[uid] }' /etc/passwd

# Count log lines per HTTP status code from nginx access log
awk '{ status[$9]++ } END { for (s in status) print s, status[s] }' access.log

# Sum bytes by IP address (nginx log columns: 1=IP, 10=bytes)
awk '{ bytes[$1] += $10 } END { for (ip in bytes) print bytes[ip], ip }' access.log | sort -rn | head

This is enormously powerful. You’re aggregating in a single pass without sorting first.

Multi-file processing

# Compare line counts of two files
awk 'NR==FNR { count++; next } END { print FILENAME, count, NR-count }' file1 file2

# Merge two files by key (a kind of join)
awk 'NR==FNR { map[$1] = $2; next } { if ($1 in map) print $1, $2, map[$1] }' lookup.tsv data.tsv

The NR==FNR trick: while we’re on the first file, NR (overall record number) equals FNR (current file record number). Once we move to the second file, they diverge. So NR==FNR { … ; next } means “process only the first file, and skip to the next record.”

Regex and patterns

# Print only lines that match a regex
awk '/ERROR|WARN/ { print }' app.log

# Print lines where field 3 starts with "foo"
awk '$3 ~ /^foo/' data.tsv

# Negation
awk '!/DEBUG/' app.log              # everything except DEBUG lines

# Print lines from PATTERN to PATTERN (range, like sed)
awk '/^START/,/^END/' file.txt

`printf` for formatting

awk '{ printf "%-30s %10d\n", $1, $2 }' data.tsv

# Common formats:
# %s    string
# %d    integer
# %f    float
# %e    scientific
# %x    hex
# %o    octal
# %-10s left-align in width 10
# %5.2f float, width 5, 2 decimals

awk functions

# Built-ins: length, substr, index, split, gsub, sub, tolower, toupper, sprintf, ...

awk '{ print length($0), $0 }' file       # length of each line
awk '{ print toupper($1) }' file          # uppercase first field
awk '{ gsub(/foo/, "bar"); print }' file  # global substitute (like sed)

# Custom functions
awk 'function abs(x) { return x < 0 ? -x : x } { print abs($1) }' data

A complete real-world awk script

# Parse nginx access log, compute requests/sec and bytes/sec by 5-min bucket
awk '
  BEGIN { FS="[ \\[\\]]+" }
  {
    # field 4 looks like: 22/Jun/2026:14:35:12
    split($4, t, "[:/]")
    bucket = t[1] "/" t[2] "/" t[3] " " t[4] ":" sprintf("%02d", int(t[5]/5)*5)
    requests[bucket]++
    bytes[bucket] += $NF
  }
  END {
    for (b in requests)
      printf "%-22s %6d req  %12d bytes\n", b, requests[b], bytes[b]
  }
' access.log | sort

This kind of pipeline used to be a Python script. In awk, it’s 10 lines.

2. `jq` — JSON Swiss army knife

JSON is everywhere in modern shell work — Kubernetes API, AWS CLI, GitHub API, every web service. jq is to JSON what awk is to columnar text.

Basics: pretty-print and select

# Pretty-print
echo '{"name":"alice","age":30}' | jq .

# Get a field
echo '{"name":"alice","age":30}' | jq .name           # "alice"
echo '{"name":"alice","age":30}' | jq '.name'         # same; quote when shell would interpret

# Get a nested field
echo '{"user":{"name":"alice"}}' | jq .user.name

# Array access
echo '[1,2,3]' | jq '.[0]'                            # 1
echo '[1,2,3]' | jq '.[-1]'                           # 3 (negative = from end)
echo '[1,2,3]' | jq '.[]'                             # iterate: 1\n2\n3
echo '[1,2,3]' | jq '.[1:3]'                          # slice: [2,3]

Always wrap jq filters in single quotes — they contain $, [, . that the shell would otherwise interpret.

Pipes (inside jq)

jq has its own internal pipe |, which feeds output of one filter into another:

# From a list of users, get names
echo '[{"name":"alice"},{"name":"bob"}]' | jq '.[] | .name'
# "alice"
# "bob"

# Even more compact:
echo '[{"name":"alice"},{"name":"bob"}]' | jq '.[].name'

Selectors and filters

# select: keep only items matching a predicate
jq '.[] | select(.age > 30)' users.json

# Multiple predicates
jq '.[] | select(.age > 30 and .role == "admin")' users.json

# Pattern match
jq '.[] | select(.name | test("^A"))' users.json    # name starts with A

Construct new objects

# Pick specific fields
jq '.[] | {name, age}' users.json

# Rename / compute
jq '.[] | {full_name: .name, is_adult: (.age >= 18)}' users.json

# As an array
jq '[.[] | .name]' users.json

`map` — transform an array

# Double every age
jq 'map(.age *= 2)' users.json

# Map to just names
jq 'map(.name)' users.json   # equivalent to [.[] | .name]

`length`, `keys`, `to_entries`, `from_entries`

jq 'length' users.json                    # number of array items
jq 'keys' my_obj.json                     # all keys of an object
jq '.users | keys'                        # nested

# Convert object to array of {key, value} pairs
jq 'to_entries' obj.json
# [ {"key":"name","value":"alice"}, {"key":"age","value":30} ]

# And back
jq 'to_entries | from_entries' obj.json   # round-trip

Aggregations: `add`, `min`, `max`, `unique`, `group_by`

# Sum all ages
jq '[.[].age] | add' users.json

# Max age
jq '[.[].age] | max' users.json

# Unique roles
jq '[.[].role] | unique' users.json

# Group by role, count each
jq 'group_by(.role) | map({role: .[0].role, count: length})' users.json

Output modes

jq -r '.name' file              # raw — no JSON quotes; useful for shell strings
jq -c .                         # compact — one item per line; for streaming
jq -s '.'                       # slurp — read all input as single array

-r is essential when feeding jq output into shell variables:

NAME=$(curl -s api.example.com/user | jq -r '.name')   # without -r, NAME would have quotes

Real-world examples

# Get all running pod names from kubectl
kubectl get pods -o json | jq -r '.items[] | select(.status.phase == "Running") | .metadata.name'

# Format AWS instances as tab-separated
aws ec2 describe-instances --output json \
  | jq -r '.Reservations[].Instances[] | [.InstanceId, .InstanceType, .State.Name] | @tsv'

# From a GitHub commits API response, get hash and message
curl -s api.github.com/repos/torvalds/linux/commits \
  | jq -r '.[] | "\(.sha[0:7]) \(.commit.message | split("\n")[0])"'

Editing JSON

# Update a field
echo '{"name":"alice","age":30}' | jq '.age = 31'

# Add a field
echo '{"name":"alice"}' | jq '. + {age: 30}'

# Delete a field
echo '{"name":"alice","age":30}' | jq 'del(.age)'

# In-place edit a JSON file (atomic with mv-temp)
TMP=$(mktemp)
jq '.version = "2.0"' package.json > "$TMP" && mv "$TMP" package.json

There’s no jq -i (yet); the mktemp + mv pattern is canonical.

3. `yq` — jq for YAML

There are two yq tools confusingly named the same:

Mike Farah’s yq (Go, written 2017+) — what most cloud engineers use. Syntax mirrors jq. Works on YAML, JSON, XML.
kislyuk’s yq (Python wrapper around jq) — the older one. Less common now.

We’ll cover Mike Farah’s. Install with brew install yq or download the binary.

Basic usage

yq '.metadata.name' deployment.yaml         # get a field — same as jq syntax
yq '.spec.replicas = 5' deployment.yaml     # mutate (prints to stdout)
yq -i '.spec.replicas = 5' deployment.yaml  # in-place (yq has -i, jq doesn't)

Multiple documents in one YAML file

Kubernetes manifests often have multiple documents separated by ---. yq handles them:

yq '.kind' multi.yaml          # prints all "kind" values, one per document
yq 'select(.kind == "Service")' multi.yaml   # extract only Service docs

Convert between formats

yq -o json '.' file.yaml > file.json     # YAML to JSON
yq -p json -o yaml '.' file.json         # JSON to YAML
yq -p xml '.' file.xml                   # parse XML

Real-world examples

# Get all images used in a Helm template'd deployment
helm template mychart | yq '..|.image? | select(.)'

# Bulk-update image tag in a Kustomize patch
yq -i '.spec.template.spec.containers[0].image = "myimage:v2"' deploy.yaml

# Get all containers across all pods in a namespace
kubectl get pods -o yaml | yq '.items[].spec.containers[].name'

Caveats

YAML is more complex than JSON — comments, anchors, multi-line strings — and yq preserves them when possible, but any yq -i round-trip can subtly reformat the file. For Helm/Kustomize source files where exact formatting matters, prefer using YAML-aware tools (Helm, Kustomize, OPA Rego) or be very deliberate with yq -i.
The two yqs are not interchangeable. If a colleague’s snippet doesn’t work, check which yq they have (yq --version).

4. CSV: when `awk -F,` isn’t enough

The naive approach to CSV is awk -F,. It works until the first quoted field with an embedded comma, then it explodes.

# This file:
"Smith, John",30,Engineer
"Doe, Jane",25,Manager

# awk -F, treats the comma INSIDE the quotes as a separator. Wrong.

For real-world CSV (especially anything from spreadsheets or business systems), use a real CSV parser.

`csvkit` — Python-based CSV toolkit

pip install csvkit

Provides:

csvlook — pretty-print as a table
csvcut — extract columns by name or index
csvgrep — filter rows by column value
csvstat — column statistics (min/max/mean/distinct values)
csvsort — sort by column
csvjoin — SQL-style join of two CSVs
csvjson — convert to JSON
csvsql — run SQL queries against a CSV (!)
in2csv — convert XLS/XLSX/JSON to CSV
csvformat — change delimiters, quoting

Examples

# Pretty-print the first 10 rows
head -n 10 data.csv | csvlook

# Get a specific column by name
csvcut -c first_name,last_name people.csv

# Filter rows where state is CA
csvgrep -c state -m CA people.csv

# Statistics on each column
csvstat sales.csv

# Sort by sale amount, descending
csvsort -c amount -r sales.csv | head

# Join orders with customers on customer_id
csvjoin -c customer_id orders.csv customers.csv > joined.csv

# Run SQL against a CSV
csvsql --query "SELECT state, COUNT(*) FROM people GROUP BY state ORDER BY 2 DESC" people.csv

# Convert Excel to CSV
in2csv sales.xlsx > sales.csv

csvsql is genuinely magical: it loads the CSV into an in-memory SQLite, runs the query, prints the result. For small to medium CSVs (up to a few hundred MB), it beats writing pandas or a real database.

`xsv` — fast CSV (Rust)

For very large CSVs, csvkit (Python) is slow. xsv is a Rust-based alternative:

brew install xsv

xsv stats data.csv | xsv table        # statistics, table-formatted
xsv select first_name,last_name data.csv
xsv search -s state CA data.csv       # filter by column
xsv join customer_id orders.csv customer_id customers.csv

Same operations as csvkit, much faster on big files.

`miller` (`mlr`)

Yet another option, designed for “TSV/CSV/JSON/etc as named-field records”:

brew install miller

mlr --csv stats1 -a mean,stddev -f age people.csv
mlr --c2t cat people.csv > people.tsv      # CSV to TSV
mlr --c2j cat people.csv > people.json     # CSV to JSON

mlr is genuinely clever and very capable, but has its own syntax to learn. Pick one (csvkit, xsv, or mlr) and stick with it.

5. Locale and UTF-8 — the silent saboteurs

This is the section that most “shell scripting” tutorials skip, and it’s where almost every senior engineer gets bitten at least once.

Locale categories

A locale tells programs how to interpret text:

LC_CTYPE — what counts as a letter, digit, lowercase
LC_COLLATE — sort order
LC_NUMERIC — decimal separator (, in Germany, . in the US)
LC_TIME — date and time format
LC_MESSAGES — language of program messages
LC_MONETARY — currency formatting
LC_ALL — overrides all of the above
LANG — fallback if a specific LC_* isn’t set

On a typical Mac/Linux system:

locale          # show current settings

# Common values:
LANG=en_US.UTF-8
LC_ALL=
LC_CTYPE="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
...

The `LC_ALL=C` trick

Setting LC_ALL=C (or the equivalent LC_ALL=POSIX) tells programs to use the most basic, byte-comparison-only locale. It’s:

Faster — no Unicode collation tables, no locale lookups. sort | uniq of a 100MB file can be 10x faster.
More predictable — alphabetical sort means “byte order,” not “linguistic order.” b < B is true in C locale (capital letters come first); in en_US.UTF-8 it depends.
Wrong for international data — German ä will sort wherever its UTF-8 byte sequence sorts, not “near a.” Polish ł will sort like a totally different letter.

So the rule:

# For pipelines that don't need linguistic correctness:
LC_ALL=C sort -u file.txt

# For pipelines that DO need it:
LC_ALL=en_US.UTF-8 sort -u file.txt

Always set LC_ALL explicitly inside scripts to avoid being at the mercy of the user’s environment:

#!/usr/bin/env bash
set -Eeuo pipefail
export LC_ALL=C        # if you want fast, byte-deterministic processing
# ... your pipeline ...

UTF-8 in `grep`, `sed`, `awk`

Modern GNU grep/sed/awk handle UTF-8 correctly when the locale is set to a UTF-8 locale:

echo 'café' | grep -oE '\w+'           # in en_US.UTF-8: "café"
echo 'café' | LC_ALL=C grep -oE '\w+'  # in C: "caf" (é is not a word char in C locale)

If you want to count grapheme clusters (what users perceive as “characters”), neither shell nor awk is the right tool — that’s python3 with unicodedata or specialised libraries.

`wc -c` vs `wc -m`

echo 'café' | wc -c        # 6 (bytes; é is 2 bytes in UTF-8) — plus newline
echo 'café' | wc -m        # 5 (characters) — plus newline; in UTF-8 locale

wc -c counts bytes; wc -m counts characters but only respects the locale.

File names with weird characters

Always quote variables that might contain filenames:

for f in *.txt; do
  cp "$f" "/backup/$f"        # quotes essential — filename might have spaces/UTF-8
done

find . -type f -print0 | xargs -0 cp -t /backup/   # NUL-safe

If a filename contains an invalid UTF-8 sequence (rare but happens), some tools will refuse it. find and cp are byte-faithful — they don’t care about UTF-8 validity. bash is also byte-faithful. So shell handles them; just don’t try to convert filenames to a string in another encoding.

BOM (Byte Order Mark) gotcha

Files saved by Windows tools sometimes have a UTF-8 BOM (EF BB BF) at the start. This is invisible but breaks scripts:

file weird.csv         # "UTF-8 Unicode (with BOM) text"

# This BOM appears as a "character" in the first cell:
head -c 3 weird.csv | xxd       # 00000000: efbb bf

# Strip BOM:
sed -i '1s/^\xef\xbb\xbf//' weird.csv
# or use dos2unix (which also handles CRLF)
dos2unix weird.csv

If your CSV “first column header” mysteriously doesn’t match what you expect, suspect a BOM.

CRLF vs LF

Windows line endings (\r\n) trip up shell scripts. The carriage return is invisible but breaks read, awk, etc.:

file script.sh                     # "ASCII text, with CRLF line terminators"

dos2unix script.sh                 # convert in place
# or:
sed -i 's/\r$//' script.sh
# or:
tr -d '\r' < script.sh > tmp && mv tmp script.sh

Numeric locale

LC_NUMERIC=de_DE.UTF-8 makes printf '%.2f' 3.14 output 3,14 (comma as decimal separator). This breaks tools that re-read the output:

# Inside scripts, force C numeric locale for safety
export LC_NUMERIC=C

6. Combining the toolkit

The real power is composing these. A few full workflows:

Workflow 1: Top 10 noisiest containers (Kubernetes)

kubectl top pods --all-namespaces --no-headers \
  | awk '{ print $3, $1 "/" $2 }' \
  | sort -k1 -h -r \
  | head -n 10 \
  | awk '{ printf "%-10s %s\n", $1, $2 }'

awk selects and reorders columns; sort -h does human-readable sort (M, G); the second awk formats. No regex, no jq.

Workflow 2: From CSV to per-region summary

csvgrep -c country -m USA sales.csv \
  | csvcut -c region,amount \
  | csvsql --query "SELECT region, SUM(CAST(amount AS REAL)) AS total
                    FROM stdin GROUP BY region ORDER BY total DESC"

Workflow 3: Container image audit across a Kubernetes cluster

# All container images in the cluster, deduped, with usage count
kubectl get pods --all-namespaces -o json \
  | jq -r '.items[].spec.containers[].image' \
  | sort | uniq -c | sort -rn

Workflow 4: Dynamic Helm values from a YAML file

# Pull pre-defined image map from values.yaml, inject into a kubectl set image
yq -r '.images | to_entries | .[] | "\(.key)=\(.value)"' values.yaml \
  | while IFS=$'\n' read -r line; do
      kubectl set image deployment/$DEPLOY "$line"
    done

Workflow 5: Streaming JSON logs filter

# Tail a structured-log-as-JSON file, filter ERROR-level, format human-readable
tail -F /var/log/app.log \
  | jq --unbuffered -r 'select(.level == "ERROR") | "\(.ts) \(.msg)"'

--unbuffered is essential for tail -F | jq pipelines so jq flushes after each input line.

Workflow 6: Detect drift in a Kubernetes manifest

# Compare in-cluster vs source-of-truth YAML, ignoring runtime fields
diff \
  <(kubectl get deploy myapp -o yaml | yq 'del(.metadata.resourceVersion, .metadata.generation, .status)') \
  <(yq 'del(.metadata.resourceVersion, .metadata.generation, .status)' deployment.yaml)

This kind of one-liner replaces a Python script. It’s the daily life of a platform engineer.

7. Pitfalls and conventions

Don’t pipe `sort | uniq` when you need order-preservation

uniq only deduplicates adjacent duplicates; that’s why it’s almost always used after sort. But sort reorders. If you need first-occurrence-preserving uniq:

awk '!seen[$0]++'         # canonical "uniq, preserving order"

Don’t `sort -u` when you need stable ordering

sort -u is a fast alternative to sort | uniq, but it doesn’t guarantee a particular dedup-winner among equal lines.

Don’t write CSV by hand

If your output is for downstream consumption as CSV, escape correctly. The naive printf '%s,%s\n' "$a" "$b" breaks the moment $a contains a comma or newline. Use a real tool (csvkit, miller, Python).

Don’t `jq -r` on JSON arrays of complex objects

If you do jq -r '.[]' on [{...},{...}], you’ll get malformed shell tokens. Either iterate one field at a time, or use jq -c '.[]' and parse each line again with jq.

# Wrong — produces ambiguous/multi-line output
jq -r '.[]' file.json | while read -r item; do … done

# Right — compact JSON per line
jq -c '.[]' file.json | while IFS= read -r item; do
  NAME=$(jq -r '.name' <<< "$item")
  AGE=$(jq -r '.age' <<< "$item")
  …
done

Don’t trust `awk -F,` for real CSV

We covered this. Use csvkit/xsv/miller.

Locale in CI

CI runners often have LANG=C or LANG=POSIX by default. If your script does locale-sensitive sort or printf, it will behave differently than on your laptop. Either explicitly set the locale in the script, or test with LC_ALL=C once.

Streaming vs slurp

jq defaults to streaming (one input record at a time). jq -s slurps everything into one array. If your script does tail -F | jq …, never use -s — it would buffer forever waiting for EOF.

8. Twelve idioms for daily use

# 1. Sum a numeric column
awk '{ s += $1 } END { print s }' data.tsv

# 2. Count distinct values in a column
awk '{ count[$1]++ } END { for (k in count) print count[k], k }' data | sort -rn

# 3. Print column 2 from a colon-separated file
awk -F: '{ print $2 }' file

# 4. Skip the header row
awk 'NR > 1' file.csv

# 5. Get all running pod names from kubectl
kubectl get pods -o json | jq -r '.items[] | select(.status.phase=="Running") | .metadata.name'

# 6. JSON to TSV
jq -r '.[] | [.id, .name, .email] | @tsv' users.json

# 7. Update a JSON field in place (atomic)
TMP=$(mktemp); jq '.version = "2.0"' package.json > "$TMP" && mv "$TMP" package.json

# 8. Update YAML in place
yq -i '.spec.replicas = 5' deploy.yaml

# 9. Convert YAML to JSON
yq -o json '.' file.yaml

# 10. Extract a column from real CSV (handling quotes)
csvcut -c name people.csv

# 11. Run SQL on a CSV
csvsql --query "SELECT state, COUNT(*) FROM stdin GROUP BY state" people.csv

# 12. Strip BOM and CRLF from a file
dos2unix file.csv && sed -i '1s/^\xef\xbb\xbf//' file.csv

9. What you must internalise before Wave 2

What does awk use for record/field separators by default? (Newline / whitespace.)
What does BEGIN { … } do in awk? (Runs before any input. END runs after.)
What’s the NR==FNR { … ; next } idiom? (Process only the first of multiple input files.)
What’s the difference between jq . and jq -r .? (-r outputs raw strings without JSON quotes — use when feeding into shell variables.)
What’s jq -c? (Compact output, one record per line — for streaming.)
Which yq are we using? (Mike Farah’s Go version. Not the Python wrapper.)
Why is awk -F, wrong for real CSV? (Doesn’t handle quoted fields with embedded commas.)
Which CSV tool is fastest for big files? (xsv. csvkit is convenient but Python-slow.)
What does LC_ALL=C do? (Forces byte-only comparisons — fast and predictable, but breaks non-ASCII linguistic correctness.)
What’s a UTF-8 BOM and how do you remove it? (EF BB BF at file start; sed -i '1s/^\xef\xbb\xbf//' file or dos2unix.)

If anything felt fuzzy, re-read the section. These tools repay study many times over.

What’s next: Wave 1 complete!

You’ve now completed the foundation of the course:

Tier 1 (foundation) — anatomy, variables, conditionals, loops, functions, arrays. Tier 2 (intermediate) — I/O, pipes, processes, signals, glob/regex/find/grep/sed, structured-data toolkit.

You can now write a robust shell script: it sets set -Eeuo pipefail and IFS=$'\n\t', defines main "$@", has trap cleanup for safe interrupts, uses arrays for structured data, processes JSON with jq and YAML with yq, handles UTF-8 and locales correctly, and uses find -print0 | xargs -0 for filename-safe pipelines.

Wave 2 — Tier 3 Advanced — covers the next layer: error handling frameworks, debug/trace techniques, secrets management (1Password, vault, sops, age), package management cross-platform, idempotent installers, CLI design for your own scripts (option parsing with getopts and argparse-like patterns), bash testing (bats), and the canonical patterns for writing scripts that ship in production. We bring everything from Wave 1 and start building real systems with it.

See you in Tier 3.

Text Processing: awk Deep Dive, jq, yq, csvkit & the Locale / UTF-8 Pitfalls — Where Shell Stops Being Primitive

1. awk — the data-processing language hidden inside awk

The data model

“Print column N” — the boilerplate

BEGIN and END blocks

Arithmetic and conditions

Field separators (-F and OFS)

Associative arrays — counting and grouping

Multi-file processing

Regex and patterns

printf for formatting

awk functions

A complete real-world awk script

2. jq — JSON Swiss army knife

Basics: pretty-print and select

Pipes (inside jq)

Selectors and filters

Construct new objects

map — transform an array

length, keys, to_entries, from_entries

Aggregations: add, min, max, unique, group_by

Output modes

Real-world examples

Editing JSON

3. yq — jq for YAML

Basic usage

Multiple documents in one YAML file

Convert between formats

Real-world examples

Caveats

4. CSV: when awk -F, isn’t enough

csvkit — Python-based CSV toolkit

Examples

xsv — fast CSV (Rust)

miller (mlr)

5. Locale and UTF-8 — the silent saboteurs

Locale categories

The LC_ALL=C trick

UTF-8 in grep, sed, awk

wc -c vs wc -m

File names with weird characters

BOM (Byte Order Mark) gotcha

CRLF vs LF

Numeric locale

6. Combining the toolkit

Workflow 1: Top 10 noisiest containers (Kubernetes)

Workflow 2: From CSV to per-region summary

Workflow 3: Container image audit across a Kubernetes cluster

Workflow 4: Dynamic Helm values from a YAML file

Workflow 5: Streaming JSON logs filter

Workflow 6: Detect drift in a Kubernetes manifest

7. Pitfalls and conventions

Don’t pipe sort | uniq when you need order-preservation

Don’t sort -u when you need stable ordering

Don’t write CSV by hand

Don’t jq -r on JSON arrays of complex objects

Don’t trust awk -F, for real CSV

Locale in CI

Streaming vs slurp

8. Twelve idioms for daily use

9. What you must internalise before Wave 2

What’s next: Wave 1 complete!

Written by Vinod

Comments

1. `awk` — the data-processing language hidden inside `awk`

Field separators (`-F` and `OFS`)

`printf` for formatting

2. `jq` — JSON Swiss army knife

`map` — transform an array

`length`, `keys`, `to_entries`, `from_entries`

Aggregations: `add`, `min`, `max`, `unique`, `group_by`

3. `yq` — jq for YAML

4. CSV: when `awk -F,` isn’t enough

`csvkit` — Python-based CSV toolkit

`xsv` — fast CSV (Rust)

`miller` (`mlr`)

The `LC_ALL=C` trick

UTF-8 in `grep`, `sed`, `awk`

`wc -c` vs `wc -m`

Don’t pipe `sort | uniq` when you need order-preservation

Don’t `sort -u` when you need stable ordering

Don’t `jq -r` on JSON arrays of complex objects

Don’t trust `awk -F,` for real CSV