Shell Lesson 11 of 42

Globbing, Regex & the find / grep / sed Toolkit That Actually Scales — From `*.txt` to Production-Grade File and Text Manipulation

Three out of every four real shell tasks boil down to: find some files, look through their contents, rewrite parts of them. The tools for this — globs, find, grep, sed — have been part of Unix since 1971, and they have lost none of their relevance in 2026. Cloud-native engineers, DevOps platform owners, SREs at hyperscalers — they all reach for these tools every day.

But most engineers use 5% of their power. This lesson covers them in real depth: globs beyond *.txt, regex carefully (because shell mixes three different regex flavours and the differences matter), find as the most powerful filesystem-traversal language ever shipped, grep with all the flags that turn it from “search for word” into “extract every error since 9am with 3 lines of context”, and sed for stream editing without losing files.

If you’ve been doing this work in Python because shell felt too primitive — read this lesson and re-evaluate. For a million tasks, find | grep | sed is one line, fast, and on every machine you’ll ever touch.


1. Globs revisited: nullglob, dotglob, globstar, extglob

We covered the basic globs in lesson 4. Bash has shell options that change globbing behaviour. Set them with shopt -s NAME (set) and shopt -u NAME (unset).

nullglob — empty match expands to nothing

By default:

ls /nonexistent/*.log
ls: cannot access '/nonexistent/*.log': No such file or directory

Bash leaves the literal /nonexistent/*.log in place when nothing matches. Inside a for loop, this means you iterate once with the literal pattern as the value. With nullglob:

shopt -s nullglob

ls /nonexistent/*.log    # silent — no files, no error
for f in /nonexistent/*.log; do
  echo "$f"              # body never runs
done

This is almost always what you want for scripts. Set it at the top of any script that iterates over globs.

dotglob — include hidden files

By default, globs do not match files starting with . (the “hidden” convention):

ls *
# regular-files-only

shopt -s dotglob
ls *
# now also includes .config .ssh .git etc.

Use dotglob when you actually need to process all files. Otherwise leave it off.

globstar — recursive **

shopt -s globstar

ls **/*.log
# matches *.log in current dir AND recursively in all subdirectories

Without globstar, ** is just two *s (no special meaning). With it, ** matches zero or more path components. This is bash 4+ only.

# Find all .py files in the project
shopt -s globstar nullglob
for f in src/**/*.py; do
  process "$f"
done

extglob — extended pattern matching

shopt -s extglob

# now you can use:
?(pat)     # 0 or 1 occurrence of pat
*(pat)     # 0 or more occurrences
+(pat)     # 1 or more
@(pat)     # exactly one
!(pat)     # NOT pat

# Examples:
ls !(*.log)          # everything except .log files
ls *.@(jpg|png|gif)  # any of three extensions
ls ?(README|LICENSE) # match either, or empty

Extglob is bash-specific but extremely useful. Especially !(...) for “everything except”:

# Remove everything in /tmp/cache except the lockfile
shopt -s extglob
rm -rf /tmp/cache/!(lockfile)

Glob options together

The standard “I want my globs to behave sensibly” preamble:

shopt -s nullglob globstar extglob

For most modern bash scripts, this is the right baseline. Add dotglob only when you specifically need it.


2. The three regex flavours in shell

Shell tools use different regex dialects. This trips up everyone. The three flavours:

BRE — Basic Regular Expression (POSIX, the oldest)

Default for grep and sed without flags. Special characters: . * ^ $ \[ \]. Other “metacharacters” must be backslash-escaped to be special: \?, \+, \{n,m\}, \|, \(, \).

echo "hello123" | grep '[0-9]\+'         # BRE — backslash-escape +
echo "hello123" | sed 's/[0-9]\+/X/'     # BRE — same

This is the most surprising flavour for people coming from other regex languages. It’s also the default. Learn it (or always use -E).

ERE — Extended Regular Expression

grep -E (or egrep), sed -E (or sed -r), awk. Special characters: . * ? + { } | ( ) ^ $ \[ \]. No backslash-escaping for ?, +, |, {, (.

echo "hello123" | grep -E '[0-9]+'       # ERE — natural +
echo "hello123" | sed -E 's/[0-9]+/X/'   # ERE — same

ERE is what most people think of as “regex.” If you have -E available, use it.

PCRE — Perl-Compatible Regular Expression

grep -P, pcregrep, ripgrep, most modern languages. The richest dialect: lookahead, lookbehind, named groups, non-greedy *?, \d, \w, \s, etc.

echo "hello123" | grep -P '\d+'                # PCRE — \d for digits
echo "hello123world" | grep -P '(?<=hello)\d+' # lookbehind: digits AFTER "hello"

PCRE is not in plain sed or awk. For PCRE in stream editing you use perl -pe:

echo "hello123" | perl -pe 's/\d+/X/'    # in-place ERE/PCRE-style

Picking a flavour

Default to ERE for clarity (grep -E, sed -E). Drop to PCRE only when you need lookahead/lookbehind/named groups. Avoid plain BRE for new code.

A handy rule of thumb: always use grep -E or grep -P. Never plain grep. The mental tax of remembering BRE backslash-escaping is too high.

Common regex character classes (work in ERE/PCRE)

[A-Za-z]      # letters
[0-9]         # digits
[A-Za-z0-9]   # alphanumeric
[[:alpha:]]   # POSIX letter class — locale-aware
[[:digit:]]   # POSIX digit class
[[:space:]]   # whitespace (space, tab, newline)
[[:punct:]]   # punctuation
[[:xdigit:]]  # hex digit
\d            # digit (PCRE only)
\w            # word char (PCRE only): [A-Za-z0-9_]
\s            # whitespace (PCRE only)
\b            # word boundary (PCRE)

The [[:class:]] POSIX classes work in BRE, ERE, and PCRE.


3. find — the filesystem-traversal language

find is its own little Turing-incomplete language for “give me files matching these criteria, do these things to them.” Most people use find . -name '*.log' and stop. The full power is staggering.

Basic structure

find [PATHS] [TESTS] [ACTIONS]

PATHS are starting points. TESTS are filters that decide whether each file matches. ACTIONS are what to do with matched files. Default action is -print if you don’t specify.

The most useful tests

-name 'PATTERN'           # filename (with shell glob); case-sensitive
-iname 'PATTERN'          # case-insensitive filename
-type TYPE                # f=file, d=dir, l=symlink, b=block, c=char, p=fifo, s=socket
-size N[bckMG]            # size: +1M is "more than 1 megabyte"; -100k is "less than 100k"
-mtime N                  # modified N days ago: -7 = within 7 days, +30 = over 30 days
-atime N                  # accessed N days ago
-ctime N                  # ctime (inode change time)
-newer FILE               # newer than FILE (handy: -newer .last-run)
-mmin N                   # modified N minutes ago
-perm MODE                # permission bits: -perm -u+x means "user-executable"
-user NAME                # owned by user
-group NAME               # owned by group
-empty                    # empty file or dir
-readable / -writable / -executable  # by current user
-regex 'PATTERN'          # match full path with BRE; pair with -regextype posix-extended
-path 'PATTERN'           # match full path with shell glob (different from -name)
-not / !                  # negate
-and / -or                # combine (default is -and)

The most useful actions

-print                    # print path (default if no action given)
-print0                   # NUL-terminated; use for piping (lesson 4)
-printf 'FORMAT\n'        # printf-style; %p path, %f filename, %s size, %T@ mtime, etc.
-delete                   # delete the file
-exec CMD {} \;           # run CMD once per match, replacing {} with path
-exec CMD {} +            # run CMD once with ALL matches batched as args (faster!)
-execdir CMD {} \;        # same but cd to file's directory first
-prune                    # don't recurse into this directory (the SKIP action)
-ls                       # ls-style output
-quit                     # stop after this match (find at most 1)

Combining tests

# Files larger than 100MB, ending in .log
find /var/log -type f -size +100M -name '*.log'

# Empty directories
find . -type d -empty

# Modified in last 7 days, not in .git
find . -type f -mtime -7 -not -path '*/.git/*'

# Owned by nobody (often security cleanup)
find / -nouser -print

The default combination is “and.” Use -or for “or”; parentheses (escaped or quoted) for grouping:

find . \( -name '*.log' -o -name '*.tmp' \) -delete

Note the escaped parens \( \) — required because parens are shell syntax otherwise.

-exec vs -exec +

This is the optimization most people miss:

# WRONG — forks gzip once per file (slow for many files)
find /var/log -name '*.log' -exec gzip {} \;

# RIGHT — batches up filenames and runs gzip ONCE with all of them
find /var/log -name '*.log' -exec gzip {} +

The trailing + (instead of \;) tells find to batch matches. find accumulates filenames until argv-length limits are reached, then exec’s gzip with as many as fit, repeats until done. Dramatically faster for “many files, simple op.”

The prune trick — skip directories

# Find all .py files, skipping .git and node_modules
find . \( -name .git -o -name node_modules \) -prune -o -type f -name '*.py' -print

Read this as: “for each entry, if it’s named .git or node_modules, prune (don’t recurse); otherwise, if it’s a file ending in .py, print it.” The -o is “or” — -prune returns false (since pruned things aren’t matches) and the second branch handles real matches.

-print0 and the pipeline pattern

For piping find output safely, always use -print0 and pair it with NUL-aware tools:

find /var/log -name '*.log' -print0 | xargs -0 gzip
find /tmp -mtime +30 -print0 | xargs -0 rm --
mapfile -d '' -t FILES < <(find . -type f -print0)

We covered this in lessons 4 and 6. It’s the only completely-robust file-collection pattern.


4. grep — text search with all the flags

Plain grep PATTERN FILE is rarely enough. The flags are essential.

Pattern flavour flags

grep PATTERN FILE         # BRE — escape special chars
grep -E PATTERN FILE      # ERE — natural regex
grep -F PATTERN FILE      # FIXED string — no regex, fastest
grep -P PATTERN FILE      # PCRE — full Perl regex

-F (fixed string) is much faster than regex when you just need a literal substring. Use it when applicable:

grep -F 'ERROR: connection refused' app.log

Output mode flags

grep -l PATTERN *.log     # print only filenames that match (no matching lines)
grep -L PATTERN *.log     # print only filenames that DON'T match
grep -c PATTERN file      # print only count of matching lines
grep -q PATTERN file      # quiet — no output, just exit code (for if conditions)
grep -o PATTERN file      # print only the matched part of each line
grep -n PATTERN file      # prefix each line with line number

Examples:

# Files in /etc that contain "deprecated"
grep -lF deprecated /etc/*.conf

# Count of error lines
grep -c '^ERROR' app.log

# Test whether a file contains a marker (in a script)
if grep -q 'STARTED' /var/log/app.log; then
  echo "App started"
fi

# Extract just the matching emails
grep -oE '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}' file.txt

Context flags

grep -A 3 PATTERN file    # print 3 lines AFTER each match
grep -B 3 PATTERN file    # 3 lines BEFORE
grep -C 3 PATTERN file    # 3 lines BEFORE and AFTER

Wonderful for log analysis:

# Show 5 lines of context around any ERROR in the last 1000 lines
tail -n 1000 app.log | grep -C 5 ERROR

Recursion flags

grep -r PATTERN dir       # recurse into dir
grep -R PATTERN dir       # also follow symlinks (rare; usually -r is what you want)
grep --include='*.py' -r 'TODO' .   # only search .py files
grep --exclude-dir=.git -r 'TODO' .  # skip .git
grep --exclude='*.lock' -r 'TODO' .

Word and line matching

grep -w foo file          # match "foo" only as a whole word (not "foobar")
grep -x EXACT file        # entire line must equal EXACT
grep -v PATTERN file      # invert: print lines that do NOT match
grep -i PATTERN file      # case-insensitive

-w is invaluable for symbol search:

grep -wn 'username' src/**/*.py    # find every "username" as a whole word

Multi-pattern and pattern files

grep -e foo -e bar file              # match either "foo" OR "bar"
grep -E '(foo|bar)' file             # same with ERE
grep -f patterns.txt file            # patterns in a file, one per line

ripgrep (rg) — the modern alternative

ripgrep (rg) is a from-scratch rewrite of grep, written in Rust. It’s:

rg PATTERN              # recurses current dir, ignoring .git/.gitignore'd files
rg -tpy 'def main'      # only Python files
rg -A 3 -B 3 PATTERN    # context
rg -c PATTERN           # counts per file

If you write a lot of search-heavy shell, install ripgrep (brew install ripgrep / apt install ripgrep) and prefer it. For maximum portability of scripts, stick with grep.


5. sed — the stream editor

sed reads input line by line, applies a script, writes output. For one-shot edits to a stream or file, it’s irreplaceable.

Substitution — the 80% use case

sed 's/OLD/NEW/' file               # replace FIRST occurrence per line
sed 's/OLD/NEW/g' file              # replace ALL occurrences (g = global)
sed 's/OLD/NEW/2' file              # replace the SECOND occurrence per line
sed 's/OLD/NEW/gI' file             # global, case-insensitive (gnu sed)

Delimiters

The default delimiter is /, but you can use any character. Especially helpful for paths:

sed 's|/usr/local/bin|/opt/local/bin|g' file
sed 's#OLD#NEW#g' file

Anchors and groups (ERE form with -E)

sed -E 's/^foo/bar/' file           # only at start of line
sed -E 's/foo$/bar/' file           # only at end of line
sed -E 's/(\w+) (\w+)/\2 \1/' file  # swap two words; capture groups

In default BRE, you’d need to backslash-escape: s/\(\w*\) \(\w*\)/\2 \1/.

Address ranges — apply only to certain lines

sed '5d' file                       # delete line 5
sed '5,10d' file                    # delete lines 5-10
sed '/^#/d' file                    # delete lines starting with #
sed '/PATTERN/,/END_PATTERN/d' file # delete from PATTERN through END_PATTERN
sed -n '5,10p' file                 # print only lines 5-10 (-n suppresses default print)
sed -n '/foo/,/bar/p' file          # print from "foo" through "bar"

The address can be a line number, $ (last line), or /PATTERN/.

Multiple commands

sed -e 's/a/A/' -e 's/b/B/' file    # two substitutions
sed 's/a/A/; s/b/B/' file           # same, ; separator

In-place editing

GNU sed (Linux):

sed -i 's/OLD/NEW/g' file           # modify file in place
sed -i.bak 's/OLD/NEW/g' file       # also save a backup as file.bak

BSD sed (macOS) requires an empty backup extension explicitly:

sed -i '' 's/OLD/NEW/g' file        # macOS — empty extension means "no backup"

For portable scripts that work on both:

sed -i.bak 's/OLD/NEW/g' file && rm -f file.bak

Or:

TMP=$(mktemp)
sed 's/OLD/NEW/g' file > "$TMP" && mv "$TMP" file

The mv-temp pattern is also atomic (mv on same fs is atomic) — readers either see the old or new version, never partial. Often the right choice for production scripts.

Sed example: rewrite config

# Update hostname in nginx.conf
sudo sed -i.bak -E 's/^(\s*server_name)\s+\S+;/\1 example.com;/' /etc/nginx/nginx.conf

The \S+ matches any non-whitespace (the old hostname); we replace with our new one. The capture group (\s*server_name) preserves indentation.

Sed example: extract a section

# Print the [database] section of an INI file
sed -n '/^\[database\]/,/^\[/p' config.ini

/PATTERN/,/PATTERN/ is a range; -n + p prints just those lines.


6. Combined examples

Find all TODOs in code, with file and line

grep -rEn --include='*.{js,ts,py,go}' '\b(TODO|FIXME|XXX)\b' .

Count lines of source per language

for ext in py js ts go; do
  COUNT=$(find . -name "*.${ext}" -not -path '*/node_modules/*' \
    -not -path '*/.git/*' -print0 | xargs -0 cat 2>/dev/null | wc -l)
  printf '%-5s %d\n' "$ext" "$COUNT"
done

Rename batch of files

shopt -s nullglob
for f in *.JPG; do
  mv -- "$f" "${f%.JPG}.jpg"
done

Update copyright year in all source files

find . -type f -name '*.py' -print0 \
  | xargs -0 sed -i.bak -E 's/Copyright \(c\) [0-9]{4}/Copyright (c) 2026/'
find . -name '*.bak' -delete

Strip trailing whitespace from all source files

find . -type f \( -name '*.py' -o -name '*.js' \) -print0 \
  | xargs -0 sed -i -E 's/[ \t]+$//'

Find files larger than 100MB and delete after confirmation

find /var/log -type f -size +100M -print0 | while IFS= read -r -d '' f; do
  read -p "Delete $f? [y/N] " -n 1 -r REPLY
  echo
  if [[ "$REPLY" =~ ^[Yy]$ ]]; then
    rm -- "$f"
  fi
done

Show files modified in last 24 hours, sorted by size

find . -type f -mtime -1 -printf '%s\t%p\n' | sort -n

7. Common pitfalls

Forgetting nullglob

for f in *.log; do process "$f"; done runs once with f="*.log" if no files match. Always either set shopt -s nullglob or guard with [ -e "$f" ] || continue.

Mixing regex flavours

Writing grep '\d+' file and getting nothing — \d is PCRE. Use grep -P '\d+' or grep -E '[0-9]+'.

find -name is a glob, not a regex

find . -name '*.py'        # glob — works
find . -name '.*\.py'      # regex syntax — gives nothing

For regex on names, use -regex:

find . -regextype posix-extended -regex '.*\.(py|js)'

Sed in-place differences

GNU vs BSD differ on the -i flag’s argument requirement. Always test on both, or use the temp-file pattern.

grep without --

If your file starts with -, grep thinks it’s a flag:

grep PATTERN -strange-file        # ERROR — "-strange-file" looks like a flag
grep PATTERN -- -strange-file     # CORRECT

The -- separator says “no more flags.” Use it for any user-supplied filenames.

Forgetting to escape regex metacharacters in grep -F

grep -F doesn’t interpret regex, so grep -F '$10' finds literal $10. But people sometimes write grep -F for performance and then put regex in the pattern, getting empty results. Use -F only for fixed strings.


8. Twelve idioms for daily use

# 1. Recursive grep with file type filter and ignore patterns
grep -rEn --include='*.{py,js,ts}' --exclude-dir={.git,node_modules} 'PATTERN' .

# 2. Recursive find with NUL-safe iteration
find /path -type f -name '*.log' -print0 | xargs -0 cmd

# 3. Count files of a type
find . -type f -name '*.py' | wc -l

# 4. Total size of files matching a pattern
find /var/log -name '*.log' -printf '%s\n' | awk '{s += $1} END {print s}'

# 5. Find empty files / dirs
find . -empty -print

# 6. Files modified in last N minutes
find . -type f -mmin -30

# 7. Files larger than 100MB
find . -type f -size +100M

# 8. Replace text in all matching files (atomic-friendly)
find . -type f -name '*.txt' -print0 | xargs -0 sed -i.bak 's/OLD/NEW/g'
find . -name '*.bak' -delete

# 9. Strip trailing whitespace
find . -type f -name '*.py' -print0 | xargs -0 sed -i 's/[ \t]*$//'

# 10. Top 10 largest files
find . -type f -printf '%s\t%p\n' | sort -rn | head -n 10

# 11. Find duplicate files by size (first pass)
find . -type f -printf '%s %p\n' | sort -n | uniq -d -w 11

# 12. Count occurrences of pattern across all logs
grep -c PATTERN /var/log/*.log

9. What you must internalise before lesson 12

If any felt fuzzy, re-read. Lesson 12 (awk, jq, yq, csvkit) covers the structured-data toolkit — for when grep/sed runs out of expressive power.


What’s next

Lesson 12 is the closer for Tier 2: text processing at the structured-data level. We cover awk deeply (its data model, BEGIN/END, FS/OFS, arrays, multi-file processing), jq for JSON (filters, transformations, the standard idioms), yq for YAML, csvkit for proper CSV handling, and the locale and UTF-8 pitfalls that trip up everyone working with international data. Bring everything from lessons 1-11. After L12, Tier 1 + Tier 2 (Wave 1) of this course is complete and we move into the advanced material in Wave 2.

shellbashglobregexfindgrepsedripgreptext-processingfundamentalslinux
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments