Text Processing

Dr Chris Paton

Unix is a text-processing civilisation. Its founding insight was that almost everything worth communicating can be expressed as lines of text, and that if you build enough good tools for manipulating text, you have effectively built good tools for manipulating everything. Configuration files are text. Logs are text. Program source code is text. Web pages are text. Even the things that are not obviously text (kernel state, process lists, hardware information) are exposed as if they were text through files like /proc/cpuinfo. This chapter is about the small, sharp tools that make text wrangling on Linux a pleasure.

Learning Objectives

Choose the right tool for viewing, searching, and transforming text
Write regular expressions for common matching tasks
Use sed and awk to transform text streams
Combine text utilities in pipelines to extract information from logs and data
Compare find and locate and know when to use each

Viewing Files

The most basic operation is simply looking at a file's contents.

cat file.txt          # print the whole file
head file.txt         # first 10 lines
head -20 file.txt     # first 20 lines
tail file.txt         # last 10 lines
tail -n 20 file.txt   # last 20 lines
tail -f /var/log/syslog   # follow: print new lines as they are appended

The -f flag on tail is particularly useful for watching log files live. When a server is misbehaving, tail -f /var/log/syslog in one window is a standard sysadmin posture.

For long files, cat is a nuisance: the terminal scrolls too fast to read. Use less:

less /var/log/syslog

Inside less, the keys come from vi. There is also more, an older pager, but less (whose name is a joke: "less is more") is more capable in every way.

Table 8.1: File viewing and searching commands

Command	When to Use
cat file	Small files; dump to stdout
less file	Large files; paginated, searchable
more file	Simpler pager (legacy)
head -n 20 file	First 20 lines
tail -n 20 file	Last 20 lines
tail -f file	Follow a growing log in real time
zcat / zless	View compressed files without unpacking
find path -name pat	Find files by name recursively
locate pat	Fast name search using a pre-built index

grep: Searching for Patterns

grep is possibly the most important single command in the Unix toolbox. Its name comes from the old ed editor command g/re/p: globally, regular expression, print. It reads files (or stdin) and prints every line that matches a pattern.

grep "error" /var/log/syslog
grep -i "error" syslog            # case-insensitive
grep -n "error" syslog            # show line numbers
grep -c "error" syslog            # count matches
grep -r "TODO" src/               # recurse into directories
grep -v "DEBUG" syslog            # invert: show lines that do NOT match
grep -A 3 "error" syslog          # 3 lines after each match
grep -B 3 "error" syslog          # 3 lines before
grep -C 3 "error" syslog          # 3 lines of context

grep understands basic regular expressions by default. For the more convenient extended regular expressions (with +, ?, |, and parentheses), use -E or the equivalent egrep:

grep -E "error|warning|critical" syslog
grep -E "^[0-9]{4}-[0-9]{2}-[0-9]{2}" syslog   # lines starting with a date

Regular Expressions: A Short Field Guide

A regular expression (or regex) is a compact pattern language for matching strings. You will use it not just in grep but in sed, awk, vim, editors, programming languages, and countless tools. The full language is intricate, but a handful of metacharacters covers most needs.

One complication: there are several regex dialects, and plain grep defaults to the oldest and most awkward. Basic Regular Expressions (BRE) treat (, ), {, }, |, +, and ? as literal characters, and require backslashes in front to give them their metacharacter meaning. Extended Regular Expressions (ERE, enabled with grep -E or egrep) do the opposite: these characters are metacharacters by default, and you backslash them to match them literally. Perl-Compatible Regular Expressions (PCRE, enabled with grep -P) are a superset of ERE with extras like \d, lookarounds, and non-greedy matching. In BRE, if you write a bare |, (, or {, you just match those literal characters. ERE is almost always what you want for anything beyond the simplest patterns; hence the convention of reaching straight for grep -E. PCRE adds shorthand classes like \d for a digit and \w for a word character, which many people have already learned from editors and programming languages.

Example: match an IP address.

grep -E "([0-9]{1,3}\.){3}[0-9]{1,3}" access.log

This matches one to three digits, then a dot, repeated three times, then one to three digits. It is imperfect (it would match 999.999.999.999) but good enough for most log scraping.

Table 8.2: Essential grep flags

Flag	Meaning
-i	Case-insensitive match
-v	Invert: show lines that do NOT match
-n	Prefix each line with its line number
-c	Print only the count of matching lines
-l	Print only the names of files with matches
-r / -R	Recurse into directories
-w	Match whole words only
-x	Match whole lines only
-E	Extended regex (egrep)
-F	Fixed string, no regex (fgrep)
-P	Perl-compatible regex
-A n / -B n / -C n	Show n lines after / before / around each match
--color=auto	Highlight matches

sed: Stream Editor

sed, the stream editor, reads input line by line, applies editing commands, and writes the result to stdout. It is non-interactive, which makes it perfect for pipelines.

The commonest use is search-and-replace:

sed 's/foo/bar/' file.txt          # replace first foo per line with bar
sed 's/foo/bar/g' file.txt         # replace all occurrences
sed 's/foo/bar/gI' file.txt        # case-insensitive, global

The s stands for substitute; the trailing g stands for global. The delimiter need not be a slash; any character works, which is handy when the pattern contains slashes:

sed 's|/usr/local|/opt|g' config.txt

sed can also delete lines:

sed '/^#/d' config.txt             # delete lines starting with #
sed '5,10d' file.txt               # delete lines 5 to 10
sed '$d' file.txt                  # delete the last line

By default sed writes to stdout, leaving the original file untouched. To edit in place, use -i:

sed -i 's/foo/bar/g' file.txt       # modifies file.txt directly
sed -i.bak 's/foo/bar/g' file.txt   # and save a backup as file.txt.bak

Be careful with -i: there is no undo.

Table 8.3: Core sed commands

Command	Action
s/old/new/	Substitute first match on each line
s/old/new/g	Substitute all matches
s/old/new/I	Case-insensitive substitution
d	Delete the line
p	Print the line (use with -n)
N	Append next line to pattern space
/pat/d	Delete lines matching pat
5d	Delete line 5
1,10p	Print lines 1-10
$d	Delete last line
y/abc/xyz/	Transliterate characters
-i	Edit file in place (GNU sed)
-n	Suppress automatic printing
-E	Extended regex

awk: A Miniature Language

awk, named after its three creators Aho, Weinberger, and Kernighan, is a full programming language disguised as a command. It was designed for processing columnar data: files where each line is a record and fields are separated by whitespace or some other delimiter.

The simplest use is printing a column:

ls -l | awk '{print $5, $9}'
# prints the size (column 5) and name (column 9) of each file

Inside the single quotes, {print $5, $9} is an awk action, and $1, $2, ... are the fields of the current line. $0 is the whole line; NF is the number of fields; NR is the current record number.

Awk actions can be gated by patterns:

awk '/error/ {print $0}' log.txt
# prints every line matching the pattern "error"

awk '$3 > 100 {print $1, $3}' data.txt
# prints columns 1 and 3 of any line where column 3 is > 100

It supports variables, arithmetic, loops, and functions; for quick data summaries it is often faster to reach for awk than to switch to Python:

awk '{sum += $1} END {print sum}' numbers.txt
# sums the first column

The END {...} block runs once after all input has been processed. There is a matching BEGIN block that runs before any input.

Table 8.4: awk built-in variables

Variable	Meaning
$0	The whole current line
$1, $2, ...	Field 1, field 2, ...
NF	Number of fields on this line
NR	Current record (line) number
FNR	Line number within current file
FS	Input field separator (default: whitespace)
OFS	Output field separator (default: space)
RS	Input record separator (default: newline)
ORS	Output record separator
FILENAME	Name of the current input file

Cutting, Translating, and Counting

cut

cut extracts columns by position or delimiter:

cut -d: -f1 /etc/passwd         # first colon-separated field: usernames
cut -c1-10 file.txt             # characters 1 through 10 of each line

tr

tr translates or deletes characters:

echo "hello" | tr a-z A-Z       # HELLO
tr -d '\r' < dos.txt > unix.txt # delete carriage returns
tr -s ' '                       # squeeze repeated spaces into one

tr does not understand regular expressions (it only knows character sets), but for simple jobs it is blazingly fast.

sort and uniq

sort sorts lines alphabetically by default; -n sorts numerically; -r reverses; -k N sorts by the Nth field; -u removes duplicates; -h understands human-readable sizes (like 4K, 2M).

sort file.txt
sort -n numbers.txt
sort -k2 -t: data.txt           # sort by field 2, delimiter :

uniq removes consecutive duplicate lines, which is why it is almost always paired with sort:

sort file.txt | uniq
sort file.txt | uniq -c         # prepend counts
sort file.txt | uniq -d         # only lines that are duplicated

wc

wc counts:

wc file.txt          # lines, words, characters
wc -l file.txt       # lines only
wc -w file.txt       # words only

diff and comm

To compare files:

diff file1 file2
# < line from file1
# > line from file2

diff -u file1 file2              # unified format, as used by patches

comm compares two sorted files line by line:

comm file1 file2
# column 1: lines only in file1
# column 2: lines only in file2
# column 3: lines in both

paste and join

paste pastes files side by side:

paste names.txt ages.txt
# Alice   30
# Bob     25

join is a SQL-style join on sorted files with a common field:

join -t: -1 1 -2 1 users.txt shells.txt

find and locate

find recursively walks a directory tree and matches files by name, type, size, age, permissions, and much more:

find /var/log -name "*.log"                 # by name
find . -type f -mtime -1                    # files modified in the last day
find /home -size +100M                      # files bigger than 100 MB
find /tmp -type f -atime +30 -delete         # delete files not accessed in 30+ days
find . -name "*.py" -exec grep -l "TODO" {} \;   # run grep on each match

find is the Swiss army knife of file hunting. It is also one of the most awkwardly-designed Unix commands (the arguments are quirky), but it is worth learning well because no other tool has the same reach.

locate, by contrast, is an index-based lookup. It maintains a nightly database (updatedb) of every file on the system, and queries return instantly:

locate resolv.conf
# /etc/resolv.conf
# /run/systemd/resolve/resolv.conf

Use find when you need precision or freshness. Use locate when you need speed and do not mind that the index may be hours out of date.

Modern Alternatives

The venerable grep and find have had a decade of serious challengers, and if you do a lot of code searching it is worth knowing about them.

ripgrep (the binary is called rg) is the one to reach for first. It is dramatically faster than grep on any non-trivial tree (often ten or twenty times faster) because it uses parallel walking, a highly optimised regex engine, and sensible defaults. It skips binary files and directories listed in .gitignore by default, which is exactly what you want for searching source code and exactly not what classic grep does. Install it with your package manager and use it like this:

rg "TODO"                      # recursive from current directory
rg -i "error" /var/log         # case-insensitive
rg -t py "def foo"             # only Python files
rg -l "import numpy"           # just list filenames with matches

ag (the_silver_searcher) is an older project in the same spirit, predating ripgrep by a few years. If you already have it in your muscle memory, carry on; if you are learning fresh, ripgrep is the better default.

fd is to find what ripgrep is to grep: faster, simpler syntax, .gitignore-aware by default. fd README finds every README under the current directory; fd -e py lists every Python file; fd -t d test lists every directory matching "test". The traditional find is still useful for its expressive power and for its ubiquity (it is on every Unix system since the 1970s), but for interactive code searching fd is more pleasant.

None of these replace grep and find. The classics are still everywhere, their flags are in everyone's fingers, and any shell script aimed at portability should stick to them. But on your own machine, for your own searches, the modern alternatives are a quiet, enormous productivity boost.

A Text-Processing Tour de Force

Let us close with a real example. Suppose you have a web server access log and you want to know the top ten IP addresses by number of requests. One line:

awk '{print $1}' access.log | sort | uniq -c | sort -rn | head

awk extracts the first column (the IP), sort groups them, uniq -c counts, sort -rn orders by count descending, head takes the top ten. Five commands, one line, and it would run on a gigabyte log file in seconds. This is the Unix way: small tools, sharp edges, composed precisely to cut exactly the shape you want.