Text Processing: What are UTF-8 and ASCII, and why does it matter for shell work?

Dr Chris Paton

Frequently Asked Question

What are UTF-8 and ASCII, and why does it matter for shell work?

ASCII is the original 7-bit character encoding from 1963: 128 code points covering English letters, digits, punctuation, and a few control characters. It fits in one byte with a spare bit. UTF-8 is a variable-length encoding for Unicode that uses one byte for the original ASCII characters and two to four bytes for everything else (Greek, Cyrillic, Chinese, emoji). Crucially, the ASCII range is encoded identically in both: a pure-ASCII file is a valid UTF-8 file. That backward compatibility is what made UTF-8 the dominant encoding on Linux and on the web.

For shell work the practical implications are: tools that count bytes (wc -c, cut -c, classic tr) will give wrong answers on multi-byte UTF-8 unless you use a character-aware variant (wc -m, GNU cut --characters with a UTF-8 locale). Set LANG=en_GB.UTF-8 (or en_US.UTF-8) so utilities know to treat input as UTF-8. Files of unknown origin can be inspected with file thing.txt, which guesses the encoding, and re-encoded with iconv.

What are UTF-8 and ASCII, and why does it matter for shell work?

Video

Further reading and video