Text Processing: Why does uniq need sort? Why can't it find duplicates on its own?

Dr Chris Paton

Frequently Asked Question

Why does uniq need sort? Why can't it find duplicates on its own?

uniq removes only consecutive duplicate lines. It walks the input once, keeping a single line of state, the previous line, and emits each line only if it differs from its predecessor. That is why a file containing a b a b will pass through uniq unchanged: the duplicates exist but are not adjacent.

The fix is to sort the input first, which brings every duplicate next to its siblings: sort file | uniq gives the true set of unique lines. The convention is so universal that sort -u exists as a shortcut. The pairing also enables counting: sort file | uniq -c prepends each line with the number of times it appeared, and piping that through sort -rn gives a frequency-sorted top list. The classic one-liner for the top ten IP addresses in a web log, awk '{print $1}' access.log | sort | uniq -c | sort -rn | head, is a perfect illustration of the pattern.

Why does uniq need sort? Why can't it find duplicates on its own?

Further reading and video