Frequently Asked Question

Why does uniq need sort? Why can't it find duplicates on its own?

uniq removes only consecutive duplicate lines. It walks the input once, keeping a single line of state, the previous line, and emits each line only if it differs from its predecessor. That is why a file containing a b a b will pass through uniq unchanged: the duplicates exist but are not adjacent.

The fix is to sort the input first, which brings every duplicate next to its siblings: sort file | uniq gives the true set of unique lines. The convention is so universal that sort -u exists as a shortcut. The pairing also enables counting: sort file | uniq -c prepends each line with the number of times it appeared, and piping that through sort -rn gives a frequency-sorted top list. The classic one-liner for the top ten IP addresses in a web log, awk '{print $1}' access.log | sort | uniq -c | sort -rn | head, is a perfect illustration of the pattern.

Further reading and video