Frequently Asked Question
Why does uniq need sort? Why can't it find duplicates on its own?
uniq removes only consecutive duplicate lines. It walks the input once,
keeping a single line of state, the previous line, and emits each line only if
it differs from its predecessor. That is why a file containing a b a b will
pass through uniq unchanged: the duplicates exist but are not adjacent.
The fix is to sort the input first, which brings every duplicate next to its
siblings: sort file | uniq gives the true set of unique lines. The convention
is so universal that sort -u exists as a shortcut. The pairing also enables
counting: sort file | uniq -c prepends each line with the number of times it
appeared, and piping that through sort -rn gives a frequency-sorted top list.
The classic one-liner for the top ten IP addresses in a web log, awk '{print $1}' access.log | sort | uniq -c | sort -rn | head, is a perfect illustration
of the pattern.