Frequently Asked Question
What does the load average actually mean, and why is it so often misleading?
The three numbers from uptime are the exponentially-weighted moving averages
of the number of tasks in the run queue or in uninterruptible sleep (state D)
over the past 1, 5, and 15 minutes. On a single-CPU machine, 1.0 means the CPU
was constantly busy with one runnable task on average. On an eight-core machine,
anything up to 8.0 just means full utilisation; only sustained values above the
core count indicate contention. The 15-minute number, lagging behind the others,
tells you whether you are trending up or recovering.
It is misleading because it conflates CPU demand with I/O demand. A server stuck
waiting on a frozen NFS mount can show a huge load while the CPUs sit idle, and
a CPU-pinned scientific job at full tilt can show a perfectly modest 1.0. Brendan
Gregg's investigation of this in 2017 traced the uninterruptible-sleep behaviour
back to a 1993 patch from Matthias Urlichs. Treat the load average as a
thirty-second triage signal, then move quickly to top, vmstat, and iostat
to find out what the queueing actually consists of.