Frequently Asked Question
What is NUMA and when does it start to matter?
Non-Uniform Memory Access is the architecture of every modern multi-socket
server: each CPU socket has its own memory controllers and DIMMs, and
access to "local" memory (attached to your socket) is faster than "remote"
memory (attached to another socket, fetched over the interconnect). The
kernel models the topology as a set of nodes; numactl --hardware shows
them, and /sys/devices/system/node/ exposes per-node memory and CPU
information. By default the scheduler prefers to keep tasks on a node where
their memory already lives, and the page allocator prefers to allocate
from the local node, but a long-running process whose threads bounce
between sockets can end up with most of its memory remote.
For desktop and small-server workloads it does not matter. It starts to
matter on 2-socket-plus machines running latency-sensitive or
bandwidth-hungry services: databases, in-memory caches, ML training,
packet processing. The cure is to pin processes to a node with
numactl --cpunodebind=0 --membind=0 ./myprog, or let cgroup v2's
cpuset.mems do it. numastat shows hit/miss statistics per node; a
growing numa_miss or numa_foreign count tells you remote accesses are
happening.