Performance and Observability: What is NUMA and when does it start to matter?

Dr Chris Paton

Frequently Asked Question

What is NUMA and when does it start to matter?

Non-Uniform Memory Access is the architecture of every modern multi-socket server: each CPU socket has its own memory controllers and DIMMs, and access to "local" memory (attached to your socket) is faster than "remote" memory (attached to another socket, fetched over the interconnect). The kernel models the topology as a set of nodes; numactl --hardware shows them, and /sys/devices/system/node/ exposes per-node memory and CPU information. By default the scheduler prefers to keep tasks on a node where their memory already lives, and the page allocator prefers to allocate from the local node, but a long-running process whose threads bounce between sockets can end up with most of its memory remote.

For desktop and small-server workloads it does not matter. It starts to matter on 2-socket-plus machines running latency-sensitive or bandwidth-hungry services: databases, in-memory caches, ML training, packet processing. The cure is to pin processes to a node with numactl --cpunodebind=0 --membind=0 ./myprog, or let cgroup v2's cpuset.mems do it. numastat shows hit/miss statistics per node; a growing numa_miss or numa_foreign count tells you remote accesses are happening.

What is NUMA and when does it start to matter?

Further reading and video