Containers and Virtualisation: What are capabilities and seccomp, and how do containers use them?

Dr Chris Paton

Frequently Asked Question

What are capabilities and seccomp, and how do containers use them?

Historically Unix's privilege model was binary: either you were root (UID 0, can do anything) or you were not. Linux capabilities broke that root powerset into about forty smaller pieces, CAP_NET_ADMIN to configure interfaces, CAP_SYS_ADMIN for most mount-related operations, CAP_NET_BIND_SERVICE to bind ports below 1024, and so on, so a process can have just the powers it needs. A container runtime takes advantage of this by starting containers with a heavily reduced capability set: by default Docker drops about half the capabilities root would normally have, and serious deployments drop almost all of them (--cap-drop=ALL --cap-add=NET_BIND_SERVICE) and re-add only what the workload actually needs.

seccomp-bpf is a parallel mechanism that filters system calls, not capabilities. A seccomp profile is a small BPF program that the kernel runs on every syscall the process attempts; it can allow, deny, or kill on each one. Docker and Podman both ship a default seccomp profile that blocks around 50 of the ~350 Linux syscalls, mostly obscure or dangerous ones like keyctl, kexec_load, the old-style clone flags, and most of *_namespace setup syscalls. Together, capability dropping and seccomp filtering shrink the kernel attack surface a container can reach by an order of magnitude, which is much of why default-config containers have been as resilient as they have.

What are capabilities and seccomp, and how do containers use them?

Further reading and video