Frequently Asked Question
What are capabilities and seccomp, and how do containers use them?
Historically Unix's privilege model was binary: either you were root (UID 0, can do
anything) or you were not. Linux capabilities broke that root powerset into
about forty smaller pieces, CAP_NET_ADMIN to configure interfaces,
CAP_SYS_ADMIN for most mount-related operations, CAP_NET_BIND_SERVICE to bind
ports below 1024, and so on, so a process can have just the powers it needs. A
container runtime takes advantage of this by starting containers with a heavily
reduced capability set: by default Docker drops about half the capabilities root
would normally have, and serious deployments drop almost all of them (--cap-drop=ALL --cap-add=NET_BIND_SERVICE) and re-add only what the workload actually needs.
seccomp-bpf is a parallel mechanism that filters system calls, not
capabilities. A seccomp profile is a small BPF program that the kernel runs on every
syscall the process attempts; it can allow, deny, or kill on each one. Docker and
Podman both ship a default seccomp profile that blocks around 50 of the ~350 Linux
syscalls, mostly obscure or dangerous ones like keyctl, kexec_load, the
old-style clone flags, and most of *_namespace setup syscalls. Together,
capability dropping and seccomp filtering shrink the kernel attack surface a
container can reach by an order of magnitude, which is much of why default-config
containers have been as resilient as they have.