Seccomp
Basic Information
Seccomp, standing for Secure Computing mode, is a security feature of the Linux kernel designed to filter system calls. It restricts processes to a limited set of system calls (exit()
, sigreturn()
, read()
, and write()
for already-open file descriptors). If a process tries to call anything else, it gets terminated by the kernel using SIGKILL or SIGSYS. This mechanism doesn't virtualize resources but isolates the process from them.
There are two ways to activate seccomp: through the prctl(2)
system call with PR_SET_SECCOMP
, or for Linux kernels 3.17 and above, the seccomp(2)
system call. The older method of enabling seccomp by writing to /proc/self/seccomp
has been deprecated in favor of prctl()
.
An enhancement, seccomp-bpf, adds the capability to filter system calls with a customizable policy, using Berkeley Packet Filter (BPF) rules. This extension is leveraged by software such as OpenSSH, vsftpd, and the Chrome/Chromium browsers on Chrome OS and Linux for flexible and efficient syscall filtering, offering an alternative to the now unsupported systrace for Linux.
Original/Strict Mode
In this mode Seccomp only allow the syscalls exit()
, sigreturn()
, read()
and write()
to already-open file descriptors. If any other syscall is made, the process is killed using SIGKILL
Seccomp-bpf
This mode allows filtering of system calls using a configurable policy implemented using Berkeley Packet Filter rules.
Seccomp in Docker
Seccomp-bpf is supported by Docker to restrict the syscalls from the containers effectively decreasing the surface area. You can find the syscalls blocked by default in https://docs.docker.com/engine/security/seccomp/ and the default seccomp profile can be found here https://github.com/moby/moby/blob/master/profiles/seccomp/default.json. You can run a docker container with a different seccomp policy with:
If you want for example to forbid a container of executing some syscall like uname
you could download the default profile from https://github.com/moby/moby/blob/master/profiles/seccomp/default.json and just remove the uname
string from the list.
If you want to make sure that some binary doesn't work inside a a docker container you could use strace to list the syscalls the binary is using and then forbid them.
In the following example the syscalls of uname
are discovered:
If you are using Docker just to launch an application, you can profile it with strace
and just allow the syscalls it needs
Example Seccomp policy
To illustrate Seccomp feature, let’s create a Seccomp profile disabling “chmod” system call as below.
In the above profile, we have set default action to “allow” and created a black list to disable “chmod”. To be more secure, we can set default action to drop and create a white list to selectively enable system calls. Following output shows the “chmod” call returning error because its disabled in the seccomp profile
Following output shows the “docker inspect” displaying the profile:
Deactivate it in Docker
Launch a container with the flag: --security-opt seccomp=unconfined
As of Kubernetes 1.19, seccomp is enabled by default for all Pods. However, the default seccomp profile applied to the Pods is the "RuntimeDefault" profile, which is provided by the container runtime (e.g., Docker, containerd). The "RuntimeDefault" profile allows most system calls while blocking a few that are considered dangerous or not generally required by containers.
Last updated