eBPF Vulnerability (CVE-2017-16995): When the Doorman Becomes the Backdoor
Co-written by Nahman Khayet and Michael Cherny
eBPF Verifier Bypass Vulnerability
To put things in proportion, the media-frenzied Meltdown/Spectre vulnerabilities, which were very severe indeed, enabled reading from memory but not writing to it. While the application here is narrower (this isn’t a CPU vulnerability and doesn’t affect all OSs), the risk is much higher.
In this blog we’ll review what this vulnerability is about.
The extended/enhanced Berkeley Packet Filter (eBPF) is an in-kernel virtual machine that provides the ability to attach to almost any location in the kernel.
eBPF can be used for tracing and debugging of the kernel, filtering of network events, and security. The general mechanics are:
- User space loads a special assembly bytecode to the kernel with specification as to where to attach that program
- The kernel runs a ‘verifier’ to make sure that the program is safe
- The kernel translates the bytecode into native code, and attaches it to the requested location.
The vulnerability, numbered CVE-2017-16995, is related to a set of security vulnerabilities in the eBPF verifier found by Jann Horn that were fixed in this commit. It is operational on Linux kernel versions 4.4 to 4.14.
The irony of a vulnerability in the “verifier” is something that’s worth mentioning. The verifier is crucial (and the only such function) in determining if an eBPF program is safe. It checks that the eBPF program meets certain requirements:
- It limits the number of bytecode instructions
- Loops are forbidden
- Makes sure there are no unreachable instructions
- Makes sure there are no out-of-bound jumps
- Ensures that access to memory is to authorized areas only
If the verifier passes a program that is malicious, it exposes the whole system to a great amount of risk, which is due to the fact that eBPF runs in the kernel - turning what should have been safe execution to arbitrary code execution in the kernel. Specifically in this case, the verifier fails to verify access to memory, allowing read-write to arbitrary kernel addresses!
The discovery of such a vulnerability is very disturbing, as it shakes the confidence in being able to safely use eBPF. Especially when eBPF was developed toward security-oriented uses. Our personal take - in the long run the benefits are great, but between now and then, extra care should be taken. This is the reason why some Linux distributions limit eBPF (e.g., grsecurity). Nevertheless, most major distributions don’t.
The combination of adding an ability to load and run an eBPF program by an unprivileged user (added on this patch), and failure to verify the code, led to unlimited R/W permissions to the kernel for unprivileged users.
It’s been out there for a couple of years (since the release of 4.9 to 4.14)! Bruce had a working (albeit not published at that time) POC since at least June:
Interestingly, the vulnerability has gone “under the radar” and still might exist in several recent distribution releases.
The vulnerability, when exploited correctly, allows unlimited R/W permissions to the kernel to unprivileged users. This can obviously be leveraged in numerous ways. Bruce Leidl published a POC that escalates a regular user to root with all capabilities!
As of January 1st, we found the POC to work on several distributions out of the box (including Ubuntu Desktop/Server version 17.10, Arch (1.12.17), Fedora latest, Mint 18.3).
Permissions before and after the exploit
What Does This Mean for Containers?
As for container security, we are in a relatively good place. eBPF programs are loaded by means of the bpf() system call.
The access to bpf() system call is gated by CAP_SYS_ADMIN capability, which is not assigned by default to containers.
On the other hand - if the OS is configured to allow unprivileged users to run bpf programs (which is actually the default on most Linux distributions), then there is no need for the CAP_SYS_ADMIN capability since the bpf() system call is available within containers by default!
Docker applies seccomp as second layer of defense: when enabled, there is a default seccomp profile that blocks the bpf syscall. One caveat is when you run with CAP_SYS_ADMIN -- in that case the seccomp profile does not block the bpf() system call.
The following example illustrate how running the exploit inside a container is blocked by seccomp:
Please note that in Kubernetes seccomp is an alpha feature, and pods run by default with seccomp in ‘unconfined’ mode. The result is that as long as loading BPF by unprivileged users is enabled, dropping CAP_SYS_ADMIN is not helpful.
One important thing to note: This is another great example of why running containers in --privileged mode and giving a container unneeded capabilities is bad practice. Upon running a container in privileged mode we are granting it all capabilities, thus allowing it to do almost anything the host can do. In the case of this vulnerability, privileged mode will bring fire and fury on your system, as it will give the user in the container full R/W permissions to the kernel and anywhere on the host. Administrators should be very circumspect when granting extended capabilities to containers.
The obvious immediate step is to update your Linux version if a patch is available.
Unless absolutely necessary, disable unprivileged user ability to load eBPF programs by setting sysctl knob “kernel.unprivileged_bpf_disabled” to true (defaulted to 0). This way an unprivileged user will not be able to run bpf programs, and use this exploit to elevate privileges.
Before and after enabling the sysctl knob
Even if an unprivileged user cannot load eBPF programs, there is still a risk of a limited root user to elevate his capabilities.
Therefore, for containers running with CAP_SYS_ADMIN capability (either added specifically or through global --privileged flag), we suggest customizing the seccomp policy to block the bpf() syscall.
Aqua customers can implement the mitigation by disabling the bpf() system call as part of the runtime policy.
CVE-2017-16995 is a severe Linux vulnerability which, for some reason, has received little attention. It’s a particularly nasty one because it stems from the eBPF virtual machine that’s supposed to make Linux more secure. It highlights again the need to ensure minimal privileges to users, and to disable syscalls where they are not needed.
It also highlights that the use of containers can make systems more secure -- but only if they are configured properly.
Finally, this is yet more proof that no matter how much you adhere to best practices, you may still be vulnerable. Therefore, monitoring applications in runtime is not a luxury but a necessity, to allow you to detect anomalies and prevent (or at least limit) attacks.Our thanks to Bruce Leidl for his help in clarifying some of the technical aspects of this CVE.