The VM Inside Your Kernel 👻

Inside your Linux kernel there’s a VM. It runs a specific bytecode. It can also JIT compile that bytecode. And it can ghostly track your processes behaviour. It exists since 1992. Its name? BPF

1. BPF from origins to present

What BPF can do now is quite distant to its modest origins. BPF was initially designed to be a packet filtering technology that can run at the kernel level to provide better performance. The name itself is acronym for BSD Packet Filter.

The technology was originally authored by Steven McCanne and Van Jacobson [1] and improved upon packet filtering solutions. Most importantly BPF was designed to work in the kernel space, bypassing the need to copy data into the user space for filtering. This simple, historical, BPF is now usually referred by kernel developers as cBPF (classic BPF).

In 2012, 15 years later after BPF was added to the Linux kernel, BPF extended its usage outside packet filtering for the first time, and became able to be used as a system call filter (seccomp policy making).

BPF milestones

In late September 2013 Alexei Starovoitov proposed a set of updates, primarily optimizations, to BPF [2], which with the help of Daniel Borkmann were merged into the kernel [3]. This marks the begining of extended BPF, eBPF in the Linux kernel.

2. BPF hook points

I’ll start using the naming eBPF and BPF interchangeably hereon as eBPF is not really a naming found in the kernel source-code. It’s all BPF.

BPF programs can be hooked at various points within the Linux kernel (and even outside of it 😅). BPF programs can observe or even change system behaviour. Since 2015, eBPF became a very popular tool for system tracing in Linux [4], which can be argued to have become eBPF’s primary focus today.

Here are some examples of hooking points:

2.1. Network specific hooks

eXpress Data Path (XDP) [5]: XDP represents the earliest point at which a “packet” can be filtered. At this point, the kernel did not even parse the ethernet frame. There are 3 types of XDP hook points:
- Generic XDP runs at the earliest point inside the kernel from which the frame was received.
- Native XDP runs at the networking driver’s early receive path. (Native XDP is available only if the network driver supports this feature)
- Offloaded XDP runs directly on the Network Interface Card (NIC), imposing zero CPU overhead on the host.
Traffic Control (TC) [5]: At this level, the packet has been allocated a sk_buff structure.

2.2. Function hooks

Hook Points	Dynamic	Static
User Space	uprobes, uretprobes	USDTs [6]
Kernel	kprobes, kretprobes	fentries, tracepoints, raw tracepoints, fexits, LSM

In the Linux kernel, kprobes predate eBPF. kprobes can dynamically add hooks at arbitrary points into kernel functions and instrument new instructions.

How does BPF inject itself into the kernel?

For inserting a probe, the following steps are performed [7], [8]:

The instruction at the targeted address is copied, saved, and replaced by a breakpoint instruction (INT3 on x86_64). The remaining space, if any, is filled with NOP instructions.
When the breakpoint is hit, the breakpoint handler within the kernel executes the installed kprobe handler.
The original instruction is executed, and the normal execution flow resumes.

When a probe is removed, the original instruction is copied back to the targeted address [8].

In the case of retprobes, a probe is inserted at the function’s entry, and upon hit, the return address is saved and then replaced with a function that will execute the retprobe handler [8].

uprobes and uretprobes are the kprobes and kretprobe, respectively, equivalent in the user space. When an uprobe breakpoint is hit, a kernel context switch happens. This significantly affects the performance if the targeted function is frequently executed during the processes’ lifetime, such as malloc and free [8].

Functions names and implementations often change between kernel implementations, thus having the potential of breaking kprobes and uprobes. For this reason, there are static hook points equivalents in the kernel and user space. Static hook points are explicitly defined alongside the data they wish to expose to the tracer (except for raw tracepoints), providing a closer to stable API, unlike kprobes which offer direct read access to the register values. These static hook points rely on their equivalent kernel implementations but come with an optimization: at the location of the static hook, enough NOP instructions are placed such that they can be overridden with a JMP instruction [8], thus bypassing the kernel breakpoint interrupt.

eBPF programs that use fexit hooks are executed at the end of the hooked function and receive the initial parameters of the function call directly. To accomplish this by using kprobes, one would have to register both a kprobe at the entry of the functions to store the values of the call into an eBPF map and then register a kretprobe which retrieves the stored values, executes the desired logic, and frees the initially stored data.

Tracing data by recording what is sent to the kernel within a system call entry is unreliable as attackers can change the data before the kernel copies it into its specific structures. This is known as the Time Of Check to Time Of Use (TOCTOU) issue.

Initially added to support security modules, the Linux Security Modules (LSM) hooks can also be accessed since 2020 [9] by eBPF programs. These hooks permit instrumentation after the data is copied into the kernel memory but before the kernel acts on it, avoiding the TOCTOU vulnerability [10].

3. The BPF verifier

The BPF verifier is one of the main components of what makes BPF the nice technology that it is, as it has the mission to perform static analysis on BPF programs’ bytecode to ensure they don’t crash the kernel.

The BPF verifier can offer guarantees that BPF programs are terminable and memory safe [11] by making sure they are free from infinite loops, null pointer dereferencing, OOB (out-of-bounds) reads or writes, or exceed certain resource consumption.

One of the most important limitations is that BPF programs’ CFGs (Control Flow Graphs) must be acyclic [12]. This limitation helps the BPF verifier ensure that BPF programs are terminable but also means that BPF programs cannot be Turing-complete. Practically, jump instructions to previous code in a program are forbidden. BPF compilers used to perform loop unrolling to overcome this limitation, but since 2019, a new BPF helper function has been added to support bounded loops [13]. Another safeguard is that the number of calls an BPF program can perform is limited; thus, a link of multiple BPF programs that call each other cannot break the termination guarantee [14].

Alas, the BPF verifier is not perfect [11]: it rejects some safe programs and may accept programs that perform OOB operations, and bugs within the BPF verifier were successfully exploited for privilege escalation in the past [15].

3. The BPF helper functions

eBPF helper functions are the primary way how eBPF programs access additional data and interact with the kernel. eBPF programs depending on their hooking point, have access to a certain set of eBPF helper functions. In the case of eBPF helper functions, some checks are done during the static analysis, and others are done at runtime; if they cannot perform the request, the latter will return an error code to the eBPF program.

eBPF helper functions can be categorized in: [16]

context helpers functions: bpf_get_current_task, bpf_get_current_pid_tgid, bpf_ktime_get_ns, etc.
map operations helpers: bpf_map_lookup_elem, bpf_map_update_elem, bpf_map_delete_elem, etc.
memory related helpers: bpf_probe_read, bpf_probe_write_user, etc.
program type specific helpers: bpf_xdp_adjust_tail, bpf_csum_diff, bpf_l3_csum_replace, etc.

At the time of writing this article, the number of eBPF helper functions is nearing 200.

The bpf_probe_write_user helper function was added in 2016 [17], allowing eBPF programs to overwrite user space process data. When an application loads an eBPF program that uses this function, a specific kernel log is emitted for security purposes.

An outlier helper function, bpf_override_return, was added in 2017, having the ability to override the return value of a specific subset of kernel functions. This function has a dedicated Kconfig entry, and Linux kernels can be compiled without this function entirely, but most distributions choose to have it enabled. Attackers have used this helper function to fool some detection applications by masking system calls that successfully executed appear as if they did not.

Another way that BPF programs can interact with the kernel is by using kfuncs (Kernel Functions). These are functions in the Linux kernel that are exposed to use for BPF programs, but unlike eBPF helper functions, which should provide a stable API, kfuncs can more widely change between kernel versions. [18]

5. BPF Maps

eBPF maps [19] are how eBPF programs can store data, and communicate between themselves and the user space. Some of the essential map types are: BPF_MAP_TYPE_ARRAY, BPF_MAP_TYPE_HASH (hash-maps with arbitrary data type as keys), both of which come in PERCPU variants (which use a different memory region for each CPU), and BPF_MAP_TYPE_ARRAY_OF_MAPS, and BPF_MAP_TYPE_HASH_OF_MAPS (which can hold map references as values).

Other map types are optimal for other use cases, such as stacks, queues, and least-recently-used data storage, or for specific objects, such as sockmaps and devmaps. [10]

eBPF maps are not isolated; any root process with CAP_BPF capability can read and write to any map, even outside of its container. This lack of isolation means that any process with CAP_BPF could change the configuration of eBPF security tools to prevent them from detecting or preventing malicious activity [20].

6. How to make and attach a BPF program

I believe this part might be more well suited to have its own blog post in the future, as it’s quite a complex topic.

At the moment, I do recommend Liz Rice’s book “Learning eBPF” [10] as a good resource to get started writing eBPF programs.

7. eBPF application landscape

7.1. Falco

Falco [21] is an eBPF runtime security tool that parses the system calls applications do at runtime and emits alerts based on predefined rules, basically functioning as a HIDS (Host Intrusion Detection System) but with a strong focus on containers and Kubernetes environments.

These type of solutions can identify and mitigate the impact of unknown security vulnerabilities that are exploited by detecting anomalous behavior or patterns that indicate an intrusion.

Falco checks for privilege escalations, read / writes to system directories such as /etc, /usr/bin, /usr/sbin, etc., ownership and mode changes, unexpected network connections, spawned processes using execve, mutating Linux coreutils executables, login binaries, shadowutil or passwd executables. [21]

Not only Falco comes with a set of predefined rules that map to the common TTPs attacker employ, but users can define their own detection rules in YAML as well.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


# Example of a Falco rule

- list: system_shells
  items: [bash, zsh, ksh, sh, csh]

- rule: Detect shells in a container
  desc: You shouldn’t have a shell ran in a container
  condition: container.id != host and proc.name in (system_shells)
  output: Bash ran inside a container (user=%user.name command=%proc.cmdline %container.info)
  priority: INFO

Rule extracted from [22].

7.2. Pixie

Pixie [23], a tool for observability in Kubernetes applications, utilizes eBPF technology to enable profiling by gathering telemetry data without requiring developers to change anything to their code. This allow developers to identify bottlenecks on real usage data.

Pixie can be used to gain an overview of cluster states, such as service maps, cluster resources, and application traffic. It also provides more detailed states, such as pod statuses, flame graphs, and individual complete body application requests. [23]

An interesting aspect of Pixie is that it uses uprobes in TLS libraries to capture the data before encryption and after decryption instead of relying only on the send(2) and recv(2) kprobes. [23]

Pixie also employs a sampling-based profiler that relies on eBPF to periodically interrupt the CPU at a frequency of approximately 10 ms, with negligible overhead, to inspect the currently running program and which part of its code. [23]

7.3. Cilium

Cilium is primarily a networking visibility and security tool, also used for Kubernetes applications. The project is quite wide so I’ll only paint an overall idea.

One of the main networking optimisation that Cilium does using eBPF amongst others, is that on the networking path, it can redirect packets to the veth of the required pod directly, bypassing most of the processing that would otherwise happen unnecessarily in the kernel’s networking stack. [24]

In a deployed environment, it is also important to perform load balancing and rate limiting, and have the ability to monitor different metrics such as ingress, and egress data amount, performance, and the availability of services. This is known as having a service mesh. [24]

To implement a service mesh, the standard was to add a dedicated container, known as a sidecar container, in every pod to hold the logic of the mesh. The failure to include a sidecar container in a pod deployment can have security implications. [24]

The Cilium Service Mesh represents an alternative to the sidecar container. Using the BPF based, Cilium Service Mesh, deploying a sidecar container for every pod is no longer necessary. [24]

Another important feature of Cilium is that it allows transparent encryption of traffic. If the traffic leaves the host, Cilium can encrypt it without the applications being aware, using layer three protocols such as IPSec or WireGuard. [24]

7.4. bpftrace

Here are various one-liners that demonstrate different capabilities of bpftrace (extracted from [25]).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


# Files opened by thread name
bpftrace -e 'tracepoint:syscalls:sys_enter_open { printf("%s %s\n", comm, str(args->filename)); }'

# Read bytes by thread name:
bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret/ { @[comm] = sum(args->ret); }'

# Read size distribution by thread name:
bpftrace -e 'tracepoint:syscalls:sys_exit_read { @[comm] = hist(args->ret); }'

# Show per-second syscall rates:
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @ = count(); } interval:s:1 { print(@); clear(@); }'

# Count LLC cache misses by thread name and PID (uses PMCs):
bpftrace -e 'hardware:cache-misses:1000000 { @[comm, pid] = count(); }'

8. Closing notes

We’ve covered a big part of what BPF is, how it came to be what it is today, how it works, and how it is used.

The main thing to remember is that BPF began as a packet filtering tool but has since evolved as an important tool for kernel and application observability.

Personally, I envision that eBPF will become the standard method for kernel tuning, enabling custom kernel programming tailored to application profiles, and will get more capabilities in that regard. This will allow applications to adjust kernel behaviour to better fit their needs, and improve system performance. The shift will likely replace traditional Loadable Kernel Module (LKM) solutions, which come with inherent risks and complexities, ultimately leading to broader adoption of eBPF-based solutions.

Thank you for reading! 😊

S. McCanne and V. Jacobson, The BSD Packet Filter: A New Architecture for User-level Packet Capture, In Proc. USENIX Winter, 1993, vol. 46.
A. Starovoitov, [PATCH net-next] extended BPF, Sep. 30, 2013. [Online]. Available: https://lkml.org/lkml/2013/9/30/627 [Accessed: Sep. 20, 2024].
D. Borkmann, [PATCH net-next 0/9] BPF updates, Mar. 21, 2014. [Online]. Available: https://lore.kernel.org/netdev/1395404418-25376-1-git-send-email-dborkman@redhat.com/T/#u [Accessed: Sep. 20, 2024].
A. Starovoitov, [PATCH v7 tip 0/8] tracing: attach eBPF programs to kprobes, Mar. 16, 2015. [Online]. Available: https://lwn.net/Articles/636976/ [Accessed: Sep. 20, 2024].
Program Types - Cilium Documentation, [Online]. Available: https://docs.cilium.io/en/stable/bpf/progtypes/ [Accessed: Sep. 23, 2024].
M. Fleming, Using user-space tracepoints with BPF, May. 11, 2018. [Online]. Available: https://lwn.net/Articles/753601/ [Accessed: Sep. 20, 2024].
S. Goswami, An introduction to KProbes, Apr. 18, 2005. [Online]. Available: https://lwn.net/Articles/132196/ [Accessed: Sep. 20, 2024].
B. Gregg, BPF Performance Tools. Addison-Wesley Professional, 2019.
K. Singh, [PATCH bpf-next v9 1/8] bpf: Introduce BPF_PROG_TYPE_LSM, [Online]. Available: https://lore.kernel.org/bpf/20200329004356.27286-2-kpsingh@chromium.org/ [Accessed: Sep. 20, 2024].
L. Rice, Learning eBPF. O'Reilly Media, 2023.
E. Gershuni, N. Amit, A. Gurfinkel, N. Narodytska, J. Navas, N. Rinetzky, L. Ryzhyk, and M. Sagiv, Simple and precise static analysis of untrusted linux kernel extensions, In Proc. Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2019, pp. 1069–1084.
eBPF Verifier - The Linux kernel documentation, [Online]. Available: https://docs.kernel.org/bpf/verifier.html [Accessed: Sep. 20, 2024].
J. Corbet, A different approach to BPF loops, Nov. 29, 2021. [Online]. Available: https://lwn.net/Articles/877062/ [Accessed: Sep. 20, 2024].
BPF Architecture - Cilium documentation, [Online]. Available: https://docs.cilium.io/en/stable/bpf/architecture/ [Accessed: Sep. 20, 2024].
CVE-2020-8835: Linux Kernel Privilege Escalation via Improper eBPF Program Verification, Apr. 16, 2020. [Online]. Available: https://www.zerodayinitiative.com/blog/2020/4/8/cve-2020-8835-linux-kernel-privilege-escalation-via-improper-ebpf-program-verification [Accessed: Sep. 20, 2024].
G. Fournier and S. Afchain, eBPF, I thought we were friends!, Aug. 2021. [Online]. Available: https://media.defcon.org/DEF%20CON%2029/DEF%20CON%2029%20presentations/Guillaume%20Fournier%20Sylvain%20Afchain%20Sylvain%20Baubeau%20-%20eBPF,%20I%20thought%20we%20were%20friends.pdf [Accessed: Sep. 20, 2024].
S. Dhillon, [PATCH v4 0/2] bpf: add bpf_probe_write helper & example, Jul. 21, 2016. [Online]. Available: https://lkml.org/lkml/2016/7/21/701 [Accessed: Sep. 20, 2024].
BPF Kernel Functions (kfuncs), [Online]. Available: https://docs.kernel.org/bpf/kfuncs.html [Accessed: Oct. 1, 2024].
BPF maps, [Online]. Available: https://docs.kernel.org/bpf/maps.html [Accessed: Sep. 20, 2024].
P. Hogan, Mapping It Out: Analyzing the Security of eBPF Maps, Feb. 22, 2021. [Online]. Available: https://www.crowdstrike.com/blog/analyzing-the-security-of-ebpf-maps/ [Accessed: Sep. 20, 2024].
The Falco Project, [Online]. Available: https://falco.org/docs/ [Accessed: Oct. 1, 2024].
M. Ducy, Getting Started Writing Falco Rules, Mar. 7, 2018. [Online]. Available: https://sysdig.com/blog/getting-started-writing-falco-rules/ [Accessed: Oct. 1, 2024].
About Pixie, [Online]. Available: https://docs.px.dev/about-pixie/ [Accessed: Oct. 1, 2024].
T. Graf, Cilium Service Mesh – Everything You Need to Know, Jul. 20, 2022. [Online]. Available: https://isovalent.com/blog/post/cilium-service-mesh/ [Accessed: Oct. 1, 2024].
bpftrace, [Online]. Available: https://github.com/bpftrace/bpftrace [Accessed: Oct. 1, 2024].

1. BPF from origins to present#

2. BPF hook points#

2.1. Network specific hooks#

2.2. Function hooks#

3. The BPF verifier#

3. The BPF helper functions#

5. BPF Maps#

6. How to make and attach a BPF program#

7. eBPF application landscape#

7.1. Falco#

7.2. Pixie#

7.3. Cilium#

7.4. bpftrace#

8. Closing notes#