eBPF Deep Dive: Từ bpftrace one-liner đến XDP, LSM, và sched_ext

April 8, 2026

Macro close-up of a honeybee — the mascot of the eBPF project

eBPF là gì và tại sao nó quan trọng?

Hãy tưởng tượng bạn cần theo dõi mọi lần một process mở file, mọi TCP connection được tạo ra, hoặc mọi lần scheduler chọn một task để chạy — tất cả trong production, không ảnh hưởng đến performance. Trước eBPF, bạn có hai lựa chọn tệ:

Kernel module: Hiệu quả nhưng nguy hiểm — một bug là crash cả server
strace / perf: An toàn nhưng chậm — overhead lên đến 10x, không dùng được production

eBPF tạo ra lựa chọn thứ ba: chạy code tùy ý trong kernel, an toàn như user space, nhanh như kernel.

Brendan Gregg — tác giả "BPF Performance Tools" — tóm gọn điều này bằng một phép so sánh: "eBPF làm với Linux điều mà JavaScript làm với HTML." Cũng như JavaScript biến trang tĩnh thành ứng dụng động mà không cần thay đổi browser, eBPF cho phép bạn lập trình kernel đang chạy mà không cần reboot hay viết kernel module.

eBPF (extended Berkeley Packet Filter): Là một virtual machine trong Linux kernel, cho phép chạy các chương trình sandbox được xác minh trước (verified). eBPF program được attach vào các hook point trong kernel — syscall, network path, scheduler, v.v. — và thực thi với overhead rất thấp khi sự kiện xảy ra.

Trong bài này, chúng ta sẽ đi từ những bước đầu tiên cho đến những chi tiết kỹ thuật nhất:

text

What we cover:

  [Beginner]
  1. Architecture overview
  2. bpftrace one-liners
  3. bcc tools ready-to-use

  [Intermediate]
  4. First eBPF program in C (libbpf)
  5. eBPF Maps - all types with examples
  6. Program types: kprobe, tracepoint, XDP, TC

  [Advanced]
  7. XDP for high-performance networking
  8. TC hooks and packet modification
  9. BPF LSM for security

  [Deep Dive]
  10. eBPF internals: verifier, JIT, instruction set
  11. Modern eBPF: BTF, CO-RE, sched_ext

eBPF Architecture

Trước khi viết code, cần hiểu eBPF program chạy như thế nào trong kernel.

Vòng đời của một eBPF program

text

User space                    Kernel space
-----------                   ------------

Source (.bpf.c)
    |
    | clang/LLVM
    v
eBPF bytecode (.o)
    |
    | bpf() syscall
    v
[BPF Verifier] ----FAIL----> EPERM / EINVAL
    |
   PASS
    |
    | JIT Compiler
    v
Native machine code
    |
    | bpf_prog_attach()
    v
Hook point (kprobe / XDP / tracepoint / ...)
    |
    | Event fires
    v
Program executes <-------> BPF Maps (shared with user space)

Hook Points

eBPF có thể attach vào nhiều điểm trong kernel:

Category	Hook Type	Use Case
Tracing	kprobe/kretprobe	Trace kernel functions
Tracing	tracepoint	Stable kernel events
Tracing	uprobe/uretprobe	Trace user space functions
Tracing	USDT	User-defined static tracing
Networking	XDP	Packet processing at driver level
Networking	TC ingress/egress	Traffic control
Networking	socket	Socket-level filtering
Networking	sk_lookup	Custom socket dispatch
Security	LSM	Linux Security Module hooks
Scheduling	sched_ext	Custom CPU schedulers (kernel 6.12+)

Tại sao eBPF an toàn?

BPF Verifier là trái tim của safety model. Trước khi chạy, verifier kiểm tra:

No infinite loops: Verifier yêu cầu chứng minh program luôn terminate. Backward jump chỉ được phép trong vòng lặp có bounded count.
No out-of-bounds memory access: Mọi pointer đều phải được check bounds trước khi dereference.
No uninitialized reads: Verifier track trạng thái của mọi register/stack slot.
No unsafe kernel function calls: Chỉ được gọi các "helper functions" được phép, không thể gọi arbitrary kernel function.

text

Verifier state machine (simplified):

  For each instruction:
    - Track type of every register (scalar, ptr-to-map, ptr-to-ctx, ...)
    - Track value range (min/max bounds) for scalars
    - For pointer arithmetic: verify result still in bounds
    - For memory access: verify ptr type + offset is valid

  If any check fails -> program rejected
  If all paths terminate safely -> program approved

BPF Verifier: Là static analyzer trong kernel, phân tích eBPF bytecode trước khi cho phép chạy. Nó thực hiện abstract interpretation — simulate tất cả các code path có thể — để đảm bảo program không bao giờ gây memory corruption hoặc vòng lặp vô tận.

Registers và Calling Convention

eBPF có 11 registers 64-bit:

Register	Vai trò
r0	Return value từ function call, kết quả của program
r1–r5	Argument cho function call
r6–r9	Callee-saved (phải restore sau khi dùng)
r10	Frame pointer (read-only), trỏ đến stack
pc	Program counter (ẩn, không truy cập trực tiếp)

eBPF stack có kích thước cố định 512 bytes — đây là giới hạn quan trọng cần nhớ khi viết program.

bpftrace: eBPF cho người mới bắt đầu

bpftrace là high-level language cho eBPF tracing, tương tự như awk cho text processing. Bạn có thể viết powerful one-liner mà không cần biết C hay kernel internals.

Cài đặt

bash

# Ubuntu/Debian
sudo apt install bpftrace
 
# Fedora/RHEL
sudo dnf install bpftrace
 
# Kiểm tra
bpftrace --version
# bpftrace v0.21.0

Cú pháp cơ bản

text

probe[,probe,...] /filter/ {
    action
}

One-liners mẫu

1. Hello World — trace mọi lần write() được gọi:

bash

sudo bpftrace -e 'tracepoint:syscalls:sys_enter_write {
    printf("pid=%d comm=%s fd=%d\n", pid, comm, args->fd);
}'

text

Output:
pid=1234 comm=bash fd=1
pid=5678 comm=nginx fd=4
pid=1234 comm=bash fd=2

2. Đếm system call theo process:

bash

sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter {
    @[comm] = count();
}'

text

Output (khi Ctrl+C):
@[chrome]:  45821
@[nginx]:    3422
@[sshd]:      891
@[bash]:      234

3. Trace file mở — tên file nào đang được mở:

bash

sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat {
    printf("%-16s %-6d %s\n", comm, pid, str(args->filename));
}'

text

Output:
nginx            1234   /var/log/nginx/access.log
chrome           5678   /etc/hosts
sshd             9012   /etc/ssh/sshd_config

4. Latency của read() — histogram:

bash

sudo bpftrace -e '
kprobe:vfs_read { @start[tid] = nsecs; }
kretprobe:vfs_read /@start[tid]/ {
    @latency_us = hist((nsecs - @start[tid]) / 1000);
    delete(@start[tid]);
}'

text

Output:
@latency_us:
[0, 1)     3421 |@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1, 2)     1823 |@@@@@@@@@@@@@@|
[2, 4)      891 |@@@@@@@|
[4, 8)      345 |@@@@|
[8, 16)     123 |@|
[16, 32)     45 |
[32, 64)     12 |

5. Các TCP connection mới:

bash

sudo bpftrace -e '
kprobe:tcp_connect {
    $sk = (struct sock *)arg0;
    printf("%-16s -> %s:%d\n",
        comm,
        ntop(AF_INET, $sk->__sk_common.skc_daddr),
        $sk->__sk_common.skc_dport >> 8 | $sk->__sk_common.skc_dport << 8);
}'

6. CPU scheduler — context switch:

bash

sudo bpftrace -e '
tracepoint:sched:sched_switch {
    printf("%-16s -> %-16s\n", args->prev_comm, args->next_comm);
}'

7. Profiling CPU — stack traces của process đang dùng nhiều CPU nhất:

bash

sudo bpftrace -e '
profile:hz:99 /comm == "nginx"/ {
    @[kstack] = count();
}'

Stack trace output giúp identify hot paths — đây là basis của flame graph.

8. Trace execve — mọi command được chạy trên hệ thống:

bash

sudo bpftrace -e '
tracepoint:syscalls:sys_enter_execve {
    printf("%-8d %-16s %s\n", pid, comm, str(args->filename));
}'

9. Block I/O latency:

bash

sudo bpftrace -e '
tracepoint:block:block_rq_issue { @start[args->dev, args->sector] = nsecs; }
tracepoint:block:block_rq_complete /@start[args->dev, args->sector]/ {
    @io_ms = hist((nsecs - @start[args->dev, args->sector]) / 1000000);
    delete(@start[args->dev, args->sector]);
}'

Cú pháp bpftrace quan trọng

Syntax	Ý nghĩa
`pid`, `tid`	Process/thread ID
`comm`	Process name
`nsecs`	Thời gian hiện tại (nanoseconds)
`kstack`	Kernel stack trace
`ustack`	User space stack trace
`@map[key]`	Map lookup/update
`count()`	Đếm
`hist(n)`	Histogram
`str(ptr)`	Convert pointer thành string
`ntop(...)`	Convert IP address thành string
`args->field`	Truy cập argument của tracepoint

Bpftrace scripts (multi-line)

Script phức tạp hơn có thể lưu vào file .bt:

bash

# tcplife.bt - track TCP session lifetime
#!/usr/bin/bpftrace
 
#include <net/sock.h>
 
BEGIN {
    printf("%-5s %-16s %-5s %-16s %-5s %s\n",
        "PID", "COMM", "LPORT", "RADDR", "RPORT", "MS");
}
 
kprobe:tcp_set_state {
    $sk = (struct sock *)arg0;
    $newstate = arg1;
 
    if ($newstate == 1) {
        // TCP_ESTABLISHED
        @start[$sk] = nsecs;
        @comm[$sk] = comm;
        @lport[$sk] = $sk->__sk_common.skc_num;
    }
 
    if ($newstate == 7 && @start[$sk]) {
        // TCP_CLOSE
        $ms = (nsecs - @start[$sk]) / 1000000;
        printf("%-5d %-16s %-5d %-16s %-5d %d\n",
            pid,
            @comm[$sk],
            @lport[$sk],
            ntop(AF_INET, $sk->__sk_common.skc_daddr),
            $sk->__sk_common.skc_dport >> 8 | $sk->__sk_common.skc_dport << 8,
            $ms);
        delete(@start[$sk]);
        delete(@comm[$sk]);
        delete(@lport[$sk]);
    }
}

bash

sudo bpftrace tcplife.bt

Viết eBPF program đầu tiên với libbpf

Khi bpftrace không đủ linh hoạt, bạn cần viết eBPF program bằng C. Modern workflow dùng libbpf + skeleton.

Prerequisites

bash

# Ubuntu 22.04+
sudo apt install -y \
    clang llvm \
    libbpf-dev \
    linux-headers-$(uname -r) \
    bpftool
 
# Kiểm tra
clang --version    # >= 12
bpftool version

Cấu trúc project

text

myebpf/
+-- myebpf.bpf.c      # eBPF program (kernel side)
+-- myebpf.c          # Userspace loader
+-- Makefile

Ví dụ 1: Hello World — trace execve

Kernel side (myebpf.bpf.c):

// myebpf.bpf.c
#include "vmlinux.h"           // All kernel types (generated by bpftool)
#include <bpf/bpf_helpers.h>  // BPF helper functions
#include <bpf/bpf_tracing.h>  // Tracing macros
 
// Shared event structure (must match userspace)
struct event {
    u32 pid;
    u8  comm[16];
    u8  filename[256];
};
 
// Ring buffer map for sending events to userspace
struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024); // 256KB ring buffer
} events SEC(".maps");
 
SEC("tracepoint/syscalls/sys_enter_execve")
int trace_execve(struct trace_event_raw_sys_enter *ctx)
{
    struct event *e;
 
    // Reserve space in ring buffer
    e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (!e)
        return 0;
 
    e->pid = bpf_get_current_pid_tgid() >> 32;
    bpf_get_current_comm(&e->comm, sizeof(e->comm));
 
    // Read filename from user space pointer
    bpf_probe_read_user_str(&e->filename, sizeof(e->filename),
                            (void *)ctx->args[0]);
 
    // Submit event to ring buffer
    bpf_ringbuf_submit(e, 0);
    return 0;
}
 
char LICENSE[] SEC("license") = "GPL";

Userspace (myebpf.c):

// myebpf.c
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <bpf/libbpf.h>
#include "myebpf.skel.h"  // Auto-generated by bpftool
 
struct event {
    unsigned int pid;
    char comm[16];
    char filename[256];
};
 
static volatile sig_atomic_t stop = 0;
static void sig_handler(int sig) { stop = 1; }
 
static int handle_event(void *ctx, void *data, size_t size)
{
    const struct event *e = data;
    printf("pid=%-6u comm=%-16s file=%s\n", e->pid, e->comm, e->filename);
    return 0;
}
 
int main(void)
{
    struct myebpf_bpf *skel;
    struct ring_buffer *rb;
    int err;
 
    // Open and load BPF skeleton
    skel = myebpf_bpf__open_and_load();
    if (!skel) {
        fprintf(stderr, "Failed to open BPF skeleton\n");
        return 1;
    }
 
    // Attach tracepoint
    err = myebpf_bpf__attach(skel);
    if (err) {
        fprintf(stderr, "Failed to attach BPF programs: %d\n", err);
        goto cleanup;
    }
 
    // Set up ring buffer polling
    rb = ring_buffer__new(bpf_map__fd(skel->maps.events),
                          handle_event, NULL, NULL);
    if (!rb) {
        err = -1;
        fprintf(stderr, "Failed to create ring buffer\n");
        goto cleanup;
    }
 
    signal(SIGINT, sig_handler);
    printf("Tracing execve... Press Ctrl+C to stop.\n");
    printf("%-6s %-16s %s\n", "PID", "COMM", "FILE");
 
    while (!stop) {
        err = ring_buffer__poll(rb, 100 /* timeout_ms */);
        if (err == -EINTR) { err = 0; break; }
        if (err < 0) { fprintf(stderr, "Error polling: %d\n", err); break; }
    }
 
cleanup:
    ring_buffer__free(rb);
    myebpf_bpf__destroy(skel);
    return err;
}

Makefile:

makefile

CLANG ?= clang
BPFTOOL ?= bpftool
LIBBPF_CFLAGS := $(shell pkg-config --cflags libbpf)
LIBBPF_LIBS := $(shell pkg-config --libs libbpf)
 
# Generate vmlinux.h from running kernel
vmlinux.h:
	$(BPFTOOL) btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h
 
# Compile BPF program to object file
myebpf.bpf.o: myebpf.bpf.c vmlinux.h
	$(CLANG) -g -O2 -target bpf -D__TARGET_ARCH_x86 \
		$(LIBBPF_CFLAGS) \
		-c myebpf.bpf.c -o myebpf.bpf.o
 
# Generate skeleton header
myebpf.skel.h: myebpf.bpf.o
	$(BPFTOOL) gen skeleton myebpf.bpf.o > myebpf.skel.h
 
# Compile userspace program
myebpf: myebpf.c myebpf.skel.h
	$(CC) -g -O2 myebpf.c $(LIBBPF_CFLAGS) $(LIBBPF_LIBS) -o myebpf
 
all: myebpf
 
clean:
	rm -f vmlinux.h myebpf.bpf.o myebpf.skel.h myebpf

bash

make all
sudo ./myebpf

text

Tracing execve... Press Ctrl+C to stop.
PID    COMM             FILE
12345  bash             /usr/bin/ls
12346  sshd             /usr/sbin/sshd
12347  python3          /usr/bin/python3

Ví dụ 2: Kprobe với arguments — trace connect()

// kprobe_connect.bpf.c
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>
 
struct conn_event {
    u32 pid;
    u32 daddr;   // destination IP (network byte order)
    u16 dport;   // destination port
    u8  comm[16];
};
 
struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 1 << 16);
} events SEC(".maps");
 
SEC("kprobe/tcp_connect")
int BPF_KPROBE(trace_connect, struct sock *sk)
{
    struct conn_event *e;
    u16 family;
 
    // Only IPv4
    family = BPF_CORE_READ(sk, __sk_common.skc_family);
    if (family != AF_INET)
        return 0;
 
    e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (!e)
        return 0;
 
    e->pid = bpf_get_current_pid_tgid() >> 32;
    e->daddr = BPF_CORE_READ(sk, __sk_common.skc_daddr);
    e->dport = bpf_ntohs(BPF_CORE_READ(sk, __sk_common.skc_dport));
    bpf_get_current_comm(&e->comm, sizeof(e->comm));
 
    bpf_ringbuf_submit(e, 0);
    return 0;
}
 
char LICENSE[] SEC("license") = "GPL";

BPF_CORE_READ: Macro của CO-RE (Compile Once, Run Everywhere). Thay vì dereference pointer trực tiếp (không an toàn vì kernel struct có thể thay đổi giữa version), BPF_CORE_READ sử dụng BTF để resolve field offset tại load time — program chạy đúng trên nhiều kernel version khác nhau.

eBPF Maps — Shared Data Structures

Maps là cơ chế để eBPF program (kernel) và userspace chia sẻ dữ liệu. Chúng cũng là cách eBPF program lưu state giữa các lần gọi.

Tất cả các loại Map

Map Type	Đặc điểm	Use Case
`HASH`	Hash table key→value	Counter theo PID, connection state
`ARRAY`	Fixed-size array indexed by u32	Global stats, per-CPU global counters
`PERCPU_HASH`	Hash table nhưng mỗi CPU có bản riêng	High-frequency counters, no lock needed
`PERCPU_ARRAY`	Array nhưng per-CPU	Per-CPU stats
`LRU_HASH`	Hash với eviction LRU	Connection tracking với giới hạn memory
`LPM_TRIE`	Longest Prefix Match trie	IP routing, CIDR matching
`RINGBUF`	Ring buffer MPSC	Gửi events từ kernel lên userspace (preferred)
`PERF_EVENT_ARRAY`	Perf event buffer	Legacy event streaming
`QUEUE` / `STACK`	FIFO / LIFO	Packet queues
`SOCKHASH` / `SOCKMAP`	Socket reference maps	Socket redirection
`ARRAY_OF_MAPS`	Map chứa map khác	Dynamic dispatch

Hash Map — Ví dụ thực tế: đếm syscall per-process

// syscall_counter.bpf.c
 
// Map: PID (u32) -> syscall count (u64)
struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 10240);
    __type(key, u32);
    __type(value, u64);
} syscall_count SEC(".maps");
 
SEC("tracepoint/raw_syscalls/sys_enter")
int count_syscalls(struct trace_event_raw_sys_enter *ctx)
{
    u32 pid = bpf_get_current_pid_tgid() >> 32;
    u64 *count;
 
    count = bpf_map_lookup_elem(&syscall_count, &pid);
    if (count) {
        // Atomic increment
        __sync_fetch_and_add(count, 1);
    } else {
        // First time for this PID
        u64 init = 1;
        bpf_map_update_elem(&syscall_count, &pid, &init, BPF_NOEXIST);
    }
    return 0;
}

Đọc từ userspace:

// userspace
#include <bpf/libbpf.h>
 
void print_top_processes(int map_fd)
{
    u32 key = 0, next_key;
    u64 value;
 
    printf("%-8s %-20s %s\n", "PID", "COMM", "SYSCALLS");
    while (bpf_map_get_next_key(map_fd, &key, &next_key) == 0) {
        bpf_map_lookup_elem(map_fd, &next_key, &value);
        // ... get comm from /proc/$pid/comm ...
        printf("%-8u %-20s %llu\n", next_key, get_comm(next_key), value);
        key = next_key;
    }
}

Per-CPU Array — Counter hiệu năng cao

Per-CPU arrays tránh contention vì mỗi CPU có bản riêng, không cần lock:

// Khai báo
struct {
    __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
    __uint(max_entries, 4); // 4 counters: rx_pkts, rx_bytes, tx_pkts, tx_bytes
    __type(key, u32);
    __type(value, u64);
} stats SEC(".maps");
 
// Trong XDP program
SEC("xdp")
int count_packets(struct xdp_md *ctx)
{
    u32 key = 0; // rx_pkts counter
    u64 *count = bpf_map_lookup_elem(&stats, &key);
    if (count)
        (*count)++;  // No lock needed! Per-CPU
 
    key = 1; // rx_bytes counter
    u64 *bytes = bpf_map_lookup_elem(&stats, &key);
    u32 pkt_size = ctx->data_end - ctx->data;
    if (bytes)
        *bytes += pkt_size;
 
    return XDP_PASS;
}

Đọc từ userspace — cần sum tất cả CPU:

// Đọc per-CPU array từ userspace
void read_percpu_stats(int map_fd)
{
    int num_cpus = libbpf_num_possible_cpus();
    u64 *values = calloc(num_cpus, sizeof(u64));
 
    // key=0 = rx_pkts
    u32 key = 0;
    bpf_map_lookup_elem(map_fd, &key, values);
 
    u64 total_pkts = 0;
    for (int i = 0; i < num_cpus; i++)
        total_pkts += values[i];
 
    printf("Total packets: %llu\n", total_pkts);
    free(values);
}

Ring Buffer — Event Streaming

Ring buffer là cách recommended để gửi events từ kernel lên userspace:

// Kernel side
struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 1 << 20); // 1MB
} rb SEC(".maps");
 
struct net_event {
    u64 ts;
    u32 pid;
    u32 daddr;
    u16 dport;
    u8  comm[16];
};
 
SEC("kprobe/tcp_connect")
int BPF_KPROBE(on_connect, struct sock *sk)
{
    struct net_event *e;
 
    // bpf_ringbuf_reserve: allocate space (lock-free)
    e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
    if (!e)
        return 0;  // ring buffer full, drop event
 
    e->ts    = bpf_ktime_get_ns();
    e->pid   = bpf_get_current_pid_tgid() >> 32;
    e->daddr = BPF_CORE_READ(sk, __sk_common.skc_daddr);
    e->dport = bpf_ntohs(BPF_CORE_READ(sk, __sk_common.skc_dport));
    bpf_get_current_comm(&e->comm, sizeof(e->comm));
 
    // bpf_ringbuf_submit: publish event (visible to userspace)
    bpf_ringbuf_submit(e, 0);
    return 0;
}

// Userspace polling
static int handle_event(void *ctx, void *data, size_t data_sz)
{
    const struct net_event *e = data;
    char ip[16];
    inet_ntop(AF_INET, &e->daddr, ip, sizeof(ip));
    printf("[%llu] pid=%-6u comm=%-16s -> %s:%u\n",
           e->ts, e->pid, e->comm, ip, e->dport);
    return 0;
}
 
// Setup ring buffer
struct ring_buffer *rb = ring_buffer__new(
    bpf_map__fd(skel->maps.rb),
    handle_event, NULL, NULL);
 
// Event loop
while (!stop) {
    ring_buffer__poll(rb, 100);
}

LPM Trie — IP CIDR Matching

Cực kỳ hữu ích cho firewall rules và routing:

// IP blocklist với CIDR support
struct lpm_key {
    __u32 prefixlen;
    __u32 addr;
};
 
struct {
    __uint(type, BPF_MAP_TYPE_LPM_TRIE);
    __uint(max_entries, 1024);
    __type(key, struct lpm_key);
    __type(value, u64);
    __uint(map_flags, BPF_F_NO_PREALLOC);
} ip_blocklist SEC(".maps");
 
// Trong XDP program
SEC("xdp")
int drop_blocked(struct xdp_md *ctx)
{
    // ... parse IP header ...
    struct lpm_key key = {
        .prefixlen = 32,
        .addr = ip->saddr,  // Source IP
    };
    if (bpf_map_lookup_elem(&ip_blocklist, &key))
        return XDP_DROP;
    return XDP_PASS;
}

Từ userspace, thêm network vào blocklist:

// Block 192.168.1.0/24
struct lpm_key key = {
    .prefixlen = 24,
    .addr = inet_addr("192.168.1.0"),
};
u64 value = 1;
bpf_map_update_elem(map_fd, &key, &value, BPF_ANY);

Map Operations từ userspace

// CRUD operations trên map
int map_fd = bpf_map__fd(skel->maps.my_map);
 
// Lookup
u32 key = pid;
u64 value;
int ret = bpf_map_lookup_elem(map_fd, &key, &value);
if (ret == 0) printf("Found: %llu\n", value);
 
// Update (BPF_ANY: create or update, BPF_NOEXIST: only create, BPF_EXIST: only update)
u64 new_value = 100;
bpf_map_update_elem(map_fd, &key, &new_value, BPF_ANY);
 
// Delete
bpf_map_delete_elem(map_fd, &key);
 
// Iterate all entries
u32 cur_key = 0, next_key;
while (bpf_map_get_next_key(map_fd, &cur_key, &next_key) == 0) {
    bpf_map_lookup_elem(map_fd, &next_key, &value);
    // process entry
    cur_key = next_key;
}

XDP — eXpress Data Path

XDP là hook point ở tầng thấp nhất của Linux network stack: ngay khi packet vừa đến NIC driver, trước khi kernel cấp phát sk_buff. Đây là lý do XDP có thể đạt throughput hàng chục triệu packet/giây trên một CPU core.

text

NIC hardware
    |
    | DMA transfer
    v
[XDP hook] <-- eBPF program runs HERE
    | PASS / DROP / TX / REDIRECT / ABORTED
    v
sk_buff allocation
    |
    v
Network stack (IP layer, TCP/UDP, ...)
    |
    v
Socket receive buffer
    |
    v
Application (read/recv)

XDP Actions

Action	Ý nghĩa
`XDP_PASS`	Cho packet đi tiếp vào kernel network stack
`XDP_DROP`	Drop packet ngay lập tức (cực nhanh)
`XDP_TX`	Gửi packet trở lại interface (hairpin)
`XDP_REDIRECT`	Redirect đến interface khác hoặc CPU khác
`XDP_ABORTED`	Drop + generate tracepoint (debug only)

Ví dụ 1: Packet Counter

// xdp_counter.bpf.c
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
 
struct {
    __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
    __uint(max_entries, 5);  // one slot per XDP action
    __type(key, u32);
    __type(value, u64);
} action_stats SEC(".maps");
 
SEC("xdp")
int xdp_counter(struct xdp_md *ctx)
{
    u32 action = XDP_PASS;
    u32 key = action;
    u64 *count = bpf_map_lookup_elem(&action_stats, &key);
    if (count)
        (*count)++;
    return action;
}
 
char LICENSE[] SEC("license") = "GPL";

bash

# Load và attach vào interface eth0
sudo ip link set dev eth0 xdp obj xdp_counter.bpf.o sec xdp
 
# Hoặc dùng xdp-loader
sudo xdp-loader load eth0 xdp_counter.bpf.o
 
# Remove
sudo ip link set dev eth0 xdp off

Ví dụ 2: IP Blocklist (DDoS mitigation)

// xdp_blocklist.bpf.c
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
 
struct lpm_key {
    __u32 prefixlen;
    __u32 addr;
};
 
struct {
    __uint(type, BPF_MAP_TYPE_LPM_TRIE);
    __uint(max_entries, 65536);
    __type(key, struct lpm_key);
    __type(value, u64);
    __uint(map_flags, BPF_F_NO_PREALLOC);
} blocklist SEC(".maps");
 
// Stats: [0]=passed, [1]=dropped
struct {
    __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
    __uint(max_entries, 2);
    __type(key, u32);
    __type(value, u64);
} stats SEC(".maps");
 
static __always_inline void update_stat(u32 key)
{
    u64 *val = bpf_map_lookup_elem(&stats, &key);
    if (val)
        (*val)++;
}
 
SEC("xdp")
int xdp_block(struct xdp_md *ctx)
{
    void *data_end = (void *)(long)ctx->data_end;
    void *data     = (void *)(long)ctx->data;
 
    // Parse Ethernet header
    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return XDP_PASS;
 
    // Only process IPv4
    if (bpf_ntohs(eth->h_proto) != ETH_P_IP) {
        update_stat(0);
        return XDP_PASS;
    }
 
    // Parse IP header
    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return XDP_PASS;
 
    // LPM lookup
    struct lpm_key key = {
        .prefixlen = 32,
        .addr      = ip->saddr,
    };
 
    if (bpf_map_lookup_elem(&blocklist, &key)) {
        update_stat(1);
        return XDP_DROP;
    }
 
    update_stat(0);
    return XDP_PASS;
}
 
char LICENSE[] SEC("license") = "GPL";

Quản lý blocklist từ userspace:

// Thêm IP/network vào blocklist
int add_to_blocklist(int map_fd, const char *cidr)
{
    struct lpm_key key;
    u64 value = 1;
    char ip_str[32];
    int prefix;
 
    sscanf(cidr, "%[^/]/%d", ip_str, &prefix);
    key.prefixlen = prefix;
    key.addr = inet_addr(ip_str) & htonl(~((1u << (32 - prefix)) - 1));
 
    return bpf_map_update_elem(map_fd, &key, &value, BPF_ANY);
}
 
// Block toàn bộ /24
add_to_blocklist(map_fd, "192.168.100.0/24");
// Block IP cụ thể
add_to_blocklist(map_fd, "10.0.0.1/32");

Ví dụ 3: Rate Limiter

// xdp_ratelimit.bpf.c — Giới hạn packet rate per source IP
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
 
#define RATE_LIMIT_PPS 1000  // packets per second per IP
 
struct rate_limit_val {
    u64 last_ts;   // timestamp of last reset
    u64 pkt_count; // packets since last reset
};
 
struct {
    __uint(type, BPF_MAP_TYPE_LRU_HASH);
    __uint(max_entries, 100000);
    __type(key, u32);   // source IP
    __type(value, struct rate_limit_val);
} rate_map SEC(".maps");
 
SEC("xdp")
int xdp_ratelimit(struct xdp_md *ctx)
{
    void *data_end = (void *)(long)ctx->data_end;
    void *data     = (void *)(long)ctx->data;
 
    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return XDP_PASS;
 
    if (bpf_ntohs(eth->h_proto) != ETH_P_IP)
        return XDP_PASS;
 
    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return XDP_PASS;
 
    u32 src_ip = ip->saddr;
    u64 now = bpf_ktime_get_ns();
 
    struct rate_limit_val *val = bpf_map_lookup_elem(&rate_map, &src_ip);
    if (val) {
        // Reset counter every second
        if (now - val->last_ts >= 1000000000ULL) {
            val->last_ts = now;
            val->pkt_count = 1;
        } else if (val->pkt_count >= RATE_LIMIT_PPS) {
            return XDP_DROP;  // Exceeded rate limit
        } else {
            val->pkt_count++;
        }
    } else {
        // New IP
        struct rate_limit_val new_val = {
            .last_ts   = now,
            .pkt_count = 1,
        };
        bpf_map_update_elem(&rate_map, &src_ip, &new_val, BPF_ANY);
    }
 
    return XDP_PASS;
}
 
char LICENSE[] SEC("license") = "GPL";

XDP Redirect — Packet Forwarding

// Redirect packet đến interface khác
SEC("xdp")
int xdp_redirect(struct xdp_md *ctx)
{
    // Redirect đến interface có index 3
    return bpf_redirect(3, 0);
}
 
// Redirect đến CPU khác (XDP_REDIRECT + devmap)
struct {
    __uint(type, BPF_MAP_TYPE_DEVMAP);
    __uint(max_entries, 64);
    __type(key, u32);
    __type(value, u32);
} tx_port SEC(".maps");
 
SEC("xdp")
int xdp_redirect_map(struct xdp_md *ctx)
{
    u32 key = 0;
    return bpf_redirect_map(&tx_port, key, XDP_PASS);
}

XDP Performance

Cloudflare báo cáo XDP có thể drop ~10-20 million packets/second per CPU core — so với iptables chỉ đạt ~1-2 Mpps. Đây là lý do XDP trở thành standard cho DDoS mitigation.

text

Throughput comparison (single core, 10GbE NIC):

  iptables DROP:  ~1-2 Mpps
  XDP DROP:      ~10-20 Mpps
  XDP (HW offload): ~100+ Mpps (with supported NIC)

TC Hooks và eBPF cho Security

TC (Traffic Control) — Linh hoạt hơn XDP

TC hooks attach ở một tầng cao hơn XDP — sau khi sk_buff đã được cấp phát, cho phép bạn đọc và sửa packet đầy đủ hơn, kể cả metadata.

text

XDP hook  <-- packet vào, TRƯỚC sk_buff
    |
    v
sk_buff allocation
    |
    v
[TC ingress hook] <-- eBPF attach ở đây (ingress)
    |
    v
IP layer (routing decision)
    |
    v
[TC egress hook]  <-- eBPF attach ở đây (egress)
    |
    v
NIC driver (transmit)

So sánh XDP vs TC:

Feature	XDP	TC
Khi packet	Vào NIC driver	Sau sk_buff cấp phát
Overhead	Cực thấp	Thấp
sk_buff access	Không	Có
Egress support	Giới hạn	Đầy đủ (ingress + egress)
Packet modification	Cơ bản	Đầy đủ
Use case	DDoS drop, forwarding	Policy enforcement, packet mangling

Ví dụ: TC ingress — log và modify packets

// tc_trace.bpf.c
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
#include <linux/pkt_cls.h>  // TC_ACT_*
 
struct pkt_info {
    u32 src_ip;
    u32 dst_ip;
    u16 src_port;
    u16 dst_port;
    u8  proto;
};
 
struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 1 << 16);
} events SEC(".maps");
 
SEC("tc")
int tc_ingress(struct __sk_buff *skb)
{
    void *data_end = (void *)(long)skb->data_end;
    void *data     = (void *)(long)skb->data;
 
    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return TC_ACT_OK;
 
    if (bpf_ntohs(eth->h_proto) != ETH_P_IP)
        return TC_ACT_OK;
 
    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return TC_ACT_OK;
 
    struct pkt_info *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (!e)
        return TC_ACT_OK;
 
    e->src_ip = ip->saddr;
    e->dst_ip = ip->daddr;
    e->proto  = ip->protocol;
 
    if (ip->protocol == IPPROTO_TCP) {
        struct tcphdr *tcp = (void *)(ip + 1);
        if ((void *)(tcp + 1) <= data_end) {
            e->src_port = bpf_ntohs(tcp->source);
            e->dst_port = bpf_ntohs(tcp->dest);
        }
    }
 
    bpf_ringbuf_submit(e, 0);
    return TC_ACT_OK;
}
 
char LICENSE[] SEC("license") = "GPL";

bash

# Load TC eBPF program
tc qdisc add dev eth0 clsact
tc filter add dev eth0 ingress bpf da obj tc_trace.bpf.o sec tc
 
# Xem current filters
tc filter show dev eth0 ingress
 
# Remove
tc filter del dev eth0 ingress
tc qdisc del dev eth0 clsact

TC Return Values

Value	Số	Ý nghĩa
`TC_ACT_OK`	0	Tiếp tục xử lý
`TC_ACT_SHOT`	2	Drop packet
`TC_ACT_REDIRECT`	7	Redirect
`TC_ACT_PIPE`	3	Chạy filter tiếp theo

eBPF cho Security — BPF LSM

BPF LSM (Linux Security Module) cho phép viết security policy bằng eBPF, attach vào các security hook trong kernel.

LSM (Linux Security Module): Framework trong Linux kernel cho phép thêm security policy. SELinux, AppArmor là LSM implementation. BPF LSM cho phép viết LSM bằng eBPF — flexible hơn SELinux, dễ triển khai hơn AppArmor.

Các LSM hooks phổ biến

// Một số hook points của BPF LSM
lsm/file_open        // Khi file được mở
lsm/bprm_check_security  // Khi execute binary
lsm/socket_connect   // Khi socket connect
lsm/task_kill        // Khi signal được gửi
lsm/socket_create    // Khi socket được tạo
lsm/inode_mkdir      // Khi tạo directory

Ví dụ: Chặn process không được phép mở file nhạy cảm

// security_file.bpf.c
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>
#include <errno.h>
 
// Danh sách process được phép đọc /etc/shadow
struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 64);
    __type(key, u32);   // PID
    __type(value, u8);
} allowed_pids SEC(".maps");
 
SEC("lsm/file_open")
int BPF_PROG(restrict_shadow, struct file *file)
{
    char fname[32];
    struct dentry *dentry;
    struct qstr d_name;
 
    // Get filename
    dentry = BPF_CORE_READ(file, f_path.dentry);
    d_name = BPF_CORE_READ(dentry, d_name);
 
    // Check if accessing /etc/shadow
    bpf_probe_read_kernel_str(&fname, sizeof(fname), d_name.name);
 
    if (__builtin_memcmp(fname, "shadow", 6) != 0)
        return 0;  // Not shadow file, allow
 
    u32 pid = bpf_get_current_pid_tgid() >> 32;
 
    // Check if PID is in allowed list
    u8 *allowed = bpf_map_lookup_elem(&allowed_pids, &pid);
    if (allowed)
        return 0;  // Allowed
 
    // Block access - return EPERM
    bpf_printk("Blocked access to shadow by pid=%u\n", pid);
    return -EPERM;
}
 
char LICENSE[] SEC("license") = "GPL";

bash

# Enable BPF LSM (cần thêm vào kernel boot params)
# GRUB_CMDLINE_LINUX="lsm=bpf,lockdown,capability,yama"
# hoặc kiểm tra
cat /sys/kernel/security/lsm

Ví dụ: Chặn exec của binary không được phép

// deny_exec.bpf.c — Cho phép chỉ specific binaries được run
SEC("lsm/bprm_check_security")
int BPF_PROG(check_exec, struct linux_binprm *bprm)
{
    char filename[256];
    u8  comm[16];
    u32 pid = bpf_get_current_pid_tgid() >> 32;
 
    // Get the binary being executed
    bpf_probe_read_kernel_str(&filename, sizeof(filename),
        BPF_CORE_READ(bprm, filename));
 
    bpf_get_current_comm(&comm, sizeof(comm));
 
    // Block execution of certain tools by non-root
    if (__builtin_memcmp(filename, "/usr/bin/nc", 11) == 0) {
        uid_t uid = bpf_get_current_uid_gid() & 0xFFFFFFFF;
        if (uid != 0) {
            bpf_printk("Blocked: pid=%u comm=%s tried to exec nc\n",
                       pid, comm);
            return -EPERM;
        }
    }
 
    return 0;
}

Ví dụ: Seccomp-BPF (lọc syscall)

Seccomp-BPF là cách đơn giản hơn để restrict syscalls cho một process cụ thể:

// Trong userspace — restrict syscalls cho process này
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <sys/prctl.h>
 
static void apply_seccomp_filter(void)
{
    struct sock_filter filter[] = {
        // Load architecture
        BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
                 offsetof(struct seccomp_data, arch)),
        // Allow only x86-64
        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 1, 0),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
 
        // Load syscall number
        BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
                 offsetof(struct seccomp_data, nr)),
 
        // Allow read, write, exit, exit_group, brk
        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_read, 4, 0),
        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_write, 3, 0),
        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_exit, 2, 0),
        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_exit_group, 1, 0),
        // Kill everything else
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
    };
 
    struct sock_fprog prog = {
        .len    = (unsigned short)(sizeof(filter) / sizeof(filter[0])),
        .filter = filter,
    };
 
    prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
    prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
}

Calico + eBPF: Kubernetes Security

Calico eBPF mode implement network policy bằng TC hooks và eBPF maps — thay thế hoàn toàn iptables:

text

Kubernetes NetworkPolicy -> Calico controller
                                |
                          translates to
                                |
                          eBPF TC hooks    +    eBPF Maps (LPM Trie for CIDR)
                                |
                          O(1) policy lookup per packet
                          (vs O(n) linear iptables scan)

Với 1000 pods và nhiều NetworkPolicy:

iptables: hàng chục nghìn rules, linear scan O(n)
Calico eBPF: hash map + LPM trie, O(1) lookup, không phụ thuộc số rules

eBPF Internals — Verifier, JIT, và Instruction Set

Phần này dành cho những ai muốn hiểu eBPF hoạt động thực sự như thế nào bên dưới — từ bytecode đến machine code.

eBPF Instruction Set

Mỗi eBPF instruction có kích thước cố định 8 bytes (64-bit):

text

eBPF Instruction encoding (64-bit / 8 bytes):

  Byte 0     Byte 1     Byte 2-3   Byte 4-7
  +--------+ +--------+ +--------+ +--------+
  | opcode | |dst |src | | offset | | imm32  |
  | 8 bits | |4b  |4b  | | 16 bit | | 32 bit |
  +--------+ +--------+ +--------+ +--------+

Opcode structure:

text

opcode (8 bits):
  bits [7:3] = operation code (add, mov, load, store, jump, ...)
  bits [2:0] = instruction class:
    000 = BPF_LD    (load)
    001 = BPF_LDX   (load from memory)
    010 = BPF_ST    (store immediate)
    011 = BPF_STX   (store register)
    100 = BPF_ALU   (32-bit arithmetic)
    101 = BPF_JMP   (64-bit jumps)
    110 = BPF_JMP32 (32-bit jumps)
    111 = BPF_ALU64 (64-bit arithmetic)

Một số opcode thông dụng:

text

BPF_MOV64_REG(dst, src)   -- dst = src           (64-bit move)
BPF_MOV64_IMM(dst, imm)   -- dst = imm           (immediate)
BPF_ALU64_IMM(ADD, dst, imm) -- dst += imm       (add)
BPF_JMP_IMM(JEQ, src, imm, off) -- if src == imm goto pc+off
BPF_LDX_MEM(DW, dst, src, off) -- dst = *(u64*)(src+off)
BPF_STX_MEM(DW, dst, src, off) -- *(u64*)(dst+off) = src
BPF_CALL func_id              -- call helper function
BPF_EXIT_INSN()               -- return r0

Ví dụ: Disassemble một eBPF program đơn giản:

bash

# Compile BPF program
clang -O2 -target bpf -c simple.bpf.c -o simple.bpf.o
 
# Disassemble
llvm-objdump -d simple.bpf.o

text

Output:
simple.bpf.o:   file format elf64-bpf

Disassembly of section tracepoint/syscalls/sys_enter_execve:

0000000000000000 <trace_execve>:
       0:       85 00 00 00 0e 00 00 00 call 14    ; bpf_get_current_pid_tgid()
       1:       77 00 00 00 20 00 00 00 r0 >>= 32  ; get PID (upper 32 bits)
       2:       63 0a f8 ff 00 00 00 00 *(u32 *)(r10 - 8) = r0  ; store PID on stack
       3:       bf a1 00 00 00 00 00 00 r1 = r10   ; r1 = frame pointer
       4:       07 01 00 00 f0 ff ff ff r1 += -16  ; r1 = &comm buffer
       ...

BPF Verifier — Deep Dive

Verifier thực hiện abstract interpretation: nó simulate tất cả code paths với abstract values thay vì concrete values.

text

Verifier state for each instruction:

  regs[0..10] = {
    type:    SCALAR / PTR_TO_MAP_VALUE / PTR_TO_CTX / NOT_INIT / ...
    value:   known value (if constant)
    range:   [min_value, max_value] (if bounded)
    off:     offset from base pointer
    id:      unique ID for pointer tracking
  }
  stack[0..511] = similar state per stack slot

Ví dụ về verifier rejection:

// BUG: Missing bounds check
SEC("xdp")
int bad_program(struct xdp_md *ctx)
{
    void *data     = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;
 
    struct ethhdr *eth = data;
    // BUG: không check bounds trước khi access eth->h_proto
    __u16 proto = eth->h_proto;  // VERIFIER REJECTS THIS
    return XDP_PASS;
}
 
// FIX: Add bounds check
SEC("xdp")
int good_program(struct xdp_md *ctx)
{
    void *data     = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;
 
    struct ethhdr *eth = data;
    // Check: eth + sizeof(*eth) <= data_end
    if ((void *)(eth + 1) > data_end)
        return XDP_PASS;
 
    __u16 proto = eth->h_proto;  // OK!
    return XDP_PASS;
}

Verifier error messages thường gặp:

text

"invalid mem access 'inv'"
  -> Dereference uninitialized/invalid pointer

"R1 min value is negative, either use unsigned or 'var &= const'"
  -> Pointer arithmetic với negative offset chưa được check

"back-edge from insn X to Y"
  -> Backward jump không có bound (infinite loop)

"value -1 makes map_value pointer be out of bounds"
  -> Map value access vượt quá kích thước

"math between map_value pointer and register with unbounded min value"
  -> Pointer arithmetic với unbounded register

Debug verifier với verbose log:

bash

# Bật verbose verifier log
sudo bpftool prog load myebpf.bpf.o /sys/fs/bpf/myebpf \
    2>&1 | head -100

JIT Compiler

Sau khi verifier pass, kernel JIT compile eBPF bytecode → native machine code.

bash

# Bật JIT (thường enabled by default)
echo 1 | sudo tee /proc/sys/net/core/bpf_jit_enable
 
# Xem JIT compiled code của một program
sudo bpftool prog dump xlated id <prog_id>     # eBPF bytecode
sudo bpftool prog dump jited id <prog_id>      # Native assembly

Ví dụ output dump jited:

text

int trace_execve(...):
; u32 pid = bpf_get_current_pid_tgid() >> 32;
   0:    push   rbp
   1:    mov    rbp, rsp
   4:    push   rbx
   5:    push   r13
   7:    push   r14
   9:    push   r15
   b:    sub    rsp, 0x28
   f:    call   0xffffffffce4a1c70  ; bpf_get_current_pid_tgid
  14:    shr    rax, 0x20           ; >> 32 to get PID
  18:    mov    dword ptr [rbp - 0x4], eax  ; store on stack

JIT code chạy với overhead gần như native C code — không có interpretation overhead.

BTF — BPF Type Format

BTF là format debug info cho eBPF, cho phép CO-RE (Compile Once, Run Everywhere):

bash

# Xem BTF của kernel
bpftool btf dump file /sys/kernel/btf/vmlinux | head -50
 
# Xem BTF của một BPF program
bpftool btf dump prog id <prog_id>

text

# BTF output mẫu (struct sock)
[156] STRUCT 'sock' size=760 vlen=44
    'sk_node' type_id=157 bits_offset=0
    'sk_hash' type_id=4 bits_offset=256
    'sk_portpair' type_id=4 bits_offset=320
    ...
    '__sk_common' type_id=163 bits_offset=0
    'sk_rcvtimeo' type_id=4 bits_offset=6080

CO-RE hoạt động như thế nào:

// Với CO-RE, thay vì trực tiếp:
// u32 dport = sk->__sk_common.skc_dport;  // BAD: offset hardcoded
 
// Dùng BPF_CORE_READ:
u32 dport = BPF_CORE_READ(sk, __sk_common.skc_dport);
// Tại load time, libbpf sẽ:
// 1. Đọc BTF của kernel hiện tại
// 2. Tính offset thực của skc_dport trong kernel này
// 3. Patch instruction để dùng offset đúng
// -> Program chạy đúng trên mọi kernel version có BTF

Xem program đang chạy

bash

# List tất cả BPF programs
sudo bpftool prog list
 
# Chi tiết về một program
sudo bpftool prog show id 42
 
# List maps
sudo bpftool map list
 
# Xem nội dung map
sudo bpftool map dump id 5
 
# Pin map để share giữa các process
sudo bpftool map pin id 5 /sys/fs/bpf/mymap

text

# Output mẫu bpftool prog list
42: tracepoint  name trace_execve  tag abc123def456  gpl
    loaded_at 2024-01-01T00:00:00+0000  uid 0
    xlated 320B  jited 248B  memlock 4096B  map_ids 7,8
    btf_id 45

Modern eBPF: BTF, CO-RE, và những tính năng mới nhất

CO-RE — Compile Once, Run Everywhere

Trước CO-RE, bạn phải compile eBPF program trên máy target (vì cần kernel headers đúng version). CO-RE giải quyết vấn đề này:

text

Trước CO-RE:
  Source.c -> compile trên kernel 5.4 -> binary chỉ chạy trên kernel 5.4

Với CO-RE:
  Source.c -> compile 1 lần -> binary chạy trên kernel 5.4, 5.15, 6.1, 6.12
                                     (miễn là kernel có BTF)

Workflow CO-RE:

bash

# 1. Generate vmlinux.h từ kernel đang chạy
bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h
 
# 2. Compile với BTF target
clang -g -O2 -target bpf \
    -D__TARGET_ARCH_x86 \
    -I. \
    -c program.bpf.c -o program.bpf.o
 
# 3. Distribute .bpf.o file — chạy được trên bất kỳ kernel nào có BTF

CO-RE macros:

// Đọc field từ kernel struct (CO-RE safe)
u32 pid = BPF_CORE_READ(task, pid);
 
// Đọc nested struct
u32 dport = BPF_CORE_READ(sk, __sk_common.skc_dport);
 
// Check nếu field tồn tại (cross-version compatibility)
if (bpf_core_field_exists(struct task_struct, on_cpu)) {
    // Kernel có field này
    bool on_cpu = BPF_CORE_READ(task, on_cpu);
}
 
// Đọc string
char filename[256];
BPF_CORE_READ_STR_INTO(&filename, bprm, filename);

BPF Skeleton — Modern Loading Pattern

Skeleton là auto-generated header, thay thế manual map/prog loading:

bash

# Generate skeleton
bpftool gen skeleton program.bpf.o > program.skel.h

// Generated skeleton (program.skel.h) — simplified
struct program_bpf {
    struct {
        struct bpf_object *obj;
        struct bpf_program *trace_execve;  // BPF program
    } progs;
    struct {
        struct bpf_map *events;            // BPF map
        struct bpf_map *config;
    } maps;
    struct {
        struct bpf_link *trace_execve;     // Link to hook
    } links;
};
 
// API:
struct program_bpf *skel;
skel = program_bpf__open();        // Open (before load)
// Modify maps/progs before load:
skel->rodata->config_value = 42;
program_bpf__load(skel);           // Load into kernel
program_bpf__attach(skel);         // Attach to hooks
program_bpf__destroy(skel);        // Cleanup

BPF Global Variables (Rodata)

// Trong BPF program — khai báo global config
const volatile u32 min_duration_ns = 0;
const volatile bool filter_by_comm = false;
 
SEC("tracepoint/syscalls/sys_enter_execve")
int trace_execve(...)
{
    u64 duration = get_duration();
    if (min_duration_ns && duration < min_duration_ns)
        return 0;  // Skip short-lived processes
    // ...
}

// Trong userspace — set trước khi load
skel = program_bpf__open();
skel->rodata->min_duration_ns = 100 * 1000 * 1000;  // 100ms
skel->rodata->filter_by_comm = true;
program_bpf__load(skel);

BPF Tokens — Unprivileged eBPF (Kernel 6.9+)

Trước đây, load eBPF program yêu cầu CAP_BPF (root-equivalent). BPF Tokens cho phép delegate quyền này:

// Admin: tạo BPF token và gắn vào BPF filesystem mount
// user có thể mount BPF FS với specific permissions:
// mount -t bpf none /sys/fs/bpf -o delegate_cmds=prog_load:map_create
 
// Unprivileged user: sử dụng token để load BPF program
int token_fd = open("/sys/fs/bpf", O_RDONLY);
LIBBPF_OPTS(bpf_prog_load_opts, opts,
    .token_fd = token_fd,
);
bpf_prog_load(prog_type, insns, insns_cnt, &opts);

BPF Arena — Shared Memory (Kernel 6.9+)

BPF Arena: Là shared memory region giữa kernel eBPF program và userspace. Cho phép hiệu năng cao hơn so với maps cho các trường hợp streaming dữ liệu lớn — không cần copy qua syscall.

// Kernel side: allocate từ arena
struct {
    __uint(type, BPF_MAP_TYPE_ARENA);
    __uint(max_entries, 1 << 20);  // 1MB arena
} arena SEC(".maps");
 
SEC("xdp")
int xdp_with_arena(struct xdp_md *ctx)
{
    // arena_alloc tự quản lý memory trong arena
    void *buf = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
    if (!buf)
        return XDP_PASS;
 
    // Write data directly to arena — visible to userspace
    // ...
    return XDP_PASS;
}

sched_ext — Custom CPU Scheduler (Kernel 6.12+)

sched_ext: Extension cho phép viết CPU scheduling algorithm bằng eBPF. Không cần rebuild kernel — load BPF scheduler vào kernel đang chạy.

// Minimal BPF scheduler skeleton
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
 
// Simple FIFO scheduler
struct {
    __uint(type, BPF_MAP_TYPE_QUEUE);
    __uint(max_entries, 4096);
    __type(value, u32);  // task PID
} runq SEC(".maps");
 
// Called when task becomes runnable
void BPF_STRUCT_OPS(sched_enqueue, struct task_struct *p, u64 enq_flags)
{
    s32 pid = p->pid;
    bpf_map_push_elem(&runq, &pid, 0);
}
 
// Called to select next task to run
void BPF_STRUCT_OPS(sched_dispatch, s32 cpu, struct task_struct *prev)
{
    s32 pid;
    if (bpf_map_pop_elem(&runq, &pid) == 0) {
        struct task_struct *p = bpf_task_from_pid(pid);
        if (p) {
            scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
            bpf_task_release(p);
        }
    }
}
 
// Scheduler metadata
SEC(".struct_ops.link")
struct sched_ext_ops my_sched = {
    .enqueue  = (void *)sched_enqueue,
    .dispatch = (void *)sched_dispatch,
    .name     = "my_fifo_scheduler",
};
 
char LICENSE[] SEC("license") = "GPL";

bash

# Load custom scheduler
sudo scxtool -l my_sched.bpf.o
# Tất cả processes giờ dùng scheduler của bạn
 
# Restore default scheduler
sudo scxtool -u

Use cases của sched_ext:

Gaming: giảm latency bằng cách ưu tiên game thread
Cloud: isolate noisy neighbor workloads
Real-time: custom latency-sensitive scheduling
Research: thử nghiệm scheduling algorithm mới mà không cần patch kernel

BPF Exceptions (Kernel 6.7+)

// Thay vì check mọi return value
struct iphdr *ip = data + sizeof(struct ethhdr);
if ((void *)(ip + 1) > data_end)
    return XDP_PASS;
 
// Với BPF exceptions (future simplification):
// bpf_exception_throw() thoát khỏi program ngay lập tức
// (hiện đang trong development)

Rust + eBPF: Aya framework

rust

// Viết eBPF program bằng Rust với Aya
// Cargo.toml
[package]
name = "my-ebpf"
version = "0.1.0"
edition = "2021"
 
[dependencies]
aya-bpf = { version = "0.1", features = ["async"] }
aya-log-ebpf = "0.1"

rust

// src/main.rs (kernel side)
#![no_std]
#![no_main]
 
use aya_bpf::{
    macros::tracepoint,
    programs::TracePointContext,
    helpers::bpf_get_current_pid_tgid,
};
use aya_log_ebpf::info;
 
#[tracepoint]
pub fn trace_execve(ctx: TracePointContext) -> u32 {
    let pid = (bpf_get_current_pid_tgid() >> 32) as u32;
    info!(&ctx, "execve called by pid={}", pid);
    0
}
 
#[panic_handler]
fn panic(_info: &core::panic::PanicInfo) -> ! {
    loop {}
}

rust

// userspace (src/main.rs)
use aya::{Bpf, include_bytes_aligned, programs::TracePoint};
use tokio::signal;
 
#[tokio::main]
async fn main() -> Result<(), anyhow::Error> {
    let mut bpf = Bpf::load(include_bytes_aligned!(
        "../../target/bpfel-unknown-none/release/my-ebpf"
    ))?;
 
    let program: &mut TracePoint = bpf.program_mut("trace_execve")
        .unwrap()
        .try_into()?;
    program.load()?;
    program.attach("syscalls", "sys_enter_execve")?;
 
    println!("Tracing execve... Ctrl+C to stop");
    signal::ctrl_c().await?;
    Ok(())
}

Kết luận

Sau khi đi qua từng lớp của eBPF — từ bpftrace one-liner đến XDP, TC, LSM, rồi đến verifier internals — có một số điểm cần nhớ:

Khi nào dùng gì

Công cụ	Khi nào dùng
`bpftrace`	Ad-hoc tracing, debugging production issue ngay lập tức
`bcc tools`	Dùng tool có sẵn (execsnoop, tcplife, biolatency, ...)
`libbpf + skeleton`	Build eBPF tool hoàn chỉnh, cần distribute
`XDP`	High-performance packet filtering, DDoS mitigation
`TC hooks`	Packet modification, policy enforcement
`BPF LSM`	Security policy tại kernel level
`sched_ext`	Custom CPU scheduling

Mental model quan trọng

text

eBPF program:
  - Attach to a hook point (kprobe, XDP, tracepoint, ...)
  - Execute on every event at that hook
  - Communicate with userspace via Maps or Ring Buffer
  - Must pass Verifier (safety guarantee)
  - Compiled to native code by JIT (performance guarantee)

Maps:
  - Shared memory between kernel programs and userspace
  - Multiple types for different access patterns (hash, array, LPM, ringbuf, ...)
  - Persist across program invocations
  - Can be shared between multiple eBPF programs

Checklist khi debug eBPF program

bash

# 1. Xem verifier error chi tiết
sudo bpftool prog load prog.bpf.o /sys/fs/bpf/prog 2>&1
 
# 2. List programs đang chạy
sudo bpftool prog list
 
# 3. Xem JIT code
sudo bpftool prog dump jited id <ID>
 
# 4. Xem map nội dung
sudo bpftool map dump id <ID>
 
# 5. Debug với bpf_printk (đọc từ trace pipe)
sudo cat /sys/kernel/debug/tracing/trace_pipe
 
# 6. Kiểm tra BTF có available không
ls /sys/kernel/btf/vmlinux
 
# 7. Check BPF limits
cat /proc/sys/kernel/bpf_stats_enabled
ulimit -l  # locked memory limit (quan trọng cho maps)

Resources để đọc thêm

Ngoài các bài viết trong phần References, một số tools/repos hữu ích:

libbpf-bootstrap: Template project cho eBPF development với libbpf
bcc/tools: 70+ production-ready eBPF tools
bpftrace/tools: bpftrace one-liner scripts
cilium/ebpf: Go library để viết eBPF loader
Aya: Rust eBPF framework
scx: Reference BPF schedulers dùng sched_ext

eBPF đang phát triển nhanh hơn bất kỳ phần nào khác của Linux kernel — mỗi kernel version lại có thêm capabilities mới. Nếu bạn làm về infrastructure, observability, hay security, đây là thứ không thể bỏ qua.

References

Tổng hợp tài nguyên học eBPF - Curated list các blog, tutorial và sách tốt nhất để học eBPF
Container Networking from Scratch - Xây dựng container networking từ đầu với veth pair, bridge, NAT và CNI plugin
Kubernetes CNI Overview - Tổng quan về các CNI plugin trong Kubernetes

eBPF Deep Dive: Từ bpftrace one-liner đến XDP, LSM, và sched_ext

eBPF là gì và tại sao nó quan trọng?

eBPF Architecture

Vòng đời của một eBPF program

Hook Points

Tại sao eBPF an toàn?

Registers và Calling Convention

bpftrace: eBPF cho người mới bắt đầu

Cài đặt

Cú pháp cơ bản

One-liners mẫu

Cú pháp bpftrace quan trọng

Bpftrace scripts (multi-line)

Viết eBPF program đầu tiên với libbpf

Prerequisites

Cấu trúc project

Ví dụ 1: Hello World — trace execve

Ví dụ 2: Kprobe với arguments — trace connect()

eBPF Maps — Shared Data Structures

Tất cả các loại Map

Hash Map — Ví dụ thực tế: đếm syscall per-process

Per-CPU Array — Counter hiệu năng cao

Ring Buffer — Event Streaming

LPM Trie — IP CIDR Matching

Map Operations từ userspace

XDP — eXpress Data Path

XDP Actions

Ví dụ 1: Packet Counter

Ví dụ 2: IP Blocklist (DDoS mitigation)

Ví dụ 3: Rate Limiter

XDP Redirect — Packet Forwarding

XDP Performance

TC Hooks và eBPF cho Security

TC (Traffic Control) — Linh hoạt hơn XDP

Ví dụ: TC ingress — log và modify packets

TC Return Values

eBPF cho Security — BPF LSM

Các LSM hooks phổ biến

Ví dụ: Chặn process không được phép mở file nhạy cảm

Ví dụ: Chặn exec của binary không được phép

Ví dụ: Seccomp-BPF (lọc syscall)

Calico + eBPF: Kubernetes Security

eBPF Internals — Verifier, JIT, và Instruction Set

eBPF Instruction Set

BPF Verifier — Deep Dive

JIT Compiler

BTF — BPF Type Format

Xem program đang chạy

Modern eBPF: BTF, CO-RE, và những tính năng mới nhất

CO-RE — Compile Once, Run Everywhere

BPF Skeleton — Modern Loading Pattern

BPF Global Variables (Rodata)

BPF Tokens — Unprivileged eBPF (Kernel 6.9+)

BPF Arena — Shared Memory (Kernel 6.9+)

sched_ext — Custom CPU Scheduler (Kernel 6.12+)

BPF Exceptions (Kernel 6.7+)

Rust + eBPF: Aya framework

Kết luận

Khi nào dùng gì

Mental model quan trọng

Checklist khi debug eBPF program

Resources để đọc thêm

References

Related posts