google · phlaie · Dec 12, 2025 · Dec 13, 2025 · Dec 14, 2025
diff --git a/pocs/linux/kernelctf/CVE-2025-37890_lts_cos_mitigation/docs/exploit.md b/pocs/linux/kernelctf/CVE-2025-37890_lts_cos_mitigation/docs/exploit.md
@@ -0,0 +1,335 @@
+# CVE-2025-37890
+## Overview
+First, use prefetch sidechannel to bypass KASLR. Then, use the re-entrancy bug to doubly activate a hfsc class. This causes the hfsc qdisc to improperly track the class' reference count, allowing us to obtain a UAF hfsc class. We can then perform a write-what-where to achieve ROP.
+For Mitigation instance, we use the same exploit technique as [CVE-2025-37798](https://github.com/google/security-research/blob/7e4f27632c6eeb08380e4d1fb1f73c1296253603/pocs/linux/kernelctf/CVE-2025-37798_lts_cos_mitigation/docs/exploit.md#mitigation-exploit).
+
+## Exploiting double class activation
+The bug allows us to doubly activate a new child (class) in a classful qdisc. While this effect is possible across a variety of classful qdiscs, it is most useful in a hfsc qdisc. As explained in [vulnerability.md](vulnerability.md), the re-entrancy bug allows us to perform the new-class activation routine on the same class twice.
+
+```c
+	if (first) {
+		if (cl->cl_flags & HFSC_RSC)
+			init_ed(cl, len);
+		if (cl->cl_flags & HFSC_FSC)
+			init_vf(cl, len);
+		/*
+		 * If this is the first packet, isolate the head so an eventual
+		 * head drop before the first dequeue operation has no chance
+		 * to invalidate the deadline.
+		 */
+		if (cl->cl_flags & HFSC_RSC)
+			cl->qdisc->ops->peek(cl->qdisc);
+
+	}
+```
+
+When we configure the hfsc class with the `HFSC_FSC` flag, `init_vf()` is used to initialize the class.
+```c
+static void
+init_vf(struct hfsc_class *cl, unsigned int len)
+{
+	struct hfsc_class *max_cl;
+	struct rb_node *n;
+	u64 vt, f, cur_time;
+	int go_active;
+
+	cur_time = 0;
+	go_active = 1;
+	for (; cl->cl_parent != NULL; cl = cl->cl_parent) {                      // [1]
+		if (go_active && cl->cl_nactive++ == 0)                              // [2]
+			go_active = 1;
+		else
+			go_active = 0;
+
+		if (go_active) {
+			// [...]
+```
+
+At [1], there is a for loop which traverses up the hierarchy of hfsc classes, starting from the new class. The new class' parent is the root class (with the same class id as the hfsc qdisc's handle), which was automatically created when the hfsc qdisc was initialized. In the for loop, the class' `cl_nactive` reference count is incremented if `go_active` is true. Since `go_active` is initially set to `1`, the `cl_nactive` count is always incremented for the new class. Resultantly, activating the same class twice will (incorrectly) set the `cl_nactive` reference count to `2`.
+
+To understand why this reference count is important, let's look at the class deletion process. First, when the child qdisc is emptied, the hfsc qdisc's `.qlen_notify` handler `hfsc_qlen_notify()` is triggered.
+```c
+static void
+hfsc_qlen_notify(struct Qdisc *sch, unsigned long arg)
+{
+	struct hfsc_class *cl = (struct hfsc_class *)arg;
+
+	/* vttree is now handled in update_vf() so that update_vf(cl, 0, 0)
+	 * needs to be called explicitly to remove a class from vttree.
+	 */
+	update_vf(cl, 0, 0);
+	if (cl->cl_flags & HFSC_RSC)
+		eltree_remove(cl);
+}
+```
+
+For our FSC class, the important call is to `update_vf()`.
+```c
+static void
+update_vf(struct hfsc_class *cl, unsigned int len, u64 cur_time)
+{
+	u64 f; /* , myf_bound, delta; */
+	int go_passive = 0;
+
+	if (cl->qdisc->q.qlen == 0 && cl->cl_flags & HFSC_FSC)                   // [3]
+		go_passive = 1;
+
+	for (; cl->cl_parent != NULL; cl = cl->cl_parent) {
+		cl->cl_total += len;
+
+		if (!(cl->cl_flags & HFSC_FSC) || cl->cl_nactive == 0)
+			continue;
+
+		if (go_passive && --cl->cl_nactive == 0)                             // [4]
+			go_passive = 1;
+		else
+			go_passive = 0;
+
+		/* update vt */
+		cl->cl_vt = rtsc_y2x(&cl->cl_virtual, cl->cl_total) + cl->cl_vtadj;
+
+		/*
+		 * if vt of the class is smaller than cvtmin,
+		 * the class was skipped in the past due to non-fit.
+		 * if so, we need to adjust vtadj.
+		 */
+		if (cl->cl_vt < cl->cl_parent->cl_cvtmin) {
+			cl->cl_vtadj += cl->cl_parent->cl_cvtmin - cl->cl_vt;
+			cl->cl_vt = cl->cl_parent->cl_cvtmin;
+		}
+
+		if (go_passive) {                                                    // [5]
+			/* no more active child, going passive */
+
+			/* update cvtoff of the parent class */
+			if (cl->cl_vt > cl->cl_parent->cl_cvtoff)
+				cl->cl_parent->cl_cvtoff = cl->cl_vt;
+
+			/* remove this class from the vt tree */
+			vttree_remove(cl);
+
+			cftree_remove(cl);
+			update_cfmin(cl->cl_parent);
+
+			continue;
+		}
+		// [...]
+```
+
+Since the FSC class is empty, `go_passive = 1` at [3]. If `cl_nactive` was correctly set to `1`, the decrement check at [4] would be true and `go_passive = 1` still. Subsequently, at [5], the inactive class is removed from the vt tree with `vttree_remove()`. However, since `cl_nactive` was previously doubly incremented to `2`, the check at [4] fails and `vttree_remove()` is never called.
+
+When deleting the hfsc class, `hfsc_delete_class()` assumes that vt tree removal was already previously handled and does not check for it. The class is eventually freed in `hfsc_destroy_class()`. This leaves a dangling reference to the class in the vt tree.
+
+## UAF
+To exploit the dangling reference in vt tree, we will target `hfsc_dequeue()`. We construct a hfsc class with two FSC child classes.
+```
+	1:0 (hfsc)
+   /   \
+1:1    1:2
+ |      |
+netem  pfifo
+```
+
+We first perform the double activation and class deletion on 1:1, which leaves it as a dangling pointer in the vt tree. Then, we enqueue a packet into 1:2. This triggers the `__qdisc_run` routine which, after enqueueing the packet, tries to dequeue a packet from the hfsc qdisc.
+
+```c
+static struct sk_buff *
+hfsc_dequeue(struct Qdisc *sch)
+{
+	struct hfsc_sched *q = qdisc_priv(sch);
+	struct hfsc_class *cl;
+	struct sk_buff *skb;
+	u64 cur_time;
+	unsigned int next_len;
+	int realtime = 0;
+
+	if (sch->q.qlen == 0)
+		return NULL;
+
+	cur_time = psched_get_time();
+
+	/*
+	 * if there are eligible classes, use real-time criteria.
+	 * find the class with the minimum deadline among
+	 * the eligible classes.
+	 */
+	cl = eltree_get_mindl(q, cur_time);                                      // [6]
+	if (cl) {
+		realtime = 1;
+	} else {
+		/*
+		 * use link-sharing criteria
+		 * get the class with the minimum vt in the hierarchy
+		 */
+		cl = vttree_get_minvt(&q->root, cur_time);                           // [7]
+		if (cl == NULL) {
+			qdisc_qstats_overlimit(sch);
+			hfsc_schedule_watchdog(sch);
+			return NULL;
+		}
+	}
+	// [...]
+```
+
+Because both of our classes are FSC and not RSC, there is no eligible class at [6]. Instead the function looks for an eligible class in the vt tree at [7].
+```c
+static struct hfsc_class *
+vttree_get_minvt(struct hfsc_class *cl, u64 cur_time)
+{
+	/* if root-class's cfmin is bigger than cur_time nothing to do */
+	if (cl->cl_cfmin > cur_time)
+		return NULL;
+
+	while (cl->level > 0) {
+		cl = vttree_firstfit(cl, cur_time);                                  // [8]
+		if (cl == NULL)
+			return NULL;
+		/*
+		 * update parent's cl_cvtmin.
+		 */
+		if (cl->cl_parent->cl_cvtmin < cl->cl_vt)
+			cl->cl_parent->cl_cvtmin = cl->cl_vt;
+	}
+	return cl;
+}
+```
+
+The freed class 1:1 is returned at [8], giving us UAF.
+
+## LPE
+From this point on, there are many documented strategies to achieve LPE.
+
+For LTS and COS, we use the strategy outlined in [CVE-2023-4623](https://github.com/google/security-research/blob/66053d865bf43b3e8d379f41f353e3b125cf4524/pocs/linux/kernelctf/CVE-2023-4623_lts_cos/docs/exploit.md#write-what-where). There are 2 differences in our exploit: we use `struct user_key_payload` to reclaim instead of `simple_xattr`, and a different ROP chain. First, in `spray_keyring()`, reclaim the UAF class with `struct user_key_payload`, which contents we can [control](https://bsauce.github.io/2021/09/26/kernel-exploit-%E6%9C%89%E7%94%A8%E7%9A%84%E7%BB%93%E6%9E%84%E4%BD%93/#5-add_key). This is elastic size, and allocated with `GFP_KERNEL`, so it is allocated in the same cache as the hfsc_class (in fact, all qdisc classes are allocated with `GFP_KERNEL`).
+
+```c
+static int
+hfsc_change_class(struct Qdisc *sch, u32 classid, u32 parentid,
+		  struct nlattr **tca, unsigned long *arg,
+		  struct netlink_ext_ack *extack)
+{
+	// [...]
+	cl = kzalloc(sizeof(struct hfsc_class), GFP_KERNEL);
+```
+
+```c
+int user_preparse(struct key_preparsed_payload *prep)
+{
+	struct user_key_payload *upayload;
+	size_t datalen = prep->datalen;
+
+	if (datalen <= 0 || datalen > 32767 || !prep->data)
+		return -EINVAL;
+
+	upayload = kmalloc(sizeof(*upayload) + datalen, GFP_KERNEL);
+```
+
+
+This method manipulates internal hfsc_class pointers to obtain a 8-byte write-what-where (the pointers are set in `prep_key_desc()`). We use the write-what-where to overwrite the `qfq_qdisc_ops.change()` pointer, and perform ROP. Using available rop gadgets, we overwrite core_pattern with our program path using copy_from_user then simply call msleep. Another thread of our exploit notice the /proc/sys/kernel/core_pattern changes, it will try to crash itself so our exploit will executed as high privilege and gives us root shell to get the flag.
+
+## Mitigation exploit
+The exploit method is similar to [CVE-2025-37798](https://github.com/google/security-research/blob/7e4f27632c6eeb08380e4d1fb1f73c1296253603/pocs/linux/kernelctf/CVE-2025-37798_lts_cos_mitigation/docs/exploit.md#mitigation-exploit) with a few minor tweaks. I will only outline the differences.
+
+The original attack uses a drr qdisc as the root qdisc. However, the double enqueue vulnerability will trigger the activation routine twice, leading to a double add splat.
+```c
+static int drr_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+		       struct sk_buff **to_free)
+{
+	// ...
+	if (!cl_is_active(cl)) {
+		list_add_tail(&cl->alist, &q->active);
+		cl->deficit = cl->quantum;
+	}
+```
+
+Thus, we will replace the drr parent with a hfsc parent instead. There is not much functionality difference. We tweak the exploit as follows:
+- hfsc root qdisc (1:0)
+- multiq qdisc (2:0) attached to hfsc class 1:1. This will contain the vulnerable qdisc tree
+- Some arbitrary qdisc (10:0) attached to hfsc class 1:2, just to keep hfsc level > 0 after deleting class 1:1
+
+Because the netem double enqueue sends the enqueue request to the root qdisc, in this case the hfsc (1:0), it is actually the hfsc - multiq relationship that is corrupted. Our goal is to delete the class that the multiq qdisc is attached to (1:1) while the hfsc qdisc still references it. Like in the original attack, we will attach the vulnerable qdisc tree to a specific multiq class. The setup and trigger of the vulnerability is the same as the LTS/COS exploit.
+
+Then, as before, we will reclaim the multiq `q->queues`.
+
+In the original method used for CVE-2025-37798, we used the chain: `drr_dequeue() -> multiq_peek() -> qdisc->peek()` for RIP hijack. Since we replaced the drr with hfsc, we will use the alternate codepath: `hfsc_dequeue() -> qdisc_dequeue_peeked() -> multiq_dequeue() -> qdisc->dequeue()`.
+
+```c
+static struct sk_buff *
+hfsc_dequeue(struct Qdisc *sch)
+{
+	struct hfsc_sched *q = qdisc_priv(sch);
+	struct hfsc_class *cl;
+	struct sk_buff *skb;
+	u64 cur_time;
+	unsigned int next_len;
+	int realtime = 0;
+
+	if (sch->q.qlen == 0)
+		return NULL;
+
+	cur_time = psched_get_time();
+
+	cl = eltree_get_mindl(q, cur_time);
+	if (cl) {
+		realtime = 1;
+	} else {
+		cl = vttree_get_minvt(&q->root, cur_time);                           // [9]
+		if (cl == NULL) {
+			qdisc_qstats_overlimit(sch);
+			hfsc_schedule_watchdog(sch);
+			return NULL;
+		}
+	}
+
+	skb = qdisc_dequeue_peeked(cl->qdisc);                                   // [10]
+	// ...
+}
+
+static inline struct sk_buff *qdisc_dequeue_peeked(struct Qdisc *sch)
+{
+	struct sk_buff *skb = skb_peek(&sch->gso_skb);                           // [11]
+
+	if (skb) {
+		skb = __skb_dequeue(&sch->gso_skb);
+		if (qdisc_is_percpu_stats(sch)) {
+			qdisc_qstats_cpu_backlog_dec(sch, skb);
+			qdisc_qstats_cpu_qlen_dec(sch);
+		} else {
+			qdisc_qstats_backlog_dec(sch, skb);
+			sch->q.qlen--;
+		}
+	} else {
+		skb = sch->dequeue(sch);                                             // [12]
+	}
+
+	return skb;
+}
+
+static struct sk_buff *multiq_dequeue(struct Qdisc *sch)
+{
+	struct multiq_sched_data *q = qdisc_priv(sch);
+	struct Qdisc *qdisc;
+	struct sk_buff *skb;
+	int band;
+
+	for (band = 0; band < q->bands; band++) {
+		q->curband++;
+		if (q->curband >= q->bands)
+			q->curband = 0;
+
+		if (!netif_xmit_stopped(
+		    netdev_get_tx_queue(qdisc_dev(sch), q->curband))) {
+			qdisc = q->queues[q->curband];
+			skb = qdisc->dequeue(qdisc);
+			// ...
+```
+
+At [9], the deleted class 1:1 is returned. At [10], `qdisc_dequeue_peeked()` is called on the freed multiq qdisc. At [11], `skb_peek()` returns null because `sch->gso_skb` is empty for the multiq qdisc (it is usually populated for non-work-conserving qdiscs in `qdisc_peek_dequeued()`). This goes to the branch at [12], calling the multiq qdisc's dequeue method `multiq_dequeue()`. In a similar fashion to `multiq_peek()` in the original exploit, we end up with a call to `qdisc->dequeue(qdisc)`, where `qdisc` is our forged qdisc pointer. The rest of the ROP chain is the same.
+
+In summary,
+1. Create qdisc (`setup_hfsc_multiq()`): hfsc (1:0), multiq (2:0)
+2. Determine band (`determine_band()`)
+3. Create qdisc: plug (10:0)
+4. Attach vulnerable set-up to 2:x (`setup_vuln_tree()`)
+5. Delete the drr class 1:1
+6. Reclaim `q->queues` (`do_spray_sendmsg()`)
+7. Trigger `hfsc_dequeue()` on UAF multiq (`trigger_uaf_miti()`)