Processes, Threads and the Clone Syscall
I want to explore how threads and processes work on Linux, under the hood, as in-depth as I can. It's been a while since I dove into the kernel. In this series of posts, I'll be writing some potentially insane C code to dive into this.
With pthread
Normally if you want a thread in C you'd use the pthreads (POSIX threads) API. Quick example with pthreads
.
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
void *work(void *arg);
int main() {
pthread_t ptid;
pthread_create(&ptid, NULL, &work, NULL);
printf("I launched a thread!\n");
pthread_join(ptid, NULL);
pthread_exit(NULL);
}
void *work(void *arg) {
pthread_detach(pthread_self());
printf("Look ma! I'm a thread!\n");
pthread_exit(NULL);
}
Compile and run, and you will get the following:
$ gcc -pthread -o pthread_thread.o pthread_thread.c
$ ./pthread_thread.o
I launched a thread!
Look ma! I'm a thread!
The order of the prints will depend on which thread got to print first. Note too that printf
is threadsafe when used within the context of glibc
, more on that later. This seems pretty straightforward. How would we do it without using pthreads
?
Without pthread
Note the following will not be equivalent to the pthread
call above. That does a lot more setup to make using the thread easier and safer than what we are going to see in this section. With that said - the clone
syscall on Linux allows the creation of threads. This can be called directly or through libc
. Let's do the latter. The signature for clone
in libc looks as follows:
int clone (int (*fn)(void *arg), void *child_stack, int flags,
void *arg, ... /* pid_t *parent_tid, void *tls, pid_t *child_tid */);
int (*fn)(void *arg)
is a pointer to a function that will do the work in the thread, void *child_stack
is a pointer to the top of the stack for the thread, int flags
are the flags for this clone
call (more on those later), and void *arg
is a pointer to an argument to pass to this thread.
The rest of the arguments, defined with var args (...
) we are not going to use. But they are the parent thread ID, pointer to thread local storage for this new thread, and the child's thread ID. clone
will return -1
on error.
Clone a Thread
First, what should our thread do? Let's just make it print something and then sleep for a bit.
// We need to define this so we can call clone later.
#define _GNU_SOURCE
// We dont need all these headers just for thread_work, but will need them later.
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <unistd.h>
int thread_work() {
const int SIZE = 17;
char buf[SIZE];
strcpy(buf, "running thread\n");
write(fileno(stdout), buf, SIZE);
sleep(5);
strcpy(buf, "finished thread\n");
write(fileno(stdout), buf, SIZE);
return 0;
}
int main(int argc, char **argv) {}
All this does is print "running thread" and sleeps for 10 seconds. We can't use printf here, it's not thread safe if not used with a thread created with pthread_create
. More on why later.
For now, what's next? Well, we need to set up some stack space for the thread. Let's create a function for that.
char *alloc_stack(int stack_size) {
char *stack = mmap(NULL, stack_size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
if (stack == MAP_FAILED) {
perror("failed to allocate stack for thread");
exit(1);
}
return stack;
}
There is a bit going on here! We call mmap to allocate memory for the stack. Let's break it down.
mmap
The signature for mmap
is:
void *mmap(void *addr, size_t length, int prot, int flags,
int fd, off_t offset);
void *addr
is a hint to the kernel about where it should place the mapping. size_t length
is the length of the mapping. int prot
defines the memory protection that should be applied. int flags
allows passing a number of flags that determine how this mapping functions, more on that in a moment. int fd
is a file descriptor, mmap
can be used to memory map a file, this is ignored depending on what flags are passed. Finally, off_t offset
the offset of the file to start mapping, if mmap
is memory mapping a file.
We pass PROT_READ | PROT_WRITE
as the value for prot
:
PROT_READ
allows pages in the mapping to be read.PROT_WRITE
allows pages in the mapping to be written to.
For flags
we pass MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK
:
MAP_PRIVATE
marks the memory region created by this mapping as private, so no other process can see it.MAP_ANONYMOUS
means this region is not backed by a file, and thefd
argument tommap
is ignored.MAP_STACK
, which is currently a no-op on Linux. This is an interesting one. I could have ignored it here, but it is recommened to add as a flag. I've dedicated a section later in this post specifically toMAP_STACK
.
Hopefully that makes sense. mmap
will return MAP_FAILED
if it fails to allocate or the address of the mapping. So alloc_stack
checks for MAP_FAILED
, log an error with perror if it does, and exit. perror
will log our message along with a translation of the errno
value mmap
sets.
If mmap
succeeds, alloc_stack
returns a pointer to the stack.
Time to Clone
Now we have a function to allocate a stack and a function that defines the work our thread should do. Time to clone
, lets just do it in main
. First up, allocate the stack.
int main(int argc, char **argv) {
const int STACK_SIZE = 65536;
char *stack = alloc_stack(STACK_SIZE);
char *stack_top = stack + STACK_SIZE;
return 0;
}
We compute the value at the top of the stack, stack_top
, as clone
needs this as an argument. Now we can clone:
int main(int argc, char **argv) {
const int STACK_SIZE = 65536;
char *stack = alloc_stack(STACK_SIZE);
char *stack_top = stack + STACK_SIZE;
printf("starting thread...\n");
if (clone(thread_work, stack_top, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM,
NULL) == -1) {
perror("error cloning");
exit(1);
}
sleep(10);
return 0;
}
We call clone
with thread_work
, the function we defined earlier, a pointer to the top of the stack, flags to tell clone what we want to do, and NULL
for the argument. clone
has a lot of functionality which is chosen through the flags passed to it.
CLONE_THREAD
says to put the child thread into the same thread group as the calling process.CLONE_SIGHAND
is required when usingCLONE_THREAD
. This makes the child and the parent share the same signal handlers.CLONE_VM
is required when usingCLONE_SIGHAND
. This makes the calling process and the child share the same memory space.
These three flags together make the child which clone
creates a thread of the calling process. If we passed different flags, we could spawn a new process, instead of a thread, that does not share memory with the calling process.
Compile and run, and you should get the following:
$ gcc -o clone_thread.o clone_thread.c
$ ./clone_thread.o
starting thread...
running thread
finished thread
Great! I mentioned this is not equivalent to pthread_create
. One difference is that pthread_create
will set a lot more flags. Here is an excerpt from the libc source on thread creation:
/* We rely heavily on various flags the CLONE function understands:
CLONE_VM, CLONE_FS, CLONE_FILES
These flags select semantics with shared address space and
file descriptors according to what POSIX requires.
CLONE_SIGHAND, CLONE_THREAD
This flag selects the POSIX signal semantics and various
other kinds of sharing (itimers, POSIX timers, etc.).
CLONE_SETTLS
The sixth parameter to CLONE determines the TLS area for the
new thread.
CLONE_PARENT_SETTID
The kernels writes the thread ID of the newly created thread
into the location pointed to by the fifth parameters to CLONE.
Note that it would be semantically equivalent to use
CLONE_CHILD_SETTID but it is be more expensive in the kernel.
CLONE_CHILD_CLEARTID
The kernels clears the thread ID of a thread that has called
sys_exit() in the location pointed to by the seventh parameter
to CLONE.
The termination signal is chosen to be zero which means no signal
is sent. */
const int clone_flags = (CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SYSVSEM
| CLONE_SIGHAND | CLONE_THREAD
| CLONE_SETTLS | CLONE_PARENT_SETTID
| CLONE_CHILD_CLEARTID
| 0);
You can see the source here. The actual calls from the pthread_create
function to that function are through an insane chain of macros, it's not easy to see the call chain at all!
Anyway, we know how to spawn a thread with clone
in libc now, yay!
mmap and MAP_STACK
In the mmap section I mentioned we could have just ignored this flag, at least for this post. But, I was super curious why it even existed after reading the man page description which says:
MAP_STACK (since Linux 2.6.27)
Allocate the mapping at an address suitable for a process or thread stack.
This flag is currently a no-op on Linux. However, by employing this flag, applications can ensure that they transparently obtain support if the flag is implemented in the future. Thus, it is used in the glibc threading implementation to allow for the fact that some architectures may (later) require special treatment for stack allocations. A further reason to employ this flag is portability: MAP_STACK exists (and has an effect) on some other systems (e.g., some of the BSDs).
Further down that man page it talks about why you should use mmap
over malloc
for allocating stack memory:
Within the sample program, we allocate the memory that is to be used for the child's stack using mmap(2) rather than malloc(3) for the following reasons:
mmap(2) allocates a block of memory that starts on a page boundary and is a multiple of the page size. This is useful if we want to establish a guard page (a page with protection PROT_NONE) at the end of the stack using mprotect(2).
We can specify the MAP_STACK flag to request a mapping that is suitable for a stack. For the moment, this flag is a no-op on Linux, but it exists and has effect on some other systems, so we should include it for portability.
Besides the man pages, you can read a bit more about MAP_STACK in this thread https://lkml.org/lkml/2019/11/11/135, which mentions why this should be added:
So, my understanding from the above is that MAP_STACK was added to allow a possible fix on some old architectures, should anyone decide it was worth doing the work of implementing it. But so far, after 12 years, no one did. It kind of looks like no one ever will (since those old architectures become less and less relevant).
Computers are hard.
Thread Safety of printf
I never really thought about it until writing this post, but printf
is only threadsafe in the context of libc. Here is what I originally tried to write when building the initial clone
example.
int thread_work() {
printf("running thread\n");
sleep(5);
printf("finished thread\n");
return 0;
}
char *alloc_stack(int stack_size) {
char *stack = mmap(NULL, stack_size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
if (stack == MAP_FAILED) {
perror("failed to allocate stack for thread");
exit(1);
}
return stack;
}
int main(int argc, char **argv) {
const int STACK_SIZE = 65536;
char *stack = alloc_stack(STACK_SIZE);
char *stack_top = stack + STACK_SIZE;
if (clone(thread_work, stack_top, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM,
NULL) == -1) {
perror("error cloning");
exit(1);
}
printf("started thread...\n");
sleep(10);
return 0;
}
If you compile and run this (hint, you shouldn't 😄), you might get something like this:
$ gcc -o clone_thread.o clone_thread.c
$ ./clone_thread.o
�U.
running thread
finished thread
Uh oh! It may segfault too. Or do who knows what, it's undefined behaviour! This baffled me for a bit. It seems when calling pthread_create
libc will setup some bookkeeping, perhaps setting up thread local storage for the buffer in printf
🤔. I'm not entirely sure yet. There are some posts on the interweb about this. This stackoverflow answer, and the one above it, comes to the same conclusion. This thread for a related bug on sourceware talks about it too.
It makes sense. I'm going to dive more into glibc, to get a better understanding of what it's doing when it spawns a thread with pthread_create
. I just need to build up the energy to wade through crazy macros first 😬.
Conclusion
I covered the basic way threads can be created on linux using clone
through libc
. It's a lot simpler than pthreads
for sure! But it does make printf
(and possibly other parts of libc!) unsafe to use in child threads. But hopefully this gave you a better understanding of what is happening under the hood. In the next post I want to drop a level lower, and clone
a thread by calling the clone
syscall directly, without libc. Either that or dive more into the printf
issue and see how glibc sets up threads.