Processes, Threads and the Clone Syscall

I want to explore how threads and processes work on Linux, under the hood, as in-depth as I can. It's been a while since I dove into the kernel. In this series of posts, I'll be writing some potentially insane C code to dive into this.

With pthread

Normally if you want a thread in C you'd use the pthreads (POSIX threads) API. Quick example with pthreads.

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

void *work(void *arg);

int main() {
  pthread_t ptid;
  pthread_create(&ptid, NULL, &work, NULL);
  printf("I launched a thread!\n");
  pthread_join(ptid, NULL);
  pthread_exit(NULL);
}

void *work(void *arg) {
  pthread_detach(pthread_self());
  printf("Look ma! I'm a thread!\n");
  pthread_exit(NULL);
}

Compile and run, and you will get the following:

$ gcc -pthread -o pthread_thread.o pthread_thread.c

$ ./pthread_thread.o
I launched a thread!
Look ma! I'm a thread!

The order of the prints will depend on which thread got to print first. Note too that printf is threadsafe when used within the context of glibc, more on that later. This seems pretty straightforward. How would we do it without using pthreads?

Without pthread

Note the following will not be equivalent to the pthread call above. That does a lot more setup to make using the thread easier and safer than what we are going to see in this section. With that said - the clone syscall on Linux allows the creation of threads. This can be called directly or through libc. Let's do the latter. The signature for clone in libc looks as follows:

int clone (int (*fn)(void *arg), void *child_stack, int flags,
	      void *arg, ... /* pid_t *parent_tid, void *tls, pid_t *child_tid */);

int (*fn)(void *arg) is a pointer to a function that will do the work in the thread, void *child_stack is a pointer to the top of the stack for the thread, int flags are the flags for this clone call (more on those later), and void *arg is a pointer to an argument to pass to this thread.

The rest of the arguments, defined with var args (...) we are not going to use. But they are the parent thread ID, pointer to thread local storage for this new thread, and the child's thread ID. clone will return -1 on error.

Clone a Thread

First, what should our thread do? Let's just make it print something and then sleep for a bit.

// We need to define this so we can call clone later.
#define _GNU_SOURCE

// We dont need all these headers just for thread_work, but  will need them later.
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <unistd.h>

int thread_work() {
  const int SIZE = 17;
  char buf[SIZE];
  strcpy(buf, "running thread\n");
  write(fileno(stdout), buf, SIZE);
  sleep(5);
  strcpy(buf, "finished thread\n");
  write(fileno(stdout), buf, SIZE);
  return 0;
}

int main(int argc, char **argv) {}

All this does is print "running thread" and sleeps for 10 seconds. We can't use printf here, it's not thread safe if not used with a thread created with pthread_create. More on why later.

For now, what's next? Well, we need to set up some stack space for the thread. Let's create a function for that.

char *alloc_stack(int stack_size) {
  char *stack = mmap(NULL, stack_size, PROT_READ | PROT_WRITE,
                     MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
  if (stack == MAP_FAILED) {
    perror("failed to allocate stack for thread");
    exit(1);
  }
  return stack;
}

There is a bit going on here! We call mmap to allocate memory for the stack. Let's break it down.

mmap

The signature for mmap is:

void *mmap(void *addr, size_t length, int prot, int flags,
                  int fd, off_t offset);

void *addr is a hint to the kernel about where it should place the mapping. size_t length is the length of the mapping. int prot defines the memory protection that should be applied. int flags allows passing a number of flags that determine how this mapping functions, more on that in a moment. int fd is a file descriptor, mmap can be used to memory map a file, this is ignored depending on what flags are passed. Finally, off_t offset the offset of the file to start mapping, if mmap is memory mapping a file.

We pass PROT_READ | PROT_WRITE as the value for prot:

For flags we pass MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK:

Hopefully that makes sense. mmap will return MAP_FAILED if it fails to allocate or the address of the mapping. So alloc_stack checks for MAP_FAILED, log an error with perror if it does, and exit. perror will log our message along with a translation of the errno value mmap sets.

If mmap succeeds, alloc_stack returns a pointer to the stack.

Time to Clone

Now we have a function to allocate a stack and a function that defines the work our thread should do. Time to clone, lets just do it in main. First up, allocate the stack.

int main(int argc, char **argv) {
  const int STACK_SIZE = 65536;
  char *stack = alloc_stack(STACK_SIZE);
  char *stack_top = stack + STACK_SIZE;

  return 0;
}

We compute the value at the top of the stack, stack_top, as clone needs this as an argument. Now we can clone:

int main(int argc, char **argv) {
  const int STACK_SIZE = 65536;
  char *stack = alloc_stack(STACK_SIZE);
  char *stack_top = stack + STACK_SIZE;

  printf("starting thread...\n");

  if (clone(thread_work, stack_top, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM,
            NULL) == -1) {
    perror("error cloning");
    exit(1);
  }

  sleep(10);

  return 0;
}

We call clone with thread_work, the function we defined earlier, a pointer to the top of the stack, flags to tell clone what we want to do, and NULL for the argument. clone has a lot of functionality which is chosen through the flags passed to it.

These three flags together make the child which clone creates a thread of the calling process. If we passed different flags, we could spawn a new process, instead of a thread, that does not share memory with the calling process.

Compile and run, and you should get the following:

$ gcc -o clone_thread.o clone_thread.c

$ ./clone_thread.o
starting thread...
running thread
finished thread

Great! I mentioned this is not equivalent to pthread_create. One difference is that pthread_create will set a lot more flags. Here is an excerpt from the libc source on thread creation:

/* We rely heavily on various flags the CLONE function understands:

     CLONE_VM, CLONE_FS, CLONE_FILES
	These flags select semantics with shared address space and
	file descriptors according to what POSIX requires.

     CLONE_SIGHAND, CLONE_THREAD
	This flag selects the POSIX signal semantics and various
	other kinds of sharing (itimers, POSIX timers, etc.).

     CLONE_SETTLS
	The sixth parameter to CLONE determines the TLS area for the
	new thread.

     CLONE_PARENT_SETTID
	The kernels writes the thread ID of the newly created thread
	into the location pointed to by the fifth parameters to CLONE.

	Note that it would be semantically equivalent to use
	CLONE_CHILD_SETTID but it is be more expensive in the kernel.

     CLONE_CHILD_CLEARTID
	The kernels clears the thread ID of a thread that has called
	sys_exit() in the location pointed to by the seventh parameter
	to CLONE.

     The termination signal is chosen to be zero which means no signal
     is sent.  */
  const int clone_flags = (CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SYSVSEM
			   | CLONE_SIGHAND | CLONE_THREAD
			   | CLONE_SETTLS | CLONE_PARENT_SETTID
			   | CLONE_CHILD_CLEARTID
			   | 0);

You can see the source here. The actual calls from the pthread_create function to that function are through an insane chain of macros, it's not easy to see the call chain at all!

Anyway, we know how to spawn a thread with clone in libc now, yay!

mmap and MAP_STACK

In the mmap section I mentioned we could have just ignored this flag, at least for this post. But, I was super curious why it even existed after reading the man page description which says:

MAP_STACK (since Linux 2.6.27)

Allocate the mapping at an address suitable for a process or thread stack.

This flag is currently a no-op on Linux. However, by employing this flag, applications can ensure that they transparently obtain support if the flag is implemented in the future. Thus, it is used in the glibc threading implementation to allow for the fact that some architectures may (later) require special treatment for stack allocations. A further reason to employ this flag is portability: MAP_STACK exists (and has an effect) on some other systems (e.g., some of the BSDs).

Further down that man page it talks about why you should use mmap over malloc for allocating stack memory:

Within the sample program, we allocate the memory that is to be used for the child's stack using mmap(2) rather than malloc(3) for the following reasons:

  • mmap(2) allocates a block of memory that starts on a page boundary and is a multiple of the page size. This is useful if we want to establish a guard page (a page with protection PROT_NONE) at the end of the stack using mprotect(2).

  • We can specify the MAP_STACK flag to request a mapping that is suitable for a stack. For the moment, this flag is a no-op on Linux, but it exists and has effect on some other systems, so we should include it for portability.

Besides the man pages, you can read a bit more about MAP_STACK in this thread https://lkml.org/lkml/2019/11/11/135, which mentions why this should be added:

So, my understanding from the above is that MAP_STACK was added to allow a possible fix on some old architectures, should anyone decide it was worth doing the work of implementing it. But so far, after 12 years, no one did. It kind of looks like no one ever will (since those old architectures become less and less relevant).

Computers are hard.

Thread Safety of printf

I never really thought about it until writing this post, but printf is only threadsafe in the context of libc. Here is what I originally tried to write when building the initial clone example.

int thread_work() {
  printf("running thread\n");
  sleep(5);
  printf("finished thread\n");
  return 0;
}

char *alloc_stack(int stack_size) {
  char *stack = mmap(NULL, stack_size, PROT_READ | PROT_WRITE,
                     MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
  if (stack == MAP_FAILED) {
    perror("failed to allocate stack for thread");
    exit(1);
  }
  return stack;
}

int main(int argc, char **argv) {
  const int STACK_SIZE = 65536;
  char *stack = alloc_stack(STACK_SIZE);
  char *stack_top = stack + STACK_SIZE;

  if (clone(thread_work, stack_top, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM,
            NULL) == -1) {
    perror("error cloning");
    exit(1);
  }

  printf("started thread...\n");
  sleep(10);

  return 0;
}

If you compile and run this (hint, you shouldn't 😄), you might get something like this:

$ gcc -o clone_thread.o clone_thread.c

$ ./clone_thread.o
ï¿œU.
running thread
finished thread

Uh oh! It may segfault too. Or do who knows what, it's undefined behaviour! This baffled me for a bit. It seems when calling pthread_create libc will setup some bookkeeping, perhaps setting up thread local storage for the buffer in printf 🀔. I'm not entirely sure yet. There are some posts on the interweb about this. This stackoverflow answer, and the one above it, comes to the same conclusion. This thread for a related bug on sourceware talks about it too.

It makes sense. I'm going to dive more into glibc, to get a better understanding of what it's doing when it spawns a thread with pthread_create. I just need to build up the energy to wade through crazy macros first 😬.

Conclusion

I covered the basic way threads can be created on linux using clone through libc. It's a lot simpler than pthreads for sure! But it does make printf (and possibly other parts of libc!) unsafe to use in child threads. But hopefully this gave you a better understanding of what is happening under the hood. In the next post I want to drop a level lower, and clone a thread by calling the clone syscall directly, without libc. Either that or dive more into the printf issue and see how glibc sets up threads.