## Me, trying to understand the GIL

— 10 Mar 2018

So today I want to do all sorts of things to understand how I can make my Python codes run faster. Dealing with large datasets means you’re constantly racing with time. Surely tools such as Spark are available, but in some cases it could probably an overkill! I don’t have fixed “use this when that” rules yet but I guess adding more tools to your toolboxes (and gaining more in-depth understanding of your tools) is not a bad idea, and over time I’ll learn more about which tool is suitable for a case. And there are probably an infinite amount of different cases with different constraints (e.g. time) :-). For a few millions of rows where I need to work fast and the requirements are constantly changing (can you try this? How about that?!), I’d resort to bash scripts (GASP). They’re pretty cool, OK, I’ll write about them one day. But I can’t rely on bash scripts forever, and I know Python, so why not try to make the best out of it?

If you don’t know me, I enjoy getting to the bottom of things. Some personal anecdote: I once annoyed the heck out of my kindergarten teacher because she told me to stop sniffing the flowers in the playground. Ugh. What happens to “stop and smell the roses” amirite? So unacceptable. So I asked her one simple question: “why?” and her answer was something along the like, “you’d get sick”. It was just a start of a long train of “why”s. “Why would I get sick?” “Why blah blah blah?” I literally followed her around, even all the way to the bathroom!

As I’ve learned early in life… sometimes it’s impossible to do that, you know, get to the bottom of things. Most of the time it takes a lot of time, but I do try! So before I begin to learn the applied part of things, like which library/module to use etc., and how to profile my Python codes, I begin by understanding why talking about performance in Python is kind of a tricky topic. I think talking about Python’s Global Interpreter Lock (GIL) is a good start, which I’ve always heard about from time to time.

I began by watching David Beazley’s talk titled Understanding the Python GIL. Below is basically my summary of the video, so you might just want to watch the video instead! Writing it down basically helps me so that everything that I just learned doesn’t go over my head.

The video gives an example of running a Python program that decrements a given number until it reaches 0:

Decrementing 50000000 to 0 takes 2.35452699661 s in my machine. If we wish to speed this up, we’d probably think of applying multithreading: divide the COUNT by 2, and decrement each of the subdivision in each thread. In Python, you can implement this by using the threading module. Here’s to hoping that we can get things running twice faster than our initial program!

The total elapsed time that I got was 3.15860486031 s which is waaay worse and totally is not what we would expect.

Then there’s this weird thing where the threaded performance gets EVEN BETTER when you’re disabling one of your cores (I didn’t try this one out myself because I didn’t have XCode installed).

## What’s the deal with the GIL?

Ugggh bad pun I KNOW. I’m sorry.

The GIL forbids parallel execution. This simplifies the implementation of the interpreter: memory management, C extension, etc. Thus, instead of the usual parallel computing we’d expect to happen, we get something called cooperative multitasking.

What happens in cooperative mutlitasking? When a thread is running (and remember, we can only have one thread running at a time), it is holding the GIL. It will continue to do so until the thread is on I/O–read, write, send, yada yada. At this point, the thread releases the GIL and allows another thread to run.

You may have noticed that the whole thing really depends on performing I/O because that’s when the GIL is going to be released and another thread is allowed to run. But what if this one thread never performs I/O at all?! Ugh this reminds me of that mean kid in my kindergarten that never allows anyone else to use the slides.

Thankfully, the interpreter already handles this case! Thank goodness! The interpreter has something called a check that occurrs every 100 ticks.

In the Python interpreter, ticks map to interpreter instructions. These ticks are not related to time. Some ticks might be longer than others.

So what happens in a check?

• Resets the tick counter back to 100
• If it is the main thread, there’s going to be signal handling
• Releases the GIL
• Reacquires the GIL

You can even see it, in all its glory, in the Python interpreter source code (Python/ceval.c)!

Right so I don’t completely understand the code above yet, but I’ll just leave it here for the moment. Otherwise, I’ll never get this post done. Every thread that you create is going to be running the exact same code above.

So the Python only implements one type of lock (in C). It’s not a mutex lock which most people might be familiar of.

Mutex lock: provides mutual exclusion. Only one of the producer/consumer can have the key (mutex) and work. Either one of them has to wait.

There are many kinds of locks. Aside mutex, there’s something called spinlock, for example. They’re all used for synchronization, but they differ in the mechanism/how they work. I need to read more into this! These links may help.

So what lock does Python implement, then? Python implements something called the binary semaphore. GIL, in fact, is an instance of a binary semaphore.

### A crash course on locks

A lock consists of three parts:

locked = 0 # the lock status
mutex = pthreads_mutex() # lock for the status


These three parts are essential to describe how a GIL is released and acquired, as illustrated by the pseudocodes below:

release() {
mutex.acquire()
locked = 0
mutex.release()
cond.signal()
}


When the lock is released, the release function will give a signal to another thread using cond.signal() that says, “hey, I’m done here! You can run now if you’d like!”.

acquire() {
mutex.acquire()
while (locked) {
cond.wait(mutex)
}
locked = 1
mutex.release()
}


See the while(locked) loop? Yep. If a thread tries to acquire a lock while it is still locked by another thread, it will enter this loop and wait (cond.wait(mutex)) until it receives a signal from release(). Once it is no longer locked, the thread will “keep” the lock for itself by setting locked = 1.

As written before, the thread switching thing happens when I/O is performed. Say that we have two threads: Thread A and Thread B. Thread A is now running, which means it is also holding the GIL. At some point, Thread A performs an I/O operation. Thread A will release the GIL, and this triggers a signaling operation (remember cond.signal()?) in which Thread A signals Thread B that Thread B may run now. Thread B, after getting the notification, acquires the GIL and starts its process.

However, like most things in life, things aren’t that easy. What happens when Thread A does not perform any I/O and thus runs until the check happens–after 100 ticks, that is?

Turns out, any of the thread can continue running if it wants to. But which one? Never-ending questions, huh!?!?! Such is life.

OK, so remember our friend cond = pthreads_cond() a.k.a. the conditional variable? Right. Our friend here is made up of a queue of threads. So whenever something like cond.wait() is called, what happens is we’re putting the signaled thread into the queue–often FIFO, not always though. This is all in the OS though, so as a Python user we do not have control over it… most of the time. Well perhaps you can mess around with some weird syscalls if you want. Now, the OS will then choose the thread to be run next based on priority: this may or may not be the signaled thread that is just inserted earlier.

## What can go wrong?

Heh, up until this point it seems like all sunshine and rainbows. But things can definitely go wrong! What are those?

To investigate this, Beazley added additional logging to a version of Python where the whole GIL events happen. Example:

release() {
mutex.acquire()
locked = 0
if gil: log("RELEASE")
mutex.release()
cond.signal()
}


This way, we can know where the release, acquiring, and other events happen. Thankfully, there are vizualisations available, generated from ~2GB of logging data! YAY! Thank goodness!

Some quick takeaways from the visualizations, as highlighted in the video:

• Threads running in 1 CPU generally run in a long interval
• The amount of switching in 2 CPUs is… pretty crazy.
• OK NOW here comes the BIG REVEAL (at least a big reveal to me!). If you take a look at the graph(s) using 2 CPUs, there are lots of red blocks there, signaling that there’s a lot of thrashing happening. Not good. Turns out that, when you’re running > 1 CPU, say 2 CPUs, it thinks of something like “OK now I can run two things at once! Hooray!”. Let’s say that Thread A is running. Since we have 2 CPUs, the OS also wakes up Thread B because it thinks it can run Thread B too. But… nu-uh. Nope. The GIL is not there so Thread B cannot run! But this keeps on happening which leads of tons of thrashing and system calls being burned away. Ouch! Now this kinds of explain the weird bad performance of threading in Python…
• What happens when you put 1 I/O-thread (only performing an I/O event, in this particular case it randomly receives UDP packets) and 1 CPU-thread together? Using 1 CPU, the I/O-thread will only run for like a few ticks. But when using 2 CPUs and a UDP packet shows up, the thread will try to run on the GIL and the thrashing will happen too!
• Crazy thrashing also happens when you put many (say, 512 I/O-threads) together!

## There’s a new GIL!

WOW, you guys. It turns out, all of these crazy thrashing stuff is solved in the new GIL implementation! This new GIL is available since the Python 3.2. Now I’m so glad that we’ve got to learn all the stuff about the old GIL because it would make it easier for me to appreciate what the new GIL brings!

Right, so what’s new?

Remember our old friend ticks? You know, the one that decides when to perform a check (that is, after 100 ticks)? They’re now gone.

They’re now replaced by a global variable:

...
static volatile int gil_drop_request = 0


The thread will run until the value of gil_drop_request is set to 1. At this point, the thread releases the GIL.

OK, how?

The answer is: TIMEOUT. HUH. This is a totally different approach isn’t it!

Previously we’ve been relying on ticks (unless some sort of event, like an I/O event, happens) to trigger a thread to release the GIL and sends a signal to another waiting thread. This time, we’re relying on timeout instead.

The waiting thread (say, Thread B) waits to see if Thread A will voluntarily release the GIL at some point (e.g. because of an I/O event). So there’s something like cond.wait(gil, TIMEOUT) instead of just cond.wait(). The default TIMEOUT value is set to 5 milliseconds, but this value can be changed.

When the timeout occurrs, the value gil_drop_request will be set to 1. This means, whatever happens, Thread A must release the GIL, suspend its operation, and notify Thread B that it is allowed to run. Thread B waits for an ack which ensures that it gets the GIL and is now running.

So let’s try it out! Using the same codes as the ones in the beginning of the post in Python 3, I got 3.630388021469116 s for the first code and 3.8260281085968018 s for the second code using threading. Now I’m not sure why the original code is even slower on Python 3, BUT at least there isn’t much difference between the original vs. the threaded code.

Keep in mind though, that GIL is still limiting us in a lot of ways. As we can see, it doesn’t speed up our process, but at least the new GIL is doing a little better.

## BUT

Watching the talk (and thus, writing this post) is an emotional rollercoaster for me because once you think things are going well for you…

It doesn’t.

The new GIL is doing better and everything, but it turns out that there are some caveats. Why can’t you just let us have a happy ending!?

### Poor response time for I/O process

Let’s say that Thread A is a CPU-thread and Thread B is an I/O-thread. Thread A is working, and suddenly Thread B receives a packet or something. What happens is, the new GIL increases the response time of the I/O process.

How come? Remember that the new GIL depends heavily on timeout. Therefore, the thread must get through the entire timeout sequence until it is able to be run, which means high-priority I/O or events are going to be ignored. Ouch.

### Unfair wakeup/starvation

So… let’s say that we have three threads now: Thread A, Thread B, and Thread C. Thread A is running, and Thread B is waiting–it therefore performs the timeout. However, surprise SURPRISE, there is no guarantee that Thread B is going to be the one that gets the GIL. It could be Thread C, for all we know. This is due to how queueing works in the OS.

### Convoy effect

OK… this is interesting. Python assumes that all I/O operations are blocking, which makes it releases the GIL as soon as it acquires it. But this is not always the case. Lots of I/O operations don’t stop running because the data is already buffered to the operating system, even when the GIL is already released.

The problem happens when we have both I/O-thread and CPU-thread on multiple cores: once the GIL is released, the CPU-thread will immediately pick the GIL. CPU-thread always gets the GIL, and I/O-thread always has to wait for another timeout (the default 5 ms). This cycle goes on and on, making performance suffer!

Hmm, doesn’t this happen in Python 2 too? What’s the difference? Sooo this is not in the video but I’m curious about the answer to this question. From this slide of a different (but on a similar topic) talk, in Python 2 our old friend ticks save the day! These poor I/O-threads do not need to wait as long as they have to in Python 3, so the issue is not really noticeable.

Lots of things to improve, but don’t worry, the new GIL is not a lost cause! Some ideas for improvements:

• Find a way to separate low-priority (CPU) and high-priority (I/O) threads