Grok the GIL: How to write fast and thread-safe Python

Grok the GIL: How to write fast and thread-safe Python

We explore Python's global interpreter lock and learn how it affects multithreaded programs.

Grok the GIL: How to write fast and thread-safe Python
Image credits : 

When I was six years old, I had a music box. I'd wind it up, and a ballerina revolved on top of the box while a mechanism inside plinked out "Twinkle, Twinkle, Little Star." The thing must have been godawful tacky, but I loved that music box, and I wanted to know how it worked. Somehow I got it open and was rewarded with the sight of a simple device—a metal cylinder the size of my thumb, studded so that as it rotated, it plucked the teeth of a steel comb and made the notes.

music box parts

Of all a programmer's traits, curiosity about how things work is the sine qua non. When I opened my music box to see inside, I showed that I could grow up to be, if not a great programmer, then at least a curious one.

It is odd, then, that for many years I wrote Python programs while holding mistaken notions about the global interpreter lock (GIL), because I was never curious enough to look at how it worked. I've met others with the same hesitation, and the same ignorance. The time has come for us to pry open the box. Let's read the CPython interpreter source code and find out exactly what the GIL is, why Python has one, and how it affects your multi-threaded programs. I'll show examples to help you grok the GIL. You will learn to write fast and thread-safe Python, and how to choose between threads and processes.

(For the sake of focus, I only describe CPython here—not Jython, PyPy, or IronPython. CPython is the Python implementation that working programmers overwhelmingly use.)

Behold, the global interpreter lock

Here it is:

static PyThread_type_lock interpreter_lock = 0; /* This is the GIL */

This line of code is in ceval.c, in the CPython 2.7 interpreter's source code. Guido van Rossum's comment, "This is the GIL," was added in 2003, but the lock itself dates from his first multithreaded Python interpreter in 1997. On Unix systems, PyThread_type_lock is an alias for the standard C lock, mutex_t. It is initialized when the Python interpreter begins:

void
PyEval_InitThreads(void)
{
    interpreter_lock = PyThread_allocate_lock();
    PyThread_acquire_lock(interpreter_lock);
}

All C code within the interpreter must hold this lock while executing Python. Guido first built Python this way because it is simple, and every attempt to remove the GIL from CPython has cost single-threaded programs too much performance to be worth the gains for multithreading.

The GIL's effect on the threads in your program is simple enough that you can write the principle on the back of your hand: "One thread runs Python, while N others sleep or await I/O." Python threads can also wait for a threading.Lock or other synchronization object from the threading module; consider threads in that state to be "sleeping," too.

hand with writing

When do threads switch? Whenever a thread begins sleeping or awaiting network I/O, there is a chance for another thread to take the GIL and execute Python code. This is cooperative multitasking. CPython also has preemptive multitasking: If a thread runs uninterrupted for 1000 bytecode instructions in Python 2, or runs 15 milliseconds in Python 3, then it gives up the GIL and another thread may run. Think of this like time slicing in the olden days when we had many threads but one CPU. I will discuss these two kinds of multitasking in detail.

Think of Python as an old mainframe; many tasks share one CPU.

Cooperative multitasking

When it begins a task, such as network I/O, that is of long or uncertain duration and does not require running any Python code, a thread relinquishes the GIL so another thread can take it and run Python. This polite conduct is called cooperative multitasking, and it allows concurrency; many threads can wait for different events at the same time.

Say that two threads each connect a socket:

def do_connect():
    s = socket.socket()
    s.connect(('python.org', 80))  # drop the GIL
 
for i in range(2):
    t = threading.Thread(target=do_connect)
    t.start()

Only one of these two threads can execute Python at a time, but once the thread has begun connecting, it drops the GIL so the other thread can run. This means that both threads could be waiting for their sockets to connect concurrently, which is a good thing. They can do more work in the same amount of time.

Let's pry open the box and see how a Python thread actually drops the GIL while it waits for a connection to be established, in socketmodule.c:

/* s.connect((host, port)) method */
static PyObject *
sock_connect(PySocketSockObject *s, PyObject *addro)
{
    sock_addr_t addrbuf;
    int addrlen;
    int res;
 
    /* convert (host, port) tuple to C address */
    getsockaddrarg(s, addro, SAS2SA(&addrbuf), &addrlen);
 
    Py_BEGIN_ALLOW_THREADS
    res = connect(s->sock_fd, addr, addrlen);
    Py_END_ALLOW_THREADS
 
    /* error handling and so on .... */
}

The Py_BEGIN_ALLOW_THREADS macro is where the thread drops the GIL; it is defined simply as:

PyThread_release_lock(interpreter_lock);

And of course Py_END_ALLOW_THREADS reacquires the lock. A thread might block at this spot, waiting for another thread to release the lock; once that happens, the waiting thread grabs the GIL back and resumes executing your Python code. In short: While N threads are blocked on network I/O or waiting to reacquire the GIL, one thread can run Python.

Below, see a complete example that uses cooperative multitasking to fetch many URLs quickly. But before that, let's contrast cooperative multitasking with the other kind of multitasking.

Preemptive multitasking

A Python thread can voluntarily release the GIL, but it can also have the GIL seized from it preemptively.

Let's back up and talk about how Python is executed. Your program is run in two stages. First, your Python text is compiled into a simpler binary format called bytecode. Second, the Python interpreter's main loop, a function mellifluously named PyEval_EvalFrameEx(), reads the bytecode and executes the instructions in it one by one.

While the interpreter steps through your bytecode it periodically drops the GIL, without asking permission of the thread whose code it is executing, so other threads can run:

for (;;) {
    if (--ticker < 0) {
        ticker = check_interval;
 
        /* Give another thread a chance */
        PyThread_release_lock(interpreter_lock);
 
        /* Other threads may run now */
 
        PyThread_acquire_lock(interpreter_lock, 1);
    }
 
    bytecode = *next_instr++;
    switch (bytecode) {
        /* execute the next instruction ... */ 
    }
}

By default the check interval is 1000 bytecodes. All threads run this same code and have the lock taken from them periodically in the same way. In Python 3 the GIL's implementation is more complex, and the check interval is not a fixed number of bytecodes, but 15 milliseconds. For your code, however, these differences are not significant.

Thread safety in Python

Weaving together multiple threads requires skill.

If a thread can lose the GIL at any moment, you must make your code thread-safe. Python programmers think differently about thread safety than C or Java programmers do, however, because many Python operations are atomic.

An example of an atomic operation is calling sort() on a list. A thread cannot be interrupted in the middle of sorting, and other threads never see a partly sorted list, nor see stale data from before the list was sorted. Atomic operations simplify our lives, but there are surprises. For example, += seems simpler than sort(), but += is not atomic. How can you know which operations are atomic and which are not?

Consider this code:

n = 0
 
def foo():
    global n
    n += 1

We can see the bytecode to which this function compiles, with Python's standard dis module:

>>> import dis
>>> dis.dis(foo)
LOAD_GLOBAL              0 (n)
LOAD_CONST               1 (1)
INPLACE_ADD
STORE_GLOBAL             0 (n)

One line of code, n += 1, has been compiled to four bytecodes, which do four primitive operations:

  1. load the value of n onto the stack
  2. load the constant 1 onto the stack
  3. sum the two values at the top of the stack
  4. store the sum back into n

Remember that every 1000 bytecodes a thread is interrupted by the interpreter taking the GIL away. If the thread is unlucky, this might happen between the time it loads the value of n onto the stack and when it stores it back. How this leads to lost updates is easy see:

threads = []
for i in range(100):
    t = threading.Thread(target=foo)
    threads.append(t)
 
for t in threads:
    t.start()
 
for t in threads:
    t.join()
 
print(n)

Usually this code prints 100, because each of the 100 threads has incremented n. But sometimes you see 99 or 98, if one of the threads' updates was overwritten by another.

So, despite the GIL, you still need locks to protect shared mutable state:

n = 0
lock = threading.Lock()
 
def foo():
    global n
    with lock:
        n += 1

What if we were using an atomic operation like sort() instead?:

lst = [4, 1, 3, 2]
 
def foo():
    lst.sort()

This function's bytecode shows that sort() cannot be interrupted, because it is atomic:

>>> dis.dis(foo)
LOAD_GLOBAL              0 (lst)
LOAD_ATTR                1 (sort)
CALL_FUNCTION            0

The one line compiles to three bytecodes:

  1. load the value of lst onto the stack
  2. load its sort method onto the stack
  3. call the sort method

Even though the line lst.sort() takes several steps, the sort call itself is a single bytecode, and thus there is no opportunity for the thread to have the GIL seized from it during the call. We could conclude that we don't need to lock around sort(). Or, to avoid worrying about which operations are atomic, follow a simple rule: Always lock around reads and writes of shared mutable state. After all, acquiring a threading.Lock in Python is cheap.

Although the GIL does not excuse us from the need for locks, it does mean there is no need for fine-grained locking. In a free-threaded language like Java, programmers make an effort to lock shared data for the shortest time possible, to reduce thread contention and allow maximum parallelism. Because threads cannot run Python in parallel, however, there's no advantage to fine-grained locking. So long as no thread holds a lock while it sleeps, does I/O, or some other GIL-dropping operation, you should use the coarsest, simplest locks possible. Other threads couldn't have run in parallel anyway.

Finishing sooner with concurrency

I wager what you really came for is to optimize your programs with multi-threading. If your task will finish sooner by awaiting many network operations at once, then multiple threads help, even though only one of them can execute Python at a time. This is concurrency, and threads work nicely in this scenario.

This code runs faster with threads:

import threading
import requests
 
urls = [...]
 
def worker():
    while True:
        try:
            url = urls.pop()
        except IndexError:
            break  # Done.
 
        requests.get(url)
 
for _ in range(10):
    t = threading.Thread(target=worker)
    t.start()

As we saw above, these threads drop the GIL while waiting for each socket operation involved in fetching a URL over HTTP, so they finish the work sooner than a single thread could.

Parallelism

What if your task will finish sooner only by running Python code simultaneously? This kind of scaling is called parallelism, and the GIL prohibits it. You must use multiple processes, which can be more complicated than threading and requires more memory, but it will take advantage of multiple CPUs.

This example finishes sooner by forking 10 processes than it could with only one, because the processes run in parallel on several cores. But it wouldn't run faster with 10 threads than with one, because only one thread can execute Python at a time:

import os
import sys
 
nums =[1 for _ in range(1000000)]
chunk_size = len(nums) // 10
readers = []
 
while nums:
    chunk, nums = nums[:chunk_size], nums[chunk_size:]
    reader, writer = os.pipe()
    if os.fork():
        readers.append(reader)  # Parent.
    else:
        subtotal = 0
        for i in chunk: # Intentionally slow code.
            subtotal += i
 
        print('subtotal %d' % subtotal)
        os.write(writer, str(subtotal).encode())
        sys.exit(0)
 
# Parent.
total = 0
for reader in readers:
    subtotal = int(os.read(reader, 1000).decode())
    total += subtotal
 
print("Total: %d" % total)

Because each forked process has a separate GIL, this program can parcel the work out and run multiple computations at once.

(Jython and IronPython provide single-process parallelism, but they are far from full CPython compatibility. PyPy with Software Transactional Memory may some day be fast. Try these interpreters if you're curious.)

Conclusion

Now that you've opened the music box and seen the simple mechanism, you know all you need to write fast, thread-safe Python. Use threads for concurrent I/O, and processes for parallel computation. The principle is plain enough that you might not even need to write it on your hand.

A. Jesse Jiryu Davis will be speaking at PyCon 2017, which will be held May 17-25 in Portland, Oregon. Catch his talk, Grok the GIL: Write Fast and Thread-Safe Python, on Friday, May 19.

5 Comments

Yakup Keskindağ

Hi Jesse, This is the greatest article I've read about GIL and Multithread programming in Python. Thanks a lot ;)

Vote up!
1
Vote down!
0
Stephen White

Good article, but there are a couple of issues with it. (a) Extension modules can also release the GIL during purely CPU operations. Most notably, numpy does this if you perform operations on large matrices, like dot, +, or even sort! . If you're doing something computationally intensive you ought to be using something like this anyway, because the core code is written in C and will run orders of magnitude faster than a hand-rolled loop written in Python. This is the main misconception about the GIL and it's a shame that this article propagates it. (b) Although you mentioned that you're only writing about CPython, I don't think you made it clear that the stuff about individual operations being atomic (like sort) only applies to CPython. Besides, what is one op code in today's version of Python might be multiple in a later version. So your discussion using dis is academically interesting but you really should manually lock in cases like that.

Vote up!
0
Vote down!
0
gusennan

Nice article and very helpful how you included CPython code for background.

I think, though, that your statement about list.sort() being atomic is not accurate. In a preempt-able scenario with multiple threads, there's nothing preventing thread 1 from starting to sort the list in-place, runs out of the 1000 bytecodes or 15ms that's allocated to it, and then another thread becomes the currently running thread and appends an item to it. This would be a problem. An external locking mechanism is needed to guarantee the list.sort() function is not interrupted.

From the documentation:

"CPython implementation detail: While a list is being sorted, the effect of attempting to mutate, or even inspect, the list is undefined. The C implementation of Python makes the list appear empty for the duration, and raises ValueError if it can detect that the list has been mutated during a sort."

https://docs.python.org/3/library/stdtypes.html?highlight=list#list

Vote up!
0
Vote down!
0
emptysquare

Thanks Nate. Check out the bytecode: list.sort() is a single bytecode, so it cannot be interrupted. The documentation you refer to describes how a C extension must interact with a list, while it is being sorted, if the C extension is running a thread that does not hold the GIL. Your Python code always holds the GIL while it runs, and therefore it can never see a list *while* it is being sorted by another thread.

Vote up!
0
Vote down!
0
gusennan

Thanks for clarifying. This makes sense as long as the GIL will not be relinquished in the CPython code, which from your other examples seems to only be done consciously by IO-dependent code, which I take from what you've written to be the case.

Vote up!
0
Vote down!
0

Comment now