Parallelization in Python

Python is often praised for its simplicity and readability, but when it comes to optimizing parallel code execution, it raises complex questions. This article explores the fundamental differences between multithreading, asynchronous programming, and multiprocessing in Python. It also highlights the limitations imposed by the Global Interpreter Lock (GIL) and the challenges associated with multiprocessing.

Multithreading vs Asynchronous Programming vs Multiprocessing

What is Multithreading?

Multithreading involves running multiple threads within the same process. These threads can execute on different physical or logical CPU cores while sharing the same memory space, which facilitates communication and reduces memory usage. In theory, it is possible to allocate a dedicated thread to each CPU core, allowing multiple tasks to run in parallel and resulting in significant performance gains.

However, in Python, multithreading is heavily limited by the GIL (Global Interpreter Lock). This lock prevents multiple Python threads from executing bytecode simultaneously, even on a multi-core processor. For I/O-bound tasks (such as network requests or I/O operations), multithreading remains effective because the GIL is released during these calls. Beyond these scenarios, only Python extensions written in C that explicitly release this lock can fully utilize multiple cores. But then, can it truly be considered Python parallelization if achieving it requires external extensions that are not written in Python?

Code example:

Python

What is Asynchronous Programming?

Asynchronous programming, often referred to as “concurrent programming,” allows tasks to be executed in an interleaved manner by relying on rapid context switches. This creates the illusion of tasks being executed in parallel. It achieves this by running I/O-bound operations non-blockingly, using an event-driven model. Unlike multithreading, which relies on multiple concurrent threads, asynchronous programming uses a single thread and coroutines to manage multiple tasks concurrently. This approach is more efficient than multithreading in the context of I/O-bound operations, though it often requires a higher level of development expertise.

In Python, asynchronous programming is supported by libraries such as asyncioanyio, or trio. Due to the GIL, multithreading offers no real advantage over asynchronous programming for I/O-bound tasks. In such cases, asynchronous programming is generally the better choice, as it ensures greater compatibility with future language developments, especially if the GIL is eventually removed in upcoming versions.

Code example:

Python

What is Multiprocessing?

The only way to parallelize CPU-bound tasks in Python for a long time has been to use multiprocessing, which involves creating multiple child processes, each with its own memory space and, more importantly, its own interpreter and GIL. Each process operates with its own threads, effectively reduced to a single thread in Python, but they run in parallel, allowing for significant performance gains in this case.

Since this approach is necessary in Python more than in other languages, the multiprocessing module aims to provide an API very similar to that of the threading module. This design attempts to hide many complexities specific to multiprocessing, particularly regarding communication between child processes and resource management. Despite this effort, we believe that trying to transparently equate these two concepts is sometimes tedious or even risky, as they are fundamentally different.

We will also explore other techniques available in Python to work around the GIL. Unfortunately, these techniques remain experimental or not widely adopted at this time.

Code example :

Python

The Global Interpreter Lock: A Barrier to Parallelization

The Global Interpreter Lock (GIL), as its name suggests, is an internal lock specific to the CPython interpreter that restricts parallel execution of Python threads. Its primary role is to ensure that only one thread can execute Python bytecode at a time, even on multi-core systems.

Other languages also implement similar mechanisms, such as:

  • Ruby: The Global VM Lock (GVL) is conceptually equivalent to the GIL.
  • R: Its threading model is also highly limited. Parallel execution is only achievable through native code execution or explicitly adapted packages.
  • PHP: The Zend Engine implementation, used as an Apache module, employs a similar but slightly more permissive concept.
  • Lua: Intrinsically single-threaded, Lua doesn’t have an exact equivalent to the GIL, but running multiple Lua scripts in parallel within the same interpreter is impossible.
  • JavaScript: Like Lua, its execution model is single-threaded, relying on an event loop, making it effectively asynchronous programming.

Why Does the GIL Still Exist?

The GIL significantly simplifies the implementation of the CPython interpreter and Python extensions written in C, which are often not “thread-safe.” By maintaining a global lock, it avoids the need for complex synchronization mechanisms around shared memory management.

Removing the GIL would require profound changes, not only to CPython itself but also to the majority of third-party libraries, whether or not they rely on Python extensions written in other languages. Many of these libraries depend on assumptions tied to the GIL’s presence. These changes would include overhauling memory management models and introducing synchronization mechanisms within CPython. More importantly, it would necessitate revisiting all third-party libraries, some of which may rely on improper multithreading synchronization methods that, until now, have been harmless thanks to the GIL.

What’s Next for the GIL?

In recent years, many efforts have been made to reduce or completely eliminate the limitations imposed by the GIL:

IronPython / Jython

Several alternative Python interpreters have been developed to address some of the limitations of CPython:

  • IronPython: Designed to run on the .NET Framework, this interpreter does not use a global lock, enabling true multithreaded execution for CPU-bound tasks.
  • Jython: An implementation of Python on the JVM (Java Virtual Machine), it leverages the JVM’s thread management mechanisms and, therefore, does not rely on a global lock.

These interpreters provide interesting alternatives to CPython, particularly in scenarios where the absence of a global lock is critical for performance. However, many Python extensions written in C are not (or are only partially) supported by these interpreters. This includes a significant number of libraries that contribute to Python’s popularity today, such as pandas or numpy, often making the transition to these interpreters impractical.

Software Transactional Memory (STM) with PyPy

Software Transactional Memory (STM) is an approach that manages synchronization and concurrency without relying on explicit locking mechanisms like the GIL. STM is inspired by transactions in databases: concurrent operations are executed within isolated “transactions,” and at the end of each transaction, a validation mechanism ensures that the changes made did not result in conflicts. If conflicts are detected, the transactions are automatically replayed. This approach enables better utilization of multi-core processors for CPU-bound Python applications.

While CPython does not natively support STM, projects like PyPy have explored this approach. For example, PyPy STM manages concurrency by avoiding the limitations of the GIL while maintaining compatibility with most standard Python programs.

Python

The Major Drawback of STM

The primary disadvantage of STM lies in its complexity and its performance cost when conflicts are frequent or the number of simultaneous transactions is high. Currently, support for STM in the Python ecosystem remains limited and exploratory.

Subinterpreters (Python 3.9)

Since CPython 3.9, via the C API for Python extensions, it has become possible to create subinterpreters that no longer depend on the parent interpreter’s GIL, with each having its own global lock. This feature represents a significant advancement. However, its practical use remains limited: it is primarily designed for Python extensions in C, and its integration requires explicit management of communication and synchronization between subinterpreters. Similar to the multiprocessing approach, shared data must be exchanged through mechanisms like queues, UNIX sockets, or shared memory segments, which increases implementation complexity. As a result, despite its potential, this approach has not yet gained significant adoption in the standard Python ecosystem.

NoGIL (Python 3.13)

PEP 703, titled “Making the Global Interpreter Lock Optional in CPython”, proposes to:

  • Remove the GIL optionally.
  • Provide backward compatibility.
  • Minimize the impact on single-threaded performance.

Without the GIL, there would no longer be a need for multiprocessing or multi-interpreter approaches and their associated complexities. However, as mentioned earlier, removing the GIL requires a complete review of all existing libraries, including both pure Python libraries and Python extensions, which will now need to be thread-safe. While auditing CPython’s standard library alone is a significant challenge, Python’s success has led to hundreds of thousands of libraries available on PyPI. Many of these will likely never be audited or updated.

This change could also affect single-threaded applications, as existing thread safety would need to be reinforced, even though PEP 703 includes optimizations for this scenario. In other words, while this change is a major step forward, we are still far from being able to fully benefit from it.

Challenges of Multiprocessing

Although multiprocessing circumvents the GIL and enables effective parallelization of CPU-bound tasks, it introduces several challenges for which the multiprocessing module provides some tools:

Communication Between Processes and Data Management

Unlike a single-process approach, data must be explicitly transferred between child processes and the parent process, as they do not share the same memory space.

To achieve this, one or more of the following techniques can be used:

  • Shared Memory

The multiprocessing module provides tools such as Value and Array for sharing data between processes, which helps minimize data copying and improve performance.

Python
  • Sockets
    Sockets enable data exchange using network primitives, either between remote machines or on the same machine. In the latter case, it is preferable to use UNIX sockets, which provide better performance by bypassing the IP layer.
Python
  • Queues and Pipes
    For simpler communication, the multiprocessing module provides queues (multiprocessing.Queue), which offer a straightforward producer/consumer-style interface, and pipes (multiprocessing.Pipe), which provide a higher-level interface similar to sockets.
Python

Synchronization

Just like in multithreading approaches, primitives such as locks, semaphores, and events are necessary to coordinate processes. However, it is important to keep in mind that these primitives, while easy to use, also rely on the inter-process communication techniques mentioned earlier. Needless to say, these primitives have a much higher cost than their counterparts that share the same memory space.

And What About Memory?

Unlike threads, child processes do not share their memory space, which inevitably leads to increased memory usage

A Concrete Example: Logging

One common pitfall when starting with the multiprocessing approach is using the logging module. Unfortunately, this is one of those subtle issues that might not come to mind initially when optimizing performance. And yet, it often becomes essential when building a serious application.

To convert a single-threaded Python application with a logger into one using the multiprocessing approach, it is necessary to configure the logger in a specific way to avoid concurrent access issues. This ensures consistent and complete logs, particularly when they are being written to disk. Below is an example of an additional configuration to include in your application:

Python

With this configuration, each log message generated by the child processes is sent through a queue to the parent process, which is the only one that actually logs messages in the application.


Conclusion

The choice between multithreading, asynchronous programming, and multiprocessing depends on the specific needs of your application.

In Python, the limitations of the GIL and the complexities of multiprocessing show that parallelization is not always straightforward. It’s easy to assume that using a multithreading approach will automatically improve performance, but in reality, it often offers no advantage, or even less, compared to an asynchronous approach. However, future developments, especially the removal of the GIL, will undoubtedly transform how parallelized applications are developed in Python.

For now, a deep understanding of existing approaches remains essential to optimizing your Python applications.

My Recommendations

For long-running calculations or persistent applications:

  • I/O-bound tasks: Favor asynchronous programming.
    Specifically, I recommend using the anyio library, which allows your projects to remain compatible with both asyncio and trio while maintaining their robustness. This approach also ensures that your code will stay robust regardless of the Python implementation you use, with or without the GIL.
  • CPU-bound tasks: Use multiprocessing to work around the GIL.
    Consider using a more advanced third-party module like joblib, which provides a high-level interface for job management with features like caching. Don’t hesitate to rely on a higher-level library than the default multiprocessing module to avoid many issues inherent to this approach. Some libraries, for instance, handle problems like the one encountered with the logging module mentioned earlier, either directly or indirectly.

For short-running calculations or ephemeral applications:

If you’re looking for a quick way to develop and test different approaches, you might consider using the concurrent.futures module. It provides a relatively simple abstraction for parallelization and allows you to quickly choose between multithreading and multiprocessing approaches.

Python

Additional Links