Parallel computation in Python with Dask

The Dask library scales Python computation to multiple cores or even to multiple machines.
Register or Login to like
Pair programming

WOCinTech Chat. Modified by CC BY-SA 4.0

One frequent complaint about Python performance is the global interpreter lock (GIL). Because of GIL, only one thread can execute Python byte code at a time. As a consequence, using threads does not speed up computation—even on modern, multi-core machines.

But when you need to parallelize to many cores, you don't need to stop using Python: the Dask library will scale computation to multiple cores or even to multiple machines. Some setups configure Dask on thousands of machines, each with multiple cores; while there are scaling limits, they are not easy to hit.

While Dask has many built-in array operations, as an example of something not built-in, we can calculate the skewness:

import numpy
import dask
from dask import array as darray

arr = dask.from_array(numpy.array(my_data), chunks=(1000,))
mean = darray.mean()
stddev = darray.std(arr)
unnormalized_moment = darry.mean(arr * arr * arr)
## See formula in wikipedia:
skewness = ((unnormalized_moment - (3 * mean * stddev ** 2) - mean ** 3) /
            stddev ** 3)

Notice that each operation will use as many cores as needed. This will parallelize across all cores, even when calculating across billions of elements.

Of course, it is not always the case that our operations can be parallelized by the library; sometimes we need to implement parallelism on our own.

For that, Dask has a "delayed" functionality:

import dask

def is_palindrome(s):
    return s == s[::-1]

palindromes = [dask.delayed(is_palindrome)(s) for s in string_list]
total = dask.delayed(sum)(palindromes)
result = total.compute()

This will calculate whether strings are palindromes in parallel and will return a count of the palindromic ones.

While Dask was created for data scientists, it is by no means limited to data science. Whenever we need to parallelize tasks in Python, we can turn to Dask—GIL or no GIL.

Moshe sitting down, head slightly to the side. His t-shirt has Guardians of the Galaxy silhoutes against a background of sound visualization bars.
Moshe has been involved in the Linux community since 1998, helping in Linux "installation parties". He has been programming Python since 1999, and has contributed to the core Python interpreter. Moshe has been a DevOps/SRE since before those terms existed, caring deeply about software reliability, build reproducibility and other such things.


Nice article, Moshe!

A typo?
unnormalized_moment = darry.mean(arr * arr * arr)
should be
unnormalized_moment = darray.mean(arr * arr * arr)

Nice article, thanks for the intro to Dask.

Do you have any thoughts on the pros/cons of using dask vs. multiprocessing or concurrent.futures libraries within Python for doing parallel computing?

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.