Dispel the Myths with Tools to Achieve Parallelism
David Liu, software technical consulting engineer, and Anton Malakhov, software development engineer, Intel Corporation
Python* as a programming language has enjoyed nearly a decade of use in industry and academia. This high-productivity language has been one of the most popular abstractions to scientific computing and machine learning, yet the base Python language remains single threaded. How is productivity in these fields being maintained with a single-threaded language?
The Python language design by Guido van Rossum was meant to trade off type flexibility and predictable, thread-safe behavior against the complexity of having to manage static types and threading primitives. This, in turn, meant having to enforce a global interpreter lock (GIL) to limit running to a single thread at a time to preserve this design mentality. Over the last decade, many concurrency implementations have been made for Python but few are in the region of parallelism. Does this mean the language isn’t performant? Let's explore further.
The base language's fundamental constructs for loops and other asynchronous or concurrent calls all abide by the single-threaded GIL, so even list comprehensions such as [x*x for x in range(0,10)] are always single threaded. The threading library's existence in the base language is also a bit misleading, since it provides the behavior of a threading implementation but still operates under the GIL. Many of the features of Python concurrent futures to almost-parallel tasks also operate under the GIL. Why does such an expressive productivity language restrict the language to these rules?
The reason is the level of abstraction the language design adopted. It ships with many tools to wrap C code, from ctypes to cffi. It prioritizes multiprocessing over multithreading in the base language, as evidenced by the multiprocessing package in the built-in Python library. These two design ideas are evident in some of the popular packages, like NumPy and SciPy, that use C code under the Python API to dispatch to a mathematical runtime library such as Intel® Math Kernel Library (Intel® MKL) or OpenBLAS. The community has adopted the paradigm to dispatch to higher-speed C-based libraries, and has become the preferred method to implement parallelism in Python.
In the combination of these accepted methods and language limitations are options to escape them and apply parallelism in Python through unique parallelism frameworks:
- Numba* allows for JIT-based compilation of Python code that can also run LLVM-based code that's compatible with Python.
- Cython gives a syntax that's similar to Python with compiled modules that can target hardware vectorization as it compiles to a C module.
- NumExpr allows for a symbolic evaluation to use compilers and advanced vectorization.
These methods escape the Python GIL in different ways while preserving the original intent of the language. All three methods implement different models of parallelism.
Let's take the general example of one of the most common language constructs to apply parallelism—the for loop. The following loop provides a basic service that returns in a list all the numbers that are fewer than 50:
def test_func(list_of_items): final_list =  for items in list_of_items: if item < 50: final_list.append(item) return final_list
Running this code gives the following result:
import random random_list = [random.randint(0,1000000) for x in range(0,1000000)] %timeit test_func(random_list) 27.4ms ± 331 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Python handles the list of items in a single-threaded way under the GIL since it's written in pure Python. It handles everything sequentially and doesn’t apply any parallelism to the code. Because of the way this code is written, it's a good candidate for the Numba framework. Numba uses a decorator (with the @ symbol) to flag functions for just-in-time (JIT) compilation that is applied on this function:
@jit(nopython=True) def test_func(list_of_items): final_list =  for item in list_of_items: if item < 50: final_list.append(item) return final_list
Running this code now gives the following result:
%timeit test_func(random_list) 15.7ms ± 173 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Including this simple decorator nearly doubled performance. This works because the original Python code is written in primitives and datatypes that can be easily compiled and vectorized to a CPU. Python lists are the first place to look. Normally, this data structure is heavy with its loose typing and built-in allocator. However, the datatypes that the random_list contains are all integers. Because of this consistency, the JIT compiler of Numba can vectorize the loop.
If the list contains mixed items (for example, a list of chars and ints), the compiled code throws a TypeError because it can't handle the heterogeneous list. Also, if the function contains mixed datatype operations, Numba fails to produce a high-performance JIT-compiled code and falls back to Python object code.
The lesson here is that achieving parallelism in Python depends on how the original code is written. Cleanliness of datatypes and the use of vectorizable data structures allow Numba to parallelize code with the insertion of a simple decorator. Being careful about the use of Python dictionaries pays dividends because historically they don’t vectorize well. Generators and comprehensions suffer from the same problem. Refactoring such code to lists, sets, or arrays can facilitate vectorization.
Parallelism is much easier to achieve in numerical and symbolic mathematics. NumPy and SciPy do a great job dispatching the computation outside of the Python GIL to lower-level C code and the Intel MKL runtime. Take, for example, the following simple NumPy symbolic expression, ((2*a + 3*b)/b):
import numpy as np a = np.random.rand(int(1e6)) b = np.random.rand(int(1e6)) %timeit (2*a + 3*b)/b 8.61ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This expression makes multiple trips through the single-threaded Python interpreter because of the structure and design of NumPy. Each return from NumPy is dispatched to C and returned back to the Python level. Then, the Python object is sent to each subsequent call to be dispatched to C again. This back-and-forth jumping becomes a bottleneck in the computation. So, when you need to compute custom kernels that can't be described in NumPy or SciPy, numexpr is a better option:
import numexpr as ne %timeit ne.evaluate('(2*a + 3*b)/b') 2.22ms ± 52.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
How does numexpr achieve nearly a 4x speedup? The previous code takes the symbolic representation of the computation into the numexpr engine to generate code that works with the vectorization commands from the vector math library in Intel MKL. So, the entire computation stays in low-level code before completing and returning the result back to the Python layer. This method also avoids multiple trips through the Python interpreter, cutting down on single-threaded sections while also providing a concise syntax.
By looking at the Python ecosystem and evaluating the different parallelism frameworks, it's evident that there are good options. To master Python parallelism, it's important to understand the tools and their limitations. Python chose the GIL as a design consideration to simplify framework development and give predictable language behavior. But, the GIL and its single-threaded restrictions are easier to sidestep with the right tools.
This article is from The Parallel Universe, Intel's quarterly magazine that helps take your software development into the future with the latest tools, tips, and training to expand your expertise.
You May Also Like
Parallelism in Python*: Directing Vectorization with NumExpr*
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.