Parallelism in Python*: Directing Vectorization with NumExpr

ID 721702
Updated 8/19/2019
Version Latest



Boost Performance for Computing with Arrays and Numerical Expressions

Fabio Baruffa, PhD, technical consulting engineer, Intel Corporation

Python* has several pathways to vectorization (for example, instruction-level parallelism), ranging from just-in-time (JIT) compilation with Numba*1 to C-like code with Cython. One interesting way of achieving Python parallelism is through NumExpr, in which a symbolic evaluator transforms numerical Python expressions into high-performance, vectorized code. NumExpr achieves this by vectorizing in chunks of elements instead of compiling everything at once. This creates accelerated object kernels that are usable from Python code. This article explores how to refactor Python code to take advantage of the NumExpr capabilities.


Parallelization of Numerical Expressions

The flexibility of Python, with its easier syntax, allows developers to rapidly prototype numerical computations with the help of libraries like NumPy and SciPy. But the Python language wasn’t developed with parallelism in mind―although it's a key requirement to get performance out of modern vector and multicore processors. So how is it possible to vectorize numerical expressions using Python?

A numerical expression is a mathematical statement that involves numbers and mathematical symbols to perform a calculation (for example, 11*a-42*b). In Python, this expression can also operate on arrays a and b defined from the NumExpr package. In this case, similar expressions working on arrays are accelerated by using intrinsic parallelism and vectorization, compared to the same calculation in standard Python.

To boost performance, NumExpr can use the optimized Vector Mathematical Function Library (VML) that is included in the Intel® Math Kernel Library (Intel® MKL). This library makes it possible to accelerate the evaluation of mathematical functions (for example, sine, exponential, or square root) that operate on vectors stored contiguously in memory.


Refactor Common NumPy Calls for NumExpr

To use the NumExpr package, pass the computational string to the evaluate function. Then compile it into an object, leaving the entire computation at low-level code before completion. After that, the result returns to the Python layer, avoiding too many calls to the Python interpreter.

Let's look at an example of computing a simple expression for NumPy arrays:

In this case, the 4x speedup is due to the intrinsic vectorization enabled by VML. The library can also perform in-place operations where the copying overhead is negligible.

Now let's evaluate the speedup from NumExpr when using a mathematical function, where the benefit of VML becomes more evident:

In this case, higher performance is achieved due to the optimize sqrt function in Intel MKL. The speedup is close to 19x. This indicates that the NumPy library doesn’t provide the acceleration expected for some expressions. Also, the NumExpr implementation circumvents memory allocations for intermediate results, which gives better cache use and reduces memory overhead. You can see the benefit of these optimizations in computations with large arrays.


Control the NumExpr Evaluator

Since NumExpr uses the VML library internally, it computes the mathematical functions only for the types the library allows. It also operates on real and complex vector arguments with unit increments, integers, and are Boolean. In cases where the types of arrays don’t match in the evaluate expression, they're cast according to the usual inference rules.

The performance depends on various factors, including vectorization and memory overhead. For this reason, use some of the VML functions to tune performance and control numerical accuracy (and eventually the number of threads).

To get information about the VML library version, call the function get_vml_version() to check the installation. All the vector functions support the following accuracy modes through the function set_vml_accuracy_mode(mode). The mode can be set to:

  • High, equivalent to high accuracy, the default mode
  • Low, equivalent to low accuracy, which improves performance by reducing accuracy of the two least significant bits.
  • Fast, equivalent to enhanced performance, which provides better performance at the cost of significantly reduced accuracy. Approximately half of the bits in the mantissa are correct.

For more information, see the Intel MKL Developer Reference2 and the official documentation of NumExpr3.

NumExpr can also be used to control the number of threads. The function set_num_threads(nthreads) sets the maximum number of threads to be used by VML operations. The return value is the previous setting of the number of threads in the current environment. Let's modify the previous example to use threads to improve performance even further:

The speedup is 3.7x, with 93% parallel efficiency. In this example, more threads equal better performance.

Using NumExpr as alternative to NumPy can give significant performance benefits for computing with arrays and numerical expressions, thanks to VML. The syntax is similar to NumPy and, with a couple of function calls, you can transition code to NumExpr.



  1. The Parallel Universe, Issue 36
  2. Developer Reference for Intel MKL
  3. NumExpr 2.0 User Guide



You May Also Like


Heterogeneous Parallel Programming Using oneAPI: A Virtual Workshop
Introduction to DPC++ Programming
Hybrid Parallel Programming for High-Performance Computing (HPC) Clusters with a Message-Passing Interface (MPI) and DPC++
Parallelism in Python*
Intel® Distribution for Python*
Develop fast, performant Python* code with this set of essential computational packages including NumPy, SciPy, scikit-learn*, and more.