The New Strided API for the Intel® Math Kernel Library (MKL) Vector Math (VM) component

ID 659917
Updated 7/17/2020
Version Latest




This article discusses the Strided API - a new functionality introduced in the Intel® Math Kernel Library (Intel® MKL), version 2020 [1] for the Intel® MKL Vector Math (VM) component.
This functionality allows you to use non-unit increments when traversing input and output arrays of a VM mathematical function. It is particularly useful in multi-dimensional arrays processing which is widely demanded in Python* popular NumPy package. The new Intel® MKL Vector Math API allows it to be done in just one pass in parallel mode.


Originally the Vector Math component of Intel® Math Kernel Library was designed to support simple one-dimensional arrays processing by unit index increments. But in some cases, only selected elements of an input array with some fixed step need to be processed and written to an output array using some other fixed step. In this article, we show how the one-dimensional VM approach can be extended to the multi-dimensional case and support Python* NumPy notion of “strides”. NumPy strides are the number of bytes to jump-over in memory in order to get from one item to the next along each direction/dimension of the array. It is the separation between consecutive items for each dimension. With NumPy strides, Python* users can process multi-dimensional arrays with complex layouts in any direction – vertical, horizontal, etc., using arbitrary increments.
The Vector Math component is being utilized in the NumPy package performance optimization. The lack of a strided interface for VM Mathematical Functions in previous Intel® Math Kernel Library releases can be emulated either by simple threaded loops before and after the VM call, for example: 


#pragma omp parallel for
for(i=0;i<n;i++){ packed_input[i] = input[i*inca]; } 

vdExp(n, packed_input, packed_result);

#pragma omp parallel for
for(i=0;i<n;i++){ result[i*incr] = packed_result [i]; }

or by calling special VM Pack\Unpack functions doing the same (currently not threaded in Intel® MKL 2020 though):


vdPackI (n, input, inca, packed_input);

vdExp (n, packed_input, packed_result);

vdUnpackI (n, packed_result, result, incr);

here are several obvious drawbacks of these emulation approaches:
1)    Additional memory consumption for intermediate packed arrays
2)    Three passes through memory rather than one (possible performance loss)
3)    Threads creation overhead for three different sections rather than one (more performance loss mostly on shorter vector lengths)
The new strided API combines pack\unpack parts with computational core. It does so within the same pass through memory and the same parallel section (in case of threaded layer). The difference with classic interface is additional increment parameters after each array and the “I” suffix in the function name:


vdExpI (n, input, inca, result, incr); 

The pseudo-code for [1.3]:


vdExpI (n, input, inca, result, incr)
   for(i=0; i < n; i++)  
        result[i*incr] = exp( input[i*inca] );

All previously available functionality – full VM functions list, all precisions (single, double, complex), accuracies (VML_HA, VML_LA, VML_EP), computation and error handling modes – also supported in the new strided API. In-place operations are also supported but only when the corresponding input and output increments are equal.
Increment values for each input and output array can be different in general. Negative and zero strides are not yet supported and considered for next Intel® MKL releases. 
The note for NumPy developers: strided VM increment parameters are measured in elements of corresponding data type while NumPy stride units are bytes – conversion may be required.
Another important note: ‘n’ is the number of elements to be processed, not the maximum array size. So, when calling vdExpI(n, input, inca, result, incr) be sure that the input vector is allocated at least for 1 + (n-1)*inca elements and the result vector has a space for 1 + (n-1)*incr elements.

Usage Example

The following example shows how to implement NumPy np.Add function via strided VM call:


np.add(A, B):
for (i0, i2) in zip(range(0, 2), range(0, 2)):
    for i1 in range(3):
        C[i0, i1, i2] = A[i0, i1, i2] + B[i0, i1, i2]

Can be expressed as:


for(i0=0 ; i0 < 2; ++i0) 
    for(i2=0 ; i2 < 2; ++i2) 
        DOUBLE_add({&A[i0,0,i2], &B[i0,0,i2], &C[i0,0,i2]}, {3,3,3}, {2,2,2})  

DOUBLE_add({ &p0, &p1, &p2 }, {n, n, n}, {st0, st1, st2})
    for (i=0; i<n; i++, p0+=st0, p1+= s1, p2+=st2)
        double v0 = *((double *) p0), 
        double v1 = *((double *) p1);
        double u = v0 + v1;
        *((double *)p2)= u;

If the strides were equal to 1 then we could just have called vdAdd. 
In this example, stride=2 so the new strided vdAddI is needed:


DOUBLE_add({ &p0, &p1, &p2 }, {n, n, n}, {st0, st1, st2})
    vdAddI(n, p0, st0, p1, st1, p2, st2);


Performance Comparison of Strided VM versus Pack\Unpack emulation approach

The performance comparison is based on simple CPU time stamp counter and provides clock cycles per element (CPE) values. The new strided VM vdExpI HA (High Accuracy) function performance was compared versus the simple ‘packed’ emulation approach [1.1] above implemented via threaded loops. Input arguments are uniformly distributed in the interval [-10.0, 10.0].
Performance by stride sizes (3, 7, 11, 33, 66, 123) with fixed vector length (1,000,000):

Performance by vector lengths (10K, 50K, 100K, 500K, 1M, 5M) with fixed stride size (33):

The new strided VM API demonstrates 1.4x–7.3x performance benefits and reduces memory consumption in comparison to approach [1.1].
Used Hardware: 
•    Cascade Lake server (CLX), 128 GB memory, 2 sockets, 24 cores/socket, Cache size: L1: 32KB + 32KB, L2: 1 MB, L3: 36 MB, Hyper-threading & throttling OFF
Used Software: 
•    Linux RHEL 7.2 
•    Intel® MKL 2020 u2, TBB threading layer




Kolesov, Andrey
Akkas, Ahmet
de Lassus Saint-Geniès, Hugues
Schneider, Eric
Gilbert, Robin C
Cornea, Marius
Fedorov, Gennady