# How To Optimize A Parallel Stable Sort Performance Using The Revolutionary Intel® oneAPI HPC Toolkit

Published: 03/22/2020

Last Updated: 03/22/2020

## Introduction

In this article I will thoroughly discuss about the several aspects of using the revolutionary new Intel® oneAPI HPC Toolkit to deliver a modern code that implements a parallel “stable” sort introduced in one of my previous articles (/content/www/us/en/develop/articles/how-to-implement-a-parallel-stable-three-way-quicksort-using-intel-c-compiler-and-openmp-45.html).

Specifically, the audience of the following article’s readers will learn how to use Data Parallel C++ (DPC++) compiler and oneAPI library, that, based on the SYCL execution model, allows to delegate the execution of an application’s workloads to the various of hardware acceleration targets, such as either CPU,GPU or FPGA, re-using the same parallel code.

The main straightforward idea of this article is to introduce an approach of delivering and running the code that implements the parallel stable sort in the heterogeneous computational platform combining the usage of the CPU, GPU and FPGA accelerators to accomplish the parallel sorting task, in which the different workloads are executed by the accelerators with different hardware architecture. This, in turn, allows to drastically improve the overall performance of the sorting process, itself.

While delivering the modern parallel code, I will use the oneAPI library along with the other libraries, such as oneTBB (e.g. oneAPI’s Threading Building Blocks) and Parallel STL, released as a part of the oneAPI Toolkit, to provide the maximum performance speed-up and scalability of the entire process of sorting. I will demonstrate how to use the oneTBB library, as an alternative to the OpenMP’s compiler-level pragmas, to modernize the code implementing the same parallel three-way quicksort, previously discussed. Instead of using the OpenMP’s concurrent tasks, I will use the functionality of tbb::task_group class for performing the recursive parallel subsorts. Beyond of the parallel code performing the three-way quicksort, executed on the CPUs, I will also discuss how to use oneAPI’s SYCL execution model to parallelize the code performing the array partitioning and execute it on the accelerator having a different architecture, rather than solely running it on the CPU.

Along with the number of performance optimization aspects, I will also explain the known shortcomings and limitations of using the oneAPI’s SYCL execution model and why, specifically, the parallel stable sort task cannot be executed as an entire SYCL kernel code.

Finally, at the bottom of this article, I will discuss how to compile and run the parallel code delivered, as well as how to evaluate its performance in the Intel DevCloud, using the various of performance acceleration hardware, such as either Intel® Xeon® Gold CPUs, Intel® HD Graphics or Intel® Programmable Acceleration Cards (PAC) FPGA-platform based on the Intel® ARRIA 10 GX hardware accelerators.

## Parallel Three-Way Quicksort Using oneTBB Library

The delivering of the parallel modern code that implements the three-way quicksort using OpenMP’s concurrent tasks was initially discussed in my first article (/content/www/us/en/develop/articles/an-efficient-parallel-three-way-quicksort-using-intel-c-compiler-and-openmp-45-library.html). An idea of using the concurrent tasks is an ideal, since the specific code performing the parallel recursive subsorts can be implemented by using the most of existing performance libraries and frameworks supporting the parallel concurrent tasking.

However, the parallel concurrent tasking is not initially supported by the oneAPI library and SYCL execution model. Specifically, the SYCL supports only a single task per each kernel code submitted. The multiple of single tasks are executed sequentially. Also, the SYCL execution model does not support the performing of the recursive function calls within a kernel execution, and, as the result, making it impossible, in this case, to perform the parallel recursive subsorts.

As a workaround to the problem, mentioned above, we will use the oneTBB library to implement another version of the code performing the parallel three-way quicksort:

	template<class Container, class _Pred>
void qsort3w(Container& array, std::size_t _First, std::size_t _Last, _Pred compare)
{
if (_First >= _Last) return;

std::size_t _Size = array.size(); g_depth++;
if (_Size > 0)
{
std::size_t _Left = _First, _Right = _Last;
bool is_swapped_left = false, is_swapped_right = false;
gen::ITEM _Pivot = array[_First];

std::size_t _Fwd = _First + 1;
while (_Fwd <= _Right)
{
if (compare(array[_Fwd], _Pivot))
{
is_swapped_left = true;
std::swap(array[_Left], array[_Fwd]);
_Left++; _Fwd++;
}

else if (compare(_Pivot, array[_Fwd])) {
is_swapped_right = true;
std::swap(array[_Right], array[_Fwd]);
_Right--;
}

else _Fwd++;
}

if (((_Left - _First) > 0) && (is_swapped_left))
qsort3w(array, _First, _Left - 1, compare);
});

if (((_Last - _Right) > 0) && (is_swapped_right))
qsort3w(array, _Right + 1, _Last, compare);
});

}
}

In this code I’ve used the functionality of the tbb::task_group class, instead of using the corresponding OpenMP’s tasking pragmas, to spawn the two concurrent tasks performing the parallel recursive subsorts. Specifically, I’ve declared the tbb::task_group object and invoked the task_group.run(…) method to spawn each task, passing the lambda function performing a recursive call to the qsort3w(…) function, as a single argument of this method. After spawning the following tasks, I’ve provided a proper barrier synchronization by invoking task_group.wait() method:

			tbb::task_group task_group;
if (((_Left - _First) > 0) && (is_swapped_left))
qsort3w(array, _First, _Left - 1, compare);
});

if (((_Last - _Right) > 0) && (is_swapped_right))
qsort3w(array, _Right + 1, _Last, compare);
});

task_group.wait();

Also, I will no longer use the variadic templates for the random-access iterator types, since the iterator ranges, such tbb::blocked_range<RanIt>, are not supported by the oneTBB library’s API. Instead, I will use the indices of the std::size_t type for the similar purpose.

To invoke the parallelism, I will run the entire process of sorting by spawning the qsort3w(…) function as a separate task, executed in its own thread:

	template<class Container, class _Pred >
void parallel_sort(Container& array, std::size_t _First, std::size_t _Last, _Pred compare)
{
g_depth = 0L;
internal::qsort3w(array, _First, _Last - 1, compare);
});

}

Finally, the following code will be used for performing the actual sorting, running it on the CPU hardware target.

## Running The Parallel Stable Sort On CPU, GPU And FPGA

As we’ve already discussed in the previous article, to perform a parallel stable sort we first invoke the internal::parallel_sort(…) to sort the array of objects by its key. According to the basic concept, introduced in this article, the actual sorting is normally performed on the CPU target.

In turn, the process of partitioning the array into subsets of objects with an equal key can be executed in parallel on the other hardware targets different from the multi-core CPU target. For that purpose, I will use the Data Parallel C++ compiler and oneAPI’s OpenCL/SYCL wrapper library to deliver a code that implements a specific kernel performing the array partitioning in parallel, while being executed on the various of targets:

	template<class Container, class _CompKey, class _CompVals>
void parallel_stable_sort(Container& array, sycl::queue device_queue,
_CompKey comp_key, _CompVals comp_vals)
{
std::vector<std::size_t> pv;
pv.resize(array.size()); pv[0] = 0;

internal::parallel_sort(array, 0L, array.size(), comp_key);
});

cl::sycl::buffer<std::size_t> pv_buf{ pv.data(), array.size() };
cl::sycl::buffer<gen::ITEM> array_buf{ array.data(), array.size() };

device_queue.submit([&](sycl::handler& cgh) {
auto pv_acc = pv_buf.get_access<cl::sycl::access::mode::write>(cgh);
cgh.parallel_for<class partition_kernel>(cl::sycl::range<1>{array.size() - 2},
[=](cl::sycl::id<1> idx) {
if ((comp_key(array_acc[idx[0]], array_acc[idx[0] + 1]) ||
(comp_key(array_acc[idx[0] + 1], array_acc[idx[0]])))) {
pv_acc[idx[0] + 1] = idx[0] + 1;
}
});
}).wait();

auto _LastIt = std::remove_if(dpstd::execution::par_unseq,
pv.begin() + 1, pv.end(), [&](const std::size_t index) { return index == 0L; });

pv.resize(std::distance(pv.begin(), _LastIt));
pv.push_back(std::distance(array.begin(), array.end()));

tbb::parallel_for(tbb::blocked_range<std::size_t>(0, pv.size() - 1), \
[&](const tbb::blocked_range<std::size_t>& r) {
for (std::size_t index = r.begin(); index != r.end(); index++)
internal::parallel_sort(array, pv[index], pv[index + 1], comp_vals);
});
}


The code implementing the partitioning algorithm can be easily parallelized and executed on the various of hardware targets since it normally does not incur those data flow dependencies and other issues. However, there’s one important algorithm-level optimization of the partitioning process. Normally, the SYCL execution model, allows to access the data stored in buffers of the various kinds, such as either a simple C/C++ arrays or more complex std::array<> and std::vector<> STL-containers, by using the accessors. Although, it does not support the performing of the dynamic buffer re-allocation, from inside the kernel being executed, while resizing the buffer or appending new elements by using std::vector<>::push_back(…) method.

That’s actually why, we will need to declare a vector of partition indices, such as std::vector<std::size_t> pv, allocating the buffer in the code running on the host CPU, prior to executing the specific kernel in the accelerator.

After that, we need to instantiate the device selector and kernels execution queue objects as shown in the code below:

    default_selector device_selector;
queue device_queue(device_selector);

The SYCL execution model provides the number of device selectors, each one is per a specific hardware acceleration platform:

 sycl::default_selector {} Intel® FPGA Emulator Application sycl::cpu_selector {} Intel® CPUs sycl::gpu_selector {} Intel® GPUs
 sycl::intel::fpga_selector {} Intel® FPGA Accelerators sycl::intel::fpga_emulator_selector {} Intel® FPGA Emulators

In this case, I will use the default_selector{} object that allows to execute a specific kernel in Intel FPGA Emulator target.

To provide an access to the vector of partition indices and the array being sorted by the code running within the specific kernel, I will declare two cl::sycl::buffer<> class objects, the constructor of which normally accepts two arguments of either the container object and the buffer size:

		cl::sycl::buffer<std::size_t> pv_buf{ pv.data(), array.size() };
cl::sycl::buffer<gen::ITEM> array_buf{ array.data(), array.size() };

Next, I will implement a code, submitting the specific kernel code to the SYCL execution queue. This typically done by using the device_queue object and device_queue.submit(…) method invocation. This method accepts a single argument of the lambda function in which the kernel code is implemented. In turn, the argument of the lambda function is an object of sycl::handler type. Further, we will use this object to access the various of methods that provide an ability to run a specific code in parallel, as well as to access the data buffers from inside the kernel. Also, we will need to provide a barrier synchronization of the specific kernel execution. To do that, I will use device_queue.wait() method, invoked after the device_queue.submit(...) method invocation:

device_queue.submit([&](sycl::handler& cgh) {
// SYCL kernel code...
}).wait();

In the following lambda function’s scope, I will instantiate two accessors used for accessing the specific buffers in the kernel code:

			auto pv_acc = pv_buf.get_access<cl::sycl::access::mode::write>(cgh);
auto array_acc = array_buf.get_access<cl::sycl::access::mode::read>(cgh);

These accessor objects are instantiated by using cl::sycl::buffer<>::get_access(…) method, accepting a single argument of the sycl::handler object.

Then, I will invoke the cgh.parallel_for(…) method to execute a single loop in parallel, during each iteration of which the actual partitioning is performed. The using of this method is much similar to performing the tight loop parallelization using the specific OpenMP’s directive construct. Each iteration of the following loop is obviously executed in parallel by its own thread.

The method cgh.parallel_for(…) accepts two main arguments of either the cl::sycl::nd_range<Dims> object or the lambda function, invoked during each iteration of the following loop. The cl::sycl::nd_range<Dims> object is used to specify the number of iterations performed by the cgh.parallel_for(…) method. In turn, the lambda function accepts a single argument of cl::sycl::id<1> object by using which the index value for each iteration is retrieved. As well, the cl::sycl::nd_range<1> and cl::sycl::id<1> accept a template parameter of the number of dimensions within the execution model being used. In this case, we will use the 1-dimensional execution model since the parallel code being discussed is for sorting the data stored in vector having a single dimension only, rather than processing matrices and other spatial data:

			cgh.parallel_for<class partition_kernel>(cl::sycl::range<1>{array.size() - 2},
[=](cl::sycl::id<1> idx) {
if ((comp_key(array_acc[idx[0]], array_acc[idx[0] + 1]) ||
(comp_key(array_acc[idx[0] + 1], array_acc[idx[0]])))) {
pv_acc[idx[0] + 1] = idx[0] + 1;
}
});

According to the variant of the partitioning algorithm, discussed in this article, during each iteration of the partitioning loop, executed in parallel, it compares the key of a current object with the key of the adjacent object at idx[0] and idx[0] + 1 indices, respectively, and if the keys are not equal, it assigns the idx[0] + 1 index at the position idx[0] + 1 of the vector pv, accessed for writing from inside the kernel, otherwise it assigns the value of zero at the position idx[0] + 1. Unlike the previous variants of the partitioning algorithm, previously discussed, it normally requires an extra space of O(N), where N – the number of objects in the array to be sorted, since it’s using a vector of indices initially having the capacity equal to the size of an array being sorted.

At the end of the kernel execution, it returns an updated vector pv to the code executed on the host CPU. This typically done by invoking the following method:

 pv_buf.get_access<cl::sycl::access::mode::read>();


After the vector of indices pv was successfully returned from the kernel, we must normalize this vector by eliminating all elements which value is equal to zero, maintaining a relative order of the partition indices:

       auto _LastIt = std::remove_if(dpstd::execution::par_unseq,
pv.begin() + 1, pv.end(), [&](const std::size_t index) { return index == 0L; });

pv.resize(std::distance(pv.begin(), _LastIt));
pv.push_back(std::distance(array.begin(), array.end()));


For that purpose we’ve used std::remove_if(…) routine that uses the Parallel STL execution policy dpstd::execution::par_unseq to run it in parallel.

Also, there’s another algorithm-level optimization of the code performing the partitioning. Since the execution of each iteration, performed by cgh.parallel_for(…) method, is properly synchronized, we no longer need to sort the vector of indices pv to preserve its order. This, in turn, allows to increase the overall performance speed-up of sorting at the algorithm’s level.

Finally, we need to sort the array of objects by its value. To iterate through each pair of indices in the vector pv we will use the tbb::parallel_for(…), which allows to execute each iteration in parallel on the host CPU. This method accepts the two arguments of either the tbb::blocked_range<> object or a lambda function executed during each iteration of the loop:

		tbb::parallel_for(tbb::blocked_range<std::size_t>(0, pv.size() - 1), \
[&](const tbb::blocked_range<std::size_t>& r) {
for (std::size_t index = r.begin(); index != r.end(); index++)
internal::parallel_sort(array, pv[index], pv[index + 1], comp_vals);
});

The tbb::blocked_range<std::size_t> object is used to maintain the range of [0..pv.size() – 1] for the specific blocked range iterator of std::size_t type. In the lambda function we implement the loop during each iteration of which a pair of indices is fetched from the vector pv, accessed by using the blocked-range iterator, incremented at the end of each iteration. We use the pairs of indices to maintain a range of [pv[index]..pv[index+1]] and independently sort all objects in the array which indices belong to the following range, by invoking the same internal::parallel_sort(…) function.

The following code is executed on the host CPU since it’s impossible to use the oneTBB API from inside the SYCL kernel code due to the limitation of using the virtual functions calls in the code executed by the kernel.

## Build And Run The Sample Program In The Intel® DevCloud

After delivering the modern parallel code capable of running in the heterogenous platform it’s time to evaluate its performance by running it on the various of hardware acceleration targets (e.g. CPU, GPU and FPGA).

The Intel® DevCloud is a cluster platform solution empowered by Intel® Corp., that allows to run, test and evaluate the variety of parallel application’s workloads using the latest Intel’s hardware and software, such as Intel® Xeon® family CPUs, as well as the innovative Intel® HD Graphics and Intel® Programmable Acceleration Cards / Intel® ARRIA 10 GX hardware accelerators.

Before we briefly discuss about the performance of the code implementing the parallel stable sort, let’s first spend a moment and see how to build and run the following code in the Intel DevCloud for oneAPI projects.

The very first thing that needs be done is to sign-up for Intel DevCloud for oneAPI projects by visiting https://intelsoftwaresites.secure.force.com/devcloud/oneapi.

After we’ve successfully signed-up for the Intel DevCloud, we have to login to the Linux terminal using Jupyter Notebook as it’s explained at https://devcloud.intel.com/oneapi/connect. Also, we have to download a project of interest at the bottom of this article’s page and extract it from the archive. Then, we will need to upload the project’s folder to the cloud.

Since the project was successfully uploaded, we need to subscribe for a computational node that will used for running the parallel stable sort application. This is typically done by using the following command in the Linux terminal:

 Intel® Xeon® Gold 6128 CPUs qsub -I (capital "i") -l (lower-case "L") nodes=1:cpu:ppn=2 Intel® 9Gen Graphics NEO qsub -I (capital "i") -l (lower-case "L") nodes=1:gpu:ppn=2 Intel® PAC Platform Compile qsub -I (capital "i") -l (lower-case "L") nodes=1:fpga_compile:ppn=2 Intel® PAC Platform Runtime qsub -I (capital "i") -l (lower-case "L") nodes=1:fpga_runtime:ppn=2

Finally, we must change to the projects folder and build it using the make command, as follows:

cd ./parallel_stable_sort_XXXX && make

After successfully building the project executable, run the parallel sort application by using this command:

./parallel_stable_sort_XXXX

Now we can evaluate the performance of the parallel stable sort running it in the heterogeneous platform.

## Evaluating The Performance

Here, I will provide a brief survey of the parallel stable sort performance analysis, depending on the hardware acceleration target that has been used for examining the actual performance speed-up of the parallel modern code delivered, comparing it to the performance of std::sort(...) library function:

 Intel® Xeon® Gold 6128 CPU @ 3.47 GHZ / 192 GiB RAM 2x-11x times faster Intel® FPGA Emulator Application 4x-6x   times faster Intel® Gen9 Graphics NEO 4x-7x   times faster Intel® PAC Platform / Intel® ARRIA 10 GX 3x-8x   times faster

#### Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.