Perform Complex Mathematical Operations with Just a Few Lines of DPC++ and Intel® oneAPI Math Kernel Library (oneMKL)
Henry A Gabb, senior principal engineer and editor-in-chief of The Parallel Universe, Intel Corporation
Note This article was adapted from an example created for the oneAPI GPU Optimization Guide. The complete source code and build and run instructions are available in the oneAPI-samples repository. The guide and the repo are both excellent resources for oneAPI developers.
Introduction to Cross-Correlation
Offloading individual Intel® oneAPI Math Kernel Library (oneMKL) kernel functions to accelerators is straightforward, so let’s look at a more complex mathematical operation requiring multiple kernel functions: cross-correlation. Cross-correlation has many applications (for example, measuring the similarity of two 1D signals, finding the best translation to overlay similar images, or volumetric medical image segmentation). Consider the following simple signals, represented as vectors of ones and zeros:
The signals are treated as circularly shifted versions of each other, so shifting the second signal three elements relative to the first signal gives the maximum correlation score of two:
Shifts of two or four elements give a correlation score of one. Any other shift gives a correlation score of zero. This is computed as follows:
where N is the number of elements in the signal vectors and i is the shift of sig2 relative to sig1.
Real signals contain more data (and noise), but the principle is the same whether you are aligning 1D signals, overlaying 2D images, or performing 3D volumetric image registration. The goal is to find the translation that maximizes correlation. However, the brute force summation shown above requires N multiplications and additions for every N shifts. In 1D, 2D, and 3D, the problem is O(N2), O(N3), and O(N4), respectively.
The Fourier correlation algorithm is a much more efficient way to perform this computation because it takes advantage of the O(N log N) complexity of the Fourier transform:
where DFT is the discrete Fourier transform, IDFT is the inverse DFT, and CONJG is the complex conjugate. The Fourier correlation algorithm can be composed using oneMKL, which contains optimized forward and backward transforms and complex conjugate multiplication functions. Therefore, the entire computation can be performed on the accelerator device.
Generate Some Test Data Using oneMKL
In many applications, only the final correlation result matters, so this is all that has to be transferred from the device back to the host. In this example, two artificial signals are created on the device, transformed in place, and then correlated. The host retrieves the final result and reports the optimal translation and correlation score.
Conventional wisdom suggests that buffering would give the best performance because it provides explicit control over data movement between the host and the device. To test this hypothesis, let’s generate two input signals:
Random noise is often added to signals to prevent overfitting during neural network training, to add visual effects to images, or to improve the detectability of signals obtained from suboptimal detectors. The buffers are initialized with random noise, using a basic random number generator in oneMKL:
Notice that a new scope is opened and a buffer, corr_buf, is declared for the correlation result. When this buffer goes out of scope, corr will be updated on the host.
An artificial signal is placed at opposite ends of each buffer, similar to the trivial example above:
Implement the 1D Fourier Correlation Using Explicit Buffering
Now that the signals are ready, let’s transform them using the DFT functions in oneMKL.
A single-precision, real-to-complex forward transform is committed to the SYCL* queue, and then an inplace DFT is performed on the data in both buffers. The result of DFT(sig1) must now be multiplied by CONJG(DFT(sig2)). This could be done with a hand-coded Data Parallel C++ (DPC++) kernel:
However, this basic implementation is unlikely to give optimal cross-architecture performance. Fortunately, oneMKL provides a convenience function, oneapi::mkl::vm::mulbyconj, that can be used for this step. The mulbyconj function expects std::complex input, but the buffers were initialized as the float data type. Even though they contain complex data after the forward transform, the buffers will have to be recast:
The IDFT step completes the computation:
When the scope that was opened at the start of this example is closed, the buffer holding the correlation result goes out of scope, forcing an update of the host container. This is the only data transfer between the host and the device.
The complete source code (fcorr_1d_buffers.cpp) is available in the oneAPI-samples repository.
Implement the 1D Fourier Correlation Using Unified Shared Memory (USM)
The Fourier correlation algorithm will now be reimplemented using Unified Shared Memory (USM) to compare to explicit buffering. Only the differences in the two implementations are highlighted. First, the signal and correlation arrays are allocated in USM, and then initialized with artificial data:
The rest of the implementation is largely the same, except pointers to USM are passed to the oneMKL functions instead of SYCL buffers:
It is also necessary to free the allocated memory:
The USM implementation has a more familiar syntax. It is also conceptually simpler because it relies on implicit data transfer handled by the DPC++ runtime. However, a programmer error hurts performance.
Notice the warning messages in the previous code snippets. The oneMKL random number generation engine is initialized on the device, so sig1 and sig2 are initialized with random noise on the device. Unfortunately, the code adding the artificial signal runs on the host, so the DPC++ runtime has to make the host and device data consistent. The signals used in Fourier correlation are usually large, especially in 3D imaging applications, so unnecessary data transfer degrades performance.
Updating the signal data directly on the device keeps the data consistent, thereby avoiding the unnecessary data transfer:
The explicit buffering and USM implementations have equivalent performance, indicating that the DPC++ runtime is good at avoiding unnecessary data transfers (provided the programmer pays attention to data consistency).
The complete source code (fcorr_1d_usm.cpp) is available in the oneAPI-samples repository.
Note that the final step of finding the location of the maximum correlation value is performed on the host:
It would be better to do this computation on the device, especially when the input data is large. Fortunately, the MAXLOC reduction is a common parallel pattern that can be implemented using DPC++. This is left as an exercise for the reader, but Figure 14-11 of Data Parallel C++ provides a suitable example to help you get started. If you’re not in the mood to exercise, the USM example in the oneAPI-samples repository has the MAXLOC reduction already implemented so that the entire computation is done on the device.
You May Also Like
Get the Software
Intel® oneAPI Math Kernel Library
Accelerate math processing routines, including matrix algebra, fast Fourier transforms (FFT), and vector math. Part of the Intel® oneAPI Base Toolkit.
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.