Introduction

Intel® Integrated Performance Primitives library contains hundreds of image processing functions highly-optimized for different Intel platforms, and each Intel IPP function strives to provide the best performance on market. But if application performs several consecutive operations on images, the simple approach of calling these functions one by one applied to the whole image may not always result in good performance.

The main reason for this is that image processing applications work with big data arrays that are much larger than the CPU L1 and L2 caches. In such conditions, the optimization of memory access is more important than the optimization of arithmetic calculations. Processing the image data by small portions can provide better performance than executing the sequence of fast Intel IPP function calls. This approach becomes even more effective in multi-threaded applications.

This document compares four approaches to image processing with the Intel IPP library and its Threading Layer component:

These approaches have different complexity, performance, and scalability. The document explains each of them starting with the simple single-threaded implementation, then improving performance step-by-step, and finally shows how to organize fast calculations pipeline combined with optimized memory access.

Please refer to the Building and running the application section to get information on where to find the source code and building instructions.

Comparison is based on the example of the Sobel edge detector filter. Sobel filter consists of several consecutive stages:

  1. Vertical Sobel filter is applied to the source image and result is stored in some temporary image A

  2. Horizontal Sobel filter is applied to the source image and result is stored in some temporary image B

  3. In-place square operation is applied to the image A

  4. In-place square operation is applied to the image B

  5. Sum of the images A and B is stored in the image A (this operation also can be done in-place)

  6. Square root of the image A is calculated

This sequence of steps is called a pipeline and is illustrated in the figure below.

Sobel edge detector filter functions pipeline:


Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Single-threaded (naïve) implementation

According to the naïve implementation of the Sobel filter, the application sequentially calls single-threaded Intel IPP functions of the filter pipeline, applying each function to the whole image (see Sobel edge detector filter functions pipeline figure). This approach is cache inefficient as usually image sizes are much bigger than the L1 CPU cache size, which leads to redundant memory access operations (see the chart at the Performance results section).

Single-threaded pipeline with slicing

To optimize memory access operations and reduce memory access overhead, decompose big image on parts in such a way that each of them fits into L1 CPU cache, and then call the Sobel edge detector filter pipeline on those image parts (see the Sobel edge detector filter functions pipeline for decomposed image (single-threaded) figure below). In this case, you can get the advantage of cache locality of the source data: only first function of the pipeline has to fetch its source data from memory during slice processing, other functions operate on data stored in cache. This approach is implemented for all functions of the Intel IPP Threading Layer.

The image splitting can be made in any way: by tiles, slices (rows), other segments. This example uses slices processing; each slice consists of 4 rows, takes ~8KB (for the Full-HD gray scale source image of 1920x1080 size) and perfectly fits into L1 cache of 32KB.

As shown in the performance comparison chart at the Performance results section, even for a single-threaded application the slicing approach gives ~20% of performance boost.

Sobel edge detector filter functions pipeline for decomposed image (single-threaded)

 

 

Threading with the Intel® IPP Threading Layer

Intel® IPP library provides the Threading Layer component that contains multi-threaded functions for image processing. These functions are implemented as wrappers over highly-optimized single-threaded Intel IPP functions with slicing processing and external multi-threading using OpenMP* or Intel® Threading Building Blocks.

The general scheme of Threading Layer wrappers is presented in the figure below. Before processing, the source image is split by slices and each thread takes its own part of the image, the specific slice, and processes it using a single-threaded Intel IPP function, storing the results in the corresponding part of the destination array.

General scheme of parallelization of an Intel IPP single-threaded function with Threading Layer

Multi-threaded pipeline: function level

The parallelization of single-threaded naïve implementation with Intel IPP Threading Layer can be done in a straightforward manner, just replacing single-threaded functions with multi-threaded wrappers from Threading Layer (see the figure below). As you can see in the performance chart at the Performance results section, this approach has good enough scalability and performance, but it is still not the best one in terms of performance for the reasons explained below.

Multi-threaded version of the Sobel edge detector filter pipeline with Threading Layer functions (function level parallelism)


In modern hardware, the numeration of hardware threads is symbolic – the real number can be determined via reading APIC (Advanced Programmable Interrupt Controller) identification that is assigned during OS boot/initialization step and cannot be changed until reboot. The OpenMP standard provides the omp_get_thread_num() function for obtaining the current thread id. This function returns the logical number of the current thread and this number is used inside the parallel region for assignment of some specific amount of work or some specific part of data to process by this logical thread. The logic of assignment of some logical number to current thread belongs to OS and used threading tool, and this logical number does not have any relation to a hardware thread number. This means that in one parallel region the hardware thread #3 may have the logical number #0 and in the next one – for example, #5. So logical thread numbers are floating – they float from one hardware thread to another in random order. There is a special term – affinity – that determines the correspondence of a logical thread to a hardware thread. Now then, this correspondence is not guaranteed at least for one parallel region – for example, if an application/function has been interrupted by some system process – the logical thread assigned to some specific hardware thread can be moved to another hardware thread. Such non-deterministic behavior of the OS and threading tools significantly affects the advantage of some data locality in the specific cache.

Intel OpenMP provides special functionality that makes it possible to set affinity – to assign some logical thread to some specific hardware thread. At the first glance, it seems that this mechanism allows solving of all issues mentioned above. With setting affinity we have the hardcoded correspondence between logical and hardware threads and can guarantee that if the logical thread #0 is assigned to the hardware thread #0 – it will always work at the hardware thread #0. This approach solves all discussed above issues for the ideal “clean” computer only. Each computer works under the management of OS that always has hundreds of active internal processes. At any time any of these processes can be waked up at some hardware thread. Imagine that you thoroughly calculated workloads for each hardware thread, thoroughly divided the amount of processing data for each thread, set hardcoded affinity, run your application – and at the same moment some OS activity starts to work at hardware thread #0. As affinity is hardcoded – neither operating system, nor threading tool can move your task assigned to the thread #0 to any other thread. It means that when all other threads #1- #7 finish their work and enter the wait loop at the application synchronization point – your task for the thread #0 will still wait for its order for the thread #0 resources, when they will be free. As such situations are unpredictable – in one run you can see rather good performance because of this approach, in other one – very bad.

Modern CPU cache structure

 


Based on the above, it is obvious that such approach – a pipeline of internally threaded functions – is very inefficient: 7 successive function calls, each function internally creates/wakes up a number of threads and has threads sync point before the exit. One of the main issues resulting from the correspondence between logical and hardware threads is that there is no guarantee that the same logical threads will work on the same hardware threads for two successive functions’ calls and will process the portions of data that already belong to the cache parts assigned to the corresponding hardware thread. Moreover, instead of processing the data that already exists in the cache assigned to the particular hardware thread, the intensive data exchange between different cache parts will start.

Logical threads migration from one parallel region to another


Imagine that “region #1” corresponds to the ippiFilterSobelVertBorder_8u16s_C1R_T function call and “region #2” corresponds to the ippiSqr_16s_C1IRSfs_T function call: both functions use the simple data decomposition approach for parallel processing:

Simple data decomposition


According to the scheme #2 the logical thread #0 will be executed at hardware thread #4 and therefore the slice of destination data from the ippiFilterSobelVert function will be stored in the L2 and L3 caches related to hardware thread #4.  The successive parallel region of the ippiSqr function will work with the data decomposed in the same way – so the logical thread #0 will expect to have all input data available in its cache. For the second parallel region, the logical thread #0 will be run at hardware thread #3. Therefore, instead of data reusing by the successive operation, an intensive data exchange will start between hardware threads #3 and #4 and all other pieces of cache.

All of the above proves that threading at the Intel IPP function level is not efficient, it does not utilize all CPU power and does not effectively use the specifics of multi-core cache subsystem organization.

Multi-threaded pipeline: application level

It is significantly more efficient to thread pipelines of Intel IPP functions at the application level, above Intel IPP functions as it is shown at the figure below:

Multi-threaded version of the Sobel edge detector filter pipeline threaded manually (application level parallelism).

This solution has benefits of pipelined image processing and using of optimized Intel IPP library functions. This is the most efficient approach in terms of performance, and it is implemented in Intel IPP Threading Layer as a separate function ippiFilterSobel_8u16s_C1R_T.

Conclusion

This document demonstrated that for the case of several successive functions that are consequently applied to the source data, the correct and efficient implementation of the pipeline can bring significant performance boost. There are two approaches that can be applied in the Intel IPP-enabled application:

You can find implementation details in the source code and comments of the Intel IPP Threading Layer example, see the Finding the source code section for details.

Performance results

The measurements that are illustrated in the figure below have been done under the following conditions:

·         System:  RedHat* Enterprise Server 7.4 machine with Intel® Xeon® Platinum 8180 running at 2.50 GHz with disabled Hyper-Threading mode

·         Source data: Full-HD gray scale image of 1920x1080 size

·         Measurement loop: contains only a processing function without initialization routines and the results are stabilized by taking average value after 100 runs. Multi-threaded approaches are parallelized using OpenMP*. You can find more details in the well-commented source code of examples. Please refer to the Finding the source code section to find it.

Performance comparison of image processing approaches described in this paper

 

Building and running the application

Finding the source code

You can find the source code of all four examples described in this document in the standalone package of Intel IPP 2020 Update 2 (or in later releases) or in any bundle that contains such versions of the Intel IPP library (e.g. Intel Parallel Studio XE 2020 Update 2 for C++ or later).

Extract the examples from the archive placed in at:

Linux

<installdir>/compilers_and_libraries_2020.2.XXX/linux/ipp/components/ components_and_examples_lin_ps.tgz,

 

where

<installdir> value can be {/opt/intel, $HOME/intel} depending on the package installation type

Windows

C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2020.2.XXX\windows\ipp\components\components_and_examples_win_ps.zip

Building the application

To build the examples, follow the next steps:

Linux

1.    Set up your environment using the script:

 source <installdir>/compilers_and_libraries_2020.2.XXX/linux/bin/compilervars.sh <arch>, where

                <arch> value can be “ia32” or “intel64” depending on your platform

2.       Unpack the examples archive:

cd <installdir>/compilers_and_libraries_2020.2.XXX/linux/ipp/components

tar xzvf components_and_examples_<target>.tgz

3.       Go to the examples folder and build Sobel filter benchmarks:

cd components/interfaces/tl

make

Note: By default, the multi-threaded benchmark binaries are linked with OpenMP*. If you want to change threading mechanism to Intel TBB,  before running the make command assign the TBBROOT environment variable value to the directory where Intel TBB is installed :

export TBBROOT=<installdir>/compilers_and_libraries_2020.2.XXX/linux/tbb

Windows

  1. Unpack the examples archive (see the Finding the source code section to find it).
  2. Open the Visual Studio* solution located in the archive: <path_to_extracted_archive>\components\interfaces\tl\tl_example.sln
  3. Choose the required “Platform toolset” in the Solution properties.
  4. Choose the appropriate build configuration. The solution can be built with OpenMP* or Intel TBB in Debug and Release configurations.
  5. Run “Build->Build Solution” in the Visual Studio* menu.

After build is finished you can find four resulting binaries in the _build/<arch>/release_{omp,tbb} folder on Linux or in the _build/<arch>/{Release|Debug} {OpenMP|TBB} on Windows:

Binary name

Description

tl_sobel_st_pipeline_per_image

Single-threaded pipeline (naïve implementation)

tl_sobel_st_pipeline_per_slice      

Single-threaded pipeline (with slicing)

tl_sobel_mt_pipeline_per_image

Multi-threaded pipeline (function level parallelism with slicing)

tl_sobel_mt_pipeline_per_slice

Multi-threaded pipeline (application level parallelism with slicing)

Running the application

Use the following command line to run a benchmark:

tl_sobel_mt_pipeline_per_image [-i] InputFile [[-o] OutputFile] [Options]

where [Options] can be:

-t <NUM>

number of threads for Threading Layer interface (for multi-threaded binaries only)

-w <NUM>

minimum test time in milliseconds

-l <NUM>

number of loops to repeat calculation (overrides test time)

-h

print help and exit

References

  1. Sobel edge-detector filter usage: https://software.intel.com/en-us/ipp-dev-reference-filtersobel