Intel®IPP overcomes “int-size” limitations in the new 2017 version.
All Intel Intel® IPP APIs in all previous versions of Intel® IPP libraries have one historical limitation inherited from ia32( ( add link: /content/www/us/en/develop/articles/ia-32-intelr-64-ia-64-architecture-mean.html) – all functions’ parameters responsible for vector length, image or buffer size, image step, etc. – have integer (or in rare case – unsigned integer) data type.
In the second decade of the 21st century, when Intel64 (AMD64) is the core architecture for almost all servers, PCs, ultra-books and even tablets and smartphones, when this architecture is sweeping the world by leaps and bounds about 15 years, the 32-bit integer data type is one of the main limitations of Intel® IPP libraries. Many Intel® IPP functions don’t provide any suitable and convenient way to process rather big (super-resolution) images, long vectors in one call, and, moreover, even can’t process such amounts of data by pieces (or slices). For example with Intel® IPP you can’t perform FFT of order greater than 27 for Ipp32fc data type (26 for Ipp64fc), you can’t process images that have distance between 2 consecutive rows more than MAX_INT (2^31), you can’t process AC4 images (Ipp8u data type) of size ~23170*23170 and greater, you can’t allocate more than 2Gb of memory with any ippMalloc function… Image sizes are growing (“Panorama” Mode's are integrated in almost all Point-and-Shoot digital cameras now!) and what if somebody will need to work with an image that is greater than 2 GB? To overcome all these limitations, in Intel® IPP 2017 version we introduce L-mode extension of Intel® IPP API.
What is L-mode extension?
New data type for “length” and “size” parameters – IppSizeL, that is Ipp32s on ia32 operating systems and Ipp64s – on Intel64 operating systems. New data type for image sizes – IppiSizeL – that is derivative from IppSizeL.
Therefore L-mode extension for Intel® IPP has different meanings for ia32 and Intel64 architectures – for Intel64 ‘L’ means “Long”, for ia32 it means “Legacy”. An abbreviation for this new term is Lx Intel® IPP API.
New Lx Intel® IPP APIs are not substitutions for already existing “classic” Intel® IPP APIs, but are additional Intel® IPP functions with extended capabilities.
In the most cases the difference between “classic” and “Lx” APIs is in used data types for “size” parameters and additional suffix “_L” in the function name. In some case “Lx” approach introduces new APIs in order to facilitate processing of huge amounts of data by slices or tiles.
These new APIs are the basis for one more additional layer provided with Intel® IPP 2017. This new additional layer is “threading” layer and it introduces new suffix “_LT” for function names that means “Lx Threaded API”. This new layer is available in 2 forms – as prebuild binary library and in source form. This approach overcomes all restrictions and limitations of the previous versions of Intel® IPP libraries threaded internally (*). All customers that are satisfied with internal function threading can use “LT” functions in the prebuild binary form. For advanced customers, that are struggling for the top possible performance of their applications, we recommend to use “LT” functionality in the source form and create threaded pipelines of Intel® IPP functions and own code with processing rather big portions of data by slices or tiles in parallel.
Therefore the main goals of this Threaded Lx Intel® IPP API extension are:
- overcome Intel® IPP 32-bit “size” limitations
- provide visual examples of Intel® IPP functions threading at the application level and processing data by slices or tiles (“external” threading)
- replace multithreaded Intel® IPP libraries (deprecated in Intel® IPP 8.2) in the long term
Pict #1: classic Intel® IPP approach – single or multi-threaded libraries linked is defined at the linking stage and st & mt libraries have the same function names
Pict #2: Approach with Threaded Layer for Lx Intel® IPP API – Threaded Layer is available both in binary and source forms.
Compare 2 APIs – “classic” Intel® IPP and Lx Intel® IPP API:
IPPAPI(IppStatus, ippsAdd_32f, (const Ipp32f* pSrc1, const Ipp32f* pSrc2, Ipp32f* pDst, int len))
IPPAPI(IppStatus, ippsAdd_32f_L, (const Ipp32f* pSrc1, const Ipp32f* pSrc2, Ipp32f* pDst, IppSizeL len))
The difference is marked with bold red and has the same nature for almost all other APIs. Also there are several “classic” Intel® IPP functionalities that can’t be executed by slices/tiles (and, therefore, threaded at the application level) due to API limitations and several functionalities that can’t be threaded (or executed by slices or tiles) because of their algorithm nature (fully sequential or with feedback dependency). For the first group (“classic” API limitation) the new Lx Intel® IPP APIs differ from “classic” approach in order to achieve the main goal (external threading) and overcome all historical limitations. The second group requires an individual approach for every such kind algorithm – some functionalities are divided on smaller logical parts that can be partially threaded (for example Pyramids, Canny, etc.), some other can get only size-extension advantage and still stay sequential (for example IIR).
Pict #3: Advantage of pipeline by tiles: Y-axis – ratio of reference code obtained from OpenCV – cv::convertTo functionality to implementation via Intel® IPP primitives in cpe - below 1.0 – OpenCV is faster, greater than 1.0 – Intel® IPP pipe is faster. X-axis – a number of case (different input data type and different image size) from OpenCV performance system sorted ascending. In order to achieve the same functionality via Intel® IPP we should create a pipe of 3 functions: convert from input data type to FP (float or double), then mul by FP const, and, then, convert to some new data type. Red curve represents direct implementation – sequential call of 3 Intel® IPP functions for the full image size; blue curve – the same 3 calls
but organized by tiles – the size of each tile is chosen to fit L0 cache (src+tmpbuf+dst). It is visible that processing of a pipe by tiles is up 5-7x faster than direct naïve implementation via 3 successive calls.
Comments to Pict #3: cv::convertTo is an equivalent to the new Intel® IPP API:
IPPAPI(IppStatus, ippiScaleC_8u16s_C1R, ( const Ipp8u* pSrc, int srcStep, Ipp64f mVal, Ipp64f aVal, Ipp16s* pDst, int dstStep, IppiSize roiSize, IppHintAlgorithm hint ))
Where 8u16s can be any other combination of data types supported by IPP. Direct substitution with the 3 calls is the next (note – for this particular case 64f doesn’t make sense and we don’t have such conversion in IPP):
IPPAPI(IppStatus, ippiConvert_8u32f_C1R, ( const Ipp8u* pSrc, int srcStep, Ipp32f* pDst, int dstStep, IppiSize roiSize ))
IPPAPI(IppStatus, ippiMulC_32f_C1IR, (Ipp32f value, Ipp32f* pSrcDst, int srcDstStep, IppiSize roiSize))
IPPAPI ( IppStatus, ippiConvert_32f16s_C1RSfs, (const Ipp32f* pSrc, int srcStep, Ipp16s* pDst, int dstStep, IppiSize roi, IppRoundMode round, int scaleFactor))
Below is a scheme for processing by tiles:
Where we use very simple condition for tile size: Tile.width *(sizeof(Ipp8u) + sizeof(Ipp16s)+sizeof(Ipp32f))*Tile.height < 32768 (32K is usual size for L0 cache for Intel CPUs). In order to simplify selection of optimal tile size and processing images by tiles, the new Lx Intel® IPP API provides helper functionality (search for “tile” in ippi_l.h header file).
We team will extend the number of Lx APIs every next release/update until all “classic” APIs get new Lx Intel® IPP APIs, and will continue their releases in source and pre-build binary forms. We will continue encourage all advance Intel® IPP customers to use application level threading and create threaded pipelines of Intel® IPP functions instead of using individual threaded functions as such recommended approach is significantly much more progressive and forward-looking in terms of performance, exploiting all efficiency of existing and future facilities of modern hardware.
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.