I love ARM. Yes, I do. - Why do I work for Intel then? Because I love Intel even more. That's why I'd like to help independent software vendors to port their products from ARM to Intel® architecture. But if and only if they would like to do it.
Why would they like it? The answer is simple. Currently, Intel® CPUs are in smartphones and tablets while Android*, Windows* 8, and some other operating systems support both ARM and x86, increasing the developer's options enormously. In most cases porting to x86 is very easy – the work varies from zero (for managed code and generic native code running via Houdini* binary translator) to simple code rebuild with the corresponding compiler. But for some applications it is not true.
Modern ARM* CPUs widely used in mobile devices ( iPhone*, iPad*, Microsoft Surface*, Samsung devices and millions of others) have the 64-128bit SIMD instruction set (aka NEON* or "MPE" Media Processing Engine) defined first as a part of the ARM* Architecture, Version 7 (ARMv7). NEON is used by numerous developers for performance critical tasks via assembler or NEON intrinsics set supported by modern compilers like gcc, rvct and Microsoft. NEON can be found in such famous open source projects as FFMPEG, VP8, OpenCV, etc. For such projects achieving the maximal performance on x86 causes the need to port ARM NEON instructions or intrinsics to Intel SIMD (SSE). Namely to Supplemental Streaming SIMD Extensions 3 (SSSE3) for the first generation of Intel Atom® CPU-based devices and to Intel® Streaming SIMD Extensions 4 (Intel® SSE4) for the second and later generations available since 2013.
However x86 SIMD and NEON instructions sets and therefore intrinsic functions are different, there is no one-to-one correspondence between them, so the porting task is not trivial. Or, to be more precise, it was not trivial before this post publication.
Automatic Porting Solution for Intrinsic Functions-Based ARM NEON Source Code Port to Intel® x86 SIMD.
Attached is the automatic solution for intrinsic functions-based ARM NEON source code port to Intel x86 SIMD (Intel SSE up to 4.2). Why intrinsics? - x86 SIMD intrinsic functions are supported by all widespread C/C++ compilers – Intel® compilers, Microsoft* compiler, gcc, etc. These intrinsics are very mature and their performance is equal tor even greater than pure assembler performance, while the usability is much better.
Why Intel SSE only but not MMX? - While MMX (64-bit data processing) instruction set usage is possible for 64-bit NEON instruction substitution, it is not recommended: MMX performance is commonly the same or lower than for the Intel SSE instructions, but the specific MMX problem of floating point registers sharing with the serial code could cause a lot of problems in SW if not properly treated. Moreover, MMX is NOT supported on 64-bit systems that are coming to mobile devices.
By default, Intel SSE up to SSSE3 is used for porting, unless you've explicitly set the Intel SSE4 code generation flag during the compilation or uncommented the corresponding "#define USE_SSE4" line in the file provided. Then Intel SSE up to Intel SSE4 is used for porting.
Though the solution is targeted for intrinsics, it could be used for pure assembler porting assistance. Namely, for each NEON function the corresponding NEON asm instruction is provided, while the corresponding x86 intrinsics code could be used directly and the asm code could be copied from the compiler intermediate output.
The solution is shaped as a C/C++ language header to be included in the sources ported instead of the standard "arm_neon.h" and provide the fully automatic porting.
The solution covers 100% NEON functions (~1700 of them) except for 16-bit float processing and:
- Redefines ARM NEON 64 and 128-bit vectors as the corresponding x86 SIMD data.
- Redefines some functions from ARM NEON to Intel SSE if 1:1 correspondence exists (~50% of 128 bit functions)
- Implements some ARM NEON functions using Intel SIMD if the performance effective implementation is possible (~45% of functions)
- Implements the remaining NEON functions (for <5% of functions not widely used in applications) using the serial solution and issuing the corresponding "low performance" compiler warning
//******* definition sample *****************
int8x16_t vaddq_s8(int8x16_t a, int8x16_t b); // VADD.I8 q0,q0,q0
#define vaddq_s8 _mm_add_epi8
//***** porting sample***********
uint8x16_t vshrq_n_u8(uint8x16_t a, __constrange(1,8) int b) // VSHR.U8 q0,q0,#8
{//no 8 bit shift available, need the special trick
__declspec(align(16))unsigned short mask10_16[9] ={0xffff, 0xff7f, 0xff3f, 0xff1f, 0xff0f, 0xff07, 0xff03, 0xff01, 0xff00};
__m128i mask0 = _mm_set1_epi16(mask10_16[b]); //to mask the bits to be "spoiled" by 16 bit shift
__m128i r = _mm_srli_epi16 ( a, b);
return _mm_and_si128 (r, mask0);
}
//***** “serial solution” sample *****
uint64x1_t vqadd_u64(uint64x1_t a, uint64x1_t b); // VQADD.U64 d0,d0,d0
_NEON2SSE_INLINE _NEON2SSE_PERFORMANCE_WARNING(uint64x1_t vqadd_u64(uint64x1_t a, uint64x1_t b), _NEON2SSE_REASON_SLOW_SERIAL){
_NEON2SSE_ALIGN_16 uint64_t a64, b64;
uint64x1_t res;
a64 = a.m64_u64[0];
b64 = b.m64_u64[0];
res.m64_u64[0] = a64 + b64;
if (res.m64_u64[0] < a64) {
res.m64_u64[0] = ~(uint64_t)0;
}
return res;}
The solution for the functions implemented passes extensive correctness tests for the ARM Neon intrinsic functions.
Main ARM NEON - x86 SIMD Porting Challenges:
- 64-bit processing functions. As 128 bit SSE registers only are used for x86 vector operations. It means that for each 64-bit processing function, we need to load data to SSE (xmm registers) somehow and then store it back. It impacts not only code quality, but the performance as well. Various load-store techniques are preferred for different compilers; for some functions, the serial processing is faster.
- Some x86 intrinsic functions require immediate parameters rather than constants resulting in compile time “catastrophic error” when called from a wrapper function. Fortunately it happens not for all compilers and in non-optimized (Debug) build only . The solution is to replace such functions with a corresponding switch for immediate parameters using branches (cases) in debug mode.
- Not all arithmetic operations are available for 8-bit data in x86 SIMD. Also there is no shift for such data. The common solution used is to convert 8-bit data to 16-bit, process them, and then pack to 8-bit again. However in some cases it is possible to use tricks like the one shown in the vector right shift sample above (vshrq_n_u8 function) to avoid such conversions.
- For some functions where x86 implementation contains more than 1 instruction, the intermediate overflow is possible. The solution is to use the overflow safe algorithm implementation even if it is slower. Say, if we need to calculate the average of a and b i.e. (a+b)/2, the calculation should be done as (a/2 + b/2).
- For some NEON functions, there exist corresponding x86 SIMD functions; however, their behavior differs when the function parameters are “out of range”. Such cases need special processing like the following Table lookup sample. While in NEON, specification indices out of range return 0, for Intel SIMD we need to set the most significant bit to 1 for zero return:
uint8x8_t vtbl1_u8(uint8x8_t a, uint8x8_t b)
{
uint8x8_t res64;
__m128i c7, maskgt, bmask, b128;
c7 = _mm_set1_epi8 (7);
b128 = _pM128i(b);
maskgt = _mm_cmpgt_epi8(b128,c7);
bmask = _mm_or_si128(b128,maskgt);
bmask = _mm_shuffle_epi8(_pM128i(a),bmask);
return64(bmask);}
- For some NEON functions there exist corresponding x86 SIMD functions; however, their rounding rules are different, so we need to compensate it by adding or subtracting 1 from the final result.
- For some functions x86 SIMD implementation is not possible or not effective. Such function samples are: shift of vector by another vector, some arithmetic operations for 64- and 32-bit data. The only solution here is serial code implementation.
Performance
First, it is necessary to notice that the exact porting solution selection for each function is based on common sense and x86 SIMD latency and throughput data for the latest Intel Atom CPU. However, for some CPUs and conditions, a better solution might be possible.
Solution performance was tested on several projects demonstrating very similar results that lead to the first and very important conclusion:
- For most cases of x86 porting expect the perfomance increase ratio similar to the ARM NEON for vectorized /serial code ratio if 128 bit processing NEON functions used.
Unfortunately the situation is different for 64 bit processing NEON functions (even for those taking 64 input and returning 128bits or vice-versa). For them the speedup is significantly lower.
So the second very important conclusion is:
- Avoid 64-bit processing NEON functions try to use 128-bit versions even if your data are 64-bit. If you use 64-bit NEON functions - expect the corresponding performance penalty.
Other porting considerations and best known methods are:
- Use 16-bit data alignment for faster load and store
- Avoid NEON functions working with constants. It gives not a gain, but a performance penalty for constants load\propagation instead. If constants usage is necessary, try to move constants initialization out of hotspot loops and, if applicable, replace it with logical and compare operations.
- Try to avoid functions marked as "serially implemented" because they need to store data from vector registers to memory, process them serially and load them again. You could probably change the data type or algorithm used to make the whole port vectorized, not a serial one.
Once again - just include the file below in your project instead of "arm_neon.h" header and your code will be ported without any other changes required!
As of December 2016, the project is hosted on GitHub: https://github.com/intel/ARM_NEON_2_x86_SSE. All issues, reports, requests, and feedback are kindly accepted there.