Cookbook

  • 2021
  • 11/09/2021
  • Public Content
Contents

Compile a Portable Optimized Binary with the Latest Instruction Set

Learn the different methods for compiling a binary with the latest instruction set while maintaining portability.
Content expert:
Roman Khatko
Modern Intel® processors support instruction set extensions such as the different versions of Intel® Advanced Vector Extensions (Intel® AVX): AVX-512, AVX2, and AVX.
When compiling your application, you may consider three options based on the intended usage of your application:
  • Generic binary:
    Compile an application for the generic x86 instruction set. As a result, the application will run on all x86 processors, but may not utilize a newer processor to its full potential.
  • Native binary:
    Compile an application for the specific processor. As a result, the application will utilize all features of the target processor but will not run on older processors.
  • Portable binary:
    Compile a portable optimized binary with multiple versions of functions, each targeted for different processors using compiler options and function attributes. The resulting binary will have the performance characteristics of an application compiled for a specific processor (native binary) and will run on older processors.
This recipe demonstrates how you can compile a portable binary with the performance characteristics of a native binary, while still maintaining portability of a generic binary. Over the course of this recipe, you compile both the generic and native binaries first to determine if the resulting performance improvement is large enough to justify the increase in binary size.
This recipe covers the
Intel® C++ Compiler Classic
and the GNU* Compiler Collection (GCC).
This recipe does not cover manual dispatching using the
CPUID
processor instruction, Processor Targeting compiler options or the target function attribute.

Ingredients

This section lists the systems and tools used in the creation of this recipe:
  • Processor:
    Intel® Xeon® Processor code named Cascade Lake
  • Operating System:
    Fedora 32
  • Compilers:
    • Intel® C++ Compiler Classic
      2021.1.2
    • GCC version 10.1.1
  • Analysis Tool:
    Intel® VTune™
    Profiler
    2021.1.2

Sample Application

Save the following code to a source file named
fma.c
:
// fma.c #include <stdio.h> #include <stdlib.h> void init(float *a, float *b, float *c, int size) { for (int i = 0; i < size; i++) { a[i] = (float) (i % 10); b[i] = a[i] * 1.1f; c[i] = a[i] * 1.2f; } } void my_fma(float *a, float *b, float *c, int size) { for (int i = 0; i < size; i++) { c[i] += a[i]*b[i]; } } #define ITERATIONS 10000000 #define SIZE 2048 int main() { float *a = malloc(SIZE*sizeof(float)); float *b = malloc(SIZE*sizeof(float)); float *c = malloc(SIZE*sizeof(float)); for (int i = 0; i < ITERATIONS; i++) { init(a, b, c, SIZE); my_fma(a, b, c, SIZE); } printf("%f", c[5]); // use the data free(a); free(b); free(c); return 0; }

Compile Generic Optimized Binary

Compile the binary following the recommendations from
VTune
Profiler
User Guide (recommendations for Windows).
Intel C++ Compiler Classic
Compile the binary with debug information and
-O3
optimization level:
icc -g -O3 -debug inline-debug-info fma.c -o fma_generic
GNU Compiler Collection
Compile the binary with debug information and
-O2
optimization level:
gcc -g -O2 fma.c -o fma_generic_O2
Check if the code was vectorized using the HPC Performance Characterization analysis type of
VTune
Profiler
.
To do that, run the analysis:
vtune -c hpc-performance -r fma_generic_O2_hpc ./fma_generic_O2
And open the result in
VTune
Profiler
GUI:
vtune-gui fma_generic_O2_hpc
Open the analysis result and see the
Top Loops/Functions with FPU Usage by CPU Time
section of the
Summary
tab:
The fact that
FP Ops: Scalar
value equals 100% and that the
Vector Instruction Set
column is empty indicates that GCC does not vectorize the code at
-O2
optimization level. Use
-O2 -ftree-vectorize
or
-O3
options to enable vectorization.
Compile the
fma_generic
binary with
-O3
optimization level:
gcc -g -O3 fma.c -o fma_generic

Compile Native Binary

Compile native binary with the Intel C++ Compiler Classic
The
-xHost
option instructs the compiler to generate instructions for the highest instruction set available on the processor performing the compilation. Alternatively, the
-x{Arch}
option, where
{Arch}
is the architecture codename, instructs the compiler to target processor features of a specific architecture.
Compile the
fma_native
binary with
-xHost
flag:
icc -g -O3 -debug inline-debug-info -xHost fma.c -o fma_native
Compile native binary with the GNU Compiler Collection
Compile the
fma_native
binary with
-march=native
flag:
gcc -g -O3 -march=native fma.c -o fma_native
If your processor supports the AVX-512 instruction set extension, consider experimenting with the
mprefer-vector-width=512
option.

Compare Generic and Native Binaries

Collect the HPC Performance Characterization analysis data for both binaries:
vtune -c hpc-performance -r fma_generic_hpc ./fma_generic
vtune -c hpc-performance -r fma_native_hpc ./fma_native
Compare these results using the command:
vtune-gui fma_generic_hpc fma_native_hpc
In the
VTune
Profiler
GUI, switch to the
Bottom-Up
tab and set
Loop Mode
to
Functions only
:
Switch to the
Summary
tab and scroll down to the
Top Loops/Functions with FPU Usage by CPU Time
section:
Observe the
CPU Time
and
Vector Instruction Set
columns.
Consider the performance difference between the generic and the native binary. Decide whether it makes sense to compile a portable binary with multiple code paths.
This sample application was auto-vectorized by the compiler. To investigate vectorization opportunities in your application in depth, try Intel® Advisor.

Compile Portable Binary

If the comparison between the generic and native binary shows a performance improvement, for example, if the
CPU Time
was improved, consider compiling a portable binary.
Compile the portable binary with the Intel C++ Compiler Classic
Use the
-ax
(
/Qax
for Windows)
option to instruct the compiler to generate multiple feature-specific auto-dispatch code paths for Intel processors.
Compile the
fma_portable
binary with the
-ax
option:
icc -g -O3 -debug inline-debug-info -axCOMMON-AVX512,CORE-AVX2,AVX,SSE4.2,TREMONT,ICELAKE-SERVER fma.c -o fma_portable
Refer to the
-ax
option help page
for the list of supported architectures.
Compile the portable binary with the GNU Compiler Collection
Compare the results for generic and native binaries. If the
CPU Time
was improved and an additional
Vector Instruction Set
was utilized for a specific function in the native binary result, then add the
target_clones
attribute to this function.
If the function calls other functions, consider adding the
flatten
attribute to force inlining, since the
target_clones
attribute is not recursive.
Copy the contents of the
fma.c
source file to a new file,
fma_portable.c
, and add the
TARGET_CLONE
preprocessor macro:
#define TARGET_CLONES __attribute__((flatten,target_clones("default,sse4.2,avx,"\ "avx2,avx512f,arch=skylake,arch=tremont,arch=skylake-avx512,"\ "arch=cascadelake,arch=cooperlake,arch=tigerlake,arch=icelake-server")))
Refer to the x86 Options page of the GCC manual for the list of supported architectures.
Multiple versions of a function will increase the binary size. Consider the trade-off between performance improvement for each target and code size. Collecting and comparing
VTune
Profiler
results enables you to make data-driven decisions to apply the
TARGET_CLONES
macro only to the functions that will run faster with new instructions.
Add the
TARGET_CLONES
macro before the
my_fma
function definition and
init
functions and save the changes to
fma_portable.c
:
TARGET_CLONES void my_fma(float *a, float *b, float *c, const int size)
Compile the
fma_portable
binary:
gcc -g -O3 fma_portable.c -o fma_portable

Compare Portable and Native Binaries

To compare the performance of portable and optimized binaries, collect the HPC Performance Characterization data for the
fma_portable
binary:
vtune -c hpc-performance -r fma_portable_hpc ./fma_portable
Open the comparison in
VTune
Profiler
GUI:
vtune-gui fma_portable_hpc fma_native_hpc
As a result, the portable binary uses the highest instruction set extension available and demonstrates optimal performance on the target system.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.