Intel VTune Profiler Performance Analysis Cookbook

ID 766316
Date 12/16/2022
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Compile a Portable Optimized Binary with the Latest Instruction Set

Learn the different methods for compiling a binary with the latest instruction set while maintaining portability.

Content expert: Roman Khatko

Modern Intel® processors support instruction set extensions such as the different versions of Intel® Advanced Vector Extensions (Intel® AVX): AVX-512, AVX2, and AVX.

When compiling your application, you may consider three options based on the intended usage of your application:

  • Generic binary: Compile an application for the generic x86 instruction set. As a result, the application will run on all x86 processors, but may not utilize a newer processor to its full potential.
  • Native binary: Compile an application for the specific processor. As a result, the application will utilize all features of the target processor but will not run on older processors.
  • Portable binary: Compile a portable optimized binary with multiple versions of functions, each targeted for different processors using compiler options and function attributes. The resulting binary will have the performance characteristics of an application compiled for a specific processor (native binary) and will run on older processors.

This recipe demonstrates how you can compile a portable binary with the performance characteristics of a native binary, while still maintaining portability of a generic binary. Over the course of this recipe, you compile both the generic and native binaries first to determine if the resulting performance improvement is large enough to justify the increase in binary size.

This recipe covers the Intel® C++ Compiler Classic and the GNU* Compiler Collection (GCC).

This recipe does not cover manual dispatching using the CPUID processor instruction, Processor Targeting compiler options or the target function attribute.

Ingredients

This section lists the systems and tools used in the creation of this recipe:

  • Processor: Intel® Xeon® Processor code named Cascade Lake
  • Operating System: Fedora 32
  • Compilers:
    • Intel® C++ Compiler Classic 2021.1.2
    • GCC version 10.1.1
  • Analysis Tool:Intel® VTune™ Profiler 2021.1.2

Sample Application

Save the following code to a source file named fma.c:

// fma.c
#include <stdio.h>
#include <stdlib.h>

void init(float *a, float *b, float *c, int size)
{
    for (int i = 0; i < size; i++)
    {
        a[i] = (float) (i % 10);
        b[i] = a[i] * 1.1f;
        c[i] = a[i] * 1.2f;
    }
}

void my_fma(float *a, float *b, float *c, int size)
{
    for (int i = 0; i < size; i++)
    {
        c[i] += a[i]*b[i];
    }
}

#define ITERATIONS 10000000
#define SIZE 2048

int main()
{
    float *a = malloc(SIZE*sizeof(float));
    float *b = malloc(SIZE*sizeof(float));
    float *c = malloc(SIZE*sizeof(float));

    for (int i = 0; i < ITERATIONS; i++)
    {
        init(a, b, c, SIZE);
        my_fma(a, b, c, SIZE);
    }
    printf("%f", c[5]); // use the data

    free(a);
    free(b);
    free(c);
    return 0;
}

Compile Generic Optimized Binary

Compile the binary following the recommendations from VTune Profiler User Guide (recommendations for Windows).

Intel C++ Compiler Classic

Compile the binary with debug information and -O3 optimization level:

icc -g -O3 -debug inline-debug-info fma.c -o fma_generic

GNU Compiler Collection

Compile the binary with debug information and -O2 optimization level:

gcc -g -O2 fma.c -o fma_generic_O2

Check if the code was vectorized using the HPC Performance Characterization analysis type of VTune Profiler.

To do that, run the analysis:

vtune -c hpc-performance -r fma_generic_O2_hpc ./fma_generic_O2

And open the result in VTune Profiler GUI:

vtune-gui fma_generic_O2_hpc

Open the analysis result and see the Top Loops/Functions with FPU Usage by CPU Time section of the Summary tab:

The fact that FP Ops: Scalar value equals 100% and that the Vector Instruction Set column is empty indicates that GCC does not vectorize the code at -O2 optimization level. Use -O2 -ftree-vectorize or -O3 options to enable vectorization.

Compile the fma_generic binary with -O3 optimization level:

gcc -g -O3 fma.c -o fma_generic

Compile Native Binary

Compile native binary with the Intel C++ Compiler Classic

The -xHost option instructs the compiler to generate instructions for the highest instruction set available on the processor performing the compilation. Alternatively, the -x{Arch} option, where {Arch} is the architecture codename, instructs the compiler to target processor features of a specific architecture.

Compile the fma_native binary with -xHost flag:

icc -g -O3 -debug inline-debug-info -xHost fma.c -o fma_native

Compile native binary with the GNU Compiler Collection

Compile the fma_native binary with -march=native flag:

gcc -g -O3 -march=native fma.c -o fma_native

If your processor supports the AVX-512 instruction set extension, consider experimenting with the mprefer-vector-width=512 option.

Compare Generic and Native Binaries

Collect the HPC Performance Characterization analysis data for both binaries:

vtune -c hpc-performance -r fma_generic_hpc ./fma_generic
vtune -c hpc-performance -r fma_native_hpc ./fma_native

Compare these results using the command:

vtune-gui fma_generic_hpc fma_native_hpc

In the VTune Profiler GUI, switch to the Bottom-Up tab and set Loop Mode to Functions only:

Switch to the Summary tab and scroll down to the Top Loops/Functions with FPU Usage by CPU Time section:

Observe the CPU Time and Vector Instruction Set columns.

Consider the performance difference between the generic and the native binary. Decide whether it makes sense to compile a portable binary with multiple code paths.

NOTE:

This sample application was auto-vectorized by the compiler. To investigate vectorization opportunities in your application in depth, try Intel® Advisor.

Compile Portable Binary

If the comparison between the generic and native binary shows a performance improvement, for example, if the CPU Time was improved, consider compiling a portable binary.

Compile the portable binary with the Intel C++ Compiler Classic

Use the -ax (/Qax for Windows) option to instruct the compiler to generate multiple feature-specific auto-dispatch code paths for Intel processors.

Compile the fma_portable binary with the -ax option:

icc -g -O3 -debug inline-debug-info -axCOMMON-AVX512,CORE-AVX2,AVX,SSE4.2,TREMONT,ICELAKE-SERVER fma.c -o fma_portable

Refer to the -ax option help page for the list of supported architectures.

Compile the portable binary with the GNU Compiler Collection

Compare the results for generic and native binaries. If the CPU Time was improved and an additional Vector Instruction Set was utilized for a specific function in the native binary result, then add the target_clones attribute to this function.

If the function calls other functions, consider adding the flatten attribute to force inlining, since the target_clones attribute is not recursive.

Copy the contents of the fma.c source file to a new file, fma_portable.c, and add the TARGET_CLONE preprocessor macro:

#define TARGET_CLONES __attribute__((flatten,target_clones("default,sse4.2,avx,"\
    "avx2,avx512f,arch=skylake,arch=tremont,arch=skylake-avx512,"\
    "arch=cascadelake,arch=cooperlake,arch=tigerlake,arch=icelake-server")))

Refer to the x86 Options page of the GCC manual for the list of supported architectures.

Multiple versions of a function will increase the binary size. Consider the trade-off between performance improvement for each target and code size. Collecting and comparing VTune Profiler results enables you to make data-driven decisions to apply the TARGET_CLONES macro only to the functions that will run faster with new instructions.

Add the TARGET_CLONES macro before the my_fma function definition and init functions and save the changes to fma_portable.c:

TARGET_CLONES
void my_fma(float *a, float *b, float *c, const int size)

Compile the fma_portable binary:

gcc -g -O3 fma_portable.c -o fma_portable

Compare Portable and Native Binaries

To compare the performance of portable and optimized binaries, collect the HPC Performance Characterization data for the fma_portable binary:

vtune -c hpc-performance -r fma_portable_hpc ./fma_portable

Open the comparison in VTune Profiler GUI:

vtune-gui fma_portable_hpc fma_native_hpc

As a result, the portable binary uses the highest instruction set extension available and demonstrates optimal performance on the target system.