Fine-Tune Performance with LLVM* Compiler Optimization Reports

06/05/2024

Get the Latest on All Things CODE

author-image

By

Valuable Insights on Optimizations Applied to Your Code

In today's rapidly advancing field of software development, optimizing code performance is more important than ever, particularly given the continuous evolution of computing architectures. Intel's LLVM-based compilers, including the Intel® oneAPI DPC++/C++ Compiler and Intel® Fortran Compiler, are at the forefront of this optimization journey, providing developers with robust tools to enhance the efficiency and speed of their applications across various computing devices such as CPUs and other specialized processors.

Having these optimization reports as part of the compiler allows for seamless integration of code optimization into the software design and code generation stage of the development cycle. The report includes information on loop transformations and vectorization. In future articles, we'll discuss other opt-report topics like inlining, Profile Guided Optimization (PGO), and more.

This article focuses on how you can generate detailed optimization reports with these compilers and how to apply the information they provide to evaluate the possibility of code improvements. By delving into the intricacies of these tools, developers will gain invaluable insights into fine-tuning their code for peak performance on modern hardware architectures.

Enabling and Controlling the Report

Below is the command line syntax for activating and managing optimization reports with Intel compilers on Windows and Linux platforms. Typically, Linux options begin with '-q', while Windows options start with '/Q'. These options apply equally to C++ and Fortran compilers.

Linux

Windows

Functionality

-qopt-report[=N]

/Qopt-report[:N]

Enables the report; N=1-3 specifies an increasing level of detail. The default is N=2 if no arg is passed.

-qopt-report-file=stdout | stderr | filename

/Qopt-report-file:stdout | stderr | filename

Specifies whether the output for the generated optimization report goes to a file, stderr, or stdout.

-qopt-report-stdout

/Qopt-report-stdout

Specifies that the generated report should go to stdout.

Layout of Loop-Related Reports

The optimization report presents a structured hierarchy of messages related to nested loops, maintaining a clear format. Each loop within the compiler-generated code is identified with a "LOOP BEGIN" message, along with the corresponding line and column numbers from the source code. The nesting of loops is clearly illustrated through indentation. It's worth noting that a single source loop might produce multiple compiler-generated loops, and the nesting structure may differ from the original code. In certain instances, a loop could be divided into several sub-loops, a technique known as "distribution."

double a[1000][1000],b[1000][1000],c[1000][1000];
void foo() 
{
        int i,j,k;
        for( i=0; i<1000; i++)
        {
                for( j=0; j< 1000; j++)
                {
                        c[j][i] = 0.0;
                        for( k=0; k<1000; k++)
                        {
                                c[j][i] = c[j][i] + a[k][i] * b[j][k];
                        }
                }
        }
}

Output:

$icpx -c -qopt-report=3 -qopt-report-file=stderr loop.cpp

Global optimization report for : _Z3foov
LOOP BEGIN at loop.cpp (5, 2)
    remark #15553: loop was not vectorized: outer loop is not an auto-vectorization candidate.
    LOOP BEGIN at loop.cpp (7, 3)
        remark #25529: Dead stores eliminated in loop
        remark #15553: loop was not vectorized: outer loop is not an auto-vectorization candidate.
        LOOP BEGIN at loop.cpp (10, 4)
            remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
            remark #25438: Loop unrolled without remainder by 8
        LOOP END
    LOOP END
LOOP END


Outer Loop at loop.cpp (5, 2): The first outer loop was not vectorized as it was not considered a suitable candidate for vectorization
Nested Loop at loop.cpp (7, 3): Within this loop, the compiler successfully eliminated dead stores, which write operations to memory that do not impact the program's outcome. This helps in reducing unnecessary memory operations and improving execution speed.
Innermost Loop at loop.cpp (10, 4): This loop was recognized as being capable of vectorization, but the compiler assessed it as inefficient to vectorize under normal conditions.

Using the Loop and Vectorization Reports

By utilizing the compiler option "-xcore-avx512," specific instructions from the AVX-512 instruction set are employed to vectorize the code. AVX-512, an extension of the Advanced Vector Extensions (AVX) instruction set architecture, offers a wider range of vectorization capabilities. This option directs the compiler to generate code optimized specifically for systems supporting AVX-512 instructions.

We can vectorize the above innermost loop using avx512 instructions with the option -xcore-avx512.

Output:

$ icpx -c -qopt-report=3 -qopt-report-file=stderr -xcore-avx512 loop.cpp

Global optimization report for : _Z3foov
LOOP BEGIN at loop.cpp (7, 3)
<Distributed chunk1>
    remark #25426: Loop distributed (2 way) for perfect loop nest formation
    remark #25567: 2 loops have been collapsed
    remark #25408: memset generated
    remark #25260: Dead loop optimized away
LOOP END
LOOP BEGIN at loop.cpp (7, 3)
    remark #15553: loop was not vectorized: outer loop is not an auto-vectorization candidate.
    LOOP BEGIN at loop.cpp (10, 4)
    <Distributed chunk2>
        remark #25444: Loopnest interchanged: ( 1 2 3 ) --> ( 2 3 1 )
        remark #15553: loop was not vectorized: outer loop is not an auto-vectorization candidate.
        LOOP BEGIN at loop.cpp (5, 2)
            remark #15553: loop was not vectorized: outer loop is not an auto-vectorization candidate.
            LOOP BEGIN at loop.cpp (7, 3)
            <Distributed chunk2>
                remark #25566: blocked by 64
                remark #25540: Loop unrolled and jammed by 4
                remark #15553: loop was not vectorized: outer loop is not an auto-vectorization candidate.
                LOOP BEGIN at loop.cpp (10, 4)
                    remark #25566: blocked by 64
                    remark #25540: Loop unrolled and jammed by 4
                    remark #15553: loop was not vectorized: outer loop is not an auto-vectorization candidate.
                    LOOP BEGIN at loop.cpp (5, 2)
                        remark #25566: blocked by 64
                        remark #25563: Load hoisted out of the loop
                        remark #25583: Number of Array Refs Scalar Replaced In Loop: 36
                        remark #15300: LOOP WAS VECTORIZED
                        remark #15305: vectorization support: vector length 4
                        remark #15475: --- begin vector loop cost summary ---
                        remark #15476: scalar cost: 47.000000
                        remark #15477: vector cost: 12.250000
                        remark #15478: estimated potential speedup: 3.812500
                        remark #15309: vectorization support: normalized vectorization overhead 0.156250
                        remark #15488: --- end vector loop cost summary ---
                        remark #15447: --- begin vector loop memory reference summary ---
                        remark #15450: unmasked unaligned unit stride loads: 8
                        remark #15451: unmasked unaligned unit stride stores: 4
                       remark #15474: --- end vector loop memory reference summary ---
                    LOOP END
                LOOP END
            LOOP END
        LOOP END
    LOOP END
LOOP END

Breaking loops into chunks improves cache utilization, leveraging the hierarchical cache structure of modern processors (L1, L2, L3), which operate much faster than main memory. Dividing a loop into smaller segments allows each segment's data to potentially reside within a cache, thereby decreasing the frequency of slow memory accesses.  

Multi-Version

The compiler generates two loop versions corresponding to a single loop in the source code; this is known as “multi-versioning”. The optimization report tells us that this is because of data dependence. The compiler does not know at compile time whether the two pointer arguments might be aliased, i.e., the data they point to might overlap in a way that would make vectorization unsafe. Therefore, the compiler creates two versions of the loop, one vectorized and one not. The compiler inserts a run-time test for data overlap so that the vectorized loop is executed if it is safe to do so; otherwise, the non-vectorized loop version is executed.

#include <math.h>
void func (float* theta, float* sth) 
{
        int i;
        for (i=0; i < 128; i++)
                sth[i] = sin(theta[i]+3.1415927);
}

Output:

$icpx -c -qopt-report=3 -qopt-report-file=stderr multi.cpp

Global optimization report for : _Z4funcPfS_
LOOP BEGIN at multi.cpp (5, 2)
<Multiversioned v2>
    remark #15319: Loop was not vectorized: novector directive used
LOOP END

LOOP BEGIN at multi.cpp (5, 2)
<Multiversioned v1>
    remark #25228: Loop multiversioned for Data Dependence
    remark #15300: LOOP WAS VECTORIZED
    remark #15305: vectorization support: vector length 2
    remark #15475: --- begin vector loop cost summary ---
    remark #15476: scalar cost: 49.000000
    remark #15477: vector cost: 28.046875
    remark #15478: estimated potential speedup: 1.734375
    remark #15309: vectorization support: normalized vectorization overhead 0.000000
    remark #15570: using scalar loop trip count: 128
    remark #15482: vectorized math library calls: 1
    remark #15488: --- end vector loop cost summary ---
    remark #15447: --- begin vector loop memory reference summary ---
    remark #15450: unmasked unaligned unit stride loads: 1
    remark #15451: unmasked unaligned unit stride stores: 1
    remark #15474: --- end vector loop memory reference summary ---
LOOP END   

If two pointer arguments are not aliased, then we can  communicate it to the compiler using the

-fargument-noalias flag

Output:

$icpx -c -qopt-report=3 -qopt-report-file=stderr -fargument-noalias multi.cpp

Global optimization report for : _Z4funcPfS_
LOOP BEGIN at multi.cpp (5, 2)
    remark #15300: LOOP WAS VECTORIZED
    remark #15305: vectorization support: vector length 2
    remark #15475: --- begin vector loop cost summary ---
    remark #15476: scalar cost: 49.000000
    remark #15477: vector cost: 28.046875
    remark #15478: estimated potential speedup: 1.734375
    remark #15309: vectorization support: normalized vectorization overhead 0.000000
    remark #15570: using scalar loop trip count: 128
    remark #15482: vectorized math library calls: 1
    remark #15488: --- end vector loop cost summary ---
    remark #15447: --- begin vector loop memory reference summary ---
    remark #15450: unmasked unaligned unit stride loads: 1
    remark #15451: unmasked unaligned unit stride stores: 1
    remark #15474: --- end vector loop memory reference summary ---
LOOP END

SIMD

SIMD vectorization allows for parallel data processing by executing a single instruction across multiple data elements simultaneously. When generating an optimization report, the pragma

#pragma omp simd aligned ()

indicates to the compiler that it should explore opportunities to parallelize the code using SIMD instructions.

Vectorization can be automatically applied, as in the previous examples, or the user can explicitly request it using the OpenMP construct “#pragma omp simd”. In the latter case, the compiler will attempt to vectorize the loop regardless of whether vectorization appears profitable. If the compiler is unable to vectorize any loop marked with this pragma, it will produce a warning message. In such cases, the optimization report will contain remarks indicating why vectorization could not be performed.

An explicitly requested SIMD loop is identified as a SIMD LOOP, rather than just a LOOP, in the optimization report.

#include <math.h>
void func (float* theta, float* sth) 
{
        int i;
#pragma omp simd aligned( sth, theta:32)
        for (i=0; i < 128; i++)
                sth[i] = sinf(theta[i]+3.1415927f);
}

Output:

$icpx -c -qopt-report=3 -qopt-report-file=stderr -fiopenmp simd.cpp

Global optimization report for : _Z4funcPfS_

LOOP BEGIN at simd.cpp (5, 1)
    remark #15301: SIMD LOOP WAS VECTORIZED
    remark #15305: vectorization support: vector length 4
    remark #15475: --- begin vector loop cost summary ---
    remark #15476: scalar cost: 32.000000
    remark #15477: vector cost: 12.062500
    remark #15478: estimated potential speedup: 2.640625
    remark #15309: vectorization support: normalized vectorization overhead 0.000000
    remark #15570: using scalar loop trip count: 128
    remark #15482: vectorized math library calls: 1
    remark #15488: --- end vector loop cost summary ---
    remark #15447: --- begin vector loop memory reference summary ---
    remark #15450: unmasked unaligned unit stride loads: 1
    remark #15451: unmasked unaligned unit stride stores: 1
    remark #15474: --- end vector loop memory reference summary ---
LOOP END   

Even More Detailed Reporting to Come

The Intel compiler team continues to make improvements to the optimization reports. Upcoming releases will support additional filtering options, optimization remarks, options for viewing inlining information, remarks for device offloading, and much more. Stay tuned!

Download the Compiler Now 

You can download the Intel oneAPI DPC++/C++ Compiler and Intel Fortran Compiler on Intel’s oneAPI Developer Tools product page

This version is also in the Intel® oneAPI Base Toolkit and Intel® HPC Toolkit, respectively, which includes an advanced set of foundational tools, libraries, analysis, debug, and code migration tools.

You may also want to check out our contributions to the LLVM compiler project on GitHub.

Additional Resources