Valuable Insights on Optimizations Applied to Your Code
In today's rapidly advancing field of software development, optimizing code performance is more important than ever, particularly given the continuous evolution of computing architectures. Intel's LLVM-based compilers, including the Intel® oneAPI DPC++/C++ Compiler and Intel® Fortran Compiler, are at the forefront of this optimization journey, providing developers with robust tools to enhance the efficiency and speed of their applications across various computing devices such as CPUs and other specialized processors.
Having these optimization reports as part of the compiler allows for seamless integration of code optimization into the software design and code generation stage of the development cycle. The report includes information on loop transformations and vectorization. In future articles, we'll discuss other opt-report topics like inlining, Profile Guided Optimization (PGO), and more.
This article focuses on how you can generate detailed optimization reports with these compilers and how to apply the information they provide to evaluate the possibility of code improvements. By delving into the intricacies of these tools, developers will gain invaluable insights into fine-tuning their code for peak performance on modern hardware architectures.
Enabling and Controlling the Report
Below is the command line syntax for activating and managing optimization reports with Intel compilers on Windows and Linux platforms. Typically, Linux options begin with '-q', while Windows options start with '/Q'. These options apply equally to C++ and Fortran compilers.
Linux |
Windows |
Functionality |
---|---|---|
-qopt-report[=N] |
/Qopt-report[:N] |
Enables the report; N=1-3 specifies an increasing level of detail. The default is N=2 if no arg is passed. |
-qopt-report-file=stdout | stderr | filename |
/Qopt-report-file:stdout | stderr | filename |
Specifies whether the output for the generated optimization report goes to a file, stderr, or stdout. |
-qopt-report-stdout |
/Qopt-report-stdout |
Specifies that the generated report should go to stdout. |
Layout of Loop-Related Reports
The optimization report presents a structured hierarchy of messages related to nested loops, maintaining a clear format. Each loop within the compiler-generated code is identified with a "LOOP BEGIN" message, along with the corresponding line and column numbers from the source code. The nesting of loops is clearly illustrated through indentation. It's worth noting that a single source loop might produce multiple compiler-generated loops, and the nesting structure may differ from the original code. In certain instances, a loop could be divided into several sub-loops, a technique known as "distribution."
double a[1000][1000],b[1000][1000],c[1000][1000];
void foo()
{
int i,j,k;
for( i=0; i<1000; i++)
{
for( j=0; j< 1000; j++)
{
c[j][i] = 0.0;
for( k=0; k<1000; k++)
{
c[j][i] = c[j][i] + a[k][i] * b[j][k];
}
}
}
}
Output:
$icpx -c -qopt-report=3 -qopt-report-file=stderr loop.cpp
Global optimization report for : _Z3foov
LOOP BEGIN at loop.cpp (5, 2)
remark #15553: loop was not vectorized: outer loop is not an auto-vectorization candidate.
LOOP BEGIN at loop.cpp (7, 3)
remark #25529: Dead stores eliminated in loop
remark #15553: loop was not vectorized: outer loop is not an auto-vectorization candidate.
LOOP BEGIN at loop.cpp (10, 4)
remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
remark #25438: Loop unrolled without remainder by 8
LOOP END
LOOP END
LOOP END
Outer Loop at loop.cpp (5, 2): The first outer loop was not vectorized as it was not considered a suitable candidate for vectorization
Nested Loop at loop.cpp (7, 3): Within this loop, the compiler successfully eliminated dead stores, which write operations to memory that do not impact the program's outcome. This helps in reducing unnecessary memory operations and improving execution speed.
Innermost Loop at loop.cpp (10, 4): This loop was recognized as being capable of vectorization, but the compiler assessed it as inefficient to vectorize under normal conditions.
Using the Loop and Vectorization Reports
By utilizing the compiler option "-xcore-avx512," specific instructions from the AVX-512 instruction set are employed to vectorize the code. AVX-512, an extension of the Advanced Vector Extensions (AVX) instruction set architecture, offers a wider range of vectorization capabilities. This option directs the compiler to generate code optimized specifically for systems supporting AVX-512 instructions.
We can vectorize the above innermost loop using avx512 instructions with the option -xcore-avx512.
Output:
$ icpx -c -qopt-report=3 -qopt-report-file=stderr -xcore-avx512 loop.cpp
Global optimization report for : _Z3foov
LOOP BEGIN at loop.cpp (7, 3)
<Distributed chunk1>
remark #25426: Loop distributed (2 way) for perfect loop nest formation
remark #25567: 2 loops have been collapsed
remark #25408: memset generated
remark #25260: Dead loop optimized away
LOOP END
LOOP BEGIN at loop.cpp (7, 3)
remark #15553: loop was not vectorized: outer loop is not an auto-vectorization candidate.
LOOP BEGIN at loop.cpp (10, 4)
<Distributed chunk2>
remark #25444: Loopnest interchanged: ( 1 2 3 ) --> ( 2 3 1 )
remark #15553: loop was not vectorized: outer loop is not an auto-vectorization candidate.
LOOP BEGIN at loop.cpp (5, 2)
remark #15553: loop was not vectorized: outer loop is not an auto-vectorization candidate.
LOOP BEGIN at loop.cpp (7, 3)
<Distributed chunk2>
remark #25566: blocked by 64
remark #25540: Loop unrolled and jammed by 4
remark #15553: loop was not vectorized: outer loop is not an auto-vectorization candidate.
LOOP BEGIN at loop.cpp (10, 4)
remark #25566: blocked by 64
remark #25540: Loop unrolled and jammed by 4
remark #15553: loop was not vectorized: outer loop is not an auto-vectorization candidate.
LOOP BEGIN at loop.cpp (5, 2)
remark #25566: blocked by 64
remark #25563: Load hoisted out of the loop
remark #25583: Number of Array Refs Scalar Replaced In Loop: 36
remark #15300: LOOP WAS VECTORIZED
remark #15305: vectorization support: vector length 4
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar cost: 47.000000
remark #15477: vector cost: 12.250000
remark #15478: estimated potential speedup: 3.812500
remark #15309: vectorization support: normalized vectorization overhead 0.156250
remark #15488: --- end vector loop cost summary ---
remark #15447: --- begin vector loop memory reference summary ---
remark #15450: unmasked unaligned unit stride loads: 8
remark #15451: unmasked unaligned unit stride stores: 4
remark #15474: --- end vector loop memory reference summary ---
LOOP END
LOOP END
LOOP END
LOOP END
LOOP END
LOOP END
Breaking loops into chunks improves cache utilization, leveraging the hierarchical cache structure of modern processors (L1, L2, L3), which operate much faster than main memory. Dividing a loop into smaller segments allows each segment's data to potentially reside within a cache, thereby decreasing the frequency of slow memory accesses.
Multi-Version
The compiler generates two loop versions corresponding to a single loop in the source code; this is known as “multi-versioning”. The optimization report tells us that this is because of data dependence. The compiler does not know at compile time whether the two pointer arguments might be aliased, i.e., the data they point to might overlap in a way that would make vectorization unsafe. Therefore, the compiler creates two versions of the loop, one vectorized and one not. The compiler inserts a run-time test for data overlap so that the vectorized loop is executed if it is safe to do so; otherwise, the non-vectorized loop version is executed.
#include <math.h>
void func (float* theta, float* sth)
{
int i;
for (i=0; i < 128; i++)
sth[i] = sin(theta[i]+3.1415927);
}
Output:
$icpx -c -qopt-report=3 -qopt-report-file=stderr multi.cpp
Global optimization report for : _Z4funcPfS_
LOOP BEGIN at multi.cpp (5, 2)
<Multiversioned v2>
remark #15319: Loop was not vectorized: novector directive used
LOOP END
LOOP BEGIN at multi.cpp (5, 2)
<Multiversioned v1>
remark #25228: Loop multiversioned for Data Dependence
remark #15300: LOOP WAS VECTORIZED
remark #15305: vectorization support: vector length 2
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar cost: 49.000000
remark #15477: vector cost: 28.046875
remark #15478: estimated potential speedup: 1.734375
remark #15309: vectorization support: normalized vectorization overhead 0.000000
remark #15570: using scalar loop trip count: 128
remark #15482: vectorized math library calls: 1
remark #15488: --- end vector loop cost summary ---
remark #15447: --- begin vector loop memory reference summary ---
remark #15450: unmasked unaligned unit stride loads: 1
remark #15451: unmasked unaligned unit stride stores: 1
remark #15474: --- end vector loop memory reference summary ---
LOOP END
If two pointer arguments are not aliased, then we can communicate it to the compiler using the
-fargument-noalias flag
Output:
$icpx -c -qopt-report=3 -qopt-report-file=stderr -fargument-noalias multi.cpp
Global optimization report for : _Z4funcPfS_
LOOP BEGIN at multi.cpp (5, 2)
remark #15300: LOOP WAS VECTORIZED
remark #15305: vectorization support: vector length 2
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar cost: 49.000000
remark #15477: vector cost: 28.046875
remark #15478: estimated potential speedup: 1.734375
remark #15309: vectorization support: normalized vectorization overhead 0.000000
remark #15570: using scalar loop trip count: 128
remark #15482: vectorized math library calls: 1
remark #15488: --- end vector loop cost summary ---
remark #15447: --- begin vector loop memory reference summary ---
remark #15450: unmasked unaligned unit stride loads: 1
remark #15451: unmasked unaligned unit stride stores: 1
remark #15474: --- end vector loop memory reference summary ---
LOOP END
SIMD
SIMD vectorization allows for parallel data processing by executing a single instruction across multiple data elements simultaneously. When generating an optimization report, the pragma
#pragma omp simd aligned ()
indicates to the compiler that it should explore opportunities to parallelize the code using SIMD instructions.
Vectorization can be automatically applied, as in the previous examples, or the user can explicitly request it using the OpenMP construct “#pragma omp simd”. In the latter case, the compiler will attempt to vectorize the loop regardless of whether vectorization appears profitable. If the compiler is unable to vectorize any loop marked with this pragma, it will produce a warning message. In such cases, the optimization report will contain remarks indicating why vectorization could not be performed.
An explicitly requested SIMD loop is identified as a SIMD LOOP, rather than just a LOOP, in the optimization report.
#include <math.h>
void func (float* theta, float* sth)
{
int i;
#pragma omp simd aligned( sth, theta:32)
for (i=0; i < 128; i++)
sth[i] = sinf(theta[i]+3.1415927f);
}
Output:
$icpx -c -qopt-report=3 -qopt-report-file=stderr -fiopenmp simd.cpp
Global optimization report for : _Z4funcPfS_
LOOP BEGIN at simd.cpp (5, 1)
remark #15301: SIMD LOOP WAS VECTORIZED
remark #15305: vectorization support: vector length 4
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar cost: 32.000000
remark #15477: vector cost: 12.062500
remark #15478: estimated potential speedup: 2.640625
remark #15309: vectorization support: normalized vectorization overhead 0.000000
remark #15570: using scalar loop trip count: 128
remark #15482: vectorized math library calls: 1
remark #15488: --- end vector loop cost summary ---
remark #15447: --- begin vector loop memory reference summary ---
remark #15450: unmasked unaligned unit stride loads: 1
remark #15451: unmasked unaligned unit stride stores: 1
remark #15474: --- end vector loop memory reference summary ---
LOOP END
Even More Detailed Reporting to Come
The Intel compiler team continues to make improvements to the optimization reports. Upcoming releases will support additional filtering options, optimization remarks, options for viewing inlining information, remarks for device offloading, and much more. Stay tuned!
Download the Compiler Now
You can download the Intel oneAPI DPC++/C++ Compiler and Intel Fortran Compiler on Intel’s oneAPI Developer Tools product page.
This version is also in the Intel® oneAPI Base Toolkit and Intel® HPC Toolkit, respectively, which includes an advanced set of foundational tools, libraries, analysis, debug, and code migration tools.
You may also want to check out our contributions to the LLVM compiler project on GitHub.
Additional Resources
- Intel® oneAPI DPC++/C++ Compiler Developer Guide and Reference
- Vectorization and Optimization Reports
- Make the Most of Intel® Compiler Optimization Reports
- Porting Guide for Intel® C++ Compiler Classic Users to the Intel® oneAPI DPC++/C++ Compiler