Intel® Advisor User Guide

ID 766448
Date 6/24/2024
Public
Document Table of Contents

Advanced Modeling Options

When you select a Target System of Intel® Xeon Phi™ or Offload to Intel Xeon Phi coprocessor, additional modeling parameters appear below Runtime Modeling area under Intel Xeon Phi Advanced Modeling:

  • Select Consider Code Vectorization if you agree to modify your parallel code later to improve vector parallel execution. If checked, you can specify:

    • Reference CPU Vectorization Speedup you expect can be achieved. This value indicates the speedup multiplier gain for the current site by using vectorization techniques with the reference CPU. When providing this estimate, base your estimates on target device characteristics and your expertise of how much and how well this part of code can be vectorized.

    • Intel Xeon Phi Vectorization Speedup you expect can be achieved. This value indicates the speedup multiplier gain for current site by using vectorization techniques with an Intel® Xeon Phi™ processor. When providing this estimate, base your estimates on target device characteristics and your expertise of how much and how well this part of code can be vectorized.

  • When you choose Target System as Offload to Intel Xeon Phi, you can select the Offload Transfer Data Size to specify data transfer size value you expect can be achieved (unit is KB).

  • Click Apply after modifying any of these values.

In some cases, you can restructure your code to enable more efficient vector operations. Loop vectorization allows hardware to process data independently in smaller units (usually 64-byte), such as operations on data arrays.

One way to enable more efficient vector operations is to modify a single loop to create a new outer loop where the two loops cover the same iteration space. A technique called strip-mining allows the innermost loop to use vector operations in small chunks.

Other ways to enable more efficient vector operations include examining outermost loops where threading parallelism might already be used, and consider vectorizing its innermost loops and/or callee functions.

Certain innermost loops may benefit from OpenMP 4 constructs. That is, under certain conditions you can use both an omp parallel for threading pragma and a omp simd (or similar) simd vectorization pragma (see the compiler vectorization report and descriptions at http://openmp.org).

The processor microarchitecture determines the type of vector instructions that will be supported and thus the size of data the hardware can process efficiently.