{"payload":{"allShortcutsEnabled":false,"path":"DirectProgramming/Fortran/DenseLinearAlgebra/vectorize-vecmatmult","repo":{"id":267223745,"defaultBranch":"master","name":"oneAPI-samples","ownerLogin":"oneapi-src","currentUserCanPush":false,"isFork":false,"isEmpty":false,"createdAt":"2020-05-27T04:52:05.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/60144784?v=4","public":true,"private":false,"isOrgOwned":true},"currentUser":null,"refInfo":{"name":"master","listCacheKey":"v0:1709893225.0","canEdit":false,"refType":"branch","currentOid":"d9ea61ab122ab030fbd74362bf9cd68766556117"},"tree":{"items":[{"name":"src","path":"DirectProgramming/Fortran/DenseLinearAlgebra/vectorize-vecmatmult/src","contentType":"directory"},{"name":"License.txt","path":"DirectProgramming/Fortran/DenseLinearAlgebra/vectorize-vecmatmult/License.txt","contentType":"file"},{"name":"Makefile","path":"DirectProgramming/Fortran/DenseLinearAlgebra/vectorize-vecmatmult/Makefile","contentType":"file"},{"name":"README.md","path":"DirectProgramming/Fortran/DenseLinearAlgebra/vectorize-vecmatmult/README.md","contentType":"file"},{"name":"sample.json","path":"DirectProgramming/Fortran/DenseLinearAlgebra/vectorize-vecmatmult/sample.json","contentType":"file"}],"templateDirectorySuggestionUrl":null,"readme":{"displayName":"README.md","richText":"

Vectorize VecMatMult Sample

\n

The Vectorize VecMatMult demonstrates how to use the auto-vectorizer to improve the performance\nof the sample application. You will compare the performance of a serial version and a version compiled with the auto-vectorizer.

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
AreaDescription
What you will learnHow to use automatic vectorization with the Intel® Fortran Compiler
Time to complete15 minutes
\n

Purpose

\n

The Intel® Fortran Compiler has an auto-vectorizer that detects operations in the application that can be done in parallel and converts sequential operations to parallel operations by using the Single Instruction Multiple Data (SIMD) instruction set.

\n

For the Intel® Fortran Compiler, vectorization is the unrolling of a loop combined with the generation of packed SIMD instructions. Because the packed instructions operate on more than one data element simultaneously, the loop can execute more efficiently. It is sometimes referred to as auto-vectorization to emphasize that the compiler automatically identifies and optimizes suitable loops on its own.

\n

Vectorization may call library routines that can result in additional performance gain on Intel microprocessors when compared to non-Intel microprocessors. The vectorization can also be affected by specific options, such as -m or -x.

\n

Vectorization is enabled with the compiler at optimization levels of -O2 (default level) and higher for both Intel® microprocessors and non-Intel® microprocessors. Many loops are vectorized automatically, in cases where this doesn't happen, you may be able to vectorize loops by making simple code modifications.

\n

This sample leads you through the following steps.

\n
    \n
  1. Establish a performance baseline.
  2. \n
  3. Generate a vectorization report.
  4. \n
  5. Improve performance by aligning data.
  6. \n
  7. Improve performance with interprocedural optimization.
  8. \n
\n

Intel® Advisor can assist with vectorization and show optimization report messages with your source code. See Intel® Advisor for more information.

\n

Prerequisites

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Optimized forDescription
OSmacOS*
Xcode*
HardwareMac* with an Intel® processor
SoftwareIntel® Fortran Compiler
\n
\n

Note: The Intel® Fortran Compiler is part of the Intel® oneAPI HPC Toolkit (HPC Kit).

\n
\n

Key Implementation Details

\n

You will use the following Fortran source files in the sample.

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
FileDescription
matvec.f90Fortran source file with a matrix-times-vector algorithm.
driver.f90Fortran source file with the main program calling matvec.
\n

Read the Intel® Fortran Compiler Developer Guide and Reference for more information the features and options mentioned in this sample.

\n

Set Environment Variables

\n

When working with the command-line interface (CLI), you should configure the oneAPI toolkits using environment variables. Set up your CLI environment by sourcing the setvars script every time you open a new terminal window. This practice ensures that your compiler, libraries, and tools are ready for development.

\n

Build the Fortran Vectorization Sample

\n
\n

Note: If you have not already done so, set up your CLI\nenvironment by sourcing the setvars script in the root of your oneAPI installation.

\n\n

For more information on configuring environment variables, see Use the setvars Script with Linux* or macOS*.

\n
\n

Use Visual Studio Code* (VS Code) (Optional)

\n

You can use Visual Studio Code* (VS Code) extensions to set your environment,\ncreate launch configurations, and browse and download samples.

\n

The basic steps to build and run a sample using VS Code include:

\n
    \n
  1. Configure the oneAPI environment with the extension Environment Configurator for Intel® oneAPI Toolkits.
  2. \n
  3. Download a sample using the extension Code Sample Browser for Intel® oneAPI Toolkits.
  4. \n
  5. Open a terminal in VS Code (Terminal > New Terminal).
  6. \n
  7. Run the sample in the VS Code terminal using the instructions below.
  8. \n
\n

To learn more about the extensions and how to configure the oneAPI environment, see the\nUsing Visual Studio Code with Intel® oneAPI Toolkits User Guide.

\n

On macOS*

\n

Step 1. Establish a Performance Baseline

\n

Create a performance baseline for the improvements that follow in this sample by compiling your sources from the src directory.

\n
    \n
  1. Compile the sources with the following commands.\n
    ifort -real-size 64 -O1 src/matvec.f90 src/driver.f90 -o MatVector\n
    \n
  2. \n
  3. Run MatVector.\n
    ./MatVector\n
    \n
  4. \n
  5. Record the execution time reported in the output. This is the baseline against which subsequent improvements will be measured.
  6. \n
\n

Step 2. Generate a Vectorization Report

\n

A vectorization report shows what loops in your code were vectorized and explains why other loops were not vectorized. To generate a vectorization report, use the qopt-report-phase=vec compiler options together with qopt-report=1 or qopt-report=2.

\n

Together with qopt-report-phase=vec, qopt-report=1 generates a report with the loops in your code that were vectorized while qopt-report-phase=vec with qopt-report=2 generates a report with both the loops in your code that were vectorized and the reason that other loops were not vectorized.

\n

Because vectorization is turned off with the O1 option, the compiler does not generate a vectorization report. Generate a vectorization report by compiling the project with the O2, qopt-report-phase=vec, qopt-report=1 options.

\n
    \n
  1. \n

    Compile the sources with the following commands.

    \n
    ifort -real-size 64 -O2 -qopt-report=1 -qopt-report-phase=vec src/matvec.f90 src/driver.f90 -o MatVector\n
    \n
  2. \n
  3. \n

    Run MatVector again.

    \n
    ./MatVector\n
    \n
  4. \n
  5. \n

    Record the new execution time.

    \n

    The reduction in time is mostly due to auto-vectorization of the inner loop at line 32 noted in the vectorization report matvec.optrpt.

    \n
    \n

    Note: Your line and column numbers may be different.

    \n
    \n
     Begin optimization report for matvec_\n\n   Report from: Vector optimizations [vec]\n\n\n LOOP BEGIN at matvec.f90(26,3)\n   remark #25460: No loop optimizations reported\n\n   LOOP BEGIN at matvec.f90(26,3)\n     remark #15300: LOOP WAS VECTORIZED\n   LOOP END\n\n   LOOP BEGIN at matvec.f90(26,3)\n   <Remainder loop for vectorization>\n   LOOP END\n LOOP END\n\n LOOP BEGIN at matvec.f90(27,3)\n   remark #25460: No loop optimizations reported\n\n   LOOP BEGIN at matvec.f90(32,6)\n   <Peeled loop for vectorization>\n   LOOP END\n\n   LOOP BEGIN at matvec.f90(32,6)\n     remark #15300: LOOP WAS VECTORIZED\n   LOOP END\n\n   LOOP BEGIN at matvec.f90(32,6)\n   <Alternate Alignment Vectorized Loop>\n   LOOP END\n\n   LOOP BEGIN at matvec.f90(32,6)\n   <Remainder loop for vectorization>\n   LOOP END\n LOOP END\n
    \n

    The combination of qopt-report=2 with qopt-report-phase=vec, loop returns a list that includes loops that were not vectorized or multi-versioned, along with the reason that the compiler did not vectorize them or multi-version the loop.

    \n
  6. \n
  7. \n

    Recompile your project with the qopt-report=2 and qopt-report-phase=vec,loop options.

    \n
    ifort -real-size 64 -O2 -qopt-report-phase=vec -qopt-report=2 src/matvec.f90 src/driver.f90 -o MatVector\n
    \n

    The vectorization report matvec.optrpt indicates that the loop at line 33 in matvec.f90 did not vectorize because it is not the loop nest's innermost loop.

    \n
    \n

    Note: Your line and column numbers may be different.

    \n
    \n
     LOOP BEGIN at matvec.f90(27,3)\n   remark #15542: loop was not vectorized: inner loop was already vectorized\n\n   LOOP BEGIN at matvec.f90(32,6)\n    <Peeled loop for vectorization>\n   LOOP END\n\n   LOOP BEGIN at matvec.f90(32,6)\n     remark #15300: LOOP WAS VECTORIZED\n   LOOP END\n\n   LOOP BEGIN at matvec.f90(32,6)\n    <Alternate Alignment Vectorized Loop>\n   LOOP END\n\n   LOOP BEGIN at matvec.f90(32,6)\n    <Remainder loop for vectorization>\n      remark #15335: remainder loop was not vectorized: vectorization possible but seemed inefficient. Use vector always directive or -vec-threshold0 to override\n   LOOP END\n LOOP END\n
    \n

    For more information on the qopt-report and qopt-report-phase compiler options, read the Compiler Options section of the Intel® Fortran Compiler Developer Guide and Reference.

    \n
  8. \n
\n

Step 3. Improve Performance by Aligning Data

\n

The vectorizer can generate faster code when operating on aligned data. In this activity, you will improve the vectorizer performance by aligning the arrays a, b, and c in driver.f90 on a 16-byte boundary so the vectorizer can use aligned load instructions for all arrays rather than the slower unaligned load instructions and can avoid runtime tests of alignment.

\n

Using the ALIGNED macro will insert an alignment directive for a, b, and c in the driver.f90 with the following syntax:

\n
!dir$ attributes align : 16 :: a,b,c
\n

This code sample instructs the compiler to create arrays aligned on a 16-byte boundary, facilitating the use of SSE aligned load instructions.

\n

The column height of the matrix needs to be padded out to be a multiple of 16 bytes, so that each column maintains the same 16-byte alignment. In practice, maintaining a constant alignment between columns is much more important than aligning the arrays' start.

\n

To derive the maximum benefit from this alignment, we also need to tell the vectorizer it can safely assume that the arrays in matvec.f90 are aligned by using the directive

\n
!dir$ vector aligned
\n
\n

Note: If you use !dir$ vector aligned, you must be sure that all the arrays or subarrays in the loop are 16-byte aligned. Otherwise, you may get a runtime error. Aligning data may still give a performance benefit even if !dir$ vector aligned is not used. See the code under the ALIGNED macro in matvec.f90.

\n
\n

If your compilation targets the Intel® AVX-512 instruction set, you should try to align data on a 64-byte boundary. This may result in improved performance. In this case, !dir$ vector aligned advises the compiler that the data is 64-byte aligned.

\n
    \n
  1. Recompile the program after adding the ALIGNED macro to ensure consistently aligned data.\n
    ifort -real-size 64 -qopt-report=2 -qopt-report-phase=vec -D ALIGNED src/matvec.f90 src/driver.f90 -o MatVector\n
    \n
  2. \n
\n

Step 4. Improve Performance with Interprocedural Optimization

\n

The compiler may be able to perform additional optimizations if it can optimize across source line boundaries. These may include but are not limited to function inlining. Enable this optimization with the -ipo option.

\n
    \n
  1. \n

    Recompile the program using the -ipo option to enable interprocedural optimization.

    \n
    ifort -real-size 64 -qopt-report=2 -qopt-report-phase=vec -D ALIGNED -ipo src/matvec.f90 src/driver.f90 -o MatVector\n
    \n

    The vectorization messages now appear at the point of inlining in driver.f90 (line 70) and this is found in the file ipo_out.optrpt.

    \n
    \n

    Note: Your line and column numbers may be different.

    \n
    \n
     LOOP BEGIN at the driver.f90(73,16)\n    remark #15541: loop was not vectorized: inner loop was already vectorized\n\n    LOOP BEGIN at matvec.f90(32,3) inlined into the driver.f90(70,14)\n       remark #15398: loop was not vectorized: loop was transformed to memset or memcpy\n    LOOP END\n\n    LOOP BEGIN at matvec.f90(33,3) inlined into driver.f90(70,14)\n       remark #15541: loop was not vectorized: inner loop was already vectorized\n\n       LOOP BEGIN at matvec.f90(38,6) inlined into driver.f90(70,14)\n          remark #15399: vectorization support: unroll factor set to 4\n          remark #15300: LOOP WAS VECTORIZED\n       LOOP END\n    LOOP END\n LOOP END\n
    \n
  2. \n
  3. \n

    Run the program, and record the execution time.

    \n
  4. \n
\n

Additional Exercises

\n

The previous examples made use of double-precision arrays. You could build same examples with single precision arrays by changing the command-line option -real-size 64 to -real-size 32. The non-vectorized versions of the loop execute only slightly faster than the double-precision version; however, the vectorized versions are substantially faster. This is because a packed SIMD instruction operating on a 32-byte vector register operates on eight single-precision data elements at once instead of four double-precision data elements.

\n
\n

Note: In the example with data alignment, you will need to set ROWBUF=3 to ensure 16-byte alignment for each row of the matrix a. Otherwise, the directive !dir$ vector aligned will cause the program to fail.

\n
\n

License

\n

Code samples are licensed under the MIT license. See\nLicense.txt for details.

\n

Third party program Licenses can be found here: third-party-programs.txt.

\n
","errorMessage":null,"headerInfo":{"toc":[{"level":1,"text":"Vectorize VecMatMult Sample","anchor":"vectorize-vecmatmult-sample","htmlText":"Vectorize VecMatMult Sample"},{"level":2,"text":"Purpose","anchor":"purpose","htmlText":"Purpose"},{"level":2,"text":"Prerequisites","anchor":"prerequisites","htmlText":"Prerequisites"},{"level":2,"text":"Key Implementation Details","anchor":"key-implementation-details","htmlText":"Key Implementation Details"},{"level":2,"text":"Set Environment Variables","anchor":"set-environment-variables","htmlText":"Set Environment Variables"},{"level":2,"text":"Build the Fortran Vectorization Sample","anchor":"build-the-fortran-vectorization-sample","htmlText":"Build the Fortran Vectorization Sample"},{"level":3,"text":"Use Visual Studio Code* (VS Code) (Optional)","anchor":"use-visual-studio-code-vs-code-optional","htmlText":"Use Visual Studio Code* (VS Code) (Optional)"},{"level":3,"text":"On macOS*","anchor":"on-macos","htmlText":"On macOS*"},{"level":4,"text":"Step 1. Establish a Performance Baseline","anchor":"step-1-establish-a-performance-baseline","htmlText":"Step 1. Establish a Performance Baseline"},{"level":4,"text":"Step 2. Generate a Vectorization Report","anchor":"step-2-generate-a-vectorization-report","htmlText":"Step 2. Generate a Vectorization Report"},{"level":4,"text":"Step 3. Improve Performance by Aligning Data","anchor":"step-3-improve-performance-by-aligning-data","htmlText":"Step 3. Improve Performance by Aligning Data"},{"level":4,"text":"Step 4. Improve Performance with Interprocedural Optimization","anchor":"step-4-improve-performance-with-interprocedural-optimization","htmlText":"Step 4. Improve Performance with Interprocedural Optimization"},{"level":3,"text":"Additional Exercises","anchor":"additional-exercises","htmlText":"Additional Exercises"},{"level":2,"text":"License","anchor":"license","htmlText":"License"}],"siteNavLoginPath":"/login?return_to=https%3A%2F%2Fgithub.com%2Foneapi-src%2FoneAPI-samples%2Ftree%2Fmaster%2FDirectProgramming%2FFortran%2FDenseLinearAlgebra%2Fvectorize-vecmatmult"}},"totalCount":5,"showBranchInfobar":false},"fileTree":{"DirectProgramming/Fortran/DenseLinearAlgebra":{"items":[{"name":"optimize-integral","path":"DirectProgramming/Fortran/DenseLinearAlgebra/optimize-integral","contentType":"directory"},{"name":"vectorize-vecmatmult","path":"DirectProgramming/Fortran/DenseLinearAlgebra/vectorize-vecmatmult","contentType":"directory"},{"name":"third-party-programs.txt","path":"DirectProgramming/Fortran/DenseLinearAlgebra/third-party-programs.txt","contentType":"file"}],"totalCount":3},"DirectProgramming/Fortran":{"items":[{"name":"CombinationalLogic","path":"DirectProgramming/Fortran/CombinationalLogic","contentType":"directory"},{"name":"DenseLinearAlgebra","path":"DirectProgramming/Fortran/DenseLinearAlgebra","contentType":"directory"},{"name":"EdgeDetection","path":"DirectProgramming/Fortran/EdgeDetection","contentType":"directory"},{"name":"Jupyter","path":"DirectProgramming/Fortran/Jupyter","contentType":"directory"},{"name":"guided_Coarray","path":"DirectProgramming/Fortran/guided_Coarray","contentType":"directory"},{"name":"guided_matrix_mul_OpenMP","path":"DirectProgramming/Fortran/guided_matrix_mul_OpenMP","contentType":"directory"},{"name":".gitkeep","path":"DirectProgramming/Fortran/.gitkeep","contentType":"file"}],"totalCount":7},"DirectProgramming":{"items":[{"name":"C++","path":"DirectProgramming/C++","contentType":"directory"},{"name":"C++SYCL","path":"DirectProgramming/C++SYCL","contentType":"directory"},{"name":"C++SYCL_FPGA","path":"DirectProgramming/C++SYCL_FPGA","contentType":"directory"},{"name":"Fortran","path":"DirectProgramming/Fortran","contentType":"directory"}],"totalCount":4},"":{"items":[{"name":".github","path":".github","contentType":"directory"},{"name":"AI-and-Analytics","path":"AI-and-Analytics","contentType":"directory"},{"name":"DirectProgramming","path":"DirectProgramming","contentType":"directory"},{"name":"Libraries","path":"Libraries","contentType":"directory"},{"name":"Publications","path":"Publications","contentType":"directory"},{"name":"RenderingToolkit","path":"RenderingToolkit","contentType":"directory"},{"name":"Templates","path":"Templates","contentType":"directory"},{"name":"Tools","path":"Tools","contentType":"directory"},{"name":"common","path":"common","contentType":"directory"},{"name":".gitignore","path":".gitignore","contentType":"file"},{"name":".gitmodules","path":".gitmodules","contentType":"file"},{"name":"CONTRIBUTING.md","path":"CONTRIBUTING.md","contentType":"file"},{"name":"License.txt","path":"License.txt","contentType":"file"},{"name":"README.md","path":"README.md","contentType":"file"},{"name":"SECURITY.md","path":"SECURITY.md","contentType":"file"},{"name":"release.json","path":"release.json","contentType":"file"},{"name":"third-party-programs.txt","path":"third-party-programs.txt","contentType":"file"}],"totalCount":17}},"fileTreeProcessingTime":8.898295000000001,"foldersToFetch":[],"treeExpanded":true,"symbolsExpanded":false,"csrf_tokens":{"/oneapi-src/oneAPI-samples/branches":{"post":"ipUJ5cqnyJietNhtaxOSlc2_ZDtwVlaVK_mv6Dd7K8jiiFqjGPeDWGSBeOqF6pxsateMozvkPxJj2dXlprfiWQ"},"/oneapi-src/oneAPI-samples/branches/fetch_and_merge/master":{"post":"4Lvn0HBwE8m31cXtijZLlMLp6poO_dygPegCZIqKDSW3fRFXuJZIH-7GVtHtxwuv3oOEyb-5_l9n79gEx3f3Qg"},"/oneapi-src/oneAPI-samples/branches/fetch_and_merge/master?discard_changes=true":{"post":"PJVP7lR9ysZ1w9Wzajo2RDZWhBN76yn_voazGdWJOw9rU7lpnJuRECzQRo8Ny3Z_KjzqQMqvCwDkgWl5mHTBaA"}}},"title":"oneAPI-samples/DirectProgramming/Fortran/DenseLinearAlgebra/vectorize-vecmatmult at master · oneapi-src/oneAPI-samples"}