Analyze Vector Instruction Set with
Intel® VTune™

Intel® VTune™
to understand why the computation time of the 2/64 combination is worse than the 32/4 combination even though the elapsed time is much less. A lower elapsed time for 32/4 is not possible due to the overhead of MPI deployments. As a result, it is better to focus on improving the computation time for the 2/64 combination instead.
To analyze the application performance with
  • Set up the Analysis: Determine the process on which the analysis should be run.
  • Run the Collection: The
    option is used in a command line for
    execution to run the HPC Performance Characterization analysis type in
  • View and Analyze the Results: Open the results file in the
    GUI to identify specific issues with the application.
  • Rebuild the Application: Rebuild the application using an updated vector instruction set.
  • Check Application Performance: Run the application again with both configuration options to see how performance has improved.

Set Up Analysis

Rather than collecting performance data for the entire application, the data should be collected on the process with the lowest MPI time. The process with the lowest MPI time is the slowest due to high computation time, which is the target for improvement.
  1. Use the
    option for Application Performance Snapshot to view the MPI Time per Rank data.
    $ aps-report stat_second -t
  2. Find the rank with the lowest MPI Time value. In this example, it is process number 7.

Run the Collection

  1. Set up the environment for the
    Intel VTune
    $ source
    is the installed location of
    Intel VTune
    (default location is
  2. Launch the application using
    and the appropriate rank number.
    $ export OMP_NUM_THREADS=64
    $ mpirun -n 16 -ppn 2 -f hosts.txt -gtool "
    -collect hpc-performance -data-limit=0 -r result_init:7" ./heart_demo -m ../mesh_mid -s ../setup_mid.txt -i -t 50
    Replace the rank number in the second command with the rank identified in the previous section. In this example command, the rank value is 7.
    The following options are included in the command:
    • -gtool
      option is used to launch tools such as
      Intel VTune
      ) on specified ranks. Additional information about the option is available from the Intel® MPI Library Developer Reference for Linux* OS at
    • vtune
      is an
      Intel VTune
      command line interface with the following options used to run the analysis:
      • -collect
        option specifies the analysis type being run on the application. Additional information about the option is available from the
        Intel VTune
        help at
      • -data-limit
        option is used to disable the size limit for result files when set to 0.
      • -r
        option specifies the name and location of the results file.
    The application launches and performance data collection begins. The data collection stops as soon as the application completes and the collected data is saved in a result file.

View and Analyze the Results

  1. After running the performance analysis, launch
    Intel VTune
    and open the result file using the following command:
    -gui result_init.<host>/result_init.<host>.
  2. Start analysis with the
    window. Hover over the question mark icons to read the pop-up help and better understand what each performance metric means.
  3. Notice that the
    SIMD Instructions per Cycle
    section indicates that the application could have better vectorization. The
    Vector Instruction Set
    column shows that the vector instruction set values are outdated (AVX, SSE). The same information can be seen in the

Rebuild Application with New Instruction Set

The application currently uses legacy instruction sets (SSE, AVX, SSE2). The instruction set has been updated to use AVX512 by adding the
option to the existing build script. Run the following to rebuild the application using a new instruction set:
$ mpiicpc ../heart_demo.cpp ../luo_rudy_1991.cpp ../rcm.cpp ../mesh.cpp -g -o heart_demo -O3 -xMIC-AVX512 -std=c++11 -qopenmp -parallel-source-info=2

Check Application Performance

Run the application without any analysis tool to review the improvement in computation time and elapsed time.
$ time
$ time
The following table shows the results as an example:
Combination (OpenMP/MPI)
Computation Time
Elapsed Time
These results show that the computation time for 2 processes per node and 64 OpenMP threads per process improved from over 19 seconds down to just over 16 seconds. It also shows a minor improvement in elapsed time. Check the parallelism of the updated code next.

Key Take-Away

Using legacy vector instruction sets can lead to inefficient application performance. Be sure to use the latest vector instruction sets for your application.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at