Tutorial: Analyzing OpenMP* and MPI Applications

ID 773235
Date 5/20/2020
Public

Build and Configure Application

Before you start analyzing the performance of the target application, do the following:

  1. Get software tools.

  2. Build the application.

  3. Run the application to determine the best MPI and OpenMP* process and thread configuration.

Get Software Tools

You need the following tools to try the tutorial steps yourself using the heart_demo sample application:

  • Intel® Parallel Studio XE Cluster Edition, including Intel® C++ Compiler, Intel® MPI Library, Intel® Trace Analyzer and Collector, and Intel® VTune™ Profiler

  • heart_demo sample application, available from GitHub* at https://github.com/CardiacDemo/Cardiac_demo.

Build Application

To build the application, do the following:

  1. Clone the application GitHub* repository to your local system:

    $ git clone https://github.com/CardiacDemo/Cardiac_demo.git

  2. Set up the Intel C++ compiler environment:

    $ source <compiler_installdir>/bin/compilervars.sh intel64

    By default, <compiler_installdir> is /opt/intel/compilers_and_libraries_<version>.<update>.<package#>/linux.

  3. In the root level of the sample package, create a build directory and change to that directory:

    $ mkdir build

    $ cd build

  4. Build the application using the following command:

    $ mpiicpc ../heart_demo.cpp ../luo_rudy_1991.cpp ../rcm.cpp ../mesh.cpp -g -o heart_demo -O3 -std=c++11 -qopenmp -parallel-source-info=2

Run Application with Various Configurations

When running a hybrid MPI/OpenMP application it is important to find an optimal combination of processes and threads to run with. Different combinations may provide widely varying results, so experimenting with various combinations is an important step in optimizing the application performance. Run several scenarios and make note of the results to set a baseline for your application performance optimization.

The choice of the best combination is also dependent on the number of cores on each node. It has been identified experimentally that it is best to load half of the available logical CPUs for the heart_demo application.

For the examples in this tutorial, 8 cluster nodes with Intel® Xeon Phi™ processors (formerly codenamed Knights Landing), each with 256 logical CPUs, were used to launch the application. As a result, 128 cores are loaded on each node with MPI processes or OpenMP threads. Three combinations were evaluated to measure pure computation time and elapsed time for each. The combinations are:

  • 128 MPI processes, 1 OpenMP thread
  • 32 MPI processes, 4 OpenMP threads
  • 2 MPI processes, 64 OpenMP threads

To perform the measurement yourself, do the following:

  1. Set up the environment for the Intel MPI Library:

    $ source <impi_installdir>/intel64/bin/mpivars.sh

    Where <impi_installdir> is the installed location for Intel MPI Library (default location is /opt/intel/compilers_and_libraries_<version>.<update>.<package#>/linux/mpi).

  2. Create a host file that lists all of the cluster nodes involved:

    node1
    node2
    ...
    node8
  3. Save the file as hosts.txt in the build directory.

  4. In the build directory, run the application with each of the three combinations. Use the time utility to measure the application elapsed time. The computation time is calculated by the application internally.

    # 128/1
    $ cat > run_ppn128_omp1.sh
    export OMP_NUM_THREADS=1
    mpirun -n 1024 -ppn 128 -f hosts.txt ./heart_demo -m ../mesh_mid -s ../setup_mid.txt -t 50
    $ time ./run_ppn128_omp1.sh
    # 32/4
    $ cat > run_ppn32_omp4.sh
    export OMP_NUM_THREADS=4
    mpirun -n 256 -ppn 32 -f hosts.txt ./heart_demo -m ../mesh_mid -s ../setup_mid.txt -t 50
    $ time ./run_ppn32_omp4.sh
    # 2/64
    $ cat > run_ppn2_omp64.sh
    export OMP_NUM_THREADS=64
    mpirun -n 16 -ppn 2 -f hosts.txt ./heart_demo -m ../mesh_mid -s ../setup_mid.txt -t 50
    	$ time ./run_ppn2_omp64.sh
  5. Review and save the computation and elapsed time values. The values are found in the last lines of the application output (computation time and elapsed time respectively):

    ...

    wall time: <value>

    real <value>

    ...

The results for each experiment are available in the table below. Your results should be similar.

Combination (MPI/OpenMP)

Computation Time

Elapsed Time

128/1

521.57

613.53

32/4

172.60

202.29

2/64

57.15

73.56

The following can be observed:

  • The first combination uses only MPI parallelism, so its performance is considerably worse than those utilizing MPI and OpenMP. It is not worth investigating further.

  • The second combination is a middle-point: the times are significantly better, but still not perfect. This may be due to an un-optimized MPI communication pattern in the application.

  • The third combination shows the best performance, so it is reasonable to focus on this one for further optimizations.

Key Take-Aways

  • Using only one method of parallelism is inefficient. Using both MPI and OpenMP parallelism at once can give a significant performance boost.

  • Test various combinations of MPI processes and OpenMP threads for your hybrid application. Different combinations can produce very different performance results for the same application.

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804