Build and Configure Application

Tutorial: Analyzing OpenMP* and MPI Applications

Download PDF

ID 773235

Date 5/20/2020

Version 2020

Public

Visible to Intel only — GUID: GUID-0CAE3550-971F-4CB9-B674-B7EE99F1E465

View Details

Build and Configure Application

Before you start analyzing the performance of the target application, do the following:

Get software tools.
Build the application.
Run the application to determine the best MPI and OpenMP* process and thread configuration.

Get Software Tools

You need the following tools to try the tutorial steps yourself using the heart_demo sample application:

Intel® Parallel Studio XE Cluster Edition, including Intel® C++ Compiler, Intel® MPI Library, Intel® Trace Analyzer and Collector, and Intel® VTune™ Profiler
heart_demo sample application, available from GitHub* at https://github.com/CardiacDemo/Cardiac_demo.

Build Application

To build the application, do the following:

Clone the application GitHub* repository to your local system:

$ git clone https://github.com/CardiacDemo/Cardiac_demo.git
Set up the Intel C++ compiler environment:

$ source <compiler_installdir>/bin/compilervars.sh intel64

By default, <compiler_installdir> is /opt/intel/compilers_and_libraries_<version>.<update>.<package#>/linux.
In the root level of the sample package, create a build directory and change to that directory:

$ mkdir build

$ cd build
Build the application using the following command:

$ mpiicpc ../heart_demo.cpp ../luo_rudy_1991.cpp ../rcm.cpp ../mesh.cpp -g -o heart_demo -O3 -std=c++11 -qopenmp -parallel-source-info=2

Run Application with Various Configurations

When running a hybrid MPI/OpenMP application it is important to find an optimal combination of processes and threads to run with. Different combinations may provide widely varying results, so experimenting with various combinations is an important step in optimizing the application performance. Run several scenarios and make note of the results to set a baseline for your application performance optimization.

The choice of the best combination is also dependent on the number of cores on each node. It has been identified experimentally that it is best to load half of the available logical CPUs for the heart_demo application.

For the examples in this tutorial, 8 cluster nodes with Intel® Xeon Phi™ processors (formerly codenamed Knights Landing), each with 256 logical CPUs, were used to launch the application. As a result, 128 cores are loaded on each node with MPI processes or OpenMP threads. Three combinations were evaluated to measure pure computation time and elapsed time for each. The combinations are:

128 MPI processes, 1 OpenMP thread
32 MPI processes, 4 OpenMP threads
2 MPI processes, 64 OpenMP threads

To perform the measurement yourself, do the following:

Set up the environment for the Intel MPI Library:

$ source <impi_installdir>/intel64/bin/mpivars.sh

Where <impi_installdir> is the installed location for Intel MPI Library (default location is /opt/intel/compilers_and_libraries_<version>.<update>.<package#>/linux/mpi).
Create a host file that lists all of the cluster nodes involved:
```
node1
node2
...
node8
```
Save the file as hosts.txt in the build directory.

In the build directory, run the application with each of the three combinations. Use the time utility to measure the application elapsed time. The computation time is calculated by the application internally.

# 128/1
$ cat > run_ppn128_omp1.sh
export OMP_NUM_THREADS=1
mpirun -n 1024 -ppn 128 -f hosts.txt ./heart_demo -m ../mesh_mid -s ../setup_mid.txt -t 50
$ time ./run_ppn128_omp1.sh

# 32/4
$ cat > run_ppn32_omp4.sh
export OMP_NUM_THREADS=4
mpirun -n 256 -ppn 32 -f hosts.txt ./heart_demo -m ../mesh_mid -s ../setup_mid.txt -t 50
$ time ./run_ppn32_omp4.sh

# 2/64
$ cat > run_ppn2_omp64.sh
export OMP_NUM_THREADS=64
mpirun -n 16 -ppn 2 -f hosts.txt ./heart_demo -m ../mesh_mid -s ../setup_mid.txt -t 50
	$ time ./run_ppn2_omp64.sh

Review and save the computation and elapsed time values. The values are found in the last lines of the application output (computation time and elapsed time respectively):

...

wall time: <value>

real <value>

...

The results for each experiment are available in the table below. Your results should be similar.

Combination (MPI/OpenMP)	Computation Time	Elapsed Time
128/1	521.57	613.53
32/4	172.60	202.29
2/64	57.15	73.56

The following can be observed:

The first combination uses only MPI parallelism, so its performance is considerably worse than those utilizing MPI and OpenMP. It is not worth investigating further.
The second combination is a middle-point: the times are significantly better, but still not perfect. This may be due to an un-optimized MPI communication pattern in the application.
The third combination shows the best performance, so it is reasonable to focus on this one for further optimizations.

Key Take-Aways

Using only one method of parallelism is inefficient. Using both MPI and OpenMP parallelism at once can give a significant performance boost.
Test various combinations of MPI processes and OpenMP threads for your hybrid application. Different combinations can produce very different performance results for the same application.

Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Next Step

Get a Performance Overview with Application Performance Snapshot

Parent topic: Tutorial: Analyzing an OpenMP* and MPI Application

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Tutorial: Analyzing OpenMP* and MPI Applications

Build and Configure Application

Get Software Tools

Build Application

Run Application with Various Configurations

Key Take-Aways

Next Step