Offload and Optimize OpenMP* Applications with Intel® Tools

Overview

Parallel programming is a powerful way for software applications to exploit the full potential of multicore hardware. Including OpenMP* directives is an effective way to add parallelism to your application.

GPUs are inherently powerful computation devices because of their massive parallel architectures. As you explore ways to improve the performance of your application, including OpenMP offload directives in your code to offload onto a GPU is a strategy you may want to consider. Optimizing the offloaded application can further elevate your performance by ensuring that you employ available hardware to the fullest extent.

The following learning path shows how to offload and optimize an OpenMP application on Intel® GPUs using developer tools from Intel.

What You Will Learn

Use this workflow to offload applications enhanced with OpenMP directives onto Intel GPUs. Learn how to:

Select the right functions to offload.
Profile application performance iteratively.
Employ the full potential of your GPU.

Each step in the workflow contains:

Tasks you can complete using developer tools from Intel
Resources for further learning
References to a comprehensive ISO3DFD OpenMP Offload sample with a guided README file to illustrate the process

Who This is For

Software developers who intend to run OpenMP applications on Intel GPUs and maximize code performance
HPC domain experts who want to learn the basics of using OpenMP offload constructs

What You Need

An application that is ported to OpenMP

Programming Languages

C
C++
Fortran

The OpenMP Offload Workflow

Step 1: Install Development Tools and an Appropriate Compiler

Step 2: Assess Application Performance on CPU and Potential GPU Kernels

Step 3: Offload Compute-Intensive Code Regions in Kernels

Step 4: Evaluate Offload Efficiency

Step 5: Optimize the Offload Process

Step 6: Review Overall Application Performance

Step 1: Install Development Tools and an Appropriate Compiler

Install the compiler and tools you need to run your application:

C, C++, and DPC++ applications require the Intel® DPC++/C++ Compiler. Get this compiler with the Intel® oneAPI Base Toolkit (Base Kit).
Fortran applications require the Intel® Fortran Compiler. Get this compiler with the Intel® oneAPI HPC Toolkit. For full functionality, download this toolkit in addition to the Base Kit.
Intel® Advisor
Intel® VTune™ Profiler
Profiling Tools Interfaces for Intel GPUs, which includes the OneTrace tool.

Note When working with the command-line interface (CLI), configure the software development toolkits from Intel using environment variables. Learn how to set environment variables.

Learn More

Step 2: Assess Application Performance on CPU and Potential GPU Kernels

Before attempting to offload functions, you must first understand how well your application runs on the CPU. Look for unexpected performance hot spots that you can speed up. You may also discover synchronization issues that impact parallel performance negatively.

2.1 Identify the most time-consuming functions in your OpenMP application and analyze their source code. For this purpose, run a Hotspots analysis with Intel VTune Profiler. The insights you gain from a Hotspots analysis are useful for tuning both serial and parallel applications. Improve your code and run the Hotspots analysis again to verify improvement. You may need to finish several tuning cycles in order to achieve desired performance.

Example See how the Hotspots analysis works in the ISO3DFD guided sample..

2.2 Examine the vectorization in your application. Vectorization converts an algorithm from operating on a single value at a time to operating on a set of values (or vectors) simultaneously. See if the vectorized loops in your application provide the benefit you expect. Understand why other loops are not vectorized. Run the Vectorization and Code Insights Perspective in Intel® Advisor.

Example See how the vectorization perspective works in the ISO3DFD guided sample.

Note Optimizing hot spots may require several iterations.

Also estimate the performance of your system against ceilings imposed by the hardware. Find out the maximum performance you can achieve with your current hardware resources. Run the CPU/Memory Roofline Insights Perspective in Intel® Advisor.

2.3 Once your application performs satisfactorily on the CPU, find out what parts of your application are worth offloading. Use the Offload Modeling Perspective in Intel® Advisor. This perspective measures and compares the performance of your application against modeled performance on an Intel GPU, without making use of the actual GPU or requiring any change to your code. To understand code behavior as well as the nature of the offload process, first offload some simple pieces of code.

If you do not observe satisfactory results with Offload Modeling Perspective, your application may not benefit from a GPU offload operation. In other words, the offload is not really profitable. This may be a good stopping point.

Example See how the Offload Modeling perspective works in the ISO3DFD guided sample.

Tip To identify pieces of code that are the most profitable for offload, use the Offload Modeling Perspective with multiple workloads of varying sizes.

Learn More

Hotspots Analysis in Intel VTune Profiler
Vectorization Perspective in Intel Advisor
Vectorization Resources
CPU and Memory Roofline Insights Perspective in Intel Advisor
Offload Modeling Resources
A cookbook recipe to demonstrate the Offload Modeling process
A cookbook recipe to estimate the speedup of a C++ application on an Intel GPU

Step 3: Offload Compute-Intensive Code Regions in Kernels

You are now ready to offload the most profitable loops and functions to an Intel GPU. Typically, your application may contain nested loops and calls to math libraries and other functions. To offload calls to math libraries, use the OpenMP Dispatch Construct.

So far, your code already contains OpenMP pragmas and any rewrites that your algorithm requires. The next step is to include additional pragmas to:

Cause some of your code to run on the GPU
Move data required by the offloaded code to and from the GPU

In this phase:

3.1 Add OpenMP pragmas to these loops and functions for memory movement and offload. To understand and use OpenMP directives, see the OpenMP Offload Tuning Guide.

3.2 To offload the OpenMP code onto Intel GPUs, recompile the application using these compiler options:

-fiopenmp -fopenmp-targets=spir64

These options are valid for C, C++, and Fortran code. The Intel compiler converts your application into an intermediate language representation called SPIR-V. The compilation process saves the SPIR-V version of your code in the binary generated. When you run your code, the just-in-time compiler translates the SPIR-V code into the appropriate assembly code for the target offload device.

If necessary, debug application crashes or incorrect output (compared to nonoffloaded output). To troubleshoot your application, use OneTrace and Intel® Distribution for GDB* until you get the expected results.

Example See how offloading works in the ISO3DFD guided sample.

Learn More

Offloading Intel® oneAPI Math Kernel Library (oneMKL) Computations onto the GPU
OpenMP Offload Tuning Guide
Training Modules to Learn the Basics of OpenMP Offload
Troubleshoot Highly Parallel Applications
Debug the DPC++ and OpenMP* Offload Process
Get Started with Intel Distribution for GDB: Windows* Host | Linux* Host
OneTrace Application

Step 4: Evaluate Offload Efficiency

Evaluate the efficiency of the data transfer and compute tasks to ensure they enable only appropriate overlaps, without any introduction of bottlenecks (like unnecessary synchronization).

4.1 The OpenMP offloading runtime libraries have a provision to handle basic profiling of your OpenMP application. Before you begin the offload process, use this profiling capability to understand aspects of the execution of your application including:

Kernel creation
Thread resource allocation
Data transfer
Kernel naming
Kernel execution time

To enable the basic profiling feature, set this variable:

export LIBOMPTARGET_PLUGIN_PROFILE=T

Optionally, you can also collect debugging information by setting this variable:

export LIBOMPTARGET_DEBUG= 1 | 2 | 3 | 4

Note Collecting debugging information this way can generate a very large output. You may need to redirect stdout and stderr to a file.

A higher number for this variable corresponds to higher verbosity.

For a quick triage, generate trace reports with the OneTrace application. Use these reports to determine:

Data transfer times
Kernel running times
Overhead associated with the offload process

4.2 To improve your understanding of GPU issues you identified in the previous step, in Intel VTune Profiler, run the GPU Offload Analysis.

4.3 With inferences from the GPU Offload Analysis, modify your code to minimize the overhead associated with data transfer and offload. Run a few cycles of this analysis to maximize computations on the CPU and GPU

Tip When modifying your code, for guidance, use the OpenMP Offload Tuning Guide.

Example See how the GPU Offload Analysis works in the ISO3DFD guided sample.

Learn More

Compile and Run an OpenMP Application
Information about OpenMP Offloading Runtime Libraries
OneTrace Application
Methodology Recipe on Software Optimization for Intel GPUs
GPU Offload Analysis in Intel VTune Profiler
A Recipe on Profiling an OpenMP Offload Application

Step 5: Optimize the Offload Process

Take a closer look into the workings of your GPU to resolve performance issues and maximize its use.

5.1 Identify performance hot spots caused by memory latency or inefficient kernel algorithms. Locate the most time-consuming kernels, understand the causes for their behavior, and resolve them. Use the GPU Compute/Media Hotspots analysis in Intel VTune Profiler to further enhance your understanding of GPU usage. Run this analysis and focus on these aspects, which can have a profound impact on GPU performance:

X^e Vector Engine (XVE) occupancy
Cache utilization
XVE stalls
Basic block latency
Memory latency
Energy consumption by GPU (for certain devices)

Example See how the GPU Compute/Media Hotspots analysis works in the ISO3DFD guided sample.

5.2 Measure the performance of your CPU and GPU against performance ceilings imposed by your hardware. Where in the pipeline should you focus your attention in order to maximize throughput on the CPU and GPU? To answer this question, run the GPU Roofline Insights perspective in Intel Advisor.

The results can help you understand the maximum performance you can expect from both your CPU and GPU in the current hardware environment. If critical kernels still display as hot spots, see if memory bandwidth or compute capacity are limiting their performance. Optimize these kernels accordingly.

Example See how the GPU Roofline Insights perspective works in the ISO3DFD guided sample.

Learn More

GPU Compute/Media Hotspots Analysis in Intel VTune Profiler
GPU Roofline Insights Perspective in Intel Advisor
Resources on the GPU Roofline Insights Perspective
Optimize Your GPU Application with the Base Kit

Step 6: Review Overall Application Performance

When you offload OpenMP applications, getting the most out of your GPU takes an iterative approach. Remember that you are comparing the performance of your offloaded application to the original OpenMP application (which was optimized for your host platform). Repeat this exercise until you are satisfied with the performance of your OpenMP offload application.

Share your offload experience in the analyzers community forum.

Was this useful? Tell us what you think.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Offload and Optimize OpenMP* Applications with Intel® Tools

Overview

The OpenMP Offload Workflow

Step 1: Install Development Tools and an Appropriate Compiler

Step 2: Assess Application Performance on CPU and Potential GPU Kernels

Step 3: Offload Compute-Intensive Code Regions in Kernels

Step 4: Evaluate Offload Efficiency

Step 5: Optimize the Offload Process

Step 6: Review Overall Application Performance