Offload and Optimize OpenMP* Applications with Intel® Tools
Overview
Parallel programming is a powerful way for software applications to exploit the full potential of multicore hardware. Including OpenMP* directives is an effective way to add parallelism to your application.
GPUs are inherently powerful computation devices because of their massive parallel architectures. As you explore ways to improve the performance of your application, including OpenMP offload directives in your code to offload onto a GPU is a strategy you may want to consider. Optimizing the offloaded application can further elevate your performance by ensuring that you employ available hardware to the fullest extent.
The following learning path shows how to offload and optimize an OpenMP application on Intel® GPUs using developer tools from Intel.
What You Will Learn
Use this workflow to offload applications enhanced with OpenMP directives onto Intel GPUs. Learn how to:
- Select the right functions to offload.
- Profile application performance iteratively.
- Employ the full potential of your GPU.
Each step in the workflow contains:
- Tasks you can complete using developer tools from Intel
- Resources for further learning
- References to a comprehensive ISO3DFD OpenMP Offload sample with a guided README file to illustrate the process
Who This is For
- Software developers who intend to run OpenMP applications on Intel GPUs and maximize code performance
- HPC domain experts who want to learn the basics of using OpenMP offload constructs
What You Need
- An application that is ported to OpenMP
Programming Languages
- C
- C++
- Fortran
Step 1: Install Development Tools and an Appropriate Compiler
Install the compiler and tools you need to run your application:
- C, C++, and DPC++ applications require the Intel® DPC++/C++ Compiler. Get this compiler with the Intel® oneAPI Base Toolkit (Base Kit).
- Fortran applications require the Intel® Fortran Compiler. Get this compiler with the Intel® oneAPI HPC Toolkit. For full functionality, download this toolkit in addition to the Base Kit.
- Intel® Advisor
- Intel® VTune™ Profiler
- Profiling Tools Interfaces for Intel GPUs, which includes the OneTrace tool.
Note When working with the command-line interface (CLI), configure the software development toolkits from Intel using environment variables. Learn how to set environment variables.
Learn More
Step 2: Assess Application Performance on CPU and Potential GPU Kernels
Before attempting to offload functions, you must first understand how well your application runs on the CPU. Look for unexpected performance hot spots that you can speed up. You may also discover synchronization issues that impact parallel performance negatively.
2.1 Identify the most time-consuming functions in your OpenMP application and analyze their source code. For this purpose, run a Hotspots analysis with Intel VTune Profiler. The insights you gain from a Hotspots analysis are useful for tuning both serial and parallel applications. Improve your code and run the Hotspots analysis again to verify improvement. You may need to finish several tuning cycles in order to achieve desired performance.
Example See how the Hotspots analysis works in the ISO3DFD guided sample..
2.2 Examine the vectorization in your application. Vectorization converts an algorithm from operating on a single value at a time to operating on a set of values (or vectors) simultaneously. See if the vectorized loops in your application provide the benefit you expect. Understand why other loops are not vectorized. Run the Vectorization and Code Insights Perspective in Intel® Advisor.
Example See how the vectorization perspective works in the ISO3DFD guided sample.
Note Optimizing hot spots may require several iterations.
Also estimate the performance of your system against ceilings imposed by the hardware. Find out the maximum performance you can achieve with your current hardware resources. Run the CPU/Memory Roofline Insights Perspective in Intel® Advisor.
2.3 Once your application performs satisfactorily on the CPU, find out what parts of your application are worth offloading. Use the Offload Modeling Perspective in Intel® Advisor. This perspective measures and compares the performance of your application against modeled performance on an Intel GPU, without making use of the actual GPU or requiring any change to your code. To understand code behavior as well as the nature of the offload process, first offload some simple pieces of code.
If you do not observe satisfactory results with Offload Modeling Perspective, your application may not benefit from a GPU offload operation. In other words, the offload is not really profitable. This may be a good stopping point.
Example See how the Offload Modeling perspective works in the ISO3DFD guided sample.
Tip To identify pieces of code that are the most profitable for offload, use the Offload Modeling Perspective with multiple workloads of varying sizes.
Learn More
- Hotspots Analysis in Intel VTune Profiler
- Vectorization Perspective in Intel Advisor
- Vectorization Resources
- CPU and Memory Roofline Insights Perspective in Intel Advisor
- Offload Modeling Resources
- A cookbook recipe to demonstrate the Offload Modeling process
- A cookbook recipe to estimate the speedup of a C++ application on an Intel GPU
Step 3: Offload Compute-Intensive Code Regions in Kernels
You are now ready to offload the most profitable loops and functions to an Intel GPU. Typically, your application may contain nested loops and calls to math libraries and other functions. To offload calls to math libraries, use the OpenMP Dispatch Construct.
So far, your code already contains OpenMP pragmas and any rewrites that your algorithm requires. The next step is to include additional pragmas to:
- Cause some of your code to run on the GPU
- Move data required by the offloaded code to and from the GPU
In this phase:
3.1 Add OpenMP pragmas to these loops and functions for memory movement and offload. To understand and use OpenMP directives, see the OpenMP Offload Tuning Guide.
3.2 To offload the OpenMP code onto Intel GPUs, recompile the application using these compiler options:
-fiopenmp -fopenmp-targets=spir64
These options are valid for C, C++, and Fortran code. The Intel compiler converts your application into an intermediate language representation called SPIR-V. The compilation process saves the SPIR-V version of your code in the binary generated. When you run your code, the just-in-time compiler translates the SPIR-V code into the appropriate assembly code for the target offload device.
If necessary, debug application crashes or incorrect output (compared to nonoffloaded output). To troubleshoot your application, use OneTrace and Intel® Distribution for GDB* until you get the expected results.
Example See how offloading works in the ISO3DFD guided sample.
Learn More
- Offloading Intel® oneAPI Math Kernel Library (oneMKL) Computations onto the GPU
- OpenMP Offload Tuning Guide
- Training Modules to Learn the Basics of OpenMP Offload
- Troubleshoot Highly Parallel Applications
- Debug the DPC++ and OpenMP* Offload Process
- Get Started with Intel Distribution for GDB: Windows* Host | Linux* Host
- OneTrace Application
Step 4: Evaluate Offload Efficiency
Evaluate the efficiency of the data transfer and compute tasks to ensure they enable only appropriate overlaps, without any introduction of bottlenecks (like unnecessary synchronization).
4.1 The OpenMP offloading runtime libraries have a provision to handle basic profiling of your OpenMP application. Before you begin the offload process, use this profiling capability to understand aspects of the execution of your application including:
- Kernel creation
- Thread resource allocation
- Data transfer
- Kernel naming
- Kernel execution time
To enable the basic profiling feature, set this variable:
export LIBOMPTARGET_PLUGIN_PROFILE=T
Optionally, you can also collect debugging information by setting this variable:
export LIBOMPTARGET_DEBUG= 1 | 2 | 3 | 4
Note Collecting debugging information this way can generate a very large output. You may need to redirect stdout and stderr to a file.
A higher number for this variable corresponds to higher verbosity.
For a quick triage, generate trace reports with the OneTrace application. Use these reports to determine:
- Data transfer times
- Kernel running times
- Overhead associated with the offload process
4.2 To improve your understanding of GPU issues you identified in the previous step, in Intel VTune Profiler, run the GPU Offload Analysis.
4.3 With inferences from the GPU Offload Analysis, modify your code to minimize the overhead associated with data transfer and offload. Run a few cycles of this analysis to maximize computations on the CPU and GPU
Tip When modifying your code, for guidance, use the OpenMP Offload Tuning Guide.
Example See how the GPU Offload Analysis works in the ISO3DFD guided sample.
Learn More
- Compile and Run an OpenMP Application
- Information about OpenMP Offloading Runtime Libraries
- OneTrace Application
- Methodology Recipe on Software Optimization for Intel GPUs
- GPU Offload Analysis in Intel VTune Profiler
- A Recipe on Profiling an OpenMP Offload Application
Step 5: Optimize the Offload Process
Take a closer look into the workings of your GPU to resolve performance issues and maximize its use.
5.1 Identify performance hot spots caused by memory latency or inefficient kernel algorithms. Locate the most time-consuming kernels, understand the causes for their behavior, and resolve them. Use the GPU Compute/Media Hotspots analysis in Intel VTune Profiler to further enhance your understanding of GPU usage. Run this analysis and focus on these aspects, which can have a profound impact on GPU performance:
- Xe Vector Engine (XVE) occupancy
- Cache utilization
- XVE stalls
- Basic block latency
- Memory latency
- Energy consumption by GPU (for certain devices)
Example See how the GPU Compute/Media Hotspots analysis works in the ISO3DFD guided sample.
5.2 Measure the performance of your CPU and GPU against performance ceilings imposed by your hardware. Where in the pipeline should you focus your attention in order to maximize throughput on the CPU and GPU? To answer this question, run the GPU Roofline Insights perspective in Intel Advisor.
The results can help you understand the maximum performance you can expect from both your CPU and GPU in the current hardware environment. If critical kernels still display as hot spots, see if memory bandwidth or compute capacity are limiting their performance. Optimize these kernels accordingly.
Example See how the GPU Roofline Insights perspective works in the ISO3DFD guided sample.
Learn More
- GPU Compute/Media Hotspots Analysis in Intel VTune Profiler
- GPU Roofline Insights Perspective in Intel Advisor
- Resources on the GPU Roofline Insights Perspective
- Optimize Your GPU Application with the Base Kit
Step 6: Review Overall Application Performance
When you offload OpenMP applications, getting the most out of your GPU takes an iterative approach. Remember that you are comparing the performance of your offloaded application to the original OpenMP application (which was optimized for your host platform). Repeat this exercise until you are satisfied with the performance of your OpenMP offload application.
Share your offload experience in the analyzers community forum.
Was this useful? Tell us what you think.