Skip To Main Content
Intel logo - Return to the home page
My Tools

Select Your Language

  • Bahasa Indonesia
  • Deutsch
  • English
  • Español
  • Français
  • Português
  • Tiếng Việt
  • ไทย
  • 한국어
  • 日本語
  • 简体中文
  • 繁體中文
Sign In to access restricted content

Using Intel.com Search

You can easily search the entire Intel.com site in several ways.

  • Brand Name: Core i9
  • Document Number: 123456
  • Code Name: Emerald Rapids
  • Special Operators: “Ice Lake”, Ice AND Lake, Ice OR Lake, Ice*

Quick Links

You can also try the quick links below to see results for most popular searches.

  • Product Information
  • Support
  • Drivers & Software

Recent Searches

Sign In to access restricted content

Advanced Search

Only search in

Sign in to access restricted content.

The browser version you are using is not recommended for this site.
Please consider upgrading to the latest version of your browser by clicking one of the following links.

  • Safari
  • Chrome
  • Edge
  • Firefox

Troubleshoot Highly Parallel Applications

Overview

Running compute-intensive code on a CPU or motherboard can strain resources. Using attached accelerators such as GPUs and FGPAs to free up CPU resources is referred to as offloading. This workflow shows you the steps to take to troubleshoot your applications that use OpenMP* or the SYCL* API with extensions to offload resources.

 

What You Will Learn

This workflow provides a recommended path for troubleshooting, as well as documentation and resources for common problems. For hands-on learning, there are three samples referenced in the workflow, all based on matrix multiply. The samples include source code with errors and source code that illustrates the solution with step-by-step instructions.

Samples included:

  • Guided Matrix Multiply Invalid Contexts
  • Guided Matrix Multiply Exceptions
  • Guided Matrix Multiply Race Conditions
  • Guided Matrix Multiplication Bad Buffers
  • Guided Matrix Multiplication Illegal Shared Local Memory (SLM) Size

 

Who This Is For

Software developers familiar with targeting attached accelerators (such as GPUs and FPGAs) using the Intel® oneAPI Base Toolkit and Intel® oneAPI HPC Toolkit software.

The Workflow

 

Prerequisites

Step 1: Prepare the Application

Step 2: Resolve Build and Runtime Crashes

Step 3: Resolve Application-Level Problems

Step 4: Address Application Performance Issues

Prerequisites: Configure Your Development Environment

Ensure that your oneAPI development environment is configured correctly and rule out common problems.

Configure Your oneAPI Environment

  1. Use the following guides:
    • oneAPI Installation Guide 
    • Get Started Guides Linux* | Windows* | macOS* 
    • Install Intel® Toolkits and Intel® Graphics Compute Runtime in HPC Cluster Environment
    • If deploying to a remote target, you may need the latest runtime versions.
  2. Install the latest driver updates for OpenCL™ platform and oneAPI Level Zero
    • Linux | Windows
  3. Use the Diagnostics Utility for Intel toolkits to check for common configuration problems. This is included in your oneAPI installation.
    • User Guide

Get Help

At any point in the following steps, you can also submit requests to the Online Service Center. Once you are signed in:

  1. Select Request Support, and then select Choose from a List. 
  2. From the list, select Software, and then Development Software. 
  3. Choose either Compiler or GPU Software Stack, and then include the output of errors in your request.

Step 1: Prepare the Application

Prepare and write your application:

  1. Start with a serial implementation of the algorithm that you can use to verify expected results.
  2. Identify areas of the code that might benefit from parallelism, and then implement them as parallel loops or parallel kernel invocations.

By default, SYCL applications use the oneAPI Level Zero runtime, which provides a low-level, direct-to-metal interface for the devices in a oneAPI platform.

Kernel-Based Application Development

Take advantage of additional parallelism on attached compute accelerators by implementing some of the parallelism in the application using the kernel-based approach from SYCL. Test it during host-only execution using the OpenCL driver for CPI. (It is easier to debug many issues on the CPU.)

Throughout this process, check the results of the parallel implementation against the serial implementation with various real-world datasets.

Tip If the code fails to build, the most common errors are caused by linking and compilation failures. Compile using --save-temps -v to generate verbose error output. This may illustrate if the failure is happening in other binaries like clang-offload-bundler, llvm-link, or llvm-spirv.

Resources

  • oneAPI Level Zero runtime
  • Intel® 64 and IA-32 Architectures Software Developer Manuals
  • Intel compilers and optimizations: Intel® oneAPI DPC++/C++ Compiler Developer Guide and Reference
  • Intel oneAPI Programming Guide
  • Intel-optimized libraries: Intel oneAPI Programming Guide: oneAPI Library Overview
  • Intel oneAPI GPU Optimization Guide

  

Step 2: Resolve Build and Runtime Crashes

Debug on CPU or GPU

To debug failed attempts to run parallel code (kernels) on a specific device (CPU or GPU), do the following:

  1. Run the kernels on the CPU before trying on the GPU, because CPU debugging tends to be easier. Once your code is running correctly on the CPU, target the GPU using either the OpenCL or oneAPI Level Zero runtimes.
  2. If your code fails when it tries to run a kernel, but does not actually fail inside the kernel itself (such as in a library or driver you did not write), troubleshoot the problem at the runtime and driver level.

Build Your Application without Optimizations

Removing optimizations makes it possible to follow all local and passed variables, get reliable line numbers, and makes it easier to find the root cause of issues like memory overruns or bad pointers.

Sometimes applications only show problems when built with optimizations. To find the root cause of those issues, do this only after you have found all the problems that can be solved in a build without optimizations. 

 

First: Run the Application on the CPU

If your program fails when it attempts to call a kernel, the problem may be the result of an error that was detected by the SYCL, OpenMP, or OpenCL runtimes. 

To fix this error:

  1. Force your application to use the CPU using SYCL environment variables. 
    • By default, SYCL applications use the oneAPI Level Zero runtime. The SYCL environment variables also allow you to switch to the OpenCL runtime for testing.
  2. If you experience runtime crashes: 
    • To get a summary of how your program got where it is, conduct a backtrace. See Backtrace in the Intel® Distribution for GDB* User Guide. 
    • To generate trace reports, see Profiling Tools Interfaces for GPU (PTI for GPU): 
      • oneTrace*: A host and device tracing tool for OpenCL runtime and oneAPI Level Zero back ends. 
      • zeTrace: The standard for oneAPI Level Zero API call tracing and profiling.
  3. Use the Intel Distribution for GDB for application-level debugging. 
    • See the Get Started Guide: Linux | Windows
  4. Repeat the previous steps until you can verify that your kernel starts to run. 
  5. Once you are sure that your offload kernel code runs on the CPU, try to run it on the GPU.

 

Resources

  • Debug the DPC++ and OpenMP Offload Process.
  • For instructions on how to use the Intel Distribution for GDB, see the Guided Matrix Multiply Exceptions sample available on GitHub*.

Tips for GPU Offload

  • Start small. Go one kernel at a time. 
  • Run your kernel on only a few threads and build from there. Stepping through a few threads to find problems is much easier than stepping through thousands of threads.

 

Next: Run the Application on the GPU

If the kernel fails to run on the GPU, the problem is likely related to the OpenMP or SYCL runtime, the OpenCL Driver, or the oneAPI Level Zero driver. 

To help triage your application, follow these steps. 

  1. Force your application to use the GPU using SYCL environment variables. 
  2. If you experience runtime crashes, try switching from Just in Time (JIT) compilation to Ahead of Time (AOT) Compilation.
  3. To switch between the default oneAPI Level Zero runtime and OpenCL runtime, use SYCL environment variables.
  4. If you continue to experience runtime crashes, generate trace reports using Profiling Tools Interfaces for GPU (PTI for GPU): 
    • oneTrace: A host and device tracing tool for OpenCL platform and oneAPI Level Zero back ends. 
    • zeTrace: The standard for oneAPI Level Zero API call tracing and profiling.
  5. If your runtime stops responding:
    • Use the Profiling Tools Interfaces for GPU (PTI for GPU) to identify which operation is not responding and to further localize investigation. 
    • To get a summary of the status of your program, conduct a backtrace. See Backtrace in the Intel Distribution for GDB User Guide. 

Note  On some GPUs, ctrl-c may be used to recover the system when it stops responding.

 

Resources

To see how to use the oneTrace from the PTI for GPU set, see the Guided Matrix Invalid Contexts and Guided Matrix Multiply SLM Size samples available on GitHub.

Step 3: Resolve Application-Level Problems

With application kernels now running, program crashes or runtime problems are likely attributed to errors in the code.

  • Focus on debugging kernel execution using the Intel Distribution for GDB, which is a model more closely aligned with traditional debug techniques.
  • Compare your application results between CPU-only and accelerated implementations (where some of the program runs on an attached compute accelerator, like a GPU). If necessary, compare your results with the original application. If the results differ by more than the expected precision differences, there may be a problem in one of the implementations.

Resources

  • Intel Distribution for GDB: Linux | Windows
  • Debug the Offload Process

Tips for GPU Debugging

As you scale up your problem sizes and thread count, continue to debug while monitoring your overall performance improvements and correctness.

Note As you continue to modify and expand your kernel-based code, you may encounter errors or problems that require you to revert to Step 2.

  1. Use the Intel Distribution for GDB for application-level debugging. 
    • See the Intel Distribution for GDB Get Started Guide: Linux | Windows
    • For common strategies, in the Intel oneAPI programming guide, see the Debug the Offload Process. 
    • To learn more about debugging programs with multiple threads, see Chapter 4.10 of the Intel Distribution for GDB User Guide.
  2. If your results are incorrect:
    • Verify that the results are still accurate on the CPU.
    • Use Intel® Inspector or Valgrind* to find correctness problems such as bad pointers or pointer overruns in GPU-offloaded code. The CPU OpenCL driver can be used to find GPU issues.

To learn how to use GDB stack traces to locate problems in your code, see the Guided Matrix Multiple Race Conditions and Guided Matrix Multiply Bad Buffers samples available on GitHub.

  

Step 4: Address Application Performance Issues

With your application successfully running on the CPU and GPU, you can now focus on getting the most out of the available hardware.

To determine how well the resulting application is working, use Intel® VTune™ Profiler.

  • To make sure that the application is spending its runtime in appropriate places, run real workloads. 
  • Unless it is compute intensive, the code that feeds your kernels should not take much time to run. As you increase the number of threads and available hardware, the application runtime should decrease. 
  • Look for unexpected time spent transferring or waiting for data, or look for kernels that are taking longer than they should due to atomics, memory contention, or other issues. Use Intel VTune Profiler plus Intel® Advisor to understand how well your kernels are using the offload device and overlapping work.

 

Resources

  • User Guide for Intel VTune Profiler
  • Get Started with Intel VTune Profiler
  • Get Started with Intel Advisor

Intel VTune Profiler enables you to measure and tune the performance of the entire application, not just the accelerated portion. This helps you find bottlenecks and opportunities for optimization throughout your entire application. 

Roofline analysis with Intel Advisor can help show you how much optimization is available to you on different hardware configurations for each kernel. It can also show where in the pipeline to focus your attention to maximize throughput on the GPU.

The optimization advice from these two tools is complementary. Intel VTune Profiler gives you an overall assessment, and Intel Advisor identifies how much further you can improve performance in each kernel.

As you optimize your code, changes in your application may result in errors and problems that require you restart at Step 4. 

 

Resources

  • Run Roofline Analysis with Intel Advisor
  • Run Performance Analysis on Intel GPUs Using Intel VTune Profiler 
  • Profile a SYCL Application Running on a GPU with Intel VTune Profiler 
  • Optimize Applications for Intel GPUs with Intel VTune Profiler
  • Company Overview
  • Contact Intel
  • Newsroom
  • Investors
  • Careers
  • Corporate Responsibility
  • Inclusion
  • Public Policy
  • © Intel Corporation
  • Terms of Use
  • *Trademarks
  • Cookies
  • Privacy
  • Supply Chain Transparency
  • Site Map
  • Recycling
  • Your Privacy Choices California Consumer Privacy Act (CCPA) Opt-Out Icon
  • Notice at Collection

Intel technologies may require enabled hardware, software or service activation. // No product or component can be absolutely secure. // Your costs and results may vary. // Performance varies by use, configuration, and other factors. Learn more at intel.com/performanceindex. // See our complete legal Notices and Disclaimers. // Intel is committed to respecting human rights and avoiding causing or contributing to adverse impacts on human rights. See Intel’s Global Human Rights Principles. Intel’s products and software are intended only to be used in applications that do not cause or contribute to adverse impacts on human rights.

Intel Footer Logo