Programming Guide

Contents

CPU Offload Flow

By default, if you are offloading to a CPU device, it goes through an OpenCL™ runtime, which also uses Intel oneAPI Threading Building Blocks for parallelism.
When offloading to a CPU, workgroups map to different logical cores and these workgroups can execute in parallel. Each work-item in the workgroup can map to a CPU SIMD lane. Work-items (sub-groups) execute together in a SIMD fashion.
CPU workgroups

Set Up for CPU Offload

  1. Make sure you have followed all steps in the oneAPI Development Environment Setup section, including running the
    setvars
    script.
  2. Check if you have the required OpenCL runtime associated with the CPU using the
    sycl-ls
    command. For example:
    $sycl-ls CPU : OpenCL 2.1 (Build 0)[ 2020.11.12.0.14_160000 ] GPU : OpenCL 3.0 NEO [ 21.33.20678 ] GPU : 1.1[ 1.2.20939 ]
  3. Use one of the following code samples to verify that your code is running on the CPU. The code sample adds scalar to large vectors of integers and verifies the results.
SYCL*
To run on a CPU, SYCL provides built-in device selectors for convenience. They use
device_selector
as a base class.
cpu_selector
selects a CPU device.
Alternatively, you could also use the following environment variable when using
default_selector
to select a device according to implementation-defined heuristics.
export SYCL_DEVICE_FILTER=cpu
SYCL code sample:
#include <CL/sycl.hpp> #include <array> #include <iostream> using namespace sycl; using namespace std; constexpr size_t array_size = 10000; int main(){ constexpr int value = 100000; try{ cpu_selector d_selector; queue q(d_selector); int *sequential = malloc_shared<int>(array_size, q); int *parallel = malloc_shared<int>(array_size, q); //Sequential iota for (size_t i = 0; i < array_size; i++) sequential[i] = value + i; //Parallel iota in SYCL auto e = q.parallel_for(range{array_size}, [=](auto i) { parallel[i] = value + i; }); e.wait(); // Verify two results are equal. for (size_t i = 0; i < array_size; i++) { if (parallel[i] != sequential[i]) { cout << "Failed on device.\n"; return -1; } } free(sequential, q); free(parallel, q); }catch (std::exception const &e) { cout << "An exception is caught while computing on device.\n"; terminate(); } cout << "Successfully completed on device.\n"; return 0; }
To compile the code sample, use:
dpcpp simple-iota-dp.cpp -o simple-iota.
Additional commands are available from Example CPU Commands.
Results after compilation:
./simple-iota Running on device: Intel® Core™ i7-8700 CPU @ 3.20GHz Successfully completed on device.
OpenMP*
OpenMP code sample:
#include<iostream> #include<omp.h> #define N 1024 int main(){ float *a = (float *)malloc(sizeof(float)*N); for(int i = 0; i < N; i++) a[i] = i; #pragma omp target teams distribute parallel for simd map(tofrom: a[:N]) for(int i = 0; i < 1024; i++) a[i]++; std::cout<<a[100]<<"\n"; return 0; }
Use the following environment variable to compile for running on a CPU:
export LIBOMPTARGET_DEVICETYPE=cpu
To compile the code sample, use:
icpx simple-ompoffload.cpp -fiopenmp -fopenmp-targets=spir64 -o simple-ompoffload
Results after compilation:
./simple-ompoffload Successfully completed on device

Offload Code to CPU

When offloading your application, it is important to identify the bottlenecks and which code will benefit from offloading. If you have a code that is compute intensive or a highly data parallel kernel, offloading your code would be something to look into.
To find opportunities to offload your code, use the Intel Advisor for Offload Modeling.

Debug Offloaded Code

The following list has some basic debugging tips for offloaded code.
  • Check host target to verify the correctness of your code.
  • Use
    printf
    to debug your application. Both SYCL and OpenMP offload support
    printf
    in kernel code.
  • Use environment variables to control verbose log information.
    • For SYCL, the following debug environment variables are recommended. A full list of environment variables is available from GitHub.
      SYCL Recommended Debug Environment Variables
      Name
      Value
      Description
      SYCL_DEVICE_FILTER
      backend:device_type:device_num
      SYCL_PI_TRACE
      1|2|-1
      1: print out the basic trace log of the SYCL/DPC++ runtime plugin
      2: print out all API traces of SYCL/DPC++ runtime plugin
      -1: all of “2” including more debug messages
    • For OpenMP, the following debug environment variables are recommended. A full list is available from the LLVM/OpenMP documentation.
      OpenMP Recommended Debug Environment Variables
      Name
      Value
      Description
      LIBOMPTARGET_DEVICETYPE
      cpu|gpu|host
      Select
      LIBOMPTARGET_DEBUG
      1
      Print out verbose debug information
      LIBOMPTARGET_INFO
      Allows the user to request different types of runtime information from
      libomptarget
  • Use Ahead of Time (AOT) to move Just-in-Time (JIT) compilations to AOT compilation issues. For more information, see Ahead-of-Time Compilation for CPU Architectures.
See Debugging the SYCL and OpenMP Offload Process for more information on debug techniques and debugging tools available with oneAPI.

Optimize CPU Code

There are many factors that can affect the performance of CPU offload code. The number of work-items, workgroups, and amount of work done depends on the number of cores in your CPU.
  • If the amount of work being done by the core is not compute-intensive, then this could hurt performance. This is because of the scheduling overhead and thread context switching.
  • On a CPU, there is no need for data transfer through PCIe, resulting in lower latency because the offload region does not have to wait long for the data.
  • Based on the nature of your application, thread affinity could affect the performance on CPU. For details, see Control Binary Execution on Multiple Cores.
  • Offloaded code uses JIT compilation by default. Use AOT compilation (offline compilation) instead. With offline compilation, you could target your code to specific CPU architecture. Refer to Optimization Flags for CPU Architectures for details.
Additional recommendations are available from Optimize Offload Performance.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.