• 2021.2
  • 03/26/2021
  • Public Content

with Samples from Intel

Introduction to
Data Parallel C++ (

applications are C++ programs for parallelism.
is designed for data parallel programming and heterogenous computing.
provides a consistent programming language (C++) and APIs across CPU, GPU, FPGA, and AI accelerators. Each architecture can be programmed and used, either in isolation or together. This allows developers to learn once and then program for distinct accelerators. Each class of accelerator requires an appropriate formulation and tuning of the algorithms for best performance, but the language and programming model remains consistent, regardless of the target device.
DPC++ is based on SYCL* from the Khronos* Group to support data parallelism and heterogeneous programming. In addition, Intel is pursuing extensions to SYCL with the aim of providing value to customer code and working with the standards organization for adoption. For instance, the DPC++ language includes an implementation of unified shared memory to ease memory usage between the host and the accelerators. These features are being driven into a future version of the SYCL language. For more details about SYCL, refer to version 1.2.1 of the SYCL Specification.
The oneAPI.com site contains more details about
and its specifications.
This guide aims to help developers understand how to program using the
programming model, and how to target and optimize for the appropriate architecture to achieve optimal application performance.
For samples specific to FPGA, visit the Explore DPC++ Through Intel® FPGA Code Samples page.

Build and Run a Sample Project

The links below take you to the Get Started with the Intel® oneAPI Base Toolkit content for the Command Line and IDE:

Sample 1: Simple Device Offload Structure

Sample 1 uses Vector Add as the equivalent of a Hello, World! sample for data parallel programs. It provides the basic structure of a
application by showing you how to target an offload device. Sample 1 provides two different source files as examples of how to manage memory, you can use buffers or Unified Shared Memory (USM).
Vector Add provides both GPU and FPGA device selectors.
In this sample, you will learn how to use the basic elements (features) of
to offload a simple computation using 1D arrays to accelerators. The basic features are:
  • A one-dimensional array of data.
  • A device selector queue, buffer, accessor, and kernel.
  • Memory management using buffers and accessors or USM.
Visit Code Sample: Vector Add for a detailed code walkthrough.
Get the sample:

Sample 2: Basic DPC++ Features Defined

Using a two-dimensional stencil to simulate a wave propagating in a 2D isotropic medium, this sample walks you through the base tenets of
step by step, with:
  • DPC++
    queues (including device selectors and exception handlers).
  • DPC++
    buffers and accessors.
  • The ability to call a function inside a kernel definition and pass accessor arguments as pointers. A function called inside the kernel performs a computation (it updates a grid point specified by the global ID variable) for a single time step.
Get the sample:

Sample 3: Optimizing for More Complex Applications

This code sample extends the
concepts reviewed in the previous sample, and explains how they can be used to solve complex stencil computations in 3D. Moving from 2D to 3D grid sizes can uncover common GPGPU (device) programming issues that are related to inefficient data access patterns, low flops-to-byte ratios, and low occupancy. Using this code sample shows you how
features can be used to tackle those underlying issues and optimize your performance. This sample includes:
  • DPC++
    local buffers and accessors (declare local memory buffers and accessors to be accessed and managed by each
  • Code for Shared Local Memory (SLM) optimizations.
  • DPC++
    kernels (including
    function and
    • DPC++
      queues (including custom device selector and exception handlers).
Get the sample:

Sample 4: Introducing Synchronization

This sample adds some complexity in the form of a large number of moving particles and their interaction with a fixed grid of cells. This is used to illustrate new
features like: Synchronization (atomic operations) and others.
Using this code sample shows you how to offload to an accelerator a computation that uses the following
  • DPC++
    queues (including device selectors and exception handlers).
  • DPC++
    buffers and accessors (communicate data between the host and the device).
  • DPC++
    kernels (including
    function and
  • DPC++
    atomic operations for synchronization.
  • API-based programming: Use
    to generate random numbers.
Visit Code Sample: Particle Diffusion for a detailed code walkthrough.
Get the sample:

Next Steps

Code Walkthroughs
Next, try a detailed code walkthrough on the following topics:
Determine Which Code to Offload
You can determine which parts of your code benefit from offloading to an accelerator with Intel® Advisor. The Offload Advisor feature allows you to collect performance predictor data, in addition to the standard profiling capabilities. It determines what code can be offloaded to a target device, which accelerates the performance of your CPU-based applications. The Get Started with Intel® Advisor helps you:
  • Optimize CPU or GPU code for memory and computes with Roofline Analysis.
  • Enable more vector parallelism and improve its efficiency.
  • Model, tune, and test multiple threading designs.
  • Create and analyze data flow and dependency-computation using heterogeneous algorithms.
Transform CUDA* Code into DPC++ Code
You can transform CUDA code into a standards-based
code with a migration engine called the
Compatibility Tool
. The Get Started Guide and User Guide help you migrate your existing CUDA applications and cover the general workflow of the migration process. The tool can be used to transform programs that are composed of multiple source and header files. It also includes:
  • One-time-only migration ports for both kernels and API calls.
  • An inline comments guide used to produce output, which can be compiled with the
    Intel® oneAPI
  • Command-line tools and IDE plug-ins that streamline operations.
Additional Resources
Access a wide range of tutorials, videos, and webinar replays to learn more about DPC++ and the supporting tools on the Intel® oneAPI Toolkits Training site.
Learn about oneAPI and
, programming models, programming interfaces,
runtimes, APIs, and software development processes.
Look through our Get Started Guides for more in-depth information.
Look through our Tutorials for more in-depth information.
Look through the FPGA code samples for more in-depth information.

Notices and Disclaimers

Intel technologies may require enabled hardware, software or service activation.
No product or component can be absolutely secure.
Your costs and results may vary.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.