Code Sample: Particle Diffusion – An Intel® oneAPI DPC++/C++ Compiler Example

Published: 10/29/2020  

Last Updated: 10/29/2020

By Alberto Villarreal Cueva

File(s): GitHub
License: MIT
Optimized for...  
(Programming Language, Tool, IDE, Framework)
Intel® oneAPI DPC++/C++ Compiler
Prerequisites: Familiarity with C++ and an interest in DPC++
Requirements:  C++ programming experience and familiarity with the DPC++ basics is assumed (Queues, Buffers/Accessors and Kernels). If you do not have prior exposure to DPC++, read the DPC++ resources listed at the end of this tutorial and reivew the other code samples in the Explore DPC++ Language with Samples from Intel document.


This tutorial demonstrates an Intel® oneAPI DPC++/C++ Compiler example, showing the implementation of a Monte Carlo simulation of diffusion of water molecules in tissue. It illustrates the basics of the Data Parallel C++ (DPC++) programming language and techniques for optimization (API-based programming and Atomic Functions). This example is motivated by a medical imaging application (Diffusion-Weighted Imaging (DWI)).


Diffusion-Weighted Imaging (DWI) is a type of magnetic resonance (MR) imaging that uses the diffusion of water molecules to generate contrast in MR images. The diffusion patterns can reveal details about certain possible pathologies found in tissue. DWI signals can be simulated using a Monte Carlo simulation, where the trajectories of randomly moving water molecules are tracked through an arrangement of cells in biological tissue (represented here by simple geometric 2D objects, like circles or squares).

In this code sample, a simplified Monte Carlo simulation of the diffusion of water molecules in a 2D grid of cells is implemented in parallel using DPC++. More complex Monte Carlo simulations can be developed using this simple code sample as a starting point. One possible way to model the cells is to use a regular grid of cells, where each cell is modeled by a circle of radius R (which is constant across the 2D grid). Figure 1 shows the grid layout. Cells, represented as green circles with radius R, are located in the center of each grid cell.

Figure 1: 2D grid layout. Each cell is modeled by a circle of radius R.


Water molecule diffusion is simulated by defining particles P (simulated water molecules) at random positions in the grid, followed by random walks of these particles in the ensemble of cells in the grid (Figure 2). During the random walks, particles can move randomly inside or outside simulated cells.

Figure 2: Random walks of particles in the grid of cells.


The positions of these particles at every time step in the simulation, the number of times they go through a cell membrane (in/out), and the time every particle spends inside and outside cells can be recorded. These counters are an example of useful information that can be used to simulate an MR signal. A small set of these counters are simulated in this code sample. As each particle can move independently from the other particles, a large number of particles (water molecules) should be used for a useful simulation. This example is well suited for a device with a large number of processing elements, like a GPU.

Code Sample Walkthrough

main() Function

The parameters used in the simulation are defined in the main function.

int main(int argc, char* argv[]) {
  // Set command line arguments to their default values
  size_t n_iterations = 10000;
  size_t n_particles = 256;
  size_t grid_size = 22;
  int seed = 777;
  unsigned int cpu_flag = 0;
  unsigned int grid_output_flag = 1;

With the above information, you can allocate arrays to hold the position of each particle (in the X and Y directions) in the grid. Arrays to store the random displacements are also declared and allocated.

// Stores X and Y position of particles in the cell grid
float* particle_X = new float[n_particles];
float* particle_Y = new float[n_particles];
// Total number of motion events
const size_t n_moves = n_particles * n_iterations;
// Declare vectors to store random values for X and Y directions
float* random_X = new float[n_moves];
float* random_Y = new float[n_moves];

The grid of cells (shown in the figures above) is implemented in the code using a flattened 3D array (A 3D array turned into a one dimensional array). Although the cell grid is 2D, the third dimension in the grid array in the code stores the different counters in the simulation.   

// Size of 3rd dimension of 3D grid (3D matrix). 3 counters => 3 planes
const size_t planes = 3;
// Stores a grid of cells, initialized to zero.
size_t* grid = new size_t[grid_size * grid_size * planes]();

DPC++ Device Selector, Queues, and Buffers

A device queue is used to enqueue work that is executed in a particular device or accelerator, specified by the device_selector object: 

// Create a device queue using default or host/gpu selectors
default_selector device_selector;

// Create a device queue using DPC++ class queue
queue q(device_selector, dpc_common::exception_handler);

NOTE: This sample uses a user defined exception handler; in this sample it is defined in the dpc_common header file.

Next you can create buffer objects that let you communicate data between the device and the host. The buffers can store collections of elements that can be 1, 2, or 3 dimensional. In this case, they are used to store 1D arrays to represent the X and Y positions of particles, the 2D grid, and the random numbers for particle displacement.

// Create buffers using DPC++ buffer class
buffer random_X_buf(random_X, range(n_moves));
buffer random_Y_buf(random_Y, range(n_moves));
buffer particle_X_buf(particle_X, range(n_particles));
buffer particle_Y_buf(particle_Y, range(n_particles));
buffer grid_buf(grid, range(grid_size * grid_size * planes));

Buffers are directly connected to how C or C++ stores arrays. Data in these buffers can be communicated back and forth between the host and device using accessors. These accessors are used in kernel functions to communicate data to and from the host, via the buffers declared above.

NOTE: The accessor use for the grid is declared using an atomic access mode. This will prevent conflicting accesses to memory when different particles try to update the same counter in the grid array.

// Submit command group for execution
// h is a handler type
q.submit([&](auto& h) {
  // Declare accessors
  accessor particle_X_a(particle_X_buf, h);  // Read/write access
  accessor particle_Y_a(particle_Y_buf, h);
  accessor random_X_a(random_X_buf, h, read_only);
  accessor random_Y_a(random_Y_buf, h, read_only);
  // Use DPC++ atomic access mode to create atomic accessors
  accessor grid_a = grid_buf.get_access<access::mode::atomic>(h);

Random Numbers to Simulate Random Displacement of Particles

To simulate the random walk of the particles, generate random numbers in the device using functionality from Intel oneAPI Math Kernel Library (oneMKL). As indicated in the documentation, you can follow two steps to generate random numbers using the Random Number Generator (RNG) functions in oneMKL.

First, create and initialize the object for basic random number generator and the distribution generator. In this sample, a Gaussian distribution with mean=alpha and S=sigma (the maximum displacement for the particles at each time iteration) is used.

// Declare basic random number generator (BRNG) for random vector
mkl::rng::philox4x32x10 engine(q, seed);
// Distribution object
mkl::rng::gaussian<float, mkl::rng::gaussian_method::icdf> distr(alpha,

Next, call the generate routine to get random numbers with appropriate statistical distribution. The random values for X and Y directions are stored in the respective buffers.

mkl::rng::generate(distr, engine, n_moves, random_X_buf);
mkl::rng::generate(distr, engine, n_moves, random_Y_buf);

Those random numbers are used via the accessors to compute random displacement for particles:

// Set the displacements to the random numbers
float displacement_X = random_X_a[iter * n_particles + p];
float displacement_Y = random_Y_a[iter * n_particles + p];

Find more information about oneMKL and RNG functionality, including code samples, in the Intel® oneAPI Math Kernel Library Developer Guide and Reference.

Command Handler and Kernel Definition

Once you have declared and filled the buffers with all the necessary data for the simulation, you can offload work to the device by submitting a command group to the device queue created above. The command group includes the kernel function that is enqueued to the device defined in the queue. The accessors that let you communicate data between the host and device via the buffers is declared with:

q.submit([&](auto& h) {
  // Declare accessors
  accessor particle_X_a(particle_X_buf, h);  // Read/write access
  accessor particle_Y_a(particle_Y_buf, h);
  accessor random_X_a(random_X_buf, h, read_only);
  accessor random_Y_a(random_Y_buf, h, read_only);
  // Use DPC++ atomic access mode to create atomic accessors
  accessor grid_a = grid_buf.get_access<access::mode::atomic>(h);

The command group declared above takes a command group handler h as a parameter, which is constructed at run-time and includes all the requirements for a kernel to execute.

NOTE: One of the accessors is assigned to an atomic RW mode. As several particles in this simulation might be writing to the memory pointed by this accessor, you can prevent data races by enabling atomicity in the access mode. Not enabling atomic access mode for this access may lead to incorrect results because several particles may try to access a memory location simultaneously. For example, in the case where more than one particle enters a single cell in one iteration.

Once you have defined the command group handler h, you can use its parallel_for function to send a kernel function to execute on the device. This kernel function is constructed using C++ lambda functions syntax. The parallel_for function creates a number of instances of the kernel function (specified by the range) and adds them to the command queue to be executed in parallel in the device.

// Send a DPC++ kernel (lambda) for parallel execution
h.parallel_for(range(n_particles), [=](auto item) {
  // Particle number (used for indexing)
  size_t p = item.get_id(0);

The first parameter in the parallel_for function defines the range where the kernel function is executed, which is given by the n_particles variable.

In the code snippet above, the body of the kernel function defines the code that is executed in parallel by every instance of the kernel function created by the parallel_for function; in this case, it is the displacement of each particle in the cell grid. The index variable item uniquely identifies (indexes) the instance of the kernel function executing in the device, which represents each particle in the simulation. Inside the kernel definition, every particle moves using the random numbers generated and stored in the buffers.

// Each particle performs this loop
for (size_t iter = 0; iter < n_iterations; ++iter) {
  // Set the displacements to the random numbers
  float displacement_X = random_X_a[iter * n_particles + p];
  float displacement_Y = random_Y_a[iter * n_particles + p];
  // Displace particles
  particle_X_a[p] += displacement_X;
  particle_Y_a[p] += displacement_Y;

After each particle performs random displacement, the distances from the particles to closest grid points in the cell grid are computed. Using these values it is possible to test if the particle is inside the cell or not, if the particle just entered a cell or just left one, etc.  The logic for the tests is described in the comments in the code.

// Compute distances from particle position to grid point i.e.,
// the particle's distance from center of cell. Subtract the
// integer value from floating point value to get just the
// decimal portion. Use this value to later determine if the
// particle is inside or outside of the cell
float dX = sycl::abs(particle_X_a[p] - sycl::round(particle_X_a[p]));
float dY = sycl::abs(particle_Y_a[p] - sycl::round(particle_Y_a[p]));
/* Grid point indices closest the particle, defined by the following:
|               Condition               |         Result         |
|particle_X + 0.5 >= ceiling(particle_X)|iX = ceiling(particle_X)|
|particle_Y + 0.5 >= ceiling(particle_Y)|iY = ceiling(particle_Y)|
|particle_X + 0.5 < ceiling(particle_X) |iX = floor(particle_X)  |
|particle_Y + 0.5 < ceiling(particle_Y) |iY = floor(particle_Y)  |
------------------------------------------------------------------  */
int iX = sycl::floor(particle_X_a[p] + 0.5);
int iY = sycl::floor(particle_Y_a[p] + 0.5);

/* There are 5 cases when considering particle movement about the grid.

For example, the following code snippet tests case 1, where the particle was not inside a cell in the previous iteration, but after the last displacement the particle it is now inside the cell defined by the grid points iX and iY. In this case counters 2 and 3 are marked to be updated.

// Check if particle's grid indices are still inside computation grid
if ((iX < grid_size) && (iY < grid_size) && (iX >= 0) && (iY >= 0)) {
  // Compare the radius to particle's distance from center of cell
  if (radius >= sycl::sqrt(dX * dX + dY * dY)) {
    // Satisfies counter 1 requirement for cases 1, 3, 4
    increment_C1 = true;
    // Case 1
    if (!inside_cell) {
      increment_C2 = true;
      increment_C3 = true;
      inside_cell = true;
      update_coordinates = true;

After the tests, and depending on the motion of each particle with respect to the cells, the appropriate counters are updated (using atomic functions).

// Counter 1 layer of the grid (0 * grid_size * grid_size)
layer = 0;
if (increment_C1)
  atomic_fetch_add<size_t>(grid_a[curr_coordinates + layer], 1);

// Counter 2 layer of the grid (1 * grid_size * grid_size)
layer = gs2;
if (increment_C2)
  atomic_fetch_add<size_t>(grid_a[curr_coordinates + layer], 1);

// Counter 3 layer of the grid (2 * grid_size * grid_size)
layer = gs2 + gs2;
if (increment_C3)
  atomic_fetch_add<size_t>(grid_a[curr_coordinates + layer], 1);

Compiling/Running the Code and Result Information

To compile and execute the code, follow the directions in the README file in the code sample directory in GitHub. To get the expected results, this walkthrough assumes that an Intel® Processor Graphics Gen9 is present in the system. If this is not the case, the code will execute on the CPU.


This tutorial describes a parallel implementation of a Monte Carlo Particle Diffusion Simulation. The complete code is available to download, and it is intended to show the basic elements of the DPC++ programming language, optimizations as generating random numbers in the device (using functionality from oneMKL), and atomic functions to prevent memory conflicts.

NOTE: This code sample is for educational purposes only; it is not optimized so that the use of DPC++ basic structures can be explicitly shown. This code is intended to provide a basic understanding of DPC++ and can be used as a starting point for more code development in DPC++.


Notices and Disclaimers

Intel® technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.



Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at