Code Sample: Vector Add – An Intel® oneAPI DPC++/C++ Compiler Example

Published: 08/26/2020  

Last Updated: 03/22/2022

By Dylan Benito

File(s): GitHub
License: MIT
Optimized for...  
Software:
(Programming Language, Tool, IDE, Framework)        
Intel® oneAPI DPC++/C++ Compiler
Prerequisites: Familiarity with C++ and an interest in SYCL*

 

This tutorial describes parallel implementations for adding two vectors using SYCL*. The two code samples, showing two different memory management techniques, (vector-add-buffers.cpp and vector-add-usm.cpp), are available at GitHub. You can use these samples as a starting point for developing more complex applications.

This tutorial and its code examples uses the Intel® oneAPI DPC++/C++ Compiler and assumes the compiler and its environment have been setup and configured correctly. Building and running these samples verify that your development environment is ready for using the core features of SYCL.

Introduction

Some basic attributes of writing SYCL applications include specifying the device for offload, where kernels execute, and managing the interaction and data propagation between the device and the host. In these samples, you will learn how to write the kernel and how to use two memory management techniques: Buffers and Unified Shared Memory (USM).

The samples use two .cpp files to demonstrate the options independently: 

  • vector-add-buffers.cpp uses buffers, along with accessors to perform memory copy to and from the device. Buffers provide data mapping between the host and the accelerator. 1-, 2-, or 3-dimensional arrays are placed into buffers, while submitting work to a queue. The queue provides work scheduling, orchestration, and high-level parallel operations. Work is submitted to the queue using a lambda that encapsulates the work kernel and the data needed for its execution. Buffers are initialized on the host and accessed by the lambda. The lambda requests read access for the input vectors and write access for the output vector.
  • vector-add-usm.cpp uses USM as a SYCL tool for data management. It uses a pointer-based approach like malloc or new to allocate memory. USM requires hardware support for unified virtual address space (this allows for consistent pointer values between the host and device). All memory is allocated by the host. It offers three distinct allocation types:
    • Shared: located either on the host or device (managed by the compiler), accessible by the host or device.
    • Device: located on the device, accessible only by the device.
    • Host: located on the host, accessible by the host or device.

The following diagram illustrates the difference between the two:

Left: both the host and the device (for example, discrete GPU) may have their own physical memories. Right: the logical view of USM. It provides a unified address space, across the host and the device, even when their memories are physically separate.

Problem Statement

You can compute a vector from two vectors by adding the corresponding elements. This is a simple, but fundamental piece of computation that is used in many linear algebra algorithms among applications in a wide range of areas. By using this problem, this tutorial demonstrates two parallel implementations using SYCL buffers and USM. It also provides a sequential implementation to verify that the result of the offloaded computation is correct.

Parallel Implementation and Sample Code Walkthrough

Sample 1: Buffers

The basic SYCL implementation explained in this code sample includes device selector, queue, buffer, accessor, kernel, and command group.

Create Device Selector

Using SYCL, you can offload computation from CPU to a device. The first step is to select a device, (the sample uses FPGA). Based on the availability of devices and their intended use cases, create a device selector object:

  • Default selector: default_selector
  • For FPGA: INTEL::fpga_selector
  • For an FPGA emulator: INTEL::fpga_emulator_selector

This sample uses definitions to choose between device selectors; you can pick between FPGA, FPGA emulator, or the default device. The appropriate definition can be provided at the time of compiling the sample to create the intended device selector object. The makefile illustrates how to specify such a definition. If no device selector definition is explicitly specified during compile time, then the default device selector object is created. This has the advantage of selecting the most performant device among the available devices during runtime. If you do not intend to use either a FPGA or a FPGA emulator, then you can omit specifying a definition and the default device selector object is created.

int main(int argc, char* argv[]) {

// Change vector_size if it was passed as argument

  if (argc > 1) vector_size = std::stoi(argv[1]);

// Create device selector for the device of your interest.

#if FPGA_EMULATOR

// SYCL extension: FPGA emulator selector on systems without FPGA card.

  INTEL::fpga_emulator_selector d_selector;
#elif FPGA

// SYCL extension: FPGA selector on systems with FPGA card.

  INTEL::fpga_selector d_selector;
#else

// The default device selector will select the most performant device.

  default_selector d_selector;
#endif

Create a Command Queue

The next step is to create a command queue. Any computation you want to offload to a device is queued/submitted to the command queue. This sample instantiates a command queue by passing following arguments to its constructor:

  • A device selector.
  • An exception handler.

The first argument, the device selector, is what you created in the Create Device Selector section above. The second argument, the exception handler, is needed to handle any asynchronous exception that the offloaded computation may encounter at the time of execution. An exception_handler has been implemented to handle async exceptions:

 queue q(d_selector, exception_handler);

The exception_handler handles exceptions by invoking std::terminate() to terminate the process. On debug builds, the handler first prints Failure as a message, and then terminates the process.

If you want to handle exceptions differently, without terminating the process for example, then you can provide your own exception handler.

Create and Initialize Vectors

For a vector_add operation, you need two vectors as inputs and a third vector to store the result. This sample creates two vectors for a result. One is the output vector for sequential computation performed on the host, and the other result vector is used to retain the output from the parallel computation performed on the device.

Two input vectors are initialized with values: 0, 1, 2, …

Allocate Device Visible Memory

To perform compute on the device, you need to make the input vectors visible to the device. You also need to copy back the computed result from the device to the host.

Along with offloading the compute on the device, this sample demonstrates how to make input visible to the device and copying back the computed result from the device. There are two major options to achieve this goal:

  • Buffers and accessors
  • USM

Buffers and Accessors

This code sample uses buffers and accessors. Below is a high-level summary of this technique and the relevant code snippets.

Create Buffers

This implementation demonstrates two ways to create buffers. Either pass a reference to a C++ container to the buffer constructor or pass both a pointer and a size as arguments to the buffer constructor. The first option is useful when your data is in C++/STL containers. The second option is useful when the data is in regular C/C++ arrays and vectors, for example:

// Create buffers that hold the data shared between the host and the devices.
// The buffer destructor is responsible to copy the data back to host when it
// goes out of scope.
  
  buffer a_buf(a_vector);
  buffer b_buf(b_vector);
  buffer sum_buf(sum_parallel.data(), num_items);
Create Accessors

You can create accessors from respective buffer objects by specifying data access mode. With vector_add, the device accesses the first two vectors to read inputs and the third vector to write output. The input vector/buffer, a_buf, specifies access::mode::read as an access mode to obtain the appropriate accessor. The output vector/buffer, sum_buf, specifies access::mode::write as an access mode.

Specifying the appropriate access mode is required for correctness and performance. Write access for the output vector tells the Intel® oneAPI DPC++/C++ Compiler that the result computed on the device must be copied back to the host at the end of computation. This is required for correctness. At the same time, the compiler avoids copying the content of the result vector from the host to the device before performing computation on the device. This step improves performance by avoiding an extra copy from the host to the device. This is correct because the computation writes/overwrites sum_buf.

 // Create an accessor for each buffer with access permission: read, write or
 // read/write. The accessor is a mean to access the memory in the buffer.

    accessor a(a_buf, h, read_only);
    accessor b(b_buf, h, read_only);

 // The sum_accessor is used to store (with write permission) the sum data.

    accessor sum(sum_buf, h, write_only);

Command Group Handler

The command group handler object is constructed by the SYCL runtime. All accessors, which are defined in a command group, take the command group handler as an argument. That way the runtime keeps track of the data dependencies. The kernel invocation functions are member functions of command group handler class. A command group handler object cannot be copied or moved.

parallel_for

parallel_for is a commonly used SYCL programming construct. While iterations of C++ for a loop run sequentially, multiple logical iterations of parallel_for can run simultaneously by multiple execution/compute units of the device. As a result, the overall compute runs faster when parallel_for is used. It is suitable for data parallel compute where each logical iteration typically executes the same code, but it operates on different pieces of data (also known as single instruction multiple data (SIMD)). parallel_for is optimal when there is no data dependency between logical iterations.

In this sample, you can use parallel_for because the corresponding elements from two input vectors can be independently and parallelly added to compute individual elements of the resultant vector, with no dependency on other elements of the input or output vectors. parallel_for is used to offload the compute onto the device. The first argument is the number of work items. For vector_add, the number of work items is simply the number of elements of the vector. The second argument is the kernel encapsulating the compute for each work item. Each work item is responsible for computing the sum of two elements from two input vectors and writing the result of the sum operation to the corresponding element of the output vector.

The offloaded work (parallel_for) continues asynchronously on the device. The last statement of the following code snippet waits for this asynchronous operation to complete.

    // Use parallel_for to run vector addition in parallel on device. This
    // executes the kernel.
    //    1st parameter is the number of work items.
    //    2nd parameter is the kernel, a lambda that specifies what to do per
    //    work item. The parameter of the lambda is the work item id.
    // SYCL supports unnamed lambda kernel by default.

    h.parallel_for(num_items, [=](auto i) { sum[i] = a[i] + b[i]; });
  });
}

Kernel for Vector Add (Parallel Compute on the Device)

The compute is offloaded to the device by submitting a lambda function to the command queue. The device gets access to two input vectors and the output vector through the accessors. The previous sections covered how buffers and accessors are setup and how parallel_for is used to compute the result vector.

Before the result buffer (sum_buf) goes out of scope, the SYCL runtime copies the computed result from the device memory over to the host memory.

// Submit a command group to the queue by a lambda function that contains the
// data access permission and device computation (kernel).

q.submit([&](handler &h) {

  // Create an accessor for each buffer with access permission: read, write or
  // read/write. The accessor is a mean to access the memory in the buffer.

    accessor a(a_buf, h, read_only);
    accessor b(b_buf, h, read_only);

  // The sum_accessor is used to store (with write permission) the sum data.

    accessor sum(sum_buf, h, write_only);

  // Use parallel_for to run vector addition in parallel on device. This
  // executes the kernel.
  // The 1st parameter is the number of work items.
  // The 2nd parameter is the kernel, a lambda that specifies what to do per
  // work item. The parameter of the lambda is the work item ID.
  // SYCL supports unnamed lambda kernel by default.

    h.parallel_for(num_items, [=](id<1> i) { sum[i] = a[i] + b[i]; });
  });
}

Sample 2: USM

USM offers three different types of allocations: device, host, and shared. This sample uses shared allocations.

Shared allocations are accessible on the host and the device. They are like host allocations, but they differ in that data can now migrate between host memory and device-local memory. This means that access on a device, after the migration from the host memory to the device local memory has completed, comes from the device local memory instead of remotely accessing the host memory. This is accomplished by the SYCL runtime and lower-level drivers. Shared allocations use implicit data movement. With this type of allocation, you do not need to explicitly insert copy operations to move data between the host and device. Instead, you access data using the pointers inside a kernel, and any required data movement is performed automatically. This simplifies moving your existing code to SYCL by replacing any malloc or new with the appropriate SYCL USM allocation functions. Shared allocation is supported via software abstraction, which is like SYCL buffer or device allocation. The only advantage is that the data is implicitly migrated rather than explicitly. With the necessary hardware support, the page migration starts (the data will migrate page-by-page between the host and the device). With page migration, computation overlaps with incremental data movement, without waiting for the data transfer to complete. This implicit overlap potentially increases the throughput of the computation. This is an advantage of using shared allocation. On the other hand, with device allocation and SYCL buffers, the computation does not start until data transfer is complete.

Memory Allocation and Memory Free

Use malloc_shared() to allocate shared memory accessible to both the host and the device. The two arguments are:

  • The number of elements in vector.
  • The queue (associated with a device to offload the compute).

Use free() to release memory. Using free() in SYCL is similar to how you would use it with C++, but SYCL uses an extra argument with the queue:

    // Create arrays with "array_size" to store input and output data. Allocate
    // unified shared memory so that both CPU and device can access them.

    int *a = malloc_shared<int>(array_size, q);
    int *b = malloc_shared<int>(array_size, q);
    int *sum_sequential = malloc_shared<int>(array_size, q);
    int *sum_parallel = malloc_shared<int>(array_size, q);

    if ((a == nullptr) || (b == nullptr) || (sum_sequential == nullptr) ||
        (sum_parallel == nullptr)) {
      if (a != nullptr) free(a, q);
      if (b != nullptr) free(b, q);
      if (sum_sequential != nullptr) free(sum_sequential, q);
      if (sum_parallel != nullptr) free(sum_parallel, q);

      std::cout << "Shared memory allocation failure.\n";
      return -1;
    }

Use USM

Like Sample 1, this sample employs parallel_for to add two vectors. e.wait() waits for everything submitted to the queue before it completes. For example, a copy back computed result from the device local memory to the host memory:

// Create the range object for the arrays.

  range<1> num_items{size};

  // Use parallel_for to run vector addition in parallel on device. This
  // executes the kernel.
  //    1st parameter is the number of work items.
  //    2nd parameter is the kernel, a lambda that specifies what to do per
  //    work item. the parameter of the lambda is the work item id.
  // SYCL supports unnamed lambda kernel by default.

  auto e = q.parallel_for(num_items, [=](auto i) { sum[i] = a[i] + b[i]; });


  // q.parallel_for() is an asynchronous call. SYCL runtime enqueues and runs
  // the kernel asynchronously. Wait for the asynchronous call to complete.

  e.wait();
}

Verification of Results

Once the computation is complete, you can compare the outputs from sequential and parallel computations to verify that the execution on the device computed the same result as on the host:

    // Compute the sum of two arrays in sequential for validation.

    for (size_t i = 0; i < array_size; i++) sum_sequential[i] = a[i] + b[i];

If the verification is successful, a few elements from input vectors and the result vector are printed to show the output of the computation:

    // Print out the result of vector add.

    for (int i = 0; i < indices_size; i++) {
      int j = indices[i];
      if (i == indices_size - 1) std::cout << "...\n";
      std::cout << "[" << j << "]: " << j << " + " << j << " = "
                << sum_sequential[j] << "\n";
    }

    free(a, q);
    free(b, q);
    free(sum_sequential, q);
    free(sum_parallel, q);
  } catch (exception const &e) {
    std::cout << "An exception is caught while adding two vectors.\n";
    std::terminate();
  }

  std::cout << "Vector add successfully completed on device.\n";
  return 0;
}

Summary

This tutorial demonstrates the basic features that are commonly used in a SYCL program, including how to create a queue, basic memory management (using buffers and accessors, or using USM), write kernel to offload compute to the device, etc. Use this tutorial as a starting point for developing more complex applications.

Resources

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.