SYCL* Foundations Code Walkthrough

Dylan Benito

In this sample walkthrough, we use a vector_add sample to demonstrate oneAPI concepts and functionality. The sample employs hardware acceleration to add two arrays of integers together. Throughout this walkthrough, you will learn about:

SYCL headers
Asynchronous exceptions from kernels
Device selectors for different accelerators
Buffers and accessors
Queues
parallel_for kernel

Download the vector_add source from GitHub.

SYCL Headers

Intel is currently using SYCL from the Khronos Group*, which includes language extensions developed through an open source community process. The Intel® oneAPI DPC++/C++ Compiler provides the sycl.hpp header file and the fpga_extensions.hpp header file includes FPGA support.

The following code snippet, taken from vector_add, demonstrates the different headers you need to support various accelerators.

//For CPU or GPU

#include <sycl/sycl.hpp>
#include <vector>
#include <iostream>
#include <string>
using namespace sycl;

//For FPGA

#include <sycl/sycl.hpp>
#include <vector>
#include <iostream>
#include <string>
#if FPGA_HARDWARE || FPGA_EMULATOR || FPGA_SIMULATOR
#include <sycl/ext/intel/fpga_extensions.hpp>
#endif
using namespace sycl;

Catch Asynchronous Exceptions from SYCL Kernels

SYCL kernels run asynchronously on accelerators in different stack frames. The kernel may have asynchronous errors that cannot be propagated up to the stack. To catch the asynchronous exceptions, the SYCL queue class provides a way for error handler functions.

The following code snippet, from vector_add, shows you how to create an exception handler.

// Use this to create an exception handler with catch asynchronous exceptions.

static auto exception_handler = [](sycl::exception_list eList) {
	for (std::exception_ptr const &e : eList) {
		try {
			std::rethrow_exception(e);
		}
		catch (std::exception const &e) {
#if _DEBUG
			std::cout << "Failure" << std::endl;
#endif
			std::terminate();
		}
	}
};
… … 
try {
    queue q(selector, exception_handler);
    … … 
} catch (exception const &e) {
    … … 
}

Using a Default Selector for Accelerators

You can select an accelerator for offload kernels in a straightforward manner. SYCL and oneAPI offer selectors that can discover and provide access to the hardware available in your environment.

The default_selector_v enumerates all available accelerators and selects the most performant one. SYCL also provides additional selector classes for the FPGA accelerator, including fpga_selector_v and fpga_emulator_selector_v, which you can find in fpga_extensions.hpp.

The following code snippet, taken from vector_add, demonstrates how to include FPGA selectors.

#if FPGA || FPGA_EMULATOR
#include <sycl/ext/intel/fpga_extensions.hpp>
#endif
… … 
#if FPGA_EMULATOR
  // Intel extension: FPGA emulator selector on systems without FPGA card.
  auto selector = sycl::ext::intel::fpga_emulator_selector_v;
#elif FPGA_SIMULATOR
  // Intel extension: FPGA simulator selector on systems without FPGA card.
  auto selector = sycl::ext::intel::fpga_simulator_selector_v;
#elif FPGA_HARDWARE
  // Intel extension: FPGA selector on systems with FPGA card.
  auto selector = sycl::ext::intel::fpga_selector_v;
#else
  // The default device selector will select the most performant device.
  auto selector = default_selector_v;
#endif

Data, Buffers, and Accessors

SYCL processes large pieces of data or computation using kernels that run on accelerators. The host declares data, which SYCL runtime then wraps in a buffer and implicitly transfers to the accelerators. Accelerators read or write to the buffer through an accessor. The runtime also determines the kernel dependencies from the used accessors and then dispatches and runs the kernels in the most efficient order. Remember the following:

a_vector, b_vector, and sum_parallel are array objects from the host.
a_buf, b_buf, and sum_buf serve as buffer wrappers.
a and b are read-only accessors, while sum is a write-only accessor.

The following code snippet, taken from vector_add, demonstrates how to use buffers and accessors.

  buffer a_buf(a_vector);
  buffer b_buf(b_vector);
  buffer sum_buf(sum_parallel.data(), num_items);
… … 
  q.submit([&](handler &h) {

// Create an accessor for each buffer with access permission: read, write or
// read/write. The accessor is a mean to access the memory in the buffer.    
    accessor a(a_buf, h, read_only);
    accessor b(b_buf, h, read_only);

// The sum_accessor is used to store (with write permission) the sum data.
    accessor sum(sum_buf, h, write_only, no_init);
… … 
  });

Queue and parallel_for Kernels

A SYCL queue encapsulates all the necessary context and states for kernel execution. By default, when no parameter is passed, a queue is created and associated with an accelerator through a default selector. It can also accept a specific device selector and an asynchronous exception handler, as used in vector_add.

You enqueue kernels to the queue for execution. Kernels come in different types: single task kernel, basic data-parallel kernel, hierarchical parallel kernel, etc. The vector_add uses the basic data-parallel parallel_for kernel, as shown in the following snippets.

try {
    queue q(selector, exception_handler);
    … … 
    q.submit([&](handler &h) {
    … … 
        h.parallel_for(num_items, [=](auto i) { sum[i] = a[i] + b[i]; });
  });  
} catch (exception const &e) {
    … … 
}

The kernel body is an addition of two arrays captured in the Lambda function.

sum[i] = a[i] + b[i];

The range of data the kernel can process is specified in the first parameter num_items of h.parallel_for. For example, a 1-D range with a size of num_items. Two read-only data, a_buf and b_buf, are transferred to the accelerator by the runtime. When the kernel is completed, the sum of the data in the sum_buf buffer is copied to the host when the sum_buf goes out of scope.

Summary

Device selectors, buffers, accessors, queues, and kernels are the building blocks of oneAPI programming. SYCL and community extensions are used to simplify data parallel programming. SYCL allows code reuse across hardware targets and enables high productivity and performance across CPU, GPU, and FPGA architectures while permitting accelerator-specific tuning.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in