SYCL* Foundations Code Walkthrough

ID 657785
Updated 1/11/2023
Version Latest
Public

author-image

By

This sample walkthrough uses a vector_add sample to demonstrate oneAPI concepts and functionality. The sample adds two arrays of integers together using hardware acceleration. In this walkthrough, you will learn about:

  • SYCL headers
  • Asynchronous exceptions from kernels
  • Device selectors for different accelerators
  • Buffers and accessors
  • Queues
  • parallel_for kernel

Download the vector_add source from GitHub.

SYCL Headers

Intel currently uses SYCL from the Khronos Group* and includes language extensions developed using an open source community process. The header file, sycl.hpp, is provided in the Intel® oneAPI DPC++/C++ Compiler. FPGA support is included with the fpga_extensions.hpp header file.

The code snippet below, from vector_add, shows the different headers you need for supporting different accelerators.

//For CPU or GPU

#include <CL/sycl.hpp>
#include <array>
#include <iostream>
using namespace sycl;

//For FPGA

#include <CL/sycl.hpp>
#include <array>
#include <iostream>
#if FPGA || FPGA_EMULATOR
#include <CL/sycl/INTEL/fpga_extensions.hpp>
#endif
using namespace sycl;

Catch Asynchronous Exceptions from SYCL Kernels

SYCL kernels run asynchronously on accelerators in different stackframes. The kernel may have asynchronous errors that cannot be propagated up to the stack. In order to catch the asynchronous exceptions, the SYCL queue class provides a way for error handler functions.

The code snippet below, from vector_add, shows you how to create an exception handler.

// Use this to create an exception handler with catch asynchronous exceptions.

static auto exception_handler = [](cl::sycl::exception_list eList) {
	for (std::exception_ptr const &e : eList) {
		try {
			std::rethrow_exception(e);
		}
		catch (std::exception const &e) {
#if _DEBUG
			std::cout << "Failure" << std::endl;
#endif
			std::terminate();
		}
	}
};
… … 
try {
    queue q(d_selector, exception_handler);
    … … 
} catch (exception const &e) {
    … … 
}

Using a Default Selector for Accelerators

Selecting an accelerator for offload kernels is straightforward. SYCL and oneAPI provide selectors that can discover and provide access to the hardware that is available on your environment. The default_selector enumerates all the available accelerators and selects the most performant one among them.

SYCL provides additional selector classes for the FPGA accelerator with fpga_selector and fpga_emulator_selector classes that are found in fpga_extensions.hpp

The code snippet below, from vector_add, shows you how to include FPGA selectors.

​#if FPGA || FPGA_EMULATOR
#include <CL/sycl/INTEL/fpga_extensions.hpp>
#endif
… … 
#if FPGA_EMULATOR
  // SYCL extension: FPGA emulator selector on systems without FPGA card.
  INTEL::fpga_emulator_selector d_selector;
#elif FPGA
  // SYCL extension: FPGA selector on systems with FPGA card.
  INTEL::fpga_selector d_selector;
#else
  // The default device selector will select the most performant device.
  default_selector d_selector;
#endif

Data, Buffers, and Accessors

SYCL uses kernels that run on accelerators to process large pieces of data or computation. Data declared on the host is wrapped in a buffer and is transferred to the accelerators implicitly by the SYCL runtime. The accelerators read or write to the buffer through an accessor. The runtime also draws the kernel dependencies from the accessors used, then dispatches and runs the kernels in most efficient order. Keep the following in mind:

  • a_array, b_array, and sum_parallel are array objects from the host.
  • a_buf, b_buf, and sum_buf are buffer wrapers.
  • a and b are read-only accessors; sum is a write-only accessor.

The code snippet below, from vector_add, shows you how to use buffers and accessors.

  buffer a_buf(a_array);
  buffer b_buf(b_array);
  buffer sum_buf(sum_parallel.data(), num_items);
… … 
  q.submit([&](handler &h) {

// Create an accessor for each buffer with access permission: read, write, or
// read/write. The accessor is used to access the memory in the buffer.
    
    accessor a(a_buf, h, read_only);
    accessor b(b_buf, h, read_only);

// The sum_accessor is used to store (with write permission) the sum data.

    accessor sum(sum_buf, h, write_only);
… … 
  });

Queue and parallel_for Kernels

A SYCL queue encapsulates all the context and states needed for kernel execution. By default, a queue is created and associated with an accelerator through a default selector when no parameter is passed. It can also take a specific device selector and an asynchronous exception handler, which is used in vector_add.

Kernels are enqueued to the queue and executed. There are different types of kernels: single task kernel, basic data-parallel kernel, hierarchical parallel kernel, etc. The basic data-parallel parallel_for kernel is used in vector_add,as shown in the snippets below.

try {
    queue q(d_selector, exception_handler);
    … … 
    q.submit([&](handler &h) {
    … … 
        h.parallel_for(num_items, [=](auto i) { sum[i] = a[i] + b[i]; });
  });  
} catch (exception const &e) {
    … … 
}

The kernel body is an addition of two arrays captured in the Lambda function.

sum[i] = a[i] + b[i]; 

The range of data the kernel can process is specified in the first parameter num_items of h.parallel_for. Example: a 1-D range with size of num_items. Two read-only data, a_array and b_array, are transferred to the accelerator by the runtime. When the kernel is completed, the sum of the data in the sum_buf buffer is copied to host when the sum_buf goes out of scope.

Summary

Device selectors, buffers, accessors, queues, and kernels are the building blocks of oneAPI programming. SYCL and community extensions are used to simplify data parallel programming. SYCL allows code reuse across hardware targets, and enables high productivity and performance across CPU, GPU, and FPGA architectures, while permitting accelerator-specific tuning.