Migrating Between CPU, GPU, and FPGA

Intel® oneAPI Programming Guide

Download PDF

ID 771723

Date 12/17/2022

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Visible to Intel only — GUID: GUID-34ED9B9E-781C-4EFF-92D5-DC2E03D6C3D9

View Details

Migrating Between CPU, GPU, and FPGA

Programming with SYCL* and using the DPC++ compiler, a platform consists of a host device connected to zero or more devices, such as CPU, GPU, FPGA, or other kinds of accelerators and processors.

When a platform has multiple devices, design the application to offload some or most of the work to the devices. There are different ways to distribute work across devices in the oneAPI programming model:

Initialize device selector – SYCL provides a set of classes called selectors that allow manual selection of devices in the platform or let oneAPI runtime heuristics choose a default device based on the compute power available on the devices.

Splitting datasets – With a highly parallel application with no data dependency, explicitly divide the datasets to employ different devices. The following code sample is an example of dispatching workloads across multiple devices. Use icpx -fsycl snippet.cpp to compile the code.

int main() {
   int data[1024];
   for (int i = 0; i < 1024; i++)
       data[i] = i;
       try {
           cpu_selector cpuSelector;
           queue cpuQueue(cpuSelector);
           gpu_selector gpuSelector;
           queue gpuQueue(gpuSelector);
           buffer<int, 1> buf(data, range<1>(1024));
           cpuQueue.submit([&](handler& cgh) {
               auto ptr =
               buf.get_access<access::mode::read_write>(cgh);
               cgh.parallel_for<class divide>(range<1>(512),
                   [=](id<1> index) {
                   ptr[index] -= 1;
               });
           });
           gpuQueue.submit([&](handler& cgh1) {
               auto ptr =
               buf.get_access<access::mode::read_write>(cgh1);
               cgh1.parallel_for<class offset1>(range<1>(1024),
                   id<1>(512), [=](id<1> index) {
                       ptr[index] += 1;
               });
           });
           cpuQueue.wait();
           gpuQueue.wait();
      }
      catch (exception const& e) {
          std::cout <<
          "SYCL exception caught: " << e.what() << '\n';
          return 2;
      }
      return 0;
}

Target multiple kernels across devices – If the application has scope for parallelization on multiple independent kernels, employ different queues to target devices. The list of SYCL supported platforms can be obtained with the list of devices for each platform by calling get_platforms() and platform.get_devices() respectively. Once all the devices are identified, construct a queue per device and dispatch different kernels to different queues. The following code sample represents dispatching a kernel on multiple SYCL devices.

#include <stdio.h>
#include <vector>
#include <CL/sycl.hpp>
using namespace cl::sycl;
using namespace std;
int main()
{
       size_t N = 1024;
       vector<float> a(N, 10.0);
       vector<float> b(N, 10.0);
       vector<float> c_add(N, 0.0);
       vector<float> c_mul(N, 0.0);
   {
       buffer<float, 1> abuffer(a.data(), range<1>(N),
         { property::buffer::use_host_ptr() });
       buffer<float, 1> bbuffer(b.data(), range<1>(N),
         { property::buffer::use_host_ptr() });
       buffer<float, 1> c_addbuffer(c_add.data(), range<1>(N),
        { property::buffer::use_host_ptr() });
       buffer<float, 1> c_mulbuffer(c_mul.data(), range<1>(N),
         { property::buffer::use_host_ptr() });
    try {
             gpu_selector gpuSelector;
             auto queue = cl::sycl::queue(gpuSelector);
             queue.submit([&](cl::sycl::handler& cgh) {
                    auto a_acc = abuffer.template
                      get_access<access::mode::read>(cgh);
                    auto b_acc = bbuffer.template
                      get_access<access::mode::read>(cgh);
                    auto c_acc_add = c_addbuffer.template
                      get_access<access::mode::write>(cgh);
                    cgh.parallel_for<class VectorAdd>
                     (range<1>(N), [=](id<1> it) {
                         //int i = it.get_global();
                             c_acc_add[it] = a_acc[it] + b_acc[it];
                                  });
                           });
             cpu_selector cpuSelector;
             auto queue1 = cl::sycl::queue(cpuSelector);
             queue1.submit([&](cl::sycl::handler& cgh) {
                    auto a_acc = abuffer.template
                        get_access<access::mode::read>(cgh);
                    auto b_acc = bbuffer.template
                        get_access<access::mode::read>(cgh);
                    auto c_acc_mul = c_mulbuffer.template
                        get_access<access::mode::write>(cgh);
                    cgh.parallel_for<class VectorMul>
                     (range<1>(N), [=](id<1> it) {
                          c_acc_mul[it] = a_acc[it] * b_acc[it];
                                  });
                           });
              }
              catch (cl::sycl::exception e) {
/* In the case of an exception being throw, print the
error message and
                     * return 1. */
                     std::cout << e.what();
                     return 1;
              }
       }
       for (int i = 0; i < 8; i++) {
              std::cout << c_add[i] << std::endl;
              std::cout << c_mul[i] << std::endl;
       }
       return 0;
}

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® oneAPI Programming Guide

Migrating Between CPU, GPU, and FPGA