Using Multiple Heterogeneous Devices

Developer Guide

oneAPI GPU Optimization Guide

Download PDF

ID 771772

Date 12/16/2022

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Visible to Intel only — GUID: GUID-F83772D5-E954-4618-A448-3A9B08F6D2DE

View Details

Using Multiple Heterogeneous Devices

Most accelerators reside in a server that has a significant amount of compute resources in it. For instance, a typical server can have up to eight sockets, with each socket containing over 50 cores. SYCL provides the ability to treat the CPUs and the accelerators uniformly to distribute work among them. It is the responsibility of the programmer to ensure a balanced distribution of work among the heterogeneous compute resources in the platform.

Overlapping Compute on Various Accelerators in the Platform

SYCL provides access to different kinds of devices through abstraction of device selectors. Queues can be created for each of the devices, and kernels can be submitted to them for execution. All kernel submits in SYCL are non-blocking, which means that once the kernel is submitted to a queue for execution, the host does not wait for it to finish unless waiting on the queue is explicitly requested. This allows the host to do some work itself or initiate work on other devices while the kernel is executing on the accelerator.

The host CPU can be treated as an accelerator and the SYCL can submit kernels to it for execution. This is completely independent and orthogonal to the job done by the host to orchestrate the kernel submission and creation. The underlying operating system manages the kernels submitted to the CPU accelerator as another process and uses the same openCL/Level0 runtime mechanisms to exchange information with the host device.

The following example shows a simple vector add operation that works on a single GPU device.

size_t VectorAdd1(sycl::queue &q, const IntArray &a, const IntArray &b,
                  IntArray &sum, int iter) {
  sycl::range num_items{a.size()};

  sycl::buffer a_buf(a);
  sycl::buffer b_buf(b);
  sycl::buffer sum_buf(sum.data(), num_items);
  auto start = std::chrono::steady_clock::now();
  for (int i = 0; i < iter; i++) {

    auto e = q.submit([&](auto &h) {
      // Input accessors
      sycl::accessor a_acc(a_buf, h, sycl::read_only);
      sycl::accessor b_acc(b_buf, h, sycl::read_only);
      // Output accessor
      sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init);

      h.parallel_for(num_items,
                     [=](auto i) { sum_acc[i] = a_acc[i] + b_acc[i]; });
    });
  }
  q.wait();
  auto end = std::chrono::steady_clock::now();
  std::cout << "Vector add1 completed on device - took "
            << (end - start).count() << " u-secs\n";
  return ((end - start).count());
} // end VectorAdd1

In the following kernel the input vector is split into two parts and computation is done on two different accelerators (one CPU and one GPU) that can execute concurrently. Care must be taken to ensure that the kernels, in addition to be being submitted, are actually launched on the devices to get this parallelism. The actual time that a kernel is launched can be substantially later than when it was submitted by the host. The implementation decides the time to launch the kernels based on some heuristics to maximize metrics like utilization, throughput, or latency. For instance, in the case of the OpenCL backend, on certain platforms one needs to explicitly issue a clFlush (as shown on line 41) on the queue to launch the kernels on the accelerators.

size_t VectorAdd2(sycl::queue &q1, sycl::queue &q2, const IntArray &a,
                  const IntArray &b, IntArray &sum, int iter) {
  sycl::range num_items{a.size() / 2};

  auto start = std::chrono::steady_clock::now();
  {
    sycl::buffer a1_buf(a.data(), num_items);
    sycl::buffer b1_buf(b.data(), num_items);
    sycl::buffer sum1_buf(sum.data(), num_items);

    sycl::buffer a2_buf(a.data() + a.size() / 2, num_items);
    sycl::buffer b2_buf(b.data() + a.size() / 2, num_items);
    sycl::buffer sum2_buf(sum.data() + a.size() / 2, num_items);
    for (int i = 0; i < iter; i++) {

      q1.submit([&](auto &h) {
        // Input accessors
        sycl::accessor a_acc(a1_buf, h, sycl::read_only);
        sycl::accessor b_acc(b1_buf, h, sycl::read_only);
        // Output accessor
        sycl::accessor sum_acc(sum1_buf, h, sycl::write_only, sycl::no_init);

        h.parallel_for(num_items,
                       [=](auto i) { sum_acc[i] = a_acc[i] + b_acc[i]; });
      });
      // do the work on host
      q2.submit([&](auto &h) {
        // Input accessors
        sycl::accessor a_acc(a2_buf, h, sycl::read_only);
        sycl::accessor b_acc(b2_buf, h, sycl::read_only);
        // Output accessor
        sycl::accessor sum_acc(sum2_buf, h, sycl::write_only, sycl::no_init);

        h.parallel_for(num_items,
                       [=](auto i) { sum_acc[i] = a_acc[i] + b_acc[i]; });
      });
    }
    // On some platforms this explicit flush of queues is needed
    // to ensure the overlap in execution between the CPU and GPU
    // cl_command_queue cq = q1.get();
    // clFlush(cq);
    // cq=q2.get();
    // clFlush(cq);
  }
  q1.wait();
  q2.wait();
  auto end = std::chrono::steady_clock::now();
  std::cout << "Vector add2 completed on device - took "
            << (end - start).count() << " u-secs\n";
  return ((end - start).count());
} // end VectorAdd2

Checking the running time of the above two kernels, it can be seen that the application runs almost twice as fast as before since it has more hardware resources dedicated to solving the problem. In order to achieve good balance, you will have to split the work in proportion to the capability of the accelerator, instead of distributing it evenly as was done in the above example.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

oneAPI GPU Optimization Guide

Using Multiple Heterogeneous Devices

Overlapping Compute on Various Accelerators in the Platform