Developer Guide

Executing Multiple Kernels on the Device at the Same Time

DPC++ has two kinds of queues that a programmer can create and submit kernels for execution.
in-order queues
where kernels are executed in the order they were submitted to the queue
out-of-order queues
where kernels can be executed in an arbitrary order (subject to the dependency constraints among them).
The choice to create an in-order or out-of-order queue is at queue construction time through the property
sycl::property::queue::in_order()
. By default, when no property is specified the queue is out-of-order.
In the following example, three kernels are submitted per iteration. Each of these kernels uses only one work-group with 256 work-items. These kernels are created specifically with one group to ensure that they do not use the entire machine. This is done to illustrate the benefit of parallel kernel execution.
int multi_queue(sycl::queue &q, const IntArray &a, const IntArray &b) { size_t num_items = a.size(); IntArray s1, s2, s3; sycl::buffer a_buf(a); sycl::buffer b_buf(b); sycl::buffer sum_buf1(s1); sycl::buffer sum_buf2(s2); sycl::buffer sum_buf3(s3); size_t num_groups = 1; size_t wg_size = 256; auto start = std::chrono::steady_clock::now(); for (int i = 0; i < iter; i++) { q.submit([&](sycl::handler &h) { sycl::accessor a_acc(a_buf, h, sycl::read_only); sycl::accessor b_acc(b_buf, h, sycl::read_only); sycl::accessor sum_acc(sum_buf1, h, sycl::write_only, sycl::no_init); h.parallel_for(sycl::nd_range<1>(num_groups * wg_size, wg_size), [=](sycl::nd_item<1> index) { size_t loc_id = index.get_local_id(); sum_acc[loc_id] = 0; for (int j = 0; j < 1000; j++) for (size_t i = loc_id; i < array_size; i += wg_size) { sum_acc[loc_id] += a_acc[i] + b_acc[i]; } }); }); q.submit([&](sycl::handler &h) { sycl::accessor a_acc(a_buf, h, sycl::read_only); sycl::accessor b_acc(b_buf, h, sycl::read_only); sycl::accessor sum_acc(sum_buf2, h, sycl::write_only, sycl::no_init); h.parallel_for(sycl::nd_range<1>(num_groups * wg_size, wg_size), [=](sycl::nd_item<1> index) { size_t loc_id = index.get_local_id(); sum_acc[loc_id] = 0; for (int j = 0; j < 1000; j++) for (size_t i = loc_id; i < array_size; i += wg_size) { sum_acc[loc_id] += a_acc[i] + b_acc[i]; } }); }); q.submit([&](sycl::handler &h) { sycl::accessor a_acc(a_buf, h, sycl::read_only); sycl::accessor b_acc(b_buf, h, sycl::read_only); sycl::accessor sum_acc(sum_buf3, h, sycl::write_only, sycl::no_init); h.parallel_for(sycl::nd_range<1>(num_groups * wg_size, wg_size), [=](sycl::nd_item<1> index) { size_t loc_id = index.get_local_id(); sum_acc[loc_id] = 0; for (int j = 0; j < 1000; j++) for (size_t i = loc_id; i < array_size; i += wg_size) { sum_acc[loc_id] += a_acc[i] + b_acc[i]; } }); }); } q.wait(); auto end = std::chrono::steady_clock::now(); std::cout << "multi_queue completed on device - took " << (end - start).count() << " u-secs\n"; // check results return ((end - start).count()); } // end multi_queue
In the case where the underlying queue is in-order, these kernels cannot be executed in parallel and have to be executed sequentially even though there are adequate resources in the machine and there are no dependencies among the kernels. This can be seen from the larger total execution time for all the kernels. The creation of the queue and the kernel submission is shown below.
sycl::property_list q_prop{sycl::property::queue::in_order()}; std::cout << "In order queue: Jitting+Execution time\n"; sycl::queue q1(d_selector, q_prop); multi_queue(q1, a, b); usleep(500 * 1000); std::cout << "In order queue: Execution time\n"; multi_queue(q1, a, b);
When the queue is out-of-order, the overall execution time is much lower, indicating that the machine is able to execute different kernels from the queue at the same time. The creation of the queue and the invocation of the kernel is shown below.
sycl::queue q2(d_selector); std::cout << "Out of order queue: Jitting+Execution time\n"; multi_queue(q2, a, b); usleep(500 * 1000); std::cout << "Out of order queue: Execution time\n"; multi_queue(q2, a, b);
In situations where kernels do not scale strongly and therefore cannot effectively utilize full machine compute resources, it is better to allocate only the required compute units through appropriate selection of work-group/work-item values and try to execute multiple kernels at the same time.
The following timeline view shows the kernels being executed by in-order and out-of-order queues (this was collected using the
clIntercept
tool when the binary is run on the OpenCL back-end). Here one can clearly see that kernels submitted to the out-of-order queue are being executed in parallel. Another thing to notice is that not all three kernels are executed in parallel. The reason for this is a heuristic the driver uses in deciding about the kernel submissions. When the driver sees the GPU is not active, it immediately submits the kernel for execution and then buffers the kernels in the queue.
Timeline for kernels executed with in-order and out-of-order queues
Timeline for kernels executed with in-order and out-of-order queues
It is also possible to statically partition a single device into sub-devices through the use of
create_sub_devices
function of
device class
. This provides more control to the programmer for submitting kernels to an appropriate sub-device. However, the partition of a device into sub-devices is static, so the runtime will not be able to adapt to the dynamic load of an application because it does not have flexibility to move kernels from one sub-device to another.
At the time of writing, only the OpenCL backend is able to execute the kernels out of order. Support in the Level Zero backend to execute kernels out of order is still in development.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.