Using Multiple Heterogeneous Devices
Most accelerators reside in a server that has a significant
amount of compute resources in it. For instance, a typical server can
have up to eight sockets, with each socket containing over 50 cores. DPC++
provides the ability to treat the CPUs and the accelerators uniformly to
distribute work among them. It is the responsibility of the programmer
to ensure a balanced distribution of work among the heterogeneous compute
resources in the platform.
Overlapping Compute on Various Accelerators in the Platform
DPC++ provides access to different kinds of devices through abstraction of
device selectors. Queues can be created for each of the devices, and kernels
can be submitted to them for execution. All kernel submits in DPC++ are
non-blocking, which means that once the kernel is submitted to a queue for
execution, the host does not wait for it to finish unless waiting on the queue
is explicitly requested. This allows the host to do some work itself or
initiate work on other devices while the kernel is executing on the
accelerator.
The host CPU can be treated as an accelerator and the DPCPP can submit kernels
to it for execution. This is completely independent and orthogonal to the job
done by the host to orchestrate the kernel submission and creation. The
underlying operating system manages the kernels submitted to the CPU
accelerator as another process and uses the same
openCL/Level0
runtime
mechanisms to exchange information with the host device.The following example shows a simple vector add operation that works on a
single GPU device.
int VectorAdd1(sycl::queue &q, const IntArray &a, const IntArray &b,
IntArray &sum, int iter) {
sycl::range num_items{a.size()};
sycl::buffer a_buf(a);
sycl::buffer b_buf(b);
sycl::buffer sum_buf(sum.data(), num_items);
auto start = std::chrono::steady_clock::now();
for (int i = 0; i < iter; i++) {
auto e = q.submit([&](auto &h) {
// Input accessors
sycl::accessor a_acc(a_buf, h, sycl::read_only);
sycl::accessor b_acc(b_buf, h, sycl::read_only);
// Output accessor
sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init);
h.parallel_for(num_items,
[=](auto i) { sum_acc[i] = a_acc[i] + b_acc[i]; });
});
}
q.wait();
auto end = std::chrono::steady_clock::now();
std::cout << "Vector add1 completed on device - took "
<< (end - start).count() << " u-secs\n";
return ((end - start).count());
} // end VectorAdd1
In the following kernel the input vector is split into two parts and
computation is done on two different accelerators (one CPU and one GPU) that
can execute concurrently. Care must be taken to ensure that the kernels, in
addition to be being submitted, are actually launched on the devices to get
this parallelism. The actual time that a kernel is launched can be
substantially later than when it was submitted by the host. The implementation
decides the time to launch the kernels based on some heuristics to maximize
metrics like utilization, throughput, or latency. For instance, in the case of
the OpenCL backend, on certain platforms one needs to explicitly issue a
clFlush
(as shown on line 41) on the queue to launch the kernels on the
accelerators.int VectorAdd2(sycl::queue &q1, sycl::queue &q2, const IntArray &a,
const IntArray &b, IntArray &sum, int iter) {
sycl::range num_items{a.size() / 2};
auto start = std::chrono::steady_clock::now();
{
sycl::buffer a1_buf(a.data(), num_items);
sycl::buffer b1_buf(b.data(), num_items);
sycl::buffer sum1_buf(sum.data(), num_items);
sycl::buffer a2_buf(a.data() + a.size() / 2, num_items);
sycl::buffer b2_buf(b.data() + a.size() / 2, num_items);
sycl::buffer sum2_buf(sum.data() + a.size() / 2, num_items);
for (int i = 0; i < iter; i++) {
q1.submit([&](auto &h) {
// Input accessors
sycl::accessor a_acc(a1_buf, h, sycl::read_only);
sycl::accessor b_acc(b1_buf, h, sycl::read_only);
// Output accessor
sycl::accessor sum_acc(sum1_buf, h, sycl::write_only, sycl::no_init);
h.parallel_for(num_items,
[=](auto i) { sum_acc[i] = a_acc[i] + b_acc[i]; });
});
// do the work on host
q2.submit([&](auto &h) {
// Input accessors
sycl::accessor a_acc(a2_buf, h, sycl::read_only);
sycl::accessor b_acc(b2_buf, h, sycl::read_only);
// Output accessor
sycl::accessor sum_acc(sum2_buf, h, sycl::write_only, sycl::no_init);
h.parallel_for(num_items,
[=](auto i) { sum_acc[i] = a_acc[i] + b_acc[i]; });
});
}
// On some platforms this explicit flush of queues is needed
// to ensure the overlap in execution between the CPU and GPU
// cl_command_queue cq = q1.get();
// clFlush(cq);
// cq=q2.get();
// clFlush(cq);
}
q1.wait();
q2.wait();
auto end = std::chrono::steady_clock::now();
std::cout << "Vector add2 completed on device - took "
<< (end - start).count() << " u-secs\n";
return ((end - start).count());
} // end VectorAdd2
Checking the running time of the above two kernels, it can be seen that the
application runs almost twice as fast as before since it has more hardware
resources dedicated to solving the problem. In order to achieve good balance,
you will have to split the work in proportion to the capability of the
accelerator, instead of distributing it evenly as was done in the above
example.