Developer Guide

Optimizing Memory Movement between Host and Accelerator

Buffers can be created using properties to control how they are allocated. One such property is
use_host_ptr
. This informs the runtime that if possible, the host memory should be directly used by the buffer instead of a copy. This will avoid the need to copy the content of the buffer back and forth between the host memory and the buffer memory, potentially saving time during the buffer creation and destruction. Another case when the GPU and CPU have shared memory, it is possible to avoid copies of memory through sharing of pages. But for page sharing to be possible, the allocated memory needs to have some properties like being aligned on page boundary. In case of discrete devices, the benefit may not be realized because any memory operation by the accelerator will have to go across PCIe or some other slower interface than the memory of the accelerator.
The following code shows how to print the memory addresses on the host, inside the buffer, and on the accelerator device inside the kernel.
int VectorAdd0(sycl::queue &q, AlignedVector<int> &a, AlignedVector<int> &b, AlignedVector<int> &sum, int iter) { sycl::range num_items{a.size()}; const sycl::property_list props = {sycl::property::buffer::use_host_ptr()}; for (int i = 0; i < iter; i++) { sycl::buffer a_buf(a, props); sycl::buffer b_buf(b, props); sycl::buffer sum_buf(sum.data(), num_items, props); { sycl::host_accessor a_host_acc(a_buf); std::cout << "add0: buff memory address =" << a_host_acc.get_pointer() << "\n"; std::cout << "add0: address of vector a = " << a.data() << "\n"; } q.submit([&](auto &h) { // Input accessors sycl::accessor a_acc(a_buf, h, sycl::read_only); sycl::accessor b_acc(b_buf, h, sycl::read_only); // Output accessor sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init); sycl::stream out(1024 * 1024, 1 * 128, h); h.parallel_for(num_items, [=](auto i) { if (i[0] == 0) out << "add0: dev addr = " << a_acc.get_pointer() << "\n"; sum_acc[i] = a_acc[i] + b_acc[i]; }); }); } q.wait(); return (0); }
When this program is run it can be seen that the addresses for all three (host, in the buffer, and on the accelerator) are the same when the property
use_host_ptr
is set for integrated GPU devices. But for discrete GPU devices the buffer and device addresses will be different. Also note that in line 1, none of the incoming arguments are declared to be
const
. If these are declared
const
then during buffer creation they are copied and new memory is allocated instead of reusing the memory in the host vectors. The code snippet below demonstrates this. When this code is executed, we see that the addresses associated with the incoming vectors are different from the memory present in the buffer and also the memory present in the accelerator device.
int VectorAdd1(sycl::queue &q, const AlignedVector<int> &a, const AlignedVector<int> &b, AlignedVector<int> &sum, int iter) { sycl::range num_items{a.size()}; const sycl::property_list props = {sycl::property::buffer::use_host_ptr()}; for (int i = 0; i < iter; i++) { sycl::buffer a_buf(a, props); sycl::buffer b_buf(b, props); sycl::buffer sum_buf(sum.data(), num_items, props); { sycl::host_accessor a_host_acc(a_buf); std::cout << "add1: buff memory address =" << a_host_acc.get_pointer() << "\n"; std::cout << "add1: address of vector aa = " << a.data() << "\n"; } q.submit([&](auto &h) { // Input accessors sycl::accessor a_acc(a_buf, h, sycl::read_only); sycl::accessor b_acc(b_buf, h, sycl::read_only); // Output accessor sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init); sycl::stream out(16 * 1024, 16 * 1024, h); h.parallel_for(num_items, [=](auto i) { if (i[0] == 0) out << "add1: dev addr = " << a_acc.get_pointer() << "\n"; sum_acc[i] = a_acc[i] + b_acc[i]; }); }); } q.wait(); return (0); }
The kernel
vectorAdd3
will not incur the cost of copying the memory contents from the buffer to the accelerator device because the
use_host_ptr
property is set while creating the buffers, and the buffers are aligned on a page boundary for an integrated GPU device. If memory pointed to by a buffer is not aligned on a page boundary, then new memory is allocated that aligns on a page boundary and the contents of the buffer are copied into that memory. This new memory from the buffer is then shared with the accelerator either by copying the contents from the buffer on host to the device (in case of accelerators that do not share any memory) or by using the page tables to avoid a physical copy of memory available on the device in case of shared memory.
int VectorAdd2(sycl::queue &q, AlignedVector<int> &a, AlignedVector<int> &b, AlignedVector<int> &sum, int iter) { sycl::range num_items{a.size()}; const sycl::property_list props = {sycl::property::buffer::use_host_ptr()}; auto start = std::chrono::steady_clock::now(); for (int i = 0; i < iter; i++) { sycl::buffer a_buf(a, props); sycl::buffer b_buf(b, props); sycl::buffer sum_buf(sum.data(), num_items, props); q.submit([&](auto &h) { // Input accessors sycl::accessor a_acc(a_buf, h, sycl::read_only); sycl::accessor b_acc(b_buf, h, sycl::read_only); // Output accessor sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init); h.parallel_for(num_items, [=](auto i) { sum_acc[i] = a_acc[i] + b_acc[i]; }); }); } q.wait(); auto end = std::chrono::steady_clock::now(); std::cout << "Vector add2 completed on device - took " << (end - start).count() << " u-secs\n"; return ((end - start).count()); }
The kernel below will incur the cost of copying memory contents between the host and buffer, and also from the buffer to the accelerator.
int VectorAdd3(sycl::queue &q, const AlignedVector<int> &a, const AlignedVector<int> &b, AlignedVector<int> &sum, int iter) { sycl::range num_items{a.size()}; auto start = std::chrono::steady_clock::now(); for (int i = 0; i < iter; i++) { sycl::buffer a_buf(a); sycl::buffer b_buf(b); sycl::buffer sum_buf(sum.data(), num_items); auto e = q.submit([&](auto &h) { // Input accessors sycl::accessor a_acc(a_buf, h, sycl::read_only); sycl::accessor b_acc(b_buf, h, sycl::read_only); // Output accessor sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init); h.parallel_for(num_items, [=](auto i) { sum_acc[i] = a_acc[i] + b_acc[i]; }); }); } q.wait(); auto end = std::chrono::steady_clock::now(); std::cout << "Vector add3 completed on device - took " << (end - start).count() << " u-secs\n"; return ((end - start).count()); }
Care must be taken to ensure that unnecessary copies are avoided during the creation of buffers and passing the memory from the buffers to the kernels. Even when the accelerator shares memory with the host, a few additional conditions must be satisfied to avoid these extra copies.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.