Reduction
Reduction is a common operation in parallel programming where an
operator is applied to all elements of an array and a single result is
produced. The reduction operator is associative and in some cases
commutative. Some examples of reductions are summation, maximum,
and minimum. A serial summation reduction is shown below:
for (int it = 0; it < iter; it++) {
sum = 0;
for (size_t i = 0; i < data_size; ++i) {
sum += data[i];
}
}
The time complexity of reduction is linear with the number of elements. There
are several ways this can be parallelized, and care must be taken to ensure
that the amount of communication/synchronization is minimized between different
processing elements. A naive way to parallelize this reduction is to use a
global variable and let the threads update this variable using an atomic
operation:
q.submit([&](auto &h) {
sycl::accessor buf_acc(buf, h, sycl::read_only);
sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init);
h.parallel_for(data_size, [=](auto index) {
size_t glob_id = index[0];
auto v = sycl::atomic_ref<int, sycl::memory_order::relaxed,
sycl::memory_scope::device,
sycl::access::address_space::global_space>(
sum_acc[0]);
v.fetch_add(buf_acc[glob_id]);
});
This kernel will perform poorly because the threads are atomically
updating a single memory location and getting significant contention.
A better approach is to split the array into small chunks, let each
thread compute a local sum for each chunk, and then do a
sequential/tree reduction of the local sums. The number of chunks
will depend on the number of processing elements present in the
platform. This can be queried using the
get_info<info::device::max_compute_units>()
function on the device
object: q.submit([&](auto &h) {
sycl::accessor buf_acc(buf, h, sycl::read_only);
sycl::accessor accum_acc(accum_buf, h, sycl::write_only, sycl::no_init);
h.parallel_for(num_processing_elements, [=](auto index) {
size_t glob_id = index[0];
size_t start = glob_id * BATCH;
size_t end = (glob_id + 1) * BATCH;
if (end > N)
end = N;
int sum = 0;
for (size_t i = start; i < end; ++i)
sum += buf_acc[i];
accum_acc[glob_id] = sum;
});
});
This kernel will perform better than the kernel that atomically
updates a shared memory location. However, it is still inefficient
because the compiler is not able to vectorize the loop. One way to get
the compiler to produce vector code is to modify the loop as shown
below:
q.submit([&](auto &h) {
sycl::accessor buf_acc(buf, h, sycl::read_only);
sycl::accessor accum_acc(accum_buf, h, sycl::write_only, sycl::no_init);
h.parallel_for(num_work_items, [=](auto index) {
size_t glob_id = index[0];
int sum = 0;
for (size_t i = glob_id; i < data_size; i += num_work_items)
sum += buf_acc[i];
accum_acc[glob_id] = sum;
});
});
The compiler can vectorize this code so the performance is better.
In the case of GPUs, a number of thread contexts are available per physical
processor, referred to as Vector Engine (VE) or Execution Unit (EU)
on the machine. So the
above code where the number of threads is equal to the number of VEs does not
utilize all the thread contexts. Even in the case of CPUs that have two
hyperthreads per core, the code will not use all the thread contexts. In
general, it is better to divide the work into enough work-groups to get full
occupancy of all thread contexts. This allows the code to better tolerate long
latency instructions. The following table shows the number of thread contexts
available per processing element in different devices:
VEs | Threads per VE | Total threads | |
---|---|---|
KBL | 24 | 7 | |
TGL | 96 | 7 | |
The code below shows a kernel with enough threads to fully utilize available
resources. Notice that there is no good way to query the number of available
thread contexts from the device. So, depending on the device, you can scale the
number of work-items you create for splitting the work among them.
q.submit([&](auto &h) {
sycl::accessor buf_acc(buf, h, sycl::read_only);
sycl::accessor accum_acc(accum_buf, h, sycl::write_only, sycl::no_init);
h.parallel_for(num_work_items, [=](auto index) {
size_t glob_id = index[0];
int sum = 0;
for (size_t i = glob_id; i < data_size; i += num_work_items)
sum += buf_acc[i];
accum_acc[glob_id] = sum;
});
});
One popular way of doing a reduction operation on GPUs is to create a number of
work-groups and do a tree reduction in each work-group. In the kernel shown
below, each work-item in the work-group participates in a reduction network to
eventually sum up all the elements in that work-group. All the intermediate
results from the work-groups are then summed up by doing a serial reduction (if
this intermediate set of results is large enough then we can do few more
round(s) of tree reductions). This tree reduction algorithm takes advantage of
the very fast synchronization operations among the work-items in a work-group.
The performance of this kernel is highly dependent on the efficiency of the
kernel launches, because a large number of kernels are launched. Also, the
kernel as written below is not very efficient because the number of threads
doing actual work reduces exponentially each time through the loop.
q.submit([&](auto &h) {
sycl::accessor buf_acc(buf, h, sycl::read_only);
sycl::accessor accum_acc(accum_buf, h, sycl::write_only, sycl::no_init);
sycl::accessor<int, 1, sycl::access::mode::read_write,
sycl::access::target::local>
scratch(work_group_size, h);
h.parallel_for(sycl::nd_range<1>(num_work_items, work_group_size),
[=](sycl::nd_item<1> item) {
size_t global_id = item.get_global_id(0);
int local_id = item.get_local_id(0);
int group_id = item.get_group(0);
if (global_id < data_size)
scratch[local_id] = buf_acc[global_id];
else
scratch[local_id] = 0;
// Do a tree reduction on items in work-group
for (int i = work_group_size / 2; i > 0; i >>= 1) {
item.barrier(sycl::access::fence_space::local_space);
if (local_id < i)
scratch[local_id] += scratch[local_id + i];
}
if (local_id == 0)
accum_acc[group_id] = scratch[0];
});
});
The single stage reduction is not very efficient since it will leave a
lot work for the host. Adding one more stage will reduce the work on
the host and improve performance quite a bit. It can be seen that in
the kernel below the intermediate result computed in stage1 is used as
input into stage2. This can be generalized to form a multi-stage
reduction until the result is small enough so that it can be performed
on the host.
q.submit([&](auto &h) {
sycl::accessor buf_acc(buf, h, sycl::read_only);
sycl::accessor accum_acc(accum1_buf, h, sycl::write_only, sycl::no_init);
sycl::accessor<int, 1, sycl::access::mode::read_write,
sycl::access::target::local>
scratch(work_group_size, h);
h.parallel_for(sycl::nd_range<1>(num_work_items1, work_group_size),
[=](sycl::nd_item<1> item) {
size_t global_id = item.get_global_id(0);
int local_id = item.get_local_id(0);
int group_id = item.get_group(0);
if (global_id < data_size)
scratch[local_id] = buf_acc[global_id];
else
scratch[local_id] = 0;
// Do a tree reduction on items in work-group
for (int i = work_group_size / 2; i > 0; i >>= 1) {
item.barrier(sycl::access::fence_space::local_space);
if (local_id < i)
scratch[local_id] += scratch[local_id + i];
}
if (local_id == 0)
accum_acc[group_id] = scratch[0];
});
});
q.submit([&](auto &h) {
sycl::accessor buf_acc(accum1_buf, h, sycl::read_only);
sycl::accessor accum_acc(accum2_buf, h, sycl::write_only, sycl::no_init);
sycl::accessor<int, 1, sycl::access::mode::read_write,
sycl::access::target::local>
scratch(work_group_size, h);
h.parallel_for(sycl::nd_range<1>(num_work_items2, work_group_size),
[=](sycl::nd_item<1> item) {
size_t global_id = item.get_global_id(0);
int local_id = item.get_local_id(0);
int group_id = item.get_group(0);
if (global_id < static_cast<size_t>(num_work_items2))
scratch[local_id] = buf_acc[global_id];
else
scratch[local_id] = 0;
// Do a tree reduction on items in work-group
for (int i = work_group_size / 2; i > 0; i >>= 1) {
item.barrier(sycl::access::fence_space::local_space);
if (local_id < i)
scratch[local_id] += scratch[local_id + i];
}
if (local_id == 0)
accum_acc[group_id] = scratch[0];
});
});
SYCL also supports built-in reduction operations, and you should use it where
it is suitable because its implementation is fine tuned to the underlying
architecture. The following kernel shows how to use the built-in reduction
operator in the compiler.
q.submit([&](auto &h) {
sycl::accessor buf_acc(buf, h, sycl::read_only);
sycl::accessor sum_acc(sum_buf, h, sycl::read_write);
auto sumr =
sycl::ext::oneapi::reduction(sum_acc, sycl::ext::oneapi::plus<>());
h.parallel_for(sycl::nd_range<1>{data_size, 256}, sumr,
[=](sycl::nd_item<1> item, auto &sumr_arg) {
int glob_id = item.get_global_id(0);
sumr_arg += buf_acc[glob_id];
});
});
A further optimization is to block the accesses to the input vector and use the
shared local memory to store the intermediate results. This kernel is shown
below. In this kernel every work-item operates on a certain number of vector
elements, and then one thread in the work-group reduces all these elements to
one result by linearly going through the shared memory containing the
intermediate results.
q.submit([&](auto &h) {
sycl::accessor buf_acc(buf, h, sycl::read_only);
sycl::accessor accum_acc(accum_buf, h, sycl::write_only, sycl::no_init);
sycl::accessor<int, 1, sycl::access::mode::read_write,
sycl::access::target::local>
scratch(work_group_size, h);
h.parallel_for(sycl::nd_range<1>{num_work_items, work_group_size},
[=](sycl::nd_item<1> item) {
size_t glob_id = item.get_global_id(0);
size_t group_id = item.get_group(0);
size_t loc_id = item.get_local_id(0);
int offset = ((glob_id >> log2workitems_per_block)
<< log2elements_per_block) +
(glob_id & mask);
int sum = 0;
for (int i = 0; i < elements_per_work_item; ++i)
sum +=
buf_acc[(i << log2workitems_per_block) + offset];
scratch[loc_id] = sum;
// Serial Reduction
item.barrier(sycl::access::fence_space::local_space);
if (loc_id == 0) {
int sum = 0;
for (int i = 0; i < work_group_size; ++i)
sum += scratch[i];
accum_acc[group_id] = sum;
}
});
});
The kernel below is similar to the one above except that tree
reduction is used to reduce the intermediate results from all the
work-items in a work-group. In most cases this does not seem to make a
big difference in performance.
q.submit([&](auto &h) {
sycl::accessor buf_acc(buf, h, sycl::read_only);
sycl::accessor accum_acc(accum_buf, h, sycl::write_only, sycl::no_init);
sycl::accessor<int, 1, sycl::access::mode::read_write,
sycl::access::target::local>
scratch(work_group_size, h);
h.parallel_for(sycl::nd_range<1>{num_work_items, work_group_size},
[=](sycl::nd_item<1> item) {
size_t glob_id = item.get_global_id(0);
size_t group_id = item.get_group(0);
size_t loc_id = item.get_local_id(0);
int offset = ((glob_id >> log2workitems_per_block)
<< log2elements_per_block) +
(glob_id & mask);
int sum = 0;
for (int i = 0; i < elements_per_work_item; ++i)
sum +=
buf_acc[(i << log2workitems_per_block) + offset];
scratch[loc_id] = sum;
// tree reduction
item.barrier(sycl::access::fence_space::local_space);
for (int i = work_group_size / 2; i > 0; i >>= 1) {
item.barrier(sycl::access::fence_space::local_space);
if (loc_id < static_cast<size_t>(i))
scratch[loc_id] += scratch[loc_id + i];
}
if (loc_id == 0)
accum_acc[group_id] = scratch[0];
});
});
The kernel below uses the blocking technique and then the compiler
reduction operator to do final reduction. This gives good performance
on most of the platforms on which it was tested.
q.submit([&](auto &h) {
sycl::accessor buf_acc(buf, h, sycl::read_only);
sycl::accessor sum_acc(sum_buf, h, sycl::read_write, sycl::no_init);
auto sumr =
sycl::ext::oneapi::reduction(sum_acc, sycl::ext::oneapi::plus<>());
h.parallel_for(sycl::nd_range<1>{num_work_items, work_group_size}, sumr,
[=](sycl::nd_item<1> item, auto &sumr_arg) {
size_t glob_id = item.get_global_id(0);
int offset = ((glob_id >> log2workitems_per_block)
<< log2elements_per_block) +
(glob_id & mask);
int sum = 0;
for (int i = 0; i < elements_per_work_item; ++i)
sum +=
buf_acc[(i << log2workitems_per_block) + offset];
sumr_arg += sum;
});
});
This next kernel uses a completely different technique for accessing the
memory. It uses sub-group loads to generate the intermediate result in a vector
form. This intermediate result is then brought back to the host and the final
reduction is performed there. In some cases it may be better to create another
kernel to reduce this result in a single work-group, which lets you perform
tree reduction through efficient barriers.
q.submit([&](auto &h) {
const sycl::accessor buf_acc(buf, h);
sycl::accessor accum_acc(accum_buf, h, sycl::write_only, sycl::no_init);
sycl::accessor<sycl::vec<int, 8>, 1, sycl::access::mode::read_write,
sycl::access::target::local>
scratch(work_group_size, h);
h.parallel_for(
sycl::nd_range<1>{num_work_items, work_group_size},
[=](sycl::nd_item<1> item) [[intel::reqd_sub_group_size(16)]] {
size_t group_id = item.get_group(0);
size_t loc_id = item.get_local_id(0);
sycl::ext::oneapi::sub_group sg = item.get_sub_group();
sycl::vec<int, 8> sum{0, 0, 0, 0, 0, 0, 0, 0};
using global_ptr =
sycl::multi_ptr<int, sycl::access::address_space::global_space>;
int base = (group_id * work_group_size +
sg.get_group_id()[0] * sg.get_local_range()[0]) *
elements_per_work_item;
for (int i = 0; i < elements_per_work_item / 8; ++i)
sum += sg.load<8>(global_ptr(&buf_acc[base + i * 128]));
scratch[loc_id] = sum;
for (int i = work_group_size / 2; i > 0; i >>= 1) {
item.barrier(sycl::access::fence_space::local_space);
if (loc_id < static_cast<size_t>(i))
scratch[loc_id] += scratch[loc_id + i];
}
if (loc_id == 0)
accum_acc[group_id] = scratch[0];
});
});
Different implementations of reduction operation are provided and discussed
here, which may have different performance characteristics depending on the
architecture of the accelerator. Another important thing to note is that the
time it takes to bring the result of reduction to the host over the PCIe
interface (for a discrete GPU) is almost same as actually doing the entire
reduction on the device. This shows that one should avoid data transfers
between host and device as much as possible or overlap the kernel execution
with data transfers.