Local Barriers vs Global Atomics
Atomics allow multiple work-items in the kernel to work on
shared resources. Barriers allow synchronization among
the work-items in a work-group. It is possible to achieve
the functionality of global atomics through judicious use
of kernel launches and local barriers. Depending on the
architecture and the amount of data involved, one or the other
can have better performance.
In the following example, we try to sum a relatively
small number of elements in a vector. This task is can be
achieved in different ways. The first kernel shown below does
this using only one work-item which walks through all elements
of the vector and sums them up.
q.submit([&](auto &h) {
sycl::accessor buf_acc(buf, h, sycl::read_only);
sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init);
h.parallel_for(data_size, [=](auto index) {
int glob_id = index[0];
if (glob_id == 0) {
int sum = 0;
for (int i = 0; i < N; i++)
sum += buf_acc[i];
sum_acc[0] = sum;
}
});
});
In the kernel shown below, the same problem is solved using
global atomics, where every work-item updates a global variable
with the value it needs to accumulate. Although there is a lot
of parallelism here, the contention on the global variable
is quite high and in most cases its performance will not be
very good.
q.submit([&](auto &h) {
sycl::accessor buf_acc(buf, h, sycl::read_only);
sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init);
h.parallel_for(data_size, [=](auto index) {
size_t glob_id = index[0];
auto v = sycl::ext::oneapi::atomic_ref<
int, sycl::ext::oneapi::memory_order::relaxed,
sycl::ext::oneapi::memory_scope::device,
sycl::access::address_space::global_space>(sum_acc[0]);
v.fetch_add(buf_acc[glob_id]);
});
});
In the following kernel, every work-item is responsible for accumulating
multiple elements of the vector. This accumulation is done in parallel
and then updated into an array that is shared among all work-items of the
work-group. At this point all work-items of the work-group do a tree
reduction using barriers to synchronize among themselves to reduce
intermediate results in shared memory to the final result. This kernel
explicitly created exactly one work-group and distributes the
responsibility of all elements in the vector to the work-items
in the work-group. Although it is not using the full capability of the
machine in terms of the number of threads, sometimes this amount of
parallelism is enough for small problem sizes.
Timer timer;
q.submit([&](auto &h) {
sycl::accessor buf_acc(buf, h, sycl::read_only);
sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init);
sycl::accessor<int, 1, sycl::access::mode::read_write,
sycl::access::target::local>
scratch(work_group_size, h);
h.parallel_for(sycl::nd_range<1>{work_group_size, work_group_size},
[=](sycl::nd_item<1> item) {
size_t loc_id = item.get_local_id(0);
int sum = 0;
for (int i = loc_id; i < data_size; i += num_work_items)
sum += buf_acc[i];
scratch[loc_id] = sum;
for (int i = work_group_size / 2; i > 0; i >>= 1) {
item.barrier(sycl::access::fence_space::local_space);
if (loc_id < i)
scratch[loc_id] += scratch[loc_id + i];
}
if (loc_id == 0)
sum_acc[0] = scratch[0];
});
});
The performance of these three kernels varies quite a bit among various
platforms, and developers need to pick the technique that suits their
application and hardware.