Developer Guide

Contents

Static Memory Coalescing

Static memory coalescing is an 
Intel® oneAPI
DPC++/C++
Compiler
optimization step that merges contiguous accesses to global memory into a single wide access. A similar optimization is applied to on-chip memory.
The figure below shows a common case where kernel performance might benefit from static memory coalescing:
Static Memory Coalescing
Static Memory Coalescing
Consider the following vectorized kernel:
q.submit([&](handler &cgh) { accessor a(a_buf, cgh, read_only); accessor b(b_buf, cgh, write_only, no_init); cgh.single_task<class coalesced>([=]() { #pragma unroll for (int i = 0; i < 4; i++) b[i] = a[i]; }); });
The kernel performs four load operations from buffer a that access consecutive locations in memory. Instead of performing four memory accesses to competing locations, the compiler coalesces the four loads into a single, wider vector load. This optimization reduces the number of accesses to a memory system and potentially leads to better memory access patterns.
Although the compiler performs static memory coalescing automatically, you should use wide vector loads and stores in your SYCL* code whenever possible to ensure efficient memory accesses.
To allow static memory coalescing, you must write your code in such a way that the compiler can identify a sequential access pattern during compilation. The original kernel code shown in the figure above can benefit from static memory coalescing because all indexes into buffers 
a
 and 
b
 increment with offsets that are known at compilation time. In contrast, the following code does not allow static memory coalescing to occur:
q.submit([&](handler &cgh) { accessor a(a_buf, cgh, read_only); accessor b(b_buf, cgh, write_only); accessor offsets(offsets_buf, cgh, read_only); cgh.single_task<class not_coalesced>([=]() { #pragma unroll for (int i = 0; i < 4; i++) b[i] = a[offsets[i]]; }); });
The value 
offsets[i]
 is unknown at compilation time. As a result, the
Intel® oneAPI
DPC++/C++
Compiler
cannot statically coalesce the read accesses to buffer 
a
.
For more information, refer to Local and Private Memory Accesses Optimization.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.