Developer Guide

Shared Local Memory

Often work-items need to share data and communicate with each other. On one hand, all work-items in all work-groups can access global memory, so data sharing and communication can occur through global memory. However, due to its lower bandwidth and higher latency, sharing and communication through global memory is less efficient. On the other hand, work-items in a sub-group executing simultaneously in an execution unit (EU) thread can share data and communicate with each other very efficiently, but the number of work-items in a sub-group is usually small and the scope of data sharing and communication is very limited. Memory with higher bandwidth and lower latency accessible to a bigger scope of work-items is very desirable for data sharing communication among work-items. The shared local memory (SLM) in Intel
®
GPUs is designed for this purpose.
Each SubSlice of Intel
®
GPUs has its own SLM. Access to the SLM is limited to the EUs in the SubSlice or work-items in the same work-group scheduled to execute on the EUs of the same SubSlice. It is local to a SubSlice (or work-group) and shared by EUs in the same SubSlice (or work-items in the same work-group), so it is called SLM. Because it is on-chip in each SubSlice, the SLM has much higher bandwidth and much lower latency than global memory. Because it is accessible to all work-items in a work-group, the SLM can accommodate data sharing and communication among hundreds of work-items, depending on the work-group size.
It is often helpful to think of SLM as a work-group managed cache. When a work-group starts, work-items in the work-group can explicitly load data from global memory into SLM. The data stays in SLM during the lifetime of the work-group for faster access. Before the work-group finishes, the data in the SLM can be explicitly written back to the global memory by the work-items. After the work-group completes execution, the data in SLM is also gone and invalid. Data consistency between the SLM and the global memory is the program’s responsibility. Properly using SLM can make significant performance difference.

Shared Local Memory Size and Work-group Size

Because it is on-chip, the SLM has limited size. How much memory is available to a work-group is device-dependent and can be obtained by querying the device, e.g.:
std::cout << "Local Memory Size: " << q.get_device().get_info<sycl::info::device::local_mem_size>() << std::endl;
The output may look like:
Local Memory Size: 65536
The unit of the size is a byte. So this GPU device has 65,536 bytes or 64KB SLM for each work-group.
It is important to know the maximum SLM size a work-group can have. In a lot of cases, the total size of SLM available to a work-group is a non-constant function of the number of work-items in the work-group. The maximum SLM size can limit the total number of work-items in a group, i.e. work-group size. For example, if the maximum SLM size is 64KB and each work-item needs 512 bytes of SLM, the maximum work-group size cannot exceed 128.

Bank Conflicts

The SLM is divided into equally sized memory banks that can be accessed simultaneously for high bandwidth. The total number of banks is device-dependent. At the time of writing, 64 consecutive bytes are stored in 16 consecutive banks at 4-byte (32-bit) granularity. Requests for access to different banks can be serviced in parallel, but requests to different addresses in the same bank cause a bank conflict and are serialized. Bank conflicts adversely affect performance. Consider this example:
constexpr int N = 32; int *data = sycl::malloc_shared<int>(N, q); auto e = q.submit([&](auto &h) { sycl::accessor<int, 1, sycl::access::mode::read_write, sycl::access::target::local> slm(sycl::range(32 * 64), h); h.parallel_for(sycl::nd_range(sycl::range{N}, sycl::range{32}), [=](sycl::nd_item<1> it) { int i = it.get_global_linear_id(); int j = it.get_local_linear_id(); slm[j * 16] = 0; it.barrier(sycl::access::fence_space::local_space); for (int m = 0; m < 1024 * 1024; m++) { slm[j * 16] += i * m; it.barrier(sycl::access::fence_space::local_space); } data[i] = slm[j * 16]; }); });
If the number of banks is 16, all work-items in the above example will read from and write to different addresses in the same bank. The memory bandwidth is 1/16 of full bandwidth.
The next example instead does not have SLM bank conflicts and achieves full memory bandwidth because every work-item reads from and writes to different addresses in different banks.
constexpr int N = 32; int *data = sycl::malloc_shared<int>(N, q); auto e = q.submit([&](auto &h) { sycl::accessor<int, 1, sycl::access::mode::read_write, sycl::access::target::local> slm(sycl::range(32 * 64), h); h.parallel_for(sycl::nd_range(sycl::range{N}, sycl::range{32}), [=](sycl::nd_item<1> it) { int i = it.get_global_linear_id(); int j = it.get_local_linear_id(); slm[j] = 0; it.barrier(sycl::access::fence_space::local_space); for (int m = 0; m < 1024 * 1024; m++) { slm[j] += i * m; it.barrier(sycl::access::fence_space::local_space); } data[i] = slm[j]; }); });

Data Sharing and Work-group Barriers

Let us consider the histogram with 256 bins example from the “Avoiding Register Spills” chapter once again.
constexpr int blockSize = 256; constexpr int NUM_BINS = 256; std::vector<unsigned long> hist(NUM_BINS, 0); sycl::buffer<unsigned long, 1> mbuf(input.data(), N); sycl::buffer<unsigned long, 1> hbuf(hist.data(), NUM_BINS); auto e = q.submit([&](auto &h) { sycl::accessor macc(mbuf, h, sycl::read_only); auto hacc = hbuf.get_access<sycl::access::mode::atomic>(h); h.parallel_for( sycl::nd_range(sycl::range{N / blockSize}, sycl::range{64}), [= ](sycl::nd_item<1> it) [[intel::reqd_sub_group_size(16)]] { int group = it.get_group()[0]; int gSize = it.get_local_range()[0]; sycl::ext::oneapi::sub_group sg = it.get_sub_group(); int sgSize = sg.get_local_range()[0]; int sgGroup = sg.get_group_id()[0]; unsigned int histogram[NUM_BINS / 16]; // histogram bins take too much storage // to be promoted to registers for (int k = 0; k < NUM_BINS / 16; k++) { histogram[k] = 0; } for (int k = 0; k < blockSize; k++) { unsigned long x = sg.load(macc.get_pointer() + group * gSize * blockSize + sgGroup * sgSize * blockSize + sgSize * k); // subgroup size is 16 #pragma unroll for (int j = 0; j < 16; j++) { unsigned long y = sycl::group_broadcast(sg, x, j); #pragma unroll for (int i = 0; i < 8; i++) { unsigned int c = y & 0xFF; // (c & 0xF) is the workitem in which the bin resides // (c >> 4) is the bin index if (sg.get_local_id()[0] == (c & 0xF)) { histogram[c >> 4] += 1; } y = y >> 8; } } } for (int k = 0; k < NUM_BINS / 16; k++) { hacc[16 * k + sg.get_local_id()[0]].fetch_add(histogram[k]); } }); });
We optimized this example to use integer data type instead of long and to share registers in the sub-group so that the private histogram bins can fit in registers for optimal performance. If we need a larger bin size (e.g., 1024), it is inevitable that the private histogram bins will spill to global memory.
The histogram bins can be shared by work-items in a work-group as long as each bin is updated atomically.
constexpr int NUM_BINS = 1024; constexpr int blockSize = 256; std::vector<unsigned long> hist(NUM_BINS, 0); sycl::buffer<unsigned long, 1> mbuf(input.data(), N); sycl::buffer<unsigned long, 1> hbuf(hist.data(), NUM_BINS); auto e = q.submit([&](auto &h) { sycl::accessor macc(mbuf, h, sycl::read_only); auto hacc = hbuf.get_access<sycl::access::mode::atomic>(h); sycl::accessor<unsigned int, 1, sycl::access::mode::atomic, sycl::access::target::local> local_histogram(sycl::range(NUM_BINS), h); h.parallel_for( sycl::nd_range(sycl::range{N / blockSize}, sycl::range{64}), [=](sycl::nd_item<1> it) { int group = it.get_group()[0]; int gSize = it.get_local_range()[0]; sycl::ext::oneapi::sub_group sg = it.get_sub_group(); int sgSize = sg.get_local_range()[0]; int sgGroup = sg.get_group_id()[0]; int factor = NUM_BINS / gSize; int local_id = it.get_local_id()[0]; if ((factor <= 1) && (local_id < NUM_BINS)) { local_histogram[local_id].store(0); } else { for (int k = 0; k < factor; k++) { local_histogram[gSize * k + local_id].store(0); } } it.barrier(sycl::access::fence_space::local_space); for (int k = 0; k < blockSize; k++) { unsigned long x = sg.load(macc.get_pointer() + group * gSize * blockSize + sgGroup * sgSize * blockSize + sgSize * k); local_histogram[x & 0x3FFU].fetch_add(1); local_histogram[(x >> 16) & 0x3FFU].fetch_add(1); local_histogram[(x >> 32) & 0x3FFU].fetch_add(1); local_histogram[(x >> 48) & 0x3FFU].fetch_add(1); } it.barrier(sycl::access::fence_space::local_space); if ((factor <= 1) && (local_id < NUM_BINS)) { hacc[local_id].fetch_add(local_histogram[local_id].load()); } else { for (int k = 0; k < factor; k++) { hacc[gSize * k + local_id].fetch_add( local_histogram[gSize * k + local_id].load()); } } }); });
When the work-group is started, each work-item in the work-group initializes a portion of the histogram bins in SLM to 0 (code in lines 21-27 in the above example). We could designate one work-item to initialize all the histogram bins, but it is usually more efficient to divide the job among all work-items in the work-group.
The work-group barrier after initialization at line 28 guarantees that all histogram bins are initialized to 0 before any work-item updates any bins.
Because the histogram bins in SLM are shared among all work-items, updates to any bin by any work-item has to be atomic.
The global histograms are updated once the local histograms in the work-group is completed. But before reading the local SLM bins to update the global bins, a work-group barrier is again called at line 43 to make sure all work-items have completed their work.
When SLM data is shared, work-group barriers are often required for work-item synchronization. The barrier has a cost and the cost may increase with a larger work-group size. It is always a good idea to try different work-group sizes to find the best one for your application.
We also have an SLM version of histogram with 256 bins in the examples folder. You can compare its performance with the performance of the version using registers. You may get some surprising results and want to think what further optimizations can be done for more performance. We leave this as an exercise.

Using SLM as Cache

We sometimes find it is more desirable to have the application manage caching of some hot data than to have the hardware do it automatically for us. With the application managing data caching directly, whenever the data is needed, we know exactly where the data is and the cost to access it. The SLM can be used for such purpose.
Consider the following 1-D convolution example:
sycl::buffer<int> ibuf(input.data(), N); sycl::buffer<int> obuf(output.data(), N); sycl::buffer<int> kbuf(kernel.data(), M); auto e = q.submit([&](auto &h) { sycl::accessor iacc(ibuf, h, sycl::read_only); sycl::accessor oacc(obuf, h); sycl::accessor kacc(kbuf, h, sycl::read_only); h.parallel_for(sycl::nd_range<1>(sycl::range{N}, sycl::range{256}), [=](sycl::nd_item<1> it) { int i = it.get_global_linear_id(); int group = it.get_group()[0]; int gSize = it.get_local_range()[0]; int t = 0; if ((group == 0) || (group == N / gSize - 1)) { if (i < M / 2) { for (int j = M / 2 - i, k = 0; j < M; j++, k++) { t += iacc[k] * kacc[j]; } } else { if (i + M / 2 >= N) { for (int j = 0, k = i - M / 2; j < M / 2 + N - i; j++, k++) { t += iacc[k] * kacc[j]; } } else { for (int j = 0, k = i - M / 2; j < M; j++, k++) { t += iacc[k] * kacc[j]; } } } } else { for (int j = 0, k = i - M / 2; j < M; j++, k++) { t += iacc[k] * kacc[j]; } } oacc[i] = t; }); });
The example convolves an integer array of 8192 x 8192 elements using a kernel array of 257 elements and writes the result to an output array. Each work-item convolves one element. To convolve one element, however, up to 256 neighboring elements are needed.
Noticing each input element is used by multiple work-items, we can preload all input elements needed by a whole work-group into SLM. Later, when an element is needed, it can be loaded from SLM instead of global memory.
sycl::buffer<int> ibuf(input.data(), N); sycl::buffer<int> obuf(output.data(), N); sycl::buffer<int> kbuf(kernel.data(), M); auto e = q.submit([&](auto &h) { sycl::accessor iacc(ibuf, h, sycl::read_only); sycl::accessor oacc(obuf, h); sycl::accessor kacc(kbuf, h, sycl::read_only); sycl::accessor<int, 1, sycl::access::mode::read_write, sycl::access::target::local> ciacc(sycl::range(256 + (M / 2) * 2), h); h.parallel_for(sycl::nd_range(sycl::range{N}, sycl::range{256}), [=](sycl::nd_item<1> it) { int i = it.get_global_linear_id(); int group = it.get_group()[0]; int gSize = it.get_local_range()[0]; int local_id = it.get_local_id()[0]; ciacc[local_id + M / 2] = iacc[i]; if (local_id == 0) { if (group == 0) { for (int j = 0; j < M / 2; j++) { ciacc[j] = 0; } } else { for (int j = 0, k = i - M / 2; j < M / 2; j++, k++) { ciacc[j] = iacc[k]; } } } if (local_id == gSize - 1) { if (group == it.get_group_range()[0] - 1) { for (int j = gSize + M / 2; j < gSize + M / 2 + M / 2; j++) { ciacc[j] = 0; } } else { for (int j = gSize + M / 2, k = i + 1; j < gSize + M / 2 + M / 2; j++, k++) { ciacc[j] = iacc[k]; } } } it.barrier(sycl::access::fence_space::local_space); int t = 0; for (int j = 0, k = local_id; j < M; j++, k++) { t += ciacc[k] * kacc[j]; } oacc[i] = t; }); });
When the work-group starts, all input elements needed by each work-item are loaded into SLM. Each work-item, except the first one and the last one, loads one element into SLM. The first work-item loads neighbors on the left of the first element and the last work item loads neighbors on the right of the last element in the SLM. If no neighbors exist, elements in SLM are filled with 0s.
Before convolution starts in each work-item, a local barrier is called to make sure all input elements are loaded into SLM.
The convolution in each work-item is straightforward. All neighboring elements are loaded from the faster SLM instead of global memory.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.