Visible to Intel only — GUID: GUID-ABCFBAEF-8B3C-48FB-8EE8-69E7FB68E15B
Visible to Intel only — GUID: GUID-ABCFBAEF-8B3C-48FB-8EE8-69E7FB68E15B
Shared Local Memory
Often work-items need to share data and communicate with each other. On one hand, all work-items in all work-groups can access global memory, so data sharing and communication can occur through global memory. However, due to its lower bandwidth and higher latency, sharing and communication through global memory is less efficient. On the other hand, work-items in a sub-group executing simultaneously in a vector engine (VE) thread can share data and communicate with each other very efficiently, but the number of work-items in a sub-group is usually small and the scope of data sharing and communication is very limited. Memory with higher bandwidth and lower latency accessible to a bigger scope of work-items is very desirable for data sharing communication among work-items. The shared local memory (SLM) in Intel® GPUs is designed for this purpose.
Each Xe-core of Intel GPUs has its own SLM. Access to the SLM is limited to the VEs in the Xe-core or work-items in the same work-group scheduled to execute on the VEs of the same Xe-core. It is local to a Xe-core (or work-group) and shared by VEs in the same Xe-core (or work-items in the same work-group), so it is called SLM. Because it is on-chip in each Xe-core, the SLM has much higher bandwidth and much lower latency than global memory. Because it is accessible to all work-items in a work-group, the SLM can accommodate data sharing and communication among hundreds of work-items, depending on the work-group size.
It is often helpful to think of SLM as a work-group managed cache. When a work-group starts, work-items in the work-group can explicitly load data from global memory into SLM. The data stays in SLM during the lifetime of the work-group for faster access. Before the work-group finishes, the data in the SLM can be explicitly written back to the global memory by the work-items. After the work-group completes execution, the data in SLM is also gone and invalid. Data consistency between the SLM and the global memory is the program’s responsibility. Properly using SLM can make a significant performance difference.
Shared Local Memory Size and Work-group Size
Because it is on-chip, the SLM has limited size. How much memory is available to a work-group is device-dependent and can be obtained by querying the device, e.g.:
std::cout << "Local Memory Size: " << q.get_device().get_info<sycl::info::device::local_mem_size>() << std::endl;
The output may look like:
Local Memory Size: 65536
The unit of the size is a byte. So this GPU device has 65,536 bytes or 64KB SLM for each work-group.
It is important to know the maximum SLM size a work-group can have. In a lot of cases, the total size of SLM available to a work-group is a non-constant function of the number of work-items in the work-group. The maximum SLM size can limit the total number of work-items in a group, i.e. work-group size. For example, if the maximum SLM size is 64KB and each work-item needs 512 bytes of SLM, the maximum work-group size cannot exceed 128.
Bank Conflicts
The SLM is divided into equally sized memory banks that can be accessed simultaneously for high bandwidth. The total number of banks is device-dependent. At the time of writing, 64 consecutive bytes are stored in 16 consecutive banks at 4-byte (32-bit) granularity. Requests for access to different banks can be serviced in parallel, but requests to different addresses in the same bank cause a bank conflict and are serialized. Bank conflicts adversely affect performance. Consider this example:
constexpr int N = 32; int *data = sycl::malloc_shared<int>(N, q); auto e = q.submit([&](auto &h) { sycl::accessor<int, 1, sycl::access::mode::read_write, sycl::access::target::local> slm(sycl::range(32 * 64), h); h.parallel_for(sycl::nd_range(sycl::range{N}, sycl::range{32}), [=](sycl::nd_item<1> it) { int i = it.get_global_linear_id(); int j = it.get_local_linear_id(); slm[j * 16] = 0; it.barrier(sycl::access::fence_space::local_space); for (int m = 0; m < 1024 * 1024; m++) { slm[j * 16] += i * m; it.barrier(sycl::access::fence_space::local_space); } data[i] = slm[j * 16]; }); });
If the number of banks is 16, all work-items in the above example will read from and write to different addresses in the same bank. The memory bandwidth is 1/16 of full bandwidth.
The next example instead does not have SLM bank conflicts and achieves full memory bandwidth because every work-item reads from and writes to different addresses in different banks.
constexpr int N = 32; int *data = sycl::malloc_shared<int>(N, q); auto e = q.submit([&](auto &h) { sycl::accessor<int, 1, sycl::access::mode::read_write, sycl::access::target::local> slm(sycl::range(32 * 64), h); h.parallel_for(sycl::nd_range(sycl::range{N}, sycl::range{32}), [=](sycl::nd_item<1> it) { int i = it.get_global_linear_id(); int j = it.get_local_linear_id(); slm[j] = 0; it.barrier(sycl::access::fence_space::local_space); for (int m = 0; m < 1024 * 1024; m++) { slm[j] += i * m; it.barrier(sycl::access::fence_space::local_space); } data[i] = slm[j]; }); });
Data Sharing and Work-group Barriers
Let us consider the histogram with 256 bins example from the “Avoiding Register Spills” chapter once again.
constexpr int blockSize = 256; constexpr int NUM_BINS = 256; std::vector<unsigned long> hist(NUM_BINS, 0); sycl::buffer<unsigned long, 1> mbuf(input.data(), N); sycl::buffer<unsigned long, 1> hbuf(hist.data(), NUM_BINS); auto e = q.submit([&](auto &h) { sycl::accessor macc(mbuf, h, sycl::read_only); auto hacc = hbuf.get_access<sycl::access::mode::atomic>(h); h.parallel_for( sycl::nd_range(sycl::range{N / blockSize}, sycl::range{64}), [=](sycl::nd_item<1> it) [[intel::reqd_sub_group_size(16)]] { int group = it.get_group()[0]; int gSize = it.get_local_range()[0]; sycl::ext::oneapi::sub_group sg = it.get_sub_group(); int sgSize = sg.get_local_range()[0]; int sgGroup = sg.get_group_id()[0]; unsigned int histogram[NUM_BINS / 16]; // histogram bins take too much storage // to be promoted to registers for (int k = 0; k < NUM_BINS / 16; k++) { histogram[k] = 0; } for (int k = 0; k < blockSize; k++) { unsigned long x = sg.load(macc.get_pointer() + group * gSize * blockSize + sgGroup * sgSize * blockSize + sgSize * k); // subgroup size is 16 #pragma unroll for (int j = 0; j < 16; j++) { unsigned long y = sycl::group_broadcast(sg, x, j); #pragma unroll for (int i = 0; i < 8; i++) { unsigned int c = y & 0xFF; // (c & 0xF) is the workitem in which the bin resides // (c >> 4) is the bin index if (sg.get_local_id()[0] == (c & 0xF)) { histogram[c >> 4] += 1; } y = y >> 8; } } } for (int k = 0; k < NUM_BINS / 16; k++) { hacc[16 * k + sg.get_local_id()[0]].fetch_add(histogram[k]); } }); });
This example has been optimized to use the integer data type instead of long and to share registers in the sub-group so that the private histogram bins can fit in registers for optimal performance. If you need a larger bin size (e.g., 1024), it is inevitable that the private histogram bins will spill to global memory.
The histogram bins can be shared by work-items in a work-group as long as each bin is updated atomically.
constexpr int NUM_BINS = 1024; constexpr int blockSize = 256; std::vector<unsigned long> hist(NUM_BINS, 0); sycl::buffer<unsigned long, 1> mbuf(input.data(), N); sycl::buffer<unsigned long, 1> hbuf(hist.data(), NUM_BINS); auto e = q.submit([&](auto &h) { sycl::accessor macc(mbuf, h, sycl::read_only); auto hacc = hbuf.get_access<sycl::access::mode::atomic>(h); sycl::accessor<unsigned int, 1, sycl::access::mode::atomic, sycl::access::target::local> local_histogram(sycl::range(NUM_BINS), h); h.parallel_for( sycl::nd_range(sycl::range{N / blockSize}, sycl::range{64}), [=](sycl::nd_item<1> it) { int group = it.get_group()[0]; int gSize = it.get_local_range()[0]; sycl::ext::oneapi::sub_group sg = it.get_sub_group(); int sgSize = sg.get_local_range()[0]; int sgGroup = sg.get_group_id()[0]; int factor = NUM_BINS / gSize; int local_id = it.get_local_id()[0]; if ((factor <= 1) && (local_id < NUM_BINS)) { local_histogram[local_id].store(0); } else { for (int k = 0; k < factor; k++) { local_histogram[gSize * k + local_id].store(0); } } it.barrier(sycl::access::fence_space::local_space); for (int k = 0; k < blockSize; k++) { unsigned long x = sg.load(macc.get_pointer() + group * gSize * blockSize + sgGroup * sgSize * blockSize + sgSize * k); local_histogram[x & 0x3FFU].fetch_add(1); local_histogram[(x >> 16) & 0x3FFU].fetch_add(1); local_histogram[(x >> 32) & 0x3FFU].fetch_add(1); local_histogram[(x >> 48) & 0x3FFU].fetch_add(1); } it.barrier(sycl::access::fence_space::local_space); if ((factor <= 1) && (local_id < NUM_BINS)) { hacc[local_id].fetch_add(local_histogram[local_id].load()); } else { for (int k = 0; k < factor; k++) { hacc[gSize * k + local_id].fetch_add( local_histogram[gSize * k + local_id].load()); } } }); });
When the work-group is started, each work-item in the work-group initializes a portion of the histogram bins in SLM to 0 (code in lines 21-27 in the above example). You could designate one work-item to initialize all the histogram bins, but it is usually more efficient to divide the job among all work-items in the work-group.
The work-group barrier after initialization at line 28 guarantees that all histogram bins are initialized to 0 before any work-item updates any bins.
Because the histogram bins in SLM are shared among all work-items, updates to any bin by any work-item has to be atomic.
The global histograms are updated once the local histograms in the work-group is completed. But before reading the local SLM bins to update the global bins, a work-group barrier is again called at line 43 to make sure all work-items have completed their work.
When SLM data is shared, work-group barriers are often required for work-item synchronization. The barrier has a cost and the cost may increase with a larger work-group size. It is always a good idea to try different work-group sizes to find the best one for your application.
You can find an example of an SLM version of a histogram with 256 bins in the Examples folder. You can compare its performance with the performance of the version using registers. You may get some surprising results, and think about further optimizations that can be done.
Using SLM as Cache
You may sometimes find it more desirable to have the application manage caching of some hot data than to have the hardware do it automatically. With the application managing data caching directly, whenever the data is needed, you know exactly where the data is and the cost to access it. The SLM can be used for this purpose.
Consider the following 1-D convolution example:
sycl::buffer<int> ibuf(input.data(), N); sycl::buffer<int> obuf(output.data(), N); sycl::buffer<int> kbuf(kernel.data(), M); auto e = q.submit([&](auto &h) { sycl::accessor iacc(ibuf, h, sycl::read_only); sycl::accessor oacc(obuf, h); sycl::accessor kacc(kbuf, h, sycl::read_only); h.parallel_for(sycl::nd_range<1>(sycl::range{N}, sycl::range{256}), [=](sycl::nd_item<1> it) { int i = it.get_global_linear_id(); int group = it.get_group()[0]; int gSize = it.get_local_range()[0]; int t = 0; int _M = static_cast<int>(M); int _N = static_cast<int>(N); if ((group == 0) || (group == _N / gSize - 1)) { if (i < _M / 2) { for (int j = _M / 2 - i, k = 0; j < _M; ++j, ++k) { t += iacc[k] * kacc[j]; } } else { if (i + _M / 2 >= _N) { for (int j = 0, k = i - _M / 2; j < _M / 2 + _N - i; ++j, ++k) { t += iacc[k] * kacc[j]; } } else { for (int j = 0, k = i - _M / 2; j < _M; ++j, ++k) { t += iacc[k] * kacc[j]; } } } } else { for (int j = 0, k = i - _M / 2; j < _M; ++j, ++k) { t += iacc[k] * kacc[j]; } } oacc[i] = t; }); });
The example convolves an integer array of 8192 x 8192 elements using a kernel array of 257 elements and writes the result to an output array. Each work-item convolves one element. To convolve one element, however, up to 256 neighboring elements are needed.
Noticing each input element is used by multiple work-items, you can preload all input elements needed by a whole work-group into SLM. Later, when an element is needed, it can be loaded from SLM instead of global memory.
sycl::buffer<int> ibuf(input.data(), N); sycl::buffer<int> obuf(output.data(), N); sycl::buffer<int> kbuf(kernel.data(), M); auto e = q.submit([&](auto &h) { sycl::accessor iacc(ibuf, h, sycl::read_only); sycl::accessor oacc(obuf, h); sycl::accessor kacc(kbuf, h, sycl::read_only); sycl::accessor<int, 1, sycl::access::mode::read_write, sycl::access::target::local> ciacc(sycl::range(256 + (M / 2) * 2), h); h.parallel_for( sycl::nd_range(sycl::range{N}, sycl::range{256}), [=](sycl::nd_item<1> it) { int i = it.get_global_linear_id(); int group = it.get_group()[0]; int gSize = it.get_local_range()[0]; int local_id = it.get_local_id()[0]; int _M = static_cast<int>(M); ciacc[local_id + M / 2] = iacc[i]; if (local_id == 0) { if (group == 0) { for (int j = 0; j < _M / 2; ++j) { ciacc[j] = 0; } } else { for (int j = 0, k = i - _M / 2; j < _M / 2; ++j, ++k) { ciacc[j] = iacc[k]; } } } if (local_id == gSize - 1) { if (group == static_cast<int>(it.get_group_range()[0]) - 1) { for (int j = gSize + _M / 2; j < gSize + _M / 2 + _M / 2; ++j) { ciacc[j] = 0; } } else { for (int j = gSize + _M / 2, k = i + 1; j < gSize + _M / 2 + _M / 2; ++j, ++k) { ciacc[j] = iacc[k]; } } } it.barrier(sycl::access::fence_space::local_space); int t = 0; for (int j = 0, k = local_id; j < _M; ++j, ++k) { t += ciacc[k] * kacc[j]; } oacc[i] = t; }); });
When the work-group starts, all input elements needed by each work-item are loaded into SLM. Each work-item, except the first one and the last one, loads one element into SLM. The first work-item loads neighbors on the left of the first element and the last work item loads neighbors on the right of the last element in the SLM. If no neighbors exist, elements in SLM are filled with 0s.
Before convolution starts in each work-item, a local barrier is called to make sure all input elements are loaded into SLM.
The convolution in each work-item is straightforward. All neighboring elements are loaded from the faster SLM instead of global memory.
Troubleshooting SLM Errors
A CL_OUT_OF_RESOURCES error may occur when a kernel uses more __private or __local memory than the emulator supports by default. When this occurs, you will see an error message similar to this:
$ ./myapp : Problem size: c(150,600) = a(150,300) * b(300,600) terminate called after throwing an instance of 'cl::sycl::runtime_error' what(): Native API failed. Native API returns: -5 (CL_OUT_OF_RESOURCES) -5 (CL_OUT_OF_RESOURCES) Aborted (core dumped) $
Or if using onetrace:
$ onetrace -c ./myapp : >>>> [6254070891] zeKernelSuggestGroupSize: hKernel = 0x263b7a0 globalSizeX = 163850 globalSizeY = 1 globalSizeZ = 1 groupSizeX = 0x7fff94e239f0 groupSizeY = 0x7fff94e239f4 groupSizeZ = 0x7fff94e239f8 <<<< [6254082074] zeKernelSuggestGroupSize [922 ns] -> ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY(0x1879048195) terminate called after throwing an instance of 'cl::sycl::runtime_error' what(): Native API failed. Native API returns: -5 (CL_OUT_OF_RESOURCES) -5 (CL_OUT_OF_RESOURCES) Aborted (core dumped) $
To see how much memory was being copied to shared local memory and the actual hardware limit, set debug keys:
export PrintDebugMessages=1 export NEOReadDebugKeys=1
This will change the output to:
$ ./myapp : Size of SLM (656384) larger than available (131072) terminate called after throwing an instance of 'cl::sycl::runtime_error' what(): Native API failed. Native API returns: -5 (CL_OUT_OF_RESOURCES) -5 (CL_OUT_OF_RESOURCES) Aborted (core dumped) $
Or, if using onetrace:
$ onetrace -c ./myapp : >>>> [317651739] zeKernelSuggestGroupSize: hKernel = 0x2175ae0 globalSizeX = 163850 globalSizeY = 1 globalSizeZ = 1 groupSizeX = 0x7ffd9caf0950 groupSizeY = 0x7ffd9caf0954 groupSizeZ = 0x7ffd9caf0958 Size of SLM (656384) larger than available (131072) <<<< [317672417] zeKernelSuggestGroupSize [10325 ns] -> ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY(0x1879048195) terminate called after throwing an instance of 'cl::sycl::runtime_error' what(): Native API failed. Native API returns: -5 (CL_OUT_OF_RESOURCES) -5 (CL_OUT_OF_RESOURCES) Aborted (core dumped) $
Once you have determined the amount of memory needed, try setting larger values for the CL_CONFIG_CPU_FORCE_PRIVATE_MEM_SIZE or the CL_CONFIG_CPU_FORCE_LOCAL_MEM_SIZE environment variable, as described in the Emulator Environment Variables section of the oneAPI Programming Guide.