• 2019 Update 4
  • 03/20/2019
  • Public Content

Memory Access Overview

Optimizing memory accesses is the first step to achieving high performance with OpenCL™ on the Intel® Graphics. Tune your kernel to access memory at an optimal granularity and with optimal addresses.
The OpenCL™ implementation for the Intel® Graphics primarily accesses
global and constant
memory through the following caches:
  • GPU-specific L3 cache
  • CPU and GPU shared Last Level Cache (LLC).
Of these two caches, it is important to optimize memory accesses for the L3 cache. L3 cache line is
Finally, there are L1 and L2 caches that are specific to the sampler and renderer.
Accesses to
memory and
memory go through the L3 cache and LLC. In addition,
memory that spill from registers do the same. If multiple OpenCL work-items in the same hardware thread make requests to the same L3 cache line, these requests are collapsed to a single request. This means that the effective
memory, and
memory bandwidth is determined by the number of the accessed L3 cache lines that are accessed.
For example, if two L3 cache lines are accessed from different work items in the same hardware thread, memory bandwidth is one half of the memory bandwidth in case when only one L3 cache line is accessed.
memory is allocated directly from the L3 cache, and is divided into 16 banks at a
-bit granularity. Because it is so highly banked, it is more important to minimize bank conflicts when accessing local memory than to minimize the number of L3 cache lines accesses.
All memory can be accessed in
-bit, or
-bit quantities.
-bit quantities can be accessed as vectors of one, two, three, or four components.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.