Intel® Graphics Compute Architecture uses system memory as a compute device memory. Such memory is unified by means of sharing the same DRAM with the CPU. The obvious performance advantage is that shared physical memory enables zero-copy transfers between host CPU and Intel® Graphics OpenCL™ device. The same zero-copy path works for the CPU OpenCL™ device and finally for the CPU-GPU shared context. Refer to the “Mapping Memory Objects" section for more information.
The Compute Architecture memory system is augmented with several levels of caches:
Read-only memory path for OpenCL images which includes a level-1 (
) and a level-2 (
) sampler caches. Image writes follow different path (see below);
) data cache is a slice-shared asset. All read and write actions on OpenCL buffers flows through the L3 data cache in units of 64-byte wide cache lines. The L3 cache includes sampler read transactions that are missing in the L1 and L2 sampler caches, and also supports sampler writes. See section “Execution of OpenCL™ Work-Items: the SIMD Machine” for details on slice-shared assets.
The L3 efficiency is highest for accesses that are cache line-aligned and adjacent within cache line
Shared Local Memory (SLM)
is a dedicated structure within the L3 that supports the work-group local memory address space. The read/write bus interface to shared local memory is again 64-bytes-wide. But shared local memory is organized as 16 banks at 4-byte granularity. This organization can yield full bandwidth access for access patterns that may not be 64-byte aligned or that may not be contiguously adjacent in memory.
The amount of SLM is an important limiting factor for the number of work-groups that can be executed simultaneously on the device. Use the
call to query the exact value.
As shared local memory is highly banked, it is more important to minimize bank conflicts when accessing local memory than to minimize the number of cache lines.
Finally, the entire architecture interfaces to the rest of the SoC components via a dedicated interface unit called the Graphics Technology Interface (GTI). The rest of SoC memory hierarchy includes the large Last-Level Cache (LLC, which is shared between CPU and GPU), possibly embedded DRAM and finally the system DRAM.
Figure 4. View of memory hierarchy and peak bandwidths (in bytes/cycle) for the Gen7.5 compute architecture (4th Generation Intel® Core™ family of microprocessors).
Please find more details on the memory access in the following sections.