Developer Guide

Intel® Processors with Integrated Graphics

Intel
®
UHD Graphics is a proprietary Intel technology that provides graphics, compute, media, and display capabilities for many Intel
®
processors. This section focuses on the compute components of UHD Graphics architecture.

Execution Unit

An Intel GPU consists of a set of execution units (EU). Each EU is simultaneously multithreaded (SMT) with seven threads. The primary computation units are a pair of Single Instruction Multiple Data (SIMD) Arithmetic Logic Units (ALU). Each ALU can execute up to four 32-bit floating-point or integer operations, or eight 16-bit floating-point operations. Effectively, each EU can execute eight SIMD F32 vector operations (16 with fused multiply and add) and 16 SIMD F16 (32 with fused multiply and add). Each hardware thread has 128 general-purpose registers (GRF) of 32B wide for the total of 4KB register/thread or 28KB/EU. Each GFR can hold a vector of one, two, four, or eight 32-bit floating point or integer values or 16 16-bit values. For convenience, we can model an EU as executing seven threads with eight SIMD lanes for a total of 56 concurrent 32-bit operations or 112 concurrent 16-bit operations.
Execution Unit
Execution Unit

SubSlice

Each SubSlice contains an EU array of 8 EUs for Ice Lake with Iris Xe Graphics (ICX) to 16 EUs for Tiger Lake with Iris Xe Graphics (TGL). Hence each SubSlice can perform 448 (ICX) to 896 (TGL) concurrent 32-bit operations. In addition to EUs, each SubSlice also contains an instruction cache, a local thread dispatcher, a read-only texture/image sampler of 64B/cycle, a Dataport of 64B/cycle for both read and write, and 64KB of shared local memory (SLM). The Dataport’s read bandwidth of 64B/cycle averages to 8B/cycle/EU or two FP32 inputs/cycle/EU. For maximum performance, compute kernels require a high computation to memory access ratio and must reuse data whenever possible. For read-only data, it is possible to use the sampler unit to get an additional 64B/cycle data inputs into a SubSlice. The total read bandwidth across Dataport, Sampler and SLM is 192B/cycle.
SubSlice
SubSlice
The SLM is a 64KB highly banked memory accessible from the EUs in the SubSlice. One important usage of SLM is to share global atomic data among all the 448 (ICX) to 896 (TGL) concurrent work-items executing in a SubSlice. For this reason, if a kernel’s work-group contains synchronization operations, all work-items of the work-group must be allocated to a single SubSlice so that they have shared access to the same 64KB SLM. The work-group size must be chosen carefully to maximize the occupancy and utilization of the SubSlice. In contrast, if a kernel does not access SLM, its work-items can be dispatched across multiple SubSlices for high occupancy and utilization.
The maximum number of work-groups that can be executed on a single SubSlice is 16. Small work-groups may make it impossible to achieve full occupancy on a SubSlice.
The following table summarizes the computing capacity of a SubSlice.
SubSlice computing capacity
GPU Generation
EUs
Threads
Operations
Maximum Work Groups
Intel Iris Xe ICX
8
LaTex Math image.
LaTex Math image.
16
Intel Iris Xe TGL
16
LaTex Math image.
LaTex Math image.
16

Slice

On the Intel
®
Iris
®
X
e
Graphics (ICX) GPU, eight SubSlices form a Slice for an aggregated total of 64 EU, or 3,584 simultaneous computations. Each Slice also contains a shared 3,072KB L3 cache and other slice-common units for graphics and media processing. The Intel
®
Iris
®
X
e
Graphics on Intel Ice Lake processors contains one Slice with 64 EUs as illustrated below, which amounts to 3,584 simultaneous computations. For good performance, an application must keep the EU occupancy high with thousands of work-items.
Note that the number of SubSlices in each Intel
®
Iris
®
X
e
Graphics generation is subject to changes. The GPUs on Intel Tiger Lake processors contain six SubSlices where each SubSlice contains sixteen EUs. The coming generations of Intel CPUs with Intel UHD Graphics contain multiple Slices to further scale computation capacities.
Intel® Iris® Xe Graphics on Intel Ice Lake processors, one Slice
|irisxe| Graphics on Intel Ice Lake processors, one Slice

Architecture Parameters across Generations

The following table summarizes the key architecture parameters in the current released products with Intel UHD Graphics:
Key architecture parameters, Intel UHD Graphics
Generations
Threads per EU
EUs per SubSlice
SubSlices
Total Threads
Total Operations
Gen9 (BDW)
7
8
3
168
1344
Intel Iris Xe (Gen11)
7
8
8
448
3584
Intel Iris Xe (Gen12)
7
16
6
672
5376

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.