Developer Guide

Contents

Intel® Iris® Xe GPU Architecture

The Intel
®
Iris
®
X
e
GPU family consists of a series of microarchitectures, ranging from integrated/low power (X
e
-LP), to enthusiast/high performance gaming (X
e
-HPG), data center/AI (X
e
-HP) and high performance computing (X
e
-HPC).
Intel® Iris® Xe family
|irisxe| family
This chapter introduces X
e
GPU family microarchitectures and configuration parameters.

X
e
-LP Execution Units (EUs)

An Execution Unit (EU) is the smallest thread-level building block of the Intel
®
Iris
®
X
e
-LP GPU architecture. Each EU is simultaneously multithreaded (SMT) with seven threads. The primary computation unit consists of a 8-wide Single Instruction Multiple Data (SIMD) Arithmetic Logic Units (ALU) supporting SIMD8 FP/INT operations and a 2-wide SIMD ALU supporting SIMD2 extended math operations. Each hardware thread has 128 general-purpose registers (GRF) of 32B wide.
Xe-LP-EU
|xe|-LP-EU
X
e
-LP EU supports diverse data types FP16, INT16 and INT8 for AI applications. The Intel® GPU Compute Throughput Rates (Ops/clock/EU) table compares the the EU throughput rates of X
e
-LP vs that of Intel
®
Gen 11 GPUs.
Intel
®
GPU Compute Throughput Rates (Ops/clock/EU)
Intel
®
Iris
®
X
e
-LP
Gen 11
FP32
8
8
FP16
16
16
INT32
8
4
INT16
16
8
INT8
32 (DP4A)
NA

X
e
-LP Dual Subslices

Each X
e
-LP Dual Subslice (DSS) consists of an EU array of 16 EUs, an instruction cache, a local thread dispatcher, Shared Local Memory (SLM), and a data port of 128B/cycle. It is called dual subslice because the hardware can pair two EUs for SIMD16 executions.
The SLM is a 128KB High Bandwidth Memory (HBM) accessible from the EUs in the subslice. One important usage of SLM is to share atomic data and signals among the concurrent work-items executing in a subslice. For this reason, if a kernel’s work-group contains synchronization operations, all work-items of the work-group must be allocated to a single subslice so that they have shared access to the same 128KB SLM. The work-group size must be chosen carefully to maximize the occupancy and utilization of the subslice. In contrast, if a kernel does not access SLM, its work-items can be dispatched across multiple subslices.
The following table summarizes the computing capacity of a subslice.
Subslice computing capacity
GPU Generation
EUs
Threads
Operations
Intel Iris Xe ICX
8
LaTex Math image.
LaTex Math image.
Intel Iris Xe-LP TGL
16
LaTex Math image.
LaTex Math image.

X
e
-LP Slice

Each X
e
-LP slice consists of six (dual) subslices for a total of 96 EUs, up to 16MB L3 cache, 128B/cycle bandwidth to L3 and 128B/cycle bandwidth to memory.
Xe-LP slice
|xe|-LP slice

Intel UHD Architecture Parameters across Generations

The following table summarizes the key architecture parameters in the current released products with Intel UHD Graphics:
Key architecture parameters, Intel UHD Graphics
Generations
Threads per EU
EUs per SubSlice
SubSlices
Total Threads
Total Operations
Gen9 (BDW)
7
8
3
168
1344
Intel Iris Xe ICL (Gen11)
7
8
8
448
3584
Intel Iris Xe-LP TGL (Gen12)
7
16
6
672
5376

X
e
-Core

Unlike the X
e
-LP and prior generations of Intel GPUs that used the Execution Unit (EU) as a compute unit, X
e
-HPG and X
e
-HPC use the X
e
-core. This is similar to an X
e
-LP dual subslice.
An X
e
-core contains vector and matrix ALUs, which are referred to as vector and matrix engines.
An Intel
®
Iris
®
X
e
-core contains 8 vector and 8 matrix engines, alongside a large 512KB L1 cache/SLM. It powers the Ponte Vecchio GPU.
Each vector engine is 512 bit wide supporting 16 FP32 SIMD operations with fused FMAs. With 8 vector engines, an X
e
-core delivers 512 FP16, 256 FP32 and 256 FP64 operations/cycle.
Each matrix engine is 4096 bit wide. With 8 matrix engines, an X
e
-core delivers 8192 int8, 4096 FP16/BF16 and 2048 FP32 operations/cycle.
An X
e
-core also provides 512B/cycle load/store bandwidth to the memory system.
Xe-core
|xe|-core

X
e
-Slice

An X
e
-slice contains 16 X
e
-core for a total of 8MB L1 cache, 16 ray tracing units and 1 hardware context.
Xe-slice
|xe|-slice

X
e
-Stack

An X
e
-stack contains up to 4 X
e
-slice: 64 X
e
-cores, 64 ray tracing units, 4 hardware contexts, 4 HBM2e controllers, 1 media engine, and 8 Xe-Link high speed coherent fabric. It also contains a shared L2 cache.
Xe-stack
|xe|-stack

X
e
-HPC 2-Stack Ponte Vecchio GPU

An X
e
-HPC 2-stack Ponte Vecchio GPU consists of 2 stacks:: 8 slices, 128 X
e
-cores, 128 ray tracing units, 8 hardware contexts, 8 HBM2e controllers, and 16 Xe-Links.

X
e
-HPG GPU

X
e
-HPG is the enthusiast or high performance gaming variant of the X
e
architecture. The microarchitecture is focused on graphics performance and supports hardware-accelerated ray tracing.
Each X
e
-HPG-core contains 16 vector engines and 16 matrix engines. Each vector engine is 256 bit wide, supporting 8 FP32 SIMD vector operations per cycle. An X
e
-HPG GPU consists of 8 X
e
-HPG-slice, which contains up to 4 X
e
-HPG-cores for a total of 4096 FP32 ALU units/shader cores.

Terminology and Configuration Summary

The following Architecture Terminology Changes table maps legacy GPU terminologies (used in Generation 9 through Generation 12 Intel
®
Core
architectures) to their new names in the Intel
®
Iris
®
X
e
GPU (Generation 12.7 and newer) architecture paradigm.
Architecture Terminology Changes
Old Term
New Intel Term
Generic Term
New Abbreviation
Execution Unit (EU)
X
e
Vector Engine
Vector Engine
XVE
Systolic/”DPAS part of EU”
X
e
Matrix eXtension
Matrix Engine
XMX
Subslice (SS) or Dual Subslice (DSS)
X
e
-core
NA
XC
Slice
Render Slice / Compute Slice
Slice
SLC
Tile
Stack
Stack
STK
The following Xe Configurations table lists the hardware characteristics across the X
e
family GPUs.
X
e
Configurations
Architecture
X
e
-LP (TGL)
X
e
-HPG (DG2)
X
e
-HPC (PVC 1 Stack)
Slice count
1
8
4
XC (DSS) count
6
32
64
XVE (EU) / XC
16
16
16
XVE count
96
512
512
Threads / XVE
7
8
8
Thread count
672
4096
4096
FLOPs / clk - single precision, MAD
1536
8192
16384
FLOPs / clk - double precision, MAD
NA
NA
16384
FLOPs / clk - FP16 DP4AS
NA
NA
262144
GTI bandwidth bytes / unslice-clk
r:128, w:128
r:512, w:512
r:1024, w:1024
LL cache size
3.84MB
16MB
up to 204MB
SLM size
LaTex Math image.
LaTex Math image.
LaTex Math image.
FMAD, SP (ops / XVE / clk)
8
8
16
SQRT, SP (ops / XVE / clk)
2
2
4

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.