Intel® Xe GPU Architecture
Xe GPU is Intel’s GPU architecture designed for integrated graphics, discrete graphics and data center GPUs.
There are multiple generations of Xe GPU Architecture. For example, the second generation of Xe GPU is referenced as Xe2 GPU, third generation as Xe3 GPU and so on.
Each Xe GPU generation has different microarchitectures:
Xe HPC = High Performance Computing
Xe HP = Data Center / AI
Xe HPG = High Performance Gaming
Xe LP/LPG = Lower Power Optimized
The performance difference between the GPU micro-architectures will depend on various hardware characteristics like compute capabilities, memory and cache sizes, data type support, process technology, frequency and other hardware characteristics, which is described in subsequent sections of this document.
Each Xe GPU micro-architecture may have multiple GPU models or GPU SKUs. The main differentiating characteristics between the GPU models within a GPU micro-architecture being number of Xe-cores in the GPU, which is fundamental building block of Intel Xe GPUs.
Intel® GPU Hierarchy:
GPU Architecture - Xe
GPU Generations - (Xe, Xe2, Xe3, …)
GPU Micro-architectures - (Xe LP/LPG, Xe HPG, Xe HPC, …)
GPU Models - (Intel® ArcTM A770 Graphics, Intel® ArcTM A750 Graphics, …)
The following table lists the hardware characteristics across the Xe family GPUs in order of the launch year.
The table shows hardware characteristics for one specific popular GPU model in each GPU microarchitecture, usually there will be more than one model available for each GPU microarchitecture.
The table includes GPUs designed for all market segments Low Power Graphics, High Performance Gaming, DataCenter/AI GPUs and High-Performance Computing GPUs. Any of these GPU products can be used to offload computations, the capability and performance of GPU computation will vary depending on the hardware characteristics.
Architecture |
Xe2-HPG |
Xe2-LPG |
Xe-LPG |
Xe-HPC |
Xe-HP |
Xe-HPG |
Xe-LP |
---|---|---|---|---|---|---|---|
GPU Model |
Intel® ArcTM B580 Graphics |
Intel® CoreTM Ultra 7 Processor 265 |
Intel® CoreTM Ultra 7 Processor 155H |
Intel® Data Center GPU MAX 1550 |
Intel® Data Center GPU Flex 170 |
Intel® ArcTM A770 Graphics |
Intel® Iris® Xe Graphics |
Year Launched |
2024 |
2024 |
2023 |
2023 |
2022 |
2022 |
2020 |
Intel code name |
BattleMage |
Lunar Lake |
Meteor Lake |
Ponte Vecchio |
Alchemist Server |
Alchemist |
Tiger Lake |
GPU Type |
Discrete |
Integrated |
Integrated |
Discrete |
Discrete |
Discrete |
Integrated |
Use Case |
High Performance Gaming and AI |
AI PC |
AI PC |
High Performance Computing |
DataCenter and AI |
High Performance Gaming |
Low Power Graphics |
Xe-Core count |
20 |
8 |
8 |
64 x 2 |
32 |
32 |
6 |
Vector Engines per Xe-Core |
8 |
8 |
16 |
8 |
16 |
16 |
16 |
Vector Engine count |
160 |
64 |
128 |
512 x 2 |
512 |
512 |
96 |
Hardware Threads per Vector Engine |
8 |
8 |
8 |
8 |
8 |
8 |
7 |
Hardware Thread count |
1280 |
512 |
1024 |
4096 x 2 |
4096 |
4096 |
672 |
Matrix Engine (XMX Support or DPAS) |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
No |
Double Precision Native Support |
Yes |
Yes |
Yes |
Yes |
Yes |
No |
No |
Number of General Register File per thread |
128 / 256 (regular mode / large register mode) |
128 / 256 (regular mode / large register mode) |
128 / 256 (regular mode / large register mode) |
128 / 256 (regular mode / large register mode) |
128 |
128 |
128 |
Register Width |
512 bits |
512 bits |
256 bits |
512 bits |
256 bits |
256 bits |
256 bits |
Global Memory Size |
12 GB |
Shared System Memory |
Shared System Memory |
128 GB |
16 GB |
16 GB |
Shared System Memory |
L3 cache size |
18 MB |
8 MB |
4 MB |
2 x 192 MB |
16 MB |
16 MB |
3.75 MB |
L1 cache size per Xe-Core |
256 KB |
192 KB |
160 KB |
512 KB |
192 KB |
192 KB |
0 KB |
SLM size per Xe-Core |
128 KB |
128 KB |
128 KB |
128 KB |
128 KB |
128 KB |
64 KB |
Max SLM size per Work-Group |
128 KB |
128 KB |
64 KB |
128 KB |
64 KB |
64 KB |
64 KB |
Max Work-Group Size |
1024 |
1024 |
1024 |
1024 |
1024 |
1024 |
512 |
Supported Sub-group Sizes |
16, 32 |
16, 32 |
8, 16, 32 |
16, 32 |
8, 16, 32 |
8, 16, 32 |
8, 16, 32 |
The table above can be used to understand the hardware characteristics for each Xe Family GPU.
Querying for Intel GPU hardware characteristics
Some of the Intel GPU hardware characteristics are necessary for efficiently programming kernels to offload computations to the GPU. These parameters can be queried using SYCL device information. Some of the characteristics that are hardware controllers like L1 and L3 cache cannot be programmatically queried, you will have to refer to the hardware specification to get this information. Other characteristics such as Xe-core count, Vector Engine count, SLM size, etc., can be queried using SYCL device information. The following SYCL code example shows how to query for the device information:
#include <sycl/sycl.hpp>
int main() {
sycl::queue q(sycl::gpu_selector_v);
auto device_name = q.get_device().get_info<sycl::info::device::name>();
auto numSlices = q.get_device().get_info<sycl::ext::intel::info::device::gpu_slices>();
auto numSubslicesPerSlice = q.get_device().get_info<sycl::ext::intel::info::device::gpu_subslices_per_slice>();
auto numEUsPerSubslice = q.get_device().get_info<sycl::ext::intel::info::device::gpu_eu_count_per_subslice>();
auto numThreadsPerEU = q.get_device().get_info<sycl::ext::intel::info::device::gpu_hw_threads_per_eu>();
auto global_mem_size = q.get_device().get_info<sycl::info::device::global_mem_size>();
auto local_mem_size = q.get_device().get_info<sycl::info::device::local_mem_size>();
auto max_work_group_size = q.get_device().get_info<sycl::info::device::max_work_group_size>();
auto sub_group_sizes = q.get_device().get_info<sycl::info::device::sub_group_sizes>();
std::cout << "Intel GPU Characteristics:\n";
std::cout << " GPU Model : " << device_name << "\n";
std::cout << " XeCore count : " << numSlices * numSubslicesPerSlice << "\n";
std::cout << " Vector Engines per XeCore : " << numEUsPerSubslice << "\n";
std::cout << " Vector Engine count : " << numSlices * numSubslicesPerSlice * numEUsPerSubslice << "\n";
std::cout << " Hardware Threads per Vector Engine : " << numThreadsPerEU << "\n";
std::cout << " Hardware Threads count : " << numSlices * numSubslicesPerSlice * numEUsPerSubslice * numThreadsPerEU << "\n";
std::cout << " GPU Memory Size : " << global_mem_size << "\n";
std::cout << " Shared Local Memory per Work-group : " << local_mem_size << "\n";
std::cout << " Max Work-group size : " << max_work_group_size << "\n";
std::cout << " Supported Sub-group sizes : ";
for (int i=0; i<sub_group_sizes.size(); i++) std::cout << sub_group_sizes[i] << " "; std::cout << "\n";
}
Intel® Xe GPU Building Blocks
At a high level, the Intel® Xe GPUs have many Xe-cores. Each Xe-core has several Vector Engines, which are multi-threaded with several hardware threads. The Vector Engine consists of Arithmetic Logic Units (ALUs) which support different data type operations. The following section describes each of these building blocks.
The Intel® Data Center GPU MAX 1550 is used in the following section as an example to break down the GPU architecture. It has 128 Xe-cores, each Xe-core has 8 Vector Engines, and each Vector Engine is multi-threaded with 8 hardware threads. Each building block is described below.
Vector Engine
Vector Engine is the smallest thread level building block of Intel Xe GPUs, and the Vector Engine is simultaneously multithreaded with 8 hardware threads. Each hardware thread executes SIMD instructions (16 or 32).
The Vector Engine capability varies between GPU families and determines how many operations can be executed per cycle. The Vector Engine consists of multiple Single Instruction Multiple Data (SIMD) Arithmetic Logic Units (ALU) supporting different data type operations (FP64, FP32, FP16, INT64, BF16, INT 32, INT16, INT8, etc.). Each ALU may support a few data types. The configuration of ALUs in the Vector Engine determines the computational performance.
Each hardware thread within the Vector Engine has a dedicated large-capacity high-bandwidth register file (GRF). The size and number of registers is shown in the table above.
Vector Engine instance count varies by product generation, as well as by model within a given generation, and their capabilities have evolved over the many generations of Architecture.
Xe-Core
Xe-Core is a fundamental building block of Intel Xe GPUs. An Xe-core contains vector and matrix ALUs, which are referred to as vector engines and matrix engines. Each Xe-Core of Xe HPC family has 8 Vector Engines and 8 Matrix Engines or Intel® Xe Matrix Extensions (Intel® XMX).
Xe-Core has shared L1 cache and Shared Local Memory accessible by all 8 Vector Engines within the Xe-Core. Refer to Thread Mapping and GPU Occupancy for more information.
Xe Stack
A collection of Xe-Cores, ray tracing units, hardware contexts, memory controllers and media engines make up a Xe Stack which is a fully functional GPU as shown below. This has L3 Cache which is accessible to all Xe-Cores.
Xe GPU
One or two Xe Stack make up a GPU. Currently, only Xe HPC Family has multi-stack configuration, other Xe Family GPUs like Xe HPG, Xe LP, Xe 2 HPG only has one Xe Stack. Refer to Multi-Stack GPU Architecture for more information.