Intel® Xe GPU Architecture

Developer Guide

oneAPI GPU Optimization Guide

Download PDF

ID 771772

Date 7/10/2025

Version

Public

Intel® Xe GPU Architecture

X^e GPU is Intel’s GPU architecture designed for integrated graphics, discrete graphics and data center GPUs.

There are multiple generations of X^e GPU Architecture. For example, the second generation of X^e GPU is referenced as X^e2 GPU, third generation as X^e3 GPU and so on.

Each X^e GPU generation has different microarchitectures:

X^e HPC = High Performance Computing
X^e HP = Data Center / AI
X^e HPG = High Performance Gaming
X^e LP/LPG = Lower Power Optimized

"Diagram of the XE GPU family"

The performance difference between the GPU micro-architectures will depend on various hardware characteristics like compute capabilities, memory and cache sizes, data type support, process technology, frequency and other hardware characteristics, which is described in subsequent sections of this document.

Each X^e GPU micro-architecture may have multiple GPU models or GPU SKUs. The main differentiating characteristics between the GPU models within a GPU micro-architecture being number of X^e-cores in the GPU, which is fundamental building block of Intel X^e GPUs.

Intel^® GPU Hierarchy:

GPU Architecture - X^e
GPU Generations - (X^e, X^e2, X^e3, …)
GPU Micro-architectures - (X^e LP/LPG, X^e HPG, X^e HPC, …)
GPU Models - (Intel^® Arc^TM A770 Graphics, Intel^® Arc^TM A750 Graphics, …)

The following table lists the hardware characteristics across the X^e family GPUs in order of the launch year.

The table shows hardware characteristics for one specific popular GPU model in each GPU microarchitecture, usually there will be more than one model available for each GPU microarchitecture.

The table includes GPUs designed for all market segments Low Power Graphics, High Performance Gaming, DataCenter/AI GPUs and High-Performance Computing GPUs. Any of these GPU products can be used to offload computations, the capability and performance of GPU computation will vary depending on the hardware characteristics.

Architecture	X^e2-HPG	X^e2-LPG	X^e-LPG	X^e-HPC	X^e-HP	X^e-HPG	X^e-LP
GPU Model	Intel^® Arc^TM B580 Graphics	Intel^® Core^TM Ultra 7 Processor 265	Intel^® Core^TM Ultra 7 Processor 155H	Intel^® Data Center GPU MAX 1550	Intel^® Data Center GPU Flex 170	Intel^® Arc^TM A770 Graphics	Intel^® Iris^® X^e Graphics
Year Launched	2024	2024	2023	2023	2022	2022	2020
Intel code name	BattleMage	Lunar Lake	Meteor Lake	Ponte Vecchio	Alchemist Server	Alchemist	Tiger Lake
GPU Type	Discrete	Integrated	Integrated	Discrete	Discrete	Discrete	Integrated
Use Case	High Performance Gaming and AI	AI PC	AI PC	High Performance Computing	DataCenter and AI	High Performance Gaming	Low Power Graphics
X^e-Core count	20	8	8	64 x 2	32	32	6
Vector Engines per X^e-Core	8	8	16	8	16	16	16
Vector Engine count	160	64	128	512 x 2	512	512	96
Hardware Threads per Vector Engine	8	8	8	8	8	8	7
Hardware Thread count	1280	512	1024	4096 x 2	4096	4096	672
Matrix Engine (XMX Support or DPAS)	Yes	Yes	Yes	Yes	Yes	Yes	No
Double Precision Native Support	Yes	Yes	Yes	Yes	Yes	No	No
Number of General Register File per thread	128 / 256 (regular mode / large register mode)	128 / 256 (regular mode / large register mode)	128 / 256 (regular mode / large register mode)	128 / 256 (regular mode / large register mode)	128	128	128
Register Width	512 bits	512 bits	256 bits	512 bits	256 bits	256 bits	256 bits
Global Memory Size	12 GB	Shared System Memory	Shared System Memory	128 GB	16 GB	16 GB	Shared System Memory
L3 cache size	18 MB	8 MB	4 MB	2 x 192 MB	16 MB	16 MB	3.75 MB
L1 cache size per X^e-Core	256 KB	192 KB	160 KB	512 KB	192 KB	192 KB	0 KB
SLM size per X^e-Core	128 KB	128 KB	128 KB	128 KB	128 KB	128 KB	64 KB
Max SLM size per Work-Group	128 KB	128 KB	64 KB	128 KB	64 KB	64 KB	64 KB
Max Work-Group Size	1024	1024	1024	1024	1024	1024	512
Supported Sub-group Sizes	16, 32	16, 32	8, 16, 32	16, 32	8, 16, 32	8, 16, 32	8, 16, 32

The table above can be used to understand the hardware characteristics for each X^e Family GPU.

Querying for Intel GPU hardware characteristics

Some of the Intel GPU hardware characteristics are necessary for efficiently programming kernels to offload computations to the GPU. These parameters can be queried using SYCL device information. Some of the characteristics that are hardware controllers like L1 and L3 cache cannot be programmatically queried, you will have to refer to the hardware specification to get this information. Other characteristics such as X^e-core count, Vector Engine count, SLM size, etc., can be queried using SYCL device information. The following SYCL code example shows how to query for the device information:

#include <sycl/sycl.hpp>

int main() {
   sycl::queue q(sycl::gpu_selector_v);

   auto device_name = q.get_device().get_info<sycl::info::device::name>();
   auto numSlices = q.get_device().get_info<sycl::ext::intel::info::device::gpu_slices>();
   auto numSubslicesPerSlice = q.get_device().get_info<sycl::ext::intel::info::device::gpu_subslices_per_slice>();
   auto numEUsPerSubslice = q.get_device().get_info<sycl::ext::intel::info::device::gpu_eu_count_per_subslice>();
   auto numThreadsPerEU = q.get_device().get_info<sycl::ext::intel::info::device::gpu_hw_threads_per_eu>();
   auto global_mem_size = q.get_device().get_info<sycl::info::device::global_mem_size>();
   auto local_mem_size = q.get_device().get_info<sycl::info::device::local_mem_size>();
   auto max_work_group_size = q.get_device().get_info<sycl::info::device::max_work_group_size>();
   auto sub_group_sizes = q.get_device().get_info<sycl::info::device::sub_group_sizes>();

   std::cout << "Intel GPU Characteristics:\n";
   std::cout << " GPU Model : " << device_name << "\n";
   std::cout << " XeCore count : " << numSlices * numSubslicesPerSlice << "\n";
   std::cout << " Vector Engines per XeCore : " << numEUsPerSubslice << "\n";
   std::cout << " Vector Engine count : " << numSlices * numSubslicesPerSlice * numEUsPerSubslice << "\n";
   std::cout << " Hardware Threads per Vector Engine : " << numThreadsPerEU << "\n";
   std::cout << " Hardware Threads count : " << numSlices * numSubslicesPerSlice * numEUsPerSubslice * numThreadsPerEU << "\n";
   std::cout << " GPU Memory Size : " << global_mem_size << "\n";
   std::cout << " Shared Local Memory per Work-group : " << local_mem_size << "\n";
   std::cout << " Max Work-group size : " << max_work_group_size << "\n";
   std::cout << " Supported Sub-group sizes : ";
   for (int i=0; i<sub_group_sizes.size(); i++) std::cout << sub_group_sizes[i] << " "; std::cout << "\n";

}

Intel^® X^e GPU Building Blocks

At a high level, the Intel^® X^e GPUs have many X^e-cores. Each X^e-core has several Vector Engines, which are multi-threaded with several hardware threads. The Vector Engine consists of Arithmetic Logic Units (ALUs) which support different data type operations. The following section describes each of these building blocks.

The Intel^® Data Center GPU MAX 1550 is used in the following section as an example to break down the GPU architecture. It has 128 X^e-cores, each X^e-core has 8 Vector Engines, and each Vector Engine is multi-threaded with 8 hardware threads. Each building block is described below.

Vector Engine

Vector Engine is the smallest thread level building block of Intel X^e GPUs, and the Vector Engine is simultaneously multithreaded with 8 hardware threads. Each hardware thread executes SIMD instructions (16 or 32).

The Vector Engine capability varies between GPU families and determines how many operations can be executed per cycle. The Vector Engine consists of multiple Single Instruction Multiple Data (SIMD) Arithmetic Logic Units (ALU) supporting different data type operations (FP64, FP32, FP16, INT64, BF16, INT 32, INT16, INT8, etc.). Each ALU may support a few data types. The configuration of ALUs in the Vector Engine determines the computational performance.

Each hardware thread within the Vector Engine has a dedicated large-capacity high-bandwidth register file (GRF). The size and number of registers is shown in the table above.

Vector Engine instance count varies by product generation, as well as by model within a given generation, and their capabilities have evolved over the many generations of Architecture.

X^e-Core

X^e-Core is a fundamental building block of Intel X^e GPUs. An X^e-core contains vector and matrix ALUs, which are referred to as vector engines and matrix engines. Each X^e-Core of X^e HPC family has 8 Vector Engines and 8 Matrix Engines or Intel^® X^e Matrix Extensions (Intel^® XMX).

X^e-Core has shared L1 cache and Shared Local Memory accessible by all 8 Vector Engines within the X^e-Core. Refer to Thread Mapping and GPU Occupancy for more information.

"Diagram of an XE core"

X^e Stack

A collection of X^e-Cores, ray tracing units, hardware contexts, memory controllers and media engines make up a X^e Stack which is a fully functional GPU as shown below. This has L3 Cache which is accessible to all X^e-Cores.

"Diagram of an XE stack"

X^e GPU

One or two X^e Stack make up a GPU. Currently, only X^e HPC Family has multi-stack configuration, other X^e Family GPUs like X^e HPG, X^e LP, X^e 2 HPG only has one X^e Stack. Refer to Multi-Stack GPU Architecture for more information.

"Diagram of a multi-stack XE GPU configuration"

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

oneAPI GPU Optimization Guide

Intel® Xe GPU Architecture

Querying for Intel GPU hardware characteristics

Intel^® X^e GPU Building Blocks

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

oneAPI GPU Optimization Guide

Intel® Xe GPU Architecture

Querying for Intel GPU hardware characteristics

Intel® Xe GPU Building Blocks

Intel^® X^e GPU Building Blocks