Intel Data Center GPU Max Series Overview

This article discusses the capabilities available in the Intel® Data Center GPU Max Series previously codename Ponte Vecchio and how developers can take advantage of them. This new GPU product is based on the Intel X^e HPC micro-architecture. The GPU uses highly parallelized computing models associated with AI and HPC. It is supported by the oneAPI open ecosystem with the flexibility of Single Instruction Multiple Data (SIMD) and Single Instruction Multiple Threads (SIMT).

Data Center GPU Pax Series Package

Figure 1 – High level X^e HPC Stack Component Overview

At the heart of the product is the X^e HPC Stack, which is comprised of various tiles stacked upon each other within a single package. The components of the X^e HPC Stack are comprised of the X^e-core Tile (Compute Tile), L2 Cache Tile, Base Tile (PCIe paths, media engine, copy engine, etc.), High Memory Bandwidth Tile, X^e Link Tile for scale-up and scale-out, and the Embedded Multi-Die Interconnect Bridge (EMIB) for communication between X^e HPC Stacks. These components all reside within a Multi Tile Package.

Intel X^e HPC Micro-Architecture

Over the course of this article, we are going to cover the different functional aspects of the total solution. This includes a discussion of the architecture, multiple execution units are combined into an individual core, multiple cores are combined into a slice, multiple slices are combined into a stack and multiple stacks are incorporated into different form factors to become different product lines.

dc gpu max series component overview

Figure 2 – Block Diagram Of The different components of the Intel Xe HPC Micro-Architecture

Intel X^e is a scalable graphics and compute architecture designed to deliver exceptional performance and functionality. The product includes a discrete PCIe add-in card and an OpenCompute Accelerator Module (OAM) to provide solutions that require increased GPU density within the data center. Intel’s X^e Architecture consists of one of three possible micro-architectures: X^e LP a low-power solution, X^e HPG which is optimized for enthusiasts and gaming, or X^e HPC which is optimized for HPC and AI acceleration. Intel® Data Center GPU Max Series is built upon the X^e HPC micro-architecture with a compute focused, programable and scalable element called the X^e-core. This X^e-core is made up of a combination of Vector Engines, Matrix engines that are referred to in this article as Intel® Xe Matrix Extensions (Intel® XMX), cache and shared local memory. Each X^e-core contains eight 512-bit Vector Engines designed to accelerate traditional graphics, compute, and HPC workloads. Along with eight 4096-bit Intel XMX, engineered to accelerate AI workloads. The X^e-core provides 512KB of L1 cache and shared local memory to support the engines.

Data Center GPU Max Series Core

Figure 3 – X^e-core Block Diagram With X^e HPC Micro-Architecture

Developers can optimize their software for use with the X^e HPC micro-architecture by using libraries and compilers that leverage the instruction set architecture (ISA). This will take advantage of the ports previously referred to as the Data Parallel Matrix Engine. Within one of the execution units of the X^e-core can be found three ports that support a variety of programming data types to help maximize application performance. The X^e-core has eight vector engines with a 512-bit wide register supporting all “vector” or non-systolic numerics as follows, FP64 (256 ops per clock), FP32 (256 ops per clock), and FP16 (512 ops per clock). While the Intel XMX systolic array depth has been increased to 8 as compared to an X^e HPG Core providing 4096-bits for each of the units. Intel XMX supports systolic numerics as follows, TF32 (2048 ops per clock), FP16 (4096 ops per clock), BF16 (4096 ops per clock) and Int8 (8192 ops per clock). New data types introduced to support specific segments include a 64-bit double precision float (FP64) to benefit HPC workloads utilizing the Vector Engines. A 32-bit tensor float (TF32) data type is provided to benefit AI workloads using Intel XMX. The L1 data cache and shared local memory have been increased to help support the additional flops introduced by these new data types. The load/store unit has also been enhanced for nd-tensor data structures.

dc gpu max series execution unit

Figure 4 – Inside one of the execution units of the X^e-core on the X^e HPC micro-architecture

The operations supplied to an execution unit of the X^e HPC micro-architecture will be serviced by one of the three ports. Integer and floating-point operations are separated out into two separate ports to increase throughput by running calculations parallel. The port architecture allows AI applications, matrix operations, element-wise vector operations and address computations to all be executed concurrently. If broader HPC applications require higher precision, the dual-issued double precision pipes deliver FP64 vector compute at 1x rate as compared to FP32.

dc gpu max series data types

Figure 5 – Data Type Comparison

Deep Neural Networks have a high tolerance to numerical precisions. Comparing the different data types, higher system performance with AI workloads can be achieved with lower and mixed precision operations. The 32-bit floating-point format (FP32) is the default for deep learning training and inference. When we look at what the other data types offer as compared to FP32 we can see their advantages. The 32-bit tensor float (TF32) data type, which can be enabled with a simple global setting, reduces the precision (only 10 bits mantissa) with the same 8-bit exponent. Because of this the throughput for TF32 is higher than FP32 just based on the amount of work done per clock cycle. The 16-bit floating-point format (FP16) can do more work per clock cycle than TF32 due to a 5-bit exponent, but it requires special handling techniques via loss scaling. The Brain Floating Point Format (BF16) is also a 16-bit floating point data type. It matches the dynamic range of FP32 and TF32 with an 8-bit exponent while having a lower precision than either, with only 7-bits of mantissa. It doesn’t suffer from the special handling required by FP16 yet is able to match its throughput advantage making it ideal for training as compared to FP32, FP16 or TF32. The 8-bit integer (INT8) data type can be used for post-training quantization for faster inference. This is due to its smaller 8-bit size as compared to FP32, FP16, BF16, or TF32 reducing both the memory and computing requirements. Even though it is faster it lacks the precision of the other data types, which only makes it useful for inference instead of training. Cloud edge devices involved with deep learning, which have smaller memory and bandwidth requirements is an area that can benefit from INT8.

Due to the general-purpose nature of the execution unit, it can process both Single Instruction Multiple Data (SIMD) and Single Instruction Multiple Threads (SIMT). Typically, SIMD is used by CPUs while SIMT is used by GPUs. The execution unit provides parallelism for both the thread and data elements. Software developers can benefit from the parallel nature of the data by extending the parallelism of the instructions. Simultaneously executing both SIMT and SIMD provides a unique opportunity for potential performance improvements. OneAPI presents an opportunity for ease of development with this approach.

X^e HPC Slice

Building up from the X^e-cores to create a functional SoC (System on a Chip) infrastructure we start with the X^e HPC Slice. It is comprised of sixteen X^e-cores combined with sixteen ray tracing units along with a hardware context. The ray tracing units provide fixed-function computation for Ray Traversal Bounding Box Intersection, and Triangle Intersection. The hardware context enables execution of multiple applications concurrently without expensive software-based context switches.

dc gpu max series slice

Figure 6 – X^e HPC Slice

X^e HPC Stack

Four X^e HPC Slices are then combined with an L2 cache, one media engine, Gen 5 PCI Express interconnect, a memory controller, and eight X^e Links to form an X^e HPC Stack. The L2 cache can be up to a size of 204MB, the media engine performs decode only functions for AVC, AVI and HEVC. In a two-stack configuration the communication occurs via the Embedded Multi-Die Interconnect Bridge (EMIB) interface with up to 230 GB/s in both directions. The X^e Links provide a high-speed coherent unified fabric for communications between GPUs to support scale up and scale out.

dc gpu max series stack

Figure 7 – X^e HPC Stack

From a perspective of form factor a single X^e HPC Stack can be used to create a PCIe add-in card. While more than one X^e HPC Stack can be combined within a physical package also known as an OpenCompute Accelerator Module (OAM). The OAM’s are placed onto a carrier baseboard for scalability and the sled can be plugged into a server via PCI Express.

X^e Link

The X^e Link is designed to provide a high-speed coherent unified fabric for communications between the X^e HPC Stacks. This includes both the internal communication within an OAM as well as across multiple OAMs. It supports load and store operations, bulk data transfers and synchronization of semantics. The technology is a scalability feature that can provide communications for up to eight X^e HPC Stacks (x8 Glueless) via an embedded switch. Each X^e Link is capable of up to 26.5 GB/s of bandwidth in each direction.

Figure 8 - X^e Link Configuration Options

Scale Up

The X^e Link is integral to scale up and scale out solutions. The expansion of the number of X^e HPC Stacks instances within a single system is known as a scale up solution. This can take on different configurations depending on whether PCI express cards are used or OAMs. A PCI Express card has a single X^e HPC Stack and six X^e Links available for scaling up. The supported PCI Express card configurations would be a two-card configuration with six X^e Links between each X^e HPC Stack. The second supported configuration would be four PCI Express cards with two X^e Links between each X^e HPC Stack. With a bandwidth of 26.6 GB/s in each direction per X^e Link this means that in a two-card configuration the bandwidth would be 159 GB/s in each direction and in a four card configuration the bandwidth would be 53 GB/s in each direction. Figure 8 shows varying levels of X^e Link connectivity that is possible with different amounts of PCI Express cards and OAMs. Ease of use for developers on scale up solutions is provided by the OS kernel. The kernel provides access to all the execution units without any needing any code changes.

Scale Out

A scale out solution is when you need to connect multiple nodes together to create a larger cluster of execution units. Up to a maximum of sixty-four interconnected OAMs can be linked together using a single form of external communication between the nodes. The inter-node communication can be a NIC-based solution using an InfiniBand fabric. Alternatively, an X^e Link Glueless solution using directly connected passive copper cables can be utilized to achieve linking together sixty-four OAMs. A need for scaling out beyond sixty-four OAMs can be accomplished by combining X^e Link Glueless along with a NIC-based InfiniBand fabric. This combination can provide up to 512 inter-connected OAMs.

Intel’s initial offering will be a carrier base board that contains four OAMs to create a half width sled. This configuration codenamed Tuscany can be plugged into an existing server via the Gen 5 PCI Express bus. It provides a high-density solution for use with HPC workloads. There will be options for a 450 watt and a 600 watt versions that can support liquid or air cooled systems on single or dual processor systems. PCI Express add in cards will be available in addition to the OAMs. The table below gives an overview of the different product specifications.

Micro-Architecture Overview

Table 1. Intel® Data Center GPU Max Series Micro-Architecture Overview

	*Specifications below are per OAM. Note that in the case of Intel’s product offering on a half width sled, codenamed Tuscany there will be four OAMs.*		*PCI Express Card*
SKU	OAM SKU Based on a Thermal Design Power of 600 Watts	OAM SKU Based on a Thermal Design Power of 450 Watts	PCI Express Card with a Thermal Design Power of 300 Watts
Thermal Design Power	600 watts – Liquid cooling only	450 watts – Liquid or Air cooling	300 watts – Air cooling only
Silicon	X^e-core (Codename Ponte Vecchio)	X^e-core (Codename Ponte Vecchio)	X^e-core (Codename Ponte Vecchio)
X^e HPC Stacks	Two X^e HPC Stacks	Two X^e HPC Stacks	One X^e HPC Stack
X^e HPC Micro-Architecture	128 X^e-cores with a total of 1024 Vector Engines and 1024 Matrix Engines	96 X^e-cores with a total of 896 Vector Engines and 896 Matrix Engines	56 X^e-cores with a total of 448 Vector Engines and 448 Matrix Engines
L1 Cache	64MB total or 512KB per X^e-core	48MB total or 512KB per X^e-core	28MB total or 512KB per X^e-core
L2 Cache	408MB total or 204MB per X^e HPC Stack (ECC support is this associated with L2 or something else?)	216MB total or 108MB per X^e HPC Stack (ECC support is this associated with L2 or something else?)	108MB total (ECC support is this associated with L2 or something else?)
Memory	128GB HBM2e total or 64GB per X^e HPC Stack with an aggregate bandwidth of 12.8 TB/s	96GB HBM2e total or 48GB per X^e HPC Stack with an aggregate bandwidth of 12.8 TB/s	48MB total or 512KB per X^e-core with an aggregate bandwidth of 12.8 TB/s
Embedded Multi-Die Interconnect Bridge (EMIB) Interface	230 GB/s between X^e HPC Stacks 1 and 2 within the same OAM	230 GB/s between X^e HPC Stacks 1 and 2 within the same OAM	N/A
X^e Links	16 total or 8 per X^e HPC Stack Each link is capable of 26.5 GB/s of bandwidth in each direction	16 total or 8 per X^e HPC Stack Each link is capable of 26.5 GB/s of bandwidth in each direction	3 total supporting x2 or x4 bridge Each link is capable of 26.5 GB/s of bandwidth in each direction
HBM	Up to 4x HBM2e 3.2GT/s	Up to 4x HBM2e 3.2GT/s	Up to 4x HBM2e 3.2GT/s
Memory Capacity	16G-128G (1T to 2T) Dedicated
PCI Express	Gen5	Gen5	Gen5
PCIe	X16 PCIe Gen5 per tile
Media Codec Support	Decode only functions for AVC, AVI and HEVC	Decode only functions for AVC, AVI and HEVC	Decode only functions for AVC, AVI and HEVC
FP64 PEAK FLOPS	52 TFLOPS	44 TFLOPS	22 TFLOPS
FP32 PEAK FLOPS	52 TFLOPS	44 TFLOPS	22 TFLOPS
BF16 PEAK FLOPS	832 TFLOPS	704 TFLOPS	352 TFLOPS
INT8 PEAK FLOPS	1664 TFLOPS	1408 TFLOPS	704 TFLOPS

Virtualization

The X^e HPC micro-architecture supports virtualization built upon Single Root I/O Virtualization (SR-IOV). Sixty-three virtual functions (VFs) are possible with a two X^e HPC Stack implementation to help with the scalability of virtualization. To assist with management VF profiles can be dynamically modified without taking the system offline. Additionally, there is flexibility to administer both heterogeneous or homogeneous profiles across the device or tile. Security of the VFs is done through both hardware isolation of the device and at the X^e HPC Stack. Isolating the data within the memory associated with device and at the X^e HPC Stack provides an additional level of security.

Virtualization can be implemented at the OAM level (i.e. multiple X^e HPC Stacks) or at the individual X^e HPC Stack level. The entire OAM’s processing and memory capabilities can be dedicated to a single virtual function or each X^e HPC Stack within the OAM can be associated with a separate VF. Alternatively, VFs can share the entire OAM or an X^e HPC Stack via an allotted a slice of time. These modes of operation are managed by the hypervisor and provide full memory isolation between the VFs.

Software

Intel oneAPI is provided for ease of adoption for software developers and includes a complete set of advanced compliers, libraries, and a code migration with SYCL along with analysis and debugger tools. Developers can be confident that existing applications will work seamlessly with oneAPI due to its interoperability with existing programming models and code bases (C++, Fortran, Python, OpenMP, etc.). The transition to CPUs, GPUs, FPGAs, and AI accelerators using a single code base in Data Parallel C++ helps to optimize a developer’s time. A good place to begin for application developers is the workflow guide focused on the GPU and the Intel® oneAPI Base Toolkit.

The Intel oneAPI Base Toolkit provides a set of high-performance tools for building C++ and Data Parallel C++ applications. While domain specific toolkits are available for specific workloads. Intel oneAPI Tools for HPC provides support for scalability of Fortran, OpenMP and MPI applications. Intel oneAPI Tools for IoT is designed to help developers that are designing solutions at the edge of the network. Intel oneAPI AI Analytics Toolkit provides support for optimization of Deep Learning frameworks and Python libraries to assist with machine learning and data science pipelines. Intel oneAPI Rendering Toolkit focuses on visualization applications. Intel Distribution of OpenVINO Toolkit assists developers working with inference and applications from the edge to the cloud. Lastly for guidance on code optimization see the oneAPI GPU Optimization Guide.

If you are focused on AI and work with PyTorch or the Tensor Flow Frameworks, the latest versions are compatible with the Intel® Data Center GPU Max Series.

Additional Resources:

Technical Overview Of The Intel® Xeon® Scalable processor Max Series

Technical Overview Of The 4th Gen Intel® Xeon® Scalable processor family

Code Sample: Intel® Advanced Matrix Extensions (Intel® AMX) - Intrinsics Functions

Intel® In-Memory Analytics Accelerator (Intel® IAA) Enabling Guide

Proof Points of Intel® Dynamic Load Balancer (Intel® DLB)

The Author: David Mulnix is a software engineer and has been with Intel Corporation for over 20 years. His areas of focus include technical content development and training, software automation, server performance and power analysis with SPECPower, and he supported the development effort for the Server Efficiency Rating Tool^TM.

Notices/Disclaimers

Performance varies by use, configuration and other factors. Learn more on the Performance Index site.

No product or component can be absolutely secure.

Code names are used by Intel to identify products, technologies, or services that are in development and not publicly available. These are not "commercial" names and not intended to function as trademarks

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade

Your costs and results may vary.

Intel technologies may require enabled hardware, software or service activation.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® Data Center GPU Max Series Technical Overview

Intel X^e HPC Micro-Architecture

X^e HPC Slice

X^e HPC Stack

X^e Link

Scale Up

Scale Out

Micro-Architecture Overview

Virtualization

Software

Product and Performance Information

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® Data Center GPU Max Series Technical Overview

Intel Xe HPC Micro-Architecture

Xe HPC Slice

Xe HPC Stack

Xe Link

Scale Up

Scale Out

Micro-Architecture Overview

Virtualization

Software

Product and Performance Information

Intel X^e HPC Micro-Architecture

X^e HPC Slice

X^e HPC Stack

X^e Link