Terminology

Developer Guide

oneAPI GPU Optimization Guide

Download PDF

ID 771772

Date 7/13/2023

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Visible to Intel only — GUID: GUID-A834ADBC-531B-4C2A-8ECF-EA052DDF59C7

View Details

Document Table of Contents

Document Table of Contents x

oneAPI GPU Optimization Guide

oneAPI GPU Optimization Guide x

Introduction Getting Started Parallelization Intel® Xe GPU Architecture General-Purpose Computing on GPU Media Graphics Computing on GPU References Terms and Conditions

General-Purpose Computing on GPU x

Execution Model Overview Thread Mapping and GPU Occupancy Kernels Using Libraries for GPU Offload Host/Device Memory, Buffer and USM Host/Device Coordination Using Multiple Heterogeneous Devices Compilation OpenMP Offloading Tuning Guide Multi-GPU, Multi-Stack and Multi-C-Slice Architecture and Programming Level Zero Performance Profiling and Analysis Configuring GPU Device

Kernels x

Sub-Groups and SIMD Vectorization Removing Conditional Checks Registerization and Avoiding Register Spills Small Register Mode vs. Large Register Mode Shared Local Memory Pointer Aliasing and the Restrict Directive Synchronization among Threads in a Kernel Considerations for Selecting Work-Group Size Prefetch Reduction Kernel Launch Executing Multiple Kernels on the Device at the Same Time Submitting Kernels to Multiple Queues Avoiding Redundant Queue Constructions Programming Intel® XMX Using SYCL Joint Matrix Extension Doing I/O in the Kernel

Synchronization among Threads in a Kernel x

Atomic Operations Local Barriers vs Global Atomics

Using Libraries for GPU Offload x

Using Performance Libraries Using Standard Library Functions in SYCL Kernels Efficiently Implementing Fourier Correlation Using oneAPI Math Kernel Library (oneMKL)

Host/Device Memory, Buffer and USM x

Unified Shared Memory Allocations Performance Impact of USM and Buffers Optimizing Memory Movement Between Host and Accelerator Avoiding Moving Data Back and Forth between Host and Device Avoiding Declaring Buffers in a Loop Buffer Accessor Modes

Host/Device Coordination x

Asynchronous and Overlapping Data Transfers Between Host and Device

Compilation x

Just-In-Time Compilation Ahead-Of-Time Compilation Specialization Constants Accuracy Versus Performance Tradeoffs in Floating-Point Computations

OpenMP Offloading Tuning Guide x

OpenMP Directives OpenMP Execution Model Terminology Compiling and Running an OpenMP Application Offloading oneMKL Computations onto the GPU Tools for Analyzing Performance of OpenMP Applications OpenMP Offload Best Practices

OpenMP Offload Best Practices x

Using More GPU Resources Minimizing Data Transfers and Memory Allocations Making Better Use of OpenMP Constructs Memory Allocation Fortran Example Clauses: is_device_ptr, use_device_ptr, has_device_addr, use_device_addr Prefetching

Multi-GPU, Multi-Stack and Multi-C-Slice Architecture and Programming x

Multi-Stack GPU Architecture GPU Memory System Implicit Scaling Explicit Scaling Intel® MPI for GPU Clusters Intra-Device and Inter-Device Data Transfers for MPI+OpenMP Programs Terminology

Implicit Scaling x

Introduction Work Scheduling and Memory Distribution Simple Examples Programming Principles

Explicit Scaling x

Explicit Scaling on Multi-GPU, Multi-Stack, Multi-C-Slice in SYCL Explicit Scaling on Multi-GPU, Multi-Stack and Multi-C-Slice in OpenMP Explicit Scaling Summary

Level Zero x

Immediate Command Lists

Performance Profiling and Analysis x

Using the Timers Intel® VTuneTM Profiler Intel® Advisor Intel® Intercept Layer for OpenCLTM Applications Intel® Profiling Tools Interfaces for GPU

Intel® VTuneTM Profiler x

Hardware-assisted Stall Sampling

Intel® Advisor x

Identify Regions to Offload to GPU with Offload Modeling Run a GPU Roofline Analysis Optimize Memory-bound Applications with GPU Roofline

Media Graphics Computing on GPU x

Optimizing Media Pipelines Performance Analysis with Intel® Graphics Performance Analyzers

Optimizing Media Pipelines x

Media Engine Hardware Media API Options for Hardware Acceleration Media Pipeline Parallelism Media Pipeline Inter-operation and Memory Sharing SYCL-Blur Example

oneAPI GPU Optimization Guide

Introduction

Getting Started

Parallelization

Intel® Xe GPU Architecture

General-Purpose Computing on GPU

Execution Model Overview

Thread Mapping and GPU Occupancy

Kernels

Sub-Groups and SIMD Vectorization

Removing Conditional Checks

Registerization and Avoiding Register Spills

Small Register Mode vs. Large Register Mode

Shared Local Memory

Pointer Aliasing and the Restrict Directive

Synchronization among Threads in a Kernel

Atomic Operations

Local Barriers vs Global Atomics

Considerations for Selecting Work-Group Size

Prefetch

Reduction

Kernel Launch

Executing Multiple Kernels on the Device at the Same Time

Submitting Kernels to Multiple Queues

Avoiding Redundant Queue Constructions

Programming Intel® XMX Using SYCL Joint Matrix Extension

Doing I/O in the Kernel

Using Libraries for GPU Offload

Using Performance Libraries

Using Standard Library Functions in SYCL Kernels

Efficiently Implementing Fourier Correlation Using oneAPI Math Kernel Library (oneMKL)

Host/Device Memory, Buffer and USM

Unified Shared Memory Allocations

Performance Impact of USM and Buffers

Optimizing Memory Movement Between Host and Accelerator

Avoiding Moving Data Back and Forth between Host and Device

Avoiding Declaring Buffers in a Loop

Buffer Accessor Modes

Host/Device Coordination

Asynchronous and Overlapping Data Transfers Between Host and Device

Using Multiple Heterogeneous Devices

Compilation

Just-In-Time Compilation

Ahead-Of-Time Compilation

Specialization Constants

Accuracy Versus Performance Tradeoffs in Floating-Point Computations

OpenMP Offloading Tuning Guide

OpenMP Directives

OpenMP Execution Model

Terminology

Compiling and Running an OpenMP Application

Offloading oneMKL Computations onto the GPU

Tools for Analyzing Performance of OpenMP Applications

OpenMP Offload Best Practices

Using More GPU Resources

Minimizing Data Transfers and Memory Allocations

Making Better Use of OpenMP Constructs

Memory Allocation

Fortran Example

Clauses: is_device_ptr, use_device_ptr, has_device_addr, use_device_addr

Prefetching

Multi-GPU, Multi-Stack and Multi-C-Slice Architecture and Programming

Multi-Stack GPU Architecture

GPU Memory System

Implicit Scaling

Introduction

Work Scheduling and Memory Distribution

Simple Examples

Programming Principles

Explicit Scaling

Explicit Scaling on Multi-GPU, Multi-Stack, Multi-C-Slice in SYCL

Explicit Scaling on Multi-GPU, Multi-Stack and Multi-C-Slice in OpenMP

Explicit Scaling Summary

Intel® MPI for GPU Clusters

Intra-Device and Inter-Device Data Transfers for MPI+OpenMP Programs

Terminology

Level Zero

Immediate Command Lists

Performance Profiling and Analysis

Using the Timers

Intel® VTuneTM Profiler

Hardware-assisted Stall Sampling

Intel® Advisor

Identify Regions to Offload to GPU with Offload Modeling

Run a GPU Roofline Analysis

Optimize Memory-bound Applications with GPU Roofline

Intel® Intercept Layer for OpenCLTM Applications

Intel® Profiling Tools Interfaces for GPU

Configuring GPU Device

Media Graphics Computing on GPU

Optimizing Media Pipelines

Media Engine Hardware

Media API Options for Hardware Acceleration

Media Pipeline Parallelism

Media Pipeline Inter-operation and Memory Sharing

SYCL-Blur Example

Performance Analysis with Intel® Graphics Performance Analyzers

References

Terms and Conditions

Visible to Intel only — GUID: GUID-A834ADBC-531B-4C2A-8ECF-EA052DDF59C7

View Details

Terminology

In this chapter, OpenMP and SYCL terminology is used interchangeably to describe the partitioning of iterations of an offloaded parallel loop.

As described in the “SYCL Thread Hierarchy and Mapping” chapter, the iterations of a parallel loop (execution range) offloaded onto the GPU are divided into work-groups, sub-groups, and work-items. The ND-range represents the total execution range, which is divided into work-groups of equal size. A work-group is a 1-, 2-, or 3-dimensional set of work-items. Each work-group can be divided into sub-groups. A sub-group represents a short range of consecutive work-items that are processed together as a SIMD vector.

The following table shows how SYCL concepts map to OpenMP and CUDA concepts.

SYCL	OpenMP	CUDA
Work-item	OpenMP thread or SIMD lane	CUDA thread
Work-group	Team	Thread block
Work-group size	Team size	Thread block size
Number of work-groups	Number of teams	Number of thread blocks
Sub-group	SIMD chunk (`simdlen` = 8, 16, 32)	Warp (size = 32)
Maximum number of work-items per work-group	Thread limit	Maximum number of of CUDA threads per thread block

Level Two Title

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

oneAPI GPU Optimization Guide

Terminology