oneAPI GPU Optimization Guide

Developer Guide

oneAPI GPU Optimization Guide

Download PDF

ID 771772

Date 12/16/2022

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Document Table of Contents x

oneAPI GPU Optimization Guide

oneAPI GPU Optimization Guide x

Introduction Getting Started Parallelization Intel® Iris® Xe GPU Architecture GPU Execution Model Overview SYCL* Thread Mapping and GPU Occupancy Kernels Using Libraries for GPU Offload Host/Device Memory, Buffer and USM Host/Device Coordination Using Multiple Heterogeneous Devices Compilation Optimizing Media Pipelines OpenMP Offloading Tuning Guide Debugging and Profiling GPU Analysis with Intel® Graphics Performance Analyzers (Intel® GPA) Reference Terms and Conditions

Kernels x

Sub-groups and SIMD Vectorization Removing Conditional Checks Registerization and Avoid Register Spills Shared Local Memory Pointer Aliasing and the Restrict Directive Synchronization among Threads in a Kernel Considerations for Selecting Work-group Size Reduction Kernel Launch Executing Multiple Kernels on the Device at the Same Time Submitting Kernels to Multiple Queues Avoid Redundant Queue Construction

Synchronization among Threads in a Kernel x

Atomic Operations Local Barriers vs Global Atomics

Using Libraries for GPU Offload x

Using Performance Libraries Using Standard Library Functions in SYCL Kernels Efficiently Implementing Fourier Correlation Using oneAPI Math Kernel Library (oneMKL)

Host/Device Memory, Buffer and USM x

Performance Impact of USM and Buffers Optimizing Memory Movement Between Host and Accelerator Avoid moving data back and forth between host and device Avoid Declaring Buffers in a Loop Buffer Accessor Modes

Host/Device Coordination x

Asynchronous and Overlapping Data Transfers Between Host and Device

Compilation x

Just-In-Time Compilation in SYCL Specialization Constants

Optimizing Media Pipelines x

Media Engine Hardware Media API Options for Hardware Acceleration Media Pipeline Parallelism Media Pipeline Inter-operation and Memory Sharing SYCL-Blur Example

OpenMP Offloading Tuning Guide x

OpenMP Directives OpenMP Execution Model Terminology Compiling and Running an OpenMP Application Offloading oneMKL Computations onto the GPU Tools to Analyze Performance of OpenMP Applications OpenMP Offload Best Practices

OpenMP Offload Best Practices x

Using More GPU Resources Minimizing Data Transfers and Memory Allocations Making Better Use of OpenMP Constructs Memory Allocation Clauses: is_device_ptr, use_device_ptr, has_device_addr, use_device_addr

Debugging and Profiling x

GPU Analysis with VTuneTM Profiler Intel® Advisor GPU Analysis Doing IO in the Kernel Using the Timers How to Use the Intercept Layer for OpenCLTM Applications Level Zero Tracer

Intel® Advisor GPU Analysis x

Identify Regions to Offload to GPU with Offload Modeling Run a GPU Roofline Analysis Optimize Memory-bound Applications with GPU Roofline

oneAPI GPU Optimization Guide

Introduction

Getting Started

Parallelization

Intel® Iris® Xe GPU Architecture

GPU Execution Model Overview

SYCL* Thread Mapping and GPU Occupancy

Kernels

Sub-groups and SIMD Vectorization

Removing Conditional Checks

Registerization and Avoid Register Spills

Shared Local Memory

Pointer Aliasing and the Restrict Directive

Synchronization among Threads in a Kernel

Atomic Operations

Local Barriers vs Global Atomics

Considerations for Selecting Work-group Size

Reduction

Kernel Launch

Executing Multiple Kernels on the Device at the Same Time

Submitting Kernels to Multiple Queues

Avoid Redundant Queue Construction

Using Libraries for GPU Offload

Using Performance Libraries

Using Standard Library Functions in SYCL Kernels

Efficiently Implementing Fourier Correlation Using oneAPI Math Kernel Library (oneMKL)

Host/Device Memory, Buffer and USM

Performance Impact of USM and Buffers

Optimizing Memory Movement Between Host and Accelerator

Avoid moving data back and forth between host and device

Avoid Declaring Buffers in a Loop

Buffer Accessor Modes

Host/Device Coordination

Asynchronous and Overlapping Data Transfers Between Host and Device

Using Multiple Heterogeneous Devices

Compilation

Just-In-Time Compilation in SYCL

Specialization Constants

Optimizing Media Pipelines

Media Engine Hardware

Media API Options for Hardware Acceleration

Media Pipeline Parallelism

Media Pipeline Inter-operation and Memory Sharing

SYCL-Blur Example

OpenMP Offloading Tuning Guide

OpenMP Directives

OpenMP Execution Model

Terminology

Compiling and Running an OpenMP Application

Offloading oneMKL Computations onto the GPU

Tools to Analyze Performance of OpenMP Applications

OpenMP Offload Best Practices

Using More GPU Resources

Minimizing Data Transfers and Memory Allocations

Making Better Use of OpenMP Constructs

Memory Allocation

Clauses: is_device_ptr, use_device_ptr, has_device_addr, use_device_addr

Debugging and Profiling

GPU Analysis with VTuneTM Profiler

Intel® Advisor GPU Analysis

Identify Regions to Offload to GPU with Offload Modeling

Run a GPU Roofline Analysis

Optimize Memory-bound Applications with GPU Roofline

Doing IO in the Kernel

Using the Timers

How to Use the Intercept Layer for OpenCLTM Applications

Level Zero Tracer

GPU Analysis with Intel® Graphics Performance Analyzers (Intel® GPA)

Reference

Terms and Conditions