A newer version of this document is available. Customers should click here to go to the newest version.
Introduction
Getting Started
Parallelization
Intel® Iris® Xe GPU Architecture
GPU Execution Model Overview
SYCL* Thread Mapping and GPU Occupancy
Kernels
Using Libraries for GPU Offload
Host/Device Memory, Buffer and USM
Host/Device Coordination
Using Multiple Heterogeneous Devices
Compilation
Optimizing Media Pipelines
OpenMP Offloading Tuning Guide
Debugging and Profiling
GPU Analysis with Intel® Graphics Performance Analyzers (Intel® GPA)
Reference
Terms and Conditions
Sub-groups and SIMD Vectorization
Removing Conditional Checks
Registerization and Avoid Register Spills
Shared Local Memory
Pointer Aliasing and the Restrict Directive
Synchronization among Threads in a Kernel
Considerations for Selecting Work-group Size
Reduction
Kernel Launch
Executing Multiple Kernels on the Device at the Same Time
Submitting Kernels to Multiple Queues
Avoid Redundant Queue Construction
OpenMP Offload Best Practices
In this chapter we present best practices for improving the performance of applications that offload onto the GPU. We organize the best practices into the following categories, which are described in the sections that follow:
- Using More GPU Resources
- Minimizing Data Transfers and Memory Allocations
- Making Better Use of OpenMP Constructs
- Memory Allocation
- Clauses: is_device_ptr, use_device_ptr, has_device_addr, use_device_addr
Note:
Used the following when collecting OpenMP performance numbers:
2-tile Intel® GPU
One GPU tile only (no implicit or explicit scaling).
Internal versions of the Intel® compilers, runtimes, and GPU driver
Level-Zero plugin
Introduced a dummy target construct at the beginning of a program, so as not to measure startup time.
Just-In-Time (JIT) compilation mode.