Execution Model Overview
Thread Mapping and GPU Occupancy
Kernels
Using Libraries for GPU Offload
Host/Device Memory, Buffer and USM
Unified Shared Memory Allocations
Performance Impact of USM and Buffers
Avoiding Moving Data Back and Forth between Host and Device
Optimizing Data Transfers
Avoiding Declaring Buffers in a Loop
Buffer Accessor Modes
Host/Device Coordination
Using Multiple Heterogeneous Devices
Compilation
OpenMP Offloading Tuning Guide
Multi-GPU and Multi-Stack Architecture and Programming
Level Zero
Performance Profiling and Analysis
Configuring GPU Device
Sub-Groups and SIMD Vectorization
Removing Conditional Checks
Registers and Performance
Shared Local Memory
Pointer Aliasing and the Restrict Directive
Synchronization among Threads in a Kernel
Considerations for Selecting Work-Group Size
Prefetch
Reduction
Kernel Launch
Executing Multiple Kernels on the Device at the Same Time
Submitting Kernels to Multiple Queues
Avoiding Redundant Queue Constructions
Programming Intel® XMX Using SYCL Joint Matrix Extension
Doing I/O in the Kernel
Optimizing Explicit SIMD Kernels
OpenMP Offload Best Practices
In this chapter we present best practices for improving the performance of applications that offload onto the GPU. We organize the best practices into the following categories, which are described in the sections that follow:
- Using More GPU Resources
- Minimizing Data Transfers and Memory Allocations
- Making Better Use of OpenMP Constructs
- Memory Allocation
- Fortran Example
- Clauses: is_device_ptr, use_device_ptr, has_device_addr, use_device_addr
- Prefetching
- Atomics with SLM
- OpenMP Interop with SYCL
- Offloading DO CONCURRENT
Note:
Used the following when collecting OpenMP performance numbers:
2-stack Intel® GPU
One GPU stack only (no implicit or explicit scaling).
Intel® compilers, runtimes, and GPU drivers
Level-Zero plugin
Introduced a dummy target construct at the beginning of a program, so as not to measure startup time.
Just-In-Time (JIT) compilation mode.