Execution Model Overview
                                        
                                        
                                    
                                        
                                        
                                            Thread Mapping and GPU Occupancy
                                        
                                        
                                    
                                        
                                            Kernels
                                        
                                        
                                        
                                    
                                        
                                            Using Libraries for GPU Offload
                                        
                                        
                                        
                                    
                                        
                                        
                                            Host/Device Memory, Buffer and USM
                                        
                                        
                                    
                                        
                                        
                                            Unified Shared Memory Allocations
                                        
                                        
                                    
                                        
                                        
                                            Performance Impact of USM and Buffers
                                        
                                        
                                    
                                        
                                        
                                            Avoiding Moving Data Back and Forth between Host and Device
                                        
                                        
                                    
                                        
                                        
                                            Optimizing Data Transfers
                                        
                                        
                                    
                                        
                                        
                                            Avoiding Declaring Buffers in a Loop
                                        
                                        
                                    
                                        
                                        
                                            Buffer Accessor Modes
                                        
                                        
                                    
                                        
                                            Host/Device Coordination
                                        
                                        
                                        
                                    
                                        
                                        
                                            Using Multiple Heterogeneous Devices
                                        
                                        
                                    
                                        
                                            Compilation
                                        
                                        
                                        
                                    
                                        
                                            OpenMP Offloading Tuning Guide
                                        
                                        
                                        
                                    
                                        
                                            Multi-GPU and Multi-Stack Architecture and Programming
                                        
                                        
                                        
                                    
                                        
                                            Level Zero
                                        
                                        
                                        
                                    
                                        
                                            Performance Profiling and Analysis
                                        
                                        
                                        
                                    
                                        
                                        
                                            Configuring GPU Device
                                        
                                        
                                    
                                
                            
                                                
                                                
                                                    
                                                    
                                                        Sub-Groups and SIMD Vectorization
                                                    
                                                    
                                                
                                                    
                                                    
                                                        Removing Conditional Checks
                                                    
                                                    
                                                
                                                    
                                                        Registers and Performance
                                                    
                                                    
                                                    
                                                
                                                    
                                                    
                                                        Shared Local Memory
                                                    
                                                    
                                                
                                                    
                                                    
                                                        Pointer Aliasing and the Restrict Directive
                                                    
                                                    
                                                
                                                    
                                                        Synchronization among Threads in a Kernel
                                                    
                                                    
                                                    
                                                
                                                    
                                                    
                                                        Considerations for Selecting Work-Group Size
                                                    
                                                    
                                                
                                                    
                                                    
                                                        Prefetch
                                                    
                                                    
                                                
                                                    
                                                    
                                                        Reduction
                                                    
                                                    
                                                
                                                    
                                                    
                                                        Kernel Launch
                                                    
                                                    
                                                
                                                    
                                                    
                                                        Executing Multiple Kernels on the Device at the Same Time
                                                    
                                                    
                                                
                                                    
                                                    
                                                        Submitting Kernels to Multiple Queues
                                                    
                                                    
                                                
                                                    
                                                    
                                                        Avoiding Redundant Queue Constructions
                                                    
                                                    
                                                
                                                    
                                                    
                                                        Programming Intel® XMX Using SYCL Joint Matrix Extension
                                                    
                                                    
                                                
                                                    
                                                    
                                                        Doing I/O in the Kernel
                                                    
                                                    
                                                
                                                    
                                                    
                                                        Optimizing Explicit SIMD Kernels
                                                    
                                                    
                                                
                                            
                                        OpenMP Offloading Tuning Guide
Intel® LLVM-based C/C++ and Fortran compilers, icx, icpx, and ifx, support OpenMP offloading onto GPUs. When using OpenMP, the programmer inserts device directives in the code to direct the compiler to offload certain parts of the application onto the GPU. Offloading compute-intensive code can yield better performance.
This section covers various topics related to OpenMP offloading, and how to improve the performance of offloaded code.