A newer version of this document is available. Customers should click here to go to the newest version.
                                    
                                    
                                        
                                        
                                            Execution Model Overview
                                        
                                        
                                    
                                        
                                        
                                            Thread Mapping and GPU Occupancy
                                        
                                        
                                    
                                        
                                            Kernels
                                        
                                        
                                        
                                    
                                        
                                            Using Libraries for GPU Offload
                                        
                                        
                                        
                                    
                                        
                                            Host/Device Memory, Buffer and USM
                                        
                                        
                                        
                                    
                                        
                                            Host/Device Coordination
                                        
                                        
                                        
                                    
                                        
                                        
                                            Using Multiple Heterogeneous Devices
                                        
                                        
                                    
                                        
                                            Compilation
                                        
                                        
                                        
                                    
                                        
                                            OpenMP Offloading Tuning Guide
                                        
                                        
                                        
                                    
                                        
                                            Multi-GPU, Multi-Stack and Multi-C-Slice Architecture and Programming
                                        
                                        
                                        
                                    
                                        
                                            Level Zero
                                        
                                        
                                        
                                    
                                        
                                            Performance Profiling and Analysis
                                        
                                        
                                        
                                    
                                        
                                        
                                            Configuring GPU Device
                                        
                                        
                                    
                                
                            
                                                
                                                
                                                    
                                                    
                                                        Sub-Groups and SIMD Vectorization
                                                    
                                                    
                                                
                                                    
                                                    
                                                        Removing Conditional Checks
                                                    
                                                    
                                                
                                                    
                                                    
                                                        Registerization and Avoiding Register Spills
                                                    
                                                    
                                                
                                                    
                                                    
                                                        Porting Code with High Register Pressure to Intel® Max GPUs
                                                    
                                                    
                                                
                                                    
                                                    
                                                        Small Register Mode vs. Large Register Mode
                                                    
                                                    
                                                
                                                    
                                                    
                                                        Shared Local Memory
                                                    
                                                    
                                                
                                                    
                                                    
                                                        Pointer Aliasing and the Restrict Directive
                                                    
                                                    
                                                
                                                    
                                                        Synchronization among Threads in a Kernel
                                                    
                                                    
                                                    
                                                
                                                    
                                                    
                                                        Considerations for Selecting Work-Group Size
                                                    
                                                    
                                                
                                                    
                                                    
                                                        Prefetch
                                                    
                                                    
                                                
                                                    
                                                    
                                                        Reduction
                                                    
                                                    
                                                
                                                    
                                                    
                                                        Kernel Launch
                                                    
                                                    
                                                
                                                    
                                                    
                                                        Executing Multiple Kernels on the Device at the Same Time
                                                    
                                                    
                                                
                                                    
                                                    
                                                        Submitting Kernels to Multiple Queues
                                                    
                                                    
                                                
                                                    
                                                    
                                                        Avoiding Redundant Queue Constructions
                                                    
                                                    
                                                
                                                    
                                                    
                                                        Programming Intel® XMX Using SYCL Joint Matrix Extension
                                                    
                                                    
                                                
                                                    
                                                    
                                                        Doing I/O in the Kernel
                                                    
                                                    
                                                
                                            
                                        
                                                            
                                                            
                                                                
                                                                
                                                                    Explicit Scaling on Multi-GPU, Multi-Stack, Multi-C-Slice in SYCL
                                                                
                                                                
                                                            
                                                                
                                                                
                                                                    Explicit Scaling Using Intel® oneAPI Math Kernel Library (oneMKL) in SYCL
                                                                
                                                                
                                                            
                                                                
                                                                
                                                                    Explicit Scaling on Multi-GPU, Multi-Stack and Multi-C-Slice in OpenMP
                                                                
                                                                
                                                            
                                                                
                                                                
                                                                    Explicit Scaling Using Intel® oneAPI Math Kernel Library (oneMKL) in OpenMP
                                                                
                                                                
                                                            
                                                                
                                                                
                                                                    Explicit Scaling Summary
                                                                
                                                                
                                                            
                                                        
                                                    Using Libraries for GPU Offload
Several libraries are available with oneAPI toolkits that can simplify the programming process by providing specialized APIs for use in optimized applications. This section provides steps on using the libraries, including code samples, for application accelerations. Detailed information about each library, including the available APIs, is available in the main documentation for the specific library.